Jeff,
Here’s what I have done on this problem:
- I’ve written a simple TCP/IP client/server pair. They
take turns sending and receiving random length messages between 1 and MAX_SIZE
bytes. They do it as fast as possible, there is no sleep. I just
ran that test for 50 hours with MAX_SIZE=20,000 bytes. It did not
fail. Thus, the problem does not occur under this simple configuration.
- I have joined the Cygwin mailing list, and posted a
message about the problem. One of the Cygwin developers replied right
away that she did not claim that the Cygwin socket library had no bugs, but
that she was not aware of any problem resembling the one I reported.
Here is what I propose. So far I have only observed
this problem when running my driver that talks to a PXI device. That is
not surprising, because that is the only application I have run under Cygwin
that does a very large number of monitors on arrays, which happen to be MCA
records (which are not part of base). If this problem lies where we think
it does then it should also occur in an IOC using only soft records, e.g. waveform
records and calc records doing lots of callbacks. I will create an IOC
using the Example application from base and create a database with lots of calc
and waveform records that are periodically scanned as fast as possible. I’ll
create an medm screen that monitors all of those PVs. Hopefully that will
trigger the problem.
If it does then we have a system that you should be able to
reproduce. At that point would you be willing to try to tackle it?
Could you design a simple socket client/server pair that is
EPICS-independent that should cause the problem, that we can send to the Cygwin
developers?
Thanks,
Mark
From: Jeff Hill
[mailto:[email protected]]
Sent: Wednesday, November 17, 2010
4:39 PM
To: Mark Rivers
Cc: Antonio
Lanzirotti; 'tech-talk'
Subject: RE: EPICS CA problems
Mark,
Ø I
wonder if a very simple socket server program and a simple socket client
Ø could
reproduce this problem? Then there is some hope of the Cygwin developers
Ø being
able to reproduce it and fix it.
Considering this
further, what might be different with CA compared to other cygwin socket codes?
o maybe send and receiving
simultaneously on the same socket from two different threads
o maybe the very
large contiguous buffer sizes used currently when configuring
EPICS_CA_MAX_ARRAY_SIZE
I will place my bet
on the 2nd one.
Jeff
______________________________________________________
Jeffrey O. Hill
Email [email protected]
LANL MS
H820
Voice 505 665 1831
Los Alamos NM 87545 USA
FAX 505 665 5107
Message
content: TSPA
With
sufficient thrust, pigs fly just fine. However, this is
not
necessarily a good idea. It is hard to be sure where they
are going to
land, and it could be dangerous sitting under them
as they fly
overhead. -- RFC 1925
From: Mark Rivers
[mailto:[email protected]]
Sent: Wednesday, November 17, 2010
9:21 AM
To: Mark Rivers; tech-talk
Cc: Jeff Hill; Antonio Lanzirotti
Subject: RE: EPICS CA problems
Folks,
Just a quick follow-up on this:
> We have now observed the failure on the following clients: IDL,
Python, medm, and
> (almost definitely) the EPICS sscan record running in another
IOC.
I have now determined that indeed the sscan record client running on a
vxWorks VME IOC was hanging up for the same reason when getting data from the
Cygwin IOC. The CAS-client thread in the Cygwin IOC that was connected to
the VME IOC client was hung> It was waiting for the Send Lock mutex that a
CAS-event thread had taken and never released. This is almost certainly
because the CAS-event thread called send(), which never returned.
I wonder if a very simple socket server program and a simple socket
client could reproduce this problem? Then there is some hope of the
Cygwin developers being able to reproduce it and fix it.
Mark
From: Mark Rivers
Sent: Saturday, November 13, 2010
4:32 PM
To: 'tech-talk'
Cc: Jeff Hill; Antonio Lanzirotti
Subject: RE: EPICS CA problems
Thanks to help from Jeff Hill, the problem with Channel Access clients
losing connection to Cygwin IOCs has been tracked down. It really
appears to be a problem with Cygwin itself.
We have now observed the failure on the following clients: IDL, Python,
medm, and (almost definitely) the EPICS sscan record running in another
IOC. The failure appears to only happen if there are channel
access monitors on the arrays in the MCA records in the IOC. Most of the
failures I observed were when the client had monitors on 16 arrays, each 8KB in
size. These were updating frequently, between 1 and 10 Hz. The
failure could take as long as several hours to occur.
I put print statements in the cas_send_bs_msg function in
rsrv/caserverio.c before and after the call to send(). send() is the
function that sends the CA data over the socket to the client. When
the failure happens send() simply never returns.
This could happen if the client had failed, and was not reading the
socket. However, since it fails on so many commonly used clients,
this is almost certainly not the case.
The problem really appears to be a bug in the Cygwin socket call.
This is really unfortunate, because it means that Cygwin cannot be used
as a reliable platform for an IOC until this problem is solved. Cygwin
has the following advantages over the win32-x86 architecture on Windows:
- The gcc compiler is free
- It supports termios, which means asyn can work on local serial
ports. win32-x86 cannot.
- It supports xdr and rpc. This is required for VXI-11
support in asyn. It also required for the saveData utility in synApps
that saves data from the sscan record directly to disk.
If anyone has any ideas on how to proceed on getting this problem fixed
I'd love to hear it.
I think it is possible that this problem is new to Cygwin 1.7.x, since
it has not been previously reported in Cygwin 1.5.x, but perhaps we just
never stressed the systems in the same way with the older version of Cygwin.