> We have now observed the failure on the following
clients: IDL, Python, medm, and
I have now determined that indeed the sscan record client
running on a vxWorks VME IOC was hanging up for the same reason when getting
data from the Cygwin IOC. The CAS-client thread in the Cygwin IOC that
was connected to the VME IOC client was hung> It was waiting for the Send
Lock mutex that a CAS-event thread had taken and never released. This is
almost certainly because the CAS-event thread called send(), which never
returned.
I wonder if a very simple socket server program and a simple
socket client could reproduce this problem? Then there is some hope of
the Cygwin developers being able to reproduce it and fix it.
Thanks to help from Jeff Hill, the problem with Channel
Access clients losing connection to Cygwin IOCs has been tracked
down. It really appears to be a problem with Cygwin itself.
We have now observed the failure on the following clients:
IDL, Python, medm, and (almost definitely) the EPICS sscan record running in
another IOC. The failure appears to only happen if there are
channel access monitors on the arrays in the MCA records in the IOC. Most
of the failures I observed were when the client had monitors on 16 arrays, each
8KB in size. These were updating frequently, between 1 and 10 Hz.
The failure could take as long as several hours to occur.
I put print statements in the cas_send_bs_msg function in
rsrv/caserverio.c before and after the call to send(). send() is the
function that sends the CA data over the socket to the client. When
the failure happens send() simply never returns.
This could happen if the client had failed, and was not
reading the socket. However, since it fails on so many commonly used
clients, this is almost certainly not the case.
The problem really appears to be a bug in the Cygwin socket
call.
This is really unfortunate, because it means that Cygwin
cannot be used as a reliable platform for an IOC until this problem is
solved. Cygwin has the following advantages over the win32-x86
architecture on Windows:
- The gcc compiler is free
- It supports termios, which means asyn can work on local
serial ports. win32-x86 cannot.
- It supports xdr and rpc. This is required
for VXI-11 support in asyn. It also required for the saveData
utility in synApps that saves data from the sscan record directly to disk.
If anyone has any ideas on how to proceed on getting this
problem fixed I'd love to hear it.
I think it is possible that this problem is new to Cygwin
1.7.x, since it has not been previously reported in Cygwin 1.5.x, but
perhaps we just never stressed the systems in the same way with the older
version of Cygwin.