1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 <2010> 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 | Index | 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 <2010> 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 |
<== Date ==> | <== Thread ==> |
---|
Subject: | RE: EPICS CA problems |
From: | "Mark Rivers" <[email protected]> |
To: | "Mark Rivers" <[email protected]>, "tech-talk" <[email protected]> |
Cc: | Antonio Lanzirotti <[email protected]> |
Date: | Wed, 17 Nov 2010 10:20:32 -0600 |
Folks, Just a quick follow-up on this: > We have now observed the failure on the following clients: IDL, Python,
medm, and > (almost definitely) the EPICS sscan record running in another IOC. I have now determined that indeed the sscan record client running on a vxWorks
VME IOC was hanging up for the same reason when getting data from the Cygwin
IOC. The CAS-client thread in the Cygwin IOC that was connected to the VME IOC
client was hung> It was waiting for the Send Lock mutex that a CAS-event
thread had taken and never released. This is almost certainly because the
CAS-event thread called send(), which never returned. I wonder if a very simple socket server program and a simple socket
client could reproduce this problem? Then there is some hope of the Cygwin
developers being able to reproduce it and fix it. Mark From: Mark Rivers Folks, Thanks to help from Jeff Hill, the problem with Channel Access clients
losing connection to Cygwin IOCs has been tracked down. It really
appears to be a problem with Cygwin itself. We have now observed the failure on the following clients: IDL, Python,
medm, and (almost definitely) the EPICS sscan record running in another
IOC. The failure appears to only happen if there are channel
access monitors on the arrays in the MCA records in the IOC. Most of the
failures I observed were when the client had monitors on 16 arrays, each 8KB in
size. These were updating frequently, between 1 and 10 Hz. The
failure could take as long as several hours to occur. I put print statements in the cas_send_bs_msg function in
rsrv/caserverio.c before and after the call to send(). send() is the
function that sends the CA data over the socket to the client. When
the failure happens send() simply never returns. This could happen if the client had failed, and was not reading the
socket. However, since it fails on so many commonly used clients,
this is almost certainly not the case. The problem really appears to be a bug in the Cygwin socket call. This is really unfortunate, because it means that Cygwin cannot be used
as a reliable platform for an IOC until this problem is solved. Cygwin
has the following advantages over the win32-x86 architecture on Windows: - The gcc compiler is free - It supports termios, which means asyn can work on local serial
ports. win32-x86 cannot. - It supports xdr and rpc. This is required for VXI-11
support in asyn. It also required for the saveData utility in synApps
that saves data from the sscan record directly to disk. If anyone has any ideas on how to proceed on getting this problem fixed
I'd love to hear it. I think it is possible that this problem is new to Cygwin 1.7.x, since
it has not been previously reported in Cygwin 1.5.x, but perhaps we just
never stressed the systems in the same way with the older version of Cygwin. Cheers, Mark
|