Folks,
Thanks to help from Jeff Hill, the problem with Channel Access clients losing connection to Cygwin IOCs has been tracked down. It really appears to be a problem with Cygwin itself.
We have now observed the failure on the following clients: IDL, Python, medm, and (almost definitely) the EPICS sscan record running in another IOC. The failure appears to only happen if there are channel access monitors on the arrays in the MCA records in the IOC. Most of the failures I observed were when the client had monitors on 16 arrays, each 8KB in size. These were updating frequently, between 1 and 10 Hz. The failure could take as long as several hours to occur.
I put print statements in the cas_send_bs_msg function in rsrv/caserverio.c before and after the call to send(). send() is the function that sends the CA data over the socket to the client. When the failure happens send() simply never returns.
This could happen if the client had failed, and was not reading the socket. However, since it fails on so many commonly used clients, this is almost certainly not the case.
The problem really appears to be a bug in the Cygwin socket call.
This is really unfortunate, because it means that Cygwin cannot be used as a reliable platform for an IOC until this problem is solved. Cygwin has the following advantages over the win32-x86 architecture on Windows:
- The gcc compiler is free
- It supports termios, which means asyn can work on local serial ports. win32-x86 cannot.
- It supports xdr and rpc. This is required for VXI-11 support in asyn. It also required for the saveData utility in synApps that saves data from the sscan record directly to disk.
If anyone has any ideas on how to proceed on getting this problem fixed I'd love to hear it.
I think it is possible that this problem is new to Cygwin 1.7.x, since it has not been previously reported in Cygwin 1.5.x, but perhaps we just never stressed the systems in the same way with the older version of Cygwin.
Cheers,
Mark
________________________________
From: Mark Rivers
Sent: Tue 10/26/2010 10:18 AM
To: Jeff Hill; 'tech-talk'
Cc: Antonio Lanzirotti
Subject: RE: EPICS CA problems
Hi Jeff,
Thanks for the reply, that is very useful. The IOC is running a detector (XIA xMAP) that I wrote the driver for, so there is a good chance it is a bug in my driver that is hanging up a CA thread as you described.
The system just hung up again, and when it did we ran epicsMutexShowAll to look for a deadlock, etc.
Here is the output:
epics> epicsMutexShowAll 1
ellCount(&mutexList) 59546 ellCount(&freeList) 23
epicsMutexId 0x12ec13e0 source ../caservertask.c line 732
So there is only 1 epicsMutex that is currently locked, and it is a caservertask that has the lock. Is this consistent with your hypothesis?
Thanks,
Mark
________________________________
From: Jeff Hill [mailto:[email protected]]
Sent: Mon 10/25/2010 4:22 PM
To: Mark Rivers; 'tech-talk'
Cc: Antonio Lanzirotti
Subject: RE: EPICS CA problems
> TCP 172.16.1.21:1752(X26A-Data): User="X26A User", V4.11, 1837 Channels,
> Priority=0
> Task Id=0x12ec19d0, Socket FD=16
> Secs since last send 131528.55, Secs since last receive 140881.60
> Unprocessed request bytes=8296, Undelivered response bytes=0
> State=up
> 377424 bytes allocated
There are no response bytes pending, but request bytes _are_ pending (for a
long time). I am going to go out on a limb and guess that the server's
per-client receive thread is trapped in db_put_field waiting for the db scan
lock, or waiting in the device support's signal write function - probably
due to a device driver issue. The server's per-client lock is held by the
receive thread in this situation, and that would shutdown also subscription
updates to this client. The situation can be diagnosed in the debugger.
Typically the server's per-client receive thread is parked in socket
receive, and the server's per-client send thread is parked in event flag
wait. Symptomatic would be wedged in db_put_field (device driver issue) or
wedged always waiting for the same lock in the same place (deadlock).
Jeff
______________________________________________________
Jeffrey O. Hill Email [email protected]
LANL MS H820 Voice 505 665 1831
Los Alamos NM 87545 USA FAX 505 665 5107
Message content: TSPA
> -----Original Message-----
> From: Mark Rivers [mailto:[email protected]]
> Sent: Monday, October 25, 2010 10:55 AM
> To: Jeff Hill; tech-talk
> Cc: Antonio Lanzirotti
> Subject: RE: EPICS CA problems
>
> Hi Jeff,
>
> I have some more information on this. The problem does NOT appear to be a
> problem with caRepeater crashing. When the client loses connection to the
> IOC the Windows Task Manager shows that caRepeater is still running on the
> IOC PC. Normally we had been seeing the problem when the client and the
> IOC were running on the same computer.
>
> However, last night we managed to reproduce the problem with the client
> running on a separate PC.
>
> I have attached the output of casr(100) on the IOC when the client has
> lost communication. The IOC server is 172.16.1.20 (X26A-Control) and the
> client is running on 172.16.1.21 (X26A-Data).
>
> It appears that when this happens the client loses connection to all PVs
> on the server. But we know for sure that it lost connection to
> X26A:med:Acquiring.
>
> I think I see something suspicious in the output. Here is the start of
> one block of output from casr for the client machine that has lost
> connection:
>
> TCP 172.16.1.21:1726(X26A-Data): User="X26A User", V4.11, 1755 Channels,
> Priority=0
> Task Id=0x12a022b0, Socket FD=15
> Secs since last send 0.02, Secs since last receive 0.02
> Unprocessed request bytes=0, Undelivered response bytes=0
> State=up
> 360696 bytes allocated
> X26A:med:PresetMode(0rw) X26A:med:ElapsedReal(1rw)
> X26A:med:PresetReal(0rw)
>
> Here is the start of another block for the same client:
>
> TCP 172.16.1.21:1752(X26A-Data): User="X26A User", V4.11, 1837 Channels,
> Priority=0
> Task Id=0x12ec19d0, Socket FD=16
> Secs since last send 131528.55, Secs since last receive 140881.60
> Unprocessed request bytes=8296, Undelivered response bytes=0
> State=up
> 377424 bytes allocated
>
> Note that there are unprocessed request bytes there.
>
> There is then another block for the same client machine:
>
> TCP 172.16.1.21:2735(X26A-Data): User="X26A User", V4.11, 1837 Channels,
> Priority=0
> Task Id=0x12ec0a10, Socket FD=19
> Secs since last send 119.81, Secs since last receive 119.82
> Unprocessed request bytes=0, Undelivered response bytes=0
> State=up
> 377424 bytes allocated
>
> There is also a UDP entry for that client machine:
>
> UDP Server:
> UDP 172.16.1.21:2733(): User="", V4.11, 0 Channels, Priority=0
> Task Id=0x1293c4e0, Socket FD=11
> Secs since last send 131525.68, Secs since last receive 3.06
> Unprocessed request bytes=16, Undelivered response bytes=0
> State=up
> 180 bytes allocated
>
> Send Lock
>
> I am not sure how to interpret this.
>
> Thanks,
> Mark
>
>
>
>
> -----Original Message-----
> From: Jeff Hill [mailto:[email protected]]
> Sent: Tuesday, October 19, 2010 10:33 AM
> To: Mark Rivers; 'tech-talk'
> Cc: Antonio Lanzirotti
> Subject: RE: EPICS CA problems
>
> Hi Mark,
>
> This is the first I have heard of any issues the ca repeater crashing.
>
> Is this running under cygwin or mingw? Compiled by ms visual c or gnu?
>
> The stack trace has no symbols so it's hard to determine a cause. If you
> could fire up the relevant debugger and get a stack trace with symbols
> that
> would help. You might need to build base for debugging. Set HOST_OPT=YES
> in
> CONFIG_SITE. Also, if you save the debugging session in visual c++ and
> email
> it to me I might be able to identify the issue.
>
> Jeff
> ______________________________________________________
> Jeffrey O. Hill Email [email protected]
> LANL MS H820 Voice 505 665 1831
> Los Alamos NM 87545 USA FAX 505 665 5107
>
> Message content: TSPA
>
>
> > -----Original Message-----
> > From: Mark Rivers [mailto:[email protected]]
> > Sent: Monday, October 18, 2010 7:58 PM
> > To: tech-talk; Jeff Hill
> > Cc: Antonio Lanzirotti
> > Subject: RE: EPICS CA problems
> >
> > Folks,
> >
> > I learned today that it appears that caRepeater has been crashing on
> > this system. I don't know for sure that this problem happens when
> > caRepeater has died, but that seems likely. The next time it happens
> > we will look to see if caRepeater is still running.
> >
> > Meanwhile, we have found that there are caRepeater stackdump files,
> > containing the following:
> >
> > Exception: STATUS_ACCESS_VIOLATION at eip=610B9F69
> > eax=00000000 ebx=00000001 ecx=00000000 edx=0014C6F0 esi=00000000
> > edi=011DCCD8
> > ebp=011DCB14 esp=011DCAEC program=C:\Program Files\EPICS WIN32
> > Extensions\caRepeater.exe, pid 2152, thread unknown (0xC44)
> > cs=001B ds=0023 es=0023 fs=003B gs=0000 ss=0023
> > Stack trace:
> > Frame Function Args
> > 011DCB14 610B9F69 (00000000, 00000000, 00000000, 00000000)
> > 011DCC24 610BA905 (00000000, 00000000, 00000000, 00000000)
> > 011DCCE4 610BB67A (FFFFFFFF, FFFFFFFF, 00000000, 00000000)
> > 011DCD34 61027DE2 (00000002, 011DCE64, 00000002, 011DCE00)
> > 011DCDC8 7C87655C (00000002, 011DCE00, 7C8763C0, 00000002)
> > End of stack trace
> >
> > Has anyone else seen such stackdumps from caRepeater? This is the
> > version of caRepeater.exe that is included in the most recent (Nov. 2,
> > 2007) APS "EPICS Win32 Extensions" package.
> >
> > Thanks,
> > Mark
> >
> >
> > ________________________________
> >
> > From: Mark Rivers
> > Sent: Wed 10/13/2010 11:14 AM
> > To: tech-talk; 'Jeff Hill'
> > Cc: Antonio Lanzirotti
> > Subject: EPICS CA problems
> >
> >
> >
> > Folks,
> >
> > We are having trouble with a Windows IOC at NSLS. Here are the
> > symptoms:
> >
> > - The IOC is running fine
> >
> > - The PC running the IOC has 2 local CA clients connected to the IOC,
> > medm and IDL. Occassionally (1-2 times per day) one of these clients
> > loses its connection to the IOC. Medm screens go white, IDL says it
> > cannot find a PV, etc. This happens when the client was running fine.
> > It typically only happens to one or the other client, not to both.
> >
> > - Restarting the client fixes the problem.
> >
> > - The same 2 clients are running on another PC connected to the same
> > IOC. Those clients are always fine, they do not lose connection when a
> > client on the PC with the IOC does.
> >
> > - Looking at the resources on the Windows machine (CPU, virtual and
> > physical memory usage) does not indicate any problems.
> >
> > How do we go about figuring out what is wrong?
> >
> > Thanks,
> > Mark
> >
>
- Replies:
- RE: EPICS CA problems Mark Rivers
- References:
- RE: EPICS CA problems Mark Rivers
- RE: EPICS CA problems Jeff Hill
- RE: EPICS CA problems Mark Rivers
- RE: EPICS CA problems Jeff Hill
- RE: EPICS CA problems Mark Rivers
- Navigate by Date:
- Prev:
RE: MDrive - a novice in trouble Mark Rivers
- Next:
Last chance to save the sequencer's pv layer Ben Franksen
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
<2010>
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: EPICS CA problems Mark Rivers
- Next:
RE: EPICS CA problems Mark Rivers
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
<2010>
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|