Hi Jeff,
Thanks for the reply, that is very useful. The IOC is running a detector (XIA xMAP) that I wrote the driver for, so there is a good chance it is a bug in my driver that is hanging up a CA thread as you described.
The system just hung up again, and when it did we ran epicsMutexShowAll to look for a deadlock, etc.
Here is the output:
epics> epicsMutexShowAll 1
ellCount(&mutexList) 59546 ellCount(&freeList) 23
epicsMutexId 0x12ec13e0 source ../caservertask.c line 732
So there is only 1 epicsMutex that is currently locked, and it is a caservertask that has the lock. Is this consistent with your hypothesis?
Thanks,
Mark
________________________________
From: Jeff Hill [mailto:[email protected]]
Sent: Mon 10/25/2010 4:22 PM
To: Mark Rivers; 'tech-talk'
Cc: Antonio Lanzirotti
Subject: RE: EPICS CA problems
> TCP 172.16.1.21:1752(X26A-Data): User="X26A User", V4.11, 1837 Channels,
> Priority=0
> Task Id=0x12ec19d0, Socket FD=16
> Secs since last send 131528.55, Secs since last receive 140881.60
> Unprocessed request bytes=8296, Undelivered response bytes=0
> State=up
> 377424 bytes allocated
There are no response bytes pending, but request bytes _are_ pending (for a
long time). I am going to go out on a limb and guess that the server's
per-client receive thread is trapped in db_put_field waiting for the db scan
lock, or waiting in the device support's signal write function - probably
due to a device driver issue. The server's per-client lock is held by the
receive thread in this situation, and that would shutdown also subscription
updates to this client. The situation can be diagnosed in the debugger.
Typically the server's per-client receive thread is parked in socket
receive, and the server's per-client send thread is parked in event flag
wait. Symptomatic would be wedged in db_put_field (device driver issue) or
wedged always waiting for the same lock in the same place (deadlock).
Jeff
______________________________________________________
Jeffrey O. Hill Email [email protected]
LANL MS H820 Voice 505 665 1831
Los Alamos NM 87545 USA FAX 505 665 5107
Message content: TSPA
> -----Original Message-----
> From: Mark Rivers [mailto:[email protected]]
> Sent: Monday, October 25, 2010 10:55 AM
> To: Jeff Hill; tech-talk
> Cc: Antonio Lanzirotti
> Subject: RE: EPICS CA problems
>
> Hi Jeff,
>
> I have some more information on this. The problem does NOT appear to be a
> problem with caRepeater crashing. When the client loses connection to the
> IOC the Windows Task Manager shows that caRepeater is still running on the
> IOC PC. Normally we had been seeing the problem when the client and the
> IOC were running on the same computer.
>
> However, last night we managed to reproduce the problem with the client
> running on a separate PC.
>
> I have attached the output of casr(100) on the IOC when the client has
> lost communication. The IOC server is 172.16.1.20 (X26A-Control) and the
> client is running on 172.16.1.21 (X26A-Data).
>
> It appears that when this happens the client loses connection to all PVs
> on the server. But we know for sure that it lost connection to
> X26A:med:Acquiring.
>
> I think I see something suspicious in the output. Here is the start of
> one block of output from casr for the client machine that has lost
> connection:
>
> TCP 172.16.1.21:1726(X26A-Data): User="X26A User", V4.11, 1755 Channels,
> Priority=0
> Task Id=0x12a022b0, Socket FD=15
> Secs since last send 0.02, Secs since last receive 0.02
> Unprocessed request bytes=0, Undelivered response bytes=0
> State=up
> 360696 bytes allocated
> X26A:med:PresetMode(0rw) X26A:med:ElapsedReal(1rw)
> X26A:med:PresetReal(0rw)
>
> Here is the start of another block for the same client:
>
> TCP 172.16.1.21:1752(X26A-Data): User="X26A User", V4.11, 1837 Channels,
> Priority=0
> Task Id=0x12ec19d0, Socket FD=16
> Secs since last send 131528.55, Secs since last receive 140881.60
> Unprocessed request bytes=8296, Undelivered response bytes=0
> State=up
> 377424 bytes allocated
>
> Note that there are unprocessed request bytes there.
>
> There is then another block for the same client machine:
>
> TCP 172.16.1.21:2735(X26A-Data): User="X26A User", V4.11, 1837 Channels,
> Priority=0
> Task Id=0x12ec0a10, Socket FD=19
> Secs since last send 119.81, Secs since last receive 119.82
> Unprocessed request bytes=0, Undelivered response bytes=0
> State=up
> 377424 bytes allocated
>
> There is also a UDP entry for that client machine:
>
> UDP Server:
> UDP 172.16.1.21:2733(): User="", V4.11, 0 Channels, Priority=0
> Task Id=0x1293c4e0, Socket FD=11
> Secs since last send 131525.68, Secs since last receive 3.06
> Unprocessed request bytes=16, Undelivered response bytes=0
> State=up
> 180 bytes allocated
>
> Send Lock
>
> I am not sure how to interpret this.
>
> Thanks,
> Mark
>
>
>
>
> -----Original Message-----
> From: Jeff Hill [mailto:[email protected]]
> Sent: Tuesday, October 19, 2010 10:33 AM
> To: Mark Rivers; 'tech-talk'
> Cc: Antonio Lanzirotti
> Subject: RE: EPICS CA problems
>
> Hi Mark,
>
> This is the first I have heard of any issues the ca repeater crashing.
>
> Is this running under cygwin or mingw? Compiled by ms visual c or gnu?
>
> The stack trace has no symbols so it's hard to determine a cause. If you
> could fire up the relevant debugger and get a stack trace with symbols
> that
> would help. You might need to build base for debugging. Set HOST_OPT=YES
> in
> CONFIG_SITE. Also, if you save the debugging session in visual c++ and
> email
> it to me I might be able to identify the issue.
>
> Jeff
> ______________________________________________________
> Jeffrey O. Hill Email [email protected]
> LANL MS H820 Voice 505 665 1831
> Los Alamos NM 87545 USA FAX 505 665 5107
>
> Message content: TSPA
>
>
> > -----Original Message-----
> > From: Mark Rivers [mailto:[email protected]]
> > Sent: Monday, October 18, 2010 7:58 PM
> > To: tech-talk; Jeff Hill
> > Cc: Antonio Lanzirotti
> > Subject: RE: EPICS CA problems
> >
> > Folks,
> >
> > I learned today that it appears that caRepeater has been crashing on
> > this system. I don't know for sure that this problem happens when
> > caRepeater has died, but that seems likely. The next time it happens
> > we will look to see if caRepeater is still running.
> >
> > Meanwhile, we have found that there are caRepeater stackdump files,
> > containing the following:
> >
> > Exception: STATUS_ACCESS_VIOLATION at eip=610B9F69
> > eax=00000000 ebx=00000001 ecx=00000000 edx=0014C6F0 esi=00000000
> > edi=011DCCD8
> > ebp=011DCB14 esp=011DCAEC program=C:\Program Files\EPICS WIN32
> > Extensions\caRepeater.exe, pid 2152, thread unknown (0xC44)
> > cs=001B ds=0023 es=0023 fs=003B gs=0000 ss=0023
> > Stack trace:
> > Frame Function Args
> > 011DCB14 610B9F69 (00000000, 00000000, 00000000, 00000000)
> > 011DCC24 610BA905 (00000000, 00000000, 00000000, 00000000)
> > 011DCCE4 610BB67A (FFFFFFFF, FFFFFFFF, 00000000, 00000000)
> > 011DCD34 61027DE2 (00000002, 011DCE64, 00000002, 011DCE00)
> > 011DCDC8 7C87655C (00000002, 011DCE00, 7C8763C0, 00000002)
> > End of stack trace
> >
> > Has anyone else seen such stackdumps from caRepeater? This is the
> > version of caRepeater.exe that is included in the most recent (Nov. 2,
> > 2007) APS "EPICS Win32 Extensions" package.
> >
> > Thanks,
> > Mark
> >
> >
> > ________________________________
> >
> > From: Mark Rivers
> > Sent: Wed 10/13/2010 11:14 AM
> > To: tech-talk; 'Jeff Hill'
> > Cc: Antonio Lanzirotti
> > Subject: EPICS CA problems
> >
> >
> >
> > Folks,
> >
> > We are having trouble with a Windows IOC at NSLS. Here are the
> > symptoms:
> >
> > - The IOC is running fine
> >
> > - The PC running the IOC has 2 local CA clients connected to the IOC,
> > medm and IDL. Occassionally (1-2 times per day) one of these clients
> > loses its connection to the IOC. Medm screens go white, IDL says it
> > cannot find a PV, etc. This happens when the client was running fine.
> > It typically only happens to one or the other client, not to both.
> >
> > - Restarting the client fixes the problem.
> >
> > - The same 2 clients are running on another PC connected to the same
> > IOC. Those clients are always fine, they do not lose connection when a
> > client on the PC with the IOC does.
> >
> > - Looking at the resources on the Windows machine (CPU, virtual and
> > physical memory usage) does not indicate any problems.
> >
> > How do we go about figuring out what is wrong?
> >
> > Thanks,
> > Mark
> >
>
- Replies:
- RE: EPICS CA problems Mark Rivers
- References:
- RE: EPICS CA problems Mark Rivers
- RE: EPICS CA problems Jeff Hill
- RE: EPICS CA problems Mark Rivers
- RE: EPICS CA problems Jeff Hill
- Navigate by Date:
- Prev:
ChannelArchiver build problem with 3.14.11 on Suse linux Burkhard Kolb
- Next:
Some Channel Access Questions Ben Franksen
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
<2010>
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: EPICS CA problems Jeff Hill
- Next:
RE: EPICS CA problems Mark Rivers
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
<2010>
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|