Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  <20122013  2014  2015  2016  2017  2018  2019  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  <20122013  2014  2015  2016  2017  2018  2019 
<== Date ==> <== Thread ==>

Subject: RE: Runaway connection count on IOC
From: "Hill, Jeff" <johill@lanl.gov>
To: "michael.abbott@diamond.ac.uk" <michael.abbott@diamond.ac.uk>, "tech-talk@aps.anl.gov" <tech-talk@aps.anl.gov>
Date: Mon, 11 Jun 2012 15:33:03 +0000
Hi Michael,

As Andrew mentions the CA server runs as the lowest priority entity 
in the IOC with the CA UDP daemon holding the very lowest slot. 
Therefore, of course, if the IOC's CPU is saturated then we certainly 
expect that this could impact the server's ability to run (either to
cleanup or to allow new clients to attach).

> However if the gateway uses a name-server instead of UDP broadcasts 
> this natural throttling will not be happening and I could see your 
> symptoms occurring as a result. 

Additionally, certain ca client might be granted access to more of the 
cpu in this situation than others depending on what priority is 
specified when creating the channel on the client side (internally 
in the GW when it creates a channel using the client library in this 
situation).

Furthermore, if the IP kernel is low on buffers then activities in the server 
might also be stalled. It's even possible that TCP circuit shutdown activities 
might stall (i.e. socket close might block) when a network buffer isn't available 
depending on the IP kernel's implementation.

> 6. Running casr reports that practically all of these connections are to
> the gateway, for example:

Starting around R3.14.6 I made some changes in the CA client library so that it 
will _not_ disconnect an unresponsive circuit and start a new one in such (heavily
congested) situations. Instead it disconnects the application but does not 
disconnect the circuit, and simply waits for TCP to recover the circuit using 
its mature built-in capabilities for dealing with congestion.

However, if for any reason the CA GW, on its own volition, were to destroy the 
channel and create a new one (when the channel was unresponsive) then this would 
circumvent the protections mentioned in the previous paragraph. I didn't write
the GW, but have looked inside of it, and I don't recall that it does this. I do
seem to recall that if a channel isn't used for some time in the gateway then it 
will be destroyed and later recreated when a client (of the GW) needs it again, but
the timeouts are probably long enough that they are not coming into play in your 
situation? Another possibility might be that this gateway was somehow using 
a pre R3.14.6 ca client library.

> 8. After `casr 2` has completed, the bogus channels have gone away:

As I recall, the casr doesn't perform any cleanup activities. I don't claim to have a clear 
explanation for this behavior. One possible guess is that casr, running at higher priority
than the vxWorks shell, temporarily disrupts the scan tasks from adding more labor to the
event queue for the server to complete, and thereby allows network buffer starvation in the 
IP kernel to clear out, and maybe this allows the server threads to finish their cleanup 
activities. Maybe a stretch, HTA.

The best way to find out what _is_ occurring would be to log into that IOC and use the 
"tt <task id>" vxWorks target shell command to determine where the CA server's TCP 
threads happen to be loitering at.

Jeff


> -----Original Message-----
> From: tech-talk-bounces@aps.anl.gov [mailto:tech-talk-bounces@aps.anl.gov]
> On Behalf Of michael.abbott@diamond.ac.uk
> Sent: Monday, June 11, 2012 1:17 AM
> To: tech-talk@aps.anl.gov
> Subject: Runaway connection count on IOC
> 
> I have a very odd problem with one particular vxWorks (EPICS 3.14.11) IOC,
> where the connection count as reported by casr climbs into the many
> thousands, all to one client (the gateway server).  Simply running `casr
> 2` is enough to clear this condition!
> 
> Let me try and be precise.
> 
> 1. The server is vxWorks 5.5.1 and EPICS 3.14.11
> 
> 2. The server is running an asyn driver interfacing to a firewire camera
> and providing images over EPICS
> 
> 3. The ethernet connection is horribly horribly overloaded (100MBit link),
> EPICS would like to deliver far more image frames than the network will
> permit
> 
> 4. The EPICS gateway clearly struggles to connect to the IOC, typically
> most of the PVs provided by the IOC are inaccessible through the gateway.
> 
> 5. At some random point during operation the number of IOC connections
> ($(IOC):CA:CNX as reported by vxStats) starts climbing steadily.
> 
> 6. Running casr reports that practically all of these connections are to
> the gateway, for example:
> 
> SR01C-DI-IOC-02 -> casr
> Channel Access Server V4.11
> Connected circuits:
> TCP 172.23.194.201:38552(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate",
> V4.11, 8697 Channels, Priority=0
> TCP 172.23.194.38:59307(cs03r-cs-serv-38.pri.diamond.ac.uk):
> User="epics_user", V4.11, 12 Channels, Priority=0
> TCP 172.23.194.201:38553(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate",
> V4.11, 18 Channels, Priority=0
> TCP 172.23.194.27:50143(cs03r-cs-serv-27.pri.diamond.ac.uk):
> User="archiver", V4.11, 1 Channels, Priority=20
> TCP 172.23.194.28:52559(cs03r-cs-serv-28.pri.diamond.ac.uk):
> User="archiver", V4.11, 1 Channels, Priority=20
> 
> 7. Running `casr 2` takes forever (well, several minutes), the connection
> is over a 9600 baud serial line.  Of the 8697 channels, the same PV is
> reported over and over and over and over again (it's a simple camera
> STATUS provided by the asyn driver).
> 
> 8. After `casr 2` has completed, the bogus channels have gone away:
> 
> SR01C-DI-IOC-02 -> casr
> Channel Access Server V4.11
> Connected circuits:
> TCP 172.23.194.201:38552(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate",
> V4.11, 6 Channels, Priority=0
> TCP 172.23.194.38:59307(cs03r-cs-serv-38.pri.diamond.ac.uk):
> User="epics_user", V4.11, 12 Channels, Priority=0
> TCP 172.23.194.201:38553(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate",
> V4.11, 18 Channels, Priority=0
> TCP 172.23.194.27:50143(cs03r-cs-serv-27.pri.diamond.ac.uk):
> User="archiver", V4.11, 1 Channels, Priority=20
> TCP 172.23.194.28:52559(cs03r-cs-serv-28.pri.diamond.ac.uk):
> User="archiver", V4.11, 1 Channels, Priority=20
> 
> 
> Very odd.  Any thoughts?
> 
> --
> This e-mail and any attachments may contain confidential, copyright and or
> privileged material, and are for the use of the intended addressee only.
> If you are not the intended addressee or an authorised recipient of the
> addressee please notify us of receipt by returning the e-mail and do not
> use, copy, retain, distribute or disclose the information in or attached
> to the e-mail.
> Any opinions expressed within this e-mail are those of the individual and
> not necessarily of Diamond Light Source Ltd.
> Diamond Light Source Ltd. cannot guarantee that this e-mail or any
> attachments are free from viruses and we cannot accept liability for any
> damage which you may sustain as a result of software viruses which may be
> transmitted in or with the message.
> Diamond Light Source Limited (company no. 4375679). Registered in England
> and Wales with its registered office at Diamond House, Harwell Science and
> Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
> 
> 
> 
> 



References:
Runaway connection count on IOC michael.abbott

Navigate by Date:
Prev: Re: Runaway connection count on IOC Andrew Johnson
Next: EPICS build on armv6l Florian Feldbauer
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  <20122013  2014  2015  2016  2017  2018  2019 
Navigate by Thread:
Prev: RE: Runaway connection count on IOC -- Possible Gateway Issue? Hill, Jeff
Next: EPICS build on armv6l Florian Feldbauer
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  <20122013  2014  2015  2016  2017  2018  2019 
ANJ, 18 Nov 2013 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·