EPICS RE: vxWorks network problems

Experimental Physics and Industrial Control System

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 <2011> 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025	Index	1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 <2011> 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
<== Date ==>		<== Thread ==>

Subject:	RE: vxWorks network problems
From:	"Jeff Hill" <[email protected]>
To:	"'Dirk Zimoch'" <[email protected]>, <[email protected]>
Date:	Mon, 23 May 2011 09:50:06 -0600

Hi Dirk,

I would make certain that your network interface driver has the very latest
patches installed.

> After restarting the gateway and killing the two clients, the IOC
> recovered.
>
> How should an IOC behave is a CA client which subscribed for monitor
> events does not handle input fast enough? Using up all network resources
> on vxWorks is not the best thing that can happen.
> 
> Some time ago I had increased the queue sizes on vxWorks from the
> default 8k to 64k. Was that a bad idea?

The server software is carefully designed to use two independent threads for
the maintenance of each client, and to use a finite amount of resource when
maintaining each client. Sometimes the client's consumption rate is less
than the server's production rate and so, from queuing theory, we expect all
of the finite buffering to be consumed. In such situations the server will
use all of the finite network kernel per-TCP-circuit buffering space on
behalf of that client, all of its per client buffering space on behalf of
that client, and then block its independent threads on behalf of that client
until the client consumes some additional messages. In modern vxWorks
network kernels the number of network buffers is fixed when you build the
vxWorks kernel. In such situations there is also something called ca flow
control, but that?s another topic. What happens when the number of TCP
circuits times the per circuit maximum buffer space is exceeding the amount
of buffer space allocated when the kernel is built is 100% dependent on the
network kernel/driver implementation. Codes which are based on the socket
library depend on a network kernel behavior in such situations where the
kernel doesn't lock up, but of course we accept that maximum throughput
might not be optimum. 

> CA beacon (send to "...") error was "ENOBUFS"
> Again, inetstatShow showed two client connections with quite full send
> queues. But this time, mbufShow still showed free buffers. And killing
> the clients did not help!
> 
> I can increase the queue size, but how much? WindRiver never answers
> questions like "why does the network not recover?".

As I recall there are multiple diagnostics from WRS netstackSysPoolShow,
netStackDataPoolShow (aka mbufShow, endPoolShow (which take the network
interface name in quotes as its argument). I would probably look also at
ifShow. 

Dave Thompson at the SNS appears to be one of the EPICS community WRS IP
kernels tuning experts. However, as I recall, the summary was that the worst
root problems occur when you have a buggy network interface device driver.

http://www.google.com/url?sa=t&source=web&cd=2&ved=0CB0QFjAB&url=http%3A%2F%
2Fwww.diamond.ac.uk%2Fdms%2FTechnology%2FEPICS%2FMBUF_Problems.ppt&rct=j&q=m
bufShow&ei=W3naTbTaLIO-sAOh_rmFDA&usg=AFQjCNGVRmOsTymiiEnbwcnLZpxEBcII3A&cad
=rja

http://www.aps.anl.gov/epics/tech-talk/2007/msg01234.php

http://www-kryo.desy.de/documents/vxWorks/V5.4/vxworks/netguide/c-tcpip6.htm
l

Jeff
______________________________________________________
Jeffrey O. Hill           Email        [email protected]
LANL MS H820              Voice        505 665 1831
Los Alamos NM 87545 USA   FAX          505 665 5107

Message content: TSPA

With sufficient thrust, pigs fly just fine. However, this is
not necessarily a good idea. It is hard to be sure where they
are going to land, and it could be dangerous sitting under them
as they fly overhead. -- RFC 1925


> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Dirk Zimoch
> Sent: Monday, May 23, 2011 4:07 AM
> To: [email protected]
> Subject: vxWorks network problems
> 
> Hi all,
> 
> We had some network problems over the weekend. Maybe someone knows what
> to do. Here is what I observed:
> 
> First incident:
> An IOC (vxWorks 5.5) lost CA connectivity. All medm panels went white.
> Logging in on the IOC over serial port showed that the data base was
> still running, but there were many messages
> "rsrv: system low on network buffers - send retry in 15 seconds"
> These messages made it a bit tough to debug, because they spill all over
> any output of any debug tool.
> 
> But what I found was: mbuf showed 0 free buffers. Where are they?
> inestatShow showed three CA connections with full send queues. Following
> the foreign address entries ans using netstat -tp on the client
> computers I found one CA gateway and 2 Tcl/Tk clients. All had large
> numbers in their receive queues. At least the gateway reported in its
> log file that it has lost connection to the IOC.
> 
> After restarting the gateway and killing the two clients, the IOC
> recovered.
> 
> It is not the first time that this happens. I have seen any type of
> clients causing this problem, Tcl, medm, gateway, ...
> 
> Some time ago I had increased the queue sizes on vxWorks from the
> default 8k to 64k. Was that a bad idea?
> 
> How should an IOC behave is a CA client which subscribed for monitor
> events does not handle input fast enough? Using up all network resources
> on vxWorks is not the best thing that can happen.
> 
> What could have stopped the clients from handling their input?
> 
> 
> Second incident:
> An other IOC lost CA connectivity. This time the error message was
> different:
> CA beacon (send to "...") error was "ENOBUFS"
> Again, inetstatShow showed two client connections with quite full send
> queues. But this time, mbufShow still showed free buffers. And killing
> the clients did not help!
> 
> Using a function I once got from WindRiver I found that the network
> interface send queue was full (size: 50 entries).
> 
> void ifQValuesShow (char *name) {
>      struct ifnet *ifp;
>      ifp = ifunit(name);
>      if (ifp == NULL) {
>          printf("Could not find %s interface\n", name);
>          return;
>          }
>      printf("%s drops = %d queue length = %d max_len = %d \n",
>          name, ifp->if_snd.ifq_drops,
>          ifp->if_snd.ifq_len, ifp->if_snd.ifq_maxlen);
>      return;
>      }
> 
> The only way to recover from this problem seems to be a reboot.
> 
> Any idea what went wrong here?
> 
> I can increase the queue size, but how much? WindRiver never answers
> questions like "why does the network not recover?".
> 
> 
> Dirk

References:: vxWorks network problems Dirk Zimoch

Navigate by Date:: Prev: Re: When a record is changed twice very fast, camonitor only detects first change Matthieu Bec; Next: RE: caGateway crashes / use of *MustSucceed functions Jeff Hill; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 <2011> 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
Navigate by Thread:: Prev: vxWorks network problems Dirk Zimoch; Next: Re: vxWorks network problems Steven M. Hartman; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 <2011> 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025