Experimental Physics and Industrial Control System
Hi Steven,
Steven M. Hartman wrote:
Dirk Zimoch wrote:
First incident:
An IOC (vxWorks 5.5) lost CA connectivity. All medm panels went white.
Logging in on the IOC over serial port showed that the data base was
still running, but there were many messages
"rsrv: system low on network buffers - send retry in 15 seconds"
These messages made it a bit tough to debug, because they spill all
over any output of any debug tool.
As you saw, mbuf starvation is usually recoverable once the client
starts behaving. Jeff's comments about queuing theory show that you
cannot protect the server completely, but you can hopefully tune your
buffer numbers to deal with most situations your network is likely to
experience. Was this a one-time event, and if so, was anything unusual
happening at the time? Which EPICS version for IOC and the clients?
All clients and IOCs are running EPICS 3.14.8.2.
I tried to tune the network buffers. But once I when I saw with
inetstatShow that send queues filled up, I increased their size from the
default 8k to 64k. This seems to be a bad decision, because now I don't
have enough mbufs any more. But how many do I need? If I have 20 CA
connections, I might need more than a MB mbufs. Thus I better go back to
8k buffers.
Has anyone changed TCP_SND_SIZE_DFLT or TCP_RCV_SIZE_DFLT in vxWorks?
My settings are at the moment:
#define NUM_64 800 /* default 100 */
#define NUM_128 800 /* default 100 */
#define NUM_256 100 /* default 40 */
#define NUM_512 100 /* default 40 */
#define NUM_1024 100 /* default 25 */
#define NUM_2048 100 /* default 25 */
#define NUM_SYS_64 1024 /* default 64 */
#define NUM_SYS_128 1024 /* default 64 */
#define NUM_SYS_256 512 /* default 64 */
#define NUM_SYS_512 512 /* default 64 */
/* TCP queue sizes increaded to improve array thoughput */
#define TCP_SND_SIZE_DFLT 65536
#define TCP_RCV_SIZE_DFLT 65536
Second incident:
An other IOC lost CA connectivity. This time the error message was
different:
CA beacon (send to "...") error was "ENOBUFS"
This one is similar to what SNS has seen over the years with mv2100 and
related boards with the DEC network driver. The precipitating event is a
temporary loss of the physical network layer (i.e. unplugged network
cable to IOC, or edge network switch powered down). It looks like a
buggy network driver that cannot recover from this fault when there are
lots of UDP packets queuing in the outgoing buffers. This one does not
seem to be recoverable except by a reboot so we have taken steps to
reduce the likelihood of loosing the physical link. We do not see this
with other boards.
It was user shift, so hopefully nobody had unplugged any cables.
Probably ca_search UDP packages are related, but the time resolution of
our network diagnosis is not so good that I can see if the IOC crashes
because of search broadcasts or if the clients broadcast because the IOC
crashed.
To my experience WRS network drivers are always buggy. My dec21x40End.c
is from September 2005. Maybe I should ask WRS for a new driver.
Thanks for your reply
Dirk
- Replies:
- Re: vxWorks network problems Steven M. Hartman
- Re: vxWorks network problems Andrew Johnson
- References:
- vxWorks network problems Dirk Zimoch
- Re: vxWorks network problems Steven M. Hartman
- Navigate by Date:
- Prev:
Re: Device Support for Nemic Lambda Power Supplies Eric Norum
- Next:
Re: vxWorks network problems Dirk Zimoch
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
<2011>
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: vxWorks network problems Dirk Zimoch
- Next:
Re: vxWorks network problems Steven M. Hartman
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
<2011>
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024