Hi all,
We had some network problems over the weekend. Maybe someone knows what
to do. Here is what I observed:
First incident:
An IOC (vxWorks 5.5) lost CA connectivity. All medm panels went white.
Logging in on the IOC over serial port showed that the data base was
still running, but there were many messages
"rsrv: system low on network buffers - send retry in 15 seconds"
These messages made it a bit tough to debug, because they spill all over
any output of any debug tool.
But what I found was: mbuf showed 0 free buffers. Where are they?
inestatShow showed three CA connections with full send queues. Following
the foreign address entries ans using netstat -tp on the client
computers I found one CA gateway and 2 Tcl/Tk clients. All had large
numbers in their receive queues. At least the gateway reported in its
log file that it has lost connection to the IOC.
After restarting the gateway and killing the two clients, the IOC recovered.
It is not the first time that this happens. I have seen any type of
clients causing this problem, Tcl, medm, gateway, ...
Some time ago I had increased the queue sizes on vxWorks from the
default 8k to 64k. Was that a bad idea?
How should an IOC behave is a CA client which subscribed for monitor
events does not handle input fast enough? Using up all network resources
on vxWorks is not the best thing that can happen.
What could have stopped the clients from handling their input?
Second incident:
An other IOC lost CA connectivity. This time the error message was
different:
CA beacon (send to "...") error was "ENOBUFS"
Again, inetstatShow showed two client connections with quite full send
queues. But this time, mbufShow still showed free buffers. And killing
the clients did not help!
Using a function I once got from WindRiver I found that the network
interface send queue was full (size: 50 entries).
void ifQValuesShow (char *name) {
struct ifnet *ifp;
ifp = ifunit(name);
if (ifp == NULL) {
printf("Could not find %s interface\n", name);
return;
}
printf("%s drops = %d queue length = %d max_len = %d \n",
name, ifp->if_snd.ifq_drops,
ifp->if_snd.ifq_len, ifp->if_snd.ifq_maxlen);
return;
}
The only way to recover from this problem seems to be a reboot.
Any idea what went wrong here?
I can increase the queue size, but how much? WindRiver never answers
questions like "why does the network not recover?".
Dirk
- Replies:
- RE: vxWorks network problems Jeff Hill
- Re: vxWorks network problems Steven M. Hartman
- Navigate by Date:
- Prev:
Re: caGateway crashes / use of *MustSucceed functions Benjamin Franksen
- Next:
When a record is changed twice very fast, camonitor only detects first change Mikel Rojo
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
<2011>
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: CAJ Flow Control Bug David Brodrick
- Next:
RE: vxWorks network problems Jeff Hill
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
<2011>
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|