EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  <19992000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  <19992000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: CA monitors... (fwd)
From: zhang hao <[email protected]>
To: [email protected]
Date: Fri, 17 Dec 1999 15:05:30 +0800 (CST)

---------- Forwarded message ----------
Date: Mon, 13 Dec 1999 12:00:43 -0600 (CST)
From: Garrett D. Rinehart <[email protected]>
To: [email protected]
Subject: CA monitors...

Jeff (and anyone else who wants to chime in),

> I think I understand correctly that when the stalling situation occurs
> that clients on multiple hosts stop receiving monitors. If so, I think
> that we can safely assume that this is a server side problem or a network
> problem.
 
Yes, that is how I understand it as well. I'm a little unclear on the
terminology, so let me say that I assume "server side" means "the ioc that
is sending the data" and the "client" is the "receiving task" on the "host"
which is the "local ioc".

> Since your situation is quite repeatable and consistent I propose that
> you perform the following steps in order to help isolate the problem.
> 
> 1) do what ever is necessary to maximize the severity of the problem
> 2) type "dbel <record name>" for the records that are stalled. Let
> me know if it indicates that some of the monitor subscriptions are behind.
> 3) type "inetstatShow" on the IOC. Let me know if there are TCP virtual
> circuits that consistently have a large number of bytes lingering in the
> "Send-Q".
> The UNIX equivalent of this command is "netstat", but you will be looking
> for bytes lingering in the receive queue.
> 3) type "casr <interest level>" on the stalling IOC. Look for client
> connections
> that have a long delay since the last send when you expect that there should
> be
> regular monitor updates. Host names can be converted to IP addresses on UNIX
> with "nslookup".
> 4) type "i" on the IOC that is stalled and then send the output from
> "tt 0x<task id>" for several of the CA event tasks. The event task id can
> be determined by running "dbel <record name>".
> I will be looking for tasks that remain consistently over several
> calls to tt (task stack trace) in the same unusual part of the code.
> Compare the tt output from normal IOCs with the tt output from the
> stalling IOC. Also compare the tt output for the netTask on the stalling
> IOC to one that is behaving normally.
> 5) Finally, I am also interested in the output from typing "mbufShow",
> "ifShow", and "tcpstatShow" on the IOC.
 
Okay, I've tried to do as you requested, but I have to admit to not understanding
some of your directions (#4 specifically). Anyway, first I looked at the stalling
ioc when it was going directly to the 10/100base-T hub (the "fixed" condition). 

Using "dbel", I saw only zeros in the "behind by" counts.
Using "tt 0x{task id}" on the values returned by "dbel", I always got only one line
back of the form:

iocRCS_2> tt 0x1db54d8
 67dde _vxTaskEntry   +10 : _semQPut (1e22b74, 0, 0, 0, 0, 0, 0, 0, 0, 0)
value = 0 = 0x0
iocRCS_2>

"inetstatShow" returned a table with five or six tasks showing counts in the sendQ
at any given time. The numbers changed each time I executed the command and none of
the counts appeared to linger. The highest any of them ever went was about 1700, 
and that was only momentarily.

Then I put everything back on the thinwire.

Intermittently, "dbel" showed channels falling behind and "tt" had much different info:
iocRCS_2> dbel "Qacc"
 VAL VALUE LOG ALARM
 VAL VALUE LOG ALARM
List of events (monitors).
task 1c95a34 select 5 pfield 1f6eaf4 behind by 0
task 1db54d8 select 5 pfield 1f6eaf4 behind by 8
value = 0 = 0x0
iocRCS_2> tt 0x1db54d8
 67dde _vxTaskEntry   +10 : _event_task (1e22b74, 0, 0, 0, 0, 0, 0, 0, 0, 0)
1ec6450 _event_task    +a2 : 1ec65c6 (1d86fb0)
1ec6644 _event_task    +296: 1ed2818 (1d9b918, 1da802c, 0)
1ed29a8 _write_notify_reply+c02: _cas_send_msg (1df4cd4, 0)
1ecfb32 _cas_send_msg  +aa : _sendto (1e, 1df4d24, 140, 0, 1df4d14, 10)
 39f10 _sendto        +42 : _bsdSendto ([1e, 1df4d24, 140, 0, 1df4d14])
 3a7f6 _bsdSendto     +9e : _sosend ([1d19a1c, 1d1a980, 1df4d24, 8, 1db53a0])
 74110 _sosend        +1b2: _sbwait (1d19aac)
 75404 _sbwait        +10 : _semQPut ([1d19acc, 1db5370, 74116, 1d19aac, 0])
value = 0 = 0x0
iocRCS_2>

It should be noted that I NEVER saw this kind of response with the other hookup.

"inetstatShow" also had one task listed that would run up to 16384 in the sendQ
and seem to hold it for quite a while (relatively) before dumping it. Then the 
count would shoot right back up there again.

The output from "casr" indicated several seconds "since last send" to the display
ioc and the OPI workstation. Neither should have been over 1/30th of a second as
that is the data sampling rate and it changes EVERY time.

These are typical results from mbufShow, ifShow, and tcpstatShow:
iocRCS_2> mbufShow
type        number
---------   ------
FREE    :    867
DATA    :    114
HEADER  :     19
SOCKET  :      0
PCB     :     55
RTABLE  :      2
HTABLE  :      0
ATABLE  :      0
SONAME  :      1
ZOMBIE  :      0
SOOPTS  :      0
FTABLE  :      0
RIGHTS  :      0
IFADDR  :      2
TOTAL   :    1060
number of mbufs: 1060
number of clusters: 39
number of interface pages: 0
number of free clusters: 32
number of times failed to find space: 0
number of times waited for space: 0
number of times drained protocols for space: 0
value = 47 = 0x2f = '/'
iocRCS_2> ifShow
ei (unit number 0):
     Flags: (0x63) UP BROADCAST ARP RUNNING 
     Internet address: 164.54.250.4
     Broadcast address: 164.54.251.255
     Netmask 0xffff0000 Subnetmask 0xfffffe00
     Ethernet address is 08:00:3e:29:5e:af
     Metric is 0
     Maximum Transfer Unit size is 1500
     81109763 packets received; 215219682 packets sent
     19888 input errors; 88009 output errors
     4431000 collisions
lo (unit number 0):
     Flags: (0x69) UP LOOPBACK ARP RUNNING 
     Internet address: 127.0.0.1
     Netmask 0xff000000 Subnetmask 0xff000000
     Metric is 0
     Maximum Transfer Unit size is 4096
     803264 packets received; 803264 packets sent
     0 input errors; 0 output errors
     0 collisions
value = 18 = 0x12
iocRCS_2> tcpstatShow
TCP:
        214996826 packets sent
                212457947 data packets (-1705954267 bytes)
                1104302 data packets (837884279 bytes) retransmitted
                1258107 ack-only packets (947311 delayed)
                0 URG only packet
                0 window probe packet
                88272 window update packets
                88218 control packets
        79774943 packets received
                57511503 acks (for -1698854679 bytes)
                3025215 duplicate acks
                0 ack for unsent data
                25019054 packets (532647771 bytes) received in-sequence
                66903 completely duplicate packets (1095116 bytes)
                64 packets with some dup. data (192 bytes duped)
                78092 out-of-order packets (4548764 bytes)
                0 packet (0 byte) of data after window
                0 window probe
                202970 window update packets
                1 packet received after close
                0 discarded for bad checksum
                0 discarded for bad header offset field
                0 discarded because packet too short
        25 connection requests
        87850 connection accepts
        87869 connections established (including accepts)
        87990 connections closed (including 171 drops)
        116 embryonic connections dropped
        56305160 segments updated rtt (of 57362745 attempts)
        222923 retransmit timeouts
                0 connection dropped by rexmit timeout
        1 persist timeout
        287 keepalive timeouts
                286 keepalive probes sent
                0 connection dropped by keepalive
value = 36 = 0x24 = '$'
iocRCS_2> 

Note that this ioc has been running for a week though all of my network
investigations. If the cumulative numbers appear high, that's probably why.
I had forgotten to get my "control set" of these before switching to the 
bad conditions, so I did it when I put things back. "ifShow" and "tcpstatShow"
had basically the same info except that most of the counts had gone up by only
a few counts.  "mbufShow" however, was completely different:
iocDRO162b> mbufShow
type        number
---------   ------
FREE    :    248
DATA    :      8
HEADER  :     13
SOCKET  :      0
PCB     :     47
RTABLE  :      2
HTABLE  :      0
ATABLE  :      0
SONAME  :      0
ZOMBIE  :      0
SOOPTS  :      0
FTABLE  :      0
RIGHTS  :      0
IFADDR  :      2
TOTAL   :    320
number of mbufs: 320
number of clusters: 7
number of interface pages: 0
number of free clusters: 7
number of times failed to find space: 0
number of times waited for space: 0
number of times drained protocols for space: 0
value = 47 = 0x2f = '/'
iocDRO162b> 


> > Grasping at straws, I started killing MEDM displays on the
> > control workstations. One of them
> > was an XY plot of beam profiles (~75 channels total @ 30Hz rate).
> 
> How many data points are in each beam profile channel? Also,
> does stopping a large number of other displays attached to the
> stalling IOC while allowing the XY plot of beam profiles to
> continue running have any effect?

Sorry that wasn't as clear as I intended. There are 2 waveforms (32 points each)
and ten other numbers. There are 74 numbers total at 30Hz rate.

I tried loading the network with displays of a similar number of other channels, 
but did not see the same effect. That lead me to believe the problem was in the 
leg PAST iocA and B (data gatherer and displayer) to iocC (beam profiler). 
However, moving iocB (displayer) to the hub instead of the thinwire leg they all 
shared made the problem disappear.

> Was the sniffer attached to the thinwire segment of the network, or
> was it attached to a different tap off of the switch? Due to the
> nature of switches, sniffers will generally not see traffic
> (broadcasts are an exception) on other parts of the switched network.
 
Yes, it was attached to the thinwire segment all the iocs were sharing.

-- Garrett




Navigate by Date:
Prev: Re: CA online: A call to "assert (semTake ... (fwd) zhang hao
Next: CA monitors... (fwd) zhang hao
Index: 1994  1995  1996  1997  1998  <19992000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: CA online: A call to "assert (semTake ... (fwd) zhang hao
Next: CA monitors... (fwd) zhang hao
Index: 1994  1995  1996  1997  1998  <19992000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 10 Aug 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·