Garrett,
We have been looking at Ethernet errors such as collisions with
"ifShow" on vxWorks, and with "netstat -i" on UNIX. You can also
look for IP level errors with "ipstatShow" on vxWorks, and
"netstat -s" on UNIX. Are you using a switched Ethernet?
Jeff
> We suffer from this very same affliction. We drive a panel full
> of digital
> readouts (the old LED kind) with data from a remote IOC at 30Hz.
> The readouts
> freeze occasionally for as little time as a long blink to almost
> two minutes.
> MEDM screens with the same data also freeze, though not
> necessarily at the same
> times.
>
> As in Peregrine's case, it's mostly one IOC, though we see it
> occasionally on
> others. I can find no correlation to CPU load, file descriptor
> usage, network
> load, or memory usage. It does not seem to be related to "time
> since reboot" as
> it fluctuates constantly. It is sometimes seconds between stops
> and sometimes
> days... well, at least so infrequent no one notices 'em.
>
> I have not checked timestamps, but I can see the 30Hz pulse
> counts and toroid
> summations on the IOC continuing normally. An MEDM screen
> attached to those
> channels will also temporarily freeze, but all the numbers jump
> to where they
> ought to be when the pause is over.
>
> Another interesting observation is that one number on the DRO
> panel always
> updates, even when the rest of 'em freeze. It's the beam current
> number which is
> calculate in the local ioc (the one that drives the panel) off a
> toroid value
> from the "stalled" ioc. The copy of that same toroid value that
> is displayed
> directly on the DROs is stalled, but the one linked between the
> databases is
> okay.
>
> We've had this through at least a couple of versions of EPICS (currently
> 3.13.0b11) and two CPUs (mv167 and now mv172). I don't think the
> cpu is the
> problem though, I think it's in the network (or the software that
> drives it).
> The problem has been especially bad in the last several days and
> I finally have
> access to a sniffer and the guy to run it, so I'm hoping to learn
> more in the
> next day or two. If anyone has an idea what to look for, I'm all ears.
>
> Garrett Rinehart
> Intense Pulsed Neutron Source
> Argonne National Laboratory
> 9700 S. Cass Ave
> Argonne, IL 60439
> (630)252-6561
>
> > X-Accept-Language: en
> > MIME-Version: 1.0
> > To: [email protected]
> > Subject: Delays in receipt of CA monitors
> > Content-Transfer-Encoding: 7bit
> >
> > Here at the Low Energy Demonstration Accelerator we are seeing
> > intermittent significant delays in the receipt of CA monitors. As these
> > delays are large enough to trigger a hardware protection system that
> > places the accelerator in a safe mode (e.g. turn the beam off) we have
> > spent considerable time over the past months to understand this.
> >
> > We don't yet understand the root cause and are asking if anyone has seen
> > a similar effect. We realize that TCP is non-realtime but are concerned
> > since these delays can as large as 3 to 10 seconds.
> >
> > Here's the context:
> >
> > In order to provide prompt detection of a failure in the control system
> > IOCs or network at timescales shorter than the default CA timeout (30
> > seconds), each IOC provides a heartbeat. This heartbeat is implemented
> > as a calculation record scanned at .5 second that toggles between 0 and
> > 1. On our run permit IOC we have a genSub record scanned at 1 second
> > that has an input CP link to that heartbeat. Thus the genSub record will
> > process both at 1 second intervals and when a monitor comes in from the
> > heartbeat. If a monitor is not received with a specific interval (1.5
> > seconds) we assume that there is a problem and shut down the
> > accelerator.
> >
> > Here's what we see:
> >
> > For a small subset of IOCs, and mostly from one, we see several times a
> > day, delays of several seconds on the receipt of a monitor. The effect
> > is one of the monitors being buffered, but not lost. We can also observe
> > this behavior by running camonitor on an IOC heartbeat channel from a
> > workstation.
> >
> > The timestamps generated by record processing of the heartbeats are
> > always seen (by dbCaGetTimeStamp) at .5 second intervals.
> >
> > Here's what we don't see:
> >
> > We don't see dependence on IOC architecture.
> > We don't see correlation with long-term resource (CPU, memory, FDs)
> > usage on either IOC.
> > We don't see correlation with network traffic.
> > We don't see correlation on short-term CPU usage (down to 1 second
> > sampling) on either IOC.
> > We don't see delays in the heartbeat timestamps generated at the target
> > IOC.
> >
> > Here's what we suspect:
> >
> > 1) Possible short term CPU loading on the target IOC by a task with a
> > priority less than the .5 second scan task but greater than CA or TCP.
> > 2) Possible buffering within CA or the TCP/IP stack.
> >
> > Aloha,
> > Peregrine
> > --
> > Peregrine M. McGehee Coordinator, Los Alamos Astrophysics
> > (505) 667-3273 MS H820, LANL, Los Alamos, NM 87545
>
>
- References:
- Re: Delays in receipt of CA monitors Garrett D. Rinehart
- Navigate by Date:
- Prev:
IOC woes Dennis M Reichhold
- Next:
I/O Intr and Asynchronouse device Lifang Zheng
- Index:
1994
1995
1996
1997
1998
<1999>
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: Delays in receipt of CA monitors Garrett D. Rinehart
- Next:
RE: Delays in receipt of CA monitors Garrett D. Rinehart
- Index:
1994
1995
1996
1997
1998
<1999>
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|