Experimental Physics and Industrial Control System
We suffer from this very same affliction. We drive a panel full of digital
readouts (the old LED kind) with data from a remote IOC at 30Hz. The readouts
freeze occasionally for as little time as a long blink to almost two minutes.
MEDM screens with the same data also freeze, though not necessarily at the same
times.
As in Peregrine's case, it's mostly one IOC, though we see it occasionally on
others. I can find no correlation to CPU load, file descriptor usage, network
load, or memory usage. It does not seem to be related to "time since reboot" as
it fluctuates constantly. It is sometimes seconds between stops and sometimes
days... well, at least so infrequent no one notices 'em.
I have not checked timestamps, but I can see the 30Hz pulse counts and toroid
summations on the IOC continuing normally. An MEDM screen attached to those
channels will also temporarily freeze, but all the numbers jump to where they
ought to be when the pause is over.
Another interesting observation is that one number on the DRO panel always
updates, even when the rest of 'em freeze. It's the beam current number which is
calculate in the local ioc (the one that drives the panel) off a toroid value
from the "stalled" ioc. The copy of that same toroid value that is displayed
directly on the DROs is stalled, but the one linked between the databases is
okay.
We've had this through at least a couple of versions of EPICS (currently
3.13.0b11) and two CPUs (mv167 and now mv172). I don't think the cpu is the
problem though, I think it's in the network (or the software that drives it).
The problem has been especially bad in the last several days and I finally have
access to a sniffer and the guy to run it, so I'm hoping to learn more in the
next day or two. If anyone has an idea what to look for, I'm all ears.
Garrett Rinehart
Intense Pulsed Neutron Source
Argonne National Laboratory
9700 S. Cass Ave
Argonne, IL 60439
(630)252-6561
> X-Accept-Language: en
> MIME-Version: 1.0
> To: [email protected]
> Subject: Delays in receipt of CA monitors
> Content-Transfer-Encoding: 7bit
>
> Here at the Low Energy Demonstration Accelerator we are seeing
> intermittent significant delays in the receipt of CA monitors. As these
> delays are large enough to trigger a hardware protection system that
> places the accelerator in a safe mode (e.g. turn the beam off) we have
> spent considerable time over the past months to understand this.
>
> We don't yet understand the root cause and are asking if anyone has seen
> a similar effect. We realize that TCP is non-realtime but are concerned
> since these delays can as large as 3 to 10 seconds.
>
> Here's the context:
>
> In order to provide prompt detection of a failure in the control system
> IOCs or network at timescales shorter than the default CA timeout (30
> seconds), each IOC provides a heartbeat. This heartbeat is implemented
> as a calculation record scanned at .5 second that toggles between 0 and
> 1. On our run permit IOC we have a genSub record scanned at 1 second
> that has an input CP link to that heartbeat. Thus the genSub record will
> process both at 1 second intervals and when a monitor comes in from the
> heartbeat. If a monitor is not received with a specific interval (1.5
> seconds) we assume that there is a problem and shut down the
> accelerator.
>
> Here's what we see:
>
> For a small subset of IOCs, and mostly from one, we see several times a
> day, delays of several seconds on the receipt of a monitor. The effect
> is one of the monitors being buffered, but not lost. We can also observe
> this behavior by running camonitor on an IOC heartbeat channel from a
> workstation.
>
> The timestamps generated by record processing of the heartbeats are
> always seen (by dbCaGetTimeStamp) at .5 second intervals.
>
> Here's what we don't see:
>
> We don't see dependence on IOC architecture.
> We don't see correlation with long-term resource (CPU, memory, FDs)
> usage on either IOC.
> We don't see correlation with network traffic.
> We don't see correlation on short-term CPU usage (down to 1 second
> sampling) on either IOC.
> We don't see delays in the heartbeat timestamps generated at the target
> IOC.
>
> Here's what we suspect:
>
> 1) Possible short term CPU loading on the target IOC by a task with a
> priority less than the .5 second scan task but greater than CA or TCP.
> 2) Possible buffering within CA or the TCP/IP stack.
>
> Aloha,
> Peregrine
> --
> Peregrine M. McGehee Coordinator, Los Alamos Astrophysics
> (505) 667-3273 MS H820, LANL, Los Alamos, NM 87545
- Replies:
- RE: Delays in receipt of CA monitors Jeff Hill
- Navigate by Date:
- Prev:
Delays in receipt of CA monitors Peregrine M. McGehee
- Next:
RE: zombie problem at UNIX IOC Jeff Hill
- Index:
1994
1995
1996
1997
1998
<1999>
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: zombie problem at UNIX IOC Jeff Hill
- Next:
RE: Delays in receipt of CA monitors Jeff Hill
- Index:
1994
1995
1996
1997
1998
<1999>
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024