EPICS Home

Experimental Physics and Industrial Control System


 
1994  1995  1996  1997  1998  <19992000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  <19992000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Delays in receipt of CA monitors
From: "Garrett D. Rinehart" <[email protected]>
To: [email protected]
Cc: [email protected]
Date: Wed, 8 Dec 1999 10:20:52 -0600 (CST)
We suffer from this very same affliction. We drive a panel full of digital 
readouts (the old LED kind) with data from a remote IOC at 30Hz. The readouts 
freeze occasionally for as little time as a long blink to almost two minutes. 
MEDM screens with the same data also freeze, though not necessarily at the same 
times.

As in Peregrine's case, it's mostly one IOC, though we see it occasionally on 
others. I can find no correlation to CPU load, file descriptor usage, network 
load, or memory usage. It does not seem to be related to "time since reboot" as 
it fluctuates constantly. It is sometimes seconds between stops and sometimes 
days... well, at least so infrequent no one notices 'em.

I have not checked timestamps, but I can see the 30Hz pulse counts and toroid 
summations on the IOC continuing normally. An MEDM screen attached to those 
channels will also temporarily freeze, but all the numbers jump to where they 
ought to be when the pause is over.

Another interesting observation is that one number on the DRO panel always 
updates, even when the rest of 'em freeze. It's the beam current number which is 
calculate in the local ioc (the one that drives the panel) off a toroid value 
from the "stalled" ioc. The copy of that same toroid value that is displayed 
directly on the DROs is stalled, but the one linked between the databases is 
okay.

We've had this through at least a couple of versions of EPICS (currently 
3.13.0b11) and two CPUs (mv167 and now mv172). I don't think the cpu is the 
problem though, I think it's in the network (or the software that drives it). 
The problem has been especially bad in the last several days and I finally have 
access to a sniffer and the guy to run it, so I'm hoping to learn more in the 
next day or two. If anyone has an idea what to look for, I'm all ears.

Garrett Rinehart
Intense Pulsed Neutron Source
Argonne National Laboratory
9700 S. Cass Ave
Argonne, IL  60439
(630)252-6561

> X-Accept-Language: en
> MIME-Version: 1.0
> To: [email protected]
> Subject: Delays in receipt of CA monitors
> Content-Transfer-Encoding: 7bit
> 
> Here at the Low Energy Demonstration Accelerator we are seeing
> intermittent significant delays in the receipt of CA monitors. As these
> delays are large enough to trigger a hardware protection system that
> places the accelerator in a safe mode (e.g. turn the beam off) we have
> spent considerable time over the past months to understand this.
> 
> We don't yet understand the root cause and are asking if anyone has seen
> a similar effect. We realize that TCP is non-realtime but are concerned
> since these delays can as large as 3 to 10 seconds.
> 
> Here's the context: 
> 
> In order to provide prompt detection of a failure in the control system
> IOCs or network at timescales shorter than the default CA timeout (30
> seconds), each IOC provides a heartbeat. This heartbeat is implemented
> as a calculation record scanned at .5 second that toggles between 0 and
> 1. On our run permit IOC we have a genSub record scanned at 1 second
> that has an input CP link to that heartbeat. Thus the genSub record will
> process both at 1 second intervals and when a monitor comes in from the
> heartbeat. If a monitor is not received with a specific interval (1.5
> seconds) we assume that there is a problem and shut down the
> accelerator.
> 
> Here's what we see:
> 
> For a small subset of IOCs, and mostly from one, we see several times a
> day, delays of several seconds on the receipt of a monitor. The effect
> is one of the monitors being buffered, but not lost. We can also observe
> this behavior by running camonitor on an IOC heartbeat channel from a
> workstation.
> 
> The timestamps generated by record processing of the heartbeats are
> always seen (by dbCaGetTimeStamp) at .5 second intervals.
> 
> Here's what we don't see:
> 
> We don't see dependence on IOC architecture.
> We don't see correlation with long-term resource (CPU, memory, FDs)
> usage on either IOC.
> We don't see correlation with network traffic.
> We don't see correlation on short-term CPU usage (down to 1 second
> sampling) on either IOC.
> We don't see delays in the heartbeat timestamps generated at the target
> IOC.
> 
> Here's what we suspect:
> 
> 1) Possible short term CPU loading on the target IOC by a task with a
> priority less than the .5 second scan task but greater than CA or TCP.
> 2) Possible buffering within CA or the TCP/IP stack.
> 
> Aloha,
> 	Peregrine
> -- 
> Peregrine M. McGehee	Coordinator, Los Alamos Astrophysics
> (505) 667-3273 		MS H820, LANL, Los Alamos, NM 87545




Replies:
RE: Delays in receipt of CA monitors Jeff Hill

Navigate by Date:
Prev: Delays in receipt of CA monitors Peregrine M. McGehee
Next: RE: zombie problem at UNIX IOC Jeff Hill
Index: 1994  1995  1996  1997  1998  <19992000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: RE: zombie problem at UNIX IOC Jeff Hill
Next: RE: Delays in receipt of CA monitors Jeff Hill
Index: 1994  1995  1996  1997  1998  <19992000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024