Here at the Low Energy Demonstration Accelerator we are seeing
intermittent significant delays in the receipt of CA monitors. As these
delays are large enough to trigger a hardware protection system that
places the accelerator in a safe mode (e.g. turn the beam off) we have
spent considerable time over the past months to understand this.
We don't yet understand the root cause and are asking if anyone has seen
a similar effect. We realize that TCP is non-realtime but are concerned
since these delays can as large as 3 to 10 seconds.
Here's the context:
In order to provide prompt detection of a failure in the control system
IOCs or network at timescales shorter than the default CA timeout (30
seconds), each IOC provides a heartbeat. This heartbeat is implemented
as a calculation record scanned at .5 second that toggles between 0 and
1. On our run permit IOC we have a genSub record scanned at 1 second
that has an input CP link to that heartbeat. Thus the genSub record will
process both at 1 second intervals and when a monitor comes in from the
heartbeat. If a monitor is not received with a specific interval (1.5
seconds) we assume that there is a problem and shut down the
accelerator.
Here's what we see:
For a small subset of IOCs, and mostly from one, we see several times a
day, delays of several seconds on the receipt of a monitor. The effect
is one of the monitors being buffered, but not lost. We can also observe
this behavior by running camonitor on an IOC heartbeat channel from a
workstation.
The timestamps generated by record processing of the heartbeats are
always seen (by dbCaGetTimeStamp) at .5 second intervals.
Here's what we don't see:
We don't see dependence on IOC architecture.
We don't see correlation with long-term resource (CPU, memory, FDs)
usage on either IOC.
We don't see correlation with network traffic.
We don't see correlation on short-term CPU usage (down to 1 second
sampling) on either IOC.
We don't see delays in the heartbeat timestamps generated at the target
IOC.
Here's what we suspect:
1) Possible short term CPU loading on the target IOC by a task with a
priority less than the .5 second scan task but greater than CA or TCP.
2) Possible buffering within CA or the TCP/IP stack.
Aloha,
Peregrine
--
Peregrine M. McGehee Coordinator, Los Alamos Astrophysics
(505) 667-3273 MS H820, LANL, Los Alamos, NM 87545
- Replies:
- Re: Delays in receipt of CA monitors john sinclair
- Re: Delays in receipt of CA monitors john sinclair
- References:
- Re: CA LINK's not monitoring. Ron Chestnut
- Navigate by Date:
- Prev:
zombie problem at UNIX IOC Tatiana V. Salikova
- Next:
Re: Delays in receipt of CA monitors Garrett D. Rinehart
- Index:
1994
1995
1996
1997
1998
<1999>
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: CA LINK's not monitoring. Ron Chestnut
- Next:
Re: Delays in receipt of CA monitors john sinclair
- Index:
1994
1995
1996
1997
1998
<1999>
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|