EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  <20072008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  <20072008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: "Heartbeat" databases or sequences?
From: Kay-Uwe Kasemir <[email protected]>
To: tech talk <[email protected]>
Date: Thu, 15 Feb 2007 10:26:51 -0500
Hello:

About two weeks ago I asked who is using 'heartbeat' type databases on top of ChannelAccess to monitor the health of connections.

The reason for this little survey:
At the SNS we have a 1 Hz sawtooth-ramp "heartbeat" record on each IOC, and database snippets to check the received heartbeat values for delays.
On IOCs that read PVs from other IOCs for the purpose of software interlocks, we are supposed to fold these heartbeat checks into the interlock: If the heartbeat is delayed, assume that the “OK” PV received from the other IOC might already be “not OK”.
When configured for a 3 second timeout, which due to database details might be closer to a 2 second timeout, those heartbeat checks occasionally fail and interrupt operation of the SNS linac.
So far we have too little data to really understand this. All might be fine for a month, then there could be a handful of trips over a few days, where several but not all of the IOCs that receive hearbeats from the same “sending” IOC observe delays. No good relation to CPU load or monitored network traffic.
For now, we increased the timeouts to 6, 10 or more seconds, and Pam Gurd started to monitor the delays across
the site.


In the past, LEDA @ LANL also used 1 Hz signals, tolerance of a few seconds, folded into software interlocks. There were problems with network delays until upgrading to a switched network. Unclear if all was then really fine, since LEDA shut down.

So I asked who else is using 'heartbeats' at all, at what rate, and with what consequences in case of a trip.

Ralph Lange @ DESY
Using ALH and the default ChannelAccess timeout mechanism (~30 seconds), the INVALID severity is used to identify network problems.


Judy Rock @ SLAC
On selected IOCs, a database on purpose causes a MAJOR alarm every 10 minutes, and these end up in cmlog. If they are not found in cmlog every 10 minutes, automated emails are sent.


The following sites all add a 1 Hz signal to some or all IOCs, which is then just displayed, or monitored automatically with various timeouts, and maybe even triggering an automated reaction:

David Maden @ PSI
1 Hz signals displayed on screens for visual inspection.

Paul Sichta @ Princeton Plasma physics Lab
1 Hz signal, checked every 10 seconds, timeout after 10..20 seconds. Logged, but no action. Maybe one alarm per month.


At SNS, a set of soft IOCs monitors all IOC heartbeats, displayed on an IOC status screen, turning red after 30 seconds without heartbeats. It’s evolving into a soft-ioc-based alarm handler, with logging, manual reset, masking etc.

Matt Bickley, JLAB
1 Hz signal on each IOC, monitored by database on one IOC, (minor) alarm after 2 second timeout. No automated reaction. Found useful to detect database delays because of e.g. CAMAC timeouts.


John Faucett @ LANSCE, LANL
1 Hz signal (0/1 toggle), monitored by custom TCL client, displays green/yellow/red based on received rate. No further action.


Kazuro Furukawa, KEKB
1 Hz signal (0/1 toggle) on every of the ~100 IOCs, displayed on one MEDM screen, and also monitored by central IOC with 30 second timeout.
Faults are logged. For certain IOCs with known issues, automated re- activation procedures are initiated.
Was helpful during commissioning. Still detects occasional clock issues on 68k IOCs. Thinking about replacement based on TCP keep- alive timer, if supported by OS.


Burkhard Kolb, HADES @ GSI.de
For experiment runs, 1 Hz signal is logged with data into Oracle. On data retrieval a database package sums over 60 seconds, the sum has to be greater than 58 to trust the stored data.







References:
"Heartbeat" databases or sequences? Kay-Uwe Kasemir

Navigate by Date:
Prev: alarm handler question: group FORCEPVs and severities at alh startup Rock, Judith E.
Next: Re: evaluation of OMG DDS as a controls protocol for EPICS Claude Saunders
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  <20072008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: "Heartbeat" databases or sequences? Kay-Uwe Kasemir
Next: Re: unexpanded macros in comments Andrew Johnson
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  <20072008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 10 Nov 2011 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·