Experimental Physics and
Industrial Control System

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 <2007> 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025	Index	1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 <2007> 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
<== Date ==>		<== Thread ==>

Subject:	Re: "Heartbeat" databases or sequences?
From:	Kay-Uwe Kasemir <[email protected]>
To:	tech talk <[email protected]>
Date:	Thu, 15 Feb 2007 10:26:51 -0500

Hello:

About two weeks ago I asked who is using 'heartbeat' type databases on top of ChannelAccess to monitor the health of connections.

The reason for this little survey: At the SNS we have a 1 Hz sawtooth-ramp "heartbeat" record on each IOC, and database snippets to check the received heartbeat values for delays. On IOCs that read PVs from other IOCs for the purpose of software interlocks, we are supposed to fold these heartbeat checks into the interlock: If the heartbeat is delayed, assume that the “OK” PV received from the other IOC might already be “not OK”. When configured for a 3 second timeout, which due to database details might be closer to a 2 second timeout, those heartbeat checks occasionally fail and interrupt operation of the SNS linac. So far we have too little data to really understand this. All might be fine for a month, then there could be a handful of trips over a few days, where several but not all of the IOCs that receive hearbeats from the same “sending” IOC observe delays. No good relation to CPU load or monitored network traffic. For now, we increased the timeouts to 6, 10 or more seconds, and Pam Gurd started to monitor the delays across the site.

In the past, LEDA @ LANL also used 1 Hz signals, tolerance of a few seconds, folded into software interlocks. There were problems with network delays until upgrading to a switched network. Unclear if all was then really fine, since LEDA shut down.

So I asked who else is using 'heartbeats' at all, at what rate, and with what consequences in case of a trip.

Ralph Lange @ DESY Using ALH and the default ChannelAccess timeout mechanism (~30 seconds), the INVALID severity is used to identify network problems.

Judy Rock @ SLAC On selected IOCs, a database on purpose causes a MAJOR alarm every 10 minutes, and these end up in cmlog. If they are not found in cmlog every 10 minutes, automated emails are sent.

The following sites all add a 1 Hz signal to some or all IOCs, which is then just displayed, or monitored automatically with various timeouts, and maybe even triggering an automated reaction:

David Maden @ PSI
1 Hz signals displayed on screens for visual inspection.

Paul Sichta @ Princeton Plasma physics Lab 1 Hz signal, checked every 10 seconds, timeout after 10..20 seconds. Logged, but no action. Maybe one alarm per month.

At SNS, a set of soft IOCs monitors all IOC heartbeats, displayed on an IOC status screen, turning red after 30 seconds without heartbeats. It’s evolving into a soft-ioc-based alarm handler, with logging, manual reset, masking etc.

Matt Bickley, JLAB 1 Hz signal on each IOC, monitored by database on one IOC, (minor) alarm after 2 second timeout. No automated reaction. Found useful to detect database delays because of e.g. CAMAC timeouts.

John Faucett @ LANSCE, LANL 1 Hz signal (0/1 toggle), monitored by custom TCL client, displays green/yellow/red based on received rate. No further action.

Kazuro Furukawa, KEKB 1 Hz signal (0/1 toggle) on every of the ~100 IOCs, displayed on one MEDM screen, and also monitored by central IOC with 30 second timeout. Faults are logged. For certain IOCs with known issues, automated re- activation procedures are initiated. Was helpful during commissioning. Still detects occasional clock issues on 68k IOCs. Thinking about replacement based on TCP keep- alive timer, if supported by OS.

Burkhard Kolb, HADES @ GSI.de For experiment runs, 1 Hz signal is logged with data into Oracle. On data retrieval a database package sums over 60 seconds, the sum has to be greater than 58 to trust the stored data.

References:: "Heartbeat" databases or sequences? Kay-Uwe Kasemir

Navigate by Date:: Prev: alarm handler question: group FORCEPVs and severities at alh startup Rock, Judith E.; Next: Re: evaluation of OMG DDS as a controls protocol for EPICS Claude Saunders; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 <2007> 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
Navigate by Thread:: Prev: "Heartbeat" databases or sequences? Kay-Uwe Kasemir; Next: Re: unexpanded macros in comments Andrew Johnson; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 <2007> 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025

ANJ, 10 Nov 2011

· Home · News · About · Base · Modules · Extensions · Distributions ·
· Download · Search · IRMIS · Talk · Documents · Links · Licensing ·

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System