![]() |
![]() ![]()
Experimental Physics and
| ||||||||||||||
|
About two weeks ago I asked who is using 'heartbeat' type databases on top of ChannelAccess to monitor the health of connections. The reason for this little survey: At the SNS we have a 1 Hz sawtooth-ramp "heartbeat" record on each IOC, and database snippets to check the received heartbeat values for delays. On IOCs that read PVs from other IOCs for the purpose of software interlocks, we are supposed to fold these heartbeat checks into the interlock: If the heartbeat is delayed, assume that the “OK” PV received from the other IOC might already be “not OK”. When configured for a 3 second timeout, which due to database details might be closer to a 2 second timeout, those heartbeat checks occasionally fail and interrupt operation of the SNS linac. So far we have too little data to really understand this. All might be fine for a month, then there could be a handful of trips over a few days, where several but not all of the IOCs that receive hearbeats from the same “sending” IOC observe delays. No good relation to CPU load or monitored network traffic. For now, we increased the timeouts to 6, 10 or more seconds, and Pam Gurd started to monitor the delays across the site. In the past, LEDA @ LANL also used 1 Hz signals, tolerance of a few seconds, folded into software interlocks. There were problems with network delays until upgrading to a switched network. Unclear if all was then really fine, since LEDA shut down. So I asked who else is using 'heartbeats' at all, at what rate, and with what consequences in case of a trip. Ralph Lange @ DESY Using ALH and the default ChannelAccess timeout mechanism (~30 seconds), the INVALID severity is used to identify network problems. Judy Rock @ SLAC On selected IOCs, a database on purpose causes a MAJOR alarm every 10 minutes, and these end up in cmlog. If they are not found in cmlog every 10 minutes, automated emails are sent. The following sites all add a 1 Hz signal to some or all IOCs, which is then just displayed, or monitored automatically with various timeouts, and maybe even triggering an automated reaction: David Maden @ PSI 1 Hz signals displayed on screens for visual inspection. Paul Sichta @ Princeton Plasma physics Lab 1 Hz signal, checked every 10 seconds, timeout after 10..20 seconds. Logged, but no action. Maybe one alarm per month. At SNS, a set of soft IOCs monitors all IOC heartbeats, displayed on an IOC status screen, turning red after 30 seconds without heartbeats. It’s evolving into a soft-ioc-based alarm handler, with logging, manual reset, masking etc. Matt Bickley, JLAB 1 Hz signal on each IOC, monitored by database on one IOC, (minor) alarm after 2 second timeout. No automated reaction. Found useful to detect database delays because of e.g. CAMAC timeouts. John Faucett @ LANSCE, LANL 1 Hz signal (0/1 toggle), monitored by custom TCL client, displays green/yellow/red based on received rate. No further action. Kazuro Furukawa, KEKB 1 Hz signal (0/1 toggle) on every of the ~100 IOCs, displayed on one MEDM screen, and also monitored by central IOC with 30 second timeout. Faults are logged. For certain IOCs with known issues, automated re- activation procedures are initiated. Was helpful during commissioning. Still detects occasional clock issues on 68k IOCs. Thinking about replacement based on TCP keep- alive timer, if supported by OS. Burkhard Kolb, HADES @ GSI.de For experiment runs, 1 Hz signal is logged with data into Oracle. On data retrieval a database package sums over 60 seconds, the sum has to be greater than 58 to trust the stored data.
| ||||||||||||||
ANJ, 10 Nov 2011 |
![]() · Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing · |