The alive record in itself is not very useful, as it needs a server that collects the heartbeat UDP packets sent by the alive records on IOCs. The server has to process the messages and keep a database in order to make use of the information.
The typical intended configuration is for many IOCs sending heartbeats to a single server. This of course means there is a single point of failure, being the server. If this is a problem, since the record doesn't allow for multiple packet recipients (although it could in theory), multiple records could be on an IOC with different server targets (and different local TCP ports), allowing for server redundancy.
This document describes how the data can be used. It is based on how the author has designed a server.
The first thing is to make sure that the alive record is sending heartbeats UDP packets to the server (from RHOST) at the expected UDP port (from RPORT), and at the expected rate determined from the HPRD period.
UDP packets will arrive at the server port from IOCs at this point. UDP packets are by their nature unreliable, with some getting dropped or delayed (so packets may arrive out of order), so the packet handling has to allow for this.
The IP address of the sending IOC is not included in the heartbeat. This is because there might be several active network interfaces, which make it not clear which one will be used for sending. When receiving the UDP packet, the IP address of the sender is given, which needs to be used for the IOC IP address. The IP address alone can't identify an IOC, as multiple IOCs can exist on one machine, which is why the IOC environment variable is used for identification.
The following shows how the UDP message's fields can be used (they are shown in order), as well as the record field if one corresponds to that value. A data structure for each IOC should be made, and all the data structures should be made into a searchable construct (binary tree/list/etc.) where the key is the IOC name. Below, the values that should be recorded into a data structure are noted.
When an IOC is turned off or crashes, there is no immediate detection of failure. This determination depends on the rate of heartbeats and the number of missing heartbeats. For a HPRD rate of 15 seconds, a failure declared after four missing heartbeats would be a minute. This is fairly conservative, and if you are certain that the network between the IOCs and the server doesn't drop many packets, the packet number can be reduced; the HPRD rate could also be increased, although that means more processing at the server.
The time value to use for determination of failure should be how long it has been since the last accepted heartbeat (as they can be out of order) was received, with the reception time being locally measured by the server, not the IOC's current time. There might seem to be some redundancy of measuring server local time when the IOC sends its local time, but this allows you to sense any packet delivery lag or any systematic difference in time (like from time zone differences). One also has to remember that the EPICS time sent back from IOCs has a negative offset from Linux time of 631152000 seconds (20 years).
A failure can be actively detected and acted on directly by the server, or the server can simply collect data and let polling clients determine the failures themselves (which allows for varying failure times and HPRD rates).
The up time for an active IOC is the time since the last heartbeat plus the difference between the IOC's last current time and its incarnation time.
The down time for a failed IOC is the time since the last heartbeat.
The TCP callback is used to get static information from the alive record. If the IOC was not able to create the callback port, the value of the Return Port will be 0, and a callback can't be made.
The server can make a callback at any time, although it typically should be done when a new incarnation is seen in the heartbeat message or when the Read (Bit 0) flag is set in the heartbeat message. Also, if the Blocked (Bit 1) flag is set in the heartbeat message, the callback will not work as the record will not accept connections; if the server tries a callback to an alive record that is sending heartbeats to a different server, it will also fail.
The information returned is static in nature (and the server doesn't really need it for running), so it should be recorded in a data structure, attached to the IOC data entry.