EPICS Re: Generating metrics for IOC applications

Experimental Physics and Industrial Control System

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 <2024> 2025	Index	1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 <2024> 2025
<== Date ==>		<== Thread ==>

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 <2024> 2025

Index

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 <2024> 2025

<== Date ==>

<== Thread ==>

This Message Is From an External Sender

This message came from outside your organization.

Hi Ralph, From my point of view, the situation you describe is a monitoring problem. In our experience, we encounter similar issues: - When to schedule the replacement of our hardware, e.g. hard disks before they fail beyond repair? - How can we be warned when our hardware is failing, e.g. when a fan fails on one of our machines? - How can we detect if a machine's OS is too old? - How can we track certain "common" metrics on our machines (RAM, CPU, disk space used, network activity, read/write operations, ...)? - How can we spot and identify security vulnerabilities in our programs deployed on those machines? - etc Your question is more focused on obsolescence, but I think our answer covers this concern. We mostly use four tools to address this kind of situation: 1. NetBox (https://urldefense.us/v3/__https://docs.netbox.dev/en/stable/__;!!G_uCfscf7eWS!Zt1EFuIfinew_sHEJ-KDg4X7bpbCbD8I-j8SG2uwlwE43w5SCvqZY26zXZ-CzVQoxscG82twY3x0_nD0VBEqU063VPFiYMw$). This is a documentation tool that enable us to make an inventory of all our equipment. There's a bit of a learning curve, but you get used to it, and administering your own instance isn't that hard. Initially, the tool was designed for data centers, but in practice, it can be used for any installation. What's really interesting about NetBox is that you can integrate plugins (https://urldefense.us/v3/__https://docs.netbox.dev/en/stable/plugins/development/__;!!G_uCfscf7eWS!Zt1EFuIfinew_sHEJ-KDg4X7bpbCbD8I-j8SG2uwlwE43w5SCvqZY26zXZ-CzVQoxscG82twY3x0_nD0VBEqU063bVT5PyM$): in your case, it's easy to imagine a plugin that would regularly scan the inventory to detect obsolete equipment (depending on your criteria). 2. Grafana (https://urldefense.us/v3/__https://grafana.com/docs/grafana/latest/introduction/__;!!G_uCfscf7eWS!Zt1EFuIfinew_sHEJ-KDg4X7bpbCbD8I-j8SG2uwlwE43w5SCvqZY26zXZ-CzVQoxscG82twY3x0_nD0VBEqU063JqONqjM$) + Prometheus (https://urldefense.us/v3/__https://grafana.com/docs/grafana/latest/getting-started/get-started-grafana-prometheus/__;!!G_uCfscf7eWS!Zt1EFuIfinew_sHEJ-KDg4X7bpbCbD8I-j8SG2uwlwE43w5SCvqZY26zXZ-CzVQoxscG82twY3x0_nD0VBEqU0637jrmmk0$). This tool help us monitoring most of our systems. You just have to install a "data exporter" on the system you want to monitor, and then you have access to a lot of different metrics. For example, you can check the wikimedia ones (which are open): https://urldefense.us/v3/__https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1__;!!G_uCfscf7eWS!Zt1EFuIfinew_sHEJ-KDg4X7bpbCbD8I-j8SG2uwlwE43w5SCvqZY26zXZ-CzVQoxscG82twY3x0_nD0VBEqU063mV5LkKw$ You can even configure alarms on the metrics you want. 3. SSH Monitor (an internal project, not open sourced yet but we're getting there, you can contact me if you want more details about it). This is an EPICS Top that will send small SSH commands to target machines in order to extract specific information. This might sound "unsecure", but thanks to SSH (and restricted Bash) this can be done in a very safe way. It allows to monitor any data from one or multiple target machines, as long as the target machine hosts an SSH daemon. In fact, it is made in such a way that it allows you to monitor any data, if you can come up with a system command-line that can retrieve it. It is used in addition to Grafana + Prometheus for the following reasons: - Grafana + Prometheus cannot be connected to EPICS (as far as I know), so if you want EPICS PVs for the metrics you monitor (e.g. for "EPICS alarms" with phoebus-alarms, for "EPICS archives" with Archiver Appliance, ...), then SSH Monitor is more appropriated. - SSH Monitor can also be more appropriated when looking for very specific informations, e.g. like parsing very particular expressions (specific for your project) in some log files... - Grafana + Prometheus needs a data-exporter to be installed on the target machine, which might be considered "intrusive" and a "NO-GO" for a lot of target machines. SSH Monitor allows very little to no intervention on the target machine. - Data-exporters might not support the target machine CPU architecture. In that case, if the target machine hosts an SSH daemon, then you won't have to care much about its CPU architecture with SSH Monitor. 4. Nix (https://urldefense.us/v3/__https://nixos.org/manual/nix/stable/__;!!G_uCfscf7eWS!Zt1EFuIfinew_sHEJ-KDg4X7bpbCbD8I-j8SG2uwlwE43w5SCvqZY26zXZ-CzVQoxscG82twY3x0_nD0VBEqU063VYNfUdg$). This is just a package manager we use to package our programs (mostly EPICS stuff) before deploying them. If you remember me, then you very probably remember my colleague Rémi NICOLE, who is quite enthusiast about Nix and NixOS. Since you mentionned "bitrot", Rémi suggested to mention the bit-for-bit build reproducibility of the Nix package manager (which is a great guarantee). Nix can be installed on any Linux distro (even very old ones, like old CentOS machines most of the time), without conflicting with the local package manager (yum, dnf, apt, etc). We also use it to check the full dependency tree of the programs we package with it: this way, we can track our dependencies and Nix can even warn us for security issues in our dependencies! They are a lot of other advantages (rollbacks, garbage collection, distributed builds and cache, the declarative nature of Nix, etc), but this is starting to be off-topic. I hope this answer will help you! Do not hesitate to contact me if you have questions. Cheers, Stéphane CEA / DRF / IRFU / DIS / LDISC On Mon, 2024-02-26 at 11:23 +0100, Ralph Lange via Tech-talk wrote: > > Dear all, > > For large control system installations consisting of many different > applications (largely independent units covering subsystems), we > would like to have something that analyzes the application and > generates some metrics, to give us a better grip on what we're > dealing with. > > The context is "obsolescence management". More and more of ITER's > expected ~170 subsystems are being delivered, integrated and start to > bitrot. > > With our limited manpower, we need to make informed decisions about > where to start updating and migrating. It would help to have a few > comparable numbers describing the size, complexity, regularity, ... > of the candidates. > > Did anyone ever start efforts in that direction? > Any formulas that generated meaningful and reliable data? > > Thanks a lot for your help and ideas, > ~Ralph >

Subject:	Re: Generating metrics for IOC applications
From:	TZVETKOV Stephane via Tech-talk <tech-talk at aps.anl.gov>
To:	"ralph.lange at gmx.de" <ralph.lange at gmx.de>, "tech-talk at aps.anl.gov" <tech-talk at aps.anl.gov>
Date:	Thu, 29 Feb 2024 16:26:02 +0000