EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  2025  <2026 Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  2025  <2026
<== Date ==> <== Thread ==>

Subject: Archiver Appliance stuck on initial sampling
From: Dennis Hilhorst via Tech-talk <tech-talk at aps.anl.gov>
To: "tech-talk at aps.anl.gov" <tech-talk at aps.anl.gov>
Date: Fri, 9 Jan 2026 10:13:57 +0000

Hi all,

 

We have been using the Archiver Appliance, but for some reason one deployment is getting stuck adding new PVs. Here’s the situation:

 

We are using version 2.1.2 of the archiver appliance (using the EPNix package https://github.com/epics-extensions/EPNix/blob/master/pkgs/by-name/archiver-appliance/package.nix), and tomcat version 9.0.108 (also through nix) on NixOS 25.05. Before, we were using version 2.0.10 of the Archiver Appliance (which was the setup we had which was working just fine before, but also suddenly started hanging). The services start just fine and the web interface and api respond just fine, but the archiver seems to get stuck trying to sample PVs. Basically, we submit a list of about 4500 PVs for it to archive, it gets to work and adds some of them after a couple of minutes. Then after a while it just hangs, and it didn’t add any more PVs for an entire day. In the management / metrics page, it shows that the number of “PVs pending computation of meta info” is 1000 (which seems to be the maximum). We then tried rebooting, after which some of the PVs that did get through become disconnected from the archiver (“Disconnected PV count” in the metrics page was no longer 0, and we found some variables that indeed were no longer being archived). Some PVs would still be connected, and it would add some more PVs, but it would again get stuck after a while, with 1000 PVs pending computation of meta info. Other metrics do update, for example the events / sec or MB / sec metrics, and we can see data through the Grafana plugin so the archiver is definitely doing stuff. The tomcat logs don’t make me any wiser, they just repeat:

 

Jan 09 08:17:26 a3xr-control startup.sh[34080]: INFO Running the archive PV workflow with 1387 requests pending (org.epics.archiverappliance.mgmt.MgmtRuntimeState)

Jan 09 08:17:36 a3xr-control startup.sh[34080]: INFO Appliances that have loaded their PVsappliance0 (org.epics.archiverappliance.config.DefaultConfigService)

Jan 09 08:17:36 a3xr-control startup.sh[34080]: INFO Running the archive PV workflow with 1387 requests pending (org.epics.archiverappliance.mgmt.MgmtRuntimeState)

Jan 09 08:17:46 a3xr-control startup.sh[34080]: INFO Appliances that have loaded their PVsappliance0 (org.epics.archiverappliance.config.DefaultConfigService)

Jan 09 08:17:46 a3xr-control startup.sh[34080]: INFO Running the archive PV workflow with 1387 requests pending (org.epics.archiverappliance.mgmt.MgmtRuntimeState)

Jan 09 08:17:56 a3xr-control startup.sh[34080]: INFO Appliances that have loaded their PVsappliance0 (org.epics.archiverappliance.config.DefaultConfigService)

Jan 09 08:17:56 a3xr-control startup.sh[34080]: INFO Running the archive PV workflow with 1387 requests pending (org.epics.archiverappliance.mgmt.MgmtRuntimeState)

 

I increased the verbosity by adding JAVA_OPTS=-verbose:class and the -v parameter to the startup script, I could tell that it struggled to detect DBR types for python softioc-hosted PVs, but there should only be about ~250 of those, and it would at some point just give up (Aborting archive request for pv nbl_a:bpm1:x_avg Reason:  (org.epics.archiverappliance.mgmt.archivepv.ArchivePVState)), so I don’t think it’s getting stuck on those. I didn’t see anything else that seems to indicate it hanging on specific PVs, and at some point the logs would just be what is stated above, repeated again and again.

 

We tried completely wiping the archiver’s sts / mts / lts directories and completely wiping the mysql  and tomcat environments (which does seem to reset the archiver, but the same problem would occur right after). The policies.py file we are using is the default one packaged in EPNix (I think it’s the same one that can be found here: https://gitlab.esss.lu.se/julianomurari/epicsarchiver-config/-/blob/master/policies/default_policies.py). I also tried increasing callbackSetQueueSize to 50000 in our ioc, hoping the issue would be an excess of monitor callbacks, but alas it was to no avail. I am not sure where else to look, there don’t seem to be any more detailed logs for the archiver appliance for me to see which PVs it is getting stuck on.

 

For completeness, our appliances.xml looks like this:

 

<appliances>

  <appliance>

    <identity>appliance0</identity>

    <cluster_inetport>localhost:16670</cluster_inetport>

    <mgmt_url>http://localhost:8080/mgmt/bpl</mgmt_url>

    <engine_url>http://localhost:8080/engine/bpl</engine_url>

    <etl_url>http://localhost:8080/etl/bpl</etl_url>

    <retrieval_url>http://localhost:8080/retrieval/bpl</retrieval_url>

    <data_retrieval_url>http://localhost:8080/retrieval</data_retrieval_url>

  </appliance>

</appliances>

 

And here are the full stats after it hung for the first time after a complete sts / mts / lts and mysql wipe:

 

Attribute

Detail

Appliance Identity

appliance0

Total PV count

1449

Disconnected PV count

0

Connected PV count

1449

Paused PV count

0

Total channels

5435

Approx pending jobs in engine queue

1

Event Rate (in events/sec)

4.86

Data Rate (in bytes/sec)

58.8

Data Rate in (GB/day)

0

Data Rate in (GB/year)

1.73

Time consumed for writing samplebuffers to STS (in secs)

0

Benchmark - writing at (events/sec)

10,674.23

Benchmark - writing at (MB/sec)

0.12

PVs pending computation of meta info

1000

Total number of CAJ channels

12452

Channels with pending search requests

7000 of 12452

Total number of ETL runs into MTS so far

20

Average time spent in ETL into MTS (s/run)

0.04

Average percentage of time spent in ETL

0

Approximate time taken by last ETL job (s)

0

Estimated weekly usage in ETL (%)

0

Avg time spent by getETLStreams (s/run)

0.01

Avg time spent by free space checks (s/run)

0

Avg time spent by prepareForNewPartition() (s/run)

0

Avg time spent by appendToETLAppendData() (s/run)

0.03

Avg time spent by commitETLAppendData() (s/run)

0

Avg time spent by markForDeletion() in ETL (s/run)

0

Avg time spent by runPostProcessors() in ETL (s/run)

0

Avg time spent by executePostETLTasks() in ETL (s/run)

0

Estimated bytes transferred in ETL (MTS)(MB)

4.54

Number of Retrieval Requests

23

Time of last Retrieval Request

Jan/07/2026 09:25:22 GMT

Number of unique users

2

PVs in archive workflow

3063

Capacity planning last update

Jan/06/2026 13:42:44 GMT

Engine write thread usage

0

Aggregated appliance storage rate (in GB/year)

17.57

Aggregated appliance event rate (in events/sec)

41.32

Aggregated appliance PV count

1,838

Incremental appliance storage rate (in GB/year)

17.57

Incremental appliance event rate (in events/sec)

41.32

Incremental appliance PV count

1,838

 

We haven’t tried a complete reinstall of the system, but of course we would prefer to figure out what is going wrong so that we can prevent it from happening in the future. We would greatly appreciate help trying to debug / fix this issue!

 

 

Sincerely,

 

Dennis Hilhorst

 


Navigate by Date:
Prev: Re: NDPluginFile question Jörn Dreyer via Tech-talk
Next: RE: Archiver Appliance stuck on initial sampling Sky Brewer via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  2025  <2026
Navigate by Thread:
Prev: Re: The motor module is unable to control the PM600 motor. Mark Rivers via Tech-talk
Next: RE: Archiver Appliance stuck on initial sampling Sky Brewer via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  2025  <2026
ANJ, 19 Mar 2026 · Home · News · About · Talk · Base · Modules · Extensions ·
· Distributions · Download · Documents · Links · Licensing ·