Hi all,
We have been using the Archiver Appliance, but for some reason one deployment is getting stuck adding new PVs. Here’s the situation:
We are using version 2.1.2 of the archiver appliance (using the EPNix package
https://github.com/epics-extensions/EPNix/blob/master/pkgs/by-name/archiver-appliance/package.nix), and tomcat version 9.0.108 (also through nix) on NixOS 25.05. Before, we were using version 2.0.10 of the Archiver Appliance (which was the setup we had
which was working just fine before, but also suddenly started hanging). The services start just fine and the web interface and api respond just fine, but the archiver seems to get stuck trying to sample PVs. Basically, we submit a list of about 4500 PVs for
it to archive, it gets to work and adds some of them after a couple of minutes. Then after a while it just hangs, and it didn’t add any more PVs for an entire day. In the management / metrics page, it shows that the number of “PVs pending computation of meta
info” is 1000 (which seems to be the maximum). We then tried rebooting, after which some of the PVs that did get through become disconnected from the archiver (“Disconnected PV count” in the metrics page was no longer 0, and we found some variables that indeed
were no longer being archived). Some PVs would still be connected, and it would add some more PVs, but it would again get stuck after a while, with 1000 PVs pending computation of meta info. Other metrics do update, for example the events / sec or MB / sec
metrics, and we can see data through the Grafana plugin so the archiver is definitely doing stuff. The tomcat logs don’t make me any wiser, they just repeat:
Jan 09 08:17:26 a3xr-control startup.sh[34080]: INFO Running the archive PV workflow with 1387 requests pending (org.epics.archiverappliance.mgmt.MgmtRuntimeState)
Jan 09 08:17:36 a3xr-control startup.sh[34080]: INFO Appliances that have loaded their PVsappliance0 (org.epics.archiverappliance.config.DefaultConfigService)
Jan 09 08:17:36 a3xr-control startup.sh[34080]: INFO Running the archive PV workflow with 1387 requests pending (org.epics.archiverappliance.mgmt.MgmtRuntimeState)
Jan 09 08:17:46 a3xr-control startup.sh[34080]: INFO Appliances that have loaded their PVsappliance0 (org.epics.archiverappliance.config.DefaultConfigService)
Jan 09 08:17:46 a3xr-control startup.sh[34080]: INFO Running the archive PV workflow with 1387 requests pending (org.epics.archiverappliance.mgmt.MgmtRuntimeState)
Jan 09 08:17:56 a3xr-control startup.sh[34080]: INFO Appliances that have loaded their PVsappliance0 (org.epics.archiverappliance.config.DefaultConfigService)
Jan 09 08:17:56 a3xr-control startup.sh[34080]: INFO Running the archive PV workflow with 1387 requests pending (org.epics.archiverappliance.mgmt.MgmtRuntimeState)
I increased the verbosity by adding JAVA_OPTS=-verbose:class and the -v parameter to the startup script, I could tell that it struggled to detect DBR types for python
softioc-hosted PVs, but there should only be about ~250 of those, and it would at some point just give up (Aborting archive request for pv nbl_a:bpm1:x_avg Reason: (org.epics.archiverappliance.mgmt.archivepv.ArchivePVState)), so I don’t think it’s getting
stuck on those. I didn’t see anything else that seems to indicate it hanging on specific PVs, and at some point the logs would just be what is stated above, repeated again and again.
We tried completely wiping the archiver’s sts / mts / lts directories and completely wiping the mysql and tomcat environments (which does seem to reset the archiver,
but the same problem would occur right after). The policies.py file we are using is the default one packaged in EPNix (I think it’s the same one that can be found here:
https://gitlab.esss.lu.se/julianomurari/epicsarchiver-config/-/blob/master/policies/default_policies.py). I also tried increasing
callbackSetQueueSize
to 50000 in our ioc, hoping the issue would be an excess of monitor callbacks, but alas it was to no avail. I am not sure where else to look, there don’t seem to be any more detailed
logs for the archiver appliance for me to see which PVs it is getting stuck on.
For completeness, our appliances.xml looks like this:
<appliances>
<appliance>
<identity>appliance0</identity>
<cluster_inetport>localhost:16670</cluster_inetport>
<mgmt_url>http://localhost:8080/mgmt/bpl</mgmt_url>
<engine_url>http://localhost:8080/engine/bpl</engine_url>
<etl_url>http://localhost:8080/etl/bpl</etl_url>
<retrieval_url>http://localhost:8080/retrieval/bpl</retrieval_url>
<data_retrieval_url>http://localhost:8080/retrieval</data_retrieval_url>
</appliance>
</appliances>
And here are the full stats after it hung for the first time after a complete sts / mts / lts and mysql wipe:
|
Attribute
|
Detail
|
|
Appliance Identity
|
appliance0
|
|
Total PV count
|
1449
|
|
Disconnected PV count
|
0
|
|
Connected PV count
|
1449
|
|
Paused PV count
|
0
|
|
Total channels
|
5435
|
|
Approx pending jobs in engine queue
|
1
|
|
Event Rate (in events/sec)
|
4.86
|
|
Data Rate (in bytes/sec)
|
58.8
|
|
Data Rate in (GB/day)
|
0
|
|
Data Rate in (GB/year)
|
1.73
|
|
Time consumed for writing samplebuffers to STS (in secs)
|
0
|
|
Benchmark - writing at (events/sec)
|
10,674.23
|
|
Benchmark - writing at (MB/sec)
|
0.12
|
|
PVs pending computation of meta info
|
1000
|
|
Total number of CAJ channels
|
12452
|
|
Channels with pending search requests
|
7000 of 12452
|
|
Total number of ETL runs into MTS so far
|
20
|
|
Average time spent in ETL into MTS (s/run)
|
0.04
|
|
Average percentage of time spent in ETL
|
0
|
|
Approximate time taken by last ETL job (s)
|
0
|
|
Estimated weekly usage in ETL (%)
|
0
|
|
Avg time spent by getETLStreams (s/run)
|
0.01
|
|
Avg time spent by free space checks (s/run)
|
0
|
|
Avg time spent by prepareForNewPartition() (s/run)
|
0
|
|
Avg time spent by appendToETLAppendData() (s/run)
|
0.03
|
|
Avg time spent by commitETLAppendData() (s/run)
|
0
|
|
Avg time spent by markForDeletion() in ETL (s/run)
|
0
|
|
Avg time spent by runPostProcessors() in ETL (s/run)
|
0
|
|
Avg time spent by executePostETLTasks() in ETL (s/run)
|
0
|
|
Estimated bytes transferred in ETL (MTS)(MB)
|
4.54
|
|
Number of Retrieval Requests
|
23
|
|
Time of last Retrieval Request
|
Jan/07/2026 09:25:22 GMT
|
|
Number of unique users
|
2
|
|
PVs in archive workflow
|
3063
|
|
Capacity planning last update
|
Jan/06/2026 13:42:44 GMT
|
|
Engine write thread usage
|
0
|
|
Aggregated appliance storage rate (in GB/year)
|
17.57
|
|
Aggregated appliance event rate (in events/sec)
|
41.32
|
|
Aggregated appliance PV count
|
1,838
|
|
Incremental appliance storage rate (in GB/year)
|
17.57
|
|
Incremental appliance event rate (in events/sec)
|
41.32
|
|
Incremental appliance PV count
|
1,838
|
We haven’t tried a complete reinstall of the system, but of course we would prefer to figure out what is going wrong so that we can prevent it from happening in
the future. We would greatly appreciate help trying to debug / fix this issue!
Sincerely,
Dennis Hilhorst