EPICS Archiver Appliance configuration questions

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 <2021> 2022 2023 2024 2025	Index	1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 <2021> 2022 2023 2024 2025
<== Date ==>		<== Thread ==>

Hello,

I am looking for some advice on the details of configuration for the EPICS Archiver Appliance (AA).

At Diamond, our AA installation was configured in 2017 and the config hasn't changed much since then, although we regularly update the software.

We have recently started seeing a problem where the AA is taking unusually long (sometimes days) to establish new connections to PVs, and re-establish connections which are interrupted, e.g. by IOC restarts.

We have determined that we have a large number of PVs requested for archiving that do not exist (around 6 %), and that this is likely the root cause, so we are currently working to reduce these to a manageable level.

We would like to rule out any other factors. We have identified two possibilities.

1. Configuration options in archappl.properties

We have archivePVWorkflowBatchSize set to 30 000. archivePVWorkflowTickSeconds is not defined so must be using the default. We do not think that these two are causing our problem because we are not close to having this many PVs pending. We also see the issue with re-establishing interrupted connections to PVs which are already archiving, so I think it is more likely something at the channel access client library level.

The only parameter that looks like it could be relevant is

org.epics.archiverappliance.engine.epics.commandThreadCount

The default is 10. Our site configuration has it set to 1. I do not have a record of the reason for this value.

The comment for this in archappl.properties says:

For faster reconnect times, we may want to use more than one JCAContext/CAJContext. This controls the number of JCACommandThreads and thus the number of JCAContext/CAJContext.

Each JCACommandThread launches aprox 4 threads in all in CAJ - one CAJ search thread (UDP); a couple of TCP threads and the JCACommand thread that controls them.

Routing all PVs thru fewer contexts seems to result in larger reconnect times.

What precise effects would we observe if this value were not high enough?

Why may a value of 1 have been chosen? It is likely that we wanted to throttle the level of channel access search requests. Excessive broadcast traffic from the AA has caused problems at DLS in the past. Are there other potential downsides to increasing it?

Are there any other parameters we haven't thought of that may be relevant for this problem?

2. Distribution of PVs between appliances

We have a collective memory that "a maximum of 80 000 PVs per appliance is optimal." The source of this statement is not remembered, however. Perhaps it was in a talk at a collaboration meeting. We have more PVs than this now (average around 100k on each of our three servers).

Can anybody explain where this 80k number came from? Is it expressing a limit on system resources, or some limit in the application? We find that the system resources such as RAM, CPU and network are all well within capacity on our servers. I would like to understand if there is some other factor we are not taking into account.

Many thanks,

Andy

This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

Subject:	Archiver Appliance configuration questions
From:	"Wilson, Andy \(DLSLtd, RAL, LSCI\) via Tech-talk" <tech-talk at aps.anl.gov>
To:	EPICS tech-talk <tech-talk at aps.anl.gov>
Cc:	"Williams, Rebecca \(OBS, RAL, LSCI\)" <rebecca.williams at diamond.ac.uk>, "Gaughran, Martin \(DLSLtd, RAL, LSCI\)" <Martin.Gaughran at diamond.ac.uk>
Date:	Thu, 4 Nov 2021 11:18:36 +0000

Experimental Physics and Industrial Control System