1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 <2021> 2022 2023 2024 2025 | Index | 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 <2021> 2022 2023 2024 2025 |
<== Date ==> | <== Thread ==> |
---|
Subject: | Re: Archiver Appliance configuration questions |
From: | "Shankar, Murali via Tech-talk" <tech-talk at aps.anl.gov> |
To: | "tech-talk at aps.anl.gov" <tech-talk at aps.anl.gov> |
Date: | Thu, 4 Nov 2021 15:56:15 +0000 |
The archivePVWorkflowBatchSize and archivePVWorkflowTickSeconds control the rate at which we add new PV's into archiver so these are probably not the root cause of your issue.
The commandThreadCount is probably related. This determines the number of CAJContexts that are created; each CAJContext has a separate search thread. In practice, this seems to help with reconnects and reconnect speed. You could try increasing this slowly to
see if this helps.
>> What precise effects would we observe if this value were not high enough?
Longer reconnect times, higher thread CPU usage for some of the engine threads. Larger commandThreadCount increases the parallelism somewhat.
>> Why may a value of 1 have been chosen? It is likely that we wanted to throttle the level of channel access search requests.
This is the reason for lowering the value; more search threads definitely means more broadcast traffic (CA searches are broadcast).
But the root cause for this is probably that there are a lot of PV's in the search queues of the CAJContexts. These are mainly "PVs requested for archiving that do not exist" and "disconnected PV's that do not exist". You should abort the former and pause the
latter to eliminate them from the search queues. See the admin guide on hints for maintaining a clean system
We run into this here occasionally as well. We have a script that checks for the liveness of disconnected PV's and sends out an email if a live PV has been disconnected for a long time ( the getCurrentlyDisconnectedPVs has a time of disconnect to help with
this). Pausing and resuming a PV will tear down the CA channel and rebuild it and this should almost always result in a immediate reconnection. A good strategy seems to be
>> We have determined that we have a large number of PVs requested for archiving that do not exist (around 6 %),
BTW, I recently worked around a bug ( in a Java collection class of all things) that impacted this. In a working system, these PV's should have been kicked out of the system in 24 hours. Still watching this issue tho; not sure if it is completely gone.
>> We have a collective memory that "a maximum of 80 000 PVs per appliance is optimal."
This is probably just a ballpark used for capacity planning. The Engine write thread(s) and Max ETL(%) in the Metrics page are probably a decent measure of capacity.
Hope this helps.
Regards,
Murali
|