The archivePVWorkflowBatchSize and archivePVWorkflowTickSeconds control the rate at which we add new PV's into archiver so these are probably not the root cause of your issue.
The commandThreadCount is probably related. This determines the number of CAJContexts that are created; each CAJContext has a separate search thread. In practice, this seems to help with reconnects and reconnect speed. You could try increasing this slowly to
see if this helps.
>> What precise effects would we observe if this value were not high enough?
Longer reconnect times, higher thread CPU usage for some of the engine threads. Larger commandThreadCount increases the parallelism somewhat.
>> Why may a value of 1 have been chosen? It is likely that we wanted to throttle the level of channel access search requests.
This is the reason for lowering the value; more search threads definitely means more broadcast traffic (CA searches are broadcast).
But the root cause for this is probably that there are a lot of PV's in the search queues of the CAJContexts. These are mainly "PVs requested for archiving that do not exist" and "disconnected PV's that do not exist". You should abort the former and pause the
latter to eliminate them from the search queues. See the admin guide on hints for maintaining a clean system
We run into this here occasionally as well. We have a script that checks for the liveness of disconnected PV's and sends out an email if a live PV has been disconnected for a long time ( the getCurrentlyDisconnectedPVs has a time of disconnect to help with
this). Pausing and resuming a PV will tear down the CA channel and rebuild it and this should almost always result in a immediate reconnection. A good strategy seems to be
- Pause all PV's that are live but disconnected.
- Resume these PV's in small batches of a few hundred or more.
>> We have determined that we have a large number of PVs requested for archiving that do not exist (around 6 %),
BTW, I recently worked around a bug ( in a Java collection class of all things) that impacted this. In a working system, these PV's should have been kicked out of the system in 24 hours. Still watching this issue tho; not sure if it is completely gone.
>> We have a collective memory that "a maximum of 80 000 PVs per appliance is optimal."
This is probably just a ballpark used for capacity planning. The Engine write thread(s) and Max ETL(%) in the Metrics page are probably a decent measure of capacity.
Hope this helps.
Regards,
Murali
|