On
10/9/20 8:51 AM, Kasemir, Kay via Core-talk wrote:
For unfortunate reasons, our EPICS_CA_BEACON_PERIOD is set to 2 instead of 15 seconds, and the EPICS_CA_CONN_TMO=5. The idea was that clients like EDM should show disconnects after 5 seconds instead of looking at stale data for the default 30 seconds, and IOCs
with CA links should consider them disconnected after 5 seconds as well.
This
seems excessive. The reduced timeout I can understand,
but
reducing the beacon period I'm less sure about. And 2 seconds
seems
excessive. Is this left over from the days when UDP beacons were
used
to timeout TCP connections?
IIRC on virtual circuits that do not have regular traffic from the client to the server the C++ client implementation takes about twice the beacon period to recognize that an IOC is no longer responsive. It then sends an “are you there" over TCP, and if
that doesn’t get responded to within some period it will mark those channels as disconnected. Using UDP beacons this way reduces the amount of network traffic (and the corresponding server workload) that would be needed if the server sent periodic beacons
over each TCP circuit.
Were you thinking that has changed? Has it?
Just one such IOC this tricks the CA client into restarting the name searches for disconnected channels. Add archive setups with 4000 missing channels (why are there so many missing channels? other issue...), physics apps that look for "all BPMs" and some are
currently offline, ... and you get a lot of broadcasts.
What to do?
At APS we have occasional campaigns that look at what clients are searching for names that don’t connect and force the client owners to clean up their screens or software. If you have any C Gateways with their server-side connected to the machine network
they can tell you what your current CA search rate is, always worth keeping an eye on.
I would suggest trying to increase the beacon periods on your IOCs to something more reasonable, would 5 seconds be acceptable to your users instead? That should give you 10-15 seconds for disconnect notifications; maybe now that you know the cost of aiming
for 5 seconds you can persuade management to let you increase it?
I've
long thought that this approach of trying to model the timing
of
beacons was too clever. Maybe a simpler model with a timeout at
3x
the beacon period, or if the beacon count jumps by >3, then reset
search
timers?
The client doesn’t really try to model the timing of the beacons from each server, it just regards a significant change in the measured period as its trigger, although I’m not sure how lenient it is. It does have to adapt to different periods from each
server, the PCAS used a different beacon period than the IOC for a long time and it may still do that.
With
PVXS, I use linear back off for search retry instead of exponential
in
an attempt to mitigate the effects of this sort of situation. I also
have
a 30 second hold off after each beacon anomaly before another will
be
recognized.
The 30 second hold-off certainly sounds like a good idea that might be implementable in the Java client; I’m not sure if the C++ CA client has anything like that in it, it may not need it.