Hello all,
it seems that we experience similar issues with NSLS-II gateway instances (thanks Zhijian Yin for discovering this talk). On multiple occurrences (separated by weeks and months) few gateways spontaneously became stuck with processes just hanging and eating an abnormal portion of CPU.
The problem was obscured by several factors:
- it appeared on rare occasions with no known reason;
- seemingly no distinguishable log error message about the cause;
- even for clues which were present (e.g. client host:port in logs), backtracking didn't yield much information.
In the context of clues provided in this discussion, some things are not clear for me:
- if there is a relevant bug in CAS code, shouldn't all servers receive the invalid zero-length name request and be affected, and not just gateways?
- when the issue appears, why are not all gateways affected but only some?
- why it looks like that some gateways choke right away while other instances can withstand multiple appearance of the same problem?
From one of our logs:
Sep 18 11:52:48 !!! Errlog message received (message is above)
zero length PV name in UDP search request?
Sep 18 11:52:48 !!! Errlog message received (message is above)
CAS Request: ? on box64-1.cs.nsls2.local:37525: cmd=6 cid=712 typ=5 cnt=11 psz=3
2 avail=2c8
CAS Request: ? on box64-1.cs.nsls2.local:37525: cmd=6 cid=712 typ=5 cnt=11 psz=3
2 avail=2c8
CAS:
Sep 18 11:52:48 !!! Errlog message received (message is above)
...
<a dozen of errors like that>
...
Sep 27 10:54:40 !!! Errlog message received (message is above)
CAS Request: ? on box64-1.cs.nsls2.local:59857: cmd=6 cid=433 typ=5 cnt=11 psz=32 avail=1b1
CAS:
Sep 27 18:02:19 !!! Errlog message received (message is above)
zero length PV name in UDP search request?
Sep 27 18:02:19 !!! Errlog message received (message is above)
@@@ Restarting child
So it was a while before the complaint came from users and the gateway was restarted. There is a possibility, say, that when the issue occurs, existing connections persist - but new ones are not established?
It looks like for now there is a good chance that gateway restarts will be required shall any client perform an invalid query...
Regards,
Anton.