EPICS Re: CA gatway runs away when zero length PV name in UDP search request

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 <2017> 2018 2019 2020 2021 2022 2023 2024 2025	Index	1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 <2017> 2018 2019 2020 2021 2022 2023 2024 2025
<== Date ==>		<== Thread ==>

Hello all,

it seems that we experience similar issues with NSLS-II gateway instances (thanks Zhijian Yin for discovering this talk). On multiple occurrences (separated by weeks and months) few gateways spontaneously became stuck with processes just hanging and eating an abnormal portion of CPU.

The problem was obscured by several factors:

- it appeared on rare occasions with no known reason;

- seemingly no distinguishable log error message about the cause;

- even for clues which were present (e.g. client host:port in logs), backtracking didn't yield much information.

In the context of clues provided in this discussion, some things are not clear for me:

- if there is a relevant bug in CAS code, shouldn't all servers receive the invalid zero-length name request and be affected, and not just gateways?

- when the issue appears, why are not all gateways affected but only some?

- why it looks like that some gateways choke right away while other instances can withstand multiple appearance of the same problem?

From one of our logs:

Sep 18 11:52:48 !!! Errlog message received (message is above)
zero length PV name in UDP search request?

Sep 18 11:52:48 !!! Errlog message received (message is above)
CAS Request: ? on box64-1.cs.nsls2.local:37525: cmd=6 cid=712 typ=5 cnt=11 psz=3
2 avail=2c8
CAS Request: ? on box64-1.cs.nsls2.local:37525: cmd=6 cid=712 typ=5 cnt=11 psz=3
2 avail=2c8
CAS:
Sep 18 11:52:48 !!! Errlog message received (message is above)

...

...

Sep 27 10:54:40 !!! Errlog message received (message is above)
CAS Request: ? on box64-1.cs.nsls2.local:59857: cmd=6 cid=433 typ=5 cnt=11 psz=32 avail=1b1
CAS:
Sep 27 18:02:19 !!! Errlog message received (message is above)
zero length PV name in UDP search request?

Sep 27 18:02:19 !!! Errlog message received (message is above)
@@@ Restarting child

So it was a while before the complaint came from users and the gateway was restarted. There is a possibility, say, that when the issue occurs, existing connections persist - but new ones are not established?

It looks like for now there is a good chance that gateway restarts will be required shall any client perform an invalid query...

Regards,

Anton.

Subject:	Re: CA gatway runs away when zero length PV name in UDP search request
From:	Anton Derbenev <[email protected]>
To:	[email protected]
Date:	Fri, 29 Sep 2017 17:36:12 -0400

Experimental Physics and Industrial Control System