Experimental Physics and Industrial Control System
The Gateway, or rather the underlying CAS code, has a problem when
EPICS_MAX_ARRAY_BYTES is set. In short, doing this causes code that
processes UDP search requests from clients (such as MEDM) to be run
that is not run otherwise and uncovers a bug in that one of the server
UDP sockets blocks when it should be non-blocking. This is known to
happen on Solaris and Linux, through 3.14.7.
I don't usually put bugs in tech-talk but this one proved rather
difficult to track down, and this may help someone else.
At the APS the main Gateways started crashing after operating system
patches were applied at the end of December. The actual event that
caused the failure is that sometime in the fall,
EPICS_CA_MAX_ARRAY_BYTES=140000 was added to everyone's .login. It,
however, did not affect the Gateways until some time later when they
were restarted after the patches. The first efforts were to check for
machine problems, since the executable had not changed. The problem
was caused by a "reverse" Gateway, one that had its server side on the
machine network. The hanging in this one caused the others to crash,
though it did not crash itself.
The bug is that the CAS socket that receives UDP broadcasts blocks
waiting for input and can delay the server, so it does not process its
queues. For the Gateway, this in turn delays the Gateway client and
also keeps it from processing.
The hanging Gateway does not crash, but owing to not processing right,
it apparently sends bad messages to other clients and Gateways. The
other Gateways are not robust to this and crash. This is a second
bug. Both are characterized by "bad message id" messages in the
Gateway logs.
The problem is compounded by the new feature in CAC, where it
disconnects if you do not call ca_poll every 30 sec, even when there
is nothing wrong with the actual connection. The thrashing caused by
the continual connects and disconnects contributes to the Gateway
interaction problem.
This "feature" also causes other problems and is mentioned here
because there are no apparent plans to change it. It causes programs
that used to work to not work now. (Probe was an example. When it
was not monitoring, there was no reason to call ca_poll and this
worked fine. It has been fixed now to call ca_poll every 100 ms, even
when it doesn't need information.) There are many other programs that
used to work and need to be changed or are not amenable to being
changed. I know of at least one developer's setting EPICS_CA_CONN_TMO
to a huge number to get around this problem. This is bad. [End of
commercial]
If you want more information, these bugs are described in more detail,
along with a fix for the hanging, in the EPICS bug tracking system,
accessed from the EPICS pages:
http://www.aps.anl.gov/epics/mantis/login_page.php
The bugs are #176 and #177.
- Replies:
- RE: The Gateway, CAS, and EPICS_MAX_ARRAY_BYTES Jeff Hill
- Navigate by Date:
- Prev:
Re: edm Terry Carlino
- Next:
RE: The Gateway, CAS, and EPICS_MAX_ARRAY_BYTES Jeff Hill
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
<2005>
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
EPICS problems on cygwin solved! Mark Rivers
- Next:
RE: The Gateway, CAS, and EPICS_MAX_ARRAY_BYTES Jeff Hill
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
<2005>
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024