Experimental Physics and Industrial Control System
EPICS experts:
Many of you are aware that Jefferson Lab (fna CEBAF) has had problems with
IOC's mysteriously crashing, and in fact having multiple IOC's crash because
1 crashed. Well, I think I may have a better understanding of the problem
now, so here's a little more info and an analysis of the problem. Note
that I do not yet have the fix -- perhaps Jeff Hill can comment?
Symptoms of the problem:
1) one IOC is in a funny state; screens which are up continue
to work for signals NOT on that IOC
2) MEDM or DM session which attempts to connect to ANY signal on
that IOC will become COMPLETELY UNUSABLE, and will in fact not
be able to connect to ANY signal on ANY ioc; existing screens
continue to function for signals on other IOC's.
Further investigation reveals:
1) a scan task is consuming all available CPU resources
Analyis:
1) the CAMAC serial highway is probably at fault, and the error
handling for a fault is sufficiently large that if all I/O to
the highway fails, there is not sufficient excess CPU cycles
to keep up with the increased load due to error handling
(this needs further study to verify and fix, probably by taking
the offending serial highway offline and forcing all further
i/o to fail immediately with no handler invocation).
(IMPORTANT ASIDE: As some of you may remember, we run our name
resolution task at an elevated priority so that when we bring up
a screen with 2000 channels on it, it resolves in an acceptable
amount of time. Without this adjustment in priorities, that
screen would take 5 minutes or more to completely resolve. This
is due to the fact that some IOC's are running 75-80% busy in
steady state, and the remaining 20% is not enough CPU time to
resolve 2000 names before channel access times out)
2) the name resolution task is at a higher priority than the scan
task now using all available CPU cycles, so name resolution
can respond, BUT the channel access client tasks are completely
starved
3) So, here's the senario: client C broadcasts to target ioc T a list
of names. T responds with the subset that it is willing to serve.
C does the next step in connecting to those channels, but T does
not respond. C waits forever.
It appears that the channel access client library never times out
on this error. Most sites never see this, because the odds of an
IOC dying between the name resolution response and the connection
establishment are vanishingly small, and if the ioc crashes all the
way and reboots, CA lib probably unhangs and reconnects OK. So the
error is only revealed if the ioc hangs or dies without rebooting
between the name resolution reply and the connection establishment.
It may also be that this is the reason that 1 ioc brings another
down: ioc A hangs, B attempts to reconnect and its ca library hangs,
causing ioc to attempt to reconnect to B and hang, causing ...
Fixes available:
(1) improve the camac driver so starvation does
not occur; this does not fix the problem, really, since anything that
causes complete CPU starvation on 1 IOC will effect our whole control
system if it persists long enough
(2) reduce the priority of the name resolution task to the default;
this is simply not acceptable -- EPICS would be slow as a dog for
operators
(3) change channel access library to correctly time out in this
senario; this assumes that I have analyzed the problem correctly
Regards,
Chip
-----------------------------------------------------------------------------
Chip Watson
Internet: [email protected] Thomas Jefferson National Accelerator Facility *
Tel: (757) 269-7101 12000 Jefferson Avenue, MS 12A2
FAX: (757) 269-5024 Newport News, VA 23606
WWW: http://www.jlab.org/~watson/
* (formerly CEBAF, the Continuous Electron Beam Accelerator Facility)
- Replies:
- Re: flaky IOC problems at Jefferson Lab Rolf Keitel
- Re: flaky IOC problems at Jefferson Lab Jeff Hill
- Navigate by Date:
- Prev:
Re: Serial and GPIB IP's and IP Carriers John R. Winans
- Next:
Re: flaky IOC problems at Jefferson Lab Rolf Keitel
- Index:
1994
1995
1996
<1997>
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: Serial and GPIB IP's and IP Carriers John R. Winans
- Next:
Re: flaky IOC problems at Jefferson Lab Rolf Keitel
- Index:
1994
1995
1996
<1997>
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024