Chip,
First, let me say that I think we are all relieved now that we better understand
the nature of the problems that have been occurring at CEBAF. In summary,
we have most likely not seen this problem because:
1) We have not experienced the CAMAC driver failure that is occurring
at CEBAF. To the best of my knowledge CEBAF is using a different CAMAC driver
than is used at other sites. This is perhaps also related to the average available
idle time. The CEBAF IOCs are (according to Chip) 75-80% busy on average.
Therefore I expect that peak CPU consumption is much higher.
It appears that a high CAMAC IO error rate caused the driver to consume all available
CPU (perhaps spinning on a transaction completion flag). Others may disagree
however IMHO it is a failure of the driver (or the system design) if a driver is allowed
to consume all available CPU at high priority when the hardware is failing.
When a driver fails in this fashion at high priority we can easily imagine
that any number of critical functions will fail (because of CPU starvation).
When writing drivers it is always best to avoid consuming significantly
more CPU in off-normal situations than is required to process the record.
2) CEBAF has elevated the priority of one of the tasks in the CA server.
3) The IP kernel TCP/IP virtual circuit "connect()" timeout parameter
may be different at CEBAF.
Chip wrote:
>
> (IMPORTANT ASIDE: As some of you may remember, we run our name
> resolution task at an elevated priority so that when we bring up
> a screen with 2000 channels on it, it resolves in an acceptable
> amount of time. Without this adjustment in priorities, that
> screen would take 5 minutes or more to completely resolve. This
> is due to the fact that some IOC's are running 75-80% busy in
> steady state, and the remaining 20% is not enough CPU time to
> resolve 2000 names before channel access times out)
>
Note that the efficiency of the name resolution activity was improved
in 3.12 at some point (by changing a time constant). Perhaps Marty's numbers
do not agree with what CEBAF has observed because he is using a more recent
version of the code.
Marty wrote:
> What happened at TJNAF is the following:
>
> 1)TJNAF raised the priorioty of CA UDP above that of the scan tasks
> 2)CAMAC failed causing a scan task to use all available cpu time.
> This caused all tasks of lower priority to be starved.
> 3)A CA client issued search requests.
> 4)CA UDP received the request and sent a reply to the client.
> 5)The client sent a message to CA TCP and waited forever for a response.
> 6)CA TCP never got a chance to process the message.
This is a correct summary of the CA activity occurring except that
I must clarify steps 5 and 6. When the CA client library receives a
search response over UDP from a new IOC it attempts to establish a TCP/IP
virtual circuit to the IOC using the "connect()" call in the
socket library. This call has no timeout parameter and the kernel
default timeout is quite long. Many of you may have experienced this long
timeout when you typed "telnet xxxx" when "xxxx" was not present on
the net (you may have typed ^C instead of waiting the full duration
of the timeout). The default connect timeout on our sun systems is about 80 sec.
Note that the CA client lib _WILL_ recover in the rare circumstances when
this occurs if the operator is willing to wait for the full duration of
the timeout.
There is no portable call for establishing a TCP/IP circuit other than
"connect()". The vxWorks OS does supply "connect_with_timeout()" however.
I am not using non-blocking IO at the time that "connect()" is called.
If I did it would add some perhaps substantial complication to the client
lib but would also avoid stalling the CA client library for the duration
of the timeout. The net effect would be faster connects for clients that
connect to multiple IOCs when one of the IOCs has failed under the rare
circumstances described above. Of course the client would never
connect to any IOCs that have failed in the way that Chip has described.
I will be looking at the level of effort required to install
this (non-blocking connect) change. It would be interesting to hear from all
of the sites that consider this to be an important issue (and therefore would
like to see non-blocking connect() installed).
I am also examining the situations reported by Rolf Keitel and Bill Brown
in more detail.
Chip wrote:
> It may also be that this is the reason that 1 ioc brings another
> down: ioc A hangs, B attempts to reconnect and its ca library hangs,
> causing ioc to attempt to reconnect to B and hang, causing ...
>
I doubt that this is occurring unless the vxWorks "connect()" implementation is
consuming too much CPU (we have not seen this).
> (2) reduce the priority of the name resolution task to the default;
> this is simply not acceptable -- EPICS would be slow as a dog for
> operators
>
Perhaps newer versions of 3.12 (or a faster CPU that is less than 80% loaded)
will connect 2000 channels faster. No doubt that the CA connect
algorithm could also be optimized (this would perhaps remove some
idle delays when the CPU isnt working but would not improve on the total
CPU consumed much I suspect).
Jeff
--
______________________________________________________________________
Jeffrey O. Hill Internet [email protected]
LANL MS H820 Voice 505 665 1831
Los Alamos, NM 87545 USA FAX 505 665 5107
- Replies:
- Re: flaky IOC problems at Jefferson Lab watson
- References:
- flaky IOC problems at Jefferson Lab watson
- Navigate by Date:
- Prev:
Foundation FieldBus Russell J. Page, Sr.
- Next:
Re: flaky IOC problems at Jefferson Lab watson
- Index:
1994
1995
1996
<1997>
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: flaky IOC problems at Jefferson Lab Rolf Keitel
- Next:
Re: flaky IOC problems at Jefferson Lab watson
- Index:
1994
1995
1996
<1997>
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|