EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  <19971998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  <19971998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: flaky IOC problems at Jefferson Lab
From: Marty Kraimer <[email protected]>
To: [email protected]
Date: Tue, 07 Jan 1997 10:02:24 -0600
I have the following comments:
 
1) When an IOC CPU has 0 or very little idle time, it is impossible
   to prevent problems. However, see 3.
 
2) Raising the priority of the name resolution task above the scan
   tasks gives me an uneasy feeling. An ioc should always give
   highest priority to local affairs before it pays attention to
   CA clients.
 
3) It is always a good idea to stress test software, i.e. push it
   until it fails. In the present case we are talking about Channel
   Access Connection management.
 
The situation reported by Chip provides an additional way of making
CA fail, i.e. a new stress test for CA. Let me restate what Chip
said to make sure I understand the problem.
 
CA has two tasks related to client search requests CA UDP, and
CA TCP. CA UDP listens for the client search requests. When
it receives a new packet of requests it looks to see if it has
any of the requested channels. For each channel it has, it sends
a udp message back to the client. If the client does not already
have a TCP connection to that ioc it sends a connection request
to CA TCP.

What happened at TJNAF is the following:
 
1)TJNAF raised the priorioty of CA UDP above that of the scan tasks
2)CAMAC failed causing a scan task to use all available cpu time.
This caused all tasks of lower priority to be starved.
3)A CA client issued search requests.
4)CA UDP received the request and sent a reply to the client.
5)The client sent a message to CA TCP and waited forever for a response.
6)CA TCP never got a chance to process the message.
 
To give more proof that this is what caused the problem, TJNAF could
also raise the priority of CA TCP and see if the client connects.
(NOTE that this violates the second comment)
If this is the situation then Jeff could make CA more robust by
changing step 5) above to wait with a timeout. Jeff, What do you think?
 
Now let me discuss the issue of raising the priority of CA UDP.
TJNAF did this because it took CA clients with many channels too long
to connect if an IOC had little idle time. Chip's message stated
that connecting to 2000 channels on an ioc that only had 20% idle
time took 5 minutes.
 
I will claim that raising the priority of CA UDP is the wrong solution
to this problem. As TJNAF discovered this only lead to other problems.
Perhaps a solution is to see if the search algorithm can be improved.
For those who have not thought about this algorithm, this is a HARD
problem. Let me give the first thing to think about. Take the case
of a client with many channels running on a processor that is faster
than the ioc containing the channels. If the client broadcasts the
requests as fast as possible, the ioc networking software will receive
packets faster than they can be processed. In this case it just silently
throws away the packets without CA UDP even knowing what happened.
This CA must use a back-off algorithm. Thus things get difficult.
 
I have a test case consisting of the following:
 
1) A 5000 record database (generated by a C program).
2) A CA client that attaches to all 5000 records. It then
   loops writing to all 5000 records and waiting a specified time
   before issuing another set of writes.
3) A second CA client that sets monitors on all 5000 records.
 
When I start the CA clients each connects to all 5000 records in about
20 seconds. Thus even if the CPU was only 20% idle we should expect
the connections to be made in < 100 seconds. From what TJNAF
has experienced this must not be true.

Perhaps this test case could be modified to reproduce the TJNAF
problem. One suggestion is to create a subroutine record that has
a "busy" loop that eats cpu time. One of its fields could be the
loop iteration value so that it is easy to adjust how much time it uses.
The get and put clients could be started for different cpu idle times
and see
how the connection times change. In addition the priorities
of CA UDP and CA TCP could be changed to see how connection times
change.
This could be a good test case for testing modifications to the
CA connection algorithm.
 
No matter what improvements are made to the CA connection algorithm,
it will not solve the problem of an IOC having little or no idle time.
 

Marty Kraimer



Replies:
Re: flaky IOC problems at Jefferson Lab watson

Navigate by Date:
Prev: EPICS Guide Bakul Banerjee
Next: Re: flaky IOC problems at Jefferson Lab Bill Brown
Index: 1994  1995  1996  <19971998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: flaky IOC problems at Jefferson Lab watson
Next: Re: flaky IOC problems at Jefferson Lab watson
Index: 1994  1995  1996  <19971998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 10 Aug 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·