Experimental Physics and Industrial Control System

<20022003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  Index <20022003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022 
<== Date ==> <== Thread ==>

Subject: Re: Gateway
From: "Ned D. Arnold" <nda@aps.anl.gov>
To: Marty Kraimer <mrk@aps.anl.gov>
Cc: "Kenneth Evans, Jr." <evans@aps.anl.gov>, Andrew Johnson <anj@aps.anl.gov>, Ralph.Lange@mail.bessy.de, dalesio@lanl.gov, johill@lanl.gov
Date: Wed, 27 Nov 2002 09:14:09 -0600

Ken has an excellent test bed for the PV Gateway (and Portable Channel Access Server). The PV Gateway on Hydra is used heavily and is a great test for HEAVY LOAD conditions.

His observation of the increased CPU usage (his e-mail to Jeff on 10-8-02 is included below) seems to be much more significant than typical "resource creep". The loop rate for the R3.13 version was over 100Hz. When this was stopped and the R3.14 version was started, the loop rate dropped to under 10 Hz. Same hardware, same number of PV requests, same version of Solaris, different version of PCAS.

Since the response time was noticeably slower for the users, we backed out of the 3.14 version just before the user run began.

Ken and I thought this was a significant discovery that may effect many applications attempting to move to R3.14. Now (before the R3.14 release) is the time to be thorough and investigate whether this is a typical case or not. Such a performance degradation will effect numerous systems and buying faster hardware is not always a solution.

According to recent talks at JLAB, the PCAS is used extensively and in many situations performance is critical (imaging systems, LabView server, etc).


Marty Kraimer wrote:
> At the EPICS Core Working Group meeting at JLAB the problem of running
> the gateway on 3.14 was discussed. My understanding is that when we
> build the gateway against 3.14 it uses so much cpu time that it doesn't
> work correctly. Someone mentioned that on 3.13 it already uses 75% of
> the cpu.
> Some questions.
> Is this true? Ken should know the answers. Do you have some actual
> performance numbers?
> If this is true then it sounds like only a matter of time until even the
> 3.13 version will fail.
> I assume this only applys to the gateway for ASD not the gateways for
> the CATS.
> For the ASD gateway can't we have another solution?
> Some possibilities.
> Run separate gatways for phoebus and oxygen.
> Get a more powerful gateway machine.
> Marty

Re: Gateway Status 10-8-02
Kenneth Evans, Jr. wrote:

     We have been running the latest Gateway 2.0 built with Base 3.14 on
Hydra as our main Gateway since Oct. 1.  This is the version that doesn't
print the many errlog messages, though there are quite a few left.  It
crashed (only) once on Oct. 8 with Pure virtual function called.  Otherwise,
it seems to be working properly.  It is doing what a Gateway is supposed to
do as far as I can tell.

     The problem is that it is inefficient and using too much CPU.  The
Gateway CPU has consistently been at around 95%, and the loop rate has been
just above 10 Hz, the limit if ca_poll() is to be called once every 100 ms.
This is on a 440 MHz UltraSparc-IIi with 1 processor.  It is "on the edge".
There are complaints of slow response, and if you try to do anything on
Hydra, the response is slow (as would be expected for a machine using 100%
CPU).  We did not feel we could continue to run it, as user operations
recommence tomorrow.

     The attached StripTool plot shows what happened when we changed back to, the latest Gateway 1.3 version.  The CPU goes down and the loop
rate goes up.  It is no longer "on the edge" and has quite a bit of
headroom.  The graph to the left of where it was changed is typical of the
load over the last week.  I have been watching it, and it has pretty much
looked like that during the whole period.  It is now handling the same load
but using fewer resources.  Note that the loop rate is now over 100 Hz.
fdMamager is called with a 10 ms timeout, so this means fdManager is
returning early.  (The loop consists of calls to fdManager, then ca_poll,
then Gateway stuff).  Note that both versions are "keeping up" in that the
ServerEventRate is equal to the ServerPostRate.  The threshold where this no
longer happens is much higher.

     It now runs on Linux and the behavior, while better, seems commensurate
given that the Linux box is 2 Pentium III's at 930 Mhz each.  It also runs
on Windows, but the performance seems much worse there (even though it is 1
Pentium at 800 Mhz.)  It appears to use very little CPU on WIN32, even when
loaded.  It just stops "keeping up".  In addition,  the threshold for
"keeping up" is lower than for the other two.  That is, it doesn't seem to
be utilizing the available CPU.

     It needs to be fixed before we can use it for production.



Navigate by Date:
Prev: Re: base max thread priority Eric Norum
Next: Re: base max thread priority Marty Kraimer
Index: <20022003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022 
Navigate by Thread:
Prev: RE: 3.14 Gateway Performance Kenneth Evans, Jr.
Next: [no subject] Jeff Hill
Index: <20022003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022