<2002> 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 | Index | <2002> 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 |
<== Date ==> | <== Thread ==> |
---|
Subject: | Re: Gateway |
From: | "Ned D. Arnold" <[email protected]> |
To: | Marty Kraimer <[email protected]> |
Cc: | "Kenneth Evans, Jr." <[email protected]>, Andrew Johnson <[email protected]>, [email protected], [email protected], [email protected] |
Date: | Wed, 27 Nov 2002 09:14:09 -0600 |
Marty,Ken has an excellent test bed for the PV Gateway (and Portable Channel Access Server). The PV Gateway on Hydra is used heavily and is a great test for HEAVY LOAD conditions.
His observation of the increased CPU usage (his e-mail to Jeff on 10-8-02 is included below) seems to be much more significant than typical "resource creep". The loop rate for the R3.13 version was over 100Hz. When this was stopped and the R3.14 version was started, the loop rate dropped to under 10 Hz. Same hardware, same number of PV requests, same version of Solaris, different version of PCAS.
Since the response time was noticeably slower for the users, we backed out of the 3.14 version just before the user run began.
Ken and I thought this was a significant discovery that may effect many applications attempting to move to R3.14. Now (before the R3.14 release) is the time to be thorough and investigate whether this is a typical case or not. Such a performance degradation will effect numerous systems and buying faster hardware is not always a solution.
According to recent talks at JLAB, the PCAS is used extensively and in many situations performance is critical (imaging systems, LabView server, etc).
Ned Marty Kraimer wrote: > At the EPICS Core Working Group meeting at JLAB the problem of running > the gateway on 3.14 was discussed. My understanding is that when we > build the gateway against 3.14 it uses so much cpu time that it doesn't > work correctly. Someone mentioned that on 3.13 it already uses 75% of > the cpu. > > Some questions. > > Is this true? Ken should know the answers. Do you have some actual > performance numbers? > > If this is true then it sounds like only a matter of time until even the > 3.13 version will fail. > > I assume this only applys to the gateway for ASD not the gateways for > the CATS. > > For the ASD gateway can't we have another solution? > > Some possibilities. > > Run separate gatways for phoebus and oxygen. > Get a more powerful gateway machine. > > Marty > Re: Gateway Status 10-8-02 Kenneth Evans, Jr. wrote:
Jeff, We have been running the latest Gateway 2.0 built with Base 3.14 on Hydra as our main Gateway since Oct. 1. This is the version that doesn't print the many errlog messages, though there are quite a few left. It crashed (only) once on Oct. 8 with Pure virtual function called. Otherwise, it seems to be working properly. It is doing what a Gateway is supposed to do as far as I can tell. The problem is that it is inefficient and using too much CPU. The Gateway CPU has consistently been at around 95%, and the loop rate has been just above 10 Hz, the limit if ca_poll() is to be called once every 100 ms. This is on a 440 MHz UltraSparc-IIi with 1 processor. It is "on the edge". There are complaints of slow response, and if you try to do anything on Hydra, the response is slow (as would be expected for a machine using 100% CPU). We did not feel we could continue to run it, as user operations recommence tomorrow. The attached StripTool plot shows what happened when we changed back to 1.3.3.4, the latest Gateway 1.3 version. The CPU goes down and the loop rate goes up. It is no longer "on the edge" and has quite a bit of headroom. The graph to the left of where it was changed is typical of the load over the last week. I have been watching it, and it has pretty much looked like that during the whole period. It is now handling the same load but using fewer resources. Note that the loop rate is now over 100 Hz. fdMamager is called with a 10 ms timeout, so this means fdManager is returning early. (The loop consists of calls to fdManager, then ca_poll, then Gateway stuff). Note that both versions are "keeping up" in that the ServerEventRate is equal to the ServerPostRate. The threshold where this no longer happens is much higher. It now runs on Linux and the behavior, while better, seems commensurate given that the Linux box is 2 Pentium III's at 930 Mhz each. It also runs on Windows, but the performance seems much worse there (even though it is 1 Pentium at 800 Mhz.) It appears to use very little CPU on WIN32, even when loaded. It just stops "keeping up". In addition, the threshold for "keeping up" is lower than for the other two. That is, it doesn't seem to be utilizing the available CPU. It needs to be fixed before we can use it for production. -Ken ------------------------------------------------------------------------