1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 <2022> 2023 2024 2025 | Index | 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 <2022> 2023 2024 2025 |
<== Date ==> | <== Thread ==> |
---|
Subject: | Re: Epics Archiver Appliance and Network Storage slowdown |
From: | "Shankar, Murali via Tech-talk" <tech-talk at aps.anl.gov> |
To: | "Manoussakis, Adamandios" <manoussakis1 at llnl.gov>, "tech-talk at aps.anl.gov" <tech-talk at aps.anl.gov> |
Date: | Tue, 15 Mar 2022 16:47:43 +0000 |
I did some testing locally as well.
>> I do not think it has to do much with the network
I would incline to disagree (but I may be wrong); the getDataAtTime API is a secondary API ( the primary one being the getData API call) and is largely targeted at quality control for save/restore applications. The PB data formats are not necessarily optimized
for this call and therefore we use a lot of parallelism to extract whatever performance we can. So, the latency of the file system calls ( and not the throughput ) is a significant factor in the performance of this API call.
For example, in my test system (using NFS as my MTS), I am able to get 18000 PV's in 42 seconds. In
one of my production systems (with some real data variance), I am able to get 9000 PV's in 9 seconds. In all of these systems, there is probably an upper limit after which we run out of some resource further downstream.
The most obvious one is the size of the ForkJoin common thread pool and I tested this in a different production system ( with a slightly older GPFS). I was able to improve performance
in the same system by increasing the
ForkJoin common thread pool size. So this is a 12 core/24 thread CPU. With the default ForkJoin common thread pool size, I get 9000 PV's in a minute. By increasing this to 48
( -Djava.util.concurrent.ForkJoinPool.common.parallelism=48 )
and then to 64
( -Djava.util.concurrent.ForkJoinPool.common.parallelism=64 )
I was able to reduce this to 30 seconds or less. So, based on your setup, you could try a similar thing and see if it makes a difference. Alternatively, you can also try reducing
this to see if the performance is the result of some starvation someplace.
>> file size for the 3000 PVs is about 40MB only
I still do not understand this; 18000 scalar PV's for me are 2MB. How does 3000 PV's become 40MB in your situation ( unless you have large some waveforms in the mix)?
Hope that helps.
Regards,
Murali
From: Manoussakis, Adamandios <manoussakis1 at llnl.gov>
Sent: Monday, March 14, 2022 5:46 PM To: Manoussakis, Adamandios <manoussakis1 at llnl.gov>; Shankar, Murali <mshankar at slac.stanford.edu>; tech-talk at aps.anl.gov <tech-talk at aps.anl.gov> Subject: RE: Epics Archiver Appliance and Network Storage slowdown Hi Murali,
I had some data back from testing, the http post request is asking for about 3000 PVs from the AA which stores about 4500 PVs Total in the LTS.
I did the benchmark test to just transfer a single pb file and was getting around 100MB/s, so I do not think it has to do much with the network since the file size for the 3000 PVs is about 40MB only. I then reduced the request down to about 1200 PVs and it definitely sped up quite a bit around 15 seconds from the 2.5mins it was taking. Someone mentioned the lookup time for each PV was log(n) or nlog(n) for the AA but then that doesn’t seem like it would impact going from 1200 to 3000 in the POST.
I am curious how many PVs you usually request in a single http when requesting from the AA, is there an upper limit I should stick to? You mentioned possibly breaking up the requests into smaller chunks.
Thanks Adam
From: Tech-talk <tech-talk-bounces at aps.anl.gov>
On Behalf Of Manoussakis, Adamandios via Tech-talk
Thanks Murali,
I will try to run some more tests including the one that was mentioned to make sure the transfer rates look correct. I cant imagine the local vs NAS on this small of a setup shouldnt be vastly different.
From: Tech-talk <tech-talk-bounces at aps.anl.gov>
On Behalf Of Shankar, Murali via Tech-talk
>> I think for this experiment its only 6000 PVs I think that should not take this long. Will look into this a bit here as well.
Regards, Murali
From: Shankar, Murali
>> We are using the getDataAtTime endpoint >> 40MB file from the archive
By this I'd guess that you are getting data for several 100,000's of PVs? The getDataAtTime API call has to look at all these 100,000 files ( with perhaps cache-misses for most of them ) and then do binary search to determine the data point. Your NAS needs to support quite a high rate of IOPS to for this to come back quickly. And this is a usecase where even the smallest latency tends to accumulate quickly. Perhaps you can consider breaking down your request into smaller chunks when using the NAS?
Regards, Murali
|