EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  <20132014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  <20132014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Using all the cores available on modern processors
From: Keith Thorne <[email protected]>
To: <[email protected]>
Cc: [email protected]
Date: Fri, 21 Jun 2013 09:43:13 -0500
Dear Nick and Matt
	The option taken by LIGO in this was to use a Linux "CPU Idle" patch that lets you run a single process on each core that cannot be interrupted because the OS thinks the core is idle.
This gets us micro-second stability.  However, you have to implement it as a kernel module and use shared memory to communicate it.
The EPICS IOC still runs in user space.
	
					Keith Thorne

On Jun 21, 2013, at 9:08 AM, <[email protected]> wrote:

> Hi Matt,
> 
> That sound great. I am copying this to tech-talk since you have made some interesting points.
> 
> I remember talking to you about this when you were doing this but because you were still working on it you couldn't summarise your findings as well as you have now. The number of variables we can play with is rather large - the underlying architecture, SCHED_FIFO, a pre-emptable kernel and task CPU affinities. It was much simpler when all we had was a single core VxWorks CPU.
> 
> If there are 25-30k context switches/sec and top isn't accounting for all the time then that would explain a lot of things. It may be that setting CPU affinities is the way to go, and is more general because it doesn't require a real time kernel. I looked at the MCoreUtils module yesterday and it contains most of what we need, but the 3.15 dependency is a bit of a pain.
> 
> Cheers,
> 
> Nick Rees
> Principal Software Engineer           Phone: +44 (0)1235-778430
> Diamond Light Source                  Fax:   +44 (0)1235-446713
> 
> -----Original Message-----
> From: Pearson, Matthew R. [mailto:[email protected]] 
> Sent: 20 June 2013 15:11
> To: Rees, Nick (DLSLtd,RAL,DIA)
> Subject: Re: Using all the cores available on modern processors
> 
> 
> Hi Nick,
> 
> I looked into this briefly in the past. I had the exact same issue with running multiple areaDetector threads. There was a data readout thread (which was actually a simulation, taking up most of a CPU core), various processing threads and visualization threads. I found that running multiple threads caused the CPU load per core to drop down, dramatically once the number of threads exceeded the number of physical cores (which is expected). If the number of threads was less than the number of cores, then it was much better. However, even with 3-4 threads running (which was less than the number of cores), the data readout thread CPU utilization still dropped off, despite multiple cores being idle.
> 
> Setting the CPU affinity seemed to help a lot, at least when the number of threads was less than the numbers of cores. The problem was that the other threads could still pre-empt the data readout thread. I didn't really get around to experimenting with real time priorities, except that I did try using SCHED_FIFO (with priority 50) and it made it much worse. It seemed to cause the thread to be switched out to other cores more frequently (which isn't what it was supposed to do).
> 
> Given that modifying the CPU affinity and scheduler had such a big effect, I assumed the issue was with the Linux scheduler (or, more likely, I hadn't set it up right for that kind of work) and perhaps context switch time (as I had >100 threads all running at the same priority). I'm not sure if 'top' accounts for time spent switching. I found I was running with ~25-30K context switches per second, with all the threads active. 
> 
> I believe Epics base 3.15 will have support for setting CPU affinity and real time thread priorities (MCoreUtils).
> 
> Cheers,
> Matt
> 
> 
> On Jun 20, 2013, at 6:54 AM, [email protected] wrote:
> 
>> This is probably aimed at Mark Rivers (sorry Mark!) but I am hoping other people may have seen the same symptoms. As I mentioned at the last EPICS meeting, on a number of development projects over the past year (mostly with area detector systems, but with some others) we have run into what superficially looks like a common problem.
>> 
>> When we configure a system simply (e.g. an areaDetector configuration which just reads out data and writes it to disk and does nothing else apart from scalar status callbacks) then the main CPU load is on a single core and it can use up to 90% of that core quite happily. 
>> 
>> When we make things more complicated and add processing plugins, for example, the data throughput to disk drops off dramatically and frames start buffering up (and some may be dropped). The CPU load is distributed across all cores, but they are only loaded at the 20-30% level, so the system seems largely idle with plenty of processing power available, but it can't be utilised.
>> 
>> Typically we have seen these problems on dual socket NUMA architecture Intel systems, with a non pre-emptively scheduled Windows or Linux OS.
>> 
>> In theorising about this we have conjectured a number of scenarios (in some sort of priority order):
>> 
>> 1. AreaDetector has a locking issue and processing is held up by tasks holding locks that they shouldn't have.
>> 2. Because this has happened on non-preemptively scheduled systems, the high priority tasks aren't getting CPU for some reason (despite there being CPU available). This may be related to (1) so maybe the system isn't overcoming priority inversions properly.
>> 3. There is some other resource bottleneck - such as a bottleneck in the QPI because of the NUMA architecture.
>> 4. There is a problem with top or the Windows performance monitor which doesn't account for overheads properly in a busy multi-core system.
>> 
>> So, at this stage we have not looked at this problem in any detail, we have just noticed the symptoms. Has anyone else come across it and have any pointers before we invest a serious amount of effort looking at it. I suspect that if we are going to understand this it may take some time, and may have some design implications for EPICS on modern architectures.
>> 
>> Cheers,
>> 
>> Nick Rees
>> Principal Software Engineer           Phone: +44 (0)1235-778430
>> Diamond Light Source                  Fax:   +44 (0)1235-446713
>> 
>> 
>> -- 
>> This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
>> Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. 
>> Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
>> Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
> Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. 
> Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
> Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
> 
> 
> 
> 
> 



References:
Using all the cores available on modern processors nick.rees
RE: Using all the cores available on modern processors nick.rees

Navigate by Date:
Prev: RE: Using all the cores available on modern processors nick.rees
Next: Re: help setting up XY table control with EPICS StreamerClass Tim Mooney
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  <20132014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: RE: Using all the cores available on modern processors nick.rees
Next: RE: Using all the cores available on modern processors Mark Rivers
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  <20132014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 20 Apr 2015 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·