Experimental Physics and Industrial Control System
This seemed to crash the simDetector on my system, on trivial examples,
Maybe on the first file, maybe after 10.
Will have to rebuild code with debugging, although that *really* slows down
write speeds on my systems.
-----Original Message-----
From: [email protected] [mailto:[email protected]]
Sent: Thursday, June 29, 2017 9:34 AM
To: [email protected]; [email protected]
Cc: [email protected]
Subject: RE: Area Detector and high performance NVME devices
Hi Mark R and Mark E,
> You load 1 NDPluginScatter plugin and 3 NDFileHDF5 plugins. Each HDF5
plugin is writing to a different file.
This would use the HDF5 library in 3 separate threads. The threadsafe
mechanism in the HDF5 library is a crude mutex lock on each library call. I
don't think performance will scale with this solution.
Cheers,
Ulrik
-----Original Message-----
From: Mark S. Engbretson [mailto:[email protected]]
Sent: 29 June 2017 14:36
To: 'Mark Rivers'; Pedersen, Ulrik (DLSLtd,RAL,TEC)
Cc: [email protected]
Subject: RE: Area Detector and high performance NVME devices
That is what I intend to try now just to see what happens.
Thanks to both Mark Rivers and Ulrik Pedersen for a road map of what to try
next.
Me
-----Original Message-----
From: Mark Rivers [mailto:[email protected]]
Sent: Thursday, June 29, 2017 8:01 AM
To: [email protected]; [email protected]
Cc: [email protected]
Subject: RE: Area Detector and high performance NVME devices
Mark,
In your case it appears that the underlying file system is fast enough to
keep up with the detector, but the HDF5 plugin is not. In this case you
could take advantage of the NDPluginScatter feature that was added in ADCore
R3-0.
Let's say that the detector and file system are somewhat less than 3 times
faster than the HDF5 file writing plugin. You can then build the following
pipeline. You load 1 NDPluginScatter plugin and 3 NDFileHDF5 plugins. Each
HDF5 plugin is writing to a different file.
Driver
|
NDPluginScatter
| | |
HDF5#1 HDF5#2 HDF5#3
NDPluginScatter distributes the NDArrays on a round-robin basis, i.e. the
first array goes to HDF5#1, the second to HDF5#2, etc. So you will end up
with 3 files, where each contains 1/3 of the data. As long as the HDF5
plugins can keep up then it will be deterministic that file#1 has arrays 1,
4, 7, .., file #2 has arrays, 2, 5, 8, .., and file 3 has arrays 3, 6, 9,
.... If arrays are dropped then you can tell which file has which arrays via
the UniqueId which is always stored for each array in the file.
Obviously you can scale this to more NDFileHDF5 plugins if you need more
than a factor of 3 performance gain.
Mark
________________________________________
From: [email protected] [[email protected]]
Sent: Thursday, June 29, 2017 7:10 AM
To: [email protected]
Cc: Mark Rivers; [email protected]
Subject: RE: Area Detector and high performance NVME devices
Hi Mark,
Apologies, I'll just use this opportunity for a bit of shameless promotion:
Using HDF5 with high performance detectors is a very wide and complex topic
that can be discussed in great lengths. So come along and do just that at
the HDF5 workshop at ICALEPCS2017 in sunny Barcelona:
http://www.icalepcs2017.org/index.php/program/workshops#HDF5
> I know that I could probably create multiple HDF file plugins - just
> no
idea of what would actually happen if they all look at the same image buffer
stream. Would they lock so that each one would get a unique NDArray, or
might the same image appear in multiple output files?
There is no point in creating multiple instances of the HDF5 file writer
plugin: the HDF5 library implements thread-safety with a crude global lock.
So you don't get a performance increase.
> In theory, under Cygwin or Linux, one can build the Open-MPI Libs that
PHDF5 requires, but I would suspect that the HDF file Plugin would still not
automagically become multithreaded and vastly faster running the same code.
I have done some work a few years ago to use the parallel HDF5 to try and
increase performance. MPI scales in processes, not threads. Each process
runs a single thread HDF5. You can't build the parallel HDF5 library with
MPI into areaDetector as it runs in a single process. If you receive the
datastream in an areaDetector driver then you would need to split and
fan-out the data stream over some IPC mechanism to the MPI/pHDF5 file writer
processes.
In my experience this does not scale to perform as well as one would expect
and I would not advise to go this way. At least not unless you scale to 100s
or 1000s of writer nodes. Writing from multiple independent (i.e. not mpi)
processes to individual files works much faster. You can then tie these
datasets together using the new HDF5 Virtual Dataset (VDS) feature to
provide a single, coherent dataset 'view' for reading/processing. VDS is
available in from HDF5 version 1.10. This is what we are doing at Diamond
now for new fast/parallel detectors.
> "Direct Chunk". Hmmm, don't supposed that This is already in the HDF5
> libs
and one could tweak some code to make use out of it, would be interested to
see a spped test on the nvme hardware.
Yes, Direct Chunk Write is available from HDF5 1.8.11 onwards. Even h5py now
have support for that so you can script some benchmarking in python!
Cheers,
Ulrik
-----Original Message-----
From: Mark S. Engbretson [mailto:[email protected]]
Sent: 28 June 2017 23:05
To: Pedersen, Ulrik (DLSLtd,RAL,TEC)
Cc: [email protected]; [email protected]
Subject: RE: Area Detector and high performance NVME devices
The impressive hardware is probably the Euresys Coaxlink frame grabber with
a 2.5 GB/s readout rate and the current NVME disk technology which claims
write benchmarks of 8 GB/s (in a raid-0 configuration). But I have heard
people talking about cameras with a 17 GB/s acquires rates and even one in
the 100's that are/will be available Real Soon Now.
Yes, everyone wants HDF5 formatted data, but as you pointed out, the single
thread HDF5 pipeline appears to choke before it hits wrtite limits on modern
hardware. Raw binary file writing on this hardware is actually easy, with
similar cavets as Lustre has (i.e., writes must be a multiple of the nvme
sector size, buffers might need to be memory page aligned). But when the
hardware runs at twice the write speed of the camera, lots of issues go
away.
I know that I could probably create multiple HDF file plugins - just no idea
of what would actually happen if they all look at the same image buffer
stream. Would they lock so that each one would get a unique NDArray, or
might the same image appear in multiple output files?
In theory, under Cygwin or Linux, one can build the Open-MPI Libs that PHDF5
requires, but I would suspect that the HDF file Plugin would still not
automagically become multithreaded and vastly faster running the same code.
"Direct Chunk". Hmmm, don't supposed that This is already in the HDF5 libs
and one could tweak some code to make use out of it, would be interested to
see a spped test on the nvme hardware.
Me
-----Original Message-----
From: [email protected] [mailto:[email protected]]
Sent: Wednesday, June 28, 2017 3:40 PM
To: [email protected]
Cc: [email protected]; [email protected]
Subject: Re: Area Detector and high performance NVME devices
Hi Marks,
So this is quite an impressive piece of detector! We have some fast
detectors here at Diamond but we have not used the areaDetector HDF5 file
writer beyond what can fit through a 10gbps Ethernet pipe. Writing faster
than that presents a number of challenges as you have noticed.
Writing 'raw' binary files will probably always be the most performant
options (if you do it right and tune the I/O pattern for the file system).
However, you lose a lot of goodness by not having a container like HDF5
around it.
There are a lot of tuning parameters available in the HDF5 library that I
assume you have played around with: chunking and flushing parameters,
boundary alignment, etc. The first thing you have to figure out about your
file system is what size of 'chunks' (i.e. individual IO writes) it likes in
order to perform best - and does it require write operations to start on
specified boundaries? Our Lustre and GPFS file systems like 1MB and 4MB
boundaries for example.
The HDF5 library operate a pipeline when doing file I/O: all read/write
operations pass the data through this pipeline by default in order to do
things like compression or datatype conversions. However, even when you're
not using these features (like when you're "just" streaming a lot of uint16
pixels to an image dataset) the data pass through the pipeline and that has
a certain performance overhead - i.e. a CPU or perhaps even memory I/O
bottleneck.
Fortunately there is a way to circumvent this internal HDF5 pipeline using a
feature that was developed for Dectris because they wanted to be able to
write out HDF5 datasets consisting of pre-compressed images. The
single-threaded HDF5 pipeline was too slow for their compression
requirements. The Direct Chunk Write [1] feature can be used with or without
compressed datasets and because it circumvents the pipeline it basically
just does a simple write under the hood.
From my tests (a good while ago now) the Direct Chunk Write from multiple
processes performed much better than the parallel HDF5 (which btw is not
supported in the areaDetector HDF5 file writer).
The Direct Chunk Write functionality should be added to the areaDetector
HDF5 file writer. I will raise a ticket on github to discuss how best to do
that. That should enable nearly the same performance as your 'raw binary'
write.
If that turns out to not be enough, the next step would be to parallelise
the problem by splitting the stream into multiple file writer processes.
This is also something we are working on at Diamond for our parallel
high-performance detector systems.
Cheers,
Ulrik
[1]: Direct Chunk Write
https://support.hdfgroup.org/HDF5/doc/Advanced/DirectChunkWrite/
> On 27 Jun 2017, at 21:07, Mark S. Engbretson <[email protected]> wrote:
>
> Pete suggested something like that, have the HDF file have a pointer
> to an each image raw file, but can not sustain the write rate with
> single
files.
> Why I was originally trying to tweak this file writer into something
> that would acquire all the data, but would then generate reasonable
> output after the fact.
>
> 2BM seems to be willing to have a starting point of perhaps one 15
> minute acquire per hour, in which the other 45 minutes either
> processing the data or getting it off the computer to prepare for
> another. Which may just be enough if the raw data is buffered and HDF
> write is out at whatever rate that it can keep up at.
>
> I hadn't thought of this raw plugin abstracting a much smaller image
> that could be a placeholder in an HDF file. Mostly since someone would
> still have to post process both of these files after the fact. No file
> plugins have implemented the read funcstions yet.
>
> -----Original Message-----
> From: Mark Rivers [mailto:[email protected]]
> Sent: Tuesday, June 27, 2017 2:48 PM
> To: Mark S. Engbretson <[email protected]>; [email protected]
> Subject: RE: Area Detector and high performance NVME devices
>
> Hi Mark,
>
> I just tried with simDetector on a Windows 7 system with 8 cores, 15K
> RPM SAS Raid-0 disk, 96 GB RAM.
>
> 4096 x 3078 Int8 images = 12 MB/image.
>
> The simDetector is generating about 150 frames/s = 1.8 GB/s.
>
> This is the output of camonitor on the ArrayRate_RBV and WriteFile_RBV
> PVs in the HDF5 plugin:
>
> corvette:simDetectorIOC/iocBoot/iocSimDetector>camonitor -tc
> 13SIM1:HDF1:ArrayRate_RBV 13SIM1:HDF1:WriteFile_RBV
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:21.809029) 0
> 13SIM1:HDF1:WriteFile_RBV (2017-06-27 14:15:21.809162) Done
> 13SIM1:HDF1:WriteFile_RBV (2017-06-27 14:15:31.451841) Writing STATE
> MINOR 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:33.410624) 34
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:34.411782) 54
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:35.410801) 55
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:37.408966) 52
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:38.408085) 54
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:39.407156) 48
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:40.408254) 53
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:41.409449) 55
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:42.410515) 45
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:43.411498) 50
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:44.411601) 47
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:45.410666) 53
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:46.409790) 50
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:47.408919) 54
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:48.407977) 53
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:49.407138) 52
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:50.408088) 60
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:51.409265) 62
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:52.410457) 52
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:53.411547) 59
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:54.411524) 62
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:55.411724) 55
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:56.410814) 60
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:57.409946) 57
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:58.408960) 59
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:15:59.407069) 60
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:00.408095) 42
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:01.409358) 17
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:02.410576) 27
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:03.411677) 25
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:04.413487) 27
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:05.411605) 25
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:06.411612) 24
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:07.409803) 23
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:11.410220) 19
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:12.411320) 18
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:13.412261) 21
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:15.412525) 20
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:16.411553) 22
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:17.410808) 24
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:20.409031) 28
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:21.411129) 25
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:22.411199) 24
> 13SIM1:HDF1:WriteFile_RBV (2017-06-27 14:16:22.818068) Done
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:23.412388) 1
> 13SIM1:HDF1:ArrayRate_RBV (2017-06-27 14:16:24.413361) 0
>
> So for the first 25 seconds or so it is writing at about 55 frames/s =
> 660 MB/s. This is probably filling the Windows file cache. It then
> slows down to about 23 frames/s = 280 MB/s, which is probably the
> steady state write speed of the disks.
>
> So I agree that it will probably be difficult to write HDF5 files at
> the full rate of your camera, which is 190 frame/s = 2.3 GB/s.
>
> One possible solution would be to write your RAW files that can keep
> up, and also write HDF5 files of "thumbnail" data that is either
> cropped or binned to 512x384 for example. You can then store all the
> metadata in the HDF file and the images in the RAW file. Later on you
> can either merge these 2 files into a large HDF5 file, or just keep
> them
separate.
>
> Mark
>
> ________________________________
> From: Mark S. Engbretson [[email protected]]
> Sent: Tuesday, June 27, 2017 1:51 PM
> To: Mark Rivers; [email protected]
> Subject: RE: Area Detector and high performance NVME devices
>
> I do not have enough memory to create a queue large enough to buffer
> all the images. 2BM wants to acquire the camera stream for at least
> 15 minutes and ideally as long as possible. So talking about 2-4 TB.
> Or larger if the buy huge nvme chips or multiple turbo Z units - the
> computer supports having 3 of them.
>
> I have allocated very large buffers, but for a 4096 by 3078 image
> being created at 190 FPS, the file plugins would have to be able to
> keep
up . . .
> and they don't.
>
> Using the SimDetector, I can create such images at ~260 FPS. HDF is
> only writing the file out at about 60 FPS, using whatever the defaults
settings.
> NetCFD writes about 30 FPS. A buffer of 4000 in both cases only lasted
> for about a minute. The raw file plugin slows the simDetector acquire
> rate to about 170-180 FPS, which the plugin can keep up with.
> The stardardarray plugin by itself also slows simDetecor to about the
> same
thing.
>
>
> From: Mark Rivers [mailto:[email protected]]
> Sent: Tuesday, June 27, 2017 1:06 PM
> To: 'Mark S. Engbretson' <[email protected]>; [email protected]
> Subject: RE: Area Detector and high performance NVME devices
>
> Hi Mark,
>
> I am surprised that a raw file plugin is significantly faster than
> netCDF or HDF5. I would like to see the tests, and figure out what is
> actually slowing them down, i.e. is it CPU bound, waiting for a
> semaphore, etc.? Can you post actual benchmark results for the
> different plugins, i.e. frames/s and MB/s?
>
> You should not need to do anything special to create a FIFO to buffer
> images while the disk is busy. Every areaDetector plugin comes with
> such a FIFO, i.e. its input queue. Just increase the QueueSize to be
> large enough to buffer all the images you need to store in one
> "burst". You can also use the CircularBuffer plugin to do this, but it
> should really not be necessary, that is intended more for "triggered"
> applications where the buffer is emptied when a trigger condition is
satisfied.
>
> Mark
>
>
> From: Mark S. Engbretson [mailto:[email protected]]
> Sent: Tuesday, June 27, 2017 12:24 PM
> To: Mark Rivers; [email protected]<mailto:[email protected]>
> Subject: Area Detector and high performance NVME devices
>
> Mark -
>
> I have the adimec camera which generates data at ~2.5 GB/s. I recently
> got my hands on a newer HP 840 with a HP Turbo Z nvme drive which
> claims a sustained write speed of 6 GB/S. None of the existing file
> plugin see any performance increase when writing to this device - I do
> not think that any are actually write limited. I have modified a raw
> binary file plugin that I obtained from Keenan Lang that easily
> sustains the cameras write rate until the device is full.
>
> Problem is - Raw data really doesn't do anyone much good. I was
> thinking that perhaps a quick solution to my problem might be to
> change this Raw File plugin to look/act like a disk based fifo or
> circular buffer. This could collect to the limit of the hardware at
> full speed, and if someone wanted HDF output, they would just drain
> this queue at the speed that HDF files are generated. Or is there an
> easier/better solution? I.e. any way that file plugins can use the new
multi-thread model of AD 3.0?
>
> I know the HDF files can be generates at very high speeds on Lustre
> file systems, but this seems to be using parallel HDF5. Is this
> something that Area Detector supports?
>
----------------------------------------------
Ulrik Kofoed Pedersen
Head of Beamline Controls
Diamond Light Source Ltd
Phone: +44 1235 77 8580
--
This e-mail and any attachments may contain confidential, copyright and or
privileged material, and are for the use of the intended addressee only. If
you are not the intended addressee or an authorised recipient of the
addressee please notify us of receipt by returning the e-mail and do not
use, copy, retain, distribute or disclose the information in or attached to
the e-mail.
Any opinions expressed within this e-mail are those of the individual and
not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any
attachments are free from viruses and we cannot accept liability for any
damage which you may sustain as a result of software viruses which may be
transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England
and Wales with its registered office at Diamond House, Harwell Science and
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
- Replies:
- RE: Area Detector and high performance NVME devices Mark Rivers
- References:
- Area Detector and high performance NVME devices Mark S. Engbretson
- RE: Area Detector and high performance NVME devices Mark Rivers
- RE: Area Detector and high performance NVME devices Mark S. Engbretson
- RE: Area Detector and high performance NVME devices Mark Rivers
- RE: Area Detector and high performance NVME devices Mark S. Engbretson
- Re: Area Detector and high performance NVME devices ulrik.pedersen
- RE: Area Detector and high performance NVME devices Mark S. Engbretson
- RE: Area Detector and high performance NVME devices ulrik.pedersen
- RE: Area Detector and high performance NVME devices Mark Rivers
- RE: Area Detector and high performance NVME devices Mark S. Engbretson
- RE: Area Detector and high performance NVME devices ulrik.pedersen
- Navigate by Date:
- Prev:
RE: Area Detector and high performance NVME devices Mark Rivers
- Next:
RE: Area Detector and high performance NVME devices Mark S. Engbretson
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
<2017>
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: Area Detector and high performance NVME devices Mark S. Engbretson
- Next:
RE: Area Detector and high performance NVME devices Mark Rivers
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
<2017>
2018
2019
2020
2021
2022
2023
2024