Experimental Physics and
Industrial Control System

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 <2010> 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025	Index	1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 <2010> 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
<== Date ==>		<== Thread ==>

Subject:	Re: archive server discussion
From:	Patrick Thomas <[email protected]>
Cc:	"[email protected] Techtalk" <[email protected]>
Date:	Wed, 04 Aug 2010 04:53:27 -0700

Hi Matt,

Would it be possible to extend this, or something similar, to monitor and record changes for tens or hundreds of thousands of PVs in a mySQL database?

-Patrick

Matt Newville wrote:

Hi James,

You didn't list my EpicsArchiver -- perhaps  it's not very well known,
but it is used at a few beamlines at the APS.  You can  see it in
action at
   http://millenia.cars.aps.anl.gov/cgi-bin/pvarch/
and there are some documentation and details at
   http://millenia.cars.aps.anl.gov/cgi-bin/pvarch/help.py
I started this many years ago, in the early days of ChannelArchiver I
believe, mostly because I wanted data stored in a relational database.

I don't know much about the other archivers you listed, but I'll give
some details on how the EpicsArchiver works, and what improvements
might be made.   Perhaps the large number of similar implementations
shows how easy it is to do....

I am not a RDMS expert at all, and I'm not sure I can comment much on
using "noSQL" databases.  I do believe that the data is sort of
relational, though the relation is almost always based on time. And,
as you say, the data is being archived, which requires writing the
latest value very quickly, and allowing for complex and relatively
slow lookups.

The implementation for EpicsArchiver uses MySQL on a linux system with
SCSI disks (non-RAID), though I don't think it would be impossible to
switch, to postgresql or Oracle.

Epics Process Variables are monitored and the "latest value" for each
variable is stored in a cache table.  This holds "PV Name, Value,
Timestamp" and not much more.  The table is small, and constantly
updated.    I'm  currently using single-threaded CA,  and a single
process too, but have little difficulty keeping up with ~5000
variables, as most PVs of interest for us move slowly, and only a few
hundred really dominate the access to this cache.   The system I run
has quad-core CPUs, and one of them is essentially dedicated to
running MySQL at 90+% usage.  Again, a multi-threaded/process/cpu
would not be a difficult  upgrade, and could definitely improve
performance for a larger set of variables.

With one process (or one set of threads/process) keeping a cache of
"current values", the actual archiving is fairly easy, and is a
non-EPICS task: it needs only to copy data from one table to archive
tables.   I think this a key feature: it separates fetching a PVs
values from archiving it.    In addition,  "read from the cache" (say
for a status web page) also does not need to be a CA client, and so
can be very fast, as reads from the local disk are much faster than
connecting to and reading with CA for stateless web apps.

For archiving, EpicsArchiver keeps a database with one table of PV
information, and 128 different data tables of PV hashkey, Timestamp,
Value.   That hashing distributes the data, and reduces data lookup
time.  The archiving process just fetches the latest values from the
cache database and decides what to put in the archive database.

I rotate these databases out every month (through a cron job), and
keep one "Master Database" that holds the time-ranges for each of the
databases.    That way, neither look up of "last weeks" data and older
historical data is painfully slow.   The lookup just has to figure out
which databases to look in for a time range, then which table to look
in for a PVs data, and then extract data by PV hashkey and timestamp.
 The EpicsArchiver also includes "Alerts" to send email messages on
user-defined changes in PV status, etc.    It also includes a simple
template system to change what is shown on the status web pages.  It's
all fairly simple database stuff, and very low maintenance.

Cheers,

--Matt Newville <newville at cars.uchicago.edu> 630-252-0431

On Mon, Aug 2, 2010 at 9:59 AM, <[email protected]> wrote:

Hi

At Diamond we are upgrading our archiver hardware, before the EPICS meeting
I'd like to open the discussion on archive server software. The obvious
question is: Is there a standard EPICS archiver server suitable for
everyone? And if not, why not?

There are a few servers in the community (apologies to their authors if I
get the details wrong):

Channel Archiver v1
index?
data in file system, files contain per-channel blocks of time-sorted data up
to 8k samples long

Channel Archiver v2
Hashtable + RTree index
data as above

CZAR
Index in RDB
data storage ?
compression using arithmetic encoding

MYA
RDB replacement for CZAR

BEAUTY
Partitioned RDB for index and data

Hyperarchiver
RDB for configuration
Hypertable for index and data (zlib-style compression)

I only have experience with Channel Archiver v2. Looking at our installation
archiving 3GB a day, a great deal of time is spent in random write
operations to the individual data blocks. We buffer for 30 seconds, then
write approximately 30 samples to each of 30,000 channels. Because of the
way the data is stored on disk (channel1[8192], channel2[2048],
channel3[1024] etc.) that's a lot of random writes. The RAID array copes
well but the large write op load reduces read performance to the point that
backups and large data-mining jobs are very slow.

Our typical archiver access pattern is append-only sample writes, read
channel data from a contiguous time range from up to 100 channels.
Assuming magnetic disks will store our 10TB+ datasets for the next few
years, an ideal database would reduce the number of slow random IO
operations and store data on disk reflecting the typical access pattern. A
suitable sort key might be (channel, timestamp) or (month, channel,
timestamp) to partition the data by date. This is the way that hypertable
and some real-time relational databases get their single-node performance,
with an in-memory table for recently arrived data, a sequential logfile for
durability, periodic sorting and flushing of the in-memory table to disk
in a contiguous write, and period compaction into larger sorted tables.
The commercial real-time relational databases KDB and Vertica use this
architecture, if any Oracle or MySQL expert knows if something similar can
be done with them I'd be very interested. There are alternate storage
back-ends for MySQL for data warehousing but they only support efficient
writes in bulk.

At Diamond we plan to run the v2 archiver with 64 bit index files to allow
us to merge our indices into one file, and benchmark for performance on some
updated hardware. I also need to merge the 10TB of data files from a primary
and standby archive server used for high availability, either by combining
the data files or adding a merging reader to the storage library. I would
like to know what other facilities are using and if there are any other
current archiver developments.

James

--

This e-mail and any attachments may contain confidential, copyright and or
privileged material, and are for the use of the intended addressee only. If
you are not the intended addressee or an authorised recipient of the
addressee please notify us of receipt by returning the e-mail and do not
use, copy, retain, distribute or disclose the information in or attached to
the e-mail.
Any opinions expressed within this e-mail are those of the individual and
not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any
attachments are free from viruses and we cannot accept liability for any
damage which you may sustain as a result of software viruses which may be
transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England
and Wales with its registered office at Diamond House, Harwell Science and
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

Replies:: Re: archive server discussion Matt Newville

References:: archive server discussion james.rowland; Re: archive server discussion Matt Newville

Navigate by Date:: Prev: RE: archive server discussion james.rowland; Next: medm Mezger Anton-Chr.; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 <2010> 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
Navigate by Thread:: Prev: RE: archive server discussion james.rowland; Next: Re: archive server discussion Matt Newville; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 <2010> 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025

ANJ, 02 Sep 2010

· Home · News · About · Base · Modules · Extensions · Distributions ·
· Download · Search · IRMIS · Talk · Documents · Links · Licensing ·

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System