EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  <20192020  2021  2022  2023  2024  Index 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  <20192020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: [Bug 1830957] [NEW] pcas deadlocks in casEventSys
From: Till Straumann via Core-talk <[email protected]>
To: [email protected]
Date: Wed, 29 May 2019 17:42:00 -0000
Public bug reported:

We observe a deadlock situation in the pcas server:

The indented lines represent the call stack; 1) 2) are threads

1) Application calls casPV::postEvent(); 
     casPVI::postEvent() takes casPVI::.mutex
        ...
          casEventSys::postEvent() takes casEventSys::.mutex


2) server thread runs fileDescriptorManager.process(..)
     ...
       casEventSys::process() takes casEventSys::.mutex
          ...
             casAsyncWriteIOI::cbFuncAsyncIO()
                this->chan.uninstallIO()
                    ..
                        casPVI::uninstallIO() takes casPVI::.mutex


Thus, we have the classical case of two threads trying to acquire two locks in opposite order.

Note that this bug has already been experienced and discussed on tech-
talk (no launchpad bug report I could find, though):

  https://epics.anl.gov/tech-talk/2016/msg01930.php
  https://github.com/paulscherrerinstitute/pcaspy/issues/29

and a "solution" to the particular race condition reported then has been put in place.
This "solution" is, IMHO, but a mere hack which works around one particular scenario.

(another potential race condition is casPVI::updateEnumStringTableAsyncCompletion()
when called from casAsyncReadIOI::cbFuncAsyncIO() and there may be more)

The deeper problem is -- again IMHO -- a design flaw in the event processing loop which
holds on to the casEventSys::.mutex while working on the callbacks.

It is not unreasonable (and quite common in other event processing systems I have seen)
for an application to post to an asynchronous facility from a guarded code section
and for callbacks to be synchronized using the same (application) lock:

{ guard( myLock );
  POST_TO_ASYC_FACILTY( somewhere, myCallback );
  other_guarded_business();
}

and

myCallback()
{ guard( myLock );
  do_something();
}

Not possible with pcas.

-> I believe the casEventSys::process() loop should be reviewed
    - release casEventSys::.mutex while working on the callback
    - remove the epicsGuard< evSysMutex > & argument from casEvent::cbFunc()
      (this is super ugly anyways. Callback should not have to know about
      locking semantics of the event loop)

** Affects: epics-base
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of EPICS
Core Developers, which is subscribed to EPICS Base.
Matching subscriptions: epics-core-list-subscription
https://bugs.launchpad.net/bugs/1830957

Title:
  pcas deadlocks in casEventSys

Status in EPICS Base:
  New

Bug description:
  We observe a deadlock situation in the pcas server:

  The indented lines represent the call stack; 1) 2) are threads

  1) Application calls casPV::postEvent(); 
       casPVI::postEvent() takes casPVI::.mutex
          ...
            casEventSys::postEvent() takes casEventSys::.mutex


  2) server thread runs fileDescriptorManager.process(..)
       ...
         casEventSys::process() takes casEventSys::.mutex
            ...
               casAsyncWriteIOI::cbFuncAsyncIO()
                  this->chan.uninstallIO()
                      ..
                          casPVI::uninstallIO() takes casPVI::.mutex

  
  Thus, we have the classical case of two threads trying to acquire two locks in opposite order.

  Note that this bug has already been experienced and discussed on tech-
  talk (no launchpad bug report I could find, though):

    https://epics.anl.gov/tech-talk/2016/msg01930.php
    https://github.com/paulscherrerinstitute/pcaspy/issues/29

  and a "solution" to the particular race condition reported then has been put in place.
  This "solution" is, IMHO, but a mere hack which works around one particular scenario.

  (another potential race condition is casPVI::updateEnumStringTableAsyncCompletion()
  when called from casAsyncReadIOI::cbFuncAsyncIO() and there may be more)

  The deeper problem is -- again IMHO -- a design flaw in the event processing loop which
  holds on to the casEventSys::.mutex while working on the callbacks.

  It is not unreasonable (and quite common in other event processing systems I have seen)
  for an application to post to an asynchronous facility from a guarded code section
  and for callbacks to be synchronized using the same (application) lock:

  { guard( myLock );
    POST_TO_ASYC_FACILTY( somewhere, myCallback );
    other_guarded_business();
  }

  and

  myCallback()
  { guard( myLock );
    do_something();
  }

  Not possible with pcas.

  -> I believe the casEventSys::process() loop should be reviewed
      - release casEventSys::.mutex while working on the callback
      - remove the epicsGuard< evSysMutex > & argument from casEvent::cbFunc()
        (this is super ugly anyways. Callback should not have to know about
        locking semantics of the event loop)

To manage notifications about this bug go to:
https://bugs.launchpad.net/epics-base/+bug/1830957/+subscriptions

Navigate by Date:
Prev: Re: [Merge] ~epics-core/epics-base/+git/Com:iocsherr into epics-base:7.0 Keenan Lang via Core-talk
Next: Jenkins build is still unstable: epics-pva2pva-linux32 #118 APS Jenkins via Core-talk
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  <20192020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: heard about the github sponsors? Jeong Han Lee via Core-talk
Next: C++ string question Mark Rivers via Core-talk
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  <20192020  2021  2022  2023  2024 
ANJ, 31 May 2019 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·