Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019 
<== Date ==> <== Thread ==>

Subject: RE: Possible deadlock issue with asyn layer?
From: Mark Rivers <rivers@cars.uchicago.edu>
To: Mark Davis <davism50@msu.edu>, "tech-talk@aps.anl.gov" <tech-talk@aps.anl.gov>
Date: Mon, 16 Apr 2018 17:10:42 +0000
Hi Mark,

What you suggest is certainly possible.  The READBACK tag is not heavily used, and there could be a race condition.  I suggest you remove the READBACK tag since you don't need it, and see if the problem goes away.

There is an asyn test application, testOutputCallbackApp that may be able to test and reproduce your problem.  It has a thread that does callbacks both to input records and to output records with the READBACK tag.  It seems that if your diagnosis is correct that application should deadlock if you try to write to one of those output records while the callbacks are happening.  Can you see if you can reproduce the problem with that test application?

Mark



-----Original Message-----
From: Mark Davis [mailto:davism50@msu.edu] 
Sent: Monday, April 16, 2018 10:33 AM
To: tech-talk@aps.anl.gov; Mark Rivers <rivers@cars.uchicago.edu>
Subject: Possible deadlock issue with asyn layer?

We have been experiencing a deadlock problem with one of our custom drivers based on the asynPortDriver class, and after reviewing the code and information gained via the gdb debugger, I have a theory regarding the cause that I would like others to review and tell me if there are any flaws they can spot (either in my logic or my assumptions).


First, a summary of the driver that is exhibiting the problem:

    - Non-blocking driver based on asynPortDriver class
    - One device per driver instance
    - Communication with device handled by a thread launched by each instance of the driver
    - All asyn calls by driver thread surrounded by lock()/unlock() calls
    - info(asyn:READBACK, "1") tag used in some of the output records
    - writeXxx() calls don't do any I/O:  They save the new value to be processed later
        (and sometimes call setXxxParam(), setParamStatus(), and
callParamCallbacks() functions)

    The driver thread runs periodically to update readings and process new settings as follows:
       - reads values form the controller
       - calls lock()
       - calls setXxxParam() and setParamStatus() as needed
       - calls callParamCallbacks() to update the records affected by the changes (acquiring the scan lock for each record while it is processed)
       - calls unlock()
       - Send new settings to the controller (using lock()/unlock() when calling any asyn functions or driver values that are touched by the
writeXxx() functions)

    Concurrently with the driver threads usual processing, writes to output records occur.
    The processing of each write (I assume) includes the following sequence of events:
       - The thread that did the write (a CAC-client thread, in this
case) gets the scan lock for the record and begins processing the record
       - Processing of the record includes calls to asyn layer functions, which requires acquiring the asyn lock for the driver instance before calling one of its writeXxx() functions
       - The asyn lock for the driver is released
       - The scan lock for the record is released

What is happening:

    The driver thread sometimes blocks indefinitely waiting to acquire the scan lock for an output record it needs to update


My theory as to what can cause this:

    What we know:
       - The driver thread blocks indefinitely waiting for the scan lock 
for an output record that includes the info(asyn:READBACK, "1") tag
         (see extract from gdb output below and the template for the record)
       - The lockup of the driver thread is always preceded by a CA 
client write to an output record for one of the driver's parameters
          (haven't yet confirmed it is the same record that the driver 
is waiting for when it hangs)

    What I believe is happening:
       - Sometime after the driver calls lock() and when it blocks 
waiting for the scan lock for an output record, a CA client thread 
writes to the same record.  It successfully acquires the scan lock for 
the record, but then blocks waiting to acquire the asyn lock for the 
driver instance (which the driver thread already has).


Obviously, if I am correct, the easiest way to avoid the problem is to 
eliminate the use of the info(asyn:READBACK, "1") tag (at least for any 
case where this could happen).  We don't actually need those tags for 
this database anymore, so that is something we will be trying shortly.

But can anyone point out a mistake in my reasoning or my assumptions?  
IS this something we need to be aware of when using the info tag?

Mark Davis
NSCL/FRIB Control Systems Software Engineer
davism50@msu.edu

========================== from gdb ===================================

#0  __lll_lock_wait () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135

#1  0x00007f294a554494 in _L_lock_952 () from 
/lib/x86_64-linux-gnu/libpthread.so.0

#2  0x00007f294a5542f6 in __GI___pthread_mutex_lock (mutex=0x18db030) at 
../nptl/pthread_mutex_lock.c:114

#3  0x00007f294adc17a3 in mutexLock (id=0x18db030) at 
../../../src/libCom/osi/os/posix/osdMutex.c:46

#4  0x00007f294adc197f in epicsMutexOsdLock (pmutex=0x18db030)
     at ../../../src/libCom/osi/os/posix/osdMutex.c:130

#5  0x00007f294adb85fb in epicsMutexLock (pmutexNode=0x18db070)
     at ../../../src/libCom/osi/epicsMutex.cpp:143

#6  0x00007f294b019733 in dbScanLock (precord=0xe9d7d0) at 
../../../src/ioc/db/dbLock.c:265

#7  0x00007f294bf8e18d in interruptCallbackOutput (drvPvt=0x1d08640, 
pasynUser=0x1d08f98, value=0)
     at ../../asyn/devEpics/devAsynUInt32Digital.c:500

#8  0x00007f294bf74540 in paramList::uint32Callback (this=0xb3cd50, 
command=358, addr=0,
     interruptMask=4294967295) at 
../../asyn/asynPortDriver/asynPortDriver.cpp:628

#9  0x00007f294bf74adc in paramList::callCallbacks (this=0xb3cd50, addr=0)
     at ../../asyn/asynPortDriver/asynPortDriver.cpp:750

#10 0x00007f294bf76417 in asynPortDriver::callParamCallbacks 
(this=0xb3ce60, list=0, addr=0)
     at ../../asyn/asynPortDriver/asynPortDriver.cpp:1510

#11 0x00007f294bf763a7 in asynPortDriver::callParamCallbacks (this=0xb3ce60)
     at ../../asyn/asynPortDriver/asynPortDriver.cpp:1496

...
_________________________________________________________________________________
Jump to any frame in the callstack to examine arguments (frame 6, here):

(gdb) f 6
#6  0x00007f294b019733 in dbScanLock (precord=0xe9d7d0) at 
../../../src/ioc/db/dbLock.c:265
265     ../../../src/ioc/db/dbLock.c: No such file or directory.
_________________________________________________________________________________
(gdb) p precord

$1 = (dbCommon *) 0xe9d7d0
(gdb) p *(dbCommon *)precord

$2 = {name = "MPS_FPS:MSTR_N0001:Slave11_MaskBit42-Sel", '\000' <repeats 
20 times>,
   desc = '\000' <repeats 40 times>, asg = '\000' <repeats 28 times>, 
scan = 0, pini = 0, phas = 0, <snipped>

_________________________________________________________________________________

record(bo, "$(SYS)_$(SUB):$(DEV)_$(INST):Slave$(SLAVE)_MaskBit$(BIT)-Sel")
{
   field(DTYP, "asynUInt32Digital")
   field(OUT,  "@asynMask($(ASYN_PORT),0,$(ASYN_MASK),$(TIMEOUT)) 
slave$(SLAVE)Mask$(MASK_REG)_Set")
   field(ZNAM,"Mask")
   field(ONAM,"Unmask")
   info(asyn:READBACK, "1")
   info(autosaveFields, "VAL")
}


References:
Possible deadlock issue with asyn layer? Mark Davis

Navigate by Date:
Prev: Re: Possible deadlock issue with asyn layer? Pearson, Matthew R.
Next: Re: Problem with autosave module making it crash at random basis. Konrad, Martin
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019 
Navigate by Thread:
Prev: RE: Possible deadlock issue with asyn layer? Mark Rivers
Next: June EPICS Meeting Registration Open Andrew Johnson
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019 
ANJ, 26 Apr 2018 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·