NOTE: Mark Rivers seems to have just about confirmed my assumptions,
but since I wrote all this, I am sending it anyways in case it is of use
to someone else in the future who stumbles across this discussion and is
looking for a bit more detail on the issue that was discussed (and why
it can happen).
Mark Davis
---------------------------------------------------------------------------
Hi Matt,
The ASYN_CANBLOCK flag is definitely NOT set (on purpose).
As I understand it, the purpose of that flag is to tell the asyn layer
that calls to driver functions from the asyn layer (e.g. the writeXxx()
functions) might block. In general, this is used when the writeXxx()
functions attempt to communicate with one or more devices directly,
using some medium/interface that may cause the calling thread to block
waiting for some I/O operations to complete.
When set, this causes the asyn layer to create a queue for write
operations and a thread to service that queue (i.e. remove a value from
the queue and call the appropriate writeXxx() function in the driver),
so if one of the driver's writeXxx() functions does block, the only
thread that is affected is the one servicing the queue for that driver.
Our driver's writeXxx() functions never do any I/O so they never block:
They simply update the value to be processed by the thread spawned by
the driver's constructor. And as with most (all?) calls to the driver
functions, our driver's writeXxx() functions will never be called until
the asyn layer has acquired the asyn lock for the driver instance (so as
long as our driver always obtains the same lock before calling asyn
layer functions, there are no concurrency issues).
The key assumption in my theory is when the calls to dbScanLock() occur.
For writes, I am assuming the following sequence (all done by a
CAC-client thread in our case):
- write to the output record
- call dbScanLock(pRec)
- call pRec->process(pRec)
- call to the write func specified in the Device Support Entry Table
(which, in this case, calls an asyn layer function to process the new
setting)
- call function in the asyn layer (which calls it own lock() func to
get the asyn-layer lock for a specific instance of the driver)
- call our driver's writeXxx() function
For reads:
- driver reads new values from device
- driver calls lock() to get asyn-layer lock for itself)
- driver calls setXxxParam() and/or setParamStatus() for each new
parameter value
- driver calls callParamCallbacks(), which calls functions in asyn's
the devEpics interface which calls dbScanLock(pRec) and posts the new
value to the record
If all this is correct, then writes require getting the scan lock 1st
and the asyn lock 2nd, while reads require getting the asyn lock 1st and
the scan lock 2nd. And telling the asyn layer that our driver might
block wouldn't change the order the locks are acquired: It would just
create a queue and another thread to manage the calls to our driver's
writeXxx() functions.
So far, I have found nothing that contradicts these assumptions, but I
have not dug deep enough in to all the code to be sure all my
assumptions are correct.
Mark
On 4/16/2018 11:57 AM, Pearson, Matthew R. wrote:
Hi,
Is it possible that the ASYN_CANBLOCK flag is not set when the asynPortDriver object is created?
Cheers,
Matt
On Apr 16, 2018, at 11:32 AM, Mark Davis <[email protected]> wrote:
We have been experiencing a deadlock problem with one of our custom drivers based on the asynPortDriver class, and after reviewing the code and information gained via the gdb debugger, I have a theory regarding the cause that I would like others to review and tell me if there are any flaws they can spot (either in my logic or my assumptions).
First, a summary of the driver that is exhibiting the problem:
- Non-blocking driver based on asynPortDriver class
- One device per driver instance
- Communication with device handled by a thread launched by each instance of the driver
- All asyn calls by driver thread surrounded by lock()/unlock() calls
- info(asyn:READBACK, "1") tag used in some of the output records
- writeXxx() calls don't do any I/O: They save the new value to be processed later
(and sometimes call setXxxParam(), setParamStatus(), and callParamCallbacks() functions)
The driver thread runs periodically to update readings and process new settings as follows:
- reads values form the controller
- calls lock()
- calls setXxxParam() and setParamStatus() as needed
- calls callParamCallbacks() to update the records affected by the changes (acquiring the scan lock for each record while it is processed)
- calls unlock()
- Send new settings to the controller (using lock()/unlock() when calling any asyn functions or driver values that are touched by the writeXxx() functions)
Concurrently with the driver threads usual processing, writes to output records occur.
The processing of each write (I assume) includes the following sequence of events:
- The thread that did the write (a CAC-client thread, in this case) gets the scan lock for the record and begins processing the record
- Processing of the record includes calls to asyn layer functions, which requires acquiring the asyn lock for the driver instance before calling one of its writeXxx() functions
- The asyn lock for the driver is released
- The scan lock for the record is released
What is happening:
The driver thread sometimes blocks indefinitely waiting to acquire the scan lock for an output record it needs to update
My theory as to what can cause this:
What we know:
- The driver thread blocks indefinitely waiting for the scan lock for an output record that includes the info(asyn:READBACK, "1") tag
(see extract from gdb output below and the template for the record)
- The lockup of the driver thread is always preceded by a CA client write to an output record for one of the driver's parameters
(haven't yet confirmed it is the same record that the driver is waiting for when it hangs)
What I believe is happening:
- Sometime after the driver calls lock() and when it blocks waiting for the scan lock for an output record, a CA client thread writes to the same record. It successfully acquires the scan lock for the record, but then blocks waiting to acquire the asyn lock for the driver instance (which the driver thread already has).
Obviously, if I am correct, the easiest way to avoid the problem is to eliminate the use of the info(asyn:READBACK, "1") tag (at least for any case where this could happen). We don't actually need those tags for this database anymore, so that is something we will be trying shortly.
But can anyone point out a mistake in my reasoning or my assumptions? IS this something we need to be aware of when using the info tag?
Mark Davis
NSCL/FRIB Control Systems Software Engineer
[email protected]
========================== from gdb ===================================
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f294a554494 in _L_lock_952 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f294a5542f6 in __GI___pthread_mutex_lock (mutex=0x18db030) at ../nptl/pthread_mutex_lock.c:114
#3 0x00007f294adc17a3 in mutexLock (id=0x18db030) at ../../../src/libCom/osi/os/posix/osdMutex.c:46
#4 0x00007f294adc197f in epicsMutexOsdLock (pmutex=0x18db030)
at ../../../src/libCom/osi/os/posix/osdMutex.c:130
#5 0x00007f294adb85fb in epicsMutexLock (pmutexNode=0x18db070)
at ../../../src/libCom/osi/epicsMutex.cpp:143
#6 0x00007f294b019733 in dbScanLock (precord=0xe9d7d0) at ../../../src/ioc/db/dbLock.c:265
#7 0x00007f294bf8e18d in interruptCallbackOutput (drvPvt=0x1d08640, pasynUser=0x1d08f98, value=0)
at ../../asyn/devEpics/devAsynUInt32Digital.c:500
#8 0x00007f294bf74540 in paramList::uint32Callback (this=0xb3cd50, command=358, addr=0,
interruptMask=4294967295) at ../../asyn/asynPortDriver/asynPortDriver.cpp:628
#9 0x00007f294bf74adc in paramList::callCallbacks (this=0xb3cd50, addr=0)
at ../../asyn/asynPortDriver/asynPortDriver.cpp:750
#10 0x00007f294bf76417 in asynPortDriver::callParamCallbacks (this=0xb3ce60, list=0, addr=0)
at ../../asyn/asynPortDriver/asynPortDriver.cpp:1510
#11 0x00007f294bf763a7 in asynPortDriver::callParamCallbacks (this=0xb3ce60)
at ../../asyn/asynPortDriver/asynPortDriver.cpp:1496
...
_________________________________________________________________________________
Jump to any frame in the callstack to examine arguments (frame 6, here):
(gdb) f 6
#6 0x00007f294b019733 in dbScanLock (precord=0xe9d7d0) at ../../../src/ioc/db/dbLock.c:265
265 ../../../src/ioc/db/dbLock.c: No such file or directory.
_________________________________________________________________________________
(gdb) p precord
$1 = (dbCommon *) 0xe9d7d0
(gdb) p *(dbCommon *)precord
$2 = {name = "MPS_FPS:MSTR_N0001:Slave11_MaskBit42-Sel", '\000' <repeats 20 times>,
desc = '\000' <repeats 40 times>, asg = '\000' <repeats 28 times>, scan = 0, pini = 0, phas = 0, <snipped>
_________________________________________________________________________________
record(bo, "$(SYS)_$(SUB):$(DEV)_$(INST):Slave$(SLAVE)_MaskBit$(BIT)-Sel")
{
field(DTYP, "asynUInt32Digital")
field(OUT, "@asynMask($(ASYN_PORT),0,$(ASYN_MASK),$(TIMEOUT)) slave$(SLAVE)Mask$(MASK_REG)_Set")
field(ZNAM,"Mask")
field(ONAM,"Unmask")
info(asyn:READBACK, "1")
info(autosaveFields, "VAL")
}
- Replies:
- Re: Possible deadlock issue with asyn layer? Pearson, Matthew R.
- References:
- Possible deadlock issue with asyn layer? Mark Davis
- Re: Possible deadlock issue with asyn layer? Pearson, Matthew R.
- Navigate by Date:
- Prev:
Re: Problem with autosave module making it crash at random basis. Konrad, Martin
- Next:
Re: Problem with autosave module making it crash at random basis. Mooney, Tim M.
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
<2018>
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: Possible deadlock issue with asyn layer? Pearson, Matthew R.
- Next:
Re: Possible deadlock issue with asyn layer? Pearson, Matthew R.
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
<2018>
2019
2020
2021
2022
2023
2024
|