EPICS Home

Experimental Physics and Industrial Control System


 
1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  <2019 Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  <2019
<== Date ==> <== Thread ==>

Subject: Java CAJ/JCA Deadlock on CAJChannel.destroy
From: Ryan Slominski via Tech-talk <tech-talk@aps.anl.gov>
To: "tech-talk@aps.anl.gov" <tech-talk@aps.anl.gov>
Date: Fri, 11 Jan 2019 23:09:47 +0000

Hi,

   I'm seeing a deadlock reported with the EPICS Java CAJ/JCA library.   It is extremely rare, but the CAJChannel.destroy method will sometimes throw a CAException because the ReferenceCountingLock cannot acquire a lock before the timeout (I believe 20 seconds).  In my application there are many threads creating and destroying channels and adding and removing monitors concurrently on the same shared context. 


It appears this is caused by inconsistent lock ordering as seen in the source of the create and destroy channel methods CAJContext.createChannel and CAJChannel.destroy (https://github.com/epics-base/jca/blob/master/src/core/com/cosylab/epics/caj/).   Consider the NamedLockPattern > ReferenceCountingLock > ReentrantLock lock (let's call it A) and the CAJChannel intrinsic synchronization lock (let's call it B).   In the CAJContext.createChannel method lock A is obtained first then B.  However, in the CAJChannel.destroy method locks are obtained B then A.  If one thread is attempting to close a channel while another thread is attempting to create a channel of the same name a deadlock may occur and I believe this may be what I am seeing in the following stack trace:


java.io.IOException: Unable to close channel
        at org.jlab.epics2web.epics.ChannelMonitor.close(ChannelMonitor.java:126)
        at org.jlab.epics2web.epics.ChannelManager.removeFromChannels(ChannelManager.java:333)
        at org.jlab.epics2web.epics.ChannelManager.removeListener(ChannelManager.java:353)
        at org.jlab.epics2web.websocket.WebSocketSessionManager.removeClient(WebSocketSessionManager.java:169)
        at org.jlab.epics2web.Application$1.run(Application.java:86)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: gov.aps.jca.CAException: Failed to obtain synchronization lock for 'R011PROT.F', possible deadlock.
        at com.cosylab.epics.caj.CAJContext.destroyChannel(CAJContext.java:1017)
        at com.cosylab.epics.caj.CAJChannel.destroy(CAJChannel.java:379)
        at com.cosylab.epics.caj.CAJChannel.destroy(CAJChannel.java:366)
        at org.jlab.epics2web.epics.ChannelMonitor.close(ChannelMonitor.java:124)
        ... 9 more


There appears to be an attempt to avoid this scenario  as CAJConext.createChannel checks for an existing channel twice, first while not holding any lock and then again while holding  lock A.  This reduces the opportunity, but does not prevent deadlock.   In the rare case in which two or more threads are attempting to create the same channel one must wait for lock A while the other is free to obtain lock B, create the channel, drop both locks, and then immediately ask to destroy the channel and thus obtain lock B.  A thread context switch occurs.  Now the second thread obtains lock A, but cannot obtain lock B so neither thread cannot continue.  However, lock A acquisition timeout occurs and thread one fails to destroy the channel and an Exception is thrown. Thread two waiting on the channel object synchronization lock is finally able to obtain it and continue in the CAJChannel.addConnectionListenerAndFireIfConnected method.


In the epics2web application this scenario (create and then immediately destroy) might occur if a user requests to view a screen and decides to close their web browser before the page fully loads.  Meanwhile another user is requesting the same screen.


I think the fix is to remove the "synchronized" keyword from the CAJChannel.destroy methods (both of them, one calls the other).  The methods already ultimately call CAJContext.destroyChannel, which obtains lock A and then calls CAJChannel.destroyChannel, which obtains lock B.  This would make the lock acquisition ordering consistent between create and destroy methods.  A workaround in the meantime is to not use the CAJChannel.destroy method and instead call CAJContext.destroyChannel directly.


More info:


https://github.com/JeffersonLab/epics2web/issues/2


This is hard to verify so someone with more CAJ/JCA experience please take a look.


Ryan Slominski



Navigate by Date:
Prev: Stanford Research Drivers Daniel Cuneo via Tech-talk
Next: Re: Stanford Research Drivers Eric Norum via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  <2019
Navigate by Thread:
Prev: Re: Stanford Research Drivers Daniel Cuneo via Tech-talk
Next: autoConnect could not connect from drvAsynSerialPortConfigure Zhibang Shen via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  <2019