|1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 <2019>||Index||1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 <2019>|
|<== Date ==>||<== Thread ==>|
|Subject:||Java CAJ/JCA Deadlock on CAJChannel.destroy|
|From:||Ryan Slominski via Tech-talk <firstname.lastname@example.org>|
|Date:||Fri, 11 Jan 2019 23:09:47 +0000|
I'm seeing a deadlock reported with the EPICS Java CAJ/JCA library. It is extremely rare, but the CAJChannel.destroy method will sometimes throw a CAException because the ReferenceCountingLock cannot acquire a lock before the timeout (I believe 20 seconds). In my application there are many threads creating and destroying channels and adding and removing monitors concurrently on the same shared context.
It appears this is caused by inconsistent lock ordering as seen in the source of the create and destroy channel methods CAJContext.createChannel and CAJChannel.destroy (https://github.com/epics-base/jca/blob/master/src/core/com/cosylab/epics/caj/). Consider the NamedLockPattern > ReferenceCountingLock > ReentrantLock lock (let's call it A) and the CAJChannel intrinsic synchronization lock (let's call it B). In the CAJContext.createChannel method lock A is obtained first then B. However, in the CAJChannel.destroy method locks are obtained B then A. If one thread is attempting to close a channel while another thread is attempting to create a channel of the same name a deadlock may occur and I believe this may be what I am seeing in the following stack trace:
There appears to be an attempt to avoid this scenario as CAJConext.createChannel checks for an existing channel twice, first while not holding any lock and then again while holding lock A. This reduces the opportunity, but does not prevent deadlock. In the rare case in which two or more threads are attempting to create the same channel one must wait for lock A while the other is free to obtain lock B, create the channel, drop both locks, and then immediately ask to destroy the channel and thus obtain lock B. A thread context switch occurs. Now the second thread obtains lock A, but cannot obtain lock B so neither thread cannot continue. However, lock A acquisition timeout occurs and thread one fails to destroy the channel and an Exception is thrown. Thread two waiting on the channel object synchronization lock is finally able to obtain it and continue in the CAJChannel.addConnectionListenerAndFireIfConnected method.
In the epics2web application this scenario (create and then immediately destroy) might occur if a user requests to view a screen and decides to close their web browser before the page fully loads.
Meanwhile another user is requesting the same screen.
I think the fix is to remove the "synchronized" keyword from the CAJChannel.destroy methods (both of them, one calls the other). The methods already ultimately call CAJContext.destroyChannel, which
obtains lock A and then calls CAJChannel.destroyChannel, which obtains lock B. This would make the lock acquisition ordering consistent between create and destroy methods. A workaround in the meantime is to not use the CAJChannel.destroy method and instead
call CAJContext.destroyChannel directly.
This is hard to verify so someone with more CAJ/JCA experience please take a look.