2002 2003 2004 2005 2006 2007 2008 2009 <2010> 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 | Index | 2002 2003 2004 2005 2006 2007 2008 2009 <2010> 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 |
<== Date ==> | <== Thread ==> |
---|
Subject: | Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp |
From: | Ralph Lange <[email protected]> |
To: | Jeff Hill <[email protected]> |
Cc: | EPICS Core Talk <[email protected]> |
Date: | Tue, 17 Aug 2010 12:07:37 -0400 |
Hi Jeff, thanks for the fix! I'm not sure about that race condition, though. In my fix:Setting iiuExistenceCount to the circuitList.count() (line ~290) happens in the cac::~cac destructor while the lock is being held, immediately followed by calling unlinkAllChannels() on all circuits in that list.
The only place where the lock is given up is one block down (line ~307), when waiting for the tcp threads to exit.
What race condition are you seeing? I looked at the 11786 change and bug#541362. I perfectly see why a counter fixes the issue that appears when using a simple flag. But don't see where my fix would break the fixed behavior, as it is still using a countre, decrementing the iiuExistenceCount is still happening at the end of cac::destroyIIU(), and incrementing iiuExistenceCount (original fix) happens exactly when it is added to the list, so that (my fix) using circuitList.count() to set iiuExistenceCount yields the same result.
I'm perfectly fine with using the original (increment) fix, but I would like to understand where the difference is.
I think I also see another new race condition introduced by ca-over-tcp, which probably breaks the shutdown behavior for clients using ca-over-tcp, and needs to be fixed.
When the lock is given up in cac::~cac() and the loop (line ~307) waits for termination of all TCP tasks, an incoming name-res reply from a still active TCP circuit (the UDP task has exited by that time in the cac::~cac destructor) could open a new circuit (tcpiiu.c #2188), which would then be added to the circuitList (and iiuExistenceCount be incremented). For that new circuit, that the loop would be waiting for, unlinkAllChannels() is not being called, so it does not have a reason to terminate its threads.
Maybe we still need a flag: to avoid creating new circuits while the cac:~cac destructor runs?!
What do you think? Thanks a lot, Ralph On 16.08.2010 19:46, Jeff Hill wrote:
+ this->iiuExistenceCount = this->circuitList.count();Examining this change a bit closer I see that it will introduce a race condition when circuits are being created and destroyed at close to the same instant in time. See revision 11786 of cac.cpp which fixes mantis 334 if you are interested in what this code does. After running the regression tests, I pushed in a fix (which is removing above mentioned change and restoring the increment of iiuExistenceCount in cac::findOrCreateVirtCircuit. Jeff-----Original Message----- From: Ralph Lange [mailto:[email protected]] Sent: Monday, August 16, 2010 10:03 AM To: Andrew Johnson Cc: Jeff Hill; EPICS Core Talk Subject: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Hi Andrew, it looks as I have found the reason for the segfaults, which I think is a bug in CAC. In the cac::~cac() destructor, a private member "this->iiuExistenceCount" is used to determine if cac should wait for TCP threads to shutdown. While this counter gets correctly decreased in cac::destroyIIU(), it never gets increased or set (except initialized to 0 in the cpp constructor). So cac::~cac() was never waiting for the TCP threads to shutdown, and happily kept tearing down the timer/mutex infrastructure. If the threads were not fast enough, they were hitting on destroyed semaphores when cancelling their watchdog timers - that was causing the exceptions and segfaults. This one-line fix will take care of the problem: === modified file 'src/ca/cac.cpp' --- src/ca/cac.cpp 2010-04-15 21:06:16 +0000 +++ src/ca/cac.cpp 2010-08-16 15:24:13 +0000 @@ -287,6 +287,7 @@ // // shutdown all tcp circuits // + this->iiuExistenceCount = this->circuitList.count(); tsDLIter< tcpiiu> iter = this->circuitList.firstIter (); while ( iter.valid() ) { // this causes a clean shutdown to occur Maybe it actually takes care of several shutdown related issues... Yay! Ralph