Hi Jeff,
On 17.08.2010 20:05, Jeff Hill wrote:
Hi Ralph,
I'm not sure about that race condition, though.
But don't see where my fix would break the fixed behavior, as it is
still using a countre, decrementing the iiuExistenceCount is still
happening at the end of cac::destroyIIU(), and incrementing
iiuExistenceCount (original fix) happens exactly when it is added to
the
list, so that (my fix) using circuitList.count() to set
iiuExistenceCount yields the same result.
The cac::destroyIIU has roughly three sections of code. In the first section there is a lock protected remove of the per-tcp-circuit data structure from cac's list etc, in the second section the per-tcp-circuit destructor is run without holding the lock so we will not deadlock, and in the third section we are protected by the lock when returning the per-tcp-circuit's memory to the free list and decrementing the iiuExistenceCount.
So, of course the business at hand is that we need to prevent the cac destructor from finishing until each of the tcp circuits is completely shut down. We do not make the cac destructor wait for the list count to become zero because we have to wait for stuff that happens after the per-tcp-circuit is removed from the list. Hence the need for the iiuExistenceCount.
Note that _multiple_ threads, each managing private TCP circuits, can all be shutting down simultaneously and that the lock is released briefly between removing the circuit from the list, and decrementing the iiuExistenceCount.
Thanks for the explanation.
But - as I said in my mail - I perfectly understand bug #541362 and see
why a counter fixes the issue that appears when using a simple flag.
My point is: my fix did not change the code in cac::destroyIIU at all. I
also did not switch back to looping over the circuitList.count() instead
of looping over iiuExistenceCount.
My fix just switched from gradually increasing iiuExistenceCount
whenever a new circuit gets added to the list, towards setting
iiuExistenceCount to circuitList.count() under the lock before shutting
down the circuits.
As (in your fix) iiuExistenceCount++ happens immediately after adding
the circuit to the circuitList(), I still think these two versions are
equivalent.
Sorry about insisting here ... This is really not about trying to always
be right, I just want to understand why I'm wrong.
Furthermore, at the same instant in time it is possible that a new TCP circuits might also be created.
How?
The only way a new circuit gets created is after an incoming result of
the name resolution process. The UDP task has been shut down earlier in
the cac dtor. Who is creating the new circuit?
So setting the count to the number of items on the list can seriously scramble the proper state of the iiuExistenceCount even before the cac dtor runs.
iiuExistenceCount is not used outside the cac dtor. I am setting it
inside the dtor with the lock being held. I don't see how it can be
scrambled after that, and a variable being scrambled before it is set is
not relevant.
And of course if that happens then we can have a crash because the cac dtor might not wait for all of the threads managing the circuits to shutdown, or alternatively hang forever waiting for phantom circuits to shutdown.
Using your fix, if circuits are added after the unlinkAllChannels block
(line ~290), they will be waited for in the line ~307 loop (as
iiuExistenceCount is increased), but unlinkAllChannels has never been
called for them, so the loop will hang forever.
Afaik this behavior has never been observed, which I would take as a
hint that shutting down the UDP task first reliably keeps new circuits
from being created at that time.
(This has changed with the introduction of TCP name requests - but
that's the other mail thread.)
Cheers,
Ralph
-----Original Message-----
From: Ralph Lange [mailto:[email protected]]
Sent: Tuesday, August 17, 2010 10:08 AM
To: Jeff Hill
Cc: EPICS Core Talk
Subject: Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp
Hi Jeff,
thanks for the fix!
I'm not sure about that race condition, though.
In my fix:
Setting iiuExistenceCount to the circuitList.count() (line ~290)
happens
in the cac::~cac destructor while the lock is being held, immediately
followed by calling unlinkAllChannels() on all circuits in that list.
The only place where the lock is given up is one block down (line
~307),
when waiting for the tcp threads to exit.
What race condition are you seeing? I looked at the 11786 change and
bug#541362. I perfectly see why a counter fixes the issue that appears
when using a simple flag.
But don't see where my fix would break the fixed behavior, as it is
still using a countre, decrementing the iiuExistenceCount is still
happening at the end of cac::destroyIIU(), and incrementing
iiuExistenceCount (original fix) happens exactly when it is added to
the
list, so that (my fix) using circuitList.count() to set
iiuExistenceCount yields the same result.
I'm perfectly fine with using the original (increment) fix, but I would
like to understand where the difference is.
I think I also see another new race condition introduced by ca-over-
tcp,
which probably breaks the shutdown behavior for clients using
ca-over-tcp, and needs to be fixed.
When the lock is given up in cac::~cac() and the loop (line ~307) waits
for termination of all TCP tasks, an incoming name-res reply from a
still active TCP circuit (the UDP task has exited by that time in the
cac::~cac destructor) could open a new circuit (tcpiiu.c #2188), which
would then be added to the circuitList (and iiuExistenceCount be
incremented). For that new circuit, that the loop would be waiting for,
unlinkAllChannels() is not being called, so it does not have a reason
to
terminate its threads.
Maybe we still need a flag: to avoid creating new circuits while the
cac:~cac destructor runs?!
What do you think?
Thanks a lot,
Ralph
On 16.08.2010 19:46, Jeff Hill wrote:
+ this->iiuExistenceCount = this->circuitList.count();
Examining this change a bit closer I see that it will introduce a
race condition when circuits are being created and destroyed at close
to the same instant in time. See revision 11786 of cac.cpp which fixes
mantis 334 if you are interested in what this code does. After running
the regression tests, I pushed in a fix (which is removing above
mentioned change and restoring the increment of iiuExistenceCount in
cac::findOrCreateVirtCircuit.
Jeff
-----Original Message-----
From: Ralph Lange [mailto:[email protected]]
Sent: Monday, August 16, 2010 10:03 AM
To: Andrew Johnson
Cc: Jeff Hill; EPICS Core Talk
Subject: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp
Hi Andrew,
it looks as I have found the reason for the segfaults, which I think
is
a bug in CAC.
In the cac::~cac() destructor, a private member
"this->iiuExistenceCount" is used to determine if cac should wait
for
TCP threads to shutdown. While this counter gets correctly decreased
in
cac::destroyIIU(), it never gets increased or set (except
initialized to
0 in the cpp constructor).
So cac::~cac() was never waiting for the TCP threads to shutdown,
and
happily kept tearing down the timer/mutex infrastructure. If the
threads
were not fast enough, they were hitting on destroyed semaphores when
cancelling their watchdog timers - that was causing the exceptions
and
segfaults.
This one-line fix will take care of the problem:
=== modified file 'src/ca/cac.cpp'
--- src/ca/cac.cpp 2010-04-15 21:06:16 +0000
+++ src/ca/cac.cpp 2010-08-16 15:24:13 +0000
@@ -287,6 +287,7 @@
//
// shutdown all tcp circuits
//
+ this->iiuExistenceCount = this->circuitList.count();
tsDLIter< tcpiiu> iter = this-
circuitList.firstIter ();
while ( iter.valid() ) {
// this causes a clean shutdown to occur
Maybe it actually takes care of several shutdown related issues...
Yay!
Ralph
- Replies:
- RE: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Jeff Hill
- References:
- Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange
- RE: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Jeff Hill
- Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange
- RE: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Jeff Hill
- Navigate by Date:
- Prev:
RE: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Jeff Hill
- Next:
Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange
- Index:
2002
2003
2004
2005
2006
2007
2008
2009
<2010>
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Jeff Hill
- Next:
RE: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Jeff Hill
- Index:
2002
2003
2004
2005
2006
2007
2008
2009
<2010>
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|