1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 <2021> 2022 2023 2024 | Index | 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 <2021> 2022 2023 2024 |
<== Date ==> | <== Thread ==> |
---|
Subject: | Segmentation fault when Gateway calls ca_clear_channel: removing node from empty createReqPend linked list |
From: | "Paduan Donadio, Marcio via Tech-talk" <tech-talk at aps.anl.gov> |
To: | EPICS tech-talk <tech-talk at aps.anl.gov> |
Date: | Fri, 2 Jul 2021 21:22:38 +0000 |
We’ve been seeing occasional crashes on our EPICS gateways at SLAC due to segmentation fault. For details with the IOC shell messages and some gdb data, please look at the issue description here: https://github.com/epics-extensions/ca-gateway/issues/1 I’m studying the case and it seems to me that the problem is in the channel access client code, not in the gateway. When the gateway calls ca_clear_channel(), one of the internal steps is to remove the channel
from the createReqPend linked list. File ca/src/client/tcpiiu.cpp: tcpiiu::uninstallChan (…) { (…) this->createReqPend.remove ( chan );
(…) } When you check the gdb core dump file inside the frame executing tsDLList.h, you see that createReqPend->pLast = 0 and createReqPend->pFirst = 0. Looks like to me that the list is empty. “item” and “theNode”
corresponds to the item that is to be removed and “this” is the pointer to the linked list (i.e. tsDLNode<T> &theNode = item). (gdb) info args item = @0x757cdd0 this = 0x7f911c0077e0 (gdb) p prevNode $1 = (tsDLNode<nciu> &) @0x0: <error reading variable> (gdb) p theNode $2 = (tsDLNode<nciu> &) @0x757cdf0: {pNext = 0x757d250, pPrev = 0x0} (gdb) p this->pLast $3 = (nciu *) 0x0 (gdb) p this->pFirst $4 = (nciu *) 0x0 There are 5 functions that can be used to empty a tsDLList: remove(), removeAll(), get(), pop(), and the constructor itself that calls clear(). In addition to tcpiiu::uninstallChan(…) that was called in the chain that triggered this seg fault, I could find these other possible calls to createReqPend.get(): tcpiiu::disconnectAllChannels, tcpiiu::unlinkAllChannels,
and tcpSendThread::run(). This last one calls createReqPend.get() inside the while(true) as part of the ordinary tasks: while ( nciu * pChan = this->iiu.createReqPend.get () )
Additionally, we have this in cac.cpp: destroyIIU calls disconnectAllChannels and ~cac calls unlinkAllChannels. Finally, in tcpiiu.cc, tcpRecvThread::run() calls if (! connectSuccess) { (…) destroyIIU (…) } and tcpSendThread::run() calls this->iiu.cacRef.destroyIIU ( this->iiu ) after the while(true) loop is left. Do you see a condition where tcpSendThread::run calls createReqPend.get() or tcpRecvThread::run calls destroyIIU() “at the same time” as the gateway calls ca_clear_channel() and this condition is not protected
by a mutex? Do you have any insight to help me to proceed with the investigation? Thank you a lot, --
|