Hello, all.
I tried to send this in tech-talk but maybe the right place is the core-talk. I still couldn’t get to the bottom of this issue. My next step is trying to reproduce it in a controlled way.
My best,
Márcio
From: "Paduan Donadio, Marcio" <marcio at slac.stanford.edu>
Date: Friday, July 2, 2021 at 2:22 PM
To: EPICS tech-talk <tech-talk at aps.anl.gov>
Subject: Segmentation fault when Gateway calls ca_clear_channel: removing node from empty createReqPend linked list
We’ve been seeing occasional crashes on our EPICS gateways at SLAC due to segmentation fault. For details with the IOC shell messages and some gdb data, please look at the issue description here: https://github.com/epics-extensions/ca-gateway/issues/1
I’m studying the case and it seems to me that the problem is in the channel access client code, not in the gateway. When the gateway calls ca_clear_channel(), one of the internal steps is to remove the channel
from the createReqPend linked list.
File ca/src/client/tcpiiu.cpp:
tcpiiu::uninstallChan (…) {
(…)
this->createReqPend.remove ( chan );
(…) }
When you check the gdb core dump file inside the frame executing tsDLList.h, you see that createReqPend->pLast = 0 and createReqPend->pFirst = 0. Looks like to me that the list is empty. “item” and “theNode”
corresponds to the item that is to be removed and “this” is the pointer to the linked list (i.e. tsDLNode<T> &theNode = item).
(gdb) info args
item = @0x757cdd0
this = 0x7f911c0077e0
(gdb) p prevNode
$1 = (tsDLNode<nciu> &) @0x0: <error reading variable>
(gdb) p theNode
$2 = (tsDLNode<nciu> &) @0x757cdf0: {pNext = 0x757d250, pPrev = 0x0}
(gdb) p this->pLast
$3 = (nciu *) 0x0
(gdb) p this->pFirst
$4 = (nciu *) 0x0
There are 5 functions that can be used to empty a tsDLList: remove(), removeAll(), get(), pop(), and the constructor itself that calls clear().
In addition to tcpiiu::uninstallChan(…) that was called in the chain that triggered this seg fault, I could find these other possible calls to createReqPend.get(): tcpiiu::disconnectAllChannels, tcpiiu::unlinkAllChannels,
and tcpSendThread::run().
This last one calls createReqPend.get() inside the while(true) as part of the ordinary tasks:
while ( nciu * pChan = this->iiu.createReqPend.get () )
Additionally, we have this in cac.cpp: destroyIIU calls disconnectAllChannels and ~cac calls unlinkAllChannels.
Finally, in tcpiiu.cc, tcpRecvThread::run() calls
if (! connectSuccess) {
(…)
destroyIIU
(…) }
and tcpSendThread::run() calls this->iiu.cacRef.destroyIIU ( this->iiu ) after the while(true) loop is left.
Do you see a condition where tcpSendThread::run calls createReqPend.get() or tcpRecvThread::run calls destroyIIU() “at the same time” as the gateway calls ca_clear_channel() and this condition is not protected
by a mutex? Do you have any insight to help me to proceed with the investigation?
Thank you a lot,
|
Márcio Paduan Donadio | Control
Systems Engineer
Advanced Control Systems Department
SLAC National Accelerator Laboratory | Menlo Park,
CA
p: 650.926.5007 | w: slac.stanford.edu
|