EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  <20212022  2023  2024  Index 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  <20212022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Segmentation fault when Gateway calls ca_clear_channel: removing node from empty createReqPend linked list
From: "Paduan Donadio, Marcio via Core-talk" <core-talk at aps.anl.gov>
To: "core-talk at aps.anl.gov" <core-talk at aps.anl.gov>, "Johnson, Andrew N." <anj at anl.gov>, "Kim, Kukhee" <khkim at slac.stanford.edu>
Date: Wed, 14 Jul 2021 00:31:19 +0000

Hello, all.

 

I tried to send this in tech-talk but maybe the right place is the core-talk. I still couldn’t get to the bottom of this issue. My next step is trying to reproduce it in a controlled way.

 

My best,

 

Márcio

 

From: "Paduan Donadio, Marcio" <marcio at slac.stanford.edu>
Date: Friday, July 2, 2021 at 2:22 PM
To: EPICS tech-talk <tech-talk at aps.anl.gov>
Subject: Segmentation fault when Gateway calls ca_clear_channel: removing node from empty createReqPend linked list

 

We’ve been seeing occasional crashes on our EPICS gateways at SLAC due to segmentation fault. For details with the IOC shell messages and some gdb data, please look at the issue description here: https://github.com/epics-extensions/ca-gateway/issues/1

 

I’m studying the case and it seems to me that the problem is in the channel access client code, not in the gateway. When the gateway calls ca_clear_channel(), one of the internal steps is to remove the channel from the createReqPend linked list.

 

File ca/src/client/tcpiiu.cpp:

 

tcpiiu::uninstallChan (…) {

(…)

  this->createReqPend.remove ( chan );

(…) }

 

When you check the gdb core dump file inside the frame executing tsDLList.h, you see that createReqPend->pLast = 0 and createReqPend->pFirst = 0. Looks like to me that the list is empty. “item” and “theNode” corresponds to the item that is to be removed and “this” is the pointer to the linked list (i.e. tsDLNode<T> &theNode = item).

 

(gdb) info args

item = @0x757cdd0

this = 0x7f911c0077e0

(gdb) p prevNode

$1 = (tsDLNode<nciu> &) @0x0: <error reading variable>

(gdb) p theNode

$2 = (tsDLNode<nciu> &) @0x757cdf0: {pNext = 0x757d250, pPrev = 0x0}

(gdb) p this->pLast

$3 = (nciu *) 0x0

(gdb) p this->pFirst

$4 = (nciu *) 0x0

 

There are 5 functions that can be used to empty a tsDLList: remove(), removeAll(), get(), pop(), and the constructor itself that calls clear().

 

In addition to tcpiiu::uninstallChan(…) that was called in the chain that triggered this seg fault, I could find these other possible calls to createReqPend.get(): tcpiiu::disconnectAllChannels, tcpiiu::unlinkAllChannels, and tcpSendThread::run().

 

This last one calls createReqPend.get() inside the while(true) as part of the ordinary tasks:

while ( nciu * pChan = this->iiu.createReqPend.get () )

 

Additionally, we have this in cac.cpp: destroyIIU calls disconnectAllChannels and ~cac calls unlinkAllChannels.

 

Finally, in tcpiiu.cc, tcpRecvThread::run() calls

 

if (! connectSuccess) {

(…)

destroyIIU

(…) }

 

and tcpSendThread::run() calls   this->iiu.cacRef.destroyIIU ( this->iiu ) after the while(true) loop is left.

 

 

Do you see a condition where tcpSendThread::run calls createReqPend.get() or  tcpRecvThread::run calls destroyIIU() “at the same time” as the gateway calls ca_clear_channel() and this condition is not protected by a mutex? Do you have any insight to help me to proceed with the investigation?

 

Thank you a lot,

 

-- 


9k=

Márcio Paduan Donadio | Control Systems Engineer

Advanced Control Systems Department

SLAC National Accelerator Laboratory | Menlo Park, CA

p: 650.926.5007 | w: slac.stanford.edu

 

 


Navigate by Date:
Prev: [Bug 1777768] Re: NPP Put to a pp(TRUE) VAL field doesn't trigger monitors Andrew Johnson via Core-talk
Next: [Bug 1777768] Re: NPP Put to a pp(TRUE) VAL field doesn't trigger monitors Ralph Lange via Core-talk
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  <20212022  2023  2024 
Navigate by Thread:
Prev: [Bug 1777768] Re: NPP Put to a pp(TRUE) VAL field doesn't trigger monitors Ralph Lange via Core-talk
Next: Build failed in Jenkins: EPICS-3.14 #1094 Jenkins EPICS PSI via Core-talk
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  <20212022  2023  2024 
ANJ, 14 Jul 2021 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·