2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 <2014> 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 | Index | 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 <2014> 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 |
<== Date ==> | <== Thread ==> |
---|
Subject: | Bug# 1279147 - Gateway sigsegv's when cleaning up channels using ca_clear_channel |
From: | "Shankar, Murali" <[email protected]> |
To: | "[email protected]" <[email protected]> |
Date: | Tue, 11 Feb 2014 18:08:24 -0800 |
I’ve opened a bug for this but I also had an exception related question. At LCLS, the archiver appliances connect to the IOCs thru a CA gateway. The gateway crashes once in a while. This does not seem to be related to an “out-of-memory” issue or a “Gateway has been running for a long time” issue. Instead, it seems to be related to a IOC that is CPU overloaded and keeps disconnecting. Unexpected problem with CA circuit to server "eioc-und1-mp01.slac.stanford.edu:5068" was "Connection reset by peer" - disconnecting Feb 07 02:21:23 Warning: Virtual circuit disconnect eioc-und1-mp01.slac.stanford.edu:5068 Feb 07 02:21:23 !!! Errlog message received (message is above) Unexpected problem with CA circuit to server "eioc-und1-mp01.slac.stanford.edu:5068" was "Connection reset by peer" - disconnecting Feb 07 02:41:49 !!! Errlog message received (message is above) Feb 07 02:41:49 Warning: Virtual circuit disconnect eioc-und1-mp01.slac.stanford.edu:5068 Feb 07 04:42:32 PV Gateway Aborting (SIGSEGV) I have core dumps and I am able to examine the variables etc and indeed the gateway is trying to clean up the PVs from this IOC using ca_clear_channel. However, the place where this crashes is in a fundamental place (tsDLList.h:238) in EPICS base. What seems to be happening here is that we have an element in the linked list that has a previousNode of 0 but is itself not the pFirst element. I can provide more details/core(s) if needed. This does not seem to be a gateway bug; it seems to be some issue in ca_clear_channel. However, I don’t want to change EPICS base; perhaps I can catch the exception in gatePv.cc:240 and then move on. Should I consider patching this like so in the gateway code? I know this has memory leaks but this does not happen often. Any help is appreciated. Regards, Murali (gdb) bt #0 0x0016c410 in __kernel_vsyscall () #1 0x0086de30 in raise () from /lib/libc.so.6 #2 0x0086f741 in abort () from /lib/libc.so.6 #3 0x080513a4 in sig_end (sig=11) at ../gateway.cc:300 #4 <signal handler called> #5 0x0075a8c9 in remove (this=0xaf728260, guard=..., chan=...) at ../../../include/tsDLList.h:238 #6 tcpiiu::uninstallChan (this=0xaf728260, guard=..., chan=...) at ../tcpiiu.cpp:1981 #7 0x007512b7 in nciu::destroy (this=0x17e24b88, guard=...) at ../nciu.cpp:93 #8 0x00768347 in oldChannelNotify::destructor (this=0x17e179f0, guard=...) at ../oldChannelNotify.cpp:71 #9 0x00749039 in ca_clear_channel (pChan=0x17e179f0) at ../access.cpp:386 #10 0x080582e0 in gatePvData::~gatePvData (this=0x157f79b0, __in_chrg=<value optimized out>) at ../gatePv.cc:240 #11 0x08062064 in gatePvNode::destroy (this=0x1ca02110) at ../gateServer.h:69 #12 0x0805d6e7 in gateServer::inactiveDeadCleanup (this=0x925af40) at ../gateServer.cc:1490 #13 0x08060fc8 in gateServer::mainLoop (this=0x925af40) at ../gateServer.cc:285 #14 0x0804ef18 in startEverything (prefix=0xbfd7bbe2 "GWLCLSARCH") at ../gateway.cc:656 #15 0x080511a8 in main (argc=16, argv=0xbfd7b494) at ../gateway.cc:1299 …… (gdb) up #4 <signal handler called> (gdb) up #5 0x0075a8c9 in remove (this=0xaf728260, guard=..., chan=...) at ../../../include/tsDLList.h:238 238 prevNode.pNext = theNode.pNext; (gdb) print theNode $1 = (tsDLNode<nciu> &) @0x17e24b98: {pNext = 0x17d44d68, pPrev = 0x0} (gdb) up #6 tcpiiu::uninstallChan (this=0xaf728260, guard=..., chan=...) at ../tcpiiu.cpp:1981 1981 this->createReqPend.remove ( chan ); (gdb) print chan $2 = (nciu &) @0x17e24b88: {<cacChannel> = {_vptr.cacChannel = 0x781168, static priorityMax = 99, static priorityMin = 0, static priorityDefault = 0, static priorityLinksDB = 99, static priorityArchive = 49, static priorityOPI = 0, callback = @0x17e179f0}, <chronIntIdRes<nciu>> = {<chronIntId> = {<intId<unsigned int, 8u, 32u>> = { id = 833073}, <No data fields>}, <tsSLNode<nciu>> = {pNext = 0x0}, <No data fields>}, <channelNode> = {<tsDLNode<nciu>> = {pNext = 0x17d44d68, pPrev = 0x0}, listMember = cs_createReqPend}, <privateInterfaceForIO> = {_vptr.privateInterfaceForIO = 0x7811d8}, eventq = {pFirst = 0x0, pLast = 0x0, itemCount = 0}, accessRightState = { f_readPermit = false, f_writePermit = false, f_operatorConfirmationRequest = false}, cacCtx = @0x925e2d8, pNameStr = 0x1c5838a8 "BLM:UND1:MP01:XILINX_CELS.LOW", piiu = 0xaf728260, sid = 4294967295, count = 0, retry = 1, nameLength = 30, typeCode = 65535, priority = 0 '\000'} (gdb) quit |