I’ve opened a bug for this but I also had an exception related question.
At LCLS, the archiver appliances connect to the IOCs thru a CA gateway. The gateway crashes once in a while. This does not seem to be related to an “out-of-memory” issue or a “Gateway has been running for a long time” issue. Instead, it seems to be related to a IOC that is CPU overloaded and keeps disconnecting.
Unexpected problem with CA circuit to server "eioc-und1-mp01.slac.stanford.edu:5068" was "Connection reset by peer" - disconnecting
Feb 07 02:21:23 Warning: Virtual circuit disconnect eioc-und1-mp01.slac.stanford.edu:5068
Feb 07 02:21:23 !!! Errlog message received (message is above)
Unexpected problem with CA circuit to server "eioc-und1-mp01.slac.stanford.edu:5068" was "Connection reset by peer" - disconnecting
Feb 07 02:41:49 !!! Errlog message received (message is above)
Feb 07 02:41:49 Warning: Virtual circuit disconnect eioc-und1-mp01.slac.stanford.edu:5068
Feb 07 04:42:32 PV Gateway Aborting (SIGSEGV)
I have core dumps and I am able to examine the variables etc and indeed the gateway is trying to clean up the PVs from this IOC using ca_clear_channel. However, the place where this crashes is in a fundamental place (tsDLList.h:238) in EPICS base. What seems to be happening here is that we have an element in the linked list that has a previousNode of 0 but is itself not the pFirst element. I can provide more details/core(s) if needed.
This does not seem to be a gateway bug; it seems to be some issue in ca_clear_channel. However, I don’t want to change EPICS base; perhaps I can catch the exception in gatePv.cc:240 and then move on. Should I consider patching this like so in the gateway code? I know this has memory leaks but this does not happen often.
Any help is appreciated.
Regards,
Murali
(gdb) bt
#0 0x0016c410 in __kernel_vsyscall ()
#1 0x0086de30 in raise () from /lib/libc.so.6
#2 0x0086f741 in abort () from /lib/libc.so.6
#3 0x080513a4 in sig_end (sig=11) at ../gateway.cc:300
#4 <signal handler called>
#5 0x0075a8c9 in remove (this=0xaf728260, guard=..., chan=...) at ../../../include/tsDLList.h:238
#6 tcpiiu::uninstallChan (this=0xaf728260, guard=..., chan=...) at ../tcpiiu.cpp:1981
#7 0x007512b7 in nciu::destroy (this=0x17e24b88, guard=...) at ../nciu.cpp:93
#8 0x00768347 in oldChannelNotify::destructor (this=0x17e179f0, guard=...) at ../oldChannelNotify.cpp:71
#9 0x00749039 in ca_clear_channel (pChan=0x17e179f0) at ../access.cpp:386
#10 0x080582e0 in gatePvData::~gatePvData (this=0x157f79b0, __in_chrg=<value optimized out>) at ../gatePv.cc:240
#11 0x08062064 in gatePvNode::destroy (this=0x1ca02110) at ../gateServer.h:69
#12 0x0805d6e7 in gateServer::inactiveDeadCleanup (this=0x925af40) at ../gateServer.cc:1490
#13 0x08060fc8 in gateServer::mainLoop (this=0x925af40) at ../gateServer.cc:285
#14 0x0804ef18 in startEverything (prefix=0xbfd7bbe2 "GWLCLSARCH") at ../gateway.cc:656
#15 0x080511a8 in main (argc=16, argv=0xbfd7b494) at ../gateway.cc:1299
……
(gdb) up
#4 <signal handler called>
(gdb) up
#5 0x0075a8c9 in remove (this=0xaf728260, guard=..., chan=...) at ../../../include/tsDLList.h:238
238 prevNode.pNext = theNode.pNext;
(gdb) print theNode
$1 = (tsDLNode<nciu> &) @0x17e24b98: {pNext = 0x17d44d68, pPrev = 0x0}
(gdb) up
#6 tcpiiu::uninstallChan (this=0xaf728260, guard=..., chan=...) at ../tcpiiu.cpp:1981
1981 this->createReqPend.remove ( chan );
(gdb) print chan
$2 = (nciu &) @0x17e24b88: {<cacChannel> = {_vptr.cacChannel = 0x781168, static priorityMax = 99, static priorityMin = 0, static priorityDefault = 0, static priorityLinksDB = 99,
static priorityArchive = 49, static priorityOPI = 0, callback = @0x17e179f0}, <chronIntIdRes<nciu>> = {<chronIntId> = {<intId<unsigned int, 8u, 32u>> = {
id = 833073}, <No data fields>}, <tsSLNode<nciu>> = {pNext = 0x0}, <No data fields>}, <channelNode> = {<tsDLNode<nciu>> = {pNext = 0x17d44d68, pPrev = 0x0},
listMember = cs_createReqPend}, <privateInterfaceForIO> = {_vptr.privateInterfaceForIO = 0x7811d8}, eventq = {pFirst = 0x0, pLast = 0x0, itemCount = 0}, accessRightState = {
f_readPermit = false, f_writePermit = false, f_operatorConfirmationRequest = false}, cacCtx = @0x925e2d8, pNameStr = 0x1c5838a8 "BLM:UND1:MP01:XILINX_CELS.LOW", piiu = 0xaf728260,
sid = 4294967295, count = 0, retry = 1, nameLength = 30, typeCode = 65535, priority = 0 '\000'}
(gdb) quit