EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  <20202021  2022  2023  2024  Index 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  <20202021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: [Bug 1868486] Re: epicsMessageQueue lost messages
From: rivers via Core-talk <core-talk at aps.anl.gov>
To: core-talk at aps.anl.gov
Date: Mon, 23 Mar 2020 19:25:31 -0000
I understand that Michael does not have the time to fix this problem.
But I really think it should be fixed.  While there is no code in base
7.0.3.1 that uses the timeout variant of epicsMessageQueue::receive()
there are plenty of places in epics-modules and areaDetector that do:

corvette:~/devel>find . -name '*.c*' -type f -exec grep -H ReceiveWithTimeout {} \;
./mca-7-8/mcaApp/CanberraSrc/nmc_comm_subs_1.c:        len = epicsMessageQueueReceiveWithTimeout(m->responseQ, pkt, sizeof(*pkt), (double)(i->timeout_time/1000.));
./asyn-4-39/asyn/drvAsynUSBTMC/drvAsynUSBTMC.c:            s = epicsMessageQueueReceiveWithTimeout(
./ipUnidig-2-11/ipUnidigApp/src/drvIpUnidig.cpp:    status = epicsMessageQueueReceiveWithTimeout(msgQId_,
./autosave-5-10/asApp/src/save_restore.c:               while (epicsMessageQueueReceiveWithTimeout(opMsgQueue, (void*) &msg, OP_MSG_SIZE, (double)MIN_DELAY) >= 0) {
./caputRecorder-1-7-2/caputRecorderApp/src/caputRecorder.c:             msg_size = epicsMessageQueueReceiveWithTimeout(caputRecorderMsgQueue, (void *)pmsg, MSG_SIZE, 5.0);
./softGlue-2-8-2/softGlueApp/src/drvIP_EP201.c:         if (epicsMessageQueueReceiveWithTimeout(pPvt->msgQId,
./areaDetector-3-9/aravisGigE/aravisGigEApp/src/aravisCamera.cpp:        if (epicsMessageQueueReceiveWithTimeout(this->msgQId, &buffer, sizeof(&buffer), 0.005) == -1) {
./areaDetector-3-9/ADAravis/aravisApp/src/ADAravis.cpp:        if (epicsMessageQueueReceiveWithTimeout(this->msgQId, &buffer, sizeof(&buffer), 0.005) == -1) {

This is where I believe epicsMessageQueue::receive() with a timeout is
used:

corvette:~/devel>find . -name '*.c*' -type f -exec grep -H 'receive(' {} \;
./dante-1-0/danteApp/danteSrc/dante.cpp:    numRecv = msgQ_->receive(&message, sizeof(message), MESSAGE_TIMEOUT);
./areaDetector-3-9/ADCore/ADApp/pluginSrc/NDPluginDriver.cpp:        numBytes = pFromThreadMsgQ_->receive(&fromMsg, sizeof(fromMsg), 2.0);
./areaDetector-3-9/ADCore/ADApp/pluginSrc/NDPluginDriver.cpp:            numBytes = pFromThreadMsgQ_->receive(&fromMsg, sizeof(fromMsg), 2.0);


The bug is such that if the time between messages is ever the same as the timeout then there is significant probability that the message will be lost.  This can lead to unreliable behavior and difficult to track down problems.

I spent 2 days isolating this problem to EPICS base.  Since the symptom
was that frames were being dropped on a FLIR camera my first suspects
were the hardware, the areaDetector driver, and the vendor library
respectively.  It took a long time to realize the problem was with EPICS
base.

-- 
You received this bug notification because you are a member of EPICS
Core Developers, which is subscribed to EPICS Base.
Matching subscriptions: epics-core-list-subscription
https://bugs.launchpad.net/bugs/1868486

Title:
  epicsMessageQueue lost messages

Status in EPICS Base:
  Confirmed
Status in EPICS Base 3.15 series:
  New
Status in EPICS Base 7.0 series:
  Confirmed

Bug description:
  https://epics.anl.gov/core-talk/2020/msg00396.php

  Mark Rivers observed epicsMessageQueue losing messages.

  https://epics.anl.gov/core-talk/2020/msg00408.php

  > I think I see the logic error in how the eventSent flag is handled,
  > specifically related to the fact that epicsEvent is a semaphore as
  > opposed to a condition variable.
  > 
  > This allows a "race" to occur if the first/only waiting receiver
  > times out, and epicsEventWaitWithTimeout() returns, while a sender
  > is in epicsMessageQueueSend() preparing to wake up a receiver.
  > 
  > This results in a situation where the sender has set the eventSent,
  > and indeed copied a message to the buffer of, a thread which has
  > decided to abort.
  > 
  > After timing out, the receiver sees the timeout and returns -1
  > even through eventSent has been set.  This can be trapped with:
  > 
  > > b osdMessageQueue.cpp:358 if status!=0 && threadNode.eventSent
  > 
  > So here is your lost message.
  > 
  > Now when epicsMessageQueueReceiveWithTimeout() is called again, no message
  > is waiting in the queue, so epicsEventWaitWithTimeout() is called.
  > Since the semaphore is already set, this returns immediately with status==0,
  > but this is a spurious wakeup and the eventSent flag is not set.
  > 
  > And here is the second "timeout".

  Line numbers circa 7.0.3.1

To manage notifications about this bug go to:
https://bugs.launchpad.net/epics-base/+bug/1868486/+subscriptions

References:
[Bug 1868486] [NEW] epicsMessageQueue lost messages mdavidsaver via Core-talk

Navigate by Date:
Prev: Re: write to a single element of an array field Ben Franksen via Core-talk
Next: reload of acf files Timo Korhonen via Core-talk
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  <20202021  2022  2023  2024 
Navigate by Thread:
Prev: [Bug 1868486] Re: epicsMessageQueue lost messages mdavidsaver via Core-talk
Next: [Bug 1868486] Re: epicsMessageQueue lost messages mdavidsaver via Core-talk
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  <20202021  2022  2023  2024 
ANJ, 24 Mar 2020 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·