EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  <20202021  2022  2023  2024  Index 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  <20202021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: Bug in epicsMessageQueue?
From: Mark Rivers via Core-talk <core-talk at aps.anl.gov>
To: 'Michael Davidsaver' <mdavidsaver at gmail.com>, "'core-talk at aps.anl.gov'" <core-talk at aps.anl.gov>
Date: Sat, 21 Mar 2020 00:14:52 +0000

I have now rolled back to base 7.0.3.

 

The problem persists:

 

2020/03/20 19:11:50.024 received value=156

2020/03/20 19:11:50.124     sent value=157

2020/03/20 19:11:50.124 received value=157

2020/03/20 19:11:50.224     sent value=158

2020/03/20 19:11:50.224 receive timeout

2020/03/20 19:11:50.224 receive timeout

2020/03/20 19:11:50.324     sent value=159

2020/03/20 19:11:50.324 received value=159

2020/03/20 19:11:50.424     sent value=160

2020/03/20 19:11:50.425 received value=160

Loops=200, numSent=200, numReceived=195

 

Message 158 and 4 others were never received.

 

Mark

 

 

From: Mark Rivers
Sent: Friday, March 20, 2020 7:00 PM
To: 'Michael Davidsaver' <mdavidsaver at gmail.com>; 'core-talk at aps.anl.gov' <core-talk at aps.anl.gov>
Subject: RE: Bug in epicsMessageQueue?

 

I misspoke in my previous post.  I said I was working with 7.0.3.1 with the commit for epicsTimeMonotonic reverted.  However, I had only built for windows-x64-static at that point, not for linux-x86_64.  However, I have now built and tested with that commit reverted and the results are the same on linux-x86_64.

 

 

For example:

 

2020/03/20 18:47:53.619 received value=188

2020/03/20 18:47:53.719     sent value=189

2020/03/20 18:47:53.719 receive timeout

2020/03/20 18:47:53.719 receive timeout

2020/03/20 18:47:53.819     sent value=190

2020/03/20 18:47:53.819 received value=190

 

So message 189 was never received.

 

Ø  From a very quick glance, you might start looking on the receive side.  What is visible as epicsMessageQueueReceiveWithTimeout()

Ø  returns -1 for conditions other than timeout.

 

Here is the code:

 

if (threadNode.eventSent && (threadNode.size <= size) &&

        status == epicsEventOK)

        return threadNode.size;

    return -1;

 

So it can return -1 if the allocated size for the message is wrong, but I believe that is impossible in this case.

 

It can fail if status is not epicsEventOK, which means epicsEventWaitWithTimeout returned something other than epicsEventOK, i.e. timeout.

 

I am not sure when threadNode.evenSent would be null.

 

This really looks like an issue with epicsEventWaitWithTimeout, but I’m not sure.  I thought those problems would go away when reverting that commit.

 

But I am not really assuming the problem is a timeout, except for the error message.  If epicsMessageQueueReceiveWithTimeout returns -1 for any reason I just try again.  But note that I never receive those missing messages.

 

This seems like a serious problem, making 7.0.3.1 not really usable.

 

Mark

 

 

-----Original Message-----

From: Michael Davidsaver <mdavidsaver at gmail.com>

Sent: Friday, March 20, 2020 4:13 PM

To: Mark Rivers <rivers at cars.uchicago.edu>; 'core-talk at aps.anl.gov' <core-talk at aps.anl.gov>

Subject: Re: Bug in epicsMessageQueue?

 

From a very quick glance, you might start looking on the receive side.  What is visible as

epicsMessageQueueReceiveWithTimeout()

returns -1 for conditions other than timeout.

 

https://github.com/epics-base/epics-base/blob/25bb966cbc4be1f2a47116a8b9ec940bc2270b12/modules/libcom/src/osi/os/default/osdMessageQueue.cpp#L355-L358

 

 

On 3/20/20 1:56 PM, Mark Rivers via Core-talk wrote:

> I just wrote a little test program for epicsMessageQueue.  I think it demonstrates a bug, but perhaps it just means I am doing something wrong.

>

>  

>

> The program is attached.  The program takes 2 arguments, numLoops and delayTime.

>

>  

>

> The main program just loops numLoops times, waiting for delayTime and then sending a 32-bit integer message which is the loop counter.

>

>  

>

> There is a messageTask thread that just calls epicsMessageQueueReceiveWithTimeout with a 0.1 second timeout in an infinite loop.  It prints a message it if receives a message, or if it times out.  It also increments a numReceived counter.

>

>  

>

> If the program is called with numLoops=200 and delayTime=0.11 it works fine.  This is the final output, numLoops=numSent=numReceived=200.

>

>  

>

> Loops=200, numSent=200, numReceived=200

>

>  

>

> If the program is called with numLoops=200 and delayTime=0.09 it also works fine.  This is the final output, numLoops=numSent=numReceived=200.

>

>  

>

> Loops=200, numSent=200, numReceived=200

>

>  

>

> However, if it is called with numLoops=200 and delayTime=0.10 (the same value as the timeout in epicsMessageQueueReceiveWithTimeout) it fails:

>

>  

>

> Loops=200, numSent=200, numReceived=188

>

>  

>

> Here is some of the output:

>

>  

>

> 2020/03/20 15:47:45.239     sent value=86

>

> 2020/03/20 15:47:45.239 received value=86

>

> 2020/03/20 15:47:45.339 receive timeout

>

> 2020/03/20 15:47:45.339     sent value=87

>

> 2020/03/20 15:47:45.339 received value=87

>

> 2020/03/20 15:47:45.439 receive timeout

>

> 2020/03/20 15:47:45.439 received value=88

>

> 2020/03/20 15:47:45.439     sent value=88

>

> 2020/03/20 15:47:45.539 receive timeout

>

> 2020/03/20 15:47:45.539 received value=89

>

> 2020/03/20 15:47:45.539     sent value=89

>

> 2020/03/20 15:47:45.639     sent value=90

>

> 2020/03/20 15:47:45.639 received value=90

>

> 2020/03/20 15:47:45.739     sent value=91

>

> 2020/03/20 15:47:45.739 received value=91

>

> 2020/03/20 15:47:45.839     sent value=92

>

> 2020/03/20 15:47:45.839 received value=92

>

> 2020/03/20 15:47:45.939     sent value=93

>

> 2020/03/20 15:47:45.939 received value=93

>

> 2020/03/20 15:47:46.039     sent value=94

>

> 2020/03/20 15:47:46.039 received value=94

>

> 2020/03/20 15:47:46.139     sent value=95

>

> 2020/03/20 15:47:46.139 received value=95

>

> 2020/03/20 15:47:46.239     sent value=96

>

> 2020/03/20 15:47:46.240 receive timeout

>

> 2020/03/20 15:47:46.240 receive timeout

>

> 2020/03/20 15:47:46.340     sent value=97

>

> 2020/03/20 15:47:46.340 receive timeout

>

> 2020/03/20 15:47:46.340 receive timeout

>

> 2020/03/20 15:47:46.440     sent value=98

>

> 2020/03/20 15:47:46.440 received value=98

>

> 2020/03/20 15:47:46.540     sent value=99

>

> 2020/03/20 15:47:46.540 received value=99

>

> 2020/03/20 15:47:46.640     sent value=100

>

> 2020/03/20 15:47:46.640 received value=100

>

>  

>

> Note that occasional timeouts are expected because the delay time between messages is “exactly” the same as the timeout.  We see that after messages 86 and 87, which is fine.  What is wrong is what happens after messages 96 and 97.  in both cases there are then 2 timeouts in less than 1 ms, and messages 96 and 97 are never received.  This seems like a bug?

>

>  

>

> This test is on linux-x86_64 with base 7.0.3.1 with commit  4f2228fb1d7527fb5ebc8b2d747c309f1dd7698d reverted.

>

>  

>

> Mark

>

>  

>

>  

>

>  

>

> *From:* Mark Rivers

> *Sent:* Friday, March 20, 2020 1:07 PM

> *To:* 'core-talk at aps.anl.gov' <core-talk at aps.anl.gov>

> *Subject:* RE: C++ question

>

>  

>

> I reverted commit  4f2228fb1d7527fb5ebc8b2d747c309f1dd7698d in EPICS base that uses epicsTimeMonotonic.  That did not fix the problem.

>

>  

>

> I still see things like this:

>

>  

>

> 2020/03/20 12:52:34.510 ImageEventHandler::OnImageEvent sending

> uniqueId=80

>

> 2020/03/20 12:52:34.511 ADSpinnaker::grabImage received uniqueID 80

> from message queue

>

> 2020/03/20 12:52:34.625 ImageEventHandler::OnImageEvent sending

> uniqueId=81

>

> 2020/03/20 12:52:34.626 ADSpinnaker::grabImage received uniqueID 81

> from message queue

>

> 2020/03/20 12:52:34.740 ImageEventHandler::OnImageEvent sending

> uniqueId=82

>

> 2020/03/20 12:52:34.741 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 12:52:34.742 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 12:52:34.842 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 12:52:34.855 ImageEventHandler::OnImageEvent sending

> uniqueId=83

>

> 2020/03/20 12:52:34.855 ADSpinnaker::grabImage received uniqueID 83

> from message queue

>

> 2020/03/20 12:52:34.970 ImageEventHandler::OnImageEvent sending

> uniqueId=84

>

> 2020/03/20 12:52:34.970 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 12:52:34.971 ADSpinnaker::grabImage received uniqueID 84

> from message queue

>

>  

>

> and this

>

>  

>

> 2020/03/20 12:52:38.305 ImageEventHandler::OnImageEvent sending

> uniqueId=113

>

> 2020/03/20 12:52:38.305 ADSpinnaker::grabImage received uniqueID 113

> from message queue

>

> 2020/03/20 12:52:38.419 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 12:52:38.420 ImageEventHandler::OnImageEvent sending

> uniqueId=114

>

> 2020/03/20 12:52:38.421 ADSpinnaker::grabImage received uniqueID 114

> from message queue

>

> 2020/03/20 12:52:38.535 ImageEventHandler::OnImageEvent sending

> uniqueId=115

>

> 2020/03/20 12:52:38.535 ADSpinnaker::grabImage received uniqueID 115

> from message queue

>

> 2020/03/20 12:52:38.650 ImageEventHandler::OnImageEvent sending

> uniqueId=116

>

> 2020/03/20 12:52:38.650 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 12:52:38.652 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 12:52:38.754 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 12:52:38.765 ImageEventHandler::OnImageEvent sending

> uniqueId=117

>

> 2020/03/20 12:52:38.765 ADSpinnaker::grabImage received uniqueID 117

> from message queue

>

> 2020/03/20 12:52:38.880 ImageEventHandler::OnImageEvent sending

> uniqueId=118

>

> 2020/03/20 12:52:38.880 ADSpinnaker::grabImage received uniqueID 118

> from message queue

>

>  

>

> Messages 82 and 116 were sent on the messageQueue but never received.  In both cases there are 2 timeouts within 1-2 ms, when the timeout is 100 ms.

>

>  

>

> Mark

>

>  

>

>  

>

>  

>

>  

>

> *From:* Mark Rivers

> *Sent:* Friday, March 20, 2020 11:56 AM

> *To:* 'core-talk at aps.anl.gov' <core-talk at aps.anl.gov

> <mailto:core-talk at aps.anl.gov>>

> *Subject:* RE: C++ question

>

>  

>

> Folks,

>

>  

>

> I have added debugging print statements when sending a message on the

> epicsMessageQueue and when receiving the message on the epicsMessageQueue.  Each time a message is sent I print the UniqueID of the image that was sent.  When the message is received I also print the UniqueID of the message that was received.  The time between message is about 0.115 seconds, while the epicsMessageQueue receive timeout is 0.1 seconds.

>

>  

>

> Here is what I observed:

>

>  

>

> 2020/03/20 11:37:37.098 ImageEventHandler::OnImageEvent sending

> uniqueId=85

>

> 2020/03/20 11:37:37.098 ADSpinnaker::grabImage received uniqueID 85

> from message queue

>

> 2020/03/20 11:37:37.213 ImageEventHandler::OnImageEvent sending

> uniqueId=86

>

> 2020/03/20 11:37:37.213 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 11:37:37.214 ADSpinnaker::grabImage received uniqueID 86

> from message queue

>

> 2020/03/20 11:37:37.328 ImageEventHandler::OnImageEvent sending

> uniqueId=87

>

> 2020/03/20 11:37:37.328 ADSpinnaker::grabImage received uniqueID 87

> from message queue

>

> 2020/03/20 11:37:37.443 ImageEventHandler::OnImageEvent sending

> uniqueId=88

>

> 2020/03/20 11:37:37.443 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 11:37:37.445 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 11:37:37.546 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 11:37:37.557 ImageEventHandler::OnImageEvent sending

> uniqueId=89

>

> 2020/03/20 11:37:37.558 ADSpinnaker::grabImage received uniqueID 89

> from message queue

>

> 2020/03/20 11:37:37.672 ImageEventHandler::OnImageEvent sending

> uniqueId=90

>

> 2020/03/20 11:37:37.672 ADSpinnaker::grabImage received uniqueID 90

> from message queue

>

> 2020/03/20 11:37:37.784 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 11:37:37.787 ImageEventHandler::OnImageEvent sending

> uniqueId=91

>

> 2020/03/20 11:37:37.787 ADSpinnaker::grabImage received uniqueID 91

> from message queue

>

>  

>

> Messages with UniqueID 85, 86, and 87 were received fine.  When reading messages 85 and 87 there was no timeout, but when reading 86 there was a timeout.  However,  0.001 second after that timeout the receive succeeded.

>

>  

>

> The problem is message 88.

>

>  

>

> 2020/03/20 11:37:37.443 ImageEventHandler::OnImageEvent sending

> uniqueId=88

>

> 2020/03/20 11:37:37.443 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 11:37:37.445 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 11:37:37.546 ADSpinnaker::grabImage timeout receiving from

> message queue

>

> 2020/03/20 11:37:37.557 ImageEventHandler::OnImageEvent sending

> uniqueId=89

>

> 2020/03/20 11:37:37.558 ADSpinnaker::grabImage received uniqueID 89

> from message queue

>

>  

>

> Note that it was sent at 37.443.  There were then 3 timeouts trying to read that message, at 37.443, 37.445, and 37.546.  Message 88 was never received!

>

>  

>

> Something seems seriously wrong here.  It is making me wonder if this is related to the problem with epicsTimeMonotonic that I observed with callbackRequestDelay and Andrew observed with epicsEventWaitWithTimeout?  What could cause that message #88 to never be received from the message queue?  Why are there 2 timeouts within 0.002 seconds, when the timeout is 0.1 second?

>

>  

>

> Thanks,

>

> Mark

>

>  

>

>  

>

>  

>

> *From:* Mark Rivers

> *Sent:* Friday, March 20, 2020 9:52 AM

> *To:* core-talk at aps.anl.gov <mailto:core-talk at aps.anl.gov>

> *Subject:* C++ question

>

>  

>

> Folks,

>

>  

>

> I am having a problem with the areaDetector ADSpinnaker driver.  I am quite sure the problem is that I am doing something wrong in my C++ callback code.

>

>  

>

> The code in question is quite simple.

>

>  

>

> The following is a class that implements callbacks from the vendor library when a new image is available.  It inherits from the vendor’s ImageEvent class, and implments a method they define called OnImageEvent.  That method is passed a smart pointer of type ImagePtr.  I made a copy of that smart pointer and pass its address on an epicsMessageQueue.  The copy is deleted in the receiving function.

>

>  

>

> ***************************************

>

> class ImageEventHandler : public ImageEvent

>

> {

>

> public:

>

>  

>

>     ImageEventHandler(epicsMessageQueue *pMsgQ)

>

>      : pMsgQ_(pMsgQ)

>

>     {}

>

>     ~ImageEventHandler() {}

>

>  

>

>     void OnImageEvent(ImagePtr image) {

>

>         ImagePtr *imagePtrAddr = new ImagePtr(image);

>

>  

>

>         if (pMsgQ_->send(&imagePtrAddr, sizeof(imagePtrAddr)) != 0) {

>

>             printf("OnImageEvent error calling pMsgQ_->send()\n");

>

>         }

>

>     }

>

>  

>

> private:

>

>     epicsMessageQueue *pMsgQ_;

>

>  

>

> };

>

> ***************************************

>

>  

>

> This is the receiving function.

>

>  

>

> ***************************************

>

> asynStatus ADSpinnaker::grabImage()

>

> {

>

>     asynStatus status = asynSuccess;

>

>     size_t nRows, nCols;

>

>     NDDataType_t dataType;

>

>     NDColorMode_t colorMode;

>

>     int timeStampMode;

>

>     int uniqueIdMode;

>

>     int convertPixelFormat;

>

>     bool imageConverted = false;

>

>     int numColors;

>

>     size_t dims[3];

>

>     ImageStatus imageStatus;

>

>     PixelFormatEnums pixelFormat;

>

>     int pixelSize;

>

>     size_t dataSize, dataSizePG;

>

>     void *pData;

>

>     int nDims;

>

>     ImagePtr pImage;

>

>     ImagePtr *imagePtrAddr=0;

>

>     static const char *functionName = "grabImage";

>

>  

>

>     try {

>

>         while(1) {

>

>             unlock();

>

>             int recvSize = pCallbackMsgQ_->receive(&imagePtrAddr,

> sizeof(imagePtrAddr), 0.1);

>

>             lock();

>

>             if (recvSize == sizeof(imagePtrAddr)) {

>

>                 break;

>

>             } else if (recvSize == -1) {

>

>                 // Timeout

>

>                 int acquire;

>

>                 getIntegerParam(ADAcquire, &acquire);

>

>                 if (acquire == 0) {

>

>                     return asynError;

>

>                 } else {

>

>                     continue;

>

>                 }

>

>             } else {

>

>                 asynPrint(pasynUserSelf, ASYN_TRACE_ERROR,

>

>                         "%s::%s error receiving from message queue\n",

>

>                         driverName, functionName);

>

>                 return asynError;

>

>             }

>

>         }

>

>         pImage = *imagePtrAddr;

>

>         // Delete the ImagePtr that was passed to us

>

>         delete imagePtrAddr;

>

> …

>

> Finishing processing pImage …

>

> ***************************************

>

>  

>

> Note that I have a timeout of 0.1 second on the msgQ->receive() call.  if the call times out it tries again if it is still acquiring.  This allows the function to return quickly if acquisition is stopped after it is called.

>

>  

>

> This code generally works fine when the time between images is 0.1

> second or less, or 0.12 seconds or more.  However, if the time between images is 0.115 seconds, for example, I “lose” about 2-3% of the images.  There are no error messages.  If I increase the 0.1 second timeout to 0.2 seconds, then images are lost if the time between images is about 0.215 seconds.

>

>  

>

> It seems to me that the problem must be that another callback happens before the previous image has been processed.  I thought that making a copy of the smart pointer on the stack would prevent this, but I must be doing something wrong.  Perhaps I am deleting the smart pointer too soon, or perhaps the whole approach I am using is unsafe?

>

>  

>

> Any advice is much appreciated!

>

>  

>

> Thanks,

>

> Mark

>

>  

>

 


References:
Bug in epicsMessageQueue? Mark Rivers via Core-talk
Re: Bug in epicsMessageQueue? Michael Davidsaver via Core-talk
RE: Bug in epicsMessageQueue? Mark Rivers via Core-talk

Navigate by Date:
Prev: RE: Bug in epicsMessageQueue? Mark Rivers via Core-talk
Next: Re: C++ question Mark Rivers via Core-talk
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  <20202021  2022  2023  2024 
Navigate by Thread:
Prev: RE: Bug in epicsMessageQueue? Mark Rivers via Core-talk
Next: Re: Bug in epicsMessageQueue? Michael Davidsaver via Core-talk
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  <20202021  2022  2023  2024 
ANJ, 20 Mar 2020 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·