EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  <20212022  2023  2024  Index 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  <20212022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: valgrind errors in base 7.0.5
From: Mark Rivers via Core-talk <core-talk at aps.anl.gov>
To: 'Michael Davidsaver' <mdavidsaver at gmail.com>
Cc: EPICS core-talk <core-talk at aps.anl.gov>
Date: Wed, 7 Apr 2021 23:05:46 +0000
I have an IOC that appears to start fine.  

However, as soon as I open an medm screen the IOC dies with this stack trace.

(gdb) bt
#0  0x00007fa7e2247c80 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x0000000000eb6826 in mutexLock (id=0x0) at ../osi/os/posix/osdMutex.c:116
#2  epicsMutexOsdLock (pmutex=0x0) at ../osi/os/posix/osdMutex.c:163
#3  0x0000000000e3a13d in db_event_enable (event=0x7fa6bc1d9ed0) at ../db/dbEvent.c:493
#4  0x0000000000e630fd in event_add_action (mp=0x7fa6d24d6d10, pPayload=0x7fa6c0038d98, client=0x7fa6c4000f88) at ../rsrv/camessage.c:1886
#5  0x0000000000e6475d in camessage (client=client@entry=0x7fa6c4000f88) at ../rsrv/camessage.c:2547
#6  0x0000000000e61490 in camsgtask (pParm=0x7fa6c4000f88) at ../rsrv/camsgtask.c:117
#7  0x0000000000eb40ac in start_routine (arg=0x7fa6c0040e40) at ../osi/os/posix/osdThread.c:412
#8  0x00007fa7e2245e25 in start_thread () from /lib64/libpthread.so.0
#9  0x00007fa7e10febad in clone () from /lib64/libc.so.6

When I run the IOC under valgrind it reports no problems and the crash did not occur, possibly because it is a race condition and valgrind changes the timing.

This is base 7.0.5 on Linux.

What is strange is that this is now 100% reproducible, which does not sound like a race condition.

My IOC application has not changed significantly from when it was working.  I have installed a new version of the vendor library, so I guess it is possible that that is trashing memory?

Mark



-----Original Message-----
From: Michael Davidsaver <mdavidsaver at gmail.com> 
Sent: Friday, April 2, 2021 8:50 PM
To: Mark Rivers <rivers at cars.uchicago.edu>
Cc: EPICS core-talk <core-talk at aps.anl.gov>; Johnson, Andrew N. <anj at anl.gov>
Subject: Re: valgrind errors in base 7.0.5

On 4/2/21 5:43 PM, Michael Davidsaver wrote:
> Mark,
> 
> On 4/2/21 4:28 PM, Johnson, Andrew N. via Core-talk wrote:
>> Hi Mark,
>>
>> On Apr 2, 2021, at 4:33 PM, Mark Rivers via Core-talk <core-talk at aps.anl.gov> wrote:
>>>
>>> I was tracking down some errors in an IOC with valgrind.  I found and fixed the errors in my drivers.  But I am seeing some errors in CAS-client and CAS-event.
>>
>> The valgrind tools are very useful, but they can throw up a number of false positives inside EPICS, and the event you posted below is likely to be one of those since it’s talking about an area of memory that was allocated by the EPICS freeList module which manages memory allocations that are all the same size. This technique was essential with older versions of the VxWorks memory allocator which fragmented the free memory pool and thus prevented IOCs from running for long periods of time.
>>
>> Michael Davidsaver may have some additional advice, but unless you’re seeing crashes associated with these I would not be too concerned about them. He has annotated many of the internal routines to teach Valgrind about the routines in Base that allocate and deallocate memory, but an “Invalid read” might be related to thread ownership and I don’t know whether there are annotations for that kind of event.
> 
> I think this is a false positive, but not one I've seen before.  So thanks for reporting it.

Actually, this looks like a race leading to a real use-after-free which is being "covered up" by the free list.

>>> ==146728== Thread 25 CAS-client:
>>> ==146728== Invalid read of size 8
>>> ==146728==    at 0xE37113: db_close_events (dbEvent.c:378)

In dbEvent.c  db_close_events()

>     /* notify the waiting task */
>     epicsEventSignal(evUser->ppendsem);
> 
>     if(evUser->taskid)  // <- line 378
>         epicsThreadMustJoin(evUser->taskid);
>     /* evUser has been deleted by the worker */

Since event_task() is deleting evUser, it looks like everything from the epicsEventSignal() onwards is a race.  event_task() also deletes ppendsem, which could happen before
epicsEventSignal() returns.  So I don't think it would be enough to save 'evUser->taskid'
before signaling.


git-blame's me for this with 37a76b433a9e7d5a8d26a13fd21ad62f20a0c1c1 prior to 7.0.3.1

I recall doing this as a clever way to handle epicsThreadMustJoin() being a no-op on older vxWorks.


Details on the valgrind error

From memcheck/mc_errors.c above where "recently re-allocated block" is printed.

>    /* Blocks allocated by memcheck malloc functions are either
>       on the recently freed list or on the malloc-ed list.
>       Custom blocks can be on both : a recently freed block might
>       have been just re-allocated.
>       So, first search the malloc-ed block, as the most recent
>       block is the probable cause of error.
>       We however detect and report that this is a recently re-allocated
>       block. */
>    /* -- Search for a currently malloc'd block which might bracket it. 
> -- */


>>> ==146728==  Address 0x1638ee68 is 12,120 bytes inside a recently 
>>> re-allocated block of size 19,464 alloc'd

From dbEvent.c

> freeListInitPvt(&dbevEventUserFreeList,
>             sizeof(struct event_user),8);

On amd64 sizeof(event_user)==2424 and offsetof(event_user, taskid)==2384

From freeListLib.c

>        ptemp = (void 
> *)malloc(pfl->nmalloc*(pfl->size+REDZONE)+REDZONE);

So each free list block is 8*(2424+8)+8 = 19464 bytes, so that tracks.

And 4*(2424+8)+8 + 2384 = 12120  so this matches an access of evUser->taskid


This access is in the 4th slot of this block.  The mention of QSRV is coincidental.
That happens to be the place where the block was allocated, and so where the 1st slot was (and probably still is) used.




> I don't suppose you can replicate this with softIocPVA?
> 
> 
> details...
> 
> "Invalid read" is an access to memory which valgrind considers 
> unallocated.  This is often followed by "Address ... is not stack'd, malloc'd or (recently) free'd", and then a SIGSEGV.
> 
> "... inside a recently re-allocated block ..." is not a message I've 
> seen before, and I don't find many references to it.  I also don't see 
> it myself, at least with softIocPVA and some camonitor/pvmonitor.
> 
> My initial guess is that this is something to do with how the 
> annotations in freeListLib are being (mis)handled [1].  Possibly related to valgrind version.  I currently have:
> 
>> $ valgrind --version
>> valgrind-3.14.0
> 
> 
> 
> [1] 
> https://github.com/epics-base/epics-base/blob/c829ca39d873d59947d851a7
> b22d74d14748e336/modules/libcom/src/freeList/freeListLib.c#L88-L94
> 
>  Also see the VALGRIND_MEMPOOL_*() manipulations further down.
> 
> 
> 
>> - Andrew
>>
>>
>>
>>>  
>>> ==146728== Thread 25 CAS-client:
>>> ==146728== Invalid read of size 8
>>> ==146728==    at 0xE37113: db_close_events (dbEvent.c:378)
>>> ==146728==    by 0xE5DB92: destroy_tcp_client (caservertask.c:1217)
>>> ==146728==    by 0xE5E77D: camsgtask (camsgtask.c:161)
>>> ==146728==    by 0xEB131B: start_routine (osdThread.c:412)
>>> ==146728==    by 0x5EC7E24: start_thread (in /usr/lib64/libpthread-2.17.so)
>>> ==146728==    by 0x704BBAC: clone (in /usr/lib64/libc-2.17.so)
>>> ==146728==  Address 0x1638ee68 is 12,120 bytes inside a recently re-allocated block of size 19,464 alloc'd
>>> ==146728==    at 0x4C29C23: malloc (vg_replace_malloc.c:299)
>>> ==146728==    by 0xE9FBA1: freeListMalloc (freeListLib.c:95)
>>> ==146728==    by 0xE9FCBD: freeListCalloc (freeListLib.c:68)
>>> ==146728==    by 0xE36F71: db_init_events (dbEvent.c:297)
>>> ==146728==    by 0x64E0CB: PDBProvider::PDBProvider(std::tr1::shared_ptr<epics::pvAccess::Configuration const> const&) (pdb.cpp:542)
>>> ==146728==    by 0x64665D: epics::pvAccess::SingletonChannelProviderFactory<PDBProvider>::sharedInstance() (pvAccess.h:1412)
>>> ==146728==    by 0x7073A5: epics::pvAccess::ChannelProviderRegistry::getProvider(std::string const&) (ChannelAccessFactory.cpp:48)
>>> ==146728==    by 0x7495AE: epics::pvAccess::ServerContextImpl::loadConfiguration() (serverContext.cpp:162)
>>> ==146728==    by 0x74A3FA: epics::pvAccess::ServerContext::create(epics::pvAccess::ServerContext::Config const&) (serverContext.cpp:597)
>>> ==146728==    by 0x6E27B9: (anonymous namespace)::startitup() (PVAServerRegister.cpp:46)
>>> ==146728==    by 0x6E31F4: startPVAServer (PVAServerRegister.cpp:71)
>>> ==146728==    by 0x6E31F4: void epics::detail::call1<char const*, &(anonymous namespace)::startPVAServer>(iocshArgBuf const*) (iocshelper.h:83)
>>> ==146728==    by 0xEA1C1F: iocshBody (iocsh.cpp:938)
>>> ==146728==
>>>  
>>> ==146728== Thread 21 CAS-event:
>>> ==146728== Syscall param socketcall.sendto(msg) points to uninitialised byte(s)
>>> ==146728==    at 0x5ECECFB: send (in /usr/lib64/libpthread-2.17.so)
>>> ==146728==    by 0xE67479: cas_send_bs_msg (caserverio.c:68)
>>> ==146728==    by 0xE67A98: cas_copy_in_header (caserverio.c:287)
>>> ==146728==    by 0xE5F74B: read_reply (camessage.c:513)
>>> ==146728==    by 0xE38084: event_read (dbEvent.c:959)
>>> ==146728==    by 0xE38084: event_task (dbEvent.c:1038)
>>> ==146728==    by 0xEB131B: start_routine (osdThread.c:412)
>>> ==146728==    by 0x5EC7E24: start_thread (in /usr/lib64/libpthread-2.17.so)
>>> ==146728==    by 0x704BBAC: clone (in /usr/lib64/libc-2.17.so)
>>> ==146728==  Address 0x1d768584 is 36 bytes inside a block of size 135,168 alloc'd
>>> ==146728==    at 0x4C29C23: malloc (vg_replace_malloc.c:299)
>>> ==146728==    by 0xE5E029: casExpandBuffer (caservertask.c:1341)
>>> ==146728==    by 0xE5E029: casExpandSendBuffer (caservertask.c:1392)
>>> ==146728==    by 0xE67A4A: cas_copy_in_header (caserverio.c:275)
>>> ==146728==    by 0xE5F74B: read_reply (camessage.c:513)
>>> ==146728==    by 0xE38084: event_read (dbEvent.c:959)
>>> ==146728==    by 0xE38084: event_task (dbEvent.c:1038)
>>> ==146728==    by 0xEB131B: start_routine (osdThread.c:412)
>>> ==146728==    by 0x5EC7E24: start_thread (in /usr/lib64/libpthread-2.17.so)
>>> ==146728==    by 0x704BBAC: clone (in /usr/lib64/libc-2.17.so)
>>> ==146728==
>>>  
>>> Base 7.0.5
>>> Centos 7
>>> gcc 4.8.5
>>>  
>>> Mark
>>
> 


Replies:
Re: valgrind errors in base 7.0.5 Michael Davidsaver via Core-talk
References:
valgrind errors in base 7.0.5 Mark Rivers via Core-talk
Re: valgrind errors in base 7.0.5 Johnson, Andrew N. via Core-talk
Re: valgrind errors in base 7.0.5 Michael Davidsaver via Core-talk
Re: valgrind errors in base 7.0.5 Michael Davidsaver via Core-talk

Navigate by Date:
Prev: Build failed in Jenkins: 3.15-pvDatabase-win64 #354 APS Jenkins via Core-talk
Next: Re: valgrind errors in base 7.0.5 Michael Davidsaver via Core-talk
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  <20212022  2023  2024 
Navigate by Thread:
Prev: Re: valgrind errors in base 7.0.5 Michael Davidsaver via Core-talk
Next: Re: valgrind errors in base 7.0.5 Michael Davidsaver via Core-talk
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  <20212022  2023  2024 
ANJ, 07 Apr 2021 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·