EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules
From: Till Straumann via Tech-talk <tech-talk at aps.anl.gov>
To: Mark Rivers <rivers at cars.uchicago.edu>, Michael Davidsaver <mdavidsaver at gmail.com>, Torsten Bögershausen <Torsten.Bogershausen at ess.eu>, Benjamin Franksen <benjamin.franksen at helmholtz-berlin.de>
Cc: "tech-talk at aps.anl.gov" <tech-talk at aps.anl.gov>
Date: Wed, 1 Jun 2022 16:34:17 +0200
Hi Mark.

I'm sorry I missed the earlier post where you had indeed added the 'ti' information.

Could you please provide 'dbEvent.o' (from commit 85822f305) ?

Things I would try:

 - check if the bus error is 'real', i.e., try to access the reported VME address e.g., from the console
   (keeping in mind that A32 as seen from the CPU may be at an offset - depending on OS/BSP).
 - if this is the case then you may deal (as suggested earlier by  Michael, iirc) that the error is reported
   in a delayed fashion. I would try to insert some 'no-op' just before the call to freeListFree() (make
   sure it is not optimized away) and see where the exception happens.
 - add a 'sanity-check' of 'pfl' and 'pfl->u.r.dtor' in 'db_delete_field_log()' and report if any of these
   addresses is within a VME window.

I have seen similar things (late-reported bus error) but often we would then find the fault address
in one of the registers (or a combination of some 'base' + 'offset' in a couple of them) but I can't
see that in your register dump. It is still possible that it was in one of the caller-saved registers
and got wiped when the call to 'freeListFree()' was set up.

Cheers
- T.


On 6/1/22 14:50, Mark Rivers via Tech-talk wrote:

I agree that a debugger would be very helpful.  I don't have that tool, but I think others at the APS do.  I'll find out when Andrew returns next week.  Meanwhile git bisect can help point in the right direction.

 

I tested this commit 85822f3051d2236144bb46dc2c24b7e38143e531 with soft records overnight.  It ran for 32,000 seconds without failing before I stopped it.

 

The next commit, 56f05d722dee4b8ca2968b8bface2737a3a9b185, fails in less than 10 seconds in each of 3 tests.

 

I just tested with the IPAC hardware again with commit 85822f3051d2236144bb46dc2c24b7e38143e531 (which does not fail with soft records).  It failed after 960 seconds with this error:

 

VME Bus Error accessing A32: 0xbffcf67c

machine check

Exception next instruction address: 0x03326a90

Machine Status Register: 0x0008b032

Condition Register: 0x42000884

Task: 0x3559330 "CAS-event"

0x3559330 (CAS-event): task 0x3559330 has had a failure and has been stopped.

0x3559330 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.

 

This is the task trace:

 

iocexample> tt  0x3559330

0x0012489c vxTaskEntry  +0x48 : epicsThreadEntry ()

0x0334501c epicsThreadEntry+0x80 : 0x032a17f0 ()

0x032a1acc db_start_events+0x450: db_delete_field_log ()

0x032a0fe4 db_delete_field_log+0x54 : freeListFree ()

value = 0 = 0x0

 

This tells us pretty clearly that the failure as in freeListFree, called by db_delete_field_log.  That has been a very common stack trace for these errors.

 

Till complained that vxWorks did not provide a register dump, but that is not true, it is available with the “ti” command.  I had included that in several previous messages.  Here it is for this task.

 

iocexample> ti  0x3559330

 

  NAME         ENTRY       TID    PRI   STATUS      PC       SP     ERRNO  DELAY

----------  ------------ -------- --- ---------- -------- -------- ------- -----

CAS-event   epicsThread>  3559330 133 STOP+I      3326a90  3559240       0     0

 

full task name : CAS-event

task entry     : epicsThreadEntry

process        : kernel

options        : 0x1009001

VX_SUPERVISOR_MODE  VX_DEALLOC_TCB      VX_FP_TASK          VX_DEALLOC_EXC_STACK

 

STACK      BASE     END       SP      SIZE    HIGH   MARGIN

--------- -------- -------- -------- ------- ------- -------

execution  3559330  3556450  3559240   12000    2800    9200

exception  2700210  26ff3a0             3696    1344    2352

 

VxWorks Events

--------------

Events Pended on    : Not Pended

Received Events     : 0x0

Options             : N/A

 

r0         = 0x40000000   sp         = 0x03559240   r2         = 0x002f1ac0

r3         = 0x03492ce0   r4         = 0x0357bec8   r5         = 0x00000001

r6         = 0x00000001   r7         = 0x00000000   r8         = 0xeefceefd

r9         = 0x03326a90   r10        = 0x00000000   r11        = 0x035591b0

r12        = 0x20000884   r13        = 0x00321e60   r14        = 0x033b6098

r15        = 0x03373c64   r16        = 0x03289478   r17        = 0x033424f0

r18        = 0x03373bdc   r19        = 0x03373c1c   r20        = 0x032a0ff8

r21        = 0x032a0f90   r22        = 0x00000000   r23        = 0x00000001

r24        = 0x033b609c   r25        = 0x03339340   r26        = 0x033390e0

r27        = 0x03443b60   r28        = 0x0357e748   r29        = 0x03567d78

r30        = 0x032dad44   r31        = 0x0357bec8   msr        = 0x0008b032

lr         = 0x032a0fe4   ctr        = 0x03326a90   pc         = 0x03326a90

cr         = 0x42000884   xer        = 0x00000000   pgTblPtr   = 0x02310000

scSrTblPtr = 0x0230fe74   srTblPtr   = 0x0230fe74

 

fpcsr  =        0

fr0    = -1.81018   fr1    =      NaN   fr2    =      NaN   fr3    =      NaN

fr4    =      NaN   fr5    =      NaN   fr6    =      NaN   fr7    =      NaN

fr8    =      NaN   fr9    =      NaN   fr10   =      NaN   fr11   =      NaN

fr12   =      NaN   fr13   =      NaN   fr14   =      NaN   fr15   =      NaN

fr16   =      NaN   fr17   =      NaN   fr18   =      NaN   fr19   =      NaN

fr20   =      NaN   fr21   =      NaN   fr22   =      NaN   fr23   =      NaN

fr24   =      NaN   fr25   =      NaN   fr26   =      NaN   fr27   =      NaN

fr28   =      NaN   fr29   =      NaN   fr30   =      NaN   fr31   =      NaN

 

machine checkvalue =

Exception next instruction address: 0x003326a90

Machine Status Register: 0x = 0008b032

Condition Register: 0x420008840x

0

 

And here is the disassembly for freeListFree, where I have highlighted in red the "next instruction" above.

 

                        freeListFree:

0x3326a90  9421fff0    stwu        r1,-16(r1)

0x3326a94  7c0802a6    mfspr       r0,LR

0x3326a98  3d200334    lis         r9,0x334 # 820

0x3326a9c  392990e0    addi        r9,r9,0x90e0 # -28448

0x3326aa0  7d2903a6    mtspr       CTR,r9

0x3326aa4  90010014    stw         r0,20(r1)

0x3326aa8  93c10008    stw         r30,8(r1)

0x3326aac  7c9e2378    or          r30,r4,r4

0x3326ab0  93e1000c    stw         r31,12(r1)

0x3326ab4  7c7f1b78    or          r31,r3,r3

0x3326ab8  80630014    lwz         r3,20(r3)

0x3326abc  4e800421    bcctrl      0x14,0

0x3326ac0  2f830000    cmpi        crf7,0,r3,0x0 # 0

0x3326ac4  41be002c    bc          0xd,30, 0x3326af0 # 0x03326af0

0x3326ac8  3d200334    lis         r9,0x334 # 820

0x3326acc  3c600338    lis         r3,0x338 # 824

0x3326ad0  3ca00338    lis         r5,0x338 # 824

0x3326ad4  392924f0    addi        r9,r9,0x24f0 # 9456

0x3326ad8  386335a4    addi        r3,r3,0x35a4 # 13732

0x3326adc  38a535c0    addi        r5,r5,0x35c0 # 13760

0x3326ae0  7d2903a6    mtspr       CTR,r9

0x3326ae4  3880008f    li          r4,0x8f # 143

0x3326ae8  38c00000    li          r6,0x0 # 0

0x3326aec  4e800421    bcctrl      0x14,0

0x3326af0  813f0010    lwz         r9,16(r31)

0x3326af4  801f0008    lwz         r0,8(r31)

0x3326af8  39290001    addi        r9,r9,0x1 # 1

0x3326afc  807f0014    lwz         r3,20(r31)

0x3326b00  913f0010    stw         r9,16(r31)

0x3326b04  3d200334    lis         r9,0x334 # 820

0x3326b08  39299340    addi        r9,r9,0x9340 # -27840

0x3326b0c  901e0000    stw         r0,0(r30)

0x3326b10  7d2903a6    mtspr       CTR,r9

0x3326b14  93df0008    stw         r30,8(r31)

0x3326b18  4e800421    bcctrl      0x14,0

0x3326b1c  80010014    lwz         r0,20(r1)

0x3326b20  83c10008    lwz         r30,8(r1)

0x3326b24  7c0803a6    mtspr       LR,r0

0x3326b28  83e1000c    lwz         r31,12(r1)

0x3326b2c  38210010    addi        r1,r1,0x10 # 16

0x3326b30  4e800020    blr

 

 

This is the source code for freeListFree in this tag of base.

 

 

LIBCOM_API void epicsStdCall freeListFree(void *pvt,void*pmem)

{

    FREELISTPVT *pfl = pvt;

#   ifdef EPICS_FREELIST_DEBUG

    memset ( pmem, 0xdd, pfl->size );

    free(pmem);

#   else

    void        **ppnext;

 

    VALGRIND_MEMPOOL_FREE(pvt, pmem);

    VALGRIND_MEMPOOL_ALLOC(pvt, pmem, sizeof(void*));

 

    epicsMutexMustLock(pfl->lock);

    ppnext = pmem;

    *ppnext = pfl->head;

    pfl->head = pmem;

    pfl->nBlocksAvailable++;

    epicsMutexUnlock(pfl->lock);

#   endif

}

 

The last thing I will try is to git bisect from this commit 85822f3051d2236144bb46dc2c24b7e38143e531 to R7.0.5 to see where the hardware tests first begin to fail.

 

Mark

 

 

-----Original Message-----
From: Michael Davidsaver <mdavidsaver at gmail.com>
Sent: Wednesday, June 1, 2022 1:40 AM
To: Torsten Bögershausen <Torsten.Bogershausen at ess.eu>; Mark Rivers <rivers at cars.uchicago.edu>; Benjamin Franksen <benjamin.franksen at helmholtz-berlin.de>
Cc: tech-talk at aps.anl.gov
Subject: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules

 

On 5/31/22 07:00, Torsten Bögershausen wrote:

> Hej Mark,

>

> So R7.0.5 is good, and f9ea6a5bff695c5f88bb95dce38a3fd349738907 is bad ?

>

> There are some "real" commits, and merges:

> git log R7.0.5..f9ea6a5bff695c5f88bb95dce38a3fd349738907

>

> Then it could make sense, to bisect between those 2?

>

> Another question:

> Coluld it make sense to run the SW (even more stripped may be) under

> Linux instead with valgrind ?

 

An RTOS without a debugger is just about the worst situation I can imagine to troubleshoot apparent memory corruption.  (excepting maybe having no console, or it being in orbit)

 

I've made an attempt to run the database Mark describes with

softIoc+valgrind, and a spam-y python script.  Valgrind doesn't

flag any access violates.

 

The threading checker ("valgrind --tool=helgrind") does flag a possible race which might be relevant.  Or it might be a false positive.  I don't feel that I understand dbEvent.c well enough to say anything conclusive this late at night.

 

 

> Possible data race during write of size 8 at 0x5197520 by thread #27

> Locks held: 3, at addresses 0x51413A0 0x5143A60 0x5197A10

>    at 0x48CA60A: db_queue_event_log (dbEvent.c:824)

>    by 0x48CA732: db_post_events (dbEvent.c:892)

>    by 0x48595EE: monitor (aoRecord.c:532)

>    by 0x48595EE: process (aoRecord.c:232)

>    by 0x48B2C6B: dbProcess (dbAccess.c:608)

>    by 0x48B4BFF: dbPutField (dbAccess.c:1278)

>    by 0x48CE9DD: dbChannel_put (db_access.c:923)

>    by 0x48F6C9A: write_action (camessage.c:799)

>    by 0x48F7727: camessage (camessage.c:2546)

>    by 0x48F351F: camsgtask (camsgtask.c:116)

>    by 0x495E094: start_routine (osdThread.c:439)

>    by 0x483F876: mythread_wrapper (hg_intercepts.c:387)

>    by 0x4F1EEA6: start_thread (pthread_create.c:477)

>

> This conflicts with a previous read of size 8 by thread #21 Locks

> held: none

>    at 0x48CA99D: event_read (dbEvent.c:999)

>    by 0x48CA99D: event_task (dbEvent.c:1078)

>    by 0x495E094: start_routine (osdThread.c:439)

>    by 0x483F876: mythread_wrapper (hg_intercepts.c:387)

>    by 0x4F1EEA6: start_thread (pthread_create.c:477)

>    by 0x4CBADEE: clone (clone.S:95)

>  Address 0x5197520 is 18,416 bytes inside a block of size 19,528 alloc'd

>    at 0x48397CF: malloc (vg_replace_malloc.c:307)

>    by 0x494DBE1: freeListMalloc (freeListLib.c:95)

>    by 0x494DCE8: freeListCalloc (freeListLib.c:68)

>    by 0x48C95B2: db_init_events (dbEvent.c:304)

>    by 0x48D653C: dbContext::subscribe(epicsGuard<epicsMutex>&, dbChannel*, dbChannelIO&, unsigned int, unsigned long, unsigned int, cacStateNotify&, unsigned int*) (dbContext.cpp:221)

>    by 0x48D7468: dbChannelIO::subscribe(epicsGuard<epicsMutex>&, unsigned int, unsigned long, unsigned int, cacStateNotify&, unsigned int*) (dbChannelIO.cpp:119)

>    by 0x4EF7771: oldSubscription::oldSubscription(epicsGuard<epicsMutex>&, oldChannelNotify&, cacChannel&, unsigned int, unsigned long, unsigned int, void (*)(event_handler_args), void*, oldSubscription**) (oldSubscription.cpp:43)

>    by 0x4EF7099: ca_create_subscription (oldChannelNotify.cpp:573)

>    by 0x48D2BE8: dbCaTask (dbCa.c:1249)

>    by 0x495E094: start_routine (osdThread.c:439)

>    by 0x483F876: mythread_wrapper (hg_intercepts.c:387)

>    by 0x4F1EEA6: start_thread (pthread_create.c:477)  Block was

> alloc'd by thread #9

 

https://github.com/epics-base/epics-base/blob/3fadf4a26cfe33dcb9eb9e4620634e1d3a7b9763/modules/database/src/ioc/db/dbEvent.c#L824

 

https://github.com/epics-base/epics-base/blob/3fadf4a26cfe33dcb9eb9e4620634e1d3a7b9763/modules/database/src/ioc/db/dbEvent.c#L999

 

 

> $ cat mr-regress.db

> record(ao,"testAo_0") {}

> record(ao,"testAo_1") {}

> record(ao,"testAo_2") {}

> record(ao,"testAo_3") {}

> record(ao,"testAo_4") {}

> record(ao,"testAo_5") {}

> record(ao,"testAo_6") {}

> record(ao,"testAo_7") {}

> record(ao,"testAo_8") {}

> record(ao,"testAo_9") {}

> record(ai,"testAi_0") {field(INP, "testAo_0 CP") }

> record(ai,"testAi_1") {field(INP, "testAo_1 CP") }

> record(ai,"testAi_2") {field(INP, "testAo_2 CP") }

> record(ai,"testAi_3") {field(INP, "testAo_3 CP") }

> record(ai,"testAi_4") {field(INP, "testAo_4 CP") }

> record(ai,"testAi_5") {field(INP, "testAo_5 CP") }

> record(ai,"testAi_6") {field(INP, "testAo_6 CP") }

> record(ai,"testAi_7") {field(INP, "testAo_7 CP") }

> record(ai,"testAi_8") {field(INP, "testAo_8 CP") }

> record(ai,"testAi_9") {field(INP, "testAo_9 CP") }

 

 

> $ valgrind --tool=helgrind ./bin/linux-x86_64/softIoc -d mr-regress.db

 

 

> $ ./bin/linux-x86_64/caput testAo_0 1.0

> $ ./bin/linux-x86_64/caput testAo_0 2.0



Replies:
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
References:
Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Torsten Bögershausen via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk

Navigate by Date:
Prev: RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Next: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
Navigate by Thread:
Prev: RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Next: RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
ANJ, 14 Sep 2022 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·