EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024  2025  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024  2025 
<== Date ==> <== Thread ==>

Subject: RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules
From: Mark Rivers via Tech-talk <tech-talk at aps.anl.gov>
To: Michael Davidsaver <mdavidsaver at gmail.com>, Torsten Bögershausen <Torsten.Bogershausen at ess.eu>, Benjamin Franksen <benjamin.franksen at helmholtz-berlin.de>
Cc: "tech-talk at aps.anl.gov" <tech-talk at aps.anl.gov>
Date: Wed, 1 Jun 2022 17:54:25 +0000

Ø  The last thing I will try is to git bisect from this commit 85822f3051d2236144bb46dc2c24b7e38143e531 to R7.0.5 to see where the hardware tests first begin to fail.

 

I have now done that.  I started with this:

 

git bisect start 85822f3051d2236144bb46dc2c24b7e38143e531 R7.0.5

 

The results were [good, bad, good].  I defined good as running for at least 3600 seconds, while the bad failed in 128 seconds.

 

The fail on the second pass was this:

 

iocexample>

VME Bus Error accessing A32: 0xbfa8c418

machine check

Exception next instruction address: 0x03326ae8

Machine Status Register: 0x0008b032

Condition Register: 0x40000882

Task: 0x3585870 "CAS-event"

0x3585870 (CAS-event): task 0x3585870 has had a failure and has been stopped.

0x3585870 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.

 

iocexample>

iocexample> tt 0x3585870

0x0012489c vxTaskEntry  +0x48 : epicsThreadEntry ()

0x03345014 epicsThreadEntry+0x80 : 0x032a17e8 ()

0x032a1ac4 db_start_events+0x450: db_delete_field_log ()

0x032a0fdc db_delete_field_log+0x54 : freeListFree ()

value = 0 = 0x0

iocexample>

iocexample> ti 0x3585870

 

  NAME         ENTRY       TID    PRI   STATUS      PC       SP     ERRNO  DELAY

----------  ------------ -------- --- ---------- -------- -------- ------- -----

CAS-event   epicsThread>  3585870 133 STOP+I      3326ae8  3585770       0     0

 

full task name : CAS-event

task entry     : epicsThreadEntry

process        : kernel

options        : 0x1009001

VX_SUPERVISOR_MODE  VX_DEALLOC_TCB      VX_FP_TASK          VX_DEALLOC_EXC_STACK

 

STACK      BASE     END       SP      SIZE    HIGH   MARGIN

--------- -------- -------- -------- ------- ------- -------

execution  3585870  3582990  3585770   12000    2800    9200

exception  35869c0  3585b50             3696    1344    2352

 

VxWorks Events

--------------

Events Pended on    : Not Pended

Received Events     : 0x0

Options             : N/A

 

r0         = 0x03326ab8   sp         = 0x03585770   r2         = 0x002f1ac0

r3         = 0x00000000   r4         = 0x0033b448   r5         = 0x00000001

r6         = 0x00000001   r7         = 0x00000001   r8         = 0x0353a530

r9         = 0xffffffff   r10        = 0x00000028   r11        = 0x035856f0

r12        = 0x20000884   r13        = 0x00321e60   r14        = 0x033b6088

r15        = 0x03373c5c   r16        = 0x03289470   r17        = 0x033424e8

r18        = 0x03373bd4   r19        = 0x03373c14   r20        = 0x032a0ff0

r21        = 0x032a0f88   r22        = 0x00000000   r23        = 0x00000001

r24        = 0x033b608c   r25        = 0x03339338   r26        = 0x033390d8

r27        = 0x034426f8   r28        = 0x0230aa00   r29        = 0x03555be8

r30        = 0x0356a558   r31        = 0x03492cf0   msr        = 0x0008b032

lr         = 0x03326ab8   ctr        = 0x0026ecc0   pc         = 0x03326ae8

cr         = 0x40000882   xer        = 0x00000000   pgTblPtr   = 0x02310000

scSrTblPtr = 0x0230fe74   srTblPtr   = 0x0230fe74

 

fpcsr  =        0

fr0    = -0.0483711   fr1    =      NaN   fr2    =      NaN   fr3    =      NaN

fr4    =      NaN   fr5    =      NaN   fr6    =      NaN   fr7    =      NaN

fr8    =      NaN   fr9    =      NaN   fr10   =      NaN   fr11   =      NaN

fr12   =      NaN   fr13   =      NaN   fr14   =      NaN   fr15   =      NaN

fr16   =      NaN   fr17   =      NaN   fr18   =      NaN   fr19   =      NaN

fr20   =      NaN   fr21   =      NaN   fr22   =      NaN   fr23   =      NaN

fr24   =      NaN   fr25   =      NaN   fr26   =      NaN   fr27   =      NaN

fr28   =      NaN   fr29   =      NaN   fr30   =      NaN   fr31   =      NaN

 

machine checkvalue =

Exception next instruction address: 0x003326ae8

= Machine Status Register: 0x0x0008b0320

Condition Register: 0x

40000882

 

git bisect reported this as the final result:

 

corvette:local/epics-devel/base-7.0.6>git bisect good

27fe3e4468ec62d4d17f775c4aced939a666ba36 is the first bad commit

commit 27fe3e4468ec62d4d17f775c4aced939a666ba36

Author: Ben Franksen <benjamin.franksen at helmholtz-berlin.de>

Date:   Mon Mar 30 21:34:32 2020 +0200

 

    refactor db_field_log and filters to get rid of dbfl_type_rec

 

    This refactor simplifies and streamlines the code associated with server

    side filters. Apart from immediate benefits (clearer code, less duplication)

    it is also hoped that this will make it easier to add write filters.

 

    The data pointer dbfl_ref.field can now either point to a copy owned by a

    filter, or it can point to the original data owned by a record. In the

    latter case, the dbfl_ref.dtor is NULL.

 

    The dbExtractArray* functions are unified to the single function

    dbExtractArray and stripped of conversion functionality. This is redundant

    because we always call dbGet after applying filters, which takes care of

    conversion. Accordingly, dbChannelMakeArrayCopy is now obsolete and its

    single use (in the ts filter) replaced with dbExtractArray. Instead, we add

    the helper function dbChannelGetArrayInfo to wrap the common boilerplate

    around calls to the get_array_info method, used in both arr.c and ts.c.

 

:040000 040000 eb0e8326fbd16b9bfe5c9316ebaa105ab74f91f0 3c6d6c0236ffe92a118ebf23a184affffc0cd055 M      modules

 

So my git bisect tests show the following:

-          27fe3e4468ec62d4d17f775c4aced939a666ba36 is the first commit that causes the tests with hardware device support to fail

-          56f05d722dee4b8ca2968b8bface2737a3a9b185 is the first commit that causes the tests with soft device support to fail

 

Both of those commits were part of this PR https://github.com/epics-base/epics-base/pull/99

 

One of the comments on that PR (January 27, 2021) was:

We expect to be nervous about this change because of the complex area it's working in, so we will merge it after the next release which should be before the Codeathon in March, but we don't see anything obviously wrong with it.

 

Mark

 

 

From: Mark Rivers
Sent: Wednesday, June 1, 2022 7:50 AM
To: Michael Davidsaver <mdavidsaver at gmail.com>; Torsten Bögershausen <Torsten.Bogershausen at ess.eu>; Benjamin Franksen <benjamin.franksen at helmholtz-berlin.de>
Cc: tech-talk at aps.anl.gov
Subject: RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules

 

I agree that a debugger would be very helpful.  I don't have that tool, but I think others at the APS do.  I'll find out when Andrew returns next week.  Meanwhile git bisect can help point in the right direction.

 

I tested this commit 85822f3051d2236144bb46dc2c24b7e38143e531 with soft records overnight.  It ran for 32,000 seconds without failing before I stopped it.

 

The next commit, 56f05d722dee4b8ca2968b8bface2737a3a9b185, fails in less than 10 seconds in each of 3 tests.

 

I just tested with the IPAC hardware again with commit 85822f3051d2236144bb46dc2c24b7e38143e531 (which does not fail with soft records).  It failed after 960 seconds with this error:

 

VME Bus Error accessing A32: 0xbffcf67c

machine check

Exception next instruction address: 0x03326a90

Machine Status Register: 0x0008b032

Condition Register: 0x42000884

Task: 0x3559330 "CAS-event"

0x3559330 (CAS-event): task 0x3559330 has had a failure and has been stopped.

0x3559330 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.

 

This is the task trace:

 

iocexample> tt  0x3559330

0x0012489c vxTaskEntry  +0x48 : epicsThreadEntry ()

0x0334501c epicsThreadEntry+0x80 : 0x032a17f0 ()

0x032a1acc db_start_events+0x450: db_delete_field_log ()

0x032a0fe4 db_delete_field_log+0x54 : freeListFree ()

value = 0 = 0x0

 

This tells us pretty clearly that the failure as in freeListFree, called by db_delete_field_log.  That has been a very common stack trace for these errors.

 

Till complained that vxWorks did not provide a register dump, but that is not true, it is available with the “ti” command.  I had included that in several previous messages.  Here it is for this task.

 

iocexample> ti  0x3559330

 

  NAME         ENTRY       TID    PRI   STATUS      PC       SP     ERRNO  DELAY

----------  ------------ -------- --- ---------- -------- -------- ------- -----

CAS-event   epicsThread>  3559330 133 STOP+I      3326a90  3559240       0     0

 

full task name : CAS-event

task entry     : epicsThreadEntry

process        : kernel

options        : 0x1009001

VX_SUPERVISOR_MODE  VX_DEALLOC_TCB      VX_FP_TASK          VX_DEALLOC_EXC_STACK

 

STACK      BASE     END       SP      SIZE    HIGH   MARGIN

--------- -------- -------- -------- ------- ------- -------

execution  3559330  3556450  3559240   12000    2800    9200

exception  2700210  26ff3a0             3696    1344    2352

 

VxWorks Events

--------------

Events Pended on    : Not Pended

Received Events     : 0x0

Options             : N/A

 

r0         = 0x40000000   sp         = 0x03559240   r2         = 0x002f1ac0

r3         = 0x03492ce0   r4         = 0x0357bec8   r5         = 0x00000001

r6         = 0x00000001   r7         = 0x00000000   r8         = 0xeefceefd

r9         = 0x03326a90   r10        = 0x00000000   r11        = 0x035591b0

r12        = 0x20000884   r13        = 0x00321e60   r14        = 0x033b6098

r15        = 0x03373c64   r16        = 0x03289478   r17        = 0x033424f0

r18        = 0x03373bdc   r19        = 0x03373c1c   r20        = 0x032a0ff8

r21        = 0x032a0f90   r22        = 0x00000000   r23        = 0x00000001

r24        = 0x033b609c   r25        = 0x03339340   r26        = 0x033390e0

r27        = 0x03443b60   r28        = 0x0357e748   r29        = 0x03567d78

r30        = 0x032dad44   r31        = 0x0357bec8   msr        = 0x0008b032

lr         = 0x032a0fe4   ctr        = 0x03326a90   pc         = 0x03326a90

cr         = 0x42000884   xer        = 0x00000000   pgTblPtr   = 0x02310000

scSrTblPtr = 0x0230fe74   srTblPtr   = 0x0230fe74

 

fpcsr  =        0

fr0    = -1.81018   fr1    =      NaN   fr2    =      NaN   fr3    =      NaN

fr4    =      NaN   fr5    =      NaN   fr6    =      NaN   fr7    =      NaN

fr8    =      NaN   fr9    =      NaN   fr10   =      NaN   fr11   =      NaN

fr12   =      NaN   fr13   =      NaN   fr14   =      NaN   fr15   =      NaN

fr16   =      NaN   fr17   =      NaN   fr18   =      NaN   fr19   =      NaN

fr20   =      NaN   fr21   =      NaN   fr22   =      NaN   fr23   =      NaN

fr24   =      NaN   fr25   =      NaN   fr26   =      NaN   fr27   =      NaN

fr28   =      NaN   fr29   =      NaN   fr30   =      NaN   fr31   =      NaN

 

machine checkvalue =

Exception next instruction address: 0x003326a90

Machine Status Register: 0x = 0008b032

Condition Register: 0x420008840x

0

 

And here is the disassembly for freeListFree, where I have highlighted in red the "next instruction" above.

 

                        freeListFree:

0x3326a90  9421fff0    stwu        r1,-16(r1)

0x3326a94  7c0802a6    mfspr       r0,LR

0x3326a98  3d200334    lis         r9,0x334 # 820

0x3326a9c  392990e0    addi        r9,r9,0x90e0 # -28448

0x3326aa0  7d2903a6    mtspr       CTR,r9

0x3326aa4  90010014    stw         r0,20(r1)

0x3326aa8  93c10008    stw         r30,8(r1)

0x3326aac  7c9e2378    or          r30,r4,r4

0x3326ab0  93e1000c    stw         r31,12(r1)

0x3326ab4  7c7f1b78    or          r31,r3,r3

0x3326ab8  80630014    lwz         r3,20(r3)

0x3326abc  4e800421    bcctrl      0x14,0

0x3326ac0  2f830000    cmpi        crf7,0,r3,0x0 # 0

0x3326ac4  41be002c    bc          0xd,30, 0x3326af0 # 0x03326af0

0x3326ac8  3d200334    lis         r9,0x334 # 820

0x3326acc  3c600338    lis         r3,0x338 # 824

0x3326ad0  3ca00338    lis         r5,0x338 # 824

0x3326ad4  392924f0    addi        r9,r9,0x24f0 # 9456

0x3326ad8  386335a4    addi        r3,r3,0x35a4 # 13732

0x3326adc  38a535c0    addi        r5,r5,0x35c0 # 13760

0x3326ae0  7d2903a6    mtspr       CTR,r9

0x3326ae4  3880008f    li          r4,0x8f # 143

0x3326ae8  38c00000    li          r6,0x0 # 0

0x3326aec  4e800421    bcctrl      0x14,0

0x3326af0  813f0010    lwz         r9,16(r31)

0x3326af4  801f0008    lwz         r0,8(r31)

0x3326af8  39290001    addi        r9,r9,0x1 # 1

0x3326afc  807f0014    lwz         r3,20(r31)

0x3326b00  913f0010    stw         r9,16(r31)

0x3326b04  3d200334    lis         r9,0x334 # 820

0x3326b08  39299340    addi        r9,r9,0x9340 # -27840

0x3326b0c  901e0000    stw         r0,0(r30)

0x3326b10  7d2903a6    mtspr       CTR,r9

0x3326b14  93df0008    stw         r30,8(r31)

0x3326b18  4e800421    bcctrl      0x14,0

0x3326b1c  80010014    lwz         r0,20(r1)

0x3326b20  83c10008    lwz         r30,8(r1)

0x3326b24  7c0803a6    mtspr       LR,r0

0x3326b28  83e1000c    lwz         r31,12(r1)

0x3326b2c  38210010    addi        r1,r1,0x10 # 16

0x3326b30  4e800020    blr

 

 

This is the source code for freeListFree in this tag of base.

 

 

LIBCOM_API void epicsStdCall freeListFree(void *pvt,void*pmem)

{

    FREELISTPVT *pfl = pvt;

#   ifdef EPICS_FREELIST_DEBUG

    memset ( pmem, 0xdd, pfl->size );

    free(pmem);

#   else

    void        **ppnext;

 

    VALGRIND_MEMPOOL_FREE(pvt, pmem);

    VALGRIND_MEMPOOL_ALLOC(pvt, pmem, sizeof(void*));

 

    epicsMutexMustLock(pfl->lock);

    ppnext = pmem;

    *ppnext = pfl->head;

    pfl->head = pmem;

    pfl->nBlocksAvailable++;

    epicsMutexUnlock(pfl->lock);

#   endif

}

 

The last thing I will try is to git bisect from this commit 85822f3051d2236144bb46dc2c24b7e38143e531 to R7.0.5 to see where the hardware tests first begin to fail.

 

Mark

 

 


Replies:
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
References:
Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Torsten Bögershausen via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk

Navigate by Date:
Prev: RE: Issue or misunderstanding with long string support Mark Rivers via Tech-talk
Next: Alarm-Server within a container Larregui, Julian via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024  2025 
Navigate by Thread:
Prev: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Till Straumann via Tech-talk
Next: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024  2025 
ANJ, 14 Sep 2022 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions ·
· Download · Search · IRMIS · Talk · Documents · Links · Licensing ·