Ø
The last thing I will try is to git bisect from this commit 85822f3051d2236144bb46dc2c24b7e38143e531 to R7.0.5 to see where the hardware tests first begin to fail.
I have now done that. I started with this:
git bisect start 85822f3051d2236144bb46dc2c24b7e38143e531 R7.0.5
The results were [good, bad, good]. I defined good as running for at least 3600 seconds, while the bad failed in 128 seconds.
The fail on the second pass was this:
iocexample>
VME Bus Error accessing A32: 0xbfa8c418
machine check
Exception next instruction address: 0x03326ae8
Machine Status Register: 0x0008b032
Condition Register: 0x40000882
Task: 0x3585870 "CAS-event"
0x3585870 (CAS-event): task 0x3585870 has had a failure and has been stopped.
0x3585870 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
iocexample>
iocexample> tt 0x3585870
0x0012489c vxTaskEntry +0x48 : epicsThreadEntry ()
0x03345014 epicsThreadEntry+0x80 : 0x032a17e8 ()
0x032a1ac4 db_start_events+0x450: db_delete_field_log ()
0x032a0fdc db_delete_field_log+0x54 : freeListFree ()
value = 0 = 0x0
iocexample>
iocexample> ti 0x3585870
NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY
---------- ------------ -------- --- ---------- -------- -------- ------- -----
CAS-event epicsThread> 3585870 133 STOP+I 3326ae8 3585770 0 0
full task name : CAS-event
task entry : epicsThreadEntry
process : kernel
options : 0x1009001
VX_SUPERVISOR_MODE VX_DEALLOC_TCB VX_FP_TASK VX_DEALLOC_EXC_STACK
STACK BASE END SP SIZE HIGH MARGIN
--------- -------- -------- -------- ------- ------- -------
execution 3585870 3582990 3585770 12000 2800 9200
exception 35869c0 3585b50 3696 1344 2352
VxWorks Events
--------------
Events Pended on : Not Pended
Received Events : 0x0
Options : N/A
r0 = 0x03326ab8 sp = 0x03585770 r2 = 0x002f1ac0
r3 = 0x00000000 r4 = 0x0033b448 r5 = 0x00000001
r6 = 0x00000001 r7 = 0x00000001 r8 = 0x0353a530
r9 = 0xffffffff r10 = 0x00000028 r11 = 0x035856f0
r12 = 0x20000884 r13 = 0x00321e60 r14 = 0x033b6088
r15 = 0x03373c5c r16 = 0x03289470 r17 = 0x033424e8
r18 = 0x03373bd4 r19 = 0x03373c14 r20 = 0x032a0ff0
r21 = 0x032a0f88 r22 = 0x00000000 r23 = 0x00000001
r24 = 0x033b608c r25 = 0x03339338 r26 = 0x033390d8
r27 = 0x034426f8 r28 = 0x0230aa00 r29 = 0x03555be8
r30 = 0x0356a558 r31 = 0x03492cf0 msr = 0x0008b032
lr = 0x03326ab8 ctr = 0x0026ecc0 pc = 0x03326ae8
cr = 0x40000882 xer = 0x00000000 pgTblPtr = 0x02310000
scSrTblPtr = 0x0230fe74 srTblPtr = 0x0230fe74
fpcsr = 0
fr0 = -0.0483711 fr1 = NaN fr2 = NaN fr3 = NaN
fr4 = NaN fr5 = NaN fr6 = NaN fr7 = NaN
fr8 = NaN fr9 = NaN fr10 = NaN fr11 = NaN
fr12 = NaN fr13 = NaN fr14 = NaN fr15 = NaN
fr16 = NaN fr17 = NaN fr18 = NaN fr19 = NaN
fr20 = NaN fr21 = NaN fr22 = NaN fr23 = NaN
fr24 = NaN fr25 = NaN fr26 = NaN fr27 = NaN
fr28 = NaN fr29 = NaN fr30 = NaN fr31 = NaN
machine checkvalue =
Exception next instruction address: 0x003326ae8
= Machine Status Register: 0x0x0008b0320
Condition Register: 0x
40000882
git bisect reported this as the final result:
corvette:local/epics-devel/base-7.0.6>git bisect good
27fe3e4468ec62d4d17f775c4aced939a666ba36 is the first bad commit
commit 27fe3e4468ec62d4d17f775c4aced939a666ba36
Author: Ben Franksen <benjamin.franksen at helmholtz-berlin.de>
Date: Mon Mar 30 21:34:32 2020 +0200
refactor db_field_log and filters to get rid of dbfl_type_rec
This refactor simplifies and streamlines the code associated with server
side filters. Apart from immediate benefits (clearer code, less duplication)
it is also hoped that this will make it easier to add write filters.
The data pointer dbfl_ref.field can now either point to a copy owned by a
filter, or it can point to the original data owned by a record. In the
latter case, the dbfl_ref.dtor is NULL.
The dbExtractArray* functions are unified to the single function
dbExtractArray and stripped of conversion functionality. This is redundant
because we always call dbGet after applying filters, which takes care of
conversion. Accordingly, dbChannelMakeArrayCopy is now obsolete and its
single use (in the ts filter) replaced with dbExtractArray. Instead, we add
the helper function dbChannelGetArrayInfo to wrap the common boilerplate
around calls to the get_array_info method, used in both arr.c and ts.c.
:040000 040000 eb0e8326fbd16b9bfe5c9316ebaa105ab74f91f0 3c6d6c0236ffe92a118ebf23a184affffc0cd055 M modules
So my git bisect tests show the following:
-
27fe3e4468ec62d4d17f775c4aced939a666ba36 is the first commit that causes the tests with hardware device support to fail
-
56f05d722dee4b8ca2968b8bface2737a3a9b185 is the first commit that causes the tests with soft device support to fail
Both of those commits were part of this PR
https://github.com/epics-base/epics-base/pull/99
One of the comments on that PR (January 27, 2021) was:
We expect to be nervous about this change because of the complex area it's working in, so we will merge it after the
next release which should be before the Codeathon in March, but we don't see anything obviously wrong with it.
Mark
From: Mark Rivers
Sent: Wednesday, June 1, 2022 7:50 AM
To: Michael Davidsaver <mdavidsaver at gmail.com>; Torsten Bögershausen <Torsten.Bogershausen at ess.eu>; Benjamin Franksen <benjamin.franksen at helmholtz-berlin.de>
Cc: tech-talk at aps.anl.gov
Subject: RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules
I agree that a debugger would be very helpful. I don't have that tool, but I think others at the APS do. I'll find out when Andrew returns next week. Meanwhile git
bisect can help point in the right direction.
I tested this commit 85822f3051d2236144bb46dc2c24b7e38143e531 with soft records overnight. It ran for 32,000 seconds without failing before I stopped it.
The next commit, 56f05d722dee4b8ca2968b8bface2737a3a9b185, fails in less than 10 seconds in each of 3 tests.
I just tested with the IPAC hardware again with commit 85822f3051d2236144bb46dc2c24b7e38143e531 (which does not fail with soft records). It failed after 960 seconds
with this error:
VME Bus Error accessing A32: 0xbffcf67c
machine check
Exception next instruction address: 0x03326a90
Machine Status Register: 0x0008b032
Condition Register: 0x42000884
Task: 0x3559330 "CAS-event"
0x3559330 (CAS-event): task 0x3559330 has had a failure and has been stopped.
0x3559330 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
This is the task trace:
iocexample> tt 0x3559330
0x0012489c vxTaskEntry +0x48 : epicsThreadEntry ()
0x0334501c epicsThreadEntry+0x80 : 0x032a17f0 ()
0x032a1acc db_start_events+0x450: db_delete_field_log ()
0x032a0fe4 db_delete_field_log+0x54 : freeListFree ()
value = 0 = 0x0
This tells us pretty clearly that the failure as in freeListFree, called by db_delete_field_log. That has been a very common stack trace for these errors.
Till complained that vxWorks did not provide a register dump, but that is not true, it is available with the “ti” command. I had included that in several previous messages.
Here it is for this task.
iocexample> ti 0x3559330
NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY
---------- ------------ -------- --- ---------- -------- -------- ------- -----
CAS-event epicsThread> 3559330 133 STOP+I 3326a90 3559240 0 0
full task name : CAS-event
task entry : epicsThreadEntry
process : kernel
options : 0x1009001
VX_SUPERVISOR_MODE VX_DEALLOC_TCB VX_FP_TASK VX_DEALLOC_EXC_STACK
STACK BASE END SP SIZE HIGH MARGIN
--------- -------- -------- -------- ------- ------- -------
execution 3559330 3556450 3559240 12000 2800 9200
exception 2700210 26ff3a0 3696 1344 2352
VxWorks Events
--------------
Events Pended on : Not Pended
Received Events : 0x0
Options : N/A
r0 = 0x40000000 sp = 0x03559240 r2 = 0x002f1ac0
r3 = 0x03492ce0 r4 = 0x0357bec8 r5 = 0x00000001
r6 = 0x00000001 r7 = 0x00000000 r8 = 0xeefceefd
r9 = 0x03326a90 r10 = 0x00000000 r11 = 0x035591b0
r12 = 0x20000884 r13 = 0x00321e60 r14 = 0x033b6098
r15 = 0x03373c64 r16 = 0x03289478 r17 = 0x033424f0
r18 = 0x03373bdc r19 = 0x03373c1c r20 = 0x032a0ff8
r21 = 0x032a0f90 r22 = 0x00000000 r23 = 0x00000001
r24 = 0x033b609c r25 = 0x03339340 r26 = 0x033390e0
r27 = 0x03443b60 r28 = 0x0357e748 r29 = 0x03567d78
r30 = 0x032dad44 r31 = 0x0357bec8 msr = 0x0008b032
lr = 0x032a0fe4 ctr = 0x03326a90 pc = 0x03326a90
cr = 0x42000884 xer = 0x00000000 pgTblPtr = 0x02310000
scSrTblPtr = 0x0230fe74 srTblPtr = 0x0230fe74
fpcsr = 0
fr0 = -1.81018 fr1 = NaN fr2 = NaN fr3 = NaN
fr4 = NaN fr5 = NaN fr6 = NaN fr7 = NaN
fr8 = NaN fr9 = NaN fr10 = NaN fr11 = NaN
fr12 = NaN fr13 = NaN fr14 = NaN fr15 = NaN
fr16 = NaN fr17 = NaN fr18 = NaN fr19 = NaN
fr20 = NaN fr21 = NaN fr22 = NaN fr23 = NaN
fr24 = NaN fr25 = NaN fr26 = NaN fr27 = NaN
fr28 = NaN fr29 = NaN fr30 = NaN fr31 = NaN
machine checkvalue =
Exception next instruction address: 0x003326a90
Machine Status Register: 0x = 0008b032
Condition Register: 0x420008840x
0
And here is the disassembly for freeListFree, where I have highlighted in red the "next instruction" above.
freeListFree:
0x3326a90 9421fff0 stwu r1,-16(r1)
0x3326a94 7c0802a6 mfspr r0,LR
0x3326a98 3d200334 lis r9,0x334 # 820
0x3326a9c 392990e0 addi r9,r9,0x90e0 # -28448
0x3326aa0 7d2903a6 mtspr CTR,r9
0x3326aa4 90010014 stw r0,20(r1)
0x3326aa8 93c10008 stw r30,8(r1)
0x3326aac 7c9e2378 or r30,r4,r4
0x3326ab0 93e1000c stw r31,12(r1)
0x3326ab4 7c7f1b78 or r31,r3,r3
0x3326ab8 80630014 lwz r3,20(r3)
0x3326abc 4e800421 bcctrl 0x14,0
0x3326ac0 2f830000 cmpi crf7,0,r3,0x0 # 0
0x3326ac4 41be002c bc 0xd,30, 0x3326af0 # 0x03326af0
0x3326ac8 3d200334 lis r9,0x334 # 820
0x3326acc 3c600338 lis r3,0x338 # 824
0x3326ad0 3ca00338 lis r5,0x338 # 824
0x3326ad4 392924f0 addi r9,r9,0x24f0 # 9456
0x3326ad8 386335a4 addi r3,r3,0x35a4 # 13732
0x3326adc 38a535c0 addi r5,r5,0x35c0 # 13760
0x3326ae0 7d2903a6 mtspr CTR,r9
0x3326ae4 3880008f li r4,0x8f # 143
0x3326ae8 38c00000 li r6,0x0 # 0
0x3326aec 4e800421 bcctrl 0x14,0
0x3326af0 813f0010 lwz r9,16(r31)
0x3326af4 801f0008 lwz r0,8(r31)
0x3326af8 39290001 addi r9,r9,0x1 # 1
0x3326afc 807f0014 lwz r3,20(r31)
0x3326b00 913f0010 stw r9,16(r31)
0x3326b04 3d200334 lis r9,0x334 # 820
0x3326b08 39299340 addi r9,r9,0x9340 # -27840
0x3326b0c 901e0000 stw r0,0(r30)
0x3326b10 7d2903a6 mtspr CTR,r9
0x3326b14 93df0008 stw r30,8(r31)
0x3326b18 4e800421 bcctrl 0x14,0
0x3326b1c 80010014 lwz r0,20(r1)
0x3326b20 83c10008 lwz r30,8(r1)
0x3326b24 7c0803a6 mtspr LR,r0
0x3326b28 83e1000c lwz r31,12(r1)
0x3326b2c 38210010 addi r1,r1,0x10 # 16
0x3326b30 4e800020 blr
This is the source code for freeListFree in this tag of base.
LIBCOM_API void epicsStdCall freeListFree(void *pvt,void*pmem)
{
FREELISTPVT *pfl = pvt;
# ifdef EPICS_FREELIST_DEBUG
memset ( pmem, 0xdd, pfl->size );
free(pmem);
# else
void **ppnext;
VALGRIND_MEMPOOL_FREE(pvt, pmem);
VALGRIND_MEMPOOL_ALLOC(pvt, pmem, sizeof(void*));
epicsMutexMustLock(pfl->lock);
ppnext = pmem;
*ppnext = pfl->head;
pfl->head = pmem;
pfl->nBlocksAvailable++;
epicsMutexUnlock(pfl->lock);
# endif
}
The last thing I will try is to git bisect from this commit 85822f3051d2236144bb46dc2c24b7e38143e531 to R7.0.5 to see where the hardware tests first begin to fail.
Mark