I am now convinced that the A32 addresses which generate VME bus errs are in fact the double values that are written to the ao records.
My test program writes floating point numbers to each of the 10 ao records. The ai records get these values via CP links.
These numbers increment by 0.1 on each loop, and the loop runs at 100 Hz.
If the range of the numbers is -2 to -0.5 the IOC gets an A32 bus error within 5 seconds. IEEE doubles in the range -2 to -0.5 have hex representation of 0xbf…. This means the most significant 32 bits, when interpreted as an address,
is in the A32 address space on my crate (0x80000000-0xbfffffff). The last 3 VME bus errors are at addresses 0xbffb3330, 0xbfeffffc, 0xbffccccc. If these are interpreted as the 4 most significant bytes of an IEEE double then the values are: -1.6999969482421875,
-0.9999980926513672, -1.7999992370605469. Within rounding errors these are -1.7, -1.0, and -1.8 which are 3 of the values written to the ao records when my test program loops from -2 to -0.5.
If I change my test program so that it loops from -3 to -2 then it runs for over 4200 seconds with no failure. Those floating point values do not translate into the A32 address space on my IOC, and so don’t generate bus errors.
If the test program loops from -0.7 to -0.5 the VME bus error address is 0xbfe66664. That is -0.6999988555908203 when interpreted as a double with that address as the most significant 32 bits. That is within the now more restricted range
of the output values.
This is pretty definitive proof that the VME bus errors are occurring because the ai or ao record values are being treated as a memory address, and when that address falls in the A32 address space we get a VME bus error.
Till said:
1.
I would join Steven Hartman in recommending to dump program memory; I'd look at
the address in R9/CTR (target of the jump) and also at the address where we jump from
(if the PC is different from CTR as in the above report then dump from the address at the PC,
if the PC is identical to CTR then dump from the address held in LR).
VME Bus Error accessing A16: 0x347e
machine check
Exception next instruction address: 0x03326a90
Machine Status Register: 0x0008b032
Condition Register: 0x42000884
Task: 0x3593ed0 "CAS-event"
0x3593ed0 (CAS-event): task 0x3593ed0 has had a failure and has been stopped.
0x3593ed0 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
iocexample> tt
0x0012489c vxTaskEntry +0x48 : epicsThreadEntry ()
0x0334501c epicsThreadEntry+0x80 : 0x032a17f0 ()
0x032a1acc db_start_events+0x450: db_delete_field_log ()
0x032a0fe4 db_delete_field_log+0x54 : freeListFree ()
value = 0 = 0x0
iocexample>
iocexample>
iocexample> ti
NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY
---------- ------------ -------- --- ---------- -------- -------- ------- -----
CAS-event epicsThread> 3593ed0 133 STOP+I 3326a90 3593de0 0 0
full task name : CAS-event
task entry : epicsThreadEntry
process : kernel
options : 0x1009001
VX_SUPERVISOR_MODE VX_DEALLOC_TCB VX_FP_TASK VX_DEALLOC_EXC_STACK
STACK BASE END SP SIZE HIGH MARGIN
--------- -------- -------- -------- ------- ------- -------
execution 3593ed0 3590ff0 3593de0 12000 2800 9200
exception 3595020 35941b0 3696 1344 2352
VxWorks Events
--------------
Events Pended on : Not Pended
Received Events : 0x0
Options : N/A
r0 = 0x40000000 sp = 0x03593de0 r2 = 0x002f1ac0
r3 = 0x03492c80 r4 = 0x0357f318 r5 = 0x00000001
r6 = 0x00000001 r7 = 0x00000000 r8 = 0x00140001
r9 = 0x03326a90 r10 = 0x00000000 r11 = 0x03593d50
r12 = 0x20000884 r13 = 0x00321e60 r14 = 0x033b6098
r15 = 0x03373c64 r16 = 0x03289478 r17 = 0x033424f0
r18 = 0x03373bdc r19 = 0x03373c1c r20 = 0x032a0ff8
r21 = 0x032a0f90 r22 = 0x00000000 r23 = 0x00000001
r24 = 0x033b609c r25 = 0x03339340 r26 = 0x033390e0
r27 = 0x03443170 r28 = 0x03443170 r29 = 0x03569e18
r30 = 0x032dad44 r31 = 0x0357f318 msr = 0x0008b032
lr = 0x032a0fe4 ctr = 0x03326a90 pc = 0x03326a90
cr = 0x42000884 xer = 0x00000000 pgTblPtr = 0x02310000
scSrTblPtr = 0x0230fe74 srTblPtr = 0x0230fe74
fpcsr = 0
fr0 = -1.1635 fr1 = NaN fr2 = NaN fr3 = NaN
fr4 = NaN fr5 = NaN fr6 = NaN fr7 = NaN
fr8 = NaN fr9 = NaN fr10 = NaN fr11 = NaN
fr12 = NaN fr13 = NaN fr14 = NaN fr15 = NaN
fr16 = NaN fr17 = NaN fr18 = NaN fr19 = NaN
fr20 = NaN fr21 = NaN fr22 = NaN fr23 = NaN
fr24 = NaN fr25 = NaN fr26 = NaN fr27 = NaN
fr28 = NaN fr29 = NaN fr30 = NaN fr31 = NaN
machine checkvalue =
Exception next instruction address: 0x003326a90
Machine Status Register: 0x = 0008b032
Condition Register: 0x420008840x
0
In this case R9=CTR=PC=0x03326a90.
iocexample> l 0x03326a90
freeListFree:
0x3326a90 9421fff0 stwu r1,-16(r1)
0x3326a94 7c0802a6 mfspr r0,LR
0x3326a98 3d200334 lis r9,0x334 # 820
0x3326a9c 392990e0 addi r9,r9,0x90e0 # -28448
0x3326aa0 7d2903a6 mtspr CTR,r9
0x3326aa4 90010014 stw r0,20(r1)
0x3326aa8 93c10008 stw r30,8(r1)
0x3326aac 7c9e2378 or r30,r4,r4
0x3326ab0 93e1000c stw r31,12(r1)
0x3326ab4 7c7f1b78 or r31,r3,r3
value = 0 = 0x0
iocexample> d 0x03326a90
NOTE: memory values are displayed in hexadecimal.
0x03326a90: 9421 fff0 7c08 02a6 3d20 0334 3929 90e0 *.!..|...= .49)..*
0x03326aa0: 7d29 03a6 9001 0014 93c1 0008 7c9e 2378 *})..........|.#x*
0x03326ab0: 93e1 000c 7c7f 1b78 8063 0014 4e80 0421 *....|..x.c..N..!*
0x03326ac0: 2f83 0000 41be 002c 3d20 0334 3c60 0338 */...A..,= .4<`.8*
0x03326ad0: 3ca0 0338 3929 24f0 3863 35a4 38a5 35c0 *<..89)$.8c5.8.5.*
0x03326ae0: 7d29 03a6 3880 008f 38c0 0000 4e80 0421 *})..8...8...N..!*
0x03326af0: 813f 0010 801f 0008 3929 0001 807f 0014 *.?......9)......*
0x03326b00: 913f 0010 3d20 0334 3929 9340 901e 0000 *.?..= .49).@....*
0x03326b10: 7d29 03a6 93df 0008 4e80 0421 8001 0014 *})......N..!....*
0x03326b20: 83c1 0008 7c08 03a6 83e1 000c 3821 0010 *....|.......8!..*
0x03326b30: 4e80 0020 9421 ffe0 7c08 02a6 9001 0024 *N.. .!..|......$*
0x03326b40: 93e1 001c 9381 0010 7c7c 1b78 83e3 000c *........||.x....*
0x03326b50: 93a1 0014 2f9f 0000 93c1 0018 419e 0074 *..../.......A..t*
0x03326b60: 3d20 001e 3bc9 02fc 807f 0004 7fc9 03a6 *= ..;...........*
0x03326b70: 83bf 0000 4e80 0421 7fe3 fb78 7fc9 03a6 *....N..!...x....*
0x03326b80: 7fbf eb78 4e80 0421 2f9d 0000 409e ffdc
*...xN..!/...@...*
LR = 0x032a0fe4
This is the disassembly with LR address highlighted in red.
iocexample> l 0x032a0f90,30
db_delete_field_log:
0x32a0f90 9421fff0 stwu r1,-16(r1)
0x32a0f94 7c0802a6 mfspr r0,LR
0x32a0f98 93e1000c stw r31,12(r1)
0x32a0f9c 7c7f1b79 or. r31,r3,r3
0x32a0fa0 90010014 stw r0,20(r1)
0x32a0fa4 41820040 bc 0xc,2, 0x32a0fe4 # 0x032a0fe4
0x32a0fa8 801f0000 lwz r0,0(r31)
0x32a0fac 2f800000 cmpi crf7,0,r0,0x0 # 0
0x32a0fb0 409c0018 bc 0x4,28, 0x32a0fc8 # 0x032a0fc8
0x32a0fb4 801f0018 lwz r0,24(r31)
0x32a0fb8 2f800000 cmpi crf7,0,r0,0x0 # 0
0x32a0fbc 419e000c bc 0xc,30, 0x32a0fc8 # 0x032a0fc8
0x32a0fc0 7c0903a6 mtspr CTR,r0
0x32a0fc4 4e800421 bcctrl 0x14,0
0x32a0fc8 3d20033b lis r9,0x33b # 827
0x32a0fcc 7fe4fb78 or r4,r31,r31
0x32a0fd0 80696098 lwz r3,24728(r9)
0x32a0fd4 3d200332 lis r9,0x332 # 818
0x32a0fd8 39296a90 addi r9,r9,0x6a90 # 27280
0x32a0fdc 7d2903a6 mtspr CTR,r9
0x32a0fe0 4e800421 bcctrl 0x14,0
0x32a0fe4 80010014 lwz r0,20(r1)
0x32a0fe8 83e1000c lwz r31,12(r1)
0x32a0fec 38210010 addi r1,r1,0x10 # 16
0x32a0ff0 7c0803a6 mtspr LR,r0
0x32a0ff4 4e800020 blr
0x32a0ff8 9421fff0 stwu r1,-16(r1)
0x32a0ffc 7c0802a6 mfspr r0,LR
0x32a1000 39240090 addi r9,r4,0x90 # 144
0x32a1004 5484103a rlwinm r4,r4,2,0,29
iocexample> d 0x032a0fe4
NOTE: memory values are displayed in hexadecimal.
0x032a0fe0: 8001 0014 83e1 000c 3821 0010 * ......8!..*
0x032a0ff0: 7c08 03a6 4e80 0020 9421 fff0 7c08 02a6 *|...N.. .!..|...*
0x032a1000: 3924 0090 5484 103a 5529 103a 7c83 2214 *9$..T..:U).:|.".*
0x032a1010: 7d23 4a14 9001 0014 3800 0000 93c1 0008 *}#J.....8.......*
0x032a1020: 93e1 000c 7c7f 1b78 83c9 0004 9004 0004 *....|..x........*
0x032a1030: 817e 001c 90a9 0004 2f8b 0001 419e 0074 *.~....../...A..t*
0x032a1040: 2b8b 0001 409d 0090 a13f 0492 2f89 0000
*+...@....?../...*
0x032a1050: 419e 0030 817e 001c 3809 ffff b01f 0492 *A..0.~..8.......*
0x032a1060: 380b ffff 83e1 000c 901e 001c 8001 0014 *8...............*
0x032a1070: 83c1 0008 3821 0010 7c08 03a6 4e80 0020 *....8!..|...N.. *
0x032a1080: 3d20 0334 3c60 0337 3929 24f0 3ca0 0337 *= .4<`.79)$.<..7*
0x032a1090: 7d29 03a6 3863 3bdc 38a5 3c00 3880 0216 *})..8c;.8.<.8...*
0x032a10a0: 38c0 0000 4e80 0421 a13f 0492 4bff ffa8 *8...N..!.?..K...*
0x032a10b0: 901e 0018 380b ffff 901e 001c 8001 0014 *....8...........*
0x032a10c0: 83c1 0008 7c08 03a6 83e1 000c 3821 0010 *....|.......8!..*
0x032a10d0: 4e80 0020 3d20 0334 3c60 0337 3929 24f0 *N.. = .4<`.79)$.*
0x032a10e0: 3ca0 0337 *<..7............*
value = 0 = 0x0
Thanks,
Mark
Hi Mark.
Some comments in line...
On 6/7/22 22:39, Mark Rivers wrote:
Hi Till,
I have checked out tag 85822f3 and only changed configure/CONFIG_SITE to build for vxWorks.
I ran the test that uses the IPAC hardware and got this crash after about 13 minutes. This time it crashed with an A16 address error, in the address space for the IP330 module. The crash is in db_delete_field_log, the same as the crash
you analyzed.
Actually: not quite. In this report the PC / next instruction address has not been loaded from the CTR yet as it had in the previous
report.
iocexample>
VME Bus Error accessing A16: 0x345e
machine check
Exception next instruction address: 0x032a0fe0
Machine Status Register: 0x0008b032
Condition Register: 0x42000884
Task: 0x3570090 "CAS-event"
0x3570090 (CAS-event): task 0x3570090 has had a failure and has been stopped.
0x3570090 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
This is the stack trace:
iocexample> tt 0x3570090
0x0012489c vxTaskEntry +0x48 : epicsThreadEntry ()
0x0334501c epicsThreadEntry+0x80 : 0x032a17f0 ()
0x032a1acc db_start_events+0x450: db_delete_field_log ()
value = 0 = 0x0
This is the register dump:
iocexample> ti 0x3570090
NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY
---------- ------------ -------- --- ---------- -------- -------- ------- -----
CAS-event epicsThread> 3570090 133 STOP+I 32a0fe0 356ffa0 0 0
full task name : CAS-event
task entry : epicsThreadEntry
process : kernel
options : 0x1009001
VX_SUPERVISOR_MODE VX_DEALLOC_TCB VX_FP_TASK VX_DEALLOC_EXC_STACK
STACK BASE END SP SIZE HIGH MARGIN
--------- -------- -------- -------- ------- ------- -------
execution 3570090 356d1b0 356ffa0 12000 2800 9200
exception 26f7ac0 26f6c50 3696 1344 2352
VxWorks Events
--------------
Events Pended on : Not Pended
Received Events : 0x0
Options : N/A
r0 = 0x40000000 sp = 0x0356ffa0 r2 = 0x002f1ac0
r3 = 0x03492cc0 r4 = 0x0356ab90 r5 = 0x00000001
r6 = 0x00000001 r7 = 0x00000000 r8 = 0x000a0008
r9 = 0x03326a90 r10 = 0x00000000 r11 = 0x0356ff10
r12 = 0x20000884 r13 = 0x00321e60 r14 = 0x033b6098
r15 = 0x03373c64 r16 = 0x03289478 r17 = 0x033424f0
r18 = 0x03373bdc r19 = 0x03373c1c r20 = 0x032a0ff8
r21 = 0x032a0f90 r22 = 0x00000000 r23 = 0x00000001
r24 = 0x033b609c r25 = 0x03339340 r26 = 0x033390e0
r27 = 0x03442b90 r28 = 0x0356c870 r29 = 0x03556108
r30 = 0x032dad44 r31 = 0x0356ab90 msr = 0x0008b032
lr = 0x032a1acc ctr = 0x03326a90 pc = 0x032a0fe0
cr = 0x42000884 xer = 0x00000000 pgTblPtr = 0x02310000
scSrTblPtr = 0x0230fe74 srTblPtr = 0x0230fe74
fpcsr = 0
fr0 = -0.85893 fr1 = NaN fr2 = NaN fr3 = NaN
fr4 = NaN fr5 = NaN fr6 = NaN fr7 = NaN
fr8 = NaN fr9 = NaN fr10 = NaN fr11 = NaN
fr12 = NaN fr13 = NaN fr14 = NaN fr15 = NaN
fr16 = NaN fr17 = NaN fr18 = NaN fr19 = NaN
fr20 = NaN fr21 = NaN fr22 = NaN fr23 = NaN
fr24 = NaN fr25 = NaN fr26 = NaN fr27 = NaN
fr28 = NaN fr29 = NaN fr30 = NaN fr31 = NaN
machine checkvalue =
Exception next instruction address: 0x0032a0fe0
Machine Status Register: 0x = 0008b032
Condition Register: 0x420008840x
This is the memory dump of the address in R31:
iocexample> d 0x0356ab90
NOTE: memory values are displayed in hexadecimal.
Looks quite normal
0x0356ab90: 4000 0000 3d01 163d 3848 818a 0000 0000
*@...=..=8H......*
ctx bit set | timestamp | stat | sevr |
0x0356aba0: 000a 0008 0000 0001 bfeb 7c5b 7c5b 7c5b *..........|[|[|[*
type |size | no_elems | dbfl_val |
Note that the first 32-bit of the dbfl_val again would point to a VME A32 address;
note also that the ls-bits (0x7c5b) differ from the reported A16 address (0x345e).
I start to wonder if pfl->u.r is not a red herring...
I would join Steven Hartman in recommending to dump program memory; I'd look at
the address in R9/CTR (target of the jump) and also at the address where we jump from
(if the PC is different from CTR as in the above report then dump from the address at the PC,
if the PC is identical to CTR then dump from the address held in LR).
0x0356abb0: 0000 0000 0000 0000 0356 ac80 3d01 163d *.........V..=..=*
0x0356abc0: 3848 818a 0000 0000 000a 0008 0000 0001 *8H..............*
0x0356abd0: bfeb 405b 405b 405b 0000 0000 0000 0000
*..@[@[@[........*
0x0356abe0: 0356 a988 3d01 163d 3848 818a 0000 0000 *.V..=..=8H......*
0x0356abf0: 000a 0008 0000 0001 bfeb 2edb 2edb 2edb *................*
0x0356ac00: 0000 0000 0000 0000 0356 a640 3d01 163d
*.........V.@=..=*
0x0356ac10: 3848 818a 0000 0000 000a 0008 0000 0001 *8H..............*
0x0356ac20: bfeb 635b 635b 635b 0000 0000 0000 0000 *..c[c[c[........*
0x0356ac30: 0356 a5f0 3d01 163d 3848 818a 0000 0000 *.V..=..=8H......*
0x0356ac40: 000a 0008 0000 0001 bfeb 4f5b 4f5b 4f5b *..........O[O[O[*
0x0356ac50: 0000 0000 0000 0000 0356 a618 3d01 163d *.........V..=..=*
0x0356ac60: 3848 818a 0000 0000 000a 0008 0000 0001 *8H..............*
0x0356ac70: bfeb 60db 60db 60db 0000 0000 0000 0000 *..`.`.`.........*
0x0356ac80: 0356 ab18 3d01 163d 3848 818a 0000 0000 *.V..=..=8H......*
value = 0 = 0x0
iocexample>
Another time I tested the same version it failed with an A32 address error in freeListFree called from db_delete_field_log
VME Bus Error accessing A32: 0xbfe26ed0
machine check
Exception next instruction address: 0x03326a90
Machine Status Register: 0x0008b032
Condition Register: 0x40000884
Task: 0x26fccd0 "CAS-event"
0x26fccd0 (CAS-event): task 0x26fccd0 has had a failure and has been stopped.
0x26fccd0 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
iocexample> tt 0x26fccd0
0x0012489c vxTaskEntry +0x48 : epicsThreadEntry ()
0x0334501c epicsThreadEntry+0x80 : 0x032a17f0 ()
0x032a1acc db_start_events+0x450: db_delete_field_log ()
0x032a0fe4 db_delete_field_log+0x54 : freeListFree ()
value = 0 = 0x0
iocexample> ti
NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY
---------- ------------ -------- --- ---------- -------- -------- ------- -----
CAS-event epicsThread> 26fccd0 133 STOP+I 3326a90 26fcbe0 0 0
full task name : CAS-event
task entry : epicsThreadEntry
process : kernel
options : 0x1009001
VX_SUPERVISOR_MODE VX_DEALLOC_TCB VX_FP_TASK VX_DEALLOC_EXC_STACK
STACK BASE END SP SIZE HIGH MARGIN
--------- -------- -------- -------- ------- ------- -------
execution 26fccd0 26f9df0 26fcbe0 12000 2800 9200
exception 3444540 34436d0 3696 1344 2352
VxWorks Events
--------------
Events Pended on : Not Pended
Received Events : 0x0
Options : N/A
r0 = 0x40000000 sp = 0x026fcbe0 r2 = 0x002f1ac0
r3 = 0x03492ce0 r4 = 0x03575a90 r5 = 0x00000001
r6 = 0x00000001 r7 = 0x00000001 r8 = 0x0354a4d0
r9 = 0x03326a90 r10 = 0x00000028 r11 = 0x026fcb50
r12 = 0x22000884 r13 = 0x00321e60 r14 = 0x033b6098
r15 = 0x03373c64 r16 = 0x03289478 r17 = 0x033424f0
r18 = 0x03373bdc r19 = 0x03373c1c r20 = 0x032a0ff8
r21 = 0x032a0f90 r22 = 0x00000000 r23 = 0x00000001
r24 = 0x033b609c r25 = 0x03339340 r26 = 0x033390e0
r27 = 0x03443078 r28 = 0x03443078 r29 = 0x03561a80
r30 = 0x032dad44 r31 = 0x03575a90 msr = 0x0008b032
lr = 0x032a0fe4 ctr = 0x03326a90 pc = 0x03326a90
cr = 0x40000884 xer = 0x00000000 pgTblPtr = 0x02310000
scSrTblPtr = 0x0230fe74 srTblPtr = 0x0230fe74
fpcsr = 0
fr0 = -0.576028 fr1 = NaN fr2 = NaN fr3 = NaN
fr4 = NaN fr5 = NaN fr6 = NaN fr7 = NaN
fr8 = NaN fr9 = NaN fr10 = NaN fr11 = NaN
fr12 = NaN fr13 = NaN fr14 = NaN fr15 = NaN
fr16 = NaN fr17 = NaN fr18 = NaN fr19 = NaN
fr20 = NaN fr21 = NaN fr22 = NaN fr23 = NaN
fr24 = NaN fr25 = NaN fr26 = NaN fr27 = NaN
fr28 = NaN fr29 = NaN fr30 = NaN fr31 = NaN
machine checkvalue =
Exception next instruction address: 0x003326a90
Machine Status Register: 0x = 0008b032
Condition Register: 0x0x40000884
iocexample> l 0x003326a90
freeListFree:
0x3326a90 9421fff0 stwu r1,-16(r1)
0x3326a94 7c0802a6 mfspr r0,LR
0x3326a98 3d200334 lis r9,0x334 # 820
0x3326a9c 392990e0 addi r9,r9,0x90e0 # -28448
0x3326aa0 7d2903a6 mtspr CTR,r9
0x3326aa4 90010014 stw r0,20(r1)
0x3326aa8 93c10008 stw r30,8(r1)
0x3326aac 7c9e2378 or r30,r4,r4
0x3326ab0 93e1000c stw r31,12(r1)
0x3326ab4 7c7f1b78 or r31,r3,r3
value = 0 = 0x0
I hope this helps.
I will work on the quick test program you sent next.
Thanks,
Mark
On 6/2/22 16:21, Mark Rivers via Tech-talk wrote:
The problem can be explained if pfl->type is occasionally set to dbfl_type_ref when it should not be. In that case the union will contain data, perhaps floats from the ADC as Michael suggested. Because the ->type is wrong the value is being treated as an address, which leads to the A32 bus address error.
No - it actually can't explained by this. In the scenario you describe
there would be a read access to pfl->u.r.dtor when the function call
is executed. If that were the case then we should see the fault address
in the PC (and in 'r0' as well as in 'ctr', see analysis below).
In the disassembly which I quote here (comments apply
to preceding instructions):
00000034 <db_delete_field_log>:
34: 94 21 ff f0 stwu r1,-16(r1)
38: 7c 08 02 a6 mflr r0
3c: 93 e1 00 0c stw r31,12(r1)
40: 7c 7f 1b 79 mr. r31,r3
copy 'pfl' arg (passed in r3) into r31 and set condition-code bits
44: 90 01 00 14 stw r0,20(r1)
48: 41 82 00 40 beq 88 <db_delete_field_log+0x54>
if ( pfl == NULL ) goto cleanup/leave
4c: 80 1f 00 00 lwz r0,0(r31)
load pfl->type into 'r0'
50: 2f 80 00 00 cmpwi cr7,r0,0
test most-significant bit (this is the 'type' 1-bit bitfield)
54: 40 9c 00 18 bge cr7,6c <db_delete_field_log+0x38>
if type is 0 (*not* dbfl_ref) then skip the 'dtor' call and jump to address 0x6c
58: 80 1f 00 18 lwz r0,24(r31)
we get here if pfl->type == dbfl_type_ref and this instruction
loads pfl->u.r.dtor into register r0
(note that this is still a valid access even if r0 now contains a bad address)
****** r0 at this point holds the 'dtor' address ******
5c: 2f 80 00 00 cmpwi cr7,r0,0
comparison: pfl->u.r.dtor == 0
60: 41 9e 00 0c beq cr7,6c <db_delete_field_log+0x38>
get here if pfl->u.r.dtor is NOT NULL
64: 7c 09 03 a6 mtctr r0
copy r0 to counter register
68: 4e 80 04 21 bctrl
***** jump to the 'dtor' routine
***** this jump would cause a bus error
***** HOWEVER: the faulting address would still be in 'r0' (and 'ctr')
***** and we would crash *here* not a few lines below
6c: 3d 20 00 00 lis r9,0
70: 7f e4 fb 78 mr r4,r31
copy r31 into r4, both hold 'pfl'
74: 80 69 00 00 lwz r3,0(r9)
load 'dbevFieldLogFreeList' into r3
78: 3d 20 00 00 lis r9,0
upper 16-bit of 'freeListFree' into r9
7c: 39 29 00 00 addi r9,r9,0
merge lower 16-bit of 'freeListFree' into r9
80: 7d 29 03 a6 mtctr r9
copy r9 into 'ctr'
84: 4e 80 04 21 bctrl
***** OBSERVED EXCEPTION happens during execution of this jump
88: 80 01 00 14 lwz r0,20(r1)
8c: 83 e1 00 0c lwz r31,12(r1)
90: 38 21 00 10 addi r1,r1,16
94: 7c 08 03 a6 mtlr r0
98: 4e 80 00 20 blr
Note that we see
r0: 0x40000000
ctr: 0x03326a90
pc: 0x03326a90 (next instruction address)
=> if the fault were caused by loading an instruction from *pfl->u.r.dtor then
pc and ctr should hold the bad address.
The register context is actually consistent with
pfl->type == dbfl_type_val
i.e., from 0x54 we would jump to 0x6c with r0 still holding the 'type' and 'ctx' bit-fields.
Here the call to 'freeListFree()' is prepared. Note that nothing looks out of order:
r9 == ctr == pc == 0x03326a90 = freeListFree
r31 == r4 =='pfl'
r0 == 0x40000000
Note how the type (msbit) is '0' and the 'ctx' is 1 (dbfl_context_event).
In conclusion: the register context gives no indication that anything went wrong.
In order to gather more data I would recommend to also dump the memory at 'pfl',
i.e., next time please post:
- register dump
- stack trace
- other info you have provided in the past
- memory (~100bytes should suffice) from the location reported in R31
If possible please always work with the 85822f305 commit so that we stay on the same page...
Cheers
- Till
Mark
Sent from my iPhone
On Jun 1, 2022, at 11:07 PM, Mark Rivers via Tech-talk <tech-talk at aps.anl.gov> wrote:
We need to explain why adding 2 lines of code to examine the contents of pfl eliminates the failures completely.
It’s like a new type of quantum computer - making an observation changes the outcome!
Mark
Sent from my iPhone
On Jun 1, 2022, at 10:24 PM, Michael Davidsaver <mdavidsaver at gmail.com> wrote:
On 6/1/22 14:50, Mark Rivers wrote:
I checked out R7.0.6.1 and made the following change to dbEvent.c It checks whether pfl or pfl->u.r.dtor are in the VME A32 address space. If they are it prints an error and returns.
So, these addresses appear in pfl->u.r.dtor or eg. pfl->u.v.field.dbf_float
My question is then which part(s) of the db_field_log struct have incorrect values?
(eg. "\xbf\x3e\x00\x1e" as float32 is -0.74)
One straightforward, though a bit tedious, think to do would be to print a
dump of the contents of the db_field_log. eg. I'd be interested to know
if db_field_log::time appears valid.
...
After making that change then one time when I started the IOC I got these messages right after iocInit, which is not when it usually fails.
This makes me wonder if these "addresses" are eg. somehow valid ADC values,
which just happen to track through the VME window sometimes. Though if you've
only loading soft records, I don't see how this could be.
Further, I still don't see how these values aren't appearing in the register dump.
I don't think PPC has any instructions which do a double indirection.
Another odd ball idea which occurs to me is that maybe the compiler is mishandling
the db_field_log::type bit field. The dbEvent.o which you send to Till and I has:
4c: 80 1f 00 00 lwz r0,0(r31)
50: 2f 80 00 00 cmpwi cr7,r0,0
If I read things correctly, I think this is meant to correspond to
'pfl->u.r.type==dbfl_type_val==0", however it is testing the whole word, and would
mis-trigger because 'pfl->u.r.ctx == dbfl_context_event == 1` for a monitor update.
When I look at the result of a local RTEMS build, I think the corresponding is:
8: 89 24 00 00 lbz r9,0(r4)
c: 71 29 00 80 andi. r9,r9,128
Which is testing the high bit as it should.
For completeness. Disassembly of db_delete_field_log() from dbEvent.o.
Note that this is object code, with incomplete relocations.
00000034 <db_delete_field_log>:
34: 94 21 ff f0 stwu r1,-16(r1)
38: 7c 08 02 a6 mflr r0
3c: 93 e1 00 0c stw r31,12(r1)
40: 7c 7f 1b 79 mr. r31,r3
44: 90 01 00 14 stw r0,20(r1)
48: 41 82 00 40 beq 88 <db_delete_field_log+0x54>
4c: 80 1f 00 00 lwz r0,0(r31)
50: 2f 80 00 00 cmpwi cr7,r0,0
54: 40 9c 00 18 bge cr7,6c <db_delete_field_log+0x38>
58: 80 1f 00 18 lwz r0,24(r31)
5c: 2f 80 00 00 cmpwi cr7,r0,0
60: 41 9e 00 0c beq cr7,6c <db_delete_field_log+0x38>
64: 7c 09 03 a6 mtctr r0
68: 4e 80 04 21 bctrl
6c: 3d 20 00 00 lis r9,0
70: 7f e4 fb 78 mr r4,r31
74: 80 69 00 00 lwz r3,0(r9)
78: 3d 20 00 00 lis r9,0
7c: 39 29 00 00 addi r9,r9,0
80: 7d 29 03 a6 mtctr r9
84: 4e 80 04 21 bctrl
88: 80 01 00 14 lwz r0,20(r1)
8c: 83 e1 00 0c lwz r31,12(r1)
90: 38 21 00 10 addi r1,r1,16
94: 7c 08 03 a6 mtlr r0
98: 4e 80 00 20 blr
From my local build for RTEMS-mvme3100
void db_delete_field_log (db_field_log *pfl)
{
if (pfl) {
0: 7c 64 1b 79 mr. r4,r3
4: 4d 82 00 20 beqlr
/* Free field if reference type field log and dtor is set */
if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
8: 89 24 00 00 lbz r9,0(r4)
c: 71 29 00 80 andi. r9,r9,128
10: 40 82 00 10 bne 20 <db_delete_field_log+0x20>
/* Free the field log chunk */
freeListFree(dbevFieldLogFreeList, pfl);
14: 3d 20 00 00 lis r9,0
18: 80 69 00 00 lwz r3,0(r9)
1c: 48 00 00 00 b 1c <db_delete_field_log+0x1c>
if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
20: 81 24 00 50 lwz r9,80(r4)
24: 2f 89 00 00 cmpwi cr7,r9,0
28: 41 be ff ec beq cr7,14 <db_delete_field_log+0x14>
{
2c: 94 21 ff f0 stwu r1,-16(r1)
30: 7c 08 02 a6 mflr r0
if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
34: 7d 29 03 a6 mtctr r9
{
38: 90 01 00 14 stw r0,20(r1)
3c: 90 81 00 08 stw r4,8(r1)
if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
40: 4e 80 04 21 bctrl
}
}
44: 80 01 00 14 lwz r0,20(r1)
freeListFree(dbevFieldLogFreeList, pfl);
48: 3d 20 00 00 lis r9,0
if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
4c: 80 81 00 08 lwz r4,8(r1)
freeListFree(dbevFieldLogFreeList, pfl);
50: 80 69 00 00 lwz r3,0(r9)
}
54: 7c 08 03 a6 mtlr r0
58: 38 21 00 10 addi r1,r1,16
freeListFree(dbevFieldLogFreeList, pfl);
5c: 48 00 00 00 b 5c <db_delete_field_log+0x5c>