I have now tested using base 7.0.5 rather than 7.0.6.1. The problem does not occur with base 7.0.5. I have switched back and forth twice, being certain to completely clean and rebuild all
synApps modules and my own application first when doing so.
I am running 7 VME crates with vxWorks 6.9.4.1. I am logging the IOC output for each of them with procServ.
I was running base 7.0.6.1 between about Wed May 18 15:00 and Thu May 19 15:00. During that time I observed the following VME bus errors.
/home/epics/logs/13BMD.log:[Wed May 18 15:46:12 2022] VME Bus Error accessing A16: 0x3316
/home/epics/logs/13IDE.log:[Wed May 18 16:07:12 2022] VME Bus Error accessing A16: 0x317e
/home/epics/logs/13IDA.log:[Wed May 18 16:50:32 2022] VME Bus Error accessing A16: 0x3146
/home/epics/logs/13IDE.log:[Wed May 18 17:23:08 2022] VME Bus Error accessing A16: 0x317e
/home/epics/logs/13IDC.log:[Wed May 18 18:11:58 2022] VME Bus Error accessing A32: 0xbf57c014
/home/epics/logs/13BMD.log:[Wed May 18 18:12:43 2022] VME Bus Error accessing A16: 0x317e
/home/epics/logs/13BMA.log:[Wed May 18 18:16:04 2022] VME Bus Error accessing A16: 0x3136
/home/epics/logs/13IDE.log:[Wed May 18 18:20:22 2022] VME Bus Error accessing A32: 0xbf140014
/home/epics/logs/13IDA.log:[Wed May 18 20:23:06 2022] VME Bus Error accessing A16: 0x300e
/home/epics/logs/13BMD.log:[Thu May 19 11:13:34 2022] VME Bus Error accessing A16: 0x3046
/home/epics/logs/13IDC.log:[Thu May 19 12:03:54 2022] VME Bus Error accessing A32: 0xbf7db01c
/home/epics/logs/13BMD.log:[Thu May 19 12:42:30 2022] VME Bus Error accessing A32: 0xbffcac70
/home/epics/logs/13IDA.log:[Thu May 19 12:44:11 2022] VME Bus Error accessing A16: 0x300e
/home/epics/logs/13IDC.log:[Thu May 19 13:01:55 2022] VME Bus Error accessing A32: 0xbf554014
/home/epics/logs/13IDE.log:[Thu May 19 13:11:16 2022] VME Bus Error accessing A32: 0xbf2e001c
/home/epics/logs/13IDE.log:[Thu May 19 13:11:45 2022] VME Bus Error accessing A16: 0x315e
/home/epics/logs/13IDD.log:[Thu May 19 13:18:03 2022] VME Bus Error accessing A32: 0xbf390018
/home/epics/logs/13IDE.log:[Thu May 19 13:25:25 2022] VME Bus Error accessing A32: 0xbf140014
/home/epics/logs/13BMD.log:[Thu May 19 13:40:38 2022] VME Bus Error accessing A16: 0x317e
/home/epics/logs/13IDE.log:[Thu May 19 14:09:56 2022] VME Bus Error accessing A32: 0xbf140014
So there were about 20 errors in 24 hours. I did reboot the IOCs a few times during this period, and the probability of an error is definitely higher in the few minutes after a reboot, perhaps
as OPI displays reconnect their CA links.
At about Thu May 19 15:00 I switched to base 7.0.5 and ran for 3 hours, until about 18:00. There were no bus errors.
At Thu May 19 17:45 I switched back to 7.0.6.1 for about 3 hours until 21:00. I observed the following errors during this time.
/home/epics/logs/13IDE.log:[Thu May 19 17:58:59 2022] VME Bus Error accessing A32: 0xbf2e001c
/home/epics/logs/13BMD.log:[Thu May 19 18:09:48 2022] VME Bus Error accessing A16: 0x315e
/home/epics/logs/13IDD.log:[Thu May 19 19:16:00 2022] VME Bus Error accessing A16: 0x3206
At about Thu May 19 21:00 I switched to base 7.0.5 again. That was 11 hours ago, and there have been no bus errors in that time.
It is clear that my systems fail with 7.0.6.1 and work fine with 7.0.5, all other software being identical. It is certainly possible that 7.0.6.1 is exposing a problem with the other software
which 7.0.5 does not. I cannot afford to experiment or test more with these 7 VME systems, since beam comes back soon. I do have one other VME crate that can be used for testing. It is not currently running, so I don’t know if it fails with 7.0.6.1, but
I will turn it on today.
Note that prior to this week I had been running base 7.0.4 on these systems for over a year with no issues. This week I did update to the latest versions of the synApps modules, but I did
not change my IOC specific code at all.
Thanks,
Mark
From: Mark Rivers
Sent: Thursday, May 19, 2022 8:34 AM
To: Johnson, Andrew N. <anj at anl.gov>
Cc: tech-talk at aps.anl.gov
Subject: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules
Folks,
I updated my VME crates from base 7.0.4 to 7.0.6.1, and also updated to the latest version of synApps modules. 5 of the 7 crates are now giving VME bus errors within a few minutes of booting.
Since it is happening on 5 crates it is highly unlikely to be a hardware problem.
The errors are all for a small set of addresses:
A16: 0x300e, 0x3136, 0x3172
These are in the address space of the first Industry Pack module on a VIPC616_01 or TVME200 IP carrier card
A32: 0xbf57c014
I am not sure what device this is. The Joerger scalars are configured to start at address 0xB0000000, but I don’t think they use that much memory space.
The task generating the error is always “CAS-event”.
These are the errors:
13IDA crate:
[Wed May 18 20:23:06 2022] VME Bus Error accessing A16: 0x300e
[Wed May 18 20:23:06 2022] machine check
[Wed May 18 20:23:06 2022] Exception next instruction address: 0x03604600
[Wed May 18 20:23:06 2022] Machine Status Register: 0x0008b032
[Wed May 18 20:23:06 2022] Condition Register: 0x42000884
[Wed May 18 20:23:06 2022] Task: 0x55d8aa0 "CAS-event"
[Wed May 18 20:23:06 2022] 0x55d8aa0 (CAS-event): task 0x55d8aa0 has had a failure and has been stopped.
[Wed May 18 20:23:06 2022] 0x55d8aa0 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
[Wed May 18 20:48:19 2022] 0x4d50400 (timerQueue): callbackRequest: cbLow ring buffer full
13BMA crate
[Wed May 18 18:16:04 2022] VME Bus Error accessing A16: 0x3136
[Wed May 18 18:16:04 2022] machine check
[Wed May 18 18:16:04 2022] Exception next instruction address: 0x0141b5b8
[Wed May 18 18:16:04 2022] Machine Status Register: 0x0008b032
[Wed May 18 18:16:04 2022] Condition Register: 0x42000884
[Wed May 18 18:16:04 2022] Task: 0x3545520 "CAS-event"
[Wed May 18 18:16:04 2022] 0x3545520 (CAS-event): task 0x3545520 has had a failure and has been stopped.
[Wed May 18 18:16:04 2022] 0x3545520 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
13IDC crate:
[Wed May 18 18:11:58 2022] VME Bus Error accessing A32: 0xbf57c014
[Wed May 18 18:11:58 2022] machine check
[Wed May 18 18:11:58 2022] Exception next instruction address: 0x013962a0
[Wed May 18 18:11:58 2022] Machine Status Register: 0x0008b032
[Wed May 18 18:11:58 2022] Condition Register: 0x42000884
[Wed May 18 18:11:58 2022] Task: 0x395e4b0 "CAS-event"
[Wed May 18 18:11:58 2022] 0x395e4b0 (CAS-event): task 0x395e4b0 has had a failure and has been stopped.
[Wed May 18 18:11:58 2022] 0x395e4b0 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
13BMD crate:
[Wed May 18 18:12:43 2022] VME Bus Error accessing A16: 0x317e
[Wed May 18 18:12:43 2022] machine check
[Wed May 18 18:12:43 2022] Exception next instruction address: 0x01395e40
[Wed May 18 18:12:43 2022] Machine Status Register: 0x0008b032
[Wed May 18 18:12:44 2022] Condition Register: 0x48000884
[Wed May 18 18:12:44 2022] Task: 0x3c73ae0 "CAS-event"
[Wed May 18 18:12:44 2022] 0x3c73ae0 (CAS-event): task 0x3c73ae0 has had a failure and has been stopped.
[Wed May 18 18:12:44 2022] 0x3c73ae0 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
13IDE crate:
[Wed May 18 16:07:12 2022] VME Bus Error accessing A16: 0x317e
[Wed May 18 16:07:12 2022] machine check
[Wed May 18 16:07:12 2022] Exception next instruction address: 0x01395c20
[Wed May 18 16:07:12 2022] Machine Status Register: 0x0008b032
[Wed May 18 16:07:12 2022] Condition Register: 0x48000884
[Wed May 18 16:07:12 2022] Task: 0x3a13c60 "CAS-event"
[Wed May 18 16:07:12 2022] 0x3a13c60 (CAS-event): task 0x3a13c60 has had a failure and has been stopped.
[Wed May 18 16:07:12 2022] 0x3a13c60 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
[Wed May 18 18:20:22 2022] VME Bus Error accessing A32: 0xbf140014
[Wed May 18 18:20:22 2022] machine check
[Wed May 18 18:20:22 2022] Exception next instruction address: 0x013959c0
[Wed May 18 18:20:22 2022] Machine Status Register: 0x0008b032
[Wed May 18 18:20:22 2022] Condition Register: 0x48000884
[Wed May 18 18:20:22 2022] Task: 0x3a2d780 "CAS-event"
[Wed May 18 18:20:22 2022] 0x3a2d780 (CAS-event): task 0x3a2d780 has had a failure and has been stopped.
[Wed May 18 18:20:23 2022] 0x3a2d780 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
This is the result of “tt” and “ti” on a couple of the crashed tasks:
ioc13ida> tt 0x55d8aa0
0x0012489c vxTaskEntry +0x48 : epicsThreadEntry ()
0x036a6afc epicsThreadEntry+0x80 : 0x03604e20 ()
0x036050f8 db_start_events+0x458: db_delete_field_log ()
value = 0 = 0x0
ioc13ida> ti 0x55d8aa0
NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY
---------- ------------ -------- --- ---------- -------- -------- ------- -----
CAS-event epicsThread> 55d8aa0 139 STOP+I 3604600 55d89b0 0 0
full task name : CAS-event
task entry : epicsThreadEntry
process : kernel
options : 0x1009001
VX_SUPERVISOR_MODE VX_DEALLOC_TCB VX_FP_TASK VX_DEALLOC_EXC_STACK
STACK BASE END SP SIZE HIGH MARGIN
--------- -------- -------- -------- ------- ------- -------
execution 55d8aa0 55d5bc0 55d89b0 12000 3568 8432
exception 55d9bf0 55d8d80 3696 1344 2352
VxWorks Events
--------------
Events Pended on : Not Pended
Received Events : 0x0
Options : N/A
r0 = 0x40050000 sp = 0x055d89b0 r2 = 0x002f1ac0
r3 = 0x04e56c70 r4 = 0x05293e20 r5 = 0x00000001
r6 = 0x00000001 r7 = 0x00000000 r8 = 0x00000000
r9 = 0x0368a8b8 r10 = 0x00000000 r11 = 0x055d8920
r12 = 0x22000884 r13 = 0x00321e60 r14 = 0x036a5e98
r15 = 0x03fb9a44 r16 = 0x035ec8c0 r17 = 0x03fb99fc
r18 = 0x00000000 r19 = 0x036a3f3c r20 = 0x03fb99bc
r21 = 0x0360461c r22 = 0x00000001 r23 = 0x036045b4
r24 = 0x04341544 r25 = 0x0369d438 r26 = 0x0369d698
r27 = 0x05156b28 r28 = 0x03639dc8 r29 = 0x05b15c98
r30 = 0x05293e20 r31 = 0x05293e20 msr = 0x0008b032
lr = 0x036050f8 ctr = 0x036045b4 pc = 0x03604600
cr = 0x42000884 xer = 0x00000000 pgTblPtr = 0x02310000
scSrTblPtr = 0x0230fe74 srTblPtr = 0x0230fe74
fpcsr = 0
fr0 = -1 fr1 = NaN fr2 = NaN fr3 = NaN
fr4 = NaN fr5 = NaN fr6 = NaN fr7 = NaN
fr8 = NaN fr9 = NaN fr10 = NaN fr11 = NaN
fr12 = NaN fr13 = NaN fr14 = NaN fr15 = NaN
fr16 = NaN fr17 = NaN fr18 = NaN fr19 = NaN
fr20 = NaN fr21 = NaN fr22 = NaN fr23 = NaN
fr24 = NaN fr25 = NaN fr26 = NaN fr27 = NaN
fr28 = NaN fr29 = NaN fr30 = NaN fr31 = NaN
machine check
Exception next instruction address: 0xva0lue = 3604600
Machine Status Register: 0x00008b032
Condition Register: 0x = 420008840x
ioc13ide> tt 0x3a2d780
0x00121e7c vxTaskEntry +0x48 : epicsThreadEntry ()
0x01437ebc epicsThreadEntry+0x80 : 0x013961e0 ()
0x013964b8 db_start_events+0x458: db_delete_field_log ()
value = 0 = 0x0
ioc13ide> ti 0x3a2d780
NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY
---------- ------------ -------- --- ---------- -------- -------- ------- -----
CAS-event epicsThread> 3a2d780 133 STOP+I 13959c0 3a2d690 0 0
full task name : CAS-event
task entry : epicsThreadEntry
process : kernel
options : 0x1009001
VX_SUPERVISOR_MODE VX_DEALLOC_TCB VX_FP_TASK VX_DEALLOC_EXC_STACK
STACK BASE END SP SIZE HIGH MARGIN
--------- -------- -------- -------- ------- ------- -------
execution 3a2d780 3a2a8a0 3a2d690 12000 2800 9200
exception 3a2e8d0 3a2da60 3696 1344 2352
VxWorks Events
--------------
Events Pended on : Not Pended
Received Events : 0x0
Options : N/A
r0 = 0x40050000 sp = 0x03a2d690 r2 = 0x002e5010
r3 = 0x02dc8530 r4 = 0x0366ac00 r5 = 0x00000001
r6 = 0x00000001 r7 = 0x00000000 r8 = 0x00000002
r9 = 0x0141bc78 r10 = 0x00000000 r11 = 0x03a2d600
r12 = 0x28000884 r13 = 0x00313850 r14 = 0x01437258
r15 = 0x01d4ae04 r16 = 0x0137dc80 r17 = 0x01d4adbc
r18 = 0x00000000 r19 = 0x014352fc r20 = 0x01d4ad7c
r21 = 0x013959dc r22 = 0x00000001 r23 = 0x01395974
r24 = 0x020d2904 r25 = 0x0142e7f8 r26 = 0x0142ea58
r27 = 0x0351b4c8 r28 = 0x013cb188 r29 = 0x03a98e70
r30 = 0x0366ac00 r31 = 0x0366ac00 msr = 0x0008b032
lr = 0x013964b8 ctr = 0x01395974 pc = 0x013959c0
cr = 0x48000884 xer = 0x00000000 pgTblPtr = 0x00af7000
scSrTblPtr = 0x00af65c4 srTblPtr = 0x00af65c4
fpcsr = 0
fr0 = -7.62951e-05 fr1 = NaN fr2 = NaN fr3 = NaN
fr4 = NaN fr5 = NaN fr6 = NaN fr7 = NaN
fr8 = NaN fr9 = NaN fr10 = NaN fr11 = NaN
fr12 = NaN fr13 = NaN fr14 = NaN fr15 = NaN
fr16 = NaN fr17 = NaN fr18 = NaN fr19 = NaN
fr20 = NaN fr21 = NaN fr22 = NaN fr23 = NaN
fr24 = NaN fr25 = NaN fr26 = NaN fr27 = NaN
fr28 = NaN fr29 = NaN fr30 = NaN fr31 = NaN
machine checkvalue =
Exception next instruction address: 0x0013959c0
Machine Status Register: 0x = 0008b032
Condition Register: 0x048000884x
Any ideas?
Thanks,
Mark