1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 <2022> 2023 2024 | Index | 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 <2022> 2023 2024 |
<== Date ==> | <== Thread ==> |
---|
Subject: | RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules |
From: | Mark Rivers via Tech-talk <tech-talk at aps.anl.gov> |
To: | "tech-talk at aps.anl.gov" <tech-talk at aps.anl.gov> |
Date: | Sat, 21 May 2022 16:42:03 +0000 |
My production VME crates are now running base 7.0.5 and working fine. I have one test crate which I just started with base 7.0.6.1. Fortunately it also fails, so it should be possible to debug this issue.
This crate has 7 Industry Pack modules, including an IP300 ADC. That module uses interrupts at 2 kHz to read the ADC values. There was a VME bus error as soon as I opened
an medm screen to look at the ai records for this device. This was the error: [Sat May 21 10:06:18 2022] VME Bus Error accessing A16: 0x325e [Sat May 21 10:06:18 2022] machine check [Sat May 21 10:06:18 2022] Exception next instruction address: 0x03606bd4 [Sat May 21 10:06:18 2022] Machine Status Register: 0x0008b032 [Sat May 21 10:06:18 2022] Condition Register: 0x48000884 [Sat May 21 10:06:18 2022] Task: 0x58e8e10 "CAS-event" [Sat May 21 10:06:18 2022] 0x58e8e10 (CAS-event): task 0x58e8e10 has had a failure and has been stopped. [Sat May 21 10:06:18 2022] 0x58e8e10 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10. This is the Industry Pack configuration on that crate. I have highlighted in red the module whose A16 address corresponds to the above VME bus error.
So as soon as I open an medm screen to show the IP330 ai records it updates the values for about 1 second and then the CAS-event task gets an A16 bus error at address 0x325e,
which is in the address range for the IP330. Why would the CAS-event task be doing VME I/O, since I was not doing a CA put, I just opened an medm screen that would be doing CA monitors? I tested this several times. It always fails within 1 second, but not always with the same error. Sometimes I get the same A16 bus error with the same 0x325e address.
But sometimes I get this error: VME Bus Error accessing A32: 0xbfa1ee10 machine check Exception next instruction address: 0x0368cec0 Machine Status Register: 0x0008b032 Condition Register: 0x42000884 Task: 0x5529e10 "CAS-event" 0x5529e10 (CAS-event): task 0x5529e10 has had a failure and has been stopped. 0x5529e10 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10. I also tested opening an MEDM screen to adjust the D/A values on the DAC128V. After a few seconds of doing that I got this error: [Sat May 21 11:08:25 2022] VME Bus Error accessing A32: 0xbfd76a40 [Sat May 21 11:08:25 2022] machine check [Sat May 21 11:08:25 2022] Exception next instruction address: 0x0368cea0 [Sat May 21 11:08:25 2022] Machine Status Register: 0x0008b032 [Sat May 21 11:08:25 2022] Condition Register: 0x48000884 [Sat May 21 11:08:25 2022] Task: 0x564bf20 "CAS-event" [Sat May 21 11:08:25 2022] 0x564bf20 (CAS-event): task 0x564bf20 has had a failure and has been stopped. [Sat May 21 11:08:25 2022] 0x564bf20 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10. This error is in the A32 address space. I am not sure if the DAC128V driver uses the A32 address space or not. However, I have not been able to reproduce this error, usually
I can adjust the DAC for a long time with no crash, so this could have been a coincidence. Mark From: Mark Rivers The A16 VME bus errors are all in the 0x3xxx range, which is the address of the first IP carrier card. That card holds 4 modules, and these each have 256 bytes of A16 address
space.
There are 4 VME crates that are generating VME bus errors, and 3 which are not. These have different Industry Pack modules installed in each slot, listed in this table.
I have highlighted in red the modules corresponding to the bus failure addresses.
The failures are only on the tyGSOctal and the IP300.
The tyGSOctal is an 8-channel RS-232 module which has lots of activity polling serial devices. The IP330 is an interrupt driven ADC which is being read at 1 kHz. This also has lots of VME bus activity. The IpUnidig is an interrupt driver digital I/O module. Because its inputs are not changing, and there are no clients writing the outputs there is very little VME bus activity. The DAC128V is an 8-channel D/A. Since there are no clients writing to it there is very little VME bus activity. The failures appear to be limited to the addresses of modules with lots of VME bus activity, and not to a single module since the tyGSOctal, IP330, and IpUnidig all seem
to fail. I have not yet looked at the A32 VME bus address errors. Mark From: Mark Rivers I have now tested using base 7.0.5 rather than 7.0.6.1. The problem does not occur with base 7.0.5. I have switched back and forth twice, being certain to completely clean
and rebuild all synApps modules and my own application first when doing so. I am running 7 VME crates with vxWorks 6.9.4.1. I am logging the IOC output for each of them with procServ. I was running base 7.0.6.1 between about Wed May 18 15:00 and Thu May 19 15:00. During that time I observed the following VME bus errors.
/home/epics/logs/13BMD.log:[Wed May 18 15:46:12 2022] VME Bus Error accessing A16: 0x3316 /home/epics/logs/13IDE.log:[Wed May 18 16:07:12 2022] VME Bus Error accessing A16: 0x317e /home/epics/logs/13IDA.log:[Wed May 18 16:50:32 2022] VME Bus Error accessing A16: 0x3146 /home/epics/logs/13IDE.log:[Wed May 18 17:23:08 2022] VME Bus Error accessing A16: 0x317e /home/epics/logs/13IDC.log:[Wed May 18 18:11:58 2022] VME Bus Error accessing A32: 0xbf57c014 /home/epics/logs/13BMD.log:[Wed May 18 18:12:43 2022] VME Bus Error accessing A16: 0x317e /home/epics/logs/13BMA.log:[Wed May 18 18:16:04 2022] VME Bus Error accessing A16: 0x3136 /home/epics/logs/13IDE.log:[Wed May 18 18:20:22 2022] VME Bus Error accessing A32: 0xbf140014 /home/epics/logs/13IDA.log:[Wed May 18 20:23:06 2022] VME Bus Error accessing A16: 0x300e /home/epics/logs/13BMD.log:[Thu May 19 11:13:34 2022] VME Bus Error accessing A16: 0x3046 /home/epics/logs/13IDC.log:[Thu May 19 12:03:54 2022] VME Bus Error accessing A32: 0xbf7db01c /home/epics/logs/13BMD.log:[Thu May 19 12:42:30 2022] VME Bus Error accessing A32: 0xbffcac70 /home/epics/logs/13IDA.log:[Thu May 19 12:44:11 2022] VME Bus Error accessing A16: 0x300e /home/epics/logs/13IDC.log:[Thu May 19 13:01:55 2022] VME Bus Error accessing A32: 0xbf554014 /home/epics/logs/13IDE.log:[Thu May 19 13:11:16 2022] VME Bus Error accessing A32: 0xbf2e001c /home/epics/logs/13IDE.log:[Thu May 19 13:11:45 2022] VME Bus Error accessing A16: 0x315e /home/epics/logs/13IDD.log:[Thu May 19 13:18:03 2022] VME Bus Error accessing A32: 0xbf390018 /home/epics/logs/13IDE.log:[Thu May 19 13:25:25 2022] VME Bus Error accessing A32: 0xbf140014 /home/epics/logs/13BMD.log:[Thu May 19 13:40:38 2022] VME Bus Error accessing A16: 0x317e /home/epics/logs/13IDE.log:[Thu May 19 14:09:56 2022] VME Bus Error accessing A32: 0xbf140014 So there were about 20 errors in 24 hours. I did reboot the IOCs a few times during this period, and the probability of an error is definitely higher in the few minutes
after a reboot, perhaps as OPI displays reconnect their CA links. At about Thu May 19 15:00 I switched to base 7.0.5 and ran for 3 hours, until about 18:00. There were no bus errors. At Thu May 19 17:45 I switched back to 7.0.6.1 for about 3 hours until 21:00. I observed the following errors during this time. /home/epics/logs/13IDE.log:[Thu May 19 17:58:59 2022] VME Bus Error accessing A32: 0xbf2e001c /home/epics/logs/13BMD.log:[Thu May 19 18:09:48 2022] VME Bus Error accessing A16: 0x315e /home/epics/logs/13IDD.log:[Thu May 19 19:16:00 2022] VME Bus Error accessing A16: 0x3206 At about Thu May 19 21:00 I switched to base 7.0.5 again. That was 11 hours ago, and there have been no bus errors in that time. It is clear that my systems fail with 7.0.6.1 and work fine with 7.0.5, all other software being identical. It is certainly possible that 7.0.6.1 is exposing a problem with
the other software which 7.0.5 does not. I cannot afford to experiment or test more with these 7 VME systems, since beam comes back soon. I do have one other VME crate that can be used for testing. It is not currently running, so I don’t know if it fails
with 7.0.6.1, but I will turn it on today. Note that prior to this week I had been running base 7.0.4 on these systems for over a year with no issues. This week I did update to the latest versions of the synApps
modules, but I did not change my IOC specific code at all. Thanks, Mark From: Mark Rivers Folks, I updated my VME crates from base 7.0.4 to 7.0.6.1, and also updated to the latest version of synApps modules. 5 of the 7 crates are now giving VME bus errors within a few
minutes of booting. Since it is happening on 5 crates it is highly unlikely to be a hardware problem. The errors are all for a small set of addresses: A16: 0x300e, 0x3136, 0x3172 These are in the address space of the first Industry Pack module on a VIPC616_01 or TVME200 IP carrier card A32: 0xbf57c014 I am not sure what device this is. The Joerger scalars are configured to start at address 0xB0000000, but I don’t think they use that much memory space. The task generating the error is always “CAS-event”. These are the errors: 13IDA crate: [Wed May 18 20:23:06 2022] VME Bus Error accessing A16: 0x300e [Wed May 18 20:23:06 2022] machine check [Wed May 18 20:23:06 2022] Exception next instruction address: 0x03604600 [Wed May 18 20:23:06 2022] Machine Status Register: 0x0008b032 [Wed May 18 20:23:06 2022] Condition Register: 0x42000884 [Wed May 18 20:23:06 2022] Task: 0x55d8aa0 "CAS-event" [Wed May 18 20:23:06 2022] 0x55d8aa0 (CAS-event): task 0x55d8aa0 has had a failure and has been stopped. [Wed May 18 20:23:06 2022] 0x55d8aa0 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10. [Wed May 18 20:48:19 2022] 0x4d50400 (timerQueue): callbackRequest: cbLow ring buffer full 13BMA crate [Wed May 18 18:16:04 2022] VME Bus Error accessing A16: 0x3136 [Wed May 18 18:16:04 2022] machine check [Wed May 18 18:16:04 2022] Exception next instruction address: 0x0141b5b8 [Wed May 18 18:16:04 2022] Machine Status Register: 0x0008b032 [Wed May 18 18:16:04 2022] Condition Register: 0x42000884 [Wed May 18 18:16:04 2022] Task: 0x3545520 "CAS-event" [Wed May 18 18:16:04 2022] 0x3545520 (CAS-event): task 0x3545520 has had a failure and has been stopped. [Wed May 18 18:16:04 2022] 0x3545520 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10. 13IDC crate: [Wed May 18 18:11:58 2022] VME Bus Error accessing A32: 0xbf57c014 [Wed May 18 18:11:58 2022] machine check [Wed May 18 18:11:58 2022] Exception next instruction address: 0x013962a0 [Wed May 18 18:11:58 2022] Machine Status Register: 0x0008b032 [Wed May 18 18:11:58 2022] Condition Register: 0x42000884 [Wed May 18 18:11:58 2022] Task: 0x395e4b0 "CAS-event" [Wed May 18 18:11:58 2022] 0x395e4b0 (CAS-event): task 0x395e4b0 has had a failure and has been stopped. [Wed May 18 18:11:58 2022] 0x395e4b0 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10. 13BMD crate: [Wed May 18 18:12:43 2022] VME Bus Error accessing A16: 0x317e [Wed May 18 18:12:43 2022] machine check [Wed May 18 18:12:43 2022] Exception next instruction address: 0x01395e40 [Wed May 18 18:12:43 2022] Machine Status Register: 0x0008b032 [Wed May 18 18:12:44 2022] Condition Register: 0x48000884 [Wed May 18 18:12:44 2022] Task: 0x3c73ae0 "CAS-event" [Wed May 18 18:12:44 2022] 0x3c73ae0 (CAS-event): task 0x3c73ae0 has had a failure and has been stopped. [Wed May 18 18:12:44 2022] 0x3c73ae0 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10. 13IDE crate: [Wed May 18 16:07:12 2022] VME Bus Error accessing A16: 0x317e [Wed May 18 16:07:12 2022] machine check [Wed May 18 16:07:12 2022] Exception next instruction address: 0x01395c20 [Wed May 18 16:07:12 2022] Machine Status Register: 0x0008b032 [Wed May 18 16:07:12 2022] Condition Register: 0x48000884 [Wed May 18 16:07:12 2022] Task: 0x3a13c60 "CAS-event" [Wed May 18 16:07:12 2022] 0x3a13c60 (CAS-event): task 0x3a13c60 has had a failure and has been stopped. [Wed May 18 16:07:12 2022] 0x3a13c60 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
[Wed May 18 18:20:22 2022] VME Bus Error accessing A32: 0xbf140014 [Wed May 18 18:20:22 2022] machine check [Wed May 18 18:20:22 2022] Exception next instruction address: 0x013959c0 [Wed May 18 18:20:22 2022] Machine Status Register: 0x0008b032 [Wed May 18 18:20:22 2022] Condition Register: 0x48000884 [Wed May 18 18:20:22 2022] Task: 0x3a2d780 "CAS-event" [Wed May 18 18:20:22 2022] 0x3a2d780 (CAS-event): task 0x3a2d780 has had a failure and has been stopped. [Wed May 18 18:20:23 2022] 0x3a2d780 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10. This is the result of “tt” and “ti” on a couple of the crashed tasks: ioc13ida> tt 0x55d8aa0 0x0012489c vxTaskEntry +0x48 : epicsThreadEntry () 0x036a6afc epicsThreadEntry+0x80 : 0x03604e20 () 0x036050f8 db_start_events+0x458: db_delete_field_log () value = 0 = 0x0 ioc13ida> ti 0x55d8aa0 NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY ---------- ------------ -------- --- ---------- -------- -------- ------- ----- CAS-event epicsThread> 55d8aa0 139 STOP+I 3604600 55d89b0 0 0 full task name : CAS-event task entry : epicsThreadEntry process : kernel options : 0x1009001 VX_SUPERVISOR_MODE VX_DEALLOC_TCB VX_FP_TASK VX_DEALLOC_EXC_STACK STACK BASE END SP SIZE HIGH MARGIN --------- -------- -------- -------- ------- ------- ------- execution 55d8aa0 55d5bc0 55d89b0 12000 3568 8432 exception 55d9bf0 55d8d80 3696 1344 2352 VxWorks Events -------------- Events Pended on : Not Pended Received Events : 0x0 Options : N/A r0 = 0x40050000 sp = 0x055d89b0 r2 = 0x002f1ac0 r3 = 0x04e56c70 r4 = 0x05293e20 r5 = 0x00000001 r6 = 0x00000001 r7 = 0x00000000 r8 = 0x00000000 r9 = 0x0368a8b8 r10 = 0x00000000 r11 = 0x055d8920 r12 = 0x22000884 r13 = 0x00321e60 r14 = 0x036a5e98 r15 = 0x03fb9a44 r16 = 0x035ec8c0 r17 = 0x03fb99fc r18 = 0x00000000 r19 = 0x036a3f3c r20 = 0x03fb99bc r21 = 0x0360461c r22 = 0x00000001 r23 = 0x036045b4 r24 = 0x04341544 r25 = 0x0369d438 r26 = 0x0369d698 r27 = 0x05156b28 r28 = 0x03639dc8 r29 = 0x05b15c98 r30 = 0x05293e20 r31 = 0x05293e20 msr = 0x0008b032 lr = 0x036050f8 ctr = 0x036045b4 pc = 0x03604600 cr = 0x42000884 xer = 0x00000000 pgTblPtr = 0x02310000 scSrTblPtr = 0x0230fe74 srTblPtr = 0x0230fe74 fpcsr = 0 fr0 = -1 fr1 = NaN fr2 = NaN fr3 = NaN fr4 = NaN fr5 = NaN fr6 = NaN fr7 = NaN fr8 = NaN fr9 = NaN fr10 = NaN fr11 = NaN fr12 = NaN fr13 = NaN fr14 = NaN fr15 = NaN fr16 = NaN fr17 = NaN fr18 = NaN fr19 = NaN fr20 = NaN fr21 = NaN fr22 = NaN fr23 = NaN fr24 = NaN fr25 = NaN fr26 = NaN fr27 = NaN fr28 = NaN fr29 = NaN fr30 = NaN fr31 = NaN machine check Exception next instruction address: 0xva0lue = 3604600 Machine Status Register: 0x00008b032 Condition Register: 0x = 420008840x ioc13ide> tt 0x3a2d780 0x00121e7c vxTaskEntry +0x48 : epicsThreadEntry () 0x01437ebc epicsThreadEntry+0x80 : 0x013961e0 () 0x013964b8 db_start_events+0x458: db_delete_field_log () value = 0 = 0x0 ioc13ide> ti 0x3a2d780 NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY ---------- ------------ -------- --- ---------- -------- -------- ------- ----- CAS-event epicsThread> 3a2d780 133 STOP+I 13959c0 3a2d690 0 0 full task name : CAS-event task entry : epicsThreadEntry process : kernel options : 0x1009001 VX_SUPERVISOR_MODE VX_DEALLOC_TCB VX_FP_TASK VX_DEALLOC_EXC_STACK STACK BASE END SP SIZE HIGH MARGIN --------- -------- -------- -------- ------- ------- ------- execution 3a2d780 3a2a8a0 3a2d690 12000 2800 9200 exception 3a2e8d0 3a2da60 3696 1344 2352 VxWorks Events -------------- Events Pended on : Not Pended Received Events : 0x0 Options : N/A r0 = 0x40050000 sp = 0x03a2d690 r2 = 0x002e5010 r3 = 0x02dc8530 r4 = 0x0366ac00 r5 = 0x00000001 r6 = 0x00000001 r7 = 0x00000000 r8 = 0x00000002 r9 = 0x0141bc78 r10 = 0x00000000 r11 = 0x03a2d600 r12 = 0x28000884 r13 = 0x00313850 r14 = 0x01437258 r15 = 0x01d4ae04 r16 = 0x0137dc80 r17 = 0x01d4adbc r18 = 0x00000000 r19 = 0x014352fc r20 = 0x01d4ad7c r21 = 0x013959dc r22 = 0x00000001 r23 = 0x01395974 r24 = 0x020d2904 r25 = 0x0142e7f8 r26 = 0x0142ea58 r27 = 0x0351b4c8 r28 = 0x013cb188 r29 = 0x03a98e70 r30 = 0x0366ac00 r31 = 0x0366ac00 msr = 0x0008b032 lr = 0x013964b8 ctr = 0x01395974 pc = 0x013959c0 cr = 0x48000884 xer = 0x00000000 pgTblPtr = 0x00af7000 scSrTblPtr = 0x00af65c4 srTblPtr = 0x00af65c4 fpcsr = 0 fr0 = -7.62951e-05 fr1 = NaN fr2 = NaN fr3 = NaN fr4 = NaN fr5 = NaN fr6 = NaN fr7 = NaN fr8 = NaN fr9 = NaN fr10 = NaN fr11 = NaN fr12 = NaN fr13 = NaN fr14 = NaN fr15 = NaN fr16 = NaN fr17 = NaN fr18 = NaN fr19 = NaN fr20 = NaN fr21 = NaN fr22 = NaN fr23 = NaN fr24 = NaN fr25 = NaN fr26 = NaN fr27 = NaN fr28 = NaN fr29 = NaN fr30 = NaN fr31 = NaN machine checkvalue = Exception next instruction address: 0x0013959c0 Machine Status Register: 0x = 0008b032 Condition Register: 0x048000884x Any ideas? Thanks, Mark |