Here is further evidence that the A16 bus error messages are erroneous. The bus errors are always at offset 0x5e or 0x7e in the IP330 register address space.
This is the register map of the IP300 from the source code, with my annotation of the register offset added.
typedef struct ip330ADCregs { Register Offset
unsigned short control; 0
#if BYTE_ORDER==__BIG_ENDIAN
unsigned char timePrescale; 2
unsigned char intVector; 3
#else
unsigned char intVector; 2
unsigned char timePrescale; 3
#endif
unsigned short conversionTime; 4
#if BYTE_ORDER==__BIG_ENDIAN
unsigned char endChanVal; 6
unsigned char startChanVal; 7
#else
unsigned char startChanVal; 6
unsigned char endChanVal; 7
#endif
unsigned short newData[2]; 8
unsigned short missedData[2]; 12
unsigned short startConvert; 16
unsigned char pad[0x0E]; 18
unsigned char gain[MAX_IP330_CHANNELS]; 32
unsigned short mailBox[MAX_IP330_CHANNELS]; 64-127 (0x40-0x7f)
} ip330ADCregs;
Offsets 0x5e and 0x7e are in the mailbox registers, which is where the driver is reading the ADC values. The driver is reading 16 registers every 500 microseconds
in an ISR routine. Since the hardware is known to be working fine, and since those are definitely valid A16 addresses, I conclude that the VME bus error messages are erroneous, and the real problem is in dbEvent.c.
Every time CAS-event crashes the stack trace (vxWorks tt command) is the same. I just rebooted with 7.0.6.1 and CAS-event crashed again when the A/D did lots
of monitor callbacks. This is the output:
VME Bus Error accessing A16: 0x347e
machine check
Exception next instruction address: 0x0368ce90
Machine Status Register: 0x0008b032
Condition Register: 0x48000884
Task: 0x2702130 "CAS-event"
0x2702130 (CAS-event): task 0x2702130 has had a failure and has been stopped.
0x2702130 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
ioc13lab2> tt 0x2702130
0x0012489c vxTaskEntry +0x48 : epicsThreadEntry ()
0x036a90d4 epicsThreadEntry+0x80 : 0x036073f8 ()
0x036076d0 db_start_events+0x458: db_delete_field_log ()
0x03606be0 db_delete_field_log+0x54 : freeListFree ()
value = 0 = 0x0
It looks like freeListFree() is probably generating an access violation.
Mark
From: Mark Rivers
Sent: Thursday, May 26, 2022 3:17 PM
To: Michael Davidsaver <mdavidsaver at gmail.com>
Cc: tech-talk at aps.anl.gov
Subject: RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules
Hi Michael,
I have now stripped my VME crate down to just 2 cards, the MVME5100 CPU and a TVME220 IP carrier card. The carrier card has a DAC128V D/A and an IP330 A/D. Channel
0 of the D/A is connected to channel 0 of the A/D.
The startup script is very simple:
**********************************
# vxWorks startup file
< cdCommands
nfsAuthUnixSet("corvette", 849601092, 849600513, 0, 0)
# Mount drives with NFS
nfsMount("corvette","/home","/corvette/home")
nfsMount("corvette","/home","/home")
cd topbin
load("CARSTest.munch")
cd startup
dbLoadDatabase("$(CARS)/dbd/CARSTestVX.dbd")
CARSTestVX_registerRecordDeviceDriver(pdbbase)
ipacAddTVME200("342FA2")
initDAC128V("DAC1", 0, 1)
dbLoadTemplate "DAC.template"
initIp330("Ip330_1",0,0,"D","-10to10",0,15,120)
configIp330("Ip330_1", 3,"Input", 500,0)
dbLoadTemplate "Ip330_ADC.template"
iocInit
**********************************
The IPAC driver, DAC128V driver, and IP330 driver have not changed in years.
When I build this IOC with base 7.0.5 it works fine, there are no bus errors.
When I build this IOC with base 7.0.6.1 and I adjust the D/A quickly with a slider for a few seconds while a couple of CA clients are receiving monitors from the A/D
I get the following failure:
VME Bus Error accessing A16: 0x347e
machine check
Exception next instruction address: 0x0368ce90
Machine Status Register: 0x0008b032
Condition Register: 0x48000884
Task: 0x27011d0 "CAS-event"
0x27011d0 (CAS-event): task 0x27011d0 has had a failure and has been stopped.
0x27011d0 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
This is the task trace on that task:
ioc13lab2> tt 0x27011d0
0x0012489c vxTaskEntry +0x48 : epicsThreadEntry ()
0x036a90d4 epicsThreadEntry+0x80 : 0x036073f8 ()
0x036076d0 db_start_events+0x458: db_delete_field_log ()
0x03606be0 db_delete_field_log+0x54 : freeListFree ()
value = 0 = 0x0
You said:
Ø
So I'm more confident in claiming the mention of "CAS-event" is false.
Ø
The faulting instruction probably originates on some other scan/driver thread, then there is a context switch to "CAS-event" because of a call to db_post_events().
I now strongly suspect the problem is the opposite of that. I think that the task that is failing is indeed CAS-event, and what is incorrect is the report of a bus error.
The reason I think this is:
-
The IPAC, DAC, and IP330 drivers are very well debugged.
-
The errors do not happen with base 7.0.5
-
The bus error messages only happen when there are lots of CA monitor events being passed to CA clients. The errors never occur if there are no CA clients receiving
monitors. That makes no sense in terms of actual bus errors.
-
The code in dbEvent.c relating to db_field_log has changed significantly between base 7.0.5 and 7.0.6.1.
Mark
-----Original Message-----
From: Michael Davidsaver <mdavidsaver at gmail.com>
Sent: Saturday, May 21, 2022 10:04 PM
To: Mark Rivers <rivers at cars.uchicago.edu>
Cc: tech-talk at aps.anl.gov
Subject: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules
On 5/21/22 15:34, Mark Rivers wrote:
> ØOk, so all powerpc. PPC Machine Check exception is asynchronous.
>
> ØSo I'm more confident in claiming the mention of "CAS-event" is false.
>
> ØThe faulting instruction probably originates on some other scan/driver thread, then there is a context switch to "CAS-event" because of a call to db_post_events().
>
> I’m not sure I understand the logic. The other scan/driver threads are always running. The IP330 is always interrupting at 2 kHz, and doing callbacks to device support. It runs with no VME bus errors at all until I open an medm
screen or run camonitor. So it seems that the problem must be caused by having the CAS-event task do CA monitors, and it is not just that the CAS-event task is being incorrectly blamed for the problem?
Oh, what I say is far from a concrete explanation. There is clearly something more going on here.
I think it could only be true if the faulting operation were a "posted write". Meaning the VME bridge buffers the operation, letting the CPU proceed before the VME bus cycle has actually completed.
This is why you may sometimes finds drivers with a "dummy" load after an important store (eg. interrupt acknowledge). Waiting for the load instruction will also wait for the preceding store to complete.
eg. an admonishment by Till from the RTEMS universe2 bridge driver.
https://github.com/RTEMS/rtems/blob/a316a9ddaeaa8f6316b2a2d29ca82b3ad40d2d22/bsps/powerpc/shared/vme/vmeUniverse.c#L2187-L2189
"posted writes" are a configuration option for each address window.
Disabling this may give a more accurate address with the exception, at the expense of some slow down.
Of course, none of this would explain why these particular addresses are being accessed, nor why they fault.
On 5/21/22 16:45, Mark Rivers wrote:
> ØI will also try to make a thin vxWorks IOC application with basically just Industry Pack module support in case it is some strange interaction with another module.
>
> A thin IOC with only seq, asyn, iocStats, ipac, ip330, dac128V, and ipUnidig does not fail with base 7.0.6.1.
>
> I will add things back in one at a time and see what is actually causing the problem.
>
> Luckily it is a gray and rainy day in Chicago. J
Some people have all the luck. Today was terribly sunny here in CA :)
(I've been here 4.5 years, and I still can't get over the weather!)
> I will try 7.0.6.
>
> I will also try to make a thin vxWorks IOC application with basically just Industry Pack module support in case it is some strange interaction with another module.
>
> Mark
>
> -----Original Message-----
> From: Michael Davidsaver <mdavidsaver at gmail.com>
> Sent: Saturday, May 21, 2022 4:31 PM
> To: Mark Rivers <rivers at cars.uchicago.edu>
> Cc: tech-talk at aps.anl.gov
> Subject: Re: Bus errors accessing VME with base 7.0.6.1 and latest
> synApps modules
>
> On 5/21/22 11:05, Mark Rivers wrote:
>
> > ØWhat specific board is involved? (eg. mvme3100?)
>
> >
>
> > The test crate is an MVME5100. But the production crates that were also failing include several MVME2700 boards as well as some MVME5100.
>
> Ok, so all powerpc. PPC Machine Check exception is asynchronous.
>
> So I'm more confident in claiming the mention of "CAS-event"
>
> is false. The faulting instruction probably originates on some other scan/driver thread, then there is a context switch to "CAS-event" because of a call to db_post_events().
>
> It still seems odd to me that the CPU could get all the way into
> db_post_events() and wake up "CAS-event" before a VME cycle completes.
> (maybe there are VME timeout happening?)
>
> In addition to Base 7.0.5 and and 7.0.6.1, could you test with 7.0.6 ?
>
> This might narrow things down a little.
>
> Since you already have a version range to suspect, you could try to narrow down further with git-bisect.
>
> (although I can't honestly recommend this as a good way to pass what
> for me is a nice Saturday afternoon.)
>
> https://git-scm.com/docs/git-bisect
> <https://git-scm.com/docs/git-bisect>
>
> > git bisect start R7.0.5 R7.0.6.1
>