EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
<== Date ==> <== Thread ==>

Subject: RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules
From: Mark Rivers via Tech-talk <tech-talk at aps.anl.gov>
To: Till Straumann <till.straumann at psi.ch>
Cc: "tech-talk at aps.anl.gov" <tech-talk at aps.anl.gov>
Date: Thu, 9 Jun 2022 21:49:01 +0000

Hi Till,

 

I tried your test program where I expanded to 36 NOPs.

 

void address_test() {

   *(volatile uint16_t*)0xfbff345e = 0;

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */

   /* copy/paste ~30 of these NOPs here */

}

 

Much to my surprise the program does not generate a bus error.  The two A16 addresses that it most frequently failed with were 0x345e and 0x347e.  Your program tests the first one.

 

iocexample> address_test

value = 0 = 0x0

 

I then tried writing to those A16 addresses manually from the vxWorks shell.  I can write to both of them with no bus error.  So I was wrong, those addresses are both readable and writeable.

 

iocexample> *0xfbff345e = 0

address_test + 0xf8a6281e = 0xfbff345e: value = 0 = 0x0

iocexample> *0xfbff347e = 0

address_test + 0xf8a6283e = 0xfbff347e: value = 65353 = 0xff49 = 'I'

 

However, if I try the next address after 7e I do get a bus error, because that is beyond the range of the IP330 module.

 

iocexample> *0xfbff3480 = 0

 

VME Bus Error accessing A16: 0x3480

machine check

Exception next instruction address: 0x001fe4f4

Machine Status Register: 0x0008b032

Condition Register: 0x88000244

 

0x0012489c vxTaskEntry  +0x48 : 0x0020c554 ()

0x0020c554 shellTask    +0x4a8: shellExec ()

0x0020c020 shellExec    +0x11c: 0x00201a78 ()

0x00201ca0 shellInterpCInit+0x640: shellInterpCparse ()

0x00201384 shellInterpCparse+0xa3c: 0x001fe9f4 ()

0x001fea24 shellInterpClex+0x1d6c: 0x001fe6dc ()

0x001fe7a0 shellInterpClex+0x1ae8: 0x001fe490 ()

 

Shell task 'tShell0' restarted...

 

So this leaves the question of why we get those A16 bus errors at all!

 

Mark

 

 

From: Till Straumann <till.straumann at psi.ch>
Sent: Wednesday, June 8, 2022 4:46 PM
To: Mark Rivers <rivers at cars.uchicago.edu>
Cc: tech-talk at aps.anl.gov
Subject: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules

 

Thanks Mark.

Hmm; all the data look normal. I.e., there is nothing that could explain why there is a VME Bus Error.
In summary:

 - the program text has not been modified
 - given the register context and (disassembled) program it seems impossible that a bus error could have
   interrupted the program at the indicated/reported location.

One thing that puzzles me is the observation that A16 errors only are observed when the read-only locations
0x345e/0x347e are involved. That hints at a write-access causing this. If this were a posted write then an error
would be reported (if it is reported at all) with a longer delay, i.e., we'd have to look a bit further into the 'past'.

A very simple test could check if write errors are reported: just try to write something to 0x345e@A16 with
some code like this

void test() {
   *(volatile uint16_t*)0xfbff345e = 0;
   asm volatile( "   addi 3, 3, 0"); /* I hope I remember this correctly; this is just a NOP */
   /* copy/paste ~30 of these NOPs here */
}

copy/paste a generous amount of NOP statements. We can then see
 - if a bus error is reported
 - how long the reporting delay is (if any)

One more thing:  please compile with -O2 or greater.

Another idea I have is that you could try a data-access breakpoint (bh() with access type 1) on 0xfbff345c
(must be word-aligned IIRC). Maybe that reveals something.

Best
- Till

On 6/8/22 22:13, Mark Rivers wrote:

Till said:

 

1.      I would join Steven Hartman in recommending to dump program memory; I'd look at
the address in R9/CTR (target of the jump) and also at the address where we jump from
(if the PC is different from CTR as in the above report then dump from the address at the PC,
if the PC is identical to CTR then dump from the address held in LR).


VME Bus Error accessing A16: 0x347e

machine check

Exception next instruction address: 0x03326a90

Machine Status Register: 0x0008b032

Condition Register: 0x42000884

Task: 0x3593ed0 "CAS-event"

0x3593ed0 (CAS-event): task 0x3593ed0 has had a failure and has been stopped.

0x3593ed0 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.

 

iocexample> tt

0x0012489c vxTaskEntry  +0x48 : epicsThreadEntry ()

0x0334501c epicsThreadEntry+0x80 : 0x032a17f0 ()

0x032a1acc db_start_events+0x450: db_delete_field_log ()

0x032a0fe4 db_delete_field_log+0x54 : freeListFree ()

value = 0 = 0x0

iocexample>

iocexample>

iocexample> ti

 

  NAME         ENTRY       TID    PRI   STATUS      PC       SP     ERRNO  DELAY

----------  ------------ -------- --- ---------- -------- -------- ------- -----

CAS-event   epicsThread>  3593ed0 133 STOP+I      3326a90  3593de0       0     0

 

full task name : CAS-event

task entry     : epicsThreadEntry

process        : kernel

options        : 0x1009001

VX_SUPERVISOR_MODE  VX_DEALLOC_TCB      VX_FP_TASK          VX_DEALLOC_EXC_STACK

 

STACK      BASE     END       SP      SIZE    HIGH   MARGIN

--------- -------- -------- -------- ------- ------- -------

execution  3593ed0  3590ff0  3593de0   12000    2800    9200

exception  3595020  35941b0             3696    1344    2352

 

VxWorks Events

--------------

Events Pended on    : Not Pended

Received Events     : 0x0

Options             : N/A

 

r0         = 0x40000000   sp         = 0x03593de0   r2         = 0x002f1ac0

r3         = 0x03492c80   r4         = 0x0357f318   r5         = 0x00000001

r6         = 0x00000001   r7         = 0x00000000   r8         = 0x00140001

r9         = 0x03326a90   r10        = 0x00000000   r11        = 0x03593d50

r12        = 0x20000884   r13        = 0x00321e60   r14        = 0x033b6098

r15        = 0x03373c64   r16        = 0x03289478   r17        = 0x033424f0

r18        = 0x03373bdc   r19        = 0x03373c1c   r20        = 0x032a0ff8

r21        = 0x032a0f90   r22        = 0x00000000   r23        = 0x00000001

r24        = 0x033b609c   r25        = 0x03339340   r26        = 0x033390e0

r27        = 0x03443170   r28        = 0x03443170   r29        = 0x03569e18

r30        = 0x032dad44   r31        = 0x0357f318   msr        = 0x0008b032

lr         = 0x032a0fe4   ctr        = 0x03326a90   pc         = 0x03326a90

cr         = 0x42000884   xer        = 0x00000000   pgTblPtr   = 0x02310000

scSrTblPtr = 0x0230fe74   srTblPtr   = 0x0230fe74

 

fpcsr  =        0

fr0    =  -1.1635   fr1    =      NaN   fr2    =      NaN   fr3    =      NaN

fr4    =      NaN   fr5    =      NaN   fr6    =      NaN   fr7    =      NaN

fr8    =      NaN   fr9    =      NaN   fr10   =      NaN   fr11   =      NaN

fr12   =      NaN   fr13   =      NaN   fr14   =      NaN   fr15   =      NaN

fr16   =      NaN   fr17   =      NaN   fr18   =      NaN   fr19   =      NaN

fr20   =      NaN   fr21   =      NaN   fr22   =      NaN   fr23   =      NaN

fr24   =      NaN   fr25   =      NaN   fr26   =      NaN   fr27   =      NaN

fr28   =      NaN   fr29   =      NaN   fr30   =      NaN   fr31   =      NaN

 

machine checkvalue =

Exception next instruction address: 0x003326a90

Machine Status Register: 0x = 0008b032

Condition Register: 0x420008840x

0

 

 

In this case R9=CTR=PC=0x03326a90.

 

iocexample> l 0x03326a90

                        freeListFree:

0x3326a90  9421fff0    stwu        r1,-16(r1)

0x3326a94  7c0802a6    mfspr       r0,LR

0x3326a98  3d200334    lis         r9,0x334 # 820

0x3326a9c  392990e0    addi        r9,r9,0x90e0 # -28448

0x3326aa0  7d2903a6    mtspr       CTR,r9

0x3326aa4  90010014    stw         r0,20(r1)

0x3326aa8  93c10008    stw         r30,8(r1)

0x3326aac  7c9e2378    or          r30,r4,r4

0x3326ab0  93e1000c    stw         r31,12(r1)

0x3326ab4  7c7f1b78    or          r31,r3,r3

value = 0 = 0x0

 

iocexample> d 0x03326a90

NOTE: memory values are displayed in hexadecimal.

0x03326a90:  9421 fff0 7c08 02a6 3d20 0334 3929 90e0  *.!..|...= .49)..*

0x03326aa0:  7d29 03a6 9001 0014 93c1 0008 7c9e 2378  *})..........|.#x*

0x03326ab0:  93e1 000c 7c7f 1b78 8063 0014 4e80 0421  *....|..x.c..N..!*

0x03326ac0:  2f83 0000 41be 002c 3d20 0334 3c60 0338  */...A..,= .4<`.8*

0x03326ad0:  3ca0 0338 3929 24f0 3863 35a4 38a5 35c0  *<..89)$.8c5.8.5.*

0x03326ae0:  7d29 03a6 3880 008f 38c0 0000 4e80 0421  *})..8...8...N..!*

0x03326af0:  813f 0010 801f 0008 3929 0001 807f 0014  *.?......9)......*

0x03326b00:  913f 0010 3d20 0334 3929 9340 901e 0000  *.?..= .49).@....*

0x03326b10:  7d29 03a6 93df 0008 4e80 0421 8001 0014  *})......N..!....*

0x03326b20:  83c1 0008 7c08 03a6 83e1 000c 3821 0010  *....|.......8!..*

0x03326b30:  4e80 0020 9421 ffe0 7c08 02a6 9001 0024  *N.. .!..|......$*

0x03326b40:  93e1 001c 9381 0010 7c7c 1b78 83e3 000c  *........||.x....*

0x03326b50:  93a1 0014 2f9f 0000 93c1 0018 419e 0074  *..../.......A..t*

0x03326b60:  3d20 001e 3bc9 02fc 807f 0004 7fc9 03a6  *= ..;...........*

0x03326b70:  83bf 0000 4e80 0421 7fe3 fb78 7fc9 03a6  *....N..!...x....*

0x03326b80:  7fbf eb78 4e80 0421 2f9d 0000 409e ffdc  *...xN..!/...@...*

 

LR = 0x032a0fe4

This is the disassembly with LR address highlighted in red.

 

iocexample> l 0x032a0f90,30

                        db_delete_field_log:

0x32a0f90  9421fff0    stwu        r1,-16(r1)

0x32a0f94  7c0802a6    mfspr       r0,LR

0x32a0f98  93e1000c    stw         r31,12(r1)

0x32a0f9c  7c7f1b79    or.         r31,r3,r3

0x32a0fa0  90010014    stw         r0,20(r1)

0x32a0fa4  41820040    bc          0xc,2, 0x32a0fe4 # 0x032a0fe4

0x32a0fa8  801f0000    lwz         r0,0(r31)

0x32a0fac  2f800000    cmpi        crf7,0,r0,0x0 # 0

0x32a0fb0  409c0018    bc          0x4,28, 0x32a0fc8 # 0x032a0fc8

0x32a0fb4  801f0018    lwz         r0,24(r31)

0x32a0fb8  2f800000    cmpi        crf7,0,r0,0x0 # 0

0x32a0fbc  419e000c    bc          0xc,30, 0x32a0fc8 # 0x032a0fc8

0x32a0fc0  7c0903a6    mtspr       CTR,r0

0x32a0fc4  4e800421    bcctrl      0x14,0

0x32a0fc8  3d20033b    lis         r9,0x33b # 827

0x32a0fcc  7fe4fb78    or          r4,r31,r31

0x32a0fd0  80696098    lwz         r3,24728(r9)

0x32a0fd4  3d200332    lis         r9,0x332 # 818

0x32a0fd8  39296a90    addi        r9,r9,0x6a90 # 27280

0x32a0fdc  7d2903a6    mtspr       CTR,r9

0x32a0fe0  4e800421    bcctrl      0x14,0

0x32a0fe4  80010014    lwz         r0,20(r1)

0x32a0fe8  83e1000c    lwz         r31,12(r1)

0x32a0fec  38210010    addi        r1,r1,0x10 # 16

0x32a0ff0  7c0803a6    mtspr       LR,r0

0x32a0ff4  4e800020    blr

0x32a0ff8  9421fff0    stwu        r1,-16(r1)

0x32a0ffc  7c0802a6    mfspr       r0,LR

0x32a1000  39240090    addi        r9,r4,0x90 # 144

0x32a1004  5484103a    rlwinm      r4,r4,2,0,29

 

iocexample> d 0x032a0fe4

NOTE: memory values are displayed in hexadecimal.

0x032a0fe0:            8001 0014 83e1 000c 3821 0010  *      ......8!..*

0x032a0ff0:  7c08 03a6 4e80 0020 9421 fff0 7c08 02a6  *|...N.. .!..|...*

0x032a1000:  3924 0090 5484 103a 5529 103a 7c83 2214  *9$..T..:U).:|.".*

0x032a1010:  7d23 4a14 9001 0014 3800 0000 93c1 0008  *}#J.....8.......*

0x032a1020:  93e1 000c 7c7f 1b78 83c9 0004 9004 0004  *....|..x........*

0x032a1030:  817e 001c 90a9 0004 2f8b 0001 419e 0074  *.~....../...A..t*

0x032a1040:  2b8b 0001 409d 0090 a13f 0492 2f89 0000  *+...@....?../...*

0x032a1050:  419e 0030 817e 001c 3809 ffff b01f 0492  *A..0.~..8.......*

0x032a1060:  380b ffff 83e1 000c 901e 001c 8001 0014  *8...............*

0x032a1070:  83c1 0008 3821 0010 7c08 03a6 4e80 0020  *....8!..|...N.. *

0x032a1080:  3d20 0334 3c60 0337 3929 24f0 3ca0 0337  *= .4<`.79)$.<..7*

0x032a1090:  7d29 03a6 3863 3bdc 38a5 3c00 3880 0216  *})..8c;.8.<.8...*

0x032a10a0:  38c0 0000 4e80 0421 a13f 0492 4bff ffa8  *8...N..!.?..K...*

0x032a10b0:  901e 0018 380b ffff 901e 001c 8001 0014  *....8...........*

0x032a10c0:  83c1 0008 7c08 03a6 83e1 000c 3821 0010  *....|.......8!..*

0x032a10d0:  4e80 0020 3d20 0334 3c60 0337 3929 24f0  *N.. = .4<`.79)$.*

0x032a10e0:  3ca0 0337                                *<..7............*

value = 0 = 0x0

 

Thanks,

Mark

 

 

From: Till Straumann <till.straumann at psi.ch>
Sent: Wednesday, June 8, 2022 1:24 PM
To: Mark Rivers <rivers at cars.uchicago.edu>
Cc: tech-talk at aps.anl.gov
Subject: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules

 

Hi Mark.

Some comments in line...

On 6/7/22 22:39, Mark Rivers wrote:

Hi Till,

 

I have checked out tag 85822f3 and only changed configure/CONFIG_SITE to build for vxWorks.

 

I ran the test that uses the IPAC hardware and got this crash after about 13 minutes.  This time it crashed with an A16 address error, in the address space for the IP330 module.  The crash is in db_delete_field_log, the same as the crash you analyzed.

Actually: not quite. In this report the PC / next instruction address has not been loaded from the CTR yet as it had in the previous
report.



 

iocexample>

VME Bus Error accessing A16: 0x345e

machine check

Exception next instruction address: 0x032a0fe0

Machine Status Register: 0x0008b032

Condition Register: 0x42000884

Task: 0x3570090 "CAS-event"

0x3570090 (CAS-event): task 0x3570090 has had a failure and has been stopped.

0x3570090 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.

 

This is the stack trace:

 

iocexample> tt 0x3570090

0x0012489c vxTaskEntry  +0x48 : epicsThreadEntry ()

0x0334501c epicsThreadEntry+0x80 : 0x032a17f0 ()

0x032a1acc db_start_events+0x450: db_delete_field_log ()

value = 0 = 0x0

 

This is the register dump:

 

iocexample> ti 0x3570090

 

  NAME         ENTRY       TID    PRI   STATUS      PC       SP     ERRNO  DELAY

----------  ------------ -------- --- ---------- -------- -------- ------- -----

CAS-event   epicsThread>  3570090 133 STOP+I      32a0fe0  356ffa0       0     0

 

full task name : CAS-event

task entry     : epicsThreadEntry

process        : kernel

options        : 0x1009001

VX_SUPERVISOR_MODE  VX_DEALLOC_TCB      VX_FP_TASK          VX_DEALLOC_EXC_STACK

 

STACK      BASE     END       SP      SIZE    HIGH   MARGIN

--------- -------- -------- -------- ------- ------- -------

execution  3570090  356d1b0  356ffa0   12000    2800    9200

exception  26f7ac0  26f6c50             3696    1344    2352

 

VxWorks Events

--------------

Events Pended on    : Not Pended

Received Events     : 0x0

Options             : N/A

 

r0         = 0x40000000   sp         = 0x0356ffa0   r2         = 0x002f1ac0

r3         = 0x03492cc0   r4         = 0x0356ab90   r5         = 0x00000001

r6         = 0x00000001   r7         = 0x00000000   r8         = 0x000a0008

r9         = 0x03326a90   r10        = 0x00000000   r11        = 0x0356ff10

r12        = 0x20000884   r13        = 0x00321e60   r14        = 0x033b6098

r15        = 0x03373c64   r16        = 0x03289478   r17        = 0x033424f0

r18        = 0x03373bdc   r19        = 0x03373c1c   r20        = 0x032a0ff8

r21        = 0x032a0f90   r22        = 0x00000000   r23        = 0x00000001

r24        = 0x033b609c   r25        = 0x03339340   r26        = 0x033390e0

r27        = 0x03442b90   r28        = 0x0356c870   r29        = 0x03556108

r30        = 0x032dad44   r31        = 0x0356ab90   msr        = 0x0008b032

lr         = 0x032a1acc   ctr        = 0x03326a90   pc         = 0x032a0fe0

cr         = 0x42000884   xer        = 0x00000000   pgTblPtr   = 0x02310000

scSrTblPtr = 0x0230fe74   srTblPtr   = 0x0230fe74

 

fpcsr  =        0

fr0    = -0.85893   fr1    =      NaN   fr2    =      NaN   fr3    =      NaN

fr4    =      NaN   fr5    =      NaN   fr6    =      NaN   fr7    =      NaN

fr8    =      NaN   fr9    =      NaN   fr10   =      NaN   fr11   =      NaN

fr12   =      NaN   fr13   =      NaN   fr14   =      NaN   fr15   =      NaN

fr16   =      NaN   fr17   =      NaN   fr18   =      NaN   fr19   =      NaN

fr20   =      NaN   fr21   =      NaN   fr22   =      NaN   fr23   =      NaN

fr24   =      NaN   fr25   =      NaN   fr26   =      NaN   fr27   =      NaN

fr28   =      NaN   fr29   =      NaN   fr30   =      NaN   fr31   =      NaN

 

machine checkvalue =

Exception next instruction address: 0x0032a0fe0

Machine Status Register: 0x = 0008b032

Condition Register: 0x420008840x

 

This is the memory dump of the address in R31:

 

iocexample> d 0x0356ab90

NOTE: memory values are displayed in hexadecimal.

Looks quite normal


0x0356ab90:  4000 0000 3d01 163d 3848 818a 0000 0000  *@...=..=8H......*

                 ctx bit set | timestamp             | stat | sevr |


0x0356aba0:  000a 0008 0000 0001 bfeb 7c5b 7c5b 7c5b  *..........|[|[|[*

                 type |size  | no_elems | dbfl_val                |

Note that the first 32-bit of the dbfl_val again would point to a VME A32 address;
note also that the ls-bits (0x7c5b) differ from the reported A16 address (0x345e).

I start to wonder if pfl->u.r is not a red herring...

I would join Steven Hartman in recommending to dump program memory; I'd look at
the address in R9/CTR (target of the jump) and also at the address where we jump from
(if the PC is different from CTR as in the above report then dump from the address at the PC,
if the PC is identical to CTR then dump from the address held in LR).

0x0356abb0:  0000 0000 0000 0000 0356 ac80 3d01 163d  *.........V..=..=*

0x0356abc0:  3848 818a 0000 0000 000a 0008 0000 0001  *8H..............*

0x0356abd0:  bfeb 405b 405b 405b 0000 0000 0000 0000  *..@[@[@[........*

0x0356abe0:  0356 a988 3d01 163d 3848 818a 0000 0000  *.V..=..=8H......*

0x0356abf0:  000a 0008 0000 0001 bfeb 2edb 2edb 2edb  *................*

0x0356ac00:  0000 0000 0000 0000 0356 a640 3d01 163d  *.........V.@=..=*

0x0356ac10:  3848 818a 0000 0000 000a 0008 0000 0001  *8H..............*

0x0356ac20:  bfeb 635b 635b 635b 0000 0000 0000 0000  *..c[c[c[........*

0x0356ac30:  0356 a5f0 3d01 163d 3848 818a 0000 0000  *.V..=..=8H......*

0x0356ac40:  000a 0008 0000 0001 bfeb 4f5b 4f5b 4f5b  *..........O[O[O[*

0x0356ac50:  0000 0000 0000 0000 0356 a618 3d01 163d  *.........V..=..=*

0x0356ac60:  3848 818a 0000 0000 000a 0008 0000 0001  *8H..............*

0x0356ac70:  bfeb 60db 60db 60db 0000 0000 0000 0000  *..`.`.`.........*

0x0356ac80:  0356 ab18 3d01 163d 3848 818a 0000 0000  *.V..=..=8H......*

value = 0 = 0x0

iocexample>

 

Another time I tested the same version it failed with an A32 address error in freeListFree called from db_delete_field_log

 

VME Bus Error accessing A32: 0xbfe26ed0

machine check

Exception next instruction address: 0x03326a90

Machine Status Register: 0x0008b032

Condition Register: 0x40000884

Task: 0x26fccd0 "CAS-event"

0x26fccd0 (CAS-event): task 0x26fccd0 has had a failure and has been stopped.

0x26fccd0 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.

 

iocexample> tt 0x26fccd0

0x0012489c vxTaskEntry  +0x48 : epicsThreadEntry ()

0x0334501c epicsThreadEntry+0x80 : 0x032a17f0 ()

0x032a1acc db_start_events+0x450: db_delete_field_log ()

0x032a0fe4 db_delete_field_log+0x54 : freeListFree ()

value = 0 = 0x0

 

iocexample> ti

 

  NAME         ENTRY       TID    PRI   STATUS      PC       SP     ERRNO  DELAY

----------  ------------ -------- --- ---------- -------- -------- ------- -----

CAS-event   epicsThread>  26fccd0 133 STOP+I      3326a90  26fcbe0       0     0

 

full task name : CAS-event

task entry     : epicsThreadEntry

process        : kernel

options        : 0x1009001

VX_SUPERVISOR_MODE  VX_DEALLOC_TCB      VX_FP_TASK          VX_DEALLOC_EXC_STACK

 

STACK      BASE     END       SP      SIZE    HIGH   MARGIN

--------- -------- -------- -------- ------- ------- -------

execution  26fccd0  26f9df0  26fcbe0   12000    2800    9200

exception  3444540  34436d0             3696    1344    2352

 

VxWorks Events

--------------

Events Pended on    : Not Pended

Received Events     : 0x0

Options             : N/A

 

r0         = 0x40000000   sp         = 0x026fcbe0   r2         = 0x002f1ac0

r3         = 0x03492ce0   r4         = 0x03575a90   r5         = 0x00000001

r6         = 0x00000001   r7         = 0x00000001   r8         = 0x0354a4d0

r9         = 0x03326a90   r10        = 0x00000028   r11        = 0x026fcb50

r12        = 0x22000884   r13        = 0x00321e60   r14        = 0x033b6098

r15        = 0x03373c64   r16        = 0x03289478   r17        = 0x033424f0

r18        = 0x03373bdc   r19        = 0x03373c1c   r20        = 0x032a0ff8

r21        = 0x032a0f90   r22        = 0x00000000   r23        = 0x00000001

r24        = 0x033b609c   r25        = 0x03339340   r26        = 0x033390e0

r27        = 0x03443078   r28        = 0x03443078   r29        = 0x03561a80

r30        = 0x032dad44   r31        = 0x03575a90   msr        = 0x0008b032

lr         = 0x032a0fe4   ctr        = 0x03326a90   pc         = 0x03326a90

cr         = 0x40000884   xer        = 0x00000000   pgTblPtr   = 0x02310000

scSrTblPtr = 0x0230fe74   srTblPtr   = 0x0230fe74

 

fpcsr  =        0

fr0    = -0.576028   fr1    =      NaN   fr2    =      NaN   fr3    =      NaN

fr4    =      NaN   fr5    =      NaN   fr6    =      NaN   fr7    =      NaN

fr8    =      NaN   fr9    =      NaN   fr10   =      NaN   fr11   =      NaN

fr12   =      NaN   fr13   =      NaN   fr14   =      NaN   fr15   =      NaN

fr16   =      NaN   fr17   =      NaN   fr18   =      NaN   fr19   =      NaN

fr20   =      NaN   fr21   =      NaN   fr22   =      NaN   fr23   =      NaN

fr24   =      NaN   fr25   =      NaN   fr26   =      NaN   fr27   =      NaN

fr28   =      NaN   fr29   =      NaN   fr30   =      NaN   fr31   =      NaN

 

machine checkvalue =

Exception next instruction address: 0x003326a90

Machine Status Register: 0x = 0008b032

Condition Register: 0x0x40000884

 

 

iocexample> l  0x003326a90

                        freeListFree:

0x3326a90  9421fff0    stwu        r1,-16(r1)

0x3326a94  7c0802a6    mfspr       r0,LR

0x3326a98  3d200334    lis         r9,0x334 # 820

0x3326a9c  392990e0    addi        r9,r9,0x90e0 # -28448

0x3326aa0  7d2903a6    mtspr       CTR,r9

0x3326aa4  90010014    stw         r0,20(r1)

0x3326aa8  93c10008    stw         r30,8(r1)

0x3326aac  7c9e2378    or          r30,r4,r4

0x3326ab0  93e1000c    stw         r31,12(r1)

0x3326ab4  7c7f1b78    or          r31,r3,r3

value = 0 = 0x0

 

I hope this helps.

 

I will work on the quick test program you sent next.

 

Thanks,

Mark

 

From: Tech-talk <tech-talk-bounces at aps.anl.gov> On Behalf Of Till Straumann via Tech-talk
Sent: Thursday, June 2, 2022 11:34 AM
To: tech-talk at aps.anl.gov
Subject: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules

 

On 6/2/22 16:21, Mark Rivers via Tech-talk wrote:

The problem can be explained if pfl->type is occasionally set to dbfl_type_ref when it should not be. In that case the union will contain data, perhaps floats from the ADC as Michael suggested. Because the ->type is wrong the value is being treated as an address, which leads to the A32 bus address error.

No - it actually can't explained by this. In the scenario you describe
there would be a read access to pfl->u.r.dtor when the function call
is executed. If that were the case then we should see the fault address
in the PC (and in 'r0' as well as in 'ctr', see analysis below).

In the disassembly which I quote here (comments apply
to preceding instructions):




00000034 <db_delete_field_log>:
    34:       94 21 ff f0     stwu    r1,-16(r1)
    38:       7c 08 02 a6     mflr    r0
    3c:       93 e1 00 0c     stw     r31,12(r1)
    40:       7c 7f 1b 79     mr.     r31,r3
copy 'pfl' arg (passed in r3) into r31 and set condition-code bits
    44:       90 01 00 14     stw     r0,20(r1)
    48:       41 82 00 40     beq     88 <db_delete_field_log+0x54>
if ( pfl == NULL ) goto cleanup/leave
    4c:       80 1f 00 00     lwz     r0,0(r31)
load pfl->type into 'r0'
    50:       2f 80 00 00     cmpwi   cr7,r0,0
test most-significant bit (this is the 'type' 1-bit bitfield)
    54:       40 9c 00 18     bge     cr7,6c <db_delete_field_log+0x38>
if type is 0 (*not* dbfl_ref) then skip the 'dtor' call and jump to address 0x6c
    58:       80 1f 00 18     lwz     r0,24(r31)
we get here if pfl->type == dbfl_type_ref and this instruction
loads pfl->u.r.dtor into register r0
(note that this is still a valid access even if r0 now contains a bad address)
****** r0 at this point holds the 'dtor' address ******
    5c:       2f 80 00 00     cmpwi   cr7,r0,0
comparison: pfl->u.r.dtor == 0 
    60:       41 9e 00 0c     beq     cr7,6c <db_delete_field_log+0x38>
get here if pfl->u.r.dtor is NOT NULL
    64:       7c 09 03 a6     mtctr   r0
copy r0 to counter register
    68:       4e 80 04 21     bctrl
***** jump to the 'dtor' routine
***** this jump would cause a bus error
***** HOWEVER: the faulting address would still be in 'r0' (and 'ctr')
*****          and we would crash *here* not a few lines below
    6c:       3d 20 00 00     lis     r9,0
    70:       7f e4 fb 78     mr      r4,r31
copy r31 into r4, both hold 'pfl'
    74:       80 69 00 00     lwz     r3,0(r9)
load 'dbevFieldLogFreeList' into r3
    78:       3d 20 00 00     lis     r9,0
upper 16-bit of 'freeListFree' into r9
    7c:       39 29 00 00     addi    r9,r9,0
merge lower 16-bit of 'freeListFree' into r9
    80:       7d 29 03 a6     mtctr   r9
copy r9 into 'ctr'
    84:       4e 80 04 21     bctrl
***** OBSERVED EXCEPTION happens during execution of this jump
    88:       80 01 00 14     lwz     r0,20(r1)
    8c:       83 e1 00 0c     lwz     r31,12(r1)
    90:       38 21 00 10     addi    r1,r1,16
    94:       7c 08 03 a6     mtlr    r0
    98:       4e 80 00 20     blr
 
Note that we see
 
r0:  0x40000000
ctr: 0x03326a90
pc:  0x03326a90 (next instruction address)
 
=> if the fault were caused by loading an instruction from *pfl->u.r.dtor then
   pc and ctr should hold the bad address.
 
The register context is actually consistent with
 
 pfl->type == dbfl_type_val
 
i.e., from 0x54 we would jump to 0x6c with r0 still holding the 'type' and 'ctx' bit-fields.
Here the call to 'freeListFree()' is prepared. Note that nothing looks out of order:
 
 r9 == ctr == pc == 0x03326a90 = freeListFree
 r31 == r4 =='pfl'
 
 r0 == 0x40000000
 
Note how the type (msbit) is '0' and the 'ctx' is 1 (dbfl_context_event).
 
In conclusion: the register context gives no indication that anything went wrong.
 
In order to gather more data I would recommend to also dump the memory at 'pfl',
i.e., next time please post:
 
 - register dump
 - stack trace
 - other info you have provided in the past
 - memory (~100bytes should suffice) from the location reported in R31
 
If possible please always work with the 85822f305 commit so that we stay on the same page...
 
Cheers
- Till
 
 
 
Mark
 
Sent from my iPhone
 
On Jun 1, 2022, at 11:07 PM, Mark Rivers via Tech-talk <tech-talk at aps.anl.gov> wrote:
 
We need to explain why adding 2 lines of code to examine the contents of pfl eliminates the failures completely.
 
It’s like a new type of quantum computer - making an observation changes the outcome!
 
Mark
 
 
Sent from my iPhone
 
On Jun 1, 2022, at 10:24 PM, Michael Davidsaver <mdavidsaver at gmail.com> wrote:
 
On 6/1/22 14:50, Mark Rivers wrote:
I checked out R7.0.6.1 and made the following change to dbEvent.c  It checks whether pfl or pfl->u.r.dtor are in the VME A32 address space.  If they are it prints an error and returns.
 
So, these addresses appear in pfl->u.r.dtor or eg. pfl->u.v.field.dbf_float
 
My question is then which part(s) of the db_field_log struct have incorrect values?
(eg. "\xbf\x3e\x00\x1e" as float32 is -0.74)
 
One straightforward, though a bit tedious, think to do would be to print a
dump of the contents of the db_field_log.  eg. I'd be interested to know
if db_field_log::time appears valid.
 
...
After making that change then one time when I started the IOC I got these messages right after iocInit, which is not when it usually fails.
 
This makes me wonder if these "addresses" are eg. somehow valid ADC values,
which just happen to track through the VME window sometimes.  Though if you've
only loading soft records, I don't see how this could be.
 
Further, I still don't see how these values aren't appearing in the register dump.
I don't think PPC has any instructions which do a double indirection.
 
 
Another odd ball idea which occurs to me is that maybe the compiler is mishandling
the db_field_log::type bit field.  The dbEvent.o which you send to Till and I has:
 
    4c:       80 1f 00 00     lwz     r0,0(r31)
    50:       2f 80 00 00     cmpwi   cr7,r0,0
 
If I read things correctly, I think this is meant to correspond to
'pfl->u.r.type==dbfl_type_val==0", however it is testing the whole word, and would
mis-trigger because 'pfl->u.r.ctx == dbfl_context_event == 1` for a monitor update.
 
When I look at the result of a local RTEMS build, I think the corresponding is:
 
 8:   89 24 00 00     lbz     r9,0(r4)
 c:   71 29 00 80     andi.   r9,r9,128
 
Which is testing the high bit as it should.
 
 
 
For completeness.  Disassembly of db_delete_field_log() from dbEvent.o.
Note that this is object code, with incomplete relocations.
 
00000034 <db_delete_field_log>:
    34:       94 21 ff f0     stwu    r1,-16(r1)
    38:       7c 08 02 a6     mflr    r0
    3c:       93 e1 00 0c     stw     r31,12(r1)
    40:       7c 7f 1b 79     mr.     r31,r3
    44:       90 01 00 14     stw     r0,20(r1)
    48:       41 82 00 40     beq     88 <db_delete_field_log+0x54>
    4c:       80 1f 00 00     lwz     r0,0(r31)
    50:       2f 80 00 00     cmpwi   cr7,r0,0
    54:       40 9c 00 18     bge     cr7,6c <db_delete_field_log+0x38>
    58:       80 1f 00 18     lwz     r0,24(r31)
    5c:       2f 80 00 00     cmpwi   cr7,r0,0
    60:       41 9e 00 0c     beq     cr7,6c <db_delete_field_log+0x38>
    64:       7c 09 03 a6     mtctr   r0
    68:       4e 80 04 21     bctrl
    6c:       3d 20 00 00     lis     r9,0
    70:       7f e4 fb 78     mr      r4,r31
    74:       80 69 00 00     lwz     r3,0(r9)
    78:       3d 20 00 00     lis     r9,0
    7c:       39 29 00 00     addi    r9,r9,0
    80:       7d 29 03 a6     mtctr   r9
    84:       4e 80 04 21     bctrl
    88:       80 01 00 14     lwz     r0,20(r1)
    8c:       83 e1 00 0c     lwz     r31,12(r1)
    90:       38 21 00 10     addi    r1,r1,16
    94:       7c 08 03 a6     mtlr    r0
    98:       4e 80 00 20     blr
 
 
From my local build for RTEMS-mvme3100
 
void db_delete_field_log (db_field_log *pfl)
{
  if (pfl) {
 0:   7c 64 1b 79     mr.     r4,r3
 4:   4d 82 00 20     beqlr
      /* Free field if reference type field log and dtor is set */
      if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
 8:   89 24 00 00     lbz     r9,0(r4)
 c:   71 29 00 80     andi.   r9,r9,128
10:   40 82 00 10     bne     20 <db_delete_field_log+0x20>
      /* Free the field log chunk */
      freeListFree(dbevFieldLogFreeList, pfl);
14:   3d 20 00 00     lis     r9,0
18:   80 69 00 00     lwz     r3,0(r9)
1c:   48 00 00 00     b       1c <db_delete_field_log+0x1c>
      if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
20:   81 24 00 50     lwz     r9,80(r4)
24:   2f 89 00 00     cmpwi   cr7,r9,0
28:   41 be ff ec     beq     cr7,14 <db_delete_field_log+0x14>
{
2c:   94 21 ff f0     stwu    r1,-16(r1)
30:   7c 08 02 a6     mflr    r0
      if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
34:   7d 29 03 a6     mtctr   r9
{
38:   90 01 00 14     stw     r0,20(r1)
3c:   90 81 00 08     stw     r4,8(r1)
      if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
40:   4e 80 04 21     bctrl
  }
}
44:   80 01 00 14     lwz     r0,20(r1)
      freeListFree(dbevFieldLogFreeList, pfl);
48:   3d 20 00 00     lis     r9,0
      if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
4c:   80 81 00 08     lwz     r4,8(r1)
      freeListFree(dbevFieldLogFreeList, pfl);
50:   80 69 00 00     lwz     r3,0(r9)
}
54:   7c 08 03 a6     mtlr    r0
58:   38 21 00 10     addi    r1,r1,16
      freeListFree(dbevFieldLogFreeList, pfl);
5c:   48 00 00 00     b       5c <db_delete_field_log+0x5c>

 

 

 


Replies:
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
References:
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Till Straumann via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Till Straumann via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Till Straumann via Tech-talk

Navigate by Date:
Prev: Re: [EXTERNAL] Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Hartman, Steven via Tech-talk
Next: Re: Removing battery/crystal from MVME 5100 Peter Milne via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
Navigate by Thread:
Prev: Re: [EXTERNAL] Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Hartman, Steven via Tech-talk
Next: RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
ANJ, 14 Sep 2022 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·