EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules
From: Till Straumann via Tech-talk <tech-talk at aps.anl.gov>
To: <tech-talk at aps.anl.gov>
Date: Thu, 2 Jun 2022 18:33:51 +0200
On 6/2/22 16:21, Mark Rivers via Tech-talk wrote:
The problem can be explained if pfl->type is occasionally set to dbfl_type_ref when it should not be. In that case the union will contain data, perhaps floats from the ADC as Michael suggested. Because the ->type is wrong the value is being treated as an address, which leads to the A32 bus address error.
No - it actually can't explained by this. In the scenario you describe
there would be a read access to pfl->u.r.dtor when the function call
is executed. If that were the case then we should see the fault address
in the PC (and in 'r0' as well as in 'ctr', see analysis below).

In the disassembly which I quote here (comments apply
to preceding instructions):

00000034 <db_delete_field_log>:
    34:       94 21 ff f0     stwu    r1,-16(r1)
    38:       7c 08 02 a6     mflr    r0
    3c:       93 e1 00 0c     stw     r31,12(r1)
    40:       7c 7f 1b 79     mr.     r31,r3
copy 'pfl' arg (passed in r3) into r31 and set condition-code bits
    44:       90 01 00 14     stw     r0,20(r1)
    48:       41 82 00 40     beq     88 <db_delete_field_log+0x54>
if ( pfl == NULL ) goto cleanup/leave
    4c:       80 1f 00 00     lwz     r0,0(r31)
load pfl->type into 'r0'
    50:       2f 80 00 00     cmpwi   cr7,r0,0
test most-significant bit (this is the 'type' 1-bit bitfield)
    54:       40 9c 00 18     bge     cr7,6c <db_delete_field_log+0x38>
if type is 0 (*not* dbfl_ref) then skip the 'dtor' call and jump to address 0x6c
    58:       80 1f 00 18     lwz     r0,24(r31)
we get here if pfl->type == dbfl_type_ref and this instruction
loads pfl->u.r.dtor into register r0
(note that this is still a valid access even if r0 now contains a bad address)
****** r0 at this point holds the 'dtor' address ******
    5c:       2f 80 00 00     cmpwi   cr7,r0,0
comparison: pfl->u.r.dtor == 0 
    60:       41 9e 00 0c     beq     cr7,6c <db_delete_field_log+0x38>
get here if pfl->u.r.dtor is NOT NULL
    64:       7c 09 03 a6     mtctr   r0
copy r0 to counter register
    68:       4e 80 04 21     bctrl
***** jump to the 'dtor' routine
***** this jump would cause a bus error
***** HOWEVER: the faulting address would still be in 'r0' (and 'ctr')
*****          and we would crash *here* not a few lines below
    6c:       3d 20 00 00     lis     r9,0
    70:       7f e4 fb 78     mr      r4,r31
copy r31 into r4, both hold 'pfl'
    74:       80 69 00 00     lwz     r3,0(r9)
load 'dbevFieldLogFreeList' into r3
    78:       3d 20 00 00     lis     r9,0
upper 16-bit of 'freeListFree' into r9
    7c:       39 29 00 00     addi    r9,r9,0
merge lower 16-bit of 'freeListFree' into r9
    80:       7d 29 03 a6     mtctr   r9
copy r9 into 'ctr'
    84:       4e 80 04 21     bctrl
***** OBSERVED EXCEPTION happens during execution of this jump
    88:       80 01 00 14     lwz     r0,20(r1)
    8c:       83 e1 00 0c     lwz     r31,12(r1)
    90:       38 21 00 10     addi    r1,r1,16
    94:       7c 08 03 a6     mtlr    r0
    98:       4e 80 00 20     blr

Note that we see

r0:  0x40000000
ctr: 0x03326a90
pc:  0x03326a90 (next instruction address)

=> if the fault were caused by loading an instruction from *pfl->u.r.dtor then
   pc and ctr should hold the bad address.

The register context is actually consistent with

 pfl->type == dbfl_type_val

i.e., from 0x54 we would jump to 0x6c with r0 still holding the 'type' and 'ctx' bit-fields.
Here the call to 'freeListFree()' is prepared. Note that nothing looks out of order:

 r9 == ctr == pc == 0x03326a90 = freeListFree
 r31 == r4 =='pfl'

 r0 == 0x40000000

Note how the type (msbit) is '0' and the 'ctx' is 1 (dbfl_context_event).

In conclusion: the register context gives no indication that anything went wrong.

In order to gather more data I would recommend to also dump the memory at 'pfl',
i.e., next time please post:

 - register dump
 - stack trace
 - other info you have provided in the past
 - memory (~100bytes should suffice) from the location reported in R31

If possible please always work with the 85822f305 commit so that we stay on the same page...

Cheers
- Till


Mark

Sent from my iPhone

On Jun 1, 2022, at 11:07 PM, Mark Rivers via Tech-talk <tech-talk at aps.anl.gov> wrote:

We need to explain why adding 2 lines of code to examine the contents of pfl eliminates the failures completely.

It’s like a new type of quantum computer - making an observation changes the outcome!

Mark


Sent from my iPhone

On Jun 1, 2022, at 10:24 PM, Michael Davidsaver <mdavidsaver at gmail.com> wrote:

On 6/1/22 14:50, Mark Rivers wrote:
I checked out R7.0.6.1 and made the following change to dbEvent.c  It checks whether pfl or pfl->u.r.dtor are in the VME A32 address space.  If they are it prints an error and returns.
So, these addresses appear in pfl->u.r.dtor or eg. pfl->u.v.field.dbf_float

My question is then which part(s) of the db_field_log struct have incorrect values?
(eg. "\xbf\x3e\x00\x1e" as float32 is -0.74)

One straightforward, though a bit tedious, think to do would be to print a
dump of the contents of the db_field_log.  eg. I'd be interested to know
if db_field_log::time appears valid.

...
After making that change then one time when I started the IOC I got these messages right after iocInit, which is not when it usually fails.
This makes me wonder if these "addresses" are eg. somehow valid ADC values,
which just happen to track through the VME window sometimes.  Though if you've
only loading soft records, I don't see how this could be.

Further, I still don't see how these values aren't appearing in the register dump.
I don't think PPC has any instructions which do a double indirection.


Another odd ball idea which occurs to me is that maybe the compiler is mishandling
the db_field_log::type bit field.  The dbEvent.o which you send to Till and I has:

    4c:       80 1f 00 00     lwz     r0,0(r31)
    50:       2f 80 00 00     cmpwi   cr7,r0,0
If I read things correctly, I think this is meant to correspond to
'pfl->u.r.type==dbfl_type_val==0", however it is testing the whole word, and would
mis-trigger because 'pfl->u.r.ctx == dbfl_context_event == 1` for a monitor update.

When I look at the result of a local RTEMS build, I think the corresponding is:

 8:   89 24 00 00     lbz     r9,0(r4)
 c:   71 29 00 80     andi.   r9,r9,128
Which is testing the high bit as it should.



For completeness.  Disassembly of db_delete_field_log() from dbEvent.o.
Note that this is object code, with incomplete relocations.

00000034 <db_delete_field_log>:
    34:       94 21 ff f0     stwu    r1,-16(r1)
    38:       7c 08 02 a6     mflr    r0
    3c:       93 e1 00 0c     stw     r31,12(r1)
    40:       7c 7f 1b 79     mr.     r31,r3
    44:       90 01 00 14     stw     r0,20(r1)
    48:       41 82 00 40     beq     88 <db_delete_field_log+0x54>
    4c:       80 1f 00 00     lwz     r0,0(r31)
    50:       2f 80 00 00     cmpwi   cr7,r0,0
    54:       40 9c 00 18     bge     cr7,6c <db_delete_field_log+0x38>
    58:       80 1f 00 18     lwz     r0,24(r31)
    5c:       2f 80 00 00     cmpwi   cr7,r0,0
    60:       41 9e 00 0c     beq     cr7,6c <db_delete_field_log+0x38>
    64:       7c 09 03 a6     mtctr   r0
    68:       4e 80 04 21     bctrl
    6c:       3d 20 00 00     lis     r9,0
    70:       7f e4 fb 78     mr      r4,r31
    74:       80 69 00 00     lwz     r3,0(r9)
    78:       3d 20 00 00     lis     r9,0
    7c:       39 29 00 00     addi    r9,r9,0
    80:       7d 29 03 a6     mtctr   r9
    84:       4e 80 04 21     bctrl
    88:       80 01 00 14     lwz     r0,20(r1)
    8c:       83 e1 00 0c     lwz     r31,12(r1)
    90:       38 21 00 10     addi    r1,r1,16
    94:       7c 08 03 a6     mtlr    r0
    98:       4e 80 00 20     blr

>From my local build for RTEMS-mvme3100

void db_delete_field_log (db_field_log *pfl)
{
  if (pfl) {
 0:   7c 64 1b 79     mr.     r4,r3
 4:   4d 82 00 20     beqlr
      /* Free field if reference type field log and dtor is set */
      if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
 8:   89 24 00 00     lbz     r9,0(r4)
 c:   71 29 00 80     andi.   r9,r9,128
10:   40 82 00 10     bne     20 <db_delete_field_log+0x20>
      /* Free the field log chunk */
      freeListFree(dbevFieldLogFreeList, pfl);
14:   3d 20 00 00     lis     r9,0
18:   80 69 00 00     lwz     r3,0(r9)
1c:   48 00 00 00     b       1c <db_delete_field_log+0x1c>
      if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
20:   81 24 00 50     lwz     r9,80(r4)
24:   2f 89 00 00     cmpwi   cr7,r9,0
28:   41 be ff ec     beq     cr7,14 <db_delete_field_log+0x14>
{
2c:   94 21 ff f0     stwu    r1,-16(r1)
30:   7c 08 02 a6     mflr    r0
      if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
34:   7d 29 03 a6     mtctr   r9
{
38:   90 01 00 14     stw     r0,20(r1)
3c:   90 81 00 08     stw     r4,8(r1)
      if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
40:   4e 80 04 21     bctrl
  }
}
44:   80 01 00 14     lwz     r0,20(r1)
      freeListFree(dbevFieldLogFreeList, pfl);
48:   3d 20 00 00     lis     r9,0
      if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
4c:   80 81 00 08     lwz     r4,8(r1)
      freeListFree(dbevFieldLogFreeList, pfl);
50:   80 69 00 00     lwz     r3,0(r9)
}
54:   7c 08 03 a6     mtlr    r0
58:   38 21 00 10     addi    r1,r1,16
      freeListFree(dbevFieldLogFreeList, pfl);
5c:   48 00 00 00     b       5c <db_delete_field_log+0x5c>


Replies:
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
References:
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk

Navigate by Date:
Prev: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Next: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
Navigate by Thread:
Prev: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Next: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
ANJ, 14 Sep 2022 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·