Hi Till,
Thanks for the detailed analysis! I will go back to that commit and provide the information you asked for when I get back to work on June 7.
Thanks,
Mark
Sent from my iPhone
On Jun 2, 2022, at 11:34 AM, Till Straumann via Tech-talk <tech-talk at aps.anl.gov> wrote:
On 6/2/22 16:21, Mark Rivers via Tech-talk wrote:
The problem can be explained if pfl->type is occasionally set to dbfl_type_ref when it should not be. In that case the union will contain data, perhaps floats from the ADC as Michael suggested. Because the ->type is wrong the value is being treated as an address, which leads to the A32 bus address error.
No - it actually can't explained by this. In the scenario you describe
there would be a read access to pfl->u.r.dtor when the function call
is executed. If that were the case then we should see the fault address
in the PC (and in 'r0' as well as in 'ctr', see analysis below).
In the disassembly which I quote here (comments apply
to preceding instructions):
00000034 <db_delete_field_log>:
34: 94 21 ff f0 stwu r1,-16(r1)
38: 7c 08 02 a6 mflr r0
3c: 93 e1 00 0c stw r31,12(r1)
40: 7c 7f 1b 79 mr. r31,r3
copy 'pfl' arg (passed in r3) into r31 and set condition-code bits
44: 90 01 00 14 stw r0,20(r1)
48: 41 82 00 40 beq 88 <db_delete_field_log+0x54>
if ( pfl == NULL ) goto cleanup/leave
4c: 80 1f 00 00 lwz r0,0(r31)
load pfl->type into 'r0'
50: 2f 80 00 00 cmpwi cr7,r0,0
test most-significant bit (this is the 'type' 1-bit bitfield)
54: 40 9c 00 18 bge cr7,6c <db_delete_field_log+0x38>
if type is 0 (*not* dbfl_ref) then skip the 'dtor' call and jump to address 0x6c
58: 80 1f 00 18 lwz r0,24(r31)
we get here if pfl->type == dbfl_type_ref and this instruction
loads pfl->u.r.dtor into register r0
(note that this is still a valid access even if r0 now contains a bad address)
****** r0 at this point holds the 'dtor' address ******
5c: 2f 80 00 00 cmpwi cr7,r0,0
comparison: pfl->u.r.dtor == 0
60: 41 9e 00 0c beq cr7,6c <db_delete_field_log+0x38>
get here if pfl->u.r.dtor is NOT NULL
64: 7c 09 03 a6 mtctr r0
copy r0 to counter register
68: 4e 80 04 21 bctrl
***** jump to the 'dtor' routine
***** this jump would cause a bus error
***** HOWEVER: the faulting address would still be in 'r0' (and 'ctr')
***** and we would crash *here* not a few lines below
6c: 3d 20 00 00 lis r9,0
70: 7f e4 fb 78 mr r4,r31
copy r31 into r4, both hold 'pfl'
74: 80 69 00 00 lwz r3,0(r9)
load 'dbevFieldLogFreeList' into r3
78: 3d 20 00 00 lis r9,0
upper 16-bit of 'freeListFree' into r9
7c: 39 29 00 00 addi r9,r9,0
merge lower 16-bit of 'freeListFree' into r9
80: 7d 29 03 a6 mtctr r9
copy r9 into 'ctr'
84: 4e 80 04 21 bctrl
***** OBSERVED EXCEPTION happens during execution of this jump
88: 80 01 00 14 lwz r0,20(r1)
8c: 83 e1 00 0c lwz r31,12(r1)
90: 38 21 00 10 addi r1,r1,16
94: 7c 08 03 a6 mtlr r0
98: 4e 80 00 20 blr
Note that we see
r0: 0x40000000
ctr: 0x03326a90
pc: 0x03326a90 (next instruction address)
=> if the fault were caused by loading an instruction from *pfl->u.r.dtor then
pc and ctr should hold the bad address.
The register context is actually consistent with
pfl->type == dbfl_type_val
i.e., from 0x54 we would jump to 0x6c with r0 still holding the 'type' and 'ctx' bit-fields.
Here the call to 'freeListFree()' is prepared. Note that nothing looks out of order:
r9 == ctr == pc == 0x03326a90 = freeListFree
r31 == r4 =='pfl'
r0 == 0x40000000
Note how the type (msbit) is '0' and the 'ctx' is 1 (dbfl_context_event).
In conclusion: the register context gives no indication that anything went wrong.
In order to gather more data I would recommend to also dump the memory at 'pfl',
i.e., next time please post:
- register dump
- stack trace
- other info you have provided in the past
- memory (~100bytes should suffice) from the location reported in R31
If possible please always work with the 85822f305 commit so that we stay on the same page...
Cheers
- Till
Mark
Sent from my iPhone
On Jun 1, 2022, at 11:07 PM, Mark Rivers via Tech-talk <tech-talk at aps.anl.gov> wrote:
We need to explain why adding 2 lines of code to examine the contents of pfl eliminates the failures completely.
It’s like a new type of quantum computer - making an observation changes the outcome!
Mark
Sent from my iPhone
On Jun 1, 2022, at 10:24 PM, Michael Davidsaver <mdavidsaver at gmail.com> wrote:
On 6/1/22 14:50, Mark Rivers wrote:
I checked out R7.0.6.1 and made the following change to dbEvent.c It checks whether pfl or pfl->u.r.dtor are in the VME A32 address space. If they are it prints an error and returns.
So, these addresses appear in pfl->u.r.dtor or eg. pfl->u.v.field.dbf_float
My question is then which part(s) of the db_field_log struct have incorrect values?
(eg. "\xbf\x3e\x00\x1e" as float32 is -0.74)
One straightforward, though a bit tedious, think to do would be to print a
dump of the contents of the db_field_log. eg. I'd be interested to know
if db_field_log::time appears valid.
...
After making that change then one time when I started the IOC I got these messages right after iocInit, which is not when it usually fails.
This makes me wonder if these "addresses" are eg. somehow valid ADC values,
which just happen to track through the VME window sometimes. Though if you've
only loading soft records, I don't see how this could be.
Further, I still don't see how these values aren't appearing in the register dump.
I don't think PPC has any instructions which do a double indirection.
Another odd ball idea which occurs to me is that maybe the compiler is mishandling
the db_field_log::type bit field. The dbEvent.o which you send to Till and I has:
4c: 80 1f 00 00 lwz r0,0(r31)
50: 2f 80 00 00 cmpwi cr7,r0,0
If I read things correctly, I think this is meant to correspond to
'pfl->u.r.type==dbfl_type_val==0", however it is testing the whole word, and would
mis-trigger because 'pfl->u.r.ctx == dbfl_context_event == 1` for a monitor update.
When I look at the result of a local RTEMS build, I think the corresponding is:
8: 89 24 00 00 lbz r9,0(r4)
c: 71 29 00 80 andi. r9,r9,128
Which is testing the high bit as it should.
For completeness. Disassembly of db_delete_field_log() from dbEvent.o.
Note that this is object code, with incomplete relocations.
00000034 <db_delete_field_log>:
34: 94 21 ff f0 stwu r1,-16(r1)
38: 7c 08 02 a6 mflr r0
3c: 93 e1 00 0c stw r31,12(r1)
40: 7c 7f 1b 79 mr. r31,r3
44: 90 01 00 14 stw r0,20(r1)
48: 41 82 00 40 beq 88 <db_delete_field_log+0x54>
4c: 80 1f 00 00 lwz r0,0(r31)
50: 2f 80 00 00 cmpwi cr7,r0,0
54: 40 9c 00 18 bge cr7,6c <db_delete_field_log+0x38>
58: 80 1f 00 18 lwz r0,24(r31)
5c: 2f 80 00 00 cmpwi cr7,r0,0
60: 41 9e 00 0c beq cr7,6c <db_delete_field_log+0x38>
64: 7c 09 03 a6 mtctr r0
68: 4e 80 04 21 bctrl
6c: 3d 20 00 00 lis r9,0
70: 7f e4 fb 78 mr r4,r31
74: 80 69 00 00 lwz r3,0(r9)
78: 3d 20 00 00 lis r9,0
7c: 39 29 00 00 addi r9,r9,0
80: 7d 29 03 a6 mtctr r9
84: 4e 80 04 21 bctrl
88: 80 01 00 14 lwz r0,20(r1)
8c: 83 e1 00 0c lwz r31,12(r1)
90: 38 21 00 10 addi r1,r1,16
94: 7c 08 03 a6 mtlr r0
98: 4e 80 00 20 blr
From my local build for RTEMS-mvme3100
void db_delete_field_log (db_field_log *pfl)
{
if (pfl) {
0: 7c 64 1b 79 mr. r4,r3
4: 4d 82 00 20 beqlr
/* Free field if reference type field log and dtor is set */
if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
8: 89 24 00 00 lbz r9,0(r4)
c: 71 29 00 80 andi. r9,r9,128
10: 40 82 00 10 bne 20 <db_delete_field_log+0x20>
/* Free the field log chunk */
freeListFree(dbevFieldLogFreeList, pfl);
14: 3d 20 00 00 lis r9,0
18: 80 69 00 00 lwz r3,0(r9)
1c: 48 00 00 00 b 1c <db_delete_field_log+0x1c>
if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
20: 81 24 00 50 lwz r9,80(r4)
24: 2f 89 00 00 cmpwi cr7,r9,0
28: 41 be ff ec beq cr7,14 <db_delete_field_log+0x14>
{
2c: 94 21 ff f0 stwu r1,-16(r1)
30: 7c 08 02 a6 mflr r0
if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
34: 7d 29 03 a6 mtctr r9
{
38: 90 01 00 14 stw r0,20(r1)
3c: 90 81 00 08 stw r4,8(r1)
if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
40: 4e 80 04 21 bctrl
}
}
44: 80 01 00 14 lwz r0,20(r1)
freeListFree(dbevFieldLogFreeList, pfl);
48: 3d 20 00 00 lis r9,0
if (pfl->type == dbfl_type_ref && pfl->u.r.dtor) pfl->u.r.dtor(pfl);
4c: 80 81 00 08 lwz r4,8(r1)
freeListFree(dbevFieldLogFreeList, pfl);
50: 80 69 00 00 lwz r3,0(r9)
}
54: 7c 08 03 a6 mtlr r0
58: 38 21 00 10 addi r1,r1,16
freeListFree(dbevFieldLogFreeList, pfl);
5c: 48 00 00 00 b 5c <db_delete_field_log+0x5c>
|