1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 <2022> 2023 2024 2025 | Index | 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 <2022> 2023 2024 2025 |
<== Date ==> | <== Thread ==> |
---|
Subject: | RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules |
From: | Mark Rivers via Tech-talk <tech-talk at aps.anl.gov> |
To: | Michael Davidsaver <mdavidsaver at gmail.com>, Torsten Bögershausen <Torsten.Bogershausen at ess.eu>, Benjamin Franksen <benjamin.franksen at helmholtz-berlin.de> |
Cc: | "tech-talk at aps.anl.gov" <tech-talk at aps.anl.gov> |
Date: | Wed, 1 Jun 2022 17:54:25 +0000 |
Ø
The last thing I will try is to git bisect from this commit 85822f3051d2236144bb46dc2c24b7e38143e531 to R7.0.5 to see where the hardware tests first begin to fail. I have now done that. I started with this: git bisect start 85822f3051d2236144bb46dc2c24b7e38143e531 R7.0.5 The results were [good, bad, good]. I defined good as running for at least 3600 seconds, while the bad failed in 128 seconds. The fail on the second pass was this: iocexample> VME Bus Error accessing A32: 0xbfa8c418 machine check Exception next instruction address: 0x03326ae8 Machine Status Register: 0x0008b032 Condition Register: 0x40000882 Task: 0x3585870 "CAS-event" 0x3585870 (CAS-event): task 0x3585870 has had a failure and has been stopped. 0x3585870 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10. iocexample> iocexample> tt 0x3585870 0x0012489c vxTaskEntry +0x48 : epicsThreadEntry () 0x03345014 epicsThreadEntry+0x80 : 0x032a17e8 () 0x032a1ac4 db_start_events+0x450: db_delete_field_log () 0x032a0fdc db_delete_field_log+0x54 : freeListFree () value = 0 = 0x0 iocexample> iocexample> ti 0x3585870 NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY ---------- ------------ -------- --- ---------- -------- -------- ------- ----- CAS-event epicsThread> 3585870 133 STOP+I 3326ae8 3585770 0 0 full task name : CAS-event task entry : epicsThreadEntry process : kernel options : 0x1009001 VX_SUPERVISOR_MODE VX_DEALLOC_TCB VX_FP_TASK VX_DEALLOC_EXC_STACK STACK BASE END SP SIZE HIGH MARGIN --------- -------- -------- -------- ------- ------- ------- execution 3585870 3582990 3585770 12000 2800 9200 exception 35869c0 3585b50 3696 1344 2352 VxWorks Events -------------- Events Pended on : Not Pended Received Events : 0x0 Options : N/A r0 = 0x03326ab8 sp = 0x03585770 r2 = 0x002f1ac0 r3 = 0x00000000 r4 = 0x0033b448 r5 = 0x00000001 r6 = 0x00000001 r7 = 0x00000001 r8 = 0x0353a530 r9 = 0xffffffff r10 = 0x00000028 r11 = 0x035856f0 r12 = 0x20000884 r13 = 0x00321e60 r14 = 0x033b6088 r15 = 0x03373c5c r16 = 0x03289470 r17 = 0x033424e8 r18 = 0x03373bd4 r19 = 0x03373c14 r20 = 0x032a0ff0 r21 = 0x032a0f88 r22 = 0x00000000 r23 = 0x00000001 r24 = 0x033b608c r25 = 0x03339338 r26 = 0x033390d8 r27 = 0x034426f8 r28 = 0x0230aa00 r29 = 0x03555be8 r30 = 0x0356a558 r31 = 0x03492cf0 msr = 0x0008b032 lr = 0x03326ab8 ctr = 0x0026ecc0 pc = 0x03326ae8 cr = 0x40000882 xer = 0x00000000 pgTblPtr = 0x02310000 scSrTblPtr = 0x0230fe74 srTblPtr = 0x0230fe74 fpcsr = 0 fr0 = -0.0483711 fr1 = NaN fr2 = NaN fr3 = NaN fr4 = NaN fr5 = NaN fr6 = NaN fr7 = NaN fr8 = NaN fr9 = NaN fr10 = NaN fr11 = NaN fr12 = NaN fr13 = NaN fr14 = NaN fr15 = NaN fr16 = NaN fr17 = NaN fr18 = NaN fr19 = NaN fr20 = NaN fr21 = NaN fr22 = NaN fr23 = NaN fr24 = NaN fr25 = NaN fr26 = NaN fr27 = NaN fr28 = NaN fr29 = NaN fr30 = NaN fr31 = NaN machine checkvalue = Exception next instruction address: 0x003326ae8 = Machine Status Register: 0x0x0008b0320 Condition Register: 0x 40000882 git bisect reported this as the final result: corvette:local/epics-devel/base-7.0.6>git bisect good 27fe3e4468ec62d4d17f775c4aced939a666ba36 is the first bad commit commit 27fe3e4468ec62d4d17f775c4aced939a666ba36 Author: Ben Franksen <benjamin.franksen at helmholtz-berlin.de> Date: Mon Mar 30 21:34:32 2020 +0200 refactor db_field_log and filters to get rid of dbfl_type_rec This refactor simplifies and streamlines the code associated with server side filters. Apart from immediate benefits (clearer code, less duplication) it is also hoped that this will make it easier to add write filters. The data pointer dbfl_ref.field can now either point to a copy owned by a filter, or it can point to the original data owned by a record. In the latter case, the dbfl_ref.dtor is NULL. The dbExtractArray* functions are unified to the single function dbExtractArray and stripped of conversion functionality. This is redundant because we always call dbGet after applying filters, which takes care of conversion. Accordingly, dbChannelMakeArrayCopy is now obsolete and its single use (in the ts filter) replaced with dbExtractArray. Instead, we add the helper function dbChannelGetArrayInfo to wrap the common boilerplate around calls to the get_array_info method, used in both arr.c and ts.c. :040000 040000 eb0e8326fbd16b9bfe5c9316ebaa105ab74f91f0 3c6d6c0236ffe92a118ebf23a184affffc0cd055 M modules So my git bisect tests show the following: -
27fe3e4468ec62d4d17f775c4aced939a666ba36 is the first commit that causes the tests with hardware device support to fail -
56f05d722dee4b8ca2968b8bface2737a3a9b185 is the first commit that causes the tests with soft device support to fail Both of those commits were part of this PR
https://github.com/epics-base/epics-base/pull/99 One of the comments on that PR (January 27, 2021) was: We expect to be nervous about this change because of the complex area it's working in, so we will merge it after the
next release which should be before the Codeathon in March, but we don't see anything obviously wrong with it. Mark From: Mark Rivers I agree that a debugger would be very helpful. I don't have that tool, but I think others at the APS do. I'll find out when Andrew returns next week. Meanwhile git
bisect can help point in the right direction. I tested this commit 85822f3051d2236144bb46dc2c24b7e38143e531 with soft records overnight. It ran for 32,000 seconds without failing before I stopped it. The next commit, 56f05d722dee4b8ca2968b8bface2737a3a9b185, fails in less than 10 seconds in each of 3 tests. I just tested with the IPAC hardware again with commit 85822f3051d2236144bb46dc2c24b7e38143e531 (which does not fail with soft records). It failed after 960 seconds
with this error: VME Bus Error accessing A32: 0xbffcf67c machine check Exception next instruction address: 0x03326a90 Machine Status Register: 0x0008b032 Condition Register: 0x42000884 Task: 0x3559330 "CAS-event" 0x3559330 (CAS-event): task 0x3559330 has had a failure and has been stopped. 0x3559330 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10. This is the task trace: iocexample> tt 0x3559330 0x0012489c vxTaskEntry +0x48 : epicsThreadEntry () 0x0334501c epicsThreadEntry+0x80 : 0x032a17f0 () 0x032a1acc db_start_events+0x450: db_delete_field_log () 0x032a0fe4 db_delete_field_log+0x54 : freeListFree () value = 0 = 0x0 This tells us pretty clearly that the failure as in freeListFree, called by db_delete_field_log. That has been a very common stack trace for these errors. Till complained that vxWorks did not provide a register dump, but that is not true, it is available with the “ti” command. I had included that in several previous messages.
Here it is for this task. iocexample> ti 0x3559330 NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY ---------- ------------ -------- --- ---------- -------- -------- ------- ----- CAS-event epicsThread> 3559330 133 STOP+I 3326a90 3559240 0 0 full task name : CAS-event task entry : epicsThreadEntry process : kernel options : 0x1009001 VX_SUPERVISOR_MODE VX_DEALLOC_TCB VX_FP_TASK VX_DEALLOC_EXC_STACK STACK BASE END SP SIZE HIGH MARGIN --------- -------- -------- -------- ------- ------- ------- execution 3559330 3556450 3559240 12000 2800 9200 exception 2700210 26ff3a0 3696 1344 2352 VxWorks Events -------------- Events Pended on : Not Pended Received Events : 0x0 Options : N/A r0 = 0x40000000 sp = 0x03559240 r2 = 0x002f1ac0 r3 = 0x03492ce0 r4 = 0x0357bec8 r5 = 0x00000001 r6 = 0x00000001 r7 = 0x00000000 r8 = 0xeefceefd r9 = 0x03326a90 r10 = 0x00000000 r11 = 0x035591b0 r12 = 0x20000884 r13 = 0x00321e60 r14 = 0x033b6098 r15 = 0x03373c64 r16 = 0x03289478 r17 = 0x033424f0 r18 = 0x03373bdc r19 = 0x03373c1c r20 = 0x032a0ff8 r21 = 0x032a0f90 r22 = 0x00000000 r23 = 0x00000001 r24 = 0x033b609c r25 = 0x03339340 r26 = 0x033390e0 r27 = 0x03443b60 r28 = 0x0357e748 r29 = 0x03567d78 r30 = 0x032dad44 r31 = 0x0357bec8 msr = 0x0008b032 lr = 0x032a0fe4 ctr = 0x03326a90 pc = 0x03326a90 cr = 0x42000884 xer = 0x00000000 pgTblPtr = 0x02310000 scSrTblPtr = 0x0230fe74 srTblPtr = 0x0230fe74 fpcsr = 0 fr0 = -1.81018 fr1 = NaN fr2 = NaN fr3 = NaN fr4 = NaN fr5 = NaN fr6 = NaN fr7 = NaN fr8 = NaN fr9 = NaN fr10 = NaN fr11 = NaN fr12 = NaN fr13 = NaN fr14 = NaN fr15 = NaN fr16 = NaN fr17 = NaN fr18 = NaN fr19 = NaN fr20 = NaN fr21 = NaN fr22 = NaN fr23 = NaN fr24 = NaN fr25 = NaN fr26 = NaN fr27 = NaN fr28 = NaN fr29 = NaN fr30 = NaN fr31 = NaN machine checkvalue = Exception next instruction address: 0x003326a90 Machine Status Register: 0x = 0008b032 Condition Register: 0x420008840x 0 And here is the disassembly for freeListFree, where I have highlighted in red the "next instruction" above. freeListFree: 0x3326a90 9421fff0 stwu r1,-16(r1) 0x3326a94 7c0802a6 mfspr r0,LR 0x3326a98 3d200334 lis r9,0x334 # 820 0x3326a9c 392990e0 addi r9,r9,0x90e0 # -28448 0x3326aa0 7d2903a6 mtspr CTR,r9 0x3326aa4 90010014 stw r0,20(r1) 0x3326aa8 93c10008 stw r30,8(r1) 0x3326aac 7c9e2378 or r30,r4,r4 0x3326ab0 93e1000c stw r31,12(r1) 0x3326ab4 7c7f1b78 or r31,r3,r3 0x3326ab8 80630014 lwz r3,20(r3) 0x3326abc 4e800421 bcctrl 0x14,0 0x3326ac0 2f830000 cmpi crf7,0,r3,0x0 # 0 0x3326ac4 41be002c bc 0xd,30, 0x3326af0 # 0x03326af0 0x3326ac8 3d200334 lis r9,0x334 # 820 0x3326acc 3c600338 lis r3,0x338 # 824 0x3326ad0 3ca00338 lis r5,0x338 # 824 0x3326ad4 392924f0 addi r9,r9,0x24f0 # 9456 0x3326ad8 386335a4 addi r3,r3,0x35a4 # 13732 0x3326adc 38a535c0 addi r5,r5,0x35c0 # 13760 0x3326ae0 7d2903a6 mtspr CTR,r9 0x3326ae4 3880008f li r4,0x8f # 143 0x3326ae8 38c00000 li r6,0x0 # 0 0x3326aec 4e800421 bcctrl 0x14,0 0x3326af0 813f0010 lwz r9,16(r31) 0x3326af4 801f0008 lwz r0,8(r31) 0x3326af8 39290001 addi r9,r9,0x1 # 1 0x3326afc 807f0014 lwz r3,20(r31) 0x3326b00 913f0010 stw r9,16(r31) 0x3326b04 3d200334 lis r9,0x334 # 820 0x3326b08 39299340 addi r9,r9,0x9340 # -27840 0x3326b0c 901e0000 stw r0,0(r30) 0x3326b10 7d2903a6 mtspr CTR,r9 0x3326b14 93df0008 stw r30,8(r31) 0x3326b18 4e800421 bcctrl 0x14,0 0x3326b1c 80010014 lwz r0,20(r1) 0x3326b20 83c10008 lwz r30,8(r1) 0x3326b24 7c0803a6 mtspr LR,r0 0x3326b28 83e1000c lwz r31,12(r1) 0x3326b2c 38210010 addi r1,r1,0x10 # 16 0x3326b30 4e800020 blr This is the source code for freeListFree in this tag of base. LIBCOM_API void epicsStdCall freeListFree(void *pvt,void*pmem) { FREELISTPVT *pfl = pvt; # ifdef EPICS_FREELIST_DEBUG memset ( pmem, 0xdd, pfl->size ); free(pmem); # else void **ppnext; VALGRIND_MEMPOOL_FREE(pvt, pmem); VALGRIND_MEMPOOL_ALLOC(pvt, pmem, sizeof(void*)); epicsMutexMustLock(pfl->lock); ppnext = pmem; *ppnext = pfl->head; pfl->head = pmem; pfl->nBlocksAvailable++; epicsMutexUnlock(pfl->lock); # endif } The last thing I will try is to git bisect from this commit 85822f3051d2236144bb46dc2c24b7e38143e531 to R7.0.5 to see where the hardware tests first begin to fail. Mark |