EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules
From: Michael Davidsaver via Tech-talk <tech-talk at aps.anl.gov>
To: Mark Rivers <rivers at cars.uchicago.edu>, Benjamin Franksen <benjamin.franksen at helmholtz-berlin.de>
Cc: "tech-talk at aps.anl.gov" <tech-talk at aps.anl.gov>
Date: Tue, 31 May 2022 22:38:57 -0700
On 5/31/22 20:17, Mark Rivers wrote:
The git bisect pinpointed commit 56f05d722dee4b8ca2968b8bface2737a3a9b185 as the first one that caused the problem with soft device

This commit was part of Ben's PR #99, which was the first merge after 7.0.5.

https://github.com/epics-base/epics-base/pull/99


This is the difference between that commit and the previous one:

corvette:local/epics-devel/base-7.0.6>git diff 85822f3051d2236144bb46dc2c24b7e38143e531 56f05d722dee4b8ca2968b8bface2737a3a9b185

diff --git a/modules/database/src/ioc/db/dbAccess.c b/modules/database/src/ioc/db/dbAccess.c

index 3f7554a..d50b256 100644

--- a/modules/database/src/ioc/db/dbAccess.c

+++ b/modules/database/src/ioc/db/dbAccess.c

@@ -952,7 +952,7 @@ long dbGet(DBADDR *paddr, short dbrType,

              goto done;

          }

-        if (!pfl) {

+        if (!dbfl_has_copy(pfl)) {

              status = dbFastGetConvertRoutine[field_type][dbrType]

                  (paddr->pfield, pbuf, paddr);

          } else {

@@ -1000,7 +1000,7 @@ long dbGet(DBADDR *paddr, short dbrType,

          /* convert data into the caller's buffer */

          if (n <= 0) {

             ;                           /*do nothing */

-        } else if (!pfl) {

+        } else if (!dbfl_has_copy(pfl)) {

              status = convert(paddr, pbuf, n, capacity, offset);

          } else {

              DBADDR localAddr = *paddr; /* Structure copy */

That commit was not included in base R7.0.5 but it was included in R7.0.6.

I ran 3 tests with the above commit with the soft device records, and it failed in [3, 6, 3] seconds.

I then tested with the previous commit (85822f3051d2236144bb46dc2c24b7e38143e531) and it ran for 5700 seconds without failing before I stopped it.

When running the tests with the soft records 100% of the VME bus errors were A32 address space, and the failing task was always CAC-event.

Now that I am quite sure that commit 85822f3051d2236144bb46dc2c24b7e38143e531 fixes the problems with soft device support, I tested that same commit with the Ip330 and dac128V hardware devices.  That commit is before Michael’s commit that added the AMSG and NAMSG fields to dbCommon, so they cannot be an issue.

In 3 tests it failed after [60, 1227, 844] seconds

The test with hardware first fails with an A16 error, not A32.

VME Bus Error accessing A16: 0x347e

machine check

Exception next instruction address: 0x032a1a40

Machine Status Register: 0x0008b032

Condition Register: 0x28000884

Task: 0x26fccd0 "CAS-event"

0x26fccd0 (CAS-event): task 0x26fccd0 has had a failure and has been stopped.

0x26fccd0 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.

The task trace is not very informative, but note that it is not in db_delete_field_log(), which all previous failures have been, both soft and hard device support.

iocexample> tt 0x26fccd0

0x0012489c vxTaskEntry  +0x48 : epicsThreadEntry ()

0x0334501c epicsThreadEntry+0x80 : 0x032a17f0 ()

value = 0 = 0x0

This was the second failure and tt

VME Bus Error accessing A32: 0xbffb3330

machine check

Exception next instruction address: 0x032e1654

Machine Status Register: 0x0008b032

Condition Register: 0x48002882

Task: 0x26fcd40 "CAS-event"

0x26fcd40 (CAS-event): task 0x26fcd40 has had a failure and has been stopped.

0x26fcd40 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.

iocexample> tt 0x26fcd40

0x0012489c vxTaskEntry  +0x48 : epicsThreadEntry ()

0x0334501c epicsThreadEntry+0x80 : 0x032a17f0 ()

0x032a1a8c db_start_events+0x410: 0x032dad44 ()

0x032dae54 rsrvFreePutNotify+0xb44: cas_copy_in_header ()

value = 0 = 0x0

This was the third failure and tt

VME Bus Error accessing A16: 0x347e

machine check

Exception next instruction address: 0x032a0fe0

Machine Status Register: 0x0008b032

Condition Register: 0x48000884

Task: 0x230d720 "CAS-event"

0x230d720 (CAS-event): task 0x230d720 has had a failure and has been stopped.

0x230d720 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.

iocexample> tt 0x230d720

0x0012489c vxTaskEntry  +0x48 : epicsThreadEntry ()

0x0334501c epicsThreadEntry+0x80 : 0x032a17f0 ()

0x032a1acc db_start_events+0x450: db_delete_field_log ()

value = 0 = 0x0

So this is interesting.  Commit 85822f3051d2236144bb46dc2c24b7e38143e531 appears to fix the problem with the CAC-event errors with soft device support.  But that commit does not fix the issue with hardware device support, so there must be an earlier commit that is causing these problems.

Mark

*From:* Mark Rivers
*Sent:* Tuesday, May 31, 2022 3:58 PM
*To:* Torsten Bögershausen <Torsten.Bogershausen at ess.eu>; Michael Davidsaver <mdavidsaver at gmail.com>
*Cc:* tech-talk at aps.anl.gov
*Subject:* RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules

I did a test with the latest 3.15 base, 3.15.9.  It works fine.

Torsten suggested:

ØSo R7.0.5 is good, and f9ea6a5bff695c5f88bb95dce38a3fd349738907 is bad ?

ØThere are some "real" commits, and merges:

Øgit log R7.0.5..f9ea6a5bff695c5f88bb95dce38a3fd349738907

ØThen it could make sense, to bisect between those 2?

That was a good idea.  I did:

git bisect start f9ea6a5bff695c5f88bb95dce38a3fd349738907 R7.0.5

The results were [bad, good, bad, good, bad, bad].  This is the final output:

corvette:local/epics-devel/base-7.0.6>git bisect bad

56f05d722dee4b8ca2968b8bface2737a3a9b185 is the first bad commit

commit 56f05d722dee4b8ca2968b8bface2737a3a9b185

Author: Ben Franksen <benjamin.franksen at helmholtz-berlin.de <mailto:benjamin.franksen at helmholtz-berlin.de>>

Date:   Thu Jan 14 17:38:58 2021 +0100

     fix in dbGet: decide use of db_field_log based on whether it has copy or not

:040000 040000 c097595692ed4936a3c90b5583fe78d635635e07 62d947b9bb7f1bc2c21375dfaf13917f4500d59f M      modules

Since every single failure, whether in CAS-event or in CAC-event is failing in db_delete_field_log this commit does seem very suspicious.  I will run long tests on the commit before this one, and on this commit.

Torsten asked:

ØCould it be that a stack has become too short under vxWorks ?

I checked that with checkStack after it failed running 7.0.6.  None of the tasks have run out of stack space.

iocexample> checkStack

   NAME        ENTRY        TID       SIZE   CUR  HIGH  MARGIN

------------ ------------ ---------- ----- ----- ----- ------

tJobTask     0x1c6b00     0x23b8fa0   8000   224   608   7392

(Exception Stack)                     4000     0    96   3904

tExcTask     0x1c61f8     0x33a5d0    8192   256  1248   6944

(Exception Stack)                     4000     0    96   3904

tLogTask     logTask      0x23bc340   5008   368  1232   3776

(Exception Stack)                     4000     0    96   3904

tShell0      shellTask    0x351a600  65536   864  7328  58208

(Exception Stack)                     3696     0    96   3600

tWdbTask     0x261f48     0x26e0a30   8192   272   320   7872

(Exception Stack)                     3696     0    96   3600

ipcom_tickd  0x285b88     0x26c6ea0   6144   256   576   5568

(Exception Stack)                     4000     0    96   3904

tVxdbgTask   0x13185c     0x26da910   8192   208   256   7936

(Exception Stack)                     3696     0    96   3600

tNet0        ipcomNetTask 0x23c1060  10000   240  1904   8096

(Exception Stack)                     4000     0   192   3808

ipcom_syslog 0x16bf48     0x268d220   6144   480   800   5344

(Exception Stack)                     4000     0    96   3904

tNetConf     0x1b74b8     0x26b9a50   6144   640  1488   4656

(Exception Stack)                     4000     0    96   3904

ipcom_telnet ipcom_telnet 0x26cdcc0   6144   512  1152   4992

(Exception Stack)                     4000     0    96   3904

ipsntps      0x1ba330     0x26d1450   6144   416  1056   5088

(Exception Stack)                     4000     0    96   3904

tPortmapd    portmapd     0x26d5570  10000   640  1072   8928

(Exception Stack)                     4000     0   192   3808

cbHigh       epicsThreadE 0x3464310  22000   288   576  21424

(Exception Stack)                     3696     0    96   3600

timerQueue   epicsThreadE 0x3449ff0  12000   448   704  11296

(Exception Stack)                     3696     0    96   3600

scanOnce     epicsThreadE 0x3479e60  22000   320   864  21136

(Exception Stack)                     3696     0   352   3344

scan-0.1     epicsThreadE 0x34a8cb0  22000   432   688  21312

(Exception Stack)                     3696     0   192   3504

scan-0.2     epicsThreadE 0x34a2180  22000   432   688  21312

(Exception Stack)                     3696     0    96   3600

cbMedium     epicsThreadE 0x345b7b0  22000   288   576  21424

(Exception Stack)                     3696     0    96   3600

scan-0.5     epicsThreadE 0x349b650  22000   432   688  21312

(Exception Stack)                     3696     0   192   3504

scan-1       epicsThreadE 0x3494b20  22000   432   688  21312

(Exception Stack)                     3696     0    96   3600

scan-2       epicsThreadE 0x348dff0  22000   432   688  21312

(Exception Stack)                     3696     0    96   3600

scan-5       epicsThreadE 0x34874c0  22000   432   688  21312

(Exception Stack)                     3696     0    96   3600

scan-10      epicsThreadE 0x3480990  22000   432   688  21312

(Exception Stack)                     3696     0    96   3600

cbLow        epicsThreadE 0x3452b00  22000   288   576  21424

(Exception Stack)                     3696     0    96   3600

CAC-event    epicsThreadE 0x34d4ae0  12000   240  2944   9056

(Exception Stack)                     3696   656  1344   2352

dbCaLink     epicsThreadE 0x346b0c0  22000   336  2976  19024

(Exception Stack)                     3696     0    96   3600

poolPoll     epicsThreadE 0x34b9e30   8000   272   560   7440

(Exception Stack)                     3696     0    96   3600

CAS-client   epicsThreadE 0x3560ef0  22000   784  1056  20944

(Exception Stack)                     3696     0    96   3600

CAS-client   epicsThreadE 0x357c910  22000   992  1232  20768

(Exception Stack)                     3696     0   352   3344

CAS-client   epicsThreadE 0x3586b10  22000   784  1056  20944

(Exception Stack)                     3696     0    96   3600

CAS-event    epicsThreadE 0x230c4b0  12000   320  2784   9216

(Exception Stack)                     3696     0   176   3520

CAS-event    epicsThreadE 0x3577040  12000   320   608  11392

(Exception Stack)                     3696     0    96   3600

CAS-event    epicsThreadE 0x357ff60  12000   320  2784   9216

(Exception Stack)                     3696     0   192   3504

CAS-TCP      epicsThreadE 0x34b2670  12000   768  1632  10368

(Exception Stack)                     3696     0   144   3552

CAS-beacon   epicsThreadE 0x34ce050   8000   464   928   7072

(Exception Stack)                     3696     0    96   3600

CAS-UDP      epicsThreadE 0x34b69e0  12000   864  1648  10352

(Exception Stack)                     3696     0   144   3552

errlog       epicsThreadE 0x343b740   8000   320   592   7408

(Exception Stack)                     3696     0    96   3600

taskwd       epicsThreadE 0x3445470   8000   416   560   7440

(Exception Stack)                     3696     0    96   3600

INTERRUPT                             5008     0   928   4080

value = 63 = 0x3f = '?'

Thanks,

Mark

-----Original Message-----
From: Torsten Bögershausen <Torsten.Bogershausen at ess.eu <mailto:Torsten.Bogershausen at ess.eu>>
Sent: Tuesday, May 31, 2022 9:01 AM
To: Mark Rivers <rivers at cars.uchicago.edu <mailto:rivers at cars.uchicago.edu>>; Michael Davidsaver <mdavidsaver at gmail.com <mailto:mdavidsaver at gmail.com>>
Cc: tech-talk at aps.anl.gov <mailto:tech-talk at aps.anl.gov>
Subject: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules

Hej Mark,

So R7.0.5 is good, and f9ea6a5bff695c5f88bb95dce38a3fd349738907 is bad ?

There are some "real" commits, and merges:

git log R7.0.5..f9ea6a5bff695c5f88bb95dce38a3fd349738907

Then it could make sense, to bisect between those 2?

Another question:

Coluld it make sense to run the SW (even more stripped may be) under Linux instead with valgrind ?

A 3rd one:

Could it be that a stack has become too short under vxWorks ?



References:
Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Torsten Bögershausen via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk

Navigate by Date:
Prev: RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Next: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Michael Davidsaver via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
Navigate by Thread:
Prev: RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Mark Rivers via Tech-talk
Next: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules Torsten Bögershausen via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
ANJ, 14 Sep 2022 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·