I did a test with the latest 3.15 base, 3.15.9. It works fine.
Torsten suggested:
Ø
So R7.0.5 is good, and f9ea6a5bff695c5f88bb95dce38a3fd349738907 is bad ?
Ø
There are some "real" commits, and merges:
Ø
git log R7.0.5..f9ea6a5bff695c5f88bb95dce38a3fd349738907
Ø
Then it could make sense, to bisect between those 2?
That was a good idea. I did:
git bisect start f9ea6a5bff695c5f88bb95dce38a3fd349738907 R7.0.5
The results were [bad, good, bad, good, bad, bad]. This is the final output:
corvette:local/epics-devel/base-7.0.6>git bisect bad
56f05d722dee4b8ca2968b8bface2737a3a9b185 is the first bad commit
commit 56f05d722dee4b8ca2968b8bface2737a3a9b185
Author: Ben Franksen <benjamin.franksen at helmholtz-berlin.de>
Date: Thu Jan 14 17:38:58 2021 +0100
fix in dbGet: decide use of db_field_log based on whether it has copy or not
:040000 040000 c097595692ed4936a3c90b5583fe78d635635e07 62d947b9bb7f1bc2c21375dfaf13917f4500d59f M modules
Since every single failure, whether in CAS-event or in CAC-event is failing in db_delete_field_log this commit does seem very suspicious. I will run long tests on the
commit before this one, and on this commit.
Torsten asked:
Ø
Could it be that a stack has become too short under vxWorks ?
I checked that with checkStack after it failed running 7.0.6. None of the tasks have run out of stack space.
iocexample> checkStack
NAME ENTRY TID SIZE CUR HIGH MARGIN
------------ ------------ ---------- ----- ----- ----- ------
tJobTask 0x1c6b00 0x23b8fa0 8000 224 608 7392
(Exception Stack) 4000 0 96 3904
tExcTask 0x1c61f8 0x33a5d0 8192 256 1248 6944
(Exception Stack) 4000 0 96 3904
tLogTask logTask 0x23bc340 5008 368 1232 3776
(Exception Stack) 4000 0 96 3904
tShell0 shellTask 0x351a600 65536 864 7328 58208
(Exception Stack) 3696 0 96 3600
tWdbTask 0x261f48 0x26e0a30 8192 272 320 7872
(Exception Stack) 3696 0 96 3600
ipcom_tickd 0x285b88 0x26c6ea0 6144 256 576 5568
(Exception Stack) 4000 0 96 3904
tVxdbgTask 0x13185c 0x26da910 8192 208 256 7936
(Exception Stack) 3696 0 96 3600
tNet0 ipcomNetTask 0x23c1060 10000 240 1904 8096
(Exception Stack) 4000 0 192 3808
ipcom_syslog 0x16bf48 0x268d220 6144 480 800 5344
(Exception Stack) 4000 0 96 3904
tNetConf 0x1b74b8 0x26b9a50 6144 640 1488 4656
(Exception Stack) 4000 0 96 3904
ipcom_telnet ipcom_telnet 0x26cdcc0 6144 512 1152 4992
(Exception Stack) 4000 0 96 3904
ipsntps 0x1ba330 0x26d1450 6144 416 1056 5088
(Exception Stack) 4000 0 96 3904
tPortmapd portmapd 0x26d5570 10000 640 1072 8928
(Exception Stack) 4000 0 192 3808
cbHigh epicsThreadE 0x3464310 22000 288 576 21424
(Exception Stack) 3696 0 96 3600
timerQueue epicsThreadE 0x3449ff0 12000 448 704 11296
(Exception Stack) 3696 0 96 3600
scanOnce epicsThreadE 0x3479e60 22000 320 864 21136
(Exception Stack) 3696 0 352 3344
scan-0.1 epicsThreadE 0x34a8cb0 22000 432 688 21312
(Exception Stack) 3696 0 192 3504
scan-0.2 epicsThreadE 0x34a2180 22000 432 688 21312
(Exception Stack) 3696 0 96 3600
cbMedium epicsThreadE 0x345b7b0 22000 288 576 21424
(Exception Stack) 3696 0 96 3600
scan-0.5 epicsThreadE 0x349b650 22000 432 688 21312
(Exception Stack) 3696 0 192 3504
scan-1 epicsThreadE 0x3494b20 22000 432 688 21312
(Exception Stack) 3696 0 96 3600
scan-2 epicsThreadE 0x348dff0 22000 432 688 21312
(Exception Stack) 3696 0 96 3600
scan-5 epicsThreadE 0x34874c0 22000 432 688 21312
(Exception Stack) 3696 0 96 3600
scan-10 epicsThreadE 0x3480990 22000 432 688 21312
(Exception Stack) 3696 0 96 3600
cbLow epicsThreadE 0x3452b00 22000 288 576 21424
(Exception Stack) 3696 0 96 3600
CAC-event epicsThreadE 0x34d4ae0 12000 240 2944 9056
(Exception Stack) 3696 656 1344 2352
dbCaLink epicsThreadE 0x346b0c0 22000 336 2976 19024
(Exception Stack) 3696 0 96 3600
poolPoll epicsThreadE 0x34b9e30 8000 272 560 7440
(Exception Stack) 3696 0 96 3600
CAS-client epicsThreadE 0x3560ef0 22000 784 1056 20944
(Exception Stack) 3696 0 96 3600
CAS-client epicsThreadE 0x357c910 22000 992 1232 20768
(Exception Stack) 3696 0 352 3344
CAS-client epicsThreadE 0x3586b10 22000 784 1056 20944
(Exception Stack) 3696 0 96 3600
CAS-event epicsThreadE 0x230c4b0 12000 320 2784 9216
(Exception Stack) 3696 0 176 3520
CAS-event epicsThreadE 0x3577040 12000 320 608 11392
(Exception Stack) 3696 0 96 3600
CAS-event epicsThreadE 0x357ff60 12000 320 2784 9216
(Exception Stack) 3696 0 192 3504
CAS-TCP epicsThreadE 0x34b2670 12000 768 1632 10368
(Exception Stack) 3696 0 144 3552
CAS-beacon epicsThreadE 0x34ce050 8000 464 928 7072
(Exception Stack) 3696 0 96 3600
CAS-UDP epicsThreadE 0x34b69e0 12000 864 1648 10352
(Exception Stack) 3696 0 144 3552
errlog epicsThreadE 0x343b740 8000 320 592 7408
(Exception Stack) 3696 0 96 3600
taskwd epicsThreadE 0x3445470 8000 416 560 7440
(Exception Stack) 3696 0 96 3600
INTERRUPT 5008 0 928 4080
value = 63 = 0x3f = '?'
Thanks,
Mark
-----Original Message-----
From: Torsten Bögershausen <Torsten.Bogershausen at ess.eu>
Sent: Tuesday, May 31, 2022 9:01 AM
To: Mark Rivers <rivers at cars.uchicago.edu>; Michael Davidsaver <mdavidsaver at gmail.com>
Cc: tech-talk at aps.anl.gov
Subject: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules
Hej Mark,
So R7.0.5 is good, and f9ea6a5bff695c5f88bb95dce38a3fd349738907 is bad ?
There are some "real" commits, and merges:
git log R7.0.5..f9ea6a5bff695c5f88bb95dce38a3fd349738907
Then it could make sense, to bisect between those 2?
Another question:
Coluld it make sense to run the SW (even more stripped may be) under Linux instead with valgrind ?
A 3rd one:
Could it be that a stack has become too short under vxWorks ?
From: Tech-talk <tech-talk-bounces at aps.anl.gov> on behalf of Mark Rivers via Tech-talk <tech-talk at aps.anl.gov>
Reply-To: Mark Rivers <rivers at cars.uchicago.edu>
Date: Tuesday, 31 May 2022 at 15:35
To: Michael Davidsaver <mdavidsaver at gmail.com>
Cc: "tech-talk at aps.anl.gov" <tech-talk at aps.anl.gov>
Subject: RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules
➢ I have rebuilt the example application with soft records with base 7.0.5 with the DAC128V enabled. It has currently run for 2500 seconds without failing, while base 7.0.6
always failed within 138 seconds. I will let it run overnight.
Base 7.0.5 ran fine overnight, no failures in over 15 hours.
I previously reported that when I removed the AMSG and PAMSG fields from dbCommon in base 7.0.6 it did not fail when I was testing with IPAC hardware. However, removing those fields did NOT fix the problem when running with records
with soft device support.
I thus decided to run “git bisect” again, testing with the soft device support records. I bisected from R7.0.6 to R7.0.5. The results were [good, bad, bad, bad]. The bisect completed successfully with this output:
corvette:local/epics-devel/base-7.0.6>git bisect bad The merge base f9ea6a5bff695c5f88bb95dce38a3fd349738907 is bad.
This means the bug has been fixed between f9ea6a5bff695c5f88bb95dce38a3fd349738907 and [1c6b02b9be2a1c778f6395dd677ef425b04410be 7f142e03f5cef74b5516e76494f29c3a0f143f4f].
This looks like a merge commit, so I am not sure how to proceed.
Mark
From: Mark Rivers
Sent: Monday, May 30, 2022 3:02 PM
To: Michael Davidsaver <mdavidsaver at gmail.com>
Cc: tech-talk at aps.anl.gov
Subject: RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules
Hi Michael,
I have now reproduced the problem with no VME bus activity, i.e. no VME bus drivers initialized and no records using the VME bus. This is the configuration:
- Base 7.0.6, only change is to configure/CONFIG_SITE to build for vxWorks-ppc32
- asyn, ipac, ip330, and dac128V modules built from new clones, master branch
- IOC application newly created using makeBaseApp.pl from base 7.0.6
These are the changes to the example application created with makeBaseApp.
configure/RELEASE has these changes:
# Variables and paths to dependent modules:
MODULES = /corvette/home/epics/devel_test #MYMODULE = $(MODULES)/my-module
# If using the sequencer, point SNCSEQ at its top directory:
#SNCSEQ = $(MODULES)/seq-ver
ASYN = $(MODULES)/asyn-4-42
IPAC = $(MODULES)/ipac-2-16
IP330 = $(MODULES)/ip330-2-10
DAC128V = $(MODULES)/dac128V-2-10-1
exampleApp/src/Makefile was changed to this. Modules can be added and removed from the application by editing configure/RELEASE. I removed the example records, etc. that is normally built in this file.
TOP=../..
include $(TOP)/configure/CONFIG
#----------------------------------------
# ADD MACRO DEFINITIONS BELOW HERE
# Build the IOC application
PROD_IOC = example
# example.dbd will be created and installed DBD += example.dbd
# example.dbd will include these files:
example_DBD += base.dbd
# example_registerRecordDeviceDriver.cpp derives from example.dbd example_SRCS += example_registerRecordDeviceDriver.cpp
# Build the main IOC entry point where needed example_SRCS_DEFAULT += exampleMain.cpp example_SRCS_vxWorks += -nil-
ifdef IPAC
example_LIBS += Ipac
example_DBD += drvIpac.dbd
endif
ifdef IP330
example_LIBS += ip330
example_DBD += ip330Support.dbd
endif
ifdef DAC128V
example_LIBS += dac128V
example_DBD += dac128VSupport.dbd
endif
ifdef ASYN
example_LIBS += asyn
example_DBD += asyn.dbd
endif
# Finally link IOC to the EPICS Base libraries example_LIBS += $(EPICS_BASE_IOC_LIBS)
include $(TOP)/configure/RULES
#----------------------------------------
# ADD EXTRA GNUMAKE RULES BELOW HERE
In the iocBoot/iocexample directory I added 2 new database files containing records with soft device support only.
aoTest.template contains 10 ao records:
corvette:example/iocBoot/iocexample>more aoTest.template
record(ao,"testAo_0") {}
record(ao,"testAo_1") {}
record(ao,"testAo_2") {}
record(ao,"testAo_3") {}
record(ao,"testAo_4") {}
record(ao,"testAo_5") {}
record(ao,"testAo_6") {}
record(ao,"testAo_7") {}
record(ao,"testAo_8") {}
record(ao,"testAo_9") {}
aiTest.template contains 10 ai records with an INP link to the corresponding ao record, using CP
corvette:example/iocBoot/iocexample>more aiTest.template
record(ai,"testAi_0") {field(INP, "testAo_0 CP") }
record(ai,"testAi_1") {field(INP, "testAo_1 CP") }
record(ai,"testAi_2") {field(INP, "testAo_2 CP") }
record(ai,"testAi_3") {field(INP, "testAo_3 CP") }
record(ai,"testAi_4") {field(INP, "testAo_4 CP") }
record(ai,"testAi_5") {field(INP, "testAo_5 CP") }
record(ai,"testAi_6") {field(INP, "testAo_6 CP") }
record(ai,"testAi_7") {field(INP, "testAo_7 CP") }
record(ai,"testAi_8") {field(INP, "testAo_8 CP") }
record(ai,"testAi_9") {field(INP, "testAo_9 CP") }
The st.cmd file just loads these 2 databases, it does not initialize any VME or IPAC drivers.
corvette:example/iocBoot/iocexample>more st.cmd
#- Example vxWorks startup file
#- The following is needed if your board support package doesn't at boot time
#- automatically cd to the directory containing its startup script #cd "/home/epics/devel_test/example/iocBoot/iocexample"
< cdCommands
# Mount drives with NFS
nfsAuthUnixSet("corvette", 849601092, 849600513, 0, 0)
nfsMount("corvette","/home","/corvette/home")
nfsMount("corvette","/home","/home")
cd topbin
load("example.munch")
cd startup
dbLoadDatabase "$(TOP)/dbd/example.dbd"
example_registerRecordDeviceDriver pdbbase
## Load record instances
dbLoadRecords "./aoTest.template"
dbLoadRecords "./aiTest.template"
iocInit
When the IOC boots this is the output of dbior and asynReport. This indicates that these drivers are present, but there are no IPAC carrier cards or asyn port drivers initialized.
iocexample> dbior
Driver: drvIpac
Driver: drvAsyn
value = 0 = 0x0
iocexample> asynReport
value = 0 = 0x0
Once the IOC is running I run the following scripts.
This script runs camonitor on all 10 ai records and all 10 ao records.
corvette:example/iocBoot/iocexample>more test_monitor camonitor -tcr \
testAi_0 \
testAi_1 \
testAi_2 \
testAi_3 \
testAi_4 \
testAi_5 \
testAi_6 \
testAi_7 \
testAi_8 \
testAi_9 \
testAo_0 \
testAo_1 \
testAo_2 \
testAo_3 \
testAo_4 \
testAo_5 \
testAo_6 \
testAo_7 \
testAo_8 \
testAo_9 \
I run 2 instances of that script on a Linux machine to cause lots of CA monitor events.
I run this IDL script that writes new values to all 10 ao records in a loop at 100 Hz. That causes the ai records to process since they have CP links. There will be 20 records * 100 Hz = 2000 new record values per second. Since there
are 2 camonitor clients that is 4000 monitor events/s.
corvette:example/iocBoot/iocexample>more test_ao.pro out = -10.
while 1 do begin
t = caput('testAo_0', out)
t = caput('testAo_1', out)
t = caput('testAo_2', out)
t = caput('testAo_3', out)
t = caput('testAo_4', out)
t = caput('testAo_5', out)
t = caput('testAo_6', out)
t = caput('testAo_7', out)
t = caput('testAo_8', out)
t = caput('testAo_9', out)
wait, .01
out = out + 0.1
if (out gt 10.0) then out = -10.0
endwhile
end
If the example application is built with configure/RELEASE as shown above, defining ASYN, IPAC, IP330, and DAC128V, then it fails. In 4 tests it failed after running for [42, 132, 32, 138] seconds respectively.
VME Bus Error accessing A32: 0xbff66664
machine check
Exception next instruction address: 0x0332133c Machine Status Register: 0x0008b032 Condition Register: 0x48000844
Task: 0x34d4950 "CAC-event"
0x34d4950 (CAC-event): task 0x34d4950 has had a failure and has been stopped.
0x34d4950 (CAC-event): The task has been terminated because it triggered an exception that raised the signal 10.
Note that the task in error is CAC-event. In the tests I did previously when using the VME hardware it was failing in CAS-event.
This is the task trace:
iocexample> tt 0x34d4950
0x0012489c vxTaskEntry +0x48 : epicsThreadEntry () 0x0333fe9c epicsThreadEntry+0x80 : 0x0329b884 () 0x0329bb5c db_start_events+0x458: db_delete_field_log () 0x0329b06c db_delete_field_log+0x54 : freeListFree () value = 0 = 0x0
I have found that if I build the application with DAC128V commented out in configure/RELEASE then the application does not crash within the time I tested for. I ran it twice and it ran for [630, 1400] seconds before I killed it.
I cannot explain why building without DAC128V allows the application to run, since the DAC128V driver is never being initialized. It could just be a memory layout artifact if there are uninitialized pointers or something.
When I was doing previous tests with the dac128V and ip330 hardware enabled and running I made some changes to base 7.0.6 and observed the following:
- If I comment out all lines referencing dbCommon.amsg and dbCommon.namsg in dbAccess.c, dbDbLink.c, dbEvent.c, and recGbl.c then it did fail, but in 2 tests it took [4317, 22400] seconds to fail.
- If I also comment out the AMSG and NAMSG fields in dbCommon.dbd.pod then it did not fail in 18000 seconds.
However, with the soft record example described here the application failed in [3, 171] seconds in CAC-event even with all of those changes to base.
I have rebuilt the example application with soft records with base 7.0.5 with the DAC128V enabled. It has currently run for 2500 seconds without failing, while base 7.0.6 always failed within 138 seconds. I will let it run overnight.
Thanks,
Mark
From: Mark Rivers
Sent: Friday, May 27, 2022 3:17 PM
To: Michael Davidsaver <mailto:mdavidsaver at gmail.com>
Cc: mailto:tech-talk at aps.anl.gov
Subject: RE: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules
My recent tests have been done with the following:
- A very "thin" startup script that only loads the ip330 and dac128V drivers and records.
- A "thick" application that is built with most of the synApps modules.
I just stripped it down so the application is only built with base, asyn, ipac, ip330, dac128v. I made fresh git clones of asyn, ipac, ip330, dac128v.
The application Makefile is very simple:
**********************
corvette:CARS/CARSApp/testSrc>more Makefile TOP=../..
include $(TOP)/configure/CONFIG
#----------------------------------------
# ADD MACRO DEFINITIONS AFTER THIS LINE
PROD_NAME = CARSTest
PROD_IOC_vxWorks = $(PROD_NAME)
DBD_NAME = $(PROD_NAME)VX
DBD += $(DBD_NAME).dbd
$(DBD_NAME)_DBD += base.dbd
$(DBD_NAME)_DBD += asyn.dbd
$(PROD_NAME)_LIBS := asyn $($(PROD_NAME)_LIBS)
$(DBD_NAME)_DBD += dac128VSupport.dbd
$(PROD_NAME)_LIBS := dac128V $($(PROD_NAME)_LIBS)
$(DBD_NAME)_DBD += ip330Support.dbd
$(PROD_NAME)_LIBS := ip330 $($(PROD_NAME)_LIBS)
$(DBD_NAME)_DBD += drvIpac.dbd
$(PROD_NAME)_LIBS := Ipac $($(PROD_NAME)_LIBS)
$(PROD_NAME)_LIBS += $(EPICS_BASE_IOC_LIBS)
$(PROD_NAME)_SRCS += $(DBD_NAME)_registerRecordDeviceDriver.cpp
$(PROD_NAME)_OBJS_vxWorks += $(EPICS_BASE_BIN)/vxComLibrary
#===========================
include $(TOP)/configure/RULES
#----------------------------------------
# ADD RULES AFTER THIS LINE
**********************
These are the commands that build the dbd file and link the application.
Creating dbd file CARSTestVX.dbd
perl -CSD /corvette/usr/local/epics-devel/base-7.0.6/bin/linux-x86_64/dbdExpand.pl -I. -I.. -I../O.Common -I../../../dbd -I/corvette/home/epics/devel/asyn-4-42/dbd -I/corvette/home/epics/devel/CARS/dbd -I/corvette/home/epics/devel/dac128V-2-10-1/dbd
-I/corvette/home/epics/devel/ipac-2-16/dbd -I/corvette/home/epics/devel/ip330-2-10/dbd -I/corvette/usr/local/epics-devel/base-7.0.6/dbd -o CARSTestVX.dbd base.dbd asyn.dbd dac128VSupport.dbd ip330Support.dbd drvIpac.dbd Installing created dbd file ../../../dbd/CARSTestVX.dbd
/usr/local/vw/vxWorks-6.9/gnu/4.3.3-vxworks-6.9/x86-linux2/bin/ldppc -r -o CARSTest -L/corvette/home/epics/devel/asyn-4-42/lib/vxWorks-ppc32/ -L/corvette/home/epics/devel/dac128V-2-10-1/lib/vxWorks-ppc32/ -L/corvette/home/epics/devel/ip330-2-10/lib/vxWorks-ppc32/
-L/corvette/home/epics/devel/ipac-2-16/lib/vxWorks-ppc32/ -L/corvette/usr/local/epics-devel/base-7.0.6/lib/vxWorks-ppc32/ /corvette/usr/local/epics-devel/base-7.0.6/bin/vxWorks-ppc32/vxComLibrary.o CARSTestVX_registerRecordDeviceDriver.o
-lIpac -lip330 -ldac128V -lasyn -ldbRecStd -ldbCore –lca
I have a script that is writing new values to DAC channel 0 at 100 Hz. That causes all 16 ADC records to post monitors at about 10 Hz.
I closed the medm screen and stopped camonitor on the ADC channels. I started the IOC and it ran fine for 5 minutes with the script writing to the DAC.
I then opened the medm screen and started the camonitor on the ADC channels. It failed in a few seconds with the A32 address error:
ioc13lab2>
VME Bus Error accessing A32: 0xbff99994
machine check
Exception next instruction address: 0x0329b0e0 Machine Status Register: 0x0008b032 Condition Register: 0x48000884
Task: 0x3560930 "CAS-event"
0x3560930 (CAS-event): task 0x3560930 has had a failure and has been stopped.
0x3560930 (CAS-event): The task has been terminated because it triggered an exception that raised the signal 10.
Mark
-----Original Message-----
From: Michael Davidsaver <mailto:mdavidsaver at gmail.com>
Sent: Friday, May 27, 2022 1:49 PM
To: Mark Rivers <mailto:rivers at cars.uchicago.edu>
Cc: mailto:tech-talk at aps.anl.gov
Subject: Re: Bus errors accessing VME with base 7.0.6.1 and latest synApps modules
> On 5/27/22 06:17, Mark Rivers wrote:
>> Hi Michael,
>>
>> I commented out all lines in recGbl.c that reference the new amsg and namsg fields. That did not fix the problem.
I take this to mean both db_post_events() and epicsVsnprintf() calls, which means that both AMSG and NAMSG should always be an empty string.
(aka. char[40] all nils)
https://github.com/epics-base/epics-base/commit/892a361de7002bf2b3f375f24bc5bf61858de2e5
> On 5/27/22 10:33, Mark Rivers wrote:
>> I am now using the unmodified 7.0.6 release.
...
> Before I build in that directory I run
I don't know how difficult it is with vxWorks @APS, but it might be a good idea to go further and start from fresh git-clone of all modules (base and synapps and application). This should rule out build leftovers entirely.
On 5/27/22 10:33, Mark Rivers wrote:
> What could possibly cause this?
This is a very good question...
As I think about it, adding/removing AMSG+NAMSG would have exposed some partial rebuild problems with Base and parts of synapps. The generated record type headers include a check that the number of fields at load time is the same as
the number at build time.
If not a build/ABI issue, I don't see an easy way forward in troubleshooting.
We've passed the point where I would be trying to setup a (remote) debugger to inspect the stack of a faulting thread. This should identify one of the records involved, and a dump of that record struct may show evidence of corruption.
For that matter, "dbpr <name> 5" could show evidence, if you can find one of the records.