Experimental Physics and Industrial Control System
Hello,
at BESSY we currently have two IOCs which perform GPIB I/O over a HP
E2050A LAN/GPIB gateway, using the EPICS GPIB support, currently R0-3,
and EPICS base 3.13.6.
Both IOCs occasionally hang and need to be rebooted.
The problem is at least one or two years old and was already present
with the old lanGpib2.4 and EPICS R3.13.2. It only appears more often
now, probably due to the latest additions to the databases that now do
more GPIB I/O.
Each of the IOCs controls exactly one gateway and therefore one GPIB
segment. We already stripped them of any other task. The only records
not GPIB related are the ones that monitor the IOC using the devVxStats
device support (part of base).
I am not aware of anyone using the EPICS GPIB support who reported a
similar problem.
When the IOC goes down, we see this:
iocIOC2X250C>
Access Fault
Program Counter: 0x00069a16
Status Register: 0x3000
Access Address : 0x5f3f046f
Special Status : 0x0525
Task: 0xeff324 "cbLow"
filename="../taskwd.c" line number=175
task eff324 cbLow suspended
CAS: request from 192.168.21.99:51810 => "put call back time out"
Now, the IOC hangs so completely that the shell no longer accepts
commands. Killing the command with Ctrl-C gives the following:
iocIOC2X250C> memShow
(nothing happens, so I enter Ctrl-C)
7865c _vxTaskEntry +10 : _shell (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
41538 _shell +138: 41556 ([1, 0, 0, 4135c, 0])
416d8 _shell +2d8: _execute (d248de)
417fc _execute +ac : _yyparse ([0, 1, 0, d248de, 0])
45568 _yyparse +16 : _malloc (960)
69476 _malloc +e : _memPartAlloc ([a72ea, 960, d248a0, 4556a,
960])
6910a _memPartAlloc +4a : _semMTake ([a72ea, 960, 4, d24854, 6947a])
tShell restarted.
IMHO, this can only mean that the suspended cbLow task still holds the
mutex that protects the VxWorks memory allocator's internal structures.
A very unfortunate situation: I can see no way to delete the task and
free the semaphore without shell interaction, which in turn needs to
allocate memory and thus waits forever... It seems that every effort to
diagnose the situation post mortem is doomed to fail. I can't even get a
stack trace for the cbLow task!
BTW, although there are no memory leaks in a running IOC, "memShow"
reports an ever (and quickly: ~1400 Bytes/sec) rising number for the
*accumulated* allocation. This is due to the RPC library calls where
structures are permanently allocated and deallocated to support
conversion between host and network format. AFAIK, this is inherent in
the RPC library interface - memory is never explicitly allocated by the
GPIB support after initialization is done.
The access fault message above (the first one) hints to a location were
the access fault happened, which is inside the vxWorks routine
memPartInfoGet:
iocIOC2X250C> lkAddr 0x00069a16
0x0006997c _memPartInfoGet text
0x00069a94 _mmu40LibInit text
.....
which is used by devVxStats to compute relative memory consumption.
For a test, I disabled the record that does the memory check. (Note that
even though devVxStats is synchronous, records are I/O interrupt scanned
and therefore processed by a callback task.)
The IOC now runs since Jul 31 15:32:42 and has not yet hung again. The
last times it ran for 25, 22, and then 4 hours.
I have the very bad intuition that, even in the improbable and vastly
fortunate case that the IOC will keep running, we have merely cured the
symptom, not the cause :-(
Any suggestions about what might be at the root of our problem or what
we could do to further analyze it would be warmly appreciated.
Ben
PS: Somehow this reminds me of the curious memory freelist bug that once
appeared in out control system network: an IOC that did exactly nothing
besides loading the kernel (empty startup file) produced an access
fault, when memShow was called from the shell. Strangely, this happend
only in our controls network - the identical setup with the *same* CPU
board worked fine in our development network.
- Replies:
- Re: GPIB + vxStats hangs IOC Benjamin Franksen
- Navigate by Date:
- Prev:
Re: How do I use registryFunctionAdd Rozelle Wright
- Next:
Re: GPIB + vxStats hangs IOC Brian McAllister
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
<2002>
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: How do I use registryFunctionAdd Rozelle Wright
- Next:
Re: GPIB + vxStats hangs IOC Benjamin Franksen
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
<2002>
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024