Hi Mark,
I'm running into a recurring IOC lockup with the Dante driver
during long scans, and I managed to capture a full backtrace
(thread apply all bt via gdb attached to the running process)
right at the moment of the freeze. I believe this is a genuine
deadlock in the driver. Here's what I found:
Thread 137 "DANTE1" (the asyn portThread):
#5 Dante::waitReply(callId=244, caller="getFirmware") at
dante.cpp:399
#6 Dante::writeInt32(value=0) at dante.cpp:567
#7 asynPortDriver::writeInt32
#8 processCallbackOutput (devAsynInt32.c)
#9 portThread (asynManager.c)
Thread 129 "acquisitionTask":
#2 epicsMutexLock
#3 asynPortDriver::lock() at asynPortDriver.cpp:949
#4 Dante::acquisitionTask() at dante.cpp:1290
What appears to be happening:
The dedicated asyn portThread (Thread 137) is executing a
getFirmware command via writeInt32(), which internally calls
waitReply() waiting for a response from the board (callId=244).
That reply never arrives, so waitReply() blocks indefinitely —
while still holding the asynPortDriver lock, since the port thread
typically holds it for the duration of the I/O operation.
Meanwhile, acquisitionTask (Thread 129) needs the same lock to
process newly acquired data at dante.cpp:1290, and blocks forever
waiting for it.
Since the portThread is the single thread serializing all asyn
requests on that port, the entire IOC becomes unresponsive to any
further command (including StopAll/Reset) — only a full IOC
restart recovers it.
Trigger condition:
This appears to be a race condition: it happened after ~300
acquisitions on one run, then again after only 21 and 33
acquisitions on subsequent runs — at variable scan steps, not a
fixed one — which points to a timing-dependent race rather than a
resource leak. We suspect the getFirmware command is being
triggered by a periodic PV poll from a Phoebus/CS-Studio
diagnostic screen running concurrently with the scan, colliding
with an in-progress acquisition cycle.
Possible fix directions:
1) Add a timeout to waitReply() so a missing reply from the board
doesn't block forever
2) Avoid issuing getFirmware (or other non-acquisition commands)
while an acquisition is in progress, or ensure they don't compete
for the same lock held by acquisitionTask
Happy to provide the full backtrace file, the exact PollTime
settings, channel/trigger configuration (4096 channels, single
mode, free-running, no trigger), or any other diagnostic info that
would help track this down.
Thanks,
Dariush
p.s.: thanks also to Claude that help me in debug
Il 24/06/2026 16:19, Mark Rivers ha
scritto:
All of my testing was done with PollTime=0.01 and I did not have
a problem. What type of computer are you using, i.e. how many
cores and what clock speed.
When you acquire with PollTime=0.01 and you run "top -H" on the
Linux machine what do you see.
PollTime=0.1 will reduce performance if you are trying to
acquire quickly. You could see if 0.02 is sufficient to fix the
problem.
Please send a screenshot of the Dante screen in the
configuration where you saw the problem.
Mark
Hi Mark,
the IOC doesn’t exit… simply it remains in “acquiring status”
without exit (even StopAll command doesn’t work).
I tried to increase the poll (from 0.01 to 0.1) as Claude
suggested, and apparently works.
Tomorrow I’ll start a complete session (so there will be many
acquisition) in order to see if the problem remain.
Two days ago I’ll attach a complete (and full) report about
threads…
Dariush
************************************
Dr. Dariush Hampai, PhD
INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy
Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy
Room: +39.06.9403.5248
Lab.: +39.06.9403.2286
Mob.: +39.06.9403.8025
Fax.: +39.06.9403.2597
************************************
Hi Dariush,
I am a bit confused. In your original message you said:
That led me to understand that the IOC was crashing, i.e. the
exiting with an error.
I then asked you to run the IOC under gdb and send me the
backtrace when it crashed. You did that, and it showed the
error was in the readline library.
But now I am wondering if it really crashed. What did you do
to get to the gdb prompt and type "backtrace". Did you type
^C, i.e. did you force it to crash?
You need to be clear on whether it is "crashing" or "hanging".
To make progress on this you need to do the following:
-
Tell us exactly what versions of
each module you are using (base, asyn, mca, dante).
-
Send a screenshot of the Dante
screen when it crashes so we know how you have configured
it.
-
Run the IOC under gdb and send the
complete output from when you first start gdb to when it
crashes and you type "backtrace".
Mark
Hi Mark,
Hi Andrew,
using Claude maybe I found the problem (but not the
solution...). Following what Claude ask, I use Dante in 4096
channels mode, not in mapping mode and without trigger.
Claude answer to me:
There are two key threads:
Thread 17 "acquisitionTask":
#2 epicsThreadSleep
#3 Dante::acquisitionTask (this=0x594560) at
../dante.cpp:1289
Thread 8 "DANTE1":
#2 epicsThreadSleep
#3 Dante::writeInt32 (this=0x594560, value=1) at
../dante.cpp:461
#4 asynPortDriver::writeInt32
#5 processCallbackOutput (devAsynInt32.c)
#6 portThread (asynManager.c)
1) Thread 8 "DANTE1" is the dedicated thread for the asyn port
(portThread in asynManager.c) — the one EPICS uses to
serialize all read/write requests on that asyn device. It is
blocked inside Dante::writeInt32 with value=1, which with very
high probability corresponds exactly to the EraseStart command
you sent. This function internally calls epicsThreadSleep — so
writeInt32 is waiting in a blocking, synchronous manner for
something to happen (most likely for the acquisition to signal
"done").
2) Thread 17 "acquisitionTask" is the internal thread of the
Dante driver that should handle the acquisition loop, also
stopped on epicsThreadSleep inside a loop at dante.cpp:1289.
This is almost certainly a polling loop that periodically
checks the acquisition status (while acquiring → sleep → check
status).
The critical point: if acquisitionTask never sees the
"acquisition completed" condition — due to a lost event, a
race condition, or a status flag not being correctly updated
by the firmware/board — it remains stuck in this loop forever.
And since writeInt32 (called by the EraseStart command) most
likely waits for this very completion before returning, the
asyn port thread also remains blocked.
Why the IOC appears "frozen" and no commands work:
This is the most important detail: in asyn, a single thread
(portThread) serializes all requests on that port. If that
thread is blocked inside writeInt32, no other command can be
processed on that device — not even StopAll or Reset, because
they end up queued behind the blocked request. This is why the
system appears completely frozen and unresponsive to anything
until you restart the IOC.
Confirmation from official documentation:
This matches exactly what I found in the Performance page: in
mapping mode with 2048 and 1024 channels, above a certain
trigger frequency spectra are lost and the EPICS IOC must be
restarted. It's very likely you're running into a variant of
the same problem — a race condition in the driver where an
acquisition completion event is lost, leaving acquisitionTask
(and therefore writeInt32/EraseStart) waiting indefinitely.
Paulscherrerinstitute
What you can do:
1) There is no documented clean software reset for this
scenario, because the thread that should process any reset
command is precisely the one that is blocked. Restarting the
IOC remains, according to the documentation itself, the
official solution.
2) Preventive mitigation — since the problem seems related to
a lost event/state during acquisition, try to:
a) Slightly increase PollTime (e.g., from 0.01 to
0.02-0.05s) to give the firmware more margin to communicate
the status
b) If you are in mapping mode with 1024/2048 channels and
high trigger frequencies, evaluate whether you can switch to
4096 channels, where (according to the Performance table) the
limits are much higher
c) Verify that the Dante board firmware is at the latest
version — bugs of this type are often fixed on the firmware
side
3) If the problem persists, it's worth reporting it directly
to EPICS Community or to Dante/XGLab support — the backtrace
you produced is exactly the kind of evidence needed for a bug
report, because it precisely identifies the lines
(dante.cpp:1289 and dante.cpp:461) where the driver gets
stuck.
Dariush
Il 23/06/2026 14:51, Dariush Hampai via Tech-talk ha scritto:
Hi
Mark
Hi Andew
with a carefully read of "(gdb) thread apply all bt" I
found a strange Thread almost at the end...
Thread 256 (Thread 0x7fff669a2640 (LWP 1412432)
"save_restore"):
#0 0x00007ffff58884da in __futex_abstimed_wait_common ()
from /lib64/libc.so.6
#1 0x00007ffff588acaf in
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libc.so.6
#2 0x00007ffff6c8578d in epicsEventWaitWithTimeout
(pevent=0x7fff78020330, timeout=<optimized out>) at
../osi/os/posix/osdEvent.c:131
#3 0x00007ffff6c86c27 in myReceive (timeout=<optimized
out>, size=512, message=0x200, pmsg=0xa724a0) at
../osi/os/default/osdMessageQueue.cpp:369
#4 epicsMessageQueueReceiveWithTimeout (pmsg=0xa724a0,
message=message@entry=0x7fff669a1a40, size=size@entry=512,
timeout=1) at ../osi/os/default/osdMessageQueue.cpp:404
#5 0x00007ffff73d5796 in save_restore () at
../save_restore.c:1226
#6 0x00007ffff6c82795 in start_routine (arg=0x3384670) at
../osi/os/posix/osdThread.c:442
#7 0x00007ffff588b4f9 in start_thread () from /lib64/libc.so.6
#8 0x00007ffff59106e0 in clone3 () from /lib64/libc.so.6
only here it was called the Thread "save_restore" and only
here the "epicsMessageQueueReceiveWithTimeout"
maybe is this the problem?
Dariush
Il 23/06/2026 14:05, Dariush Hampai via Tech-talk ha
scritto:
Hi
Mark,
Hi Andrew,
any idea?
Dariush
Il 22/06/2026 11:25, Dariush Hampai ha scritto:
Dear
Mark
Dear Andrew,
I replied the crash and it seems the same.
#0 0x00007ffff5904b92 in pselect () from /lib64/libc.so.6
#1 0x00007ffff6b143bb in rl_getc () from /lib64/libreadline.so.8
#2 0x00007ffff6b13cd1 in rl_read_key () from /lib64/libreadline.so.8
#3 0x00007ffff6af8497 in readline_internal_char ()
from /lib64/libreadline.so.8
#4 0x00007ffff6b01535 in readline () from /lib64/libreadline.so.8
#5 0x00007ffff6c85cd2 in osdReadline
(context=0x444dc0, prompt=0x7ffff6c9c183 "epics> ")
at ../osi/os/default/gnuReadline.c:70
#6 epicsReadline (prompt=0x7ffff6c9c183 "epics> ",
context=0x444dc0) at ../osi/epicsReadline.c:68
#7 0x00007ffff6c77aea in iocshBody
(pathname=<optimized out>, commandLine=0x0,
macros=0x0) at ../iocsh/iocsh.cpp:1143
#8 0x000000000040a616 in main (argc=<optimized
out>, argv=<optimized out>) at
../mcaDanteAppMain.cpp:20
In
attach I'll put the output of "thread apply all
backtrace"
awaiting
your replies...
Dariush
Il 19/06/2026 17:13, Johnson, Andrew N. ha scritto:
Hi Dariush,
Please also include any messages output just before
and announcing the crash, and instead of just the gdb
command backtrace first run set
height 0 to disable the pager and then thread apply
all backtrace which will
produce lots of output that may help Mark diagnose the
problem.
- Andrew
--
Complexity comes for free, Simplicity you have to
work for.
What did you do that triggered the crash this time?
Please continue to run the IOC using gdb. Each time
it crashes save the output of backtrace. We need to
see if it is always crashing in the readline
library.
Mark
Dear Mark,
I don't know if it is the same for all the previous
crashes... however the effects are the same...
Dariush
Il 19/06/2026 16:42, Mark Rivers ha scritto:
Hi Dariush,
The gdb backtrace says that the crash is actually
in the Linux readline library. Was this crash
caused by the same sequence of events as previous
crashes you observed?
Mark
Hi Mark,
Are you using a Dante1 or a Dante8?
I'm using Dante8
Does this happen every time you start, or just
occasionally. If it is occasionally, then how
frequently does it happen?
Occasionally, more often when I two records are
executed very closely
Are there any error messages on the IOC?
no, however some records are in "acquire" exit (as
$(P)$(M).ACQG in mca window)
Are you running on Linux or Windows?
Linux (Centos 9)
If you are running on Linux then please run the
IOC in the GNU debugger. You can do that with the
following commands from the iocDante1 directory:
gdb ../../bin/linux-x86_64/mcaDanteApp
run st.cmd
When it crashes type this command at the debugger
prompt:
backtrace
(gdb) backtrace
#0 0x00007ffff5904b92 in pselect () from /lib64/ libc.so.6
#1 0x00007ffff6b143bb in rl_getc () from /lib64/ libreadline.so.8
#2 0x00007ffff6b13cd1 in rl_read_key () from
/lib64/ libreadline.so.8
#3 0x00007ffff6af8497 in readline_internal_char
() from /lib64/ libreadline.so.8
#4 0x00007ffff6b01535 in readline () from /lib64/ libreadline.so.8
#5 0x00007ffff6c85cd2 in osdReadline
(context=0x444dc0, prompt=0x7ffff6c9c183
"epics> ") at
../osi/os/default/gnuReadline.c:70
#6 epicsReadline (prompt=0x7ffff6c9c183
"epics> ", context=0x444dc0) at
../osi/epicsReadline.c:68
#7 0x00007ffff6c77aea in iocshBody
(pathname=<optimized out>, commandLine=0x0,
macros=0x0) at ../iocsh/iocsh.cpp:1143
#8 0x000000000040a616 in main (argc=<optimized
out>, argv=<optimized out>) at
../mcaDanteAppMain.cpp:20
Thank you in advance
Dariush e Maurizio
Il 16/06/2026 18:18, Mark Rivers ha scritto:
Hi Dariush,
Are you using a Dante1 or a Dante8?
Does this happen every time you start, or just
occasionally. If it is occasionally, then how
frequently does it happen?
Are there any error messages on the IOC?
Are you running on Linux or Windows?
If you are running on Linux then please run the
IOC in the GNU debugger. You can do that with
the following commands from the iocDante1
directory:
gdb ../../bin/linux-x86_64/mcaDanteApp
run st.cmd
When it crashes type this command at the
debugger prompt:
backtrace
Send me the output.
Mark
Hi Community,
Hi Mark,
I'm almost finish the implementation of Dante
EPICS Drivers in our
system, however I have a big problem (maybe a
bug?).
When I start few acquisitions (from caput
command or from Phebus), the
system seems that crash.
The Dante:dante2:ElapsedRealTime stops (not on
target). Moreover on
Phebus the text is in "Collecting" mode, The
Acquire Busy is in
"Acquiring" mode and the IOC do not respond to
any command that I send.
Up to now, the only solution is to stop the IOC
and restart it.
What's the problem?
Is there a possibility to force a
reset/reinitialize the driver without
stop and restart it?
awaiting your (precious) help
Dariush
--
************************************
Dr. Dariush Hampai, PhD
INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy
Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy
Room: +39.06.9403.5248
Lab.: +39.06.9403.2286
Mob.: +39.06.9403.8025
Fax.: +39.06.9403.2597
************************************
--
************************************
Dr. Dariush Hampai, PhD
INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy
Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy
Room: +39.06.9403.5248
Lab.: +39.06.9403.2286
Mob.: +39.06.9403.8025
Fax.: +39.06.9403.2597
************************************
--
************************************
Dr. Dariush Hampai, PhD
INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy
Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy
Room: +39.06.9403.5248
Lab.: +39.06.9403.2286
Mob.: +39.06.9403.8025
Fax.: +39.06.9403.2597
************************************
--
************************************
Dr. Dariush Hampai, PhD
INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy
Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy
Room: +39.06.9403.5248
Lab.: +39.06.9403.2286
Mob.: +39.06.9403.8025
Fax.: +39.06.9403.2597
************************************
--
************************************
Dr. Dariush Hampai, PhD
INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy
Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy
Room: +39.06.9403.5248
Lab.: +39.06.9403.2286
Mob.: +39.06.9403.8025
Fax.: +39.06.9403.2597
************************************
--
************************************
Dr. Dariush Hampai, PhD
INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy
Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy
Room: +39.06.9403.5248
Lab.: +39.06.9403.2286
Mob.: +39.06.9403.8025
Fax.: +39.06.9403.2597
************************************
--
************************************
Dr. Dariush Hampai, PhD
INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy
Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy
Room: +39.06.9403.5248
Lab.: +39.06.9403.2286
Mob.: +39.06.9403.8025
Fax.: +39.06.9403.2597
************************************
- Replies:
- Re: Problems with Dante (XGLab) Driver Mark Rivers via Tech-talk
- References:
- Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk
- Re: Problems with Dante (XGLab) Driver Mark Rivers via Tech-talk
- Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk
- Re: Problems with Dante (XGLab) Driver Mark Rivers via Tech-talk
- Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk
- Re: Problems with Dante (XGLab) Driver Mark Rivers via Tech-talk
- Re: Problems with Dante (XGLab) Driver Johnson, Andrew N. via Tech-talk
- Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk
- Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk
- Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk
- Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk
- Re: Problems with Dante (XGLab) Driver Mark Rivers via Tech-talk
- Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk
- Re: Problems with Dante (XGLab) Driver Mark Rivers via Tech-talk
- Navigate by Date:
- Prev:
EPICS thread on Marana problem Louisa Kienesberger via Tech-talk
- Next:
How to ensure with_ctrlvars monitor in PyEpics if PVs are initially unavailable Wang, Lin via Tech-talk
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
<2026>
- Navigate by Thread:
- Prev:
Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk
- Next:
Re: Problems with Dante (XGLab) Driver Mark Rivers via Tech-talk
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
<2026>
|