Experimental Physics and
Industrial Control System

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 <2026>	Index	1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 <2026>
<== Date ==>		<== Thread ==>

Subject:	Re: Problems with Dante (XGLab) Driver
From:	Dariush Hampai via Tech-talk <tech-talk at aps.anl.gov>
To:	Mark Rivers <rivers at cars.uchicago.edu>
Cc:	"tech-talk at aps.anl.gov" <tech-talk at aps.anl.gov>
Date:	Thu, 25 Jun 2026 12:36:34 +0200

Hi Mark,
I'm running into a recurring IOC lockup with the Dante driver during long scans, and I managed to capture a full backtrace (thread apply all bt via gdb attached to the running process) right at the moment of the freeze. I believe this is a genuine deadlock in the driver. Here's what I found:
Thread 137 "DANTE1" (the asyn portThread):
#5 Dante::waitReply(callId=244, caller="getFirmware") at dante.cpp:399
#6 Dante::writeInt32(value=0) at dante.cpp:567
#7 asynPortDriver::writeInt32
#8 processCallbackOutput (devAsynInt32.c)
#9 portThread (asynManager.c)
Thread 129 "acquisitionTask":
#2 epicsMutexLock
#3 asynPortDriver::lock() at asynPortDriver.cpp:949
#4 Dante::acquisitionTask() at dante.cpp:1290
What appears to be happening:

The dedicated asyn portThread (Thread 137) is executing a getFirmware command via writeInt32(), which internally calls waitReply() waiting for a response from the board (callId=244).
That reply never arrives, so waitReply() blocks indefinitely — while still holding the asynPortDriver lock, since the port thread typically holds it for the duration of the I/O operation.
Meanwhile, acquisitionTask (Thread 129) needs the same lock to process newly acquired data at dante.cpp:1290, and blocks forever waiting for it.
Since the portThread is the single thread serializing all asyn requests on that port, the entire IOC becomes unresponsive to any further command (including StopAll/Reset) — only a full IOC restart recovers it.

Trigger condition:
This appears to be a race condition: it happened after ~300 acquisitions on one run, then again after only 21 and 33 acquisitions on subsequent runs — at variable scan steps, not a fixed one — which points to a timing-dependent race rather than a resource leak. We suspect the getFirmware command is being triggered by a periodic PV poll from a Phoebus/CS-Studio diagnostic screen running concurrently with the scan, colliding with an in-progress acquisition cycle.

Possible fix directions:
1) Add a timeout to waitReply() so a missing reply from the board doesn't block forever
2) Avoid issuing getFirmware (or other non-acquisition commands) while an acquisition is in progress, or ensure they don't compete for the same lock held by acquisitionTask

Happy to provide the full backtrace file, the exact PollTime settings, channel/trigger configuration (4096 channels, single mode, free-running, no trigger), or any other diagnostic info that would help track this down.
Thanks,
Dariush

p.s.: thanks also to Claude that help me in debug

Il 24/06/2026 16:19, Mark Rivers ha scritto:

All of my testing was done with PollTime=0.01 and I did not have a problem. What type of computer are you using, i.e. how many cores and what clock speed.

When you acquire with PollTime=0.01 and you run "top -H" on the Linux machine what do you see.

PollTime=0.1 will reduce performance if you are trying to acquire quickly. You could see if 0.02 is sufficient to fix the problem.

Please send a screenshot of the Dante screen in the configuration where you saw the problem.

Mark

From: Dariush Hampai <dariush.hampai at lnf.infn.it>
Sent: Wednesday, June 24, 2026 9:06 AM
To: Mark Rivers <rivers at cars.uchicago.edu>
Cc: Johnson, Andrew N. <anj at anl.gov>; tech-talk at aps.anl.gov <tech-talk at aps.anl.gov>
Subject: Re: Problems with Dante (XGLab) Driver

Hi Mark,

the IOC doesn’t exit… simply it remains in “acquiring status” without exit (even StopAll command doesn’t work).

I tried to increase the poll (from 0.01 to 0.1) as Claude suggested, and apparently works.

Tomorrow I’ll start a complete session (so there will be many acquisition) in order to see if the problem remain.

Two days ago I’ll attach a complete (and full) report about threads…

Dariush

************************************

Dr. Dariush Hampai, PhD

INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy

Mail Address:

XLab-Frascati

LNF-INFN

Casella Postale 13

Frascati (RM)

Italy

Room: +39.06.9403.5248
Lab.: +39.06.9403.2286
Mob.: +39.06.9403.8025
Fax.: +39.06.9403.2597

************************************
Il giorno 24 giu 2026, alle ore 15:47, Mark Rivers <rivers at cars.uchicago.edu> ha scritto:

Hi Dariush,

I am a bit confused. In your original message you said:

When I start few acquisitions (from caput command or from Phebus), the system seems that crash.

That led me to understand that the IOC was crashing, i.e. the exiting with an error.

I then asked you to run the IOC under gdb and send me the backtrace when it crashed. You did that, and it showed the error was in the readline library.

But now I am wondering if it really crashed. What did you do to get to the gdb prompt and type "backtrace". Did you type ^C, i.e. did you force it to crash?

You need to be clear on whether it is "crashing" or "hanging".

To make progress on this you need to do the following:

Tell us exactly what versions of each module you are using (base, asyn, mca, dante).

Send a screenshot of the Dante screen when it crashes so we know how you have configured it.

Run the IOC under gdb and send the complete output from when you first start gdb to when it crashes and you type "backtrace".

Mark

From: Dariush Hampai <dariush.hampai at lnf.infn.it>
Sent: Wednesday, June 24, 2026 4:10 AM
To: Johnson, Andrew N. <anj at anl.gov>; Mark Rivers <rivers at cars.uchicago.edu>; tech-talk at aps.anl.gov <tech-talk at aps.anl.gov>
Subject: Re: Problems with Dante (XGLab) Driver

Hi Mark,
Hi Andrew,
using Claude maybe I found the problem (but not the solution...). Following what Claude ask, I use Dante in 4096 channels mode, not in mapping mode and without trigger.

Claude answer to me:

There are two key threads:
Thread 17 "acquisitionTask":
#2 epicsThreadSleep
#3 Dante::acquisitionTask (this=0x594560) at ../dante.cpp:1289
Thread 8 "DANTE1":
#2 epicsThreadSleep
#3 Dante::writeInt32 (this=0x594560, value=1) at ../dante.cpp:461
#4 asynPortDriver::writeInt32
#5 processCallbackOutput (devAsynInt32.c)
#6 portThread (asynManager.c)

1) Thread 8 "DANTE1" is the dedicated thread for the asyn port (portThread in asynManager.c) — the one EPICS uses to serialize all read/write requests on that asyn device. It is blocked inside Dante::writeInt32 with value=1, which with very high probability corresponds exactly to the EraseStart command you sent. This function internally calls epicsThreadSleep — so writeInt32 is waiting in a blocking, synchronous manner for something to happen (most likely for the acquisition to signal "done").

2) Thread 17 "acquisitionTask" is the internal thread of the Dante driver that should handle the acquisition loop, also stopped on epicsThreadSleep inside a loop at dante.cpp:1289. This is almost certainly a polling loop that periodically checks the acquisition status (while acquiring → sleep → check status).

The critical point: if acquisitionTask never sees the "acquisition completed" condition — due to a lost event, a race condition, or a status flag not being correctly updated by the firmware/board — it remains stuck in this loop forever. And since writeInt32 (called by the EraseStart command) most likely waits for this very completion before returning, the asyn port thread also remains blocked.

Why the IOC appears "frozen" and no commands work:

This is the most important detail: in asyn, a single thread (portThread) serializes all requests on that port. If that thread is blocked inside writeInt32, no other command can be processed on that device — not even StopAll or Reset, because they end up queued behind the blocked request. This is why the system appears completely frozen and unresponsive to anything until you restart the IOC.

Confirmation from official documentation:

This matches exactly what I found in the Performance page: in mapping mode with 2048 and 1024 channels, above a certain trigger frequency spectra are lost and the EPICS IOC must be restarted. It's very likely you're running into a variant of the same problem — a race condition in the driver where an acquisition completion event is lost, leaving acquisitionTask (and therefore writeInt32/EraseStart) waiting indefinitely. Paulscherrerinstitute

What you can do:

1) There is no documented clean software reset for this scenario, because the thread that should process any reset command is precisely the one that is blocked. Restarting the IOC remains, according to the documentation itself, the official solution.

2) Preventive mitigation — since the problem seems related to a lost event/state during acquisition, try to:
a) Slightly increase PollTime (e.g., from 0.01 to 0.02-0.05s) to give the firmware more margin to communicate the status
b) If you are in mapping mode with 1024/2048 channels and high trigger frequencies, evaluate whether you can switch to 4096 channels, where (according to the Performance table) the limits are much higher
c) Verify that the Dante board firmware is at the latest version — bugs of this type are often fixed on the firmware side

3) If the problem persists, it's worth reporting it directly to EPICS Community or to Dante/XGLab support — the backtrace you produced is exactly the kind of evidence needed for a bug report, because it precisely identifies the lines (dante.cpp:1289 and dante.cpp:461) where the driver gets stuck.

Dariush

Il 23/06/2026 14:51, Dariush Hampai via Tech-talk ha scritto:
Hi Mark
Hi Andew
with a carefully read of "(gdb) thread apply all bt" I found a strange Thread almost at the end...

Thread 256 (Thread 0x7fff669a2640 (LWP 1412432) "save_restore"):
#0 0x00007ffff58884da in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1 0x00007ffff588acaf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libc.so.6
#2 0x00007ffff6c8578d in epicsEventWaitWithTimeout (pevent=0x7fff78020330, timeout=<optimized out>) at ../osi/os/posix/osdEvent.c:131
#3 0x00007ffff6c86c27 in myReceive (timeout=<optimized out>, size=512, message=0x200, pmsg=0xa724a0) at ../osi/os/default/osdMessageQueue.cpp:369
#4 epicsMessageQueueReceiveWithTimeout (pmsg=0xa724a0, message=message@entry=0x7fff669a1a40, size=size@entry=512, timeout=1) at ../osi/os/default/osdMessageQueue.cpp:404
#5 0x00007ffff73d5796 in save_restore () at ../save_restore.c:1226
#6 0x00007ffff6c82795 in start_routine (arg=0x3384670) at ../osi/os/posix/osdThread.c:442
#7 0x00007ffff588b4f9 in start_thread () from /lib64/libc.so.6
#8 0x00007ffff59106e0 in clone3 () from /lib64/libc.so.6

only here it was called the Thread "save_restore" and only here the "epicsMessageQueueReceiveWithTimeout"
maybe is this the problem?

Dariush

Il 23/06/2026 14:05, Dariush Hampai via Tech-talk ha scritto:
Hi Mark,
Hi Andrew,
any idea?

Dariush

Il 22/06/2026 11:25, Dariush Hampai ha scritto:
Dear Mark
Dear Andrew,

I replied the crash and it seems the same.

#0 0x00007ffff5904b92 in pselect () from /lib64/libc.so.6
#1 0x00007ffff6b143bb in rl_getc () from /lib64/libreadline.so.8
#2 0x00007ffff6b13cd1 in rl_read_key () from /lib64/libreadline.so.8
#3 0x00007ffff6af8497 in readline_internal_char () from /lib64/libreadline.so.8
#4 0x00007ffff6b01535 in readline () from /lib64/libreadline.so.8
#5 0x00007ffff6c85cd2 in osdReadline (context=0x444dc0, prompt=0x7ffff6c9c183 "epics> ") at ../osi/os/default/gnuReadline.c:70
#6 epicsReadline (prompt=0x7ffff6c9c183 "epics> ", context=0x444dc0) at ../osi/epicsReadline.c:68
#7 0x00007ffff6c77aea in iocshBody (pathname=<optimized out>, commandLine=0x0, macros=0x0) at ../iocsh/iocsh.cpp:1143
#8 0x000000000040a616 in main (argc=<optimized out>, argv=<optimized out>) at ../mcaDanteAppMain.cpp:20

In attach I'll put the output of "thread apply all backtrace"

awaiting your replies...

Dariush

Il 19/06/2026 17:13, Johnson, Andrew N. ha scritto:
Hi Dariush,

Please also include any messages output just before and announcing the crash, and instead of just the gdb command backtrace first run set height 0 to disable the pager and then thread apply all backtrace which will produce lots of output that may help Mark diagnose the problem.

- Andrew

--

Complexity comes for free, Simplicity you have to work for.
On 6/19/26, 9:50 AM, "Tech-talk" <tech-talk-bounces at aps.anl.gov> wrote:

What did you do that triggered the crash this time?

Please continue to run the IOC using gdb. Each time it crashes save the output of backtrace. We need to see if it is always crashing in the readline library.

Mark

From: Dariush Hampai <dariush.hampai at lnf.infn.it>
Sent: Friday, June 19, 2026 9:45 AM
To: Mark Rivers <rivers at cars.uchicago.edu>; tech-talk at aps.anl.gov <tech-talk at aps.anl.gov>
Subject: Re: Problems with Dante (XGLab) Driver

Dear Mark,

I don't know if it is the same for all the previous crashes... however the effects are the same...

Dariush

Il 19/06/2026 16:42, Mark Rivers ha scritto:
Hi Dariush,

The gdb backtrace says that the crash is actually in the Linux readline library. Was this crash caused by the same sequence of events as previous crashes you observed?

Mark

From: Dariush Hampai <dariush.hampai at lnf.infn.it>
Sent: Friday, June 19, 2026 9:17 AM
To: Mark Rivers <rivers at cars.uchicago.edu>; tech-talk at aps.anl.gov <tech-talk at aps.anl.gov>
Subject: Re: Problems with Dante (XGLab) Driver

Hi Mark,

Are you using a Dante1 or a Dante8?
I'm using Dante8

Does this happen every time you start, or just occasionally. If it is occasionally, then how frequently does it happen?
Occasionally, more often when I two records are executed very closely

Are there any error messages on the IOC?
no, however some records are in "acquire" exit (as $(P)$(M).ACQG in mca window)

Are you running on Linux or Windows?
Linux (Centos 9)

If you are running on Linux then please run the IOC in the GNU debugger. You can do that with the following commands from the iocDante1 directory:

gdb ../../bin/linux-x86_64/mcaDanteApp
run st.cmd

When it crashes type this command at the debugger prompt:

backtrace

(gdb) backtrace
#0 0x00007ffff5904b92 in pselect () from /lib64/libc.so.6
#1 0x00007ffff6b143bb in rl_getc () from /lib64/libreadline.so.8
#2 0x00007ffff6b13cd1 in rl_read_key () from /lib64/libreadline.so.8
#3 0x00007ffff6af8497 in readline_internal_char () from /lib64/libreadline.so.8
#4 0x00007ffff6b01535 in readline () from /lib64/libreadline.so.8
#5 0x00007ffff6c85cd2 in osdReadline (context=0x444dc0, prompt=0x7ffff6c9c183 "epics> ") at ../osi/os/default/gnuReadline.c:70
#6 epicsReadline (prompt=0x7ffff6c9c183 "epics> ", context=0x444dc0) at ../osi/epicsReadline.c:68
#7 0x00007ffff6c77aea in iocshBody (pathname=<optimized out>, commandLine=0x0, macros=0x0) at ../iocsh/iocsh.cpp:1143
#8 0x000000000040a616 in main (argc=<optimized out>, argv=<optimized out>) at ../mcaDanteAppMain.cpp:20

Thank you in advance
Dariush e Maurizio

Il 16/06/2026 18:18, Mark Rivers ha scritto:

Hi Dariush,

Are you using a Dante1 or a Dante8?

Does this happen every time you start, or just occasionally. If it is occasionally, then how frequently does it happen?

Are there any error messages on the IOC?

Are you running on Linux or Windows?

If you are running on Linux then please run the IOC in the GNU debugger. You can do that with the following commands from the iocDante1 directory:

gdb ../../bin/linux-x86_64/mcaDanteApp

run st.cmd

When it crashes type this command at the debugger prompt:

backtrace

Send me the output.

Mark

From: Dariush Hampai <dariush.hampai at lnf.infn.it>
Sent: Tuesday, June 16, 2026 10:23 AM
To: Mark Rivers <rivers at cars.uchicago.edu>; tech-talk at aps.anl.gov <tech-talk at aps.anl.gov>
Subject: Problems with Dante (XGLab) Driver

Hi Community,
Hi Mark,

I'm almost finish the implementation of Dante EPICS Drivers in our
system, however I have a big problem (maybe a bug?).
When I start few acquisitions (from caput command or from Phebus), the
system seems that crash.
The Dante:dante2:ElapsedRealTime stops (not on target). Moreover on
Phebus the text is in "Collecting" mode, The Acquire Busy is in
"Acquiring" mode and the IOC do not respond to any command that I send.
Up to now, the only solution is to stop the IOC and restart it.
What's the problem?
Is there a possibility to force a reset/reinitialize the driver without
stop and restart it?

awaiting your (precious) help

Dariush
-- 
************************************

Dr. Dariush Hampai, PhD

INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy

Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy

Room:	+39.06.9403.5248
Lab.:	+39.06.9403.2286
Mob.:	+39.06.9403.8025
Fax.:	+39.06.9403.2597

************************************
-- 
************************************

Dr. Dariush Hampai, PhD

INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy

Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy

Room:	+39.06.9403.5248
Lab.:	+39.06.9403.2286
Mob.:	+39.06.9403.8025
Fax.:	+39.06.9403.2597

************************************
-- 
************************************

Dr. Dariush Hampai, PhD

INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy

Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy

Room:	+39.06.9403.5248
Lab.:	+39.06.9403.2286
Mob.:	+39.06.9403.8025
Fax.:	+39.06.9403.2597

************************************
-- 
************************************

Dr. Dariush Hampai, PhD

INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy

Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy

Room:	+39.06.9403.5248
Lab.:	+39.06.9403.2286
Mob.:	+39.06.9403.8025
Fax.:	+39.06.9403.2597

************************************
-- 
************************************

Dr. Dariush Hampai, PhD

INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy

Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy

Room:	+39.06.9403.5248
Lab.:	+39.06.9403.2286
Mob.:	+39.06.9403.8025
Fax.:	+39.06.9403.2597

************************************
-- 
************************************

Dr. Dariush Hampai, PhD

INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy

Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy

Room:	+39.06.9403.5248
Lab.:	+39.06.9403.2286
Mob.:	+39.06.9403.8025
Fax.:	+39.06.9403.2597

************************************

-- 
************************************

Dr. Dariush Hampai, PhD

INFN - LNF
X-Lab Frascati
Via E. Fermi, 54 (ex 40)
I-00044 Frascati (RM)
Italy

Mail Address:
XLab-Frascati
LNF-INFN
Casella Postale 13
Frascati (RM)
Italy

Room:	+39.06.9403.5248
Lab.:	+39.06.9403.2286
Mob.:	+39.06.9403.8025
Fax.:	+39.06.9403.2597

************************************

Replies:: Re: Problems with Dante (XGLab) Driver Mark Rivers via Tech-talk

References:: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk; Re: Problems with Dante (XGLab) Driver Mark Rivers via Tech-talk; Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk; Re: Problems with Dante (XGLab) Driver Mark Rivers via Tech-talk; Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk; Re: Problems with Dante (XGLab) Driver Mark Rivers via Tech-talk; Re: Problems with Dante (XGLab) Driver Johnson, Andrew N. via Tech-talk; Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk; Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk; Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk; Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk; Re: Problems with Dante (XGLab) Driver Mark Rivers via Tech-talk; Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk; Re: Problems with Dante (XGLab) Driver Mark Rivers via Tech-talk

Navigate by Date:: Prev: EPICS thread on Marana problem Louisa Kienesberger via Tech-talk; Next: How to ensure with_ctrlvars monitor in PyEpics if PVs are initially unavailable Wang, Lin via Tech-talk; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 <2026>
Navigate by Thread:: Prev: Re: Problems with Dante (XGLab) Driver Dariush Hampai via Tech-talk; Next: Re: Problems with Dante (XGLab) Driver Mark Rivers via Tech-talk; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 <2026>

ANJ, 25 Jun 2026

· Home · News · About · Talk · Base · Modules · Extensions ·
· Distributions · Download · Documents · Links · Licensing ·

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System