|
Hi Evan,
Does this IOC load any other EPICS drivers or device support at all (other than the standard ones that come with Base)? Your configuration with multiple callback threads is unusual, and a lot of device/driver support might not
have been written or tested with the thought that there could more than one such callback in mind, so any such support code might be contributing to this issue.
- Andrew
Hi Michael,
Please see:
https://gist.github.com/daykin/6ab54c10e46d119785a89f3cf88d424c (debugger output)
Driver code is here:
https://github.com/frib-bim/caen-blm-software
-----Original Message-----
From: Michael Davidsaver < [email protected]>
Sent: Thursday, August 5, 2021 11:14 AM
To: Daykin, Evan < [email protected]>
Cc: [email protected]; Mark Rivers < [email protected]>; Johnson, Andrew N. < [email protected]>
Subject: Re: cbLow consumption randomly stopping
[EXTERNAL] This email originated from outside of FRIB
Even,
... Should I file a bug report in base? ...
This is almost certainly a driver bug, with nothing to fix in Base.
... If so, is there any other info you'd like me to add before submitting?
The output of "thread apply all backtrace" from GDB when an IOC is "stuck".
On 8/4/21 12:33 PM, Daykin, Evan wrote:
Hello again,
Just reviving this thread as I was able to do some more digging, but still no solution. Some more data points:
The PICO module is not stopping. debugPICO PICO<n> 5 spams output like always.
In bugged state, cbLow threads are stopped in FUTEX_WAIT_PRIVATE for some reason, but I'm having trouble figuring out why. If I run strace -fp <pid> and search for the LWP ID's associated with cbLow-1...cbLow-8, I get exactly
one hit for each cbLow thread:
I don't think that 'strace' is going to help you here. All this is says is that 'cbLow' is
waiting on some condition (epicsMutex or epicsEvent). This happens normally all the time.
'strace' doesn't provide enough context to distinguish normal vs. erroneous operation.
It might be better for you to focus on trying to investigate how to trigger the issue
on demand. Personally, I find these sort of issues much easier to troubleshoot when I don't
have to wait hours or days between iterations.
1669 futex(0x55a511f78924, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
OTOH, during normal operation, I see what I'd expect:
21864 futex(0x55cc299855f0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
...
21864 <... futex resumed> )
21864 futex(0x55cc299855f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
Andrew, I saw you'd mentioned it could be in the callback subsystem. Should I file a bug report in base? If so, is there any other info you'd like me to add before submitting?
Have a nice day,
Evan
-----Original Message-----
From: Daykin, Evan
Sent: Tuesday, June 15, 2021 5:30 PM
To: Michael Davidsaver <[email protected]>; Mark Rivers <[email protected]>
Cc: [email protected]
Subject: RE: cbLow consumption randomly stopping
The issue just happened again on another device. So as promised, here is the github mirror:
https://github.com/frib-bim/caen-blm-software
And some more debug information - epicsMutexShowAll 1, info threads, thread apply all bt:
https://gist.github.com/daykin/6ab54c10e46d119785a89f3cf88d424c
Cheers,
Evan
-----Original Message-----
From: Michael Davidsaver <[email protected]>
Sent: Tuesday, June 8, 2021 3:31 PM
To: Daykin, Evan <[email protected]>; Mark Rivers <[email protected]>
Cc: [email protected]
Subject: Re: cbLow consumption randomly stopping
[EXTERNAL] This email originated from outside of FRIB
On 6/8/21 12:19 PM, Daykin, Evan wrote:
, if you think there is a deadlock, I would suggest attaching GDB to a running IOC process and capture all the stack.
So should I run `bt full` on the cbLow thread?
I should have written "... all of the stacks". So all threads, or "thread apply all backtrace".
-----Original Message-----
From: Michael Davidsaver <[email protected]>
Sent: Tuesday, June 8, 2021 2:20 PM
To: Daykin, Evan <[email protected]>; Mark Rivers <[email protected]>
Cc: [email protected]
Subject: Re: cbLow consumption randomly stopping
[EXTERNAL] This email originated from outside of FRIB
On 6/8/21 11:03 AM, Daykin, Evan via Tech-talk wrote:
...
Is the driver or device support using asyn? If so, what version?
No, it seems to be using epicspci/devlib2 to communicate with hardware.
I think this driver was one that I did back in 2017.
@Andrew, Mark hardware access is partly through a Linux driver [1], and partly
through MMIO for a custom register range. The Linux driver is public, but the
EPICS driver was FRIB internal. (it would be nice if this could be public as well)
@Evan, if you think there is a deadlock, I would suggest attaching GDB to a running
IOC process and capture all the stack.
Looking back, I also added a per-device logging mask (a la asyn). cf. "debugPico()"
at the IOC shell.
All cbLow threads show same backtrace.
These threads are expected to be in epicsEventWait() when idling.
So this by itself does not indicate a deadlock.
[1] https://github.com/CAENels/amc-pico8-driver and/or https://github.com/mdavidsaver/amc-pico8-driver
@Andrew, here is some output from attaching gdb:
(gdb) info threads
Id Target Id Frame
1 Thread 0x7fd5a60a5c80 (LWP 22198) "st.cmd" 0x00007fd5a9d62151 in __pselect (nfds=1, readfds=0x7fff0eb98db0, writefds=0x0, exceptfds=0x0, timeout=<optimized out>, sigmask=<optimized out>) at ../sysdeps/unix/sysv/linux/pselect.c:69
2 Thread 0x7fd5a5899700 (LWP 22200) "errlog" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d2538054) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
3 Thread 0x7fd5a5818700 (LWP 22202) "Reader" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d2612600) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
4 Thread 0x7fd5a5797700 (LWP 22203) "capture" __GI___libc_read (nbytes=176, buf=0x7fd598000b20, fd=9) at ../sysdeps/unix/sysv/linux/read.c:26
5 Thread 0x7fd5a5716700 (LWP 22204) "Reader" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d26144d0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
6 Thread 0x7fd5a5695700 (LWP 22205) "capture" __GI___libc_read (nbytes=176, buf=0x7fd59c000b20, fd=13) at ../sysdeps/unix/sysv/linux/read.c:26
7 Thread 0x7fd5a5614700 (LWP 22206) "Reader" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d26163c0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
8 Thread 0x7fd5a5593700 (LWP 22207) "capture" __GI___libc_read (nbytes=176, buf=0x7fd590000b20, fd=17) at ../sysdeps/unix/sysv/linux/read.c:26
9 Thread 0x7fd5a5512700 (LWP 22208) "Reader" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d2618350) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
10 Thread 0x7fd5a5491700 (LWP 22209) "capture" __GI___libc_read (nbytes=176, buf=0x7fd594000b20, fd=21) at ../sysdeps/unix/sysv/linux/read.c:26
11 Thread 0x7fd5a5410700 (LWP 22217) "taskwd" futex_abstimed_wait_cancelable (private=0, abstime=0x7fd5a540fcf0, expected=0, futex_word=0x55c1d2e56e00) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
12 Thread 0x7fd5a538f700 (LWP 22218) "timerQueue" futex_abstimed_wait_cancelable (private=0, abstime=0x7fd5a538ecc0, expected=0, futex_word=0x55c1d2619b74) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
* 13 Thread 0x7fd5a609a700 (LWP 22219) "cbLow-0" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d321ba40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
14 Thread 0x7fd5a528e700 (LWP 22220) "cbLow-1" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d321ba40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
15 Thread 0x7fd5a508d700 (LWP 22221) "cbLow-2" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d321ba40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
16 Thread 0x7fd5a4e8c700 (LWP 22222) "cbLow-3" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d321ba40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
17 Thread 0x7fd5a4c8b700 (LWP 22223) "cbLow-4" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d321ba40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
18 Thread 0x7fd5a4a8a700 (LWP 22224) "cbLow-5" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d321ba40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
19 Thread 0x7fd5a4889700 (LWP 22225) "cbLow-6" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d321ba40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
20 Thread 0x7fd5a4688700 (LWP 22226) "cbLow-7" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d321ba40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
21 Thread 0x7fd5a4487700 (LWP 22227) "cbMedium" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d31ed8c0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
22 Thread 0x7fd5a4286700 (LWP 22228) "cbHigh" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d31edac0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
23 Thread 0x7fd567fff700 (LWP 22229) "dbCaLink" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d31edeb4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
24 Thread 0x7fd567dfe700 (LWP 22230) "timerQueue" futex_abstimed_wait_cancelable (private=0, abstime=0x7fd567dfdcc0, expected=0, futex_word=0x55c1d31eef44) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
25 Thread 0x7fd55095c700 (LWP 22233) "scanOnce" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d3883b80) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
26 Thread 0x7fd55075b700 (LWP 22234) "scan-10" futex_abstimed_wait_cancelable (private=0, abstime=0x7fd55075acd0, expected=0, futex_word=0x55c1d3883360) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
27 Thread 0x7fd55055a700 (LWP 22235) "scan-5" futex_abstimed_wait_cancelable (private=0, abstime=0x7fd550559cd0, expected=0, futex_word=0x55c1d38834a0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
28 Thread 0x7fd550359700 (LWP 22236) "scan-2" futex_abstimed_wait_cancelable (private=0, abstime=0x7fd550358cd0, expected=0, futex_word=0x55c1d38835e0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
29 Thread 0x7fd543fff700 (LWP 22237) "scan-1" futex_abstimed_wait_cancelable (private=0, abstime=0x7fd543ffecd0, expected=0, futex_word=0x55c1d3883720) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
30 Thread 0x7fd543dfe700 (LWP 22238) "scan-0.5" futex_abstimed_wait_cancelable (private=0, abstime=0x7fd543dfdcd0, expected=0, futex_word=0x55c1d3883860) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
31 Thread 0x7fd543bfd700 (LWP 22239) "scan-0.2" futex_abstimed_wait_cancelable (private=0, abstime=0x7fd543bfccd0, expected=0, futex_word=0x55c1d38839a0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
32 Thread 0x7fd5439fc700 (LWP 22240) "scan-0.1" futex_abstimed_wait_cancelable (private=0, abstime=0x7fd5439fbcd0, expected=0, futex_word=0x55c1d3883ae0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
33 Thread 0x7fd5437fb700 (LWP 22241) "asCaTask" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d3927c70) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
34 Thread 0x7fd5435fa700 (LWP 22242) "ipToAsciiProxy" futex_wait_cancelable (private=0, expected=0, futex_word=0x7fd52c00fa34) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
35 Thread 0x7fd56418c700 (LWP 22243) "timerQueue" futex_abstimed_wait_cancelable (private=0, abstime=0x7fd56418bcc0, expected=0, futex_word=0x7fd52c010224) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
36 Thread 0x7fd550158700 (LWP 22244) "CAC-UDP" 0x00007fd5a9d6b742 in __libc_recvfrom (fd=26, buf=0x7fd52c0127d8, len=65551, flags=0, addr=..., addrlen=0x7fd550157ca8) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
37 Thread 0x7fd5433f9700 (LWP 22245) "CAC-TCP-recv" __libc_recv (flags=0, len=16384, buf=0x7fd54307b04c, fd=28) at ../sysdeps/unix/sysv/linux/recv.c:28
38 Thread 0x7fd5431f8700 (LWP 22246) "CAC-TCP-send" futex_wait_cancelable (private=0, expected=0, futex_word=0x7fd52000e830) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
39 Thread 0x7fd543076700 (LWP 22256) "CAS-TCP" 0x00007fd5a9d6b437 in __libc_accept (fd=29, addr=..., len=0x7fd543075cec) at ../sysdeps/unix/sysv/linux/accept.c:26
40 Thread 0x7fd542f75700 (LWP 22257) "CAS-UDP" 0x00007fd5a9d6b742 in __libc_recvfrom (fd=30, buf=0x7fd50c001a50, len=65551, flags=0, addr=..., addrlen=0x7fd542f74ce8) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
41 Thread 0x7fd542e74700 (LWP 22258) "CAS-beacon" 0x00007fd5a9d37720 in __GI___nanosleep (requested_time=0x7fd542e73c50, remaining=0x7fd542e73c60) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
42 Thread 0x7fd542df3700 (LWP 22259) "CAC-event" futex_wait_cancelable (private=0, expected=0, futex_word=0x7fd56008ccd0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
43 Thread 0x7fd542cf2700 (LWP 22260) "logRestart" futex_abstimed_wait_cancelable (private=0, abstime=0x7fd542cf1d30, expected=0, futex_word=0x55c1d39310d0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
44 Thread 0x7fd542c71700 (LWP 22261) "caPutLog" futex_wait_cancelable (private=0, expected=0, futex_word=0x7fd4f8000b90) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
45 Thread 0x7fd542bf0700 (LWP 22262) "logRestart" futex_abstimed_wait_cancelable (private=0, abstime=0x7fd542befd30, expected=0, futex_word=0x55c1d393f270) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
46 Thread 0x7fd542b6f700 (LWP 22263) "save_restore" futex_abstimed_wait_cancelable (private=0, abstime=0x7fd542b6e880, expected=0, futex_word=0x7fd4fc0204e0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
47 Thread 0x7fd54296e700 (LWP 22264) "reccaster-etcd" 0x00007fd5a9d37720 in __GI___nanosleep (requested_time=0x7fd54296dc40, remaining=0x7fd54296dc50) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
48 Thread 0x7fd5428ed700 (LWP 22265) "recEtcdConsumer" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d61a30d0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
49 Thread 0x7fd54286c700 (LWP 22266) "recEtcdConsumer" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d61a3380) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
50 Thread 0x7fd5427eb700 (LWP 22267) "recEtcdConsumer" futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d61a3630) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
-----
(gdb) thread 13
[Switching to thread 13 (Thread 0x7fd5a609a700 (LWP 22219))]
#0 futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d321ba40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88
../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
------------
(gdb) bt
#0 futex_wait_cancelable (private=0, expected=0, futex_word=0x55c1d321ba40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x55c1d321b9f0, cond=0x55c1d321ba18) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=0x55c1d321ba18, mutex=0x55c1d321b9f0) at pthread_cond_wait.c:655
#3 0x00007fd5a9f3051b in epicsEventWait () from /lib/x86_64-linux-gnu/libCom.so.3.15.8
#4 0x00007fd5a9f28c99 in epicsEventMustWait () from /lib/x86_64-linux-gnu/libCom.so.3.15.8
#5 0x00007fd5a9f9c251 in ?? () from /lib/x86_64-linux-gnu/libdbCore.so.3.15.8
#6 0x00007fd5a9f2da7b in ?? () from /lib/x86_64-linux-gnu/libCom.so.3.15.8
#7 0x00007fd5a98e9fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#8 0x00007fd5a9d6a4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
All cbLow threads show same backtrace.
Thanks for the help,
Evan
-----Original Message-----
From: Mark Rivers <[email protected]>
Sent: Tuesday, June 8, 2021 1:27 PM
To: Daykin, Evan <[email protected]>
Cc: [email protected]
Subject: Re: cbLow consumption randomly stopping
[EXTERNAL] This email originated from outside of FRIB One possibility is that the thread the pulls the message out of the queue has locked up for some reason.One possibility is a deadlock. When this happens you could try typingepicsMutexShowAll
1several times. If the same mutexes are locked each time you type that, then there is likely a deadlock.What OS is this?What version of EPICS base?Is the driver or device support using asyn? If so, what version?Mark________________________________From: Tech-talk
<[email protected]> on behalf of Daykin, Evan via Tech-talk <[email protected]>Sent: Tuesday, June 8, 2021 12:10 PMTo: EPICS tech-talkSubject: cbLow consumption randomly stoppingHello tech-talk,We have an IOC repeatedly locking up with the
infamous "cbLow ring buffer full" message. The IOC consumes data from our CAEN Pico-8 card via an EPICS driver at 100Hz (about 1200-1215 events per second total) , and at some point on the order of hours to days, seems to stop consuming callbacks from the
queue. I wrote a script to track this, and it seems to all be happening in one shot. Below is some data I captured from the last time this happened. Before this, everything was happy for about 4 hours, but sometimes it takes up to a week for this to happen.TimeHigh-waterItems
in qQ capacity%Oflws6/7/2021 19:22:4013012000006/7/2021 19:22:4113012000006/7/2021 19:22:4213012000006/7/2021 19:22:4413012000006/7/2021 19:22:45571571120004.806/7/2021 19:22:46177817781200014.806/7/2021 19:22:47299129911200024.906/7/2021 19:22:4842044204120003506/7/2021
19:22:4954065406120004506/7/2021 19:22:50661966191200055.206/7/2021 19:22:51783478341200065.306/7/2021 19:22:52904890481200075.406/7/2021 19:22:5310249102491200085.406/7/2021 19:22:5411462114621200095.506/7/2021 19:22:5612000120001200010016/7/2021 19:22:581200012000120001001Has
anyone else experienced this? If so, what is the remedy? I can't seem to correlate it to anything unusual happening in the OS/kernel.Thanks,Evan
--
Complexity comes for free, simplicity you have to work for.
|