1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 <2018> 2019 2020 2021 2022 2023 2024 | Index | 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 <2018> 2019 2020 2021 2022 2023 2024 |
<== Date ==> | <== Thread ==> |
---|
Subject: | Re: IOC Crash with No Exception Generated |
From: | Ricardo Cardenes via Tech-talk <[email protected]> |
To: | [email protected] |
Cc: | Talk EPICS Tech <[email protected]> |
Date: | Thu, 9 Aug 2018 12:50:39 -1000 |
Ok... I run spy (sampling for 2 seconds) just before initializing the sequencer, to see what's going on, and it starts pretty much normal, mostly idle, as I would expect:PID PRI STATE %CPU %STK NAME09010001 255 READY 97.0 0 IDLE0a010014 136 Wmutex 1.9 7 scan10a01001e 148 READY 0.4 19 timerQueue0a010021 148 Wmutex 0.1 14 CAC-event0a010004 10 Wevnt 0.1 21 ntwk0a010018 132 Wmutex 0.0 20 scan0.050a010022 108 READY 0.0 20 SPY0a010017 133 Wmutex 0.0 13 scan0.10a010010 129 Wmutex 0.0 18 scanOnce0a010016 134 Wmutex 0.0 21 scan0.20a01001a 182 DELAY 0.0 17 CAS-beacon0a010020 189 Wevnt 0.0 20 CAC-repeater0a010015 135 Wmutex 0.0 16 scan0.50a01001b 183 Wevnt 0.0 20 CAS-UDP0a01000e 149 Wmutex 0.0 16 dbCaLink0a010013 137 Wmutex 0.0 1 scan20a01001f 147 Wevnt 0.0 21 CAC-UDP0a01001d 189 Wmutex 0.0 23 ipToAsciiProxy0a01001c 189 DELAY 0.0 16 bcYearMonitor0a010019 181 Wevnt 0.0 20 CAS-TCP0a010012 138 Wmutex 0.0 23 scan50a010011 139 Wmutex 0.0 23 scan100a01000f 141 Wmutex 0.0 23 timerQueue0a01000d 128 Wmutex 0.0 23 cbHigh0a01000c 135 Wmutex 0.0 23 cbMedium0a01000b 140 Wmutex 0.0 23 cbLow0a01000a 129 Wmutex 0.0 23 timerQueue0a010009 189 Wmutex 0.0 23 taskwd0a010008 109 Wmutex 0.0 23 ClockTimeSync0a010007 109 Wmutex 0.0 22 NTPTimeSync0a010006 10 Wevnt 0.0 18 RPCd0a010005 10 Wevnt 0.0 21 MVEd0a010002 100 Wmsg 0.0 0 ImsgDaemon0a010003 189 Wmutex 0.0 22 errlog0a010001 108 Wmutex 0.0 0 _main_And then, after a bunch of iterations...PID PRI STATE %CPU %STK NAME0a01000f 141 READY 63.1 20 timerQueue09010001 255 READY 32.7 0 IDLE0a010022 108 READY 1.9 19 SPY...Press <return> to terminate.PID PRI STATE %CPU %STK NAME0a01000f 141 READY 95.4 20 timerQueue0a010014 136 Wmutex 2.2 7 scan10a010022 108 READY 1.9 19 SPYtimerQueue stays high from there on. If I stop spying after that happens, the system is already hanging up. If I stop *before*, while the sytem is still idling, then it will continue for a while. timerQueue is not of particularly high priority, but it's higher than any CA task, which is probably significant. Yesterday, tweaking EPICS base a little bit (essentially, I put a couple of probe callbacks in dbScan.c/scanList), I discovered that when the system starts locking up, it is processing a bo (scan rate: 1s) trying to pull its value via DOL that has been labeled CA MS.To test this I'm temporarily turning that link to NPP NMS, because right now I'm testing against simulated subsystem that run as part of the IOC itself, meaning that the access will be to the local database. So far I've rebooted a dozen times and the system does not seem to lock up, or at least it doesn't fail in the same way as before!cheers,RicardoOn Thu, Aug 9, 2018 at 5:31 AM Andrew Johnson <[email protected]> wrote:On 08/08/2018 06:46 PM, Ricardo Cardenes wrote:
> This is a capture from one of the times where the IOC booted all the way
> through:
>
> tc2-sim-ioc> epicsThreadShowAll
> PRIORITY
> ID EPICS RTEMS STATE WAIT NAME
> +--------+-----------+--------+--------+---------------------+
...
> 0a01000a 70 129 Wmtx 1a010258 timerQueue
> 0a01000b 59 140 Wmtx 1a01025d cbLow
> 0a01000c 64 135 Wmtx 1a01025e cbMedium
> 0a01000d 71 128 Wmtx 1a01025f cbHigh
> 0a01000e 50 149 Wmtx 1a010262 dbCaLink
> 0a01000f 58 141 Wmtx 1a010273 timerQueue
> 0a010010 70 129 Wmtx 1a013619 scanOnce
> 0a010011 60 139 Wmtx 1a01361b scan10
> 0a010012 61 138 Wmtx 1a01361d scan5
> 0a010013 62 137 Wmtx 1a01361f scan2
> 0a010014 63 136 Wmtx 1a013621 scan1
> 0a010015 64 135 Wmtx 1a013623 scan0.5
> 0a010016 65 134 Wmtx 1a013625 scan0.2
> 0a010017 66 133 Wmtx 1a013627 scan0.1
> 0a010018 67 132 Wmtx 1a013629 scan0.05
So based on your earlier story about scan thread priorities the problem
might also be related to the medium priority callback thread cbMedium,
which is a general facility and could be used for any number of things
inside either the IOC or the application. If you can run spy during
boot-up that might confirm that one way of the other.
Whichever thread it turns out to be, getting a stack trace of the thread
at the time is probably the next most useful thing you could do.
- Andrew
--
Arguing for surveillance because you have nothing to hide is no
different than making the claim, "I don't care about freedom of
speech because I have nothing to say." -- Edward Snowdon