EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: IOC Crash with No Exception Generated
From: Ricardo Cardenes via Tech-talk <[email protected]>
To: [email protected]
Cc: Talk EPICS Tech <[email protected]>
Date: Thu, 9 Aug 2018 12:50:39 -1000
Oh well, too early to say "it doesn't fail the same way". After a few reboots more it still gets stuck. Which is no surprise, given that this particular timerQueue's priority is higher than tcsSeqInit, for example. Well, at least there's some lead. Now, let's see what is spawning this one (there are other two, both with higher priorities).

On Thu, Aug 9, 2018 at 12:34 PM Ricardo Cardenes <[email protected]> wrote:
Ok... I run spy (sampling for 2 seconds) just before initializing the sequencer, to see what's going on, and it starts pretty much normal, mostly idle, as I would expect:

   PID   PRI STATE   %CPU %STK  NAME
09010001 255 READY   97.0    0  IDLE
0a010014 136 Wmutex   1.9    7  scan1
0a01001e 148 READY    0.4   19  timerQueue
0a010021 148 Wmutex   0.1   14  CAC-event
0a010004  10 Wevnt    0.1   21  ntwk
0a010018 132 Wmutex   0.0   20  scan0.05
0a010022 108 READY    0.0   20  SPY 
0a010017 133 Wmutex   0.0   13  scan0.1
0a010010 129 Wmutex   0.0   18  scanOnce
0a010016 134 Wmutex   0.0   21  scan0.2
0a01001a 182 DELAY    0.0   17  CAS-beacon
0a010020 189 Wevnt    0.0   20  CAC-repeater
0a010015 135 Wmutex   0.0   16  scan0.5
0a01001b 183 Wevnt    0.0   20  CAS-UDP
0a01000e 149 Wmutex   0.0   16  dbCaLink
0a010013 137 Wmutex   0.0    1  scan2
0a01001f 147 Wevnt    0.0   21  CAC-UDP
0a01001d 189 Wmutex   0.0   23  ipToAsciiProxy
0a01001c 189 DELAY    0.0   16  bcYearMonitor
0a010019 181 Wevnt    0.0   20  CAS-TCP
0a010012 138 Wmutex   0.0   23  scan5
0a010011 139 Wmutex   0.0   23  scan10
0a01000f 141 Wmutex   0.0   23  timerQueue
0a01000d 128 Wmutex   0.0   23  cbHigh
0a01000c 135 Wmutex   0.0   23  cbMedium
0a01000b 140 Wmutex   0.0   23  cbLow
0a01000a 129 Wmutex   0.0   23  timerQueue
0a010009 189 Wmutex   0.0   23  taskwd
0a010008 109 Wmutex   0.0   23  ClockTimeSync
0a010007 109 Wmutex   0.0   22  NTPTimeSync
0a010006  10 Wevnt    0.0   18  RPCd
0a010005  10 Wevnt    0.0   21  MVEd
0a010002 100 Wmsg     0.0    0  ImsgDaemon
0a010003 189 Wmutex   0.0   22  errlog
0a010001 108 Wmutex   0.0    0  _main_

And then, after a bunch of iterations...

   PID   PRI STATE   %CPU %STK  NAME
0a01000f 141 READY   63.1   20  timerQueue
09010001 255 READY   32.7    0  IDLE
0a010022 108 READY    1.9   19  SPY 
...

Press <return> to terminate.
   PID   PRI STATE   %CPU %STK  NAME
0a01000f 141 READY   95.4   20  timerQueue
0a010014 136 Wmutex   2.2    7  scan1
0a010022 108 READY    1.9   19  SPY 

timerQueue stays high from there on. If I stop spying after that happens, the system is already hanging up. If I stop *before*, while the sytem is still idling, then it will continue for a while. timerQueue is not of particularly high priority, but it's higher than any CA task, which is probably significant. Yesterday, tweaking EPICS base a little bit (essentially, I put a couple of probe callbacks in dbScan.c/scanList), I discovered that when the system starts locking up, it is processing a bo (scan rate: 1s) trying to pull its value via DOL that has been labeled CA MS.

To test this I'm temporarily turning that link to NPP NMS, because right now I'm testing against simulated subsystem that run as part of the IOC itself, meaning that the access will be to the local database. So far I've rebooted a dozen times and the system does not seem to lock up, or at least it doesn't fail in the same way as before!

cheers,
Ricardo


On Thu, Aug 9, 2018 at 5:31 AM Andrew Johnson <[email protected]> wrote:
On 08/08/2018 06:46 PM, Ricardo Cardenes wrote:
> This is a capture from one of the times where the IOC booted all the way
> through:
>
> tc2-sim-ioc> epicsThreadShowAll
>             PRIORITY
>     ID    EPICS RTEMS   STATE    WAIT         NAME
> +--------+-----------+--------+--------+---------------------+
...
>  0a01000a   70 129        Wmtx 1a010258 timerQueue
>  0a01000b   59 140        Wmtx 1a01025d cbLow
>  0a01000c   64 135        Wmtx 1a01025e cbMedium
>  0a01000d   71 128        Wmtx 1a01025f cbHigh
>  0a01000e   50 149        Wmtx 1a010262 dbCaLink
>  0a01000f   58 141        Wmtx 1a010273 timerQueue
>  0a010010   70 129        Wmtx 1a013619 scanOnce
>  0a010011   60 139        Wmtx 1a01361b scan10
>  0a010012   61 138        Wmtx 1a01361d scan5
>  0a010013   62 137        Wmtx 1a01361f scan2
>  0a010014   63 136        Wmtx 1a013621 scan1
>  0a010015   64 135        Wmtx 1a013623 scan0.5
>  0a010016   65 134        Wmtx 1a013625 scan0.2
>  0a010017   66 133        Wmtx 1a013627 scan0.1
>  0a010018   67 132        Wmtx 1a013629 scan0.05

So based on your earlier story about scan thread priorities the problem
might also be related to the medium priority callback thread cbMedium,
which is a general facility and could be used for any number of things
inside either the IOC or the application. If you can run spy during
boot-up that might confirm that one way of the other.

Whichever thread it turns out to be, getting a stack trace of the thread
at the time is probably the next most useful thing you could do.

- Andrew

--
Arguing for surveillance because you have nothing to hide is no
different than making the claim, "I don't care about freedom of
speech because I have nothing to say." -- Edward Snowdon

Replies:
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
References:
IOC Crash with No Exception Generated Matt Rippa via Tech-talk
Re: IOC Crash with No Exception Generated Andrew Johnson
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
Re: IOC Crash with No Exception Generated Michael Davidsaver
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
Re: IOC Crash with No Exception Generated Michael Davidsaver
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
Re: IOC Crash with No Exception Generated Michael Davidsaver
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
Re: IOC Crash with No Exception Generated Andrew Johnson
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk

Navigate by Date:
Prev: Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
Next: Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
Next: Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024 
ANJ, 09 Aug 2018 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·