EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: IOC Crash with No Exception Generated
From: "Johnson, Andrew N." <[email protected]>
To: Ricardo Cardenes <[email protected]>
Cc: Talk EPICS Tech <[email protected]>
Date: Wed, 8 Aug 2018 15:50:41 +0000
Hi Ricardo,

Do you have Eric Norum’s spy code for RTEMS available? I haven’t used it myself but if it works like the VxWorks one maybe you could try starting spy early on in your startup script and just watch the output while it boots? The spy reporting task has to run at a fairly high priority so I think it should work and give you more information about what’s happening.

- Andrew

-- 
Sent from my iPad

On Aug 8, 2018, at 5:19 AM, Ricardo Cardenes <[email protected]> wrote:



On Tue, Aug 7, 2018 at 7:56 PM Michael Davidsaver <[email protected]> wrote:
On 08/07/2018 04:25 PM, Ricardo Cardenes wrote:
> Hi everyone,
>
> I've got a bit more info about this problem. Or maybe not. Given that we can't really test our 2700's right now, we're trying to ship the system on an MVME6100 instead, and I've been testing it on the lab. This board has given us quite some trouble in the past (mainly due to overheating, it seems), so we're testing it thoroughly on the lab.
>
> Along with it, I've written a soft watchdog to probe some idea that I've been toying around with. Now, the watchdog operates in a very simple way:
>
>   * An RTEMS timer is created, and set to fire after 1 second (this time is arbitrary). If it fires, it will simply inform us that the system is stalling.
>   * An EPICS record processes every 0.2 seconds, and the associated SNAM resets the timer. If it stops processing, the timer will (eventually) fire.
>
> So far so good. Even if the watchdog was intended to test the situation with the 2700, I included it in the build for the 6100, just for future use. Now, about 50% of the time, everything works without (visible) problems. But when it doesn't, I observe the following which, at least on the surface, seems close enough to what I could see in the 2700, but I can't confirm until I get the same set of debugging tools on it:
>
>  1. The system starts booting and going through the startup script
>  2. At a certain point, while initializing the sequencer tasks (from the seq supp module), the system stops going through the initialization.
>      1. This happens always at a certain, specific transition. I need to check it up
>      2. The system stops responding to CA events at this point, which is consistent with 2.1, because the transition depends on an external (simulated, in this case, but accessed through CA) PV
>      3. The watchdog is not firing at this point, meaning that RTEMS is still working and, at least, scan0.2 is processing!

When things are (apparently) working, what does epicsThreadShowAll() show?

I can't really tell. At that point I can't interact with the iocsh console without making the system crash (see point #3), and CA is lost.
 
Are you changing the priority of any sequencer threads?  If not, then last I checked,
sequencer threads default to epicsThreadPriorityMedium (50).  This is lower than the periodic
scan threads (60), but higher than the CA server threads (20).  So I think what you describe
would be explainable if a sequencer program were getting stuck in a tight loop.

We're not changing any priority. Actually, after sending the email I started looking at priorities and poked around a little bit longer, playing with my watchdog reset task, making it process at a lower scan rates. Placing it at scan0.5 didn't make a difference, but then I moved it to scan1 (increasing the timer to 2 seconds, to ensure that it would be reset properly), and here's where I hit the possible culprit. Now, we have tons of records processing at that scan rate in this particular IOC (our largest and more complex), so I'm trying some creative approach to the problem...
 
Does your sequencer program wait for all PVs to become connected?  "50% of the time" sounds
like a race, and PV (dis)connection is a prime candidate.  As a quick test, you might try
putting an epicsThreadSleep() in your startup script between iocInit() and when the sequencer
program(s) are started.

Actually, that "50% of the time" may have been too optimistic. Sometimes I can get the system to hang several times in a row. We also suspect of a race somewhere. The sequencer controls about a dozen of separate state sets, one per subsystem, to track their status.
 

>  3. When I get to this point, if I interact with the iocsh console (eg. I just press Enter), something else locks up, scan0.2 stops processing and the watchdog fires.
>      1. Interestingly, iocsh registers the first Enter (a newline is echoed back), but not subsequent ones.

I can't explain this.  The "main" thread should be running with high priority (91).
It could hang if an iocsh function blocks locking a mutex held by a hung thread.
But simply pressing "return" shouldn't do this.

I can't either, the high priority of the "_main_" thread flies in the face of the "something is hogging the CPU for everything low-priority". iocsh would need to stall itself in the same way for everything else to die, and I'm not sure what could cause this. I've had a look at the code that reads commands from the prompt and it's relatively straightforward.

> NB: when discussing this with my team, someone suggested that maybe the whole system was blocking until I hit Enter, not just (parts of) EPICS. I have confirmed that this is *not* the case by introducing a periodic printf in the function that resets the timer. Indeed, it keeps being called while the sequencer and CA are stalled.
>
> *But*, /and this is a very big but/, I had to use printf to put out those messages. Initially I used errlogMessage, and this made the messages not show up. But all this time the watchdog was still not being fired. Now, errlogPrintf (called by errlogMessage) does not print things out to console unless this has been explicitly enabled using eltc (which is not our case), meaning that the messages are being queued to the message buffer to be printed out later by the errlogThread. *But errlogThread is not being called*...
>
> errlogThread is run as a low priority task (priority 10, one of the lowest). It won't be given CPU time if other, higher priority tasks, are hogging the CPU, which suggests me that something *is*, and most probably this is the problem I'm facing right now, possibly (but not 100% sure) related to the initial one. *But that something is not high priority enough to override scan0.2* (priority 65), and certainly not the timer, which is activated via an ISR and thus should be at the highest priority.
>
> I'll keep trying to isolate the priority of the task that is blocking everything else. If thinks get hairy, I've got a support module that I've been using to measure the system behavior at the RTEMS thread level and I'll try to merge it in to get some info on what's going on.
>
> In the mean time, if someone has any idea, you're welcome to chime in!

A situation like this, which is probably a mis-behaving thread, would be an excellent
case for the Till Straumann's in-process GDB stub.

Sure. I'll have to resort to it if I can't find anything over the next couple of days. Thanks!
 

http://www.slac.stanford.edu/~strauman/rtems/gdb/

The version I've used

https://github.com/epicsdeb/rtems-gdbstub

If you don't want to use the whole loadable object infrastructure, I have a recipe
for linking the gdbstub into a regular IOC executable.

https://github.com/mdavidsaver/epics-base/commit/16209d34a1a594cb800362bb15024d8ce8e694bd


> Cheers,
> Ricardo
>
> On Fri, Jul 27, 2018 at 10:09 AM Ricardo Cardenes <[email protected] <mailto:[email protected]>> wrote:
>
>     Hi Michael,
>
>
>     On Thu, Jul 26, 2018 at 9:57 AM Michael Davidsaver <[email protected] <mailto:[email protected]>> wrote:
>
>         On 07/26/2018 12:16 PM, Ricardo Cardenes via Tech-talk wrote:
>         > Hi,
>         >
>         > Thanks both Michael and Andrew for your answers. I also have hardware error as the top cause, but we're trying other ideas first because:
>         >
>         >  1. This has happened on two different boards (one rather old, another one more a more recent purchase, both MVME2700s) within 4 days. This is an RTEMS systems replacing a VxWorks one, also on a 2700.
>
>         Is there a second VME master in these crates?  Any use of the VME inbound windows or DMA?
>         It could also be something "simple" like an infinite loop with interrupts disabled.
>
>
>     There's another one, but my understanding is that this is a split crate. The two systems share a power supply, but nothing else. There are no VME inbound windows, nor DMA. The only boards along with the main IOC are a Bancomm 637 and a Xycom-240.
>      
>
>         >  2. To account for that, I have speculated that an external factor (another board in the same crate, or maybe a failing power supply) is making this happen, but the VxWorks based system doesn't seem to be experiencing the problem at all
>         >
>         > To make things worse, the system is at a separate location and we can only test it in real time by sending an engineer to spend the night at 13800' just waiting for a failure that may happen, or not. It's a quite frustrating situation, because it seems site-dependent, as the same boards have been running for weeks on our lab without a glitch! :-(
>
>         I'd recommend getting an IP camera or two.  I made use of these in the past.
>         A great way to keep an eye on equipment remotely.  The two axis motorized
>         type is good as you don't have to spend as much time positioning.
>
>         I don't remember the exact models I've used, but there are a number of similar products.
>
>         https://www.amazon.com/Amcrest-1920TVL-Security-Wireless-IP2M-841B/dp/B0145OQTPG
>
>
>     I could look into this (we have IP cameras for other uses), but that would require getting clearance to run a known unstable system during regular operations. Which won't happen, because every reboot requires tuning, meaning ~20-30 minutes of downtime. But thanks for the suggestion :-)
>      
>
>          
>
>         > Andrew: the watchdog to trigger the abort looks like a great idea. I'll look into your interrupt latency suggestion, too. Our problem now would be to convince the users to put a known unstable system in operations just to see it fail :-)
>         >
>         > Regards,
>         > Ricardo
>         >
>         > On Thu, Jul 26, 2018 at 5:38 AM Andrew Johnson <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>         >
>         >     Hi Matt,
>         >
>         >     On 07/25/2018 08:52 PM, Matt Rippa via Tech-talk wrote:
>         >     > Is there a way to force an exception (or stack trace), for example with
>         >     > watchdog?
>         >     >
>         >     > RTEMS 4.10.2/EPICS 3.14.12.7 MVME2307 BSP
>         >
>         >     Back in 2000 we were having this kind of issue on some of our VxWorks
>         >     systems, and I wrote some code for a couple of different CPU boards
>         >     (MVME167 and 172) which connected an interrupt handler to the Abort
>         >     button interrupt. When the button was pressed this routine dumped the
>         >     status of a selected set of tasks into an area of memory that was
>         >     configured to survive a reboot, allowing us to show that status after
>         >     bringing the system back up.
>         >
>         >     My code would be no use to you on a different CPU board and OS, but the
>         >     idea of connecting something up to an Abort button interrupt if your
>         >     boards have one might help. You'll need the hardware manual for the CPU
>         >     board to work out how to enable and connect the abort interrupt, but
>         >     most Motorola/Emerson/Whoever boards do have such a button.
>         >
>         >
>         >     I will add that the PowerPC CPUs seem to be a bit more prone to the CPU
>         >     completely hanging up than the 68Ks were, which I think tends to happen
>         >     if they get a Bus Error (PCIbus Target Abort) from code running inside
>         >     an interrupt handler/ISR. If you have any ISRs that do VMEbus I/O you
>         >     might want to look at whether they can be converted into high priority
>         >     threads that sit waiting on a semaphore, and have the ISR do nothing but
>         >     trigger that semaphore. This will increase the interrupt latency and
>         >     jitter, but would prevent hangups if something goes wrong with the VME I/O.
>         >
>         >     HTH,
>         >
>         >     - Andrew
>         >
>         >     --
>         >     Arguing for surveillance because you have nothing to hide is no
>         >     different than making the claim, "I don't care about freedom of
>         >     speech because I have nothing to say." -- Edward Snowdon
>         >
>


Replies:
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
References:
IOC Crash with No Exception Generated Matt Rippa via Tech-talk
Re: IOC Crash with No Exception Generated Andrew Johnson
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
Re: IOC Crash with No Exception Generated Michael Davidsaver
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
Re: IOC Crash with No Exception Generated Michael Davidsaver
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk

Navigate by Date:
Prev: Inverting control inputs on SIS3280? Jesse Hopkins
Next: Re: dbGetPdbAddrFromLink dropped from 3.15 again Johnson, Andrew N.
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
Next: Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024 
ANJ, 08 Aug 2018 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·