On 07/26/2018 12:16 PM, Ricardo Cardenes via Tech-talk wrote:
> Hi,
>
> Thanks both Michael and Andrew for your answers. I also have hardware error as the top cause, but we're trying other ideas first because:
>
> 1. This has happened on two different boards (one rather old, another one more a more recent purchase, both MVME2700s) within 4 days. This is an RTEMS systems replacing a VxWorks one, also on a 2700.
Is there a second VME master in these crates? Any use of the VME inbound windows or DMA?
It could also be something "simple" like an infinite loop with interrupts disabled.
> 2. To account for that, I have speculated that an external factor (another board in the same crate, or maybe a failing power supply) is making this happen, but the VxWorks based system doesn't seem to be experiencing the problem at all
>
> To make things worse, the system is at a separate location and we can only test it in real time by sending an engineer to spend the night at 13800' just waiting for a failure that may happen, or not. It's a quite frustrating situation, because it seems site-dependent, as the same boards have been running for weeks on our lab without a glitch! :-(
I'd recommend getting an IP camera or two. I made use of these in the past.
A great way to keep an eye on equipment remotely. The two axis motorized
type is good as you don't have to spend as much time positioning.
I don't remember the exact models I've used, but there are a number of similar products.
https://www.amazon.com/Amcrest-1920TVL-Security-Wireless-IP2M-841B/dp/B0145OQTPG
> Andrew: the watchdog to trigger the abort looks like a great idea. I'll look into your interrupt latency suggestion, too. Our problem now would be to convince the users to put a known unstable system in operations just to see it fail :-)
>
> Regards,
> Ricardo
>
> On Thu, Jul 26, 2018 at 5:38 AM Andrew Johnson <[email protected] <mailto:[email protected]>> wrote:
>
> Hi Matt,
>
> On 07/25/2018 08:52 PM, Matt Rippa via Tech-talk wrote:
> > Is there a way to force an exception (or stack trace), for example with
> > watchdog?
> >
> > RTEMS 4.10.2/EPICS 3.14.12.7 MVME2307 BSP
>
> Back in 2000 we were having this kind of issue on some of our VxWorks
> systems, and I wrote some code for a couple of different CPU boards
> (MVME167 and 172) which connected an interrupt handler to the Abort
> button interrupt. When the button was pressed this routine dumped the
> status of a selected set of tasks into an area of memory that was
> configured to survive a reboot, allowing us to show that status after
> bringing the system back up.
>
> My code would be no use to you on a different CPU board and OS, but the
> idea of connecting something up to an Abort button interrupt if your
> boards have one might help. You'll need the hardware manual for the CPU
> board to work out how to enable and connect the abort interrupt, but
> most Motorola/Emerson/Whoever boards do have such a button.
>
>
> I will add that the PowerPC CPUs seem to be a bit more prone to the CPU
> completely hanging up than the 68Ks were, which I think tends to happen
> if they get a Bus Error (PCIbus Target Abort) from code running inside
> an interrupt handler/ISR. If you have any ISRs that do VMEbus I/O you
> might want to look at whether they can be converted into high priority
> threads that sit waiting on a semaphore, and have the ISR do nothing but
> trigger that semaphore. This will increase the interrupt latency and
> jitter, but would prevent hangups if something goes wrong with the VME I/O.
>
> HTH,
>
> - Andrew
>
> --
> Arguing for surveillance because you have nothing to hide is no
> different than making the claim, "I don't care about freedom of
> speech because I have nothing to say." -- Edward Snowdon
>
- Replies:
- Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
- References:
- IOC Crash with No Exception Generated Matt Rippa via Tech-talk
- Re: IOC Crash with No Exception Generated Andrew Johnson
- Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
- Navigate by Date:
- Prev:
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
- Next:
Positions at Brookhaven for EPICS Contrls engineers Farnsworth, Richard
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
<2018>
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
- Next:
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
<2018>
2019
2020
2021
2022
2023
2024
|