Experimental Physics and Industrial Control System
|
Hi Michael,
On 07/26/2018 12:16 PM, Ricardo Cardenes via Tech-talk wrote:
> Hi,
>
> Thanks both Michael and Andrew for your answers. I also have hardware error as the top cause, but we're trying other ideas first because:
>
> 1. This has happened on two different boards (one rather old, another one more a more recent purchase, both MVME2700s) within 4 days. This is an RTEMS systems replacing a VxWorks one, also on a 2700.
Is there a second VME master in these crates? Any use of the VME inbound windows or DMA?
It could also be something "simple" like an infinite loop with interrupts disabled.
There's another one, but my understanding is that this is a split crate. The two systems share a power supply, but nothing else. There are no VME inbound windows, nor DMA. The only boards along with the main IOC are a Bancomm 637 and a Xycom-240.
> 2. To account for that, I have speculated that an external factor (another board in the same crate, or maybe a failing power supply) is making this happen, but the VxWorks based system doesn't seem to be experiencing the problem at all
>
> To make things worse, the system is at a separate location and we can only test it in real time by sending an engineer to spend the night at 13800' just waiting for a failure that may happen, or not. It's a quite frustrating situation, because it seems site-dependent, as the same boards have been running for weeks on our lab without a glitch! :-(
I'd recommend getting an IP camera or two. I made use of these in the past.
A great way to keep an eye on equipment remotely. The two axis motorized
type is good as you don't have to spend as much time positioning.
I don't remember the exact models I've used, but there are a number of similar products.
https://www.amazon.com/Amcrest-1920TVL-Security-Wireless-IP2M-841B/dp/B0145OQTPG
I could look into this (we have IP cameras for other uses), but that would require getting clearance to run a known unstable system during regular operations. Which won't happen, because every reboot requires tuning, meaning ~20-30 minutes of downtime. But thanks for the suggestion :-)
> Andrew: the watchdog to trigger the abort looks like a great idea. I'll look into your interrupt latency suggestion, too. Our problem now would be to convince the users to put a known unstable system in operations just to see it fail :-)
>
> Regards,
> Ricardo
>
> On Thu, Jul 26, 2018 at 5:38 AM Andrew Johnson <[email protected] <mailto:[email protected]>> wrote:
>
> Hi Matt,
>
> On 07/25/2018 08:52 PM, Matt Rippa via Tech-talk wrote:
> > Is there a way to force an exception (or stack trace), for example with
> > watchdog?
> >
> > RTEMS 4.10.2/EPICS 3.14.12.7 MVME2307 BSP
>
> Back in 2000 we were having this kind of issue on some of our VxWorks
> systems, and I wrote some code for a couple of different CPU boards
> (MVME167 and 172) which connected an interrupt handler to the Abort
> button interrupt. When the button was pressed this routine dumped the
> status of a selected set of tasks into an area of memory that was
> configured to survive a reboot, allowing us to show that status after
> bringing the system back up.
>
> My code would be no use to you on a different CPU board and OS, but the
> idea of connecting something up to an Abort button interrupt if your
> boards have one might help. You'll need the hardware manual for the CPU
> board to work out how to enable and connect the abort interrupt, but
> most Motorola/Emerson/Whoever boards do have such a button.
>
>
> I will add that the PowerPC CPUs seem to be a bit more prone to the CPU
> completely hanging up than the 68Ks were, which I think tends to happen
> if they get a Bus Error (PCIbus Target Abort) from code running inside
> an interrupt handler/ISR. If you have any ISRs that do VMEbus I/O you
> might want to look at whether they can be converted into high priority
> threads that sit waiting on a semaphore, and have the ISR do nothing but
> trigger that semaphore. This will increase the interrupt latency and
> jitter, but would prevent hangups if something goes wrong with the VME I/O.
>
> HTH,
>
> - Andrew
>
> --
> Arguing for surveillance because you have nothing to hide is no
> different than making the claim, "I don't care about freedom of
> speech because I have nothing to say." -- Edward Snowdon
>
- Replies:
- Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
- References:
- IOC Crash with No Exception Generated Matt Rippa via Tech-talk
- Re: IOC Crash with No Exception Generated Andrew Johnson
- Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
- Re: IOC Crash with No Exception Generated Michael Davidsaver
- Navigate by Date:
- Prev:
Positions at Brookhaven for EPICS Contrls engineers Farnsworth, Richard
- Next:
Signing off Ronald L. Sluiter
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
<2018>
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: IOC Crash with No Exception Generated Michael Davidsaver
- Next:
Re: IOC Crash with No Exception Generated Ricardo Cardenes via Tech-talk
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
<2018>
2019
2020
2021
2022
2023
2024
|
ANJ, 07 Aug 2018 |
·
Home
·
News
·
About
·
Base
·
Modules
·
Extensions
·
Distributions
·
Download
·
·
Search
·
EPICS V4
·
IRMIS
·
Talk
·
Bugs
·
Documents
·
Links
·
Licensing
·
|