EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: VME Bus Error handling on MVME3100 and 6100 boards
From: Till Straumann <[email protected]>
To: "Thompson, David H." <[email protected]>
Cc: [email protected]
Date: Mon, 21 Aug 2006 11:21:13 -0700
On Thu, 2006-08-17 at 11:43 -0400, Thompson, David H. wrote:
> It seems to me that detecting a bus error and suspending the offending
> task may not always be what you want.  We have one card that does not
> have a driver thread, it has an ISR and device support for several
> record types.  This card occasionally has a transient lost clock
> condition that causes a bus time out and a bus error during record
> processing.  This causes a scan task to suspend, silently disabling all
> of the records in that scan task and any other records in the current
> lock set.   At that point the scan task is not recoverable.  In the
> ideal world, a bus error should cause an exception that invalidates the
> affected record and leaves the scan task running.  Then at least you
> have the ability to handle the error in some reasonable way and maybe
> even recover from it.  

You'd rather eliminate the cause for the bus error. A bus timeout
stalls the CPU for a long time - this is unacceptable in a real-time
environment.


-- T.
> 
> One possible approach would be to send a signal to the current task or
> just poll the exception register after each record is processed. Epics
> would need to provide a hook for polling hardware status after each
> record is scanned.  
> 
> Another possibility would be to add a task switch hook routine to
> monitor the exception register. You may not be able to tell which
> instruction caused the error but you should be able to identify the task
> for the coupled reads at least. Admittedly it might take a long time to
> notice the error but it would eventually be noticed and assigned to the
> correct task.

> 
> Doing both these things together combined with a coding standard that
> requires a driver to do a read after the last write would provide
> coverage for bus errors without delving into the device drivers too
> much.

> 
> > -----Original Message-----
> > From: Ernest L. Williams Jr. [mailto:[email protected]]
> > Sent: Wednesday, August 16, 2006 6:42 PM
> > To: Andrew Johnson
> > Cc: Till Straumann; EPICS tech-talk
> > Subject: Re: VME Bus Error handling on MVME3100 and 6100 boards
> > 
> > On Wed, 2006-08-16 at 17:22 -0500, Andrew Johnson wrote:
> > > Till Straumann wrote:
> > > > On Thu, 2006-08-10 at 14:58 -0500, Andrew Johnson wrote:
> > > >
> > > >> Interrupts may not be as quick at actually getting to the CPU as
> a
> > > >> Target Abort - I don't know whether modern CPUs finish off
> any/all
> > > >> instructions that they've already started running before they
> > actually
> > > >> switch to processing the exception, but it's likely that there
> will
> > be
> > > >> a number of instructions pending.  This also supposes that
> interrupts
> > > >> are enabled at the time the bus error gets flagged.
> > > >
> > > > Yes, the latter is true. However, VME access is so slow that
> having
> > > > interrupts disabled around longer manipulations is not a good idea
> > > > anyways.
> > >
> > > It is sometimes impossible to write code that has to manipulate the
> > > interrupt registers of a VME slave card without disabling interrupts
> to
> > > the CPU.
> > >
> > > > Note that the 'machine check' generated by the target abort is
> also
> > > > just an external interrupt line. I can't see how that differs much
> > > > from using EE. On board designs using the universe, the target
> abort
> > > > is generated by the host bridge and propagated via the MCP or TSA
> > > > line to the CPU and therefore inherently asynchronous to
> instruction
> > > > execution also.
> > >
> > > The Machine Check exception generated by the Target Abort is
> synchronous
> > > with the termination of the read cycle that caused the VME bus
> error,
> > > and it is thus possible to determine the instruction that caused the
> > > fault.  For example, on an MVME2700 (Universe-2) with my BSP:
> > >
> > > mv2700> d 0xf0000000
> > > f0000000:
> > > VME Bus Error accessing A24: 0x000000
> > > machine check
> > > Exception next instruction address: 0x001ba5e0
> > > Machine Status Register: 0x0008b030
> > > Condition Register: 0x20004084
> > > Task: 0x1d3b1d0 "tShell"
> > >
> > > A disassembly shows the exception instruction:
> > >
> > > mv2700> l 0x001ba5d0
> > > 0x1ba5d0  3c60001f    lis         r3,0x1f # 31
> > > 0x1ba5d4  3ba10030    addi        r29,r1,0x30 # 48
> > > 0x1ba5d8  386302b4    addi        r3,r3,0x2b4 # 692
> > > 0x1ba5dc  a0090000    lhz         r0,0(r9)
> > > 0x1ba5e0  901e0004    stw         r0,4(r30)
> > > 0x1ba5e4  93010030    stw         r24,48(r1)
> > > 0x1ba5e8  a09e0006    lhz         r4,6(r30)
> > > 0x1ba5ec  4cc63182    crxor       crb6,crb6,crb6
> > > 0x1ba5f0  4bfe1b61    bl          0x19c150 # printf
> > >
> > > The instruction at 0x001ba5dc is the lhz instruction that tried to
> read
> > > the location at A24:000000
> > >
> > > >> If the Bus Error occurs inside an interrupt service routine,
> > > >
> > > > I consider this a fatal, nonrecoverable error.
> > >
> > > I also consider it a pretty fatal error, but I want my hardware and
> OS
> > > to be able to tell me where it was when the problem occurred so I
> can
> > > quickly figure out what actually happened.
> > >
> > > > you're invariably gonna see more of this as CPUs get faster ;-)
> > >
> > > Actually CPUs have pretty much stopped getting faster nowadays
> (although
> > > the highest speeds haven't filtered through to the VME world yet);
> we're
> > > just putting them in parallel to achieve speedups now...
> > >
> > > > In any case, IMO, a bus error should be considered a serious error
> > that
> > > > must be avoided (except for 'probing' during initialization)
> > > > because of the significant latencies that can be introduced
> > > > by a VME bus timeout.
> > >
> > > I'm not disputing that we should avoid bus errors, but they are a
> fact
> > > of life in a failing VME system.  Unfortunately the Tempe chip's
> flawed
> > > design makes the system's response to one much less than ideal,
> given
> > > that the Target Abort mechanism is available on the PCIbus and
> Tundra
> > > have already managed to implement the necessary circuitry to use it
> in
> > > the Universe-2 chip.
> > >
> > > > Of course, write operations are completely asynchronous
> > > > and in that case, the only thing that can be done is reporting
> > > > that an error happened but there is no way to relate it
> > > > to a particular task/PC.
> > > >
> > > > Note that this is also true for the Universe (with write-posting
> > > > enabled).
> > >
> > > I am less concerned about write posting (I enable this myself) and
> even
> > > bus errors from write cycles, since they don't directly affect the
> > > operation of the running task and will almost always be surrounded
> by
> > > read cycles anyway so a card that develops a fault will soon signal
> its
> > > problem by faulting a read operation.
> > >
> > > What I object to is the completion of a failing read cycle with an
> > > all-1s bitpattern, because this can and probably will break any
> existing
> > > device drivers.  In the past a driver was guaranteed that a bus
> error on
> > > a read cycle would stop it immediately at the read instruction and
> thus
> > > prevent further operation, whereas now drivers will have to be very
> > > defensive about all the data they read from the VMEbus.
> > >
> > > That's not going to be good for performance or portability,
> especially
> > > where all-1's is a valid bitpattern from a register that must be
> read
> > > inside an ISR (how can the ISR tell whether the value it read was
> real
> > > or not?  The only way to find out is to ask the Tempe chip, so the
> code
> > > is no longer portable).
> > >
> > > > However, in contrast to the universe, write posting cannot be
> disabled
> > > > on the Tsi148 and that introduces problems with VME ISRs:
> > > ...
> > > > The only remedy here is reading something back from
> > > > the device prior to letting the ISR return (reading anything
> > > > flushes the tsi148's write-FIFO)
> > >
> > > This is actually something that all VME ISRs should be doing anyway,
> > > since even the VMEchip2 (as used on the MVME167 et al) implemented
> write
> > > posting.
> > >
> > > > => IMO,  the Tsi148's  new features
> > > >      (fast 2eVME and SST transfers among others)
> > > >      outweigh the disadvantage that write-posting
> > > >      cannot be disabled.
> > > >         I don't share your negative assessment and
> > > >      recommendation to stay away from 6100s.
> > >
> > > If you need the new features and speed then you'll probably be
> willing
> > > to recode any existing drivers or just accept that random things may
> > > happen in the event that some card fails.  For operational sites
> like
> > > the APS with 224 different types of VME card used in our IOCs,
> > > revisiting all our device drivers isn't something we want to have to
> > do...
> > 
> > 
> > We are concerned about the VMEBus issues that you raised.
> > The MVME2100 will be End-of-Life (EOL) soon.  We are counting on the
> > MVME3100 and MVME6100 as successors.
> > 
> > So, I have filed a technical concern with both Motorola and WindRiver.
> > Of course, this will lead to TUNDRA but we need to get this resolved
> if
> > we want to move forward and have reliability.
> > 
> > I will post the results back here hopefully in the near future.
> > 
> > 
> > Thanks,
> > Ernest
> > SNS Control Systems Group
> > ORNL
> > 
> > 
> > 
> > >
> > > - Andrew
> 


References:
RE: VME Bus Error handling on MVME3100 and 6100 boards Thompson, David H.

Navigate by Date:
Prev: Re: VME Bus Error handling on MVME3100 and 6100 boards Till Straumann
Next: CVS Access David Dudley
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: RE: VME Bus Error handling on MVME3100 and 6100 boards Thompson, David H.
Next: Re: VME Bus Error handling on MVME3100 and 6100 boards Till Straumann
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 02 Sep 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·