On Thu, 2006-08-17 at 11:43 -0400, Thompson, David H. wrote:
> It seems to me that detecting a bus error and suspending the offending
> task may not always be what you want. We have one card that does not
> have a driver thread, it has an ISR and device support for several
> record types. This card occasionally has a transient lost clock
> condition that causes a bus time out and a bus error during record
> processing. This causes a scan task to suspend, silently disabling all
> of the records in that scan task and any other records in the current
> lock set. At that point the scan task is not recoverable. In the
> ideal world, a bus error should cause an exception that invalidates the
> affected record and leaves the scan task running. Then at least you
> have the ability to handle the error in some reasonable way and maybe
> even recover from it.
You'd rather eliminate the cause for the bus error. A bus timeout
stalls the CPU for a long time - this is unacceptable in a real-time
environment.
-- T.
>
> One possible approach would be to send a signal to the current task or
> just poll the exception register after each record is processed. Epics
> would need to provide a hook for polling hardware status after each
> record is scanned.
>
> Another possibility would be to add a task switch hook routine to
> monitor the exception register. You may not be able to tell which
> instruction caused the error but you should be able to identify the task
> for the coupled reads at least. Admittedly it might take a long time to
> notice the error but it would eventually be noticed and assigned to the
> correct task.
>
> Doing both these things together combined with a coding standard that
> requires a driver to do a read after the last write would provide
> coverage for bus errors without delving into the device drivers too
> much.
>
> > -----Original Message-----
> > From: Ernest L. Williams Jr. [mailto:[email protected]]
> > Sent: Wednesday, August 16, 2006 6:42 PM
> > To: Andrew Johnson
> > Cc: Till Straumann; EPICS tech-talk
> > Subject: Re: VME Bus Error handling on MVME3100 and 6100 boards
> >
> > On Wed, 2006-08-16 at 17:22 -0500, Andrew Johnson wrote:
> > > Till Straumann wrote:
> > > > On Thu, 2006-08-10 at 14:58 -0500, Andrew Johnson wrote:
> > > >
> > > >> Interrupts may not be as quick at actually getting to the CPU as
> a
> > > >> Target Abort - I don't know whether modern CPUs finish off
> any/all
> > > >> instructions that they've already started running before they
> > actually
> > > >> switch to processing the exception, but it's likely that there
> will
> > be
> > > >> a number of instructions pending. This also supposes that
> interrupts
> > > >> are enabled at the time the bus error gets flagged.
> > > >
> > > > Yes, the latter is true. However, VME access is so slow that
> having
> > > > interrupts disabled around longer manipulations is not a good idea
> > > > anyways.
> > >
> > > It is sometimes impossible to write code that has to manipulate the
> > > interrupt registers of a VME slave card without disabling interrupts
> to
> > > the CPU.
> > >
> > > > Note that the 'machine check' generated by the target abort is
> also
> > > > just an external interrupt line. I can't see how that differs much
> > > > from using EE. On board designs using the universe, the target
> abort
> > > > is generated by the host bridge and propagated via the MCP or TSA
> > > > line to the CPU and therefore inherently asynchronous to
> instruction
> > > > execution also.
> > >
> > > The Machine Check exception generated by the Target Abort is
> synchronous
> > > with the termination of the read cycle that caused the VME bus
> error,
> > > and it is thus possible to determine the instruction that caused the
> > > fault. For example, on an MVME2700 (Universe-2) with my BSP:
> > >
> > > mv2700> d 0xf0000000
> > > f0000000:
> > > VME Bus Error accessing A24: 0x000000
> > > machine check
> > > Exception next instruction address: 0x001ba5e0
> > > Machine Status Register: 0x0008b030
> > > Condition Register: 0x20004084
> > > Task: 0x1d3b1d0 "tShell"
> > >
> > > A disassembly shows the exception instruction:
> > >
> > > mv2700> l 0x001ba5d0
> > > 0x1ba5d0 3c60001f lis r3,0x1f # 31
> > > 0x1ba5d4 3ba10030 addi r29,r1,0x30 # 48
> > > 0x1ba5d8 386302b4 addi r3,r3,0x2b4 # 692
> > > 0x1ba5dc a0090000 lhz r0,0(r9)
> > > 0x1ba5e0 901e0004 stw r0,4(r30)
> > > 0x1ba5e4 93010030 stw r24,48(r1)
> > > 0x1ba5e8 a09e0006 lhz r4,6(r30)
> > > 0x1ba5ec 4cc63182 crxor crb6,crb6,crb6
> > > 0x1ba5f0 4bfe1b61 bl 0x19c150 # printf
> > >
> > > The instruction at 0x001ba5dc is the lhz instruction that tried to
> read
> > > the location at A24:000000
> > >
> > > >> If the Bus Error occurs inside an interrupt service routine,
> > > >
> > > > I consider this a fatal, nonrecoverable error.
> > >
> > > I also consider it a pretty fatal error, but I want my hardware and
> OS
> > > to be able to tell me where it was when the problem occurred so I
> can
> > > quickly figure out what actually happened.
> > >
> > > > you're invariably gonna see more of this as CPUs get faster ;-)
> > >
> > > Actually CPUs have pretty much stopped getting faster nowadays
> (although
> > > the highest speeds haven't filtered through to the VME world yet);
> we're
> > > just putting them in parallel to achieve speedups now...
> > >
> > > > In any case, IMO, a bus error should be considered a serious error
> > that
> > > > must be avoided (except for 'probing' during initialization)
> > > > because of the significant latencies that can be introduced
> > > > by a VME bus timeout.
> > >
> > > I'm not disputing that we should avoid bus errors, but they are a
> fact
> > > of life in a failing VME system. Unfortunately the Tempe chip's
> flawed
> > > design makes the system's response to one much less than ideal,
> given
> > > that the Target Abort mechanism is available on the PCIbus and
> Tundra
> > > have already managed to implement the necessary circuitry to use it
> in
> > > the Universe-2 chip.
> > >
> > > > Of course, write operations are completely asynchronous
> > > > and in that case, the only thing that can be done is reporting
> > > > that an error happened but there is no way to relate it
> > > > to a particular task/PC.
> > > >
> > > > Note that this is also true for the Universe (with write-posting
> > > > enabled).
> > >
> > > I am less concerned about write posting (I enable this myself) and
> even
> > > bus errors from write cycles, since they don't directly affect the
> > > operation of the running task and will almost always be surrounded
> by
> > > read cycles anyway so a card that develops a fault will soon signal
> its
> > > problem by faulting a read operation.
> > >
> > > What I object to is the completion of a failing read cycle with an
> > > all-1s bitpattern, because this can and probably will break any
> existing
> > > device drivers. In the past a driver was guaranteed that a bus
> error on
> > > a read cycle would stop it immediately at the read instruction and
> thus
> > > prevent further operation, whereas now drivers will have to be very
> > > defensive about all the data they read from the VMEbus.
> > >
> > > That's not going to be good for performance or portability,
> especially
> > > where all-1's is a valid bitpattern from a register that must be
> read
> > > inside an ISR (how can the ISR tell whether the value it read was
> real
> > > or not? The only way to find out is to ask the Tempe chip, so the
> code
> > > is no longer portable).
> > >
> > > > However, in contrast to the universe, write posting cannot be
> disabled
> > > > on the Tsi148 and that introduces problems with VME ISRs:
> > > ...
> > > > The only remedy here is reading something back from
> > > > the device prior to letting the ISR return (reading anything
> > > > flushes the tsi148's write-FIFO)
> > >
> > > This is actually something that all VME ISRs should be doing anyway,
> > > since even the VMEchip2 (as used on the MVME167 et al) implemented
> write
> > > posting.
> > >
> > > > => IMO, the Tsi148's new features
> > > > (fast 2eVME and SST transfers among others)
> > > > outweigh the disadvantage that write-posting
> > > > cannot be disabled.
> > > > I don't share your negative assessment and
> > > > recommendation to stay away from 6100s.
> > >
> > > If you need the new features and speed then you'll probably be
> willing
> > > to recode any existing drivers or just accept that random things may
> > > happen in the event that some card fails. For operational sites
> like
> > > the APS with 224 different types of VME card used in our IOCs,
> > > revisiting all our device drivers isn't something we want to have to
> > do...
> >
> >
> > We are concerned about the VMEBus issues that you raised.
> > The MVME2100 will be End-of-Life (EOL) soon. We are counting on the
> > MVME3100 and MVME6100 as successors.
> >
> > So, I have filed a technical concern with both Motorola and WindRiver.
> > Of course, this will lead to TUNDRA but we need to get this resolved
> if
> > we want to move forward and have reliability.
> >
> > I will post the results back here hopefully in the near future.
> >
> >
> > Thanks,
> > Ernest
> > SNS Control Systems Group
> > ORNL
> >
> >
> >
> > >
> > > - Andrew
>
- References:
- RE: VME Bus Error handling on MVME3100 and 6100 boards Thompson, David H.
- Navigate by Date:
- Prev:
Re: VME Bus Error handling on MVME3100 and 6100 boards Till Straumann
- Next:
CVS Access David Dudley
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
<2006>
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: VME Bus Error handling on MVME3100 and 6100 boards Thompson, David H.
- Next:
Re: VME Bus Error handling on MVME3100 and 6100 boards Till Straumann
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
<2006>
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|