On Wed, 2006-08-16 at 18:42 -0400, Ernest L. Williams Jr. wrote:
> On Wed, 2006-08-16 at 17:22 -0500, Andrew Johnson wrote:
> > Till Straumann wrote:
> > > On Thu, 2006-08-10 at 14:58 -0500, Andrew Johnson wrote:
> > >
> > >> Interrupts may not be as quick at actually getting to the CPU as a
> > >> Target Abort - I don't know whether modern CPUs finish off any/all
> > >> instructions that they've already started running before they actually
> > >> switch to processing the exception, but it's likely that there will be
> > >> a number of instructions pending. This also supposes that interrupts
> > >> are enabled at the time the bus error gets flagged.
> > >
> > > Yes, the latter is true. However, VME access is so slow that having
> > > interrupts disabled around longer manipulations is not a good idea
> > > anyways.
> >
> > It is sometimes impossible to write code that has to manipulate the
> > interrupt registers of a VME slave card without disabling interrupts to
> > the CPU.
> >
> > > Note that the 'machine check' generated by the target abort is also
> > > just an external interrupt line. I can't see how that differs much
> > > from using EE. On board designs using the universe, the target abort
> > > is generated by the host bridge and propagated via the MCP or TSA
> > > line to the CPU and therefore inherently asynchronous to instruction
> > > execution also.
> >
> > The Machine Check exception generated by the Target Abort is synchronous
> > with the termination of the read cycle that caused the VME bus error,
> > and it is thus possible to determine the instruction that caused the
> > fault. For example, on an MVME2700 (Universe-2) with my BSP:
> >
> > mv2700> d 0xf0000000
> > f0000000:
> > VME Bus Error accessing A24: 0x000000
> > machine check
> > Exception next instruction address: 0x001ba5e0
> > Machine Status Register: 0x0008b030
> > Condition Register: 0x20004084
> > Task: 0x1d3b1d0 "tShell"
> >
> > A disassembly shows the exception instruction:
> >
> > mv2700> l 0x001ba5d0
> > 0x1ba5d0 3c60001f lis r3,0x1f # 31
> > 0x1ba5d4 3ba10030 addi r29,r1,0x30 # 48
> > 0x1ba5d8 386302b4 addi r3,r3,0x2b4 # 692
> > 0x1ba5dc a0090000 lhz r0,0(r9)
> > 0x1ba5e0 901e0004 stw r0,4(r30)
> > 0x1ba5e4 93010030 stw r24,48(r1)
> > 0x1ba5e8 a09e0006 lhz r4,6(r30)
> > 0x1ba5ec 4cc63182 crxor crb6,crb6,crb6
> > 0x1ba5f0 4bfe1b61 bl 0x19c150 # printf
> >
> > The instruction at 0x001ba5dc is the lhz instruction that tried to read
> > the location at A24:000000
> >
> > >> If the Bus Error occurs inside an interrupt service routine,
> > >
> > > I consider this a fatal, nonrecoverable error.
> >
> > I also consider it a pretty fatal error, but I want my hardware and OS
> > to be able to tell me where it was when the problem occurred so I can
> > quickly figure out what actually happened.
> >
> > > you're invariably gonna see more of this as CPUs get faster ;-)
> >
> > Actually CPUs have pretty much stopped getting faster nowadays (although
> > the highest speeds haven't filtered through to the VME world yet); we're
> > just putting them in parallel to achieve speedups now...
> >
> > > In any case, IMO, a bus error should be considered a serious error that
> > > must be avoided (except for 'probing' during initialization)
> > > because of the significant latencies that can be introduced
> > > by a VME bus timeout.
> >
> > I'm not disputing that we should avoid bus errors, but they are a fact
> > of life in a failing VME system. Unfortunately the Tempe chip's flawed
> > design makes the system's response to one much less than ideal, given
> > that the Target Abort mechanism is available on the PCIbus and Tundra
> > have already managed to implement the necessary circuitry to use it in
> > the Universe-2 chip.
> >
> > > Of course, write operations are completely asynchronous
> > > and in that case, the only thing that can be done is reporting
> > > that an error happened but there is no way to relate it
> > > to a particular task/PC.
> > >
> > > Note that this is also true for the Universe (with write-posting
> > > enabled).
> >
> > I am less concerned about write posting (I enable this myself) and even
> > bus errors from write cycles, since they don't directly affect the
> > operation of the running task and will almost always be surrounded by
> > read cycles anyway so a card that develops a fault will soon signal its
> > problem by faulting a read operation.
> >
> > What I object to is the completion of a failing read cycle with an
> > all-1s bitpattern, because this can and probably will break any existing
> > device drivers. In the past a driver was guaranteed that a bus error on
> > a read cycle would stop it immediately at the read instruction and thus
> > prevent further operation, whereas now drivers will have to be very
> > defensive about all the data they read from the VMEbus.
> >
> > That's not going to be good for performance or portability, especially
> > where all-1's is a valid bitpattern from a register that must be read
> > inside an ISR (how can the ISR tell whether the value it read was real
> > or not? The only way to find out is to ask the Tempe chip, so the code
> > is no longer portable).
> >
> > > However, in contrast to the universe, write posting cannot be disabled
> > > on the Tsi148 and that introduces problems with VME ISRs:
> > ...
> > > The only remedy here is reading something back from
> > > the device prior to letting the ISR return (reading anything
> > > flushes the tsi148's write-FIFO)
> >
> > This is actually something that all VME ISRs should be doing anyway,
> > since even the VMEchip2 (as used on the MVME167 et al) implemented write
> > posting.
> >
> > > => IMO, the Tsi148's new features
> > > (fast 2eVME and SST transfers among others)
> > > outweigh the disadvantage that write-posting
> > > cannot be disabled.
> > > I don't share your negative assessment and
> > > recommendation to stay away from 6100s.
> >
> > If you need the new features and speed then you'll probably be willing
> > to recode any existing drivers or just accept that random things may
> > happen in the event that some card fails. For operational sites like
> > the APS with 224 different types of VME card used in our IOCs,
> > revisiting all our device drivers isn't something we want to have to do...
>
>
> We are concerned about the VMEBus issues that you raised.
> The MVME2100 will be End-of-Life (EOL) soon. We are counting on the
> MVME3100 and MVME6100 as successors.
>
> So, I have filed a technical concern with both Motorola and WindRiver.
> Of course, this will lead to TUNDRA but we need to get this resolved if
> we want to move forward and have reliability.
>
> I will post the results back here hopefully in the near future.
========== Response from Motorola below ===============================
Motorola has received the fiollowing response from an engineer that was
involved with the TEMPE and from Tundra:
Tundra states that "the error handling with respect to a BERR
termination as a VMEbus master works as described in the Users Manual.
Tundra has no plans of changing this." If the customer doesn't feel
that signaling via an interrupt is insufficient then we have no
recourse. Please note that both Mac and Tundra agree that a failed read
of a valid VME address would likely occur only as a result of a
catastrophic system failure, i.e. the target board smoked. The
probability that a failed read, when (or if) retried would return data
is essentially zero.
========================================================================
Not sure, what to make out of that initial response?
>
>
> Thanks,
> Ernest
> SNS Control Systems Group
> ORNL
>
>
>
> >
> > - Andrew
>
- Replies:
- Re: VME Bus Error handling on MVME3100 and 6100 boards Tim Mooney
- References:
- VME Bus Error handling on MVME3100 and 6100 boards Andrew Johnson
- Re: VME Bus Error handling on MVME3100 and 6100 boards Kate Feng
- Re: VME Bus Error handling on MVME3100 and 6100 boards Till Straumann
- Re: VME Bus Error handling on MVME3100 and 6100 boards Andrew Johnson
- Re: VME Bus Error handling on MVME3100 and 6100 boards Andrew Johnson
- Re: VME Bus Error handling on MVME3100 and 6100 boards Ernest L. Williams Jr.
- Navigate by Date:
- Prev:
Re: VME Bus Error handling on MVME3100 and 6100 boards Joe Sullivan
- Next:
RE: VME Bus Error handling on MVME3100 and 6100 boards Thompson, David H.
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
<2006>
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: VME Bus Error handling on MVME3100 and 6100 boards Joe Sullivan
- Next:
Re: VME Bus Error handling on MVME3100 and 6100 boards Tim Mooney
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
<2006>
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|