On Wed, 2006-08-16 at 17:22 -0500, Andrew Johnson wrote:
> Till Straumann wrote:
> > On Thu, 2006-08-10 at 14:58 -0500, Andrew Johnson wrote:
> >
> >> Interrupts may not be as quick at actually getting to the CPU as a
> >> Target Abort - I don't know whether modern CPUs finish off any/all
> >> instructions that they've already started running before they actually
> >> switch to processing the exception, but it's likely that there will be
> >> a number of instructions pending. This also supposes that interrupts
> >> are enabled at the time the bus error gets flagged.
> >
> > Yes, the latter is true. However, VME access is so slow that having
> > interrupts disabled around longer manipulations is not a good idea
> > anyways.
>
> It is sometimes impossible to write code that has to manipulate the
> interrupt registers of a VME slave card without disabling interrupts to
> the CPU.
>
> > Note that the 'machine check' generated by the target abort is also
> > just an external interrupt line. I can't see how that differs much
> > from using EE. On board designs using the universe, the target abort
> > is generated by the host bridge and propagated via the MCP or TSA
> > line to the CPU and therefore inherently asynchronous to instruction
> > execution also.
>
> The Machine Check exception generated by the Target Abort is synchronous
> with the termination of the read cycle that caused the VME bus error,
> and it is thus possible to determine the instruction that caused the
> fault. For example, on an MVME2700 (Universe-2) with my BSP:
>
> mv2700> d 0xf0000000
> f0000000:
> VME Bus Error accessing A24: 0x000000
> machine check
> Exception next instruction address: 0x001ba5e0
> Machine Status Register: 0x0008b030
> Condition Register: 0x20004084
> Task: 0x1d3b1d0 "tShell"
>
> A disassembly shows the exception instruction:
>
> mv2700> l 0x001ba5d0
> 0x1ba5d0 3c60001f lis r3,0x1f # 31
> 0x1ba5d4 3ba10030 addi r29,r1,0x30 # 48
> 0x1ba5d8 386302b4 addi r3,r3,0x2b4 # 692
> 0x1ba5dc a0090000 lhz r0,0(r9)
> 0x1ba5e0 901e0004 stw r0,4(r30)
> 0x1ba5e4 93010030 stw r24,48(r1)
> 0x1ba5e8 a09e0006 lhz r4,6(r30)
> 0x1ba5ec 4cc63182 crxor crb6,crb6,crb6
> 0x1ba5f0 4bfe1b61 bl 0x19c150 # printf
>
> The instruction at 0x001ba5dc is the lhz instruction that tried to read
> the location at A24:000000
>
> >> If the Bus Error occurs inside an interrupt service routine,
> >
> > I consider this a fatal, nonrecoverable error.
>
> I also consider it a pretty fatal error, but I want my hardware and OS
> to be able to tell me where it was when the problem occurred so I can
> quickly figure out what actually happened.
>
> > you're invariably gonna see more of this as CPUs get faster ;-)
>
> Actually CPUs have pretty much stopped getting faster nowadays (although
> the highest speeds haven't filtered through to the VME world yet); we're
> just putting them in parallel to achieve speedups now...
>
> > In any case, IMO, a bus error should be considered a serious error that
> > must be avoided (except for 'probing' during initialization)
> > because of the significant latencies that can be introduced
> > by a VME bus timeout.
>
> I'm not disputing that we should avoid bus errors, but they are a fact
> of life in a failing VME system. Unfortunately the Tempe chip's flawed
> design makes the system's response to one much less than ideal, given
> that the Target Abort mechanism is available on the PCIbus and Tundra
> have already managed to implement the necessary circuitry to use it in
> the Universe-2 chip.
>
> > Of course, write operations are completely asynchronous
> > and in that case, the only thing that can be done is reporting
> > that an error happened but there is no way to relate it
> > to a particular task/PC.
> >
> > Note that this is also true for the Universe (with write-posting
> > enabled).
>
> I am less concerned about write posting (I enable this myself) and even
> bus errors from write cycles, since they don't directly affect the
> operation of the running task and will almost always be surrounded by
> read cycles anyway so a card that develops a fault will soon signal its
> problem by faulting a read operation.
>
> What I object to is the completion of a failing read cycle with an
> all-1s bitpattern, because this can and probably will break any existing
> device drivers. In the past a driver was guaranteed that a bus error on
> a read cycle would stop it immediately at the read instruction and thus
> prevent further operation, whereas now drivers will have to be very
> defensive about all the data they read from the VMEbus.
>
> That's not going to be good for performance or portability, especially
> where all-1's is a valid bitpattern from a register that must be read
> inside an ISR (how can the ISR tell whether the value it read was real
> or not? The only way to find out is to ask the Tempe chip, so the code
> is no longer portable).
>
> > However, in contrast to the universe, write posting cannot be disabled
> > on the Tsi148 and that introduces problems with VME ISRs:
> ...
> > The only remedy here is reading something back from
> > the device prior to letting the ISR return (reading anything
> > flushes the tsi148's write-FIFO)
>
> This is actually something that all VME ISRs should be doing anyway,
> since even the VMEchip2 (as used on the MVME167 et al) implemented write
> posting.
>
> > => IMO, the Tsi148's new features
> > (fast 2eVME and SST transfers among others)
> > outweigh the disadvantage that write-posting
> > cannot be disabled.
> > I don't share your negative assessment and
> > recommendation to stay away from 6100s.
>
> If you need the new features and speed then you'll probably be willing
> to recode any existing drivers or just accept that random things may
> happen in the event that some card fails. For operational sites like
> the APS with 224 different types of VME card used in our IOCs,
> revisiting all our device drivers isn't something we want to have to do...
We are concerned about the VMEBus issues that you raised.
The MVME2100 will be End-of-Life (EOL) soon. We are counting on the
MVME3100 and MVME6100 as successors.
So, I have filed a technical concern with both Motorola and WindRiver.
Of course, this will lead to TUNDRA but we need to get this resolved if
we want to move forward and have reliability.
I will post the results back here hopefully in the near future.
Thanks,
Ernest
SNS Control Systems Group
ORNL
>
> - Andrew
- Replies:
- Re: VME Bus Error handling on MVME3100 and 6100 boards Andy Foster
- Re: VME Bus Error handling on MVME3100 and 6100 boards Joe Sullivan
- Re: VME Bus Error handling on MVME3100 and 6100 boards Ernest L. Williams Jr.
- RE: VME Bus Error handling on MVME3100 and 6100 boards Thompson, David H.
- References:
- VME Bus Error handling on MVME3100 and 6100 boards Andrew Johnson
- Re: VME Bus Error handling on MVME3100 and 6100 boards Kate Feng
- Re: VME Bus Error handling on MVME3100 and 6100 boards Till Straumann
- Re: VME Bus Error handling on MVME3100 and 6100 boards Andrew Johnson
- Re: VME Bus Error handling on MVME3100 and 6100 boards Andrew Johnson
- Navigate by Date:
- Prev:
Re: VME Bus Error handling on MVME3100 and 6100 boards Andrew Johnson
- Next:
Re: looking for OSI version of epid record Benjamin Franksen
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
<2006>
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: VME Bus Error handling on MVME3100 and 6100 boards Andrew Johnson
- Next:
Re: VME Bus Error handling on MVME3100 and 6100 boards Andy Foster
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
<2006>
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|