Argonne National Laboratory

Experimental Physics and
Industrial Control System

<19941995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  Index <19941995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020 
<== Date ==> <== Thread ==>

Subject: "The problem admits of no explanation...."
From: wbrown@csg.lbl.gov (Bill Brown)
Date: Wed, 8 Jun 94 10:12:19 PDT
I've been fighting with a weird kind of problem.  I'll try to explain what's going
on, and what I've done.

Ok - we have one system, which I'll call the "dev system", on subnet i.j.70,
and another system, which I'll call the "prod system", on subnet i.j.71.  The
server is on the '70 sunbnet, and a "gateway" is specified in the boot params
of the prod system.  Several routes and hosts added during the startup process.
A couple of file systems on the server are nsfMounted also.

Each system consists of an 8-meg mv167, a VMIVME-2534 parallel i/o board (I wrote
the drv and dev stuff for it), and a N.I. GPIB-1014 board.

In the dev system, everything seems to boot and operate normally.  In the prod
system, I get a large burst of "error interrupts" when the GPIB board is
initialized or serviced.  Seems simple enough.  But there's more.

I carry the dev system (Tracewell bin and all) out to the ring, hook up the
e-net drop, and reconfigure the boot params so it "thinks" it's the prod system.
When it boots, it behaves _exactly_ the same as the "prod" system.  I carry it
back to my office, reconfigure the boot params, and it behaves normally.

I have interchanged all of the boards in both systems, in just about all
combinations and permutations.  In all cases the dev system works normally
and the prod system gets error interrupts from the GPIB interface.

I have removed the vmivme2534 board; it made no difference in either system.
I've checked for addressing conflict; according to the relevant chapter in
my EPICS manual there is no conflict.  If there was, I would expect to see
the problem in both systems.  Oh yes - the '2534 does not generate interrupts;
it lacks hardware to do so.

Both systems are loading the same vxWorks kernel _from_the_same_file_on_the
_same_server_; this is also the case with EPICS (v3.11.1) and the database.

I have tried the experiment with one other vme bin in the ring, and it behaves
in the same manner as the prod system.

There are a couple of side issues which seem to have little to do with the
problem, but I'll throw them in.

The server seems to be "bogging down" at various times, judging from various
delays in the startup process.  Sometimes I get a timeout while trying to
connect to the log server on the server machine.  This is true from both
nets, but I have the feeling that it happens more frequently from the prod
system.

When loading the database, both systems report errors trying to open the
"default.sdrSum".  The file exists, and the (read) permissions look ok.
I don't know what this means, or what the file is/does.  But the behavior
is the same on both systems.  Unfortunately, an error return of "-1" doesn't
tell me a lot about what happened.

The i.j.71 subnet is a fairly recent creation.  I suspect that the gateway
is not configured quite right, judging by some fairly long times reported
by "ping" when the messages must pass thru it.  I'm by no means really savy
about network issues however, so this may be taken with a grain of salt.

On the dev system, I seem to be getting GPIB timeouts after it has been
running for a while.  I don't know the cause of this, but  it does go away
(for a while) with a re-boot.  The instrument is a G-P 307 vacuum gauge
controller and we're using the device support included with the release.
I'll try to hunt down another instrument and see it that fixes it.  Hmmm -
if I could get the prod system to work, there's _lots_ of vacuum gauge 
controllers out in the ring!  And they're even connected to vacuum gauges.

The only real difference I see between the systems is that the prod system
has a router between it and its' server.  I don't see how that can be the
source of the problem since the prod system can talk to its' server well
enough to boot properly.

I`m at a loss for what to try next, aside from retirement which I'm not quite
old enough for yet.

Any and all ideas will be appreciated.  Thank you for your help.


Disclaimer:  Any opinions are my own and have	    |
    nothing to do with the official policy or the   |  -bill
    management of L.B.L, who probably couldn't      |   wlbrown@lbl.gov
    care less about employees who play with trains. |

Navigate by Date:
Prev: Re: Re: ":" Bill Brown
Next: Re: "The problem admits of no explanation...." Jeff Hill
Index: <19941995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020 
Navigate by Thread:
Prev: PID record RPO?? mcgehee
Next: Re: "The problem admits of no explanation...." Jeff Hill
Index: <19941995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020 
ANJ, 10 Aug 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·