Jeff,
Could this be the beacon port network order problem that you fixed recently?
That might explain why she's seeing no beacons from the gateway.
- Andrew
On Thursday 12 March 2009 11:12:56 Jeff Hill wrote:
> Emma,
>
> > There doesn't seem to be any other obvious problems that I can see (CPU
> > usage very low) - I've attached some of the console output. I did a "tt"
> > on the dbca link thread but wasn't sure where to go from there - is there
> > anything else I should try before I reboot the IOC?
>
> First try to determine if it's an IP kernel related issue (you should see
> some aspects of TCP/UDP that are not working using protocols that are not
> CA if it's an IP kernel issue). Does telnet (verifying TCP) and ping
> (verifying IP) work with the IOC when it is in this state? If your vxWorks
> system has an echo server (listening on port seven) you could test UDP with
> that.
>
> Here is a talk by Dave Thomson (which has some info on diagnosing vxWorks
> buffer starvation related issues).
>
> http://www.diamond.ac.uk/CMSWeb/Downloads/diamond/Events/EPICS/MBUF_Problem
>s .ppt
>
> This one might help also.
>
> http://www.xs4all.nl/~borkhuis/vxworks/troubleshooting.txt
>
> And here is some info on how to configure vxWorks to run EPICS.
>
> http://www.aps.anl.gov/epics/base/tornado.php?format=printer
>
>
> The output from ifShow, endPoolShow("name", 0), netStackDataPoolShow(),
> netStackSysPoolShow(), and maybe also udpStatShow are probably most likely
> to provide some hints at the cause of your problems if you are experiencing
> troubles with the vxWorks IP kernel (or below). The output from ifShow can
> be very interesting if there are low level media transmission errors.
>
> Look at the output from inetStatShow. In particular, look at TCP circuits
> that consistently indicate the same large number of bytes pending in their
> buffers (in multiple samples dumped with inetStatShow). Pending output
> bytes can indicate congestion problems with the IP kernel, network, routing
> system, and or the server (possibly a CA server (GW or IOC) this IOC is
> connected to). Pending input bytes usually indicate issues with the code
> consuming bytes from the socket (in this case the CA client library).
>
> > > I would also look very closely at the output from dbcar at higher
> > > interest levels. As the interest level increases you should be able to
> > > see if CA thinks that the channel is connected or not (the output from
> > > void nciu::show ()). Of particular interest would be any situations
> > > where CA thinks the channel is connected, but the DB CA link code does
> > > not.
> > > Also look for situations where the DB CA Link code thinks that it's a
> > > CA link, but the CA channel hasn't been created (yet).
>
> I would definitely dump the output of dbcar when specifying a very high
> magnitude interest level (a level of 1000 should be sufficient) so that you
> see all of the gory details. We need to fault isolate so look for
> situations where the CA client library marks a particular channel as being
> connected, but the db ca link facility marks this channel as being
> disconnected. Also look for situations where a channel hasn't been created
> in the CA client library, but the db ca link facility considers the link to
> be a CA link, and of course the third possibility would be that the channel
> exists in the CA client library and both the CA client library and the DB
> CA link facility consider the channel to be disconnected.
>
> If you can somehow capture the entire output from dbcar at interest level
> 1000, and send it to me in an email, I would be happy to have a look. One
> possibility would be to forward the output of the vxWorks command to a
> file. Also send the name of the channels that should be connected, but
> aren't.
>
> It will be time consuming, but you might also capture a tt from the thread
> running the db ca link facility, and hopefully also all of the threads
> managing the CA client context created for the db ca link facility. If you
> could send that information I might be able to determine what has happened.
> The tornado, host based debugging system, might help to automate the stack
> trace collection process.
>
> Jeff
>
> > -----Original Message-----
> > From: Shepherd, EL (Emma) [mailto:[email protected]]
> > Sent: Tuesday, March 10, 2009 9:12 AM
> > To: Jeff Hill
> > Subject: RE: Inter-IOC link problems
> >
> > Hi Jeff,
> >
> > You may remember this problem I reported on tech-talk a little while ago.
> > It has occurred again, and I have managed to do a little more debugging.
> > I loaded a standalone CA client as you suggested and it works fine, so it
> > appears that it is not a global CA issue.
> >
> > There doesn't seem to be any other obvious problems that I can see (CPU
> > usage very low) - I've attached some of the console output. I did a "tt"
> > on the dbca link thread but wasn't sure where to go from there - is there
> > anything else I should try before I reboot the IOC?
> >
> > Thanks again for your help,
> >
> > Emma
> >
> > Emma Shepherd
> > Software Systems Engineer
> > Beamline Controls - I06, I07, I24
> >
> > +44 (0)1235-778235
> > http://www.diamond.ac.uk
> >
> > > -----Original Message-----
> > > From: Jeff Hill [mailto:[email protected]]
> > > Sent: 20 October 2008 17:24
> > > To: Shepherd, EL (Emma); [email protected]
> > > Subject: RE: Inter-IOC link problems
> > >
> > >
> > > Presumably, the IP stack on this IOC is operating correctly when this
> > > happens - as verified by {telnet, ping, ifShow, ...}?
> > >
> > > When this occurs, you might try running a small standalone CA client
> > > that you have dynamically loaded into vxWorks. Its best to spawn this
> > > type of client so that a CA context will not end up getting attached
> > > to the vxWorks shell. The intent of course would be to isolate between
> > > a global CA issue, and one that is isolated to the CA client / DB CA
> > > Link code combination.
> > >
> > > I would also look very closely at the output from dbcar at higher
> > > interest levels. As the interest level increases you should be able to
> > > see if CA thinks that the channel is connected or not (the output from
> > > void nciu::show ()). Of particular interest would be any situations
> > > where CA thinks the channel is connected, but the DB CA link code does
> > > not.
> > > Also look for situations where the DB CA Link code thinks that it's a
> > > CA link, but the CA channel hasn't been created (yet).
> > >
> > > Also, do a "tt" on the DBCA Link thread, and the satellite threads for
> > > its CA context. Look for any situations where threads are hanging
> > > around in unusual places which might indicate some form of deadlock.
> > > If you see anything out of the ordinary please send the tt output and
> > > I will have a look. In lightly loaded situations, "out of the
> > > ordinary"
> > > usually means a thread that isn't parked in the normal place (as seen
> > > by snapshots with tt) for an extended length of time. One of course
> > > needs to compare tt output from when the IOC is normal to tt output
> > > from when the IOC is misbehaving.
> > > Needless to say, a CPU starvation situation on this IOC would also
> > > cause issues (could be the cause of your issue).
> > >
> > > In the past, quite some years back actually, I have seen UDP issues if
> > > there were too many machines on a network with the wrong subnet mask
> > > configuration. I think that there used to be some issues in particular
> > > with HP workstations because they would reply with "ICMP network
> > > unreachable" if their network mask was set incorrectly and this could
> > > cause the IOC's search response to be discarded off the end of the
> > > finite length UDP input queue (depending on which response got there
> > > first and how many bogus ICMP messages are sent in response to each
> > > search request). ICMP traffic can be seen with Ethernet snoopers like
> > > wireshark or tcpdump. However, on modern switched networks, it may be
> > > best to be on the same hub (not a switch) with the IOC so that you can
> > > see unicast traffic that the switch sends only between the IOC and its
> > > message peers. Admittedly, this is perhaps contraindicated based on
> > > your not seeing any search traffic from the IOC in casnooper.
> > >
> > > You might have a look at the output from utpStatShow (presuming that
> > > something is wrong with UDP and not IP).
> > > Also, have a look at ifShow and verify that the broadcast address
> > > remains correctly configured, and that there are not high error rates.
> > >
> > > Jeff
> > >
> > > > -----Original Message-----
> > > > From: [email protected]
> > > > [mailto:[email protected]]
> > > > On Behalf Of Shepherd, EL (Emma)
> > > > Sent: Friday, October 17, 2008 9:12 AM
> > > > To: [email protected]
> > > > Subject: RE: Inter-IOC link problems
> > > >
> > > > I've done a little more investigation and I think that in this case
> > > > the gateway is not to blame. It seems that other CA links
> > >
> > > on this IOC
> > >
> > > > are also not working, and they are not all going through
> > >
> > > the gateway
> > >
> > > > (some are on other IOCs on the same network).
> > > >
> > > > I setup caSnooper to monitor connection requests on one of
> > >
> > > the PVs my
> > >
> > > > IOC is failing to link to. When I change the link to a
> > >
> > > constant and
> > >
> > > > change it back again, caSnooper does not report any new
> > >
> > > requests for
> > >
> > > > the PV. However when I do the same on a 'healthy' IOC which has
> > > > working links, I see the new request on caSnooper when I
> > >
> > > put the link
> > >
> > > > back.
> > > >
> > > > I'm not sure what that tells me except that it looks like
> > >
> > > the IOC has
> > >
> > > > somehow stopped broadcasting search requests..?
> > > >
> > > > Emma
> > > >
> > > > > -----Original Message-----
> > > > > From: [email protected]
> > > > > [mailto:[email protected]] On Behalf Of Shepherd, EL
> > > > > (Emma)
> > > > > Sent: 17 October 2008 12:28
> > > > > To: Ralph Lange
> > > > > Cc: [email protected]
> > > > > Subject: RE: Inter-IOC link problems
> > > > >
> > > > >
> > > > > Hi there,
> > > > >
> > > > > Thanks for the replies, it seems that the 'undefined' entry might
> > > > > have been a red herring.
> > > > >
> > > > > The IOC I am looking at is the client of the PV
> > >
> > > connection, and the
> > >
> > > > > IP address listed is the server side of the CA gateway.
> > >
> > > There are
> > >
> > > > > in fact two gateways on this machine - one for each
> > >
> > > direction as you
> > >
> > > > > suggested. The configuration is really very simple, it is
> > >
> > > setup to
> > >
> > > > > allow read access for all PVs. Do you need to know anything more
> > > > > specific?
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Emma
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Ralph Lange [mailto:[email protected]]
> > > > > > Sent: 17 October 2008 08:52
> > > > > > To: Shepherd, EL (Emma)
> > > > > > Cc: [email protected]
> > > > > > Subject: Re: Inter-IOC link problems
> > > > > >
> > > > > >
> > > > > > Hi Emma,
> > > > > >
> > > > > > I would need a bit more information about your setup to
> > >
> > > be able to
> > >
> > > > > > fully understand your report.
> > > > > >
> > > > > > You are looking at the CA client side of an IOC. When you are
> > > > > > losing connections between IOCs, is the IOC you're
> > >
> > > looking at the
> > >
> > > > > > server or the client of that PV connection?
> > > > > > It seems there are no beacons coming from the CA Gateway
> > > > > > (172.23.106.35). Is that the client side or the server side
> > > > >
> > > > > of the CA
> > > > >
> > > > > > Gateway? Are two (or more) Gateway processes running on
> > > > >
> > > > > that machine
> > > > >
> > > > > > (i.e. one for each direction)? What is the CA configuration for
> > > > > > the
> > > > > > Gateway(s) on that machine?
> > > > > >
> > > > > > CA configuration of a Gateway is difficult and subtle.
> > >
> > > There are a
> > >
> > > > > > lot of environment variables for CA server and client (see the
> > > > >
> > > > > CA Manual)
> > > > >
> > > > > > which influence the behaviour of a CA application. Some
> > > > >
> > > > > variables are
> > > > >
> > > > > > using other variables' values as default, which simplifies
> > > > > > configuration of pure CA client or server apps, but may lead to
> > > > > > unwanted behaviour for a CA Gateway (whis is one of the few apps
> > > > > > that is as well CA server and client). E.g, it is quite easy to
> > > > > > create a setup where the
> > > > >
> > > > > Gateway is
> > > > >
> > > > > > sending out beacons on the wrong (i.e. client) side.
> > > > > >
> > > > > > Cheers,
> > > > > > Ralph
> > > > > >
> > > > > > Shepherd, EL (Emma) wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > > We still seem to suffer quite a bit from problems with
> > > > > >
> > > > > > database links
> > > > > >
> > > > > > > between IOCs, particularly when a gateway is
> > >
> > > involved. For some
> > >
> > > > > > > reason the links can become disconnected and a reboot
> > >
> > > is usually
> > >
> > > > > > > necessary to get them working again. I have just had an
> > > > > >
> > > > > > opportunity
> > > > > >
> > > > > > > to do some diagnosis on one such problem and found a clue
> > > > >
> > > > > in the CA
> > > > >
> > > > > > > beacon hashtable part of the dbcar report. The entry for
> > > > > >
> > > > > > the gateway
> > > > > >
> > > > > > > (172.23.106.35) is 'undefined', although the gateway itself
> > > > > >
> > > > > > seems to
> > > > > >
> > > > > > > be working just fine and I can use caget through the
> > >
> > > gateway as
> > >
> > > > > > > normal.
> > > > > > >
> > > > > > > Any ideas what could cause this to happen, or how to fix
> > > > >
> > > > > it when it
> > > > >
> > > > > > > does? None of the tasks are suspended, CPU usage is low and
> > > > > > > everything else looks fine.
> > > > > > >
> > > > > > > CA beacon hash entry for 172.23.106.32:5064 with
> > >
> > > period estimate
> > >
> > > > > > > 15.000521
> > > > > > > beacon number 168436, on THU OCT 16 2008 14:27:46 CA
> > > > > > > beacon hash entry for 172.23.106.35:5064 <no period estimate>
> > > > > > > beacon number 0, on <undefined> CA beacon hash entry
> > > > > > > for 172.23.106.97:5064 with
> > >
> > > period estimate
> > >
> > > > > > > 14.988265
> > > > > > > beacon number 76356, on THU OCT 16 2008 14:27:52 CA
> > > > > > > beacon hash entry for 172.23.106.96:5064 with period estimate
> > > > > > > 14.988637
> > > > > > > beacon number 39491, on THU OCT 16 2008 14:27:53 CA
> > > > > > > beacon hash entry for 172.23.106.98:5064 with period estimate
> > > > > > > 14.980477
> > > > > > > beacon number 58989, on THU OCT 16 2008 14:27:47 CA
> > > > > > > beacon hash entry for 172.23.106.102:5064 with period
> > >
> > > estimate
> > >
> > > > > > > 14.990867
> > > > > > > beacon number 39993, on THU OCT 16 2008 14:27:53 CA
> > > > > > > beacon hash entry for 172.23.106.32:5064 with period estimate
> > > > > > > 15.000521
> > > > > > > beacon number 168436, on THU OCT 16 2008 14:27:46 CA
> > > > > > > beacon hash entry for 172.23.106.35:5064 <no period estimate>
> > > > > > > beacon number 0, on <undefined> CA beacon hash entry
> > > > > > > for 172.23.106.97:5064 with
> > >
> > > period estimate
> > >
> > > > > > > 14.988265
> > > > > > > beacon number 76356, on THU OCT 16 2008 14:27:52 CA
> > > > > > > beacon hash entry for 172.23.106.96:5064 with period estimate
> > > > > > > 14.988637
> > > > > > > beacon number 39491, on THU OCT 16 2008 14:27:53 CA
> > > > > > > beacon hash entry for 172.23.106.98:5064 with period estimate
> > > > > > > 14.980477
> > > > > > > beacon number 58989, on THU OCT 16 2008 14:27:47 CA
> > > > > > > beacon hash entry for 172.23.106.102:5064 with period
> > >
> > > estimate
> > >
> > > > > > > 14.990867
> > > > > > > beacon number 39993, on THU OCT 16 2008 14:27:53
> > > > > >
> > > > > > entries per
> > > > > >
> > > > > > > bucket: mean = 0.011719 std dev = 0.107617 max = 1
> > > > > > >
> > > > > > >
> > > > > > > Thanks in advance....
> > > > > > >
> > > > > > > Emma
> > > > >
> > > > > <DIV><FONT size="1" color="gray">This e-mail and any
> > >
> > > attachments may
> > >
> > > > > contain confidential, copyright and or privileged
> > >
> > > material, and are
> > >
> > > > > for the use of the intended addressee only. If you are not the
> > > > > intended addressee or an authorised recipient of the addressee
> > > > > please notify us of receipt by returning the e-mail and
> > >
> > > do not use,
> > >
> > > > > copy, retain, distribute or disclose the information in
> > >
> > > or attached
> > >
> > > > > to the e-mail. Any opinions expressed within this e-mail are those
> > > > > of the individual and not necessarily of Diamond Light Source Ltd.
> > > > > Diamond Light Source Ltd. cannot guarantee that this e-mail or any
> > > > > attachments are free from viruses and we cannot accept liability
> > > > > for any damage which you may sustain as a result of software
> > > > > viruses which may be transmitted in or with the message. Diamond
> > > > > Light Source Limited (company no. 4375679).
> > > > > Registered in England and Wales with its registered office at
> > > > > Diamond House, Harwell Science and Innovation Campus, Didcot,
> > > > > Oxfordshire, OX11 0DE, United Kingdom </FONT></DIV>
> > > >
> > > > <DIV><FONT size="1" color="gray">This e-mail and any
> > >
> > > attachments may
> > >
> > > > contain confidential, copyright and or privileged material, and are
> > > > for
> > >
> > > the
> > >
> > > > use of the intended addressee only. If you are not the intended
> > > > addressee or an authorised recipient of the addressee
> > >
> > > please notify us
> > >
> > > > of receipt by returning the e-mail and do not use, copy, retain,
> > > > distribute or disclose the information in or attached to
> > >
> > > the e-mail.
> > >
> > > > Any opinions expressed within this e-mail are those of the
> > >
> > > individual
> > >
> > > > and not necessarily of Diamond Light Source Ltd. Diamond
> > >
> > > Light Source
> > >
> > > > Ltd. cannot guarantee that this e-mail or any attachments are free
> > > > from viruses and we cannot accept liability for any damage
> > >
> > > which you
> > >
> > > > may sustain as a result of software viruses which may be
> > >
> > > transmitted
> > >
> > > > in or with the message. Diamond Light Source Limited (company no.
> > > > 4375679). Registered in England and Wales with its
> > >
> > > registered office
> > >
> > > > at Diamond House, Harwell Science and Innovation Campus, Didcot,
> > > > Oxfordshire, OX11 0DE, United Kingdom </FONT></DIV>
> >
> > <DIV><FONT size="1" color="gray">This e-mail and any attachments may
> > contain confidential, copyright and or privileged material, and are for
> > the use of the intended addressee only. If you are not the intended
> > addressee or an authorised recipient of the addressee please notify us of
> > receipt by returning the e-mail and do not use, copy, retain, distribute
> > or disclose the information in or attached to the e-mail.
> > Any opinions expressed within this e-mail are those of the individual and
> > not necessarily of Diamond Light Source Ltd.
> > Diamond Light Source Ltd. cannot guarantee that this e-mail or any
> > attachments are free from viruses and we cannot accept liability for any
> > damage which you may sustain as a result of software viruses which may be
> > transmitted in or with the message.
> > Diamond Light Source Limited (company no. 4375679). Registered in England
> > and Wales with its registered office at Diamond House, Harwell Science
> > and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
> > </FONT></DIV>
> >
> > --
> >
> > Scanned by iCritical.
--
The best FOSS code is written to be read by other humans -- Harold Welte
- Replies:
- RE: Inter-IOC link problems Jeff Hill
- References:
- RE: Inter-IOC link problems Jeff Hill
- Navigate by Date:
- Prev:
RE: Inter-IOC link problems Jeff Hill
- Next:
building asyn-4.10 under cygwin-x86 Frank Hoeft
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
<2009>
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: Inter-IOC link problems Jeff Hill
- Next:
RE: Inter-IOC link problems Jeff Hill
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
<2009>
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|