Experimental Physics and
Industrial Control System

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 <2006> 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025	Index	1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 <2006> 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
<== Date ==>		<== Thread ==>

Subject:	RE: channel access
From:	"Ernest L. Williams Jr." <[email protected]>
To:	Mark Rivers <[email protected]>
Cc:	Jeff Hill <[email protected]>, Dirk Zimoch <[email protected]>, EPICS tech-talk <[email protected]>
Date:	Wed, 11 Jan 2006 15:55:51 -0500

On Wed, 2006-01-11 at 13:41 -0600, Mark Rivers wrote:
> Folks,
> 
> > > we have a problem with CA since we upgraded our MV2300 IOCs 
> > to Tornado2.
> > > 
> > > After a reboot, often channel access links don't connect 
> > immediately to
> > > the server. They connect a few minutes later when this 
> > message is printed:
> > > 
> > > CAC: Unable to connect port 5064 on "172.19.157.20:5064" because
> > >   22="S_errno_EINVAL"
> 
> This is not just a problem with IOC to IOC sockets, but with any vxWorks
> to vxWorks sockets.
> 
> We recently purchased a Newport XPS motor controller.  It communicates
> over Ethernet, and uses vxWorks as it's operating system.  We control
> the XPS from a vxWorks IOC. When we reboot our vxWorks IOC the XPS will
> not communicate again after the IOC reboots, because it does not know
> the IOC rebooted, and the same ports are being used.  It is thus
> necessary to also reboot the XPS when rebooting the IOC.  But rebooting
> the XPS requires re-homing all of the motors, which is sometimes almost
> impossible because of installed equipment!  This is a real pain.
> 
> This problem goes away if we control the XPS with a non-vxWorks IOC,
> such as Linux, probably because Linux closes the sockets when killing
> the IOC.
> 
> On a related topic, I am appending an exchange I had with Jeff Hill and
> others on this topic in October 2003, that was not posted to tech-talk.
> 
> Cheers,
> Mark Rivers
> 
> 
> 
> Folks,
> 
> I'd like to revisit the problem of CA disconnects when rebooting a
> vxWorks client IOC that has CA links to a vxWorks server IOC (that does
> not reboot).
> 
> The EPICS 3.14.3 Release Notes say:
> 
> "Recent versions of vxWorks appear to experience a connect failure if
> the vxWorks IP kernel reassigns the same ephemeral TCP port number as
> was assigned during a previous lifetime. The IP kernel on the vxWorks
> system hosting the CA server might have a stale entry for this ephemeral
> port that has not yet timed out which prevents the client from
> connecting with the ephemeral port assigned by the IP kernel.
> Eventually, after EPICS_CA_CONN_TMO seconds, the TCP connect sequence is
> aborted and the client library closes the socket, opens a new socket,
> receives a new ephemeral port assignment, and successfully connects."
> 
> The last sentence is only partially correct.  The problem is that:
> - vxWorks assigns these ephemeral port numbers in ascending numerical
> order
> - It takes a very long time for the server IOC to kill the stale entries
> 
> Thus, if I reboot the client many times in a row, it does not just
> result in one disconnect before the succesful connection, but many.  I
> just did a test where I rebooted a vxWorks client IOC 11 times, as one
> might do when debugging IOC software.  This IOC is running Marty's
> example sequence program, with 2 PVs connecting to a remote vxWorks
> server IOC. 
> 
> Here is the amount of time elapsed before the sequence program PVs
> connected:
> Reboot #  Time (sec)
> 1           0.1
> 2           5.7
> 3            30
> 4            60
> 5            90
> 6           120
> 7            30
> 8           150
> 9           150
> 10          180
> 11          210
> 
> Here is the output of "casr" on the vxWorks server IOC that never
> rebooted after client reboot #11.
> Channel Access Server V4.11
> 164.54.160.74:1067(ioc13bma): User="iocboot", V4.11, Channel Count=1
> Priority=80
> 164.54.160.100:4453(miata): User="dac_user", V4.8, Channel Count=461
> Priority=0
> 164.54.160.75:1027(ioc13ida): User="iocboot", V4.11, Channel Count=1
> Priority=80
> 164.54.160.101:3379(lebaron): User="dac_user", V4.8, Channel Count=18
> Priority=0
> 164.54.160.73:1025(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1027(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1028(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1029(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1026(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1030(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1031(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.111:55807(millenia.cars.aps.anl.gov): User="webmaster", V4.8,
> Channel Count=291 Priority=0
> 164.54.160.73:1032(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 
> There should only be one connection from the client, 164.54.160.73
> (ioc13lab).  All but the highest numbered port (1032) are stale.  
> 
> The connection times do not increase by 30 seconds every single time,
> because for some reason every once in a while one of the old port
> connections times out (?) and is reused.  You can see that 1026 was
> reused in the above test. But in general they do increase by 30 seconds
> on each reboot.  
> 
> This situation makes it very difficult to do software development under
> vxWorks in the case where CA connections to other vxWorks IOCs are used.
> It starts to take 4 or 5 minutes for the CA connections to get
> established.  Rebooting the server IOC is often not an option.
> 
> Here is a proposal for Jeff:
> 
> Would it be possible to create a new function named something like
> vxCAClientStopAll.  This command would call close() on the CA
> connections for all vxWorks CA clients, and hence would gracefully close
> all of the socket connections on the server IOC.
> 
> We could then make another new vxWorks command, "restart" which does
> vxCAClientStopAll();
> reboot();

This is very awesome!!!

Jeff can you implement this for the next EPICS RELEASE???


Ernest







> 
> This would not solve the problem for hard reboots, but it would make it
> possible in many cases to avoid these long delays in cases where an IOC
> is being deliberately rebooted under software control.
> 
> Cheers,
> Mark
> 
> Jeff's reply was:
> Mark,
> 
> 
> > - vxWorks assigns these ephemeral port numbers in ascending numerical
> > order
> 
> That's correct there could be several of these stale circuits and the
> system
> will sequentially step through ephemeral port assignments timing out
> each
> one until an open slot is found. One solution would be for WRS to store
> the
> last ephemeral port assignment in non-volatile RAM between boots.
> 
> It's also true that this problem is mostly a development issue and not
> an
> operational issue because during operations machines typically stay in a
> booted operational state for much longer than the stale circuit timeout
> interval.
> 
> > - It takes a very long time for the server IOC to kill the stale 
> > entries
> 
> Yes, that's true. I do turn on the keep-alive timer, but it has a very
> long
> delay by default. This delay *can* however be changed globally for all
> circuits.
> 
> I don't know what RTEMS does, but I strongly suspect that windows, UNIX,
> and
> VMS systems hang up all connected circuits when the system is software
> rebooted.
> 
> Therefore, we have a vxWorks and possibly an RTEMS specific problem. 
> 
> > Would it be possible to create a new function named something like
> > vxCAClientStopAll.  This command would call close() on the CA
> > connections for all vxWorks CA clients, and hence would 
> > gracefully close all of the socket connections on the server IOC.
> >
> 
> Of course ca_context_destroy() and ca_task_exit() are fulfilling a
> similar,
> but context specific role. They do however shutdown only one context at
> a
> time, and the context identifier is private to the context.
> 
> So perhaps we should do this:
> 
> Implement an iocCore shutdown module where subsystems register for
> callback
> when iocCore is shutdown. There would be a command line function that
> users
> call to shutdown an IOC gracefully. This command line would call all of
> the
> callbacks in the LIFO order. The sequencer and the database links would
> of
> course call ca_context_destroy() in their IOC core shutdown callbacks.
> 
> Jeff

Replies:: orderly shutdown Jeff Hill

References:: RE: channel access Mark Rivers

Navigate by Date:: Prev: RE: channel access Mark Rivers; Next: orderly shutdown Jeff Hill; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 <2006> 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
Navigate by Thread:: Prev: RE: channel access Mark Rivers; Next: orderly shutdown Jeff Hill; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 <2006> 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025

ANJ, 02 Sep 2010

· Home · News · About · Base · Modules · Extensions · Distributions ·
· Download · Search · IRMIS · Talk · Documents · Links · Licensing ·

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System