On Wed, 2006-01-11 at 13:41 -0600, Mark Rivers wrote:
> Folks,
>
> > > we have a problem with CA since we upgraded our MV2300 IOCs
> > to Tornado2.
> > >
> > > After a reboot, often channel access links don't connect
> > immediately to
> > > the server. They connect a few minutes later when this
> > message is printed:
> > >
> > > CAC: Unable to connect port 5064 on "172.19.157.20:5064" because
> > > 22="S_errno_EINVAL"
>
> This is not just a problem with IOC to IOC sockets, but with any vxWorks
> to vxWorks sockets.
>
> We recently purchased a Newport XPS motor controller. It communicates
> over Ethernet, and uses vxWorks as it's operating system. We control
> the XPS from a vxWorks IOC. When we reboot our vxWorks IOC the XPS will
> not communicate again after the IOC reboots, because it does not know
> the IOC rebooted, and the same ports are being used. It is thus
> necessary to also reboot the XPS when rebooting the IOC. But rebooting
> the XPS requires re-homing all of the motors, which is sometimes almost
> impossible because of installed equipment! This is a real pain.
>
> This problem goes away if we control the XPS with a non-vxWorks IOC,
> such as Linux, probably because Linux closes the sockets when killing
> the IOC.
>
> On a related topic, I am appending an exchange I had with Jeff Hill and
> others on this topic in October 2003, that was not posted to tech-talk.
>
> Cheers,
> Mark Rivers
>
>
>
> Folks,
>
> I'd like to revisit the problem of CA disconnects when rebooting a
> vxWorks client IOC that has CA links to a vxWorks server IOC (that does
> not reboot).
>
> The EPICS 3.14.3 Release Notes say:
>
> "Recent versions of vxWorks appear to experience a connect failure if
> the vxWorks IP kernel reassigns the same ephemeral TCP port number as
> was assigned during a previous lifetime. The IP kernel on the vxWorks
> system hosting the CA server might have a stale entry for this ephemeral
> port that has not yet timed out which prevents the client from
> connecting with the ephemeral port assigned by the IP kernel.
> Eventually, after EPICS_CA_CONN_TMO seconds, the TCP connect sequence is
> aborted and the client library closes the socket, opens a new socket,
> receives a new ephemeral port assignment, and successfully connects."
>
> The last sentence is only partially correct. The problem is that:
> - vxWorks assigns these ephemeral port numbers in ascending numerical
> order
> - It takes a very long time for the server IOC to kill the stale entries
>
> Thus, if I reboot the client many times in a row, it does not just
> result in one disconnect before the succesful connection, but many. I
> just did a test where I rebooted a vxWorks client IOC 11 times, as one
> might do when debugging IOC software. This IOC is running Marty's
> example sequence program, with 2 PVs connecting to a remote vxWorks
> server IOC.
>
> Here is the amount of time elapsed before the sequence program PVs
> connected:
> Reboot # Time (sec)
> 1 0.1
> 2 5.7
> 3 30
> 4 60
> 5 90
> 6 120
> 7 30
> 8 150
> 9 150
> 10 180
> 11 210
>
> Here is the output of "casr" on the vxWorks server IOC that never
> rebooted after client reboot #11.
> Channel Access Server V4.11
> 164.54.160.74:1067(ioc13bma): User="iocboot", V4.11, Channel Count=1
> Priority=80
> 164.54.160.100:4453(miata): User="dac_user", V4.8, Channel Count=461
> Priority=0
> 164.54.160.75:1027(ioc13ida): User="iocboot", V4.11, Channel Count=1
> Priority=80
> 164.54.160.101:3379(lebaron): User="dac_user", V4.8, Channel Count=18
> Priority=0
> 164.54.160.73:1025(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1027(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1028(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1029(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1026(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1030(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1031(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.111:55807(millenia.cars.aps.anl.gov): User="webmaster", V4.8,
> Channel Count=291 Priority=0
> 164.54.160.73:1032(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
>
> There should only be one connection from the client, 164.54.160.73
> (ioc13lab). All but the highest numbered port (1032) are stale.
>
> The connection times do not increase by 30 seconds every single time,
> because for some reason every once in a while one of the old port
> connections times out (?) and is reused. You can see that 1026 was
> reused in the above test. But in general they do increase by 30 seconds
> on each reboot.
>
> This situation makes it very difficult to do software development under
> vxWorks in the case where CA connections to other vxWorks IOCs are used.
> It starts to take 4 or 5 minutes for the CA connections to get
> established. Rebooting the server IOC is often not an option.
>
> Here is a proposal for Jeff:
>
> Would it be possible to create a new function named something like
> vxCAClientStopAll. This command would call close() on the CA
> connections for all vxWorks CA clients, and hence would gracefully close
> all of the socket connections on the server IOC.
>
> We could then make another new vxWorks command, "restart" which does
> vxCAClientStopAll();
> reboot();
This is very awesome!!!
Jeff can you implement this for the next EPICS RELEASE???
Ernest
>
> This would not solve the problem for hard reboots, but it would make it
> possible in many cases to avoid these long delays in cases where an IOC
> is being deliberately rebooted under software control.
>
> Cheers,
> Mark
>
> Jeff's reply was:
> Mark,
>
>
> > - vxWorks assigns these ephemeral port numbers in ascending numerical
> > order
>
> That's correct there could be several of these stale circuits and the
> system
> will sequentially step through ephemeral port assignments timing out
> each
> one until an open slot is found. One solution would be for WRS to store
> the
> last ephemeral port assignment in non-volatile RAM between boots.
>
> It's also true that this problem is mostly a development issue and not
> an
> operational issue because during operations machines typically stay in a
> booted operational state for much longer than the stale circuit timeout
> interval.
>
> > - It takes a very long time for the server IOC to kill the stale
> > entries
>
> Yes, that's true. I do turn on the keep-alive timer, but it has a very
> long
> delay by default. This delay *can* however be changed globally for all
> circuits.
>
> I don't know what RTEMS does, but I strongly suspect that windows, UNIX,
> and
> VMS systems hang up all connected circuits when the system is software
> rebooted.
>
> Therefore, we have a vxWorks and possibly an RTEMS specific problem.
>
> > Would it be possible to create a new function named something like
> > vxCAClientStopAll. This command would call close() on the CA
> > connections for all vxWorks CA clients, and hence would
> > gracefully close all of the socket connections on the server IOC.
> >
>
> Of course ca_context_destroy() and ca_task_exit() are fulfilling a
> similar,
> but context specific role. They do however shutdown only one context at
> a
> time, and the context identifier is private to the context.
>
> So perhaps we should do this:
>
> Implement an iocCore shutdown module where subsystems register for
> callback
> when iocCore is shutdown. There would be a command line function that
> users
> call to shutdown an IOC gracefully. This command line would call all of
> the
> callbacks in the LIFO order. The sequencer and the database links would
> of
> course call ca_context_destroy() in their IOC core shutdown callbacks.
>
> Jeff
- Replies:
- orderly shutdown Jeff Hill
- References:
- RE: channel access Mark Rivers
- Navigate by Date:
- Prev:
RE: channel access Mark Rivers
- Next:
orderly shutdown Jeff Hill
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
<2006>
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: channel access Mark Rivers
- Next:
orderly shutdown Jeff Hill
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
<2006>
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|