On Wed, 2006-01-11 at 13:41 -0600, Mark Rivers wrote:
> > > we have a problem with CA since we upgraded our MV2300 IOCs
> > to Tornado2.
> > >
> > > After a reboot, often channel access links don't connect
> > immediately to
> > > the server. They connect a few minutes later when this
> > message is printed:
> > >
> > > CAC: Unable to connect port 5064 on "172.19.157.20:5064" because
> > > 22="S_errno_EINVAL"
> This is not just a problem with IOC to IOC sockets, but with any vxWorks
> to vxWorks sockets.
> We recently purchased a Newport XPS motor controller. It communicates
> over Ethernet, and uses vxWorks as it's operating system. We control
> the XPS from a vxWorks IOC. When we reboot our vxWorks IOC the XPS will
> not communicate again after the IOC reboots, because it does not know
> the IOC rebooted, and the same ports are being used. It is thus
> necessary to also reboot the XPS when rebooting the IOC. But rebooting
> the XPS requires re-homing all of the motors, which is sometimes almost
> impossible because of installed equipment! This is a real pain.
> This problem goes away if we control the XPS with a non-vxWorks IOC,
> such as Linux, probably because Linux closes the sockets when killing
> the IOC.
> On a related topic, I am appending an exchange I had with Jeff Hill and
> others on this topic in October 2003, that was not posted to tech-talk.
> Mark Rivers
> I'd like to revisit the problem of CA disconnects when rebooting a
> vxWorks client IOC that has CA links to a vxWorks server IOC (that does
> not reboot).
> The EPICS 3.14.3 Release Notes say:
> "Recent versions of vxWorks appear to experience a connect failure if
> the vxWorks IP kernel reassigns the same ephemeral TCP port number as
> was assigned during a previous lifetime. The IP kernel on the vxWorks
> system hosting the CA server might have a stale entry for this ephemeral
> port that has not yet timed out which prevents the client from
> connecting with the ephemeral port assigned by the IP kernel.
> Eventually, after EPICS_CA_CONN_TMO seconds, the TCP connect sequence is
> aborted and the client library closes the socket, opens a new socket,
> receives a new ephemeral port assignment, and successfully connects."
> The last sentence is only partially correct. The problem is that:
> - vxWorks assigns these ephemeral port numbers in ascending numerical
> - It takes a very long time for the server IOC to kill the stale entries
> Thus, if I reboot the client many times in a row, it does not just
> result in one disconnect before the succesful connection, but many. I
> just did a test where I rebooted a vxWorks client IOC 11 times, as one
> might do when debugging IOC software. This IOC is running Marty's
> example sequence program, with 2 PVs connecting to a remote vxWorks
> server IOC.
> Here is the amount of time elapsed before the sequence program PVs
> Reboot # Time (sec)
> 1 0.1
> 2 5.7
> 3 30
> 4 60
> 5 90
> 6 120
> 7 30
> 8 150
> 9 150
> 10 180
> 11 210
> Here is the output of "casr" on the vxWorks server IOC that never
> rebooted after client reboot #11.
> Channel Access Server V4.11
> 220.127.116.11:1067(ioc13bma): User="iocboot", V4.11, Channel Count=1
> 18.104.22.168:4453(miata): User="dac_user", V4.8, Channel Count=461
> 22.214.171.124:1027(ioc13ida): User="iocboot", V4.11, Channel Count=1
> 126.96.36.199:3379(lebaron): User="dac_user", V4.8, Channel Count=18
> 188.8.131.52:1025(ioc13lab): User="iocboot", V4.11, Channel Count=2
> 184.108.40.206:1027(ioc13lab): User="iocboot", V4.11, Channel Count=2
> 220.127.116.11:1028(ioc13lab): User="iocboot", V4.11, Channel Count=2
> 18.104.22.168:1029(ioc13lab): User="iocboot", V4.11, Channel Count=2
> 22.214.171.124:1026(ioc13lab): User="iocboot", V4.11, Channel Count=2
> 126.96.36.199:1030(ioc13lab): User="iocboot", V4.11, Channel Count=2
> 188.8.131.52:1031(ioc13lab): User="iocboot", V4.11, Channel Count=2
> 184.108.40.206:55807(millenia.cars.aps.anl.gov): User="webmaster", V4.8,
> Channel Count=291 Priority=0
> 220.127.116.11:1032(ioc13lab): User="iocboot", V4.11, Channel Count=2
> There should only be one connection from the client, 18.104.22.168
> (ioc13lab). All but the highest numbered port (1032) are stale.
> The connection times do not increase by 30 seconds every single time,
> because for some reason every once in a while one of the old port
> connections times out (?) and is reused. You can see that 1026 was
> reused in the above test. But in general they do increase by 30 seconds
> on each reboot.
> This situation makes it very difficult to do software development under
> vxWorks in the case where CA connections to other vxWorks IOCs are used.
> It starts to take 4 or 5 minutes for the CA connections to get
> established. Rebooting the server IOC is often not an option.
> Here is a proposal for Jeff:
> Would it be possible to create a new function named something like
> vxCAClientStopAll. This command would call close() on the CA
> connections for all vxWorks CA clients, and hence would gracefully close
> all of the socket connections on the server IOC.
> We could then make another new vxWorks command, "restart" which does
This is very awesome!!!
Jeff can you implement this for the next EPICS RELEASE???
> This would not solve the problem for hard reboots, but it would make it
> possible in many cases to avoid these long delays in cases where an IOC
> is being deliberately rebooted under software control.
> Jeff's reply was:
> > - vxWorks assigns these ephemeral port numbers in ascending numerical
> > order
> That's correct there could be several of these stale circuits and the
> will sequentially step through ephemeral port assignments timing out
> one until an open slot is found. One solution would be for WRS to store
> last ephemeral port assignment in non-volatile RAM between boots.
> It's also true that this problem is mostly a development issue and not
> operational issue because during operations machines typically stay in a
> booted operational state for much longer than the stale circuit timeout
> > - It takes a very long time for the server IOC to kill the stale
> > entries
> Yes, that's true. I do turn on the keep-alive timer, but it has a very
> delay by default. This delay *can* however be changed globally for all
> I don't know what RTEMS does, but I strongly suspect that windows, UNIX,
> VMS systems hang up all connected circuits when the system is software
> Therefore, we have a vxWorks and possibly an RTEMS specific problem.
> > Would it be possible to create a new function named something like
> > vxCAClientStopAll. This command would call close() on the CA
> > connections for all vxWorks CA clients, and hence would
> > gracefully close all of the socket connections on the server IOC.
> Of course ca_context_destroy() and ca_task_exit() are fulfilling a
> but context specific role. They do however shutdown only one context at
> time, and the context identifier is private to the context.
> So perhaps we should do this:
> Implement an iocCore shutdown module where subsystems register for
> when iocCore is shutdown. There would be a command line function that
> call to shutdown an IOC gracefully. This command line would call all of
> callbacks in the LIFO order. The sequencer and the database links would
> course call ca_context_destroy() in their IOC core shutdown callbacks.
- orderly shutdown Jeff Hill
- RE: channel access Mark Rivers
- Navigate by Date:
RE: channel access Mark Rivers
orderly shutdown Jeff Hill
- Navigate by Thread:
RE: channel access Mark Rivers
orderly shutdown Jeff Hill