On Wed, 2006-01-11 at 14:40 -0700, Jeff Hill wrote:
> > > Here is a proposal for Jeff:
> > >
> > > Would it be possible to create a new function named something like
> > > vxCAClientStopAll. This command would call close() on the CA
> > > connections for all vxWorks CA clients, and hence would gracefully close
> > > all of the socket connections on the server IOC.
> > >
> > > We could then make another new vxWorks command, "restart" which does
> > > vxCAClientStopAll();
> > > reboot();
> >
> > This is very awesome!!!
> >
> > Jeff can you implement this for the next EPICS RELEASE???
> >
> >
> > Ernest
> >
>
> What Mark suggests is certainly a possible fix. If such a function were
> written its name, instead of vxCAClientStopAll(), might be instead
> ca_close_circuits_but_dont_shut_anything_else_down() because if the rest of
> the CA infrastructure is not left in place the db threads that are still
> using it will crash and potentially disrupt the orderly shutdown.
>
> There are different perspectives on this. One perspective is that CA already
> has such functions, ca_clear_channel and ca_context_destroy, and that all
> that is needed is a function called dbStopAll that calls them ;-). There
> would be many advantages to such an approach. One of them would be that
> devices could be shutdown also. For example the Allen Bradley TCP/IP
> circuits might also need to be gracefully shutdown.
>
> Jeff
Jeff, here is some code that WindRiver sent me:
===========================================================================
/* tcpRstAll.c - send RST on All open TCP connections (prior to system
reset)
* Anton Langebner ([email protected])
*
* $Header:$
*
* $Log:$
*/
#include "vxWorks.h"
#include "sys/types.h"
#include "netinet/in.h"
#include "netinet/in_pcb.h"
#include "netinet/in_systm.h"
#include "netinet/ip.h"
#include "netinet/ip_var.h"
#include "netinet/tcp.h"
#include "netinet/tcp_fsm.h"
#include "netinet/tcp_seq.h"
#include "netinet/tcp_timer.h"
#include "netinet/tcp_var.h"
#include "netinet/tcpip.h"
#include "net/route.h"
#include "errno.h"
#include "string.h"
#include "stdio.h"
IMPORT struct inpcbhead *_pTcpPcbHead;
STATUS tcpRstAll(int startType)
{
int s;
struct inpcb *pInpcb; /* TCP: PCB Head */
struct inpcb *inp; /* TCP: Current PCB */
struct tcpcb *pTcpCb; /* TCP: Current TCP PCB */
struct socket *pSock;
struct rtentry * pRouteEntry = NULL;
struct sockaddr * destAddr = NULL;
short timeout;
s= splnet();
if (_pTcpPcbHead==NULL)
{
splx(s);
printf("Reset TCP: no connections found!\n");
return(ERROR);
}
printf("Reset TCP connections");
pInpcb= _pTcpPcbHead->lh_first;
for (inp= pInpcb; inp!= NULL; inp= inp->inp_list.le_next)
{
pTcpCb= (struct tcpcb *)inp->inp_ppcb;
if (pTcpCb->t_state!=TCPS_ESTABLISHED)
continue;
printf(".");
pTcpCb->t_state= TCPS_CLOSED;
tcp_output(pTcpCb);
}
splx(s);
printf("done\n");
return(OK);
}
STATUS tcpRstAllInit()
{
printf("Adding tcpRstAll() Reboot Hook\n");
rebootHookAdd((FUNCPTR)tcpRstAll);
return(OK);
}
void tcpRstAllNow(int startType)
{
tcpRstAll(startType);
reboot(startType);
}
=====================================================================
Thanks,
Ernest
>
> > -----Original Message-----
> > From: Ernest L. Williams Jr. [mailto:[email protected]]
> > Sent: Wednesday, January 11, 2006 1:56 PM
> > To: Mark Rivers
> > Cc: Jeff Hill; Dirk Zimoch; EPICS tech-talk
> > Subject: RE: channel access
> >
> > On Wed, 2006-01-11 at 13:41 -0600, Mark Rivers wrote:
> > > Folks,
> > >
> > > > > we have a problem with CA since we upgraded our MV2300 IOCs
> > > > to Tornado2.
> > > > >
> > > > > After a reboot, often channel access links don't connect
> > > > immediately to
> > > > > the server. They connect a few minutes later when this
> > > > message is printed:
> > > > >
> > > > > CAC: Unable to connect port 5064 on "172.19.157.20:5064" because
> > > > > 22="S_errno_EINVAL"
> > >
> > > This is not just a problem with IOC to IOC sockets, but with any vxWorks
> > > to vxWorks sockets.
> > >
> > > We recently purchased a Newport XPS motor controller. It communicates
> > > over Ethernet, and uses vxWorks as it's operating system. We control
> > > the XPS from a vxWorks IOC. When we reboot our vxWorks IOC the XPS will
> > > not communicate again after the IOC reboots, because it does not know
> > > the IOC rebooted, and the same ports are being used. It is thus
> > > necessary to also reboot the XPS when rebooting the IOC. But rebooting
> > > the XPS requires re-homing all of the motors, which is sometimes almost
> > > impossible because of installed equipment! This is a real pain.
> > >
> > > This problem goes away if we control the XPS with a non-vxWorks IOC,
> > > such as Linux, probably because Linux closes the sockets when killing
> > > the IOC.
> > >
> > > On a related topic, I am appending an exchange I had with Jeff Hill and
> > > others on this topic in October 2003, that was not posted to tech-talk.
> > >
> > > Cheers,
> > > Mark Rivers
> > >
> > >
> > >
> > > Folks,
> > >
> > > I'd like to revisit the problem of CA disconnects when rebooting a
> > > vxWorks client IOC that has CA links to a vxWorks server IOC (that does
> > > not reboot).
> > >
> > > The EPICS 3.14.3 Release Notes say:
> > >
> > > "Recent versions of vxWorks appear to experience a connect failure if
> > > the vxWorks IP kernel reassigns the same ephemeral TCP port number as
> > > was assigned during a previous lifetime. The IP kernel on the vxWorks
> > > system hosting the CA server might have a stale entry for this ephemeral
> > > port that has not yet timed out which prevents the client from
> > > connecting with the ephemeral port assigned by the IP kernel.
> > > Eventually, after EPICS_CA_CONN_TMO seconds, the TCP connect sequence is
> > > aborted and the client library closes the socket, opens a new socket,
> > > receives a new ephemeral port assignment, and successfully connects."
> > >
> > > The last sentence is only partially correct. The problem is that:
> > > - vxWorks assigns these ephemeral port numbers in ascending numerical
> > > order
> > > - It takes a very long time for the server IOC to kill the stale entries
> > >
> > > Thus, if I reboot the client many times in a row, it does not just
> > > result in one disconnect before the succesful connection, but many. I
> > > just did a test where I rebooted a vxWorks client IOC 11 times, as one
> > > might do when debugging IOC software. This IOC is running Marty's
> > > example sequence program, with 2 PVs connecting to a remote vxWorks
> > > server IOC.
> > >
> > > Here is the amount of time elapsed before the sequence program PVs
> > > connected:
> > > Reboot # Time (sec)
> > > 1 0.1
> > > 2 5.7
> > > 3 30
> > > 4 60
> > > 5 90
> > > 6 120
> > > 7 30
> > > 8 150
> > > 9 150
> > > 10 180
> > > 11 210
> > >
> > > Here is the output of "casr" on the vxWorks server IOC that never
> > > rebooted after client reboot #11.
> > > Channel Access Server V4.11
> > > 164.54.160.74:1067(ioc13bma): User="iocboot", V4.11, Channel Count=1
> > > Priority=80
> > > 164.54.160.100:4453(miata): User="dac_user", V4.8, Channel Count=461
> > > Priority=0
> > > 164.54.160.75:1027(ioc13ida): User="iocboot", V4.11, Channel Count=1
> > > Priority=80
> > > 164.54.160.101:3379(lebaron): User="dac_user", V4.8, Channel Count=18
> > > Priority=0
> > > 164.54.160.73:1025(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1027(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1028(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1029(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1026(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1030(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1031(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > > 164.54.160.111:55807(millenia.cars.aps.anl.gov): User="webmaster", V4.8,
> > > Channel Count=291 Priority=0
> > > 164.54.160.73:1032(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > >
> > > There should only be one connection from the client, 164.54.160.73
> > > (ioc13lab). All but the highest numbered port (1032) are stale.
> > >
> > > The connection times do not increase by 30 seconds every single time,
> > > because for some reason every once in a while one of the old port
> > > connections times out (?) and is reused. You can see that 1026 was
> > > reused in the above test. But in general they do increase by 30 seconds
> > > on each reboot.
> > >
> > > This situation makes it very difficult to do software development under
> > > vxWorks in the case where CA connections to other vxWorks IOCs are used.
> > > It starts to take 4 or 5 minutes for the CA connections to get
> > > established. Rebooting the server IOC is often not an option.
> > >
> > > Here is a proposal for Jeff:
> > >
> > > Would it be possible to create a new function named something like
> > > vxCAClientStopAll. This command would call close() on the CA
> > > connections for all vxWorks CA clients, and hence would gracefully close
> > > all of the socket connections on the server IOC.
> > >
> > > We could then make another new vxWorks command, "restart" which does
> > > vxCAClientStopAll();
> > > reboot();
> >
> > This is very awesome!!!
> >
> > Jeff can you implement this for the next EPICS RELEASE???
> >
> >
> > Ernest
> >
> >
> >
> >
> >
> >
> >
> > >
> > > This would not solve the problem for hard reboots, but it would make it
> > > possible in many cases to avoid these long delays in cases where an IOC
> > > is being deliberately rebooted under software control.
> > >
> > > Cheers,
> > > Mark
> > >
> > > Jeff's reply was:
> > > Mark,
> > >
> > >
> > > > - vxWorks assigns these ephemeral port numbers in ascending numerical
> > > > order
> > >
> > > That's correct there could be several of these stale circuits and the
> > > system
> > > will sequentially step through ephemeral port assignments timing out
> > > each
> > > one until an open slot is found. One solution would be for WRS to store
> > > the
> > > last ephemeral port assignment in non-volatile RAM between boots.
> > >
> > > It's also true that this problem is mostly a development issue and not
> > > an
> > > operational issue because during operations machines typically stay in a
> > > booted operational state for much longer than the stale circuit timeout
> > > interval.
> > >
> > > > - It takes a very long time for the server IOC to kill the stale
> > > > entries
> > >
> > > Yes, that's true. I do turn on the keep-alive timer, but it has a very
> > > long
> > > delay by default. This delay *can* however be changed globally for all
> > > circuits.
> > >
> > > I don't know what RTEMS does, but I strongly suspect that windows, UNIX,
> > > and
> > > VMS systems hang up all connected circuits when the system is software
> > > rebooted.
> > >
> > > Therefore, we have a vxWorks and possibly an RTEMS specific problem.
> > >
> > > > Would it be possible to create a new function named something like
> > > > vxCAClientStopAll. This command would call close() on the CA
> > > > connections for all vxWorks CA clients, and hence would
> > > > gracefully close all of the socket connections on the server IOC.
> > > >
> > >
> > > Of course ca_context_destroy() and ca_task_exit() are fulfilling a
> > > similar,
> > > but context specific role. They do however shutdown only one context at
> > > a
> > > time, and the context identifier is private to the context.
> > >
> > > So perhaps we should do this:
> > >
> > > Implement an iocCore shutdown module where subsystems register for
> > > callback
> > > when iocCore is shutdown. There would be a command line function that
> > > users
> > > call to shutdown an IOC gracefully. This command line would call all of
> > > the
> > > callbacks in the LIFO order. The sequencer and the database links would
> > > of
> > > course call ca_context_destroy() in their IOC core shutdown callbacks.
> > >
> > > Jeff
>
- Replies:
- RE: orderly shutdown Jeff Hill
- References:
- orderly shutdown Jeff Hill
- Navigate by Date:
- Prev:
About: timeout handler of epicsTimer Jun-ichi Odagiri
- Next:
Re: vxStats Marty Kraimer
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
<2006>
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
orderly shutdown Jeff Hill
- Next:
RE: orderly shutdown Jeff Hill
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
<2006>
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|