EPICS Home

Experimental Physics and Industrial Control System


 
1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: CA reconnection problem
From: "Jeff Hill" <[email protected]>
To: "'Rees, NP \(Nick\)'" <[email protected]>, "'Dirk Zimoch'" <[email protected]>
Cc: "'tech-talk'" <[email protected]>
Date: Mon, 10 Aug 2009 19:02:10 -0600
I had another look at this issue this evening and suspect now that the timer
queue entries with negative delays reflect only the time spent (on the order
of 20 seconds) holding the lock while dumping "dbcar" at high intrest levels
to the slow 9600 baud serial port.

So perhaps its best to step back a bit and look at what Diamond and SLS have
in common. It seems that both sites have a db ca link connecting through the
gateway when this occurs. Dirk mentioned that R3.13 CA links connect fine at
SLS, but R3.14 ones don?t. That can probably be explained by the fact that
the R3.13 ca client library disconnects both the circuit and the application
when a circuit is unresponsive. In contrast, the R3.14 ca client library
disconnects only the application, and not the circuit when the circuit is
unresponsive relying instead on TCP to disconnect broken circuits.

So what probably is occurring is an issue with the gateway where PCAS isn?t
properly scheduling itself to interact with the R3.14 ca client's circuit,
but the circuit remains connected at TCP's level. In Dirk's situation I seem
to recall that a power outage was the precipitating event. A precipitating
event at Diamond maybe isn?t identified?

There are an assortment of PCAS, and also gateway improvements, coming soon
in R3.14.11 RC1 but I'm not certain that this one has been fixed. I will see
if I can reproduce it.

Jeff
______________________________________________________
Jeffrey O. Hill           Email        [email protected]
LANL MS H820              Voice        505 665 1831
Los Alamos NM 87545 USA   FAX          505 665 5107

Message content: TSPA


> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Jeff Hill
> Sent: Monday, August 10, 2009 11:05 AM
> To: 'Rees, NP (Nick)'; 'Dirk Zimoch'
> Cc: 'tech-talk'
> Subject: RE: CA reconnection problem
> 
> Nick, Dirk,
> 
> > I think this is possibly the same problem we described in the thread:
> >
> > "Inter-IOC link problems"
> > http://www.aps.anl.gov/epics/tech-talk/2008/msg01103.php
> >
> 
> I vaguely recall that there was a mantis entry for the above, but have had
> no success finding it. I did find 164 which is somewhat close, but
> probably
> not the same issue. Also 330 (which will probably soon be assigned not-a-
> bug
> status) might be of some interest if you are seeing unexpectedly long
> reconnect times with R3.14.
> 
> Nevertheless, working only from my memory, it's not clear that Emma
> Shepherd's and Dirk's issues have the same symptoms. In Dirk's situation
> dbcar (at high interest levels) shows entries in the CA client library's
> timer queue with negative expiration which, as I recall, was not
> identified
> as a symptom with Emma Shepherd's situation? The tech-talk thread does
> indicate some close examination of the dbcar output occurred, which
> probably
> would have immediately identified the negative expiration times on the
> timer
> queue as a problem?
> 
> In any case you now have a common symptom to look for with dbcar the next
> time that it happens.
> 
> Stack traces from _all_ of the dbca, CA client library, and timer threads
> involved will be the information I need to determine a cause. When
> deciding
> on which threads traces to send, look at the threads operating at close to
> the same scheduling priority as dbca in the vxWorks "i" output. Emphasis
> on
> all because if there is a deadlock issue they can be tricky to identify
> w/o
> clear understanding of the lock hierarchy interactions between the two or
> more threads involved. Of course, if we have a deadlock situation, then
> any
> threads that are continuously blocking on a mutex are of particular
> interest.
> 
> Thanks for your help,
> 
> Jeff
> ______________________________________________________
> Jeffrey O. Hill           Email        [email protected]
> LANL MS H820              Voice        505 665 1831
> Los Alamos NM 87545 USA   FAX          505 665 5107
> 
> Message content: TSPA
> 
> 
> > -----Original Message-----
> > From: Rees, NP (Nick) [mailto:[email protected]]
> > Sent: Thursday, August 06, 2009 6:29 AM
> > To: Dirk Zimoch; Jeff Hill
> > Cc: tech-talk
> > Subject: RE: CA reconnection problem
> >
> > Hi Dirk, Jeff,
> >
> > I think this is possibly the same problem we described in the thread:
> >
> > "Inter-IOC link problems"
> > http://www.aps.anl.gov/epics/tech-talk/2008/msg01103.php
> >
> > That thread seemed to peter out because we basically developed reboot
> > workarounds that meant it didn't bother us as much. However, I don't
> > think it was ever solved. I don't know whether there is a Mantis entry
> > for it.
> >
> > Cheers,
> >
> > Nick Rees
> > Principal Software Engineer           Phone: +44 (0)1235-778430
> > Diamond Light Source                  Fax:   +44 (0)1235-446713
> >
> > > -----Original Message-----
> > > From: [email protected]
> > > [mailto:[email protected]] On Behalf Of Dirk Zimoch
> > > Sent: 06 August 2009 10:22
> > > To: Jeff Hill
> > > Cc: 'tech-talk'
> > > Subject: Re: CA reconnection problem
> > >
> > > Hi Jeff,
> > >
> > > Time advances normally on the IOC.
> > > Bug 299 is not related because I have no suspended tasks.
> > > Bug 330 is also not related because my channels never
> > > reconnect, even after days.
> > >
> > > The IOC connects to the gateway with normal broadcasts on the
> > > local network. I
> > > can see beacons from all IOCs and gateway and I can see the
> > > beacon anomaly when
> > > I start a softioc. Still the IOC does not send any search
> > > requests to reconnect
> > > the links. I can see the search requests from 3.13 IOCs,
> > > 3.14.8 medms and the
> > > gateway after a beacon anomaly, but not from 3.14.8 IOCs. It
> > > is definitely not
> > > the case that the 3.14.8 IOC never stops searching.
> > >
> > > I do not want to restart the gateway at the moment because we
> > > have user
> > > operation. I also think it's not necessary because I can see
> > > the channels
> > > through the gateway with caget. I doubt that it is a gateway
> > > problem at all.
> > >
> > > Rebooting the IOC helps. Writing the record name to the link
> > > again helps too,
> > > but only for this link. All other links on the same IOC to
> > > the same target
> > > record stay disconnected. I do not understand this.
> > >
> > > UDP package size is not an issue because its only 8 broken CA
> > > links to 2
> > > records. So it should easily fit into one package. But as I
> > > said, I see no
> > > package at all (except when I rewnew the link).
> > >
> > > I would send a stack trace of the CAC timer thread if I could
> > > find it. What's
> > > the name?
> > >
> > > Here is the full dbcar output, mybe it helps:
> > >
> > > X04SA-VME-ID2 > dbcar "",100
> > >      connected X04SA-VME-ID2 X04SA-ID-LMD:PMD.INPB
> > > X04SA-ID-LMD:SCALER.S3
> > >      connected X04SA-VME-ID2 X04SA-ID-LMD:PMD-INJ.INPB
> > > X04SA-ID-LMD:PMD
> > >      connected X04SA-VME-ID2 X04SA-ID-LMD:PMD-NOINJ.INPB
> > > X04SA-ID-LMD:PMD
> > >      connected X04SA-VME-ID2 X04SA-ID-LMD:PMU.INPB
> > > X04SA-ID-LMD:SCALER.S2
> > >      connected X04SA-VME-ID2 X04SA-ID-LMU:PMD.INPB
> > > X04SA-ID-LMU:SCALER.S3
> > >      connected X04SA-VME-ID2 X04SA-ID-LMU:PMU.INPB
> > > X04SA-ID-LMU:SCALER.S2
> > >      connected 129.129.104.31:5064 X04SA-ID-FBPMD:ALARM.INPD
> > > X04SA-ID-GAP:ISOPEN
> > > not_connected X04SA-ID-FBPMD:X-COMP.INPE ACOAU-ACCU:OP-MODE
> > > not_connected X04SA-ID-FBPMD:X-COMP.INPI ARIDI-PCT:CURRENT
> > > not_connected X04SA-ID-FBPMD:Y-COMP.INPE ACOAU-ACCU:OP-MODE
> > > not_connected X04SA-ID-FBPMD:Y-COMP.INPI ARIDI-PCT:CURRENT
> > >      connected 129.129.104.31:5064 X04SA-ID-FBPMU:ALARM.INPD
> > > X04SA-ID-GAP:ISOPEN
> > > not_connected X04SA-ID-FBPMU:X-COMP.INPE ACOAU-ACCU:OP-MODE
> > > not_connected X04SA-ID-FBPMU:X-COMP.INPI ARIDI-PCT:CURRENT
> > > not_connected X04SA-ID-FBPMU:Y-COMP.INPE ACOAU-ACCU:OP-MODE
> > > not_connected X04SA-ID-FBPMU:Y-COMP.INPI ARIDI-PCT:CURRENT
> > >      connected 129.129.104.31:5064 X04SA-ID:FBPM-ILK.OUT
> > > X04SA-ID:ILK-BITS.B1
> > >      connected X04SA-VME-ID2 X04SA-VME-ID2:HBTSYNC.INPA
> > > X04SA-VME-ID2:HBT.VAL
> > >      connected X04SA-VME-ID2 X04SA-VME-ID2:MEMALLOC.INPA
> > > X04SA-VME-ID2:MEMALLOCRA
> > > W
> > >      connected X04SA-VME-ID2 X04SA-VME-ID2:MEMFREE.INPA
> > > X04SA-VME-ID2:MEMFREE-RAW
> > >      connected X04SA-VME-ID2 X04SA-VME-ID2:MEMMAX.INPA
> > > X04SA-VME-ID2:MEMMAX-RAW
> > > ncalinks 21 not connected 8 no_read_access 0 no_write_access 0
> > >   nDisconnect 0 nNoWrite 0
> > > ca_client_context at 0x7ca6c90 pndRecvCnt=0 ioSeqNo=0
> > > dbContext at 0x7ca6a80
> > >          event call back cache location 0x7bac130, and its size 24
> > > dbContextReadNotifyCache
> > > dbContextReadNotifyCacheAlocator
> > >          count 1 and size 88
> > > epicsMutexId 0x7ca6c20 source
> > > ../../../src/libCom/osi/epicsMutex.cpp line 197
> > >
> > > Semaphore Id        : 0x7ca6c50
> > > Semaphore Type      : MUTEX
> > > Task Queuing        : PRIORITY
> > > Pended Tasks        : 3
> > > Owner               : 0x7a331e0    (tShell)
> > > Options             : 0xd       SEM_Q_PRIORITY
> > >                                  SEM_DELETE_SAFE
> > >                                  SEM_INVERSION_SAFE
> > >
> > > VxWorks Events
> > > --------------
> > > Registered Task     : NONE
> > > Event(s) to Send    : N/A
> > > Options             : N/A
> > >
> > > Pended Tasks
> > > ------------
> > >     NAME      TID    PRI TIMEOUT
> > > ---------- -------- --- -------
> > > timerQueue  7bf5d20 148       0
> > > CAC-event   7bb5a90 148       0
> > > CAC-TCP-recv 7a5e3b0 150       0
> > >
> > > Channel Access Client Context at 0x7bf8cc0 for user ioc
> > >          revision "@(#) EPICS 3.14.8.2, CA Client Library Nov 29 2006"
> > > Hash table with 512 buckets and 2 items of type 6tcpiiu installed
> > > Virtual circuit to "129.129.104.31:5064" at version V4.8 state 1
> > >          current data cache pointer = 0x7b88c20 current data
> > > cache size = 16384
> > >          contiguous receive message count=0, busy detect
> > > bool=0, flow control boo
> > > l=0
> > > eceive thread is busy=0
> > >          virtual circuit socket identifier 20
> > >          send thread flush signal:
> > >
> > > Semaphore Id        : 0x7b88bc0
> > > Semaphore Type      : BINARY
> > > Task Queuing        : FIFO
> > > Pended Tasks        : 1
> > > State               : EMPTY
> > > Options             : 0x0       SEM_Q_FIFO
> > >
> > > VxWorks Events
> > > --------------
> > > Registered Task     : NONE
> > > Event(s) to Send    : N/A
> > > Options             : N/A
> > >
> > > Pended Tasks
> > > ------------
> > >     NAME      TID    PRI TIMEOUT
> > > ---------- -------- --- -------
> > > CAC-TCP-send 7b8fba0 148       0
> > >
> > >          send thread:
> > >          recv thread:
> > >          echo pending bool = 0
> > > IO identifier hash table:
> > > Connected channels
> > > Channel "X04SA-ID-GAP:ISOPEN", connected to server
> > > 129.129.104.31:5064, native t
> > > ype DBF_ENUM, native element count 1, read access, write access
> > >          network IO pointer = 0x7b96cd0
> > >          server identifier 392
> > >          search retry number=1
> > >          name length=20
> > > Channel "X04SA-ID-GAP:ISOPEN", connected to server
> > > 129.129.104.31:5064, native t
> > > ype DBF_ENUM, native element count 1, read access, write access
> > >          network IO pointer = 0x7b96cd0
> > >          server identifier 393
> > >          search retry number=1
> > >          name length=20
> > > Channel "X04SA-ID:ILK-BITS.B1", connected to server
> > > 129.129.104.31:5064, native
> > > type DBF_CHAR, native element count 1, read access, write access
> > >          network IO pointer = 0x7b96cd0
> > >          server identifier 394
> > >          search retry number=1
> > >          name length=21
> > > Virtual circuit to "129.129.104.14:5064" at version V4.11 state 1
> > >          current data cache pointer = 0x7af62b0 current data
> > > cache size = 16384
> > >          contiguous receive message count=0, busy detect
> > > bool=0, flow control boo
> > > l=0
> > > eceive thread is busy=0
> > >          virtual circuit socket identifier 22
> > >          send thread flush signal:
> > >
> > > Semaphore Id        : 0x7a1f670
> > > Semaphore Type      : BINARY
> > > Task Queuing        : FIFO
> > > Pended Tasks        : 1
> > > State               : EMPTY
> > > Options             : 0x0       SEM_Q_FIFO
> > >
> > > VxWorks Events
> > > --------------
> > > Registered Task     : NONE
> > > Event(s) to Send    : N/A
> > > Options             : N/A
> > >
> > > Pended Tasks
> > > ------------
> > >     NAME      TID    PRI TIMEOUT
> > > ---------- -------- --- -------
> > > CAC-TCP-send 7a61590 148       0
> > >
> > >          send thread:
> > >          recv thread:
> > >          echo pending bool = 0
> > > IO identifier hash table:
> > > Unresponsive circuit channels
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 1
> > >          search retry number=1
> > >          name length=18
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 2
> > >          search retry number=1
> > >          name length=18
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 3
> > >          search retry number=1
> > >          name length=18
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 4
> > >          search retry number=1
> > >          name length=18
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 5
> > >          search retry number=1
> > >          name length=19
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 6
> > >          search retry number=1
> > >          name length=19
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 7
> > >          search retry number=1
> > >          name length=19
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 8
> > >          search retry number=1
> > >          name length=19
> > > Virtual circuit to "129.129.104.31:5064" at version V4.8 state 1
> > >          current data cache pointer = 0x7b88c20 current data
> > > cache size = 16384
> > >          contiguous receive message count=0, busy detect
> > > bool=0, flow control boo
> > > l=0
> > > eceive thread is busy=0
> > >          virtual circuit socket identifier 20
> > >          send thread flush signal:
> > >
> > > Semaphore Id        : 0x7b88bc0
> > > Semaphore Type      : BINARY
> > > Task Queuing        : FIFO
> > > Pended Tasks        : 1
> > > State               : EMPTY
> > > Options             : 0x0       SEM_Q_FIFO
> > >
> > > VxWorks Events
> > > --------------
> > > Registered Task     : NONE
> > > Event(s) to Send    : N/A
> > > Options             : N/A
> > >
> > > Pended Tasks
> > > ------------
> > >     NAME      TID    PRI TIMEOUT
> > > ---------- -------- --- -------
> > > CAC-TCP-send 7b8fba0 148       0
> > >
> > >          send thread:
> > >          recv thread:
> > >          echo pending bool = 0
> > > IO identifier hash table:
> > > Connected channels
> > > Channel "X04SA-ID-GAP:ISOPEN", connected to server
> > > 129.129.104.31:5064, native t
> > > ype DBF_ENUM, native element count 1, read access, write access
> > >          network IO pointer = 0x7b96cd0
> > >          server identifier 392
> > >          search retry number=1
> > >          name length=20
> > > Channel "X04SA-ID-GAP:ISOPEN", connected to server
> > > 129.129.104.31:5064, native t
> > > ype DBF_ENUM, native element count 1, read access, write access
> > >          network IO pointer = 0x7b96cd0
> > >          server identifier 393
> > >          search retry number=1
> > >          name length=20
> > > Channel "X04SA-ID:ILK-BITS.B1", connected to server
> > > 129.129.104.31:5064, native
> > > type DBF_CHAR, native element count 1, read access, write access
> > >          network IO pointer = 0x7b96cd0
> > >          server identifier 394
> > >          search retry number=1
> > >          name length=21
> > > Virtual circuit to "129.129.104.14:5064" at version V4.11 state 1
> > >          current data cache pointer = 0x7af62b0 current data
> > > cache size = 16384
> > >          contiguous receive message count=0, busy detect
> > > bool=0, flow control boo
> > > l=0
> > > eceive thread is busy=0
> > >          virtual circuit socket identifier 22
> > >          send thread flush signal:
> > >
> > > Semaphore Id        : 0x7a1f670
> > > Semaphore Type      : BINARY
> > > Task Queuing        : FIFO
> > > Pended Tasks        : 1
> > > State               : EMPTY
> > > Options             : 0x0       SEM_Q_FIFO
> > >
> > > VxWorks Events
> > > --------------
> > > Registered Task     : NONE
> > > Event(s) to Send    : N/A
> > > Options             : N/A
> > >
> > > Pended Tasks
> > > ------------
> > >     NAME      TID    PRI TIMEOUT
> > > ---------- -------- --- -------
> > > CAC-TCP-send 7a61590 148       0
> > >
> > >          send thread:
> > >          recv thread:
> > >          echo pending bool = 0
> > > IO identifier hash table:
> > > Unresponsive circuit channels
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 1
> > >          search retry number=1
> > >          name length=18
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 2
> > >          search retry number=1
> > >          name length=18
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 3
> > >          search retry number=1
> > >          name length=18
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 4
> > >          search retry number=1
> > >          name length=18
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 5
> > >          search retry number=1
> > >          name length=19
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 6
> > >          search retry number=1
> > >          name length=19
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 7
> > >          search retry number=1
> > >          name length=19
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 8
> > >          search retry number=1
> > >          name length=19
> > > entries per bucket: mean = 0.003906 std dev = 0.062378 max = 1
> > >          connection time out watchdog period 30.000000
> > > Datagram IO circuit (and disconnected channel repository)
> > >          repeater port 5065
> > >          default server port 5064
> > > Channel Access Address List
> > > 129.129.104.255:5064
> > >          socket identifier 18
> > >          bytes in xmit buffer 16
> > >          shut down command bool 0
> > >          recv thread exit signal:
> > > repeater subscribe timer: attempts=2 registered=1 once=0
> > > disconnect governor timer:
> > > search timer delay 0.032000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > search timer delay 0.064000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > search timer delay 0.128000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > search timer delay 0.256000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > search timer delay 0.512000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > search timer delay 1.024000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > search timer delay 2.048000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > search timer delay 4.096000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > search timer delay 8.192000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > search timer delay 16.384000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > search timer delay 32.768000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > search timer delay 65.536000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > search timer delay 131.072000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > search timer delay 262.144000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > Program begin time:
> > > epicsTime: MON MAY 11 2009 16:53:29.047623310
> > > epicsTime: revision "@(#) EPICS 3.14.8.2, Common Utilities
> > > Library Nov 29 2006"
> > > Channel identifier hash table:
> > > Hash table with 512 buckets and 11 items of type 4nciu installed
> > > Channel "X04SA-ID-GAP:ISOPEN", connected to server
> > > 129.129.104.31:5064, native t
> > > ype DBF_ENUM, native element count 1, read access, write access
> > >          network IO pointer = 0x7b96cd0
> > >          server identifier 392
> > >          search retry number=1
> > >          name length=20
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 5
> > >          search retry number=1
> > >          name length=19
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 1
> > >          search retry number=1
> > >          name length=18
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 6
> > >          search retry number=1
> > >          name length=19
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 2
> > >          search retry number=1
> > >          name length=18
> > > Channel "X04SA-ID-GAP:ISOPEN", connected to server
> > > 129.129.104.31:5064, native t
> > > ype DBF_ENUM, native element count 1, read access, write access
> > >          network IO pointer = 0x7b96cd0
> > >          server identifier 393
> > >          search retry number=1
> > >          name length=20
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 7
> > >          search retry number=1
> > >          name length=19
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 3
> > >          search retry number=1
> > >          name length=18
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 8
> > >          search retry number=1
> > >          name length=19
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 4
> > >          search retry number=1
> > >          name length=18
> > > Channel "X04SA-ID:ILK-BITS.B1", connected to server
> > > 129.129.104.31:5064, native
> > > type DBF_CHAR, native element count 1, read access, write access
> > >          network IO pointer = 0x7b96cd0
> > >          server identifier 394
> > >          search retry number=1
> > >          name length=21
> > > Channel "X04SA-ID-GAP:ISOPEN", connected to server
> > > 129.129.104.31:5064, native t
> > > ype DBF_ENUM, native element count 1, read access, write access
> > >          network IO pointer = 0x7b96cd0
> > >          server identifier 392
> > >          search retry number=1
> > >          name length=20
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 5
> > >          search retry number=1
> > >          name length=19
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 1
> > >          search retry number=1
> > >          name length=18
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 6
> > >          search retry number=1
> > >          name length=19
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 2
> > >          search retry number=1
> > >          name length=18
> > > Channel "X04SA-ID-GAP:ISOPEN", connected to server
> > > 129.129.104.31:5064, native t
> > > ype DBF_ENUM, native element count 1, read access, write access
> > >          network IO pointer = 0x7b96cd0
> > >          server identifier 393
> > >          search retry number=1
> > >          name length=20
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 7
> > >          search retry number=1
> > >          name length=19
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 3
> > >          search retry number=1
> > >          name length=18
> > > Channel "ACOAU-ACCU:OP-MODE" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 8
> > >          search retry number=1
> > >          name length=19
> > > Channel "ARIDI-PCT:CURRENT" is disconnected
> > >          network IO pointer = 0x7b96f58
> > >          server identifier 4
> > >          search retry number=1
> > >          name length=18
> > > Channel "X04SA-ID:ILK-BITS.B1", connected to server
> > > 129.129.104.31:5064, native
> > > type DBF_CHAR, native element count 1, read access, write access
> > >          network IO pointer = 0x7b96cd0
> > >          server identifier 394
> > >          search retry number=1
> > >          name length=21
> > > entries per bucket: mean = 0.021484 std dev = 0.144992 max = 1
> > > IO identifier hash table:
> > > Hash table with 512 buckets and 10 items of type 8baseNMIU installed
> > > event subscription IO at 0x7ae2ef0, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae2f1c, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae2f48, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae2f74, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae2fa0, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae2fcc, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae2ff8, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae3024, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae3050, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae307c, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae2ef0, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae2f1c, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae2f48, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae2f74, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae2fa0, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae2fcc, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae2ff8, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae3024, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae3050, type DBF_invalid, element
> > > count 1, mask 5
> > > event subscription IO at 0x7ae307c, type DBF_invalid, element
> > > count 1, mask 5
> > > entries per bucket: mean = 0.019531 std dev = 0.138383 max = 1
> > > Beacon source identifier hash table:
> > > Hash table with 512 buckets and 11 items of type 3bhe installed
> > > CA beacon hash entry for 129.129.104.13:32771 with period
> > > estimate 15.049622
> > >          beacon number 718291, on MON JUN 29 2009 13:59:03
> > > CA beacon hash entry for 129.129.104.33:5064 with period
> > > estimate 14.849221
> > >          beacon number 0, on THU AUG 06 2009 10:56:40
> > > CA beacon hash entry for 129.129.104.34:5064 with period
> > > estimate 14.891577
> > >          beacon number 0, on THU AUG 06 2009 10:56:30
> > > CA beacon hash entry for 129.129.104.36:5064 with period
> > > estimate 14.936137
> > >          beacon number 0, on THU AUG 06 2009 10:56:36
> > > CA beacon hash entry for 129.129.104.37:5064 with period
> > > estimate 14.915017
> > >          beacon number 26678, on THU AUG 06 2009 10:56:33
> > > CA beacon hash entry for 129.129.104.39:5064 with period
> > > estimate 14.838267
> > >          beacon number 0, on THU AUG 06 2009 10:56:39
> > > CA beacon hash entry for 129.129.104.13:5064 with period
> > > estimate 20.991435
> > >          beacon number 51, on THU AUG 06 2009 10:56:33
> > > CA beacon hash entry for 129.129.104.14:5064 with period
> > > estimate 14.935088
> > >          beacon number 218331, on THU AUG 06 2009 10:56:37
> > > CA beacon hash entry for 129.129.104.15:5064 with period
> > > estimate 14.881638
> > >          beacon number 218058, on THU AUG 06 2009 10:56:29
> > > CA beacon hash entry for 129.129.104.30:5064 with period
> > > estimate 14.953183
> > >          beacon number 0, on THU AUG 06 2009 10:56:39
> > > CA beacon hash entry for 129.129.104.31:5064 with period
> > > estimate 14.925085
> > >          beacon number 0, on THU AUG 06 2009 10:56:34
> > > CA beacon hash entry for 129.129.104.13:32771 with period
> > > estimate 15.049622
> > >          beacon number 718291, on MON JUN 29 2009 13:59:03
> > > CA beacon hash entry for 129.129.104.33:5064 with period
> > > estimate 14.849221
> > >          beacon number 0, on THU AUG 06 2009 10:56:40
> > > CA beacon hash entry for 129.129.104.34:5064 with period
> > > estimate 14.891577
> > >          beacon number 0, on THU AUG 06 2009 10:56:30
> > > CA beacon hash entry for 129.129.104.36:5064 with period
> > > estimate 14.936137
> > >          beacon number 0, on THU AUG 06 2009 10:56:36
> > > CA beacon hash entry for 129.129.104.37:5064 with period
> > > estimate 14.915017
> > >          beacon number 26678, on THU AUG 06 2009 10:56:33
> > > CA beacon hash entry for 129.129.104.39:5064 with period
> > > estimate 14.838267
> > >          beacon number 0, on THU AUG 06 2009 10:56:39
> > > CA beacon hash entry for 129.129.104.13:5064 with period
> > > estimate 20.991435
> > >          beacon number 51, on THU AUG 06 2009 10:56:33
> > > CA beacon hash entry for 129.129.104.14:5064 with period
> > > estimate 14.935088
> > >          beacon number 218331, on THU AUG 06 2009 10:56:37
> > > CA beacon hash entry for 129.129.104.15:5064 with period
> > > estimate 14.881638
> > >          beacon number 218058, on THU AUG 06 2009 10:56:29
> > > CA beacon hash entry for 129.129.104.30:5064 with period
> > > estimate 14.953183
> > >          beacon number 0, on THU AUG 06 2009 10:56:39
> > > CA beacon hash entry for 129.129.104.31:5064 with period
> > > estimate 14.925085
> > >          beacon number 0, on THU AUG 06 2009 10:56:34
> > > entries per bucket: mean = 0.021484 std dev = 0.144992 max = 1
> > > Timer queue:
> > > EPICS threaded timer queue at 0x7bf6240
> > > epicsTimerQueue with 15 items pending
> > > 11searchTimer, state = pending, delay = -19.894333
> > > search timer delay 0.064000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > 11searchTimer, state = pending, delay = -20.000999
> > > search timer delay 1.024000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > 11searchTimer, state = pending, delay = -20.163666
> > > search timer delay 0.128000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > 11searchTimer, state = pending, delay = -20.235666
> > > search timer delay 0.256000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > 11searchTimer, state = pending, delay = -20.296333
> > > search timer delay 0.512000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > 11searchTimer, state = pending, delay = -18.926999
> > > search timer delay 2.048000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > 11searchTimer, state = pending, delay = -17.095666
> > > search timer delay 4.096000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > 23disconnectGovernorTimer, state = pending, delay = -16.191666
> > > disconnect governor timer:
> > > 11searchTimer, state = pending, delay = -13.316332
> > > search timer delay 8.192000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > 11searchTimer, state = pending, delay = -5.424332
> > > search timer delay 16.384000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > 11searchTimer, state = pending, delay = -5.590332
> > > search timer delay 32.768000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > 15tcpRecvWatchdog, state = pending, delay = 1.191668
> > > Receive virtual circuit watchdog at 0x7b96df0, period 30.000000
> > >
> > > 11searchTimer, state = pending, delay = 6.361002
> > > search timer delay 65.536000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > 11searchTimer, state = pending, delay = 51.530337
> > > search timer delay 131.072000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > 11searchTimer, state = pending, delay = 236.852334
> > > search timer delay 262.144000
> > > 0 channels with search request pending
> > > 0 channels with search response pending
> > > reschedule event
> > >
> > > Semaphore Id        : 0x7bf6070
> > > Semaphore Type      : BINARY
> > > Task Queuing        : FIFO
> > > Pended Tasks        : 0
> > > State               : EMPTY
> > > Options             : 0x0       SEM_Q_FIFO
> > >
> > > VxWorks Events
> > > --------------
> > > Registered Task     : NONE
> > > Event(s) to Send    : N/A
> > > Options             : N/A
> > >
> > > exit event
> > >
> > > Semaphore Id        : 0x7bf6030
> > > Semaphore Type      : BINARY
> > > Task Queuing        : FIFO
> > > Pended Tasks        : 0
> > > State               : EMPTY
> > > Options             : 0x0       SEM_Q_FIFO
> > >
> > > VxWorks Events
> > > --------------
> > > Registered Task     : NONE
> > > Event(s) to Send    : N/A
> > > Options             : N/A
> > >
> > > exitFlag = F, terminateFlag = F
> > > IP address to name conversion engine:
> > > ipAddrToAsciiEngine at 0x7bf86c0 with 0 requests pending
> > > mutex:
> > > epicsMutexId 0x7bf8650 source
> > > ../../../src/libCom/osi/epicsMutex.cpp line 197
> > >
> > > Semaphore Id        : 0x7bf8680
> > > Semaphore Type      : MUTEX
> > > Task Queuing        : PRIORITY
> > > Pended Tasks        : 0
> > > Owner               : 0x7a331e0    (tShell)
> > > Options             : 0xd       SEM_Q_PRIORITY
> > >                                  SEM_DELETE_SAFE
> > >                                  SEM_INVERSION_SAFE
> > >
> > > VxWorks Events
> > > --------------
> > > Registered Task     : NONE
> > > Event(s) to Send    : N/A
> > > Options             : N/A
> > >
> > > laborEvent:
> > >
> > > Semaphore Id        : 0x7bf8610
> > > Semaphore Type      : BINARY
> > > Task Queuing        : FIFO
> > > Pended Tasks        : 1
> > > State               : EMPTY
> > > Options             : 0x0       SEM_Q_FIFO
> > >
> > > VxWorks Events
> > > --------------
> > > Registered Task     : NONE
> > > Event(s) to Send    : N/A
> > > Options             : N/A
> > >
> > > Pended Tasks
> > > ------------
> > >     NAME      TID    PRI TIMEOUT
> > > ---------- -------- --- -------
> > > ipToAsciiProxy 7bf82c0 189       0
> > >
> > > exitFlag  boolean = 0
> > > exit event:
> > > Default mutex:
> > > epicsMutexId 0x7ca6c20 source
> > > ../../../src/libCom/osi/epicsMutex.cpp line 197
> > >
> > > Semaphore Id        : 0x7ca6c50
> > > Semaphore Type      : MUTEX
> > > Task Queuing        : PRIORITY
> > > Pended Tasks        : 5
> > > Owner               : 0x7a331e0    (tShell)
> > > Options             : 0xd       SEM_Q_PRIORITY
> > >                                  SEM_DELETE_SAFE
> > >                                  SEM_INVERSION_SAFE
> > >
> > > VxWorks Events
> > > --------------
> > > Registered Task     : NONE
> > > Event(s) to Send    : N/A
> > > Options             : N/A
> > >
> > > Pended Tasks
> > > ------------
> > >     NAME      TID    PRI TIMEOUT
> > > ---------- -------- --- -------
> > > scan10      7c2d530 136       0
> > > CAC-UDP     7be23d0 147       0
> > > timerQueue  7bf5d20 148       0
> > > CAC-event   7bb5a90 148       0
> > > CAC-TCP-recv 7a5e3b0 150       0
> > >
> > > mutex:
> > > epicsMutexId 0x7ca6c20 source
> > > ../../../src/libCom/osi/epicsMutex.cpp line 197
> > >
> > > Semaphore Id        : 0x7ca6c50
> > > Semaphore Type      : MUTEX
> > > Task Queuing        : PRIORITY
> > > Pended Tasks        : 5
> > > Owner               : 0x7a331e0    (tShell)
> > > Options             : 0xd       SEM_Q_PRIORITY
> > >                                  SEM_DELETE_SAFE
> > >                                  SEM_INVERSION_SAFE
> > >
> > > VxWorks Events
> > > --------------
> > > Registered Task     : NONE
> > > Event(s) to Send    : N/A
> > > Options             : N/A
> > >
> > > Pended Tasks
> > > ------------
> > >     NAME      TID    PRI TIMEOUT
> > > ---------- -------- --- -------
> > > scan10      7c2d530 136       0
> > > CAC-UDP     7be23d0 147       0
> > > timerQueue  7bf5d20 148       0
> > > CAC-event   7bb5a90 148       0
> > > CAC-TCP-recv 7a5e3b0 150       0
> > >
> > >          preemptive callback is enabled
> > >          there are 0 unsatisfied IO operations blocking ca_pend_io()
> > >          the current io sequence number is 0
> > > IO done event:
> > >
> > > Semaphore Id        : 0x7ca6b70
> > > Semaphore Type      : BINARY
> > > Task Queuing        : FIFO
> > > Pended Tasks        : 0
> > > State               : EMPTY
> > > Options             : 0x0       SEM_Q_FIFO
> > >
> > > VxWorks Events
> > > --------------
> > > Registered Task     : NONE
> > > Event(s) to Send    : N/A
> > > Options             : N/A
> > >
> > > Synchronous group identifier hash table:
> > > Hash table with 0 buckets and 0 items of type 4CASG installed
> > > value = 0 = 0x0
> > > X04SA-VME-ID2 >
> > >
> > >
> > >
> > > Jeff Hill wrote:
> > > > Dirk,
> > > >
> > > > The R3.14 design is more conservative from a perspective of avoiding
> > > > positive congestion feedback. Here are the primary differences.
> > > >
> > > > O When a circuit times out the CAC library disconnects the
> > > application, but
> > > > not the circuit. We rely on TCP to decide when to give up
> > > on a circuit, and
> > > > disconnect it.
> > > >
> > > > O When the CAC library sees a beacon anomaly the search
> > > rate receives a
> > > > boost, but this boost is only to once every 5 seconds. In
> > > contrast, newly
> > > > created channels are initially searched for at a rate which
> > > is closer to the
> > > > average measured round trip time. The CAC library sends
> > > multiple search
> > > > requests per UDP frame, and depending on past success rates
> > > also multiple
> > > > UDP frames per periodic search attempt (that part isn't
> > > different between
> > > > R3.14 and R3.15). Nevertheless, if there many channels it
> > > can take a little
> > > > while to connect all of them if the IOC (or GW in this
> > > situation has been
> > > > off for a long time).
> > > >
> > > > O In R3.14 the CAC library never stops searching for
> > > disconnected channels,
> > > > but the maximum search rate is adjustable with an
> > > environment variable.
> > > >
> > > >> I cannot see any search requests from that IOC.
> > > >
> > > > That doesn't sound right, but bear in mind that if the IOC
> > > is sending
> > > > unicasts, and not broadcasts, to the gateway on a switched
> > > network then the
> > > > network sniffer may or may not see them if it isn't
> > > listening at the source
> > > > or the destination cables. Directed broadcasts may also not
> > > be seen by the
> > > > network sniffer if it isn't sniffing directly on the
> > > directed broadcast's
> > > > destination subnet.
> > > >
> > > > The output from dbcar when its interest level argument is
> > > set very high (say
> > > > to 100) will show many things about the internals of the
> > > CAC library. Also,
> > > > there is a timer thread created by the CAC library that is
> > > scheduling the
> > > > search requests, and you might send a stack trace (if it
> > > isn't simply
> > > > waiting on its time delay event flag).
> > > >
> > > > Another possibility is that something is amiss in the GW.
> > > Does restarting
> > > > the GW improve the situation?
> > > >
> > > > Jeff
> > > > ______________________________________________________
> > > > Jeffrey O. Hill           Email        [email protected]
> > > > LANL MS H820              Voice        505 665 1831
> > > > Los Alamos NM 87545 USA   FAX          505 665 5107
> > > >
> > > > Message content: TSPA
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: [email protected]
> > > [mailto:[email protected]]
> > > >> On Behalf Of Dirk Zimoch
> > > >> Sent: Wednesday, August 05, 2009 8:09 AM
> > > >> To: tech-talk
> > > >> Subject: CA reconnection problem
> > > >>
> > > >> Hi all (Jeff in particular)
> > > >>
> > > >> I have seen a problem with some of our IOCs today. Several
> > > beamline IOCs
> > > >> running
> > > >> 3.14.8 have disconnected CA input links to a central IOC
> > > (3.13.10) even
> > > >> though
> > > >> the channels are there. The beamline IOCs run behind ca
> > > gateways. It may
> > > >> be that
> > > >> the gateways or the central IOC had been offline for more
> > > than some hours
> > > >> recently in the last shutdown.
> > > >>
> > > >> Even though the channels are back and I can read them from
> > > the beamline
> > > >> network
> > > >> with caget the IOCs don't reconnect.
> > > >>
> > > >> I have started a softioc in the beamline network to
> > > generate a beacon
> > > >> anomaly
> > > >> but even then the links do not reconnect. I cannot see any
> > > search requests
> > > >> from
> > > >> that IOC.
> > > >>
> > > >> Only 3.14.8 systems show this problem. 3.13. systems
> > > don't. Is there a bug
> > > >> in
> > > >> the 3.14.8 implementation of CAC I should know of?
> > > >>
> > > >> Dirk
> > > >>
> > > >> --
> > > >> Dr. Dirk Zimoch
> > > >> Paul Scherrer Institut, WBGB/006
> > > >> 5232 Villigen PSI, Switzerland
> > > >> Phone +41 56 310 5182
> > > >
> > > >
> > >
> > > --
> > > Dr. Dirk Zimoch
> > > Paul Scherrer Institut, WBGB/006
> > > 5232 Villigen PSI, Switzerland
> > > Phone +41 56 310 5182
> > >
> > --
> > Scanned by iCritical.




Replies:
RE: CA reconnection problem Shepherd, Emma (DLSLtd,RAL,DIA)
References:
CA reconnection problem Dirk Zimoch
RE: CA reconnection problem Jeff Hill
Re: CA reconnection problem Dirk Zimoch
RE: CA reconnection problem Rees, NP (Nick)
RE: CA reconnection problem Jeff Hill

Navigate by Date:
Prev: Re: Building ASYN on Windows Andrew Johnson
Next: RE: CA reconnection problem Shepherd, Emma (DLSLtd,RAL,DIA)
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: RE: CA reconnection problem Jeff Hill
Next: RE: CA reconnection problem Shepherd, Emma (DLSLtd,RAL,DIA)
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024