EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: caget randomly returns Channel connect timed out
From: Michael Davidsaver <[email protected]>
To: Ignacio <[email protected]>
Cc: [email protected]
Date: Fri, 13 Jul 2018 13:19:28 -0700
On 07/12/2018 02:03 PM, Ignacio wrote:
> Hello everyone,
> Thanks for the responses.
> Just to add more info, we have another system (with IP 172.17.102.130) that's reading tcs:ak:astCtx with a caget. Whenever we see the "not found" message on our test, we see this message on the TCS ioc console, 
> 
> Jul 12 12:40:59  E) PORT: tcs_vme, MSG: CAS: request from 172.17.102.130:57113 <http://172.17.102.130:57113> => bad resource ID                                                                 
> Jul 12 12:40:59  E) PORT: tcs_vme, MSG: CAS: Request from 172.17.102.130:57113 <http://172.17.102.130:57113> => cmmd=12 cid=0xffffffff type=0 count=0 postsize=0 
> Jul 12 12:40:59  E) PORT: tcs_vme, MSG: CAS: Request from 172.17.102.130:57113 <http://172.17.102.130:57113> =>   available=0x1       N=0 paddr=0x0 
> Jul 12 12:40:59  E) PORT: tcs_vme, MSG: CAS: forcing disconnect from 172.17.102.130:57113 <http://172.17.102.130:57113> 

What you're seeing is the IOC (aka. RSRV) flagging a protocol violation, and forcing the TCP connection closed.

"cmmd=12" is a CA_PROTO_CLEAR_CHANNEL message.  And the the other fields look valid, except that "cid=0xffffffff"...

https://epics.anl.gov/base/R3-16/0-docs/CAproto/index.html

The labeling of cid= and available= are not specific to this message.  The header is:

000c 0000 0000 0000
ffff ffff 0000 0001

Despite the label "cid=", in this message the 0xffffffff is the Server provided SID.
So the CID is 1, which makes sense as the client is 'caget' requesting a single PV.
RSRV and libca pick CID/SID numbers sequentially starting with 1.

Have a look at https://github.com/mdavidsaver/cashark and run

> wireshark -X lua_script:ca.lua test/catest.pcap

The first exchange (a caget) has CID=1 and SID=2, so it's obvious which is which.


Anyway, this does appear to be a bad SID sent by the client.  And 0xffffffff is quite suggestive.
Like '-1' suggestive.  And in fact the SID is initialized to -1 when a new channel structure (class nciu)
is allocated, and returned to -1 in nciu::setServerAddressUnknown().

https://github.com/epics-base/epics-base/search?q=sid+%3D+UINT_MAX&unscoped_q=sid+%3D+UINT_MAX

Given the circumstances of a CA_PROTO_CLEAR_CHANNEL message, I suspect that the a message
is being sent after the 'sid' member has been cleared.

Would it be possible for me to see a capture of the TCP traffic resulting from a 'caget' which triggers this error?



> 
> The other system uses EPICS3.14.8.2
> 
> Thanks again,
> Saludos,
> Ignacio A. 
> 
> 
> 
> On Thu, Jul 12, 2018 at 4:47 PM Michael Davidsaver <[email protected] <mailto:[email protected]>> wrote:
> 
>     On 07/12/2018 01:12 PM, Andrew Johnson wrote:
>     > Hi Matt,
>     >
>     > On 07/12/2018 02:54 PM, Matt Rippa wrote:
>     >> We ran a simple caget (once per second) from a linux host to an upgraded
>     >> IOC. With the new rtems/3.14.12.7 <http://3.14.12.7> <http://3.14.12.7> IOC, it often
>     >> experiences timeouts as shown below. With our legacy vxworks/3.13.9 IOC
>     >> there are no timeouts logged. Relevant info is shown below. The
>     >> caRepeater is running on the linux host.
>     >
>     > If your IOC is busy the caget default search timeout of 1 second may be
>     > too short, have you tried using 'caget -w 5 tcs:ak:astCtx' say?
> 
>     @Matt, If this doesn't have an effect I'd suggest taking a packet capture
>     to see whether these is a search reply coming back.
> 
>     As it happens, I had an symptom similar to this while playing around
>     with an mvme31000 emulator.  The issue turned out to be dropped RX
>     interrupts due to a bug in the emulator.  The effect was due to
>     a hard coded 1 second timeout in CA during TCP connect().
> 
>     Have you tried the simple test of letting 'ping' run for a while?
>     If the reported latencies aren't stable, it may be a sign of a
>     lower level issue.
> 
>     >> I'm thinking that setting the linux host EPICS_CA_ADDR_LIST=YES may
>     >> help. I'm running that test now. 
>     >
>     > I have never seen an EPICS_CA_ADDR_LIST as long as yours, but given that
>     > you seem to be searching multiple subnets maybe you do need that.
> 
>     Agreed.  For this test I'd suggest setting EPICS_CA_ADDR_LIST with only
>     the IP of the IOC you're testing.  Along with EPICS_CA_AUTO_ADDR_LIST=NO
> 
>     > I don't think setting EPICS_CA_AUTO_ADDR_LIST to YES will make any
>     > difference to this behaviour (your epicsPrtEnvParams showed it as YES
>     > already, so unless you have explicitly changed it for the client you're
>     > probably already searching your local subnet as well).
> 
>     Also if both IOC and client tools are 3.14.12.x then you can try
> 
>     EPICS_CA_AUTO_ADDR_LIST=NO EPICS_CA_ADDR_LIST= EPICS_CA_NAME_SERVERS=<IP>
> 
>     This will skip the UDP search altogether and do a TCP search.
> 
> 
>     > HTH,
>     >
>     > - Andrew
>     >
>     >
>     >> EPICS_CA_ADDR_LIST=172.17.2.255 172.17.3.255 172.17.102.130
>     >> 172.17.105.20 172.16.102.130 aom-vme sbfmcao01.cl.gemini.edu <http://sbfmcao01.cl.gemini.edu>
>     >> mcao-stealth.cl.gemini.edu <http://mcao-stealth.cl.gemini.edu> 172.17.106.111 172.17.105.37
>     >> 172.17.107.50 172.17.55.101 172.17.101.101 172.17.65.255
>     >> 172.17.102.139 172.17.102.138
>     >>
>     >> linux> caget tcs:ak:astCtx
>     >> tcs:ak:astCtx                  0
>     >> linux> caget tcs:ak:astCtx
>     >> Channel connect timed out: 'tcs:ak:astCtx' not found.
>     >> linux> caget tcs:ak:astCtx
>     >> Channel connect timed out: 'tcs:ak:astCtx' not found.
>     >> linux> caget tcs:ak:astCtx
>     >> tcs:ak:astCtx                  0
>     >
> 


References:
caget randomly returns Channel connect timed out Matt Rippa
Re: caget randomly returns Channel connect timed out Andrew Johnson
Re: caget randomly returns Channel connect timed out Michael Davidsaver
Re: caget randomly returns Channel connect timed out Ignacio

Navigate by Date:
Prev: Re: caget randomly returns Channel connect timed out Matt Rippa
Next: GeSys+Cexp on RTEMS 4.10+ Anton Derbenev
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: caget randomly returns Channel connect timed out Matt Rippa
Next: Re: caget randomly returns Channel connect timed out Matt Rippa
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024 
ANJ, 13 Jul 2018 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·