EPICS Home

Experimental Physics and Industrial Control System


 
2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <20172018  2019  2020  2021  2022  2023  2024  Index 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <20172018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Stalled CA connection (IOC to CS-Studio archiver)
From: Michael Davidsaver <[email protected]>
To: "Kasemir, Kay" <[email protected]>, Ralph Lange <[email protected]>, EPICS Core Talk <[email protected]>
Date: Thu, 15 Jun 2017 17:53:14 +0200
On the RSRV side, my best guess is that the sender thread is in a
blocking send() with the client lock held (cf. cas_send_bs_msg() w/
lock_needed=true).  The recv thread is stuck trying to take the client lock.

A cursory look at src/com/cosylab/epics/caj/impl/CATransport.java
suggests that CAJ also locks around some send().  So it may be the same
situation there.

libca at least claims not to do this (in tcpiiu::sendThreadFlush() ).
If true, then libca would timeout if RSRV got into this situation.




On 06/15/2017 05:29 PM, Kasemir, Kay wrote:
> Hi:
> 
> 
> No real clue.
> 
> 
> On the archive engine VM, you would issue
> 
> 
>   kill -QUIT {PID of the java process}
> 
> 
> which causes Java to dump a stack trace of all threads to its console,
> including locks that each thread has taken or is trying to take.
> 
> (You can also use "jps" to list all java processes, then "jstack {PID}"
> to fetch a stack trace.)
> 
> 
> Maybe do that again 5 minutes later and compare to see if there's one
> thread that's blocked by a lock, or stuck in some function call and not
> progressing for some other reason.
> 
> 
> -Kay
> 
> 
> 
> ------------------------------------------------------------------------
> *From:* [email protected] <[email protected]> on
> behalf of Ralph Lange <[email protected]>
> *Sent:* Thursday, June 15, 2017 5:37 AM
> *To:* EPICS Core Talk
> *Subject:* Stalled CA connection (IOC to CS-Studio archiver)
>  
> Hi all,
> 
> We have an ongoing issue in a test setup that includes a Linux "Fast
> Controller" (IP...37) running IOCs (40k records each) on one end and a
> CS-Studio BEAUTY archiver on a VM (IP...41) on the other end. IOCs are
> running Base 3.15.5, BEAUTY uses a current JCA/CAJ client.
> 
> The CA TCP connection is up, but blocked in both directions:
> 
> On the fast controller (...37) , netstat shows
> 
> tcp        0      0 IP...37:5064   0.0.0.0:*      LISTEN      29499/MAG-CYSI
> tcp    86888 178656 IP...37:5064   IP...41:40147  ESTABLISHED 29499/MAG-CYSI
> 
> On the archiver VM (...41), we see
> 
> tcp   495144  70184 IP...41:40147  IP...37:5064   ESTABLISHED 9164/java
> tcp        0      0 IP...41:40691  IP...49:5064   ESTABLISHED 9164/java
> 
> tcpdump shows no traffic on that connection.
> 
> The archive engine logs things like:
> 
> 2017-06-12 22:17:53.047 WARNING [Thread 30]
> com.cosylab.epics.caj.impl.CATransport (noSyncSend) - Failed to send
> message to /IP...37:5064 - buffer full, will retry.
> 
> and has not written data to the archive from this IOC for a long time.
> It is happily archiving data from other connections (e.g. the one shown
> in line 2 of the netstat output above).
> 
> Obviously the TCP connection is blocked and backed up to the other host
> in both directions.
> 
> The IOC is alive and casr shows all channels as connected.
> 
> Why are both sides not taking data out of their receive-Qs?
> 
> In this test setup, this is not happening to us for the first time. Has
> anyone seen such situations before? Any ideas for how to proceed trying
> to find out what's happening?
> 
> Thanks a lot
> ~Ralph
> 


Replies:
Re: Stalled CA connection (IOC to CS-Studio archiver) Ralph Lange
References:
Stalled CA connection (IOC to CS-Studio archiver) Ralph Lange
Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay

Navigate by Date:
Prev: Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
Next: Build failed in Jenkins: epics-base-3.15-win64-test #119 APS Jenkins
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <20172018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
Next: Re: Stalled CA connection (IOC to CS-Studio archiver) Ralph Lange
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <20172018  2019  2020  2021  2022  2023  2024