2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 <2017> 2018 2019 2020 2021 2022 2023 2024 | Index | 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 <2017> 2018 2019 2020 2021 2022 2023 2024 |
<== Date ==> | <== Thread ==> |
---|
Subject: | Re: Stalled CA connection (IOC to CS-Studio archiver) |
From: | "Kasemir, Kay" <[email protected]> |
To: | Ralph Lange <[email protected]>, EPICS Core Talk <[email protected]> |
Date: | Thu, 15 Jun 2017 15:29:18 +0000 |
Hi:
No real clue.
On the archive engine VM, you would issue
kill -QUIT {PID of the java process}
which causes Java to dump a stack trace of all threads to its console, including locks that each thread has taken or is trying to take. (You can also use "jps" to list all java processes, then "jstack {PID}" to fetch a stack trace.)
Maybe do that again 5 minutes later and compare to see if there's one thread that's blocked by a lock, or stuck in some function call and not progressing for some other reason.
-Kay
From: [email protected] <[email protected]> on behalf of Ralph Lange <[email protected]>
Sent: Thursday, June 15, 2017 5:37 AM To: EPICS Core Talk Subject: Stalled CA connection (IOC to CS-Studio archiver) Hi all,
We have an ongoing issue in a test setup that includes a Linux "Fast Controller" (IP...37) running IOCs (40k records each) on one end and a CS-Studio BEAUTY archiver on a VM (IP...41) on the other end. IOCs are running Base 3.15.5, BEAUTY uses a current
JCA/CAJ client.
The CA TCP connection is up, but blocked in both directions:
On the fast controller (...37) , netstat shows
tcp 0 0 IP...37:5064 0.0.0.0:* LISTEN 29499/MAG-CYSI On the archiver VM (...41), we see tcp 495144 70184 IP...41:40147 IP...37:5064 ESTABLISHED 9164/java tcpdump shows no traffic on that connection. The archive engine logs things like: 2017-06-12 22:17:53.047 WARNING [Thread 30] com.cosylab.epics.caj.impl.CATransport (noSyncSend) - Failed to send message to /IP...37:5064 - buffer full, will retry. and has not written data to the archive from this IOC for a long time. It is happily archiving data from other connections (e.g. the one shown in line 2 of the netstat output above). Obviously the TCP connection is blocked and backed up to the other host in both directions. The IOC is alive and casr shows all channels as connected. Why are both sides not taking data out of their receive-Qs? In this test setup, this is not happening to us for the first time. Has anyone seen such situations before? Any ideas for how to proceed trying to find out what's happening? Thanks a lot |