On the RSRV side, my best guess is that the sender thread is in a
blocking send() with the client lock held (cf. cas_send_bs_msg() w/
lock_needed=true). The recv thread is stuck trying to take the client lock.
A cursory look at src/com/cosylab/epics/caj/impl/CATransport.java
suggests that CAJ also locks around some send(). So it may be the same
situation there.
libca at least claims not to do this (in tcpiiu::sendThreadFlush() ).
If true, then libca would timeout if RSRV got into this situation.
On 06/15/2017 05:29 PM, Kasemir, Kay wrote:
> Hi:
>
>
> No real clue.
>
>
> On the archive engine VM, you would issue
>
>
> kill -QUIT {PID of the java process}
>
>
> which causes Java to dump a stack trace of all threads to its console,
> including locks that each thread has taken or is trying to take.
>
> (You can also use "jps" to list all java processes, then "jstack {PID}"
> to fetch a stack trace.)
>
>
> Maybe do that again 5 minutes later and compare to see if there's one
> thread that's blocked by a lock, or stuck in some function call and not
> progressing for some other reason.
>
>
> -Kay
>
>
>
> ------------------------------------------------------------------------
> *From:* [email protected] <[email protected]> on
> behalf of Ralph Lange <[email protected]>
> *Sent:* Thursday, June 15, 2017 5:37 AM
> *To:* EPICS Core Talk
> *Subject:* Stalled CA connection (IOC to CS-Studio archiver)
>
> Hi all,
>
> We have an ongoing issue in a test setup that includes a Linux "Fast
> Controller" (IP...37) running IOCs (40k records each) on one end and a
> CS-Studio BEAUTY archiver on a VM (IP...41) on the other end. IOCs are
> running Base 3.15.5, BEAUTY uses a current JCA/CAJ client.
>
> The CA TCP connection is up, but blocked in both directions:
>
> On the fast controller (...37) , netstat shows
>
> tcp 0 0 IP...37:5064 0.0.0.0:* LISTEN 29499/MAG-CYSI
> tcp 86888 178656 IP...37:5064 IP...41:40147 ESTABLISHED 29499/MAG-CYSI
>
> On the archiver VM (...41), we see
>
> tcp 495144 70184 IP...41:40147 IP...37:5064 ESTABLISHED 9164/java
> tcp 0 0 IP...41:40691 IP...49:5064 ESTABLISHED 9164/java
>
> tcpdump shows no traffic on that connection.
>
> The archive engine logs things like:
>
> 2017-06-12 22:17:53.047 WARNING [Thread 30]
> com.cosylab.epics.caj.impl.CATransport (noSyncSend) - Failed to send
> message to /IP...37:5064 - buffer full, will retry.
>
> and has not written data to the archive from this IOC for a long time.
> It is happily archiving data from other connections (e.g. the one shown
> in line 2 of the netstat output above).
>
> Obviously the TCP connection is blocked and backed up to the other host
> in both directions.
>
> The IOC is alive and casr shows all channels as connected.
>
> Why are both sides not taking data out of their receive-Qs?
>
> In this test setup, this is not happening to us for the first time. Has
> anyone seen such situations before? Any ideas for how to proceed trying
> to find out what's happening?
>
> Thanks a lot
> ~Ralph
>
- Replies:
- Re: Stalled CA connection (IOC to CS-Studio archiver) Ralph Lange
- References:
- Stalled CA connection (IOC to CS-Studio archiver) Ralph Lange
- Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
- Navigate by Date:
- Prev:
Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
- Next:
Build failed in Jenkins: epics-base-3.15-win64-test #119 APS Jenkins
- Index:
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
<2017>
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
- Next:
Re: Stalled CA connection (IOC to CS-Studio archiver) Ralph Lange
- Index:
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
<2017>
2018
2019
2020
2021
2022
2023
2024
|