Experimental Physics and
Industrial Control System

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 <2018> 2019 2020 2021 2022 2023 2024 2025	Index	1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 <2018> 2019 2020 2021 2022 2023 2024 2025
<== Date ==>		<== Thread ==>

Subject:	Re: EPICS CAS errors
From:	"Oleg A. Makarov" <[email protected]>
To:	Ralph Lange <[email protected]>, EPICS Tech Talk <[email protected]>
Date:	Tue, 10 Apr 2018 13:16:54 -0500

Ralph,

thank you for your suggestions. I agree, such problems are very hard to debug, they are intermittent and most likely caused by memory corruption.
According to my observations these errors happen at start time of a beamline control application which establish connections/monitors to large number of EPICS PVs.
I don't think these errors caused by access rights changes, since our beamline control applications does not change any access rights.

Regards,
Oleg

On 4/10/2018 4:44 AM, Ralph Lange wrote:

Hi Oleg,

Remote diagnosis of an unknown system is always more of a guessing game than anything else.

So, first and most important suggestion: refer to a local expert.

Nevertheless, some thoughts:

Statistically, many if not most weird errors on the IOC are caused by memory corruption.

In your case, the thread suspensions happen when the CA server on the IOC calls db_event_enable (line 477) or db_event_disable (line 493), and trying to acquire the monitor lock fails with an error.

The routines db_event_enable / db_event_disable are called from within the CA server when access rights change for a record or when a client sets up / cancels a monitor.

Were there access rights changes happening on the IOC at 07-Apr 19:59:10 and 08-Apr 08:11:10 (at the "line 493" events)?

Some "line 477" thread suspensions happen with intervals of a few minutes. That could match a client repeatedly getting ungracefully disconnected (because of the server-side thread being suspended) and then reconnecting, provoking another attempt to lock an invalid monitor lock and get disconnected again.

The semaphore locking code is used everywhere, all the time, all over EPICS Base. Not an obvious candidate for a bug.

So ... I think what you see may be consistent with a memory corruption that affects at least one record (i.e. the pointer to its monitor lock semaphore) or the memory area where the semaphore structures have been allocated.

Too bad that the error messages don't show the record involved. That would give valuable information.

Memory corruption issues (if there is one) are not easy to track down; strategies and tools depend on the operating system. Which closes the loop to my first and most important suggestion: refer to a local expert.

Cheers,
~Ralph

On Mon, Apr 9, 2018 at 10:40 PM, Oleg A. Makarov <[email protected]> wrote:

Ralph,

could you please provide some suggestions how to diagnose what causing suspension of CAS-client threads ?

Thank you,
Oleg

References:: EPICS CAS errors Oleg A. Makarov; Re: EPICS CAS errors Ralph Lange; Re: EPICS CAS errors Oleg A. Makarov; Re: EPICS CAS errors Ralph Lange

Navigate by Date:: Prev: Re: driver support for Gamma Vacuum QPC 4 Phillip Sorensen; Next: RE: driver support for Gamma Vacuum QPC 4 Abdalla Ahmad; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 <2018> 2019 2020 2021 2022 2023 2024 2025
Navigate by Thread:: Prev: Re: EPICS CAS errors Martin L. Smith; Next: Re: EPICS CAS errors Dirk Zimoch; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 <2018> 2019 2020 2021 2022 2023 2024 2025

ANJ, 24 May 2018

· Home · News · About · Base · Modules · Extensions · Distributions ·
· Download · Search · IRMIS · Talk · Documents · Links · Licensing ·

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System