Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019 
<== Date ==> <== Thread ==>

Subject: Re: EPICS CAS errors
From: "Martin L. Smith" <mls@aps.anl.gov>
To: <tech-talk@aps.anl.gov>
Date: Tue, 10 Apr 2018 05:28:09 -0500
Hi Oleg,

I'm not sure if this matters or not, but I can tell you that the IP address
in the original message belongs to a Kiethly device named bl1keithley. This
might help to narrow down which record(s) are involved.

Marty

On 04/10/2018 04:44 AM, Ralph Lange wrote:
Hi Oleg,

Remote diagnosis of an unknown system is always more of a guessing game than anything else.
So, first and most important suggestion: refer to a local expert.

Nevertheless, some thoughts:

Statistically, many if not most weird errors on the IOC are caused by memory corruption.

In your case, the thread suspensions happen when the CA server on the IOC calls db_event_enable (line 477) or db_event_disable (line 493), and trying to acquire the monitor lock fails with an error.

The routines db_event_enable/ db_event_disable are called from within the CA server when access rights change for a record or when a client sets up / cancels a monitor.

Were there access rights changes happening on the IOC at 07-Apr 19:59:10 and 08-Apr 08:11:10 (at the "line 493" events)?

Some "line 477" thread suspensions happen with intervals of a few minutes. That could match a client repeatedly getting ungracefully disconnected (because of the server-side thread being suspended) and then reconnecting, provoking another attempt to lock an invalid monitor lock and get disconnected again.

The semaphore locking code is used everywhere, all the time, all over EPICS Base. Not an obvious candidate for a bug.

So ... I think what you see may be consistent with a memory corruption that affects at least one record (i.e. the pointer to its monitor lock semaphore) or the memory area where the semaphore structures have been allocated.

Too bad that the error messages don't show the record involved. That would give valuable information.

Memory corruption issues (if there is one) are not easy to track down; strategies and tools depend on the operating system. Which closes the loop to my first and most important suggestion: refer to a local expert.

Cheers,
~Ralph


On Mon, Apr 9, 2018 at 10:40 PM, Oleg A. Makarov <makarov@anl.gov <mailto:makarov@anl.gov>> wrote:

    Ralph,

    could you please provide some suggestions how to diagnose what causing
    suspension of CAS-client threads ?

    Thank you,
    Oleg



References:
EPICS CAS errors Oleg A. Makarov
Re: EPICS CAS errors Ralph Lange
Re: EPICS CAS errors Oleg A. Makarov
Re: EPICS CAS errors Ralph Lange

Navigate by Date:
Prev: RE: driver support for Gamma Vacuum QPC 4 ian.gillingham@diamond.ac.uk
Next: Re: driver support for Gamma Vacuum QPC 4 Martin L. Smith
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019 
Navigate by Thread:
Prev: Re: EPICS CAS errors Ralph Lange
Next: Re: EPICS CAS errors Oleg A. Makarov
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019 
ANJ, 10 Apr 2018 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·