EPICS Re: Mutex error on recent Redhat systems

Experimental Physics and Industrial Control System

2002 2003 2004 2005 2006 <2007> 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025	Index	2002 2003 2004 2005 2006 <2007> 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
<== Date ==>		<== Thread ==>

Subject:	Re: Mutex error on recent Redhat systems
From:	Andrew Johnson <[email protected]>
To:	Sergey Stepanov <[email protected]>
Cc:	EPICS core-talk <[email protected]>
Date:	Fri, 18 May 2007 11:45:22 -0500

Hi Sergey,

Sergey Stepanov wrote:
> Hi Andrew,
> we wonder if you have seen this: we are often observing
> the following error when using caget and caput (the EPICS
> base programs) on Redhat EL5 and Fedora-6:

I have not seen this yet as I haven't had either of those OSs available 
before now -- I've been asking IT to upgrade my PC to Fedora-6 for some 
time, and it finally looks like they're able to do it so I will 
investigate this problem as soon as I can.

Has anyone else on core-talk seen this problem?

> gmca@bl1ws3:~ 43> caget 23i:GO:Om:Start.DESC
> 23i:GO:Om:Start.DESC           Gonio Omega
> epicsThreadOnceOsd epicsMutexLock failed.
> 
> 
> gmca@bl1ws3:~ 47> caput 23i:GO:Om:Start.DESC "Gonio Omega"
> Old : 23i:GO:Om:Start.DESC           Gonio Omega
> New : 23i:GO:Om:Start.DESC           Gonio Omega
> epicsThreadOnceOsd epicsMutexLock failed.
> 
> This is happening at least with EPICS BASE-3.14.9 and BASE-3.14.8.2.
> At the same time the error does not show up on Redhat EL3 and
> Redhat EL4 (with the same versions of EPICS BASE). The rate of
> the error is about 5-10%.
> 
> Does it ring the bell for you? We did not find any note on this
> problem in the EPICS Mantis Bug Reports list. Also we do not
> have an estimate how harmful it is. In case of the error the caput
> program still does what it is asked to do (i.e. sets the new value),
> but we are not sure how the other EPICS software would behave.
> 
> We also tried caget and caput from EPICS extensions and got the
> same problem.

The error message is from src/libCom/osi/os/posix/osdThread.c at the top 
of the routine epicsThreadOnceOsd(), and occurs if mutexLock(&onceLock) 
returns a non-zero status.  The mutexLock() routine is one of ours that 
calls pthread_mutex_lock() and handles retrying in the event that it 
returns EINTR (which violates the SUSv3).

The result of a bad status (other than EINTR) from pthread_mutex_lock() 
is for picsThreadOnceOsd() to call exit(1).  However since Sergey's 
caput and caget operations seem to be completing normally, it looks like 
this may be some kind of shutdown issue.  The onceLock mutex does get 
destroyed after myAtExit() has run epicsExitCallAtExits() and called 
pthread_cancel() on all threads other than itself and main, and this 
might explain the issue -- if a thread other than main invokes exit() 
and is used to perform the atExit() cleanups itself, the main thread 
might still try to run something that checks the onceLock before the 
process breathes its very last breath.

I'll add some more instrumentation and try to replicate this on Fedora-6 
but any assistance is welcome...

- Andrew
-- 
The right to be heard does not automatically include
the right to be taken seriously. -- Hubert H. Humphrey
_______________________________________________
Core-talk mailing list
[email protected]
http://www.aps.anl.gov/mailman/listinfo/core-talk

Replies:: RE: Mutex error on recent Redhat systems Jeff Hill

Navigate by Date:: Prev: Core-talk moves to Mailman Andrew Johnson; Next: RE: Mutex error on recent Redhat systems Jeff Hill; Index: 2002 2003 2004 2005 2006 <2007> 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
Navigate by Thread:: Prev: Core-talk moves to Mailman Andrew Johnson; Next: RE: Mutex error on recent Redhat systems Jeff Hill; Index: 2002 2003 2004 2005 2006 <2007> 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025