EPICS Home

Experimental Physics and Industrial Control System


 
1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: EPICS CAS errors
From: Dirk Zimoch <[email protected]>
To: <[email protected]>
Date: Thu, 24 May 2018 18:11:04 +0200
Hi Oleg,

I have observed a similar crash today with EPICS 3.14.12.7.

CAS: request from 172.26.120.14:57787 => bad resource ID
CAS: Request from 172.26.120.14:57787 => cmmd=15 cid=0x26f7 type=10 count=1 postsize=0 CAS: Request from 172.26.120.14:57787 => available=0x76635 ^IN=0 paddr=(nil)
CAS: forcing disconnect from 172.26.120.14:57787

A call to 'assert(status == epicsMutexLockOK)'
    by thread 'CAS-client' failed in ../dbEvent.c line 531.
EPICS Release EPICS R3.14.12.7-DEV-2018-04.
Local time is 2018-05-24 15:45:05.435638450 CEST
Please E-mail this message to the author or to [email protected]
Calling epicsThreadSuspendSelf()
Thread CAS-client (0xb5c08258) suspended

Here the crash seems related to a CA failure as well. The same thread 'CAS-client' crashed.

In my case the crash is rare. Many "bad resource ID" errors can be generated without problem before one error causes the crash.

Meanwhile I suspect the following race condition:
While CAS-client is repeatedly locking mutexes in a loop, another thread destroys them. This makes epicsMutexMustLock() fail. At least a test that I wrote creates the same type of error message when doing exactly that:

creating mutex
locking mutex
creating thread
taking mutex
unlocking and destroying mutex now
destroying mutex done
unblocked
taking mutex

A call to 'assert(status == epicsMutexLockOK)'
    by thread 'blocked' failed in ../testmutex.c line 11.

In your case LOCKREC = epicsMutexMustLock((RECPTR)->mlok) fails.
In my case LOCKEVQUE = epicsMutexMustLock((EV_QUE)->writelock) fails.

I had first suspected that epicsMutexDestroy might destroy the mutex while the other thread is waiting on it, but in that case pthread_mutex_destroy would fail. (What I am missing in epicsMutexOsdDestroy() is to wait and retry as long as pthread_mutex_destroy returns EBUSY.) But that seems not to be the problem. Indeed if I change the test to destroy the locked mutex I get:

creating mutex
locking mutex
creating thread
taking mutex
destroying mutex now
epicsMutex pthread_mutex_destroy failed: error Device or resource busy
destroying mutex done

Now epicsMutexDestroy prints an error and finishes while the other thread is stuck in epicsMutexMustLock() and never wakes up. Also not the best behavior in my opinion. epicsMutexDestroy should wait until the mutex is free. This needs some more code in epicsMutexOsdDestroy. But that is not the problem here.

I think the problem is that mutexes used in dbEvent.c may be destroyed before the CAS-client thread has finished its work.

Maybe I will start digging a bit deeper into the dbEvent.c, posix/osdMutex.c and CAS-client code...

Dirk


On 09.04.2018 18:19, Oleg A. Makarov wrote:
Hi Jeffrey,

yesterday EPICS PVs updates stalled on one of our IOCs. IOC reboot solve the problem.
I am posting log messages from IOC console below. EPICS version is 3.15.5.

Should some corrections to the CAS be made, so that erroneous request does not cause suspended threads?

Oleg

CAS: request from 164.54.103.102:38866 => CAS: Missaligned protocol rejected
CAS: Request from 164.54.103.102:38866 => cmmd=29797 cid=0x73000000 type=29793 count=29813 postsiz
e=29267
CAS: Request from 164.54.103.102:38866 =>   available=0x0       N=2 paddr=0x0
CAS: invalid (damaged?) UDP request from 164.54.103.102:38866 ?
CAS: message received at 2018-03-22 20:18:21


CAS: request from 164.54.103.102:54641 => CAS: Missaligned protocol rejected
CAS: Request from 164.54.103.102:54641 => cmmd=29797 cid=0x73000000 type=29793 count=29813 postsiz
e=29267
CAS: Request from 164.54.103.102:54641 =>   available=0x0       N=2 paddr=0x0
CAS: invalid (damaged?) UDP request from 164.54.103.102:54641 ?
CAS: message received at 2018-03-31 13:36:44


A call to 'assert(status == epicsMutexLockOK)'
     by thread 'CAS-client' failed in ../../../src/ioc/db/dbEvent.c line 493.
EPICS Release EPICS R3.15.5.
Local time is 2018-04-07 19:59:10.227314800 CST/CDT
Please E-mail this message to the author or [email protected]
Calling epicsThreadSuspendSelf()
Thread CAS-client (0x492c090) suspended


A call to 'assert(status == epicsMutexLockOK)'
     by thread 'CAS-client' failed in ../../../src/ioc/db/dbEvent.c line 477.
EPICS Release EPICS R3.15.5.
Local time is 2018-04-08 08:04:00.242762000 CST/CDT
Please E-mail this message to the author or [email protected]
Calling epicsThreadSuspendSelf()
Thread CAS-client (0x4c573d0) suspended



A call to 'assert(status == epicsMutexLockOK)'
     by thread 'CAS-client' failed in ../../../src/ioc/db/dbEvent.c line 477.
EPICS Release EPICS R3.15.5.
Local time is 2018-04-08 08:06:45.226095334 CST/CDT
Please E-mail this message to the author or [email protected]
Calling epicsThreadSuspendSelf()
Thread CAS-client (0x4b810a0) suspended



A call to 'assert(status == epicsMutexLockOK)'
     by thread 'CAS-client' failed in ../../../src/ioc/db/dbEvent.c line 493.
EPICS Release EPICS R3.15.5.
Local time is 2018-04-08 08:11:10.776095352 CST/CDT
Please E-mail this message to the author or [email protected]
Calling epicsThreadSuspendSelf()
Thread CAS-client (0x4962b00) suspended



A call to 'assert(status == epicsMutexLockOK)'
     by thread 'CAS-client' failed in ../../../src/ioc/db/dbEvent.c line 477.
EPICS Release EPICS R3.15.5.
Local time is 2018-04-08 08:13:56.288060135 CST/CDT
Please E-mail this message to the author or [email protected]
Calling epicsThreadSuspendSelf()
Thread CAS-client (0x4b8bf40) suspended



A call to 'assert(status == epicsMutexLockOK)'
     by thread 'CAS-client' failed in ../../../src/ioc/db/dbEvent.c line 477.
EPICS Release EPICS R3.15.5.
Local time is 2018-04-08 08:17:01.421393463 CST/CDT
Please E-mail this message to the author or [email protected]
Calling epicsThreadSuspendSelf()
Thread CAS-client (0x4bfbbc0) suspended



A call to 'assert(status == epicsMutexLockOK)'
     by thread 'CAS-client' failed in ../../../src/ioc/db/dbEvent.c line 477.
EPICS Release EPICS R3.15.5.
Local time is 2018-04-08 08:19:30.938060149 CST/CDT
Please E-mail this message to the author or [email protected]
Calling epicsThreadSuspendSelf()
Thread CAS-client (0x25d37f0) suspended



A call to 'assert(status == epicsMutexLockOK)'
     by thread 'CAS-client' failed in ../../../src/ioc/db/dbEvent.c line 477.
EPICS Release EPICS R3.15.5.
Local time is 2018-04-08 08:22:36.554726791 CST/CDT
Please E-mail this message to the author or [email protected]
Calling epicsThreadSuspendSelf()
Thread CAS-client (0x4c40d70) suspended



A call to 'assert(status == epicsMutexLockOK)'
     by thread 'CAS-client' failed in ../../../src/ioc/db/dbEvent.c line 477.
EPICS Release EPICS R3.15.5.
Local time is 2018-04-08 08:23:58.421393463 CST/CDT
Please E-mail this message to the author or [email protected] Calling epicsThreadSuspendSelf()
Thread CAS-client (0x4c4a810) suspended
A call to 'assert(status == epicsMutexLockOK)'
     by thread 'CAS-client' failed in ../../../src/ioc/db/dbEvent.c line 477.
EPICS Release EPICS R3.15.5.
Local time is 2018-04-08 08:28:55.938060149 CST/CDT
Please E-mail this message to the author or [email protected] Calling epicsThreadSuspendSelf()
Thread CAS-client (0x4c85be0) suspended
A call to 'assert(status == epicsMutexLockOK)'
     by thread 'CAS-client' failed in ../../../src/ioc/db/dbEvent.c line 477.
EPICS Release EPICS R3.15.5.
Local time is 2018-04-08 08:40:05.300141941 CST/CDT
Please E-mail this message to the author or [email protected]
Calling epicsThreadSuspendSelf()
Thread CAS-client (0x4c90690) suspended

23i:ioc2>
23i:ioc2>
23i:ioc2>
23i:ioc2>
23i:ioc2>
23i:ioc2>
23i:ioc2>
23i:ioc2>
23i:ioc2>
23i:ioc2> reboot


References:
EPICS CAS errors Oleg A. Makarov

Navigate by Date:
Prev: RE: EDM X/Y Plot segfaults Baily, Scott A
Next: caQtDM questions Baily, Scott A
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: EPICS CAS errors Oleg A. Makarov
Next: EPICS Meeting Web Site Issues Paul Chu
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024