EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Can't create mutex semaphore: too many
From: Michael Davidsaver via Tech-talk <tech-talk at aps.anl.gov>
To: "Luchini, Kristi L." <luchini at slac.stanford.edu>, Till Straumann <till.straumann at psi.ch>
Cc: EPICS Tech-Talk <tech-talk at aps.anl.gov>, Joel Sherrill <joel.sherrill at gmail.com>, "Mock, Jeremy A" <jmock at slac.stanford.edu>
Date: Wed, 25 May 2022 21:19:49 -0700
I'm CC'ing tech-talk as I think others may be interested in these numbers.
(though likely not the 11MB screenlog.0 attachment they derive from)

On 5/25/22 19:07, Luchini, Kristi L. wrote:
Hi Everyone,

Unfortunately, I had to reboot the iocs because it was causing problems for ops. However, I preformed the epicsMutexShowAll(0,1) as Till recommended.  I've attached the screenlog file

While not conclusive, nor perhaps even relevant, there are some interesting bits in here.

Firstly, the total number of mutex in use.


$ grep epicsMutexId /tmp/screenlog.0 |wc -l
1409


Line numbers assuming a minimum of SLAC specific patching...

$ grep epicsMutexId /tmp/screenlog.0 |awk '{print $4, $6}'|sort|uniq -c|sort -nr
    318 ../db/dbScan.c 597

scanIoInit() has been called 106 times by whatever drivers
are loaded.  Each scan list has three mutexes for each of the
three callback priorities.

https://github.com/epics-base/epics-base/blob/34834dfe973a577f342503441e7b33b7fe322de5/modules/database/src/ioc/db/dbScan.c#L596

    246 ../misc/iocInit.c 500

There are 246 records loaded.

https://github.com/epics-base/epics-base/blob/34834dfe973a577f342503441e7b33b7fe322de5/modules/database/src/ioc/misc/iocInit.c#L500

    246 ../db/dbLock.c 87

These 246 records are in 246 lock sets.
Which means there are no DB links between records.

https://github.com/epics-base/epics-base/blob/34834dfe973a577f342503441e7b33b7fe322de5/modules/database/src/ioc/db/dbLock.c#L87

    193 ../dbStatic/dbPvdLib.c 119

These 246 records occupy 193 buckets in the hash table
used during PV name lookup.

https://github.com/epics-base/epics-base/blob/34834dfe973a577f342503441e7b33b7fe322de5/modules/database/src/ioc/dbStatic/dbPvdLib.c#L119

    126 ../../src/remote/beaconHandler.cpp 24

Beacons from 126 servers have been received to this point.
With the pvAccessCPP issue I highlighted, this number
will grow without bound as more servers (re)start.
(at least for soft IOCs where the port number changes)

     46 ../db/dbEvent.c 379

There are active subscriptions to 46 PVs.
This number will go up and down as clients connect and disconnect.

https://github.com/epics-base/epics-base/blob/34834dfe973a577f342503441e7b33b7fe322de5/modules/database/src/ioc/db/dbEvent.c#L379

     23 ../freeList/freeListLib.c 55

23 free lists are active.

     21 ../osi/epicsThread.cpp 208

There are 21 c++ threads running.
This number will go up/down as PVA clients (dis)connect.

I'm getting bored now, and won't go through the long tail.

     12 ../../src/remote/blockingUDPTransport.cpp 69
     10 ../db/dbScan.c 897
     10 ../db/dbEvent.c 300
     10 ../db/dbEvent.c 290
      9 ../rsrv/caservertask.c 1267
      9 ../rsrv/caservertask.c 1266
      9 ../rsrv/caservertask.c 1265
      9 ../rsrv/caservertask.c 1264
      9 ../cxxTemplates/tsFreeList.h 114
      7 ../../src/pva/pv/pvaDefs.h 23
      7 ../../../../../include/tsFreeList.h 114
      4 ../timer/timerQueue.cpp 28
      4 ../../src/remoteClient/clientContextImpl.cpp 3982
      4 ../devSeqCar.c 154
      3 ../../src/remote/channelSearchManager.cpp 86
      3 ../gpHash/gpHashLib.c 57
      3 ../db/dbCa.c 326
      3 ../ca_client_context.cpp 63
      3 ../ca_client_context.cpp 62
      2 ../../src/remote/pv/transportRegistry.h 68
      2 ../../src/misc/timer.cpp 34
      2 ../../src/client/pv/pvAccess.h 1530
      2 ../../src/client/client.cpp 238
      2 ../repeaterSubscribeTimer.cpp 40
      2 ../log/logClient.c 466
      2 ../hostNameCache.cpp 34
      1 ../tpool.cpp 20
      1 ../timer/timerQueueActiveMgr.cpp 23
      1 ../taskwd/taskwd.c 151
      1 ../taskwd/taskwd.c 150
      1 ../taskwd/taskwd.c 149
      1 ../../src/utils/configuration.cpp 244
      1 ../../src/server/serverContext.cpp 48
      1 ../../src/remote/blockingTCPAcceptor.cpp 38
      1 ../../src/misc/reftrack.cpp 31
      1 ../../src/misc/reftrack.cpp 164
      1 ../../src/factory/FieldCreateFactory.cpp 1627
      1 ../../src/factory/ChannelAccessFactory.cpp 100
      1 ../seq_cmd.c 38
      1 ../save_restore.c 2244
      1 ../rsrv/caservertask.c 485
      1 ../PVAServerRegister.cpp 42
      1 ../pvalink_channel.cpp 18
      1 ../../p2pApp/weakmap.h 74
      1 ../osi/os/posix/osdSock.c 38
      1 ../osi/osiNTPTime.c 91
      1 ../osi/osiClockTime.c 82
      1 ../osi/os/default/osdThreadHooks.c 38
      1 ../osi/epicsGeneralTime.c 94
      1 ../osi/epicsGeneralTime.c 91
      1 ../osi/devLibVME.c 711
      1 ../misc/ipAddrToAsciiAsynchronous.cpp 174
      1 ../misc/epicsExit.c 80
      1 ../iocsh/iocsh.cpp 81
      1 ../iocsh/initHooks.c 40
      1 ../evrTime.c 456
      1 ../error/errlog.c 485
      1 ../error/errlog.c 482
      1 ../error/errlog.c 481
      1 ../drvMrfEr.c 929
      1 ../drvEvr.c 406
      1 ../devIocStatsAnalog.c 465
      1 ../dbmf/dbmf.c 80
      1 ../db/dbScan.c 552
      1 ../db/dbNotify.c 320
      1 ../db/dbLock.c 63
      1 ../db/dbCa.c 278
      1 ../db/dbBkpt.c 259
      1 ../cxxTemplates/epicsSingletonMutex.cpp 35
      1 ../ca_client_context.cpp 53
      1 ../as/asTrapWrite.c 71
      1 ../as/asLibRoutines.c 83
      1 ../as/asCa.c 235
      1 /afs/slac.stanford.edu/g/lcls/vol8/epics/base/R7.0.3.1-1.0/include/pv/pvAccess.h 1401



Thanks,
-Kristi

-----Original Message-----
From: Michael Davidsaver <mdavidsaver at gmail.com>
Sent: Wednesday, May 25, 2022 2:15 PM
To: Till Straumann <till.straumann at psi.ch>; Luchini, Kristi L. <luchini at slac.stanford.edu>
Cc: Talk EPICS Tech <tech-talk at aps.anl.gov>; Joel Sherrill <joel.sherrill at gmail.com>
Subject: Re: Can't create mutex semaphore: too many

On 5/25/22 08:35, Michael Davidsaver wrote:
On 5/25/22 01:03, Till Straumann wrote:
Hi Kristi.

I'm sorry that I only get to look at this after a long time.

The list of mutexes is indeed very, very long and therefore the
message that no more can be created seems legit.

I believe that your problem could be caused by an algorithm that
leeks mutexes, i.e., a mutex is created, some operation fails and
this process is repeated without destroying the mutex.

Indeed this appears to be the case.  There is an std::map used to track which PVA servers have sent beacons.  Entries are added, but never removed.
Each entry has a Mutex (this code does very granular locking...).

I can't be certain that this is the cause of what Kristi reports, but it seems a likely culprit.

https://github.com/epics-base/pvAccessCPP/issues/184


It may be awhile before I have time to dig into this.  If someone else wants to have a go first, my suggested starting point would be to add LRU behavior.
BeaconHandler::beaconNotify() is already passed (but ignores) the current time.
Then remove old entries as new beacons arrive subject to a max. size, and/or periodically remove entries for servers which haven't sent a beacon in some arbitrary time (like PVXS does).

https://github.com/mdavidsaver/pvxs/blob/f22ab94458abb9ed58af0582e88cf1b2b40ed52a/src/client.cpp#L1084-L1121


I was unable to find the code that produces the 'Error on UDP RX xxx -> xxx'.
It is not in any of the RTEMS or EPICS sources that I still have on
my computer.

This particular message is coming from pvAccessCPP.  Those seen
earlier in the log do not.  I'm somewhat surprised that reception of a
UDP packet would cause a mutex/semaphore to be allocated.  However,
that code is hard to follow (waaaay too many virtual function calls for my taste).

https://github.com/epics-base/pvAccessCPP/blob/f72c7e653c607f95daa818d
14dfc928d59cfe8ff/src/remote/blockingUDPTransport.cpp#L283-L284


It could be helpful if you could identify the code that generates said message.
If analyzing it does not produce any clues then you might need a
bigger hammer (such as hacking the epicsMutex code to attach a stack
trace to every mutex recording how it was created. Once the problem
occurs you look at these stack traces and then you can see if my
hypothesis holds, i.e., there is indeed some sort of loop/repeated
pattern and the traces also point you to the code that deserves scrutiny).

The first place to start is running "epicsMutexShowAll(0,1)" (eg. from
iocsh) Hopefully this will give some hint.

The epicsMutex code tries to keep track of a file+line where each
mutex was allocated.  eg. why epicsMutexCreate() is a C macro.
Although this can be obscured by wrapper code.

#define epicsMutexCreate() epicsMutexOsiCreate(__FILE__,__LINE__)


HTH
- Till

On 5/10/22 20:44, Luchini, Kristi L. wrote:

Hi Till,

I should point out that the VME ioc is running RTEMS  4.10.2.   I understood from Shantha that RTEMS changes in this RTEMS version or maybe it was a previous version, was done by someone at SLAC other than you.   That version does have an unrelated known bug regarding the 2^nd lan port and we haven’t been able to move with RF due to that error.  So I don’t know if this is newly introduced problem, but I don’t call this occurring in the RTEMS 4.9.4 version, but  I’ll double check with Sonya since this is her IOC. It’s odd because this ioc doesn’t do much…it is loading the BsaCore epics module and there was a bug that caused a crash when you’re ioc didn’t have any BSA pvs or an EVR. This ioc doesn’t have bsa pvs but it does have an EVR. In any case this binary includes the bug fix for BSA core.

I’ve attached the screenlog.0 file with the rtemsMonitor() command you recommend.

Regards,

   * Kristi

*From:* Till Straumann <till.straumann at psi.ch>
*Sent:* Monday, May 9, 2022 11:45 PM
*To:* Joel Sherrill <joel.sherrill at gmail.com>; Michael Davidsaver
<mdavidsaver at gmail.com>; Luchini, Kristi L.
<luchini at slac.stanford.edu>
*Cc:* Talk EPICS Tech <tech-talk at aps.anl.gov>
*Subject:* Re: Can't create mutex semaphore: too many

The GeSys application is configured for an unlimited amount of
semaphores and in case of beatnik/mvme6100 15MB of CONFIGURE_EXECUTIVE_RAM_SIZE.

(https://github.com/till-s/rtems-gesys/blob/master/config.c, for beatnik MEMORY_HUGE is defined).

If the rtems monitor is loaded (make sure the 'monitor.obj' module
is loaded before you run into the problem as the monitor creates at
least one semaphore for its own use; IIRC the standard startup
scripts at SLAC do in fact load the monitor) then you can invoke the
monitor from the cexp shell and dump info about semaphores

cexpsh> rtemsMonitor()
monitor> sema

(note that the monitor has  its own interpreter with syntax that
differs from cexpsh's; leave the monitor by typing 'exit')

monitor> exit
cepxsh>

Strange - in my many years at SLAC I have never seen this problem...

HTH
- Till

On 5/10/22 04:14, Joel Sherrill via Tech-talk wrote:

     On Mon, May 9, 2022, 7:41 PM Michael Davidsaver <mdavidsaver at gmail.com> wrote:

         On 5/9/22 16:42, Joel Sherrill wrote:
         >
         >
         > On Mon, May 9, 2022, 5:45 PM Michael Davidsaver via Tech-talk <tech-talk at aps.anl.gov <mailto:tech-talk at aps.anl.gov>> wrote:
         >
         >     On 5/9/22 15:08, Luchini, Kristi L. via Tech-talk wrote:
         >      > Hello,
         >      >
         >      > I have a MVME6100 ioc running RTEMS that appears to be having a resource issue.  Does anyone recognize these error messages below and happen to know function call that I can add to the st.cmd to increase that resource? Do  I need to increase the number file descriptors?
         >      >
         >      > If I reboot the ioc the message will go away and the iocs works fine, but after some period of time, a month or so, this message reappears and you can’t access pvs over CA, which I assume is because the iocs seems to be too busy spewing error messages.
         >      >
         >      > Any help would be appreciated.
         >
         >     Sounds like there is a resource leak somewhere, at least of RTEMS semaphores.
         >     I'm not sure if the mention of "file descriptor" is
real, or a case of
         >     overloading the meaning of an errno code.
         >
         >     Which drivers and other code is being loaded?
         >
         >     Are any of these creating worker threads after iocInit() ?
         >
         >     Are there any earlier error messages being logged?
         >
         >     You could try running "epicsThreadShowAll()" to see if
some driver threads are
         >     hanging around when they shouldn't.
         >
         >     Unfortunately, there is no "epicsMutexShowAll"
implemented for RTEMS 4.X, and
         >     I don't know if your RTEMS builds include the RTEMS
shell and its associated
         >     diagnostics functions.
         >
         >
         > I don't know what RTEMS configure options are used but there is an option for unlimited and another for unified workspace. If they both are configured, you have one pool of memory to allocate everything from. This includes RTEMS objects and things you allocate via malloc() and new. In this configuration, any leak would be bad and could lead to this. It would just be a matter of which create or allocate failed. Malloc() will return NULL in this situation and that often isn't error checked.

         @joel, I'm 90% certain that Kristi is using Till's GESYS
application,
         which has a different RTEMS configuration.  So the following
isn't
         relevant for @SLAC.

         Still.  fyi.  The current 7.0 default RTEMS configurations for EPICS applications are:

         for RTEMS  <= 4.x which uses separate pools.

https://github.com/epics-base/epics-base/blob/cbae8d37b3da486ac8a68a
d6ef9d2028cd98cca0/modules/libcom/RTEMS/score/rtems_config.c#L29-L38

     I think that for 4.10, the cpp logic is setting unified workspace. If I remember correctly that's the first version with that feature.



         for RTEMS >= 5.x which sets CONFIGURE_UNIFIED_WORK_AREAS

https://github.com/epics-base/epics-base/blob/cbae8d37b3da486ac8a68a
d6ef9d2028cd98cca0/modules/libcom/RTEMS/posix/rtems_config.c#L64-L7
<https://github.com/epics-base/epics-base/blob/cbae8d37b3da486ac8a68
ad6ef9d2028cd98cca0/modules/libcom/RTEMS/posix/rtems_config.c#L64-L7
2>

     This uses 64 as the maximum file descriptors while the previous
had a much larger number. But 64 is still a lot. You shouldn't run
out. :)

     Kristi.. what are the configure settings for maximums/unlimited and unified workspace.

     I am pretty sure unified workspace was first an option around 4.10.

     --joel




         > --joel
         >
         >
         >
         >      > Thanks,
         >      >
         >      > Kristi
         >      >
         >      > Message from RTEMS ioc:
         >      >
         >      > Can't create mutex semaphore: too many
         >      >
         >      > Can't create mutex semaphore: too many
         >      >
         >      > Can't create mutex semaphore: too many
         >      >
         >      > Can't create mutex semaphore: too many
         >      >
         >      > Cexp@ioc-in10-mp01>
         >      >
         >      > Cexp@ioc-in10-mp01>Error on UDP RX
172.27.72.43:47591 <http://172.27.72.43:47591>
<http://172.27.72.43:47591> -> 172.27.75.255:5076
<http://172.27.75.255:5076> <http://172.27.75.255:5076> at 46 :
Can't create mutex semaphore: too many
         >      >
         >      > epicsMutex::mutexCreateFailed()
         >      >
         >      > 0x00 ca024000 27000000 d5b37962 00000000 ..@. '... ..yb ....
         >      >
         >      > 0x10 c84f9922 00740000 00000000 00000000  .O." .t.. .... ....
         >      >
         >      > 0x20 0000ffff 00000000 d3130374 6370ff    .... .... ...t cp.
         >      >
         >      > CAS: Client accept error: Too many open files in
system (16)
         >      >
         >      > Error on UDP RX 172.27.72.99:37311
<http://172.27.72.99:37311> <http://172.27.72.99:37311> ->
172.27.75.255:5076 <http://172.27.75.255:5076>
<http://172.27.75.255:5076> at 46 : Can't create mutex semaphore:
too many
         >      >
         >      > epicsMutex::mutexCreateFailed()
         >      >
         >      > 0x00 ca024000 27000000 a5a77862 00000000 ..@. '... ..xb ....
         >      >
         >      > 0x10 ea850c38 004f0000 00000000 00000000  ...8 .O.. .... ....
         >      >
         >      > 0x20 0000ffff 00000000 d3130374 6370ff    .... .... ...t cp.
         >      >
         >      >   * Kristi
         >      >
         >






References:
RE: Can't create mutex semaphore: too many Luchini, Kristi L. via Tech-talk
Re: Can't create mutex semaphore: too many Michael Davidsaver via Tech-talk
Re: Can't create mutex semaphore: too many Joel Sherrill via Tech-talk
Re: Can't create mutex semaphore: too many Michael Davidsaver via Tech-talk
Re: Can't create mutex semaphore: too many Joel Sherrill via Tech-talk
Re: Can't create mutex semaphore: too many Till Straumann via Tech-talk
Re: Can't create mutex semaphore: too many Till Straumann via Tech-talk
Re: Can't create mutex semaphore: too many Michael Davidsaver via Tech-talk
Re: Can't create mutex semaphore: too many Michael Davidsaver via Tech-talk

Navigate by Date:
Prev: Re: Can't create mutex semaphore: too many Michael Davidsaver via Tech-talk
Next: Re: Can't create mutex semaphore: too many Michael Davidsaver via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
Navigate by Thread:
Prev: Re: Can't create mutex semaphore: too many Michael Davidsaver via Tech-talk
Next: Re: Can't create mutex semaphore: too many Michael Davidsaver via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  <20222023  2024 
ANJ, 14 Sep 2022 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·