EPICS Home

Experimental Physics and Industrial Control System


 
1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  <20212022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  <20212022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Matlab 2020b crashes with labCA 3.7.2
From: Till Straumann via Tech-talk <tech-talk at aps.anl.gov>
To: Miroslaw Dach <mdach at lbl.gov>
Cc: Gregory Portmann <gjportmann at lbl.gov>, "Corbett, William J." <corbett at slac.stanford.edu>, EPICS Techtalk <tech-talk at aps.anl.gov>
Date: Thu, 22 Apr 2021 16:08:50 +0200
I have been contacted by MathWorks regarding this issue. I have supplied them with
my analysis and my non-EPICS/non-labCA example (mexJoin.cc, see earlier message)
which reproduces the problem (which to our current understanding occurs only under
RHEL7 with matlab2020b).

The answer I received from MathWorks is not very satisfactory but to some extent
understandable given that RHEL always is notoriously outdated.

MathWorks claims that they had to back-port a certain feature which is required for
matlab to the glibc-2.17 library. This apparently created a conflict with a work-around
for the bug we observe (lockup if a library loaded by dlopen() uses pthread_join during
static initialization) and re-introduces that bug into glibc-2.17.

matlab distributes a proprietary version of glibc-2.17 (glibc-2.17_shim which ls LD_PRELOADed
by the matlab script) and this proprietary version contains the bug. According to MathWorks
it is not possible to port the fix that is present in the native glibc-2.17 to their 'shim' version.

Consequently, it is not possible to use any MEX file under RHEL7/2020b which depends on
a library that joins threads during static initialization.

Mathworks claims that it is not their 'fault' (I'd see that a little bit different since their proprietary
modification of glibc-2.17 for RHEL7 clearly introduces a bug). I do see, however, that it is
not easy to be backwards compatible with notoriously old RHEL while using much more modern
compilers and library versions on other linux systems.

The 'solution' proposed by MathWorks is as simple as 'modify all libraries used by toolboxes
to not use threads during intialization'. That is not very helpful and quite unrealistic I'd say.

MathWorks has closed this issue.

Therefore, for the foreseeable future I recommend one of the following approaches:
 1. avoid RHEL7 + 2020b (use newer RHEL7 or older matlab) if at all possible
 2. use a version of EPICS base that was compiled with posix RT scheduling disabled
    (but it might only be a matter of time until some other part of EPICS hits this bug)
 3. Use LD_PRELOAD to load EPICS base before matlab is started (see earlier posts)

Best regards
- Till

PS: For the record I paste MW's response with a few links to glibc discussions that touch on the issue(s)

The issue you discovered appears to be a well-known issue with glibc, which is moreover documented in the following bug report,

https://bugzilla.redhat.com/show_bug.cgi?id=1223055

While the referenced article talks about glibc versions greater than 2.21, the underlying issue was in fact introduced with then glibc 2.18 __cxa_thread_atexit_impl C++ compiler runtime feature. Accordingly, we needed to backport the latter runtime feature to glibc 2.17 via the shim library in order to be able to continue supporting RHEL7 (which is in fact a 7 years old operating system) with newer compilers and libraries.

As a matter of fact the glibc maintainers are still trying to fix the set of issues involved here and have been working on it for 6+ years, as the following threads attest,

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e400f3ccd36fe91d432cc7d45b4ccc799dece763
https://sourceware.org/bugzilla/show_bug.cgi?id=19329
https://libc-alpha.sourceware.narkive.com/baoZQBHf/patch-bz-19329-fix-race-between-tls-allocation-at-thread-creation-and-dlopen
https://sourceware.org/pipermail/libc-alpha/2021-February/122626.html
https://sourceware.org/pipermail/libc-alpha/2021-February/122634.html

Moreover, while the original issue was "worked around" in glibc-2.23 by changing how TLS (thread local storage) is allocated for this set of operations, that is unfortunately not a strategy MathWorks can employ in shim layer.

The best recommendation that we can offer you for the time being is to avoid creating and/or destroying threads while loading shared libraries until the glibc maintainer's community eventually gets to the bottom of this set of issues.

I hope that this information helps you to proceed and I apologize once again for the inconvenience.

I am closing this Service Request tentatively, but please feel free to come back to this communication anytime for similar questions or concerns and I will reopen this Service Request for you accordingly.


On 3/5/21 3:50 PM, Till Straumann via Tech-talk wrote:
Hi All.

It seems my original answer did never make it to tech-talk. It can still be found
below but here are some more data I gathered:

Even with the suggested work-arounds I could not get labca-3.7.2 and
epics-7.0.4.1 or 7.05 to work with matlab 2020b. It would either hang or crash when
quitting matlab.

I cut a new [labca 3.8.0 release](https://github.com/till-s/epics-labca/releases/tag/labca_3_8_0)

which addresses this problem but you still need one (no need for both) of the following
two work-arounds:

a) use a build of epics base with posix priority scheduling disabled. In configure/CONFIG_SITE set
       USE_POSIX_THREAD_PRIORITY_SCHEDULING=NO
    then 'make clean' and 'make'. Obviously, make sure no real-time systems are using this new build.
b) use LD_PRELOAD to load and initialize libCom before starting matlab

    LD_PRELOAD=<path_to_base>/lib/<arch>/libCom.so  matlab <options>

If someone has good connections to MathWorks (or a MW engineer is reading this)
they could use the attached simple (and standalone) 'mex' file to reproduce the problem
(w/o any epics).

Cheers
- Till


On 3/1/21 5:56 PM, Miroslaw Dach wrote:
Hi Till,

Thank you very much for the in depth study of the problem. It looks like Mathworks has changed something in the code and even worse - they have introduced an "unwanted feature" which affects the Matlan2020b and LabCa users on RHEL7.

We will try one of your suggestions and let you know how things are.

Many Thanks
Mirek



On Mon, Mar 1, 2021 at 2:01 AM Till Straumann <till.straumann at psi.ch> wrote:
Hi Mirek.

I have investigated this deadlock and come to the conclusion that it is a problem
with matlab, probably under RHEL7 (I have no way to test under other systemes, in particular: windows,
though).

When I debugged the deadlock I found that some matlab threads deadlock in a library called

glibc-2.17_shim.so

This (mathworks proprietary) library is LD_PRELOADed from the 'matlab' driver script where we find a comment:

    # Preload glibc_shim in case of RHLE7 variants
    test -e /usr/bin/ldd &&  ldd --version |  grep -q "(GNU libc) 2\.17"  \
            && export LD_PRELOAD="$LD_PRELOAD:$MATLAB/bin/glnxa64/glibc-2.17_shim.so" \
            && export MW_GLIBC_SHIM="$MATLAB/bin/glnxa64/glibc-2.17_shim.so"

which leads to the hypothesis that RHEL7 only may be affected.

The deadlock happens when matlab
 - loads a shared object (or library)
 - AND the shared object executes some initialization code (e.g., constructors of global objects defined in the library)
 - AND the initialization code calls 'pthread_join()'. 'pthread_join()' then never returns.

Note that if 'ordinary' code in the shared object (i.e., as opposed to initialization code) uses 'pthread_join()' then
that works fine.

A simple example mex file (attached) which is not using labca or epics and reproduces the described behaviour.

EPICS' libCom does use 'pthread_join()' during initialization and is therefore affected.

At this point I can suggest two possible work-arounds (using one of them is sufficient):

1.) Use an EPICS-base build with posix priority scheduling disabled. This avoids a section of initialization
     code which calls 'pthread_join()'

     E.g., in configure/CONFIG_SITE:

    USE_POSIX_THREAD_PRIORITY_SCHEDULING = NO

2.) LD_PRELOAD EPICS' libCom.so *before* starting matlab

    LD_PRELOAD=<path_to_my_epics_lib>/libCom.so  matlab

HTH
- Till



On 2/19/21 4:20 AM, Miroslaw Dach wrote:
Hi Till,

We have crossed each other. You came to PSI from the US and I did the opposite. I moved to work in LBL.

Are you still maintaining the LabCa?
We are facing a problem with Matlab 2020b crashes when using labCa 3.7.2.
It looks like the incompatibility between the Matlab 2020b and labCa latest official version.
The labCa 3.7.2 seems to be the latest version unless you have the newer one?

Best Regards
Mirek





References:
Re: Matlab 2020b crashes with labCA 3.7.2 Till Straumann via Tech-talk

Navigate by Date:
Prev: Re: RPM packages for EPICS Hu, Yong via Tech-talk
Next: Re: RPM packages for EPICS Michael Davidsaver via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  <20212022  2023  2024 
Navigate by Thread:
Prev: Re: Matlab 2020b crashes with labCA 3.7.2 White, Greg via Tech-talk
Next: Saving hdf5 ADPilatus images along with ROI and stats data from the same frame Ivashkevych, Oksana via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  <20212022  2023  2024