I have been contacted by MathWorks
regarding this issue. I have supplied them with
my analysis and my non-EPICS/non-labCA example (mexJoin.cc, see
earlier message)
which reproduces the problem (which to our current understanding
occurs only under
RHEL7 with matlab2020b).
The answer I received from MathWorks is not very satisfactory but
to some extent
understandable given that RHEL always is notoriously outdated.
MathWorks claims that they had to back-port a certain feature
which is required for
matlab to the glibc-2.17 library. This apparently created a
conflict with a work-around
for the bug we observe (lockup if a library loaded by dlopen()
uses pthread_join during
static initialization) and re-introduces that bug into glibc-2.17.
matlab distributes a proprietary version of glibc-2.17
(glibc-2.17_shim which ls LD_PRELOADed
by the matlab script) and this proprietary version contains the
bug. According to MathWorks
it is not possible to port the fix that is present in the native
glibc-2.17 to their 'shim' version.
Consequently, it is not possible to use any MEX file under
RHEL7/2020b which depends on
a library that joins threads during static initialization.
Mathworks claims that it is not their 'fault' (I'd see that a
little bit different since their proprietary
modification of glibc-2.17 for RHEL7 clearly introduces a bug). I
do see, however, that it is
not easy to be backwards compatible with notoriously old RHEL
while using much more modern
compilers and library versions on other linux systems.
The 'solution' proposed by MathWorks is as simple as 'modify all
libraries used by toolboxes
to not use threads during intialization'. That is not very helpful
and quite unrealistic I'd say.
MathWorks has closed this issue.
Therefore, for the foreseeable future I recommend one of the
following approaches:
1. avoid RHEL7 + 2020b (use newer RHEL7 or older matlab) if at
all possible
2. use a version of EPICS base that was compiled with posix RT
scheduling disabled
(but it might only be a matter of time until some other part
of EPICS hits this bug)
3. Use LD_PRELOAD to load EPICS base before matlab is started
(see earlier posts)
Best regards
- Till
PS: For the record I paste MW's response with a few links to glibc
discussions that touch on the issue(s)
The issue you discovered appears to be a
well-known issue with glibc, which is moreover documented in the
following bug report,
https://bugzilla.redhat.com/show_bug.cgi?id=1223055
While the referenced article talks about glibc versions greater
than 2.21, the underlying issue was in fact introduced with
then glibc 2.18 __cxa_thread_atexit_impl C++ compiler runtime
feature. Accordingly, we needed to backport the latter runtime
feature to glibc 2.17 via the shim library in order to be able
to continue supporting RHEL7 (which is in fact a 7 years old
operating system) with newer compilers and libraries.
As a matter of fact the glibc maintainers are still trying to
fix the set of issues involved here and have been working on it
for 6+ years, as the following threads attest,
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e400f3ccd36fe91d432cc7d45b4ccc799dece763
https://sourceware.org/bugzilla/show_bug.cgi?id=19329
https://libc-alpha.sourceware.narkive.com/baoZQBHf/patch-bz-19329-fix-race-between-tls-allocation-at-thread-creation-and-dlopen
https://sourceware.org/pipermail/libc-alpha/2021-February/122626.html
https://sourceware.org/pipermail/libc-alpha/2021-February/122634.html
Moreover, while the original issue was "worked around"
in glibc-2.23 by changing how TLS (thread local storage) is
allocated for this set of operations, that is unfortunately not
a strategy MathWorks can employ in shim layer.
The best recommendation that we can offer you for the
time being is to avoid creating and/or destroying threads while
loading shared libraries until the glibc maintainer's community
eventually gets to the bottom of this set of issues.
I hope that this information helps you to proceed and I
apologize once again for the inconvenience.
I am closing this Service Request tentatively, but please feel
free to come back to this communication anytime for similar
questions or concerns and I will reopen this Service Request for
you accordingly.
On 3/5/21 3:50 PM, Till Straumann via Tech-talk wrote:
Hi All.
It seems my original answer did never make it to tech-talk. It
can still be found
below but here are some more data I gathered:
Even with the suggested work-arounds I could not get labca-3.7.2
and
epics-7.0.4.1 or 7.05 to work with matlab 2020b. It would either
hang or crash when
quitting matlab.
I cut a new [labca 3.8.0 release](
https://github.com/till-s/epics-labca/releases/tag/labca_3_8_0)
which addresses this problem but you still need one (no need for
both) of the following
two work-arounds:
a) use a build of epics base with posix priority scheduling
disabled. In configure/CONFIG_SITE set
USE_POSIX_THREAD_PRIORITY_SCHEDULING=NO
then 'make clean' and 'make'. Obviously, make sure no
real-time systems are using this new build.
b) use LD_PRELOAD to load and initialize libCom before starting
matlab
LD_PRELOAD=<path_to_base>/lib/<arch>/libCom.so
matlab <options>
If someone has good connections to MathWorks (or a MW engineer
is reading this)
they could use the attached simple (and standalone) 'mex' file
to reproduce the problem
(w/o any epics).
Cheers
- Till
On 3/1/21 5:56 PM, Miroslaw Dach wrote:
Hi Till,
Thank you very much for the in depth study of the
problem. It looks like Mathworks has changed something in
the code and even worse - they have introduced an
"unwanted feature" which affects the Matlan2020b and LabCa
users on RHEL7.
We will try one of your suggestions and let you know how
things are.
Many Thanks
Mirek
Hi Mirek.
I have investigated this deadlock and come to the
conclusion that it is a problem
with matlab, probably under RHEL7 (I have no way to
test under other systemes, in particular: windows,
though).
When I debugged the deadlock I found that some matlab
threads deadlock in a library called
glibc-2.17_shim.so
This (mathworks proprietary) library is LD_PRELOADed
from the 'matlab' driver script where we find a
comment:
# Preload glibc_shim in case of RHLE7 variants
test -e /usr/bin/ldd && ldd --version |
grep -q "(GNU libc) 2\.17" \
&& export
LD_PRELOAD="$LD_PRELOAD:$MATLAB/bin/glnxa64/
glibc-2.17_shim.so" \
&& export
MW_GLIBC_SHIM="$MATLAB/bin/glnxa64/
glibc-2.17_shim.so"
which leads to the hypothesis that RHEL7 only may be
affected.
The deadlock happens when matlab
- loads a shared object (or library)
- AND the shared object executes some initialization
code (e.g., constructors of global objects defined in
the library)
- AND the initialization code calls 'pthread_join()'.
'pthread_join()' then never returns.
Note that if 'ordinary' code in the shared object
(i.e., as opposed to initialization code) uses
'pthread_join()' then
that works fine.
A simple example mex file (attached) which is not
using labca or epics and reproduces the described
behaviour.
EPICS' libCom does use 'pthread_join()' during
initialization and is therefore affected.
At this point I can suggest two possible work-arounds
(using one of them is sufficient):
1.) Use an EPICS-base build with posix priority
scheduling disabled. This avoids a section of
initialization
code which calls 'pthread_join()'
E.g., in configure/CONFIG_SITE:
USE_POSIX_THREAD_PRIORITY_SCHEDULING = NO
2.) LD_PRELOAD EPICS' libCom.so *before* starting
matlab
LD_PRELOAD=<path_to_my_epics_lib>/libCom.so
matlab
HTH
- Till
On 2/19/21 4:20 AM, Miroslaw Dach wrote:
Hi Till,
We have crossed each other. You came to PSI
from the US and I did the opposite. I moved to
work in LBL.
Are you still maintaining the LabCa?
We are facing
a problem with Matlab 2020b crashes when using
labCa 3.7.2.
It looks like
the incompatibility between the Matlab 2020b
and labCa latest official version.
The labCa 3.7.2 seems to
be the latest version unless you have the
newer one?