Jeff,
Thanks for moving this to to core-talk. I've removed our
slac help email from the cc list, as it's no longer an issue
for our IT group.
I'm also satisfied for now with our fix, which is to use
the default stack size for all our linux architecture targets
by putting the def in CONFIG_SITE.
That has fixed the stack overflow in the nss_ldap lib for our
CA tools that run on linux, and our only embedded target is
RTEMS which doesn't use OSITHREAD_USE_DEFAULT_STACK.
I don't think this would be the right fix for sites with
embedded posix targets, whether linux or others.
Does this point to a need for embedded versions of the
configure/os/CONFIG.linux* files?
This issue should also go to the full tech-talk list soon, as
there will likely be other RHEL5 users that will be getting
these crashes as they update their nss_ldap libs.
Regards,
- Bruce
On 12/12/2011 03:31 PM, Jeff Hill wrote:
My question is this (not having written the EPICS posix interface layer and
not claiming to understand all of the issues involved); should the system
have different defaults for OSITHREAD_USE_DEFAULT_STACK specified in the
build system depending on if its embedded linux arch or not?
I will keep https://bugs.launchpad.net/epics-base/+bug/903448 open a bit
longer.
Jeff
______________________________________________________
Jeffrey O. Hill Email [email protected]
LANL MS H820 Voice 505 665 1831
Los Alamos NM 87545 USA FAX 505 665 5107
Message content: TSPA
With sufficient thrust, pigs fly just fine. However, this is
not necessarily a good idea. It is hard to be sure where they
are going to land, and it could be dangerous sitting under them
as they fly overhead. -- RFC 1925
-----Original Message-----
From: Bruce Hill [mailto:[email protected]]
Sent: Monday, December 12, 2011 3:58 PM
To: pcds-help
Cc: Jeff Hill
Subject: Re: [SLAC #351542] caget crashing on psusr*
It seems to me that there's no good reason for us to use the
stack size feature in the CA lib for our linux based apps and tools,
so I defined OSITHREAD_USE_DEFAULT_STACK to YES
in the EPICS CONFIG_SITE file and rebuilt.
I did a couple of loops on psusr121 using the new caget and
nss_ldap version 42.el5_7.4 with over 1100 caget's and no
crashes.
EPICS 3.14.9-0.3.0, the one used by our current caget path,
is now rebuilt using default stack sizes.
I think we can close this now.
Thanks all for your help!
Regards,
- Bruce
On 12/12/2011 2:32 PM, Jeff Hill via RT wrote:
Queue/Owner: PCDS-Help [open] Nobody
Requestors: Hill, Bruce<[email protected]> x4752 901/131B [PPA
Eng EE]
Ticket: https://www-
rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
Transaction: Correspondence added by [email protected]
Hi Bruce,
We've been having a problem lately with caget and other CA clients
crashing due to stack overflows in the nss_ldap library.
The synchronous DNS name lookup is only used for CA diagnostic messages.
Its
handled using an asynchronous callback from a single auxiliary thread so
that the CA client library never blocks.
there's a change in the latest nss_ldap library that puts
a 128K buffer on the stack.
That’s a pretty large buffer to be instantiating as an C automatic
variable
on the stack. As for the advantages and disadvantages of specifying a
posix
pthreads stack size on Linux and or on embedded Linux, I don’t claim to
understand all of the issues involved at this time. Certainly it seems
that
on virtual memory Linux that it might be best to let the virtual paging
take
care of stack expansion.
I created a bug entry. You can find it at this URL.
https://bugs.launchpad.net/epics-base/+bug/903448
Jeff
______________________________________________________
Jeffrey O. Hill Email [email protected]
LANL MS H820 Voice 505 665 1831
Los Alamos NM 87545 USA FAX 505 665 5107
Message content: TSPA
-----Original Message-----
From: Bruce Hill [mailto:[email protected]]
Sent: Monday, December 12, 2011 1:59 PM
To: Jeff Hill
Cc: pcds-help
Subject: Re: [SLAC #351542] caget crashing on psusr*
Hi Jeff,
We've been having a problem lately with caget and other CA clients
crashing
due to stack overflows in the nss_ldap library. We're running
RHEL5,
and
there's a change in the latest nss_ldap library that puts a 128K buffer
on
the stack.
The change happened between nss_ldap version 42.el5 and the newer
42.el5_7.4.
We're mostly running EPICS 3.14.9, which by default for linux is
allocating a small
stack for this in src/libCom/osi/os/posix/osdThread.c. Thus, it
appears that
the library is overwriting the stack leading to random crashes. I've
checked 3.14.12,
and it appears this is still the default setting for linux.
Have you had any other reports of this crash?
Any reason why we shouldn't just use the default stack size?
Are there any plans to change this in upcoming EPICS releases?
Thanks,
- Bruce
On 12/12/2011 12:17 PM, Amedeo Perazzo via RT wrote:
Queue/Owner: PCDS-Help [open] Nobody
Requestors: Hill, Bruce<[email protected]> x4752 901/131B
[PPA
Eng EE]
Ticket: https://www-
rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
Transaction: Correspondence added by perazzo
I agree with Michael having 128KB on the stack is _not_ a good idea
and
I agree with Booker that a 128KB stack size on a modern Linux system
is
probably too small.
My guess is that EPICS is trying to reduce the footprint as much as
possible given that it must run on embedded systems which can have
very
limited resources.
Bruce, should we ask the EPICS community how they plan to handle this?
If RHEL6 has the same nss_ldap code as the one that broke EPICS, the
community will be forced to handle this problem eventually.
On 12/12/11 11:55, [email protected] via RT wrote:
Queue/Owner: PCDS-Help [open] Nobody
Requestors: Hill, Bruce<[email protected]> x4752
901/131B
[PPA Eng EE]
Ticket: https://www-
rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
Transaction: Correspondence added by mcbrowne
Well, it's the code that we're running... I'm not willing to say it's
correct
though! You're absolutely right... these seem like very small stack
sizes.
Proof that this is what is running: the full routine without ellipses
is:
unsigned int epicsThreadGetStackSize (epicsThreadStackSizeClass
stackSizeClass)
{
#if ! defined (_POSIX_THREAD_ATTR_STACKSIZE)
return 0;
#elif defined (OSITHREAD_USE_DEFAULT_STACK)
return 0;
#else
static const unsigned stackSizeTable[epicsThreadStackBig+1] =
{128*ARCH_STACK_FACTOR, 256*ARCH_STACK_FACTOR,
512*ARCH_STACK_FACTOR};
if (stackSizeClass<epicsThreadStackSmall) {
errlogPrintf("epicsThreadGetStackSize illegal argument (too
small)");
return stackSizeTable[epicsThreadStackBig];
}
if (stackSizeClass>epicsThreadStackBig) {
errlogPrintf("epicsThreadGetStackSize illegal argument (too
large)");
return stackSizeTable[epicsThreadStackBig];
}
return stackSizeTable[stackSizeClass];
#endif /*_POSIX_THREAD_ATTR_STACKSIZE*/
}
Running gdb on psusr117:
psusr117% gdb caget
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-37.el5_7.1)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute
it.
There is NO WARRANTY, to the extent permitted by law. Type "show
copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from
/reg/g/pcds/package/epics/3.14/base/R3.14.9-0.3.0/bin/linux-
x86_64/caget...done.
(gdb) break main
Breakpoint 1 at 0x401d00: file ../caget.c, line 329.
(gdb) run
Starting program:
/reg/g/pcds/package/epics/3.14/base/R3.14.9-0.3.0/bin/linux-
x86_64/caget
warning: no loadable sections found in added symbol-file system-
supplied
DSO at 0x2aaaaaac7000
[Thread debugging using libthread_db enabled]
Breakpoint 1, main (argc=1, argv=0x7fffffffdf68) at
../caget.c:329
329 {
(gdb) x/20i epicsThreadGetStackSize
0x2aaaaaf5e670<epicsThreadGetStackSize>: sub $0x8,%rsp
0x2aaaaaf5e674<epicsThreadGetStackSize+4>: cmp $0x2,%edi
0x2aaaaaf5e677<epicsThreadGetStackSize+7>: ja 0x2aaaaaf5e690
<epicsThreadGetStackSize+32>
0x2aaaaaf5e679<epicsThreadGetStackSize+9>:
lea 0xebfc(%rip),%rax # 0x2aaaaaf6d27c<stackSizeTable.4846>
0x2aaaaaf5e680<epicsThreadGetStackSize+16>: mov %edi,%edx
0x2aaaaaf5e682<epicsThreadGetStackSize+18>: mov
(%rax,%rdx,4),%eax
0x2aaaaaf5e685<epicsThreadGetStackSize+21>: add $0x8,%rsp
0x2aaaaaf5e689<epicsThreadGetStackSize+25>: retq
0x2aaaaaf5e68a<epicsThreadGetStackSize+26>: nopw
0x0(%rax,%rax,1)
0x2aaaaaf5e690<epicsThreadGetStackSize+32>: lea
0xe969(%rip),%rdi #
0x2aaaaaf6d000
0x2aaaaaf5e697<epicsThreadGetStackSize+39>: xor %eax,%eax
0x2aaaaaf5e699<epicsThreadGetStackSize+41>: callq 0x2aaaaaf47940
<errlogPrintf@plt>
0x2aaaaaf5e69e<epicsThreadGetStackSize+46>: mov $0x80000,%eax
0x2aaaaaf5e6a3<epicsThreadGetStackSize+51>: add $0x8,%rsp
0x2aaaaaf5e6a7<epicsThreadGetStackSize+55>: retq
0x2aaaaaf5e6a8: nopl 0x0(%rax,%rax,1)
0x2aaaaaf5e6b0<epicsThreadPrivateSet>: push %rbp
0x2aaaaaf5e6b1<epicsThreadPrivateSet+1>: mov %rdi,%rbp
0x2aaaaaf5e6b4<epicsThreadPrivateSet+4>: push %rbx
0x2aaaaaf5e6b5<epicsThreadPrivateSet+5>: mov %rsi,%rbx
(gdb) x/3d 0x2aaaaaf6d27c
0x2aaaaaf6d27c<stackSizeTable.4846>: 131072 262144 524288
(gdb)
In any event, it isn't just returning 0, which would be the case if
we
were
using OSITHREAD_USE_DEFAULT_STACK.
--Mike
Booker Bense via RT wrote:
On Mon, 12 Dec 2011, [email protected] via RT
wrote:
/reg/g/pcds/package/epics/3.14/base/current/src/libCom/osi/os/posix/osdThr
ead.c,
you will see that:
Is this the correct code? Does anyone know why you are setting
the stacksize? It's generally not reccommended.
http://www.cognitus.net/html/howto/pthreadSemiFAQ_5.html
Can you just recompile with OSITHREAD_USE_DEFAULT_STACK?
#if defined (_POSIX_THREAD_ATTR_STACKSIZE)
#if ! defined (OSITHREAD_USE_DEFAULT_STACK)
status = pthread_attr_setstacksize(
&pthreadInfo->attr,(size_t)stackSize);
checkStatusOnce(status,"pthread_attr_setstacksize");
#endif /*OSITHREAD_USE_DEFAULT_STACK*/
#endif /*_POSIX_THREAD_ATTR_STACKSIZE*/
I don't know all the details, but 128K seems very tiny compared
to current memory sizes. If I'm reading that page correctly,
all the local variables for the thread need to fit on the stack.
Another solution might be to simply remove ldap from the
nsswitch file for hosts.
- Booker C. Bense
Core was generated by `caget UND:R02:IOC:10:BAT:Fiducial'. Program
terminated with signal 11, Segmentation fault. #0 0x00002aaaab2b7812 in
_nss_ldap_readconfig () from /lib64/libnss_ldap.so.2 (gdb) bt #0
0x00002aaaab2b7812 in _nss_ldap_readconfig () from
/lib64/libnss_ldap.so.2
#1 0x00002aaaab2ad298 in ?? () from /lib64/libnss_ldap.so.2 #2
0x00002aaaab2af530 in _nss_ldap_search_s () from
/lib64/libnss_ldap.so.2
#3 0x00002aaaab2b02f8 in _nss_ldap_getbyname () from
/lib64/libnss_ldap.so.2 #4 0x00002aaaab2b30d9 in
_nss_ldap_gethostbyaddr_r
() from /lib64/libnss_ldap.so.2 #5 0x00002b4c98528055 in
gethostbyaddr_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #6
0x00002b4c98527e41 in gethostbyaddr () from /lib64/libc.so.6 #7
0x00002b4c9719d348 in ipAddrToHostName (pAddr=0x419f5f34,
pBuf=0x653e600
"", bufSize=1024) at ../../../src/libCom/osi/os/posix/osdSock.c:148 #8
0x00002b4c9719d6d9 in ipAddrToA (paddr=0x419f5f30, pBuf=0x419f43f0
"X¿v«ª*", bufSize=0) at
../../../src/libCom/osi/osiSock.c:99 #9 0x00002b4c971981d2 in
ipAddrToAsciiEnginePrivate::run (this=0x653e5f0) at
../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:289 #10
0x00002b4c97199a2d in epicsThreadCallEntryPoint (pPvt=<value optimized
out>) at ../../../src/libCom/osi/epicsThread.cpp:59 #11
0x00002b4c9719f731
in start_routine (arg=<value optimized out>) at
../../../src/libCom/osi/os/posix/osdThread.c:322 #12 0x00002b4c973f373d
in
start_thread () from /lib64/libpthread.so.0 #13 0x00002b4c985124bd in
clone () from /lib64/libc.so.6 (gdb) quit It's intermittant, and
sometimes
crashes before printing the results and sometimes after. I ran caget 10
times, and got 4 core dumps, and 7 successful printouts of the value.
I've
done the stack trace many times and each time it's in the same
nss_ldap_readconfig() call. Does anyone have any idea why nss ldap may
have changed on the psusr* machines in the last few weeks? Is anyone
else
seeing similar crashes? Thanks, - Bruce
- Replies:
- Re: [SLAC #351542] caget crashing on psusr* Andrew Johnson
- References:
- RE: [SLAC #351542] caget crashing on psusr* Jeff Hill
- Navigate by Date:
- Prev:
RE: [SLAC #351542] caget crashing on psusr* Jeff Hill
- Next:
Re: [SLAC #351542] caget crashing on psusr* Andrew Johnson
- Index:
2002
2003
2004
2005
2006
2007
2008
2009
2010
<2011>
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: [SLAC #351542] caget crashing on psusr* Jeff Hill
- Next:
Re: [SLAC #351542] caget crashing on psusr* Andrew Johnson
- Index:
2002
2003
2004
2005
2006
2007
2008
2009
2010
<2011>
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|