EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: [SLAC #351542] caget crashing on psusr*
From: "Jeff Hill" <[email protected]>
To: "'Bruce Hill'" <[email protected]>
Cc: "'pcds-help'" <[email protected]>, "'EPICS'" <[email protected]>
Date: Mon, 12 Dec 2011 15:32:19 -0700
Hi Bruce,

> We've been having  a problem lately with caget and other CA clients
> crashing due to stack overflows in the nss_ldap library.

The synchronous DNS name lookup is only used for CA diagnostic messages. Its
handled using an asynchronous callback from a single auxiliary thread so
that the CA client library never blocks. 

> there's a change in the latest nss_ldap library that puts 
> a 128K buffer on the stack.

That?s a pretty large buffer to be instantiating as an C automatic variable
on the stack. As for the advantages and disadvantages of specifying a posix
pthreads stack size on Linux and or on embedded Linux, I don?t claim to
understand all of the issues involved at this time. Certainly it seems that
on virtual memory Linux that it might be best to let the virtual paging take
care of stack expansion.

I created a bug entry. You can find it at this URL.

https://bugs.launchpad.net/epics-base/+bug/903448

Jeff
______________________________________________________
Jeffrey O. Hill           Email        [email protected]
LANL MS H820              Voice        505 665 1831
Los Alamos NM 87545 USA   FAX          505 665 5107

Message content: TSPA


> -----Original Message-----
> From: Bruce Hill [mailto:[email protected]]
> Sent: Monday, December 12, 2011 1:59 PM
> To: Jeff Hill
> Cc: pcds-help
> Subject: Re: [SLAC #351542] caget crashing on psusr*
> 
> Hi Jeff,
> We've been having  a problem lately with caget and other CA clients
> crashing
> due to stack overflows in the nss_ldap library.     We're running RHEL5,
> and
> there's a change in the latest nss_ldap library that puts a 128K buffer on
> the stack.
> 
> The change happened between nss_ldap version 42.el5 and the newer
> 42.el5_7.4.
> 
> We're mostly running EPICS 3.14.9, which by default for linux is
> allocating a small
> stack for this in src/libCom/osi/os/posix/osdThread.c.     Thus, it
> appears that
> the library is overwriting the stack leading to random crashes.    I've
> checked 3.14.12,
> and it appears this is still the default setting for linux.
> 
> Have you had any other reports of this crash?
> 
> Any reason why we shouldn't just use the default stack size?
> 
> Are there any plans to change this in upcoming EPICS releases?
> 
> Thanks,
> - Bruce
> 
> On 12/12/2011 12:17 PM, Amedeo Perazzo via RT wrote:
> > Queue/Owner: PCDS-Help [open] Nobody
> >   Requestors: Hill, Bruce<[email protected]>  x4752 901/131B [PPA
> Eng EE]
> >       Ticket: https://www-
> rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
> >
> > Transaction: Correspondence added by perazzo
> >
> > I agree with Michael having 128KB on the stack is _not_ a good idea and
> > I agree with Booker that a 128KB stack size on a modern Linux system is
> > probably too small.
> >
> > My guess is that EPICS is trying to reduce the footprint as much as
> > possible given that it must run on embedded systems which can have very
> > limited resources.
> >
> > Bruce, should we ask the EPICS community how they plan to handle this?
> > If RHEL6 has the same nss_ldap code as the one that broke EPICS, the
> > community will be forced to handle this problem eventually.
> >
> >
> > On 12/12/11 11:55, [email protected] via RT wrote:
> >> Queue/Owner: PCDS-Help [open] Nobody
> >>    Requestors: Hill, Bruce<[email protected]>   x4752 901/131B
> [PPA Eng EE]
> >>        Ticket: https://www-
> rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
> >>
> >> Transaction: Correspondence added by mcbrowne
> >>
> >> Well, it's the code that we're running... I'm not willing to say it's
> correct
> >> though! You're absolutely right... these seem like very small stack
> sizes.
> >>
> >> Proof that this is what is running: the full routine without ellipses
> is:
> >>
> >>     unsigned int epicsThreadGetStackSize (epicsThreadStackSizeClass
> >>     stackSizeClass)
> >>     {
> >>     #if ! defined (_POSIX_THREAD_ATTR_STACKSIZE)
> >>     return 0;
> >>     #elif defined (OSITHREAD_USE_DEFAULT_STACK)
> >>     return 0;
> >>     #else
> >>     static const unsigned stackSizeTable[epicsThreadStackBig+1] =
> >>     {128*ARCH_STACK_FACTOR, 256*ARCH_STACK_FACTOR,
> 512*ARCH_STACK_FACTOR};
> >>     if (stackSizeClass<epicsThreadStackSmall) {
> >>     errlogPrintf("epicsThreadGetStackSize illegal argument (too
> small)");
> >>     return stackSizeTable[epicsThreadStackBig];
> >>     }
> >>
> >>     if (stackSizeClass>epicsThreadStackBig) {
> >>     errlogPrintf("epicsThreadGetStackSize illegal argument (too
> large)");
> >>     return stackSizeTable[epicsThreadStackBig];
> >>     }
> >>
> >>     return stackSizeTable[stackSizeClass];
> >>     #endif /*_POSIX_THREAD_ATTR_STACKSIZE*/
> >>     }
> >>
> >> Running gdb on psusr117:
> >>
> >>     psusr117% gdb caget
> >>     GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-37.el5_7.1)
> >>     Copyright (C) 2009 Free Software Foundation, Inc.
> >>     License GPLv3+: GNU GPL version 3 or later
> >>     <http://gnu.org/licenses/gpl.html>
> >>     This is free software: you are free to change and redistribute it.
> >>     There is NO WARRANTY, to the extent permitted by law. Type "show
> copying"
> >>     and "show warranty" for details.
> >>     This GDB was configured as "x86_64-redhat-linux-gnu".
> >>     For bug reporting instructions, please see:
> >>     <http://www.gnu.org/software/gdb/bugs/>...
> >>     Reading symbols from
> >>     /reg/g/pcds/package/epics/3.14/base/R3.14.9-0.3.0/bin/linux-
> x86_64/caget...done.
> >>     (gdb) break main
> >>     Breakpoint 1 at 0x401d00: file ../caget.c, line 329.
> >>     (gdb) run
> >>     Starting program:
> >>     /reg/g/pcds/package/epics/3.14/base/R3.14.9-0.3.0/bin/linux-
> x86_64/caget
> >>     warning: no loadable sections found in added symbol-file system-
> supplied
> >>     DSO at 0x2aaaaaac7000
> >>     [Thread debugging using libthread_db enabled]
> >>
> >>     Breakpoint 1, main (argc=1, argv=0x7fffffffdf68) at ../caget.c:329
> >>     329 {
> >>     (gdb) x/20i epicsThreadGetStackSize
> >>     0x2aaaaaf5e670<epicsThreadGetStackSize>: sub $0x8,%rsp
> >>     0x2aaaaaf5e674<epicsThreadGetStackSize+4>: cmp $0x2,%edi
> >>     0x2aaaaaf5e677<epicsThreadGetStackSize+7>: ja 0x2aaaaaf5e690
> >>     <epicsThreadGetStackSize+32>
> >>     0x2aaaaaf5e679<epicsThreadGetStackSize+9>:
> >>     lea 0xebfc(%rip),%rax # 0x2aaaaaf6d27c<stackSizeTable.4846>
> >>     0x2aaaaaf5e680<epicsThreadGetStackSize+16>: mov %edi,%edx
> >>     0x2aaaaaf5e682<epicsThreadGetStackSize+18>: mov (%rax,%rdx,4),%eax
> >>     0x2aaaaaf5e685<epicsThreadGetStackSize+21>: add $0x8,%rsp
> >>     0x2aaaaaf5e689<epicsThreadGetStackSize+25>: retq
> >>     0x2aaaaaf5e68a<epicsThreadGetStackSize+26>: nopw 0x0(%rax,%rax,1)
> >>     0x2aaaaaf5e690<epicsThreadGetStackSize+32>: lea 0xe969(%rip),%rdi #
> >>     0x2aaaaaf6d000
> >>     0x2aaaaaf5e697<epicsThreadGetStackSize+39>: xor %eax,%eax
> >>     0x2aaaaaf5e699<epicsThreadGetStackSize+41>: callq 0x2aaaaaf47940
> >>     <errlogPrintf@plt>
> >>     0x2aaaaaf5e69e<epicsThreadGetStackSize+46>: mov $0x80000,%eax
> >>     0x2aaaaaf5e6a3<epicsThreadGetStackSize+51>: add $0x8,%rsp
> >>     0x2aaaaaf5e6a7<epicsThreadGetStackSize+55>: retq
> >>     0x2aaaaaf5e6a8: nopl 0x0(%rax,%rax,1)
> >>     0x2aaaaaf5e6b0<epicsThreadPrivateSet>: push %rbp
> >>     0x2aaaaaf5e6b1<epicsThreadPrivateSet+1>: mov %rdi,%rbp
> >>     0x2aaaaaf5e6b4<epicsThreadPrivateSet+4>: push %rbx
> >>     0x2aaaaaf5e6b5<epicsThreadPrivateSet+5>: mov %rsi,%rbx
> >>     (gdb) x/3d 0x2aaaaaf6d27c
> >>     0x2aaaaaf6d27c<stackSizeTable.4846>: 131072 262144 524288
> >>     (gdb)
> >>
> >> In any event, it isn't just returning 0, which would be the case if we
> were
> >> using OSITHREAD_USE_DEFAULT_STACK.
> >> --Mike
> >>
> >>
> >>
> >> Booker Bense via RT wrote:
> >>
> >>     On Mon, 12 Dec 2011,   [email protected]   via RT wrote:
> >>
> >>
> >>
> >>
> /reg/g/pcds/package/epics/3.14/base/current/src/libCom/osi/os/posix/osdThr
> ead.c,
> >>       you will see that:
> >>
> >>
> >>
> >>     Is this the correct code? Does anyone know why you are setting
> >>     the stacksize? It's generally not reccommended.
> >>     http://www.cognitus.net/html/howto/pthreadSemiFAQ_5.html
> >>     Can you just recompile with OSITHREAD_USE_DEFAULT_STACK?
> >>
> >>
> >>     #if defined (_POSIX_THREAD_ATTR_STACKSIZE)
> >>     #if ! defined (OSITHREAD_USE_DEFAULT_STACK)
> >>          status = pthread_attr_setstacksize(
> >>     &pthreadInfo->attr,(size_t)stackSize);
> >>          checkStatusOnce(status,"pthread_attr_setstacksize");
> >>     #endif /*OSITHREAD_USE_DEFAULT_STACK*/
> >>     #endif /*_POSIX_THREAD_ATTR_STACKSIZE*/
> >>
> >>     I don't know all the details, but 128K seems very tiny compared
> >>     to current memory sizes. If I'm reading that page correctly,
> >>     all the local variables for the thread need to fit on the stack.
> >>
> >>     Another solution might be to simply remove ldap from the
> >>     nsswitch file for hosts.
> >>
> >>     - Booker C. Bense
> >>
> >>
> >>
> >>
> >>
> >>
> >> Core was generated by `caget UND:R02:IOC:10:BAT:Fiducial'. Program
> terminated with signal 11, Segmentation fault. #0 0x00002aaaab2b7812 in
> _nss_ldap_readconfig () from /lib64/libnss_ldap.so.2 (gdb) bt #0
> 0x00002aaaab2b7812 in _nss_ldap_readconfig () from /lib64/libnss_ldap.so.2
> #1 0x00002aaaab2ad298 in ?? () from /lib64/libnss_ldap.so.2 #2
> 0x00002aaaab2af530 in _nss_ldap_search_s () from /lib64/libnss_ldap.so.2
> #3 0x00002aaaab2b02f8 in _nss_ldap_getbyname () from
> /lib64/libnss_ldap.so.2 #4 0x00002aaaab2b30d9 in _nss_ldap_gethostbyaddr_r
> () from /lib64/libnss_ldap.so.2 #5 0x00002b4c98528055 in
> gethostbyaddr_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #6
> 0x00002b4c98527e41 in gethostbyaddr () from /lib64/libc.so.6 #7
> 0x00002b4c9719d348 in ipAddrToHostName (pAddr=0x419f5f34, pBuf=0x653e600
> "", bufSize=1024) at ../../../src/libCom/osi/os/posix/osdSock.c:148 #8
> 0x00002b4c9719d6d9 in ipAddrToA (paddr=0x419f5f30, pBuf=0x419f43f0
> "X¿v«ª*", bufSize=0) at
> >> ../../../src/libCom/osi/osiSock.c:99 #9 0x00002b4c971981d2 in
> ipAddrToAsciiEnginePrivate::run (this=0x653e5f0) at
> ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:289 #10
> 0x00002b4c97199a2d in epicsThreadCallEntryPoint (pPvt=<value optimized
> out>) at ../../../src/libCom/osi/epicsThread.cpp:59 #11 0x00002b4c9719f731
> in start_routine (arg=<value optimized out>) at
> ../../../src/libCom/osi/os/posix/osdThread.c:322 #12 0x00002b4c973f373d in
> start_thread () from /lib64/libpthread.so.0 #13 0x00002b4c985124bd in
> clone () from /lib64/libc.so.6 (gdb) quit It's intermittant, and sometimes
> crashes before printing the results and sometimes after. I ran caget 10
> times, and got 4 core dumps, and 7 successful printouts of the value. I've
> done the stack trace many times and each time it's in the same
> nss_ldap_readconfig() call. Does anyone have any idea why nss ldap may
> have changed on the psusr* machines in the last few weeks? Is anyone else
> seeing similar crashes? Thanks, - Bruce
> >
> >



Navigate by Date:
Prev: EPICS Base 3.14.12.2 Released Andrew Johnson
Next: Re: EDM Installation Error Ed Villasenor
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: EPICS Base 3.14.12.2 Released Andrew Johnson
Next: Change to RULES.Db between 3.14.8.2 and 3.14.11 -I../O.Common no longer in include paths Allison, Stephanie
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 18 Nov 2013 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·