Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019  Index 2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019 
<== Date ==> <== Thread ==>

Subject: RE: [SLAC #351542] caget crashing on psusr*
From: "Jeff Hill" <johill@lanl.gov>
To: "'Bruce Hill'" <bhill@slac.stanford.edu>, "'pcds-help'" <pcds-help@slac.stanford.edu>
Cc: "'EPICS core-talk'" <core-talk@aps.anl.gov>
Date: Mon, 12 Dec 2011 16:31:17 -0700
My question is this (not having written the EPICS posix interface layer and
not claiming to understand all of the issues involved); should the system
have different defaults for OSITHREAD_USE_DEFAULT_STACK specified in the
build system depending on if its embedded linux arch or not?

I will keep https://bugs.launchpad.net/epics-base/+bug/903448 open a bit
longer.

Jeff
______________________________________________________
Jeffrey O. Hill           Email        johill@lanl.gov
LANL MS H820              Voice        505 665 1831
Los Alamos NM 87545 USA   FAX          505 665 5107

Message content: TSPA

With sufficient thrust, pigs fly just fine. However, this is
not necessarily a good idea. It is hard to be sure where they
are going to land, and it could be dangerous sitting under them
as they fly overhead. -- RFC 1925


> -----Original Message-----
> From: Bruce Hill [mailto:bhill@slac.stanford.edu]
> Sent: Monday, December 12, 2011 3:58 PM
> To: pcds-help
> Cc: Jeff Hill
> Subject: Re: [SLAC #351542] caget crashing on psusr*
> 
> It seems to me that there's no good reason for us to use the
> stack size feature in the CA lib for our linux based apps and tools,
> so I defined OSITHREAD_USE_DEFAULT_STACK to YES
> in the EPICS CONFIG_SITE file and rebuilt.
> 
> I did a couple of loops on psusr121 using the new caget and
> nss_ldap version 42.el5_7.4 with over 1100 caget's and no
> crashes.
> 
> EPICS 3.14.9-0.3.0, the one used by our current caget path,
> is now rebuilt using default stack sizes.
> 
> I think we can close this now.
> 
> Thanks all for your help!
> 
> Regards,
> - Bruce
> 
> 
> On 12/12/2011 2:32 PM, Jeff Hill via RT wrote:
> > Queue/Owner: PCDS-Help [open] Nobody
> >   Requestors: Hill, Bruce<bhill@slac.stanford.edu>  x4752 901/131B [PPA
> Eng EE]
> >       Ticket: https://www-
> rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
> >
> > Transaction: Correspondence added by johill@lanl.gov
> >
> > Hi Bruce,
> >
> >> We've been having  a problem lately with caget and other CA clients
> >> crashing due to stack overflows in the nss_ldap library.
> > The synchronous DNS name lookup is only used for CA diagnostic messages.
> Its
> > handled using an asynchronous callback from a single auxiliary thread so
> > that the CA client library never blocks.
> >
> >> there's a change in the latest nss_ldap library that puts
> >> a 128K buffer on the stack.
> > That?s a pretty large buffer to be instantiating as an C automatic
> variable
> > on the stack. As for the advantages and disadvantages of specifying a
> posix
> > pthreads stack size on Linux and or on embedded Linux, I don?t claim to
> > understand all of the issues involved at this time. Certainly it seems
> that
> > on virtual memory Linux that it might be best to let the virtual paging
> take
> > care of stack expansion.
> >
> > I created a bug entry. You can find it at this URL.
> >
> > https://bugs.launchpad.net/epics-base/+bug/903448
> >
> > Jeff
> > ______________________________________________________
> > Jeffrey O. Hill           Email        johill@lanl.gov
> > LANL MS H820              Voice        505 665 1831
> > Los Alamos NM 87545 USA   FAX          505 665 5107
> >
> > Message content: TSPA
> >
> >
> >> -----Original Message-----
> >> From: Bruce Hill [mailto:bhill@slac.stanford.edu]
> >> Sent: Monday, December 12, 2011 1:59 PM
> >> To: Jeff Hill
> >> Cc: pcds-help
> >> Subject: Re: [SLAC #351542] caget crashing on psusr*
> >>
> >> Hi Jeff,
> >> We've been having  a problem lately with caget and other CA clients
> >> crashing
> >> due to stack overflows in the nss_ldap library.     We're running
> RHEL5,
> >> and
> >> there's a change in the latest nss_ldap library that puts a 128K buffer
> on
> >> the stack.
> >>
> >> The change happened between nss_ldap version 42.el5 and the newer
> >> 42.el5_7.4.
> >>
> >> We're mostly running EPICS 3.14.9, which by default for linux is
> >> allocating a small
> >> stack for this in src/libCom/osi/os/posix/osdThread.c.     Thus, it
> >> appears that
> >> the library is overwriting the stack leading to random crashes.    I've
> >> checked 3.14.12,
> >> and it appears this is still the default setting for linux.
> >>
> >> Have you had any other reports of this crash?
> >>
> >> Any reason why we shouldn't just use the default stack size?
> >>
> >> Are there any plans to change this in upcoming EPICS releases?
> >>
> >> Thanks,
> >> - Bruce
> >>
> >> On 12/12/2011 12:17 PM, Amedeo Perazzo via RT wrote:
> >>> Queue/Owner: PCDS-Help [open] Nobody
> >>>    Requestors: Hill, Bruce<bhill@slac.stanford.edu>   x4752 901/131B
> [PPA
> >> Eng EE]
> >>>        Ticket: https://www-
> >> rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
> >>> Transaction: Correspondence added by perazzo
> >>>
> >>> I agree with Michael having 128KB on the stack is _not_ a good idea
> and
> >>> I agree with Booker that a 128KB stack size on a modern Linux system
> is
> >>> probably too small.
> >>>
> >>> My guess is that EPICS is trying to reduce the footprint as much as
> >>> possible given that it must run on embedded systems which can have
> very
> >>> limited resources.
> >>>
> >>> Bruce, should we ask the EPICS community how they plan to handle this?
> >>> If RHEL6 has the same nss_ldap code as the one that broke EPICS, the
> >>> community will be forced to handle this problem eventually.
> >>>
> >>>
> >>> On 12/12/11 11:55, mcbrowne@slac.stanford.edu via RT wrote:
> >>>> Queue/Owner: PCDS-Help [open] Nobody
> >>>>     Requestors: Hill, Bruce<bhill@slac.stanford.edu>    x4752
> 901/131B
> >> [PPA Eng EE]
> >>>>         Ticket: https://www-
> >> rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
> >>>> Transaction: Correspondence added by mcbrowne
> >>>>
> >>>> Well, it's the code that we're running... I'm not willing to say it's
> >> correct
> >>>> though! You're absolutely right... these seem like very small stack
> >> sizes.
> >>>> Proof that this is what is running: the full routine without ellipses
> >> is:
> >>>>      unsigned int epicsThreadGetStackSize (epicsThreadStackSizeClass
> >>>>      stackSizeClass)
> >>>>      {
> >>>>      #if ! defined (_POSIX_THREAD_ATTR_STACKSIZE)
> >>>>      return 0;
> >>>>      #elif defined (OSITHREAD_USE_DEFAULT_STACK)
> >>>>      return 0;
> >>>>      #else
> >>>>      static const unsigned stackSizeTable[epicsThreadStackBig+1] =
> >>>>      {128*ARCH_STACK_FACTOR, 256*ARCH_STACK_FACTOR,
> >> 512*ARCH_STACK_FACTOR};
> >>>>      if (stackSizeClass<epicsThreadStackSmall) {
> >>>>      errlogPrintf("epicsThreadGetStackSize illegal argument (too
> >> small)");
> >>>>      return stackSizeTable[epicsThreadStackBig];
> >>>>      }
> >>>>
> >>>>      if (stackSizeClass>epicsThreadStackBig) {
> >>>>      errlogPrintf("epicsThreadGetStackSize illegal argument (too
> >> large)");
> >>>>      return stackSizeTable[epicsThreadStackBig];
> >>>>      }
> >>>>
> >>>>      return stackSizeTable[stackSizeClass];
> >>>>      #endif /*_POSIX_THREAD_ATTR_STACKSIZE*/
> >>>>      }
> >>>>
> >>>> Running gdb on psusr117:
> >>>>
> >>>>      psusr117% gdb caget
> >>>>      GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-37.el5_7.1)
> >>>>      Copyright (C) 2009 Free Software Foundation, Inc.
> >>>>      License GPLv3+: GNU GPL version 3 or later
> >>>>      <http://gnu.org/licenses/gpl.html>
> >>>>      This is free software: you are free to change and redistribute
> it.
> >>>>      There is NO WARRANTY, to the extent permitted by law. Type "show
> >> copying"
> >>>>      and "show warranty" for details.
> >>>>      This GDB was configured as "x86_64-redhat-linux-gnu".
> >>>>      For bug reporting instructions, please see:
> >>>>      <http://www.gnu.org/software/gdb/bugs/>...
> >>>>      Reading symbols from
> >>>>      /reg/g/pcds/package/epics/3.14/base/R3.14.9-0.3.0/bin/linux-
> >> x86_64/caget...done.
> >>>>      (gdb) break main
> >>>>      Breakpoint 1 at 0x401d00: file ../caget.c, line 329.
> >>>>      (gdb) run
> >>>>      Starting program:
> >>>>      /reg/g/pcds/package/epics/3.14/base/R3.14.9-0.3.0/bin/linux-
> >> x86_64/caget
> >>>>      warning: no loadable sections found in added symbol-file system-
> >> supplied
> >>>>      DSO at 0x2aaaaaac7000
> >>>>      [Thread debugging using libthread_db enabled]
> >>>>
> >>>>      Breakpoint 1, main (argc=1, argv=0x7fffffffdf68) at
> ../caget.c:329
> >>>>      329 {
> >>>>      (gdb) x/20i epicsThreadGetStackSize
> >>>>      0x2aaaaaf5e670<epicsThreadGetStackSize>: sub $0x8,%rsp
> >>>>      0x2aaaaaf5e674<epicsThreadGetStackSize+4>: cmp $0x2,%edi
> >>>>      0x2aaaaaf5e677<epicsThreadGetStackSize+7>: ja 0x2aaaaaf5e690
> >>>>      <epicsThreadGetStackSize+32>
> >>>>      0x2aaaaaf5e679<epicsThreadGetStackSize+9>:
> >>>>      lea 0xebfc(%rip),%rax # 0x2aaaaaf6d27c<stackSizeTable.4846>
> >>>>      0x2aaaaaf5e680<epicsThreadGetStackSize+16>: mov %edi,%edx
> >>>>      0x2aaaaaf5e682<epicsThreadGetStackSize+18>: mov
> (%rax,%rdx,4),%eax
> >>>>      0x2aaaaaf5e685<epicsThreadGetStackSize+21>: add $0x8,%rsp
> >>>>      0x2aaaaaf5e689<epicsThreadGetStackSize+25>: retq
> >>>>      0x2aaaaaf5e68a<epicsThreadGetStackSize+26>: nopw
> 0x0(%rax,%rax,1)
> >>>>      0x2aaaaaf5e690<epicsThreadGetStackSize+32>: lea
> 0xe969(%rip),%rdi #
> >>>>      0x2aaaaaf6d000
> >>>>      0x2aaaaaf5e697<epicsThreadGetStackSize+39>: xor %eax,%eax
> >>>>      0x2aaaaaf5e699<epicsThreadGetStackSize+41>: callq 0x2aaaaaf47940
> >>>>      <errlogPrintf@plt>
> >>>>      0x2aaaaaf5e69e<epicsThreadGetStackSize+46>: mov $0x80000,%eax
> >>>>      0x2aaaaaf5e6a3<epicsThreadGetStackSize+51>: add $0x8,%rsp
> >>>>      0x2aaaaaf5e6a7<epicsThreadGetStackSize+55>: retq
> >>>>      0x2aaaaaf5e6a8: nopl 0x0(%rax,%rax,1)
> >>>>      0x2aaaaaf5e6b0<epicsThreadPrivateSet>: push %rbp
> >>>>      0x2aaaaaf5e6b1<epicsThreadPrivateSet+1>: mov %rdi,%rbp
> >>>>      0x2aaaaaf5e6b4<epicsThreadPrivateSet+4>: push %rbx
> >>>>      0x2aaaaaf5e6b5<epicsThreadPrivateSet+5>: mov %rsi,%rbx
> >>>>      (gdb) x/3d 0x2aaaaaf6d27c
> >>>>      0x2aaaaaf6d27c<stackSizeTable.4846>: 131072 262144 524288
> >>>>      (gdb)
> >>>>
> >>>> In any event, it isn't just returning 0, which would be the case if
> we
> >> were
> >>>> using OSITHREAD_USE_DEFAULT_STACK.
> >>>> --Mike
> >>>>
> >>>>
> >>>>
> >>>> Booker Bense via RT wrote:
> >>>>
> >>>>      On Mon, 12 Dec 2011,   mcbrowne@slac.stanford.edu   via RT
> wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> /reg/g/pcds/package/epics/3.14/base/current/src/libCom/osi/os/posix/osdThr
> >> ead.c,
> >>>>        you will see that:
> >>>>
> >>>>
> >>>>
> >>>>      Is this the correct code? Does anyone know why you are setting
> >>>>      the stacksize? It's generally not reccommended.
> >>>>      http://www.cognitus.net/html/howto/pthreadSemiFAQ_5.html
> >>>>      Can you just recompile with OSITHREAD_USE_DEFAULT_STACK?
> >>>>
> >>>>
> >>>>      #if defined (_POSIX_THREAD_ATTR_STACKSIZE)
> >>>>      #if ! defined (OSITHREAD_USE_DEFAULT_STACK)
> >>>>           status = pthread_attr_setstacksize(
> >>>>      &pthreadInfo->attr,(size_t)stackSize);
> >>>>           checkStatusOnce(status,"pthread_attr_setstacksize");
> >>>>      #endif /*OSITHREAD_USE_DEFAULT_STACK*/
> >>>>      #endif /*_POSIX_THREAD_ATTR_STACKSIZE*/
> >>>>
> >>>>      I don't know all the details, but 128K seems very tiny compared
> >>>>      to current memory sizes. If I'm reading that page correctly,
> >>>>      all the local variables for the thread need to fit on the stack.
> >>>>
> >>>>      Another solution might be to simply remove ldap from the
> >>>>      nsswitch file for hosts.
> >>>>
> >>>>      - Booker C. Bense
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Core was generated by `caget UND:R02:IOC:10:BAT:Fiducial'. Program
> >> terminated with signal 11, Segmentation fault. #0 0x00002aaaab2b7812 in
> >> _nss_ldap_readconfig () from /lib64/libnss_ldap.so.2 (gdb) bt #0
> >> 0x00002aaaab2b7812 in _nss_ldap_readconfig () from
> /lib64/libnss_ldap.so.2
> >> #1 0x00002aaaab2ad298 in ?? () from /lib64/libnss_ldap.so.2 #2
> >> 0x00002aaaab2af530 in _nss_ldap_search_s () from
> /lib64/libnss_ldap.so.2
> >> #3 0x00002aaaab2b02f8 in _nss_ldap_getbyname () from
> >> /lib64/libnss_ldap.so.2 #4 0x00002aaaab2b30d9 in
> _nss_ldap_gethostbyaddr_r
> >> () from /lib64/libnss_ldap.so.2 #5 0x00002b4c98528055 in
> >> gethostbyaddr_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #6
> >> 0x00002b4c98527e41 in gethostbyaddr () from /lib64/libc.so.6 #7
> >> 0x00002b4c9719d348 in ipAddrToHostName (pAddr=0x419f5f34,
> pBuf=0x653e600
> >> "", bufSize=1024) at ../../../src/libCom/osi/os/posix/osdSock.c:148 #8
> >> 0x00002b4c9719d6d9 in ipAddrToA (paddr=0x419f5f30, pBuf=0x419f43f0
> >> "X¿v«ª*", bufSize=0) at
> >>>> ../../../src/libCom/osi/osiSock.c:99 #9 0x00002b4c971981d2 in
> >> ipAddrToAsciiEnginePrivate::run (this=0x653e5f0) at
> >> ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:289 #10
> >> 0x00002b4c97199a2d in epicsThreadCallEntryPoint (pPvt=<value optimized
> >> out>) at ../../../src/libCom/osi/epicsThread.cpp:59 #11
> 0x00002b4c9719f731
> >> in start_routine (arg=<value optimized out>) at
> >> ../../../src/libCom/osi/os/posix/osdThread.c:322 #12 0x00002b4c973f373d
> in
> >> start_thread () from /lib64/libpthread.so.0 #13 0x00002b4c985124bd in
> >> clone () from /lib64/libc.so.6 (gdb) quit It's intermittant, and
> sometimes
> >> crashes before printing the results and sometimes after. I ran caget 10
> >> times, and got 4 core dumps, and 7 successful printouts of the value.
> I've
> >> done the stack trace many times and each time it's in the same
> >> nss_ldap_readconfig() call. Does anyone have any idea why nss ldap may
> >> have changed on the psusr* machines in the last few weeks? Is anyone
> else
> >> seeing similar crashes? Thanks, - Bruce
> >>>
> >
> >
> >



Replies:
Re: [SLAC #351542] caget crashing on psusr* Bruce Hill

Navigate by Date:
Prev: EPICS Contract Position - Please alert your subscribers. Ken Reed
Next: Re: [SLAC #351542] caget crashing on psusr* Bruce Hill
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019 
Navigate by Thread:
Prev: EPICS Contract Position - Please alert your subscribers. Ken Reed
Next: Re: [SLAC #351542] caget crashing on psusr* Bruce Hill
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019 
ANJ, 02 Feb 2012 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·