Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019  Index 2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019 
<== Date ==> <== Thread ==>

Subject: Re: [SLAC #351542] caget crashing on psusr*
From: Bruce Hill <bhill@slac.stanford.edu>
To: Jeff Hill <johill@lanl.gov>
Cc: "'EPICS core-talk'" <core-talk@aps.anl.gov>, "Browne, Michael C." <mcbrowne@slac.stanford.edu>
Date: Mon, 12 Dec 2011 16:26:31 -0800
Jeff,
Thanks for moving this to to core-talk.   I've removed our
slac help email from the cc list, as it's no longer an issue
for our IT group.

I'm also satisfied for now with our fix, which is to use
the default stack size for all our linux architecture targets
by putting the def in CONFIG_SITE.

That has fixed the stack overflow in the nss_ldap lib for our
CA tools that run on linux, and our only embedded target is
RTEMS which doesn't use OSITHREAD_USE_DEFAULT_STACK.

I don't think this would be the right fix for sites with
embedded posix targets, whether linux or others.
Does this point to a need for embedded versions of the
configure/os/CONFIG.linux* files?

This issue should also go to the full tech-talk list soon, as
there will likely be other RHEL5 users that will be getting
these crashes as they update their nss_ldap libs.

Regards,
- Bruce


On 12/12/2011 03:31 PM, Jeff Hill wrote:
My question is this (not having written the EPICS posix interface layer and
not claiming to understand all of the issues involved); should the system
have different defaults for OSITHREAD_USE_DEFAULT_STACK specified in the
build system depending on if its embedded linux arch or not?

I will keep https://bugs.launchpad.net/epics-base/+bug/903448 open a bit
longer.

Jeff
______________________________________________________
Jeffrey O. Hill           Email        johill@lanl.gov
LANL MS H820              Voice        505 665 1831
Los Alamos NM 87545 USA   FAX          505 665 5107

Message content: TSPA

With sufficient thrust, pigs fly just fine. However, this is
not necessarily a good idea. It is hard to be sure where they
are going to land, and it could be dangerous sitting under them
as they fly overhead. -- RFC 1925


-----Original Message-----
From: Bruce Hill [mailto:bhill@slac.stanford.edu]
Sent: Monday, December 12, 2011 3:58 PM
To: pcds-help
Cc: Jeff Hill
Subject: Re: [SLAC #351542] caget crashing on psusr*

It seems to me that there's no good reason for us to use the
stack size feature in the CA lib for our linux based apps and tools,
so I defined OSITHREAD_USE_DEFAULT_STACK to YES
in the EPICS CONFIG_SITE file and rebuilt.

I did a couple of loops on psusr121 using the new caget and
nss_ldap version 42.el5_7.4 with over 1100 caget's and no
crashes.

EPICS 3.14.9-0.3.0, the one used by our current caget path,
is now rebuilt using default stack sizes.

I think we can close this now.

Thanks all for your help!

Regards,
- Bruce


On 12/12/2011 2:32 PM, Jeff Hill via RT wrote:
Queue/Owner: PCDS-Help [open] Nobody
   Requestors: Hill, Bruce<bhill@slac.stanford.edu>   x4752 901/131B [PPA
Eng EE]
       Ticket: https://www-
rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
Transaction: Correspondence added by johill@lanl.gov

Hi Bruce,

We've been having  a problem lately with caget and other CA clients
crashing due to stack overflows in the nss_ldap library.
The synchronous DNS name lookup is only used for CA diagnostic messages.
Its
handled using an asynchronous callback from a single auxiliary thread so
that the CA client library never blocks.

there's a change in the latest nss_ldap library that puts
a 128K buffer on the stack.
That’s a pretty large buffer to be instantiating as an C automatic
variable
on the stack. As for the advantages and disadvantages of specifying a
posix
pthreads stack size on Linux and or on embedded Linux, I don’t claim to
understand all of the issues involved at this time. Certainly it seems
that
on virtual memory Linux that it might be best to let the virtual paging
take
care of stack expansion.

I created a bug entry. You can find it at this URL.

https://bugs.launchpad.net/epics-base/+bug/903448

Jeff
______________________________________________________
Jeffrey O. Hill           Email        johill@lanl.gov
LANL MS H820              Voice        505 665 1831
Los Alamos NM 87545 USA   FAX          505 665 5107

Message content: TSPA


-----Original Message-----
From: Bruce Hill [mailto:bhill@slac.stanford.edu]
Sent: Monday, December 12, 2011 1:59 PM
To: Jeff Hill
Cc: pcds-help
Subject: Re: [SLAC #351542] caget crashing on psusr*

Hi Jeff,
We've been having  a problem lately with caget and other CA clients
crashing
due to stack overflows in the nss_ldap library.     We're running
RHEL5,
and
there's a change in the latest nss_ldap library that puts a 128K buffer
on
the stack.

The change happened between nss_ldap version 42.el5 and the newer
42.el5_7.4.

We're mostly running EPICS 3.14.9, which by default for linux is
allocating a small
stack for this in src/libCom/osi/os/posix/osdThread.c.     Thus, it
appears that
the library is overwriting the stack leading to random crashes.    I've
checked 3.14.12,
and it appears this is still the default setting for linux.

Have you had any other reports of this crash?

Any reason why we shouldn't just use the default stack size?

Are there any plans to change this in upcoming EPICS releases?

Thanks,
- Bruce

On 12/12/2011 12:17 PM, Amedeo Perazzo via RT wrote:
Queue/Owner: PCDS-Help [open] Nobody
    Requestors: Hill, Bruce<bhill@slac.stanford.edu>    x4752 901/131B
[PPA
Eng EE]
        Ticket: https://www-
rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
Transaction: Correspondence added by perazzo

I agree with Michael having 128KB on the stack is _not_ a good idea
and
I agree with Booker that a 128KB stack size on a modern Linux system
is
probably too small.

My guess is that EPICS is trying to reduce the footprint as much as
possible given that it must run on embedded systems which can have
very
limited resources.

Bruce, should we ask the EPICS community how they plan to handle this?
If RHEL6 has the same nss_ldap code as the one that broke EPICS, the
community will be forced to handle this problem eventually.


On 12/12/11 11:55, mcbrowne@slac.stanford.edu via RT wrote:
Queue/Owner: PCDS-Help [open] Nobody
     Requestors: Hill, Bruce<bhill@slac.stanford.edu>     x4752
901/131B
[PPA Eng EE]
         Ticket: https://www-
rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
Transaction: Correspondence added by mcbrowne

Well, it's the code that we're running... I'm not willing to say it's
correct
though! You're absolutely right... these seem like very small stack
sizes.
Proof that this is what is running: the full routine without ellipses
is:
      unsigned int epicsThreadGetStackSize (epicsThreadStackSizeClass
      stackSizeClass)
      {
      #if ! defined (_POSIX_THREAD_ATTR_STACKSIZE)
      return 0;
      #elif defined (OSITHREAD_USE_DEFAULT_STACK)
      return 0;
      #else
      static const unsigned stackSizeTable[epicsThreadStackBig+1] =
      {128*ARCH_STACK_FACTOR, 256*ARCH_STACK_FACTOR,
512*ARCH_STACK_FACTOR};
      if (stackSizeClass<epicsThreadStackSmall) {
      errlogPrintf("epicsThreadGetStackSize illegal argument (too
small)");
      return stackSizeTable[epicsThreadStackBig];
      }

      if (stackSizeClass>epicsThreadStackBig) {
      errlogPrintf("epicsThreadGetStackSize illegal argument (too
large)");
      return stackSizeTable[epicsThreadStackBig];
      }

      return stackSizeTable[stackSizeClass];
      #endif /*_POSIX_THREAD_ATTR_STACKSIZE*/
      }

Running gdb on psusr117:

      psusr117% gdb caget
      GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-37.el5_7.1)
      Copyright (C) 2009 Free Software Foundation, Inc.
      License GPLv3+: GNU GPL version 3 or later
      <http://gnu.org/licenses/gpl.html>
      This is free software: you are free to change and redistribute
it.
      There is NO WARRANTY, to the extent permitted by law. Type "show
copying"
      and "show warranty" for details.
      This GDB was configured as "x86_64-redhat-linux-gnu".
      For bug reporting instructions, please see:
      <http://www.gnu.org/software/gdb/bugs/>...
      Reading symbols from
      /reg/g/pcds/package/epics/3.14/base/R3.14.9-0.3.0/bin/linux-
x86_64/caget...done.
      (gdb) break main
      Breakpoint 1 at 0x401d00: file ../caget.c, line 329.
      (gdb) run
      Starting program:
      /reg/g/pcds/package/epics/3.14/base/R3.14.9-0.3.0/bin/linux-
x86_64/caget
      warning: no loadable sections found in added symbol-file system-
supplied
      DSO at 0x2aaaaaac7000
      [Thread debugging using libthread_db enabled]

      Breakpoint 1, main (argc=1, argv=0x7fffffffdf68) at
../caget.c:329
      329 {
      (gdb) x/20i epicsThreadGetStackSize
      0x2aaaaaf5e670<epicsThreadGetStackSize>: sub $0x8,%rsp
      0x2aaaaaf5e674<epicsThreadGetStackSize+4>: cmp $0x2,%edi
      0x2aaaaaf5e677<epicsThreadGetStackSize+7>: ja 0x2aaaaaf5e690
      <epicsThreadGetStackSize+32>
      0x2aaaaaf5e679<epicsThreadGetStackSize+9>:
      lea 0xebfc(%rip),%rax # 0x2aaaaaf6d27c<stackSizeTable.4846>
      0x2aaaaaf5e680<epicsThreadGetStackSize+16>: mov %edi,%edx
      0x2aaaaaf5e682<epicsThreadGetStackSize+18>: mov
(%rax,%rdx,4),%eax
      0x2aaaaaf5e685<epicsThreadGetStackSize+21>: add $0x8,%rsp
      0x2aaaaaf5e689<epicsThreadGetStackSize+25>: retq
      0x2aaaaaf5e68a<epicsThreadGetStackSize+26>: nopw
0x0(%rax,%rax,1)
      0x2aaaaaf5e690<epicsThreadGetStackSize+32>: lea
0xe969(%rip),%rdi #
      0x2aaaaaf6d000
      0x2aaaaaf5e697<epicsThreadGetStackSize+39>: xor %eax,%eax
      0x2aaaaaf5e699<epicsThreadGetStackSize+41>: callq 0x2aaaaaf47940
      <errlogPrintf@plt>
      0x2aaaaaf5e69e<epicsThreadGetStackSize+46>: mov $0x80000,%eax
      0x2aaaaaf5e6a3<epicsThreadGetStackSize+51>: add $0x8,%rsp
      0x2aaaaaf5e6a7<epicsThreadGetStackSize+55>: retq
      0x2aaaaaf5e6a8: nopl 0x0(%rax,%rax,1)
      0x2aaaaaf5e6b0<epicsThreadPrivateSet>: push %rbp
      0x2aaaaaf5e6b1<epicsThreadPrivateSet+1>: mov %rdi,%rbp
      0x2aaaaaf5e6b4<epicsThreadPrivateSet+4>: push %rbx
      0x2aaaaaf5e6b5<epicsThreadPrivateSet+5>: mov %rsi,%rbx
      (gdb) x/3d 0x2aaaaaf6d27c
      0x2aaaaaf6d27c<stackSizeTable.4846>: 131072 262144 524288
      (gdb)

In any event, it isn't just returning 0, which would be the case if
we
were
using OSITHREAD_USE_DEFAULT_STACK.
--Mike



Booker Bense via RT wrote:

      On Mon, 12 Dec 2011,   mcbrowne@slac.stanford.edu   via RT
wrote:



/reg/g/pcds/package/epics/3.14/base/current/src/libCom/osi/os/posix/osdThr
ead.c,
        you will see that:



      Is this the correct code? Does anyone know why you are setting
      the stacksize? It's generally not reccommended.
      http://www.cognitus.net/html/howto/pthreadSemiFAQ_5.html
      Can you just recompile with OSITHREAD_USE_DEFAULT_STACK?


      #if defined (_POSIX_THREAD_ATTR_STACKSIZE)
      #if ! defined (OSITHREAD_USE_DEFAULT_STACK)
           status = pthread_attr_setstacksize(
      &pthreadInfo->attr,(size_t)stackSize);
           checkStatusOnce(status,"pthread_attr_setstacksize");
      #endif /*OSITHREAD_USE_DEFAULT_STACK*/
      #endif /*_POSIX_THREAD_ATTR_STACKSIZE*/

      I don't know all the details, but 128K seems very tiny compared
      to current memory sizes. If I'm reading that page correctly,
      all the local variables for the thread need to fit on the stack.

      Another solution might be to simply remove ldap from the
      nsswitch file for hosts.

      - Booker C. Bense






Core was generated by `caget UND:R02:IOC:10:BAT:Fiducial'. Program
terminated with signal 11, Segmentation fault. #0 0x00002aaaab2b7812 in
_nss_ldap_readconfig () from /lib64/libnss_ldap.so.2 (gdb) bt #0
0x00002aaaab2b7812 in _nss_ldap_readconfig () from
/lib64/libnss_ldap.so.2
#1 0x00002aaaab2ad298 in ?? () from /lib64/libnss_ldap.so.2 #2
0x00002aaaab2af530 in _nss_ldap_search_s () from
/lib64/libnss_ldap.so.2
#3 0x00002aaaab2b02f8 in _nss_ldap_getbyname () from
/lib64/libnss_ldap.so.2 #4 0x00002aaaab2b30d9 in
_nss_ldap_gethostbyaddr_r
() from /lib64/libnss_ldap.so.2 #5 0x00002b4c98528055 in
gethostbyaddr_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #6
0x00002b4c98527e41 in gethostbyaddr () from /lib64/libc.so.6 #7
0x00002b4c9719d348 in ipAddrToHostName (pAddr=0x419f5f34,
pBuf=0x653e600
"", bufSize=1024) at ../../../src/libCom/osi/os/posix/osdSock.c:148 #8
0x00002b4c9719d6d9 in ipAddrToA (paddr=0x419f5f30, pBuf=0x419f43f0
"X¿v«ª*", bufSize=0) at
../../../src/libCom/osi/osiSock.c:99 #9 0x00002b4c971981d2 in
ipAddrToAsciiEnginePrivate::run (this=0x653e5f0) at
../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:289 #10
0x00002b4c97199a2d in epicsThreadCallEntryPoint (pPvt=<value optimized
out>) at ../../../src/libCom/osi/epicsThread.cpp:59 #11
0x00002b4c9719f731
in start_routine (arg=<value optimized out>) at
../../../src/libCom/osi/os/posix/osdThread.c:322 #12 0x00002b4c973f373d
in
start_thread () from /lib64/libpthread.so.0 #13 0x00002b4c985124bd in
clone () from /lib64/libc.so.6 (gdb) quit It's intermittant, and
sometimes
crashes before printing the results and sometimes after. I ran caget 10
times, and got 4 core dumps, and 7 successful printouts of the value.
I've
done the stack trace many times and each time it's in the same
nss_ldap_readconfig() call. Does anyone have any idea why nss ldap may
have changed on the psusr* machines in the last few weeks? Is anyone
else
seeing similar crashes? Thanks, - Bruce




Replies:
Re: [SLAC #351542] caget crashing on psusr* Andrew Johnson
References:
RE: [SLAC #351542] caget crashing on psusr* Jeff Hill

Navigate by Date:
Prev: RE: [SLAC #351542] caget crashing on psusr* Jeff Hill
Next: Re: [SLAC #351542] caget crashing on psusr* Andrew Johnson
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019 
Navigate by Thread:
Prev: RE: [SLAC #351542] caget crashing on psusr* Jeff Hill
Next: Re: [SLAC #351542] caget crashing on psusr* Andrew Johnson
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019 
ANJ, 02 Feb 2012 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·