Experimental Physics and Industrial Control System
My question is this (not having written the EPICS posix interface layer and
not claiming to understand all of the issues involved); should the system
have different defaults for OSITHREAD_USE_DEFAULT_STACK specified in the
build system depending on if its embedded linux arch or not?
I will keep https://bugs.launchpad.net/epics-base/+bug/903448 open a bit
longer.
Jeff
______________________________________________________
Jeffrey O. Hill Email [email protected]
LANL MS H820 Voice 505 665 1831
Los Alamos NM 87545 USA FAX 505 665 5107
Message content: TSPA
With sufficient thrust, pigs fly just fine. However, this is
not necessarily a good idea. It is hard to be sure where they
are going to land, and it could be dangerous sitting under them
as they fly overhead. -- RFC 1925
> -----Original Message-----
> From: Bruce Hill [mailto:[email protected]]
> Sent: Monday, December 12, 2011 3:58 PM
> To: pcds-help
> Cc: Jeff Hill
> Subject: Re: [SLAC #351542] caget crashing on psusr*
>
> It seems to me that there's no good reason for us to use the
> stack size feature in the CA lib for our linux based apps and tools,
> so I defined OSITHREAD_USE_DEFAULT_STACK to YES
> in the EPICS CONFIG_SITE file and rebuilt.
>
> I did a couple of loops on psusr121 using the new caget and
> nss_ldap version 42.el5_7.4 with over 1100 caget's and no
> crashes.
>
> EPICS 3.14.9-0.3.0, the one used by our current caget path,
> is now rebuilt using default stack sizes.
>
> I think we can close this now.
>
> Thanks all for your help!
>
> Regards,
> - Bruce
>
>
> On 12/12/2011 2:32 PM, Jeff Hill via RT wrote:
> > Queue/Owner: PCDS-Help [open] Nobody
> > Requestors: Hill, Bruce<[email protected]> x4752 901/131B [PPA
> Eng EE]
> > Ticket: https://www-
> rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
> >
> > Transaction: Correspondence added by [email protected]
> >
> > Hi Bruce,
> >
> >> We've been having a problem lately with caget and other CA clients
> >> crashing due to stack overflows in the nss_ldap library.
> > The synchronous DNS name lookup is only used for CA diagnostic messages.
> Its
> > handled using an asynchronous callback from a single auxiliary thread so
> > that the CA client library never blocks.
> >
> >> there's a change in the latest nss_ldap library that puts
> >> a 128K buffer on the stack.
> > That?s a pretty large buffer to be instantiating as an C automatic
> variable
> > on the stack. As for the advantages and disadvantages of specifying a
> posix
> > pthreads stack size on Linux and or on embedded Linux, I don?t claim to
> > understand all of the issues involved at this time. Certainly it seems
> that
> > on virtual memory Linux that it might be best to let the virtual paging
> take
> > care of stack expansion.
> >
> > I created a bug entry. You can find it at this URL.
> >
> > https://bugs.launchpad.net/epics-base/+bug/903448
> >
> > Jeff
> > ______________________________________________________
> > Jeffrey O. Hill Email [email protected]
> > LANL MS H820 Voice 505 665 1831
> > Los Alamos NM 87545 USA FAX 505 665 5107
> >
> > Message content: TSPA
> >
> >
> >> -----Original Message-----
> >> From: Bruce Hill [mailto:[email protected]]
> >> Sent: Monday, December 12, 2011 1:59 PM
> >> To: Jeff Hill
> >> Cc: pcds-help
> >> Subject: Re: [SLAC #351542] caget crashing on psusr*
> >>
> >> Hi Jeff,
> >> We've been having a problem lately with caget and other CA clients
> >> crashing
> >> due to stack overflows in the nss_ldap library. We're running
> RHEL5,
> >> and
> >> there's a change in the latest nss_ldap library that puts a 128K buffer
> on
> >> the stack.
> >>
> >> The change happened between nss_ldap version 42.el5 and the newer
> >> 42.el5_7.4.
> >>
> >> We're mostly running EPICS 3.14.9, which by default for linux is
> >> allocating a small
> >> stack for this in src/libCom/osi/os/posix/osdThread.c. Thus, it
> >> appears that
> >> the library is overwriting the stack leading to random crashes. I've
> >> checked 3.14.12,
> >> and it appears this is still the default setting for linux.
> >>
> >> Have you had any other reports of this crash?
> >>
> >> Any reason why we shouldn't just use the default stack size?
> >>
> >> Are there any plans to change this in upcoming EPICS releases?
> >>
> >> Thanks,
> >> - Bruce
> >>
> >> On 12/12/2011 12:17 PM, Amedeo Perazzo via RT wrote:
> >>> Queue/Owner: PCDS-Help [open] Nobody
> >>> Requestors: Hill, Bruce<[email protected]> x4752 901/131B
> [PPA
> >> Eng EE]
> >>> Ticket: https://www-
> >> rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
> >>> Transaction: Correspondence added by perazzo
> >>>
> >>> I agree with Michael having 128KB on the stack is _not_ a good idea
> and
> >>> I agree with Booker that a 128KB stack size on a modern Linux system
> is
> >>> probably too small.
> >>>
> >>> My guess is that EPICS is trying to reduce the footprint as much as
> >>> possible given that it must run on embedded systems which can have
> very
> >>> limited resources.
> >>>
> >>> Bruce, should we ask the EPICS community how they plan to handle this?
> >>> If RHEL6 has the same nss_ldap code as the one that broke EPICS, the
> >>> community will be forced to handle this problem eventually.
> >>>
> >>>
> >>> On 12/12/11 11:55, [email protected] via RT wrote:
> >>>> Queue/Owner: PCDS-Help [open] Nobody
> >>>> Requestors: Hill, Bruce<[email protected]> x4752
> 901/131B
> >> [PPA Eng EE]
> >>>> Ticket: https://www-
> >> rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
> >>>> Transaction: Correspondence added by mcbrowne
> >>>>
> >>>> Well, it's the code that we're running... I'm not willing to say it's
> >> correct
> >>>> though! You're absolutely right... these seem like very small stack
> >> sizes.
> >>>> Proof that this is what is running: the full routine without ellipses
> >> is:
> >>>> unsigned int epicsThreadGetStackSize (epicsThreadStackSizeClass
> >>>> stackSizeClass)
> >>>> {
> >>>> #if ! defined (_POSIX_THREAD_ATTR_STACKSIZE)
> >>>> return 0;
> >>>> #elif defined (OSITHREAD_USE_DEFAULT_STACK)
> >>>> return 0;
> >>>> #else
> >>>> static const unsigned stackSizeTable[epicsThreadStackBig+1] =
> >>>> {128*ARCH_STACK_FACTOR, 256*ARCH_STACK_FACTOR,
> >> 512*ARCH_STACK_FACTOR};
> >>>> if (stackSizeClass<epicsThreadStackSmall) {
> >>>> errlogPrintf("epicsThreadGetStackSize illegal argument (too
> >> small)");
> >>>> return stackSizeTable[epicsThreadStackBig];
> >>>> }
> >>>>
> >>>> if (stackSizeClass>epicsThreadStackBig) {
> >>>> errlogPrintf("epicsThreadGetStackSize illegal argument (too
> >> large)");
> >>>> return stackSizeTable[epicsThreadStackBig];
> >>>> }
> >>>>
> >>>> return stackSizeTable[stackSizeClass];
> >>>> #endif /*_POSIX_THREAD_ATTR_STACKSIZE*/
> >>>> }
> >>>>
> >>>> Running gdb on psusr117:
> >>>>
> >>>> psusr117% gdb caget
> >>>> GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-37.el5_7.1)
> >>>> Copyright (C) 2009 Free Software Foundation, Inc.
> >>>> License GPLv3+: GNU GPL version 3 or later
> >>>> <http://gnu.org/licenses/gpl.html>
> >>>> This is free software: you are free to change and redistribute
> it.
> >>>> There is NO WARRANTY, to the extent permitted by law. Type "show
> >> copying"
> >>>> and "show warranty" for details.
> >>>> This GDB was configured as "x86_64-redhat-linux-gnu".
> >>>> For bug reporting instructions, please see:
> >>>> <http://www.gnu.org/software/gdb/bugs/>...
> >>>> Reading symbols from
> >>>> /reg/g/pcds/package/epics/3.14/base/R3.14.9-0.3.0/bin/linux-
> >> x86_64/caget...done.
> >>>> (gdb) break main
> >>>> Breakpoint 1 at 0x401d00: file ../caget.c, line 329.
> >>>> (gdb) run
> >>>> Starting program:
> >>>> /reg/g/pcds/package/epics/3.14/base/R3.14.9-0.3.0/bin/linux-
> >> x86_64/caget
> >>>> warning: no loadable sections found in added symbol-file system-
> >> supplied
> >>>> DSO at 0x2aaaaaac7000
> >>>> [Thread debugging using libthread_db enabled]
> >>>>
> >>>> Breakpoint 1, main (argc=1, argv=0x7fffffffdf68) at
> ../caget.c:329
> >>>> 329 {
> >>>> (gdb) x/20i epicsThreadGetStackSize
> >>>> 0x2aaaaaf5e670<epicsThreadGetStackSize>: sub $0x8,%rsp
> >>>> 0x2aaaaaf5e674<epicsThreadGetStackSize+4>: cmp $0x2,%edi
> >>>> 0x2aaaaaf5e677<epicsThreadGetStackSize+7>: ja 0x2aaaaaf5e690
> >>>> <epicsThreadGetStackSize+32>
> >>>> 0x2aaaaaf5e679<epicsThreadGetStackSize+9>:
> >>>> lea 0xebfc(%rip),%rax # 0x2aaaaaf6d27c<stackSizeTable.4846>
> >>>> 0x2aaaaaf5e680<epicsThreadGetStackSize+16>: mov %edi,%edx
> >>>> 0x2aaaaaf5e682<epicsThreadGetStackSize+18>: mov
> (%rax,%rdx,4),%eax
> >>>> 0x2aaaaaf5e685<epicsThreadGetStackSize+21>: add $0x8,%rsp
> >>>> 0x2aaaaaf5e689<epicsThreadGetStackSize+25>: retq
> >>>> 0x2aaaaaf5e68a<epicsThreadGetStackSize+26>: nopw
> 0x0(%rax,%rax,1)
> >>>> 0x2aaaaaf5e690<epicsThreadGetStackSize+32>: lea
> 0xe969(%rip),%rdi #
> >>>> 0x2aaaaaf6d000
> >>>> 0x2aaaaaf5e697<epicsThreadGetStackSize+39>: xor %eax,%eax
> >>>> 0x2aaaaaf5e699<epicsThreadGetStackSize+41>: callq 0x2aaaaaf47940
> >>>> <errlogPrintf@plt>
> >>>> 0x2aaaaaf5e69e<epicsThreadGetStackSize+46>: mov $0x80000,%eax
> >>>> 0x2aaaaaf5e6a3<epicsThreadGetStackSize+51>: add $0x8,%rsp
> >>>> 0x2aaaaaf5e6a7<epicsThreadGetStackSize+55>: retq
> >>>> 0x2aaaaaf5e6a8: nopl 0x0(%rax,%rax,1)
> >>>> 0x2aaaaaf5e6b0<epicsThreadPrivateSet>: push %rbp
> >>>> 0x2aaaaaf5e6b1<epicsThreadPrivateSet+1>: mov %rdi,%rbp
> >>>> 0x2aaaaaf5e6b4<epicsThreadPrivateSet+4>: push %rbx
> >>>> 0x2aaaaaf5e6b5<epicsThreadPrivateSet+5>: mov %rsi,%rbx
> >>>> (gdb) x/3d 0x2aaaaaf6d27c
> >>>> 0x2aaaaaf6d27c<stackSizeTable.4846>: 131072 262144 524288
> >>>> (gdb)
> >>>>
> >>>> In any event, it isn't just returning 0, which would be the case if
> we
> >> were
> >>>> using OSITHREAD_USE_DEFAULT_STACK.
> >>>> --Mike
> >>>>
> >>>>
> >>>>
> >>>> Booker Bense via RT wrote:
> >>>>
> >>>> On Mon, 12 Dec 2011, [email protected] via RT
> wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> /reg/g/pcds/package/epics/3.14/base/current/src/libCom/osi/os/posix/osdThr
> >> ead.c,
> >>>> you will see that:
> >>>>
> >>>>
> >>>>
> >>>> Is this the correct code? Does anyone know why you are setting
> >>>> the stacksize? It's generally not reccommended.
> >>>> http://www.cognitus.net/html/howto/pthreadSemiFAQ_5.html
> >>>> Can you just recompile with OSITHREAD_USE_DEFAULT_STACK?
> >>>>
> >>>>
> >>>> #if defined (_POSIX_THREAD_ATTR_STACKSIZE)
> >>>> #if ! defined (OSITHREAD_USE_DEFAULT_STACK)
> >>>> status = pthread_attr_setstacksize(
> >>>> &pthreadInfo->attr,(size_t)stackSize);
> >>>> checkStatusOnce(status,"pthread_attr_setstacksize");
> >>>> #endif /*OSITHREAD_USE_DEFAULT_STACK*/
> >>>> #endif /*_POSIX_THREAD_ATTR_STACKSIZE*/
> >>>>
> >>>> I don't know all the details, but 128K seems very tiny compared
> >>>> to current memory sizes. If I'm reading that page correctly,
> >>>> all the local variables for the thread need to fit on the stack.
> >>>>
> >>>> Another solution might be to simply remove ldap from the
> >>>> nsswitch file for hosts.
> >>>>
> >>>> - Booker C. Bense
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Core was generated by `caget UND:R02:IOC:10:BAT:Fiducial'. Program
> >> terminated with signal 11, Segmentation fault. #0 0x00002aaaab2b7812 in
> >> _nss_ldap_readconfig () from /lib64/libnss_ldap.so.2 (gdb) bt #0
> >> 0x00002aaaab2b7812 in _nss_ldap_readconfig () from
> /lib64/libnss_ldap.so.2
> >> #1 0x00002aaaab2ad298 in ?? () from /lib64/libnss_ldap.so.2 #2
> >> 0x00002aaaab2af530 in _nss_ldap_search_s () from
> /lib64/libnss_ldap.so.2
> >> #3 0x00002aaaab2b02f8 in _nss_ldap_getbyname () from
> >> /lib64/libnss_ldap.so.2 #4 0x00002aaaab2b30d9 in
> _nss_ldap_gethostbyaddr_r
> >> () from /lib64/libnss_ldap.so.2 #5 0x00002b4c98528055 in
> >> gethostbyaddr_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #6
> >> 0x00002b4c98527e41 in gethostbyaddr () from /lib64/libc.so.6 #7
> >> 0x00002b4c9719d348 in ipAddrToHostName (pAddr=0x419f5f34,
> pBuf=0x653e600
> >> "", bufSize=1024) at ../../../src/libCom/osi/os/posix/osdSock.c:148 #8
> >> 0x00002b4c9719d6d9 in ipAddrToA (paddr=0x419f5f30, pBuf=0x419f43f0
> >> "X¿v«ª*", bufSize=0) at
> >>>> ../../../src/libCom/osi/osiSock.c:99 #9 0x00002b4c971981d2 in
> >> ipAddrToAsciiEnginePrivate::run (this=0x653e5f0) at
> >> ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:289 #10
> >> 0x00002b4c97199a2d in epicsThreadCallEntryPoint (pPvt=<value optimized
> >> out>) at ../../../src/libCom/osi/epicsThread.cpp:59 #11
> 0x00002b4c9719f731
> >> in start_routine (arg=<value optimized out>) at
> >> ../../../src/libCom/osi/os/posix/osdThread.c:322 #12 0x00002b4c973f373d
> in
> >> start_thread () from /lib64/libpthread.so.0 #13 0x00002b4c985124bd in
> >> clone () from /lib64/libc.so.6 (gdb) quit It's intermittant, and
> sometimes
> >> crashes before printing the results and sometimes after. I ran caget 10
> >> times, and got 4 core dumps, and 7 successful printouts of the value.
> I've
> >> done the stack trace many times and each time it's in the same
> >> nss_ldap_readconfig() call. Does anyone have any idea why nss ldap may
> >> have changed on the psusr* machines in the last few weeks? Is anyone
> else
> >> seeing similar crashes? Thanks, - Bruce
> >>>
> >
> >
> >
- Replies:
- Re: [SLAC #351542] caget crashing on psusr* Bruce Hill
- Navigate by Date:
- Prev:
EPICS Contract Position - Please alert your subscribers. Ken Reed
- Next:
Re: [SLAC #351542] caget crashing on psusr* Bruce Hill
- Index:
2002
2003
2004
2005
2006
2007
2008
2009
2010
<2011>
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
EPICS Contract Position - Please alert your subscribers. Ken Reed
- Next:
Re: [SLAC #351542] caget crashing on psusr* Bruce Hill
- Index:
2002
2003
2004
2005
2006
2007
2008
2009
2010
<2011>
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024