EPICS ca library fork() problem

Experimental Physics and Industrial Control System

<1994> 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025	Index	<1994> 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
<== Date ==>		<== Thread ==>

Subject:	ca library fork() problem
From:	Gerry Swislow <[email protected]>
Date:	Mon, 11 Apr 94 22:55:25 -0400

Hi Folks,

This weekend I finally added some simple channel access
capability to "spec".  To my surprise, I found that when I
would quit spec and then restart it, I'd get an error
message saying a spec lock file was still locked.  I hunted
through the source code to the channel access library and
found the source of the problem, which is related to how
the library spawns a "repeater" process using fork().  I
can anticipate other problems with this use of fork(),
which I point out below.  Perhaps other beamline controls
people will sympathize and the epics authors will be able
to provide a fix in a future release. 


The code in question is responsible for spawning a
"repeater" task if the library doesn't find a repeater
process currently running on the host.  The repeater
process is involved in listening to a TCP port or some such
thing and making the messages available to all the other
processes on the host.  Only one "repeater" process runs
at a time.

The code below is excerpted from R3.11.3/share/src/ca/access.c:

LOCAL void spawn_repeater() {
#ifdef UNIX
        {
                int     status;

        /*
         * use of fork makes ca and repeater images larger than
         * required but a path is not required in the code.
         *
         * This method of spawn places a path name in the
         * code so it has been avoided for now:
         *      system("/path/repeater &");
         */
        status = fork();
...

The author seems to understand one of the problems with
the fork() approach, in that a second image of a the
application process is created by the fork(), doubling
the consumption of most of the system resources used by
the first process. There are at least two other major
problems with this scheme, though. They are:   


 1) Open file descriptors are not closed after the fork(),
    with the consequences: 


   a) File locks (made with lockf()) on open file descriptors
      before the fork() will stay in place after the parent
      process exits and until the child repeater process is
      killed.  (A Posix-compliant OS supposedly will do
      otherwise, but a child process on SunOS 4.1.x does
      inherit the locks.) 


   b) Files opened before the fork() but that are closed and
      unlinked after the fork() will not go away until the
      repeater process dies, as the blocks that belonged to an
      unlinked file don't disappear until all the processes
      that have the file open close it.  Not a big deal, except
      when there may be limited space on a file system and the
      file is very big.

   c) Kernel device driver close() routines won't be called
      until the last process that has the device open closes the
      file descriptor or exits. If the application is closing
      the device to insure hardware gets reset, that won't happen
      until the repeater process is killed.  Also, more importantly,
      if the driver is set for exclusive use, no other process can 

      open it until the repeater process goes away. 


 2) Caught signals are not reset after the fork() with the
    consequence:

      Signal catching routines that do hardware control may be
      called by both the parent and child (repeater) process. 

      For example, if a user's ^C calls a move-abort routine,
      that routine could be caught by both processes.  If
      SIGTERM is being caught, and the parent handler routines
      writes out a file before exiting, the child process could
      corrupt the file when it is killed and writes something to
      the file.  


  3) The "ps" command displays the child process with the same name
     as the parent.  Not a major problem, but misleading to users.
     Let's say the user starts "spec", which spawns a repeater, then
     exits spec and starts "super".  The "ps" command will show that
     spec is still running, even though it's just in the form of the
     repeater process.  I'm sure that will confuse some people.

I thought I had a work around by making a little program named
"repeater" that just calls ca_task_initialize(), and running this
program with system(), say, when spec starts up (requiring a built-in
path name!).  However, if for any reason this repeater process dies,
the code in the channel access library will automatically call the
spawn_repeater() code to fork() and spawn a new process anyway,
beyond my control.

The fix would be to use fork() and execl(), either called directly
or through the system() C library call.  The execl() will free most
of the resources of the child process when it overlays the fork()ed
process with the image of a hopefully much smaller dedicated
"repeater" process.  That means a path name would have to be built
into the channel access library for the UNIX variant, but I don't
see how that can be avoided.  The path could be looked for first
in a subdirectory of an EPICS_HOME environment variable, and if
that is not set, in a subdirectory of paths in a standard built-in
list, such as
/usr/epics:/usr/local/epics:/usr/lib/epics:/usr/local/lib/epics,
say.  Without doing the execl(), we are stuck with the possibility
of getting the parent application process duplicated.

A minimal, but less desirable fix, would use just the fork(),
sticking in code just after the fork() to close files and reset
signals, as in:

if ((status = fork()) == 0) {
        for (i = 3; i < 50; i++)
                close(i);
       

        for (i = 1; i < NSIG; i++)
                switch (i) {
                 case SIGHUP:
                 case SIGINT:
                 case SIGQUIT:
                        signal(i, SIG_IGN);
                        break;
                 default:
                        signal(i, SIG_DFL);
                }
....

That closes all file descriptors past the standard three, up to
some high, but arbitrary limit of 50.  The real maximum number can
be determined in various OS-dependent ways, but I don't bother in
my own code, since I know I never go anywhere near 50, and if the
close() returns an error, who cares?

As to signals, SIGHUP, SIGINT, SIGQUIT should probably be set to
SIG_IGN for a background process.  My application catches SIGPIPE,
SIGALRM, SIGUSR1, SIGUSR2 and SIGTERM at various points, so they
should probably be reset to SIG_DFL.  Other applications may catch
other signals, so perhaps all other signals should be reset to the
default as in the example.  I don't reset all signals in my code
like the above, so I can't say if there are any undesireable
side-effects.

Of coures, doing an execl() would automatically reset all signals
anyway, and that is probably the better way to go.


Gerry Swislow
-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - -
Certified Scientific Software  Internet : [email protected]
PO Box 390640                     Phone : +1 (617) 576-1610
Cambridge, MA  02139                Fax : +1 (617) 497-4242

Navigate by Date:: Prev: Re: CA WAN/gateway extensions notes Steve Lewis; Next: Re: ca library fork() problem Jeff Hill; Index: <1994> 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
Navigate by Thread:: Prev: Re: CA WAN/gateway extensions notes Karen J. Coulter; Next: Re: ca library fork() problem Jeff Hill; Index: <1994> 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025