-----Original Message-----
From: Dennis Nicklaus [mailto:[email protected]]
Sent: Wednesday, January 03, 2007 4:10 PM
To: [email protected]
Subject: caRepeater must run before casr
We recently ran into a very puzzling problem here using the EPICS
casr
(channel access save restore) tool. The problem showed up in one
of two
ways after you push the casrSave or casrRestore buttons.
Sometimes the Tcl/Tk casr interface would give an error dialog
saying,
"error waiting for process to exit: child process lost (is SIGCHLD
ignored or trapped?)"
and other times it would just hang forever after you push
casrSave/casrRestore
without the error dialog (though the save/restore would be
processed).
The short solution is that you must have caRepeater running before
running casr.
A brief summary of the gory details: when one presses the Tk
casrSave
button, that causes tcl to
exec the casave program. casave in turn starts carepeater if
carepeater
isn't already there.
carepeater, in trying to be a nice forked process, closes all its
file
descriptors except
stdin, stdout ,and stderr. This is part of where the problem starts
because the pipe open between
the top-level wish (tcl) shell and the casave program gets dup-ed to
stdout of casave,
then when casave clones/forks off carepeater, the same stdout remains
open in carepeater.
Then when casave finishes, it's dead, but the higher level tcl is
still
trying to read() on the pipe,
which is being held open by carepeater. This wouldn't be a
problem if
the high level tcl shell
were getting a SIGCHLD from the casave process, but by sifting
through
trace output,
we saw that the casave process was being started with the clone()
system
call without
specifying SIGCHLD in the flags, and, as the clone() man page
says, "If
no signal is specified, then the parent process is not signaled
when
the child terminates." We don't know if this is a mistake in the
version of tcl we have or something with the version of linux and
TLS we
happen to be running,
though it happens on multiple linux kernel versions we have.
YMMV widely depending on your verions of unix and tcl.
I'm not suggesting anything necessarily needs to change in casr or
caRepeater, just trying to point out a bizarre problem someone
else may
bump into along the way.
Many thanks to Ron Rechenmacher who spent many hours puzzling over
this
one.
Dennis