I wrote a program that creates a lot of child processes. These child processes may die of their own accord, or they might be killed by the parent process. Consequently, I had (a) few options:
- Let zombie processes accumulate;
- Reap them by periodically calling wait(2)
- Preclude zombification by ignoring SIGCLD—or as some say, SIGCHLD (see signal(2))
Oh! Apparently this is non-POSIX behavior.
One child process itself spawns a child process (i.e., a grandchild of the SIGCLD-ignoring main process). This grandchild process used the library function system(3), which would often return an error ECHILD (i.e., "No child processes").
I decided to do a web search, something like this; which led me to http://fixunix.com/unix/84125-system-fails-errno%3Dechild.html; the author offers the following throwaway comment: I know one reason for this is that SICCHLD is set to SIG_IGN.
"Whoa," I thought, "that's exactly what I did!"
Here's the thing: the implementation of system(3) creates a child shell process, which runs the command in question; the parent process calls waitpid(2) or something like this, which gets the status from the child (the shell). "Most" of the time, the parent calls waitpid(2) before the child process terminates, and thus has no problem reaping the child's status.
But what if the child completes before the parent calls waitpid(2)? Well, normally the child process becomes a zombie until the parent can reap it. But according to sigaction(2):
POSIX.1-1990 disallowed setting the action for SIGCHLD to SIG_IGN. POSIX.1-2001 allows this possibility, so that ignor- ing SIGCHLD can be used to prevent the creation of zombies (see wait(2)). …Since this grandchild's ancestor had set SIGCHLD to SIG_IGN, and nothing had changed it back to SIG_DFL, zombification was getting preempted; thus, when system(3) tried to reap the status of its already-terminated child (shell) process, nothing was there, not even a zombie, and ECHILD resulted. Once I learned the cause of the problem, the fix was easy:
signal(SIGCLD, SIG_DFL);before the first call to system(3).
Really, though, I should have coded the main process in a more POSIXly-correct manner, and avoided this whole mess. Live and learn.
No comments:
Post a Comment