Friday, January 03, 2014

system(3) gives ECHILD, and all I did was ignore SIGCLD (I hate zombies)

This was a little puzzle, and the tracks are rather obscure, so I thought I'd tell you about it.

I wrote a program that creates a lot of child processes. These child processes may die of their own accord, or they might be killed by the parent process. Consequently, I had (a) few options:

I chose the last option, because it seemed easiest. In hindsight, though… well, I'm getting ahead of myself.

One child process itself spawns a child process (i.e., a grandchild of the SIGCLD-ignoring main process). This grandchild process used the library function system(3), which would often return an error ECHILD (i.e., "No child processes").

I decided to do a web search, something like this; which led me to http://fixunix.com/unix/84125-system-fails-errno%3Dechild.html; the author offers the following throwaway comment: I know one reason for this is that SICCHLD is set to SIG_IGN.

"Whoa," I thought, "that's exactly what I did!"

Here's the thing: the implementation of system(3) creates a child shell process, which runs the command in question; the parent process calls waitpid(2) or something like this, which gets the status from the child (the shell). "Most" of the time, the parent calls waitpid(2) before the child process terminates, and thus has no problem reaping the child's status.

But what if the child completes before the parent calls waitpid(2)? Well, normally the child process becomes a zombie until the parent can reap it. But according to sigaction(2):

  POSIX.1-1990  disallowed  setting  the  action  for  SIGCHLD to
  SIG_IGN.  POSIX.1-2001 allows this possibility, so that  ignor-
  ing SIGCHLD can be used to prevent the creation of zombies (see
  wait(2)). …
Since this grandchild's ancestor had set SIGCHLD to SIG_IGN, and nothing had changed it back to SIG_DFL, zombification was getting preempted; thus, when system(3) tried to reap the status of its already-terminated child (shell) process, nothing was there, not even a zombie, and ECHILD resulted. Once I learned the cause of the problem, the fix was easy:
signal(SIGCLD, SIG_DFL);
before the first call to system(3).

Really, though, I should have coded the main process in a more POSIXly-correct manner, and avoided this whole mess. Live and learn.

No comments: