[Buildbot-devel] Failed shell command and failed to kill the process
warner-buildbot at lothar.com
Thu Apr 20 17:23:08 UTC 2006
> Anyone has any idea what happened here?
That's really really weird.
> I found that the command "ls" looks like " 6211 ttyr5- ZW
> 0:00.00 (ls)". There are bracket surround "ls" , which means the "ls" is
> already finished.
The "Z" means that it has finished and is waiting for the parent process (in
this case the buildslave) to reap it. Processes in this state are called
"zombies", because they're dead, but just won't go away until the parent
reads their exit status.
> command timed out: 30 seconds without output, killing pid 6116
This suggests that the buildslave process never saw the SIGCHLD which
indicated that a child process has terminated and needs to be reaped.
> SIGKILL failed to kill process
This might be the normal thing that happens when you try to kill a zombie..
Huh, I'm not sure where to go. It sounds like either SIGCHLD is not being
delivered properly, or it is getting ignored or eaten somewhere.
I'd want to add some instrumentation to
twisted.internet.process.reapAllProcesses() and Process.reapProcess() and
PTYProcess.reapProcess() to do a log.msg() each time they're called, so we'd
know when/if SIGCHLD was being delivered. I'd also set 'debug=True' in
buildbot.slave.commands.ShellCommandPP, which will turn on additional logging
when twisted thinks the process has ended. (If the process is still a zombie,
then ShellCommandPP.processEnded will probably not have been called, but it's
still worth turning on the debug messages). I'd also try running the
buildslave under strace to see if the signal is being delivered or not.
What version of Twisted are you using? Which unix are you running on? Do the
buildbot unit tests pass on your box? (from the top of the buildbot source
tree, do 'PYTHONPATH=. trial buildbot.test' to run them).
More information about the devel