[Buildbot-devel] zombie processes on Solaris
Dustin J. Mitchell
dustin at zmanda.com
Wed Jun 13 19:17:33 UTC 2007
I'm having some trouble with builds hanging on a new Solaris 10 buildslave I
just put together. The step in question is:
f.addStep(my.steps.TarballSourceNFS)
# where
class TarballSourceNFS(Source):
"""
Get a tarball from $HOME, as left there by KeepTarballNFS.
"""
def __init__(self, workdir, **kwargs):
self.workdir = workdir
Source.__init__(self, workdir, **kwargs)
def startVC(self, branch, revision, patch):
branch = zmanda.base.branch_to_str(self.getProperty("branch"))
revision = self.getProperty("revision")
cmd = """
gunzip -c $HOME/dist/tarballs/amanda-%(branch)s-%(revision)s.tar.gz |
tar -xf - &&
rm -rf %(workdir)s &&
mv amanda* %(workdir)s""" % {
'branch' : branch,
'revision' : revision,
'workdir' : os.path.basename(self.workdir),
}
# run the command in the *parent* of the workdir
rsc = RemoteShellCommand(os.path.dirname(self.workdir), cmd)
self.startCommand(rsc)
The class basically just writes a shell command and hands it off to the parent
class. I realize it's a hack, but I think that's immaterial to this question.
When it runs, it succeeds, as evidenced by looking at workdir on the slave. On
the slave, I also see (ps -ef):
UID PID PPID C STIME TTY TIME CMD
buildsla 23580 17211 0 09:17:02 ? 0:00 python /home/buildslave/bin/twistd --no_save -y buildbot.tac
buildsla 26817 23580 0 - ? 0:01 <defunct>
so it looks like the spawnProcess isn't catching its child's exit?
I have also seen this step actually succeed, but the subsequent step
(configure) failed, and hung the same way (including the defunct process).
I see in the slave's logs:
2007/06/13 11:52 -0700 [Broker,client] <SlaveBuilder 'archtest-sparc-solaris-10' at 5573728>.startBuild
2007/06/13 11:52 -0700 [Broker,client] startCommand:shell [id 2]
2007/06/13 11:52 -0700 [Broker,client] ShellCommand._startCommand
2007/06/13 11:52 -0700 [Broker,client] /bin/sh -c
gunzip -c $HOME/dist/tarballs/amanda-trunk-6674.tar.gz |
tar -xf - &&
rm -rf build &&
mv amanda* build
2007/06/13 11:52 -0700 [Broker,client] in dir /tmp/buildslave-9989/archtest-sparc-solaris-10/ (timeout 1200 secs)
2007/06/13 11:52 -0700 [Broker,client] watching logfiles {}
2007/06/13 11:52 -0700 [Broker,client] argv: ['/bin/sh', '-c', '\n\t\tgunzip -c $HOME/dist/tarballs/amanda-trunk-6674.tar.gz |\n\t\t\ttar -xf - &&\n\t\trm -rf build &&\n\t\tmv amanda* build']
2007/06/13 11:52 -0700 [Broker,client] environment: { ... }
2007/06/13 11:55 -0700 [-] sending app-level keepalive
2007/06/13 12:05 -0700 [-] sending app-level keepalive
2007/06/13 12:12 -0700 [-] command timed out: 1200 seconds without output, killing pid 26817
2007/06/13 12:12 -0700 [-] trying os.kill(-pid, 9)
2007/06/13 12:12 -0700 [-] signal 9 sent successfully
2007/06/13 12:12 -0700 [-] we tried to kill the process, and it wouldn't die.. finish anyway
2007/06/13 12:12 -0700 [-] ShellCommand.failed: command failed: SIGKILL failed to kill process
2007/06/13 12:12 -0700 [-] SlaveBuilder.commandFailed <buildbot.slave.commands.SlaveShellCommand instance at 0x553440>
2007/06/13 12:12 -0700 [-] Unhandled Error
Traceback (most recent call last):
Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process
Pinging the builder after the failure works just fine, so the Twisted event
loop seems fine. I don't see the debugging message from 'processEnded'
anywhere in the logs.
I'm running:
Python 2.3.5 (#1, Nov 30 2005, 10:43:26) [C] on sunos5
Twisted-2.5.0
zope.interface-3.3.0
buildbot-0.7.5
Any suggestions?
Dustin
--
Dustin J. Mitchell
Storage Software Engineer, Zmanda, Inc.
http://www.zmanda.com/
More information about the devel
mailing list