[Buildbot-devel] Log polling errors
Jean-Paul Calderone
exarkun at divmod.com
Sat Jul 29 19:20:27 UTC 2006
I recently started seeing this one a build slave:
2006/07/28 15:15 CDT [-] Unhandled error in Deferred:
2006/07/28 15:15 CDT [-] Traceback (most recent call last):
File "/usr/lib/python2.2/site-packages/twisted/application/app.py", line 128, in runReactorWithLogging
reactor.run()
File "/usr/lib/python2.2/site-packages/twisted/internet/posixbase.py", line 200, in run
self.mainLoop()
File "/usr/lib/python2.2/site-packages/twisted/internet/posixbase.py", line 208, in mainLoop
self.runUntilCurrent()
File "/usr/lib/python2.2/site-packages/twisted/internet/base.py", line 533, in runUntilCurrent
call.func(*call.args, **call.kw)
--- <exception caught here> ---
File "/usr/lib/python2.2/site-packages/twisted/internet/task.py", line 84, in __call__
self.f(*self.a, **self.kw)
File "buildbot/slave/commands.py", line 171, in poll
self.f = open(self.logfile, "rb")
exceptions.IOError: [Errno 2] No such file or directory: 'path/to/log/file'
2006/07/28 15:47 CDT [-] command finished with signal None, exit code 1
2006/07/28 15:47 CDT [-] Traceback (most recent call last):
File "/usr/lib/python2.2/site-packages/twisted/internet/base.py", line 510, in runUntilCurrent
f(*a, **kw)
File "/usr/lib/python2.2/site-packages/twisted/internet/process.py", line 44, in reapAllProcesses
process.reapProcess()
File "/usr/lib/python2.2/site-packages/twisted/internet/process.py", line 752, in reapProcess
self.processEnded(status)
File "/usr/lib/python2.2/site-packages/twisted/internet/process.py", line 774, in processEnded
self.maybeCallProcessEnded()
--- <exception caught here> ---
File "/usr/lib/python2.2/site-packages/twisted/internet/process.py", line 810, in maybeCallProcessEnded
self.proto.processEnded(failure.Failure(e))
File "buildbot/slave/commands.py", line 134, in processEnded
self.command.finished(sig, rc)
File "buildbot/slave/commands.py", line 381, in finished
w.stop()
File "buildbot/slave/commands.py", line 158, in stop
self.poller.stop()
File "/usr/lib/python2.2/site-packages/twisted/internet/task.py", line 68, in stop
assert self.running
exceptions.AssertionError:
2006/07/28 16:07 CDT [-] command timed out: 1200 seconds without output, killing pid 16724
2006/07/28 16:07 CDT [-] trying os.kill(-pid, 9)
2006/07/28 16:07 CDT [-] trying process.signalProcess('KILL')
2006/07/28 16:07 CDT [-] signalProcess/os.kill failed both times
2006/07/28 16:07 CDT [-] we tried to kill the process, and it wouldn't die.. finish anyway
2006/07/28 16:07 CDT [-] ShellCommand.failed: command failed: SIGKILL failed to kill process
2006/07/28 16:07 CDT [-] SlaveBuilder.commandFailed <buildbot.slave.commands.SlaveShellCommand instance at 0x84a012c>
2006/07/28 16:07 CDT [-] Traceback (most recent call last):
Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process
I don't know why the file being polled does not exist on this one slave; I suppose it is a race of some sort. However, since the exception prevents the LoopingCall used by the poller from being started, when the loop is eventually stopped, another exception occurs. This seems to confuse the buildmaster and the slave becomes indefinitely wedged.
This change seems to help things move along:
--- commands.orig 2006-07-29 14:19:50.000000000 -0500
+++ commands.py 2006-07-29 14:19:26.000000000 -0500
@@ -151,11 +151,16 @@
self.poller = task.LoopingCall(self.poll)
def start(self):
- self.poller.start(self.POLL_INTERVAL)
+ self.poller.start(self.POLL_INTERVAL).addErrback(self._cleanupPoll)
+
+ def _cleanupPoll(self, err):
+ log.err(err, msg="Polling error")
+ self.poller = None
def stop(self):
self.poll()
- self.poller.stop()
+ if self.poller is not None:
+ self.poller.stop()
def statFile(self):
if os.path.exists(self.logfile):
Jean-Paul
More information about the devel
mailing list