[Buildbot-devel] Non-fatal disconnects?

Brian Warner warner-buildbot at lothar.com
Sun Feb 18 00:31:05 UTC 2007


I think Markus' analysis is correct, that the ping-the-builder code needs to
be improved. There is an open ticket on this (SF#1500669) and I know I have a
tree called 'sharedping' lying around somewhere where I started to work on
this. One reason that I've been so slow about it is that the "right" fix is
at a much lower level, in the wire protocol itself (PB). I've been thinking
about how I want to add this sort of keepalive facility to Foolscap (the
successor to PB that I also happen to work on), but I've not yet figured out
a good way to add it to PB. But fixing SF#1500669 will be a start, basically
ping the buildslave as a whole and not just the individual SlaveBuilders
within it: if we hear from the buildslave at all, we assume that it is safe
to start a new build on it.

The longer-term solution is the feature request that many people have asked
for: builds should not be interrupted when the master/slave connection is
lost. This is decidedly non-trivial, since it amounts to creating a
higher-level notion of "connection" than we're currently using, and each side
must distinguish between disconnections and the other end actually
terminating. The buildbot used to work this way, but it never quite worked
right, so I removed the code in the interests of simplicity. The basic
approach is that all the output from build steps (stdout, stderr, exit codes,
etc) gets queued for delivery to the buildmaster when and if there is a
connection available, and each such queue is associated with a (builder name,
build number) tuple rather than a PB RemoteReference (which evaporates when
the connection is lost).

I've created ticket #25 to track this one
(http://buildbot.net/trac/ticket/25). Feel free to add comments and ideas to
it.


cheers,
 -Brian




More information about the devel mailing list