[Buildbot-devel] Various bugs with builder level locking

Wed May 27 17:45:53 UTC 2009

In buildbot.process.base, the code waits to notify the status code
that a build has started after all locks have been acquired.  During
this time the waterfall status page starts doing all sorts of weird
stuff.  The biggest and most notable bug is that existing builds for
the same build disappear from the waterfall list.  If the buildbot
master restarts at this time, then the waterfall page will never
render correctly again until you delete all of the existing build data
and start again.

For testing purpose, I put the buildStarted() line before the
acquireLocks() call and it seems to resolve all of the various issues,
but it has the side effect that it thinks the build has started before
it has actually started and messes up the estimated time code for
subsequent builds.  This roughly matches the behavior of using
buildstep locking when using per build locks.

    self.acquireLocks().addCallback(self._startBuild_2)
    return d

def _startBuild_2(self, res):
    self.build_status.buildStarted(self)
    self.startNextStep()

Another related problem is that if you have multiple clients that are
capable of fulfilling the request and all of them are currently in a
locked state, it will frequently make an bad decision about which
client to issue the job to.  Often times one of the clients will
finish within a relatively short time, but the job ends up going to
the client that isn't going to complete for a significantly longer
time.  Looking at the same code, it appears to pick which client to
run the build on before it knows which client is going to have
capacity to run the job.

It seems like a single code change could potentially fix all 3 of the
above problems.  I am going to take a stab at trying to fix it (I am
new to the buildbot code base however), but was curious for feedback
on what seemed like the best solution.