[Buildbot-devel] Lock question: Multiple builders, load-balanced slaves

Fri Nov 3 08:18:26 UTC 2006

"Roy S. Rapoport" <buildbot-devel at ols.inorganic.org> writes:

> Imagine this situation:
> 1. There are two slaves for a given os/hardware;
> 2. There are four builders that want to use these two slaves; 
> 3. Each slave can only have one build happening at any given point

If I understand you correctly, then I think you're right: the existing Lock
mechanism won't work for you. Locks in the current release are an AND thing,
and you need an OR thing instead. You want to have a SlaveLock held around
the whole build (preventing two builders from running simultaneously on a
given buildslave), but you want the Builders to choose an idle slave rather
than committing to run on a slave which is running a build (and holding the
SlaveLock) on behalf of a different Builder.

The buildslave-choosing mechanism is unaware of Locks. It just picks a slave
that isn't already doing a build for the same builder at that moment. Each
builder maintains a separate list of buildbot.process.builder.SlaveBuilder
instances, one per buildslave. Each SlaveBuilder is either idle or doing a
build, and the buildslave-choosing code just grabs one that is idle.
Unbeknownst to this code, the SlaveBuilders that are in the "doing a build"
state may actually be stuck waiting for a Lock. Also, the SlaveBuilders that
are in the "idle" state might be associated with a buildslave that is doing a
build right now (and thus holding a SlaveLock), such that if we initiate our
own build on it, we'll just get stuck waiting for that SlaveLock to become
available.

The SlaveLock (with maxCount=1) would be sufficient to protect the slaves
against multiple simultaneous builds. But it won't let you obtain full
utilization of the hardware: sometimes, one of the slaves will be idle when
there is still work to be done.

I can't think of an easy answer. As for hard answers, one possible direction
is to make that slave-choosing code aware of SlaveLocks. This code is in
buildbot/process/builder.py, Builder.maybeStartBuild(), where it walks
through self.slaves looking for one that is idle. You could add an extra test
for whether or not the slave was running another build, or if there was a
SlaveLock held that involved that slave.

        for sb in self.slaves:
            if sb.state == IDLE and CheckForSlaveLock(sb):
                break # found one

To implement that test, you need to find a Lock instance and ask it if
anybody is holding it. The list of all real Locks (both kinds) is kept in
master.BotMaster (accessed in getLockByID). (the SlaveLock and MasterLock
instances in the config file are just markers, used as keys in a dictionary
that keeps the real locks in its values. This is to avoid problems when
reloading the config file, since that creates brand new SlaveLock/MasterLock
instances). These BotMaster-held Locks are the right thing to look at for
MasterLocks, however for SlaveLocks you have to take one more step
(getLock(sb)) to get just the lock for a given slave.

>From the builder, you could do something like:

 def CheckForSlaveLock(self, sb):
     lock = self.botmaster.getLockByID(SlaveLock("same name as in your master.cfg"))
     per_slave_lock = lock.getLock(sb) # this is a BaseLock instance
     return per_slave_lock.isAvailable()

You could also avoid hard-wiring in the lock name (in exchange for
hard-wiring in some different behaviors) by searching through *all* locks in
search of any that are held by this slave. It's kind of gross, violates
abstraction boundaries left and right, and probably has some side-effects,
but I'm imagining something like:

 def CheckForSlaveLock(self, sb):
     for l in self.botmaster.locks.values():
         if isinstance(l, locks.SlaveLock):
             sl = l.getLock(sb)
             if not sl.isAvailable():
                 return False
     return True

Doing either of these things instead of the usual "grab the first IDLE slave
we see" (and using Build-wide slavelocks on everything) will basically
prevent a build from being started on a slave that's already running a build.
If there's another slave that can be used, the build will start there. If
not, the build will wait until one of the slaves becomes idle.

That isn't an ideal solution (i.e. I don't think I'd want to put it into the
mainline code), but I think it might stand a chance of working for this
problem.

I don't know what a cleaner approach might look like. It would probably
involve some new kind of MasterLock (or new argument to the existing one)
that says "maxCount might be 2, but the number of builds that can run on any
given slave is still limited to 1". Orthogonal to this (and not really
something that would help here) would be some kind of OR operator, where you
say you want to claim either Lock1 OR Lock2. Such a mechanism might help
solve other problems, but probably not one in which the buildslaves are the
locus of the contention.

We could certainly use some better reporting interfaces to tell when steps or
builds are waiting on a Lock. (at the moment at work we've got several
builders sharing a MasterLock (maxCount=2) to limit contention for a slow SVN
server, and the resulting gaps in the waterfall page are confusing to look
at). It would probably be sufficient to make the step's status text say
"waiting for Lock(name)" while it's in that state, but it might also be nice
to make it look like the step has already started (since by that point we've
committed to running the step eventually.. it's just a matter of time).

Showing Builds which are waiting on a lock (as opposed to individual
BuildSteps) could use improvement too. I'm not sure if there's a Builder
status or not.. if there is, we could put the "waiting for Lock(name)"
message there.

hope that helps,
 -Brian