[Buildbot-devel] slave level locks

Brian Warner warner-buildbot at lothar.com
Tue Apr 17 07:55:45 UTC 2007


> Another question, then, is how to react to high loads. Using locks
> suggest to simply serialize access to the machine, i.e. to queue builders.
> However, in a build farm multiple equivalent machines may be available,
> so a real dispatcher is needed that can send build requests to one among
> many hosts.

Good point.

I had been thinking of the slave-availability thing as being more advisory
than strict, i.e. put the heavily-loaded buildslave on the bottom of the
list, but not actually forbid using it unless the buildslave admin marks it
as being out-of-service. But I can see how a load-average-measuring process
could tell us that no builds should be started on that slave at all until the
load dropped.

Of course, we'd have to do something more clever than just waiting until the
load drops below some threshold: I can just imagine 10 builds ready to go,
all waiting on the same thing, then all being started at the same time,
swamping the machine. There would have to be some better mechanism to allow
only one build at a time on that machine (spread across all the slaves that
are running there), and that build could only start when the system load
(possibly caused by non-buildbot processes) goes low enough.

> Brian, any ideas on this ? Have you started any work on this ? Is there
> anything to be done to help ?

I made the briefest of starts on this several months ago, and then got
distracted. I'll attach the patch, but it's really only an early sketch and
probably won't actually be useful to anybody.

Hm, what's the best way to help.. I'm not sure. There's a good bit of code to
write, but first there's a good bit of design to be done, and I haven't
learned how to outsource that yet :).

If the slave-availability flag is more of an advisory thing, then we mainly
need to decide upon some means to change that flag (which could be a
buildslave admin pushing some yet-to-be-determined button to say "stop using
my slave", or a cron job, or a load-average measurer, or something), and then
finsh the code included in the patch below to make that flag affect the slave
selection that takes place when a build is started. The big issue is what
happens when there are no slaves available: in that case, you want a slave
becoming available to trigger the build. We have code to make a slave
*connecting* trigger a build that's been stuck for a while, but this
slave-availability flag is a new ball game, and we need some new sort of
subscribe-to-hear-about-state-changes mechanism for it.

To properly handle the use case above, we need some kind of external
load-checking agent (one per machine, used by multiple buildmasters) that
gets to dole out builds, one at a time. To that end, the load-checking agent
needs to know exactly which builders want to run a build. We need a
subscription mechanism between the buildbot and this agent, so the
buildmaster can tell it "I have builder A that wants to run. Please remember
this and notify me when you have enough spare capacity to handle my job". The
buildmaster can subscribe to several such agents (if you have redundant
slaves), and the first one to give us a slot causes the other requests to be
withdrawn. There's all sorts of distributed locking and potential error
conditions to beware of.. fun! :).

So, let's start by nailing down a couple of useful ways for this availability
flag to be controlled. The main use case I've heard so far is "make the
buildslave on my workstation be lowest-priority or completely unavailable
(but still connected for some reason) between 9am and 5pm". Simplistic
load-average is a close second, something like "do not start any builds on my
machine while the load average is above 3". Both still require that
subscription thing though, so that the Builder can be notified when the slave
becomes available next.

And let's talk about how this hypothetical load-average-monitoring agent
would be run. How common is it that we get multiple buildslaves per host? Or
have buildslaves on machines that are heavily loaded by non-buildbot uses? (I
have a machine at work which suffers from the latter problem all the time).
If we have a daemon that watches CPU load and is configured to only allow one
build at a time to run, and only when the load is low enough, then how should
that daemon connect to the buildbots? The buildslave could talk to it
directly, and relay its decisions back to the buildmaster. Or the
buildmasters could all connect to the daemon (probably via PB), and somehow
associate that particular daemon with a certain buildslave, allowing that
buildslave's availability and job scheduling to be controlled by the
load-monitoring daemon.

I really want to get something like this in for 0.7.6 . It's probably the
second largest change on the roadmap (behind the #11 buildstep-construction
change), so it's got priority.

hope that helps,
 -Brian


-------------- next part --------------
A non-text attachment was scrubbed...
Name: slave-availability.patch
Type: text/x-diff
Size: 3820 bytes
Desc: slave availability patch
URL: <http://buildbot.net/pipermail/devel/attachments/20070417/6eb45e80/attachment.bin>


More information about the devel mailing list