[Buildbot-devel] slave level locks

Tue Apr 17 14:01:37 UTC 2007

Hi Brian,

I'm very excited to see you pick up this discussion. Keep it going ! :-)

Brian Warner wrote:
>> Another question, then, is how to react to high loads. Using locks
>> suggest to simply serialize access to the machine, i.e. to queue builders.
>> However, in a build farm multiple equivalent machines may be available,
>> so a real dispatcher is needed that can send build requests to one among
>> many hosts.
> 
> Good point.
> 
> I had been thinking of the slave-availability thing as being more advisory
> than strict, i.e. put the heavily-loaded buildslave on the bottom of the
> list, but not actually forbid using it unless the buildslave admin marks it
> as being out-of-service. But I can see how a load-average-measuring process
> could tell us that no builds should be started on that slave at all until the
> load dropped.

Couldn't that become a configurable strategy, per buildslave ? That may sound
more complex than it actually is. I'm really just thinking of letting the
buildmaster / scheduler ping the slave for its willingness to start a builder,
at a given point in time. How the slave determines that should be pretty much
irrelevant in that context. The key changes are that a) there are potentially
multiple slaves that a scheduler can dispatch to and b) builders may be queued
if build slave resources are limited.

> Of course, we'd have to do something more clever than just waiting until the
> load drops below some threshold: I can just imagine 10 builds ready to go,
> all waiting on the same thing, then all being started at the same time,
> swamping the machine. There would have to be some better mechanism to allow
> only one build at a time on that machine (spread across all the slaves that
> are running there), and that build could only start when the system load
> (possibly caused by non-buildbot processes) goes low enough.

Well, builders should be started one at a time, allowing to measure the load
before each individual attempt to start a new one. However, a load may obviously
only get high in the process of a build, not right at the start. I'd suggest not
to get too clever here, and instead allow users to implement any additional
logic, for example by configuring schedulers in a way that certain builds
are run only sequentially, if they so wish.

>> Brian, any ideas on this ? Have you started any work on this ? Is there
>> anything to be done to help ?
> 
> I made the briefest of starts on this several months ago, and then got
> distracted. I'll attach the patch, but it's really only an early sketch and
> probably won't actually be useful to anybody.
> 
> Hm, what's the best way to help.. I'm not sure. There's a good bit of code to
> write, but first there's a good bit of design to be done, and I haven't
> learned how to outsource that yet :).

:-)

I agree that's important to make sure we are all on the same page design-wise,
before we jump into the code. It's quite easy to make a mess of things otherwise.

> If the slave-availability flag is more of an advisory thing, then we mainly
> need to decide upon some means to change that flag (which could be a
> buildslave admin pushing some yet-to-be-determined button to say "stop using
> my slave", or a cron job, or a load-average measurer, or something), and then
> finsh the code included in the patch below to make that flag affect the slave
> selection that takes place when a build is started. The big issue is what
> happens when there are no slaves available: in that case, you want a slave
> becoming available to trigger the build. We have code to make a slave
> *connecting* trigger a build that's been stuck for a while, but this
> slave-availability flag is a new ball game, and we need some new sort of
> subscribe-to-hear-about-state-changes mechanism for it.

I guess either the buildmaster or the slave needs to ping the other side. If
we were talking about discrete state changes the slave could just notify the
master. However, if we are talking about measuring current loads that's more
suitably done by a regular ping. I'd suggest to let buildslaves be responsible
for sending state updates. This way, how the state (e.g. load) is measured is
completely up to the slave, and the master only needs to take into account
whatever the slave tells it about itself.

> To properly handle the use case above, we need some kind of external
> load-checking agent (one per machine, used by multiple buildmasters) that
> gets to dole out builds, one at a time. To that end, the load-checking agent
> needs to know exactly which builders want to run a build. We need a
> subscription mechanism between the buildbot and this agent, so the
> buildmaster can tell it "I have builder A that wants to run. Please remember
> this and notify me when you have enough spare capacity to handle my job". The
> buildmaster can subscribe to several such agents (if you have redundant
> slaves), and the first one to give us a slot causes the other requests to be
> withdrawn. There's all sorts of distributed locking and potential error
> conditions to beware of.. fun! :).

I'm not quite sure. Why do you think a 'load-checking agent' needs to be
a separate process ? Can't the buildslave itself use some system calls
(if available) to measure the load ?

Also, why do you expect this agent to know about builders ? Why should it
talk to anybody other than the local buildslaves at all ?

> So, let's start by nailing down a couple of useful ways for this availability
> flag to be controlled. The main use case I've heard so far is "make the
> buildslave on my workstation be lowest-priority or completely unavailable
> (but still connected for some reason) between 9am and 5pm". Simplistic
> load-average is a close second, something like "do not start any builds on my
> machine while the load average is above 3". Both still require that
> subscription thing though, so that the Builder can be notified when the slave
> becomes available next.

Right, there are a lot of different ways (and reasons) a buildslave may change its
availability over time. However, I maintain that all this could be handled locally,
i.e. by individual slaves. (The slave admin may configure in particular policies when
starting the slave.)

> And let's talk about how this hypothetical load-average-monitoring agent
> would be run. How common is it that we get multiple buildslaves per host? Or
> have buildslaves on machines that are heavily loaded by non-buildbot uses? (I
> have a machine at work which suffers from the latter problem all the time).

I would expect it to be quite common. For example, think of a compile farm
serving multiple projects, not all using a buildbot, or at least, not all
using the same buildbot. They shouldn't need to care about each other.

> If we have a daemon that watches CPU load and is configured to only allow one
> build at a time to run, and only when the load is low enough, then how should
> that daemon connect to the buildbots? The buildslave could talk to it
> directly, and relay its decisions back to the buildmaster. Or the
> buildmasters could all connect to the daemon (probably via PB), and somehow
> associate that particular daemon with a certain buildslave, allowing that
> buildslave's availability and job scheduling to be controlled by the
> load-monitoring daemon.

What's the rationale for suggesting a separate daemon, as opposed to letting
buildslaves obtain that number by themselves ? (They may need a new thread
that does nothing but watches and reports the load, but that should suffice.)

Thanks,
		Stefan

-- 

      ...ich hab' noch einen Koffer in Berlin...