[Buildbot-devel] database-backed status/scheduler-state project
l10n.moz at googlemail.com
Thu Sep 3 10:07:19 UTC 2009
first of all, woot :-)
Let me chime in a bit with how I see our challenges.
1) Mozilla's buildbots run on some 270 hg repositories, with no common
2) We're running 8 branches, each on probably 5 platforms (win, linux, mac,
mobile arm linux, wince).
3) We're running 3 main compilation slave pools, plus a flock of dedicated
perf test pools. These pools contain dozens of slaves each.
4) Our main status interface is tinderbox.
5) We're running almost 80 localizations on three branches, some 30 on two
others, plus what momo does on their bits.
6) We have a flock of infrastructure code for the slaves.
Let me address 1) first, as the number of those is mostly me (l10n). We have
a bunch of buildmasters running against most of those repos, moco does, I
do, momo does, I don't know who else has daemons polling.
Moving to a changeservice per installation will not really solve our load
here, while adding a centralized change management to the hg server will.
That's not really rocket science either, I have a db for that in place. I
didn't get to get actual review on it, nor does it have hg hooks or a web
api so far. I just have a changesource that queries the db directly. I use
this db for other things as well, so "it's good to have".
On to the load issues:
Not all load we're seeing on the master is "real". We're spending a
significant amount of resources to keep tinderbox alive and not totally
insulting humanity. For one, we have a flock of build steps that do nothing
but add noise to the build that's stupid enough for tinderbox to read it
out, aka TinderboxPrint stuff. In regular nightly builds, that's 3 steps out
of 30, for the l10n repacks, it's 7 out of 40. The other part of it is the
build-done status mail, which get's all logs and headers for all builds,
cats it together and bzip2's it and then sends out a mail with that.
Tinderbox just really needs to die.
Another load issue we're creating is that we're doing slave maintainance in
each and every build (including the 80 l10n builds per branch every night).
We have something between 2 and 3 (I think) repos on hg that we, for every
build, clobber and clone from scratch. Doesn't take long, but it's an
overall system load we're putting up because we're not using something good
for slave maintainance.
A big load issue right now are the changesources, but I got those covered.
If there are more master load issues, why not spin more? Riiiight, let's
detail on that.
We're not having any load issues on the slaves. We may have not enough
slaves, and disks, but in general, each slave is fine. That's because we're
using generic slaves per platform that can do basically any build on that
platform, independent of branch. We're setting the builds limit to 1 per
slave, and our slaves build what needs to be built.
Adding masters breaks that.
While I like the general ideas on how to make the masters more stable and
flexible, I think that the more immediately winning move for Mozilla would
be to make the slave pool support better.
Most importantly, making a single slave instance work with multiple masters,
and have the locks work across masters. I'd go as far as that we only really
need to make maxbuilds work right, I don't think that builders of the same
name on different masters should be considered to be the same build. Small
detail in how to create local work dirs in the slave, and that the masters
would need unique and constant names, but that's all that'd be needed, IMHO.
That said, here's my wishlist:
- slave sharing
- status db, so that we can kill tinderbox
- build restarts with properties right (*)
- log server sounds great, but I'd need a richer API. Full support for
chunks, and for clients creating caches.
(*) is rather tricky right now, as every other thing invents their own
sources. I thought we could use those to figure out which properties are set
by the build, and which are set at request time, but it doesn't really work
with triggers and whatnot.
Axel, who hopes that's digestible.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the devel