[Buildbot-devel] Mozilla moving away from buildbot?

Marc-Antoine Ruel maruel at chromium.org
Mon Mar 10 17:54:02 UTC 2014


Dan is right, we are doing something which sounds similar to what Mozilla
is investigating. I think it's worth collaborating if there is interest.

We took an approach similar to what is done internally to Google, but in a
way that would work efficiently on Windows; the steps involves:
- Lists all the files that are needed to run each single step, a step
should be an individual test
- Archives these files into a remote cache (isolate server)
- Trigger a swarming task to run on a specific machine for each step. The
task itself can be split up if the task knows how to shard itself.
- Wait for results for each step

On the swarming bot:
- Grabs the list of files to download and their SHA-1
- Download the files not already in its local cache
- Create a temporary tree of hardlinks to recreate the tree (read only)
- Run the task
- Upload results back
- Delete the temporary tree

Performance is dependent on the number of fresh files to be downloaded
between each step. We're around 98% cache hit rate.

Swarming bots poll the master, there's no master process, so no master to
restart ever. There's no build or step dependency. Every step has to be
idempotent. The slaves tolerate when the server HTTP 500's for several
minutes without user visible impact.


Setting this up is involved, so it is only worth with very high execution
load.


We integrate with buildbot. The way it works is that all the control is
deferred to the buildbot slave. Upon a build, the slave:
- Triggers all the tests to run
- Tells the master the expected steps that will have output
- Ask the swarming server for results for all the steps simultaneously.
- Fills stdio, mark the steps as closed as the tasks are completed, out of
order.

This is done through an hack that we call "annotated builds" where a slave
can fake steps dynamically.


The homepage of the swarming project without the chromium-specific stuff is
at https://code.google.com/p/swarming/

I made some graphs at
http://dev.chromium.org/developers/testing/isolated-testing/infrastructureincluding
the buildbot integration points.

We're using it with a few hundreds slaves just fine.

Side notes;
- The native sharding is useful to reduce latency, for example if you have
a test suite that takes 40 minutes to run but you can tell it to run an
exact 1/5 of the suite, then swarming can trigger 5 subtasks on 5 slaves
under the hood and aggregate the output, so you can get the results in 8
minutes + overhead. Overhead is kept low by using a content addressed cache.
- There is no stateful server (!), all the processing is done through http
requests and direct writes to the DB.
- I'm toying with the idea of rewriting it in Go, I've already started with
the isolate server so that I could run it standalone to not be forced to
use AppEngine. This is not completed yet.
- This doesn't replace buildbot, it complements it. It's a test
acceleration mechanism. So checkout and compilation is still normally done
on the buildbot slave. One thing it enables is to easily do "build once,
runs it on multiple OS versions".
- We are still using IP whitelisting for slaves but are moving to OAuth2
based authentication. Users can now authenticate via oauth2 to trigger
tasks.

M-A


2014-03-05 12:58 GMT-05:00 Dan Kegel <dank at kegel.com>:

> I seem to recall Google did something similar.
> - Dan
>
> On Wed, Mar 5, 2014 at 9:55 AM, Dustin J. Mitchell <dustin at v.igoro.us>
> wrote:
> > I can fill in a little background here.  Mozilla's been using the same
> > version of Buildbot - 0.8.2, for something like four years now.  So,
> > in a way, Mozilla hasn't been using Buildbot for several years now.
> >
> > Work to move away from or beyond Buildbot is really just an effort to
> > modernize.  We're faced with the choice of contributing to a project
> > which serves a large pool of users, adding features in a way that they
> > can be accepted upstream, or starting fresh and building a tool that
> > solves exactly the problem you have and no more.  Framed that way, the
> > choice to try to implement a custom CI system makes a lot of sense
> > (much as I wish it was otherwise).  The result will, of course, be
> > open source, but as always, there's a big gulf between "you can modify
> > and distribute the source" and "this might actually be useful to you".
> >
> > It's important to note, too, that this is an experimental initiative
> > within Mozilla.  Sometimes things don't work out.  TaskCluster might
> > turn out to be impractical in the face of the crazy detailed
> > requirements release engineers are familiar with.  Time will tell.
> >
> > Dustin
>
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to
> Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries.  Built-in WAN optimization and
> the
> freedom to use Git, Perforce or both. Make the move to Perforce.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> _______________________________________________
> Buildbot-devel mailing list
> Buildbot-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/buildbot-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://buildbot.net/pipermail/devel/attachments/20140310/64e56f30/attachment.html>


More information about the devel mailing list