[Buildbot-devel] RFC: Crowd-sourcing CI using volunteer computing

Wed Mar 27 20:15:51 UTC 2013

Hi

We solve a somewhat similar problem with the Isolate+Swarm project. We are
working on it to fully distribute the work items for Chromium Continuous
Integration masters and pre-commit masters. The goal here is to run all the
tests in parallel, so that getting the test results is O(1) as long as you
have enough slaves.

Swarm is doing automatic distribution (and even sharding) based on a *
dynamic* pool of slaves, where the selection is done depending on the
advertised slave properties, e.g. what OS is it, in which vlan is it
located, etc. Swarm's design is radically different from Buildbot framework
so Swarm can't be easily coupled inside buildbot. Swarm's philosophy is
based on Google's internal tools. Swarm is fully axed towards sharding
individual steps, e.g. running a single unit test on a single OS. The Swarm
slaves poll the master for jobs instead of buildbot's push mechanism. So we
use a pool of slaves that make the bridge between buildbot and Swarm. The
bridge is not super efficient but it works.

Isolate describes all the runtime dependencies, based on a set of variables
(OS, built with specific build flags changing which files are loaded) so it
is possible to package the tests in an efficient way. The server part is a
Content Addressed Storage LRU cache so each file is stored named by its
SHA-1. A json file named .isolated contain all the files necessary for the
work item, the command an dmore. Then the SHA-1 of the .isolated is the
only thing needed to download the full content. Isolate doesn't know about
Swarm so each can be used independently.

Both are under heavy development and we are rolling them on the Chromium
infrastructure.

Design docs
http://chromium.org/developers/testing/isolated-testing/design
http://chromium.org/developers/testing/isolated-testing/swarm/design

Client scripts:
http://git.chromium.org/gitweb/?p=chromium/tools/swarm_client.git under BSD

Isolate server (CAS cache)
https://src.chromium.org/viewvc/chrome/trunk/tools/isolate_server/ under BSD

Swarm Server
https://code.google.com/p/swarming/ under Apache 2.0

We don't have a proper a "*self-contained swarm slave package*" yet as
Stefan discusses, we just didn't take the time to do it yet since we still
tightly control our slaves for now but it's something we'd like to have.

Also, the code assumes Google AppEngine, even if it would totally be
possible to rip AE references out, it's not done at the moment. This would
be a significant amount of work. I'm not sure how much could be done with
the free AE quota, probably not much.

Overall, I don't think it would be easy to bake Swarm-equivalent
functionality into buildbot. The design differences are quite fundamental
as they solve different problems. Some nice things would be doable in
buildbot when taken independently:
- Accepting any slave without not requiring a predetermined list. This is
more complex than it looks like because buildbot needs to figure our to
which builder attach the slave.
- Tolerate TCP teardown. That's something that has bitten often so Swarm
doesn't require a TCP connection for the whole build process and works all
over HTTP.

Then anyone could just use the Isolate part and recreate something similar
within buildbot. One of the issue we faced is when we scale a master over
~500 slaves, it starts to require a relatively beefy server to run buildbot
on. Our Try Server runs at over 700 slaves and it suffers. Swarm, by its
simplistic nature, is designed to cope better with a large number of
slaves. But on the other end, it's *useless* to manage a
checkout-compile-test build.

If you have inquiries, you can email me directly or to the following group:
https://groups.google.com/a/chromium.org/forum/?fromgroups=#!forum/isolate. Be
sure to join before posting. I'd prefer to not pollute buildbot-dev@ any
more than that.

Thanks,

M-A

2013/3/12 Dustin J. Mitchell <dustin at v.igoro.us>

> Why the focus on offline builds?  I think that's actually the hardest
> part of this project.
>
> I don't think "pulling jobs" requires a pre-baked build script.  In
> some sense, it might work already: when a slave is available, it
> connects to the master, and if there's work to do the master will
> start a job.  There are some difficulties here in that a slave doesn't
> have a way to know that it's finished a job, since it only sees
> commands.  So the next least-complicated fix is to alter the protocol
> so that the slave can reason about builds, too -- a way to ask a
> connected slave "hey are you ready to start a build" and then
> notifications that a build is starting and ending.  Then the slave can
> schedule its jobs appropriately, even if it's connected to multiple
> masters.
>
> Another possibility is to implement status-receiver-only builds.  That
> is, an API for slaves to provide all of the relevant data about a
> build that they performed independently.  So that would include API
> calls to create a new build (including sourcestamps, properties,
> etc.), create a new step in that build, and create logs for each step.
>  Then slaves can run their own scheduling algorithm -- the simplest of
> which, used by Tinderbox, is 'while true; do ..; done'.
>
> Historically, we've gotten ourselves in a lot of trouble by making
> fundamental changes to Buildbot that are incomplete or have arbitrary
> limits.  For example, latent buildslaves have a number of nasty (and
> sometimes expensive) gotchas.  I think we got codebases (mostly)
> right, but the development and review for that took Harry several
> months of nearly full-time work.  So I think we should try to avoid
> fundamental changes as much as possible, and where they are made,
> implement them completely.
>
> Dustin
>
>
> ------------------------------------------------------------------------------
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
> endpoint security space. For insight on selecting the right partner to
> tackle endpoint security challenges, access the full report.
> http://p.sf.net/sfu/symantec-dev2dev
> _______________________________________________
> Buildbot-devel mailing list
> Buildbot-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/buildbot-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://buildbot.net/pipermail/devel/attachments/20130327/941d293f/attachment.html>