[Buildbot-devel] How to implement some sort of "speculative builds"?

Mon Mar 2 11:20:29 UTC 2015

Hello,

there is a patch i'd like to make to my team's buildbot deployment,
and here I come to ask for advices.

This turned out as a long-ish message, so here the contents:

1. The problem I am trying to solve
2. The problem I am not trying to solve
3. The improvement I want to implement
4. Hints?

== 1. The problem I am trying to solve ==

* Our build plus testing takes about 90 minutes.
* Every incoming pull request passes through the build+test dance.
  If tests pass, it is then merged to our master repository.
  (Actually, the merge happens -before- testing.
  If all is green, it is then pushed to master)
* Pull requests are processed (merge -> build -> test -> possibly pushed)
  in a sequential fashion. This is because we like to know that
  each merge in the master repository passed the test green.
  If, say, we were to process multiple pull requests simultaneously,
  you would end up in the case where two heads separately work,
  but if the merge of them doesn't work, you cannot say whose fault is that.
* At peak time our buildbot queue can have ~20 pull requests.

This all implies that after I submit my thing, it takes forever
to know if it's pushed or not (a day, sometimes two).
As a result everybody is pretty unhappy.

== 2. The problem I am not trying to solve ==

Everybody at this point steps up and says that
a non-regression test suite that lasts 1 hour isn't acceptable,
it shouldn't last more than the length of its longest test scenario,
or something on these lines.

For sure I agree on all these points, and in the
background I am looking into ways to cut the fat out of that
monstrosity. But I also need to be pragmatic and
keep assuring that my team can release every day, now.

== 3. The improvement I want to implement ==

While I don't plan to speed up the tests for now,
I do observe that they almost don't consume resources
(profiling shows flat graphs for CPU, memory, I/O.
The thing basically spends its time waiting).

So, I'd like to run multiple builds+tests at once,
without sacrificing the nice property "if it's broken
you can always tell what's the bad pull request: the last one".

Here an example of the strategy I have in mind:

Say you have 7 pull requests in the queue,
PR1, PR2, PR3, PR4, PR5, PR6 and PR7 and 4 buildbot slaves.
What I want to do is to tell my slave1 to process PR1
(i.e., merge it with master and see what it gives).
Slave2 would -not- take PR2 alone, but the merge of PR1+PR2.
Slave3 will go with the merge PR1+PR2+PR3 and slave4 will
take the merge PR1+PR2+PR3+PR4.

slave1: master + PR1
slave2: master + PR1 + PR2
slave3: master + PR1 + PR2 + PR3
slave4: master + PR1 + PR2 + PR3 + PR4

It's kind of like speculative execution in pipelined processors:
you do some work ahead of time assuming a given results
for "branching points".
Here, I am building PR1 + PR2 like if I knew that PR1 passes the tests
(which I don't). If I guessed right, the throughput of my system increases.
If not, I'll have to discard some work I did.

Now, after these "almost simultaneous" builds complete, say that results
turn out to be:

master + PR1                    :  pass
master + PR1 + PR2              :  pass
master + PR1 + PR2 + PR3        :  fail
master + PR1 + PR2 + PR3 + PR4  :  pass

Here my policy would be: push PR1+PR2, since it's the
"last uninterrupted success" in my series of builds
(I consider them to have "an order", even if they happen
simultaneously, and the order is "thinner to fatter"
in term of changes they carry).
Then I see that PR1+PR2+PR3 that fails. It smells bad:
given the circumstances, the principal suspect is PR3.
I discard PR3 and inform the author that his change breaks the build.
I don't really care that PR1+PR2+PR3+PR4 passes;
I want to play safe and stop at the first failure. So PR4 go back to the
queue.

Now the queue looks like: PR4, PR5, PR6 and PR7.
My builders get back to work. They start from an updated master', i.e.
master' = master + PR1 + PR2
Which is:

slave1: master' + PR4
slave2: master' + PR4 + PR5
slave3: master' + PR4 + PR5 + PR6
slave4: master' + PR4 + PR5 + PR6 + PR7

Now say all four builds passes, and my master can safely
advance to (master' + PR4 + PR5 + PR6 + PR7).
Which is, I pushed 6 pull requests in the time of 2 builds.

== 4. Hints? ==

I operate a buildbot instance since quite some time, but never really looked
into its internals. A few questions before I scratch my own itches:

1) Does what I write above make any sense to you?
2) Where should I start looking? Any class / feature in the code
   that can be a starting point for my development?
3) Any difficulty that Dunning-Kruger is making me overlook?

Cheers,
Giovanni Gherdovich
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://buildbot.net/pipermail/devel/attachments/20150302/8f9df2a2/attachment.html>