[Buildbot-devel] How to implement some sort of "speculative builds"?

Mon Mar 2 15:05:23 UTC 2015

We have a very similar process at Intel for our Android CI process.

We are also using merge request serialization. The difference with yours is
that we have a preintegration builder that does pretty much the same as the
mergerequest, but without merging, this was to reduce the risk on the
mergerequest serialized builder.

So if I reuse your usecase, we did:
First:
PR1 -> BUILD -> TESTS   --> MR \
PR2 -> BUILD -> TESTS  --> MR   \ -> PR1 -> BUILD -> TESTS -> PR3 -> BUILD
->TESTS -> PR1 ->BUILD -> TESTS -> PR4 -> BUILD --> TESTS
PR3 -> BUILD -> TESTS  --> MR  /
PR4 -> BUILD -> TESTS  --> MR/

After doing this for a while, we got some metrics about regression found in
various stages of the process, and we found out that the serialized tests
*never* found any regression, because those have already been found in the
parrallel pre-integration builders.

So we changed to:

PR1 -> BUILD -> TESTS   --> MR \
PR2 -> BUILD -> TESTS  --> MR   \ -> PR1 -> BUILD  -> PR3 -> BUILD -> PR1
->BUILD  -> PR4 -> BUILD
PR3 -> BUILD -> TESTS  --> MR  /
PR4 -> BUILD -> TESTS  --> MR/

We only do the build for each serialised merge. This allows us to have a
build for each step of the mainline, and if we detect a problem that are
not detected by automated tests, we have builds ready for bisection testing.

I find this method much easier than the optimistic pipelined tests that you
are proposing, and this worked well for us.
Since we switched to this method, we got rid of all our big queues problem.

HTH
Pierre

Le lun. 2 mars 2015 à 12:22, Giovanni Gherdovich <g.gherdovich at gmail.com> a
écrit :

> Hello,
>
> there is a patch i'd like to make to my team's buildbot deployment,
> and here I come to ask for advices.
>
> This turned out as a long-ish message, so here the contents:
>
> 1. The problem I am trying to solve
> 2. The problem I am not trying to solve
> 3. The improvement I want to implement
> 4. Hints?
>
> == 1. The problem I am trying to solve ==
>
> * Our build plus testing takes about 90 minutes.
> * Every incoming pull request passes through the build+test dance.
>   If tests pass, it is then merged to our master repository.
>   (Actually, the merge happens -before- testing.
>   If all is green, it is then pushed to master)
> * Pull requests are processed (merge -> build -> test -> possibly pushed)
>   in a sequential fashion. This is because we like to know that
>   each merge in the master repository passed the test green.
>   If, say, we were to process multiple pull requests simultaneously,
>   you would end up in the case where two heads separately work,
>   but if the merge of them doesn't work, you cannot say whose fault is
> that.
> * At peak time our buildbot queue can have ~20 pull requests.
>
> This all implies that after I submit my thing, it takes forever
> to know if it's pushed or not (a day, sometimes two).
> As a result everybody is pretty unhappy.
>
>
> == 2. The problem I am not trying to solve ==
>
> Everybody at this point steps up and says that
> a non-regression test suite that lasts 1 hour isn't acceptable,
> it shouldn't last more than the length of its longest test scenario,
> or something on these lines.
>
> For sure I agree on all these points, and in the
> background I am looking into ways to cut the fat out of that
> monstrosity. But I also need to be pragmatic and
> keep assuring that my team can release every day, now.
>
>
> == 3. The improvement I want to implement ==
>
> While I don't plan to speed up the tests for now,
> I do observe that they almost don't consume resources
> (profiling shows flat graphs for CPU, memory, I/O.
> The thing basically spends its time waiting).
>
> So, I'd like to run multiple builds+tests at once,
> without sacrificing the nice property "if it's broken
> you can always tell what's the bad pull request: the last one".
>
> Here an example of the strategy I have in mind:
>
> Say you have 7 pull requests in the queue,
> PR1, PR2, PR3, PR4, PR5, PR6 and PR7 and 4 buildbot slaves.
> What I want to do is to tell my slave1 to process PR1
> (i.e., merge it with master and see what it gives).
> Slave2 would -not- take PR2 alone, but the merge of PR1+PR2.
> Slave3 will go with the merge PR1+PR2+PR3 and slave4 will
> take the merge PR1+PR2+PR3+PR4.
>
> slave1: master + PR1
> slave2: master + PR1 + PR2
> slave3: master + PR1 + PR2 + PR3
> slave4: master + PR1 + PR2 + PR3 + PR4
>
> It's kind of like speculative execution in pipelined processors:
> you do some work ahead of time assuming a given results
> for "branching points".
> Here, I am building PR1 + PR2 like if I knew that PR1 passes the tests
> (which I don't). If I guessed right, the throughput of my system increases.
> If not, I'll have to discard some work I did.
>
> Now, after these "almost simultaneous" builds complete, say that results
> turn out to be:
>
> master + PR1                    :  pass
> master + PR1 + PR2              :  pass
> master + PR1 + PR2 + PR3        :  fail
> master + PR1 + PR2 + PR3 + PR4  :  pass
>
> Here my policy would be: push PR1+PR2, since it's the
> "last uninterrupted success" in my series of builds
> (I consider them to have "an order", even if they happen
> simultaneously, and the order is "thinner to fatter"
> in term of changes they carry).
> Then I see that PR1+PR2+PR3 that fails. It smells bad:
> given the circumstances, the principal suspect is PR3.
> I discard PR3 and inform the author that his change breaks the build.
> I don't really care that PR1+PR2+PR3+PR4 passes;
> I want to play safe and stop at the first failure. So PR4 go back to the
> queue.
>
> Now the queue looks like: PR4, PR5, PR6 and PR7.
> My builders get back to work. They start from an updated master', i.e.
> master' = master + PR1 + PR2
> Which is:
>
> slave1: master' + PR4
> slave2: master' + PR4 + PR5
> slave3: master' + PR4 + PR5 + PR6
> slave4: master' + PR4 + PR5 + PR6 + PR7
>
> Now say all four builds passes, and my master can safely
> advance to (master' + PR4 + PR5 + PR6 + PR7).
> Which is, I pushed 6 pull requests in the time of 2 builds.
>
>
> == 4. Hints? ==
>
> I operate a buildbot instance since quite some time, but never really
> looked
> into its internals. A few questions before I scratch my own itches:
>
> 1) Does what I write above make any sense to you?
> 2) Where should I start looking? Any class / feature in the code
>    that can be a starting point for my development?
> 3) Any difficulty that Dunning-Kruger is making me overlook?
>
> Cheers,
> Giovanni Gherdovich
> ------------------------------------------------------------
> ------------------
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Buildbot-devel mailing list
> Buildbot-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/buildbot-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://buildbot.net/pipermail/devel/attachments/20150302/5f4e220a/attachment.html>