[Buildbot-devel] consolidating builds for slow bots?

Sat Feb 7 07:08:23 UTC 2004

> Any ideas on this one?

Doh, sorry, I thought I'd responded to that one already.

> something like (mono-spaced font required for proper alignment):

Argh, or it would if there weren't those tabs in it :). Tabs evil. Spaces
good. Especially with python.

> Therefore, as soon as it finishes this new build, it will start another one
> even though it could have just done one build with all of the changes.

Ah. Yeah, that's annoying, but at the moment it's mostly working as intended.
The long answer follows. The short answer is that that statement isn't
strictly true: the latest changes haven't "stabilized" yet (i.e. the
tree-stable-timer hasn't expired yet), and the Buildbot is implementing one
of two possible solutions to what I'll term the "Bus Driver's Dilemma".

Here's the philosophy behind the design (possibly flawed):

 Changes tend to be lumped into groups, especially for VC systems like CVS
 that don't make it easy to do atomic multi-file commits. We don't want to
 do builds that only include half of a change-set.

 There may be some time delays between the change being made, the
 buildmaster's notification of it, and the CVS checkout actually taking
 place. We want the checkout to be stable: it should reflect the state of the
 code tree at a point in time comfortably distant from any other commits, to
 avoid races with other commits that we haven't found out about yet.

 The "tree-stable-timer" is intended to serve both of these purposes. Changes
 are treated as spans of time rather than discrete point events, starting at
 the notification time and ending tree-stable-timer later. The tree is stable
 when it hasn't changed for tree-stable-timer seconds. (i.e. when there are
 no spans which include the current time).

 Another goal is to have some kinds of builds just happen less frequently
 than others, because they take a long time. The tree-stable-timer can be
 extended to accomplish this too, but it isn't clear to me what exactly the
 policy should be, or how to quantify "less frequently" well enough to write
 code.

Here's the scoop on how it works at present:

 Each Build object has a list of Changes. The compile/test should be done
 against a tree that contains those Changes and does not contain any later
 ones. Notionally the list of Changes is really equivalent to a complete list
 of file/version for everything in the source tree. When we get the 'try'
 feature implemented, it will basically be equivalent to a list of
 file/version/diff for everything in the tree.

 Each Build object has a timer.

 There is always a single Build sitting in the Builder's .waiting slot.
 Incoming Changes are always appended to the .waiting Build's list
 (regardless of what the Builder is doing: idle or building), and its timer
 is restarted. You can keep the build from ever happening by making a new
 commit once every 4 minutes 59 seconds.

 When the .waiting Build's timer fires, a decision is made: if the .buildable
 slot is empty, the .waiting Build is moved into it. The Build in this slot
 is ready to go: all of its Changes are considered stable. If .currentBuild
 is empty (i.e. the Builder is idle), and the slave is attached, the build
 will start right away (it is moved from .buildable to .currentBuild).

 On the other hand, if there was already a .buildable Build ready, the
 .waiting Build is merged into it. The .buildable Build will then hold all
 the changes of both.

 Each time something interesting happens (a timer expires, a build finishes),
 .maybeStartBuild() called. If this sees a Build in the .buildable slot, and
 non in the .currentBuild slot, and a slave is attached, the .buildable one
 is started. Interlocked builds are handled here too.

The idea is that each Build will incorporate all Changes that were stable at
the time it began. Changes that are still waiting on their timers to expire
are not included. The build does not wait for them to expire, but rather
starts with whatever it can accomplish right now. This was a design choice
that I faced the last time I rewrote that code.. immediacy versus efficiency.
Kind of like a bus driver seeing someone running towards the bus stop.. do
they wait for the passenger, delaying everyone else, or do they take off
because they know there will be another bus coming in just a few minutes?

If the build time is short (another bus will arrive soon), it is better to
leave right away. If the build time becomes long, then it depends: is there
just one passenger running towards the stop? Or might there be another one a
block behind him? At some point the driver has to decide, and they might not
have enough information to do it right.. the guy might be running towards
something else, or there might be someone running for the bus who is out of
sight around the corner.

It seemed like "build what you can as soon as you can" was a better way to go
than have the Builder snoop the .waiting build to see if it could improve
efficiency by stalling for a while. Another reason for this decision was that
Interlocks can confuse the matter: the Change that you're waiting on might
not be useable yet because it depends upon another build succeeding, and you
don't know how long that other build is going to take, or whether it is going
to pass or fail.

So anyway, I think in your example, the change3 that "should have" gone into
the build was actually still waiting for the timer to expire. If change2 and
change3 happen within the tree-stable-time of each other, they should wind up
in the same Build.

Now there is a missing feature in the CVS buildstep which does indeed cause
the mis-blame problem you describe. The second purpose of the
tree-stable-timer is to give us a safe timestamp to do the CVS checkout, so
we can be sure we're getting just the Changes that are listed in the Build
and not some extra commit that occurred just before we started the checkout.
The CVS step is supposed to take a timestamp, picked from the dead center of
the tree-stable window, and use it in a -D option.

This feature is missing (along with the '-r branch' feature that really ought
to go in with it). Without it, the checkout just comes from HEAD, which could
be unstable, both because someone has done a commit which is known but not
yet stable, or because the notification-latency means the buildmaster hasn't
even heard about it yet. There's some vestigal support for it
(Build.startBuild computes a 'when' value which is then unused), which should
probably be ripped out.

So.. if you're looking to hack some code, what I'd suggest is the following:

 step.CVS.start should look at self.build.lastChangeTime(), which indicates
 the timestamp of the last Change claimed by this Build. It should add half
 of self.build.treeStableTimer, and turn the result into a 'cvs -D datestamp'
 -compatible string (use time.strftime). Remember to account for timezones,
 which might differ between the master and the slave and the repository. It
 may be easier to send a seconds-since-epoch to the slave and let it do the
 formatting, I'm not sure.

 The 'args' dict that gets sent to the slave should then include 'timestamp',
 and also 'branch' (which should come from a new __init__ parameter).

 The buildslave (slavecommand.CVS) needs to turn the timestamp and branch
 keys, if non-None, into -D and -r options.

I think that would solve the mis-blame problem. Note that other VC systems
would do this differently (which is why that .startBuild 'when' value should
be ripped out).. P4 and Arch use strictly sequential change numbers to index
Changes, and instead of computing a timestamp, the step would just look at
.changes[-1] and extract the change number.

It would also be possible to change the design decision that causes builds to
run as soon as possible (even if there are Changes waiting to stabilize). To
do that you'd need to change the behavior of Builder.maybeStartBuild, to have
it check self.waiting to see if its timer was running. If so, it could choose
to do nothing instead of starting the .buildable build. Later, when those
Changes stabilize, Builder.buildTimerFired() is called, which will merge that
new .waiting Build into .buildable, and then call .maybeStartBuild again,
which can see that there is nothing else upcoming and finally start the
build.

If you do that, I'd have it fall back to the old do-it-now behavior if there
are any Interlocks involved, to avoid the problems described above. (The
specific problem is a dependent-build-failure, which knocks the failed
Changes out of the running.. you'd have to make sure that the successful
Changes get built anyway).

If it works, let me know and we can make that behavior configurable at the
Builder level, with a .waitForStragglers attribute or something.

thanks,
 -Brian