[Buildbot-devel] issues with buildbot

Tue Aug 30 19:49:12 UTC 2005

> I'm in the process of setting up a new buildbot system for a large
> project.  It's a great tool and I've got it mostly running now, but I
> thought I'd give you some feedback on the few issues I've noticed.

Hey, welcome!

> 1. Contrary to what I would have thought, the 'Force Build' doesn't
> replace a build that is pending (due to changes with a stable timer), it
> instead starts a build of HEAD independently -  and then the build of
> the changed revision (usually the same) will begin immediately after.

Right, the "Force" button is really about "force a brand new build" rather
than "accelerate the currently pending build and skip its tree-stable timer".
It is most commonly used when something has changed out-of-band: you've just
added a new Builder and want to see if it works even though no source code
changes have happened, you just installed some library on the buildslave and
want to re-run a build that failed because it was missing before, etc.

With the new Scheduler design in the upcoming release, it will be easier to
implement a "skip the timeout" button (you would make a subclass of Scheduler
with a method that takes any pending build and resets its timer down to
zero). The "Force" button will still have the same behavior, though: it
bypasses the scheduler altogether, injecting a brand new build into the
Builder.

> This is a bit odd, and also is a problem in my setup where (for various
> reasons) I assume the revision each build is based on is newer than the
> previous.

Just so you know, the internal architecture is designed to make it possible
to build arbitrary revisions of the code (specifically the BuildRequest that
is handed to the Builder has a "SourceStamp" object, which provides a branch
and revision number, and ideally those can change arbitrarily). Someone could
force a build of a specific revision, then force a second build of an
*earlier* revision, etc. The usual build-upon-Change use case doesn't
exercise this functionality, but the 'buildbot try' and build-on-branch
features of the upcoming release do.

Apart from that, the only problem I see with a buildbot setup where you
assume revisions are monotonically increasing is that you wouldn't be able to
build the same source code twice, which is kind of a pity. The buildslaves
are supposed to behave a lot like a human developer doing checkouts and
compiles.. hopefully your human developers aren't limited to one compile per
source tree too.

> 2. The 'Stop Build' button sends a SIGKILL to the build process.  It
> really should use SIGTERM so that the process can cleanup any temporary
> files, locks or such.  It can still be followed by SIGKILL if it fails
> to exit in a timely manner.

Yeah, I've got a TODO somewhere in the code about doing it this way, but I
hadn't yet gotten around to implementing it. I think the "Stop Build" button
is under-tested.. current the SIGKILL-sending code is mostly used to
implement the "your build process appears to be stuck" timeout, and it seemed
unlikely to me that such a wedged build would respond any better to a SIGTERM
than a SIGKILL. Also, most of my experience is with using 'make' as the
top-level build tool, and it doesn't do anything special with SIGTERM (and
can't really pass it on to its children).

But yeah, it should be implemented just as you said.

> 3. The PBChangeSource doesn't allow you to specify a prefix of more than
> one directory - if you do it just treats it as an always failed match.

The prefix is a strict string.startswith() comparison (actually comparison
and removal). The idea is that your project is paying attention to exactly
one source tree, defined by whatever tree you check out from the repository
in the first Step of your build. This prefix-stripping serves two purposes.

The first is that your source tree may be kept in a larger repository, one
that tracks other projects (or other components of whatever you're using the
buildbot on), so you might get VC notification for files that are outside the
tree of interest. Discarding filenames that don't have the prefix is
equivalent to ignoring files outside your tree of interest.

The second is that some BuildSteps find it useful to have a list of exactly
which files were changed for any particular build. (this is stored as the
.files list in a Change object). For example, step_twisted.Trial (which runs
a unit-test utility) can be set up in a mode which only runs unit tests for
the files that were changed in this build. (specifically, you can put special
test-case-name tags in python source files, and /usr/bin/trial can be told to
look for these tags and only run the unit tests named by them. This is handy
for configuring "quick" Builders that provide immediate feedback on the stuff
that was just modified, but you usually follow it up with a full test suite
run a couple of minutes later).

But, for this list of changed file names to be useful, it needs to be
relative to your source tree, not relative to some remote VC repository's
internal directory structure. The prefix is stripped from the filenames to
make sure that changes[0].files[0] is a valid relative path from the top of
the tree that you've checked out to the file that was just changed.

So, changing PBChangeSource (or any of the change sources) to accept a list
of prefixes would invalidate the second purpose. Unless you've got a source
checkout step that does multiple checkouts and somehow merges them into the
same directory, I think you would wind up with filenames that don't line up
with anything in the builder's local tree.

What's the use case for multiple prefixes?

Or, having just written all that, I think I may have misunderstood you. When
you say "more than one directory", do you mean "foo/bar/baz" as opposed to
just "foo" ? In that case, it's just a bug. If you could provide me with an
example of a working and a non-working case, I'll write up a test case and
fix it. Oops.

> 4. It seems that the output to the build step logs is overly buffered,
> making it impossible to watch the build process output in anything close
> to real time.

Hm, my experience has been that it isn't delayed by more than a few seconds.
The chain of pipes looks like:

 child process writes to stdout/stderr
   -> maybe a libc FILE buffer ->
   -> pipe ->
 buildslave reads from pipe, immediately packages output and sends to master
   -> TCP socket ->
 master receives output, appends to logfile, publishes to status targets
   -> TCP socket (HTTP) ->
 web browser appends text to the bottom of the page

The "Nagle" algorithm in TCP will delay small amounts of data for small
amounts of time (I want to say that 100ms is common) in the hopes of sending
out one big packet instead of several small ones, but I doubt that's the
issue here. Unix pipes typically have some buffering (4k at most), but in my
experience they are usually flush-as-soon-as-possible rather than
as-late-as-possible. The libc buffered FILE object typically has
as-late-as-possible semantics, but most of the test processes I've seen don't
wind up with huge latencies because of this (possibly because they use lots
of small commands instead of one command that produces huge amounts of
output).

It's possible that your build process has a couple layers of child processes,
each doing their own buffering, which might cause the kinds of delays you're
talking about. How bad is it?

To track this one down, I would recommend adding some log.msg() calls in
buildbot.slave.command.ShellCommandPP.outReceived, which is called each time
the buildslave receives some stdout from the process it has just spawned.
Something like:

 import time
 log.msg("got stdout %d bytes time %s" % (len(data), time.time()))

If this is reporting frequent small updates, then the problem is somewhere in
buildbot. If it is reporting infrequent large updates, then the buffering is
happening somewhere in the child process.

I don't remember if the pipes that are created to the child process have
their buffering flag turned off or not. It's a tradeoff between immediacy and
efficiency, of course, so it may default to the more-efficient True setting.
If you're interested, they get created in twisted.internet.process.Process
(just search for calls to os.pipe). This might be related, but even with the
default setting I see stdout messages coming from a single 20-minute Trial
process showing up every second or two.

> 5. The ShellCommand build step doesn't allow you to set the description
> via the init call parameters.  Maybe my python is bad, but I'm not sure
> how to set it aside from here, since the steps are actually in the tuple
> of s(step.ShellCommand, arg = x, arg2 = y, ...) - so there's no actually
> instance of the object to set the description attribute on.  (In my
> configuration I've created a subclass of ShellCommand that does
> understand a description argument to init and sets the attribute from
> there).

I think you can use s(step.ShellCommand, name="my description") to set it.
Most of the useful attributes of a BuildStep subclass will be copied from the
kwargs you pass to __init__ (see BuildStep.__init__ where it loops through
the names listed in BuildStep.parms).

Yeah, the BuildStep instances don't exist until the Build actually starts,
because each Build gets a separate copy of each BuildStep. (there is state
kept in the BuildStep instance that is specific to a particular build, so
they can't be shared or pre-created in the config file).

> 6. (wishlist) In my environment I actually have only one supported
> platform and hence only a single builder.  But I do have multiple build
> trees of the same code (different released versions that are still
> actively maintained).  It would be nice if the same master could manage
> all the code branches (since the build process is identical).

It's your lucky day! :). The next release will include the build-on-branch
feature we've been talking about for the last few months. Take a look at the
user's manual on the web site (the CVS HEAD one, not the 0.6.6 version) and
see if the functionality described in there will meet your needs. If not, let
me know, I need more use cases.

> It would also be nice if the WaterFall display could show the status of all
> the different build trees in the one page, rather than having to create a
> separate html.WaterFall object using a different port for each.

If I understand you correctly, then I think the build-on-branch feature will
accomodate this desire (but I may not quite understand what you want to do).
In the next release, the main Waterfall display will have one column per
builder, just as we've got now, and you can either have each Builder handle
multiple branches (so the builds for those branches would be interleaved in a
single column), or you could create multiple Builders (each with the same
BuildFactory) and assign one branch per Scheduler and one Scheduler per
Builder (so the branches would be built in parallel).

Hmm, that description wasn't very clear. I'll make a note to try and write up
some use cases in the documentation.

But, in short, the next release will let you build multiple branches in a
single buildmaster, and thus display all their status on the same page.

> 7. (wishlist) This may be already possible, but it would be nice if the
> builder could have access to the build output files easily.  My build
> steps include archiving successful builds and making them available
> to users, and it would be nice to include the build logs in the archive.

In the current code, each BuildStep has access to the logfiles that were
generated as it runs (see ShellCommand.createSummary). In addition, each
StatusTarget (like the Waterfall page, or the IRC bot) can get access to each
LogFile (see buildbot.status.mail.MailNotifier.buildMessage, at the end where
it uses build.getLogs() and log.getText() ). The idea is that the BuildStep
is responsible for process-specific things, like creating filtered versions
of the main log file (just the warnings, just the errors, or parsing
pass/fail tests counts from the output). The StatusTargets are responsible
for distributing the logs somewhere. Long-term archiving should be
implemented in a new StatusTarget, which can just pull all the logfiles from
the IBuildStatus object (along with build results, the SourceStamp, etc) and
stash them somewhere.

I'm working on a diagram of how Build/BuildStep/BuildStatus/BuildStepStatus
instances are related, but in general it is always possible to get from the
buildbot.process -side object (like Build or BuildStep) to the
buildbot.status -side object (like BuildStatus or BuildStepStatus). (it is
*not* possible to go in the other direction.. one reason is that the status
objects are persisted, so they can't hold references to things that shouldn't
be persisted.. the other reason is that status objects are supposed to
passively accept data from the build process rather than influence the build,
so maintaining a unidirectional connection makes everything cleaner).

So, I'd recommend doing archiving in a StatusTarget, but if you really need
access to the logs from the BuildStep or the Build (or even the Builder, but
that would get kind of ugly), you can do it.

hope that helps.. feel free to describe your use cases a bit more, I'll do
what I can do accomodate them in the code.

cheers,
 -Brian