[Buildbot-devel] gc.collect in BuilderStatus.prune

Fri Jun 17 17:56:18 UTC 2011

Hi!
My company has a group of 80 buildslaves used to parallelize the
execution of our internal test suite. Each slave runs a subset of all
our tests; each slave is associated with a unique Builder object in
our configuration.

Over the past few months, we've seen intermittent load spikes on our
buildmaster; the buildbot process pegs a CPU. We recently created a
test distribution system to balance the test execution time across our
build cluster. So, every slave finishes at about the same time, and
builds aren't blocked on one poorly-scheduled slave that's running
every slow test in the suite.

With our current setup, Builder.buildFinished is invoked once for each
slave in our build, because of our bijective relationship between
buildslaves and Builder objects. Builder.buildFinished is invoked
repeatedly in a very short period of time because every slave finishes
at about the same time.

We profiled buildbot (with cProfile and with a hastily-written
sampling profiler) and found that:
 - gc.collect, called from BuilderStatus.prune, consumes a significant
amount of CPU time, and
 - during periods of 100% CPU usage, we were typically evaluating
BuilderStatus.prune.

Why is gc.collect called from BuilderStatus.prune? Is it necessary? I
looked through status/builder.py's commit history and read
http://trac.buildbot.net/ticket/459 (and
http://trac.buildbot.net/ticket/458), but I am not yet familiar enough
with buildbot's internals to know if
aea14d3f245badd92c5e197803cf56cc4e6d96be's commit message or if the
comments in bug #459 describe why gc.collect was added.

Are there other gotchas that we may encounter with our current setup?
Has anyone else run into scalability issues with buildbot?