[Buildbot-devel] Buildbot performance with 25+ slaves

John Wieland johnny.c.sparkles at gmail.com
Fri Nov 16 09:17:59 UTC 2012


Hi,

I'm currently setting up a Buildbot to build over 25 different platforms.
So far things are going quite well. The setup involves hosting the
buildslave scripts in a virtualenv on an NFS mount with some platforms that
don't have python-2.7.3 installed also accessing that over NFS. We think it
is a huge improvement over our previous system and a great product.

I've had around 20 builders running on roughly that many slaves for about a
week and performance has been fine. These are intended to be our 'sanity'
builds and run off a SingleBranchScheduler. Yesterday I added another ~20
builders running on the same slaves but with different configuration to
create a nightly 'release' process. Everything was fine. Today I also added
a Try server with three platforms/builders. Problems began to occur with
each of the builders showing that builds were pending but these were never
being kicked off. The python process on the buildmaster was also sitting at
99.9%. I actually stuffed up when I added the Try server and attached it to
the release builders. This was around the time we first started noticing
the pending problem.

None of the logs indicated any problems and restarting the master/slave and
cancelling all builds didn't seem to fix anything.

I see a number of areas for possible investigation:
1. Something got messed up in the DB when I screwed up the Try server. I'm
considering backing up the DB (state.sqlite or something?) somewhere and
starting afresh to see what happens.
2. A single Buildmaster was never intended to handle this many builders -
this seems unlikely as I read here:
http://atlee.ca/blog/2009/07/28/profiling-buildbot/ they are using quite a
few (although they could be split up over many masters). I could possibly
split up the builds between different buildmasters.
3. The buildslaves shouldn't have multiple builders assigned to them - this
one seems a possibility. I notice a lot of locking and unlocking of objects
going on in the twistd.log of the master.
4. The NFS hosting of the scripts is slowing things down - We actually do
this with a lot of tools on our network (Java for instance) and no one else
has reported a slow down as far as I am aware.

I'm also going to attempt to setup cProfile to see what is happening and
will post that here when I get it going.

I'm on a bit of a tight schedule and would appreciate any advice anyone
might have so that I can target my investigation. What logs should I be
looking in for signs of trouble? Are there any diagnostics I can run on the
DB? Can I increase the debugging in the buildmaster logs?

Again, great product, for a team that develops on as many archaic platforms
as we do (AIX, HPUX, Solaris, VxWorks, Windows) the try server will be very
useful for us. It was greeted with 'oooohhh's when I first showed it off :)

Thanks,
John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://buildbot.net/pipermail/devel/attachments/20121116/24361fd5/attachment.html>


More information about the devel mailing list