[Buildbot-devel] Several scalability questions

Tue Jan 18 19:46:30 UTC 2011

On Tue, Jan 18, 2011 at 12:40 PM, David Coppit <dcoppit at nvidia.com> wrote:
> Question #1: We've set up Buildbot on enterprise-level hardware, running a
> single master with a custom scheduler and UI page, servicing 40 slave
> machines. We have a few hundred users. We currently have 88 builders, and
> plan to go up to a few hundred. The current builds feed into 18 custom
> schedulers (see question #3), a Try scheduler, and 4 Nightly schedulers.
> Nearly all of the builds can be handled by nearly all of the slave machines.
> Does this setup raise any scalability red flags?

Not particularly - it's a medium-sized installation.

> Question #2: Section 4.5.2 of the manual describes multi-master mode:
> http://buildbot.net/buildbot/docs/latest/full.html#Database-Specification
> It alludes to some limit on the number of slaves attached to a single
> master. Is 40 slaves considered to be "a lot"? What are the symptoms of
> having too many slaves?

This is hard to say for sure, but you'll know you have too many slaves
when the master stops being responsive to data from slaves, e.g., log
file data.  You'll see a lot of IO or a pegged CPU.

I just counted 120 slaves on a randomly-selected master here at
Mozilla, so 40 should be fine.

> Question #3: We've written a custom UI that has a few slow queries (total of
> about 400ms per page view) and a custom scheduler that has a few slow
> queries (total of about 300ms per execution of the "fileIsImportant"
> callable). We also have some build steps that run on the master that take a
> while.
> It appears from watching the system execute that the UI can slow down
> scheduling and vice-versa. We thought that BB was multi-threaded. Can it
> handle hundreds of pageviews along with scheduling, builders, and custom
> build steps on the master? Is it possible that we've serialized the threads
> through table locks or something? (See question #6 below)

Buildbot is not multi-threaded.  It uses Twisted Python.  So any
blocking operations will block everything.  Currently Buildbot does a
number of queries in a blocking fashion (this should be fixed in
0.8.4), and the web UI pages are rendered synchronously.  So the
blocking operations you're mentioning are blocking any other action on
the master - logs from slaves, scheduler operations, new build steps,
other web UI access.

I would definitely not expose buildbot to 100's of pageviews.

We're working on moving the web UI out of process - in fact, moving it
to the browser.  That should happen in 0.9.0.

> Question #4: Over time we noticed that the system was getting slower. We
> correlated the problem with a large number of rows in scheduler_changes --
> over 100k. The schedulers table also seemed to have a large number of
> defunct schedulers. The master seemed to be processing all the
> scheduler_changes associated with the defunct schedulers, issuing tons of DB
> queries. When we cleared out the schedulers and scheduler_changes tables,
> the master re-created a much smaller number of schedulers and kept the
> scheduler_changes table small.
> How did these tables get so large? In case it's relevant, our custom
> scheduler subclasses AnyBranchScheduler, specifying a fileIsImportant
> callable. However, it doesn't specify anything for change_filter or
> branches/categories. Could that contribute to the cruft?
> Is the following an appropriate maintenance activity?
> DELETE FROM schedulers
> DELETE FROM scheduler_changes WHERE schedulerid NOT IN (SELECT schedulerid
> FROM schedulers)
> Is there any other "cruft" that might accumulate that we have to worry
> about? Or any periodic maintenance we should do?

I suspect that this is due to a misconfiguration in your custom
schedulers.  Earlier versions did have a problem where Nightly
schedulers would accumulate a large body of categorized changes
without ever flushing them.  That's been fixed in 0.8.3.  The large
schedulers table, however, makes me think that your custom schedulers
are getting a new schedulerid on every startup - that may be something
to look at.

Note that MySQL doesn't delete rows, so running 'DELETE FROM ..' will
actually make your tables *slower*, as MySQL will scan all of the
deleted rows.  You need to analyze (vacuum?) the table to reclaim that
space.

The current scheduler implementation eats CPU cycles and database
space for lunch -- and for no good reason.  I'm hoping to fix that in
0.8.5 or so.

> Question #5: Has anyone run into lost DB connections, segfaults in libc and
> libmysqlclient, and BB process crashes that don't leave anything in the
> logs? When we had our query-heavy UI and scheduler running we were plagued
> by crashes every hour or so. This was also before we cleaned up table cruft
> as described in question #4. We switched to the PyMySQL client library and
> the DB connection problems went away. We're on Ubuntu running the latest
> apt-fetchable versions of the various libraries and Python packages.
> We still have a DB connection deadlock problem (queries hang), but we
> suspect that's a bug in our custom scheduler. All of these issues could be
> what caused the segfaults problem--perhaps PyMySQL just shows the symptoms
> as deadlocks rather than crashes...

Mozilla's running a much larger installation backed by MySQL, and
hasn't seen that sort of problem.

The DB connection deadlock is probably due to the scheduler
implementation.  It's very difficult to see how to execute
sub-queries, which methods are called from worker threads, etc.

For what it's worth, the current Buildbot implementation of schedulers
isn't particularly designed to be subclassed.  Buildbot-0.8.4 will
drastically change the Scheduler iplementation, and that will happen
at least one *more* time before we declare a supported scheduler API
for subclassing.  It's fine to subclass, but it's not easy and as you
can see the subclasses are virtually guaranteed to break on upgrade
(until 0.10.0 or so)

> Question #6: Our custom scheduler and UI both make use of the
> synchronous+blocking runInteractionNow(). Is this just a really bad idea,
> and we should switch to runInteraction() and wait for the query to complete?
> Our rationale is that this would use the thread pool rather than a single
> synchronous DB connection.

Yes, this is basically the crux of the current work to rewrite
Buildbot's database interfaces.  This is also likely what's causing
your hangs - if you call runInteractionNow from scheduler methods,
which are already *in* an interaction, Bad Things will happen (well, I
don't expect segfaults, but hangs and bogus results are expected).  In
fact, Mozilla made this same mistake a while back.

> This email message is for the sole use of the intended recipient(s) and may
> contain confidential information.  Any unauthorized review, use, disclosure
> or distribution is prohibited.  If you are not the intended recipient,
> please contact the sender by reply email and destroy all copies of the
> original message.

I was tempted to delete your message to avoid any legal liability for
having read it.  Consider using a non-work email address to post to
mailing lists.

Dustin