[Buildbot-devel] Several scalability questions

David Coppit dcoppit at nvidia.com
Tue Jan 18 18:40:21 UTC 2011


Greetings all,

Question #1: We've set up Buildbot on enterprise-level hardware, running a single master with a custom scheduler and UI page, servicing 40 slave machines. We have a few hundred users. We currently have 88 builders, and plan to go up to a few hundred. The current builds feed into 18 custom schedulers (see question #3), a Try scheduler, and 4 Nightly schedulers. Nearly all of the builds can be handled by nearly all of the slave machines. Does this setup raise any scalability red flags?

Question #2: Section 4.5.2 of the manual describes multi-master mode:

http://buildbot.net/buildbot/docs/latest/full.html#Database-Specification

It alludes to some limit on the number of slaves attached to a single master. Is 40 slaves considered to be "a lot"? What are the symptoms of having too many slaves?

Question #3: We've written a custom UI that has a few slow queries (total of about 400ms per page view) and a custom scheduler that has a few slow queries (total of about 300ms per execution of the "fileIsImportant" callable). We also have some build steps that run on the master that take a while.

It appears from watching the system execute that the UI can slow down scheduling and vice-versa. We thought that BB was multi-threaded. Can it handle hundreds of pageviews along with scheduling, builders, and custom build steps on the master? Is it possible that we've serialized the threads through table locks or something? (See question #6 below)

Question #4: Over time we noticed that the system was getting slower. We correlated the problem with a large number of rows in scheduler_changes -- over 100k. The schedulers table also seemed to have a large number of defunct schedulers. The master seemed to be processing all the scheduler_changes associated with the defunct schedulers, issuing tons of DB queries. When we cleared out the schedulers and scheduler_changes tables, the master re-created a much smaller number of schedulers and kept the scheduler_changes table small.

How did these tables get so large? In case it's relevant, our custom scheduler subclasses AnyBranchScheduler, specifying a fileIsImportant callable. However, it doesn't specify anything for change_filter or branches/categories. Could that contribute to the cruft?

Is the following an appropriate maintenance activity?

DELETE FROM schedulers
DELETE FROM scheduler_changes WHERE schedulerid NOT IN (SELECT schedulerid FROM schedulers)

Is there any other "cruft" that might accumulate that we have to worry about? Or any periodic maintenance we should do?

Question #5: Has anyone run into lost DB connections, segfaults in libc and libmysqlclient, and BB process crashes that don't leave anything in the logs? When we had our query-heavy UI and scheduler running we were plagued by crashes every hour or so. This was also before we cleaned up table cruft as described in question #4. We switched to the PyMySQL client library and the DB connection problems went away. We're on Ubuntu running the latest apt-fetchable versions of the various libraries and Python packages.

We still have a DB connection deadlock problem (queries hang), but we suspect that's a bug in our custom scheduler. All of these issues could be what caused the segfaults problem--perhaps PyMySQL just shows the symptoms as deadlocks rather than crashes...

Question #6: Our custom scheduler and UI both make use of the synchronous+blocking runInteractionNow(). Is this just a really bad idea, and we should switch to runInteraction() and wait for the query to complete? Our rationale is that this would use the thread pool rather than a single synchronous DB connection.

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://buildbot.net/pipermail/devel/attachments/20110118/0e2cc705/attachment.html>


More information about the devel mailing list