[Buildbot-devel] Several scalability questions

Wed Jan 19 19:08:59 UTC 2011

Just wanted to chip in my $0.02 CAD ( = $0.0201 USD!)

On Tue, Jan 18, 2011 at 2:46 PM, Dustin J. Mitchell <dustin at v.igoro.us> wrote:
> On Tue, Jan 18, 2011 at 12:40 PM, David Coppit <dcoppit at nvidia.com> wrote:
>> Question #1: We've set up Buildbot on enterprise-level hardware, running a
>> single master with a custom scheduler and UI page, servicing 40 slave
>> machines. We have a few hundred users. We currently have 88 builders, and
>> plan to go up to a few hundred. The current builds feed into 18 custom
>> schedulers (see question #3), a Try scheduler, and 4 Nightly schedulers.
>> Nearly all of the builds can be handled by nearly all of the slave machines.
>> Does this setup raise any scalability red flags?
>
> Not particularly - it's a medium-sized installation.

Yeah, agreed that this seems doable in a single master.

>> Question #2: Section 4.5.2 of the manual describes multi-master mode:
>> http://buildbot.net/buildbot/docs/latest/full.html#Database-Specification
>> It alludes to some limit on the number of slaves attached to a single
>> master. Is 40 slaves considered to be "a lot"? What are the symptoms of
>> having too many slaves?
>
> This is hard to say for sure, but you'll know you have too many slaves
> when the master stops being responsive to data from slaves, e.g., log
> file data.  You'll see a lot of IO or a pegged CPU.
>
> I just counted 120 slaves on a randomly-selected master here at
> Mozilla, so 40 should be fine.

Other symptoms we've noticed are:
- slow waterfall / web status
- slow FileDownload and related steps
- inter-step lag. e.g. look at the last line of a build step, it will
say something like elapsedTime=2.037679.  Then look at the elapsed
time for the step on the master.  This _should_ be the same value.
When your master is overloaded, there will often be a lag between when
the step finishes on the slave, and when the master processes that
event.  We were seeing lags of several minutes between steps due to
this.

>> Question #3: We've written a custom UI that has a few slow queries (total of
>> about 400ms per page view) and a custom scheduler that has a few slow
>> queries (total of about 300ms per execution of the "fileIsImportant"
>> callable). We also have some build steps that run on the master that take a
>> while.
>> It appears from watching the system execute that the UI can slow down
>> scheduling and vice-versa. We thought that BB was multi-threaded. Can it
>> handle hundreds of pageviews along with scheduling, builders, and custom
>> build steps on the master? Is it possible that we've serialized the threads
>> through table locks or something? (See question #6 below)

If your schedulers are expensive to run, they can be split out into a
separate master or masters.  This has the nice side effect of being
able to deploy new scheduler implementations without interrupting
running builds.

> Buildbot is not multi-threaded.  It uses Twisted Python.  So any
> blocking operations will block everything.  Currently Buildbot does a
> number of queries in a blocking fashion (this should be fixed in
> 0.8.4), and the web UI pages are rendered synchronously.  So the
> blocking operations you're mentioning are blocking any other action on
> the master - logs from slaves, scheduler operations, new build steps,
> other web UI access.
>
> I would definitely not expose buildbot to 100's of pageviews.
>
> We're working on moving the web UI out of process - in fact, moving it
> to the browser.  That should happen in 0.9.0.
>
>> Question #4: Over time we noticed that the system was getting slower. We
>> correlated the problem with a large number of rows in scheduler_changes --
>> over 100k. The schedulers table also seemed to have a large number of
>> defunct schedulers. The master seemed to be processing all the
>> scheduler_changes associated with the defunct schedulers, issuing tons of DB
>> queries. When we cleared out the schedulers and scheduler_changes tables,
>> the master re-created a much smaller number of schedulers and kept the
>> scheduler_changes table small.
>> How did these tables get so large? In case it's relevant, our custom
>> scheduler subclasses AnyBranchScheduler, specifying a fileIsImportant
>> callable. However, it doesn't specify anything for change_filter or
>> branches/categories. Could that contribute to the cruft?
>> Is the following an appropriate maintenance activity?
>> DELETE FROM schedulers
>> DELETE FROM scheduler_changes WHERE schedulerid NOT IN (SELECT schedulerid
>> FROM schedulers)
>> Is there any other "cruft" that might accumulate that we have to worry
>> about? Or any periodic maintenance we should do?
>
> I suspect that this is due to a misconfiguration in your custom
> schedulers.  Earlier versions did have a problem where Nightly
> schedulers would accumulate a large body of categorized changes
> without ever flushing them.  That's been fixed in 0.8.3.  The large
> schedulers table, however, makes me think that your custom schedulers
> are getting a new schedulerid on every startup - that may be something
> to look at.
>
> Note that MySQL doesn't delete rows, so running 'DELETE FROM ..' will
> actually make your tables *slower*, as MySQL will scan all of the
> deleted rows.  You need to analyze (vacuum?) the table to reclaim that
> space.
>
> The current scheduler implementation eats CPU cycles and database
> space for lunch -- and for no good reason.  I'm hoping to fix that in
> 0.8.5 or so.
>
>> Question #5: Has anyone run into lost DB connections, segfaults in libc and
>> libmysqlclient, and BB process crashes that don't leave anything in the
>> logs? When we had our query-heavy UI and scheduler running we were plagued
>> by crashes every hour or so. This was also before we cleaned up table cruft
>> as described in question #4. We switched to the PyMySQL client library and
>> the DB connection problems went away. We're on Ubuntu running the latest
>> apt-fetchable versions of the various libraries and Python packages.
>> We still have a DB connection deadlock problem (queries hang), but we
>> suspect that's a bug in our custom scheduler. All of these issues could be
>> what caused the segfaults problem--perhaps PyMySQL just shows the symptoms
>> as deadlocks rather than crashes...

We used to get random crashes under high load, which turned out to be
a glibc bug.  Upgrading to a newer glibc fixed the problem for us.
Running buildbot directly via twisted with core dumps enabled helped
us track this down:

ulimit -c unlimited
twistd -n -l twistd.log -y buildbot.tac >> output.log 2>&1

HTH,
Chris

> Mozilla's running a much larger installation backed by MySQL, and
> hasn't seen that sort of problem.
>
> The DB connection deadlock is probably due to the scheduler
> implementation.  It's very difficult to see how to execute
> sub-queries, which methods are called from worker threads, etc.
>
> For what it's worth, the current Buildbot implementation of schedulers
> isn't particularly designed to be subclassed.  Buildbot-0.8.4 will
> drastically change the Scheduler iplementation, and that will happen
> at least one *more* time before we declare a supported scheduler API
> for subclassing.  It's fine to subclass, but it's not easy and as you
> can see the subclasses are virtually guaranteed to break on upgrade
> (until 0.10.0 or so)
>
>> Question #6: Our custom scheduler and UI both make use of the
>> synchronous+blocking runInteractionNow(). Is this just a really bad idea,
>> and we should switch to runInteraction() and wait for the query to complete?
>> Our rationale is that this would use the thread pool rather than a single
>> synchronous DB connection.
>
> Yes, this is basically the crux of the current work to rewrite
> Buildbot's database interfaces.  This is also likely what's causing
> your hangs - if you call runInteractionNow from scheduler methods,
> which are already *in* an interaction, Bad Things will happen (well, I
> don't expect segfaults, but hangs and bogus results are expected).  In
> fact, Mozilla made this same mistake a while back.
>
>> This email message is for the sole use of the intended recipient(s) and may
>> contain confidential information.  Any unauthorized review, use, disclosure
>> or distribution is prohibited.  If you are not the intended recipient,
>> please contact the sender by reply email and destroy all copies of the
>> original message.
>
> I was tempted to delete your message to avoid any legal liability for
> having read it.  Consider using a non-work email address to post to
> mailing lists.
>
> Dustin
>
> ------------------------------------------------------------------------------
> Protect Your Site and Customers from Malware Attacks
> Learn about various malware tactics and how to avoid them. Understand
> malware threats, the impact they can have on your business, and how you
> can protect your company and customers by using code signing.
> http://p.sf.net/sfu/oracle-sfdevnl
> _______________________________________________
> Buildbot-devel mailing list
> Buildbot-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/buildbot-devel
>