[Buildbot-devel] database-backed status/scheduler-state project

Thu Sep 3 14:26:12 UTC 2009

On 2 Sep, 10:56 pm, warner at lothar.com wrote:
>
>Hi everybody, it's me again.
>
>I've taken on a short-term contract with Mozilla to make some
>scaling/usability improvements on Buildbot that will be suitable for
>merging upstream. The basic pieces are:
>
>* persistent (database-backed) scheduling state
>* DB-backed status information
>* ability to split buildmaster into multiple load-balanced pieces
>
>I'll be working on this over the next few months, pushing features into
>trunk as we get them working (via my github repo). The result should be
>a buildbot which:
>
>* lets you bounce the buildmaster without losing queued builds or the
>   state of i.e. Dependent schedulers
>* bouncing a master or slave during a build should re-queue the
>   interrupted build

One of the main problems I currently encounter with BuildBot operation 
is that slaves will often lose their connection to the master as a 
*consequence* of attempting a build.  Something that I'd be concerned 
about with automatic re-queueing is that a build would result in a never 
ending stream of work for a slave, as it tries the build, fails, 
disconnects, reconnects, is again instructed to attempt the build, 
fails, disconnects, etc.
>* third-party tools can read or manipulate the scheduler state, to
>   insert builds, cancel requests, or accelerate requests, all by
>   fussing with the database

I'm curious about the motivation to use the database directly as the RPC 
mechanism here.  This makes the database schema part of the public API, 
something which impedes future improvements to the schema.  Also, if the 
data's constraints aren't *entirely* represented in the database schema, 
it opens up the possibility for invariants to be broken by data 
incautiously inserted by other tools.

It seems to me that providing a more constrained interface to the 
database would be a better solution all around.
>* third-party tools can render status information (think PHP scripts
>   reading stuff out of the DB and generating a specialized waterfall)
>* multiple "build-process-master" processes (needs a better name) can
>   be run on separate CPUs, each handling some set of slaves. Each one
>   claims a buildrequest from the DB when it has a slave available, runs
>   the build, then marks the build as done. If one dies, others will
>   take over.

I hadn't thought about having multiple buildmasters before.  That's an 
interesting idea.  I wonder if there will really be a need to balance 
across multiple CPUs once the horrible inefficiencies of the current 
pickle data store are removed.  I suppose if you have ten thousand 
slaves it might still be beneficial... but then perhaps you just move 
the bottleneck into your database server?
>
>I'm hoping that the persistent scheduler-state code will be done by the
>end of the month, ready to put into a buildbot-0.8.0 release shortly
>thereafter.
>
>DATABASES:
>
>I'm planning to make the default config store the scheduler state in a
>SQLite file inside the buildmaster's base directory. To enable the
>scaling benefits, you'd need a real networked database, so I also plan
>to have connectors for MySQL and potentially others.

SQLite3, I assume (by way of the stdlib sqlite3 module or the third- 
party pysqlite2).

I'll encourage you to take a look at Axiom's custom lock timeout retry 
logic (mostly 
<http://divmod.org/trac/browser/trunk/Axiom/axiom/_pysqlite2.py#L75>) 
before you start trying to deal with concurrent access at all.  I think 
newer versions of SQLite3 have an even better mechanism for dealing with 
this problem, but I don't think it's exposed to Python yet.
>
>The plan is to have the schedulers make synchronous DB calls, rather
>than completely rewriting the scheduler/builder code to look more like 
>a
>state machine with async calls (twisted.enterprise). This should let us
>finish the project sooner and with fewer instabilities, but also means
>that DB performance is an issue, since a slow DB will block everything
>else the buildmaster is doing. The Mozilla folks are ok with this, so
>we'll just build it and see how it goes.

As long as you're actually using SQLite3 and have suitable indexes and 
carefully constructed queries, this can definitely work out.  You may 
want to have a look at how Axiom supports unit testing query 
performance, too 
(<http://divmod.org/trac/browser/trunk/Axiom/axiom/test/util.py#L78>).
>It's very important to me that Buildbot is easy to get installed for 
>all
>users, and installing a big database is not easy, so the default will 
>be
>the no-effort-required entirely-local SQLite. Users will only have to
>set up a real database if they want the "distributed across multiple
>computers" scaling features.

The scheduler state doesn't seem like it's very hefty, though.  How much 
data are we talking about here?  Even with ten thousand slaves, I don't 
see there being much more than an order of magnitude more rows than that 
in the database, and the average database will be closer to... 20 rows? 
50?  What other data am I overlooking here?
>The statusdb (as opposed to the schedulerdb) may be implemented as a
>buildbot status plugin, leaving the existing pickle files alone, but
>exporting a copy of everything to an external database as the builds
>progress. This would reduce the work to be done (there's already some
>code to do much of this) and minimize the impact on the core code (we'd
>just be adding an extra file that could be enabled or not as people saw
>fit), but might not result in something that's as well integrated into
>the buildbot as it could be (and it might be nice to have a
>Waterfall/etc which read from the database, as things like
>filter-by-branchname would finally become efficient enough to use).

Since this is where almost all of the data BuildBot deals with will 
actually end up, it seems like this is where the concern for performance 
should really be focused.  I assume the Mozilla folks have provided you 
with some profile data, or will let you collect some to make sure 
efforts go to the right place.  Assuming the pickle handling really is a 
bottleneck, though, I hope there'll be an option to disable pickles and 
only have the new database.
>DEPENDENCIES:
>
>Buildbot-0.8.0 will need sqlite bindings. These come batteries-included
>(in the standard library) with python 2.5 and 2.6. Users running
>python2.4 will have to install the python-pysqlite2 package to run
>buildbot-0.8.0. I think this is a pretty minimal addition.

Only sort of related to this paragraph... but...  What's the 
relationship between BuildBot 0.7.x, BuildBot 0.8.0, and BuildBot 1.0?
>
>I'm examining SQLAlchemy to see if the features it offers would be 
>worth
>the extra dependency load. I don't want to use a heavy ORM (because a
>big goal is to have a schema that's easy to query/manipulate from other
>languages), but it looks like it's got connection-pool-management and
>cross-DB support code that might be useful.

I hope you don't select SQLAlchemy.  It has a poor track record and 
there are a lot of other options out there.  I'll reiterate my point 
about not liking direct database access as the public API, too.
>What do people think about the 0.8.0 buildmaster potentially requiring
>sqlalchemy? Would that annoy you? Annoy new users? Make it hard to
>upgrade your environment?

It will annoy me, indeed.  I'll probably end up waiting a long time to 
let other people shake out whatever bugs come up.  I don't have the time 
or desire to track down obscure transaction bugs in BuildBot.  Assuming 
it eventually actually works reliably, I'll still probably upgrade, but 
I want a lot of other people's experience to be good before I think 
about it.

Jean-Paul