[Buildbot-devel] database-backed status/scheduler-state project

Sat Sep 5 01:29:01 UTC 2009

[I've only got a few minutes to write this up.. I'll try to respond more
thoroughly tomorrow.. sorry]

"Dustin J. Mitchell" <dustin at zmanda.com> wrote:
> 
> Having written a Python daemon that's controlled from Access via ODBC
> and various silly "control" tables, I heartily agree.  Besides the
> reasons above, you also commit the daemon to "polling" the database
> for changes to the control tables, or (horrors!) writing DB triggers
> that somehow signal the daemon that something has changed.

My plan was to build a central "notification service", using a simple
line-oriented TCP pubsub protocol, that could run on the same server as the
database. Anybody who reads the database to discover what work they need to
do will subscribe to hear about changes to the table(s) of interest, anyone
who adds things to those tables will publish the fact that they've changed
something. When master.cfg says to use a local database, it uses a degenerate
in-process service.

John-at-Mozilla seemed to think that polling would be good enough (and that
he'd want to control the rate of polling, over concern about load on the
reader), but like you I want low and reasonably deterministic latency.
Polling sucks.

> From: exarkun at twistedmatrix.com
> 
> One of the main problems I currently encounter with BuildBot operation 
> is that slaves will often lose their connection to the master as a 
> *consequence* of attempting a build.  Something that I'd be concerned 
> about with automatic re-queueing is that a build would result in a never 
> ending stream of work for a slave, as it tries the build, fails, 
> disconnects, reconnects, is again instructed to attempt the build, 
> fails, disconnects, etc.

Yeah, I don't know how to solve that yet. I think we've got consensus that
retrying a build that fails because of a transient buildbot problem (lost
slave, master bounce) is good, but of course telling the difference between
transient and caused failures is generally impossible. Maybe a retry limit?
If the build experiences "transient" failures more than N times it gets
shelved for manual triggering?

Although not necessarily a part of this project, I still want to have a less
connection-sensitive master-slave build protocol, one which can tolerate
connection loss without dropping messages as long as the processes on both
sides remain running. (i.e. stop treating buildsteps as pb.Referenceable, use
unique build/step id values, hold status updates in memory until they're
ACKed). I think this is feasible, and might help with the problem you've been
seeing.

> I'm curious about the motivation to use the database directly as the RPC 
> mechanism here.  This makes the database schema part of the public API, 
> something which impedes future improvements to the schema.

Speaking strictly to the scheduler-db (as opposed to the status-db), we want
two things:

 * persistence, so master reboots don't lose so much state

 * coordination of multiple buildmaster services on separate boxes

To achieve the former requires something to be stored on disk. To achieve the
latter requires something that speaks over the network. Typical databases do
both. So it seemed slightly easier to have these multiple services talk
through a database (plus "notification server") than, say, to have one
component be responsible for storing all state on disk, and have all the
other components speak PB to that piece.

DB-as-RPC also lets other tools get into the game. One example that John
mentioned was a web-driven PHP script that looks for any build requests that
you've submitted and cancels them. You should be able to hand the schema
definition to your average PHP developer and have it done in an hour. If they
have to speak PB to a "buildmaster-master", or if we have to implement an
XMLRPC protocol for each conceivable control knob, it will take a lot longer.

OTOH, yes, exposing this makes the schema part of the public API. I think it
may be ok to say that any tools which touch the scheduler-db brains are
responsible for keeping up with buildbot changes, and that we won't bend over
backwards to provide them with a compatible interface. The alternative is to
define some set of operations that we hope will be sufficient, create some
other RPC mechanism to trigger them (PB? XMLRPC?), and then attempt to
provide backwards-compatibility for those (even if they become insufficient
or obsolete at some point). Seems like compatibility is hard and maybe
unnecessary, so we might as well make the job easier somehow.

Putting things in the status-db is less controversal, I think, because that's
basically write-only (no scheduling decisions are made on the basis of
status-db contents).

> I suppose if you have ten thousand slaves it might still be beneficial...
> but then perhaps you just move the bottleneck into your database server?

I believe the mozilla farm has something like 800 slaves. Also, I believe
that anyone with an installation that large will also have DBAs who can fix
the performance problems, which is certainly not the case with pickles. The
theme is moving the problem to a place where other people can solve it.

> The scheduler state doesn't seem like it's very hefty, though. How much
> data are we talking about here? Even with ten thousand slaves, I don't see
> there being much more than an order of magnitude more rows than that in the
> database, and the average database will be closer to... 20 rows? 50? What
> other data am I overlooking here?

I'd expect it to be just the queued/active builds (so yeah, fewer than 100
rows). But some of the discussions I've had with the mozilla folks suggest
that we should consider keeping historical scheduling information in the same
table, to answer questions like "how long was this buildrequest kept waiting"
and "why was this request submitted". We've talked about having two separate
tables (one for active requests, one for historical) and moving items from
one to the other after the request is retired. We've also talked about
putting this sort of information strictly in the status-db.

Some folks at the meeting seemed wary of the notion of ever deleting a row
from a table, and preferred to just mark it "done" instead. Still up in the
air.

> I hope you don't select SQLAlchemy. It has a poor track record and there
> are a lot of other options out there.

/me nods. I respect your opinion a lot, so I'll consider it seriously. What
are some of the other options you're thinking of? I know what Axiom provides,
but we need a networked database and something that won't mangle the schema
so that other programs can play along.

ok, gotta run, more later. Thanks everybody!

 -Brian