[Buildbot-devel] database-backed status/scheduler-state project

Mon Sep 7 12:01:57 UTC 2009

Sorry to reiterate, but...

I'm concerned about the complexity here. I don't think it's necessary to
solve Mozilla's problems, and I don't think that it's going to pay off
outside of Mozilla. I'm also concerned that the scheduling stuff seems to
trump the status report issues we're really having, which cost productivity
across all of Mozilla.

That's not saying that design goals wouldn't be nice to have, but I do think
that the design of it went a good deal ahead of what's needed.

Axel

2009/9/5 Brian Warner <warner at lothar.com>

>
> [I've only got a few minutes to write this up.. I'll try to respond more
> thoroughly tomorrow.. sorry]
>
> "Dustin J. Mitchell" <dustin at zmanda.com> wrote:
> >
> > Having written a Python daemon that's controlled from Access via ODBC
> > and various silly "control" tables, I heartily agree.  Besides the
> > reasons above, you also commit the daemon to "polling" the database
> > for changes to the control tables, or (horrors!) writing DB triggers
> > that somehow signal the daemon that something has changed.
>
> My plan was to build a central "notification service", using a simple
> line-oriented TCP pubsub protocol, that could run on the same server as the
> database. Anybody who reads the database to discover what work they need to
> do will subscribe to hear about changes to the table(s) of interest, anyone
> who adds things to those tables will publish the fact that they've changed
> something. When master.cfg says to use a local database, it uses a
> degenerate
> in-process service.
>
> John-at-Mozilla seemed to think that polling would be good enough (and that
> he'd want to control the rate of polling, over concern about load on the
> reader), but like you I want low and reasonably deterministic latency.
> Polling sucks.
>
> > From: exarkun at twistedmatrix.com
> >
> > One of the main problems I currently encounter with BuildBot operation
> > is that slaves will often lose their connection to the master as a
> > *consequence* of attempting a build.  Something that I'd be concerned
> > about with automatic re-queueing is that a build would result in a never
> > ending stream of work for a slave, as it tries the build, fails,
> > disconnects, reconnects, is again instructed to attempt the build,
> > fails, disconnects, etc.
>
> Yeah, I don't know how to solve that yet. I think we've got consensus that
> retrying a build that fails because of a transient buildbot problem (lost
> slave, master bounce) is good, but of course telling the difference between
> transient and caused failures is generally impossible. Maybe a retry limit?
> If the build experiences "transient" failures more than N times it gets
> shelved for manual triggering?
>
> Although not necessarily a part of this project, I still want to have a
> less
> connection-sensitive master-slave build protocol, one which can tolerate
> connection loss without dropping messages as long as the processes on both
> sides remain running. (i.e. stop treating buildsteps as pb.Referenceable,
> use
> unique build/step id values, hold status updates in memory until they're
> ACKed). I think this is feasible, and might help with the problem you've
> been
> seeing.
>
> > I'm curious about the motivation to use the database directly as the RPC
> > mechanism here.  This makes the database schema part of the public API,
> > something which impedes future improvements to the schema.
>
> Speaking strictly to the scheduler-db (as opposed to the status-db), we
> want
> two things:
>
>  * persistence, so master reboots don't lose so much state
>
>  * coordination of multiple buildmaster services on separate boxes
>
> To achieve the former requires something to be stored on disk. To achieve
> the
> latter requires something that speaks over the network. Typical databases
> do
> both. So it seemed slightly easier to have these multiple services talk
> through a database (plus "notification server") than, say, to have one
> component be responsible for storing all state on disk, and have all the
> other components speak PB to that piece.
>
> DB-as-RPC also lets other tools get into the game. One example that John
> mentioned was a web-driven PHP script that looks for any build requests
> that
> you've submitted and cancels them. You should be able to hand the schema
> definition to your average PHP developer and have it done in an hour. If
> they
> have to speak PB to a "buildmaster-master", or if we have to implement an
> XMLRPC protocol for each conceivable control knob, it will take a lot
> longer.
>
> OTOH, yes, exposing this makes the schema part of the public API. I think
> it
> may be ok to say that any tools which touch the scheduler-db brains are
> responsible for keeping up with buildbot changes, and that we won't bend
> over
> backwards to provide them with a compatible interface. The alternative is
> to
> define some set of operations that we hope will be sufficient, create some
> other RPC mechanism to trigger them (PB? XMLRPC?), and then attempt to
> provide backwards-compatibility for those (even if they become insufficient
> or obsolete at some point). Seems like compatibility is hard and maybe
> unnecessary, so we might as well make the job easier somehow.
>
> Putting things in the status-db is less controversal, I think, because
> that's
> basically write-only (no scheduling decisions are made on the basis of
> status-db contents).
>
> > I suppose if you have ten thousand slaves it might still be beneficial...
> > but then perhaps you just move the bottleneck into your database server?
>
> I believe the mozilla farm has something like 800 slaves. Also, I believe
> that anyone with an installation that large will also have DBAs who can fix
> the performance problems, which is certainly not the case with pickles. The
> theme is moving the problem to a place where other people can solve it.
>
> > The scheduler state doesn't seem like it's very hefty, though. How much
> > data are we talking about here? Even with ten thousand slaves, I don't
> see
> > there being much more than an order of magnitude more rows than that in
> the
> > database, and the average database will be closer to... 20 rows? 50? What
> > other data am I overlooking here?
>
> I'd expect it to be just the queued/active builds (so yeah, fewer than 100
> rows). But some of the discussions I've had with the mozilla folks suggest
> that we should consider keeping historical scheduling information in the
> same
> table, to answer questions like "how long was this buildrequest kept
> waiting"
> and "why was this request submitted". We've talked about having two
> separate
> tables (one for active requests, one for historical) and moving items from
> one to the other after the request is retired. We've also talked about
> putting this sort of information strictly in the status-db.
>
> Some folks at the meeting seemed wary of the notion of ever deleting a row
> from a table, and preferred to just mark it "done" instead. Still up in the
> air.
>
> > I hope you don't select SQLAlchemy. It has a poor track record and there
> > are a lot of other options out there.
>
> /me nods. I respect your opinion a lot, so I'll consider it seriously. What
> are some of the other options you're thinking of? I know what Axiom
> provides,
> but we need a networked database and something that won't mangle the schema
> so that other programs can play along.
>
>
> ok, gotta run, more later. Thanks everybody!
>
>  -Brian
>
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus
> on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> _______________________________________________
> Buildbot-devel mailing list
> Buildbot-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/buildbot-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://buildbot.net/pipermail/devel/attachments/20090907/0ad6b938/attachment.html>