[Buildbot-devel] database-backed status/scheduler-state project

Thu Sep 3 09:17:13 UTC 2009

Brian Warner <warner at lothar.com> writes:

> I've taken on a short-term contract with Mozilla to make some
> scaling/usability improvements on Buildbot that will be suitable for
> merging upstream. The basic pieces are:

Sounds cool! For MariaDB we have for some time had similar needs/plans, though
at a lower level of ambition.

I did a tiny bit along these lines already. It is a Buildstep for running
mysql-test-run (MySQL test suite) which can insert results of test runs into
an external database. You might want to take a quick look, if for nothing else
to make sure that something like this will fit into your work. Ie. we would
want something like this to use just additional tables in the same status
database;

    http://github.com/djmitche/buildbot/commit/00b2fac1f3f75c7431cd9aa6b06dfc993e56b5d6

We use this to be able to search in past test failures, see here:

    http://askmonty.org/buildbot/reports/cross_reference

I'd definitely be interested in following your progress on this, and should
also be able to assist with the implementation/maintenance work over time.

>  * persistent (database-backed) scheduling state
>  * DB-backed status information

Agree.

I did more or less the same for Pushbuild, the autobuilder used at MySQL
(in-house project). It was very successful.

>  * ability to split buildmaster into multiple load-balanced pieces

Really? What kind of load are you generating?!? :-)

If status pages (waterfall ...) is moved outside the twistd process, I have a
hard time imagining a single machine/cpu not being able to handle the master
load. Automatic failover would be nice of course. In any case, I guess this is
not for phase 1.

>  * lets you bounce the buildmaster without losing queued builds or the
>    state of i.e. Dependent schedulers
>  * bouncing a master or slave during a build should re-queue the
>    interrupted build

Yes. This is something we really need.

> I'm hoping that the persistent scheduler-state code will be done by the
> end of the month, ready to put into a buildbot-0.8.0 release shortly
> thereafter.

My guess would be that the main blocker for speedy release would be the need
to ensure backwards compatibility with all of the existing Buildbot setups out
there. But I don't know much of the scheduler code, probably you have a much
better idea than me of how hard this will be.

> I'm planning to make the default config store the scheduler state in a
> SQLite file inside the buildmaster's base directory. To enable the
> scaling benefits, you'd need a real networked database, so I also plan
> to have connectors for MySQL and potentially others.

We (as in MariaDB) could definitely do the MySQL connector.

> The plan is to have the schedulers make synchronous DB calls, rather
> than completely rewriting the scheduler/builder code to look more like a
> state machine with async calls (twisted.enterprise). This should let us

Hm, I disagree on this. At the _very_ least, you should isolate things so that
there is an easy path to fixing to use real asynchroneous DB calls. Believe
me, I am fully aware of the pain this can be, but it's just the way you have
to work in frameworks like twistd.

Are you aware of the option of using Inline Callbacks?

    http://twistedmatrix.com/documents/current/api/twisted.internet.defer.html#inlineCallbacks

They take away most of the pain. They are Python 2.5+ only though. Maybe the
"outside" of the scheduler could be implemented in two ways, with async db
calls using Inline Callbacks, and with sync db calls for compatibility with
<=2.4. And then the gut logic of the schedulers would be in method calls
shared between the two.

You mention yourself later that the scheduler could just fetch everything from
the database up-front. That really should not be hard to do asynchroneously
even without Inline Callbacks. As for status updates, I think most/all of them
can be done asynchroneously, as you don't need to wait for the results (at
least that is how it works in my mtrlogobserver.MTR buildstep).

> What do people think about the 0.8.0 buildmaster potentially requiring
> sqlalchemy? Would that annoy you? Annoy new users? Make it hard to
> upgrade your environment?

[Don't really know about dependencies. I just did `apt-get install buildbot`
to get all the dependencies pulled in, then installed newest version from
source.]

> I'm looking to hear about other folk's experiences with this sort of
> project. We've been talking about this for years, and some prototypes
> have been built, so I'd like to hear about them (I've been briefed on
> many of the mozilla efforts already).

My experience from Pushbuild is that there are big benefits for larger
setups. But I don't have any experience with how hard it would be to implement
in Buildbot.

Personally, I would choose to first implement a "phase 0" which handled
restarting builds with RESUBMIT state using the existing in-memory state
before attacking the larger problem of moving state to database.

>  * (probably) add "graceful shutdown" switch to the buildmaster. Once
>    the buildmaster is in this mode, new jobs will not be started, and
>    the buildmaster will shutdown once the last running job completes.
>    The switch may have an option to make the buildmaster restart itself
>    automatically upon shutdown. UI is uncertain.

For us at least, this is less useful than just having the Buildbot properly
retry all interrupted builds after restart. We have test runs that take 4-8
hours, and a graceful restart could see most of the other slaves idle for most
of this period, while an immediate restart would loose some work, but get all
of the slaves working again immediately.

>  * (maybe) add "graceful shutdown" switch to the buildslave, used in the
>    same way as the buildmaster's switch. UI is uncertain.

I think this is already there:

    http://djmitche.github.com/buildbot/docs/0.7.11/#Shutdown

>  * (probably) add "RESUBMIT" state to the overall Build object (along
>    with the existing SUCCESS, WARNING, FAILURE, EXCEPTION states). The
>    scheduling code will react to this by requeueing the BuildRequest.
>    Builds which stop because of a lost slave or restarted buildmaster
>    will be marked with this state, so they will be re-run when the
>    necessary resources come back.

Sounds good. Then Buildstep subclasses could throw this state if they are able
to detect that a failure is caused by external environment (disk full?) rather
than a problem with the committed patch in the change.

>  * dependency load must not increase significantly. I'm ok with
>    requiring SQLite because it's built-in to python2.5/2.6, and easy to
>    get for python2.4 . I'm not willing to require other database
>    bindings, nor to require all Buildbot users to install/configure an
>    e.g. MySQL database before they can run a buildmaster.

I think I was able to implement mtrlogobserver.MTR without any additional
dependencies, just ADBAPI which is included in Twisted? User just passes in a
connectionpool object, so they can handle any dependencies on database
bindings themselved.

But maybe you just meant that Buildbot should not require to set up an
external database, which makes sense.

> * databases: three databases, plus logserver
> ** Changes go in one database
> ** scheduling stuff (Scheduler state, builds ready-to-go/claimed/finished)
>    this includes BuildRequests and their properties
> ** status (steps, logids, results, properties)
>    the goal is for the buildmaster to never read from the status db, only
>    the status-rendering code (which will eventually live elsewhere)

What do you mean with "three databases" ?

If you mean the *ability* to have the three different kinds of information in
three different independent places, then ok. That would mean eg. no foreign
keys among them etc.

However, it should be supported (and the usual setup) to have everything in
one database, even the same schema. Since it will be very useful to do
eg. joins across the three different kinds of data. No unnecessary complexity
in the common case.

Also, as I mentioned above, it is important to have the ability for
users/Buildsteps to add additional status information in additional tables for
eg. failures in test suites.

If you will post drafts of the database schema to the list, I will be happy to
review and provide comments.

> ** all schedulerdb calls should block, reconnect, retry */1s, log w/backoff

Really, you should recosider if it would not be possible to use async db
calls. Otherwise just a single broken TCP connection could hang the entire
Buildbot for the duration of a TCP timeout :-(.

> ** config option to set DB type, connection arguments

Alternatively, let users just pass in a ADBAPI connection pool object for
maximum flexibility.

Good luck!

 - Kristian.