[Buildbot-devel] schedulerdb project status update

Thu Oct 15 19:07:06 UTC 2009

Buildbot-schedulerdb status report, 10/15/2009.

= Code =

Latest code is published on github:

 http://github.com/warner/buildbot/tree/schedulerdb-REBASES

  (the -REBASES suffix is a warning that the history of this branch will
   be rewritten frequently, as I rebase to latest trunk, and make the
   deltas easier to read. So if you write patches on top of this branch,
   be prepared to rebase them also, and do not merge this branch to
   trunk)

 Use "git clone -b schedulerdb-REBASES git://github.com/warner/buildbot.git"
 to grab a copy. "git diff master...schedulerdb-REBASES" will show you
what's
 changed relative to trunk.

Schema is defined in buildbot/db.py:

 http://github.com/warner/buildbot/blob/schedulerdb-REBASES/buildbot/db.py

= Architecture =

Basic flow of control is:

 ChangeSource -> [changes tables]
 [changes] -> Scheduler -> [scheduler],[buildrequests]
 [buildrequests] -> Builder

A separate "Notification Loop" is used to control invocation of both
Schedulers, and Builders. This ensures that only one e.g. Scheduler is
running at a time, and provides a place for Schedulers to wake up the
Builders after they've changed something. It also allows a Scheduler to
ask to be woken up later (i.e. for tree-stable-timer or Periodic). (this
is the piece that will become a network-visible service, to enable a
distributed collection of buildmasters).

The Scheduler loop is triggered each time a Change is added to
[changes]. Each uses two separate transactions. In the first, each
Scheduler examines [changes], classifies new ones as important or
unimportant, and writes its conclusions to the DB. In the second, it
reads these conclusions to decide if it is time to submit a
BuildRequest. If it needs to wait for time to pass, it returns a delay
value to the Notification loop. Schedulers must be idempotent, and keep
all of their state (but not their configuration) in the DB. Eventually,
the Scheduler adds a BuildRequest to [buildrequests], and wakes up the
Builder loop.

The Builder loop invokes each Builder in turn. If the Builder has an
available slave and sees an unclaimed BuildRequest, it claims the breq
(by updating 'claimed_at' and 'claimed_by' in [buildrequests]), creates
the BuildRequest objects, and starts the build as usual. When the build
finishes, the request is removed from [buildrequests] (or at least made
inactive), and the Builder loop is retriggered, so it can look for more
work.

Builders are supposed to renew their claims at least once an hour. Other
buildmasters are allowed to claim a build that has been left unclaimed
for over an hour. Each buildmaster can recognize its own (old) claims,
so if a buildmaster is bounced, it can reclaim builds immediately.

= Progress =

About 90 hours on the clock so far.

Framing is 90% complete:
 * tables are mostly feature-complete
 * DB setup/access code works, "buildbot upgrade-master" creates sqlite DB
 * architecture seems sound
 * [todo] what history do we want to preserve? where should it be
   stored? separate tables?

Drywall is 50% complete. Still to do:
 * construct passing unit tests for Builder
 * update SQL to work with mysql, not just sqlite
 * port other Schedulers: Periodic/Nightly, Triggerable/Try
 * port Dependent scheduler: implement BuildSet, feedback from
   build-finished to buildset-complete to downstream-scheduler
 * figure out status interfaces for scheduler pieces:
   BuildRequestStatus, (current) BuilderStatus. Any "statusdb" work is
   explicitly deferred, but the existing interfaces need to keep working
   even if their backing store has been moved from RAM to the DB
 * periodically update the claim (held by Builders on BuildRequests)
 * avoid race condition in claiming buildrequests (only update if the
   claimed_at/claimed_by is unchanged)

Painting is 10% complete:
 * retain compatibility with c['prioritizeBuilders']
 * update unit tests
 * figure out GC: keep DB from unbounded growth, remove old/unused
   SourceStamp/BuildRequest/Properties rows
 * construct buildmaster identity (for buildrequest claim):
   name+boottime

Problems/slowdowns:
 * lost about 15h trying (and failing) to clean up Builder.startBuild,
   to support RETRY cleanly, complicated by test_slaves
   (LatentBuildSlave, slaveping). This may or may not be necessary
   before merging this work into trunk.
 * spent about 5h accomodating surprising new features (nextSlave,
   nextBuild, latent slaves)
 * maybe 2h spent on other Mozilla RelEng support: machine
   allocation/utilization analysis

Likely remaining large changes:
 * move all DB code into a single file?
 * clean up mixed sync/async DB access: decide on what should not block,
   what can block, where the complexity of returning Deferreds is
   warranted
 * change allocate-next-highest-idnumber technique