[Buildbot-devel] database-backed status/scheduler-state project
Nicolas Sylvain
nsylvain at chromium.org
Thu Sep 3 00:11:58 UTC 2009
This sounds awesome!
With your new multi process architecture, does this mean that we can have
WebStatus in one process, while the slave<->master communication in another
process? Slaves will stop to die when people do long waterfall requests?
(!!!)
Any plan to make the web server be multi threaded?
Where would the saved stdio live? All in the database? We do have a lot of
scripts that parses them directly from the directory where they are saved.
Modifying these script to use sql would make sense, if it's not too hard to
access the database.
Also we collect more than a few GB of logs every day, and we need to prune
them often. It would be awesome if there was a way to tell buildbot when to
auto-prune the old data. If not, then it would be great if it's easy to
prune old data in the database. For example, our current policy is : Delete
build logs older than 21 days. Keep build status for 28 days, then, when it
reaches 28 days, we archive it. We keep the archive 90 days.
Otherwise, that looks really good. I'll make sure to use these new features
as they come on the Chromium buildbot.
Thanks !
Nicolas
On Wed, Sep 2, 2009 at 3:56 PM, Brian Warner <warner at lothar.com> wrote:
>
> Hi everybody, it's me again.
>
> I've taken on a short-term contract with Mozilla to make some
> scaling/usability improvements on Buildbot that will be suitable for
> merging upstream. The basic pieces are:
>
> * persistent (database-backed) scheduling state
> * DB-backed status information
> * ability to split buildmaster into multiple load-balanced pieces
>
> I'll be working on this over the next few months, pushing features into
> trunk as we get them working (via my github repo). The result should be
> a buildbot which:
>
> * lets you bounce the buildmaster without losing queued builds or the
> state of i.e. Dependent schedulers
> * bouncing a master or slave during a build should re-queue the
> interrupted build
> * third-party tools can read or manipulate the scheduler state, to
> insert builds, cancel requests, or accelerate requests, all by
> fussing with the database
> * third-party tools can render status information (think PHP scripts
> reading stuff out of the DB and generating a specialized waterfall)
> * multiple "build-process-master" processes (needs a better name) can
> be run on separate CPUs, each handling some set of slaves. Each one
> claims a buildrequest from the DB when it has a slave available, runs
> the build, then marks the build as done. If one dies, others will
> take over.
>
> I'm hoping that the persistent scheduler-state code will be done by the
> end of the month, ready to put into a buildbot-0.8.0 release shortly
> thereafter.
>
> DATABASES:
>
> I'm planning to make the default config store the scheduler state in a
> SQLite file inside the buildmaster's base directory. To enable the
> scaling benefits, you'd need a real networked database, so I also plan
> to have connectors for MySQL and potentially others.
>
> The plan is to have the schedulers make synchronous DB calls, rather
> than completely rewriting the scheduler/builder code to look more like a
> state machine with async calls (twisted.enterprise). This should let us
> finish the project sooner and with fewer instabilities, but also means
> that DB performance is an issue, since a slow DB will block everything
> else the buildmaster is doing. The Mozilla folks are ok with this, so
> we'll just build it and see how it goes.
>
> It's very important to me that Buildbot is easy to get installed for all
> users, and installing a big database is not easy, so the default will be
> the no-effort-required entirely-local SQLite. Users will only have to
> set up a real database if they want the "distributed across multiple
> computers" scaling features.
>
> The statusdb (as opposed to the schedulerdb) may be implemented as a
> buildbot status plugin, leaving the existing pickle files alone, but
> exporting a copy of everything to an external database as the builds
> progress. This would reduce the work to be done (there's already some
> code to do much of this) and minimize the impact on the core code (we'd
> just be adding an extra file that could be enabled or not as people saw
> fit), but might not result in something that's as well integrated into
> the buildbot as it could be (and it might be nice to have a
> Waterfall/etc which read from the database, as things like
> filter-by-branchname would finally become efficient enough to use).
>
> DEPENDENCIES:
>
> Buildbot-0.8.0 will need sqlite bindings. These come batteries-included
> (in the standard library) with python 2.5 and 2.6. Users running
> python2.4 will have to install the python-pysqlite2 package to run
> buildbot-0.8.0. I think this is a pretty minimal addition.
>
> I'm examining SQLAlchemy to see if the features it offers would be worth
> the extra dependency load. I don't want to use a heavy ORM (because a
> big goal is to have a schema that's easy to query/manipulate from other
> languages), but it looks like it's got connection-pool-management and
> cross-DB support code that might be useful.
>
> What do people think about the 0.8.0 buildmaster potentially requiring
> sqlalchemy? Would that annoy you? Annoy new users? Make it hard to
> upgrade your environment?
>
> HELP!:
>
> I'm looking to hear about other folk's experiences with this sort of
> project. We've been talking about this for years, and some prototypes
> have been built, so I'd like to hear about them (I've been briefed on
> many of the mozilla efforts already).
>
> I'll attach the proposal below, along with a file of notes that I made
> while walking through the code to see how this needs to work.
>
>
> cheers,
> -Brian
>
>
> ===== PROJECT PROPOSAL =====
>
> Buildbot Database project:
>
> The goal is to improve the usability and scalability of Buildbot to meet
> Mozilla's current needs, implemented in an appropriate fashion to get
> merged upstream. The primary "pain points" to be addressed are:
>
> * most buildmaster state is held in RAM, preventing process restarts
> for fear of losing queued builds and builds-in-progress. There is no
> "graceful shutdown" command, but even if there were, it could take
> hours or days to wait for everything in the queue to finish, losing
> valuable developer time.
>
> * buildmaster does many things in one process (build scheduling, build
> processing, status distribution), and CPU exhaustion has been
> observed
>
> * Waterfall display is very CPU-intensive. Current deployment does not
> share waterfall with outside world for fear of overload. Development
> of alternate status displays (which could run in separate processes)
> is hampered by the local-file pickle-based status storage format.
>
> The changes planned for this project are:
>
> * move build scheduling state out of RAM and into a persistent
> database, allowing buildmaster to be bounced without losing queued
> builds. Builders will claim builds from the database, perform the
> builds, then update the DB to mark the build state as done, allowing
> multiple buildmaster processes (on separate machines) to share the
> load, communicating mostly through the DB. New tools (written in
> arbitrary languages) can be used to manipulate the schedulerdb, to
> implement features like "accelerate build request", "cancel request",
> etc.
>
> * move build status out of pickle files into a database, to enable
> multiple processes (on separate hosts) to access the status. Database
> replication can then be used to allow a publically-visible Waterfall
> without threatening to overload the buildmaster. Status displaying
> tools (dashboards, etc) can be written in arbitrary languages and
> simply read the information they need from the statusdb.
>
> * add configuration options to switch on/off the four main buildmaster
> functions (ChangeMaster, Schedulers, Builder/Build processing, Status
> distribution), allowing these functions to be spread across multiple
> processes, using the state/status databases for coordination. The
> goal is to have one ChangeMaster/Schedulers process, multiple
> Builder/Build processing tasks (one "build-master" per "pod", with a
> set of slaves attached to each one), and multiple status distribution
> processes. This should help the scalability problem, by allowing the
> load to be spread across multiple computers.
>
> * the default database will be a local SQLite file, but master.cfg
> statements will allow flexible configuration of the database
> connection method. Postgres (or whatever mozilla's favorite DB is)
> will be tested. Others (at least MySQL) should be possible.
> Provisions will be made to tolerate the inevitable SQL dialect
> variations.
>
> * (probably) add "graceful shutdown" switch to the buildmaster. Once
> the buildmaster is in this mode, new jobs will not be started, and
> the buildmaster will shutdown once the last running job completes.
> The switch may have an option to make the buildmaster restart itself
> automatically upon shutdown. UI is uncertain.
>
> * (maybe) add "graceful shutdown" switch to the buildslave, used in the
> same way as the buildmaster's switch. UI is uncertain.
>
> * (probably) add "RESUBMIT" state to the overall Build object (along
> with the existing SUCCESS, WARNING, FAILURE, EXCEPTION states). The
> scheduling code will react to this by requeueing the BuildRequest.
> Builds which stop because of a lost slave or restarted buildmaster
> will be marked with this state, so they will be re-run when the
> necessary resources come back.
>
> * retain cancel-build capabilities (may require Builder to poll a DB to
> see if the build has been cancelled)
>
> Design restrictions imposed by Brian as Buildbot upstream developer:
>
> * dependency load must not increase significantly. I'm ok with
> requiring SQLite because it's built-in to python2.5/2.6, and easy to
> get for python2.4 . I'm not willing to require other database
> bindings, nor to require all Buildbot users to install/configure an
> e.g. MySQL database before they can run a buildmaster.
>
> * existing 0.7.11 deployments must remain compatible with the new code.
> The default configuration must use SQLite in a local directory. Any
> state-migration steps that must be done will be handled by adding new
> code to the existing "buildbot upgrade-master" command.
>
> * all code must have clear User's Manual documentation (with examples)
> and adequate unit tests. All changes must be licensed compatibly with
> the upstream source (GPL2).
>
> The specific milestones we're planning are:
>
> * phase 1: Create the database connectors (initially only SQLite), move
> just the scheduler state into the database. This includes the output
> of the ChangeMaster, the internal state of all Schedulers, and the
> list of ready-to-go BuildRequests. All existing Scheduler classes and
> the Builder class will be changed to scan the database for work
> instead of looking at lists in RAM. The RESUBMIT state will be
> implemented and Builders updated to requeue such builds.
>
> This will allow the buildmaster to be bounced without loss of state
> (although any running builds will be abandoned and requeued). It will
> not yet enable the use of multiple processes. It will not touch the
> build status information (currently stored in pickle files).
>
> * phase 1.1: Implement the Postgres database connector, and the
> master.cfg options necessary to control which db type/location to
> use for scheduler state. Test a buildmaster running with a remote
> schedulerdb.
>
> * phase 1.2: Implement graceful-shutdown controls.
>
> * phase 2: Change the build-status code to store its state in a
> database, instead of in the current pickle files. Implement a "Log
> Server" to store/publish/stream logfile contents. Write a "buildbot
> upgrade-master" tool to non-destructively migrate old pickle data
> into the new database and logserver. Change the existing Status
> plugins (Waterfall, MailNotifier, IRCBot, etc) to read status from
> database. Add master.cfg options to control which db is used for
> status data.
>
> This will enable non-buildbot status-displaying frontends.
>
> * phase 3: Add master.cfg options to control which components are
> enabled in any given process. Provide mechanisms and examples to run
> e.g. multiple build-process-masters which coordinate through the
> database. Implement TCP/HTTP/polling -based "ping notifiers" to allow
> low-latency triggering between components in separate processes (i.e.
> Scheduler writes ready-to-build requests into DB, but the
> build-process-master on a separate host must be told to re-scan the
> DB for new work). Provide master.cfg options to control type/location
> of DB, ping-notifiers, and Log Server. build-process-master instances
> will have some configuration in common, other configuration unique to
> each instance.
>
> This will finally enable scaling through multiple buildbot processes,
> and multiple Waterfall renderers.
>
> I'm roughly targetting phase 1 to be incorporated into an upstream
> buildbot-0.8.0 release, and phase 2 in an 0.9.0 release shortly
> afterwards. Phase 3 may get into 0.9.0, or may go into a subsequent
> upstream release.
>
> Aggressive target is to get phase 1 done by end of september, then
> evaluate schedule and progress made before beginning next phase. Overall
> goal is to complete project in 2-3 months.
>
> Sub-tasks which can be split out easily include:
>
> * database connector module: python "dbapi2" interface,
> reconnection-on-error (and log attempts w/backoff), cross-database
> compatibility code, blocking methods for scheduler state db,
> fire-and-forget (but retry for a little while) for status writes
>
> * "ping notifier" module: define HTTP POST / line-oriented TCP /
> polling protocol, implement client / server modules.
>
> * Log Server: writer-side PB interface, reader-side HTTP interface
>
> === DESIGN NOTES ===
> -*- org-mode -*-
>
> * databases: three databases, plus logserver
> ** Changes go in one database
> ** scheduling stuff (Scheduler state, builds ready-to-go/claimed/finished)
> this includes BuildRequests and their properties
> ** status (steps, logids, results, properties)
> the goal is for the buildmaster to never read from the status db, only
> the status-rendering code (which will eventually live elsewhere)
>
> * database connector
> ** all statusdb calls may raise DBUnavailableError
> renderer should deliver error to client
> ** all schedulerdb calls should block, reconnect, retry */1s, log w/backoff
> db is critical to this part
> ** config option to set DB type, connection arguments
> ** schema restrictions to get cross-db compatibility:
> - declare types (SQLite tolerates, but most don't)
> - revision ids will be strings, SVN will deal
> - no binary strings. Unicode is ok(?).
>
> * notification mechanism
> - first milestone (non-distributed) will be all in-process
> - distributed milestone will require pings
> HTTP POST (forwards), TCP line-oriented (either), or just polling
>
> * persistent scheduler project
> ** Changemaster:
> - (changeid, branchname, revisionid, author, timestamp, comment,
> category?)
> changeids must be comparable and monotonically increasing
> - (changeid, filename)
> i.e. changes[changeid].filenames = []
> - (changeid, propertyname, propertyvaluestring)
> i.e. changes[changeid].properties = {name: value}
> *** add row to database, ping Schedulers (eventual-send)
> *** ping all schedulers at buildmaster startup
> ** Schedulers:
> - all state must be put in DB
> - each records last-change-number, only examines changes since then
> - each records list of changes, with important/unimportant flag
> - trickiest part will be relationships between Dependent schedulers
> *** when pinged, or timer wakeup:
> - loop over all Schedulers
> - scan for unchecked changeids
> - default Scheduler ignores changes on the wrong branch
> - check importance of each
> - add to changes table
> - arrange for tree-stable-timer wakeup
> - if all changes are old enough, and important, then submit build
> - AnyBranchScheduler processes changes one branch at a time
> *** Dependent (downstream):
> - configured with an upstream scheduler, by name
> - wants to be told when upstream BuildSet completes successfully,
> receive SourceStamp object
> - then submits a new BuildSet, using the same SourceStamp, with
> different
> buildernames and properties
> **** so, this scheduler ignores the changes table and watches active-builds
> - defer figuring it out until I build the active-build table
> *** Periodic
> - (schedulername, last-build-started-time, last-changeid-built)
> - if last-build-started-time + delay < now:
> make SS with recent changes, submit buildset, update
> last-build-started-time and last-changeid-built
> - consider checking active-builds, avoid overlaps
> - else: arrange for wakeup in (now - last-build-started-time + epsilon)
> **** after a long downtime, this should start a build
> *** Nightly/Cron
> - like Periodic, but compute next build time differently
> **** after a long downtime, this should *not* start a build
> - maybe make that configurable, catchup=bool
> *** Try: ignores changetable, just submits buildsets
> *** schema:
> - changes table: (schedulerid, changenum, important_p)
> - timer table: (wakeup-time)
> if min(wakeup-time) < now: empty table, ping all schedulers
> **** default Scheduler
> - (schedulerid, schedulername, last-changeid-checked)
> **** Periodic
> - (schedulername, last-build-started-time, last-changeid-built)
> **** Triggerable
> - really just maps scheduler name +properties to buildernames
> - certain buildsteps can push the trigger, wait for completion
> - ignores changetable, ignores buildtable
> - does not use schedulerdb
> **** SourceStamps
> how to gc?
> - (sourcestampid, branch, revision/None, patchlevel, patch)
> - (sourcestampid, changeid)
> *** scheduler has properties, copied into BuildSet
> - doesn't need to be in the scheduler table, but might need to be in
> BuildSet table
> *** scheduler's output is a BuildSet, which has .waitUntilFinished()
> - buildernames, sourcestamp, properties
> ** BuildSet
> - have .waitUntilFinished(), used by downstream Dependent schedulers and
> Triggerable steps
> - (buildsetid, sourcestampid, reason, idstring, current-state)
> idstring comes from Try job, to associate with external tools
> - current-state in (hopeful, unhopeful, complete)
> (no failures seen yet, some failures seen, all builds finished)
> (idea is to notify early on first sign of failure)
> - (buildsetid, buildername, buildreqid)
> i.e. buildset.buildernames = []
> - (buildsetid, propertyname, valuestring)
> i.e. buildset.properties = {}
> *** when all buildrequests complete, aggregate the results
> - when each buildrequest completes, ping the buildsets
> - this may change the buildset state
> - buildset state changes should ping schedulers
> ** BuildRequest
> - created with reason, sourcestamp, buildername, properties
> - can be merged with other requests, if sourcestamps agree to it
> - given to Builder to add to the builder queue
> - can be started multiple times: updates status, informs watchers
> - can be finished once, informs watchers
> - IBuildRequestControl: subscribe/un, cancel, .submit_time
> not sure if anybody calls it.. words.py? a few tests?
> - "reqtable": (buildrequestid, reason, sourcestampid, buildername,
> claimed-at, claimed-by?)
> - (buildrequestid, propertyname, propertyvalue)
> ** Builder
> - .buildable, .building
> - submitBuildRequest adds to .buildable, pings maybeStartAllBuilds
> - what is __getstate__/__setstate__ doing there?
> *** so we need the Builder to scan the reqtable
> - this is the part that will get distributed
> - Builder A can claim any buildreqest that's for it and not yet claimed
> or was claimed but got orphaned by a dead buildmaster, maybe have
> a timestamp or two
> - "claimed-at" holds timestamp, starts at 0, updated when a buildmaster
> grabs it, refreshed every once in a while. req can be claimed by
> someone else when (now - claimed-at) > timeout.
> - when the build is done, the buildrequest is removed from the reqtable
> and the buildset is examined
> - to cancel a request: remove it from the table
> - add submit-time or submit-sequence, to provide first-come-first-built
> to accelerate a request, change that value
>
> * LogServer
> ** writer-side PB interface:
> - open(title) -> logid string
> - write(logid, channel, data)
> - close(logid)
> logfile is renamed (from LOGID.open to LOGID.closed) upon close
> - get_base_url()
> ** buildmaster sends async writes, queues limited amount of requests
> - fire-and-forget-after-30s, discard if queue grows too big
> - goal is to tolerate LogServer bounces but not consume lots of memory
> ** reader-side HTTP interface:
> *** logid URL shows title, filesize, options links, open/closed status
> - with/without headers
> - just stderr
> - last N lines (when closed), last N lines plus headers
> - reads when open do tail-f
> *** all option links are normal statically-computable URLs
>
> * DB-based status writer
> ** write logserver baseurl into DB each time LogServer PB connection is
> made
> ** indirect this, to plan for multiple LogServers (logserverid=1 for now)
> - (stepid, logserverid, logid)
> - (logserverid, logserver_baseurl)
>
> * DB-based status renderer
> **
>
> * random ideas to keep in mind
> ** scheduler db is small
> - so rather than coming up with clever queries, just grab everything,
> sort it in memory
> - also useful to avoid doing multiple queries
>
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus
> on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now. http://p.sf.net/sfu/bobj-july
> _______________________________________________
> Buildbot-devel mailing list
> Buildbot-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/buildbot-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://buildbot.net/pipermail/devel/attachments/20090902/16af32f1/attachment.html>
More information about the devel
mailing list