[Buildbot-devel] Weird builder status with 0.7.12 [long]

Thu Aug 26 21:09:34 UTC 2010

Hi all --

I'm running BuildBot 0.7.12 with a lot of custom code to hook our
local build database to BuildBot.  I wrote a StatusReceiver that kicks
in whenever a build finishes.  It's supposed to do this:

  if all builders are idle:
    query our local build database for new builds
    if we found one, turn it into a BuildBot build request and start it

Our build system is a little bit "special" in that we use three
builders (linux, windows, mac) to do wildly different builds from the
same source tree.  "Build finished" in our context really means "all
three builders have completed a build".  They're tied together with a
custom build property, which is the build_id field from our build
database.

My problem is in the first step, where I check that all builders are
idle.  Right now I suspect a bug in BuildBot, for reasons I will
explain in due course.

Here are the relevant excerpts from my StatusReceiver (and related
code); skip to the end for the important bits if you are impatient:

"""
class InteleradNotifier(base.StatusReceiverMultiService):
    log = log                           # global -> class attr

    def __init__(self):
        base.StatusReceiverMultiService.__init__(self)
        self.log.info("InteleradNotifier: constructing %r", self)

        # Instance of IStatus, which has handy methods like getBuildbotURL(),
        # getURLForThing(), ....
        self.status = None

        # Map build_id to list of (buildStatus, stepStatus, results)
tuples, where
        #   buildStatus : buildbot.interfaces.IBuildStatus
        #   stepStatus : buildbot.interfaces.IBuildStepStatus
        #   results : (code : int, text : [str])
        # (These are just the values passed to IStatusReceiver.stepFinished().)
        #
        # We add an element to this list every time a build step *from any
        # build* finishes, and then query the list when a build fails.
        # This is tricky because it takes multiple builds to build
        # IntelePACS, and we only deem the pair finished when the Linux
        # build finishes ... but we need to report failures in the other
        # builds (Windows, Mac) too!  Either we have to go all the way up
        # to the root Buildbot object and traverse down to the other
        # builds' BuildStatus, or we track it ourselves as the builds
        # progress.
        self.finished_steps = {}

    [...]

    def stepFinished(self, buildStatus, stepStatus, results):
        build_id = self._get_build_id(buildStatus)
        if build_id is None:            # not one of our "official"
builds -- ignore
            return
        steps = self.finished_steps.setdefault(build_id, [])
        steps.append((buildStatus, stepStatus, results))

        # stepStatus is a buildbot.status.builder.BuildStepStatus, which does
        # not have a useful str() or repr()
        self.log.debug("InteleradNotifier.stepFinished(): "
                       "build_id=%d, buildStatus=%r, step=%r, results=%r",
                       build_id, buildStatus, stepStatus.getName(), results)

    def buildFinished(self, builderName, buildStatus, results):
        """
        builderName : str
        buildStatus : IBuildStatus (buildbot.status.builder.BuildStatus)
        results : int (although BB docs say tuple: hmmm)
        """
        self.log.info("InteleradNotifier.buildFinish(): "
                      "builderName=%r, buildStatus=%r, results=%r",
                      builderName, buildStatus, results)

        [...tricky logic to figure out if this means a "complete" build
            (all 3 builders) is actually done...]

        self.log.debug("finished_steps[%d] = %r",
                       build_id, self.finished_steps.get(build_id))

        reactor.callInThread(self._build_finished, buildStatus,
build_id, results)

    def _build_finished(self, buildStatus, build_id, results):
        """
        Record and announce that a build has finished.  Updates the
build database
        and then sends email to the addresses recorded in the build DB when
        the build was requested.

        This method does all sorts of blocking I/O (database, email),
        so must be run in a background thread!
        """
        [...stuff for updating our build database and sending email...]

        checkBuildQueue(self.parent, db)

def checkBuildQueue(master, db):
    """
    Check if there are any builds waiting in the build queue.  If
    so, start the next one, remove it from the queue, and update its
    state as appropriate.

    This does blocking I/O (database), so must not run in the reactor
    thread.
    """
    if not allBuildersIdle(master):
        log.info("Some builders are not idle: not checking build queue")
        return

    [...actually check the queue and convert a build request there into
        a BuildBot build request...]

def allBuildersIdle(master):
    status = master.getStatus()         # IStatus object
(buildbot.status.builder.Status)
    idle = True
    for builder_name in intelepacs.BUILDER_NAMES:
        builder_status = status.getBuilder(builder_name) # IBuilderStatus
        builder_state = builder_status.getState()[0]     # string
        log.debug("Builder %r: status = %r, state = %r",
                  builder_name, builder_status, builder_state)
        if builder_state != "idle":
            # excessive logging to figure out why nightly builds are sometimes
            # not all triggered
            log.debug("pending builds: %r", builder_status.getPendingBuilds())
            log.debug("current builds: %r", builder_status.getCurrentBuilds())
            log.debug("last finished build: %r",
builder_status.getLastFinishedBuild())
            idle = False

    return idle
"""

OK, that's all the code you need.  The important bit is
allBuildersIdle() -- the rest is just there to give you a sense of
context.  (Or to impress you with my 3l33t c0ding skillz. :-)

Also, that comment in allBuildersIdle() tells you what the problem is:
sometimes, not all of our nightly builds run.  They are all queued up
(in our build database) at 02:00; then, as each nightly build
finishes, the code above kicks in, checks that all builders are idle,
reads the next nightly build from the queue, and starts it.  Except
when it doesn't.  Today, at 13:08:57, something went wrong and the
next build in our queue did not run.  The relevant logging starts with
the end of the previous build:

"""
[2010-08-26 13:08:18.091 DEBUG] InteleradNotifier.stepFinished():
build_id=8123, buildStatus=<BuildStatus #1205>, step='make-rpms',
results=(0, [])
[2010-08-26 13:08:57.115 DEBUG] InteleradNotifier.stepFinished():
build_id=8123, buildStatus=<BuildStatus #1205>, step='copy-rpms',
results=(0, [])
[2010-08-26 13:08:57.158 DEBUG] InteleradNotifier.stepFinished():
build_id=8123, buildStatus=<BuildStatus #1205>, step='generic',
results=(0, [])
[2010-08-26 13:08:57.161 DEBUG] InteleradNotifier.stepFinished():
build_id=8123, buildStatus=<BuildStatus #1205>,
step='trigger-clean-linux', results=(0, [])
[2010-08-26 13:08:57.163  INFO] InteleradNotifier.buildFinish():
builderName='linux', buildStatus=<BuildStatus #1205>, results=0
"""

You can't tell from that output that a "complete" build (all 3
builders) finished, but trust me: our Linux build takes 60 min, our
Windows build 30 min, and our Mac build 5 min.  When the Linux build
is done, the whole 3-part build is done.  I can verify it by looking
at the waterfall page or digging farther back in the logs.  In other
words, buildFinished() (called by BuildBot when *one* builder
completes) calls _build_finished() (only called when all 3 builders
are done).  The log output continues:

"""
[2010-08-26 13:08:57.172 DEBUG] finished_steps[8123] = [...long list
of (BuildStatus, BuildStepStatus, result) tuples...]
[2010-08-26 13:08:57.204  INFO] Connecting to database 'build'
[...update our build database to write finish time, state, etc...]
[2010-08-26 13:08:57.214 DEBUG] Writing email message to
/tmp/bbot.email._DLhpj.txt
[2010-08-26 13:08:57.216  INFO] Sending email: /usr/sbin/sendmail
[...] < /tmp/bbot.email._DLhpj.txt
"""

All of which merely proves that _build_finished() was called and did
its first two tasks: update our build database and send email.  Its
final task is to call checkBuildQueue(), which immediately calls
allBuildersIdle(), which logs this:

"""
[2010-08-26 13:08:57.228 DEBUG] Builder 'linux': status =
<buildbot.status.builder.BuilderStatus instance at 0xa0899cc>, state =
'building'
[2010-08-26 13:08:57.228 DEBUG] pending builds: []
[2010-08-26 13:08:57.229 DEBUG] current builds: []
[2010-08-26 13:08:57.229 DEBUG] last finished build: <BuildStatus #1205>
[2010-08-26 13:08:57.229 DEBUG] Builder 'windows': status =
<buildbot.status.builder.BuilderStatus instance at 0x9e3f1cc>, state =
'idle'
[2010-08-26 13:08:57.229 DEBUG] Builder 'mac': status =
<buildbot.status.builder.BuilderStatus instance at 0x9e46d2c>, state =
'idle'
[2010-08-26 13:08:57.229  INFO] Some builders are not idle: not
checking build queue
"""

A-HA!!!  Something fishy is going on here: I *know* that the "linux"
builder is idle, because a) I just got a call to buildFinished() for
that builder and b) nobody has started another build in the 70 ms
since.  And yet the BuilderStatus object claims that it is in state
"building" ... even though it has no pending builds and no current
builds.

So either I'm doing something wrong (querying BuildBot objects from
another thread?), or BuildBot 0.7.12 is giving me inaccurate build
status information.  Any ideas?

Incidentally, this has only started happening in the last 3 or 4
weeks.  Recent events in our BuildBot history:

  early May: upgrade to BuildBot 0.7.12
  early Aug: several minor code changes to prepare for ...
  early Aug: migrate BuildBot master daemon to a different server

The old server and new server are both fast machines: the old one has
8 2.0 GHz Xeon CPUs and 8 GB of RAM, the new one has 4 2.33 GHz Xeons
and 8 GB of RAM.  So raw performance is probably not that different.
However, the new server is running a much more recent OS (Ubuntu 10.04
rather than RHEL 4.0), and it is much less heavily loaded.  So it's
possible that some obscure combination of hardware + software + load
is exposing a race condition, either in my code or in BuildBot itself.

If you have any ideas, thoughts, suggestions, or random advice... I'm all ears.

Thanks!

Greg