[Buildbot-devel] update on the next release

Brian Warner warner-buildbot at lothar.com
Sat Nov 4 00:52:34 UTC 2006


Just a status report on how the next release is looking.. I've got some notes
on the "roadmap" page[1] with a list of features/work-in-progress for the
next release.. The four things I have listed as blockers are:

    * buildslaves get stuck on reconfig
    * filetransfer still needs review/cleanup
    * pbstatus wants improvement
    * #twisted IRC bot is offline

I think I'm going to punt on the pbstatus changes for this one, and I also
think I'm going to stop fussing with the filetransfer code (I wanted to make
sure I'm happy with the remote API part of that code, since if that is stable
the rest of the code can be changed without causing compatibility issues).

I haven't yet investigated why the twisted IRC bot is offline. Has anyone
else been having problems with their IRC bots? I know there's a problem with
dynamically-assigned (and changing) IP addresses that would need some sort of
keepalive or periodic ping to resolve, but I don't think that's what's
causing this particular problem. Maybe some sort of login failure? Bots
requiring passwords on freenode or something?

The big issue still on the list is that buildslaves can get stuck when you
reconfigure the builder. The symptom is that after you hit the "reload
config" button on the debugclient (or issue a SIGHUP), the builder looks like
it's been reconfigured normally, but the next time you start a build (perhaps
hours later), it begins the first step, but never finishes it, nor does it
ever emit any status for it. I've seen this bug in our work buildbot, but
I've been unable to reproduce it in a simple unit test. I'm not sure how many
other people use the reconfig-without-restarting feature on a regular basis,
so I don't know if anyone else has been bitten by this one yet.

I think I know where the bug must lie, in the admittedly weird code that
tries (and fails) to make reconfig more reliable by simulating a complete
slave disconnect/reconnect cycle. This seems to result in two
bot.SlaveBuilder objects on the slave side: a new one that knows it is
"running", and an old one that knows it is not running. The problem is that
the new build gets sent to the old SlaveBuilder, which dutifully starts the
step, but then refuses to send status to anybody because it "knows" that it
has been turned off.

Rather than spending a lot of time isolating and resolving this problem, I'm
working on changing the builder-reconfig process completely. The new approach
removes the disconnect/reconnect cycle, and instead has the new Builder
attempt to suck the brain out of the old one: all of the builds queued in the
old builder get moved to the new one, all the buildslaves that the new
Builder still wants to use get reparented, etc.

There are real benefits to this approach: if it works, reconfiguring the
builder will not interrupt any existing builds, and everything that was
queued for the old builder will eventually get built by the new one. We'll
still have problems with builds getting dropped when the entire buildslave or
buildmaster is restarted, but reconfig should no longer be as destructive an
operation as it is in the current release.

I'm hoping to get this code done over the weekend, and then we can let it
burn-in for a week or two before cutting the next release.


that's all the news from this end,
 -Brian


[1]: http://buildbot.sourceforge.net/roadmap.html




More information about the devel mailing list