[Buildbot-devel] new BuildBot "success story"

Thu Dec 9 02:35:43 UTC 2004

>     http://bugzilla.SpamAssassin.org:8010/

Excellent! I'll add it to the buildbot web page.

>   - I had to hack up svn commit-emails.pl support -- patch in the sf.net
>     patches queue

Cool, I'll take a look. Is this a notification script that ships with svn or
is it a third-party thing?

>   - Michael Parker's dynamic-IP-addressed slaves issue

I've got the first round of lost-slave-handling patches in CVS now, and I'm
testing it out on the Twisted buildbot (there's one slave which lives behind
a slow link that will timeout if you even look at it funny, so I'm trying to
make sure the buildmaster handles that part well before actually fixing the
slave side to not disconnect so readily).

>   - is there a way to pick up idle slaves, when the master is restarted?
>     it appears that they must also be restarted to show up as online.
>     (it'd be nice if they could poll the server, and reconnect gracefully
>     if the server conn dies.)

As Stephen pointed out, it's an expontential backoff that could really be
clamped to a shorter maximum. The parameter is named 'maxDelay', and defaults
to one hour. If you'd like to clamp it lower (say, 10 minutes), then edit
buildbot/slave/bot.py (about line 285) to set BotFactory.maxDelay:

 class BotFactory(ReconnectingPBClientFactory):
     maxDelay = 10*60
     keepaliveTimeout = 30
     unsafeTracebacks = 1
     ...

If you're seeing backoff delays of more than an hour, let me know. (I think
the slave will log each delay in twistd.log, but I could be mistaken). There
might be a bug somewhere. The slave logs are pretty valuable in this case.

I do believe there was a bug in some versions of Twisted such that certain
disconnects would get classified as a "UserError" which did not schedule a
reconnection attempt. This may or may not have any bearing on possible long
reconnect times.

>   - and finally, it's shown up some bizarro FreeBSD locking bug in our
>     code; but I can't really blame BuildBot for that, it's just doing
>     its job ;)

Yes! That's exactly what it's meant to do: ferret out the cross-platform and
portability issues.

> No, they are running.  I just restarted one of them and it came right
> back up.  Maybe I should just put in a cron job that restarts them
> every so often.  It would be nice to put in some sort of HUP or other
> signal that I could send the slave to cause it to ping the master, or
> something like that.

Hmm, I can see having SIGHUP trigger a reconnect perhaps being useful, but if
you have to get that involved with the slave then you might as well restart
it. Ideally the timed reconnect should be enough.

That said, there are a handful of slave-side control buttons that I haven't
figured out how to expose properly. Ping and Force Build are things that the
slave admin should be able to easily do at any time. Maybe a small web page
served by the buildslave, maybe a SIGUSR1 or 2, maybe a local TCP port that
you just telnet into. Not sure.

 -Brian, back to figuring out the slave-disconnect exception bug..