[Buildbot-devel] Slave disconnected under heavy load?

Mon Jun 5 00:56:35 UTC 2006

On 2006-06-05, Brian Warner <warner-buildbot at lothar.com> wrote:
>> The time between 'sending ping' and 'ping timeout' is very short,
>> maybe 5 seconds or so. I believe that is too short of a timeout for
>> declaring a slave dead, specially if the slave might be under heavy
>> load.
>
> Oh oh oh. I think I get it now.

Yeepie!

> So I added the pre-build quick ping to catch a fairly common problem where
> the buildslave is behind a NAT box which has forgotten about the connection.
> The problem was that nothing was trying to use the connection until the build
> started, so we wouldn't detect this silent-disconnect until annoyingly late.
> The symptom was that the build would start (usually with a VC checkout step)
> and then hang right away. About 5 to 60 minutes later (whenever TCP decided
> to give up), the step would fail with a "lost buildslave" disconnect error.
>
> Adding the quick timeout made it possible to flunk the build early, rather
> than making it look like it was the SVN checkout that was failing. It also
> made it possible to assign the build to a different (more functional)
> buildslave, because we decide fairly quickly whether or not the intended
> buildslave is really still available.
>
> Obviously when the same buildslave is being used for multiple builds, this is
> inappropriate. The "slave is alive" flag needs to be per-slave, not
> per-SlaveBuilder. (there is a SlaveBuilder instance for each buildslave that
> is eligible to perform builds for a given Builder).

Yeah, I'm just getting started with reading the code but it makes
sense now that you've explained.

> The code that handles this is in buildbot/process/builder.py . Basically
> SlaveBuilder.ping() should really pass the request off to a separate
> BuildSlave object. Each time a message is received for *any* SlaveBuilder,
> this BuildSlave object should update it's record of the last time we've heard
> from the buildslave. The "ping" method should be replaced with a "provoke a
> response from the slave if you haven't heard from them in a while, get back
> to me when you hear *any* message from them (in response to our provocation
> or otherwise), or when 5 seconds have passed" method.
>
> This will take a bit of work.. PB doesn't make it very easy to get
> notification when arbitrary messages have been delivered to arbitrary places,
> so I think I'll have to add some sort of central is-alive-status collector
> object somewhere. For the meanwhile, just bump up that timeout. Or, if you
> like, disable it altogether: replace builder.py:498 to say:
>
>   d = defer.succeed(True)
>
> instead of:
>
>   d = sb.ping(self.START_BUILD_TIMEOUT)

I went with the 'defer.succeed' since my slaves are all in the same
network and it works great now.