[Buildbot-devel] Slave disconnected under heavy load?

Mon Jun 5 00:11:47 UTC 2006

> The time between 'sending ping' and 'ping timeout' is very short,
> maybe 5 seconds or so. I believe that is too short of a timeout for
> declaring a slave dead, specially if the slave might be under heavy
> load.

Oh oh oh. I think I get it now.

So I added the pre-build quick ping to catch a fairly common problem where
the buildslave is behind a NAT box which has forgotten about the connection.
The problem was that nothing was trying to use the connection until the build
started, so we wouldn't detect this silent-disconnect until annoyingly late.
The symptom was that the build would start (usually with a VC checkout step)
and then hang right away. About 5 to 60 minutes later (whenever TCP decided
to give up), the step would fail with a "lost buildslave" disconnect error.

Adding the quick timeout made it possible to flunk the build early, rather
than making it look like it was the SVN checkout that was failing. It also
made it possible to assign the build to a different (more functional)
buildslave, because we decide fairly quickly whether or not the intended
buildslave is really still available.

Obviously when the same buildslave is being used for multiple builds, this is
inappropriate. The "slave is alive" flag needs to be per-slave, not
per-SlaveBuilder. (there is a SlaveBuilder instance for each buildslave that
is eligible to perform builds for a given Builder).

The code that handles this is in buildbot/process/builder.py . Basically
SlaveBuilder.ping() should really pass the request off to a separate
BuildSlave object. Each time a message is received for *any* SlaveBuilder,
this BuildSlave object should update it's record of the last time we've heard
from the buildslave. The "ping" method should be replaced with a "provoke a
response from the slave if you haven't heard from them in a while, get back
to me when you hear *any* message from them (in response to our provocation
or otherwise), or when 5 seconds have passed" method.

This will take a bit of work.. PB doesn't make it very easy to get
notification when arbitrary messages have been delivered to arbitrary places,
so I think I'll have to add some sort of central is-alive-status collector
object somewhere. For the meanwhile, just bump up that timeout. Or, if you
like, disable it altogether: replace builder.py:498 to say:

  d = defer.succeed(True)

instead of:

  d = sb.ping(self.START_BUILD_TIMEOUT)

That will bypass the ping completely and just let the build start right away.

I've opened a bug for this issue, SF#1500669, to prompt me to figure out a
solution soon. Feel free to add more information to that ticket[1].

now, how to fix it...

ponderously,
 -Brian

[1]: https://sourceforge.net/tracker/index.php?func=detail&aid=1500669&group_id=73177&atid=537001