[Buildbot-commits] [Buildbot] #1780: Latent build slaves shut down uncleanly and get forgotten by the master
Buildbot
nobody at buildbot.net
Wed Jan 26 19:17:35 UTC 2011
#1780: Latent build slaves shut down uncleanly and get forgotten by the master
---------------------+-----------------------
Reporter: jacobian | Owner:
Type: defect | Status: new
Priority: major | Milestone: undecided
Version: 0.8.3p1 | Keywords:
---------------------+-----------------------
Occasionally when the master shuts down a latent buildslave it'll fail
weirdly, and the master decides that the latent build slave is broken and
never tries to reboot it.
Unfortunately I don't have a lot of insight into what's actually
happening, but I'll provide as much detail as a I can:
The buildmaster is http://buildbot.djangoproject.com/. All the code
running there lives at https://github.com/jacobian/django-buildmaster, and
you can see the specific latent buildslave implementation at
https://github.com/jacobian/django-
buildmaster/blob/master/djangobotcfg/rsc_slave.py.
Here's what I see in the logs when this error occurs:
{{{
2011-01-26 10:22:41-0800 [-] disconnecting old slave bs1.jacobian.org now
2011-01-26 10:22:41-0800 [-] waiting for slave to finish disconnecting
2011-01-26 10:22:41-0800 [-] DjangoCloudserversBuildSlave bs1.jacobian.org
deleting instance 572258
2011-01-26 10:22:41-0800 [Broker,2,204.232.209.196]
BuildSlave.detached(bs1.jacobian.org)
2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] slave
'bs1.jacobian.org' attaching from IPv4Address(TCP, '204.232.209.196',
53732)
2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] Slave bs1.jacobian.org
received connection while not trying to substantiate. Disconnecting.
2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] waiting for slave to
finish disconnecting
2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] Peer will receive
following PB traceback:
2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] Unhandled Error
Traceback (most recent call last):
Failure: exceptions.RuntimeError: Slave bs1.jacobian.org received
connection while not trying to substantiate. Disconnecting.
2011-01-26 10:22:45-0800 [-] DjangoCloudserversBuildSlave bs1.jacobian.org
deleted instance 572258
}}}
The lines "DjangoCloudserversBuildSlave bs1.jacobian.org deleting instance
572258" and "DjangoCloudserversBuildSlave bs1.jacobian.org deleted
instance 572258" are coming from my code; the rest are logged by Buildbot
itself.
The problem isn't the connection error: the slave gets shut down just a
few seconds later. But when this happens the master decides the slave is
somehow broken and never boots another instance. The only way to get it
working again is to restart the buildmaster.
That's all I know for sure, but here's my speculation on what I *think*
might be happening: it appears that the build master disconnects my latent
slave, then calls `stop_instance()` to shut it down. The master then
detaches the slave. If the shutdown hasn't finished quickly enough,
though, it looks like the slave tries to reconnect -- it's been kicked off
by the master, and not yet killed as part of the shutdown process. So it
looks like the master freaks out and decides that the slave's misbehaving
and never tries to boot it again.
It seems that the master should just ignore connections from the slave
while it's trying to unsubstantiate the slave. Otherwise unless the slave
shuts down immediately upon the `stop_instance()` call it seems like
this'll happen again and again.
--
Ticket URL: <http://trac.buildbot.net/ticket/1780>
Buildbot <http://buildbot.net/>
Buildbot: build/test automation
More information about the Commits
mailing list