[Buildbot-commits] [Buildbot] #1780: Latent build slaves shut down uncleanly and get forgotten by the master

Buildbot nobody at buildbot.net
Wed Jan 26 19:17:35 UTC 2011


#1780: Latent build slaves shut down uncleanly and get forgotten by the master
---------------------+-----------------------
Reporter:  jacobian  |      Owner:
    Type:  defect    |     Status:  new
Priority:  major     |  Milestone:  undecided
 Version:  0.8.3p1   |   Keywords:
---------------------+-----------------------
 Occasionally when the master shuts down a latent buildslave it'll fail
 weirdly, and the master decides that the latent build slave is broken and
 never tries to reboot it.

 Unfortunately I don't have a lot of insight into what's actually
 happening, but I'll provide as much detail as a I can:

 The buildmaster is http://buildbot.djangoproject.com/. All the code
 running there lives at https://github.com/jacobian/django-buildmaster, and
 you can see the specific latent buildslave implementation at
 https://github.com/jacobian/django-
 buildmaster/blob/master/djangobotcfg/rsc_slave.py.

 Here's what I see in the logs when this error occurs:

 {{{
 2011-01-26 10:22:41-0800 [-] disconnecting old slave bs1.jacobian.org now
 2011-01-26 10:22:41-0800 [-] waiting for slave to finish disconnecting
 2011-01-26 10:22:41-0800 [-] DjangoCloudserversBuildSlave bs1.jacobian.org
 deleting instance 572258
 2011-01-26 10:22:41-0800 [Broker,2,204.232.209.196]
 BuildSlave.detached(bs1.jacobian.org)
 2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] slave
 'bs1.jacobian.org' attaching from IPv4Address(TCP, '204.232.209.196',
 53732)
 2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] Slave bs1.jacobian.org
 received connection while not trying to substantiate.  Disconnecting.
 2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] waiting for slave to
 finish disconnecting
 2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] Peer will receive
 following PB traceback:
 2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] Unhandled Error
         Traceback (most recent call last):
         Failure: exceptions.RuntimeError: Slave bs1.jacobian.org received
 connection while not trying to substantiate.  Disconnecting.

 2011-01-26 10:22:45-0800 [-] DjangoCloudserversBuildSlave bs1.jacobian.org
 deleted instance 572258
 }}}

 The lines "DjangoCloudserversBuildSlave bs1.jacobian.org deleting instance
 572258" and "DjangoCloudserversBuildSlave bs1.jacobian.org deleted
 instance 572258" are coming from my code; the rest are logged by Buildbot
 itself.

 The problem isn't the connection error: the slave gets shut down just a
 few seconds later. But when this happens the master decides the slave is
 somehow broken and never boots another instance. The only way to get it
 working again is to restart the buildmaster.

 That's all I know for sure, but here's my speculation on what I *think*
 might be happening: it appears that the build master disconnects my latent
 slave, then calls `stop_instance()` to shut it down. The master then
 detaches the slave. If the shutdown hasn't finished quickly enough,
 though, it looks like the slave tries to reconnect -- it's been kicked off
 by the master, and not yet killed as part of the shutdown process. So it
 looks like the master freaks out and decides that the slave's misbehaving
 and never tries to boot it again.

 It seems that the master should just ignore connections from the slave
 while it's trying to unsubstantiate the slave. Otherwise unless the slave
 shuts down immediately upon the `stop_instance()` call it seems like
 this'll happen again and again.

-- 
Ticket URL: <http://trac.buildbot.net/ticket/1780>
Buildbot <http://buildbot.net/>
Buildbot: build/test automation


More information about the Commits mailing list