[Buildbot-devel] Master shutting down when slave detaches?

Dan Kegel dank at kegel.com
Thu Dec 27 05:28:44 UTC 2012


I'm using the following gentle one-shot-slave patch:
--- a/master/buildbot/process/build.py
+++ b/master/buildbot/process/build.py
@@ -270,6 +270,9 @@ class Build(properties.PropertiesMixin):

         self.build_status.buildStarted(self)
         self.acquireLocks().addCallback(self._startBuild_2)
+        if (self.slavebuilder.slave.properties.getProperty('oneshot', False)):
+            log.msg("oneshot slave started, requesting graceful shutdown.")
+            self.slavebuilder.slave.slave_status.setGraceful(True)
         return d

     def acquireLocks(self, res=None):

On good machines, I see the following when a slave finishes a job:

2012-12-26 20:16:04-0800 [-]  <Build pyflakes-osx107-master>: build finished
2012-12-26 20:16:04-0800 [-]  setting expectations for next time
2012-12-26 20:16:04-0800 [-] new expectations: 22.0238649845 seconds
2012-12-26 20:16:04-0800 [-] releaseLocks(<BuildSlave
'pyflakes-dan-kegels-macbook-pro'>): []
2012-12-26 20:16:04-0800 [Broker,1,192.168.0.103]
BuildSlave.detached(pyflakes-dan-kegels-macbook-pro)
2012-12-26 20:16:04-0800 [Broker,1,192.168.0.103]
releaseLocks(<BuildSlave 'pyflakes-dan-kegels-macbook-pro'>): []
2012-12-26 20:16:04-0800 [Broker,1,192.168.0.103] Buildslave
pyflakes-dan-kegels-macbook-pro detached from pyflakes-osx107-master
2012-12-26 20:16:04-0800 [Broker,2,192.168.0.103] slave
'pyflakes-dan-kegels-macbook-pro' attaching from IPv4Address(TCP,
'192.168.0.103', 60049)
2012-12-26 20:16:04-0800 [Broker,2,192.168.0.103] Starting buildslave
keepalive timer for 'pyflakes-dan-kegels-macbook-pro'
2012-12-26 20:16:05-0800 [Broker,2,192.168.0.103] Got slaveinfo from
'pyflakes-dan-kegels-macbook-pro'
2012-12-26 20:16:05-0800 [Broker,2,192.168.0.103] bot attached
2012-12-26 20:16:05-0800 [Broker,2,192.168.0.103] Buildslave
pyflakes-dan-kegels-macbook-pro attached to pyflakes-osx107-master

That's fine.  But on bad machines, I see the following when a slave
finishes a job:

2012-12-26 21:50:54-0600 [-]  <Build pyflakes-ubu1204-master>: build finished
2012-12-26 21:50:54-0600 [-]  setting expectations for next time
2012-12-26 21:50:54-0600 [-] new expectations: 11.2170479298 seconds
2012-12-26 21:50:54-0600 [-] releaseLocks(<BuildSlave
'pyflakes-ubu1204x1-ubu1204'>): []
2012-12-26 21:50:54-0600 [Broker,0,10.0.3.67]
BuildSlave.detached(pyflakes-ubu1204x1-ubu1204)
2012-12-26 21:50:54-0600 [Broker,0,10.0.3.67] releaseLocks(<BuildSlave
'pyflakes-ubu1204x1-ubu1204'>): []
2012-12-26 21:50:54-0600 [Broker,0,10.0.3.67] Buildslave
pyflakes-ubu1204x1-ubu1204 detached from pyflakes-ubu1204-master
2012-12-26 21:50:54-0600 [-] Received SIGTERM, shutting down.
2012-12-26 21:50:54-0600 [buildbot.status.web.baseweb.RotateLogSite]
(TCP Port 8010 Closed)
...
2012-12-26 21:50:54-0600 [-] Server Shut Down.

WTF?  Why would a slave detaching cause the server to SIGTERM (itself)?
Well, on the bad machines, the slave is an ephemeral lxc container, and
its network goes down immediately after the slave exits, maybe there's
a RST being
sent somewhere.

tcpdump shows the slave gets the shutdown command, and that connection
gets nicely FIN'd and ACK'd, but the slave then sends a SYN to the
master, which does indeed reply with an RST.   (All this is observed
on the host, and probably before the network goes down.)
I don't know if this happens in the good case, but I will find out.

Why would the master reply with an RST unless it was shutting down?

In both cases, the slave log looks like this:

2012-12-27 04:24:55+0000 [-] command finished with signal None, exit
code 0, elapsedTime: 0.004477
2012-12-27 04:24:55+0000 [-] SlaveBuilder.commandComplete
<buildslave.commands.shell.SlaveShellCommand instance at 0x9dc610c>
2012-12-27 04:24:55+0000 [Broker,client] slave shutting down on
command from master
2012-12-27 04:24:55+0000 [Broker,client] lost remote
2012-12-27 04:24:55+0000 [Broker,client] Lost connection to ubu1204x1:9010
2012-12-27 04:24:55+0000 [Broker,client] Stopping factory
<buildslave.bot.BotFactory instance at 0x9db90cc>
2012-12-27 04:24:55+0000 [-] Main loop terminated.
2012-12-27 04:24:55+0000 [-] Server Shut Down.

I will continue looking at this, but if anyone has a bright idea about what
causes the master to shutdown after every build with that patch
applied, I'd love to hear it.

Happy new year!
- Dan




More information about the devel mailing list