[users at bb.net] One EC2 slave misbehaving

Yuval Hager yhager at yhager.com
Sun Feb 7 09:05:16 UTC 2016


I've got this on a test system. The ami was created from scratch.

Looks like the master does not see a 'detached' event, even though the instance
has terminated.

See log and config attached. It's "substantiating", but never "waiting for
instance to start" (my guess is since it's not detached yet). Only after 20
minutes there's a timeout, the detach is forced, but the pending job never runs
(I guess that's another bug).

Could it be some weird TCP situation that never shuts down the connection
properly? I tried tracing the code for what triggers the detached event but I
got lost..

Now that I've isolated this on a test system, I'm happy to debug this further
as needed with network dumps, debug prints etc..

I'm running latest eight on master and 0.8.12 on the slave.

Thanks,

--yuval

On 04/02/16 03:42 PM, Yuval Hager wrote:
> I've got a few EC2 slaves that I'd like each to run a single build and
> terminate. I've set up max_builds=1 and build_wait_timeout=0 for all of them.
> 
> They are all defined the same, except for the ami id, slave name and pwd.
> They all work well, except for one. That one would only run a single job, get
> terminated and never get started again.
> 
> After running its first job, it is about to start its second. I see in the log
> "substantiating slave", but I never see "waiting for instance to start". After
> about 20 minutes I get a timeout from twisted, but there is still no recovery
> and the build never starts, until I restart master.
> 
> I'm running 0.8.12. I have applied the patch in
> https://github.com/buildbot/buildbot/pull/1831, but it made no difference. I
> also tried to implement something like
> http://trac.buildbot.net/ticket/1001#comment:5, but it made no difference
> either (I may have done this one wrong though).
> 
> The only way I can get this slave to behave is to set build_wait_timeout=1.
> This way everything works properly, except for the fact that it might run more
> than one build before terminating. I'd like to avoid that, but it's the only
> way I got this to work so far.
> 
> Suspecting something in the master db, I've created another AMI off the first
> one, given it a new name, and it behaved the same way! The only other
> difference between this and the other images, is that this one is an Amazon
> Linux, and the others are Ubuntu and RHEL. I can't think why it would matter
> though. It takes this instance slightly longer to terminate than the others
> (30-40 seconds, the others are at 25-30), but again, I can't see how it would
> matter.
> 
> I would appreciate any insight on what this might be related to, or any
> suggestions on how to debug this further.
> 
> Thanks!
> 
> --y
> _______________________________________________
> users mailing list
> users at buildbot.net
> https://lists.buildbot.net/mailman/listinfo/users
> 

-------------- next part --------------
from buildbot.plugins import *
c = BuildmasterConfig = {}
c['slaves'] = [
    buildslave.EC2LatentBuildSlave('amazon-linux', 'pass', 
                                   'm3.medium', region='us-west-1', 
                                   max_builds=1, 
                                   build_wait_timeout=0,
                                   ami='ami-xxxxx')
]
c['protocols'] = {'pb': {'port': 9989}}
c['change_source'] = []
c['schedulers'] = []
c['schedulers'].append(schedulers.ForceScheduler(
                            name="force",
                            builderNames=["test"]))
factory = util.BuildFactory()
factory.addStep(steps.ShellCommand(command=["sleep", "10"]))

c['builders'] = []
c['builders'].append(
    util.BuilderConfig(name="test",
      slavenames=["amazon-linux"],
      factory=factory))
c['status'] = []

from buildbot.status import html
from buildbot.status.web import authz, auth

authz_cfg=authz.Authz(
    auth=auth.BasicAuth([("user","pass")]),
    gracefulShutdown = False,
    forceBuild = 'auth', # use this to test your slave once it is set up
    forceAllBuilds = 'auth',  # ..or this
    pingBuilder = False,
    stopBuild = False,
    stopAllBuilds = False,
    cancelPendingBuild = False,
)
c['status'].append(html.WebStatus(http_port=8010, authz=authz_cfg))

c['title'] = "test"
c['buildbotURL'] = "http://example.com:8010/"
c['db'] = { 'db_url' : "sqlite:///state.sqlite"}
-------------- next part --------------
2016-02-07 00:31:30-0800 [-] substantiating slave <LatentSlaveBuilder builder='test' slave='amazon-linux'>
2016-02-07 00:31:30-0800 [-] EC2LatentBuildSlave amazon-linux waiting for instance i-7cc326c9 to start
2016-02-07 00:31:34-0800 [-] added buildset 14 to database
2016-02-07 00:31:50-0800 [-] EC2LatentBuildSlave amazon-linux instance i-7cc326c9 started on ec2-184-72-22-91.us-west-1.compute.amazonaws.com in about 0 minutes 20 seconds (None)
2016-02-07 00:31:56-0800 [-] added buildset 15 to database
2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] slave 'amazon-linux' attaching from IPv4Address(TCP, '184.72.22.91', 58579)
2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] Got slaveinfo from 'amazon-linux'
2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] Starting buildslave keepalive timer for 'amazon-linux'
2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] bot attached
2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] Buildslave amazon-linux attached to test
2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] Slave amazon-linux substantiated \o/
2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] Firing amazon-linux substantiation deferred with success
2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] starting build <Build test>.. pinging the slave <LatentSlaveBuilder builder='test' slave='amazon-linux'>
2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] sending ping
2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] ping finished: success
2016-02-07 00:32:39-0800 [-] <Build test>.startBuild
2016-02-07 00:32:39-0800 [-] ShellCommand.startCommand(cmd=<RemoteShellCommand '['sleep', '10']'>)
2016-02-07 00:32:39-0800 [-]   cmd.args = {'workdir': 'build', 'env': None, 'want_stderr': 1, 'logEnviron': True, 'sigtermTime': None, 'logfiles': {}, 'timeout': 1200, 'want_stdout': 1, 'usePTY': 'slave-config', 'maxTime': None, 'initial_stdin': None}
2016-02-07 00:32:39-0800 [-] <RemoteShellCommand '['sleep', '10']'>: RemoteCommand.run [0]
2016-02-07 00:32:39-0800 [-] command '['sleep', '10']' in dir 'build'
2016-02-07 00:32:49-0800 [Broker,0,184.72.22.91] <RemoteShellCommand '['sleep', '10']'> rc=0
2016-02-07 00:32:49-0800 [-] closing log <buildbot.status.logfile.LogFile instance at 0x7fad18b1f098>
2016-02-07 00:32:49-0800 [-] releaseLocks(<buildbot.steps.shell.ShellCommand object at 0x7fad18a7f810>): []
2016-02-07 00:32:49-0800 [-]  step 'shell' complete: success
2016-02-07 00:32:49-0800 [-]  <Build test>: build finished
2016-02-07 00:32:49-0800 [-]  setting expectations for next time
2016-02-07 00:32:49-0800 [-] new expectations: 10.0162930489 seconds
2016-02-07 00:32:49-0800 [-] releaseLocks(<EC2LatentBuildSlave 'amazon-linux'>): []
2016-02-07 00:32:49-0800 [-] EC2LatentBuildSlave amazon-linux terminating instance i-7cc326c9
2016-02-07 00:33:20-0800 [-] EC2LatentBuildSlave amazon-linux instance i-7cc326c9 ('terminated',) after about 0 minutes 30 seconds
2016-02-07 00:33:20-0800 [-] starting build <Build test> using slave <LatentSlaveBuilder builder='test' slave='amazon-linux'>
2016-02-07 00:33:20-0800 [-] acquireLocks(slave <EC2LatentBuildSlave 'amazon-linux'>, locks [])
2016-02-07 00:33:20-0800 [-] substantiating slave <LatentSlaveBuilder builder='test' slave='amazon-linux'>
2016-02-07 00:53:20-0800 [-] disconnecting old slave amazon-linux now
2016-02-07 00:53:20-0800 [-] waiting for slave to finish disconnecting
2016-02-07 00:53:20-0800 [-] Buildslave amazon-linux detached from test
2016-02-07 00:53:20-0800 [-] while preparing slavebuilder:
	Traceback (most recent call last):
	Failure: twisted.internet.defer.TimeoutError: 
	
2016-02-07 00:53:20-0800 [-] slave <Build test> can't build <LatentSlaveBuilder builder='test'> after all; re-queueing the request
2016-02-07 00:53:20-0800 [-] while running <function run_cleanups at 0x7fad18ada050>
	Traceback (most recent call last):
	  File "/home/yuval/buildbot/sandbox/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1184, in gotResult
	    _inlineCallbacks(r, g, deferred)
	  File "/home/yuval/buildbot/sandbox/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
	    result = result.throwExceptionIntoGenerator(g)
	  File "/home/yuval/buildbot/sandbox/local/lib/python2.7/site-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
	    return g.throw(self.type, self.value, self.tb)
	  File "/home/yuval/buildbot/src/master/buildbot/process/builder.py", line 345, in _startBuildFor
	    run_cleanups()
	--- <exception caught here> ---
	  File "/home/yuval/buildbot/src/master/buildbot/process/builder.py", line 308, in run_cleanups
	    fn()
	  File "/home/yuval/buildbot/src/master/buildbot/process/builder.py", line 322, in <lambda>
	    cleanups.append(lambda: slavebuilder.slave.releaseLocks())
	exceptions.AttributeError: 'NoneType' object has no attribute 'releaseLocks'
	
2016-02-07 00:53:20-0800 [Broker,0,184.72.22.91] BuildSlave.detached(amazon-linux)
2016-02-07 00:53:20-0800 [Broker,0,184.72.22.91] releaseLocks(<EC2LatentBuildSlave 'amazon-linux'>): []


More information about the users mailing list