[users at bb.net] One EC2 slave misbehaving

Yuval Hager yhager at yhager.com
Thu Feb 4 23:42:43 UTC 2016


I've got a few EC2 slaves that I'd like each to run a single build and
terminate. I've set up max_builds=1 and build_wait_timeout=0 for all of them.

They are all defined the same, except for the ami id, slave name and pwd.
They all work well, except for one. That one would only run a single job, get
terminated and never get started again.

After running its first job, it is about to start its second. I see in the log
"substantiating slave", but I never see "waiting for instance to start". After
about 20 minutes I get a timeout from twisted, but there is still no recovery
and the build never starts, until I restart master.

I'm running 0.8.12. I have applied the patch in
https://github.com/buildbot/buildbot/pull/1831, but it made no difference. I
also tried to implement something like
http://trac.buildbot.net/ticket/1001#comment:5, but it made no difference
either (I may have done this one wrong though).

The only way I can get this slave to behave is to set build_wait_timeout=1.
This way everything works properly, except for the fact that it might run more
than one build before terminating. I'd like to avoid that, but it's the only
way I got this to work so far.

Suspecting something in the master db, I've created another AMI off the first
one, given it a new name, and it behaved the same way! The only other
difference between this and the other images, is that this one is an Amazon
Linux, and the others are Ubuntu and RHEL. I can't think why it would matter
though. It takes this instance slightly longer to terminate than the others
(30-40 seconds, the others are at 25-30), but again, I can't see how it would
matter.

I would appreciate any insight on what this might be related to, or any
suggestions on how to debug this further.

Thanks!

--y


More information about the users mailing list