[users at bb.net] Multi-master 0.9.3 anecdotes.

Neil Gilmore ngilmore at grammatech.com
Fri Feb 3 20:22:17 UTC 2017


Hi Everyone,

Well, I put 0.9.3 multi-master plus Pierre's reconfig patches into 
production Tues. afternoon. After running a few days, it mostly works.

Unfortunately, there's always problems. Our current problem is that 
we've added a few workers since then. And which the builders associated 
with those workers are having builds scheduled, those builds never 
start. Even forced builds do not start.

Here's what the worker log shows:

2017-02-03 12:39:08-0500 [-] Loading buildbot.tac...
2017-02-03 12:39:09-0500 [-] Loaded.
2017-02-03 12:39:09-0500 [-] twistd 16.2.0 (/usr/bin/python 2.7.6) 
starting up.
2017-02-03 12:39:09-0500 [-] reactor class: 
twisted.internet.epollreactor.EPollReactor.
2017-02-03 12:39:09-0500 [-] Starting Worker -- version: 0.9.0rc2
2017-02-03 12:39:09-0500 [-] recording hostname in twistd.hostname
2017-02-03 12:39:09-0500 [-] Starting factory 
<buildbot_worker.pb.BotFactory instance at 0x7f5ed09fcd88>
2017-02-03 12:39:09-0500 [-] Connecting to buildbot:9984
2017-02-03 12:39:09-0500 [Broker,client] message from master: attached
2017-02-03 12:39:09-0500 [Broker,client] Connected to <host:port>; 
worker is ready
2017-02-03 12:39:09-0500 [Broker,client] sending application-level 
keepalives every 600 seconds

And here's what the master log shows (yes, I've redacted host names, 
etc.). And the masters are pretyt busy, so I hope I have the relevant 
entries here:

2017-02-03T12:39:09-0500 [Broker,28906,10.233.216.43] worker '<name>' 
attaching from IPv4Address(TCP, '<ip>', 35642)
2017-02-03T12:39:09-0500 [Broker,28906,10.233.216.43] Got workerinfo 
from '<name>'
2017-02-03T12:39:09-0500 [-] bot attached
2017-02-03T12:39:09-0500 [-] worker <name> cannot attach
         Traceback (most recent call last):
         Failure: twisted.internet.error.AlreadyCalled: Tried to cancel 
an already-called event.

This is consistent for all the added workers. The UI shows that the 
workers are attached, and the builds scheduler as normal. They just 
never seem to start. Workers present when we started the 0.,9.3 masters 
(using the same database as before) appear to be working correctly.

The 'cannot attach'  entry comes from Worker.attached() after an 
exception in AbstractWorker.attached(). But it comes very late in 
AbstractWorker.attached(), as these are the only lines after the 'bot 
attached' entry is generated:

         self.messageReceivedFromWorker()
         self.stopMissingTimer()
         yield self.updateWorker()
         yield self.botmaster.maybeStartBuildsForWorker(self.name)

I have no clue which might be the problem, or which event was already 
called. When it's a forced build, the worker is on a different master, 
as the force scheudlers are all on out UI master, but this happens with 
scheduled builds, too. And those schedulers should be on the same master 
as the builder and worker.

I've tried a number of things to correct this, short of just shutting 
everything down and/or using a new database.

We also had several builders showing 2 builds building at the same time. 
This appears to be benign, as going in through the manhole and looking 
at masters.botmaster.namedServices['<name>'].building shows only 1 build 
on a builder.

Neil Gilmore
grammatech.com




More information about the users mailing list