[users at bb.net] multi-master mostly rc2, after the weekend.

Neil Gilmore ngilmore at grammatech.com
Mon Oct 10 19:36:15 UTC 2016


Good afternoon everyone, I have more anecdotes!

Well, we've had multi-master running since Thurs. or so. It's been a 
mixed bag, but I'll start with the good.

Having the UI and force schedulers on their own master is definitely a 
good thing. Having a single maser, sometimes it would take many minutes 
to populate a page. Now it may take a minute, tops.

Separating out our results process into its own master also seems to be 
good. That process is pretty stable, but people would complain if I had 
to take down the master when they needed it.

In general, having our 'real' builds separated across two masters seems 
to work at least as well a single master. Maybe a touch better.

Now the neutral:

I had been using 4 identical copies of our master.cfg, 1 per master. 
This was a bit silly, but got things running quickly. I did a little 
experiment, and yes, you can have multiple masters pointing to the same 
master.cfg in their respective buildbot.tac files. In our case, using an 
absolute path works well. We only want a single copy because it's in our 
version control, and we don't want to have to remember all the places it 
turns up. Also, much of it, like the dictionaries containing the workers 
and their directories, is reused among the masters.

Unfortunately, to convert from 4 copies to a single copies requires 
taking down the master, editing its buildbot.tac, then bringing it back 
up. Because reconfig only works on master.cfg, not buildbot.tac. Oh well...

We're also having trouble with checkconfig. I rewrote our master.cfg to 
decide which master was being configured by comparing against the 
variable basedir. Unfortunately, basedir is usually '.' when doing a 
reconfig, and that's not in our dictionary of masters. So we get a 
KeyError. I'll need to fix that. I'd want to anyway because checking the 
config for a particular master will result in some things not getting 
checked. I'll work around this by disabling the code that gets called 
when builders, etc. are added that compares the current master to the 
master that the object should belong to.

I was going to move the masters to the single master.cfg, but I had 
trouble shutting them down. I got logs that mostly looked like this:
2016-10-10T14:45:27-0400 [-] while publishing event 
org.buildbot.mq.steps.888\
04.logs.stdio.append
         Traceback (most recent call last):
           File 
"/usr/local/lib/python2.7/dist-packages/buildbot-0.9.0rc2-py2.\
7.egg/buildbot/mq/wamp.py", line 37, in produce
             d = self._produce(routingKey, data)
           File 
"/usr/local/lib/python2.7/dist-packages/buildbot-0.9.0rc2-py2.\
7.egg/buildbot/mq/wamp.py", line 57, in _produce
             return 
self.master.wamp.publish(self.messageTopic(routingKey), _d\
ata, options=options)
           File 
"/usr/local/lib/python2.7/dist-packages/Twisted-16.3.0-py2.7-l\
inux-x86_64.egg/twisted/internet/defer.py", line 1274, in unwindGenerator
             return _inlineCallbacks(None, gen, Deferred())
           File 
"/usr/local/lib/python2.7/dist-packages/Twisted-16.3.0-py2.7-l\
inux-x86_64.egg/twisted/internet/defer.py", line 1128, in _inlineCallbacks
             result = g.send(result)
         --- <exception caught here> ---
           File 
"/usr/local/lib/python2.7/dist-packages/buildbot-0.9.0rc2-py2.\
7.egg/buildbot/wamp/connector.py", line 109, in publish
             ret = yield service.publish(topic, data, options=options)
           File 
"/usr/local/lib/python2.7/dist-packages/autobahn-0.16.0-py2.7.\
egg/autobahn/wamp/protocol.py", line 1109, in publish
             raise exception.TransportLost()
         autobahn.wamp.exception.TransportLost:

It's worth noting that I'm told we had some network hiccups affecting 
things last night. This might be nothing. But plain old kill won't stop 
a buildbot if it doesn't want to stop.

The bad:

Using multi-master doesn't seem to have stopped our lost deferred/stuck 
build problems. Restarting the master seems to have remedied the 
problem, but it's disappointing. Especially since that's our major 
problem that we were trying to deal with by going to multi-master. The 
only good news there is that when I have to stop a master because of 
stuck builds, it doesn't stop everything.

If I could have a single bugfix, it would be to stop losing deferred 
objects so builds wouldn't stall (assuming that's the problem). If I 
could have a single new feature, it would be a way to reset a worker 
completely without having to take down the master, so as to not need the 
bugfix.

Neil Gilmore
grammatech.com


More information about the users mailing list