[users at bb.net] multi-master mostly rc2, after the weekend.
Neil Gilmore
ngilmore at grammatech.com
Mon Oct 10 19:36:15 UTC 2016
Good afternoon everyone, I have more anecdotes!
Well, we've had multi-master running since Thurs. or so. It's been a
mixed bag, but I'll start with the good.
Having the UI and force schedulers on their own master is definitely a
good thing. Having a single maser, sometimes it would take many minutes
to populate a page. Now it may take a minute, tops.
Separating out our results process into its own master also seems to be
good. That process is pretty stable, but people would complain if I had
to take down the master when they needed it.
In general, having our 'real' builds separated across two masters seems
to work at least as well a single master. Maybe a touch better.
Now the neutral:
I had been using 4 identical copies of our master.cfg, 1 per master.
This was a bit silly, but got things running quickly. I did a little
experiment, and yes, you can have multiple masters pointing to the same
master.cfg in their respective buildbot.tac files. In our case, using an
absolute path works well. We only want a single copy because it's in our
version control, and we don't want to have to remember all the places it
turns up. Also, much of it, like the dictionaries containing the workers
and their directories, is reused among the masters.
Unfortunately, to convert from 4 copies to a single copies requires
taking down the master, editing its buildbot.tac, then bringing it back
up. Because reconfig only works on master.cfg, not buildbot.tac. Oh well...
We're also having trouble with checkconfig. I rewrote our master.cfg to
decide which master was being configured by comparing against the
variable basedir. Unfortunately, basedir is usually '.' when doing a
reconfig, and that's not in our dictionary of masters. So we get a
KeyError. I'll need to fix that. I'd want to anyway because checking the
config for a particular master will result in some things not getting
checked. I'll work around this by disabling the code that gets called
when builders, etc. are added that compares the current master to the
master that the object should belong to.
I was going to move the masters to the single master.cfg, but I had
trouble shutting them down. I got logs that mostly looked like this:
2016-10-10T14:45:27-0400 [-] while publishing event
org.buildbot.mq.steps.888\
04.logs.stdio.append
Traceback (most recent call last):
File
"/usr/local/lib/python2.7/dist-packages/buildbot-0.9.0rc2-py2.\
7.egg/buildbot/mq/wamp.py", line 37, in produce
d = self._produce(routingKey, data)
File
"/usr/local/lib/python2.7/dist-packages/buildbot-0.9.0rc2-py2.\
7.egg/buildbot/mq/wamp.py", line 57, in _produce
return
self.master.wamp.publish(self.messageTopic(routingKey), _d\
ata, options=options)
File
"/usr/local/lib/python2.7/dist-packages/Twisted-16.3.0-py2.7-l\
inux-x86_64.egg/twisted/internet/defer.py", line 1274, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
File
"/usr/local/lib/python2.7/dist-packages/Twisted-16.3.0-py2.7-l\
inux-x86_64.egg/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
--- <exception caught here> ---
File
"/usr/local/lib/python2.7/dist-packages/buildbot-0.9.0rc2-py2.\
7.egg/buildbot/wamp/connector.py", line 109, in publish
ret = yield service.publish(topic, data, options=options)
File
"/usr/local/lib/python2.7/dist-packages/autobahn-0.16.0-py2.7.\
egg/autobahn/wamp/protocol.py", line 1109, in publish
raise exception.TransportLost()
autobahn.wamp.exception.TransportLost:
It's worth noting that I'm told we had some network hiccups affecting
things last night. This might be nothing. But plain old kill won't stop
a buildbot if it doesn't want to stop.
The bad:
Using multi-master doesn't seem to have stopped our lost deferred/stuck
build problems. Restarting the master seems to have remedied the
problem, but it's disappointing. Especially since that's our major
problem that we were trying to deal with by going to multi-master. The
only good news there is that when I have to stop a master because of
stuck builds, it doesn't stop everything.
If I could have a single bugfix, it would be to stop losing deferred
objects so builds wouldn't stall (assuming that's the problem). If I
could have a single new feature, it would be a way to reset a worker
completely without having to take down the master, so as to not need the
bugfix.
Neil Gilmore
grammatech.com
More information about the users
mailing list