[users at bb.net] Master failed to register workers after network disconnect
Yngve N. Pettersen
yngve at vivaldi.com
Sat Mar 2 14:33:17 UTC 2019
Hello all,
Yesterday we had a network event when some of our buildbot workers lost
the network connection to the master for about 10 minutes.
However, while according to the logs on both the master and the workers
show that the workers successfully reconnected within 10 minutes of the
network connection being restored, according to status displays, the
workers the worker were missing. It eventually took a stop/start or reboot
of the workers to get them reconnected an hour after the network
connection was lost.
What I am seeing is that master log has entries like this when a worker
("arbeider") reconnected:
2019-03-01 11:24:05+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4]
worker 'arbeider' attaching from IPv4Address(type='TCP', host='1.2.3.4',
port=51630)
2019-03-01 11:24:05+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] Got
duplication connection from 'arbeider' starting arbitration procedure
2019-03-01 11:24:15+0000 [-] Connected worker 'arbeider' ping timed out
after 10 seconds
2019-03-01 11:24:15+0000 [-] Old connection for 'arbeider' was lost,
accepting new
2019-03-01 11:24:15+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] Got
workerinfo from 'arbeider'
2019-03-01 11:24:15+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4]
worker arbeider cannot attach
Traceback (most recent call last):
File
"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
File
"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File
"sandbox/lib/python3.6/site-packages/twisted/python/failure.py", line 491,
in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File
"sandbox/lib/python3.6/site-packages/buildbot/worker/base.py", line 638,
in attached
log.err(e, "worker %s cannot attach" % (self.name,))
--- <exception caught here> ---
File
"sandbox/lib/python3.6/site-packages/buildbot/worker/base.py", line 636,
in attached
yield AbstractWorker.attached(self, bot)
builtins.AssertionError:
Does anyone have any ideas about why the reconnects failed?
In one case, a job was started on one of the workers (which was shown as
"online"), and the master was just registering the task as "Pinging
worker", for 20+ minutes until we stopped the task (and even that took a
while).
If this happens every time the network connection is lost (which
admittedly does not happen that frequently, but could happen in case of
network maintenance) it is going to be a serious inconvenience, since some
of the workers need special handling when being restarted.
Relevant information about the configuration:
* Buildbot v2.0.1
* The PB connections are TLS protected, using a workaround based on the
one from <https://github.com/buildbot/buildbot/issues/2866>
* Workers run Python 2
* The master is running the current Twisted version
* The workers are running Twisted 18.7.0 (fixed version, due to
installation problems with the current version; on Windows it goes looking
for a compiler and does not find one, even when one is installed)
--
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS
More information about the users
mailing list