[users at bb.net] Master failed to register workers after network disconnect

Yngve N. Pettersen yngve at vivaldi.com
Sat Mar 2 14:33:17 UTC 2019


Hello all,

Yesterday we had a network event when some of our buildbot workers lost  
the network connection to the master for about 10 minutes.

However, while according to the logs on both the master and the workers  
show that the workers successfully reconnected within 10 minutes of the  
network connection being restored, according to status displays, the  
workers the worker were missing. It eventually took a stop/start or reboot  
of the workers to get them reconnected an hour after the network  
connection was lost.

What I am seeing is that master log has entries like this when a worker  
("arbeider") reconnected:

2019-03-01 11:24:05+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4]  
worker 'arbeider' attaching from IPv4Address(type='TCP', host='1.2.3.4',  
port=51630)
2019-03-01 11:24:05+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] Got  
duplication connection from 'arbeider' starting arbitration procedure
2019-03-01 11:24:15+0000 [-] Connected worker 'arbeider' ping timed out  
after 10 seconds
2019-03-01 11:24:15+0000 [-] Old connection for 'arbeider' was lost,  
accepting new
2019-03-01 11:24:15+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] Got  
workerinfo from 'arbeider'
2019-03-01 11:24:15+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4]  
worker arbeider cannot attach
         Traceback (most recent call last):
           File  
"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line  
1529, in _cancellableInlineCallbacks
             _inlineCallbacks(None, g, status)
           File  
"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line  
1416, in _inlineCallbacks
             result = result.throwExceptionIntoGenerator(g)
           File  
"sandbox/lib/python3.6/site-packages/twisted/python/failure.py", line 491,  
in throwExceptionIntoGenerator
             return g.throw(self.type, self.value, self.tb)
           File  
"sandbox/lib/python3.6/site-packages/buildbot/worker/base.py", line 638,  
in attached
             log.err(e, "worker %s cannot attach" % (self.name,))
         --- <exception caught here> ---
           File  
"sandbox/lib/python3.6/site-packages/buildbot/worker/base.py", line 636,  
in attached
             yield AbstractWorker.attached(self, bot)
         builtins.AssertionError:


Does anyone have any ideas about why the reconnects failed?

In one case, a job was started on one of the workers (which was shown as  
"online"), and the master was just registering the task as "Pinging  
worker", for 20+ minutes until we stopped the task (and even that took a  
while).

If this happens every time the network connection is lost (which  
admittedly does not happen that frequently, but could happen in case of  
network maintenance) it is going to be a serious inconvenience, since some  
of the workers need special handling when being restarted.


Relevant information about the configuration:

* Buildbot v2.0.1

* The PB connections are TLS protected, using a workaround based on the  
one from <https://github.com/buildbot/buildbot/issues/2866>

* Workers run Python 2

* The master is running the current Twisted version

* The workers are running Twisted 18.7.0 (fixed version, due to  
installation problems with the current version; on Windows it goes looking  
for a compiler and does not find one, even when one is installed)


-- 
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS


More information about the users mailing list