[users at bb.net] Master failed to register workers after network disconnect

Pierre Tardy tardyp at gmail.com
Sun Mar 3 16:17:35 UTC 2019


Hi,

Thanks for reporting. I am not sure if your issue is related to the TLS
hack.
Please note that we now have official TLS support in mainline.

We do have tests which make sure the connection is restored after a network
break down, but there are so many different way a TCP connection can break
that it is difficult to test.
We would be glad if you can test the new TLS support regarding this.

In any way, the best for such issue is to raise a bug on github

Regards
PIerre

On Sat, Mar 2, 2019 at 3:33 PM Yngve N. Pettersen <yngve at vivaldi.com> wrote:

> Hello all,
>
> Yesterday we had a network event when some of our buildbot workers lost
> the network connection to the master for about 10 minutes.
>
> However, while according to the logs on both the master and the workers
> show that the workers successfully reconnected within 10 minutes of the
> network connection being restored, according to status displays, the
> workers the worker were missing. It eventually took a stop/start or
> reboot
> of the workers to get them reconnected an hour after the network
> connection was lost.
>
> What I am seeing is that master log has entries like this when a worker
> ("arbeider") reconnected:
>
> 2019-03-01 11:24:05+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4]
> worker 'arbeider' attaching from IPv4Address(type='TCP', host='1.2.3.4',
> port=51630)
> 2019-03-01 11:24:05+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] Got
> duplication connection from 'arbeider' starting arbitration procedure
> 2019-03-01 11:24:15+0000 [-] Connected worker 'arbeider' ping timed out
> after 10 seconds
> 2019-03-01 11:24:15+0000 [-] Old connection for 'arbeider' was lost,
> accepting new
> 2019-03-01 11:24:15+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] Got
> workerinfo from 'arbeider'
> 2019-03-01 11:24:15+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4]
> worker arbeider cannot attach
>          Traceback (most recent call last):
>            File
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
> 1529, in _cancellableInlineCallbacks
>              _inlineCallbacks(None, g, status)
>            File
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
> 1416, in _inlineCallbacks
>              result = result.throwExceptionIntoGenerator(g)
>            File
> "sandbox/lib/python3.6/site-packages/twisted/python/failure.py", line
> 491,
> in throwExceptionIntoGenerator
>              return g.throw(self.type, self.value, self.tb)
>            File
> "sandbox/lib/python3.6/site-packages/buildbot/worker/base.py", line 638,
> in attached
>              log.err(e, "worker %s cannot attach" % (self.name,))
>          --- <exception caught here> ---
>            File
> "sandbox/lib/python3.6/site-packages/buildbot/worker/base.py", line 636,
> in attached
>              yield AbstractWorker.attached(self, bot)
>          builtins.AssertionError:
>
>
> Does anyone have any ideas about why the reconnects failed?
>
> In one case, a job was started on one of the workers (which was shown as
> "online"), and the master was just registering the task as "Pinging
> worker", for 20+ minutes until we stopped the task (and even that took a
> while).
>
> If this happens every time the network connection is lost (which
> admittedly does not happen that frequently, but could happen in case of
> network maintenance) it is going to be a serious inconvenience, since
> some
> of the workers need special handling when being restarted.
>
>
> Relevant information about the configuration:
>
> * Buildbot v2.0.1
>
> * The PB connections are TLS protected, using a workaround based on the
> one from <https://github.com/buildbot/buildbot/issues/2866>
>
> * Workers run Python 2
>
> * The master is running the current Twisted version
>
> * The workers are running Twisted 18.7.0 (fixed version, due to
> installation problems with the current version; on Windows it goes
> looking
> for a compiler and does not find one, even when one is installed)
>
>
> --
> Sincerely,
> Yngve N. Pettersen
> Vivaldi Technologies AS
> _______________________________________________
> users mailing list
> users at buildbot.net
> https://lists.buildbot.net/mailman/listinfo/users
>
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.buildbot.net/pipermail/users/attachments/20190303/cf8e22a8/attachment.html>


More information about the users mailing list