[users at bb.net] Master failed to register workers after network disconnect

Yngve N. Pettersen yngve at vivaldi.com
Sun Mar 3 17:15:16 UTC 2019


On Sun, 03 Mar 2019 17:17:35 +0100, Pierre Tardy <tardyp at gmail.com> wrote:

> Hi,
>
> Thanks for reporting. I am not sure if your issue is related to the TLS
> hack.
> Please note that we now have official TLS support in mainline.

Good to hear

> We do have tests which make sure the connection is restored after a  
> network
> break down, but there are so many different way a TCP connection can  
> break
> that it is difficult to test.

AFAICT the master had not detected that the connection went down. It  
seemed to accept the new connection after the worker reconnected, but then  
asserted while processing the worker and before assigning it to a builder,  
or marking the worker as online. I have no idea if the information from  
the client was corrupted in some fashion, or there was something master  
side, but I suspect it was master side.

Still it is curious; The day before I moved one of the workers to a new  
machine. I switched by first stopping the old worker, then starting the  
new one. The log for that also displayed a "ping old connection" line  
IIRC. Main difference from this case: different machines, normal  
connection shutdown, although IP address would look the same.

It might be an idea to test a situation where the connection break is  
between the master and worker, without any opportunity to send normal TCP  
messages for a connection close, and reconnect after several minutes.  
Probably hard to do in a selftest, though, but could perhaps be simulated  
worker side by just "forgetting" the connection, but not closing it, then  
starting a new one; probably require doing something twisted deep inside  
Twisted.

> We would be glad if you can test the new TLS support regarding this.

I'll probably take look when I get some wheels to run.

Such an update would give me a chance to test the "shutdown and restart  
the buildbot in 60 seconds after doing a pip update" worker class  
implementation I added to my system.

> In any way, the best for such issue is to raise a bug on github

I'm not sure I have anything really specific to report, except what I  
reported below, and that is AFAICT missing important traceback info about  
where the assert came from.

> Regards
> PIerre
>
> On Sat, Mar 2, 2019 at 3:33 PM Yngve N. Pettersen <yngve at vivaldi.com>  
> wrote:
>
>> Hello all,
>>
>> Yesterday we had a network event when some of our buildbot workers lost
>> the network connection to the master for about 10 minutes.
>>
>> However, while according to the logs on both the master and the workers
>> show that the workers successfully reconnected within 10 minutes of the
>> network connection being restored, according to status displays, the
>> workers the worker were missing. It eventually took a stop/start or
>> reboot
>> of the workers to get them reconnected an hour after the network
>> connection was lost.
>>
>> What I am seeing is that master log has entries like this when a worker
>> ("arbeider") reconnected:
>>
>> 2019-03-01 11:24:05+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4]
>> worker 'arbeider' attaching from IPv4Address(type='TCP', host='1.2.3.4',
>> port=51630)
>> 2019-03-01 11:24:05+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] Got
>> duplication connection from 'arbeider' starting arbitration procedure
>> 2019-03-01 11:24:15+0000 [-] Connected worker 'arbeider' ping timed out
>> after 10 seconds
>> 2019-03-01 11:24:15+0000 [-] Old connection for 'arbeider' was lost,
>> accepting new
>> 2019-03-01 11:24:15+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] Got
>> workerinfo from 'arbeider'
>> 2019-03-01 11:24:15+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4]
>> worker arbeider cannot attach
>>          Traceback (most recent call last):
>>            File
>> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
>> 1529, in _cancellableInlineCallbacks
>>              _inlineCallbacks(None, g, status)
>>            File
>> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
>> 1416, in _inlineCallbacks
>>              result = result.throwExceptionIntoGenerator(g)
>>            File
>> "sandbox/lib/python3.6/site-packages/twisted/python/failure.py", line
>> 491,
>> in throwExceptionIntoGenerator
>>              return g.throw(self.type, self.value, self.tb)
>>            File
>> "sandbox/lib/python3.6/site-packages/buildbot/worker/base.py", line 638,
>> in attached
>>              log.err(e, "worker %s cannot attach" % (self.name,))
>>          --- <exception caught here> ---
>>            File
>> "sandbox/lib/python3.6/site-packages/buildbot/worker/base.py", line 636,
>> in attached
>>              yield AbstractWorker.attached(self, bot)
>>          builtins.AssertionError:
>>
>>
>> Does anyone have any ideas about why the reconnects failed?
>>
>> In one case, a job was started on one of the workers (which was shown as
>> "online"), and the master was just registering the task as "Pinging
>> worker", for 20+ minutes until we stopped the task (and even that took a
>> while).
>>
>> If this happens every time the network connection is lost (which
>> admittedly does not happen that frequently, but could happen in case of
>> network maintenance) it is going to be a serious inconvenience, since
>> some
>> of the workers need special handling when being restarted.
>>
>>
>> Relevant information about the configuration:
>>
>> * Buildbot v2.0.1
>>
>> * The PB connections are TLS protected, using a workaround based on the
>> one from <https://github.com/buildbot/buildbot/issues/2866>
>>
>> * Workers run Python 2
>>
>> * The master is running the current Twisted version
>>
>> * The workers are running Twisted 18.7.0 (fixed version, due to
>> installation problems with the current version; on Windows it goes
>> looking
>> for a compiler and does not find one, even when one is installed)
>>
>>
>> --
>> Sincerely,
>> Yngve N. Pettersen
>> Vivaldi Technologies AS
>> _______________________________________________
>> users mailing list
>> users at buildbot.net
>> https://lists.buildbot.net/mailman/listinfo/users
>>
> --


-- 
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS


More information about the users mailing list