[users at bb.net] A summary of first month issues using Buildbot 2.0

Yngve N. Pettersen yngve at vivaldi.com
Fri Mar 15 01:29:30 UTC 2019


Hi,

About a month ago we transferred our build system from the old chromium  
developed buildbot system to one based on Buildbot 2.0. In that period we  
have had a couple of major issues that I thought I'd summarize:

* We have had two crashes of the buildbot master process. I do not know  
what causes the crashes, and the twisted.log does not contain any  
information about what happened, so my guess is that it is either the  
Ubuntu 18 Python 3.6 that crashed, or the Twisted/buildbot scripts did so  
in a non-logging fashion.

* We have had at least two cases where the master lost its connection to  
the Database server, and did not recover, and restarting the master was  
the only option. The probable commonality with these cases is that it  
seems to have happened when using the reconfigure/sighup option to update  
the buildbot configuration. In at least one case the log seemed to include  
an exception regarding the Database connection (which is a remote  
postgresql server)

* We have had a couple of cases where the network connection between the  
master and some of the workers have been interrupted. In the major case,  
this lead to having to restart the worker instances on all the affected  
workers. This was the topic of an email to this list a few weeks ago. In  
this case logs show that the workers correctly connected, but that the  
master then failed (due to an exception) to correctly register the worker,  
and failed to cut the connection to the worker (so that it could try to  
reconnect again) either when the registration process failed, or later  
when checking open connections (if it does), and apparently also responded  
to pings from the worker. It also did not detect that a worker was not  
really connected when it tried to ping it when trying to assign it a job.

This reconnect issue is such a major problem and hassle that, when we did  
a restart of that network connection, we shut down the *master* instance  
while taking down the network connection, and restarting it afterwards.

-- 
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS


More information about the users mailing list