[users at bb.net] A summary of first month issues using Buildbot 2.0

Sat Mar 16 18:22:14 UTC 2019

Hi again,

An update about one of the issues, the lost database connection.

This seems to affect the GitPoller instance. Other database activity, such  
as both forced_scheduler and triggered jobs work as normal.

It seems like the GitPoller (maybe all pollers) are not able to recover  
 from a lost database connection. A full shutdown and start is needed to  
recover. This seems to be similar to the worker reconnect failures I've  
mentioned, that code is not able to recover from a failed worker  
subscription, and the connection ends up as a zombie, a live connection,  
but still dead.

In the case earlier today, I got an exception during the sighup operation:

2019-03-16 13:02:31+0000 [-] while polling for changes
   Traceback (most recent call last):
     File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",  
line 1418, in _inlineCallbacks
       result = g.send(result)
     File  
"sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py", line  
233, in poll
       yield self.setState('lastRev', self.lastRev)
     File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",  
line 1613, in unwindGenerator
       return _cancellableInlineCallbacks(gen)
     File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",  
line 1529, in _cancellableInlineCallbacks
       _inlineCallbacks(None, g, status)
   --- <exception caught here> ---
     File  
"sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py", line  
233, in poll
       yield self.setState('lastRev', self.lastRev)
     File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1418, in _inlineCallbacks
       result = g.send(result)
     File "sandbox/lib/python3.6/site-packages/buildbot/util/state.py",  
line 43, in setState
       yield self.master.db.state.setState(self._objectid, key, value)
   builtins.AttributeError: 'NoneType' object has no attribute 'db'

2019-03-16 13:02:31+0000 [-] Caught exception while deactivating  
ClusteredService(...)
   Traceback (most recent call last):
     File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",  
line 654, in _runCallbacks
       current.result = callback(current.result, *args, **kw)
     File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",  
line 1475, in gotResult
       _inlineCallbacks(r, g, status)
     File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",  
line 1418, in _inlineCallbacks
       result = g.send(result)
     File "sandbox/lib/python3.6/site-packages/buildbot/util/service.py",  
line 341, in stopService
       log.err(e, _why="Caught exception while deactivating  
ClusteredService(%s)" % self.name)
   --- <exception caught here> ---
     File "sandbox/lib/python3.6/site-packages/buildbot/util/service.py",  
line 339, in stopService
       yield self._unclaimService()
     File "sandbox/lib/python3.6/site-packages/buildbot/changes/base.py",  
line 51, in _unclaimService
       return  
self.master.data.updates.trySetChangeSourceMaster(self.serviceid,
   builtins.AttributeError: 'NoneType' object has no attribute 'data'

On Fri, 15 Mar 2019 02:29:30 +0100, Yngve N. Pettersen <yngve at vivaldi.com>  
wrote:

> Hi,
>
> About a month ago we transferred our build system from the old chromium  
> developed buildbot system to one based on Buildbot 2.0. In that period  
> we have had a couple of major issues that I thought I'd summarize:
>
> * We have had two crashes of the buildbot master process. I do not know  
> what causes the crashes, and the twisted.log does not contain any  
> information about what happened, so my guess is that it is either the  
> Ubuntu 18 Python 3.6 that crashed, or the Twisted/buildbot scripts did  
> so in a non-logging fashion.
>
> * We have had at least two cases where the master lost its connection to  
> the Database server, and did not recover, and restarting the master was  
> the only option. The probable commonality with these cases is that it  
> seems to have happened when using the reconfigure/sighup option to  
> update the buildbot configuration. In at least one case the log seemed  
> to include an exception regarding the Database connection (which is a  
> remote postgresql server)
>
> * We have had a couple of cases where the network connection between the  
> master and some of the workers have been interrupted. In the major case,  
> this lead to having to restart the worker instances on all the affected  
> workers. This was the topic of an email to this list a few weeks ago. In  
> this case logs show that the workers correctly connected, but that the  
> master then failed (due to an exception) to correctly register the  
> worker, and failed to cut the connection to the worker (so that it could  
> try to reconnect again) either when the registration process failed, or  
> later when checking open connections (if it does), and apparently also  
> responded to pings from the worker. It also did not detect that a worker  
> was not really connected when it tried to ping it when trying to assign  
> it a job.
>
> This reconnect issue is such a major problem and hassle that, when we  
> did a restart of that network connection, we shut down the *master*  
> instance while taking down the network connection, and restarting it  
> afterwards.
>

-- 
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS