[users at bb.net] A summary of first month issues using Buildbot 2.0

Yngve N. Pettersen yngve at vivaldi.com
Thu Mar 21 10:46:46 UTC 2019


Hi again,

A further update on that DB issue.

It seems that the last such incident caused two of the new branch  
schedulers I was adding to not get properly registered. The schedulers are  
listed, but AFAICT the git polling does not work, and it also seems like  
the restart I did to fix the DB issue did not fix the problem.

On Sat, 16 Mar 2019 19:22:14 +0100, Yngve N. Pettersen <yngve at vivaldi.com>  
wrote:

> Hi again,
>
> An update about one of the issues, the lost database connection.
>
> This seems to affect the GitPoller instance. Other database activity,  
> such as both forced_scheduler and triggered jobs work as normal.
>
> It seems like the GitPoller (maybe all pollers) are not able to recover  
>  from a lost database connection. A full shutdown and start is needed to  
> recover. This seems to be similar to the worker reconnect failures I've  
> mentioned, that code is not able to recover from a failed worker  
> subscription, and the connection ends up as a zombie, a live connection,  
> but still dead.
>
> In the case earlier today, I got an exception during the sighup  
> operation:
>
> 2019-03-16 13:02:31+0000 [-] while polling for changes
>    Traceback (most recent call last):
>      File  
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line  
> 1418, in _inlineCallbacks
>        result = g.send(result)
>      File  
> "sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py",  
> line 233, in poll
>        yield self.setState('lastRev', self.lastRev)
>      File  
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line  
> 1613, in unwindGenerator
>        return _cancellableInlineCallbacks(gen)
>      File  
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line  
> 1529, in _cancellableInlineCallbacks
>        _inlineCallbacks(None, g, status)
>    --- <exception caught here> ---
>      File  
> "sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py",  
> line 233, in poll
>        yield self.setState('lastRev', self.lastRev)
>      File  
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
> line 1418, in _inlineCallbacks
>        result = g.send(result)
>      File "sandbox/lib/python3.6/site-packages/buildbot/util/state.py",  
> line 43, in setState
>        yield self.master.db.state.setState(self._objectid, key, value)
>    builtins.AttributeError: 'NoneType' object has no attribute 'db'
>
> 2019-03-16 13:02:31+0000 [-] Caught exception while deactivating  
> ClusteredService(...)
>    Traceback (most recent call last):
>      File  
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line  
> 654, in _runCallbacks
>        current.result = callback(current.result, *args, **kw)
>      File  
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line  
> 1475, in gotResult
>        _inlineCallbacks(r, g, status)
>      File  
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line  
> 1418, in _inlineCallbacks
>        result = g.send(result)
>      File  
> "sandbox/lib/python3.6/site-packages/buildbot/util/service.py", line  
> 341, in stopService
>        log.err(e, _why="Caught exception while deactivating  
> ClusteredService(%s)" % self.name)
>    --- <exception caught here> ---
>      File  
> "sandbox/lib/python3.6/site-packages/buildbot/util/service.py", line  
> 339, in stopService
>        yield self._unclaimService()
>      File  
> "sandbox/lib/python3.6/site-packages/buildbot/changes/base.py", line 51,  
> in _unclaimService
>        return  
> self.master.data.updates.trySetChangeSourceMaster(self.serviceid,
>    builtins.AttributeError: 'NoneType' object has no attribute 'data'
>
>
> On Fri, 15 Mar 2019 02:29:30 +0100, Yngve N. Pettersen  
> <yngve at vivaldi.com> wrote:
>
>> Hi,
>>
>> About a month ago we transferred our build system from the old chromium  
>> developed buildbot system to one based on Buildbot 2.0. In that period  
>> we have had a couple of major issues that I thought I'd summarize:
>>
>> * We have had two crashes of the buildbot master process. I do not know  
>> what causes the crashes, and the twisted.log does not contain any  
>> information about what happened, so my guess is that it is either the  
>> Ubuntu 18 Python 3.6 that crashed, or the Twisted/buildbot scripts did  
>> so in a non-logging fashion.
>>
>> * We have had at least two cases where the master lost its connection  
>> to the Database server, and did not recover, and restarting the master  
>> was the only option. The probable commonality with these cases is that  
>> it seems to have happened when using the reconfigure/sighup option to  
>> update the buildbot configuration. In at least one case the log seemed  
>> to include an exception regarding the Database connection (which is a  
>> remote postgresql server)
>>
>> * We have had a couple of cases where the network connection between  
>> the master and some of the workers have been interrupted. In the major  
>> case, this lead to having to restart the worker instances on all the  
>> affected workers. This was the topic of an email to this list a few  
>> weeks ago. In this case logs show that the workers correctly connected,  
>> but that the master then failed (due to an exception) to correctly  
>> register the worker, and failed to cut the connection to the worker (so  
>> that it could try to reconnect again) either when the registration  
>> process failed, or later when checking open connections (if it does),  
>> and apparently also responded to pings from the worker. It also did not  
>> detect that a worker was not really connected when it tried to ping it  
>> when trying to assign it a job.
>>
>> This reconnect issue is such a major problem and hassle that, when we  
>> did a restart of that network connection, we shut down the *master*  
>> instance while taking down the network connection, and restarting it  
>> afterwards.
>>
>
>


-- 
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS


More information about the users mailing list