[users at bb.net] A summary of first month issues using Buildbot 2.0
Yngve N. Pettersen
yngve at vivaldi.com
Thu Mar 21 10:46:46 UTC 2019
Hi again,
A further update on that DB issue.
It seems that the last such incident caused two of the new branch
schedulers I was adding to not get properly registered. The schedulers are
listed, but AFAICT the git polling does not work, and it also seems like
the restart I did to fix the DB issue did not fix the problem.
On Sat, 16 Mar 2019 19:22:14 +0100, Yngve N. Pettersen <yngve at vivaldi.com>
wrote:
> Hi again,
>
> An update about one of the issues, the lost database connection.
>
> This seems to affect the GitPoller instance. Other database activity,
> such as both forced_scheduler and triggered jobs work as normal.
>
> It seems like the GitPoller (maybe all pollers) are not able to recover
> from a lost database connection. A full shutdown and start is needed to
> recover. This seems to be similar to the worker reconnect failures I've
> mentioned, that code is not able to recover from a failed worker
> subscription, and the connection ends up as a zombie, a live connection,
> but still dead.
>
> In the case earlier today, I got an exception during the sighup
> operation:
>
> 2019-03-16 13:02:31+0000 [-] while polling for changes
> Traceback (most recent call last):
> File
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
> 1418, in _inlineCallbacks
> result = g.send(result)
> File
> "sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py",
> line 233, in poll
> yield self.setState('lastRev', self.lastRev)
> File
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
> 1613, in unwindGenerator
> return _cancellableInlineCallbacks(gen)
> File
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
> 1529, in _cancellableInlineCallbacks
> _inlineCallbacks(None, g, status)
> --- <exception caught here> ---
> File
> "sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py",
> line 233, in poll
> yield self.setState('lastRev', self.lastRev)
> File
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
> line 1418, in _inlineCallbacks
> result = g.send(result)
> File "sandbox/lib/python3.6/site-packages/buildbot/util/state.py",
> line 43, in setState
> yield self.master.db.state.setState(self._objectid, key, value)
> builtins.AttributeError: 'NoneType' object has no attribute 'db'
>
> 2019-03-16 13:02:31+0000 [-] Caught exception while deactivating
> ClusteredService(...)
> Traceback (most recent call last):
> File
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
> 654, in _runCallbacks
> current.result = callback(current.result, *args, **kw)
> File
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
> 1475, in gotResult
> _inlineCallbacks(r, g, status)
> File
> "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line
> 1418, in _inlineCallbacks
> result = g.send(result)
> File
> "sandbox/lib/python3.6/site-packages/buildbot/util/service.py", line
> 341, in stopService
> log.err(e, _why="Caught exception while deactivating
> ClusteredService(%s)" % self.name)
> --- <exception caught here> ---
> File
> "sandbox/lib/python3.6/site-packages/buildbot/util/service.py", line
> 339, in stopService
> yield self._unclaimService()
> File
> "sandbox/lib/python3.6/site-packages/buildbot/changes/base.py", line 51,
> in _unclaimService
> return
> self.master.data.updates.trySetChangeSourceMaster(self.serviceid,
> builtins.AttributeError: 'NoneType' object has no attribute 'data'
>
>
> On Fri, 15 Mar 2019 02:29:30 +0100, Yngve N. Pettersen
> <yngve at vivaldi.com> wrote:
>
>> Hi,
>>
>> About a month ago we transferred our build system from the old chromium
>> developed buildbot system to one based on Buildbot 2.0. In that period
>> we have had a couple of major issues that I thought I'd summarize:
>>
>> * We have had two crashes of the buildbot master process. I do not know
>> what causes the crashes, and the twisted.log does not contain any
>> information about what happened, so my guess is that it is either the
>> Ubuntu 18 Python 3.6 that crashed, or the Twisted/buildbot scripts did
>> so in a non-logging fashion.
>>
>> * We have had at least two cases where the master lost its connection
>> to the Database server, and did not recover, and restarting the master
>> was the only option. The probable commonality with these cases is that
>> it seems to have happened when using the reconfigure/sighup option to
>> update the buildbot configuration. In at least one case the log seemed
>> to include an exception regarding the Database connection (which is a
>> remote postgresql server)
>>
>> * We have had a couple of cases where the network connection between
>> the master and some of the workers have been interrupted. In the major
>> case, this lead to having to restart the worker instances on all the
>> affected workers. This was the topic of an email to this list a few
>> weeks ago. In this case logs show that the workers correctly connected,
>> but that the master then failed (due to an exception) to correctly
>> register the worker, and failed to cut the connection to the worker (so
>> that it could try to reconnect again) either when the registration
>> process failed, or later when checking open connections (if it does),
>> and apparently also responded to pings from the worker. It also did not
>> detect that a worker was not really connected when it tried to ping it
>> when trying to assign it a job.
>>
>> This reconnect issue is such a major problem and hassle that, when we
>> did a restart of that network connection, we shut down the *master*
>> instance while taking down the network connection, and restarting it
>> afterwards.
>>
>
>
--
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS
More information about the users
mailing list