[users at bb.net] A summary of first month issues using Buildbot 2.0
Yngve N. Pettersen
yngve at vivaldi.com
Sat Mar 16 18:22:14 UTC 2019
Hi again,
An update about one of the issues, the lost database connection.
This seems to affect the GitPoller instance. Other database activity, such
as both forced_scheduler and triggered jobs work as normal.
It seems like the GitPoller (maybe all pollers) are not able to recover
from a lost database connection. A full shutdown and start is needed to
recover. This seems to be similar to the worker reconnect failures I've
mentioned, that code is not able to recover from a failed worker
subscription, and the connection ends up as a zombie, a live connection,
but still dead.
In the case earlier today, I got an exception during the sighup operation:
2019-03-16 13:02:31+0000 [-] while polling for changes
Traceback (most recent call last):
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1418, in _inlineCallbacks
result = g.send(result)
File
"sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py", line
233, in poll
yield self.setState('lastRev', self.lastRev)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- <exception caught here> ---
File
"sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py", line
233, in poll
yield self.setState('lastRev', self.lastRev)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1418, in _inlineCallbacks
result = g.send(result)
File "sandbox/lib/python3.6/site-packages/buildbot/util/state.py",
line 43, in setState
yield self.master.db.state.setState(self._objectid, key, value)
builtins.AttributeError: 'NoneType' object has no attribute 'db'
2019-03-16 13:02:31+0000 [-] Caught exception while deactivating
ClusteredService(...)
Traceback (most recent call last):
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1475, in gotResult
_inlineCallbacks(r, g, status)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1418, in _inlineCallbacks
result = g.send(result)
File "sandbox/lib/python3.6/site-packages/buildbot/util/service.py",
line 341, in stopService
log.err(e, _why="Caught exception while deactivating
ClusteredService(%s)" % self.name)
--- <exception caught here> ---
File "sandbox/lib/python3.6/site-packages/buildbot/util/service.py",
line 339, in stopService
yield self._unclaimService()
File "sandbox/lib/python3.6/site-packages/buildbot/changes/base.py",
line 51, in _unclaimService
return
self.master.data.updates.trySetChangeSourceMaster(self.serviceid,
builtins.AttributeError: 'NoneType' object has no attribute 'data'
On Fri, 15 Mar 2019 02:29:30 +0100, Yngve N. Pettersen <yngve at vivaldi.com>
wrote:
> Hi,
>
> About a month ago we transferred our build system from the old chromium
> developed buildbot system to one based on Buildbot 2.0. In that period
> we have had a couple of major issues that I thought I'd summarize:
>
> * We have had two crashes of the buildbot master process. I do not know
> what causes the crashes, and the twisted.log does not contain any
> information about what happened, so my guess is that it is either the
> Ubuntu 18 Python 3.6 that crashed, or the Twisted/buildbot scripts did
> so in a non-logging fashion.
>
> * We have had at least two cases where the master lost its connection to
> the Database server, and did not recover, and restarting the master was
> the only option. The probable commonality with these cases is that it
> seems to have happened when using the reconfigure/sighup option to
> update the buildbot configuration. In at least one case the log seemed
> to include an exception regarding the Database connection (which is a
> remote postgresql server)
>
> * We have had a couple of cases where the network connection between the
> master and some of the workers have been interrupted. In the major case,
> this lead to having to restart the worker instances on all the affected
> workers. This was the topic of an email to this list a few weeks ago. In
> this case logs show that the workers correctly connected, but that the
> master then failed (due to an exception) to correctly register the
> worker, and failed to cut the connection to the worker (so that it could
> try to reconnect again) either when the registration process failed, or
> later when checking open connections (if it does), and apparently also
> responded to pings from the worker. It also did not detect that a worker
> was not really connected when it tried to ping it when trying to assign
> it a job.
>
> This reconnect issue is such a major problem and hassle that, when we
> did a restart of that network connection, we shut down the *master*
> instance while taking down the network connection, and restarting it
> afterwards.
>
--
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS
More information about the users
mailing list