[users at bb.net] More from the land of multi-master.
Neil Gilmore
ngilmore at grammatech.com
Wed Oct 4 15:12:20 UTC 2017
Hi everyone,
As you might expect, we're having troubles again. But first, some good
news.
As it turns out, and as we expected, our (second) CPU spiking/memory
fragmenting problem was in our builds. The problem was that we could get
enormous strings (one was 302MB) which the log parser would then attempt
to parse, dragging everything down. Those seem to have been cleared up.
At least, I haven't seen any spikes other than one when I've restarted
that master and it tries to catch up.
The other bit of good news is that I have the OK to upgrade our masters
to the latest. I see that 0.9.12 is due out today, so I'll be waiting
for that.
But other news isn't as bright. We're having problems with our most
active master. Big problems.
One problem is obvious. We have a newer version of sqlalchemy than 0.9.3
likes, so we get a lot of deprecation warnings. Enough to significantly
slow everything down, especially startup. Looking through the 0.9.11
code, it looks like that's been taken care of. So updating will at least
fix the least of our problems.
We have also been getting a lot of errors apparently tied to build
collapsing, which we have turned on globally. If you've been following
along with the anecdotes, you'll know that we've also slightly modified
the circumstances under which a build will be collapsed to ignore
revision (in our case, we always want to use the latest -- we don't care
about building anything 'intermediate'). We'd been getting a lot of
'tried to complete N buildequests, but only completed M' warnings. And I
left some builders' pages up in my browser long enough to see that every
build (except forced builds) was getting marked as SKIPPED eventually.
Forced builds were never getting claimed. Nor were the skipped builds
marked as claimed, which is odd, because the collapsing code claims
builds before marking them skipped. And the comments indicate that a
prime suspect in getting that warning is builds that were already claimed.
If that's affecting everything else, it's a new effect, because things
ran fine for months.
The result of this is that our master is failing in its prime mission,
which is to run builds. I've been occasionally able to get a build to
happen by stopping the worker. When our process starts the worker back
up, and it connects, the master will look for a pending build and start
it. But any subsequent builds will not start. And if there aren't any
queued builds, a build that gets queued while the worker is running is
not started. And the builder we use to start workers, which is scheduled
every half hour, didn't run for 18 hours (though it seems to have just
started a build).
And we had a problem where the master became completely unresponsive. It
wasn't producing log messages, nor was it allowing connections. Showed
as still running, though. Unfortunately, that run was with a standard
python instead of the one with debugging symbols (we'd had a run go out
of memory, and our cron job picked up before we could start it manually)
so we were a bit out of luck. The workers' logs were all similar --
connection would be lost, as well as possibly a remote step, any current
commands would be aborted, and the worker would try to connect, being
refused every time.
Neil Gilmore
grammatech.com
More information about the users
mailing list