[users at bb.net] More from the land of multi-master.

Wed Oct 4 15:12:20 UTC 2017

Hi everyone,

As you might expect, we're having troubles again. But first, some good 
news.

As it turns out, and as we expected,  our (second) CPU spiking/memory 
fragmenting problem was in our builds. The problem was that we could get 
enormous strings (one was 302MB) which the log parser would then attempt 
to parse, dragging everything down. Those seem to have been cleared up. 
At least, I haven't seen any spikes other than one when I've restarted 
that master and it tries to catch up.

The other bit of good news is that I have the OK to upgrade our masters 
to the latest. I see that 0.9.12 is due out today, so I'll be waiting 
for that.

But other news isn't as bright. We're having problems with our most 
active master. Big problems.

One problem is obvious. We have a newer version of sqlalchemy than 0.9.3 
likes, so we get a lot of deprecation warnings. Enough to significantly 
slow everything down, especially startup. Looking through the 0.9.11 
code, it looks like that's been taken care of. So updating will at least 
fix the least of our problems.

We have also been getting a lot of errors apparently tied to build 
collapsing, which we have turned on globally. If you've been following 
along with the anecdotes, you'll know that we've also slightly modified 
the circumstances under which a build will be collapsed to ignore 
revision (in our case, we always want to use the latest -- we don't care 
about building anything 'intermediate'). We'd been getting a lot of 
'tried to complete N buildequests, but only completed M' warnings. And I 
left some builders' pages up in my browser long enough to see that every 
build (except forced builds) was getting marked as SKIPPED eventually. 
Forced builds were never getting claimed. Nor were the skipped builds 
marked as claimed, which is odd, because the collapsing code claims 
builds before marking them skipped. And the comments indicate that a 
prime suspect in getting that warning is builds that were already claimed.

If that's affecting everything else, it's a new effect, because things 
ran fine for months.

The result of this is that our master is failing in its prime mission, 
which is to run builds. I've been occasionally able to get a build to 
happen by stopping the worker. When our process starts the worker back 
up, and it connects, the master will look for a pending build and start 
it. But any subsequent builds will not start. And if there aren't any 
queued builds, a build that gets queued while the worker is running is 
not started. And the builder we use to start workers, which is scheduled 
every half hour, didn't run for 18 hours (though it seems to have just 
started a build).

And we had a problem where the master became completely unresponsive. It 
wasn't producing log messages, nor was it allowing connections. Showed 
as still running, though. Unfortunately, that run was with a standard 
python instead of the one with debugging symbols (we'd had a run go out 
of memory, and our cron job picked up before we could start it manually) 
so we were a bit out of luck. The workers' logs were all similar -- 
connection would be lost, as well as possibly a remote step, any current 
commands would be aborted, and the worker would try to connect, being 
refused every time.

Neil Gilmore
grammatech.com