[users at bb.net] More from the land of multi-master.

Thu Oct 5 08:45:33 UTC 2017

On Wed, Oct 4, 2017 at 5:12 PM Neil Gilmore <ngilmore at grammatech.com> wrote:

> Hi everyone,
>
> As you might expect, we're having troubles again. But first, some good
> news.
>
> As it turns out, and as we expected,  our (second) CPU spiking/memory
> fragmenting problem was in our builds. The problem was that we could get
> enormous strings (one was 302MB) which the log parser would then attempt
> to parse, dragging everything down. Those seem to have been cleared up.
> At least, I haven't seen any spikes other than one when I've restarted
> that master and it tries to catch up.
>

Good. We had some fixes in 0.9.11 in order to help with enormous strings.

> The other bit of good news is that I have the OK to upgrade our masters
> to the latest. I see that 0.9.12 is due out today, so I'll be waiting
> for that.
>
Great. I didn't have time to do that yesterday, but I've set the threshold
to today 4pm CET.

>
> But other news isn't as bright. We're having problems with our most
> active master. Big problems.
>
> One problem is obvious. We have a newer version of sqlalchemy than 0.9.3
> likes, so we get a lot of deprecation warnings. Enough to significantly
> slow everything down, especially startup. Looking through the 0.9.11
> code, it looks like that's been taken care of. So updating will at least
> fix the least of our problems.
>
Indeed, 0.9.3 is a now quite behind..

>
> We have also been getting a lot of errors apparently tied to build
> collapsing, which we have turned on globally. If you've been following
> along with the anecdotes, you'll know that we've also slightly modified
> the circumstances under which a build will be collapsed to ignore
> revision (in our case, we always want to use the latest -- we don't care
> about building anything 'intermediate'). We'd been getting a lot of
> 'tried to complete N buildequests, but only completed M' warnings.

We have seen also people seeing those issues. I have made a fix in 0.9.10,
but it looks like there are still people  complaining about it, but without
much clue of what is wrong beyond what was fixed.
The known problem was that the N buildrequests were actually not uniques
buildrequests, the list contained duplicated.
So those warnings should be pretty harmless beyond the noise.

> And I
> left some builders' pages up in my browser long enough to see that every
> build (except forced builds) was getting marked as SKIPPED eventually.
> Forced builds were never getting claimed. Nor were the skipped builds
> marked as claimed, which is odd, because the collapsing code claims
> builds before marking them skipped. And the comments indicate that a
> prime suspect in getting that warning is builds that were already claimed.
>
Normally the buildrequest collapser is not supposed to mark *builds*
skipped. It marks buildrequests as skipped.
So could that be another thing in your steps?

> If that's affecting everything else, it's a new effect, because things
> ran fine for months.
>
> The result of this is that our master is failing in its prime mission,
> which is to run builds. I've been occasionally able to get a build to
> happen by stopping the worker. When our process starts the worker back
> up, and it connects, the master will look for a pending build and start
> it. But any subsequent builds will not start. And if there aren't any
> queued builds, a build that gets queued while the worker is running is
> not started. And the builder we use to start workers, which is scheduled
> every half hour, didn't run for 18 hours (though it seems to have just
> started a build).
>
Not sure exactly how to answer to that. This is not normal, but there are
many reason which could be leading to that situation.
in my experience, very often it is related to some customization code that
is failing.
Is the first build correctly finished?, is there a nextWorker that is not
behaving correctly, do you have custom workers?
I've seen people having good results by using Manhole to debug those
freezes.

https://docs.buildbot.net/current/manual/cfg-global.html#manhole
That could help you pinging into the workers and workerforbuilders objects
looking for their states

>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.buildbot.net/pipermail/users/attachments/20171005/f7e8e6ff/attachment.html>