<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Wed, Oct 4, 2017 at 5:12 PM Neil Gilmore <<a href="mailto:ngilmore@grammatech.com">ngilmore@grammatech.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi everyone,<br>

<br>

As you might expect, we're having troubles again. But first, some good<br>

news.<br>

<br>

As it turns out, and as we expected,  our (second) CPU spiking/memory<br>

fragmenting problem was in our builds. The problem was that we could get<br>

enormous strings (one was 302MB) which the log parser would then attempt<br>

to parse, dragging everything down. Those seem to have been cleared up.<br>

At least, I haven't seen any spikes other than one when I've restarted<br>

that master and it tries to catch up.<br></blockquote><div><br></div><div>Good. We had some fixes in 0.9.11 in order to help with enormous strings. </div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

The other bit of good news is that I have the OK to upgrade our masters<br>

to the latest. I see that 0.9.12 is due out today, so I'll be waiting<br>

for that.<br></blockquote><div>Great. I didn't have time to do that yesterday, but I've set the threshold to today 4pm CET.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

But other news isn't as bright. We're having problems with our most<br>

active master. Big problems.<br>

<br>

One problem is obvious. We have a newer version of sqlalchemy than 0.9.3<br>

likes, so we get a lot of deprecation warnings. Enough to significantly<br>

slow everything down, especially startup. Looking through the 0.9.11<br>

code, it looks like that's been taken care of. So updating will at least<br>

fix the least of our problems.<br></blockquote><div>Indeed, 0.9.3 is a now quite behind..</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

We have also been getting a lot of errors apparently tied to build<br>

collapsing, which we have turned on globally. If you've been following<br>

along with the anecdotes, you'll know that we've also slightly modified<br>

the circumstances under which a build will be collapsed to ignore<br>

revision (in our case, we always want to use the latest -- we don't care<br>

about building anything 'intermediate'). We'd been getting a lot of<br>

'tried to complete N buildequests, but only completed M' warnings.</blockquote><div>We have seen also people seeing those issues. I have made a fix in 0.9.10, but it looks like there are still people  complaining about it, but without much clue of what is wrong beyond what was fixed.</div><div class="gmail_quote">The known problem was that the N buildrequests were actually not uniques buildrequests, the list contained duplicated.</div>So those warnings should be pretty harmless beyond the noise.<br class="inbox-inbox-Apple-interchange-newline"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> And I<br>

left some builders' pages up in my browser long enough to see that every<br>

build (except forced builds) was getting marked as SKIPPED eventually.<br>

Forced builds were never getting claimed. Nor were the skipped builds<br>

marked as claimed, which is odd, because the collapsing code claims<br>

builds before marking them skipped. And the comments indicate that a<br>

prime suspect in getting that warning is builds that were already claimed.<br></blockquote><div>Normally the buildrequest collapser is not supposed to mark *builds* skipped. It marks buildrequests as skipped.</div><div>So could that be another thing in your steps?</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

If that's affecting everything else, it's a new effect, because things<br>

ran fine for months.<br>

<br>

The result of this is that our master is failing in its prime mission,<br>

which is to run builds. I've been occasionally able to get a build to<br>

happen by stopping the worker. When our process starts the worker back<br>

up, and it connects, the master will look for a pending build and start<br>

it. But any subsequent builds will not start. And if there aren't any<br>

queued builds, a build that gets queued while the worker is running is<br>

not started. And the builder we use to start workers, which is scheduled<br>

every half hour, didn't run for 18 hours (though it seems to have just<br>

started a build).<br></blockquote><div>Not sure exactly how to answer to that. This is not normal, but there are many reason which could be leading to that situation. </div><div>in my experience, very often it is related to some customization code that is failing.</div><div>Is the first build correctly finished?, is there a nextWorker that is not behaving correctly, do you have custom workers?</div><div>I've seen people having good results by using Manhole to debug those freezes.</div><div><br></div><div><a href="https://docs.buildbot.net/current/manual/cfg-global.html#manhole">https://docs.buildbot.net/current/manual/cfg-global.html#manhole</a><br></div><div>That could help you pinging into the workers and workerforbuilders objects looking for their states</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br></blockquote></div></div>