<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Wed, Oct 4, 2017 at 5:12 PM Neil Gilmore <<a href="mailto:ngilmore@grammatech.com">ngilmore@grammatech.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi everyone,<br>
<br>
As you might expect, we're having troubles again. But first, some good<br>
news.<br>
<br>
As it turns out, and as we expected, our (second) CPU spiking/memory<br>
fragmenting problem was in our builds. The problem was that we could get<br>
enormous strings (one was 302MB) which the log parser would then attempt<br>
to parse, dragging everything down. Those seem to have been cleared up.<br>
At least, I haven't seen any spikes other than one when I've restarted<br>
that master and it tries to catch up.<br></blockquote><div><br></div><div>Good. We had some fixes in 0.9.11 in order to help with enormous strings. </div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
The other bit of good news is that I have the OK to upgrade our masters<br>
to the latest. I see that 0.9.12 is due out today, so I'll be waiting<br>
for that.<br></blockquote><div>Great. I didn't have time to do that yesterday, but I've set the threshold to today 4pm CET.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
But other news isn't as bright. We're having problems with our most<br>
active master. Big problems.<br>
<br>
One problem is obvious. We have a newer version of sqlalchemy than 0.9.3<br>
likes, so we get a lot of deprecation warnings. Enough to significantly<br>
slow everything down, especially startup. Looking through the 0.9.11<br>
code, it looks like that's been taken care of. So updating will at least<br>
fix the least of our problems.<br></blockquote><div>Indeed, 0.9.3 is a now quite behind..</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
We have also been getting a lot of errors apparently tied to build<br>
collapsing, which we have turned on globally. If you've been following<br>
along with the anecdotes, you'll know that we've also slightly modified<br>
the circumstances under which a build will be collapsed to ignore<br>
revision (in our case, we always want to use the latest -- we don't care<br>
about building anything 'intermediate'). We'd been getting a lot of<br>
'tried to complete N buildequests, but only completed M' warnings.</blockquote><div>We have seen also people seeing those issues. I have made a fix in 0.9.10, but it looks like there are still people complaining about it, but without much clue of what is wrong beyond what was fixed.</div><div class="gmail_quote">The known problem was that the N buildrequests were actually not uniques buildrequests, the list contained duplicated.</div>So those warnings should be pretty harmless beyond the noise.<br class="inbox-inbox-Apple-interchange-newline"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> And I<br>
left some builders' pages up in my browser long enough to see that every<br>
build (except forced builds) was getting marked as SKIPPED eventually.<br>
Forced builds were never getting claimed. Nor were the skipped builds<br>
marked as claimed, which is odd, because the collapsing code claims<br>
builds before marking them skipped. And the comments indicate that a<br>
prime suspect in getting that warning is builds that were already claimed.<br></blockquote><div>Normally the buildrequest collapser is not supposed to mark *builds* skipped. It marks buildrequests as skipped.</div><div>So could that be another thing in your steps?</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
If that's affecting everything else, it's a new effect, because things<br>
ran fine for months.<br>
<br>
The result of this is that our master is failing in its prime mission,<br>
which is to run builds. I've been occasionally able to get a build to<br>
happen by stopping the worker. When our process starts the worker back<br>
up, and it connects, the master will look for a pending build and start<br>
it. But any subsequent builds will not start. And if there aren't any<br>
queued builds, a build that gets queued while the worker is running is<br>
not started. And the builder we use to start workers, which is scheduled<br>
every half hour, didn't run for 18 hours (though it seems to have just<br>
started a build).<br></blockquote><div>Not sure exactly how to answer to that. This is not normal, but there are many reason which could be leading to that situation. </div><div>in my experience, very often it is related to some customization code that is failing.</div><div>Is the first build correctly finished?, is there a nextWorker that is not behaving correctly, do you have custom workers?</div><div>I've seen people having good results by using Manhole to debug those freezes.</div><div><br></div><div><a href="https://docs.buildbot.net/current/manual/cfg-global.html#manhole">https://docs.buildbot.net/current/manual/cfg-global.html#manhole</a><br></div><div>That could help you pinging into the workers and workerforbuilders objects looking for their states</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br></blockquote></div></div>