<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Pierre,<br>

    <br>

    As always, thanks for the reply and advice.<br>

    <br>

    Note that I've clipped items that were addressed and that I have no

    more comments on.<br>

    <br>

    <div class="moz-cite-prefix">On 10/5/2017 3:45 AM, Pierre Tardy

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAJ+soVfUwa-RSsG0F+Kw4+S1u4ZW2OsBsufJ-vT4JDhBSiioNA@mail.gmail.com">

      <div dir="ltr"><br>

        <br>

        <div class="gmail_quote">

          <div dir="ltr">On Wed, Oct 4, 2017 at 5:12 PM Neil Gilmore

            <<a href="mailto:ngilmore@grammatech.com"

              moz-do-not-send="true">ngilmore@grammatech.com</a>>

            wrote:<br>

          </div>

          We have also been getting a lot of errors apparently tied to

          build<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            collapsing, which we have turned on globally. If you've been

            following<br>

            along with the anecdotes, you'll know that we've also

            slightly modified<br>

            the circumstances under which a build will be collapsed to

            ignore<br>

            revision (in our case, we always want to use the latest --

            we don't care<br>

            about building anything 'intermediate'). We'd been getting a

            lot of<br>

            'tried to complete N buildequests, but only completed M'

            warnings.</blockquote>

          <div>We have seen also people seeing those issues. I have made

            a fix in 0.9.10, but it looks like there are still people 

            complaining about it, but without much clue of what is wrong

            beyond what was fixed.</div>

          <div class="gmail_quote">The known problem was that the N

            buildrequests were actually not uniques buildrequests, the

            list contained duplicated.</div>

          So those warnings should be pretty harmless beyond the noise.<br

            class="inbox-inbox-Apple-interchange-newline">

        </div>

      </div>

    </blockquote>

    <br>

    Even though the transaction involving those buildrequests cancels

    the transaction, so that the original work of marking requests

    doesn't happen? Or would that just mean the requests don't get

    skipped?<br>

    <br>

    It does seem like the incidence of this warning has been going down,

    though we haven't done anything to fix it.<br>

    <br>

    <blockquote type="cite"

cite="mid:CAJ+soVfUwa-RSsG0F+Kw4+S1u4ZW2OsBsufJ-vT4JDhBSiioNA@mail.gmail.com">

      <div dir="ltr">

        <div class="gmail_quote">

          <div> </div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex"> And I<br>

            left some builders' pages up in my browser long enough to

            see that every<br>

            build (except forced builds) was getting marked as SKIPPED

            eventually.<br>

            Forced builds were never getting claimed. Nor were the

            skipped builds<br>

            marked as claimed, which is odd, because the collapsing code

            claims<br>

            builds before marking them skipped. And the comments

            indicate that a<br>

            prime suspect in getting that warning is builds that were

            already claimed.<br>

          </blockquote>

          <div>Normally the buildrequest collapser is not supposed to

            mark *builds* skipped. It marks buildrequests as skipped.</div>

          <div>So could that be another thing in your steps?</div>

        </div>

      </div>

    </blockquote>

    <br>

    My mistake in using the wrong term here. The code appears to claim

    the request then mark it as skipped. But in the UI, I never see a

    skipped request marked as claimed.<br>

    <br>

    <blockquote type="cite"

cite="mid:CAJ+soVfUwa-RSsG0F+Kw4+S1u4ZW2OsBsufJ-vT4JDhBSiioNA@mail.gmail.com">

      <div dir="ltr">

        <div class="gmail_quote"><br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            The result of this is that our master is failing in its

            prime mission,<br>

            which is to run builds. I've been occasionally able to get a

            build to<br>

            happen by stopping the worker. When our process starts the

            worker back<br>

            up, and it connects, the master will look for a pending

            build and start<br>

            it. But any subsequent builds will not start. And if there

            aren't any<br>

            queued builds, a build that gets queued while the worker is

            running is<br>

            not started. And the builder we use to start workers, which

            is scheduled<br>

            every half hour, didn't run for 18 hours (though it seems to

            have just<br>

            started a build).<br>

          </blockquote>

          <div>Not sure exactly how to answer to that. This is not

            normal, but there are many reason which could be leading to

            that situation. </div>

          <div>in my experience, very often it is related to some

            customization code that is failing.</div>

          <div>Is the first build correctly finished?, is there a

            nextWorker that is not behaving correctly, do you have

            custom workers?</div>

          <div>I've seen people having good results by using Manhole to

            debug those freezes.</div>

        </div>

      </div>

    </blockquote>

    <br>

    The only actual custom code we have is a pair of custom build steps

    that produce logs useful to us, and the modification to collapsing

    to ignore revision.<br>

    <br>

    The builder we have to start workers does not use the custom steps,

    though we have collapsing turned on globally. I have not seen that

    builder having any skipped requests. It appears to be running

    normally since yesterday.<br>

    <br>

    For the builder that only wants to run once, the first build

    finishes correctly.<br>

    <br>

    We do not have custom workers.<br>

    <br>

    <blockquote type="cite"

cite="mid:CAJ+soVfUwa-RSsG0F+Kw4+S1u4ZW2OsBsufJ-vT4JDhBSiioNA@mail.gmail.com">

      <div dir="ltr">

        <div class="gmail_quote">

          <div><a

              href="https://docs.buildbot.net/current/manual/cfg-global.html#manhole"

              moz-do-not-send="true">https://docs.buildbot.net/current/manual/cfg-global.html#manhole</a><br>

          </div>

          <div>That could help you pinging into the workers and

            workerforbuilders objects looking for their states</div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

          </blockquote>

        </div>

      </div>

    </blockquote>

    <br>

    I've used the manhole before, but not for this. I've had to use it

    in the past to manually finish stuck builds, and to manually release

    locks when necessary (though I haven't had to do that in a long

    time).<br>

    <br>

    But we don't leave the manhole open, which means that I reconfig

    when I'm going to use it (and since we use the same master.cfg for

    all the masters, the manhole would try, and probably fail, to open

    for all of them). Lately, that hasn't been a good option, because

    when we were having the CPU spikes, the reconfig would never finish

    (it might run for 24 hours or more until we were going to restart

    the master anyway). It might work now, though, since we seem to have

    solved the CPU problem.<br>

    <br>

    Neil Gilmore<br>

    grammatech.com<br>

  </body>

</html>