<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">Hi all,<br>

      <br>

      I think the API requests that Neil mentions may have been a red

      herring.  He credits the elimination of the API requests with the

      reduced memory consumption, but what actually did the trick was

      deleting 95% of the buildrequests from the database.<br>

      <br>

      I've been digging into the UI master's memory use a bit more using

      a debug build of python, a heap profiler (tcmalloc), a

      python-layer heap profiler (guppy/heapy), manhole, and gdb. 

      Python is built --without-pymalloc so that the heap profiler will

      be useful.<br>

      <p>I don't think there are outright leaks.  We see the heap

        payload yoyo from 1..10GB frequently once things get warmed up. 

        The important thing is that it generally goes back down. 

        Fragmentation is certainly getting worse with time.</p>

      <p>A heap profile taken when 10GB were in use reveals the

        following:</p>

      <p>dict - 6gb<br>

        str - 2gb<br>

        list - 1gb<br>

        datetime - 300mb<br>

        unicode - 300mb<br>

        ...</p>

      <p><br>

      </p>

      <p>That's a lot of datetime objects.<br>

      </p>

      <p>guppy/heapy seems to not be detecting nearly as many objects as

        the malloc-level profiler.  It is unclear whether this is

        because we are running it at the wrong time or it just doesn't

        find all the objects.  So I'm unable to describe a random

        sampling of the values of the objects responsible for using the

        space.<br>

      </p>

      <p>Using gdb to break into the process, what I've noticed is that

        the process is usually spending its time freeing lists of 2

        million BrDict objects or freeing postgres result set objects,

        also of length 2 million.  In my opinion, the UI master has no

        business having 2 million build requests in memory at the same

        time.  <br>

      </p>

      <p>2 million is the number of build requests in the database:<br>

      </p>

      <p>select seq_tup_read+idx_tup_fetch as tup_fetch, relname,

        n_live_tup, seq_scan, seq_tup_read, idx_tup_fetch from

        pg_stat_user_tables order by seq_tup_read+idx_tup_fetch desc;<br>

          tup_fetch   |        relname        | n_live_tup | seq_scan  |

        seq_tup_read | idx_tup_fetch<br>

--------------+-----------------------+------------+-----------+--------------+---------------<br>

         326697132142 | logchunks             |   16936743 |         3

        |      7038305 |  326690093837<br>

         129499345714 | buildrequests         |    2128108 |     48979

        |  24905174984 |  104594170730<br>

          17425488144 | tags                  |         55 | 300167451

        |  16087109106 |    1338379038<br>

          16679707813 | buildrequest_claims   |    1578498 |    272420

        |  16207663053 |     472044760<br>

           6279416129 | buildset_sourcestamps |     174852 |    519155

        |   1035926078 |    5243490051<br>

           6199684579 | buildsets             |     174852 |    175191

        |   2851064037 |    3348620542<br>

           3992967017 | builds                |     300202 |     48125

        |   3328395644 |     664571373<br>

           2505110400 | build_properties      |    3283964 |         1

        |            0 |    2505110400<br>

           1849345467 | builders              |        668 |   1149411

        |    671123174 |    1178222293<br>

           1451803528 | changes               |       4500 |    709374

        |   1426480341 |      25323187<br>

           1383266706 | sourcestamps          |       4502 |    301851

        |    111653587 |    1271613119<br>

      </p>

      <p>This table is sorted by approximately the amount of reading

        performed against each table.  n_live_tup is the relation's

        cardinality.<br>

      </p>

      Only 1000 of those 2 million build requests are incomplete:<br>

      # select count(*) from buildrequests where complete=0;<br>

       count<br>

      -------<br>

        1088<br>

      (1 row)<br>

      <br>

      <br>

      I'm thinking the UI master is repeatedly fetching all

      buildrequests going back to the beginning of time.  Based on the

      timestamps in the buildrequests table in the database, these

      things never get deleted.  The nature of twistd makes it difficult

      to see which URLs are causing this work, since all I get are call

      stacks where they are being freed.  This code in buildrequests.py

      is sometimes at the end of the stack:<br>

                  if resultSpec is not None:<br>

                      return resultSpec.thd_execute(<br>

                          conn, q,<br>

                          lambda r: self._brdictFromRow(r,

      self.db.master.masterid))<br>

      <br>

      <br>

      This method is also popular:<br>

      <br>

      class Db2DataMixin(object):<br>

          def db2data(self, dbdict):<br>

              data = {<br>

                  'buildrequestid': dbdict['buildrequestid'],<br>

                  'buildsetid': dbdict['buildsetid'],<br>

                  'builderid': dbdict['builderid'],<br>

                  'priority': dbdict['priority'],<br>

                  'claimed': dbdict['claimed'],<br>

                  'claimed_at': dbdict['claimed_at'],<br>

                  'claimed_by_masterid': dbdict['claimed_by_masterid'],<br>

                  'complete': dbdict['complete'],<br>

                  'results': dbdict['results'],<br>

                  'submitted_at': dbdict['submitted_at'],<br>

                  'complete_at': dbdict['complete_at'],<br>

                  'waited_for': dbdict['waited_for'],<br>

              }<br>

              return defer.succeed(data)<br>

      <br>

      It's mostly boring twistd functions with names like "run" above

      there.<br>

      <br>

      I've also noticed that the /api/v2/builds URL produces a 100mb

      response and takes about 12 minutes and several GB of memory to do

      so.  I do not believe it is responsible though, because the

      problem continued to happen after those requests were

      discontinued.<br>

      <br>

      I have added ON DELETE CASCADE to many of the FKRs in the

      database, and deleted all but the most recent 100,000

      buildrequests (which in turn cleans out the corresponding builds

      and logs).  The performance issue in the UI master seems to be

      worked around, and I guess I'm happy enough to do this for the

      time being.<br>

      <br>

      However, there is another bottleneck.  It is frequently the case

      that a buildrequest gets put in and takes ~6 hours before a worker

      is asked to start it (configuration specifies that the 6 hours

      should be 10 minutes).  The 0.9 web UI doesn't have timestamps for

      when builds start/finish, so it's hard to say exactly what the

      timeline is.  I'm pretty sure the master is at fault for not

      dispatching these things promptly.  Any ideas on how to debug

      this?  Could the huge number of build requests in the DB also be

      responsible for this?  postgres itself isn't breaking a sweat.<br>

      <br>

      I noticed one of the non-UI masters was fully utilitizing a CPU

      for many hours last night.  Unfortunately, it was hosted by a

      stripped python binary, so I have limited insight about what it

      was doing.  Looks like running a regex substitution:<br>

      #0  0x000000000054d6a3 in sre_charset.67717 ()<br>

      #1  0x000000000045207c in sre_search.67817 ()<br>

      #2  0x000000000047f714 in pattern_subx ()<br>

      #3  0x000000000047fdbd in pattern_sub ()<br>

      #4  0x0000000000536f94 in PyEval_EvalFrameEx ()<br>

      <br>

      <br>

      My hunch is that this happens whenever someone makes a huge commit

      (many changed files), like a branch merge.  Do you folks have any

      intuition about which regex this might be?  This problem seems to

      come and go depending on the commits of the day.  I do not know if

      this is related to the slow processing of build requests.<br>

      <br>

      - Dave<br>

      <br>

      On 8/18/2017 10:41 AM, Neil Gilmore wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:b9cb4859-38db-7967-8b0e-b56762880ef6@grammatech.com">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      Hi again,<br>

      <br>

      This morning's update, after spending some quality time trying to

      figure out just what's going on...<br>

      <br>

      We stuffed in some of the google perftools stuff to try to see

      where memory was going. That didn't tell us much, other than

      there's a bunch of data in dictionaries (big surprise there).<br>

      <br>

      Top was showing a reserve much higher than the current memory

      usage. Not surprising if python isn't actually giving memory back

      and that's acting as a high water mark.<br>

      <br>

      Last night, when people were less likely to be around, my boss

      attached gdb to our UI master. According to him, it spends quite a

      bit of time dealing with lists of 2 million build requests.His

      synopsis is that it makes a sql query that results in 2 million

      build requests, does stuff with them, then spends time freeing

      them.<br>

      <br>

      We also spent a bit of time in the http logs. We were certainly

      able to spike memory usage by hitting the REST api.<br>

      <br>

      Which, as it turns out, some of our users do. One of them says

      that UI is 'too slow', so she's grabbing the info she needs every

      10 minutes. Another has a python script that grabs various

      combinations of builds and builders. But he had, at least last

      night, 202 processes running that script, because it didn't check

      to see if the previous run had finished. And he had it running in

      cron.<br>

      <br>

      That's a bit like having an extra couple hundred users hammering

      the UI at once. I haven't exactly explored that side of the code,

      but I can envision scenarios that might spike memory to the point

      where the system kills it.<br>

      <br>

      I doubt that any of this is new. It was probably people trying to

      work around the limitations of our old, single-master, 0.8.x

      installation.<br>

      <br>

      This morning, after killing those scripts and the cron job, top

      looks a lot better. That master is using about 1.3G. Under our old

      conditions, it blows past that in the first couple minutes it's

      running.<br>

      <br>

      We'll see how it does the rest of today. I did have to restart

      that other master again.<br>

      <br>

      Neil Gilmore<br>

      grammatech.com<br>

      <br>

      <div class="moz-cite-prefix">On 8/17/2017 10:10 AM, Neil Gilmore

        wrote:<br>

      </div>

      <blockquote type="cite"

        cite="mid:4a70aaca-5a2d-76a1-7ef9-d91eec1f0747@grammatech.com">

        <meta http-equiv="Content-Type" content="text/html;

          charset=utf-8">

        Hi everyone.<br>

        <br>

        This is a sort of extension to my original message.<br>

        <br>

        I had to restart one of our other masters this morning. It

        apparently lost its WAMP connection, whihc happens on occasion.

        The symptoms are that none of its builders show up in the UI.

        The master itself was running, though it may depend on the

        definition of 'running'. The process was present, but it sat for

        minutes after its last log message, which is very unusual for

        that master (it's the one that does most of the builds).<br>

        <br>

        I've also found that the stop command is nearly useless for our

        masters. If they've been up for a while, doing things, they

        don't ever seem to actually shut down. In this morning's case, I

        did the stop command. The log showed the process getting the

        SIGTERM. Then I waited for 10 minutes. The process didn't stop.

        I finally killed it (and a regular kill won't do it, I have to

        -9) because I have people asking me about it constantly when

        it's not working right.<br>

        <br>

        I'll also point out that we've been accumulating integrity

        errors in the database again. They look like this in the UI

        master's log:<br>

        <br>

                sqlalchemy.exc.IntegrityError: (IntegrityError) update

        or delete on table "changes" violates foreign key constraint

        "changes_parent_changeids_fkey" on table "changes"<br>

                DETAIL:  Key (changeid)=(4188) is still referenced from

        table "changes".<br>

                 'DELETE FROM changes WHERE changes.changeid IN

        (%(changeid_1)s, %(changeid_2)s, %(changeid_3)s, %(changeid_4)s,

        %(changeid_5)s, %(changeid_6)s, %(changeid_7)s, %(changeid_8)s,

        %(changeid_9)s, %(changeid_10)s, %(changeid_11)s,

        %(changeid_12)s, %(changeid_13)s, %(changeid_14)s,

        %(changeid_15)s, %(changeid_16)s, %(changeid_17)s,

        %(changeid_18)s, %(changeid_19)s, %(changeid_20)s,

        %(changeid_21)s, %(changeid_22)s, %(changeid_23)s,

        %(changeid_24)s, %(changeid_25)s, %(changeid_26)s,

        %(changeid_27)s, %(changeid_28)s, %(changeid_29)s,

        %(changeid_30)s, %(changeid_31)s, %(changeid_32)s,

        %(changeid_33)s, %(changeid_34)s, %(changeid_35)s,

        %(changeid_36)s, %(changeid_37)s, %(changeid_38)s,

        %(changeid_39)s, %(changeid_40)s, %(changeid_41)s,

        %(changeid_42)s, %(changeid_43)s, %(changeid_44)s,

        %(changeid_45)s, %(changeid_46)s, %(changeid_47)s,

        %(changeid_48)s, %(changeid_49)s, %(changeid_50)s,

        %(changeid_51)s, %(changeid_52)s, %(changeid_53)s,

        %(changeid_54)s, %(changeid_55)s, %(changeid_56)s,

        %(changeid_57)s, %(changeid_58)s, %(changeid_59)s,

        %(changeid_60)s, %(changeid_61)s, %(changeid_62)s,

        %(changeid_63)s, %(changeid_64)s, %(changeid_65)s,

        %(changeid_66)s, %(changeid_67)s, %(changeid_68)s,

        %(changeid_69)s, %(changeid_70)s, %(changeid_71)s,

        %(changeid_72)s, %(changeid_73)s, %(changeid_74)s,

        %(changeid_75)s, %(changeid_76)s, %(changeid_77)s,

        %(changeid_78)s, %(changeid_79)s, %(changeid_80)s,

        %(changeid_81)s, %(changeid_82)s, %(changeid_83)s,

        %(changeid_84)s, %(changeid_85)s, %(changeid_86)s,

        %(changeid_87)s, %(changeid_88)s, %(changeid_89)s,

        %(changeid_90)s, %(changeid_91)s, %(changeid_92)s,

        %(changeid_93)s, %(changeid_94)s, %(changeid_95)s,

        %(changeid_96)s, %(changeid_97)s, %(changeid_98)s,

        %(changeid_99)s, %(changeid_100)s)' {'changeid_100': 4180,

        'changeid_29': 4251, 'changeid_28': 4252, 'changeid_27': 4253,

        'changeid_26': 4254, 'changeid_25': 4255, 'changeid_24': 4256,

        'changeid_23': 4257, 'changeid_22': 4258, 'changeid_21': 4259,

        'changeid_20': 4260, 'changeid_89': 4191, 'changeid_88': 4192,

        'changeid_81': 4199, 'changeid_80': 4200, 'changeid_83': 4197,

        'changeid_82': 4198, 'changeid_85': 4195, 'changeid_84': 4196,

        'changeid_87': 4193, 'changeid_86': 4194, 'changeid_38': 4242,

        'changeid_39': 4241, 'changeid_34': 4246, 'changeid_35': 4245,

        'changeid_36': 4244, 'changeid_37': 4243, 'changeid_30': 4250,

        'changeid_31': 4249, 'changeid_32': 4248, 'changeid_33': 4247,

        'changeid_98': 4182, 'changeid_99': 4181, 'changeid_96': 4184,

        'changeid_97': 4183, 'changeid_94': 4186, 'changeid_95': 4185,

        'changeid_92': 4188, 'changeid_93': 4187, 'changeid_90': 4190,

        'changeid_91': 4189, 'changeid_63': 4217, 'changeid_62': 4218,

        'changeid_61': 4219, 'changeid_60': 4220, 'changeid_67': 4213,

        'changeid_66': 4214, 'changeid_65': 4215, 'changeid_64': 4216,

        'changeid_69': 4211, 'changeid_68': 4212, 'changeid_16': 4264,

        'changeid_17': 4263, 'changeid_14': 4266, 'changeid_15': 4265,

        'changeid_12': 4268, 'changeid_13': 4267, 'changeid_10': 4270,

        'changeid_11': 4269, 'changeid_18': 4262, 'changeid_19': 4261,

        'changeid_70': 4210, 'changeid_71': 4209, 'changeid_72': 4208,

        'changeid_73': 4207, 'changeid_74': 4206, 'changeid_75': 4205,

        'changeid_76': 4204, 'changeid_77': 4203, 'changeid_78': 4202,

        'changeid_79': 4201, 'changeid_45': 4235, 'changeid_44': 4236,

        'changeid_47': 4233, 'changeid_46': 4234, 'changeid_41': 4239,

        'changeid_40': 4240, 'changeid_43': 4237, 'changeid_42': 4238,

        'changeid_49': 4231, 'changeid_48': 4232, 'changeid_4': 4276,

        'changeid_5': 4275, 'changeid_6': 4274, 'changeid_7': 4273,

        'changeid_1': 4279, 'changeid_2': 4278, 'changeid_3': 4277,

        'changeid_8': 4272, 'changeid_9': 4271, 'changeid_58': 4222,

        'changeid_59': 4221, 'changeid_52': 4228, 'changeid_53': 4227,

        'changeid_50': 4230, 'changeid_51': 4229, 'changeid_56': 4224,

        'changeid_57': 4223, 'changeid_54': 4226, 'changeid_55': 4225}<br>

        <br>

        We did have the UI master get stopped again for being OOM,

        somewhere north of 50G. It's currently 46.8, and been running

        only a couple days. My plan, when I get time, is to use some

        stuff we have in our own product based on guppy to see what's

        going on in memory, but attaching gdb to it and using

        PyRun_SimpleString() to run the heap checking stuff. But for the

        time being, I can keep restarting things.<br>

        <br>

        My gut feeling is that I'm going to have to shut everything down

        and start with a clean database.<br>

        <br>

        Neil Gilmore<br>

        <a class="moz-txt-link-abbreviated"

          href="mailto:raito@raito.com" moz-do-not-send="true">raito@raito.com</a><br>

        <br>

        <br>

        <div class="moz-cite-prefix">On 8/10/2017 3:39 PM, Pierre Tardy

          wrote:<br>

        </div>

        <blockquote type="cite"

cite="mid:CAJ+soVe7Z980tGD5XVJKABGqaD-62j=BPUhds06eE+wtF0C9LQ@mail.gmail.com">

          <div dir="ltr">Mmh.. sounds like you found a memory leak in

            the UI.

            <div><br>

              <div>Several points to check that could be out of control

                for UI:</div>

              <div>Message could be queuing up over hanged (or badly

                closed) websockets</div>

              <div>REST queries data could keep some reference in some

                way, which would make them just accumulate in memory.</div>

              <div>sombody is doing huge REST apis queries. It is

                unfortunatly quite easy to just hit /api/v2/builds, and

                dump the whole build db.</div>

              <div>This wouldn't go all the way to 45GB IMO.</div>

              <div><br>

              </div>

              <div>During my perf tests, I saw the logs system are quite

                memory hungry, and its buildbot is trading memory for

                cpu if the master can't keepup recording all the logs in

                the data (nothing fancy. the twisted queues will just be

                growing waiting for callbacks to be called). </div>

              <div>Due to several encoding and decoding, the memory

                usage in the log queue could be 2-4 times the amount of

                actual log bytes processed. anyway, this is probably not

                your problem.</div>

              <div><br>

              </div>

              <div>You can look at <a

                  href="https://github.com/tardyp/buildbot_profiler"

                  moz-do-not-send="true">https://github.com/tardyp/buildbot_profiler</a> maybe

                improve it to add a memory profiler..</div>

              <div><br>

              </div>

              <div><br>

              </div>

              <div>Pierre</div>

              <div> </div>

            </div>

          </div>

          <br>

          <div class="gmail_quote">

            <div dir="ltr">Le jeu. 10 août 2017 à 21:52, Neil Gilmore

              <<a href="mailto:ngilmore@grammatech.com"

                moz-do-not-send="true">ngilmore@grammatech.com</a>> a

              écrit :<br>

            </div>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>

              <br>

              A couple of times in the last couple weeks, the master

              running our UI<br>

              has crashed/exited/ceased to run. This most recent time it

              was a<br>

              MemoryError, backed up by the kernel log saying it was

              OOM. Current run<br>

              shows it using 43.1G memory. Seems like a lot, doesn't it?

              The other 2<br>

              larger masters (the ones actually running builds) are

              about 2G, and the<br>

              remaining, smallest one is about 1.7G.<br>

              <br>

              What does the UI hold onto that makes it so large?<br>

              <br>

              I'm supposed to get some heap profiling thingy wedged into

              buildbot so I<br>

              can try to figure it out. My current plan is to introduce

              a new build<br>

              step (like our current custom ones) that triggers the

              profiling process,<br>

              create a sort of dummy build using the step, and create a

              force<br>

              scheduler to trigger that build. That'll keep it on our UI

              master<br>

              anyway. Is there a better way to trigger such code in the

              master at<br>

              arbitrary times?<br>

              <br>

              I'm also starting to get complaints (and I've noticed this

              myself), that<br>

              sometimes queued builds take a long time to start after

              the previous<br>

              build finishes. Sometimes on the order of a half-hour or

              more. Or at<br>

              least it appears so. I'm going on what the UI tells me. I

              haven't tried<br>

              delving into the logs of the master controlling the builds

              to match up<br>

              times. Any ideas? I'm afraid I'm not up on exactly

              how/when builds are<br>

              supposed to get started.<br>

              <br>

              And out database seems to be accumulating integrity errors

              again.<br>

              <br>

              Neil Gilmore<br>

              <a href="mailto:raito@raito.com" target="_blank"

                moz-do-not-send="true">raito@raito.com</a><br>

              _______________________________________________<br>

              users mailing list<br>

              <a href="mailto:users@buildbot.net" target="_blank"

                moz-do-not-send="true">users@buildbot.net</a><br>

              <a

                href="https://lists.buildbot.net/mailman/listinfo/users"

                rel="noreferrer" target="_blank" moz-do-not-send="true">https://lists.buildbot.net/mailman/listinfo/users</a><br>

            </blockquote>

          </div>

        </blockquote>

        <br>

        <br>

        <fieldset class="mimeAttachmentHeader"></fieldset>

        <br>

        <pre wrap="">_______________________________________________

users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:users@buildbot.net" moz-do-not-send="true">users@buildbot.net</a>

<a class="moz-txt-link-freetext" href="https://lists.buildbot.net/mailman/listinfo/users" moz-do-not-send="true">https://lists.buildbot.net/mailman/listinfo/users</a></pre>

      </blockquote>

      <br>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:users@buildbot.net">users@buildbot.net</a>

<a class="moz-txt-link-freetext" href="https://lists.buildbot.net/mailman/listinfo/users">https://lists.buildbot.net/mailman/listinfo/users</a></pre>

    </blockquote>

    <br>

  </body>

</html>