<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Hi all,<br>
<br>
I think the API requests that Neil mentions may have been a red
herring. He credits the elimination of the API requests with the
reduced memory consumption, but what actually did the trick was
deleting 95% of the buildrequests from the database.<br>
<br>
I've been digging into the UI master's memory use a bit more using
a debug build of python, a heap profiler (tcmalloc), a
python-layer heap profiler (guppy/heapy), manhole, and gdb.
Python is built --without-pymalloc so that the heap profiler will
be useful.<br>
<p>I don't think there are outright leaks. We see the heap
payload yoyo from 1..10GB frequently once things get warmed up.
The important thing is that it generally goes back down.
Fragmentation is certainly getting worse with time.</p>
<p>A heap profile taken when 10GB were in use reveals the
following:</p>
<p>dict - 6gb<br>
str - 2gb<br>
list - 1gb<br>
datetime - 300mb<br>
unicode - 300mb<br>
...</p>
<p><br>
</p>
<p>That's a lot of datetime objects.<br>
</p>
<p>guppy/heapy seems to not be detecting nearly as many objects as
the malloc-level profiler. It is unclear whether this is
because we are running it at the wrong time or it just doesn't
find all the objects. So I'm unable to describe a random
sampling of the values of the objects responsible for using the
space.<br>
</p>
<p>Using gdb to break into the process, what I've noticed is that
the process is usually spending its time freeing lists of 2
million BrDict objects or freeing postgres result set objects,
also of length 2 million. In my opinion, the UI master has no
business having 2 million build requests in memory at the same
time. <br>
</p>
<p>2 million is the number of build requests in the database:<br>
</p>
<p>select seq_tup_read+idx_tup_fetch as tup_fetch, relname,
n_live_tup, seq_scan, seq_tup_read, idx_tup_fetch from
pg_stat_user_tables order by seq_tup_read+idx_tup_fetch desc;<br>
tup_fetch | relname | n_live_tup | seq_scan |
seq_tup_read | idx_tup_fetch<br>
--------------+-----------------------+------------+-----------+--------------+---------------<br>
326697132142 | logchunks | 16936743 | 3
| 7038305 | 326690093837<br>
129499345714 | buildrequests | 2128108 | 48979
| 24905174984 | 104594170730<br>
17425488144 | tags | 55 | 300167451
| 16087109106 | 1338379038<br>
16679707813 | buildrequest_claims | 1578498 | 272420
| 16207663053 | 472044760<br>
6279416129 | buildset_sourcestamps | 174852 | 519155
| 1035926078 | 5243490051<br>
6199684579 | buildsets | 174852 | 175191
| 2851064037 | 3348620542<br>
3992967017 | builds | 300202 | 48125
| 3328395644 | 664571373<br>
2505110400 | build_properties | 3283964 | 1
| 0 | 2505110400<br>
1849345467 | builders | 668 | 1149411
| 671123174 | 1178222293<br>
1451803528 | changes | 4500 | 709374
| 1426480341 | 25323187<br>
1383266706 | sourcestamps | 4502 | 301851
| 111653587 | 1271613119<br>
</p>
<p>This table is sorted by approximately the amount of reading
performed against each table. n_live_tup is the relation's
cardinality.<br>
</p>
Only 1000 of those 2 million build requests are incomplete:<br>
# select count(*) from buildrequests where complete=0;<br>
count<br>
-------<br>
1088<br>
(1 row)<br>
<br>
<br>
I'm thinking the UI master is repeatedly fetching all
buildrequests going back to the beginning of time. Based on the
timestamps in the buildrequests table in the database, these
things never get deleted. The nature of twistd makes it difficult
to see which URLs are causing this work, since all I get are call
stacks where they are being freed. This code in buildrequests.py
is sometimes at the end of the stack:<br>
if resultSpec is not None:<br>
return resultSpec.thd_execute(<br>
conn, q,<br>
lambda r: self._brdictFromRow(r,
self.db.master.masterid))<br>
<br>
<br>
This method is also popular:<br>
<br>
class Db2DataMixin(object):<br>
def db2data(self, dbdict):<br>
data = {<br>
'buildrequestid': dbdict['buildrequestid'],<br>
'buildsetid': dbdict['buildsetid'],<br>
'builderid': dbdict['builderid'],<br>
'priority': dbdict['priority'],<br>
'claimed': dbdict['claimed'],<br>
'claimed_at': dbdict['claimed_at'],<br>
'claimed_by_masterid': dbdict['claimed_by_masterid'],<br>
'complete': dbdict['complete'],<br>
'results': dbdict['results'],<br>
'submitted_at': dbdict['submitted_at'],<br>
'complete_at': dbdict['complete_at'],<br>
'waited_for': dbdict['waited_for'],<br>
}<br>
return defer.succeed(data)<br>
<br>
It's mostly boring twistd functions with names like "run" above
there.<br>
<br>
I've also noticed that the /api/v2/builds URL produces a 100mb
response and takes about 12 minutes and several GB of memory to do
so. I do not believe it is responsible though, because the
problem continued to happen after those requests were
discontinued.<br>
<br>
I have added ON DELETE CASCADE to many of the FKRs in the
database, and deleted all but the most recent 100,000
buildrequests (which in turn cleans out the corresponding builds
and logs). The performance issue in the UI master seems to be
worked around, and I guess I'm happy enough to do this for the
time being.<br>
<br>
However, there is another bottleneck. It is frequently the case
that a buildrequest gets put in and takes ~6 hours before a worker
is asked to start it (configuration specifies that the 6 hours
should be 10 minutes). The 0.9 web UI doesn't have timestamps for
when builds start/finish, so it's hard to say exactly what the
timeline is. I'm pretty sure the master is at fault for not
dispatching these things promptly. Any ideas on how to debug
this? Could the huge number of build requests in the DB also be
responsible for this? postgres itself isn't breaking a sweat.<br>
<br>
I noticed one of the non-UI masters was fully utilitizing a CPU
for many hours last night. Unfortunately, it was hosted by a
stripped python binary, so I have limited insight about what it
was doing. Looks like running a regex substitution:<br>
#0 0x000000000054d6a3 in sre_charset.67717 ()<br>
#1 0x000000000045207c in sre_search.67817 ()<br>
#2 0x000000000047f714 in pattern_subx ()<br>
#3 0x000000000047fdbd in pattern_sub ()<br>
#4 0x0000000000536f94 in PyEval_EvalFrameEx ()<br>
<br>
<br>
My hunch is that this happens whenever someone makes a huge commit
(many changed files), like a branch merge. Do you folks have any
intuition about which regex this might be? This problem seems to
come and go depending on the commits of the day. I do not know if
this is related to the slow processing of build requests.<br>
<br>
- Dave<br>
<br>
On 8/18/2017 10:41 AM, Neil Gilmore wrote:<br>
</div>
<blockquote type="cite"
cite="mid:b9cb4859-38db-7967-8b0e-b56762880ef6@grammatech.com">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Hi again,<br>
<br>
This morning's update, after spending some quality time trying to
figure out just what's going on...<br>
<br>
We stuffed in some of the google perftools stuff to try to see
where memory was going. That didn't tell us much, other than
there's a bunch of data in dictionaries (big surprise there).<br>
<br>
Top was showing a reserve much higher than the current memory
usage. Not surprising if python isn't actually giving memory back
and that's acting as a high water mark.<br>
<br>
Last night, when people were less likely to be around, my boss
attached gdb to our UI master. According to him, it spends quite a
bit of time dealing with lists of 2 million build requests.His
synopsis is that it makes a sql query that results in 2 million
build requests, does stuff with them, then spends time freeing
them.<br>
<br>
We also spent a bit of time in the http logs. We were certainly
able to spike memory usage by hitting the REST api.<br>
<br>
Which, as it turns out, some of our users do. One of them says
that UI is 'too slow', so she's grabbing the info she needs every
10 minutes. Another has a python script that grabs various
combinations of builds and builders. But he had, at least last
night, 202 processes running that script, because it didn't check
to see if the previous run had finished. And he had it running in
cron.<br>
<br>
That's a bit like having an extra couple hundred users hammering
the UI at once. I haven't exactly explored that side of the code,
but I can envision scenarios that might spike memory to the point
where the system kills it.<br>
<br>
I doubt that any of this is new. It was probably people trying to
work around the limitations of our old, single-master, 0.8.x
installation.<br>
<br>
This morning, after killing those scripts and the cron job, top
looks a lot better. That master is using about 1.3G. Under our old
conditions, it blows past that in the first couple minutes it's
running.<br>
<br>
We'll see how it does the rest of today. I did have to restart
that other master again.<br>
<br>
Neil Gilmore<br>
grammatech.com<br>
<br>
<div class="moz-cite-prefix">On 8/17/2017 10:10 AM, Neil Gilmore
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:4a70aaca-5a2d-76a1-7ef9-d91eec1f0747@grammatech.com">
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8">
Hi everyone.<br>
<br>
This is a sort of extension to my original message.<br>
<br>
I had to restart one of our other masters this morning. It
apparently lost its WAMP connection, whihc happens on occasion.
The symptoms are that none of its builders show up in the UI.
The master itself was running, though it may depend on the
definition of 'running'. The process was present, but it sat for
minutes after its last log message, which is very unusual for
that master (it's the one that does most of the builds).<br>
<br>
I've also found that the stop command is nearly useless for our
masters. If they've been up for a while, doing things, they
don't ever seem to actually shut down. In this morning's case, I
did the stop command. The log showed the process getting the
SIGTERM. Then I waited for 10 minutes. The process didn't stop.
I finally killed it (and a regular kill won't do it, I have to
-9) because I have people asking me about it constantly when
it's not working right.<br>
<br>
I'll also point out that we've been accumulating integrity
errors in the database again. They look like this in the UI
master's log:<br>
<br>
sqlalchemy.exc.IntegrityError: (IntegrityError) update
or delete on table "changes" violates foreign key constraint
"changes_parent_changeids_fkey" on table "changes"<br>
DETAIL: Key (changeid)=(4188) is still referenced from
table "changes".<br>
'DELETE FROM changes WHERE changes.changeid IN
(%(changeid_1)s, %(changeid_2)s, %(changeid_3)s, %(changeid_4)s,
%(changeid_5)s, %(changeid_6)s, %(changeid_7)s, %(changeid_8)s,
%(changeid_9)s, %(changeid_10)s, %(changeid_11)s,
%(changeid_12)s, %(changeid_13)s, %(changeid_14)s,
%(changeid_15)s, %(changeid_16)s, %(changeid_17)s,
%(changeid_18)s, %(changeid_19)s, %(changeid_20)s,
%(changeid_21)s, %(changeid_22)s, %(changeid_23)s,
%(changeid_24)s, %(changeid_25)s, %(changeid_26)s,
%(changeid_27)s, %(changeid_28)s, %(changeid_29)s,
%(changeid_30)s, %(changeid_31)s, %(changeid_32)s,
%(changeid_33)s, %(changeid_34)s, %(changeid_35)s,
%(changeid_36)s, %(changeid_37)s, %(changeid_38)s,
%(changeid_39)s, %(changeid_40)s, %(changeid_41)s,
%(changeid_42)s, %(changeid_43)s, %(changeid_44)s,
%(changeid_45)s, %(changeid_46)s, %(changeid_47)s,
%(changeid_48)s, %(changeid_49)s, %(changeid_50)s,
%(changeid_51)s, %(changeid_52)s, %(changeid_53)s,
%(changeid_54)s, %(changeid_55)s, %(changeid_56)s,
%(changeid_57)s, %(changeid_58)s, %(changeid_59)s,
%(changeid_60)s, %(changeid_61)s, %(changeid_62)s,
%(changeid_63)s, %(changeid_64)s, %(changeid_65)s,
%(changeid_66)s, %(changeid_67)s, %(changeid_68)s,
%(changeid_69)s, %(changeid_70)s, %(changeid_71)s,
%(changeid_72)s, %(changeid_73)s, %(changeid_74)s,
%(changeid_75)s, %(changeid_76)s, %(changeid_77)s,
%(changeid_78)s, %(changeid_79)s, %(changeid_80)s,
%(changeid_81)s, %(changeid_82)s, %(changeid_83)s,
%(changeid_84)s, %(changeid_85)s, %(changeid_86)s,
%(changeid_87)s, %(changeid_88)s, %(changeid_89)s,
%(changeid_90)s, %(changeid_91)s, %(changeid_92)s,
%(changeid_93)s, %(changeid_94)s, %(changeid_95)s,
%(changeid_96)s, %(changeid_97)s, %(changeid_98)s,
%(changeid_99)s, %(changeid_100)s)' {'changeid_100': 4180,
'changeid_29': 4251, 'changeid_28': 4252, 'changeid_27': 4253,
'changeid_26': 4254, 'changeid_25': 4255, 'changeid_24': 4256,
'changeid_23': 4257, 'changeid_22': 4258, 'changeid_21': 4259,
'changeid_20': 4260, 'changeid_89': 4191, 'changeid_88': 4192,
'changeid_81': 4199, 'changeid_80': 4200, 'changeid_83': 4197,
'changeid_82': 4198, 'changeid_85': 4195, 'changeid_84': 4196,
'changeid_87': 4193, 'changeid_86': 4194, 'changeid_38': 4242,
'changeid_39': 4241, 'changeid_34': 4246, 'changeid_35': 4245,
'changeid_36': 4244, 'changeid_37': 4243, 'changeid_30': 4250,
'changeid_31': 4249, 'changeid_32': 4248, 'changeid_33': 4247,
'changeid_98': 4182, 'changeid_99': 4181, 'changeid_96': 4184,
'changeid_97': 4183, 'changeid_94': 4186, 'changeid_95': 4185,
'changeid_92': 4188, 'changeid_93': 4187, 'changeid_90': 4190,
'changeid_91': 4189, 'changeid_63': 4217, 'changeid_62': 4218,
'changeid_61': 4219, 'changeid_60': 4220, 'changeid_67': 4213,
'changeid_66': 4214, 'changeid_65': 4215, 'changeid_64': 4216,
'changeid_69': 4211, 'changeid_68': 4212, 'changeid_16': 4264,
'changeid_17': 4263, 'changeid_14': 4266, 'changeid_15': 4265,
'changeid_12': 4268, 'changeid_13': 4267, 'changeid_10': 4270,
'changeid_11': 4269, 'changeid_18': 4262, 'changeid_19': 4261,
'changeid_70': 4210, 'changeid_71': 4209, 'changeid_72': 4208,
'changeid_73': 4207, 'changeid_74': 4206, 'changeid_75': 4205,
'changeid_76': 4204, 'changeid_77': 4203, 'changeid_78': 4202,
'changeid_79': 4201, 'changeid_45': 4235, 'changeid_44': 4236,
'changeid_47': 4233, 'changeid_46': 4234, 'changeid_41': 4239,
'changeid_40': 4240, 'changeid_43': 4237, 'changeid_42': 4238,
'changeid_49': 4231, 'changeid_48': 4232, 'changeid_4': 4276,
'changeid_5': 4275, 'changeid_6': 4274, 'changeid_7': 4273,
'changeid_1': 4279, 'changeid_2': 4278, 'changeid_3': 4277,
'changeid_8': 4272, 'changeid_9': 4271, 'changeid_58': 4222,
'changeid_59': 4221, 'changeid_52': 4228, 'changeid_53': 4227,
'changeid_50': 4230, 'changeid_51': 4229, 'changeid_56': 4224,
'changeid_57': 4223, 'changeid_54': 4226, 'changeid_55': 4225}<br>
<br>
We did have the UI master get stopped again for being OOM,
somewhere north of 50G. It's currently 46.8, and been running
only a couple days. My plan, when I get time, is to use some
stuff we have in our own product based on guppy to see what's
going on in memory, but attaching gdb to it and using
PyRun_SimpleString() to run the heap checking stuff. But for the
time being, I can keep restarting things.<br>
<br>
My gut feeling is that I'm going to have to shut everything down
and start with a clean database.<br>
<br>
Neil Gilmore<br>
<a class="moz-txt-link-abbreviated"
href="mailto:raito@raito.com" moz-do-not-send="true">raito@raito.com</a><br>
<br>
<br>
<div class="moz-cite-prefix">On 8/10/2017 3:39 PM, Pierre Tardy
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAJ+soVe7Z980tGD5XVJKABGqaD-62j=BPUhds06eE+wtF0C9LQ@mail.gmail.com">
<div dir="ltr">Mmh.. sounds like you found a memory leak in
the UI.
<div><br>
<div>Several points to check that could be out of control
for UI:</div>
<div>Message could be queuing up over hanged (or badly
closed) websockets</div>
<div>REST queries data could keep some reference in some
way, which would make them just accumulate in memory.</div>
<div>sombody is doing huge REST apis queries. It is
unfortunatly quite easy to just hit /api/v2/builds, and
dump the whole build db.</div>
<div>This wouldn't go all the way to 45GB IMO.</div>
<div><br>
</div>
<div>During my perf tests, I saw the logs system are quite
memory hungry, and its buildbot is trading memory for
cpu if the master can't keepup recording all the logs in
the data (nothing fancy. the twisted queues will just be
growing waiting for callbacks to be called). </div>
<div>Due to several encoding and decoding, the memory
usage in the log queue could be 2-4 times the amount of
actual log bytes processed. anyway, this is probably not
your problem.</div>
<div><br>
</div>
<div>You can look at <a
href="https://github.com/tardyp/buildbot_profiler"
moz-do-not-send="true">https://github.com/tardyp/buildbot_profiler</a> maybe
improve it to add a memory profiler..</div>
<div><br>
</div>
<div><br>
</div>
<div>Pierre</div>
<div> </div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr">Le jeu. 10 août 2017 à 21:52, Neil Gilmore
<<a href="mailto:ngilmore@grammatech.com"
moz-do-not-send="true">ngilmore@grammatech.com</a>> a
écrit :<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>
<br>
A couple of times in the last couple weeks, the master
running our UI<br>
has crashed/exited/ceased to run. This most recent time it
was a<br>
MemoryError, backed up by the kernel log saying it was
OOM. Current run<br>
shows it using 43.1G memory. Seems like a lot, doesn't it?
The other 2<br>
larger masters (the ones actually running builds) are
about 2G, and the<br>
remaining, smallest one is about 1.7G.<br>
<br>
What does the UI hold onto that makes it so large?<br>
<br>
I'm supposed to get some heap profiling thingy wedged into
buildbot so I<br>
can try to figure it out. My current plan is to introduce
a new build<br>
step (like our current custom ones) that triggers the
profiling process,<br>
create a sort of dummy build using the step, and create a
force<br>
scheduler to trigger that build. That'll keep it on our UI
master<br>
anyway. Is there a better way to trigger such code in the
master at<br>
arbitrary times?<br>
<br>
I'm also starting to get complaints (and I've noticed this
myself), that<br>
sometimes queued builds take a long time to start after
the previous<br>
build finishes. Sometimes on the order of a half-hour or
more. Or at<br>
least it appears so. I'm going on what the UI tells me. I
haven't tried<br>
delving into the logs of the master controlling the builds
to match up<br>
times. Any ideas? I'm afraid I'm not up on exactly
how/when builds are<br>
supposed to get started.<br>
<br>
And out database seems to be accumulating integrity errors
again.<br>
<br>
Neil Gilmore<br>
<a href="mailto:raito@raito.com" target="_blank"
moz-do-not-send="true">raito@raito.com</a><br>
_______________________________________________<br>
users mailing list<br>
<a href="mailto:users@buildbot.net" target="_blank"
moz-do-not-send="true">users@buildbot.net</a><br>
<a
href="https://lists.buildbot.net/mailman/listinfo/users"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://lists.buildbot.net/mailman/listinfo/users</a><br>
</blockquote>
</div>
</blockquote>
<br>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:users@buildbot.net" moz-do-not-send="true">users@buildbot.net</a>
<a class="moz-txt-link-freetext" href="https://lists.buildbot.net/mailman/listinfo/users" moz-do-not-send="true">https://lists.buildbot.net/mailman/listinfo/users</a></pre>
</blockquote>
<br>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:users@buildbot.net">users@buildbot.net</a>
<a class="moz-txt-link-freetext" href="https://lists.buildbot.net/mailman/listinfo/users">https://lists.buildbot.net/mailman/listinfo/users</a></pre>
</blockquote>
<br>
</body>
</html>