[users at bb.net] And even more multi-master anecdotes.

Fri Aug 18 14:50:30 UTC 2017

Hi all,

I think the API requests that Neil mentions may have been a red 
herring.  He credits the elimination of the API requests with the 
reduced memory consumption, but what actually did the trick was deleting 
95% of the buildrequests from the database.

I've been digging into the UI master's memory use a bit more using a 
debug build of python, a heap profiler (tcmalloc), a python-layer heap 
profiler (guppy/heapy), manhole, and gdb. Python is built 
--without-pymalloc so that the heap profiler will be useful.

I don't think there are outright leaks.  We see the heap payload yoyo 
from 1..10GB frequently once things get warmed up. The important thing 
is that it generally goes back down. Fragmentation is certainly getting 
worse with time.

A heap profile taken when 10GB were in use reveals the following:

dict - 6gb
str - 2gb
list - 1gb
datetime - 300mb
unicode - 300mb
...

That's a lot of datetime objects.

guppy/heapy seems to not be detecting nearly as many objects as the 
malloc-level profiler.  It is unclear whether this is because we are 
running it at the wrong time or it just doesn't find all the objects.  
So I'm unable to describe a random sampling of the values of the objects 
responsible for using the space.

Using gdb to break into the process, what I've noticed is that the 
process is usually spending its time freeing lists of 2 million BrDict 
objects or freeing postgres result set objects, also of length 2 
million.  In my opinion, the UI master has no business having 2 million 
build requests in memory at the same time.

2 million is the number of build requests in the database:

select seq_tup_read+idx_tup_fetch as tup_fetch, relname, n_live_tup, 
seq_scan, seq_tup_read, idx_tup_fetch from pg_stat_user_tables order by 
seq_tup_read+idx_tup_fetch desc;
   tup_fetch   |        relname        | n_live_tup | seq_scan  | 
seq_tup_read | idx_tup_fetch
--------------+-----------------------+------------+-----------+--------------+---------------
  326697132142 | logchunks             |   16936743 |         3 |      
7038305 |  326690093837
  129499345714 | buildrequests         |    2128108 |     48979 |  
24905174984 |  104594170730
   17425488144 | tags                  |         55 | 300167451 |  
16087109106 |    1338379038
   16679707813 | buildrequest_claims   |    1578498 |    272420 |  
16207663053 |     472044760
    6279416129 | buildset_sourcestamps |     174852 |    519155 |   
1035926078 |    5243490051
    6199684579 | buildsets             |     174852 |    175191 |   
2851064037 |    3348620542
    3992967017 | builds                |     300202 |     48125 |   
3328395644 |     664571373
    2505110400 | build_properties      |    3283964 |         1 
|            0 |    2505110400
    1849345467 | builders              |        668 |   1149411 |    
671123174 |    1178222293
    1451803528 | changes               |       4500 |    709374 |   
1426480341 |      25323187
    1383266706 | sourcestamps          |       4502 |    301851 |    
111653587 |    1271613119

This table is sorted by approximately the amount of reading performed 
against each table.  n_live_tup is the relation's cardinality.

Only 1000 of those 2 million build requests are incomplete:
# select count(*) from buildrequests where complete=0;
  count
-------
   1088
(1 row)

I'm thinking the UI master is repeatedly fetching all buildrequests 
going back to the beginning of time.  Based on the timestamps in the 
buildrequests table in the database, these things never get deleted.  
The nature of twistd makes it difficult to see which URLs are causing 
this work, since all I get are call stacks where they are being freed.  
This code in buildrequests.py is sometimes at the end of the stack:
             if resultSpec is not None:
                 return resultSpec.thd_execute(
                     conn, q,
                     lambda r: self._brdictFromRow(r, 
self.db.master.masterid))

This method is also popular:

class Db2DataMixin(object):
     def db2data(self, dbdict):
         data = {
             'buildrequestid': dbdict['buildrequestid'],
             'buildsetid': dbdict['buildsetid'],
             'builderid': dbdict['builderid'],
             'priority': dbdict['priority'],
             'claimed': dbdict['claimed'],
             'claimed_at': dbdict['claimed_at'],
             'claimed_by_masterid': dbdict['claimed_by_masterid'],
             'complete': dbdict['complete'],
             'results': dbdict['results'],
             'submitted_at': dbdict['submitted_at'],
             'complete_at': dbdict['complete_at'],
             'waited_for': dbdict['waited_for'],
         }
         return defer.succeed(data)

It's mostly boring twistd functions with names like "run" above there.

I've also noticed that the /api/v2/builds URL produces a 100mb response 
and takes about 12 minutes and several GB of memory to do so.  I do not 
believe it is responsible though, because the problem continued to 
happen after those requests were discontinued.

I have added ON DELETE CASCADE to many of the FKRs in the database, and 
deleted all but the most recent 100,000 buildrequests (which in turn 
cleans out the corresponding builds and logs).  The performance issue in 
the UI master seems to be worked around, and I guess I'm happy enough to 
do this for the time being.

However, there is another bottleneck.  It is frequently the case that a 
buildrequest gets put in and takes ~6 hours before a worker is asked to 
start it (configuration specifies that the 6 hours should be 10 
minutes).  The 0.9 web UI doesn't have timestamps for when builds 
start/finish, so it's hard to say exactly what the timeline is.  I'm 
pretty sure the master is at fault for not dispatching these things 
promptly.  Any ideas on how to debug this?  Could the huge number of 
build requests in the DB also be responsible for this?  postgres itself 
isn't breaking a sweat.

I noticed one of the non-UI masters was fully utilitizing a CPU for many 
hours last night.  Unfortunately, it was hosted by a stripped python 
binary, so I have limited insight about what it was doing.  Looks like 
running a regex substitution:
#0  0x000000000054d6a3 in sre_charset.67717 ()
#1  0x000000000045207c in sre_search.67817 ()
#2  0x000000000047f714 in pattern_subx ()
#3  0x000000000047fdbd in pattern_sub ()
#4  0x0000000000536f94 in PyEval_EvalFrameEx ()

My hunch is that this happens whenever someone makes a huge commit (many 
changed files), like a branch merge.  Do you folks have any intuition 
about which regex this might be?  This problem seems to come and go 
depending on the commits of the day.  I do not know if this is related 
to the slow processing of build requests.

- Dave

On 8/18/2017 10:41 AM, Neil Gilmore wrote:
> Hi again,
>
> This morning's update, after spending some quality time trying to 
> figure out just what's going on...
>
> We stuffed in some of the google perftools stuff to try to see where 
> memory was going. That didn't tell us much, other than there's a bunch 
> of data in dictionaries (big surprise there).
>
> Top was showing a reserve much higher than the current memory usage. 
> Not surprising if python isn't actually giving memory back and that's 
> acting as a high water mark.
>
> Last night, when people were less likely to be around, my boss 
> attached gdb to our UI master. According to him, it spends quite a bit 
> of time dealing with lists of 2 million build requests.His synopsis is 
> that it makes a sql query that results in 2 million build requests, 
> does stuff with them, then spends time freeing them.
>
> We also spent a bit of time in the http logs. We were certainly able 
> to spike memory usage by hitting the REST api.
>
> Which, as it turns out, some of our users do. One of them says that UI 
> is 'too slow', so she's grabbing the info she needs every 10 minutes. 
> Another has a python script that grabs various combinations of builds 
> and builders. But he had, at least last night, 202 processes running 
> that script, because it didn't check to see if the previous run had 
> finished. And he had it running in cron.
>
> That's a bit like having an extra couple hundred users hammering the 
> UI at once. I haven't exactly explored that side of the code, but I 
> can envision scenarios that might spike memory to the point where the 
> system kills it.
>
> I doubt that any of this is new. It was probably people trying to work 
> around the limitations of our old, single-master, 0.8.x installation.
>
> This morning, after killing those scripts and the cron job, top looks 
> a lot better. That master is using about 1.3G. Under our old 
> conditions, it blows past that in the first couple minutes it's running.
>
> We'll see how it does the rest of today. I did have to restart that 
> other master again.
>
> Neil Gilmore
> grammatech.com
>
> On 8/17/2017 10:10 AM, Neil Gilmore wrote:
>> Hi everyone.
>>
>> This is a sort of extension to my original message.
>>
>> I had to restart one of our other masters this morning. It apparently 
>> lost its WAMP connection, whihc happens on occasion. The symptoms are 
>> that none of its builders show up in the UI. The master itself was 
>> running, though it may depend on the definition of 'running'. The 
>> process was present, but it sat for minutes after its last log 
>> message, which is very unusual for that master (it's the one that 
>> does most of the builds).
>>
>> I've also found that the stop command is nearly useless for our 
>> masters. If they've been up for a while, doing things, they don't 
>> ever seem to actually shut down. In this morning's case, I did the 
>> stop command. The log showed the process getting the SIGTERM. Then I 
>> waited for 10 minutes. The process didn't stop. I finally killed it 
>> (and a regular kill won't do it, I have to -9) because I have people 
>> asking me about it constantly when it's not working right.
>>
>> I'll also point out that we've been accumulating integrity errors in 
>> the database again. They look like this in the UI master's log:
>>
>>         sqlalchemy.exc.IntegrityError: (IntegrityError) update or 
>> delete on table "changes" violates foreign key constraint 
>> "changes_parent_changeids_fkey" on table "changes"
>>         DETAIL:  Key (changeid)=(4188) is still referenced from table 
>> "changes".
>>          'DELETE FROM changes WHERE changes.changeid IN 
>> (%(changeid_1)s, %(changeid_2)s, %(changeid_3)s, %(changeid_4)s, 
>> %(changeid_5)s, %(changeid_6)s, %(changeid_7)s, %(changeid_8)s, 
>> %(changeid_9)s, %(changeid_10)s, %(changeid_11)s, %(changeid_12)s, 
>> %(changeid_13)s, %(changeid_14)s, %(changeid_15)s, %(changeid_16)s, 
>> %(changeid_17)s, %(changeid_18)s, %(changeid_19)s, %(changeid_20)s, 
>> %(changeid_21)s, %(changeid_22)s, %(changeid_23)s, %(changeid_24)s, 
>> %(changeid_25)s, %(changeid_26)s, %(changeid_27)s, %(changeid_28)s, 
>> %(changeid_29)s, %(changeid_30)s, %(changeid_31)s, %(changeid_32)s, 
>> %(changeid_33)s, %(changeid_34)s, %(changeid_35)s, %(changeid_36)s, 
>> %(changeid_37)s, %(changeid_38)s, %(changeid_39)s, %(changeid_40)s, 
>> %(changeid_41)s, %(changeid_42)s, %(changeid_43)s, %(changeid_44)s, 
>> %(changeid_45)s, %(changeid_46)s, %(changeid_47)s, %(changeid_48)s, 
>> %(changeid_49)s, %(changeid_50)s, %(changeid_51)s, %(changeid_52)s, 
>> %(changeid_53)s, %(changeid_54)s, %(changeid_55)s, %(changeid_56)s, 
>> %(changeid_57)s, %(changeid_58)s, %(changeid_59)s, %(changeid_60)s, 
>> %(changeid_61)s, %(changeid_62)s, %(changeid_63)s, %(changeid_64)s, 
>> %(changeid_65)s, %(changeid_66)s, %(changeid_67)s, %(changeid_68)s, 
>> %(changeid_69)s, %(changeid_70)s, %(changeid_71)s, %(changeid_72)s, 
>> %(changeid_73)s, %(changeid_74)s, %(changeid_75)s, %(changeid_76)s, 
>> %(changeid_77)s, %(changeid_78)s, %(changeid_79)s, %(changeid_80)s, 
>> %(changeid_81)s, %(changeid_82)s, %(changeid_83)s, %(changeid_84)s, 
>> %(changeid_85)s, %(changeid_86)s, %(changeid_87)s, %(changeid_88)s, 
>> %(changeid_89)s, %(changeid_90)s, %(changeid_91)s, %(changeid_92)s, 
>> %(changeid_93)s, %(changeid_94)s, %(changeid_95)s, %(changeid_96)s, 
>> %(changeid_97)s, %(changeid_98)s, %(changeid_99)s, %(changeid_100)s)' 
>> {'changeid_100': 4180, 'changeid_29': 4251, 'changeid_28': 4252, 
>> 'changeid_27': 4253, 'changeid_26': 4254, 'changeid_25': 4255, 
>> 'changeid_24': 4256, 'changeid_23': 4257, 'changeid_22': 4258, 
>> 'changeid_21': 4259, 'changeid_20': 4260, 'changeid_89': 4191, 
>> 'changeid_88': 4192, 'changeid_81': 4199, 'changeid_80': 4200, 
>> 'changeid_83': 4197, 'changeid_82': 4198, 'changeid_85': 4195, 
>> 'changeid_84': 4196, 'changeid_87': 4193, 'changeid_86': 4194, 
>> 'changeid_38': 4242, 'changeid_39': 4241, 'changeid_34': 4246, 
>> 'changeid_35': 4245, 'changeid_36': 4244, 'changeid_37': 4243, 
>> 'changeid_30': 4250, 'changeid_31': 4249, 'changeid_32': 4248, 
>> 'changeid_33': 4247, 'changeid_98': 4182, 'changeid_99': 4181, 
>> 'changeid_96': 4184, 'changeid_97': 4183, 'changeid_94': 4186, 
>> 'changeid_95': 4185, 'changeid_92': 4188, 'changeid_93': 4187, 
>> 'changeid_90': 4190, 'changeid_91': 4189, 'changeid_63': 4217, 
>> 'changeid_62': 4218, 'changeid_61': 4219, 'changeid_60': 4220, 
>> 'changeid_67': 4213, 'changeid_66': 4214, 'changeid_65': 4215, 
>> 'changeid_64': 4216, 'changeid_69': 4211, 'changeid_68': 4212, 
>> 'changeid_16': 4264, 'changeid_17': 4263, 'changeid_14': 4266, 
>> 'changeid_15': 4265, 'changeid_12': 4268, 'changeid_13': 4267, 
>> 'changeid_10': 4270, 'changeid_11': 4269, 'changeid_18': 4262, 
>> 'changeid_19': 4261, 'changeid_70': 4210, 'changeid_71': 4209, 
>> 'changeid_72': 4208, 'changeid_73': 4207, 'changeid_74': 4206, 
>> 'changeid_75': 4205, 'changeid_76': 4204, 'changeid_77': 4203, 
>> 'changeid_78': 4202, 'changeid_79': 4201, 'changeid_45': 4235, 
>> 'changeid_44': 4236, 'changeid_47': 4233, 'changeid_46': 4234, 
>> 'changeid_41': 4239, 'changeid_40': 4240, 'changeid_43': 4237, 
>> 'changeid_42': 4238, 'changeid_49': 4231, 'changeid_48': 4232, 
>> 'changeid_4': 4276, 'changeid_5': 4275, 'changeid_6': 4274, 
>> 'changeid_7': 4273, 'changeid_1': 4279, 'changeid_2': 4278, 
>> 'changeid_3': 4277, 'changeid_8': 4272, 'changeid_9': 4271, 
>> 'changeid_58': 4222, 'changeid_59': 4221, 'changeid_52': 4228, 
>> 'changeid_53': 4227, 'changeid_50': 4230, 'changeid_51': 4229, 
>> 'changeid_56': 4224, 'changeid_57': 4223, 'changeid_54': 4226, 
>> 'changeid_55': 4225}
>>
>> We did have the UI master get stopped again for being OOM, somewhere 
>> north of 50G. It's currently 46.8, and been running only a couple 
>> days. My plan, when I get time, is to use some stuff we have in our 
>> own product based on guppy to see what's going on in memory, but 
>> attaching gdb to it and using PyRun_SimpleString() to run the heap 
>> checking stuff. But for the time being, I can keep restarting things.
>>
>> My gut feeling is that I'm going to have to shut everything down and 
>> start with a clean database.
>>
>> Neil Gilmore
>> raito at raito.com
>>
>>
>> On 8/10/2017 3:39 PM, Pierre Tardy wrote:
>>> Mmh.. sounds like you found a memory leak in the UI.
>>>
>>> Several points to check that could be out of control for UI:
>>> Message could be queuing up over hanged (or badly closed) websockets
>>> REST queries data could keep some reference in some way, which would 
>>> make them just accumulate in memory.
>>> sombody is doing huge REST apis queries. It is unfortunatly quite 
>>> easy to just hit /api/v2/builds, and dump the whole build db.
>>> This wouldn't go all the way to 45GB IMO.
>>>
>>> During my perf tests, I saw the logs system are quite memory hungry, 
>>> and its buildbot is trading memory for cpu if the master can't 
>>> keepup recording all the logs in the data (nothing fancy. the 
>>> twisted queues will just be growing waiting for callbacks to be 
>>> called).
>>> Due to several encoding and decoding, the memory usage in the log 
>>> queue could be 2-4 times the amount of actual log bytes processed. 
>>> anyway, this is probably not your problem.
>>>
>>> You can look at https://github.com/tardyp/buildbot_profiler maybe 
>>> improve it to add a memory profiler..
>>>
>>>
>>> Pierre
>>>
>>> Le jeu. 10 août 2017 à 21:52, Neil Gilmore <ngilmore at grammatech.com 
>>> <mailto:ngilmore at grammatech.com>> a écrit :
>>>
>>>     Hi all,
>>>
>>>     A couple of times in the last couple weeks, the master running
>>>     our UI
>>>     has crashed/exited/ceased to run. This most recent time it was a
>>>     MemoryError, backed up by the kernel log saying it was OOM.
>>>     Current run
>>>     shows it using 43.1G memory. Seems like a lot, doesn't it? The
>>>     other 2
>>>     larger masters (the ones actually running builds) are about 2G,
>>>     and the
>>>     remaining, smallest one is about 1.7G.
>>>
>>>     What does the UI hold onto that makes it so large?
>>>
>>>     I'm supposed to get some heap profiling thingy wedged into
>>>     buildbot so I
>>>     can try to figure it out. My current plan is to introduce a new
>>>     build
>>>     step (like our current custom ones) that triggers the profiling
>>>     process,
>>>     create a sort of dummy build using the step, and create a force
>>>     scheduler to trigger that build. That'll keep it on our UI master
>>>     anyway. Is there a better way to trigger such code in the master at
>>>     arbitrary times?
>>>
>>>     I'm also starting to get complaints (and I've noticed this
>>>     myself), that
>>>     sometimes queued builds take a long time to start after the previous
>>>     build finishes. Sometimes on the order of a half-hour or more. Or at
>>>     least it appears so. I'm going on what the UI tells me. I
>>>     haven't tried
>>>     delving into the logs of the master controlling the builds to
>>>     match up
>>>     times. Any ideas? I'm afraid I'm not up on exactly how/when
>>>     builds are
>>>     supposed to get started.
>>>
>>>     And out database seems to be accumulating integrity errors again.
>>>
>>>     Neil Gilmore
>>>     raito at raito.com <mailto:raito at raito.com>
>>>     _______________________________________________
>>>     users mailing list
>>>     users at buildbot.net <mailto:users at buildbot.net>
>>>     https://lists.buildbot.net/mailman/listinfo/users
>>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users at buildbot.net
>> https://lists.buildbot.net/mailman/listinfo/users
>
>
>
> _______________________________________________
> users mailing list
> users at buildbot.net
> https://lists.buildbot.net/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.buildbot.net/pipermail/users/attachments/20170818/becf1e02/attachment.html>