[users at bb.net] Performance issues in 2.10

Yngve N. Pettersen yngve at vivaldi.com
Thu Jan 14 15:38:18 UTC 2021


On Thu, 14 Jan 2021 15:21:24 +0100, Pierre Tardy <tardyp at gmail.com> wrote:

> In the case of the UI customization, this would be hard for me to reason
> without the exact actual code.
> Maybe, indeed, calling getChanges for builds for 1000 builds is not a  
> good
> idea. For example in the grid view we couldn't do this for performance
> reason, and thus, we have a more global algorithm that is running inside
> the browser.
> onNew happens for all new builds that appear, regardless of any paging.

It might perhaps be an idea to limit that somehow?

> Is this really custom to your company? maybe it would make sense to

It might not be just custom to us.

> upstream the code and usecase in order to discuss the best implementation
> path together.

Maybe, but given what I currently suspect, I suspect our current  
implementation won't work very well, and will probably require a rewrite  
 from scratch, probably with some extra support in the framework.

So far we have been experimenting a bit with this to get the changelog  
texts displayed. It felt ready for a first deployment this week, although  
it was not feature complete. We were still investigating how to implement  
some of the extras, like showing multiple commit messages in a pop-up.

For reference, we have a few other customizations in some of the views:

   * grouping our builders into categories in the builders view and sort  
them inside the group
   * Replace the buildnumber (1001) with the builds actual version number  
(1.2.3.4 ; in some cases with a suffix name); this actually has some  
refresh issues when the state changes; need to reload/redraw the page to  
get the build version number displayed.  
<https://github.com/buildbot/buildbot/issues/4516>
   * We display branch and revision info in the builds table.  
<https://github.com/buildbot/buildbot/issues/4516>
   * We show the date when the build completed, instead of when it started  
(but when building, show the started time)  
<https://github.com/buildbot/buildbot/issues/4516>
   * We expanded the number of builds displayed on the Home page to 100  
(from 20; with ~30 main builders 20 is way too small)  
<https://github.com/buildbot/buildbot/issues/4515>

   * In the changelog prototype we also rearranged the columns

Most of this is really something that I'd like to somehow configure via  
settings, rather than customizing the www-base component (less code  
maintenance).

> We had a few contributions this week e.g.
> https://github.com/buildbot/buildbot/pull/5747 which go in the direction  
> of
> more information in the builder page with mouse-over.
> Maybe you can go this path in order to get your sourcestamp information.
>
>
> Another issue that could happen for your problem is a 100% cpu in the
> browser! That would explain a freeze of one tab and work of other tab.

Possible (didn't check that), although my dev machine is actually one of  
the most powerful dev machines we have.

> If the get of the changes is done in onNew, indeed for 1000 builds, this
> will probably be done in parallel.
> In that case if you have http only then you will get serialized by  
> browser
> to 8 request in //, but if you have http2, then you will have unlimited
> parallelisation.
> In this case those 1000 request can indeed pile-up, and starve the
> threadpool.

Sounds plausible.

> If you have a largeVM, and want to make the current code work, maybe you
> can raise the db poolsize
> https://github.com/tardyp/buildbot/blob/9f1cac1d3bb61baa0b6c836cc18812a64cfa9c2b/master/buildbot/db/enginestrategy.py#L273
> hence, you can configure the db_url like this:
>
> "driver://[username:password@]host:port/database?pool_size=20"

Now that I probably know the cause of the issue, I'd rather have the data  
retrieval happen when the build item triggers an "onVisible" event; I was  
already a bit irritated about the random display sequence for the  
changelog texts.

It would be much better to just retrieve the data we are going to display,  
but at least AFAICT currently there isn't support for something like that  
(but then I am primarily a C++ dev, not a JS dev).

>
> Regards
> Pierre
>
>
> Le jeu. 14 janv. 2021 à 13:22, Yngve N. Pettersen <yngve at vivaldi.com> a
> écrit :
>
>> Hi,
>>
>> Hmmm, there is a bit of a difference between a 2ms sampling interval and
>> a
>> 500 ms one :P .
>>
>> There isn't that much load on the machine (we actually expanded CPU  
>> count
>> threefold, and more than doubled RAM trying to eliminate those potential
>> issues; I am not aware of any significant load on the VM-host), and I
>> have
>> seen this freeze happen while there was no build activity.
>>
>> This VM is just running the buildbot and the postgresql DB; There should
>> be no disk speed issues.
>>
>> What I have noticed in the netdata graphs is that the Pressure Stall
>> Information graph
>> <https://www.kernel.org/doc/html/latest/accounting/psi.html> also
>> spikes,
>> mostly in the some10 line, no idea why or how bad the stall really is.
>>
>> I have had one colleague report that pasting the page URL into a new tab
>> had the page load quickly, while the original page was still blank. I
>> have
>> not tried that yet, but I have tried to load two different pages, and
>> both
>> remained blank.
>>
>>
>> *However*, during the last case of this, I was checking the inspector
>> (planning to check network traffic) and noticed that there were console
>> reports from code we had added on the page I was *leaving*. It seems  
>> that
>> this code continued running after I navigated away from the page, to the
>> previous page in the history.
>>
>> The modified code is used to load and display changelog entries for each
>> build displayed on the builder page; this was developed in response to a
>> feature request from our CEO, and this week was the first time it was
>> deployed in production.
>>
>> The current version of that code triggers a changelog retrieval for each
>> build in the list in the $scope.builds.onNew function in
>> www/base/src/app/builders/builder/builder.controller.js.
>>
>> If this code is indeed involved in the freezes, then it appears that the
>> code continues to run after the page is no longer the current page. IMO
>> those calls should have been unsubscribed when the navigation started.
>>
>> Further (if this is indeed what caused the issue):
>>
>>    * it seems like the onNew calls for the changes entries got priority
>> above what was going on on the new page
>>
>>    * It also seems that the builds onNew call is triggered for builds
>> that
>> are not yet visible on the page, and perhaps are also never going to be
>> displayed since they beyond the limit of what is to be displayed unless
>> the user request more. If this triggers loading of 1000 builds'  
>> changelog
>> entries .....
>>
>> As far as the changelog info is concerned, we don't actually need it
>> until
>> the build info is rendered on the page, so an "onDisplay" call would be
>> better for our purposes.
>>
>>
>>
>> I have just backed those changes out of our deployed custom www-base
>> module, so that we can see if that was involved in our issue.
>>
>> BTW, I am noticing some problems with pip and this custom module. It
>> seems
>> that the requirement declaration is set such that only the 2.10.0  
>> package
>> can be loaded; this caused issues since the default version for building
>> the module is 2.10.1-devX; I had to force the version to 2.10.0 to be
>> able
>> to load it, and that is causing warnings from pip, and problems
>> installing
>> an updated module. AFAICT this started becoming a problem with recent
>> versions of Pip. (Note: I am not interested in building the whole
>> buildbot
>> package bundle; in fact, I'd prefer to register our customization in a
>> different manner)
>>
>>
>> On Thu, 14 Jan 2021 10:07:32 +0100, Pierre Tardy <tardyp at gmail.com>  
>> wrote:
>>
>> > (Back to mailing list)
>> >
>> > Thanks for this data. Here is how you would read it:
>> >
>> >
>> > [image: image.png]
>> > In the left, you get the main thread activity. You can say that  
>> because
>> > the
>> > bottom frame(s) are about startReactor.
>> > In the right, you get what happens in the threads. You can say that
>> > because
>> > we see threading.py in the last bottom frame.
>> >
>> > You can see that the reactor thread does pretty much nothing:
>> > [image: image.png]
>> > doPoll is just the reactor waiting for events (that's 1 third of the
>> > samples)
>> > then the left part is about getting sourcestams, and waiting the
>> > dbthread.
>> > then the right part is log management, which is usually the major part
>> > of a
>> > buildbot load.
>> >
>> > On the thread side, this is pretty similar picture. We can see that
>> most
>> > of
>> > the load is managing source stamps.
>> > [image: image.png]
>> > So frankly, I wouldn't say this is an obvious profile of a blocked
>> > buildbot
>> > master.
>> >
>> > The reactor does pretty much nothing. Reading 300 sourcestamps  
>> requests
>> > by
>> > second from the db shouldn't be that much of a problem with modern HW.
>> > We can see that the profiler itself has difficulties to run in time,  
>> as
>> > we
>> > see some profiler stacks.
>> > BTW, I gave wrong advice last time, I told you to put 500HZ frequency
>> > while
>> > I meant 500ms timer. we should rather put a 2HZ to have 2 sample per
>> > second
>> > and per thread. This may avoid to disrupt too much the measurment
>> > (Heisenberg principle).
>> >
>> > So I thought about memory or disk starvation. Do you have any other
>> load
>> > in
>> > the master that would starve CPU ressources?
>> > I can see that there is 28GB memory free so no memory starvation. but  
>> the
>> > profiler do not output the loadaverage.
>> > Do we have db connection pool or threadpool starvation?
>> > Do we have disk io or network io starvation?
>> >
>> > Regards
>> > Pierre
>> >
>> >
>> > Le jeu. 14 janv. 2021 à 00:09, Yngve N. Pettersen <yngve at vivaldi.com>  
>> a
>> > écrit :
>> >
>> >> Hi,
>> >>
>> >> I finally got a trace of the problem area.
>> >>
>> >> I wasn't able to make sense of the profile, maybe you can? zip  
>> attached
>> >>
>> >> The area of interest is the ~400 second segment towards the end,  
>> which
>> >> matches the bump in the PNG.
>> >>
>> >> This happened while reloading the Build>Builders page after loading a
>> >> build and then returning to the previous point in the history.  
>> Loading
>> >> of
>> >> the page took about 5 minutes of the 7 minutes of high load.
>> >>
>> >> The Python process hit ~120% CPU
>> >>
>> >>
>> >> On Wed, 13 Jan 2021 18:35:26 +0100, Pierre Tardy <tardyp at gmail.com>
>> >> wrote:
>> >>
>> >> >> Regarding the profiler, I also had some problems running it for a
>> >> longer
>> >> > time, so if you find a fix, please share it with me.
>> >> > It was a bytes versus string issue indeed in the python parts.
>> >> >
>> >> > I did struggle to build the project again as expected, but managed  
>> to
>> >> > hack
>> >> > something eventually.
>> >> >
>> >> > see https://pypi.org/project/buildbot-profiler/1.3.1/
>> >> >
>> >> > I did verify the profile duration is now working.
>> >> >
>> >> > Regards
>> >> > Pierre
>> >> >
>> >> >
>> >> > Le mer. 13 janv. 2021 à 12:49, Vlad Bogolin <vlad at mariadb.org> a
>> >> écrit :
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> So the changes issue I was referring to doesn't seem to be fixed  
>> (I
>> >> >> checked the latest Buildbot code) and looked back over my changes.
>> >> The
>> >> >> problem that I identified is here
>> >> >>
>> >>
>> https://github.com/buildbot/buildbot/blob/9f1cac1d3bb61baa0b6c836cc18812a64cfa9c2b/master/buildbot/data/resultspec.py#L320
>> >> >> because is some cases there would be an unmatched_filter or
>> >> >> unmatched_order. *So if this is indeed the case, you should see  
>> the
>> >> >> warning in the buildbot log file:*
>> >> >>
>> >> >> "Warning: limited data api query is not backed by db because of
>> >> >> following
>> >> >> filters..." as defined here
>> >> >>
>> >>
>> https://github.com/buildbot/buildbot/blob/9f1cac1d3bb61baa0b6c836cc18812a64cfa9c2b/master/buildbot/data/resultspec.py#L322
>> >> >>
>> >> >> There are two reasons why this happens. One, an incomplete
>> >> definition of
>> >> >> the fieldMapping for the data/changes.py and this would be the  
>> fix:
>> >> >>
>> >> >> +++ b/master/buildbot/data/changes.py
>> >> >> @@ -42,6 +42,19 @@ class FixerMixin:
>> >> >>          return change
>> >> >>      fieldMapping = {
>> >> >>          'changeid': 'changes.id',
>> >> >> +        'author': 'changes.author',
>> >> >> +        'committer': 'changes.committer',
>> >> >> +        'comments': 'changes.comments',
>> >> >> +        'branch': 'changes.branch',
>> >> >> +        'revision': 'changes.revision',
>> >> >> +        'revlink': 'changes.revlink',
>> >> >> +        'when_timestamp': 'changes.when_timestamp',
>> >> >> +        'category': 'changes.category',
>> >> >> +        'repository': 'changes.repository',
>> >> >> +        'codebase': 'changes.codebase',
>> >> >> +        'project': 'changes.project',
>> >> >> +        'sourcestampid': 'changes.sourcestampid',
>> >> >> +        'parent_changeids': 'changes.parent_changeids',
>> >> >>      }
>> >> >>
>> >> >> and two not having all the columns in the select statement in  
>> order
>> >> to
>> >> >> be
>> >> >> able to check for matched/unmatched filtering or ordering as it is
>> >> done
>> >> >> here
>> >> >>
>> >>
>> https://github.com/buildbot/buildbot/blob/9f1cac1d3bb61baa0b6c836cc18812a64cfa9c2b/master/buildbot/data/resultspec.py#L269
>> >> .
>> >> >> My solution was to get all the fields from the changes table from  
>> the
>> >> >> database (don't know if this is the best approach) as you can see
>> >> here:
>> >> >>
>> >> >> +++ b/master/buildbot/db/changes.py
>> >> >> @@ -249,7 +249,7 @@ class
>> >> >> ChangesConnectorComponent(base.DBConnectorComponent):
>> >> >>          def thd(conn):
>> >> >>              # get the changeids from the 'changes' table
>> >> >>              changes_tbl = self.db.model.changes
>> >> >> -            q = sa.select([changes_tbl.c.changeid])
>> >> >> +            q = changes_tbl.select()
>> >> >>
>> >> >> However, since only the changeid would be needed, don't sure if  
>> this
>> >> >> would
>> >> >> be the right approach. Also, I would suspect some missing tests
>> >> because
>> >> >> I
>> >> >> would expect this to be a pretty common use case.
>> >> >>
>> >> >> Regarding the profiler, I also had some problems running it for a
>> >> longer
>> >> >> time, so if you find a fix, please share it with me.
>> >> >>
>> >> >> Cheers,
>> >> >> Vlad
>> >> >>
>> >> >> On Wed, Jan 13, 2021 at 1:33 PM Pierre Tardy <tardyp at gmail.com>
>> >> wrote:
>> >> >>
>> >> >>>
>> >> >>> I am not sure why this wouldn't work. I vaguely recall there was  
>> an
>> >> >>> issue
>> >> >>> there, but can't figure it out staring at the code.
>> >> >>> As this is quite ancient, I am not sure the JS part will even  
>> build
>> >> >>> anymore :-/
>> >> >>>
>> >> >>> You can change the default values at that line in the python
>> >> >>>
>> >> >>>
>> >>
>> https://github.com/tardyp/buildbot_profiler/blob/master/buildbot_profiler/api.py#L193
>> >> >>> I think you will be able to force them by editing this file  
>> inside
>> >> your
>> >> >>> virtualenv..
>> >> >>>
>> >> >>>
>> >> >>> Regards
>> >> >>> Pierre
>> >> >>>
>> >> >>>
>> >> >>> Le mer. 13 janv. 2021 à 11:46, Yngve N. Pettersen
>> >> <yngve at vivaldi.com>
>> >> a
>> >> >>> écrit :
>> >> >>>
>> >> >>>> > I insist on the buildbot profiler. What I was saying before is
>> >> that
>> >> >>>> you
>> >> >>>> > need to hit the record button before the problem appears, and
>> >> put a
>> >> >>>> large
>> >> >>>> > enough record time to be sure to catch a spike.
>> >> >>>> > Then, you will be able to zoom to the cpu spike and catch the
>> >> issue
>> >> >>>> > precisely.
>> >> >>>> >
>> >> >>>> > If the spike is in the order of minutes like you said, you can
>> >> >>>> configure
>> >> >>>> > it
>> >> >>>> > like this and get enough samples to get enough evidence to
>> where
>> >> the
>> >> >>>> code
>> >> >>>> > is actually spending time:
>> >> >>>> >
>> >> >>>> > ProfilerService(frequency=500, gatherperiod=60 * 60,
>> >> mode='virtual',
>> >> >>>> > basepath=None, wantBuilds=100
>> >> >>>>
>> >> >>>> I tried configuring this with the settings dropdown in the  
>> WebGUI
>> >> >>>> plugin,
>> >> >>>> but AFAICT it is not working, it only gathers info for 30  
>> seconds.
>> >> >>>>
>> >> >>>> I guess I must be holding it incorrectly.
>> >> >>>>
>> >> >>>> > This will record for one hour, and mitigate the memory used if
>> >> you
>> >> >>>> worry
>> >> >>>> > about it.
>> >> >>>>
>> >> >>>> --
>> >> >>>> Sincerely,
>> >> >>>> Yngve N. Pettersen
>> >> >>>> Vivaldi Technologies AS
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >> --
>> >> >> Vlad
>> >> >>
>> >>
>> >>
>> >> --
>> >> Sincerely,
>> >> Yngve N. Pettersen
>> >> Vivaldi Technologies AS
>>
>>
>> --
>> Sincerely,
>> Yngve N. Pettersen
>> Vivaldi Technologies AS
>>


-- 
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS


More information about the users mailing list