[users at bb.net] Performance issues in 2.10

Yngve N. Pettersen yngve at vivaldi.com
Wed Jan 13 00:34:57 UTC 2021


Hi,

Thanks for that info.

In my case the problem is apparently something that happens now and then.

As mentioned, I have seen it on the Builds->Builders and Builds->Workers  
pages, neither of which includes any changelog access AFAIK.

I have also seen it occasionally on individual build pages, which has a  
log of steps with logs, and a changelog panel.

Just a few minutes ago I saw this freeze/spike happen while the buildbot  
manager was completely idle, since all active tasks had completed, and I  
had paused all workers since I needed to restart the manager (due to the  
hanging build).

I have also had reports about the Grid view and Console pages displaying  
this issue, but have not seen it myself.

At present I have enabled logging in the postgresql server, so maybe I can  
figure out what requests are handled during the spike.



On Wed, 13 Jan 2021 00:38:34 +0100, Vlad Bogolin <vlad at mariadb.org> wrote:

> Hi,
>
> I have experienced some similar interface freezes while trying to  
> configure
> our version of buildbot. I now remember two cases:
>
> 1) A "changes" API problem where it seemed that the "limit" argument was
> ignored in some cases which translated into a full changes table scan.  
> This
> was reproducible when hitting the "Builds > Last Changes" dashboard and
> then all the other pages were frozen. There are other requests to  
> changes,
> so this may be related to the Builds page too. Also, this only
> happened when the number of changes from the db was high. I was planning  
> on
> submitting a proper fix, but we are running a custom version of 2.7.1  
> where
> I implemented a fast workaround and did not managed to submit a proper  
> fix
> (hope to be able to do it next week).
>
> 2) We experienced the same issue as you describe when a lot of logs where
> coming (which seems to be your case too) and the master process was
> overwhelmed when multiple builds were running in the same time (constant
> CPU usage around ~120%). We solved the issue by switching to multi master
> and limiting the amount of logs, but if you say that this was not an  
> issue
> in 2.7 I would really be interested in finding out what is the root cause
> (I thought it was the high amount of logs). You can test this hypothesis  
> by
> limiting the amount of running builds and see if the issue keeps
> reproducing.
>
> What worked for me in order to find out the "changes" API problem was
> visiting each dashboard and see if the freeze occurs or not.
>
> Hope this helps!
>
> Cheers,
> Vlad
>
> On Wed, Jan 13, 2021 at 12:40 AM Yngve N. Pettersen <yngve at vivaldi.com>
> wrote:
>
>> On Tue, 12 Jan 2021 22:13:50 +0100, Pierre Tardy <tardyp at gmail.com>  
>> wrote:
>>
>> > Thanks for the update.
>> >
>> > Some random thoughts...
>> >
>> > You should probably leave the profiler open until you get the  
>> performance
>> > spike.
>> > If you are inside the spike when starting, indeed, you won't be able  
>> to
>> > start profiler, but if it is started before the spike it for sure will
>> > detect exactly where the code is.
>>
>> I did have the profiler open in this latest case; as far as I could tell
>> it still didn't start recording until after the spike ended (there was  
>> no
>> progress information in the recorder line).
>>
>> The two major items showing up were
>>
>>    /buildbot/db/builds.py+91:getPrevSuccessfulBuild
>>    /buildbot/db/pool.py+190:__thd
>>
>> but I think they were recorded after the spike.
>>
>> I am planning to activate more detailed logging in the postgresql  
>> server,
>> but have not done that yet (probably need to shut down and restart
>> buildbot when I do).
>>
>>
>> BTW, I suspect that this issue can also cause trouble for builds whose
>> steps ends at the time the problem is occurring; I just noticed a task
>> that is still running more than 4 hours after it started a step that
>> should have been killed after 20 minutes if it was hanging. It should
>> have
>> ended at about the time one of the hangs was occuring. And it is
>> impossible to stop the task for some reason, even shutting down the
>> worker
>> process did not work. AFAIK the only way to fix the issue is to shut the
>> buildbot manager down.
>>
>> > statistic profiling will use timer interrupts which will preempt  
>> anything
>> > that is running, and make a call stack trace.
>> >
>> > Waiting for repro, if, from the db log, you manage to get the info of
>> > what
>> > kind of db data that is, maybe we can narrow down the usual suspects..
>> >
>> > If there are lots of short selects like you said, usually, you would
>> > have a
>> > back and forth from reactor thread to db thread, so it sounds weird.
>> > What can be leading to your behavior is that whatever is halting the
>> > processing, everything is queued up in between, and unqueued when it  
>> is
>> > finished, which could lead to spike of db actions in the end of the
>> > event.
>>
>> The DB actions were going on for the entire 3 minutes that spike lasted;
>> it is not a burst at either end, but a ~180 second long continuous
>> sequence (or barrage) of approximately 70-90000 transactions, if I am
>> interpreting the graph data correctly.
>>
>> > Regards
>> > Pierre
>> >
>> >
>> > Le mar. 12 janv. 2021 à 21:49, Yngve N. Pettersen <yngve at vivaldi.com>  
>> a
>> > écrit :
>> >
>> >> Hi again,
>> >>
>> >> A bit of an update.
>> >>
>> >> I have not been able to locate the issue using the profiler.
>> >>
>> >> It seems that when Buildbot gets into the problematic mode, then the
>> >> profiler is not able to work at all. It only starts collecting after  
>> the
>> >> locked mode is resolved.
>> >>
>> >> It does seem like the locked mode occurs when Buildbot is fetching a  
>> lot
>> >> of data from the DB and then spends a lot of time processing that  
>> data,
>> >> without yielding to other processing needs.
>> >>
>> >> Looking at the monitoring of the server, it also appears that  
>> buildbot
>> >> is
>> >> fetching a lot of data. During the most recent instance, the returned
>> >> tuples count in the graph for the server indicates three minutes of,  
>> on
>> >> average 25000 tuples returned, with spikes to 80K and 100K, per  
>> second.
>> >>
>> >> The number of open connections rose to 6 or 7, and the transaction  
>> count
>> >> was 400-500 per second during the whole time (rolled back  
>> transactions,
>> >> which I assume is just one or more selects).
>> >>
>> >> IMO this makes it look like, while requesting these data, Buildbot is
>> >> *synchronously* querying the DB and processing the returned data, not
>> >> yielding. It might also be that it is requesting data more data than  
>> it
>> >> needs, and also requesting other data earlier than it is actually
>> >> needed.
>> >>
>> >>
>> >>
>> >> On Tue, 12 Jan 2021 12:48:40 +0100, Yngve N. Pettersen
>> >> <yngve at vivaldi.com>
>> >>
>> >> wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> > IIRC the only real processing in our system that might be heavy is
>> >> done
>> >> > via logobserver.LineConsumerLogObserver in a class (now) derived  
>> from
>> >> > ShellCommandNewStyle, so if that is the issue, and deferToThread is
>> >> the
>> >> > solution, then if it isn't already done, my suggestion would be to
>> >> > implement that inside the code handling the log observers.
>> >> >
>> >> > I've tested the profiler a little, but haven't seen any samples  
>> within
>> >> > our code so far, just inside buildbot, quite a lot of log DB  
>> actions,
>> >> > also some TLS activity.
>> >> >
>> >> > The performance issue for those pages seems to be a bit flaky; at
>> >> > present its not happening AFAICT
>> >> >
>> >> > On Tue, 12 Jan 2021 10:59:42 +0100, Pierre Tardy <tardyp at gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Hello,
>> >> >>
>> >> >> A lot of things happen between 2.7 and 2.10, although I don't see
>> >> >> anything
>> >> >> which could impact the performance that much. (maybe new reporter
>> >> >> framework, but really not convinced)
>> >> >> If you see that the db is underutilized this must be a classical
>> >> reactor
>> >> >> starvation.
>> >> >> With asynchronous systems like buildbot, you shouldn't do any  
>> heavy
>> >> >> computation in the main event loop thread, those must be done in a
>> >> >> thread
>> >> >> via deferToThread and co.
>> >> >>
>> >> >> Those are the common issues you can have with performance
>> >> >> independantly from upgrade regressions:
>> >> >>
>> >> >> 1) Custom steps:
>> >> >> A lot of time, we see people struggling with performance when they
>> >> just
>> >> >> have some custom step doing heavy computation that block the main
>> >> thread
>> >> >> constantly, preventing all the very quick tasks to run in //.
>> >> >>
>> >> >> 2) too much logs
>> >> >>  In this case, there is not much to do beside reducing the log
>> >> amount.
>> >> >> This
>> >> >> would be the time to switch to a multi-master setup, where you  
>> put 2
>> >> >> masters for builds, and one master for web UI.
>> >> >> You can put those in the same machine/VM, no problem, the only  
>> work
>> >> is
>> >> >> to
>> >> >> have separate processes that each have several event queues. You
>> can
>> >> use
>> >> >> docker-compose or kubernetes in order to more easily create such
>> >> >> deployment. We don't have readily useable for that, but several
>> >> people
>> >> >> have
>> >> >> done and documented it, for example
>> >> >> https://github.com/pop/buildbot-on-kubernetes
>> >> >>
>> >> >>
>> >> >> I have developed the buildbot profiler in order to quickly find
>> >> those.
>> >> >> You
>> >> >> just have to install it as a plugin and start a profile whenever  
>> the
>> >> >> buildbot feels slow.
>> >> >> It is a statistical profiler, so it will not significantly change  
>> the
>> >> >> actual performance so it is safe to run in production.
>> >> >>
>> >> >> https://pypi.org/project/buildbot-profiler/
>> >> >>
>> >> >>
>> >> >> Regards,
>> >> >> Pierre
>> >> >>
>> >> >>
>> >> >> Le mar. 12 janv. 2021 à 01:29, Yngve N. Pettersen
>> >> <yngve at vivaldi.com> a
>> >> >> écrit :
>> >> >>
>> >> >>> Hello all,
>> >> >>>
>> >> >>> We have just upgraded our buildbot system from 2.7 to 2.10.
>> >> >>>
>> >> >>> However, I am noticing performance issues when loading these  
>> pages:
>> >> >>>
>> >> >>>   Builds->Builders
>> >> >>>   Builds->Workers
>> >> >>>   individual builds
>> >> >>>
>> >> >>> Loading these can take several minutes, although there are  
>> periods
>> >> of
>> >> >>> immediate responses.
>> >> >>>
>> >> >>> What I am seeing on the buildbot manager machine is that the  
>> Python3
>> >> >>> process hits 90-100% for the entire period.
>> >> >>>
>> >> >>> The Python version is 3.6.9 running on Ubuntu 18.04
>> >> >>>
>> >> >>> As far as I can tell, the Postgresql database is mostly idle  
>> during
>> >> >>> this
>> >> >>> period. I did do a full vacuum a few hours ago, in case that was  
>> the
>> >> >>> issue.
>> >> >>>
>> >> >>> There are about 40 builders, and 30 workers in the system, only
>> >> about
>> >> >>> 10-15 of these have a 10-20 builds for the past few days,  
>> although
>> >> most
>> >> >>> of
>> >> >>> these have active histories of 3000 builds (which do make me
>> wonder
>> >> if
>> >> >>> the
>> >> >>> problem could be a lack of limiting the DB queries, at present I
>> >> have
>> >> >>> not
>> >> >>> inspected the DB queries).
>> >> >>>
>> >> >>> The individual builds can have very large log files in the build
>> >> steps,
>> >> >>> in
>> >> >>> many cases tens of thousands of lines (we _are_ talking about a
>> >> >>> Chromium
>> >> >>> based project).
>> >> >>>
>> >> >>> Our changes in the builders and workers JS code are minimal (we  
>> are
>> >> >>> using
>> >> >>> a custom build of www-base), just using different information for
>> >> the
>> >> >>> build labels (build version number), and grouping the builders,
>> >> which
>> >> >>> should not be causing any performance issues. (we have larger
>> >> changes
>> >> >>> in
>> >> >>> the individual builder view, where we include Git commit  
>> messages,
>> >> and
>> >> >>> I
>> >> >>> have so far not seen any performance issues there)
>> >> >>>
>> >> >>> BTW: The line plots for build time and successes on builders  
>> seems
>> >> to
>> >> >>> be
>> >> >>> MIA. Not sure if that is an upstream issue, or due to something  
>> in
>> >> our
>> >> >>> www-base build.
>> >> >>>
>> >> >>> Do you have any suggestions for where to look for the cause of  
>> the
>> >> >>> problem?
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Sincerely,
>> >> >>> Yngve N. Pettersen
>> >> >>> Vivaldi Technologies AS
>> >> >>> _______________________________________________
>> >> >>> users mailing list
>> >> >>> users at buildbot.net
>> >> >>> https://lists.buildbot.net/mailman/listinfo/users
>> >> >>>
>> >> >
>> >> >
>> >>
>> >>
>> >> --
>> >> Sincerely,
>> >> Yngve N. Pettersen
>> >> Vivaldi Technologies AS
>> >>
>>
>>
>> --
>> Sincerely,
>> Yngve N. Pettersen
>> Vivaldi Technologies AS
>> _______________________________________________
>> users mailing list
>> users at buildbot.net
>> https://lists.buildbot.net/mailman/listinfo/users
>>


-- 
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS


More information about the users mailing list