[users at bb.net] Performance issues in 2.10

Tue Jan 12 23:38:34 UTC 2021

Hi,

I have experienced some similar interface freezes while trying to configure
our version of buildbot. I now remember two cases:

1) A "changes" API problem where it seemed that the "limit" argument was
ignored in some cases which translated into a full changes table scan. This
was reproducible when hitting the "Builds > Last Changes" dashboard and
then all the other pages were frozen. There are other requests to changes,
so this may be related to the Builds page too. Also, this only
happened when the number of changes from the db was high. I was planning on
submitting a proper fix, but we are running a custom version of 2.7.1 where
I implemented a fast workaround and did not managed to submit a proper fix
(hope to be able to do it next week).

2) We experienced the same issue as you describe when a lot of logs where
coming (which seems to be your case too) and the master process was
overwhelmed when multiple builds were running in the same time (constant
CPU usage around ~120%). We solved the issue by switching to multi master
and limiting the amount of logs, but if you say that this was not an issue
in 2.7 I would really be interested in finding out what is the root cause
(I thought it was the high amount of logs). You can test this hypothesis by
limiting the amount of running builds and see if the issue keeps
reproducing.

What worked for me in order to find out the "changes" API problem was
visiting each dashboard and see if the freeze occurs or not.

Hope this helps!

Cheers,
Vlad

On Wed, Jan 13, 2021 at 12:40 AM Yngve N. Pettersen <yngve at vivaldi.com>
wrote:

> On Tue, 12 Jan 2021 22:13:50 +0100, Pierre Tardy <tardyp at gmail.com> wrote:
>
> > Thanks for the update.
> >
> > Some random thoughts...
> >
> > You should probably leave the profiler open until you get the performance
> > spike.
> > If you are inside the spike when starting, indeed, you won't be able to
> > start profiler, but if it is started before the spike it for sure will
> > detect exactly where the code is.
>
> I did have the profiler open in this latest case; as far as I could tell
> it still didn't start recording until after the spike ended (there was no
> progress information in the recorder line).
>
> The two major items showing up were
>
>    /buildbot/db/builds.py+91:getPrevSuccessfulBuild
>    /buildbot/db/pool.py+190:__thd
>
> but I think they were recorded after the spike.
>
> I am planning to activate more detailed logging in the postgresql server,
> but have not done that yet (probably need to shut down and restart
> buildbot when I do).
>
>
> BTW, I suspect that this issue can also cause trouble for builds whose
> steps ends at the time the problem is occurring; I just noticed a task
> that is still running more than 4 hours after it started a step that
> should have been killed after 20 minutes if it was hanging. It should
> have
> ended at about the time one of the hangs was occuring. And it is
> impossible to stop the task for some reason, even shutting down the
> worker
> process did not work. AFAIK the only way to fix the issue is to shut the
> buildbot manager down.
>
> > statistic profiling will use timer interrupts which will preempt anything
> > that is running, and make a call stack trace.
> >
> > Waiting for repro, if, from the db log, you manage to get the info of
> > what
> > kind of db data that is, maybe we can narrow down the usual suspects..
> >
> > If there are lots of short selects like you said, usually, you would
> > have a
> > back and forth from reactor thread to db thread, so it sounds weird.
> > What can be leading to your behavior is that whatever is halting the
> > processing, everything is queued up in between, and unqueued when it is
> > finished, which could lead to spike of db actions in the end of the
> > event.
>
> The DB actions were going on for the entire 3 minutes that spike lasted;
> it is not a burst at either end, but a ~180 second long continuous
> sequence (or barrage) of approximately 70-90000 transactions, if I am
> interpreting the graph data correctly.
>
> > Regards
> > Pierre
> >
> >
> > Le mar. 12 janv. 2021 à 21:49, Yngve N. Pettersen <yngve at vivaldi.com> a
> > écrit :
> >
> >> Hi again,
> >>
> >> A bit of an update.
> >>
> >> I have not been able to locate the issue using the profiler.
> >>
> >> It seems that when Buildbot gets into the problematic mode, then the
> >> profiler is not able to work at all. It only starts collecting after the
> >> locked mode is resolved.
> >>
> >> It does seem like the locked mode occurs when Buildbot is fetching a lot
> >> of data from the DB and then spends a lot of time processing that data,
> >> without yielding to other processing needs.
> >>
> >> Looking at the monitoring of the server, it also appears that buildbot
> >> is
> >> fetching a lot of data. During the most recent instance, the returned
> >> tuples count in the graph for the server indicates three minutes of, on
> >> average 25000 tuples returned, with spikes to 80K and 100K, per second.
> >>
> >> The number of open connections rose to 6 or 7, and the transaction count
> >> was 400-500 per second during the whole time (rolled back transactions,
> >> which I assume is just one or more selects).
> >>
> >> IMO this makes it look like, while requesting these data, Buildbot is
> >> *synchronously* querying the DB and processing the returned data, not
> >> yielding. It might also be that it is requesting data more data than it
> >> needs, and also requesting other data earlier than it is actually
> >> needed.
> >>
> >>
> >>
> >> On Tue, 12 Jan 2021 12:48:40 +0100, Yngve N. Pettersen
> >> <yngve at vivaldi.com>
> >>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > IIRC the only real processing in our system that might be heavy is
> >> done
> >> > via logobserver.LineConsumerLogObserver in a class (now) derived from
> >> > ShellCommandNewStyle, so if that is the issue, and deferToThread is
> >> the
> >> > solution, then if it isn't already done, my suggestion would be to
> >> > implement that inside the code handling the log observers.
> >> >
> >> > I've tested the profiler a little, but haven't seen any samples within
> >> > our code so far, just inside buildbot, quite a lot of log DB actions,
> >> > also some TLS activity.
> >> >
> >> > The performance issue for those pages seems to be a bit flaky; at
> >> > present its not happening AFAICT
> >> >
> >> > On Tue, 12 Jan 2021 10:59:42 +0100, Pierre Tardy <tardyp at gmail.com>
> >> > wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> A lot of things happen between 2.7 and 2.10, although I don't see
> >> >> anything
> >> >> which could impact the performance that much. (maybe new reporter
> >> >> framework, but really not convinced)
> >> >> If you see that the db is underutilized this must be a classical
> >> reactor
> >> >> starvation.
> >> >> With asynchronous systems like buildbot, you shouldn't do any heavy
> >> >> computation in the main event loop thread, those must be done in a
> >> >> thread
> >> >> via deferToThread and co.
> >> >>
> >> >> Those are the common issues you can have with performance
> >> >> independantly from upgrade regressions:
> >> >>
> >> >> 1) Custom steps:
> >> >> A lot of time, we see people struggling with performance when they
> >> just
> >> >> have some custom step doing heavy computation that block the main
> >> thread
> >> >> constantly, preventing all the very quick tasks to run in //.
> >> >>
> >> >> 2) too much logs
> >> >>  In this case, there is not much to do beside reducing the log
> >> amount.
> >> >> This
> >> >> would be the time to switch to a multi-master setup, where you put 2
> >> >> masters for builds, and one master for web UI.
> >> >> You can put those in the same machine/VM, no problem, the only work
> >> is
> >> >> to
> >> >> have separate processes that each have several event queues. You
> can
> >> use
> >> >> docker-compose or kubernetes in order to more easily create such
> >> >> deployment. We don't have readily useable for that, but several
> >> people
> >> >> have
> >> >> done and documented it, for example
> >> >> https://github.com/pop/buildbot-on-kubernetes
> >> >>
> >> >>
> >> >> I have developed the buildbot profiler in order to quickly find
> >> those.
> >> >> You
> >> >> just have to install it as a plugin and start a profile whenever the
> >> >> buildbot feels slow.
> >> >> It is a statistical profiler, so it will not significantly change the
> >> >> actual performance so it is safe to run in production.
> >> >>
> >> >> https://pypi.org/project/buildbot-profiler/
> >> >>
> >> >>
> >> >> Regards,
> >> >> Pierre
> >> >>
> >> >>
> >> >> Le mar. 12 janv. 2021 à 01:29, Yngve N. Pettersen
> >> <yngve at vivaldi.com> a
> >> >> écrit :
> >> >>
> >> >>> Hello all,
> >> >>>
> >> >>> We have just upgraded our buildbot system from 2.7 to 2.10.
> >> >>>
> >> >>> However, I am noticing performance issues when loading these pages:
> >> >>>
> >> >>>   Builds->Builders
> >> >>>   Builds->Workers
> >> >>>   individual builds
> >> >>>
> >> >>> Loading these can take several minutes, although there are periods
> >> of
> >> >>> immediate responses.
> >> >>>
> >> >>> What I am seeing on the buildbot manager machine is that the Python3
> >> >>> process hits 90-100% for the entire period.
> >> >>>
> >> >>> The Python version is 3.6.9 running on Ubuntu 18.04
> >> >>>
> >> >>> As far as I can tell, the Postgresql database is mostly idle during
> >> >>> this
> >> >>> period. I did do a full vacuum a few hours ago, in case that was the
> >> >>> issue.
> >> >>>
> >> >>> There are about 40 builders, and 30 workers in the system, only
> >> about
> >> >>> 10-15 of these have a 10-20 builds for the past few days, although
> >> most
> >> >>> of
> >> >>> these have active histories of 3000 builds (which do make me
> wonder
> >> if
> >> >>> the
> >> >>> problem could be a lack of limiting the DB queries, at present I
> >> have
> >> >>> not
> >> >>> inspected the DB queries).
> >> >>>
> >> >>> The individual builds can have very large log files in the build
> >> steps,
> >> >>> in
> >> >>> many cases tens of thousands of lines (we _are_ talking about a
> >> >>> Chromium
> >> >>> based project).
> >> >>>
> >> >>> Our changes in the builders and workers JS code are minimal (we are
> >> >>> using
> >> >>> a custom build of www-base), just using different information for
> >> the
> >> >>> build labels (build version number), and grouping the builders,
> >> which
> >> >>> should not be causing any performance issues. (we have larger
> >> changes
> >> >>> in
> >> >>> the individual builder view, where we include Git commit messages,
> >> and
> >> >>> I
> >> >>> have so far not seen any performance issues there)
> >> >>>
> >> >>> BTW: The line plots for build time and successes on builders seems
> >> to
> >> >>> be
> >> >>> MIA. Not sure if that is an upstream issue, or due to something in
> >> our
> >> >>> www-base build.
> >> >>>
> >> >>> Do you have any suggestions for where to look for the cause of the
> >> >>> problem?
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Sincerely,
> >> >>> Yngve N. Pettersen
> >> >>> Vivaldi Technologies AS
> >> >>> _______________________________________________
> >> >>> users mailing list
> >> >>> users at buildbot.net
> >> >>> https://lists.buildbot.net/mailman/listinfo/users
> >> >>>
> >> >
> >> >
> >>
> >>
> >> --
> >> Sincerely,
> >> Yngve N. Pettersen
> >> Vivaldi Technologies AS
> >>
>
>
> --
> Sincerely,
> Yngve N. Pettersen
> Vivaldi Technologies AS
> _______________________________________________
> users mailing list
> users at buildbot.net
> https://lists.buildbot.net/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.buildbot.net/pipermail/users/attachments/20210113/0bf4adea/attachment-0001.html>