[users at bb.net] Performance issues in 2.10
Povilas Kanapickas
povilas at radix.lt
Wed Jan 13 10:32:37 UTC 2021
Hi,
Looking into diff between 2.7 and 2.10, pretty much the only difference
in the existing data and DB APIs are related to changes. So I think it's
very likely there's something not obvious we didn't consider.
I will build a test case and see if I can reproduce the problem. Long
term I think we need to have tests to prevent this kind of issues. I'll
think about how this can be done.
Yngve, thanks a lot for your work investigating this issue.
Cheers,
Povilas
On 1/13/21 12:26 PM, Pierre Tardy wrote:
> Hello,
>
> getPrevSuccessfulBuild is called by getChanges for build which in turn
> is called by /builds/NN/changes Rest API.
> the bug Vlad was referring to was a perf issue on the /changes API,
> which has been fixed a while back.
>
> Indeed, this algorithm is far from optimized,but I don't see why this
> would lead to main thread blocking. Looking at the code, I see that
> there are no big loops that do not yield to the main reactor loop.
>
> I insist on the buildbot profiler. What I was saying before is that you
> need to hit the record button before the problem appears, and put a
> large enough record time to be sure to catch a spike.
> Then, you will be able to zoom to the cpu spike and catch the issue
> precisely.
>
> If the spike is in the order of minutes like you said, you can configure
> it like this and get enough samples to get enough evidence to where the
> code is actually spending time:
>
> ProfilerService(frequency=500, gatherperiod=60 * 60, mode='virtual', basepath=None, wantBuilds=100
>
> This will record for one hour, and mitigate the memory used if you worry
> about it.
>
> Pierre
>
>
> Le mer. 13 janv. 2021 à 11:01, Yngve N. Pettersen <yngve at vivaldi.com
> <mailto:yngve at vivaldi.com>> a écrit :
>
>
> Hi again,
>
> I was just able to get a partial database log from a freeze incident
> when
> refreshing the Builds->Builders page.
>
> It looks like Vlad is on the right track.
>
> There are a *lot* of individual source stamp requests, but also requests
> to the builds and change tables.
>
> An interesting part of the builds request is this request:
>
> SELECT builds.id <http://builds.id>, builds.number,
> builds.builderid, builds.buildrequestid,
> builds.workerid, builds.masterid, builds.started_at, builds.complete_at,
> builds.state_string, builds.results
> FROM builds
> WHERE builds.builderid = 51 AND builds.number < 46 AND
> builds.results = 0 ORDER BY builds.complete_at DESC
> LIMIT 1000 OFFSET 0
>
> which appears to then be followed by a lot of changes and source stamp
> requests.
>
> The log contains a lot of these requests per second; according to the DB
> graph 200 to 400 per second.
>
> The 1000 limit appears to come from
> db.builds.BuildsConnectorComponent.getPrevSuccessfulBuild(), but that
> value seems to have been that way for a while, so the problem is likely
> caused by something else. This function did show up at the beginning
> of my
> traces related to these freezes.
>
>
> One possibility that I can think of, is that several of these pages,
> or
> the functions they are using, are no longer restricting how far back
> in
> the build history they are fetching build information for. E.g the
> Builders page is only supposed to show a couple of days of builds
> for each
> builder, so there should be no need to fetch data for a 1000 builds
> (making sure you have the build ids is one thing, fetching all the
> associated data even for builds that are not to be displayed is
> something
> else).
>
> BTW, I have noticed that another page, the waterfall, is not
> displaying
> anything, even after waiting for a very long time.
>
>
> On Wed, 13 Jan 2021 01:34:57 +0100, Yngve N. Pettersen
> <yngve at vivaldi.com <mailto:yngve at vivaldi.com>>
> wrote:
>
> > Hi,
> >
> > Thanks for that info.
> >
> > In my case the problem is apparently something that happens now
> and then.
> >
> > As mentioned, I have seen it on the Builds->Builders and
> Builds->Workers
> > pages, neither of which includes any changelog access AFAIK.
> >
> > I have also seen it occasionally on individual build pages, which
> has a
> > log of steps with logs, and a changelog panel.
> >
> > Just a few minutes ago I saw this freeze/spike happen while the
> buildbot
> > manager was completely idle, since all active tasks had completed,
> and I
> > had paused all workers since I needed to restart the manager (due
> to the
> > hanging build).
> >
> > I have also had reports about the Grid view and Console pages
> displaying
> > this issue, but have not seen it myself.
> >
> > At present I have enabled logging in the postgresql server, so
> maybe I
> > can figure out what requests are handled during the spike.
> >
> >
> >
> > On Wed, 13 Jan 2021 00:38:34 +0100, Vlad Bogolin <vlad at mariadb.org
> <mailto:vlad at mariadb.org>>
> > wrote:
> >
> >> Hi,
> >>
> >> I have experienced some similar interface freezes while trying to
> >> configure
> >> our version of buildbot. I now remember two cases:
> >>
> >> 1) A "changes" API problem where it seemed that the "limit"
> argument was
> >> ignored in some cases which translated into a full changes table
> scan.
> >> This
> >> was reproducible when hitting the "Builds > Last Changes"
> dashboard and
> >> then all the other pages were frozen. There are other requests to
> >> changes,
> >> so this may be related to the Builds page too. Also, this only
> >> happened when the number of changes from the db was high. I was
> >> planning on
> >> submitting a proper fix, but we are running a custom version of
> 2.7.1
> >> where
> >> I implemented a fast workaround and did not managed to submit a
> proper
> >> fix
> >> (hope to be able to do it next week).
> >>
> >> 2) We experienced the same issue as you describe when a lot of logs
> >> where
> >> coming (which seems to be your case too) and the master process was
> >> overwhelmed when multiple builds were running in the same time
> (constant
> >> CPU usage around ~120%). We solved the issue by switching to multi
> >> master
> >> and limiting the amount of logs, but if you say that this was not
> an
> >> issue
> >> in 2.7 I would really be interested in finding out what is the root
> >> cause
> >> (I thought it was the high amount of logs). You can test this
> >> hypothesis by
> >> limiting the amount of running builds and see if the issue keeps
> >> reproducing.
> >>
> >> What worked for me in order to find out the "changes" API problem was
> >> visiting each dashboard and see if the freeze occurs or not.
> >>
> >> Hope this helps!
> >>
> >> Cheers,
> >> Vlad
> >>
> >> On Wed, Jan 13, 2021 at 12:40 AM Yngve N. Pettersen
> <yngve at vivaldi.com <mailto:yngve at vivaldi.com>>
> >> wrote:
> >>
> >>> On Tue, 12 Jan 2021 22:13:50 +0100, Pierre Tardy
> <tardyp at gmail.com <mailto:tardyp at gmail.com>>
> >>> wrote:
> >>>
> >>> > Thanks for the update.
> >>> >
> >>> > Some random thoughts...
> >>> >
> >>> > You should probably leave the profiler open until you get the
> >>> performance
> >>> > spike.
> >>> > If you are inside the spike when starting, indeed, you won't
> be able
> >>> to
> >>> > start profiler, but if it is started before the spike it for
> sure
> >>> will
> >>> > detect exactly where the code is.
> >>>
> >>> I did have the profiler open in this latest case; as far as I
> could
> >>> tell
> >>> it still didn't start recording until after the spike ended
> (there was
> >>> no
> >>> progress information in the recorder line).
> >>>
> >>> The two major items showing up were
> >>>
> >>> /buildbot/db/builds.py+91:getPrevSuccessfulBuild
> >>> /buildbot/db/pool.py+190:__thd
> >>>
> >>> but I think they were recorded after the spike.
> >>>
> >>> I am planning to activate more detailed logging in the postgresql
> >>> server,
> >>> but have not done that yet (probably need to shut down and restart
> >>> buildbot when I do).
> >>>
> >>>
> >>> BTW, I suspect that this issue can also cause trouble for builds
> whose
> >>> steps ends at the time the problem is occurring; I just noticed
> a task
> >>> that is still running more than 4 hours after it started a step that
> >>> should have been killed after 20 minutes if it was hanging. It
> should
> >>> have
> >>> ended at about the time one of the hangs was occuring. And it is
> >>> impossible to stop the task for some reason, even shutting down the
> >>> worker
> >>> process did not work. AFAIK the only way to fix the issue is to
> shut
> >>> the
> >>> buildbot manager down.
> >>>
> >>> > statistic profiling will use timer interrupts which will preempt
> >>> anything
> >>> > that is running, and make a call stack trace.
> >>> >
> >>> > Waiting for repro, if, from the db log, you manage to get the
> info of
> >>> > what
> >>> > kind of db data that is, maybe we can narrow down the usual
> >>> suspects..
> >>> >
> >>> > If there are lots of short selects like you said, usually, you
> would
> >>> > have a
> >>> > back and forth from reactor thread to db thread, so it sounds
> weird.
> >>> > What can be leading to your behavior is that whatever is
> halting the
> >>> > processing, everything is queued up in between, and unqueued
> when it
> >>> is
> >>> > finished, which could lead to spike of db actions in the end
> of the
> >>> > event.
> >>>
> >>> The DB actions were going on for the entire 3 minutes that spike
> >>> lasted;
> >>> it is not a burst at either end, but a ~180 second long continuous
> >>> sequence (or barrage) of approximately 70-90000 transactions, if
> I am
> >>> interpreting the graph data correctly.
> >>>
> >>> > Regards
> >>> > Pierre
> >>> >
> >>> >
> >>> > Le mar. 12 janv. 2021 à 21:49, Yngve N. Pettersen
> >>> <yngve at vivaldi.com <mailto:yngve at vivaldi.com>> a
> >>> > écrit :
> >>> >
> >>> >> Hi again,
> >>> >>
> >>> >> A bit of an update.
> >>> >>
> >>> >> I have not been able to locate the issue using the profiler.
> >>> >>
> >>> >> It seems that when Buildbot gets into the problematic mode,
> then the
> >>> >> profiler is not able to work at all. It only starts collecting
> >>> after the
> >>> >> locked mode is resolved.
> >>> >>
> >>> >> It does seem like the locked mode occurs when Buildbot is
> fetching
> >>> a lot
> >>> >> of data from the DB and then spends a lot of time processing
> that
> >>> data,
> >>> >> without yielding to other processing needs.
> >>> >>
> >>> >> Looking at the monitoring of the server, it also appears that
> >>> buildbot
> >>> >> is
> >>> >> fetching a lot of data. During the most recent instance, the
> >>> returned
> >>> >> tuples count in the graph for the server indicates three
> minutes
> >>> of, on
> >>> >> average 25000 tuples returned, with spikes to 80K and 100K, per
> >>> second.
> >>> >>
> >>> >> The number of open connections rose to 6 or 7, and the
> transaction
> >>> count
> >>> >> was 400-500 per second during the whole time (rolled back
> >>> transactions,
> >>> >> which I assume is just one or more selects).
> >>> >>
> >>> >> IMO this makes it look like, while requesting these data,
> Buildbot
> >>> is
> >>> >> *synchronously* querying the DB and processing the returned
> data,
> >>> not
> >>> >> yielding. It might also be that it is requesting data more data
> >>> than it
> >>> >> needs, and also requesting other data earlier than it is actually
> >>> >> needed.
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Tue, 12 Jan 2021 12:48:40 +0100, Yngve N. Pettersen
> >>> >> <yngve at vivaldi.com <mailto:yngve at vivaldi.com>>
> >>> >>
> >>> >> wrote:
> >>> >>
> >>> >> > Hi,
> >>> >> >
> >>> >> > IIRC the only real processing in our system that might be
> heavy is
> >>> >> done
> >>> >> > via logobserver.LineConsumerLogObserver in a class (now)
> derived
> >>> from
> >>> >> > ShellCommandNewStyle, so if that is the issue, and
> deferToThread
> >>> is
> >>> >> the
> >>> >> > solution, then if it isn't already done, my suggestion
> would be to
> >>> >> > implement that inside the code handling the log observers.
> >>> >> >
> >>> >> > I've tested the profiler a little, but haven't seen any
> samples
> >>> within
> >>> >> > our code so far, just inside buildbot, quite a lot of log DB
> >>> actions,
> >>> >> > also some TLS activity.
> >>> >> >
> >>> >> > The performance issue for those pages seems to be a bit
> flaky; at
> >>> >> > present its not happening AFAICT
> >>> >> >
> >>> >> > On Tue, 12 Jan 2021 10:59:42 +0100, Pierre Tardy
> >>> <tardyp at gmail.com <mailto:tardyp at gmail.com>>
> >>> >> > wrote:
> >>> >> >
> >>> >> >> Hello,
> >>> >> >>
> >>> >> >> A lot of things happen between 2.7 and 2.10, although I
> don't see
> >>> >> >> anything
> >>> >> >> which could impact the performance that much. (maybe new
> reporter
> >>> >> >> framework, but really not convinced)
> >>> >> >> If you see that the db is underutilized this must be a
> classical
> >>> >> reactor
> >>> >> >> starvation.
> >>> >> >> With asynchronous systems like buildbot, you shouldn't do
> any
> >>> heavy
> >>> >> >> computation in the main event loop thread, those must be
> done in
> >>> a
> >>> >> >> thread
> >>> >> >> via deferToThread and co.
> >>> >> >>
> >>> >> >> Those are the common issues you can have with performance
> >>> >> >> independantly from upgrade regressions:
> >>> >> >>
> >>> >> >> 1) Custom steps:
> >>> >> >> A lot of time, we see people struggling with performance
> when
> >>> they
> >>> >> just
> >>> >> >> have some custom step doing heavy computation that block
> the main
> >>> >> thread
> >>> >> >> constantly, preventing all the very quick tasks to run in //.
> >>> >> >>
> >>> >> >> 2) too much logs
> >>> >> >> In this case, there is not much to do beside reducing the log
> >>> >> amount.
> >>> >> >> This
> >>> >> >> would be the time to switch to a multi-master setup, where
> you
> >>> put 2
> >>> >> >> masters for builds, and one master for web UI.
> >>> >> >> You can put those in the same machine/VM, no problem, the
> only
> >>> work
> >>> >> is
> >>> >> >> to
> >>> >> >> have separate processes that each have several event
> queues. You
> >>> can
> >>> >> use
> >>> >> >> docker-compose or kubernetes in order to more easily
> create such
> >>> >> >> deployment. We don't have readily useable for that, but
> several
> >>> >> people
> >>> >> >> have
> >>> >> >> done and documented it, for example
> >>> >> >> https://github.com/pop/buildbot-on-kubernetes
> >>> >> >>
> >>> >> >>
> >>> >> >> I have developed the buildbot profiler in order to quickly
> find
> >>> >> those.
> >>> >> >> You
> >>> >> >> just have to install it as a plugin and start a profile
> whenever
> >>> the
> >>> >> >> buildbot feels slow.
> >>> >> >> It is a statistical profiler, so it will not significantly
> >>> change the
> >>> >> >> actual performance so it is safe to run in production.
> >>> >> >>
> >>> >> >> https://pypi.org/project/buildbot-profiler/
> >>> >> >>
> >>> >> >>
> >>> >> >> Regards,
> >>> >> >> Pierre
> >>> >> >>
> >>> >> >>
> >>> >> >> Le mar. 12 janv. 2021 à 01:29, Yngve N. Pettersen
> >>> >> <yngve at vivaldi.com <mailto:yngve at vivaldi.com>> a
> >>> >> >> écrit :
> >>> >> >>
> >>> >> >>> Hello all,
> >>> >> >>>
> >>> >> >>> We have just upgraded our buildbot system from 2.7 to 2.10.
> >>> >> >>>
> >>> >> >>> However, I am noticing performance issues when loading
> these
> >>> pages:
> >>> >> >>>
> >>> >> >>> Builds->Builders
> >>> >> >>> Builds->Workers
> >>> >> >>> individual builds
> >>> >> >>>
> >>> >> >>> Loading these can take several minutes, although there are
> >>> periods
> >>> >> of
> >>> >> >>> immediate responses.
> >>> >> >>>
> >>> >> >>> What I am seeing on the buildbot manager machine is that
> the
> >>> Python3
> >>> >> >>> process hits 90-100% for the entire period.
> >>> >> >>>
> >>> >> >>> The Python version is 3.6.9 running on Ubuntu 18.04
> >>> >> >>>
> >>> >> >>> As far as I can tell, the Postgresql database is mostly
> idle
> >>> during
> >>> >> >>> this
> >>> >> >>> period. I did do a full vacuum a few hours ago, in case
> that
> >>> was the
> >>> >> >>> issue.
> >>> >> >>>
> >>> >> >>> There are about 40 builders, and 30 workers in the
> system, only
> >>> >> about
> >>> >> >>> 10-15 of these have a 10-20 builds for the past few days,
> >>> although
> >>> >> most
> >>> >> >>> of
> >>> >> >>> these have active histories of 3000 builds (which do make me
> >>> wonder
> >>> >> if
> >>> >> >>> the
> >>> >> >>> problem could be a lack of limiting the DB queries, at
> present I
> >>> >> have
> >>> >> >>> not
> >>> >> >>> inspected the DB queries).
> >>> >> >>>
> >>> >> >>> The individual builds can have very large log files in
> the build
> >>> >> steps,
> >>> >> >>> in
> >>> >> >>> many cases tens of thousands of lines (we _are_ talking
> about a
> >>> >> >>> Chromium
> >>> >> >>> based project).
> >>> >> >>>
> >>> >> >>> Our changes in the builders and workers JS code are
> minimal (we
> >>> are
> >>> >> >>> using
> >>> >> >>> a custom build of www-base), just using different
> information
> >>> for
> >>> >> the
> >>> >> >>> build labels (build version number), and grouping the
> builders,
> >>> >> which
> >>> >> >>> should not be causing any performance issues. (we have larger
> >>> >> changes
> >>> >> >>> in
> >>> >> >>> the individual builder view, where we include Git commit
> >>> messages,
> >>> >> and
> >>> >> >>> I
> >>> >> >>> have so far not seen any performance issues there)
> >>> >> >>>
> >>> >> >>> BTW: The line plots for build time and successes on
> builders
> >>> seems
> >>> >> to
> >>> >> >>> be
> >>> >> >>> MIA. Not sure if that is an upstream issue, or due to
> something
> >>> in
> >>> >> our
> >>> >> >>> www-base build.
> >>> >> >>>
> >>> >> >>> Do you have any suggestions for where to look for the
> cause of
> >>> the
> >>> >> >>> problem?
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> --
> >>> >> >>> Sincerely,
> >>> >> >>> Yngve N. Pettersen
> >>> >> >>> Vivaldi Technologies AS
> >>> >> >>> _______________________________________________
> >>> >> >>> users mailing list
> >>> >> >>> users at buildbot.net <mailto:users at buildbot.net>
> >>> >> >>> https://lists.buildbot.net/mailman/listinfo/users
> >>> >> >>>
> >>> >> >
> >>> >> >
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Sincerely,
> >>> >> Yngve N. Pettersen
> >>> >> Vivaldi Technologies AS
> >>> >>
> >>>
> >>>
> >>> --
> >>> Sincerely,
> >>> Yngve N. Pettersen
> >>> Vivaldi Technologies AS
> >>> _______________________________________________
> >>> users mailing list
> >>> users at buildbot.net <mailto:users at buildbot.net>
> >>> https://lists.buildbot.net/mailman/listinfo/users
> >>>
> >
> >
>
>
> --
> Sincerely,
> Yngve N. Pettersen
> Vivaldi Technologies AS
>
>
> _______________________________________________
> users mailing list
> users at buildbot.net
> https://lists.buildbot.net/mailman/listinfo/users
>
More information about the users
mailing list