[users at bb.net] Performance issues in 2.10
Yngve N. Pettersen
yngve at vivaldi.com
Wed Jan 13 14:42:14 UTC 2021
Hi,
I've been thinking a bit.
I am not familiar with the defer system, so I don't know the deep details
about it, so please take the following speculation with some big spoons of
salt.
What I started wondering about is, what happens with the deferring system
if the following happens?
- The list of builders is requested
- for each builder the 1000 most recent builds are requested
- for each build the change entries are requests
- for each build or change entry the source stamps are requested
Could this lead to such a long queue of deferred function calls that the
server gets bogged down handling them, not being able to even forward data
that are ready to the web front-end?
I currently suspect that there is a missing restriction (either numbers of
builds or time-range) on requesting information about the individual
builds.
Requesting a list of 1000 builds for each builder is probably OK as the
query should be quick, but the question is how many of those builds one
should requests deep details of?
Another, related aspect of what I saw in the DB log is that there were
many requests for single items, especially for source stamps and changes.
Maybe they could be gathered together into a single requests for X number
of records? That does require keeping a separate list/queue with the items
that has been requested, and might be complicated to implement.
On Wed, 13 Jan 2021 11:32:37 +0100, Povilas Kanapickas <povilas at radix.lt>
wrote:
> Hi,
>
> Looking into diff between 2.7 and 2.10, pretty much the only difference
> in the existing data and DB APIs are related to changes. So I think it's
> very likely there's something not obvious we didn't consider.
>
> I will build a test case and see if I can reproduce the problem. Long
> term I think we need to have tests to prevent this kind of issues. I'll
> think about how this can be done.
>
> Yngve, thanks a lot for your work investigating this issue.
>
> Cheers,
> Povilas
>
> On 1/13/21 12:26 PM, Pierre Tardy wrote:
>> Hello,
>>
>> getPrevSuccessfulBuild is called by getChanges for build which in turn
>> is called by /builds/NN/changes Rest API.
>> the bug Vlad was referring to was a perf issue on the /changes API,
>> which has been fixed a while back.
>>
>> Indeed, this algorithm is far from optimized,but I don't see why this
>> would lead to main thread blocking. Looking at the code, I see that
>> there are no big loops that do not yield to the main reactor loop.
>>
>> I insist on the buildbot profiler. What I was saying before is that you
>> need to hit the record button before the problem appears, and put a
>> large enough record time to be sure to catch a spike.
>> Then, you will be able to zoom to the cpu spike and catch the issue
>> precisely.
>>
>> If the spike is in the order of minutes like you said, you can configure
>> it like this and get enough samples to get enough evidence to where the
>> code is actually spending time:
>>
>> ProfilerService(frequency=500, gatherperiod=60 * 60, mode='virtual',
>> basepath=None, wantBuilds=100
>>
>> This will record for one hour, and mitigate the memory used if you worry
>> about it.
>>
>> Pierre
>>
>>
>> Le mer. 13 janv. 2021 à 11:01, Yngve N. Pettersen <yngve at vivaldi.com
>> <mailto:yngve at vivaldi.com>> a écrit :
>>
>>
>> Hi again,
>>
>> I was just able to get a partial database log from a freeze incident
>> when
>> refreshing the Builds->Builders page.
>>
>> It looks like Vlad is on the right track.
>>
>> There are a *lot* of individual source stamp requests, but also
>> requests
>> to the builds and change tables.
>>
>> An interesting part of the builds request is this request:
>>
>> SELECT builds.id <http://builds.id>, builds.number,
>> builds.builderid, builds.buildrequestid,
>> builds.workerid, builds.masterid, builds.started_at,
>> builds.complete_at,
>> builds.state_string, builds.results
>> FROM builds
>> WHERE builds.builderid = 51 AND builds.number < 46 AND
>> builds.results = 0 ORDER BY builds.complete_at DESC
>> LIMIT 1000 OFFSET 0
>>
>> which appears to then be followed by a lot of changes and source
>> stamp
>> requests.
>>
>> The log contains a lot of these requests per second; according to
>> the DB
>> graph 200 to 400 per second.
>>
>> The 1000 limit appears to come from
>> db.builds.BuildsConnectorComponent.getPrevSuccessfulBuild(), but
>> that
>> value seems to have been that way for a while, so the problem is
>> likely
>> caused by something else. This function did show up at the beginning
>> of my
>> traces related to these freezes.
>>
>>
>> One possibility that I can think of, is that several of these pages,
>> or the functions they are using, are no longer restricting how
>> far back
>> in the build history they are fetching build information for.
>> E.g the Builders page is only supposed to show a couple of days of
>> builds
>> for each builder, so there should be no need to fetch data for a
>> 1000 builds (making sure you have the build ids is one thing,
>> fetching all the associated data even for builds that are not to be
>> displayed is
>> something else).
>>
>> BTW, I have noticed that another page, the waterfall, is not
>> displaying anything, even after waiting for a very long time.
>>
>>
>> On Wed, 13 Jan 2021 01:34:57 +0100, Yngve N. Pettersen
>> <yngve at vivaldi.com <mailto:yngve at vivaldi.com>>
>> wrote:
>>
>> > Hi,
>> >
>> > Thanks for that info.
>> >
>> > In my case the problem is apparently something that happens now
>> and then.
>> >
>> > As mentioned, I have seen it on the Builds->Builders and
>> Builds->Workers > pages, neither of which includes any changelog
>> access AFAIK.
>> >
>> > I have also seen it occasionally on individual build pages, which
>> has a > log of steps with logs, and a changelog panel.
>> >
>> > Just a few minutes ago I saw this freeze/spike happen while the
>> buildbot > manager was completely idle, since all active tasks
>> had completed,
>> and I > had paused all workers since I needed to restart the
>> manager (due
>> to the > hanging build).
>> >
>> > I have also had reports about the Grid view and Console pages
>> displaying > this issue, but have not seen it myself.
>> >
>> > At present I have enabled logging in the postgresql server, so
>> maybe I > can figure out what requests are handled during the
>> spike.
>> >
>> >
>> >
>> > On Wed, 13 Jan 2021 00:38:34 +0100, Vlad Bogolin <vlad at mariadb.org
>> <mailto:vlad at mariadb.org>> > wrote:
>> >
>> >> Hi,
>> >>
>> >> I have experienced some similar interface freezes while trying to
>> >> configure
>> >> our version of buildbot. I now remember two cases:
>> >>
>> >> 1) A "changes" API problem where it seemed that the "limit"
>> argument was
>> >> ignored in some cases which translated into a full changes table
>> scan. >> This
>> >> was reproducible when hitting the "Builds > Last Changes"
>> dashboard and
>> >> then all the other pages were frozen. There are other requests to
>> >> changes,
>> >> so this may be related to the Builds page too. Also, this only
>> >> happened when the number of changes from the db was high. I was
>> >> planning on
>> >> submitting a proper fix, but we are running a custom version of
>> 2.7.1 >> where
>> >> I implemented a fast workaround and did not managed to submit a
>> proper >> fix
>> >> (hope to be able to do it next week).
>> >>
>> >> 2) We experienced the same issue as you describe when a lot of
>> logs >> where
>> >> coming (which seems to be your case too) and the master process
>> was
>> >> overwhelmed when multiple builds were running in the same time
>> (constant
>> >> CPU usage around ~120%). We solved the issue by switching to
>> multi >> master
>> >> and limiting the amount of logs, but if you say that this was not
>> an >> issue
>> >> in 2.7 I would really be interested in finding out what is the
>> root >> cause
>> >> (I thought it was the high amount of logs). You can test this
>> >> hypothesis by
>> >> limiting the amount of running builds and see if the issue keeps
>> >> reproducing.
>> >>
>> >> What worked for me in order to find out the "changes" API
>> problem was
>> >> visiting each dashboard and see if the freeze occurs or not.
>> >>
>> >> Hope this helps!
>> >>
>> >> Cheers,
>> >> Vlad
>> >>
>> >> On Wed, Jan 13, 2021 at 12:40 AM Yngve N. Pettersen
>> <yngve at vivaldi.com <mailto:yngve at vivaldi.com>>
>> >> wrote:
>> >>
>> >>> On Tue, 12 Jan 2021 22:13:50 +0100, Pierre Tardy
>> <tardyp at gmail.com <mailto:tardyp at gmail.com>> >>> wrote:
>> >>>
>> >>> > Thanks for the update.
>> >>> >
>> >>> > Some random thoughts...
>> >>> >
>> >>> > You should probably leave the profiler open until you get the
>> >>> performance
>> >>> > spike.
>> >>> > If you are inside the spike when starting, indeed, you won't
>> be able >>> to
>> >>> > start profiler, but if it is started before the spike it for
>> sure >>> will
>> >>> > detect exactly where the code is.
>> >>>
>> >>> I did have the profiler open in this latest case; as far as I
>> could >>> tell
>> >>> it still didn't start recording until after the spike ended
>> (there was >>> no
>> >>> progress information in the recorder line).
>> >>>
>> >>> The two major items showing up were
>> >>>
>> >>> /buildbot/db/builds.py+91:getPrevSuccessfulBuild
>> >>> /buildbot/db/pool.py+190:__thd
>> >>>
>> >>> but I think they were recorded after the spike.
>> >>>
>> >>> I am planning to activate more detailed logging in the
>> postgresql >>> server,
>> >>> but have not done that yet (probably need to shut down and
>> restart
>> >>> buildbot when I do).
>> >>>
>> >>>
>> >>> BTW, I suspect that this issue can also cause trouble for builds
>> whose
>> >>> steps ends at the time the problem is occurring; I just noticed
>> a task
>> >>> that is still running more than 4 hours after it started a step
>> that
>> >>> should have been killed after 20 minutes if it was hanging. It
>> should
>> >>> have
>> >>> ended at about the time one of the hangs was occuring. And it is
>> >>> impossible to stop the task for some reason, even shutting down
>> the
>> >>> worker
>> >>> process did not work. AFAIK the only way to fix the issue is to
>> shut >>> the
>> >>> buildbot manager down.
>> >>>
>> >>> > statistic profiling will use timer interrupts which will
>> preempt >>> anything
>> >>> > that is running, and make a call stack trace.
>> >>> >
>> >>> > Waiting for repro, if, from the db log, you manage to get the
>> info of
>> >>> > what
>> >>> > kind of db data that is, maybe we can narrow down the usual
>> >>> suspects..
>> >>> >
>> >>> > If there are lots of short selects like you said, usually, you
>> would
>> >>> > have a
>> >>> > back and forth from reactor thread to db thread, so it sounds
>> weird.
>> >>> > What can be leading to your behavior is that whatever is
>> halting the
>> >>> > processing, everything is queued up in between, and unqueued
>> when it >>> is
>> >>> > finished, which could lead to spike of db actions in the end
>> of the
>> >>> > event.
>> >>>
>> >>> The DB actions were going on for the entire 3 minutes that spike
>> >>> lasted;
>> >>> it is not a burst at either end, but a ~180 second long
>> continuous
>> >>> sequence (or barrage) of approximately 70-90000 transactions, if
>> I am
>> >>> interpreting the graph data correctly.
>> >>>
>> >>> > Regards
>> >>> > Pierre
>> >>> >
>> >>> >
>> >>> > Le mar. 12 janv. 2021 à 21:49, Yngve N. Pettersen >>>
>> <yngve at vivaldi.com <mailto:yngve at vivaldi.com>> a
>> >>> > écrit :
>> >>> >
>> >>> >> Hi again,
>> >>> >>
>> >>> >> A bit of an update.
>> >>> >>
>> >>> >> I have not been able to locate the issue using the profiler.
>> >>> >>
>> >>> >> It seems that when Buildbot gets into the problematic mode,
>> then the
>> >>> >> profiler is not able to work at all. It only starts
>> collecting >>> after the
>> >>> >> locked mode is resolved.
>> >>> >>
>> >>> >> It does seem like the locked mode occurs when Buildbot is
>> fetching >>> a lot
>> >>> >> of data from the DB and then spends a lot of time processing
>> that >>> data,
>> >>> >> without yielding to other processing needs.
>> >>> >>
>> >>> >> Looking at the monitoring of the server, it also appears that
>> >>> buildbot
>> >>> >> is
>> >>> >> fetching a lot of data. During the most recent instance, the
>> >>> returned
>> >>> >> tuples count in the graph for the server indicates three
>> minutes >>> of, on
>> >>> >> average 25000 tuples returned, with spikes to 80K and 100K,
>> per >>> second.
>> >>> >>
>> >>> >> The number of open connections rose to 6 or 7, and the
>> transaction >>> count
>> >>> >> was 400-500 per second during the whole time (rolled back
>> >>> transactions,
>> >>> >> which I assume is just one or more selects).
>> >>> >>
>> >>> >> IMO this makes it look like, while requesting these data,
>> Buildbot >>> is
>> >>> >> *synchronously* querying the DB and processing the returned
>> data, >>> not
>> >>> >> yielding. It might also be that it is requesting data more
>> data >>> than it
>> >>> >> needs, and also requesting other data earlier than it is
>> actually
>> >>> >> needed.
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Tue, 12 Jan 2021 12:48:40 +0100, Yngve N. Pettersen
>> >>> >> <yngve at vivaldi.com <mailto:yngve at vivaldi.com>>
>> >>> >>
>> >>> >> wrote:
>> >>> >>
>> >>> >> > Hi,
>> >>> >> >
>> >>> >> > IIRC the only real processing in our system that might be
>> heavy is
>> >>> >> done
>> >>> >> > via logobserver.LineConsumerLogObserver in a class (now)
>> derived >>> from
>> >>> >> > ShellCommandNewStyle, so if that is the issue, and
>> deferToThread >>> is
>> >>> >> the
>> >>> >> > solution, then if it isn't already done, my suggestion
>> would be to
>> >>> >> > implement that inside the code handling the log observers.
>> >>> >> >
>> >>> >> > I've tested the profiler a little, but haven't seen any
>> samples >>> within
>> >>> >> > our code so far, just inside buildbot, quite a lot of log
>> DB >>> actions,
>> >>> >> > also some TLS activity.
>> >>> >> >
>> >>> >> > The performance issue for those pages seems to be a bit
>> flaky; at
>> >>> >> > present its not happening AFAICT
>> >>> >> >
>> >>> >> > On Tue, 12 Jan 2021 10:59:42 +0100, Pierre Tardy >>>
>> <tardyp at gmail.com <mailto:tardyp at gmail.com>>
>> >>> >> > wrote:
>> >>> >> >
>> >>> >> >> Hello,
>> >>> >> >>
>> >>> >> >> A lot of things happen between 2.7 and 2.10, although I
>> don't see
>> >>> >> >> anything
>> >>> >> >> which could impact the performance that much. (maybe new
>> reporter
>> >>> >> >> framework, but really not convinced)
>> >>> >> >> If you see that the db is underutilized this must be a
>> classical
>> >>> >> reactor
>> >>> >> >> starvation.
>> >>> >> >> With asynchronous systems like buildbot, you shouldn't do
>> any >>> heavy
>> >>> >> >> computation in the main event loop thread, those must be
>> done in >>> a
>> >>> >> >> thread
>> >>> >> >> via deferToThread and co.
>> >>> >> >>
>> >>> >> >> Those are the common issues you can have with performance
>> >>> >> >> independantly from upgrade regressions:
>> >>> >> >>
>> >>> >> >> 1) Custom steps:
>> >>> >> >> A lot of time, we see people struggling with performance
>> when >>> they
>> >>> >> just
>> >>> >> >> have some custom step doing heavy computation that block
>> the main
>> >>> >> thread
>> >>> >> >> constantly, preventing all the very quick tasks to run in
>> //.
>> >>> >> >>
>> >>> >> >> 2) too much logs
>> >>> >> >> In this case, there is not much to do beside reducing
>> the log
>> >>> >> amount.
>> >>> >> >> This
>> >>> >> >> would be the time to switch to a multi-master setup, where
>> you >>> put 2
>> >>> >> >> masters for builds, and one master for web UI.
>> >>> >> >> You can put those in the same machine/VM, no problem, the
>> only >>> work
>> >>> >> is
>> >>> >> >> to
>> >>> >> >> have separate processes that each have several event
>> queues. You
>> >>> can
>> >>> >> use
>> >>> >> >> docker-compose or kubernetes in order to more easily
>> create such
>> >>> >> >> deployment. We don't have readily useable for that, but
>> several
>> >>> >> people
>> >>> >> >> have
>> >>> >> >> done and documented it, for example
>> >>> >> >> https://github.com/pop/buildbot-on-kubernetes
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> I have developed the buildbot profiler in order to quickly
>> find
>> >>> >> those.
>> >>> >> >> You
>> >>> >> >> just have to install it as a plugin and start a profile
>> whenever >>> the
>> >>> >> >> buildbot feels slow.
>> >>> >> >> It is a statistical profiler, so it will not significantly
>> >>> change the
>> >>> >> >> actual performance so it is safe to run in production.
>> >>> >> >>
>> >>> >> >> https://pypi.org/project/buildbot-profiler/
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> Regards,
>> >>> >> >> Pierre
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> Le mar. 12 janv. 2021 à 01:29, Yngve N. Pettersen
>> >>> >> <yngve at vivaldi.com <mailto:yngve at vivaldi.com>> a
>> >>> >> >> écrit :
>> >>> >> >>
>> >>> >> >>> Hello all,
>> >>> >> >>>
>> >>> >> >>> We have just upgraded our buildbot system from 2.7 to
>> 2.10.
>> >>> >> >>>
>> >>> >> >>> However, I am noticing performance issues when loading
>> these >>> pages:
>> >>> >> >>>
>> >>> >> >>> Builds->Builders
>> >>> >> >>> Builds->Workers
>> >>> >> >>> individual builds
>> >>> >> >>>
>> >>> >> >>> Loading these can take several minutes, although there
>> are >>> periods
>> >>> >> of
>> >>> >> >>> immediate responses.
>> >>> >> >>>
>> >>> >> >>> What I am seeing on the buildbot manager machine is that
>> the >>> Python3
>> >>> >> >>> process hits 90-100% for the entire period.
>> >>> >> >>>
>> >>> >> >>> The Python version is 3.6.9 running on Ubuntu 18.04
>> >>> >> >>>
>> >>> >> >>> As far as I can tell, the Postgresql database is mostly
>> idle >>> during
>> >>> >> >>> this
>> >>> >> >>> period. I did do a full vacuum a few hours ago, in case
>> that >>> was the
>> >>> >> >>> issue.
>> >>> >> >>>
>> >>> >> >>> There are about 40 builders, and 30 workers in the
>> system, only
>> >>> >> about
>> >>> >> >>> 10-15 of these have a 10-20 builds for the past few days,
>> >>> although
>> >>> >> most
>> >>> >> >>> of
>> >>> >> >>> these have active histories of 3000 builds (which do
>> make me
>> >>> wonder
>> >>> >> if
>> >>> >> >>> the
>> >>> >> >>> problem could be a lack of limiting the DB queries, at
>> present I
>> >>> >> have
>> >>> >> >>> not
>> >>> >> >>> inspected the DB queries).
>> >>> >> >>>
>> >>> >> >>> The individual builds can have very large log files in
>> the build
>> >>> >> steps,
>> >>> >> >>> in
>> >>> >> >>> many cases tens of thousands of lines (we _are_ talking
>> about a
>> >>> >> >>> Chromium
>> >>> >> >>> based project).
>> >>> >> >>>
>> >>> >> >>> Our changes in the builders and workers JS code are
>> minimal (we >>> are
>> >>> >> >>> using
>> >>> >> >>> a custom build of www-base), just using different
>> information >>> for
>> >>> >> the
>> >>> >> >>> build labels (build version number), and grouping the
>> builders,
>> >>> >> which
>> >>> >> >>> should not be causing any performance issues. (we have
>> larger
>> >>> >> changes
>> >>> >> >>> in
>> >>> >> >>> the individual builder view, where we include Git commit
>> >>> messages,
>> >>> >> and
>> >>> >> >>> I
>> >>> >> >>> have so far not seen any performance issues there)
>> >>> >> >>>
>> >>> >> >>> BTW: The line plots for build time and successes on
>> builders >>> seems
>> >>> >> to
>> >>> >> >>> be
>> >>> >> >>> MIA. Not sure if that is an upstream issue, or due to
>> something >>> in
>> >>> >> our
>> >>> >> >>> www-base build.
>> >>> >> >>>
>> >>> >> >>> Do you have any suggestions for where to look for the
>> cause of >>> the
>> >>> >> >>> problem?
>> >>> >> >>>
>> >>> >> >>>
>> >>> >> >>> --
>> >>> >> >>> Sincerely,
>> >>> >> >>> Yngve N. Pettersen
>> >>> >> >>> Vivaldi Technologies AS
>> >>> >> >>> _______________________________________________
>> >>> >> >>> users mailing list
>> >>> >> >>> users at buildbot.net <mailto:users at buildbot.net>
>> >>> >> >>> https://lists.buildbot.net/mailman/listinfo/users
>> >>> >> >>>
>> >>> >> >
>> >>> >> >
>> >>> >>
>> >>> >>
>> >>> >> --
>> >>> >> Sincerely,
>> >>> >> Yngve N. Pettersen
>> >>> >> Vivaldi Technologies AS
>> >>> >>
>> >>>
>> >>>
>> >>> --
>> >>> Sincerely,
>> >>> Yngve N. Pettersen
>> >>> Vivaldi Technologies AS
>> >>> _______________________________________________
>> >>> users mailing list
>> >>> users at buildbot.net <mailto:users at buildbot.net>
>> >>> https://lists.buildbot.net/mailman/listinfo/users
>> >>>
>> >
>> >
>>
>>
>> --
>> Sincerely,
>> Yngve N. Pettersen
>> Vivaldi Technologies AS
>>
>>
>> _______________________________________________
>> users mailing list
>> users at buildbot.net
>> https://lists.buildbot.net/mailman/listinfo/users
>>
>
--
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS
More information about the users
mailing list