<div dir="ltr"><div>Hello,</div><div><br></div><div>getPrevSuccessfulBuild is called by getChanges for build which in turn is called by /builds/NN/changes Rest API.<br></div><div>the bug Vlad was referring to was a perf issue on the /changes API, which has been fixed a while back.<br><div><br></div><div>Indeed, this algorithm is far from optimized,but I don't see why this would lead to main thread blocking. Looking at the code, I see that there are no big loops that do not yield to the main reactor loop.</div><div><br></div><div>I insist on the buildbot profiler. What I was saying before is that you need to hit the record button before the problem appears, and put a large enough record time to be sure to catch a spike.</div><div>Then, you will be able to zoom to the cpu spike and catch the issue precisely.</div><div><br></div><div>If the spike is in the order of minutes like you said, you can configure it like this and get enough samples to get enough evidence to where the code is actually spending time:</div><div><pre style="box-sizing:inherit;overflow:auto;font-family:"Source Code Pro",monospace;font-size:0.85rem;margin-top:30px;margin-bottom:0px;padding:15px;background-color:rgb(249,249,249);border:1px solid rgb(211,211,211);color:rgb(108,108,108)"><span class="gmail-n" style="box-sizing:inherit">ProfilerService</span><span class="gmail-p" style="box-sizing:inherit">(</span><span class="gmail-n" style="box-sizing:inherit">frequency</span><span class="gmail-o" style="box-sizing:inherit">=5</span><span class="gmail-mi" style="box-sizing:inherit;color:rgb(17,106,30)">00</span><span class="gmail-p" style="box-sizing:inherit">,</span> <span class="gmail-n" style="box-sizing:inherit">gatherperiod</span><span class="gmail-o" style="box-sizing:inherit">=</span><span class="gmail-mi" style="box-sizing:inherit;color:rgb(17,106,30)">60</span> <span class="gmail-o" style="box-sizing:inherit">*</span> <span class="gmail-mi" style="box-sizing:inherit;color:rgb(17,106,30)">60</span><span class="gmail-p" style="box-sizing:inherit">,</span> <span class="gmail-n" style="box-sizing:inherit">mode</span><span class="gmail-o" style="box-sizing:inherit">=</span><span class="gmail-s1" style="box-sizing:inherit;color:rgb(213,45,64)">'virtual'</span><span class="gmail-p" style="box-sizing:inherit">,</span> <span class="gmail-n" style="box-sizing:inherit">basepath</span><span class="gmail-o" style="box-sizing:inherit">=</span><span class="gmail-kc" style="box-sizing:inherit;color:rgb(17,106,30)">None</span><span class="gmail-p" style="box-sizing:inherit">,</span> <span class="gmail-n" style="box-sizing:inherit">wantBuilds</span><span class="gmail-o" style="box-sizing:inherit">=</span><span class="gmail-mi" style="box-sizing:inherit;color:rgb(17,106,30)">100</span></pre><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">This will record for one hour, and mitigate the memory used if you worry about it.</div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><br></div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">Pierre</div></div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Le mer. 13 janv. 2021 à 11:01, Yngve N. Pettersen <<a href="mailto:yngve@vivaldi.com">yngve@vivaldi.com</a>> a écrit :<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
Hi again,<br>
<br>
I was just able to get a partial database log from a freeze incident when<br>
refreshing the Builds->Builders page.<br>
<br>
It looks like Vlad is on the right track.<br>
<br>
There are a *lot* of individual source stamp requests, but also requests<br>
to the builds and change tables.<br>
<br>
An interesting part of the builds request is this request:<br>
<br>
SELECT <a href="http://builds.id" rel="noreferrer" target="_blank">builds.id</a>, builds.number, builds.builderid, builds.buildrequestid,<br>
builds.workerid, builds.masterid, builds.started_at, builds.complete_at,<br>
builds.state_string, builds.results<br>
FROM builds<br>
WHERE builds.builderid = 51 AND builds.number < 46 AND<br>
builds.results = 0 ORDER BY builds.complete_at DESC<br>
LIMIT 1000 OFFSET 0<br>
<br>
which appears to then be followed by a lot of changes and source stamp<br>
requests.<br>
<br>
The log contains a lot of these requests per second; according to the DB<br>
graph 200 to 400 per second.<br>
<br>
The 1000 limit appears to come from<br>
db.builds.BuildsConnectorComponent.getPrevSuccessfulBuild(), but that<br>
value seems to have been that way for a while, so the problem is likely<br>
caused by something else. This function did show up at the beginning of my<br>
traces related to these freezes.<br>
<br>
<br>
One possibility that I can think of, is that several of these pages, or <br>
the functions they are using, are no longer restricting how far back in <br>
the build history they are fetching build information for. E.g the <br>
Builders page is only supposed to show a couple of days of builds for each <br>
builder, so there should be no need to fetch data for a 1000 builds <br>
(making sure you have the build ids is one thing, fetching all the <br>
associated data even for builds that are not to be displayed is something <br>
else).<br>
<br>
BTW, I have noticed that another page, the waterfall, is not displaying <br>
anything, even after waiting for a very long time.<br>
<br>
<br>
On Wed, 13 Jan 2021 01:34:57 +0100, Yngve N. Pettersen <<a href="mailto:yngve@vivaldi.com" target="_blank">yngve@vivaldi.com</a>><br>
wrote:<br>
<br>
> Hi,<br>
><br>
> Thanks for that info.<br>
><br>
> In my case the problem is apparently something that happens now and then.<br>
><br>
> As mentioned, I have seen it on the Builds->Builders and Builds->Workers <br>
> pages, neither of which includes any changelog access AFAIK.<br>
><br>
> I have also seen it occasionally on individual build pages, which has a <br>
> log of steps with logs, and a changelog panel.<br>
><br>
> Just a few minutes ago I saw this freeze/spike happen while the buildbot <br>
> manager was completely idle, since all active tasks had completed, and I <br>
> had paused all workers since I needed to restart the manager (due to the <br>
> hanging build).<br>
><br>
> I have also had reports about the Grid view and Console pages displaying <br>
> this issue, but have not seen it myself.<br>
><br>
> At present I have enabled logging in the postgresql server, so maybe I <br>
> can figure out what requests are handled during the spike.<br>
><br>
><br>
><br>
> On Wed, 13 Jan 2021 00:38:34 +0100, Vlad Bogolin <<a href="mailto:vlad@mariadb.org" target="_blank">vlad@mariadb.org</a>> <br>
> wrote:<br>
><br>
>> Hi,<br>
>><br>
>> I have experienced some similar interface freezes while trying to <br>
>> configure<br>
>> our version of buildbot. I now remember two cases:<br>
>><br>
>> 1) A "changes" API problem where it seemed that the "limit" argument was<br>
>> ignored in some cases which translated into a full changes table scan. <br>
>> This<br>
>> was reproducible when hitting the "Builds > Last Changes" dashboard and<br>
>> then all the other pages were frozen. There are other requests to <br>
>> changes,<br>
>> so this may be related to the Builds page too. Also, this only<br>
>> happened when the number of changes from the db was high. I was <br>
>> planning on<br>
>> submitting a proper fix, but we are running a custom version of 2.7.1 <br>
>> where<br>
>> I implemented a fast workaround and did not managed to submit a proper <br>
>> fix<br>
>> (hope to be able to do it next week).<br>
>><br>
>> 2) We experienced the same issue as you describe when a lot of logs <br>
>> where<br>
>> coming (which seems to be your case too) and the master process was<br>
>> overwhelmed when multiple builds were running in the same time (constant<br>
>> CPU usage around ~120%). We solved the issue by switching to multi <br>
>> master<br>
>> and limiting the amount of logs, but if you say that this was not an <br>
>> issue<br>
>> in 2.7 I would really be interested in finding out what is the root <br>
>> cause<br>
>> (I thought it was the high amount of logs). You can test this <br>
>> hypothesis by<br>
>> limiting the amount of running builds and see if the issue keeps<br>
>> reproducing.<br>
>><br>
>> What worked for me in order to find out the "changes" API problem was<br>
>> visiting each dashboard and see if the freeze occurs or not.<br>
>><br>
>> Hope this helps!<br>
>><br>
>> Cheers,<br>
>> Vlad<br>
>><br>
>> On Wed, Jan 13, 2021 at 12:40 AM Yngve N. Pettersen <<a href="mailto:yngve@vivaldi.com" target="_blank">yngve@vivaldi.com</a>><br>
>> wrote:<br>
>><br>
>>> On Tue, 12 Jan 2021 22:13:50 +0100, Pierre Tardy <<a href="mailto:tardyp@gmail.com" target="_blank">tardyp@gmail.com</a>> <br>
>>> wrote:<br>
>>><br>
>>> > Thanks for the update.<br>
>>> ><br>
>>> > Some random thoughts...<br>
>>> ><br>
>>> > You should probably leave the profiler open until you get the <br>
>>> performance<br>
>>> > spike.<br>
>>> > If you are inside the spike when starting, indeed, you won't be able <br>
>>> to<br>
>>> > start profiler, but if it is started before the spike it for sure <br>
>>> will<br>
>>> > detect exactly where the code is.<br>
>>><br>
>>> I did have the profiler open in this latest case; as far as I could <br>
>>> tell<br>
>>> it still didn't start recording until after the spike ended (there was <br>
>>> no<br>
>>> progress information in the recorder line).<br>
>>><br>
>>> The two major items showing up were<br>
>>><br>
>>> /buildbot/db/builds.py+91:getPrevSuccessfulBuild<br>
>>> /buildbot/db/pool.py+190:__thd<br>
>>><br>
>>> but I think they were recorded after the spike.<br>
>>><br>
>>> I am planning to activate more detailed logging in the postgresql <br>
>>> server,<br>
>>> but have not done that yet (probably need to shut down and restart<br>
>>> buildbot when I do).<br>
>>><br>
>>><br>
>>> BTW, I suspect that this issue can also cause trouble for builds whose<br>
>>> steps ends at the time the problem is occurring; I just noticed a task<br>
>>> that is still running more than 4 hours after it started a step that<br>
>>> should have been killed after 20 minutes if it was hanging. It should<br>
>>> have<br>
>>> ended at about the time one of the hangs was occuring. And it is<br>
>>> impossible to stop the task for some reason, even shutting down the<br>
>>> worker<br>
>>> process did not work. AFAIK the only way to fix the issue is to shut <br>
>>> the<br>
>>> buildbot manager down.<br>
>>><br>
>>> > statistic profiling will use timer interrupts which will preempt <br>
>>> anything<br>
>>> > that is running, and make a call stack trace.<br>
>>> ><br>
>>> > Waiting for repro, if, from the db log, you manage to get the info of<br>
>>> > what<br>
>>> > kind of db data that is, maybe we can narrow down the usual <br>
>>> suspects..<br>
>>> ><br>
>>> > If there are lots of short selects like you said, usually, you would<br>
>>> > have a<br>
>>> > back and forth from reactor thread to db thread, so it sounds weird.<br>
>>> > What can be leading to your behavior is that whatever is halting the<br>
>>> > processing, everything is queued up in between, and unqueued when it <br>
>>> is<br>
>>> > finished, which could lead to spike of db actions in the end of the<br>
>>> > event.<br>
>>><br>
>>> The DB actions were going on for the entire 3 minutes that spike <br>
>>> lasted;<br>
>>> it is not a burst at either end, but a ~180 second long continuous<br>
>>> sequence (or barrage) of approximately 70-90000 transactions, if I am<br>
>>> interpreting the graph data correctly.<br>
>>><br>
>>> > Regards<br>
>>> > Pierre<br>
>>> ><br>
>>> ><br>
>>> > Le mar. 12 janv. 2021 à 21:49, Yngve N. Pettersen <br>
>>> <<a href="mailto:yngve@vivaldi.com" target="_blank">yngve@vivaldi.com</a>> a<br>
>>> > écrit :<br>
>>> ><br>
>>> >> Hi again,<br>
>>> >><br>
>>> >> A bit of an update.<br>
>>> >><br>
>>> >> I have not been able to locate the issue using the profiler.<br>
>>> >><br>
>>> >> It seems that when Buildbot gets into the problematic mode, then the<br>
>>> >> profiler is not able to work at all. It only starts collecting <br>
>>> after the<br>
>>> >> locked mode is resolved.<br>
>>> >><br>
>>> >> It does seem like the locked mode occurs when Buildbot is fetching <br>
>>> a lot<br>
>>> >> of data from the DB and then spends a lot of time processing that <br>
>>> data,<br>
>>> >> without yielding to other processing needs.<br>
>>> >><br>
>>> >> Looking at the monitoring of the server, it also appears that <br>
>>> buildbot<br>
>>> >> is<br>
>>> >> fetching a lot of data. During the most recent instance, the <br>
>>> returned<br>
>>> >> tuples count in the graph for the server indicates three minutes <br>
>>> of, on<br>
>>> >> average 25000 tuples returned, with spikes to 80K and 100K, per <br>
>>> second.<br>
>>> >><br>
>>> >> The number of open connections rose to 6 or 7, and the transaction <br>
>>> count<br>
>>> >> was 400-500 per second during the whole time (rolled back <br>
>>> transactions,<br>
>>> >> which I assume is just one or more selects).<br>
>>> >><br>
>>> >> IMO this makes it look like, while requesting these data, Buildbot <br>
>>> is<br>
>>> >> *synchronously* querying the DB and processing the returned data, <br>
>>> not<br>
>>> >> yielding. It might also be that it is requesting data more data <br>
>>> than it<br>
>>> >> needs, and also requesting other data earlier than it is actually<br>
>>> >> needed.<br>
>>> >><br>
>>> >><br>
>>> >><br>
>>> >> On Tue, 12 Jan 2021 12:48:40 +0100, Yngve N. Pettersen<br>
>>> >> <<a href="mailto:yngve@vivaldi.com" target="_blank">yngve@vivaldi.com</a>><br>
>>> >><br>
>>> >> wrote:<br>
>>> >><br>
>>> >> > Hi,<br>
>>> >> ><br>
>>> >> > IIRC the only real processing in our system that might be heavy is<br>
>>> >> done<br>
>>> >> > via logobserver.LineConsumerLogObserver in a class (now) derived <br>
>>> from<br>
>>> >> > ShellCommandNewStyle, so if that is the issue, and deferToThread <br>
>>> is<br>
>>> >> the<br>
>>> >> > solution, then if it isn't already done, my suggestion would be to<br>
>>> >> > implement that inside the code handling the log observers.<br>
>>> >> ><br>
>>> >> > I've tested the profiler a little, but haven't seen any samples <br>
>>> within<br>
>>> >> > our code so far, just inside buildbot, quite a lot of log DB <br>
>>> actions,<br>
>>> >> > also some TLS activity.<br>
>>> >> ><br>
>>> >> > The performance issue for those pages seems to be a bit flaky; at<br>
>>> >> > present its not happening AFAICT<br>
>>> >> ><br>
>>> >> > On Tue, 12 Jan 2021 10:59:42 +0100, Pierre Tardy <br>
>>> <<a href="mailto:tardyp@gmail.com" target="_blank">tardyp@gmail.com</a>><br>
>>> >> > wrote:<br>
>>> >> ><br>
>>> >> >> Hello,<br>
>>> >> >><br>
>>> >> >> A lot of things happen between 2.7 and 2.10, although I don't see<br>
>>> >> >> anything<br>
>>> >> >> which could impact the performance that much. (maybe new reporter<br>
>>> >> >> framework, but really not convinced)<br>
>>> >> >> If you see that the db is underutilized this must be a classical<br>
>>> >> reactor<br>
>>> >> >> starvation.<br>
>>> >> >> With asynchronous systems like buildbot, you shouldn't do any <br>
>>> heavy<br>
>>> >> >> computation in the main event loop thread, those must be done in <br>
>>> a<br>
>>> >> >> thread<br>
>>> >> >> via deferToThread and co.<br>
>>> >> >><br>
>>> >> >> Those are the common issues you can have with performance<br>
>>> >> >> independantly from upgrade regressions:<br>
>>> >> >><br>
>>> >> >> 1) Custom steps:<br>
>>> >> >> A lot of time, we see people struggling with performance when <br>
>>> they<br>
>>> >> just<br>
>>> >> >> have some custom step doing heavy computation that block the main<br>
>>> >> thread<br>
>>> >> >> constantly, preventing all the very quick tasks to run in //.<br>
>>> >> >><br>
>>> >> >> 2) too much logs<br>
>>> >> >> In this case, there is not much to do beside reducing the log<br>
>>> >> amount.<br>
>>> >> >> This<br>
>>> >> >> would be the time to switch to a multi-master setup, where you <br>
>>> put 2<br>
>>> >> >> masters for builds, and one master for web UI.<br>
>>> >> >> You can put those in the same machine/VM, no problem, the only <br>
>>> work<br>
>>> >> is<br>
>>> >> >> to<br>
>>> >> >> have separate processes that each have several event queues. You<br>
>>> can<br>
>>> >> use<br>
>>> >> >> docker-compose or kubernetes in order to more easily create such<br>
>>> >> >> deployment. We don't have readily useable for that, but several<br>
>>> >> people<br>
>>> >> >> have<br>
>>> >> >> done and documented it, for example<br>
>>> >> >> <a href="https://github.com/pop/buildbot-on-kubernetes" rel="noreferrer" target="_blank">https://github.com/pop/buildbot-on-kubernetes</a><br>
>>> >> >><br>
>>> >> >><br>
>>> >> >> I have developed the buildbot profiler in order to quickly find<br>
>>> >> those.<br>
>>> >> >> You<br>
>>> >> >> just have to install it as a plugin and start a profile whenever <br>
>>> the<br>
>>> >> >> buildbot feels slow.<br>
>>> >> >> It is a statistical profiler, so it will not significantly <br>
>>> change the<br>
>>> >> >> actual performance so it is safe to run in production.<br>
>>> >> >><br>
>>> >> >> <a href="https://pypi.org/project/buildbot-profiler/" rel="noreferrer" target="_blank">https://pypi.org/project/buildbot-profiler/</a><br>
>>> >> >><br>
>>> >> >><br>
>>> >> >> Regards,<br>
>>> >> >> Pierre<br>
>>> >> >><br>
>>> >> >><br>
>>> >> >> Le mar. 12 janv. 2021 à 01:29, Yngve N. Pettersen<br>
>>> >> <<a href="mailto:yngve@vivaldi.com" target="_blank">yngve@vivaldi.com</a>> a<br>
>>> >> >> écrit :<br>
>>> >> >><br>
>>> >> >>> Hello all,<br>
>>> >> >>><br>
>>> >> >>> We have just upgraded our buildbot system from 2.7 to 2.10.<br>
>>> >> >>><br>
>>> >> >>> However, I am noticing performance issues when loading these <br>
>>> pages:<br>
>>> >> >>><br>
>>> >> >>> Builds->Builders<br>
>>> >> >>> Builds->Workers<br>
>>> >> >>> individual builds<br>
>>> >> >>><br>
>>> >> >>> Loading these can take several minutes, although there are <br>
>>> periods<br>
>>> >> of<br>
>>> >> >>> immediate responses.<br>
>>> >> >>><br>
>>> >> >>> What I am seeing on the buildbot manager machine is that the <br>
>>> Python3<br>
>>> >> >>> process hits 90-100% for the entire period.<br>
>>> >> >>><br>
>>> >> >>> The Python version is 3.6.9 running on Ubuntu 18.04<br>
>>> >> >>><br>
>>> >> >>> As far as I can tell, the Postgresql database is mostly idle <br>
>>> during<br>
>>> >> >>> this<br>
>>> >> >>> period. I did do a full vacuum a few hours ago, in case that <br>
>>> was the<br>
>>> >> >>> issue.<br>
>>> >> >>><br>
>>> >> >>> There are about 40 builders, and 30 workers in the system, only<br>
>>> >> about<br>
>>> >> >>> 10-15 of these have a 10-20 builds for the past few days, <br>
>>> although<br>
>>> >> most<br>
>>> >> >>> of<br>
>>> >> >>> these have active histories of 3000 builds (which do make me<br>
>>> wonder<br>
>>> >> if<br>
>>> >> >>> the<br>
>>> >> >>> problem could be a lack of limiting the DB queries, at present I<br>
>>> >> have<br>
>>> >> >>> not<br>
>>> >> >>> inspected the DB queries).<br>
>>> >> >>><br>
>>> >> >>> The individual builds can have very large log files in the build<br>
>>> >> steps,<br>
>>> >> >>> in<br>
>>> >> >>> many cases tens of thousands of lines (we _are_ talking about a<br>
>>> >> >>> Chromium<br>
>>> >> >>> based project).<br>
>>> >> >>><br>
>>> >> >>> Our changes in the builders and workers JS code are minimal (we <br>
>>> are<br>
>>> >> >>> using<br>
>>> >> >>> a custom build of www-base), just using different information <br>
>>> for<br>
>>> >> the<br>
>>> >> >>> build labels (build version number), and grouping the builders,<br>
>>> >> which<br>
>>> >> >>> should not be causing any performance issues. (we have larger<br>
>>> >> changes<br>
>>> >> >>> in<br>
>>> >> >>> the individual builder view, where we include Git commit <br>
>>> messages,<br>
>>> >> and<br>
>>> >> >>> I<br>
>>> >> >>> have so far not seen any performance issues there)<br>
>>> >> >>><br>
>>> >> >>> BTW: The line plots for build time and successes on builders <br>
>>> seems<br>
>>> >> to<br>
>>> >> >>> be<br>
>>> >> >>> MIA. Not sure if that is an upstream issue, or due to something <br>
>>> in<br>
>>> >> our<br>
>>> >> >>> www-base build.<br>
>>> >> >>><br>
>>> >> >>> Do you have any suggestions for where to look for the cause of <br>
>>> the<br>
>>> >> >>> problem?<br>
>>> >> >>><br>
>>> >> >>><br>
>>> >> >>> --<br>
>>> >> >>> Sincerely,<br>
>>> >> >>> Yngve N. Pettersen<br>
>>> >> >>> Vivaldi Technologies AS<br>
>>> >> >>> _______________________________________________<br>
>>> >> >>> users mailing list<br>
>>> >> >>> <a href="mailto:users@buildbot.net" target="_blank">users@buildbot.net</a><br>
>>> >> >>> <a href="https://lists.buildbot.net/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.buildbot.net/mailman/listinfo/users</a><br>
>>> >> >>><br>
>>> >> ><br>
>>> >> ><br>
>>> >><br>
>>> >><br>
>>> >> --<br>
>>> >> Sincerely,<br>
>>> >> Yngve N. Pettersen<br>
>>> >> Vivaldi Technologies AS<br>
>>> >><br>
>>><br>
>>><br>
>>> --<br>
>>> Sincerely,<br>
>>> Yngve N. Pettersen<br>
>>> Vivaldi Technologies AS<br>
>>> _______________________________________________<br>
>>> users mailing list<br>
>>> <a href="mailto:users@buildbot.net" target="_blank">users@buildbot.net</a><br>
>>> <a href="https://lists.buildbot.net/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.buildbot.net/mailman/listinfo/users</a><br>
>>><br>
><br>
><br>
<br>
<br>
-- <br>
Sincerely,<br>
Yngve N. Pettersen<br>
Vivaldi Technologies AS<br>
</blockquote></div>