<div dir="ltr">(Back to mailing list)<div><br></div><div>Thanks for this data. Here is how you would read it:</div><div><br></div><div><br></div><div><img src="cid:ii_kjwlcazz0" alt="image.png" width="1114" height="571" style="margin-right: 0px;"></div><div><div>In the left, you get the main thread activity. You can say that because the bottom frame(s) are about startReactor.</div><div>In the right, you get what happens in the threads. You can say that because we see threading.py in the last bottom frame.</div><div></div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><br></div><div class="gmail_signature" data-smartmail="gmail_signature">You can see that the reactor thread does pretty much nothing:</div><div class="gmail_signature" data-smartmail="gmail_signature"><img src="cid:ii_kjwlhm4e1" alt="image.png" width="779" height="409" style="margin-right: 0px;"><br></div><div class="gmail_signature" data-smartmail="gmail_signature">doPoll is just the reactor waiting for events (that's 1 third of the samples)</div><div class="gmail_signature" data-smartmail="gmail_signature">then the left part is about getting sourcestams, and waiting the dbthread.</div><div class="gmail_signature" data-smartmail="gmail_signature">then the right part is log management, which is usually the major part of a buildbot load.</div><div class="gmail_signature" data-smartmail="gmail_signature"><br></div><div class="gmail_signature" data-smartmail="gmail_signature">On the thread side, this is pretty similar picture. We can see that most of the load is managing source stamps.</div><div class="gmail_signature" data-smartmail="gmail_signature"><img src="cid:ii_kjwllxt42" alt="image.png" width="773" height="169" style="margin-right: 0px;"><br></div><div class="gmail_signature" data-smartmail="gmail_signature">So frankly, I wouldn't say this is an obvious profile of a blocked buildbot master.</div><div class="gmail_signature" data-smartmail="gmail_signature"><br></div><div class="gmail_signature" data-smartmail="gmail_signature">The reactor does pretty much nothing. Reading 300 sourcestamps requests by second from the db shouldn't be that much of a problem with modern HW.</div><div class="gmail_signature" data-smartmail="gmail_signature">We can see that the profiler itself has difficulties to run in time, as we see some profiler stacks.</div><div class="gmail_signature" data-smartmail="gmail_signature">BTW, I gave wrong advice last time, I told you to put 500HZ frequency while I meant 500ms timer. we should rather put a 2HZ to have 2 sample per second and per thread. This may avoid to disrupt too much the measurment (Heisenberg principle).</div><div class="gmail_signature" data-smartmail="gmail_signature"><br></div><div class="gmail_signature" data-smartmail="gmail_signature">So I thought about memory or disk starvation. Do you have any other load in the master that would starve CPU ressources?</div><div class="gmail_signature" data-smartmail="gmail_signature">I can see that there is 28GB memory free so no memory starvation. but the profiler do not output the loadaverage.</div><div class="gmail_signature" data-smartmail="gmail_signature">Do we have db connection pool or threadpool starvation?</div><div class="gmail_signature" data-smartmail="gmail_signature">Do we have disk io or network io starvation?</div><div class="gmail_signature" data-smartmail="gmail_signature"><br></div><div class="gmail_signature" data-smartmail="gmail_signature">Regards</div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">Pierre</div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Le jeu. 14 janv. 2021 à 00:09, Yngve N. Pettersen <<a href="mailto:yngve@vivaldi.com">yngve@vivaldi.com</a>> a écrit :<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
I finally got a trace of the problem area.<br>
<br>
I wasn't able to make sense of the profile, maybe you can? zip attached<br>
<br>
The area of interest is the ~400 second segment towards the end, which <br>
matches the bump in the PNG.<br>
<br>
This happened while reloading the Build>Builders page after loading a <br>
build and then returning to the previous point in the history. Loading of <br>
the page took about 5 minutes of the 7 minutes of high load.<br>
<br>
The Python process hit ~120% CPU<br>
<br>
<br>
On Wed, 13 Jan 2021 18:35:26 +0100, Pierre Tardy <<a href="mailto:tardyp@gmail.com" target="_blank">tardyp@gmail.com</a>> wrote:<br>
<br>
>> Regarding the profiler, I also had some problems running it for a longer<br>
> time, so if you find a fix, please share it with me.<br>
> It was a bytes versus string issue indeed in the python parts.<br>
><br>
> I did struggle to build the project again as expected, but managed to <br>
> hack<br>
> something eventually.<br>
><br>
> see <a href="https://pypi.org/project/buildbot-profiler/1.3.1/" rel="noreferrer" target="_blank">https://pypi.org/project/buildbot-profiler/1.3.1/</a><br>
><br>
> I did verify the profile duration is now working.<br>
><br>
> Regards<br>
> Pierre<br>
><br>
><br>
> Le mer. 13 janv. 2021 à 12:49, Vlad Bogolin <<a href="mailto:vlad@mariadb.org" target="_blank">vlad@mariadb.org</a>> a écrit :<br>
><br>
>> Hi,<br>
>><br>
>> So the changes issue I was referring to doesn't seem to be fixed (I<br>
>> checked the latest Buildbot code) and looked back over my changes. The<br>
>> problem that I identified is here<br>
>> <a href="https://github.com/buildbot/buildbot/blob/9f1cac1d3bb61baa0b6c836cc18812a64cfa9c2b/master/buildbot/data/resultspec.py#L320" rel="noreferrer" target="_blank">https://github.com/buildbot/buildbot/blob/9f1cac1d3bb61baa0b6c836cc18812a64cfa9c2b/master/buildbot/data/resultspec.py#L320</a><br>
>> because is some cases there would be an unmatched_filter or<br>
>> unmatched_order. *So if this is indeed the case, you should see the<br>
>> warning in the buildbot log file:*<br>
>><br>
>> "Warning: limited data api query is not backed by db because of <br>
>> following<br>
>> filters..." as defined here<br>
>> <a href="https://github.com/buildbot/buildbot/blob/9f1cac1d3bb61baa0b6c836cc18812a64cfa9c2b/master/buildbot/data/resultspec.py#L322" rel="noreferrer" target="_blank">https://github.com/buildbot/buildbot/blob/9f1cac1d3bb61baa0b6c836cc18812a64cfa9c2b/master/buildbot/data/resultspec.py#L322</a><br>
>><br>
>> There are two reasons why this happens. One, an incomplete definition of<br>
>> the fieldMapping for the data/changes.py and this would be the fix:<br>
>><br>
>> +++ b/master/buildbot/data/changes.py<br>
>> @@ -42,6 +42,19 @@ class FixerMixin:<br>
>> return change<br>
>> fieldMapping = {<br>
>> 'changeid': '<a href="http://changes.id" rel="noreferrer" target="_blank">changes.id</a>',<br>
>> + 'author': 'changes.author',<br>
>> + 'committer': 'changes.committer',<br>
>> + 'comments': 'changes.comments',<br>
>> + 'branch': 'changes.branch',<br>
>> + 'revision': 'changes.revision',<br>
>> + 'revlink': 'changes.revlink',<br>
>> + 'when_timestamp': 'changes.when_timestamp',<br>
>> + 'category': 'changes.category',<br>
>> + 'repository': 'changes.repository',<br>
>> + 'codebase': 'changes.codebase',<br>
>> + 'project': 'changes.project',<br>
>> + 'sourcestampid': 'changes.sourcestampid',<br>
>> + 'parent_changeids': 'changes.parent_changeids',<br>
>> }<br>
>><br>
>> and two not having all the columns in the select statement in order to <br>
>> be<br>
>> able to check for matched/unmatched filtering or ordering as it is done<br>
>> here<br>
>> <a href="https://github.com/buildbot/buildbot/blob/9f1cac1d3bb61baa0b6c836cc18812a64cfa9c2b/master/buildbot/data/resultspec.py#L269" rel="noreferrer" target="_blank">https://github.com/buildbot/buildbot/blob/9f1cac1d3bb61baa0b6c836cc18812a64cfa9c2b/master/buildbot/data/resultspec.py#L269</a>.<br>
>> My solution was to get all the fields from the changes table from the<br>
>> database (don't know if this is the best approach) as you can see here:<br>
>><br>
>> +++ b/master/buildbot/db/changes.py<br>
>> @@ -249,7 +249,7 @@ class<br>
>> ChangesConnectorComponent(base.DBConnectorComponent):<br>
>> def thd(conn):<br>
>> # get the changeids from the 'changes' table<br>
>> changes_tbl = self.db.model.changes<br>
>> - q = sa.select([changes_tbl.c.changeid])<br>
>> + q = changes_tbl.select()<br>
>><br>
>> However, since only the changeid would be needed, don't sure if this <br>
>> would<br>
>> be the right approach. Also, I would suspect some missing tests because <br>
>> I<br>
>> would expect this to be a pretty common use case.<br>
>><br>
>> Regarding the profiler, I also had some problems running it for a longer<br>
>> time, so if you find a fix, please share it with me.<br>
>><br>
>> Cheers,<br>
>> Vlad<br>
>><br>
>> On Wed, Jan 13, 2021 at 1:33 PM Pierre Tardy <<a href="mailto:tardyp@gmail.com" target="_blank">tardyp@gmail.com</a>> wrote:<br>
>><br>
>>><br>
>>> I am not sure why this wouldn't work. I vaguely recall there was an <br>
>>> issue<br>
>>> there, but can't figure it out staring at the code.<br>
>>> As this is quite ancient, I am not sure the JS part will even build<br>
>>> anymore :-/<br>
>>><br>
>>> You can change the default values at that line in the python<br>
>>><br>
>>> <a href="https://github.com/tardyp/buildbot_profiler/blob/master/buildbot_profiler/api.py#L193" rel="noreferrer" target="_blank">https://github.com/tardyp/buildbot_profiler/blob/master/buildbot_profiler/api.py#L193</a><br>
>>> I think you will be able to force them by editing this file inside your<br>
>>> virtualenv..<br>
>>><br>
>>><br>
>>> Regards<br>
>>> Pierre<br>
>>><br>
>>><br>
>>> Le mer. 13 janv. 2021 à 11:46, Yngve N. Pettersen <<a href="mailto:yngve@vivaldi.com" target="_blank">yngve@vivaldi.com</a>> a<br>
>>> écrit :<br>
>>><br>
>>>> > I insist on the buildbot profiler. What I was saying before is that <br>
>>>> you<br>
>>>> > need to hit the record button before the problem appears, and put a<br>
>>>> large<br>
>>>> > enough record time to be sure to catch a spike.<br>
>>>> > Then, you will be able to zoom to the cpu spike and catch the issue<br>
>>>> > precisely.<br>
>>>> ><br>
>>>> > If the spike is in the order of minutes like you said, you can<br>
>>>> configure<br>
>>>> > it<br>
>>>> > like this and get enough samples to get enough evidence to where the<br>
>>>> code<br>
>>>> > is actually spending time:<br>
>>>> ><br>
>>>> > ProfilerService(frequency=500, gatherperiod=60 * 60, mode='virtual',<br>
>>>> > basepath=None, wantBuilds=100<br>
>>>><br>
>>>> I tried configuring this with the settings dropdown in the WebGUI<br>
>>>> plugin,<br>
>>>> but AFAICT it is not working, it only gathers info for 30 seconds.<br>
>>>><br>
>>>> I guess I must be holding it incorrectly.<br>
>>>><br>
>>>> > This will record for one hour, and mitigate the memory used if you<br>
>>>> worry<br>
>>>> > about it.<br>
>>>><br>
>>>> --<br>
>>>> Sincerely,<br>
>>>> Yngve N. Pettersen<br>
>>>> Vivaldi Technologies AS<br>
>>>><br>
>>><br>
>><br>
>> --<br>
>> Vlad<br>
>><br>
<br>
<br>
-- <br>
Sincerely,<br>
Yngve N. Pettersen<br>
Vivaldi Technologies AS</blockquote></div>