[users at bb.net] buildbot CPU usage

Francesco Di Mizio francescodimizio at gmail.com
Mon Jan 30 16:49:40 UTC 2017


Digging this up again. Havent moved to a physical server yet. I will first
be splitting the bbot server and postgres on different hosts.

What I've noticed the CPU spiking is always (and I rea) due to queries like:
SELECT logchunks.first_line, logchunks.last_line, logchunks.content,
logchunks.compressed FROM logchunks WHERE logchunks.logid = 56343 ORDER BY
logchunks.first_line

If I am correct that call can only come from CompressLog
<https://github.com/buildbot/buildbot/blob/cfe60e31206748bcf01232eb13c0d6177cd04d5d/master/buildbot/db/logs.py#L252>
.
They do take a very long time (even 1 minute). While running all the system
is stuck. I wonder if I should use some custom postgres config, I am simply
running the official postgres 9.5 container.

Thanks,
 Francesco

On Mon, Dec 19, 2016 at 4:06 PM, Francesco Di Mizio <
francescodimizio at gmail.com> wrote:

> Pierre,
>
> I will have IT move the server to a real physical box. Will be using a
> speedy SSD so if that's really the problem it should be evident soon.
> Will of course update here with my findings.
>
> Thanks!
>
> On Fri, Dec 16, 2016 at 12:59 PM, Pierre Tardy <tardyp at gmail.com> wrote:
>
>> Hi Francesco,
>> I am continuing to do my scalability benchmarking with 0.9.2, and one
>> issue I am finding is that buildbot consumes lots of memory in order to
>> compress the logs.
>> It actually loads the whole log in memory in order to compress it at once
>> This simplifies the algorithm, but can be quite memory hungry. On my
>> current test, I am doing 300 builds in parallel with 10k lines of log, and
>> it eats 4GB+ of memory.
>> This explains why Greg had issue with 32bit python, and could explain
>> your performance issue, if your master starts swapping while doing the
>> compressions.
>>
>> This is probably something that I want to eventually fix.
>> Pierre
>>
>>
>> Le mer. 14 déc. 2016 à 19:12, Greg MacDonald <gmacdonald at trionworlds.com>
>> a écrit :
>>
>>> Might not be related but I had some issues with logs and UI lag when I
>>> was using 32 bit python for the server. Went away when I upgraded to 64
>>> bit. - Greg
>>>
>>>
>>>
>>> *From:* users [mailto:users-bounces at buildbot.net] *On Behalf Of *Pierre
>>> Tardy
>>> *Sent:* Wednesday, December 14, 2016 10:48 AM
>>> *To:* Francesco Di Mizio <francescodimizio at gmail.com>
>>> *Cc:* Buildbot <users at buildbot.net>
>>> *Subject:* Re: [users at bb.net] buildbot CPU usage
>>>
>>>
>>>
>>> Hi Francesco,
>>>
>>>
>>>
>>> I have tried to reproduce your issue by doing some stress tests on the
>>> buildbot master.
>>>
>>> I have been able to run thousands of builds throwing 30k lines of logs
>>> each without significant lag in the UI.
>>>
>>> it takes ~2min to compress the whole log with sqlite, but I can always
>>> get smooth access to the UI.
>>>
>>>
>>>
>>> I also tried with pg, and it was much faster.
>>>
>>>
>>>
>>> I am setting up a cleaner test environment, and I will be publishing
>>> some benchmarks.
>>>
>>>
>>>
>>> This week we got this bug about performance issue:
>>> http://trac.buildbot.net/ticket/3654#comment:4
>>>
>>> This eventually resolved to be a hardware problem on the machine. Could
>>> this be something like this?
>>>
>>>
>>>
>>> Pierre
>>>
>>>
>>>
>>> Le mar. 13 déc. 2016 à 01:26, Francesco Di Mizio <
>>> francescodimizio at gmail.com> a écrit :
>>>
>>> Pierre i am not using SQLite but pg hosted in an other container on the
>>> same box. I'd say the issue isn't which dB engine i am running as it also
>>> happens with  MySQL.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>  Francesco
>>>
>>> On Dec 13, 2016 12:40 AM, "Pierre Tardy" <tardyp at gmail.com> wrote:
>>>
>>> Hi Francesco,
>>>
>>> I would say it is normal that the CPU goes to 100% while the log are
>>> being compressed. As soon as anything else is not blocked, for me this is
>>> fine.
>>>
>>> I can see you are using sqlite. The thing is with sqlite the number of
>>> parallel access to the db is restricted to one by default in order to
>>> protect against weird parallel behaviour we got with sqlite.
>>>
>>> I think this explains your behaviour.
>>>
>>> I would recommend as an easy fix to switch to pg or mysql. I'll try to
>>> think of some workarounds like compressing the log only when there is a low
>>> load average. This is already something that we though about with my
>>> teammates. Log compression is important, but not urgent, so this is fine to
>>> delay until there is not so much to do.
>>>
>>>
>>>
>>> Pierre
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Le lun. 12 déc. 2016 à 12:40, Francesco Di Mizio <
>>> francescodimizio at gmail.com> a écrit :
>>>
>>> Hey Pierre,
>>>
>>>
>>>
>>> first of all thanks for taking a look. I never ever have more than 9-10
>>> builds going on at the same time. Like I said many times the system can
>>> happily be sitting on a nice 10-15% CPU usage when even 8 builds are
>>> running, but the CPU spikes can happen at any time with any number of
>>> builds going on. Things seem to go worse when the system runs for a few
>>> days with no reboot. Just now I had 7 builds running and the cpu keeps
>>> oscillating between 10% and 120%,  now it's contantly over 100%. I wonder
>>> if we can find a way for you to access the system so you can take a look
>>> yourself (of course if you are fine with it)?
>>>
>>>
>>>
>>> I will def be looking at multimaster in the future, just have no time
>>> atm.
>>>
>>>
>>>
>>> Thanks again,
>>>
>>>  Francesco
>>>
>>> On Mon, Dec 12, 2016 at 11:18 AM, Pierre Tardy <tardyp at gmail.com> wrote:
>>>
>>> Hi Francesco,
>>>
>>> I finally got the time to look at your profile.
>>>
>>>
>>>
>>> I dont see anything other than normal load. Buildbot is hardworking at
>>> compressing the log in the db, managing the build and steps, etc.
>>>
>>> How many build in parallel do you have for your master? Maybe its time
>>> to start thinking about multimaster.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Pierre
>>>
>>> Le lun. 28 nov. 2016 à 11:30, Francesco Di Mizio <
>>> francescodimizio at gmail.com> a écrit :
>>>
>>> Reviving this old thread. According to things that were said on this
>>> thread, I reduced as much as I could all the logs produced by all steps.
>>> Bigger has now 300 lines.
>>>
>>> Despite that I am still seeing cpu spiked. I managed to grab a hopefully
>>> useful json from the profiler which you'll find attached. Toards the end
>>> you can clearly see the load goes through the roof.
>>>
>>>
>>>
>>> If anybody could take a look I'd really appreciate!
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Francesco
>>>
>>> On Fri, Aug 26, 2016 at 6:06 PM, Neil Gilmore <ngilmore at grammatech.com>
>>> wrote:
>>>
>>> Hi Pierre,
>>>
>>> Well, my previous message was really saying that I think buildbot can
>>> handle logs a lot bigger then 12K lines.
>>>
>>> It seems unlikely that it's stuck on log compression. The step says it's
>>> been running for 9.5 hours, and previous runs of the entire build have
>>> durations in the 1-3 hour range. The particular step is a ShellCommand that
>>> actually does the build. The place where the log stops is during an svn up
>>> operation.
>>>
>>> I don't find any exceptions in the log for this, though that's not
>>> definitive (current number of log files is 122, and sometimes finding
>>> things is tricky). I do see the build cancellation in the log, and it also
>>> shows up in the UI.
>>>
>>> I'm betting on a lost deferred. So far, I've never seen an exception
>>> relating to that.
>>>
>>> On the bright side, this seems to be the only worker whose builders are
>>> stuck (some of the rest are waiting for locks, which this build has).
>>> Usually, there's more.
>>>
>>> Neil Gilmore
>>> grammatech.com
>>>
>>>
>>>
>>> On 8/26/2016 10:35 AM, Pierre Tardy wrote:
>>>
>>> Hum,
>>>
>>> It might not be stuck, actually, but just spending very long time to
>>> compress the log.
>>>
>>>
>>>
>>> In theory, the log compression is not waited for, though.
>>>
>>>
>>>
>>> Note that if you stop a build that is waiting for a lost deferred, this
>>> will have no effect as you describe.
>>>
>>>
>>>
>>> You got absolutly no exception in twisted.log?
>>>
>>>
>>>
>>> Pierre
>>>
>>>
>>>
>>>
>>>
>>> Le ven. 26 août 2016 à 17:06, Neil Gilmore <ngilmore at grammatech.com> a
>>> écrit :
>>>
>>> I'm currently looking at a step with 3 logs:
>>> 81531 lines
>>> 489285 lines
>>> 489311 lines
>>>
>>> An earlier successful run would have that first log at 244080 lines.
>>>
>>> This particular build is stuck, though. :( (which is why I'm looking at
>>> it.) (A bit off-topic, but I tried stopping it. The last step is marked
>>> cancelled, but that's the only effect.)
>>>
>>>
>>>
>>> Neil Gilmore
>>> grammatech.com
>>>
>>>
>>>
>>> On 8/26/2016 6:25 AM, Pierre Tardy wrote:
>>>
>>>
>>>
>>> The ram looks like sufficient, it might be a good test to try and
>>> increase the number of cpu for that VM.
>>>
>>> In your trace, I can see the use of up to 7 threads at the same time, so
>>> you might gain by going 8 CPUs
>>>
>>>
>>>
>>> Also, make sure that your VM host is not overbooked. In my experience of
>>> using VMware VMs provided by IT, overbooking has been a source of
>>> inexplicable performance issues.
>>>
>>>
>>>
>>> 12k lines is a lot, but buildbot shall support this kind of load without
>>> issue.
>>>
>>>
>>>
>>> Le ven. 26 août 2016 à 12:05, Francesco Di Mizio <
>>> francescodimizio at gmail.com> a écrit :
>>>
>>> It's a vmware virtual machine with 4 GIGs RAM and 4 CPUs at 3Ghz. It runs,
>>> among other marginal things, 2 docker containers - one for the buildbot and
>>> one for the postgres db.
>>>
>>>
>>>
>>> The most beefy logs have around 12K lines. Is it too much?
>>>
>>> Also some other logs are read from the worker's filesystem and added as
>>> additional logs.
>>>
>>>
>>>
>>> On Fri, Aug 26, 2016 at 11:51 AM, Pierre Tardy <tardyp at gmail.com> wrote:
>>>
>>> Cool!
>>>
>>> I can indeed see 3 spikes.
>>>
>>>
>>>
>>> Looks related to logs and logs compression.
>>>
>>>
>>>
>>> What is the HW spec of your master machine?
>>>
>>> How much log does your build generate?
>>>
>>>
>>>
>>> Pierre
>>>
>>>
>>>
>>> Le ven. 26 août 2016 à 11:42, Francesco Di Mizio <
>>> francescodimizio at gmail.com> a écrit :
>>>
>>> Pierre,
>>>
>>>
>>>
>>> I enabled it, waited 1 min and saw the spike, then stopped after a few
>>> secs. Attached the json.
>>>
>>> Awesome tool btw, work wonders!
>>>
>>>
>>>
>>> On Thu, Aug 25, 2016 at 1:14 PM, Pierre Tardy <tardyp at gmail.com> wrote:
>>>
>>> You can try to hit the button in the morning with a two hours gather
>>> period, and hope that you see the spike during that period..
>>>
>>>
>>>
>>>
>>>
>>> Le jeu. 25 août 2016 à 12:17, Francesco Di Mizio <
>>> francescodimizio at gmail.com> a écrit :
>>>
>>> Thanks a lot! Pierre I will def will give it a shot. I am not sure I'll
>>> be able to smash that 'start recording' button as the UI isusuallystuck
>>> when the CPU spikes. Updates to come!
>>>
>>>
>>>
>>> On Thu, Aug 25, 2016 at 10:45 AM, Pierre Tardy <tardyp at gmail.com> wrote:
>>>
>>> Hi Francesco,
>>>
>>>
>>>
>>> I spent some time in order to implement a profiler plugin for buildbot
>>>
>>>
>>>
>>> You can give it a look, and send your profile.json file if you need more
>>> analysis from me.
>>>
>>> https://github.com/tardyp/buildbot_profiler
>>>
>>>
>>>
>>> Regards,
>>>
>>> Pierre
>>>
>>>
>>>
>>>
>>>
>>> Le mer. 24 août 2016 à 22:43, Francesco Di Mizio <
>>> francescodimizio at gmail.com> a écrit :
>>>
>>> I've tried and it's not an easy task because of my Win into Vagrant into
>>> Docker setup.
>>>
>>> I'll try again soon when I get a Linux  box!
>>>
>>>
>>>
>>> On Fri, Aug 19, 2016 at 5:54 PM, Vasily <just.one.man at yandex.ru> wrote:
>>>
>>> Umm, no. VTune has Python support starting 2017 Beta, and, well, it was
>>> my team (at Intel) work actually :-)
>>>
>>> P.S. I'm from Intel, too.
>>>
>>> Thanks,
>>> Vasily
>>>
>>> 19 авг. 2016 г. 18:17 пользователь "Francesco Di Mizio" <
>>> francescodimizio at gmail.com> написал:
>>>
>>>
>>>
>>> I had thought you were making fun of Intel somehow ;)
>>>
>>>
>>>
>>> On Aug 19, 2016 5:07 PM, "Pierre Tardy" <tardyp at gmail.com> wrote:
>>>
>>> ahah
>>>
>>>
>>>
>>> I though this was a taunt on me being employed by Intel.
>>>
>>> I actually had mitigated experience with vtune few years ago, and didn't
>>> know they had python support until then.
>>>
>>> Being an opensource guy, I usually neglegate to look at proprietary
>>> stuff.
>>>
>>>
>>>
>>> Pierre
>>>
>>>
>>>
>>> Le ven. 19 août 2016 à 12:18, Vasily <just.one.man at yandex.ru> a écrit :
>>>
>>> I'm again suggesting to look into Python profiling capabilities of
>>> Intel® VTune™ Amplifier. It could run statistical profiling for a long time
>>> and display CPU usage over time, so the developer can look at specific time
>>> range where CPU usage was too high and see which functions were executed.
>>>
>>> Thanks,
>>> Vasily
>>>
>>> 19 авг. 2016 г. 11:57 пользователь "Pierre Tardy" <tardyp at gmail.com>
>>> написал:
>>>
>>>
>>>
>>> Hi Francesco,
>>>
>>>
>>>
>>> Your described setup looks sane to me.
>>>
>>>
>>>
>>> The problems we are trying to catch are cpu spikes, as far as I
>>> understand, which does not happen for very long, but are very annoying for
>>> users, as it is blocking the reactor.
>>>
>>>
>>>
>>> This problem is not easy to see in the profile you sent, as this profile
>>> is over long time, so we see the average of each method during the day and
>>> not the spikes.
>>>
>>>
>>>
>>> What would really be needed is a on-demand profiler which would detect
>>> cpu spikes and only log the stack traces during those times.
>>>
>>>
>>>
>>> Here is a nice blog pst explaining why statistic profiling is cool and
>>> easy to implement in python.
>>>
>>> https://nylas.com/blog/performance
>>>
>>>
>>>
>>> For 0.9.1 I want to concentrate on scalability, and write a debugging ui
>>> plugin based on those ideas (and probably code)
>>>
>>>
>>>
>>> That would be great if your team can help on that matter.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Pierre
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>>
>>> users mailing list
>>>
>>> users at buildbot.net
>>>
>>> https://lists.buildbot.net/mailman/listinfo/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users at buildbot.net
>>> https://lists.buildbot.net/mailman/listinfo/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users at buildbot.net
>>> https://lists.buildbot.net/mailman/listinfo/users
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.buildbot.net/pipermail/users/attachments/20170130/4e628c2f/attachment.html>


More information about the users mailing list