[users at bb.net] buildbot CPU usage

Francesco Di Mizio francescodimizio at gmail.com
Mon Dec 19 15:06:33 UTC 2016


Pierre,

I will have IT move the server to a real physical box. Will be using a
speedy SSD so if that's really the problem it should be evident soon.
Will of course update here with my findings.

Thanks!

On Fri, Dec 16, 2016 at 12:59 PM, Pierre Tardy <tardyp at gmail.com> wrote:

> Hi Francesco,
> I am continuing to do my scalability benchmarking with 0.9.2, and one
> issue I am finding is that buildbot consumes lots of memory in order to
> compress the logs.
> It actually loads the whole log in memory in order to compress it at once
> This simplifies the algorithm, but can be quite memory hungry. On my
> current test, I am doing 300 builds in parallel with 10k lines of log, and
> it eats 4GB+ of memory.
> This explains why Greg had issue with 32bit python, and could explain your
> performance issue, if your master starts swapping while doing the
> compressions.
>
> This is probably something that I want to eventually fix.
> Pierre
>
>
> Le mer. 14 déc. 2016 à 19:12, Greg MacDonald <gmacdonald at trionworlds.com>
> a écrit :
>
>> Might not be related but I had some issues with logs and UI lag when I
>> was using 32 bit python for the server. Went away when I upgraded to 64
>> bit. - Greg
>>
>>
>>
>> *From:* users [mailto:users-bounces at buildbot.net] *On Behalf Of *Pierre
>> Tardy
>> *Sent:* Wednesday, December 14, 2016 10:48 AM
>> *To:* Francesco Di Mizio <francescodimizio at gmail.com>
>> *Cc:* Buildbot <users at buildbot.net>
>> *Subject:* Re: [users at bb.net] buildbot CPU usage
>>
>>
>>
>> Hi Francesco,
>>
>>
>>
>> I have tried to reproduce your issue by doing some stress tests on the
>> buildbot master.
>>
>> I have been able to run thousands of builds throwing 30k lines of logs
>> each without significant lag in the UI.
>>
>> it takes ~2min to compress the whole log with sqlite, but I can always
>> get smooth access to the UI.
>>
>>
>>
>> I also tried with pg, and it was much faster.
>>
>>
>>
>> I am setting up a cleaner test environment, and I will be publishing some
>> benchmarks.
>>
>>
>>
>> This week we got this bug about performance issue: http://trac.buildbot.
>> net/ticket/3654#comment:4
>>
>> This eventually resolved to be a hardware problem on the machine. Could
>> this be something like this?
>>
>>
>>
>> Pierre
>>
>>
>>
>> Le mar. 13 déc. 2016 à 01:26, Francesco Di Mizio <
>> francescodimizio at gmail.com> a écrit :
>>
>> Pierre i am not using SQLite but pg hosted in an other container on the
>> same box. I'd say the issue isn't which dB engine i am running as it also
>> happens with  MySQL.
>>
>>
>>
>> Thanks,
>>
>>  Francesco
>>
>> On Dec 13, 2016 12:40 AM, "Pierre Tardy" <tardyp at gmail.com> wrote:
>>
>> Hi Francesco,
>>
>> I would say it is normal that the CPU goes to 100% while the log are
>> being compressed. As soon as anything else is not blocked, for me this is
>> fine.
>>
>> I can see you are using sqlite. The thing is with sqlite the number of
>> parallel access to the db is restricted to one by default in order to
>> protect against weird parallel behaviour we got with sqlite.
>>
>> I think this explains your behaviour.
>>
>> I would recommend as an easy fix to switch to pg or mysql. I'll try to
>> think of some workarounds like compressing the log only when there is a low
>> load average. This is already something that we though about with my
>> teammates. Log compression is important, but not urgent, so this is fine to
>> delay until there is not so much to do.
>>
>>
>>
>> Pierre
>>
>>
>>
>>
>>
>>
>>
>> Le lun. 12 déc. 2016 à 12:40, Francesco Di Mizio <
>> francescodimizio at gmail.com> a écrit :
>>
>> Hey Pierre,
>>
>>
>>
>> first of all thanks for taking a look. I never ever have more than 9-10
>> builds going on at the same time. Like I said many times the system can
>> happily be sitting on a nice 10-15% CPU usage when even 8 builds are
>> running, but the CPU spikes can happen at any time with any number of
>> builds going on. Things seem to go worse when the system runs for a few
>> days with no reboot. Just now I had 7 builds running and the cpu keeps
>> oscillating between 10% and 120%,  now it's contantly over 100%. I wonder
>> if we can find a way for you to access the system so you can take a look
>> yourself (of course if you are fine with it)?
>>
>>
>>
>> I will def be looking at multimaster in the future, just have no time atm.
>>
>>
>>
>> Thanks again,
>>
>>  Francesco
>>
>> On Mon, Dec 12, 2016 at 11:18 AM, Pierre Tardy <tardyp at gmail.com> wrote:
>>
>> Hi Francesco,
>>
>> I finally got the time to look at your profile.
>>
>>
>>
>> I dont see anything other than normal load. Buildbot is hardworking at
>> compressing the log in the db, managing the build and steps, etc.
>>
>> How many build in parallel do you have for your master? Maybe its time to
>> start thinking about multimaster.
>>
>>
>>
>> Regards,
>>
>> Pierre
>>
>> Le lun. 28 nov. 2016 à 11:30, Francesco Di Mizio <
>> francescodimizio at gmail.com> a écrit :
>>
>> Reviving this old thread. According to things that were said on this
>> thread, I reduced as much as I could all the logs produced by all steps.
>> Bigger has now 300 lines.
>>
>> Despite that I am still seeing cpu spiked. I managed to grab a hopefully
>> useful json from the profiler which you'll find attached. Toards the end
>> you can clearly see the load goes through the roof.
>>
>>
>>
>> If anybody could take a look I'd really appreciate!
>>
>>
>>
>> Thanks,
>>
>> Francesco
>>
>> On Fri, Aug 26, 2016 at 6:06 PM, Neil Gilmore <ngilmore at grammatech.com>
>> wrote:
>>
>> Hi Pierre,
>>
>> Well, my previous message was really saying that I think buildbot can
>> handle logs a lot bigger then 12K lines.
>>
>> It seems unlikely that it's stuck on log compression. The step says it's
>> been running for 9.5 hours, and previous runs of the entire build have
>> durations in the 1-3 hour range. The particular step is a ShellCommand that
>> actually does the build. The place where the log stops is during an svn up
>> operation.
>>
>> I don't find any exceptions in the log for this, though that's not
>> definitive (current number of log files is 122, and sometimes finding
>> things is tricky). I do see the build cancellation in the log, and it also
>> shows up in the UI.
>>
>> I'm betting on a lost deferred. So far, I've never seen an exception
>> relating to that.
>>
>> On the bright side, this seems to be the only worker whose builders are
>> stuck (some of the rest are waiting for locks, which this build has).
>> Usually, there's more.
>>
>> Neil Gilmore
>> grammatech.com
>>
>>
>>
>> On 8/26/2016 10:35 AM, Pierre Tardy wrote:
>>
>> Hum,
>>
>> It might not be stuck, actually, but just spending very long time to
>> compress the log.
>>
>>
>>
>> In theory, the log compression is not waited for, though.
>>
>>
>>
>> Note that if you stop a build that is waiting for a lost deferred, this
>> will have no effect as you describe.
>>
>>
>>
>> You got absolutly no exception in twisted.log?
>>
>>
>>
>> Pierre
>>
>>
>>
>>
>>
>> Le ven. 26 août 2016 à 17:06, Neil Gilmore <ngilmore at grammatech.com> a
>> écrit :
>>
>> I'm currently looking at a step with 3 logs:
>> 81531 lines
>> 489285 lines
>> 489311 lines
>>
>> An earlier successful run would have that first log at 244080 lines.
>>
>> This particular build is stuck, though. :( (which is why I'm looking at
>> it.) (A bit off-topic, but I tried stopping it. The last step is marked
>> cancelled, but that's the only effect.)
>>
>>
>>
>> Neil Gilmore
>> grammatech.com
>>
>>
>>
>> On 8/26/2016 6:25 AM, Pierre Tardy wrote:
>>
>>
>>
>> The ram looks like sufficient, it might be a good test to try and
>> increase the number of cpu for that VM.
>>
>> In your trace, I can see the use of up to 7 threads at the same time, so
>> you might gain by going 8 CPUs
>>
>>
>>
>> Also, make sure that your VM host is not overbooked. In my experience of
>> using VMware VMs provided by IT, overbooking has been a source of
>> inexplicable performance issues.
>>
>>
>>
>> 12k lines is a lot, but buildbot shall support this kind of load without
>> issue.
>>
>>
>>
>> Le ven. 26 août 2016 à 12:05, Francesco Di Mizio <
>> francescodimizio at gmail.com> a écrit :
>>
>> It's a vmware virtual machine with 4 GIGs RAM and 4 CPUs at 3Ghz. It runs,
>> among other marginal things, 2 docker containers - one for the buildbot and
>> one for the postgres db.
>>
>>
>>
>> The most beefy logs have around 12K lines. Is it too much?
>>
>> Also some other logs are read from the worker's filesystem and added as
>> additional logs.
>>
>>
>>
>> On Fri, Aug 26, 2016 at 11:51 AM, Pierre Tardy <tardyp at gmail.com> wrote:
>>
>> Cool!
>>
>> I can indeed see 3 spikes.
>>
>>
>>
>> Looks related to logs and logs compression.
>>
>>
>>
>> What is the HW spec of your master machine?
>>
>> How much log does your build generate?
>>
>>
>>
>> Pierre
>>
>>
>>
>> Le ven. 26 août 2016 à 11:42, Francesco Di Mizio <
>> francescodimizio at gmail.com> a écrit :
>>
>> Pierre,
>>
>>
>>
>> I enabled it, waited 1 min and saw the spike, then stopped after a few
>> secs. Attached the json.
>>
>> Awesome tool btw, work wonders!
>>
>>
>>
>> On Thu, Aug 25, 2016 at 1:14 PM, Pierre Tardy <tardyp at gmail.com> wrote:
>>
>> You can try to hit the button in the morning with a two hours gather
>> period, and hope that you see the spike during that period..
>>
>>
>>
>>
>>
>> Le jeu. 25 août 2016 à 12:17, Francesco Di Mizio <
>> francescodimizio at gmail.com> a écrit :
>>
>> Thanks a lot! Pierre I will def will give it a shot. I am not sure I'll
>> be able to smash that 'start recording' button as the UI isusuallystuck
>> when the CPU spikes. Updates to come!
>>
>>
>>
>> On Thu, Aug 25, 2016 at 10:45 AM, Pierre Tardy <tardyp at gmail.com> wrote:
>>
>> Hi Francesco,
>>
>>
>>
>> I spent some time in order to implement a profiler plugin for buildbot
>>
>>
>>
>> You can give it a look, and send your profile.json file if you need more
>> analysis from me.
>>
>> https://github.com/tardyp/buildbot_profiler
>>
>>
>>
>> Regards,
>>
>> Pierre
>>
>>
>>
>>
>>
>> Le mer. 24 août 2016 à 22:43, Francesco Di Mizio <
>> francescodimizio at gmail.com> a écrit :
>>
>> I've tried and it's not an easy task because of my Win into Vagrant into
>> Docker setup.
>>
>> I'll try again soon when I get a Linux  box!
>>
>>
>>
>> On Fri, Aug 19, 2016 at 5:54 PM, Vasily <just.one.man at yandex.ru> wrote:
>>
>> Umm, no. VTune has Python support starting 2017 Beta, and, well, it was
>> my team (at Intel) work actually :-)
>>
>> P.S. I'm from Intel, too.
>>
>> Thanks,
>> Vasily
>>
>> 19 авг. 2016 г. 18:17 пользователь "Francesco Di Mizio" <
>> francescodimizio at gmail.com> написал:
>>
>>
>>
>> I had thought you were making fun of Intel somehow ;)
>>
>>
>>
>> On Aug 19, 2016 5:07 PM, "Pierre Tardy" <tardyp at gmail.com> wrote:
>>
>> ahah
>>
>>
>>
>> I though this was a taunt on me being employed by Intel.
>>
>> I actually had mitigated experience with vtune few years ago, and didn't
>> know they had python support until then.
>>
>> Being an opensource guy, I usually neglegate to look at proprietary stuff.
>>
>>
>>
>> Pierre
>>
>>
>>
>> Le ven. 19 août 2016 à 12:18, Vasily <just.one.man at yandex.ru> a écrit :
>>
>> I'm again suggesting to look into Python profiling capabilities of Intel®
>> VTune™ Amplifier. It could run statistical profiling for a long time and
>> display CPU usage over time, so the developer can look at specific time
>> range where CPU usage was too high and see which functions were executed.
>>
>> Thanks,
>> Vasily
>>
>> 19 авг. 2016 г. 11:57 пользователь "Pierre Tardy" <tardyp at gmail.com>
>> написал:
>>
>>
>>
>> Hi Francesco,
>>
>>
>>
>> Your described setup looks sane to me.
>>
>>
>>
>> The problems we are trying to catch are cpu spikes, as far as I
>> understand, which does not happen for very long, but are very annoying for
>> users, as it is blocking the reactor.
>>
>>
>>
>> This problem is not easy to see in the profile you sent, as this profile
>> is over long time, so we see the average of each method during the day and
>> not the spikes.
>>
>>
>>
>> What would really be needed is a on-demand profiler which would detect
>> cpu spikes and only log the stack traces during those times.
>>
>>
>>
>> Here is a nice blog pst explaining why statistic profiling is cool and
>> easy to implement in python.
>>
>> https://nylas.com/blog/performance
>>
>>
>>
>> For 0.9.1 I want to concentrate on scalability, and write a debugging ui
>> plugin based on those ideas (and probably code)
>>
>>
>>
>> That would be great if your team can help on that matter.
>>
>>
>>
>> Regards,
>>
>> Pierre
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>> users mailing list
>>
>> users at buildbot.net
>>
>> https://lists.buildbot.net/mailman/listinfo/users
>>
>> _______________________________________________
>> users mailing list
>> users at buildbot.net
>> https://lists.buildbot.net/mailman/listinfo/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users at buildbot.net
>> https://lists.buildbot.net/mailman/listinfo/users
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.buildbot.net/pipermail/users/attachments/20161219/bec69a54/attachment.html>


More information about the users mailing list