[users at bb.net] buildbot CPU usage

Greg MacDonald gmacdonald at trionworlds.com
Wed Dec 14 18:12:44 UTC 2016


Might not be related but I had some issues with logs and UI lag when I was using 32 bit python for the server. Went away when I upgraded to 64 bit. - Greg

From: users [mailto:users-bounces at buildbot.net] On Behalf Of Pierre Tardy
Sent: Wednesday, December 14, 2016 10:48 AM
To: Francesco Di Mizio <francescodimizio at gmail.com>
Cc: Buildbot <users at buildbot.net>
Subject: Re: [users at bb.net] buildbot CPU usage

Hi Francesco,

I have tried to reproduce your issue by doing some stress tests on the buildbot master.
I have been able to run thousands of builds throwing 30k lines of logs each without significant lag in the UI.
it takes ~2min to compress the whole log with sqlite, but I can always get smooth access to the UI.

I also tried with pg, and it was much faster.

I am setting up a cleaner test environment, and I will be publishing some benchmarks.

This week we got this bug about performance issue: http://trac.buildbot.net/ticket/3654#comment:4
This eventually resolved to be a hardware problem on the machine. Could this be something like this?

Pierre

Le mar. 13 déc. 2016 à 01:26, Francesco Di Mizio <francescodimizio at gmail.com<mailto:francescodimizio at gmail.com>> a écrit :
Pierre i am not using SQLite but pg hosted in an other container on the same box. I'd say the issue isn't which dB engine i am running as it also happens with  MySQL.

Thanks,
 Francesco

On Dec 13, 2016 12:40 AM, "Pierre Tardy" <tardyp at gmail.com<mailto:tardyp at gmail.com>> wrote:
Hi Francesco,
I would say it is normal that the CPU goes to 100% while the log are being compressed. As soon as anything else is not blocked, for me this is fine.
I can see you are using sqlite. The thing is with sqlite the number of parallel access to the db is restricted to one by default in order to protect against weird parallel behaviour we got with sqlite.
I think this explains your behaviour.
I would recommend as an easy fix to switch to pg or mysql. I'll try to think of some workarounds like compressing the log only when there is a low load average. This is already something that we though about with my teammates. Log compression is important, but not urgent, so this is fine to delay until there is not so much to do.

Pierre



Le lun. 12 déc. 2016 à 12:40, Francesco Di Mizio <francescodimizio at gmail.com<mailto:francescodimizio at gmail.com>> a écrit :
Hey Pierre,

first of all thanks for taking a look. I never ever have more than 9-10 builds going on at the same time. Like I said many times the system can happily be sitting on a nice 10-15% CPU usage when even 8 builds are running, but the CPU spikes can happen at any time with any number of builds going on. Things seem to go worse when the system runs for a few days with no reboot. Just now I had 7 builds running and the cpu keeps oscillating between 10% and 120%,  now it's contantly over 100%. I wonder if we can find a way for you to access the system so you can take a look yourself (of course if you are fine with it)?

I will def be looking at multimaster in the future, just have no time atm.

Thanks again,
 Francesco

On Mon, Dec 12, 2016 at 11:18 AM, Pierre Tardy <tardyp at gmail.com<mailto:tardyp at gmail.com>> wrote:
Hi Francesco,
I finally got the time to look at your profile.

I dont see anything other than normal load. Buildbot is hardworking at compressing the log in the db, managing the build and steps, etc.
How many build in parallel do you have for your master? Maybe its time to start thinking about multimaster.

Regards,
Pierre

Le lun. 28 nov. 2016 à 11:30, Francesco Di Mizio <francescodimizio at gmail.com<mailto:francescodimizio at gmail.com>> a écrit :
Reviving this old thread. According to things that were said on this thread, I reduced as much as I could all the logs produced by all steps. Bigger has now 300 lines.
Despite that I am still seeing cpu spiked. I managed to grab a hopefully useful json from the profiler which you'll find attached. Toards the end you can clearly see the load goes through the roof.

If anybody could take a look I'd really appreciate!

Thanks,
Francesco

On Fri, Aug 26, 2016 at 6:06 PM, Neil Gilmore <ngilmore at grammatech.com<mailto:ngilmore at grammatech.com>> wrote:
Hi Pierre,

Well, my previous message was really saying that I think buildbot can handle logs a lot bigger then 12K lines.

It seems unlikely that it's stuck on log compression. The step says it's been running for 9.5 hours, and previous runs of the entire build have durations in the 1-3 hour range. The particular step is a ShellCommand that actually does the build. The place where the log stops is during an svn up operation.

I don't find any exceptions in the log for this, though that's not definitive (current number of log files is 122, and sometimes finding things is tricky). I do see the build cancellation in the log, and it also shows up in the UI.

I'm betting on a lost deferred. So far, I've never seen an exception relating to that.

On the bright side, this seems to be the only worker whose builders are stuck (some of the rest are waiting for locks, which this build has). Usually, there's more.

Neil Gilmore
grammatech.com<http://grammatech.com>

On 8/26/2016 10:35 AM, Pierre Tardy wrote:
Hum,
It might not be stuck, actually, but just spending very long time to compress the log.

In theory, the log compression is not waited for, though.

Note that if you stop a build that is waiting for a lost deferred, this will have no effect as you describe.

You got absolutly no exception in twisted.log?

Pierre


Le ven. 26 août 2016 à 17:06, Neil Gilmore <ngilmore at grammatech.com<mailto:ngilmore at grammatech.com>> a écrit :
I'm currently looking at a step with 3 logs:
81531 lines
489285 lines
489311 lines

An earlier successful run would have that first log at 244080 lines.

This particular build is stuck, though. :( (which is why I'm looking at it.) (A bit off-topic, but I tried stopping it. The last step is marked cancelled, but that's the only effect.)


Neil Gilmore
grammatech.com<http://grammatech.com>

On 8/26/2016 6:25 AM, Pierre Tardy wrote:

The ram looks like sufficient, it might be a good test to try and increase the number of cpu for that VM.
In your trace, I can see the use of up to 7 threads at the same time, so you might gain by going 8 CPUs

Also, make sure that your VM host is not overbooked. In my experience of using VMware VMs provided by IT, overbooking has been a source of inexplicable performance issues.

12k lines is a lot, but buildbot shall support this kind of load without issue.

Le ven. 26 août 2016 à 12:05, Francesco Di Mizio <francescodimizio at gmail.com<mailto:francescodimizio at gmail.com>> a écrit :
It's a vmware virtual machine with 4 GIGs RAM and 4 CPUs at 3Ghz. It runs, among other marginal things, 2 docker containers - one for the buildbot and one for the postgres db.

The most beefy logs have around 12K lines. Is it too much?
Also some other logs are read from the worker's filesystem and added as additional logs.

On Fri, Aug 26, 2016 at 11:51 AM, Pierre Tardy <tardyp at gmail.com<mailto:tardyp at gmail.com>> wrote:
Cool!
I can indeed see 3 spikes.

Looks related to logs and logs compression.

What is the HW spec of your master machine?
How much log does your build generate?

Pierre

Le ven. 26 août 2016 à 11:42, Francesco Di Mizio <francescodimizio at gmail.com<mailto:francescodimizio at gmail.com>> a écrit :
Pierre,

I enabled it, waited 1 min and saw the spike, then stopped after a few secs. Attached the json.
Awesome tool btw, work wonders!

On Thu, Aug 25, 2016 at 1:14 PM, Pierre Tardy <tardyp at gmail.com<mailto:tardyp at gmail.com>> wrote:
You can try to hit the button in the morning with a two hours gather period, and hope that you see the spike during that period..


Le jeu. 25 août 2016 à 12:17, Francesco Di Mizio <francescodimizio at gmail.com<mailto:francescodimizio at gmail.com>> a écrit :
Thanks a lot! Pierre I will def will give it a shot. I am not sure I'll be able to smash that 'start recording' button as the UI isusuallystuck when the CPU spikes. Updates to come!

On Thu, Aug 25, 2016 at 10:45 AM, Pierre Tardy <tardyp at gmail.com<mailto:tardyp at gmail.com>> wrote:
Hi Francesco,

I spent some time in order to implement a profiler plugin for buildbot

You can give it a look, and send your profile.json file if you need more analysis from me.
https://github.com/tardyp/buildbot_profiler

Regards,
Pierre


Le mer. 24 août 2016 à 22:43, Francesco Di Mizio <francescodimizio at gmail.com<mailto:francescodimizio at gmail.com>> a écrit :
I've tried and it's not an easy task because of my Win into Vagrant into Docker setup.
I'll try again soon when I get a Linux  box!

On Fri, Aug 19, 2016 at 5:54 PM, Vasily <just.one.man at yandex.ru<mailto:just.one.man at yandex.ru>> wrote:

Umm, no. VTune has Python support starting 2017 Beta, and, well, it was my team (at Intel) work actually :-)

P.S. I'm from Intel, too.

Thanks,
Vasily
19 авг. 2016 г. 18:17 пользователь "Francesco Di Mizio" <francescodimizio at gmail.com<mailto:francescodimizio at gmail.com>> написал:


I had thought you were making fun of Intel somehow ;)

On Aug 19, 2016 5:07 PM, "Pierre Tardy" <tardyp at gmail.com<mailto:tardyp at gmail.com>> wrote:
ahah

I though this was a taunt on me being employed by Intel.
I actually had mitigated experience with vtune few years ago, and didn't know they had python support until then.
Being an opensource guy, I usually neglegate to look at proprietary stuff.

Pierre

Le ven. 19 août 2016 à 12:18, Vasily <just.one.man at yandex.ru<mailto:just.one.man at yandex.ru>> a écrit :

I'm again suggesting to look into Python profiling capabilities of Intel® VTune™ Amplifier. It could run statistical profiling for a long time and display CPU usage over time, so the developer can look at specific time range where CPU usage was too high and see which functions were executed.

Thanks,
Vasily
19 авг. 2016 г. 11:57 пользователь "Pierre Tardy" <tardyp at gmail.com<mailto:tardyp at gmail.com>> написал:

Hi Francesco,

Your described setup looks sane to me.

The problems we are trying to catch are cpu spikes, as far as I understand, which does not happen for very long, but are very annoying for users, as it is blocking the reactor.

This problem is not easy to see in the profile you sent, as this profile is over long time, so we see the average of each method during the day and not the spikes.

What would really be needed is a on-demand profiler which would detect cpu spikes and only log the stack traces during those times.

Here is a nice blog pst explaining why statistic profiling is cool and easy to implement in python.
https://nylas.com/blog/performance

For 0.9.1 I want to concentrate on scalability, and write a debugging ui plugin based on those ideas (and probably code)

That would be great if your team can help on that matter.

Regards,
Pierre







_______________________________________________

users mailing list

users at buildbot.net<mailto:users at buildbot.net>

https://lists.buildbot.net/mailman/listinfo/users
_______________________________________________
users mailing list
users at buildbot.net<mailto:users at buildbot.net>
https://lists.buildbot.net/mailman/listinfo/users


_______________________________________________
users mailing list
users at buildbot.net<mailto:users at buildbot.net>
https://lists.buildbot.net/mailman/listinfo/users


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.buildbot.net/pipermail/users/attachments/20161214/c57b00b1/attachment.html>


More information about the users mailing list