[Buildbot-devel] RFC: Assigning builds when load is high
Vitali Lovich
vlovich at gmail.com
Thu May 7 02:19:59 UTC 2015
> On May 6, 2015, at 6:12 PM, Jared Grubb <jared.grubb at gmail.com> wrote:
>
>
>> On May 6, 2015, at 13:59, Vitali Lovich <vlovich at gmail.com <mailto:vlovich at gmail.com>> wrote:
>>
>> The problem with #2 is that you won’t actually use your compute cluster since you’ll be waiting for a particular buildslave even though other buildslaves may be idle.
>
> No, you misunderstand me … under #2, if no buildslave works, then the build request is not assigned to any buildslave (under #1, it is assigned to a random one, since they’re all busy) … when buildslaves go idle, the loop checks again.
>
> So to put it another way, if you had one builder with an exclusive slave lock, #2 guarantees each buildslave has at most one build assigned… the others “float” until something frees up. Under #1, all builds will get assigned to something, but will get stuck in “Waiting For Lock”.
Oh - I think that’s what we do. I didn’t realize it wasn’t the expected behavior. That’s what the documentation says IIRC that if there’s any locks that will be taken, if those locks fail to acquire then it won’t create the build.
I think there might be some bugs with how that works but I believe you’re describing intended behavior.
>> The approach I’ve found that works better is implementing a prioritization that knows about which jobs are likely to be quick & which aren’t so that quick jobs are picked for completion first.
>> This does make it a domain-specific problem unfortunately but is tractable. Ping me offline if you want to discuss the details for our setup.
>>
>> If buildbot wants to properly solve scheduling I think there are a few moving parts where a revamped ETA is crucial:
>>
>> 1. ETA needs to be implemented properly & robustly. That means being able to provide a buildslave-specific ETA for each build step + take into account domain-specific dimensions.
>> In other words, the user has to be able to provide a set of properties that must match for the ETA samples so that if a builder is shared between projects the ETA is still correct or if a build request is for a clean build vs incremental.
>> Similarly, some kind of fallback mechanism is likely necessary since if I’m building a branch it likely needs to use the master ETA as a baseline if we don’t have anything more up-to-date for the branch itself.
>
> ETA is gone in nine, and hopefully will re-emerge under the new “metrics” stuff that is coming. (I hope?)
>
>> 2. ETA needs to have a guess function that given a buildslave & domain-specific dimensions returns how long the build *would* take (including accounting for any locks we might need to acquire up-front).
>> 3. The ETA would need to be available for locks based on current load by using ETA of completion of the builds/buildsteps holding the lock (read-only lock would be the max of the ETA of the things holding the lock).
>> 4. The queue would need to take the ETA for a given BR for each buildslave & then try to use the buildslave that minimizes the ETA (regardless of any current locks being held).
>>
>> This way, if you add a machine that is 10x faster than the rest, you’ll have jobs queue up on it leaving your slower machines idle until it’s faster to overflow to other machines.
>
> I dont know any good reason to assign builds to buildslaves if they’ll just block. The only advantage is that there’s then something to Cancel early if you wanted to I guess.
I think you misunderstood. I’m not saying it should create the build just to block. I’m saying that even if another machine is free, then it shouldn’t just blindly assign a job to it. The canonical example is if you have a non-uniform cluster
& the powerful machines, even loaded with queue, will finish all jobs before the less powerful machine processes the job.
>> This isn’t optimal from a total queue scheduling perspective since it’s greedy instead of co-operative, but it will actually likely behave
>> per user expectations (i.e. use all the available capacity so that jobs finish the most quickly).
>>
>> -Vitali
>>
>>> On May 6, 2015, at 1:15 PM, Jared Grubb <jared.grubb at gmail.com <mailto:jared.grubb at gmail.com>> wrote:
>>>
>>> Many months ago, I made a change in buildbot to enhance the way that buildslaves and builds get assigned. In particular, we added a “canStartBuild” functor that lets you adjust how these mappings happen.
>>>
>>> There was a design decision I made that I’m starting to regret (and have disabled in my buildbot).
>>>
>>> Question:
>>> - The BRD attempts to pick buildslaves that can aquire builder locks. If no buildslave qualify (ie high load), we have two choices:
>>> 1. pick a random buildslave that would work otherwise
>>> 2. give up and wait until a buildslave can acquire the locks needed
>>>
>>> Currently, the BRD does #1, however, I’ve seen this cause problems when quick builds get stuck behind long builds … and so I’ll see my set of buildslaves go idle except for one, which will have a few builds on it, all stuck behind one long build. If we did #2, then the short builds would get assigned immediately as the next buildslave goes idle.
>>>
>>> I am thinking that #2 should be the default behavior — or at least be opt-in configurable.
>>>
>>> Note this applies to both eight and nine and is a fairly trivial patch either way.
>>>
>>> Anyone have any thoughts or comments?
>>>
>>> Jared
>>> ------------------------------------------------------------------------------
>>> One dashboard for servers and applications across Physical-Virtual-Cloud
>>> Widest out-of-the-box monitoring support with 50+ applications
>>> Performance metrics, stats and reports that give you Actionable Insights
>>> Deep dive visibility with transaction tracing using APM Insight.
>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y <http://ad.doubleclick.net/ddm/clk/290420510;117567292;y>
>>> _______________________________________________
>>> Buildbot-devel mailing list
>>> Buildbot-devel at lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/buildbot-devel
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://buildbot.net/pipermail/devel/attachments/20150506/73508973/attachment.html>
More information about the devel
mailing list