[Buildbot-devel] Ignoring offline builders/slaves

Jim Rowan jmr at computing.com
Thu Mar 12 05:19:37 UTC 2015


Jason,

Many of your builders have only one buildslave that is servicing that builder.  If that buildslave is offline, then that build can’t proceed.  The normal way to avoid this is to have more than one buildslave servicing each builder.  If one of them is down, one of the other slaves will handle it.   This doesn’t really have anything to do with gerrit. 

You have a policy coded in your config that all the builds for a particular change have to succeed in order to gain a +1 verified vote in gerrit.  Your goal seems to be to relax that policy such that only some subset of builds are required.   I can’t think of a particularly reasonable way to do that.

The fundamental problem is that GerritStatusPush won’t call it’s summary callback until all of the builds in the buildset have completed.  If some of the builds are blocked, you’re stuck waiting.

I suppose the most straightforward way to approach this is to write a custom scheduler that dynamically checks to see which of the builders have at least one buildslave that is online at the time a change comes in, and to only trigger those, rather than the whole list.   Another approach might be to have something that recognizes when builds are queued for builders that have no online buildslaves, and to cancel those builds.   (I’m not sure that is effective in removing them from the buildset …but  that’s the objective.)




 
On Mar 11, 2015, at 7:06 PM, Jason Edgecombe <jason at rampaginggeek.com> wrote:

> On 03/11/2015 02:07 PM, Mikhail Sobolev wrote:
>> Hi Jason,
>> 
>> On Tue, Mar 10, 2015 at 07:54:21PM -0400, Jason Edgecombe wrote:
>>> On 03/10/2015 02:33 PM, Mikhail Sobolev wrote:
>>>> On Tue, Mar 10, 2015 at 09:32:00AM -0400, Jason Edgecombe wrote:
>>>>> We maintain a buildbot farm of volunteer slaves. A subset of the slaves
>>>>> use Gerrit as a changesource. Occasionally, one of the gerrit slaves is
>>>>> down, and can be down for hours until the slave admin can fix it. During
>>>>> the outage, gerrit changes are blocked from being built and receiving
>>>>> status updates on the builds. Is there a way to dynamically exclude the
>>>>> offline builders from the gerrit pool?
>>>> Could you please elaborate a bit?  My understanding is that if a build slave is
>>>> down, it won't get any jobs, how gerrit changes get blocked?
>>>> 
>>>>> For reference, my buildbot config file is at
>>>>> https://github.com/edgester/afsbotcfg/blob/master/master.cfg
>>>> Thanks for the link.  I'm looking at it now to see if I understand the problem
>>>> better.
>>> We use a summary callback in the Buildbot config, so that only one
>>> comment is posted to gerrit when all of the slaves are done. This is
>>> fine, but new gerrit changes will typically trigger a build on all of
>>> the gerrit builders, even the ones that are down at the time of submission.
>> Let me see if I understand the problem correctly.
>> 
>> There's a number of platforms that you'd like to check the things against.  So
>> a build is created for each of those platforms.  However when for one of the
>> platforms all Gerrit capable build slaves are down, the things get stuck.
>> 
>> If this is the case, then I do not think it's possible to work around this: a
>> build for each platform has to be performed.
>> 
> When a build is submitted or started, buildbot already knows which 
> builders are online. Is there a way to instruct the buildmaster to 
> submit the change to all builders that are online at that time? I 
> understand that changes that are currently building, and possibly those 
> in-queue are hosed, but I would like for newly-submitted (and possibly 
> queued) changes to act as if the downed builder is not in the build list.
> 
> The problem is that one of the admins must intervene to reconfigure the 
> buildmaster to ignore the downed slave in order to restore partial 
> service. Since the admins, like myself, are volunteers and possibly in 
> different timezones, and other developers may be waiting for half a day 
> until the admin gets a chance to intervene.
> 
> Thanks,
> Jason
> 
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the 
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Buildbot-devel mailing list
> Buildbot-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/buildbot-devel





More information about the devel mailing list