[Buildbot-devel] Multi-master setup (or not)

Benoît Allard benoit.allard at greenbone.net
Wed Nov 26 11:28:37 UTC 2014


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/26/2014 10:37 AM, Edward Armes wrote:
> Hi all,
> 
> I see Buildbot's exact problem in this case being semi-unique. As
> most systems follow a pattern of behavior of that there is a
> central master which can be written to and duplicate masters that
> then act as readers (i.e. the traditional distributed DB approach).
> Git seems to take the approach that every copy is a master and the
> true master is only by user convention. However, I get the feeling
> that neither of these approaches would work for Buildbot? I would
> suggest is that have you looked into how Zookeeper approaches this
> problem, as I imagine they may of had to have solved similar
> problems?
> 

Thanks for the hint ! Let's look together at Zookeeper 's Overview's
page [0].

Zookeeper has an elected server that acts as a leader, and doesn't
accept direct connections from clients ( ? not sure about that one ?
). All Zookeepers servers knows about each others, and they all share
the same configuration. All servers have a duplication of the 'in
memory' database. The 'between-server' protocol care of replacing
leaders on failures and syncing followers with leaders.

Let's consider Zookeeper's database as our 'configuration', and our
database as an implementation detail.

To summarize and put it in buildbot's terms: each master has it's own
duplication of the configuration, there is a (randomly ?) chosen
master that acts as leader, all other ones are followers. Masters
communicate with each other directly, and the leader master is
responsible to propagate configuration update to the other ones. That
communication's protocol cares about electing a new leader upon
failure (in less than 200ms !).

Sounds very reliable ! But not very trivial to implement. Beside, how
to keep configuration up-to-date between masters ? Update the
in-memory model via protocol plus synchronize the files ?

Regards,
Ben.

[0] http://zookeeper.apache.org/doc/trunk/zookeeperOver.html

> Edward
> 
> On Tue, Nov 25, 2014 at 11:09 PM, Dustin J. Mitchell
> <dustin at v.igoro.us> wrote:
> 
>> The fact that Buildbot's configuration is code makes this
>> tricky. Otherwise, we could just load the configuration into the
>> DB and have all masters pull from there.  Configuration as code
>> is one of Buildbot's advantages over other tools, so I don't want
>> to lose that.
>> 
>> I'd love to have more of the smart people on this list looking at
>> the problem and thinking about the right solution.  There's time
>> for some simple modifications to the model, like you suggest, to
>> make it into nine.  There's also an argument to be made (as
>> Pierre did) that the model isn't fundamentally broken, but just
>> has a buggy implementation and needs some utilities built up.
>> 
>> In particular, I'd love to hear how other software has solved
>> similar problems -- we've reinvented enough wheels already here
>> at Buildbot HQ!
>> 
>> Dustin
>> 
>> On Tue, Nov 25, 2014 at 2:43 PM, Benoît Allard 
>> <benoit.allard at greenbone.net> wrote:
>>> Hi there,
>>> 
>>> [TL;DR: I propose introducing a supervisor to manage the
>>> master(s).]
>>> 
>>> The build properties PR are flowing in [0] (more review welcome
>>> !), so it's time to start tackling the next bigger trouble I
>>> have with the current development branch, namely the absence of
>>> master's hierarchy.
>>> 
>>> Let me explain.
>>> 
>>> I believed for years (indeed !) that I wasn't in the need to
>>> bother about those multi-master stuffs, I don't have hundreds
>>> of slaves, not more than a dozen of repositories to care of, so
>>> why should I care ? Well, that's what I thought until I
>>> realised that even without using this feature, it beat me quite
>>> a few times already since I started experimenting with the
>>> current development branch.
>>> 
>>> In the current development branch ('nine'), the whole data is
>>> stored in a common database. Each master (one in most of the
>>> cases) is responsible for its own configuration (the master.cfg
>>> + dependencies), and as such, register it in the db: its
>>> slaves, its builders, its schedulers, ... They further will
>>> populate the database with sourcestamps, buildrequests,
>>> buildset, builds, and all the rest.
>>> 
>>> Nothing was wrong, until I tried to reconfigure my master, and
>>> my old builders (I had renamed some of them) where still to be
>>> seen on the waterfall page. A few reconfig/restart further,
>>> half of that waterfall page (and builder list) is taken with
>>> builders that are not defined anywhere in my configuration any
>>> more. I'm afraid of further modifying my configuration !
>>> Looking further, old slaves (actually the current one, but with
>>> a different username/password), are still present, (and linked
>>> to my master !), although not existing in any configuration ! 
>>> Same for change source, I guess you got the picture.
>>> 
>>> I didn't realised immediately the size of the trouble I had
>>> met. I opened an issue [1], and expected an easy answer like
>>> ... "Yes, sure, you just forgot to ..." or something similar.
>>> The answer I got was quite different, it tried to explain that
>>> it was a consequence of the current design, that the slaves /
>>> builders / change sources / ... could have switched master, or
>>> could have belonged to a master that is not up at that moment,
>>> so no one was in the position to delete their entries from the
>>> database. I had just had hit a design flaw.
>>> 
>>> Few days later, my SVNPoller stopped polling [1], and nothing
>>> could bring it back to life: restart, reconfig, delete from
>>> configuration / reinsert, nothing ... The point was In the
>>> (common) database, the poller was still marked as active on a
>>> master, so my (one and only) master didn't tried to start it !
>>> I was hit by the same design trouble.
>>> 
>>> Few weeks later (now), I haven't met any other manifestation of
>>> this trouble. But I know, it's still there ...
>>> 
>>> Hope you got the picture now.
>>> 
>>> The good news is, I have an idea how to solve it. I'm just not
>>> sure if it's the best one, it involves quite a few
>>> modifications, and comes at a price ...
>>> 
>>> I've been wondering how do other distributed systems do ?
>>> 
>>> Are they any other distributed system that rely on a common
>>> database, and is able to identify active vs. inactive stuff ? I
>>> don't know, and so far, I've not met any. If you know of any of
>>> them, please speak-up, I'd be interested to know how they
>>> manage their data.
>>> 
>>> Back in eight, the trouble was not that big: The database was
>>> only there to pass information from schedulers to builders.
>>> Neither schedulers, not builders, nor ... where put in the db,
>>> they belonged to the personal data of the master that was
>>> responsible for them. If that master disappeared, so did that
>>> information. The old builders (and builds) did not disappeared
>>> from the disk, but they were not visible any more in the web
>>> interface, as the master knew which information to show.
>>> 
>>> My idea is quite simple (in theory), I believe the main trouble
>>> is that no one has authority on all the master: hence I propose
>>> introducing a 'supervisor' that would be the only one to know
>>> about the configuration, and manages the master(s). The
>>> configuration would probably gain some 'sections' (one per
>>> master), so that the supervisor knows what part to send to
>>> which master. For instance, the master responsible for the web 
>>> interface would get a list of active identities (schedulers,
>>> slaves, builders, ...) and just show them.
>>> 
>>> I'm convinced that this solution could completely solve the
>>> trouble I've identified, however, it's not an easy one, it
>>> involves quite a few modifications (not **too** much, the goal
>>> is to keep is as small as possible - KISS), and they come a a
>>> price, namely time ...
>>> 
>>> Do you have an other / better idea ?
>>> 
>>> Thanks for reading so far.
>>> 
>>> Best Regards, Ben.
>>> 
>>> [0] #1380, #1382, #1384, #1385, #1886, #1887 (and a few more to
>>> come) [1] TRAC-2959 [2] TRAC-3012
>>> 
>>> 
>> ------------------------------------------------------------------------------
>>>
>> 
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>> from Actuate! Instantly Supercharge Your Business Reports and
>>> Dashboards with Interactivity, Sharing, Native Excel Exports,
>>> App Integration & more Get technology previously reserved for
>>> billion-dollar corporations, FREE
>>> 
>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>>>
>> 
_______________________________________________
>>> Buildbot-devel mailing list 
>>> Buildbot-devel at lists.sourceforge.net 
>>> https://lists.sourceforge.net/lists/listinfo/buildbot-devel
>> 
>> 
>> ------------------------------------------------------------------------------
>>
>> 
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>> from Actuate! Instantly Supercharge Your Business Reports and
>> Dashboards with Interactivity, Sharing, Native Excel Exports, App
>> Integration & more Get technology previously reserved for
>> billion-dollar corporations, FREE
>> 
>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>>
>> 
_______________________________________________
>> Buildbot-devel mailing list Buildbot-devel at lists.sourceforge.net 
>> https://lists.sourceforge.net/lists/listinfo/buildbot-devel
>> 
> 
> 
> 
> ------------------------------------------------------------------------------
>
> 
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and
> Dashboards with Interactivity, Sharing, Native Excel Exports, App
> Integration & more Get technology previously reserved for
> billion-dollar corporations, FREE 
> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>
> 
> 
> 
> _______________________________________________ Buildbot-devel
> mailing list Buildbot-devel at lists.sourceforge.net 
> https://lists.sourceforge.net/lists/listinfo/buildbot-devel
> 


- -- 
Benoît Allard (B30A05B0)|Greenbone Networks GmbH|http://greenbone.net
Neuer Graben 17, 49074 Osnabrück, Germany | AG Osnabrück, HR B 202460
Executive Directors: Lukas Grunwald, Dr. Jan-Oliver Wagner
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iQEbBAEBAgAGBQJUdblgAAoJEHZCfVOzCgWw3ZMH+MvBJTCZMuCKyHA0hqN6jvAQ
y+0GWXy5ojGuKfN0olgFlwqig+HzNIgN/XOTW3eTSfcZlharlmlqHXHSuCGF4yEf
ctY+YNGebIv64hE+41asEvvgf2zWlr/wqHa/h2+eKWvFSUfp4RK+co5FNSQeNeU1
/rB+RHx1G3NWe7os6z9fmehXF+8hTsPpVuLZFoQmC7h0BFeqYOXPNDB6rX4EDyGp
XZWi785KqcVnOtcwe5Xt4pRxRNiYOsNbaohcHkjnv5nWUnzP/pKtpAKAF3VnIJ+r
fKLBecXv5Z9P6prXlc+5M1UVQ33C4tB11bkTVDqgvihVUaKwW3gpOPjQP6kPRw==
=ek2B
-----END PGP SIGNATURE-----




More information about the devel mailing list