[Buildbot-devel] Multi-master setup (or not)
Benoît Allard
benoit.allard at greenbone.net
Tue Nov 25 19:43:27 UTC 2014
Hi there,
[TL;DR: I propose introducing a supervisor to manage the master(s).]
The build properties PR are flowing in [0] (more review welcome !), so
it's time to start tackling the next bigger trouble I have with the
current development branch, namely the absence of master's hierarchy.
Let me explain.
I believed for years (indeed !) that I wasn't in the need to bother
about those multi-master stuffs, I don't have hundreds of slaves, not
more than a dozen of repositories to care of, so why should I care ?
Well, that's what I thought until I realised that even without using
this feature, it beat me quite a few times already since I started
experimenting with the current development branch.
In the current development branch ('nine'), the whole data is stored in
a common database. Each master (one in most of the cases) is responsible
for its own configuration (the master.cfg + dependencies), and as such,
register it in the db: its slaves, its builders, its schedulers, ...
They further will populate the database with sourcestamps,
buildrequests, buildset, builds, and all the rest.
Nothing was wrong, until I tried to reconfigure my master, and my old
builders (I had renamed some of them) where still to be seen on the
waterfall page. A few reconfig/restart further, half of that waterfall
page (and builder list) is taken with builders that are not defined
anywhere in my configuration any more. I'm afraid of further modifying
my configuration ! Looking further, old slaves (actually the current
one, but with a different username/password), are still present, (and
linked to my master !), although not existing in any configuration !
Same for change source, I guess you got the picture.
I didn't realised immediately the size of the trouble I had met. I
opened an issue [1], and expected an easy answer like ... "Yes, sure,
you just forgot to ..." or something similar. The answer I got was quite
different, it tried to explain that it was a consequence of the current
design, that the slaves / builders / change sources / ... could have
switched master, or could have belonged to a master that is not up at
that moment, so no one was in the position to delete their entries from
the database. I had just had hit a design flaw.
Few days later, my SVNPoller stopped polling [1], and nothing could
bring it back to life: restart, reconfig, delete from configuration /
reinsert, nothing ... The point was In the (common) database, the poller
was still marked as active on a master, so my (one and only) master
didn't tried to start it ! I was hit by the same design trouble.
Few weeks later (now), I haven't met any other manifestation of this
trouble. But I know, it's still there ...
Hope you got the picture now.
The good news is, I have an idea how to solve it. I'm just not sure if
it's the best one, it involves quite a few modifications, and comes at a
price ...
I've been wondering how do other distributed systems do ?
Are they any other distributed system that rely on a common database,
and is able to identify active vs. inactive stuff ? I don't know, and so
far, I've not met any. If you know of any of them, please speak-up, I'd
be interested to know how they manage their data.
Back in eight, the trouble was not that big: The database was only
there to pass information from schedulers to builders. Neither
schedulers, not builders, nor ... where put in the db, they belonged to
the personal data of the master that was responsible for them. If that
master disappeared, so did that information. The old builders (and
builds) did not disappeared from the disk, but they were not visible any
more in the web interface, as the master knew which information to show.
My idea is quite simple (in theory), I believe the main trouble is that
no one has authority on all the master: hence I propose introducing a
'supervisor' that would be the only one to know about the configuration,
and manages the master(s). The configuration would probably gain some
'sections' (one per master), so that the supervisor knows what part to
send to which master. For instance, the master responsible for the web
interface would get a list of active identities (schedulers, slaves,
builders, ...) and just show them.
I'm convinced that this solution could completely solve the trouble
I've identified, however, it's not an easy one, it involves quite a few
modifications (not **too** much, the goal is to keep is as small as
possible - KISS), and they come a a price, namely time ...
Do you have an other / better idea ?
Thanks for reading so far.
Best Regards,
Ben.
[0] #1380, #1382, #1384, #1385, #1886, #1887 (and a few more to come)
[1] TRAC-2959
[2] TRAC-3012
More information about the devel
mailing list