[Buildbot-devel] Multi-master setup (or not)

Benoît Allard benoit.allard at greenbone.net
Tue Nov 25 19:43:27 UTC 2014


Hi there,

[TL;DR: I propose introducing a supervisor to manage the master(s).]

The build properties PR are flowing in [0] (more review welcome !), so 
it's time to start tackling the next bigger trouble I have with the 
current development branch, namely the absence of master's hierarchy.

Let me explain.

I believed for years (indeed !) that I wasn't in the need to bother 
about those multi-master stuffs, I don't have hundreds of slaves, not 
more than a dozen of repositories to care of, so why should I care ? 
Well, that's what I thought until I realised that even without using 
this feature, it beat me quite a few times already since I started 
experimenting with the current development branch.

In the current development branch ('nine'), the whole data is stored in 
a common database. Each master (one in most of the cases) is responsible 
for its own configuration (the master.cfg + dependencies), and as such, 
register it in the db: its slaves, its builders, its schedulers, ... 
They further will populate the database with sourcestamps, 
buildrequests, buildset, builds, and all the rest.

Nothing was wrong, until I tried to reconfigure my master, and my old 
builders (I had renamed some of them) where still to be seen on the 
waterfall page. A few reconfig/restart further, half of that waterfall 
page (and builder list) is taken with builders that are not defined 
anywhere in my configuration any more. I'm afraid of further modifying 
my configuration ! Looking further, old slaves (actually the current 
one, but with a different username/password), are still present, (and 
linked to my master !), although not existing in any configuration ! 
Same for change source, I guess you got the picture.

I didn't realised immediately the size of the trouble I had met. I 
opened an issue [1], and expected an easy answer like ... "Yes, sure, 
you just forgot to ..." or something similar. The answer I got was quite 
different, it tried to explain that it was a consequence of the current 
design, that the slaves / builders / change sources / ... could have 
switched master, or could have belonged to a master that is not up at 
that moment, so no one was in the position to delete their entries from 
the database. I had just had hit a design flaw.

Few days later, my SVNPoller stopped polling [1], and nothing could 
bring it back to life: restart, reconfig, delete from configuration / 
reinsert, nothing ... The point was In the (common) database, the poller 
was still marked as active on a master, so my (one and only) master 
didn't tried to start it ! I was hit by the same design trouble.

Few weeks later (now), I haven't met any other manifestation of this 
trouble. But I know, it's still there ...

Hope you got the picture now.

The good news is, I have an idea how to solve it. I'm just not sure if 
it's the best one, it involves quite a few modifications, and comes at a 
price ...

I've been wondering how do other distributed systems do ?

Are they any other distributed system that rely on a common database, 
and is able to identify active vs. inactive stuff ? I don't know, and so 
far, I've not met any. If you know of any of them, please speak-up, I'd 
be interested to know how they manage their data.

Back in eight, the trouble was not that big: The database was only 
there to pass information from schedulers to builders. Neither 
schedulers, not builders, nor ... where put in the db, they belonged to 
the personal data of the master that was responsible for them. If that 
master disappeared, so did that information. The old builders (and 
builds) did not disappeared from the disk, but they were not visible any 
more in the web interface, as the master knew which information to show.

My idea is quite simple (in theory), I believe the main trouble is that 
no one has authority on all the master: hence I propose introducing a 
'supervisor' that would be the only one to know about the configuration, 
and manages the master(s). The configuration would probably gain some 
'sections' (one per master), so that the supervisor knows what part to 
send to which master. For instance, the master responsible for the web 
interface would get a list of active identities (schedulers, slaves, 
builders, ...) and just show them.

I'm convinced that this solution could completely solve the trouble 
I've identified, however, it's not an easy one, it involves quite a few 
modifications (not **too** much, the goal is to keep is as small as 
possible - KISS), and they come a a price, namely time ...

Do you have an other / better idea ?

Thanks for reading so far.

Best Regards,
Ben.

[0] #1380, #1382, #1384, #1385, #1886, #1887 (and a few more to come)
[1] TRAC-2959
[2] TRAC-3012




More information about the devel mailing list