[users at bb.net] Anotheranecdote from the multi-master trenches.

Pierre Tardy tardyp at gmail.com
Thu Dec 8 20:25:13 UTC 2016


Hi Neil,
Thanks for the detailed report. I see few chances that the symptoms you are
describing could be explained by a failure of the multi-master messaging.
If the data api showed the builders that means that the builders were seen
to be attached to 0 masters.
There is a "show old builders" checkbox that could have confirm that. An
builder is considered "old" when it has no master.
The builders REST api has a masterids attribute that will tell that.

There are several action that will make the list of builders of a given
master go to 0

- during a reconfiguration. The BotMaster service will setup the new list
of builders to the database (they could go to 0 if misconfiguration)

- at master shutdown, the master will set itself inactive, and unregisters
from all its builders.

- After the master health-check period. each master has a timestamp which a
needs to update regularly in the database to inform other masters that he
his still alive. During that heartbeat callback, the master will also check
for other masters if they have correctly updated their own timestamp. If
they didn't for the previous 10 minutes, this means that they somehow
crashed without telling, so the first detecting master will mark the quiet
master to be disconnected. In you case, this could be explaining the
behaviour. Maybe there was a time were the consumer and procucers masters
were unavailable, blocked or off-network. The third master marked them
away, but the 2 then went back, but did not figure out they were marked
disconnected, but still continued to take buildrequests. I think this is a
design bug that we need to fixed. A single reconfig would have fixed the
situation (no need for restart)

In any case I would expect that the twisted.log may tell you some stuff.
Either you would get some exceptions during a reconfiguration or something.
Or you may get a period of time with suspicious activity, which could
explain a miss of the heartbeat timer.

Let us know if you reproduce the problem again and if these advices helped
you better understand the problem.
regards,
Pierre

Le jeu. 8 déc. 2016 à 17:45, Neil Gilmore <ngilmore at grammatech.com> a
écrit :

> Hi everyone.
>
> First, a bit of good news. My current top priority is to make the
> schedulers reconfigurable. Not conceptually difficult, but I wasn't
> well-versed in Python argument passing (which figures prominently in
> this), so I've had a couple aborted tries on that score. I think I've
> got all that sorted out for now. It's just biting us way too badly to
> not be able to reconfigure schedulers.
>
> Now, the anecdote. As you may remember, we're running 4 masters. 1 just
> has the UI and force schedulers. 1 has our overall logging system. The
> other 2 are split between producing builds, and consuming them for tests.
>
> Sometime between when I left yesterday and when the test lead looked
> this morning, the UI stopped displaying the builders for the producer
> and consumer masters. Looking at all the masters, they were running, and
> I didn't immediately see anything suspicious in the logs. Looking at the
> data api, I could see all the builders and workers. The workers all
> showed connected_to being valid, but only the logging workers showed
> anything in configured_on. I restarted our UI master and that didn't
> help. Restarting the producer and consumer seems to have solved the
> problem. I can see the builders in the UI, and looking at the workers in
> the data API, I see that most appear to have configured_on set. I have
> no idea what actually happened. My wild conjecture is that the
> inter-master communication got screwed up somehow. Either that or they
> lost connection to the database (less likely, I think. Postgres is
> pretty stable that way.).
>
> Neil Gilmore
> grammatech.com
> _______________________________________________
> users mailing list
> users at buildbot.net
> https://lists.buildbot.net/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.buildbot.net/pipermail/users/attachments/20161208/2998a8dc/attachment.html>


More information about the users mailing list