[users at bb.net] Multi-master 0.9.3 anecdotes.

Neil Gilmore ngilmore at grammatech.com
Fri Feb 3 21:42:29 UTC 2017


Hi Pierre,

We've had notify_on_missing configured for some time -- certainly it was 
present when I restarted the masters on Tues. I also restarted the 
master that has those workers. Shouldn't all the workers in master.cfg 
have attached then?

This could cause us some trouble. Here's how we start workers:

We have a builder consisting of ShellCommands that does an ssh login 
into the worker machine, sees if the worker is running, and runs it if 
it is not. This allows us to make certain that all the workers that are 
in master.cfg also have matching worker processes on the worker 
machines. This builder runs every hour. It is entirely possible that a 
worker machine without a worker process could sit for more than an hour 
before having its worker process started.

What you guys seem to be telling me is that if I were to stop a worker 
process and let more than an hour go by, that worker would never, ever 
have its builds run. Even though the worker is attached, and the builds 
are queued.

That sounds pretty bad to me. Am I understanding correctly? Or can I 
wait until a worker process is running, then reconfigure without that 
worker in master.cfg(and getting unkown worker errors, I suppose), then 
reconfiguring with the worker back in master.cfg, so that it attempts to 
attach before the timeout?

Let me emphasize that the thing that brought this to my attention was 
adding a new worker and its builders to master.cfg. The worker process 
would have been started sometime after that by the builder that starts 
worker processes.

Neil Gilmore
grammatech.com

On 2/3/2017 3:17 PM, Pierre Tardy wrote:
> Hi Neil,
>
> The timer starts when the worker is first configured.
> but only if notify_on_missing is configured.
>
> that may be a reason why you do not see the bug for ancient workers
>
> Pierre
>
> Le ven. 3 févr. 2017 à 21:59, Neil Gilmore <ngilmore at grammatech.com 
> <mailto:ngilmore at grammatech.com>> a écrit :
>
>     Hi Andrej,
>
>     Thanks for the reply.
>
>     I don't see missing_timeout in our master.cfg anywhere. But I do
>     see this:
>
>     c['workers'] = [Worker(host, '<password>',
>     notify_on_missing=bots_email[host]) for host in bots_list]
>
>     Let's see if I understood you. The default missing_timeout is 60
>     minutes. If I start the master and wait 60 minutes, then start the
>     worker, the worker won't attach?
>
>     In our case, we're not even adding the worker to master.cfg until well
>     after that 60 minutes (a couple days after). We're adding new workers.
>     Do you figure this could be the same problem?
>
>     What happens with a default notify_on_missing? I figure I can try the
>     patch in your PR when we restart the masters.
>
>     Neil Gilmore
>     raito at raito.com <mailto:raito at raito.com>
>
>     On 2/3/2017 2:42 PM, Andrej Rode wrote:
>     > Hi Neil,
>     >
>     >> 2017-02-03T12:39:09-0500 [Broker,28906,10.233.216.43] worker
>     '<name>'
>     >> attaching from IPv4Address(TCP, '<ip>', 35642)
>     >> 2017-02-03T12:39:09-0500 [Broker,28906,10.233.216.43] Got
>     workerinfo
>     >> from '<name>'
>     >> 2017-02-03T12:39:09-0500 [-] bot attached
>     >> 2017-02-03T12:39:09-0500 [-] worker <name> cannot attach
>     >>          Traceback (most recent call last):
>     >>          Failure: twisted.internet.error.AlreadyCalled: Tried
>     to cancel
>     >> an already-called event.
>     > I had the same problembs but with a single-master setup. By any
>     chance
>     > are you using a non-default `missing_timeout` and/or
>     `notify_on_missing`
>     > on your workers?
>     >
>     > For my issue I've a PR up [0] and now I can detach and attach
>     workers
>     > as I like. But it is still not clear why we even run into
>     problems here.
>     >
>     > I figured out that attaching a worker after longer than
>     > `missing_timeout` after a master start results in this problem on my
>     > setup. (Default `missing_timeout` is 60 minutes.)
>     >
>     > Cheers,
>     > Andrej
>     >
>     > [0] https://github.com/buildbot/buildbot/pull/2708
>     > _______________________________________________
>     > users mailing list
>     > users at buildbot.net <mailto:users at buildbot.net>
>     > https://lists.buildbot.net/mailman/listinfo/users
>
>     _______________________________________________
>     users mailing list
>     users at buildbot.net <mailto:users at buildbot.net>
>     https://lists.buildbot.net/mailman/listinfo/users
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.buildbot.net/pipermail/users/attachments/20170203/42052ec7/attachment.html>


More information about the users mailing list