[users at bb.net] More anecdotes from the multi-master trenches.

Neil Gilmore ngilmore at grammatech.com
Thu Dec 1 16:52:12 UTC 2016


Hi everyone,

More anecdotes. Scheduler related.

As you may recall, I attempted to fix our problem with collapsing 
behavior by appending the master name to each scheduler (except the 
force schedulers). The idea was to force the schedulers to be master 
specific so that the builder and scheduler would be on the same master, 
allowing the default (mostly, we changed the behavior to not regard 
revision as significant) collapsing behavior to work. This appears to 
have at least mostly worked.

But there's always a snag, isn't there? A few days ago, we switched the 
branch that most of our work is on. The way we have out master.cfg set 
up, this is a one line change. But it changes nearly everything. It 
changes builder names, scheduler names, etc.

Now I'm seeing some odd anomalies. Such as builds being scheduled by 
schedulers that no longer exist on any master, and are not in our 
master.cfg, but are still in the database.

I am also seeing builders in current schedulers that never seem to get 
builds in their queues. We have to force them to see anything happen.

And builders with builds in their queues that never seem to start.

Could this be part of the result of schedulers not being particularly 
reconfigurable?

And on that note, there seems to be 3 schemes in 0.9.x for 
checkConfig/reconfigService.

Number 1 is how the schedulers do it. Which is that they don't, but have 
largish __init__() functions.

Number 2 is how the workers do it. checkconfig looks a lot like __init__ 
might, and reconfigService looks a lot like checkConfig, except that it 
doesn't except.

Number 3 is how things like reporters do it. checkConfig only does 
checks (and the occasional null-ish initialization), and reconfigService 
copies its arguments into itself.

Which is the proper way, since I'm likely to have a go at updating the 
schedulers? Number 1 is right out. Number 2 is pretty easy, mostly 
moving the __init__ to checkConfig, and mostly copying to 
reconfigService, and making sure to call base classes methods properly.

One slightly happier anecdote...

We ended up with a situation where there were 2 builders for a 
particular worker. Both had current builds marked as acquiring locks 
(remember that we use locks to keep it to one build per worker, except 
for a special builder that should always run, even if there's another 
build running. That's why we don't restrict builds at the worker level).

I did manage to go in through the manhole and release the lock from 
whoever was holding it. By the time I got far enough to do that, I 
wasn't interested in figuring out which build was actually holding onto it.

The first builder's build completed, and the second builder picked up 
after that.

Yay.

As always, thanks for your assistance.

Neil Gilmore
grammatech.com




More information about the users mailing list