[Buildbot] #3176: Surprising deadlock behaviour
Buildbot trac
trac at buildbot.net
Sat Jan 31 06:29:49 UTC 2015
#3176: Surprising deadlock behaviour
----------------------+------------------------
Reporter: vlovich | Owner:
Type: undecided | Status: new
Priority: major | Milestone: undecided
Version: 0.8.10 | Resolution:
Keywords: |
----------------------+------------------------
Comment (by vlovich):
Unfortunately I can't quite give snippets since it's a lot of code. This
is my understanding of the deadlock, although I haven't dug into the
deadlock. Changing builder 1 to trigger and not wait on builder 2 has
solved the deadlock from reoccurring.
* Builder 1: acquire shared slave lock for build. The last step is
trigger scheduler for builder 2
* Builder 2: acquire a shared slave lock for build (it technically also
acquired a separate exclusive master lock)
* Builder 3: acquire an exclusive slave lock for a step.
This is a per-slave lock. I technically have 2 slaves but this is more
easily reproduced with 1. Builder 1 has no limit on the number of jobs.
Builder 2 and 3 honor the per-slave limit of 1 job.
I think the easiest ordering that exposes this race is:
* Builder 1: !#0. Acquire shared slave lock A. Trigger builder 2 and wait
* Builder 2: !#0: acquire shared slave lock A. Acquire master lock B.
Starts running (takes about 15 minutes to complete.
* Builder 1: !#1: acquire shared slave lock A. Trigger builder 2 and
wait.
* Builder 2: !#1: wait to acquire master lock B currently held by builder
2 !#0.
* Builder 3: !#0: wait to acquire exclusive lock A
* Builder 2: !#0 finishes
* Builder 1: !#0 finishes
Lock graph:
* builder 2: !#1 cannot aquire shared lock A because builder 3: !#0 has a
step waiting on an exclusive lock A
* Builder 3: !#0 cannot aquire exclusive lock because builder 1: !#1 holds
shared lock A.
* Builder 1: !#1 can never finish to release lock A because it's blocked
waiting for Builder 2: !#1.
Thus we're in a deadlock because builder !#2 is trying to be nice and
prevent exclusive lock starvation by waiting for the exclusive lock in-
front of it to be aquired and released. It's not a bad thing, but that
should only be done in cases where it's not going to deadlock. I don't
know if there's an easy way to solve this problem. You can try to detect
deadlock every time you wait for a lock, and then grant all the locks you
can. If that fails start aborting jobs that prevent forward progress.
Another way is to let shared-locks be aquired whenever possible even if
there's an exclusive lock in front. Yes, the exclusive lock might be
starved for longer than one would like, but assuming the slave isn't over-
provisioned, it won't livelock.
--
Ticket URL: <http://trac.buildbot.net/ticket/3176#comment:2>
Buildbot <http://buildbot.net/>
Buildbot: build/test automation
More information about the bugs
mailing list