[Buildbot] #3176: Surprising deadlock behaviour

Buildbot trac trac at buildbot.net
Sat Jan 31 06:29:49 UTC 2015


#3176: Surprising deadlock behaviour
----------------------+------------------------
Reporter:  vlovich    |       Owner:
    Type:  undecided  |      Status:  new
Priority:  major      |   Milestone:  undecided
 Version:  0.8.10     |  Resolution:
Keywords:             |
----------------------+------------------------

Comment (by vlovich):

 Unfortunately I can't quite give snippets since it's a lot of code.  This
 is my understanding of the deadlock, although I haven't dug into the
 deadlock.  Changing builder 1 to trigger and not wait on builder 2 has
 solved the deadlock from reoccurring.

 * Builder 1: acquire shared slave lock for build.  The last step is
 trigger scheduler for builder 2
 * Builder 2: acquire a shared slave lock for build (it technically also
 acquired a separate exclusive master lock)
 * Builder 3: acquire an exclusive slave lock for a step.

 This is a per-slave lock.  I technically have 2 slaves but this is more
 easily reproduced with 1.  Builder 1 has no limit on the number of jobs.
 Builder 2 and 3 honor the per-slave limit of 1 job.

 I think the easiest ordering that exposes this race is:

 * Builder 1: !#0.  Acquire shared slave lock A. Trigger builder 2 and wait
 * Builder 2: !#0:  acquire shared slave lock A.  Acquire master lock B.
 Starts running (takes about 15 minutes to complete.
 * Builder 1: !#1: acquire shared slave lock A.  Trigger builder 2 and
 wait.
 * Builder 2: !#1: wait to acquire master lock B currently held by builder
 2 !#0.
 * Builder 3: !#0: wait to acquire exclusive lock A
 * Builder 2: !#0 finishes
 * Builder 1: !#0 finishes

 Lock graph:

 * builder 2: !#1 cannot aquire shared lock A because builder 3: !#0 has a
 step waiting on an exclusive lock A
 * Builder 3: !#0 cannot aquire exclusive lock because builder 1: !#1 holds
 shared lock A.
 * Builder 1: !#1 can never finish to release lock A because it's blocked
 waiting for Builder 2: !#1.

 Thus we're in a deadlock because builder !#2 is trying to be nice and
 prevent exclusive lock starvation by waiting for the exclusive lock in-
 front of it to be aquired and released.  It's not a bad thing, but that
 should only be done in cases where it's not going to deadlock.  I don't
 know if there's an easy way to solve this problem.  You can try to detect
 deadlock every time you wait for a lock, and then grant all the locks you
 can.  If that fails start aborting jobs that prevent forward progress.

 Another way is to let shared-locks be aquired whenever possible even if
 there's an exclusive lock in front.  Yes, the exclusive lock might be
 starved for longer than one would like, but assuming the slave isn't over-
 provisioned, it won't livelock.

--
Ticket URL: <http://trac.buildbot.net/ticket/3176#comment:2>
Buildbot <http://buildbot.net/>
Buildbot: build/test automation


More information about the bugs mailing list