[Buildbot-devel] Debugging (presumed) deadlocks

Dmitry Mikhin dmitry.mikhin at gmail.com
Tue Feb 7 21:26:15 UTC 2012


Hi guys,

Faced with the following issue:

I have a box that acts as both a native slave and a host for several
libvirt slaves. More specifically, it has to run 8 build tasks natively
plus 3 virtual slaves 2 tasks each. The total build-test-build with
options-test-etc. time is quite high reaching 2-3 hours, hence,
load-balancing is pretty important. The box has a 6-core CPU and
max_build=4, apparently handling 4 tasks at a time without issues.

Now, after properly configuring automated libvirt slaves I first noticed
that they are not included into the build count, namely the master attempts
to start 4 native tasks (per max_build) plus several virtual boxes at once.
Some were eventually starved out of CPU and/or memory, causing the builds
to timeout.

I implemented a custom locking scheme:

from buildbot import locks
my_num_cpus=4
my_lock = locks.MasterLock("my", maxCount=my_num_cpus)

Then the main (native) slave is configured with

            'max_builds': my_num_cpus,
            'locks'     : [my_lock.access("counting")],

while libvirt slaves with

            'max_builds'         : 1,
            'locks'              : [my_lock.access("counting")],

This helped in the sense that the master never starts more then my_num_cpus
jobs at once. However, I noticed that one of the virtual slaves is never
run. All slaves are configured identically and differ only in their brand
of Linux, which for buildbot purposes is just a string. If the buildmaster
is restarted, the remaining slave wakes up and does its two builds
properly. However, if it is restarted late in the day and new scheduled
builds kick in before the two belated ones are completed, the problems
start to cascade:

1. Only up to 3 builders (out of max_build=4) run concurrently.

2. Some other seemingly random libvirt slave is marked offline.

Now, to the questions:

1. Does it make sense to do explicit locking-based load-balancing of
virtual and native slaves? Is there a more proper API for doing the same?

2. Do the symptoms sound familiar? Any ideas where to poke?

3. How can I check who is holding the locks at the moment (assuming it is
indeed a deadlock problem)?

The buildmaster is configured to accept manhole connections, but I'm not
sufficiently conversant with buildbot to dig to the locks from the
top-level master object.

Best regards,
Dmitry Mikhin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://buildbot.net/pipermail/devel/attachments/20120208/d813866d/attachment.html>


More information about the devel mailing list