[Buildbot-commits] [Buildbot] #2427: Master requires restart sometimes if slave connection lost during graceful shutdown
Buildbot
nobody at buildbot.net
Fri Jan 18 18:30:57 UTC 2013
#2427: Master requires restart sometimes if slave connection lost during graceful
shutdown
-------------------+-----------------------
Reporter: dank | Owner:
Type: defect | Status: new
Priority: minor | Milestone: undecided
Version: 0.8.7 | Keywords:
-------------------+-----------------------
I run with a patch applied that requests a graceful shutdown on each
slave as soon as each build starts; that lets me run slaves that do
just on build and exit.
That works great, except that it seems to expose a problem in the
graceful shutdown protocol. Fairly often, if two jobs are submitted
to a builder, the first one works, but during the graceful shutdown,
the master gets confused and shows the builder as stuck; the initial
step fails with a connection error, but the build is not failed.
Restarting the master recovers from the problem.
When the problem happens, stderr from the affected job looks like
[Failure instance: Traceback (failure with no frames): <class
'twisted.internet.error.ConnectionLost'>: Connection to the other side was
lost in a non-clean fashion.
]
or occasionally
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback
(failure with no frames): <class 'twisted.internet.error.ConnectionDone'>:
Connection was closed cleanly.
]
See discussion at
http://permalink.gmane.org/gmane.comp.python.buildbot.devel/8964
which says
"The code looks like this:
def remote_shutdown(self):
log.msg("slave shutting down on command from master")
# there's no good way to learn that the PB response has been
delivered,
# so we'll just wait a bit, in hopes the master hears back.
Masters are
# resilinet to slaves dropping their connections, so there is no
harm
# if this timeout is too short.
reactor.callLater(0.2, reactor.stop)
which should have you thinking "OMG GIANT HACK"
The problem happens about 1 in 10 tries. I'll try upping that delay
now and see if that makes it happen less often.
--
Ticket URL: <http://trac.buildbot.net/ticket/2427>
Buildbot <http://buildbot.net/>
Buildbot: build/test automation
More information about the Commits
mailing list