[Buildbot-commits] [Buildbot] #2427: Master requires restart sometimes if slave connection lost during graceful shutdown

Fri Jan 18 18:30:57 UTC 2013

#2427: Master requires restart sometimes if slave connection lost during graceful
shutdown
-------------------+-----------------------
Reporter:  dank    |      Owner:
    Type:  defect  |     Status:  new
Priority:  minor   |  Milestone:  undecided
 Version:  0.8.7   |   Keywords:
-------------------+-----------------------
 I run with a patch applied that requests a graceful shutdown on each
 slave as soon as each build starts; that lets me run slaves that do
 just on build and exit.
 That works great, except that it seems to expose a problem in the
 graceful shutdown protocol.  Fairly often, if two jobs are submitted
 to a builder, the first one works, but during the graceful shutdown,
 the master gets confused and shows the builder as stuck; the initial
 step fails with a connection error, but the build is not failed.
 Restarting the master recovers from the problem.

 When the problem happens, stderr from the affected job looks like
 [Failure instance: Traceback (failure with no frames): <class
 'twisted.internet.error.ConnectionLost'>: Connection to the other side was
 lost in a non-clean fashion.
 ]
 or occasionally
 Traceback (most recent call last):
 Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback
 (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>:
 Connection was closed cleanly.
 ]

 See discussion at
 http://permalink.gmane.org/gmane.comp.python.buildbot.devel/8964
 which says

 "The code looks like this:

     def remote_shutdown(self):
         log.msg("slave shutting down on command from master")
         # there's no good way to learn that the PB response has been
 delivered,
         # so we'll just wait a bit, in hopes the master hears back.
 Masters are
         # resilinet to slaves dropping their connections, so there is no
 harm
         # if this timeout is too short.
         reactor.callLater(0.2, reactor.stop)

 which should have you thinking "OMG GIANT HACK"

 The problem happens about 1 in 10 tries.  I'll try upping that delay
 now and see if that makes it happen less often.

-- 
Ticket URL: <http://trac.buildbot.net/ticket/2427>
Buildbot <http://buildbot.net/>
Buildbot: build/test automation