[Buildbot-commits] [Buildbot] #2427: Master requires restart sometimes if slave connection lost during graceful shutdown

Fri Jan 18 19:19:51 UTC 2013

#2427: Master requires restart sometimes if slave connection lost during graceful
shutdown
-------------------+--------------------
Reporter:  dank    |       Owner:
    Type:  defect  |      Status:  new
Priority:  minor   |   Milestone:  0.8.8
 Version:  0.8.7   |  Resolution:
Keywords:          |
-------------------+--------------------
Changes (by dustin):

 * milestone:  undecided => 0.8.8

Old description:

> I run with a patch applied that requests a graceful shutdown on each
> slave as soon as each build starts; that lets me run slaves that do
> just on build and exit.
> That works great, except that it seems to expose a problem in the
> graceful shutdown protocol.  Fairly often, if two jobs are submitted
> to a builder, the first one works, but during the graceful shutdown,
> the master gets confused and shows the builder as stuck; the initial
> step fails with a connection error, but the build is not failed.
> Restarting the master recovers from the problem.
>
> When the problem happens, stderr from the affected job looks like
> [Failure instance: Traceback (failure with no frames): <class
> 'twisted.internet.error.ConnectionLost'>: Connection to the other side
> was lost in a non-clean fashion.
> ]
> or occasionally
> Traceback (most recent call last):
> Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback
> (failure with no frames): <class
> 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.
> ]
>
> See discussion at
> http://permalink.gmane.org/gmane.comp.python.buildbot.devel/8964
> which says
>
> "The code looks like this:
>
>     def remote_shutdown(self):
>         log.msg("slave shutting down on command from master")
>         # there's no good way to learn that the PB response has been
> delivered,
>         # so we'll just wait a bit, in hopes the master hears back.
> Masters are
>         # resilinet to slaves dropping their connections, so there is no
> harm
>         # if this timeout is too short.
>         reactor.callLater(0.2, reactor.stop)
>
> which should have you thinking "OMG GIANT HACK"
>
> The problem happens about 1 in 10 tries.  I'll try upping that delay
> now and see if that makes it happen less often.

New description:

 I run with a patch applied that requests a graceful shutdown on each
 slave as soon as each build starts; that lets me run slaves that do
 just on build and exit.
 That works great, except that it seems to expose a problem in the
 graceful shutdown protocol.  Fairly often, if two jobs are submitted
 to a builder, the first one works, but during the graceful shutdown,
 the master gets confused and shows the builder as stuck; the initial
 step fails with a connection error, but the build is not failed.
 Restarting the master recovers from the problem.

 When the problem happens, stderr from the affected job looks like
 {{{
 [Failure instance: Traceback (failure with no frames): <class
 'twisted.internet.error.ConnectionLost'>: Connection to the other side was
 lost in a non-clean fashion.
 ]
 }}}
 or occasionally
 {{{
 Traceback (most recent call last):
 Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback
 (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>:
 Connection was closed cleanly.
 ]
 }}}
 See discussion at
 http://permalink.gmane.org/gmane.comp.python.buildbot.devel/8964
 which says

 "The code looks like this:

 {{{
     def remote_shutdown(self):
         log.msg("slave shutting down on command from master")
         # there's no good way to learn that the PB response has been
 delivered,
         # so we'll just wait a bit, in hopes the master hears back.
 Masters are
         # resilinet to slaves dropping their connections, so there is no
 harm
         # if this timeout is too short.
         reactor.callLater(0.2, reactor.stop)
 }}}

 which should have you thinking "OMG GIANT HACK"

 The problem happens about 1 in 10 tries.  I'll try upping that delay
 now and see if that makes it happen less often.

--

-- 
Ticket URL: <http://trac.buildbot.net/ticket/2427#comment:1>
Buildbot <http://buildbot.net/>
Buildbot: build/test automation