[Buildbot-commits] [Buildbot] #2427: Master requires restart sometimes if slave connection lost during graceful shutdown
Buildbot
nobody at buildbot.net
Fri Jan 18 19:19:51 UTC 2013
#2427: Master requires restart sometimes if slave connection lost during graceful
shutdown
-------------------+--------------------
Reporter: dank | Owner:
Type: defect | Status: new
Priority: minor | Milestone: 0.8.8
Version: 0.8.7 | Resolution:
Keywords: |
-------------------+--------------------
Changes (by dustin):
* milestone: undecided => 0.8.8
Old description:
> I run with a patch applied that requests a graceful shutdown on each
> slave as soon as each build starts; that lets me run slaves that do
> just on build and exit.
> That works great, except that it seems to expose a problem in the
> graceful shutdown protocol. Fairly often, if two jobs are submitted
> to a builder, the first one works, but during the graceful shutdown,
> the master gets confused and shows the builder as stuck; the initial
> step fails with a connection error, but the build is not failed.
> Restarting the master recovers from the problem.
>
> When the problem happens, stderr from the affected job looks like
> [Failure instance: Traceback (failure with no frames): <class
> 'twisted.internet.error.ConnectionLost'>: Connection to the other side
> was lost in a non-clean fashion.
> ]
> or occasionally
> Traceback (most recent call last):
> Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback
> (failure with no frames): <class
> 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.
> ]
>
> See discussion at
> http://permalink.gmane.org/gmane.comp.python.buildbot.devel/8964
> which says
>
> "The code looks like this:
>
> def remote_shutdown(self):
> log.msg("slave shutting down on command from master")
> # there's no good way to learn that the PB response has been
> delivered,
> # so we'll just wait a bit, in hopes the master hears back.
> Masters are
> # resilinet to slaves dropping their connections, so there is no
> harm
> # if this timeout is too short.
> reactor.callLater(0.2, reactor.stop)
>
> which should have you thinking "OMG GIANT HACK"
>
> The problem happens about 1 in 10 tries. I'll try upping that delay
> now and see if that makes it happen less often.
New description:
I run with a patch applied that requests a graceful shutdown on each
slave as soon as each build starts; that lets me run slaves that do
just on build and exit.
That works great, except that it seems to expose a problem in the
graceful shutdown protocol. Fairly often, if two jobs are submitted
to a builder, the first one works, but during the graceful shutdown,
the master gets confused and shows the builder as stuck; the initial
step fails with a connection error, but the build is not failed.
Restarting the master recovers from the problem.
When the problem happens, stderr from the affected job looks like
{{{
[Failure instance: Traceback (failure with no frames): <class
'twisted.internet.error.ConnectionLost'>: Connection to the other side was
lost in a non-clean fashion.
]
}}}
or occasionally
{{{
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback
(failure with no frames): <class 'twisted.internet.error.ConnectionDone'>:
Connection was closed cleanly.
]
}}}
See discussion at
http://permalink.gmane.org/gmane.comp.python.buildbot.devel/8964
which says
"The code looks like this:
{{{
def remote_shutdown(self):
log.msg("slave shutting down on command from master")
# there's no good way to learn that the PB response has been
delivered,
# so we'll just wait a bit, in hopes the master hears back.
Masters are
# resilinet to slaves dropping their connections, so there is no
harm
# if this timeout is too short.
reactor.callLater(0.2, reactor.stop)
}}}
which should have you thinking "OMG GIANT HACK"
The problem happens about 1 in 10 tries. I'll try upping that delay
now and see if that makes it happen less often.
--
--
Ticket URL: <http://trac.buildbot.net/ticket/2427#comment:1>
Buildbot <http://buildbot.net/>
Buildbot: build/test automation
More information about the Commits
mailing list