[Buildbot-devel] database-backed status/scheduler-state project

Mon Sep 7 06:36:05 UTC 2009

Brian Warner <warner at lothar.com> writes:

>> From: exarkun at twistedmatrix.com
>> 
>> One of the main problems I currently encounter with BuildBot operation 
>> is that slaves will often lose their connection to the master as a 
>> *consequence* of attempting a build.  Something that I'd be concerned 
>> about with automatic re-queueing is that a build would result in a never 
>> ending stream of work for a slave, as it tries the build, fails, 
>> disconnects, reconnects, is again instructed to attempt the build, 
>> fails, disconnects, etc.
>
> Yeah, I don't know how to solve that yet. I think we've got consensus that
> retrying a build that fails because of a transient buildbot problem (lost
> slave, master bounce) is good, but of course telling the difference between
> transient and caused failures is generally impossible. Maybe a retry limit?
> If the build experiences "transient" failures more than N times it gets
> shelved for manual triggering?

What I did in another project was the following: When a failure was deemed
"transient", the status displays would display a black box with a
"host down" status (and stdio logs for debugging the failure). The build would
be re-tried only after a 30-minute delay. In addition, an alert was triggered
in the sysadmin monitoring tool (XYmon, could be Nagios etc.) so that someone
would know the need to check and fix the box. This worked really well.

The most important transient failure is that the TCP connection is dropped,
and that can be detected fairly reliably. I agree though that it is necessary
to be conservative with respect to when a failure is deemed "transient".

I am not too worried about endless retry. This is in any case a problem that
needs to be solved with the host, and our slaves are in any case busy most of
the time. One thing that would be very useful though is to be able to
configure some kind of scheduling priorities, so that a more important build
would be done before doing a less important build (whether for the first time
or for retry after transient failure). This is something that should be easy
to add if the scheduling was in a persistent database.

> Although not necessarily a part of this project, I still want to have a less
> connection-sensitive master-slave build protocol, one which can tolerate
> connection loss without dropping messages as long as the processes on both
> sides remain running. (i.e. stop treating buildsteps as pb.Referenceable, use

Yes, this also would be really useful, we have machines behind poor ISP
connections which sometimes drop the connections for a short period.

 - Kristian.