[Buildbot-devel] "Connection lost in a non-clean fashion" / duplicate slave

Harry Percival harry at pythonanywhere.com
Tue Nov 18 09:14:55 UTC 2014


PPS, and apologies for the train-of-thought style email here.

We've managed to get a stable connection by changing the "keepalive" 
parameter in buildbot.tac.  I had, at first, misunderstood where this 
file lived.  Figured out that it's the buildbot.tac on the buildslave 
that needed changing.  We set it to 60s, which may have been overkill, 
but it works now.

So, current best theory is that Azure has some sort of firewall that 
disconnects some TCP/IP connections if they go idle, and sending regular 
keepalive packets from the client fixes it.

keep up the good work everyone!

-- 
Harry Percival
Developer
harry at pythonanywhere.com

PythonAnywhere - a fully browser-based Python development and hosting environment
<http://www.pythonanywhere.com/>

PythonAnywhere LLP
17a Clerkenwell Road, London EC1M 5RD, UK
VAT No.: GB 893 5643 79
Registered in England and Wales as company number OC378414.
Registered address: 28 Ely Place, 3rd Floor, London EC1N 6TD, UK

On 17/11/14 16:02, Harry Percival wrote:
> PS - having inspected the logs on the slave, it looks like one of the 
> keepalives being sent from the slave is failing maybe?
>
>     2014-11-17 09:02:14+0000 [-] sending app-level keepalive
>     2014-11-17 09:12:14+0000 [-] sending app-level keepalive
>     2014-11-17 09:18:50+0000 [Broker,client] SlaveBuilder._ackFailed:
>     SlaveBuilder.sendUpdate
>     2014-11-17 09:18:50+0000 [Broker,client] Unhandled Error
>         Traceback (most recent call last):
>         Failure: twisted.spread.pb.PBConnectionLost: [Failure
>     instance: Traceback (failure with no frames): <class
>     'twisted.internet.error.ConnectionLost'>: Connection to the other
>     side was lost in a non-clean fashion.
>         ]
>
>     2014-11-17 09:18:50+0000 [Broker,client] SlaveBuilder._ackFailed:
>     SlaveBuilder.sendUpdate
>     2014-11-17 09:18:50+0000 [Broker,client] Unhandled Error
>         Traceback (most recent call last):
>         Failure: twisted.spread.pb.PBConnectionLost: [Failure
>     instance: Traceback (failure with no frames): <class
>     'twisted.internet.error.ConnectionLost'>: Connection to the other
>     side was lost in a non-clean fashion.
>         ]
>
>     2014-11-17 09:18:50+0000 [Broker,client] SlaveBuilder._ackFailed:
>     SlaveBuilder.sendUpdate
>     2014-11-17 09:18:50+0000 [Broker,client] Unhandled Error
>         Traceback (most recent call last):
>         Failure: twisted.spread.pb.PBConnectionLost: [Failure
>     instance: Traceback (failure with no frames): <class
>     'twisted.internet.error.ConnectionLost'>: Connection to the other
>     side was lost in a non-clean fashion.
>         ]
>
>     2014-11-17 09:18:50+0000 [Broker,client] lost remote
>     2014-11-17 09:18:50+0000 [Broker,client] lost remote step
>     2014-11-17 09:18:50+0000 [Broker,client] stopCommand: halting
>     current command <buildslave.commands.shell.SlaveShellCommand
>     instance at 0x0000000002DE6C08>
>     2014-11-17 09:18:50+0000 [Broker,client] command interrupted,
>     attempting to kill
>     2014-11-17 09:18:50+0000 [Broker,client] using TASKKILL PID /F /T
>     to kill pid 3356
>     2014-11-17 09:18:51+0000 [Broker,client] taskkill'd pid 3356
>     2014-11-17 09:18:51+0000 [Broker,client] Lost connection to
>     integration.company.com:9886
>     2014-11-17 09:18:51+0000 [Broker,client]
>     <twisted.internet.tcp.Connector instance at 0x0000000002BD6108>
>     will retry in 2 seconds
>     2014-11-17 09:18:51+0000 [Broker,client] Stopping factory
>     <buildslave.bot.BotFactory instance at 0x0000000002BC6E48>
>     2014-11-17 09:18:51+0000 [-] command finished with signal None,
>     exit code 1, elapsedTime: 2062.839000
>     2014-11-17 09:18:51+0000 [-] would sendStatus but not .running
>     2014-11-17 09:18:51+0000 [-] SlaveBuilder.commandComplete None
>     2014-11-17 09:18:54+0000 [-] Starting factory
>     <buildslave.bot.BotFactory instance at 0x0000000002BC6E48>
>     2014-11-17 09:18:54+0000 [-] Connecting to
>     integration.company.com:9886
>     2014-11-17 09:18:54+0000 [Broker,client] message from master:
>     master already has a connection named 'redacted' - checking its
>     liveness
>     2014-11-17 09:19:04+0000 [Broker,client] message from master: attached
>     2014-11-17 09:19:04+0000 [Broker,client]
>     SlaveBuilder.remote_print(google chrome stress): message from
>     master: attached
>     2014-11-17 09:19:04+0000 [Broker,client] Connected to
>     integration.company.com:9886; slave is ready
>     2014-11-17 09:19:04+0000 [Broker,client] sending application-level
>     keepalives every 600 seconds
>
>
>
> -- 
> Harry Percival
> Developer
> harry at pythonanywhere.com
>
> PythonAnywhere - a fully browser-based Python development and hosting environment
> <http://www.pythonanywhere.com/>
>
> PythonAnywhere LLP
> 17a Clerkenwell Road, London EC1M 5RD, UK
> VAT No.: GB 893 5643 79
> Registered in England and Wales as company number OC378414.
> Registered address: 28 Ely Place, 3rd Floor, London EC1N 6TD, UK
> On 17/11/14 09:28, Harry Percival wrote:
>> Hi there,
>>
>> We've been running a build farm with a linux master and windows slaves
>> for many years.  Have been experimenting with moving the slaves to
>> Azure, and I'm seeing a lot of errors saying:
>>
>>       remoteFailed: [Failure instance: Traceback (failure with no
>> frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to
>> the other side was lost in a non-clean fashion.
>>
>> Which abort the build.  In the twistd.log, I'm seeing these messages at
>> around the same time:
>>
>>       <timestamp> [Broker,<n>,<ip>] duplicate slave <slavename>; delaying
>> new slave (IPv4Address(TCP, '<ip>', <port>)) and pinging old
>> (IPv4Address(TCP, '<port>', <other-port>))
>>       <timestamp+10s> [Broker,<n-1>,<ip>] BuildSlave.detached(<slavename>)
>>
>> What could be happening here?
>>
>> The slaves are running Windows Server 2012R2 Datacenter.
>>
>> I've had a look at this:
>> https://mariadb.com/kb/en/mariadb/development/tools/buildbot/buildbot-setup/buildbot-setup-buildbot-setup-for-windows/  
>> and tried changing the buildbot.tax keepalive variable, no apparent change.
>>
>> thanks for any help!
>>
>> Harry
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://buildbot.net/pipermail/devel/attachments/20141118/7e89c93c/attachment.html>


More information about the devel mailing list