[users at bb.net] Issue connecting slave to master

Colin Chargy Colin.Chargy at bentley.com
Fri Jun 30 08:31:57 UTC 2017


Hi Jim,
Thanks for the input.
I did copy the save folder from slave to slave.
I did try to reboot the slave with no luck.
I don’t know the details about the network. We do have an IT team to handle it. All other slaves are with the same network setup.
I don’t know any other software running on the computer.

Except TCDdump/Wireshark, anyone knows another way to debug that ? Does buildbot or twisted has a verbose mode ?
I’m gonna try to open a direct ssh port forwarding tunnel between the master and the slave to see if that changes anything (might help us to understand the cause of the issue). I keep you posted.

Regards,
Colin Chargy

From: Jim Rowan [mailto:jmr at computing.com]
Sent: Friday, June 23, 2017 19:54
To: Colin Chargy <Colin.Chargy at bentley.com>
Cc: users at buildbot.net
Subject: Re: [users at bb.net] Issue connecting slave to master

hmmm.    :).   Some thoughts/questions interleaved below:


On Jun 22, 2017, at 10:06 AM, Colin Chargy <Colin.Chargy at bentley.com<mailto:Colin.Chargy at bentley.com>> wrote:

Hi Jim,
Thanks for the input. I tested what you suggested. The same slave folder on another computer wworks fine and another slave folder (from another computer) on this one doesn’t work.

I’m not positive what you’re saying — I think what you did above was to copy the actual slave folders in question from machine to machine, and then try to start them up?   If so, that’s fine … just trying to fully understand.

In both of the tests you mention above, are the slaves talking to the “second” master — the one that doesn’t work?

If so, I think that pretty much proves that it’s something about *this machine*, and seems to me to be almost certainly external to buildbot.

I have a few somewhat-wild-guess things to look at:

1.) If you haven’t already, reboot the slave.

2.) I notice the slave's address is  192.168.0.254 and the master’s address is 192.168.0.1.   Assuming a /24 network, those are by convention both a bit special — people might configure either one of them as gateways to other subnets.   Although that isn’t technically a problem, it makes me suspicious.   Is this indeed on a /24 subnet?  Are you sure that no other system is using these addresses?   Do both machines have the correct subnet mask?  (Some of these might be answered by your tcpdump file, but I didn’t crack it open..)

3.) Is there some other software running on this particular slave machine that makes it unusual compared to the others?
For grins, don’t start anything else after a reboot, and just try to run this one slave.  (By hand, if you are normally starting it as a service.)



I’m assuming that both slaves on that machine are sharing the same python installation, and therefore the same buildslave code?  So the only thing unique is the actual slavedir and the name/password?
Yes

And the master that it’s talking to has other working slaves on different machines?
Yes, plenty and no one has thoses issues.

Any other idea ?

Regards,
Colin Chargy

From: Jim Rowan [mailto:jmr at computing.com]
Sent: Tuesday, June 20, 2017 21:35
To: Colin Chargy <Colin.Chargy at bentley.com<mailto:Colin.Chargy at bentley.com>>
Cc: users at buildbot.net<mailto:users at buildbot.net>
Subject: Re: [users at bb.net<mailto:users at bb.net>] Issue connecting slave to master



Wow .. so it’s apparently something specific to this particular slave on this particular machine, or the tuple of those with the particular master.   Have you tried instantiating and starting the slave on a different windows 10 machine?   Or changing buildslave.tac to use a different (and working) slavename that is defined on that same master?  (I think you said you did test this.).

I’m assuming that both slaves on that machine are sharing the same python installation, and therefore the same buildslave code?  So the only thing unique is the actual slavedir and the name/password?

And the master that it’s talking to has other working slaves on different machines?


On Jun 20, 2017, at 10:13 AM, Colin Chargy <Colin.Chargy at bentley.com<mailto:Colin.Chargy at bentley.com>> wrote:

Hi,
Thanks for your input. It doesn’t change anything. ☹

Best regards,
Colin Chargy

From: Jim Rowan [mailto:jmr at computing.com]
Sent: Tuesday, June 20, 2017 17:11
To: Colin Chargy <Colin.Chargy at bentley.com<mailto:Colin.Chargy at bentley.com>>
Cc: Pierre Tardy <tardyp at gmail.com<mailto:tardyp at gmail.com>>; users at buildbot.net<mailto:users at buildbot.net>
Subject: Re: [users at bb.net<mailto:users at bb.net>] Issue connecting slave to master

It’s a bit of a wild guess, but what happens if you stop the second (working) slave that is on the same machine before trying to start this one?

On Jun 20, 2017, at 8:10 AM, Colin Chargy <Colin.Chargy at bentley.com<mailto:Colin.Chargy at bentley.com>> wrote:

Hi Pierre,
I tried with the following version :
$ buildslave --version
Buildslave version: 0.8.8
Twisted version: 12.3.0

It’s now the exact same of the master and the behavior continues…

Anything else I could try ?

I’ll ask the admin of the server to update twisted.

Best regards,
Colin Chargy

From: Pierre Tardy [mailto:tardyp at gmail.com]
Sent: Tuesday, June 20, 2017 15:02
To: Colin Chargy <Colin.Chargy at bentley.com<mailto:Colin.Chargy at bentley.com>>; users at buildbot.net<mailto:users at buildbot.net>
Subject: Re: [users at bb.net<mailto:users at bb.net>] FW: Issue connecting slave to master

Oh, I did not realize the very old twisted version. you can try to downgrade on the worker indeed.

I see no reason not to upgrade twisted on master, though

Pierre

On Tue, Jun 20, 2017 at 2:45 PM Colin Chargy <Colin.Chargy at bentley.com<mailto:Colin.Chargy at bentley.com>> wrote:
Hi Pierre,
I tested what you suggested :
$ buildslave --version
Buildslave version: 0.8.8
Twisted version: 17.5.0

This does not change the behavior. Should I test with another twisted version ?

Regards,
Colin

From: Pierre Tardy [mailto:tardyp at gmail.com<mailto:tardyp at gmail.com>]
Sent: Tuesday, June 20, 2017 14:15

To: Colin Chargy <Colin.Chargy at bentley.com<mailto:Colin.Chargy at bentley.com>>; users at buildbot.net<mailto:users at buildbot.net>
Subject: Re: [users at bb.net<mailto:users at bb.net>] FW: Issue connecting slave to master

Colin,
Its a bit harder to me to efficiently help you as 0.8.8 is quite an old version. I imagine upgrading is not an option..

it might be an incompatibility of the slave version string. We usually try to maintain compatibility for new master version to old slave version, but we might not always take care of supporting running new slaves with older master.
Did you try downgrading your slave version to 0.8.8?

Pierre

On Tue, Jun 20, 2017 at 11:53 AM Colin Chargy <Colin.Chargy at bentley.com<mailto:Colin.Chargy at bentley.com>> wrote:
Hi Pierre,
Thanks for your reply.
Indeed, I’ve seen in the failedToGetPerspective doc that it could fail with a wrong login password. However, the slave name and password seems correct (ie the same on the slave .toc file and on the server config). We also tested multiple login/password couple to see if that changes anything (with no luck). The TCP dump seems to show that the last things which are  sent are  the host name and slave info which are the default one (I tried modify them with no luck). What happen after/inside failedToGetPerspective ? Does the connection changes port/connection/setting or anything else at this point ?

I should probably add about info our set up : the server runs 2 buildbot masters and the slave computer also 2 buildbot slave (one for each master). We do have other computer that work that way without any problem. Of course, we checked that each slave is connecting to the correct master. Only one of the slave/master couple fails (and as already said, only on this computer).

Best regards,
Colin Chargy

From: Pierre Tardy [mailto:tardyp at gmail.com<mailto:tardyp at gmail.com>]
Sent: Tuesday, June 20, 2017 11:41
To: Colin Chargy <Colin.Chargy at bentley.com<mailto:Colin.Chargy at bentley.com>>; users at buildbot.net<mailto:users at buildbot.net>
Subject: Re: [users at bb.net<mailto:users at bb.net>] FW: Issue connecting slave to master

Hi Colin
Could that be a problem with your slave password?


 def failedToGetPerspective(self, why):
        """The login process failed, most likely because of an authorization
        failure (bad password), but it is also possible that we lost the new
        connection before we managed to send our credentials.
        """
        log.msg("ReconnectingPBClientFactory.failedToGetPerspective")
        if why.check(pb.PBConnectionLost):
            log.msg("we lost the brand-new connection")
            # retrying might help here, let clientConnectionLost decide
            return
        # probably authorization
        self.stopTrying()  # logging in harder won't help
        log.err(why)


On Tue, Jun 20, 2017 at 9:18 AM Colin Chargy <Colin.Chargy at bentley.com<mailto:Colin.Chargy at bentley.com>> wrote:
Hi everyone,
Before I start describing my issue, let me say to we have dozen of slaves (Win, Mac and Linux platform perfectly working right now), only one is problematic :
We are facing an issue with slave connection to master. Here is the log on the slave side (see enclosed twisted.log for complete log) :
[Broker,client] message from master: attached [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective
[Broker,client] we lost the brand-new connection [Broker,client] Lost connection to 192.168.0.1:9989<http://192.168.0.1:9989/> [Broker,client] <twisted.internet.tcp.Connector instance at 0x03471918> will retry in 3 seconds

And it starts it again.
On the server side, the following log is produced :
2017-06-19 16:11:27+0200 [Broker,9423,192.168.0.254] slave 'lrttestauto-test' attaching from IPv4Address(TCP, '192.168.0.254', 35524)
2017-06-19 16:11:27+0200 [Broker,9423,192.168.0.254] Starting buildslave keepalive timer for 'lrttestauto-test'
2017-06-19 16:11:27+0200 [Broker,9423,192.168.0.254] Peer will receive following PB traceback:
2017-06-19 16:11:27+0200 [Broker,9423,192.168.0.254] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
        ]

I've checked that the login and password are correct and Buildbot version are the following :
On the server-side (which is a Debian):
Buildbot version: 0.8.8
Twisted version: 12.3.0

On the slave side (which is a Windows 10, buildslave installed via pip):
Buildslave version: 0.8.14
Twisted version: 17.5.0

I've enclosed the slave log, the slave tac file and a tcpdump showing data transfer between slave and server (I've tried to debug it with Wireshark with no luck).

What can I do to debug or to solve this issue ?

Best regards,
Colin Chargy
_______________________________________________
users mailing list
users at buildbot.net<mailto:users at buildbot.net>
https://lists.buildbot.net/mailman/listinfo/users
_______________________________________________
users mailing list
users at buildbot.net<mailto:users at buildbot.net>
https://lists.buildbot.net/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.buildbot.net/pipermail/users/attachments/20170630/12e2f0d0/attachment.html>


More information about the users mailing list