[Buildbot-devel] lost remote - slave cannot keep connection with master

Wed Feb 27 07:13:31 UTC 2008

On Tuesday 26 February 2008 17:47:51 you wrote:
> I thought that buildbot prevented duplicate processing happening
> though storing the pid in a file?

It's supposed to, but I think this was an extraordinary situation.  One thing 
I remember doing was executing a "buildbot stop ~; buildbot start ~" command 
(I have that sequence as an alias) on that slave's account, and ^C'ing it 
because it got hung waiting for some message over the network.  My experience 
is that sometimes a stop or start becomes really slow, perhaps since we have 
build nodes on different VMs on slightly overloaded hardware, and it's faster 
to abort it and retry.  Perhaps the timing of that ^C was such that it 
somehow allowed the buildbot process to live.  That's speculation; there is 
really no way to know.  In any event, when I searched for running buildbot 
processes, I saw something like this:

{{{
build at linbot1:~$ ps aux | grep buildbot
build      417  0.0  2.3  23776 11928 ?        Sl   14:47   
0:08 /usr/bin/python /usr/bin/buildbot start /home/build
build      927  0.0  2.3  23776 11928 ?        Sl   Feb24   
0:00 /usr/bin/python /usr/bin/buildbot start /home/build
build     1073  0.0  0.1   2800   756 pts/0    S+   23:55   0:00 grep buildbot
build at linbot1:~$
}}}

(This is a reconstruction - I did not save the output at the time, so not all 
of fields above won't be accurate.)  The key is that there should be only 
one "buildbot start /home/build" process, but there are two.  However that 
came to pass.  It explains why I was having this "lost remote" issue on the 
main buildbot, but not the development version, even though literally every 
byte of the master.cfg and its support files were identical.

Perhaps there is another explanation, but so far I don't think so.

-- 
Aaron Maxwell .:. amax at snaplogic.org .:. http://snaplogic.org
SnapLogic, Inc. - Data Integration for the Last Mile