[Buildbot-devel] lost remote - slave cannot keep connection with master
Aaron Maxwell
amax at snaplogic.org
Wed Feb 27 07:13:31 UTC 2008
On Tuesday 26 February 2008 17:47:51 you wrote:
> I thought that buildbot prevented duplicate processing happening
> though storing the pid in a file?
It's supposed to, but I think this was an extraordinary situation. One thing
I remember doing was executing a "buildbot stop ~; buildbot start ~" command
(I have that sequence as an alias) on that slave's account, and ^C'ing it
because it got hung waiting for some message over the network. My experience
is that sometimes a stop or start becomes really slow, perhaps since we have
build nodes on different VMs on slightly overloaded hardware, and it's faster
to abort it and retry. Perhaps the timing of that ^C was such that it
somehow allowed the buildbot process to live. That's speculation; there is
really no way to know. In any event, when I searched for running buildbot
processes, I saw something like this:
{{{
build at linbot1:~$ ps aux | grep buildbot
build 417 0.0 2.3 23776 11928 ? Sl 14:47
0:08 /usr/bin/python /usr/bin/buildbot start /home/build
build 927 0.0 2.3 23776 11928 ? Sl Feb24
0:00 /usr/bin/python /usr/bin/buildbot start /home/build
build 1073 0.0 0.1 2800 756 pts/0 S+ 23:55 0:00 grep buildbot
build at linbot1:~$
}}}
(This is a reconstruction - I did not save the output at the time, so not all
of fields above won't be accurate.) The key is that there should be only
one "buildbot start /home/build" process, but there are two. However that
came to pass. It explains why I was having this "lost remote" issue on the
main buildbot, but not the development version, even though literally every
byte of the master.cfg and its support files were identical.
Perhaps there is another explanation, but so far I don't think so.
--
Aaron Maxwell .:. amax at snaplogic.org .:. http://snaplogic.org
SnapLogic, Inc. - Data Integration for the Last Mile
More information about the devel
mailing list