[users at bb.net] How to Check Worker Status?
ngilmore at grammatech.com
Thu Feb 1 18:29:34 UTC 2018
What we have is a builder that ssh's into the machine the worker is
running on, cd's into the worker's directory, and looks for twistd.pid,
and restarts based on whether it's present and whether it appears in the
process list and so on.
One huge benefit of this over cron jobs is that we can construct the
list of workers inside of master.cfg. Very useful when our master.cfg
changes multiple times in a day. The only cron job we run is one for the
masters and the single worker that runs the builder that checks and
starts all the others.
And at least in our, older system, buildbot-worker start will terminate,
but it takes 10 seconds or more before spitting out the following:
The worker took more than 10 seconds to start and/or connect to the
so we were unable to confirm that it started and connected correctly. Please
'tail twistd.log' and look for a line that says 'message from master:
to verify correct startup. If you see a bunch of messages like 'will
retry in 6
seconds', your worker might not have the correct hostname or portnumber
buildmaster, or the buildmaster might not be running. If you see
messages like 'Failure: twisted.cred.error.UnauthorizedLogin'
then your worker might be using the wrong botname or password. Please
these problems and then restart the worker.
On 2/1/2018 12:19 PM, Chris Spencer wrote:
> I'm having a problem with workers randomly stopping. From the worker's
> logs, I'm seeing:
> 2018-01-26 01:22:33-0500 [-] sending app-level keepalive
> 2018-01-26 01:32:33-0500 [-] sending app-level keepalive
> 2018-01-26 01:42:33-0500 [-] sending app-level keepalive
> 2018-01-26 01:52:33-0500 [-] sending app-level keepalive
> 2018-01-26 02:00:00-0500 [-] Received SIGTERM, shutting down.
> 2018-01-26 02:00:00-0500 [HangCheckProtocol,client] Lost connection to
> 10.159.135.58:9989 <http://10.159.135.58:9989>
> 2018-01-26 02:00:00-0500 [-] Stopping factory
> <buildbot_worker.pb.BotFactory instance at 0x7f50af441950>
> 2018-01-26 02:00:00-0500 [-] Main loop terminated.
> 2018-01-26 02:00:00-0500 [-] Server Shut Down.
> However, my master's still running, as well as other workers, so I
> don't know why a single worker would get receive a sigkill, and
> nothing else.
> To work around this issue, I want to create a cronjob that
> periodically checks to see if the worker has stopped and restart it.
> Looking at the docs for buildbot-worker at
> http://docs.buildbot.net/latest/manual/cmdline.html, I see options to
> start, stop and restart, but there's no option to check status.
> How do I check to see if a specific worker is running, so I know to
> restart it?
> I tried just re-running `buildbot-worker start workerN` but that hangs
> if that worker is already running, showing the error message:
> Following twistd.log until startup finished..
> Another twistd server is running, PID 13758
> This could either be a previously started instance of your
> application or a
> different application entirely. To start a new one, either run it
> in some other
> directory, or use the --pidfile and --logfile parameters to avoid
> Why does that not simply exit after showing the error message? I had
> to send ctrl-c to make it return.
> And obviously I don't want to run `buildbot-worker restart workerN`
> because that will kill the current worker if it's already running,
> interrupting the current build.
> I can check for the existence of <buildbot_dir>/workerN/twistd.pid,
> but that feels a little hacky and likely to break if Buildbot changes
> how it tracks worker pids.
> users mailing list
> users at buildbot.net
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the users