[users at bb.net] How to Check Worker Status?

Thu Feb 1 18:29:34 UTC 2018

What we have is a builder that ssh's into the machine the worker is 
running on, cd's into the worker's directory, and looks for twistd.pid, 
and restarts based on whether it's present and whether it appears in the 
process list and so on.

One huge benefit of this over cron jobs is that we can construct the 
list of workers inside of master.cfg. Very useful when our master.cfg 
changes multiple times in a day. The only cron job we run is one for the 
masters and the single worker that runs the builder that checks and 
starts all the others.

And at least in our, older system, buildbot-worker start will terminate, 
but it takes 10 seconds or more before spitting out the following:
The worker took more than 10 seconds to start and/or connect to the 
buildmaster,
so we were unable to confirm that it started and connected correctly. Please
'tail twistd.log' and look for a line that says 'message from master: 
attached'
to verify correct startup. If you see a bunch of messages like 'will 
retry in 6
seconds', your worker might not have the correct hostname or portnumber 
for the
buildmaster, or the buildmaster might not be running. If you see 
messages like   'Failure: twisted.cred.error.UnauthorizedLogin'
then your worker might be using the wrong botname or password. Please 
correct
these problems and then restart the worker.

Neil Gilmore
grammatech.com

On 2/1/2018 12:19 PM, Chris Spencer wrote:
> I'm having a problem with workers randomly stopping. From the worker's 
> logs, I'm seeing:
>
> 2018-01-26 01:22:33-0500 [-] sending app-level keepalive
> 2018-01-26 01:32:33-0500 [-] sending app-level keepalive
> 2018-01-26 01:42:33-0500 [-] sending app-level keepalive
> 2018-01-26 01:52:33-0500 [-] sending app-level keepalive
> 2018-01-26 02:00:00-0500 [-] Received SIGTERM, shutting down.
> 2018-01-26 02:00:00-0500 [HangCheckProtocol,client] Lost connection to 
> 10.159.135.58:9989 <http://10.159.135.58:9989>
> 2018-01-26 02:00:00-0500 [-] Stopping factory 
> <buildbot_worker.pb.BotFactory instance at 0x7f50af441950>
> 2018-01-26 02:00:00-0500 [-] Main loop terminated.
> 2018-01-26 02:00:00-0500 [-] Server Shut Down.
>
> However, my master's still running, as well as other workers, so I 
> don't know why a single worker would get receive a sigkill, and 
> nothing else.
>
> To work around this issue, I want to create a cronjob that 
> periodically checks to see if the worker has stopped and restart it. 
> Looking at the docs for buildbot-worker at 
> http://docs.buildbot.net/latest/manual/cmdline.html, I see options to 
> start, stop and restart, but there's no option to check status.
>
> How do I check to see if a specific worker is running, so I know to 
> restart it?
>
> I tried just re-running `buildbot-worker start workerN` but that hangs 
> if that worker is already running, showing the error message:
>
>     Following twistd.log until startup finished..
>     Another twistd server is running, PID 13758
>
>     This could either be a previously started instance of your 
> application or a
>     different application entirely. To start a new one, either run it 
> in some other
>     directory, or use the --pidfile and --logfile parameters to avoid 
> clashes.
>
> Why does that not simply exit after showing the error message? I had 
> to send ctrl-c to make it return.
>
> And obviously I don't want to run `buildbot-worker restart workerN` 
> because that will kill the current worker if it's already running, 
> interrupting the current build.
>
> I can check for the existence of <buildbot_dir>/workerN/twistd.pid, 
> but that feels a little hacky and likely to break if Buildbot changes 
> how it tracks worker pids.
>
>
> _______________________________________________
> users mailing list
> users at buildbot.net
> https://lists.buildbot.net/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.buildbot.net/pipermail/users/attachments/20180201/b6999152/attachment.html>