[Buildbot-devel] buildbot slave hangs on SunOS?

Brian Warner warner at lothar.com
Wed Feb 4 21:11:41 UTC 2004


> The cvs output comes out as usual, and seems to work, but it never
> seems to detect the end of the cvs command. (That is, the command
> doesn't ever return.)

Interesting..

> There are some weird things about python on a sun system. There is a
> test, relating to pty, which fails.
> 
> Similarly, some of the twisted tests hang on the sun:
> 
> <snip>
>   PosixProcessTestCasePTY
>     testAbnormalTermination ...                                          [FAIL]
>     testNormalTermination ... 
> 
> And it hangs here. Perhaps this is related to the problem?

Yes, definitely.

(by the way, you can run the twisted tests with the -o option to get a
similarly-verbose lists of tests and their results, but without the ANSI
color-switching escape sequences)

So it looks like there is a solaris/pty/python problem. As a workaround, try
telling the buildslave to not use PTYs.. in buildbot/slavecommand.py, edit
line 169 (in ShellCommand.start) to say 'usePTY=0' instead of 'usePTY=1'.
That will make it fall back to running the child process under a regular set
of pipes instead of a PTY, which might avoid triggering that bug.

The only disadvantage of usePTY=0 is that it makes it impractical to kill off
all the grand-child processes if/when the command times out. In simple build
situations this is rare, but I've had projects in which the unit tests needed
to spawn several helper child processes themselves. In those cases, if the
test program locked up, the buildslave's timeout would kick in and kill off
the top-level 'make test' process, but it would have no process group by
which to kill off the (now orphaned) helper children.

> 2004/02/04 13:48 MST [-] command timed out: 1200 seconds without output, killing pid 26839
> 2004/02/04 13:48 MST [-] trying os.kill(-pid, signal.SIGKILL)
> 2004/02/04 13:48 MST [-]  successful
> 2004/02/04 13:48 MST [-] trying process.signalProcess('KILL')
> 2004/02/04 13:48 MST [-]  successful
> 2004/02/04 13:48 MST [-] Failure: buildbot.slavecommand.TimeoutError: SIGKILL failed to kill process

I suspect that process reaping is somehow broken on Solaris.. the child
process finishes but the parent somehow missed the signal that indicates it
should be reaped. I noticed your message to the twisted-dev list.. in
addition to that, I'd file a bug on their bug tracker (linked from the
twistedmatrix.com home page). They may need your help to track down the
problem.. I don't think most of the developers have access to a Solaris box.

let me know how that works,
 -Brian




More information about the devel mailing list