[Buildbot-devel] spurious SIGHUP when running under Ubuntu hardy heron (buildbot 0.7.6)

Brian Warner warner-buildbot at lothar.com
Mon Mar 24 07:38:41 UTC 2008


On Thu, 6 Mar 2008 20:21:21 -0500
"Charles Lepple" <clepple at gmail.com> wrote:

> Signal 1 seems to be SIGHUP, so I ran it under strace, and caught the
> following: ...
> 
> However, I am curious as to what changed, and if there is something
> simple that could be added to buildbot to work around this. It doesn't
> seem like the slave side uses SIGHUP for much, if anything, but I am
> somewhat mystified by the details of signal handling across process
> groups, etc.

I've been wondering the same thing. I've been seeing several process-spawning
unit tests fail about 10% of the time from SIGHUP (on my debian/sid system),
and I wasn't seeing that six months ago.

We close stdin on the child process shortly after starting it, as soon as we
finish writing any data that's supposed to go to stdin (which is usually an
empty string, but 'try' builds will send data to the 'patch' command, and
other buildsteps can use the same mechanism). When the child is running under
a PTY, I suspect this makes the PTY believe that we're done with it, so it
sends SIGHUP to all of its children.

My suspicion is that something changed in recent kernels or maybe libc, which
changes the timing of this SIGHUP (or maybe it never sent one at all).
Running the same strace on the gutsy box might reveal something.

I've switched most of the unit tests to usePTY=False, to avoid this problem
during tests. I think that I'm going to change the way we use stdin too: the
combination to avoid is usePTY=True and closeStdin=True. I think that means
that if we're writing anything to stdin, then we need to *not* be using a
PTY.

Note that we use PTYs to put the child subprocess, and all of its
descendants, into the same "process group". If we ever interrupt the build,
we do an os.kill() with the negative PID number, which means "kill everything
in the process group". Once upon a time, I had unit test suites which ran
under 'make' and started by launching a daemon, for test programs to work
against, which was to be shut down when the tests finished. Without the
process group trick, the daemon would tend to get left running whenever the
tests failed.

But for most programs (like 'svn' or 'cp'), we don't need this sort of
complication, and in fact it looks like it interacts badly with the way we
handle stdin writing and closing.


There are a couple of tickets around this one: #158 and #198. They won't make
it into the upcoming 0.7.7, but I think we can fix it at the beginning of the
next cycle.

cheers,
 -Brian




More information about the devel mailing list