[Buildbot-devel] buildbot from git hangs during setup until slave connects?

Dustin J. Mitchell dustin at v.igoro.us
Tue Sep 18 02:00:29 UTC 2012


This does sound something like #1992, in the sense that the selfpipe
is failing entirely.  It's interesting that it looks like it's failing
in epipe, not select.  (This sounds a little different from #2055, as
in that bug there was a deadlock that couldn't be fixed for love or
money.  In your case, sending a signal to the process *fixes* it,
rather than breaking it.)

A few thoughts for further experimentation:

Does sending a harmless signal (kill -CONT $pid) to the process "unstick" it?

You mention a full strace - can I see a snippet of that, to know
what's going on beforehand, too?  Also, strace -f might help to
distinguish threads in the same process.  It seems odd that the writes
and reads to the selfpipe are in the same thread, but that's what it
looks like from what you've pasted.

The stats of /etc/localtime are related to the printing of log
messages.  I suppose it wants to be responsive to timezone changes.
As far as I can see all glibc versions do that.  I'm surprised I don't
see calls to write() after each of the stat's, though.

The relevant snippet from a run of the latest master on my system
(which is a pretty up-to-date Gentoo system) is

[pid 13059] ftruncate(9, 76800)         = 0
[pid 13059] fsync(9)                    = 0
[pid 13059] unlink("/home/dustin/code/buildbot/t/buildbot/sand27/master/state.sqlite-shm")
= 0
[pid 13059] munmap(0x7fc59cf4c000, 32768) = 0
[pid 13059] close(11)                   = 0
[pid 13059] close(10)                   = 0
[pid 13059] unlink("/home/dustin/code/buildbot/t/buildbot/sand27/master/state.sqlite-wal")
= 0
[pid 13059] fcntl(9, F_SETLK, {type=F_UNLCK, whence=SEEK_SET,
start=1073741824, len=2}) = 0
[pid 13059] fcntl(9, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0,
len=0}) = 0
[pid 13059] close(9)                    = 0
[pid 13059] write(8, "x", 1)            = 1
[pid 13052] <... epoll_wait resumed> {{EPOLLIN, {u32=7,
u64=11286045658605158407}}}, 3, 2937) = 1
[pid 13052] futex(0xb9f020, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 13059] futex(0xb9f020, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 13052] <... futex resumed> )       = -1 EAGAIN (Resource
temporarily unavailable)
[pid 13059] <... futex resumed> )       = 0
[pid 13052] read(7,  <unfinished ...>
[pid 13059] futex(0x7fc588019460, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 13052] <... read resumed> "x", 8192) = 1
[pid 13052] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
[pid 13052] fcntl(9, F_GETFL)           = 0x2 (flags O_RDWR)
[pid 13052] fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 13052] fcntl(9, F_GETFD)           = 0
[pid 13052] fcntl(9, F_SETFD, FD_CLOEXEC) = 0
[pid 13052] setsockopt(9, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid 13052] bind(9, {sa_family=AF_INET, sin_port=htons(8888),
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
[pid 13052] listen(9, 50)               = 0
[pid 13052] getsockname(9, {sa_family=AF_INET, sin_port=htons(8888),
sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
[pid 13052] stat("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=3519, ...}) = 0
[pid 13052] write(1, "2012-09-17 21:40:46-0400 [-] PBS"..., 62) = 62
[pid 13052] stat("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=3519, ...}) = 0
[pid 13052] write(1, "2012-09-17 21:40:46-0400 [-] Sta"..., 104) = 104
[pid 13052] epoll_ctl(6, EPOLL_CTL_ADD, 9, {EPOLLIN, {u32=9,
u64=11286045658605158409}}) = 0
[pid 13052] getcwd("/home/dustin/code/buildbot/t/buildbot/sand27/master",
1024) = 52
...
[pid 13052] open("/home/dustin/code/buildbot/t/buildbot/sand27/lib/python2.7/site-packages/jinja2/ext.py",
O_RDONLY) = 10

Here, 13059 is the db thread.  It writes an 'x' to the selfpipe, which
wakes up the epoll in 13052.  The next bit starts binding a PB port
(8888), then begins the web status startup.

In comparing this to your snippet, I'm not sure why you saw the write
to the selfpipe *after* you awakened the process with a signal.
However, that may have been a different write to the selfpipe, since
six x's were queued in the pipe.

As a final debugging technique, it's probably worth adding the hack
patch that I landed for #1992 to see if that helps.  If so, then a
good next step is to compare the versions of Twisted and Python you're
running to those in the bug.

Dustin




More information about the devel mailing list