[Buildbot-devel] buildbot from git hangs during setup until slave connects?
Dan Kegel
dank at kegel.com
Mon Sep 17 22:31:40 UTC 2012
On Tue, Sep 11, 2012 at 6:25 PM, Dustin J. Mitchell <dustin at v.igoro.us> wrote:
>> Two things will unstick it:
>> 1) if a slave connects (or if I telnet to the slave port)
>> 2) if I strace the process (with strace -p)
>
> Oh dear, this sounds like a Python or Twisted bug of some sort.
> Generally when strace will "fix" things, it means that some change is
> not being properly communicated to Twisted via the select() call. In
> older versions, this was sometimes due to missing a SIGCHLD signal
> when a child exited before the signal handler was installed. But
> schedulers don't use subprocesses. So that's odd.
>
> When you do strace it, what do you see?
It happend a bunch of times today, so I have some data.
This is on a single-core machine (really, a vmware instance with
only a single core allocated), but I'm not sure that's important.
When I do
$ cat /home/buildbot/master-state/sandbox/gspeak/twistd.pid
62047
$ strace -o /tmp/slog -p 62047
attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
$ echo 0|sudo tee /proc/sys/kernel/yama/ptrace_scope
0
$ strace -o /tmp/slog -p 62047
it gets unstuck, and /tmp/slog starts off with a double stat of localtime:
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
futex(0x7ff610002dd0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x259f290, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7ff614013210, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x259f290, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7ff6240454c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x259f290, FUTEX_WAKE_PRIVATE, 1) = 1
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
write(4, "2012-09-17 14:29:03-0700 [-] add"..., 63) = 63
futex(0x7ff6200129c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x259f290, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7ff60c052cd0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x259f290, FUTEX_WAKE_PRIVATE, 1) = 1
write(8, "x", 1) = 1
epoll_wait(6, {{EPOLLIN, {u32=7, u64=22205092589469703}}}, 5, 108884) = 1
read(7, "xxxxxx", 8192) = 6
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
futex(0x7ff61c0379a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x259f290, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7ff618012620, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x259f290, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7ff610002dd0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x259f290, FUTEX_WAKE_PRIVATE, 1) = 1
getcwd("/home/buildbot/master-state/sandbox/gspeak", 1024) = 43
...
open("/home/buildbot/master-state/sandbox/local/lib/python2.7/site-packages/jinja2/ext.py",
O_RDONLY) = 12
It seems:
- the hang is shortly before jinja2/ext.py is loaded
- can be found in the strace log of a full startup by looking for the
last call to getcwd()
- the self-pipe trick from _SocketWaker is happening in the vicinity
and might be related
Maybe a call to wakeUp is missing somewhere, or doesn't do what it should.
A little searching finds I'm not the first person to report mysterious hangs
on startup: http://trac.buildbot.net/ticket/1992
And now it's stopped happening again. I'll be back with more data next time
it happens.
- Dan
More information about the devel
mailing list