[Buildbot-devel] buildbot from git hangs during setup until slave connects?

Dan Kegel dank at kegel.com
Fri Sep 28 22:20:19 UTC 2012


Short story:
 --quiet seems to work around the problem, and twistd's --reactor poll
option might also work around it.

Long story:

I ran the stress script again, with buildbot under strace -f.  After
four good start/stop cycles,
it failed like the first time (web status port closed, but slave
status port open).
Here's the log: http://kegel.com/twistd.log
And here's the strace log: http://kegel.com/slog5.bad.rz
Filtering out just the lines of interest with
  rzip -d slog5.bad.rz
  egrep 'clone|execve|64220.*10|64228.*10' < slog5.bad
I see
    423 64220 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f12193099d0) = 64221
    424 64220 pipe([10, 11])                    = 0
    425 64220 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f12193099d0) = 64222
    426 64220 fcntl(10, F_GETFL)                = 0 (flags O_RDONLY)
    427 64220 fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
    428 64220 epoll_ctl(5, EPOLL_CTL_ADD, 10, {EPOLLIN, {u32=10,
u64=22205092589469706}} <unfinished ...>
...
    444 64227 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f12193099d0) = 64228
    445 64228 open("/home/buildbot/master-state/sandbox/hello/cfg.py",
O_RDONLY) = 10
...
    752 64228 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
...
    758 64228 bind(10, {sa_family=AF_INET, sin_port=htons(60020),
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
    759 64228 listen(10, 50)                    = 0
...
    762 64228 epoll_ctl(5, EPOLL_CTL_ADD, 10, {EPOLLIN, {u32=10,
u64=22205092589469706}}) = 0
    763 64228 futex(0x7f121002b290, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
    764 64228 <... epoll_wait resumed> {{EPOLLERR, {u32=4,
u64=22205092589469700}}, {EPOLLHUP, {u32=10, u64=22205092589469706}}},
4, 996) = 2
    765 64220 epoll_ctl(5, EPOLL_CTL_DEL, 10,
{EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x4e9020, {u32=0,
u64=22205092589469696}} <unfinished ...>
    766 64228 epoll_ctl(5, EPOLL_CTL_DEL, 10,
{EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x4e9020, {u32=0,
u64=22205092589469696}} <unfinished ...>
    767 64228 shutdown(10, 2 /* send and receive */ <unfinished ...>
    768 64228 close(10 <unfinished ...>
    769 64220 close(10 <unfinished ...>

This makes me wonder if epoll is getting two different fd 10's
confused, or there are stale events going around.
I wonder if starting twistd with "--reactor poll" would work around
this, but I don't see an easy
way to do it (ok, I'm lazy).

The fork above looks like it's from
http://buildbot.net/buildbot/docs/latest/reference/buildbot.scripts.start-pysrc.html
I also see that the --quiet option to buildbot start disables that fork.
Running with --quiet let me run 24 iterations without trouble.

I'll file a bug report when/if I get around to the minimal test case.




More information about the devel mailing list