[Buildbot-devel] buildbot from git hangs during setup until slave connects?
Dan Kegel
dank at kegel.com
Fri Sep 28 22:20:19 UTC 2012
Short story:
--quiet seems to work around the problem, and twistd's --reactor poll
option might also work around it.
Long story:
I ran the stress script again, with buildbot under strace -f. After
four good start/stop cycles,
it failed like the first time (web status port closed, but slave
status port open).
Here's the log: http://kegel.com/twistd.log
And here's the strace log: http://kegel.com/slog5.bad.rz
Filtering out just the lines of interest with
rzip -d slog5.bad.rz
egrep 'clone|execve|64220.*10|64228.*10' < slog5.bad
I see
423 64220 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f12193099d0) = 64221
424 64220 pipe([10, 11]) = 0
425 64220 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f12193099d0) = 64222
426 64220 fcntl(10, F_GETFL) = 0 (flags O_RDONLY)
427 64220 fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
428 64220 epoll_ctl(5, EPOLL_CTL_ADD, 10, {EPOLLIN, {u32=10,
u64=22205092589469706}} <unfinished ...>
...
444 64227 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f12193099d0) = 64228
445 64228 open("/home/buildbot/master-state/sandbox/hello/cfg.py",
O_RDONLY) = 10
...
752 64228 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
...
758 64228 bind(10, {sa_family=AF_INET, sin_port=htons(60020),
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
759 64228 listen(10, 50) = 0
...
762 64228 epoll_ctl(5, EPOLL_CTL_ADD, 10, {EPOLLIN, {u32=10,
u64=22205092589469706}}) = 0
763 64228 futex(0x7f121002b290, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
764 64228 <... epoll_wait resumed> {{EPOLLERR, {u32=4,
u64=22205092589469700}}, {EPOLLHUP, {u32=10, u64=22205092589469706}}},
4, 996) = 2
765 64220 epoll_ctl(5, EPOLL_CTL_DEL, 10,
{EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x4e9020, {u32=0,
u64=22205092589469696}} <unfinished ...>
766 64228 epoll_ctl(5, EPOLL_CTL_DEL, 10,
{EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x4e9020, {u32=0,
u64=22205092589469696}} <unfinished ...>
767 64228 shutdown(10, 2 /* send and receive */ <unfinished ...>
768 64228 close(10 <unfinished ...>
769 64220 close(10 <unfinished ...>
This makes me wonder if epoll is getting two different fd 10's
confused, or there are stale events going around.
I wonder if starting twistd with "--reactor poll" would work around
this, but I don't see an easy
way to do it (ok, I'm lazy).
The fork above looks like it's from
http://buildbot.net/buildbot/docs/latest/reference/buildbot.scripts.start-pysrc.html
I also see that the --quiet option to buildbot start disables that fork.
Running with --quiet let me run 24 iterations without trouble.
I'll file a bug report when/if I get around to the minimal test case.
More information about the devel
mailing list