[Buildbot] #3183: fix network "spottiness" on new infra
Buildbot trac
trac at buildbot.net
Sun Feb 8 21:33:08 UTC 2015
#3183: fix network "spottiness" on new infra
---------------------+--------------------------
Reporter: dustin | Owner:
Type: defect | Status: new
Priority: critical | Milestone: sys - other
Version: | Resolution:
Keywords: |
---------------------+--------------------------
Comment (by dustin):
On that post, I once again got "Warning: the change has been saved ..",
and the connection hung for quite a while. I put the tcpdumps from both
trac and my laptop in
https://gist.github.com/djmitche/4dad36e0fdbfcc5cc215, with some editing
to make them more useful. Overall, the connection shows a 78-second TCP
connection. The "trac" host is using TSO, which means that tcpdump on
that host captures the oversized segments that the kernel hands to the
NIC, while tcpdump on the laptop captures the smaller segments into which
the NIC breaks them. This all seems to work OK, although it makes it
difficult to match up packets.
Things start off well enough:
{{{
4:03 IP laptop.60061 > trac.http: Flags [S], seq 833594151, win 29200,
options [mss 1460,sackOK,TS val 17271196 ecr 0,nop,wscale 7], length 0
4:03 IP trac.http > laptop.60061: Flags [S.], seq 1018352583, ack
833594152, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val
4013412119 ecr 17271196], length 0
4:03 IP laptop.60061 > trac.http: Flags [.], ack 1, win 229, options
[nop,nop,TS val 17271280 ecr 4013412119], length 0
4:03 IP laptop.60061 > trac.http: Flags [.], seq 1:1449, ack 1, win 229,
options [nop,nop,TS val 17271280 ecr 4013412119], length 1448
4:03 IP laptop.60061 > trac.http: Flags [.], seq 1449:2897, ack 1, win
229, options [nop,nop,TS val 17271280 ecr 4013412119], length 1448
4:03 IP laptop.60061 > trac.http: Flags [P.], seq 2897:3681, ack 1, win
229, options [nop,nop,TS val 17271282 ecr 4013412119], length 784
4:04 IP trac.http > laptop.60061: Flags [.], ack 2897, win 1018, options
[nop,nop,TS val 4013412205 ecr 17271280], length 0
4:04 IP trac.http > laptop.60061: Flags [.], ack 3681, win 1040, options
[nop,nop,TS val 4013412312 ecr 17271282], length 0
}}}
laptop sends a 3681-byte HTTP request, and track acks receipt.
10 seconds later, and every 10 seconds thereafter, laptop pings and track
pongs:
{{{
4:14 IP laptop.60061 > trac.http: Flags [.], ack 1, win 229, options
[nop,nop,TS val 17281473 ecr 4013412312], length 0
4:14 IP trac.http > laptop.60061: Flags [.], ack 3681, win 1040, options
[nop,nop,TS val 4013422396 ecr 17271282], length 0
4:24 IP laptop.60061 > trac.http: Flags [.], ack 1, win 229, options
[nop,nop,TS val 17291584 ecr 4013422396], length 0
4:24 IP trac.http > laptop.60061: Flags [.], ack 3681, win 1040, options
[nop,nop,TS val 4013432507 ecr 17271282], length 0
4:34 IP laptop.60061 > trac.http: Flags [.], ack 1, win 229, options
[nop,nop,TS val 17301696 ecr 4013432507], length 0
4:34 IP trac.http > laptop.60061: Flags [.], ack 3681, win 1040, options
[nop,nop,TS val 4013442619 ecr 17271282], length 0
4:44 IP laptop.60061 > trac.http: Flags [.], ack 1, win 229, options
[nop,nop,TS val 17311808 ecr 4013442619], length 0
4:44 IP trac.http > laptop.60061: Flags [.], ack 3681, win 1040, options
[nop,nop,TS val 4013452730 ecr 17271282], length 0
4:54 IP laptop.60061 > trac.http: Flags [.], ack 1, win 229, options
[nop,nop,TS val 17321920 ecr 4013452730], length 0
4:54 IP trac.http > laptop.60061: Flags [.], ack 3681, win 1040, options
[nop,nop,TS val 4013462842 ecr 17271282], length 0
5:04 IP laptop.60061 > trac.http: Flags [.], ack 1, win 229, options
[nop,nop,TS val 17332032 ecr 4013462842], length 0
5:04 IP trac.http > laptop.60061: Flags [.], ack 3681, win 1040, options
[nop,nop,TS val 4013472954 ecr 17271282], length 0
}}}
I don't recognize this pattern, but it looks like the hosts do, so OK.
A full 76 seconds after the connection was initiated, trac begins
replying:
{{{
5:19 IP trac.http > laptop.60061: Flags [P.], seq 1:399, ack 3681, win
1040, options [nop,nop,TS val 4013487830 ecr 17271282], length 398
}}}
this goes on for a while, seemingly without any issue.
Finally, after 47419 bytes of response, the laptop sends a FIN:
{{{
5:21 IP trac.http > laptop.60061: Flags [.], seq 45971:47419, ack 4762,
win 1040, options [nop,nop,TS val 4013490058 ecr 17349135], length 1448
5:21 IP laptop.60061 > trac.http: Flags [.], ack 47419, win 984, options
[nop,nop,TS val 17349220 ecr 4013490058], length 0
5:21 IP laptop.60061 > trac.http: Flags [F.], seq 4762, ack 47419, win
984, options [nop,nop,TS val 17349220 ecr 4013490058], length 0
}}}
But trac isn't done yet, and keeps sending data, to which the laptop sends
RST packets:
{{{
5:21 IP trac.http > laptop.60061: Flags [.], seq 47419:48867, ack 4762,
win 1040, options [nop,nop,TS val 4013490059 ecr 17349135], length 1448
5:21 IP laptop.60061 > trac.http: Flags [R], seq 833598913, win 0, length
0
5:21 IP trac.http > laptop.60061: Flags [.], seq 48867:50315, ack 4762,
win 1040, options [nop,nop,TS val 4013490060 ecr 17349136], length 1448
5:21 IP laptop.60061 > trac.http: Flags [R], seq 833598913, win 0, length
0
5:21 IP trac.http > laptop.60061: Flags [.], seq 50315:50351, ack 4762,
win 1040, options [nop,nop,TS val 4013490060 ecr 17349136], length 36
5:21 IP laptop.60061 > trac.http: Flags [R], seq 833598913, win 0, length
0
5:21 IP trac.http > laptop.60061: Flags [.], seq 50351:51799, ack 4762,
win 1040, options [nop,nop,TS val 4013490143 ecr 17349219], length 1448
5:21 IP laptop.60061 > trac.http: Flags [R], seq 833598913, win 0, length
0
5:21 IP trac.http > laptop.60061: Flags [.], seq 51799:53247, ack 4762,
win 1040, options [nop,nop,TS val 4013490144 ecr 17349220], length 1448
5:21 IP laptop.60061 > trac.http: Flags [R], seq 833598913, win 0, length
0
5:21 IP trac.http > laptop.60061: Flags [.], seq 53247:54695, ack 4762,
win 1040, options [nop,nop,TS val 4013490144 ecr 17349220], length 1448
5:21 IP laptop.60061 > trac.http: Flags [R], seq 833598913, win 0, length
0
5:21 IP trac.http > laptop.60061: Flags [.], seq 54695:56143, ack 4763,
win 1040, options [nop,nop,TS val 4013490144 ecr 17349220], length 1448
5:21 IP laptop.60061 > trac.http: Flags [R], seq 833598914, win 0, length
0
}}}
So, two funny things here. First, it took well over a minute for Trac to
respond -- which is probably associated with the failed attempt to send an
email. That's probably the bug we're after here.
Second, trac's sending more data than the browser expects. This might be
due to a missing or incorrect Content-Length header. I'm not going to
worry about that one right now.
On a whim, I disabled TSO on service2's em0 and em1 interfaces. Let's see
if that helps!
--
Ticket URL: <http://trac.buildbot.net/ticket/3183#comment:2>
Buildbot <http://buildbot.net/>
Buildbot: build/test automation
More information about the bugs
mailing list