All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] tcp : how many times a frame can possibly be retransmitted ?
@ 2011-08-24 16:21 Eric Dumazet
  2011-08-24 19:03 ` Alexander Zimmermann
  2011-08-24 22:44 ` Ilpo Järvinen
  0 siblings, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2011-08-24 16:21 UTC (permalink / raw)
  To: netdev, Jerry Chu, Damian Lukowski

On one dev machine running net-next, I just found strange tcp sessions
that retransmit a frame forever (The other peer disappeared)

# ss -emoi dst 10.2.1.1
State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
ESTAB      0      816              10.2.1.2:37930             10.2.1.1:ssh      timer:(on,630ms,246) ino:60786 sk:ffff8801189aa400
	 mem:(r0,w3776,f320,t0) ts sack ecn cubic wscale:8,6 rto:1680 rtt:16.25/7.5 ato:40 ssthresh:7 send 1.4Mbps rcv_rtt:10 rcv_space:16632


You can see the retransmit count : 246 

What possibly can be going on ?

What happened to backoff ?

# grep . /proc/sys/net/ipv4/tcp_retries*
/proc/sys/net/ipv4/tcp_retries1:3
/proc/sys/net/ipv4/tcp_retries2:15



extract of tcpdump :

12:01:02.074244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128024 59389>
12:01:03.754243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128192 59389>
12:01:05.434245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128360 59389>
12:01:07.114243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128528 59389>
12:01:08.794248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128696 59389>
12:01:10.474242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128864 59389>
12:01:12.154243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129032 59389>
12:01:13.834241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129200 59389>
12:01:15.514246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129368 59389>
12:01:17.194244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129536 59389>
12:01:18.874248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129704 59389>
12:01:20.554243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129872 59389>
12:01:22.234244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130040 59389>
12:01:23.914244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130208 59389>
12:01:25.594247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130376 59389>
12:01:27.274242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130544 59389>
12:01:28.954242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130712 59389>
12:01:30.634248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130880 59389>
12:01:32.314245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131048 59389>
12:01:33.994243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131216 59389>
12:01:35.674250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131384 59389>
12:01:37.354244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131552 59389>
12:01:39.034245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131720 59389>
12:01:40.714245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131888 59389>
12:01:42.394245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132056 59389>
12:01:44.074242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132224 59389>
12:01:45.754249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132392 59389>
12:01:47.434242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132560 59389>
12:01:49.114247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132728 59389>
12:01:50.794250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132896 59389>
12:01:52.474247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133064 59389>
12:01:54.154242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133232 59389>
12:01:55.834246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133400 59389>
12:01:57.514243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133568 59389>
12:01:59.194247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133736 59389>
12:02:00.874250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133904 59389>
12:02:02.554242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134072 59389>
12:02:04.234243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134240 59389>
12:02:05.914245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134408 59389>
12:02:07.594244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134576 59389>
12:02:09.274249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134744 59389>
12:02:10.954241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134912 59389>
12:02:12.634249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16135080 59389>

tcp_retransmit_timer() does the exponential backoff, but something
resets icsk_rto to a low value ?

Ah, it seems to be because of commit f1ecd5d9e7366609 
(Revert Backoff [v3]: Revert RTO on ICMP destination unreachable)

Since arp resolution (or routing, I dont know yet) fails, an
 internal/loopback ICMP host/network unreachable message is 
generated and handled in tcp_v4_err() :

icsk_backoff-- and icsk_rto is reset.

I am afraid this can generate a storm (cpu time at very least),
in case we have many tcp sessions in this state.

I guess its time for me to read RFC 6069

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
  2011-08-24 16:21 [BUG] tcp : how many times a frame can possibly be retransmitted ? Eric Dumazet
@ 2011-08-24 19:03 ` Alexander Zimmermann
  2011-08-24 19:39   ` Jerry Chu
  2011-08-24 19:45   ` Eric Dumazet
  2011-08-24 22:44 ` Ilpo Järvinen
  1 sibling, 2 replies; 22+ messages in thread
From: Alexander Zimmermann @ 2011-08-24 19:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jerry Chu, Lukowski Damian, Hannemann Arnd

Hi Eric,

Am 24.08.2011 um 18:21 schrieb Eric Dumazet:

> On one dev machine running net-next, I just found strange tcp sessions
> that retransmit a frame forever (The other peer disappeared)

not forever...
If remember correctly you will stop after 120s.

> 
> # ss -emoi dst 10.2.1.1
> State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
> ESTAB      0      816              10.2.1.2:37930             10.2.1.1:ssh      timer:(on,630ms,246) ino:60786 sk:ffff8801189aa400
> 	 mem:(r0,w3776,f320,t0) ts sack ecn cubic wscale:8,6 rto:1680 rtt:16.25/7.5 ato:40 ssthresh:7 send 1.4Mbps rcv_rtt:10 rcv_space:16632
> 
> 
> You can see the retransmit count : 246 
> 
> What possibly can be going on ?
> 
> What happened to backoff ?
> 
> # grep . /proc/sys/net/ipv4/tcp_retries*
> /proc/sys/net/ipv4/tcp_retries1:3
> /proc/sys/net/ipv4/tcp_retries2:15
> 
> 
> 
> extract of tcpdump :
> 
> 12:01:02.074244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128024 59389>
> 12:01:03.754243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128192 59389>
> 12:01:05.434245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128360 59389>
> 12:01:07.114243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128528 59389>
> 12:01:08.794248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128696 59389>
> 12:01:10.474242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128864 59389>
> 12:01:12.154243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129032 59389>
> 12:01:13.834241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129200 59389>
> 12:01:15.514246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129368 59389>
> 12:01:17.194244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129536 59389>
> 12:01:18.874248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129704 59389>
> 12:01:20.554243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129872 59389>
> 12:01:22.234244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130040 59389>
> 12:01:23.914244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130208 59389>
> 12:01:25.594247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130376 59389>
> 12:01:27.274242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130544 59389>
> 12:01:28.954242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130712 59389>
> 12:01:30.634248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130880 59389>
> 12:01:32.314245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131048 59389>
> 12:01:33.994243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131216 59389>
> 12:01:35.674250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131384 59389>
> 12:01:37.354244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131552 59389>
> 12:01:39.034245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131720 59389>
> 12:01:40.714245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131888 59389>
> 12:01:42.394245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132056 59389>
> 12:01:44.074242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132224 59389>
> 12:01:45.754249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132392 59389>
> 12:01:47.434242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132560 59389>
> 12:01:49.114247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132728 59389>
> 12:01:50.794250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132896 59389>
> 12:01:52.474247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133064 59389>
> 12:01:54.154242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133232 59389>
> 12:01:55.834246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133400 59389>
> 12:01:57.514243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133568 59389>
> 12:01:59.194247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133736 59389>
> 12:02:00.874250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133904 59389>
> 12:02:02.554242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134072 59389>
> 12:02:04.234243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134240 59389>
> 12:02:05.914245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134408 59389>
> 12:02:07.594244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134576 59389>
> 12:02:09.274249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134744 59389>
> 12:02:10.954241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134912 59389>
> 12:02:12.634249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16135080 59389>
> 
> tcp_retransmit_timer() does the exponential backoff, but something
> resets icsk_rto to a low value ?
> 
> Ah, it seems to be because of commit f1ecd5d9e7366609 
> (Revert Backoff [v3]: Revert RTO on ICMP destination unreachable)
> 
> Since arp resolution (or routing, I dont know yet) fails, an
> internal/loopback ICMP host/network unreachable message is 
> generated and handled in tcp_v4_err() :

Yeah, you have a local connectivity disruption. This is one
possible scenario.

> 
> icsk_backoff-- and icsk_rto is reset.
> 
> I am afraid this can generate a storm (cpu time at very least),
> in case we have many tcp sessions in this state.

Hmm, maybe. I don't know. Arnd or Damian what are you thing about this point?  

> 
> I guess its time for me to read RFC 6069

If you find a bug. Let me know.

Alex

> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
  2011-08-24 19:03 ` Alexander Zimmermann
@ 2011-08-24 19:39   ` Jerry Chu
  2011-08-24 19:45   ` Eric Dumazet
  1 sibling, 0 replies; 22+ messages in thread
From: Jerry Chu @ 2011-08-24 19:39 UTC (permalink / raw)
  To: Alexander Zimmermann
  Cc: Eric Dumazet, netdev, Lukowski Damian, Hannemann Arnd

Hi Alexander,

On Wed, Aug 24, 2011 at 12:03 PM, Alexander Zimmermann
<alexander.zimmermann@comsys.rwth-aachen.de> wrote:
> Hi Eric,
>
> Am 24.08.2011 um 18:21 schrieb Eric Dumazet:
>
>> On one dev machine running net-next, I just found strange tcp sessions
>> that retransmit a frame forever (The other peer disappeared)
>
> not forever...
> If remember correctly you will stop after 120s.

Yup. It looks like this "feature" was introduced in the patch
"Revert Backoff [v3]: Calculate TCP's connection close threshold as a
time value"
by Damian as well to bound the abort timeout by time duration rather
than how many
retries (icsk_retransmits). But as pointed out if rto is small it
could mean a lot of
retransmissions before one gives up.

Jerry

>
>>
>> # ss -emoi dst 10.2.1.1
>> State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port
>> ESTAB      0      816              10.2.1.2:37930             10.2.1.1:ssh      timer:(on,630ms,246) ino:60786 sk:ffff8801189aa400
>>        mem:(r0,w3776,f320,t0) ts sack ecn cubic wscale:8,6 rto:1680 rtt:16.25/7.5 ato:40 ssthresh:7 send 1.4Mbps rcv_rtt:10 rcv_space:16632
>>
>>
>> You can see the retransmit count : 246
>>
>> What possibly can be going on ?
>>
>> What happened to backoff ?
>>
>> # grep . /proc/sys/net/ipv4/tcp_retries*
>> /proc/sys/net/ipv4/tcp_retries1:3
>> /proc/sys/net/ipv4/tcp_retries2:15
>>
>>
>>
>> extract of tcpdump :
>>
>> 12:01:02.074244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128024 59389>
>> 12:01:03.754243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128192 59389>
>> 12:01:05.434245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128360 59389>
>> 12:01:07.114243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128528 59389>
>> 12:01:08.794248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128696 59389>
>> 12:01:10.474242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128864 59389>
>> 12:01:12.154243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129032 59389>
>> 12:01:13.834241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129200 59389>
>> 12:01:15.514246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129368 59389>
>> 12:01:17.194244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129536 59389>
>> 12:01:18.874248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129704 59389>
>> 12:01:20.554243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129872 59389>
>> 12:01:22.234244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130040 59389>
>> 12:01:23.914244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130208 59389>
>> 12:01:25.594247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130376 59389>
>> 12:01:27.274242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130544 59389>
>> 12:01:28.954242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130712 59389>
>> 12:01:30.634248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130880 59389>
>> 12:01:32.314245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131048 59389>
>> 12:01:33.994243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131216 59389>
>> 12:01:35.674250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131384 59389>
>> 12:01:37.354244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131552 59389>
>> 12:01:39.034245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131720 59389>
>> 12:01:40.714245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131888 59389>
>> 12:01:42.394245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132056 59389>
>> 12:01:44.074242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132224 59389>
>> 12:01:45.754249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132392 59389>
>> 12:01:47.434242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132560 59389>
>> 12:01:49.114247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132728 59389>
>> 12:01:50.794250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132896 59389>
>> 12:01:52.474247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133064 59389>
>> 12:01:54.154242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133232 59389>
>> 12:01:55.834246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133400 59389>
>> 12:01:57.514243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133568 59389>
>> 12:01:59.194247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133736 59389>
>> 12:02:00.874250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133904 59389>
>> 12:02:02.554242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134072 59389>
>> 12:02:04.234243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134240 59389>
>> 12:02:05.914245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134408 59389>
>> 12:02:07.594244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134576 59389>
>> 12:02:09.274249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134744 59389>
>> 12:02:10.954241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134912 59389>
>> 12:02:12.634249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16135080 59389>
>>
>> tcp_retransmit_timer() does the exponential backoff, but something
>> resets icsk_rto to a low value ?
>>
>> Ah, it seems to be because of commit f1ecd5d9e7366609
>> (Revert Backoff [v3]: Revert RTO on ICMP destination unreachable)
>>
>> Since arp resolution (or routing, I dont know yet) fails, an
>> internal/loopback ICMP host/network unreachable message is
>> generated and handled in tcp_v4_err() :
>
> Yeah, you have a local connectivity disruption. This is one
> possible scenario.
>
>>
>> icsk_backoff-- and icsk_rto is reset.
>>
>> I am afraid this can generate a storm (cpu time at very least),
>> in case we have many tcp sessions in this state.
>
> Hmm, maybe. I don't know. Arnd or Damian what are you thing about this point?
>
>>
>> I guess its time for me to read RFC 6069
>
> If you find a bug. Let me know.
>
> Alex
>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> //
> // Dipl.-Inform. Alexander Zimmermann
> // Department of Computer Science, Informatik 4
> // RWTH Aachen University
> // Ahornstr. 55, 52056 Aachen, Germany
> // phone: (49-241) 80-21422, fax: (49-241) 80-22222
> // email: zimmermann@cs.rwth-aachen.de
> // web: http://www.umic-mesh.net
> //
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
  2011-08-24 19:03 ` Alexander Zimmermann
  2011-08-24 19:39   ` Jerry Chu
@ 2011-08-24 19:45   ` Eric Dumazet
  1 sibling, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2011-08-24 19:45 UTC (permalink / raw)
  To: Alexander Zimmermann; +Cc: netdev, Jerry Chu, Lukowski Damian, Hannemann Arnd

Le mercredi 24 août 2011 à 21:03 +0200, Alexander Zimmermann a écrit :
> Hi Eric,
> 
> Am 24.08.2011 um 18:21 schrieb Eric Dumazet:
> 
> > On one dev machine running net-next, I just found strange tcp sessions
> > that retransmit a frame forever (The other peer disappeared)
> 
> not forever...
> If remember correctly you will stop after 120s.
> 

Hi Alexander

I just tried again one session, and got much more delay than that.

It stops because of a side effect, "icsk_retransmits" being a 8bit
field.

Every 256 retransmits, it becomes 255+1 -> 0

retransmits_timed_out() immediately returns false.

And backoff increases at this time.

Eventually, we retransmit 256*15 times, process 256*15 ICMP messages.

Thanks

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
  2011-08-24 16:21 [BUG] tcp : how many times a frame can possibly be retransmitted ? Eric Dumazet
  2011-08-24 19:03 ` Alexander Zimmermann
@ 2011-08-24 22:44 ` Ilpo Järvinen
  2011-08-24 23:00   ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Ilpo Järvinen @ 2011-08-24 22:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jerry Chu, Damian Lukowski

On Wed, 24 Aug 2011, Eric Dumazet wrote:

> On one dev machine running net-next, I just found strange tcp sessions
> that retransmit a frame forever (The other peer disappeared)
> 
> # ss -emoi dst 10.2.1.1
> State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
> ESTAB      0      816              10.2.1.2:37930             10.2.1.1:ssh      timer:(on,630ms,246) ino:60786 sk:ffff8801189aa400
> 	 mem:(r0,w3776,f320,t0) ts sack ecn cubic wscale:8,6 rto:1680 rtt:16.25/7.5 ato:40 ssthresh:7 send 1.4Mbps rcv_rtt:10 rcv_space:16632
> 
> 
> You can see the retransmit count : 246 
> 
> What possibly can be going on ?
> 
> What happened to backoff ?
> 
> # grep . /proc/sys/net/ipv4/tcp_retries*
> /proc/sys/net/ipv4/tcp_retries1:3
> /proc/sys/net/ipv4/tcp_retries2:15
> 
> 
> 
> extract of tcpdump :
> 
> 12:01:02.074244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128024 59389>
> 12:01:03.754243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128192 59389>
> 12:01:05.434245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128360 59389>
> 12:01:07.114243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128528 59389>
> 12:01:08.794248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128696 59389>
> 12:01:10.474242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128864 59389>
> 12:01:12.154243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129032 59389>
> 12:01:13.834241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129200 59389>
> 12:01:15.514246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129368 59389>
> 12:01:17.194244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129536 59389>
> 12:01:18.874248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129704 59389>
> 12:01:20.554243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129872 59389>
> 12:01:22.234244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130040 59389>
> 12:01:23.914244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130208 59389>
> 12:01:25.594247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130376 59389>
> 12:01:27.274242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130544 59389>
> 12:01:28.954242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130712 59389>
> 12:01:30.634248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130880 59389>
> 12:01:32.314245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131048 59389>
> 12:01:33.994243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131216 59389>
> 12:01:35.674250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131384 59389>
> 12:01:37.354244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131552 59389>
> 12:01:39.034245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131720 59389>
> 12:01:40.714245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131888 59389>
> 12:01:42.394245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132056 59389>
> 12:01:44.074242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132224 59389>
> 12:01:45.754249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132392 59389>
> 12:01:47.434242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132560 59389>
> 12:01:49.114247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132728 59389>
> 12:01:50.794250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132896 59389>
> 12:01:52.474247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133064 59389>
> 12:01:54.154242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133232 59389>
> 12:01:55.834246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133400 59389>
> 12:01:57.514243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133568 59389>
> 12:01:59.194247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133736 59389>
> 12:02:00.874250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133904 59389>
> 12:02:02.554242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134072 59389>
> 12:02:04.234243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134240 59389>
> 12:02:05.914245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134408 59389>
> 12:02:07.594244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134576 59389>
> 12:02:09.274249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134744 59389>
> 12:02:10.954241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134912 59389>
> 12:02:12.634249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16135080 59389>
> 
> tcp_retransmit_timer() does the exponential backoff, but something
> resets icsk_rto to a low value ?
> 
> Ah, it seems to be because of commit f1ecd5d9e7366609 
> (Revert Backoff [v3]: Revert RTO on ICMP destination unreachable)
> 
> Since arp resolution (or routing, I dont know yet) fails, an
>  internal/loopback ICMP host/network unreachable message is 
> generated and handled in tcp_v4_err() :
> 
> icsk_backoff-- and icsk_rto is reset.
> 
> I am afraid this can generate a storm (cpu time at very least),
> in case we have many tcp sessions in this state.

But RTO (even without any backoffs) should be lower bounded to some not so 
zeroish value?

> I guess its time for me to read RFC 6069

-- 
 i.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
  2011-08-24 22:44 ` Ilpo Järvinen
@ 2011-08-24 23:00   ` Eric Dumazet
  2011-08-24 23:41     ` [PATCH] tcp: bound RTO to minimum Hagen Paul Pfeifer
  2011-08-25  8:56     ` [BUG] tcp : how many times a frame can possibly be retransmitted ? Ilpo Järvinen
  0 siblings, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2011-08-24 23:00 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: netdev, Jerry Chu, Damian Lukowski

Le jeudi 25 août 2011 à 01:44 +0300, Ilpo Järvinen a écrit :
> On Wed, 24 Aug 2011, Eric Dumazet wrote:
> 
> > On one dev machine running net-next, I just found strange tcp sessions
> > that retransmit a frame forever (The other peer disappeared)
> > 
> > # ss -emoi dst 10.2.1.1
> > State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
> > ESTAB      0      816              10.2.1.2:37930             10.2.1.1:ssh      timer:(on,630ms,246) ino:60786 sk:ffff8801189aa400
> > 	 mem:(r0,w3776,f320,t0) ts sack ecn cubic wscale:8,6 rto:1680 rtt:16.25/7.5 ato:40 ssthresh:7 send 1.4Mbps rcv_rtt:10 rcv_space:16632
> > 
> > 
> > You can see the retransmit count : 246 
> > 
> > What possibly can be going on ?
> > 
> > What happened to backoff ?
> > 
> > # grep . /proc/sys/net/ipv4/tcp_retries*
> > /proc/sys/net/ipv4/tcp_retries1:3
> > /proc/sys/net/ipv4/tcp_retries2:15
> > 
> > 
> > 
> > extract of tcpdump :
> > 
> > 12:01:02.074244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128024 59389>
> > 12:01:03.754243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128192 59389>
> > 12:01:05.434245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128360 59389>
> > 12:01:07.114243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128528 59389>
> > 12:01:08.794248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128696 59389>
> > 12:01:10.474242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128864 59389>
> > 12:01:12.154243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129032 59389>
> > 12:01:13.834241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129200 59389>
> > 12:01:15.514246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129368 59389>
> > 12:01:17.194244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129536 59389>
> > 12:01:18.874248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129704 59389>
> > 12:01:20.554243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129872 59389>
> > 12:01:22.234244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130040 59389>
> > 12:01:23.914244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130208 59389>
> > 12:01:25.594247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130376 59389>
> > 12:01:27.274242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130544 59389>
> > 12:01:28.954242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130712 59389>
> > 12:01:30.634248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130880 59389>
> > 12:01:32.314245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131048 59389>
> > 12:01:33.994243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131216 59389>
> > 12:01:35.674250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131384 59389>
> > 12:01:37.354244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131552 59389>
> > 12:01:39.034245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131720 59389>
> > 12:01:40.714245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131888 59389>
> > 12:01:42.394245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132056 59389>
> > 12:01:44.074242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132224 59389>
> > 12:01:45.754249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132392 59389>
> > 12:01:47.434242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132560 59389>
> > 12:01:49.114247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132728 59389>
> > 12:01:50.794250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132896 59389>
> > 12:01:52.474247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133064 59389>
> > 12:01:54.154242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133232 59389>
> > 12:01:55.834246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133400 59389>
> > 12:01:57.514243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133568 59389>
> > 12:01:59.194247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133736 59389>
> > 12:02:00.874250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133904 59389>
> > 12:02:02.554242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134072 59389>
> > 12:02:04.234243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134240 59389>
> > 12:02:05.914245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134408 59389>
> > 12:02:07.594244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134576 59389>
> > 12:02:09.274249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134744 59389>
> > 12:02:10.954241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134912 59389>
> > 12:02:12.634249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16135080 59389>
> > 
> > tcp_retransmit_timer() does the exponential backoff, but something
> > resets icsk_rto to a low value ?
> > 
> > Ah, it seems to be because of commit f1ecd5d9e7366609 
> > (Revert Backoff [v3]: Revert RTO on ICMP destination unreachable)
> > 
> > Since arp resolution (or routing, I dont know yet) fails, an
> >  internal/loopback ICMP host/network unreachable message is 
> > generated and handled in tcp_v4_err() :
> > 
> > icsk_backoff-- and icsk_rto is reset.
> > 
> > I am afraid this can generate a storm (cpu time at very least),
> > in case we have many tcp sessions in this state.
> 
> But RTO (even without any backoffs) should be lower bounded to some not so 
> zeroish value?

Apparently not.

The only thing that protect us from a flood is that ip_error() uses
inetpeer cache to ratelimit the icmp_send(ICMP_DEST_UNREACH)

This is why we get retransmit period >= 1 sec

vi +432 net/ipv4/tcp_ipv4.c

                icsk->icsk_backoff--;
                inet_csk(sk)->icsk_rto = (tp->srtt ? __tcp_set_rto(tp) :
                        TCP_TIMEOUT_INIT) << icsk->icsk_backoff;
                tcp_bound_rto(sk);

and __tcp_set_rto() uses : return (tp->srtt >> 3) + tp->rttvar;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] tcp: bound RTO to minimum
  2011-08-24 23:00   ` Eric Dumazet
@ 2011-08-24 23:41     ` Hagen Paul Pfeifer
  2011-08-24 23:43       ` Hagen Paul Pfeifer
  2011-08-25  1:50       ` Yuchung Cheng
  2011-08-25  8:56     ` [BUG] tcp : how many times a frame can possibly be retransmitted ? Ilpo Järvinen
  1 sibling, 2 replies; 22+ messages in thread
From: Hagen Paul Pfeifer @ 2011-08-24 23:41 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, Hagen Paul Pfeifer

Check if calculated RTO is less then TCP_RTO_MIN. If this is true we
adjust the value to TCP_RTO_MIN.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
---
 include/net/tcp.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 149a415..9b5f4bf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -520,6 +520,8 @@ static inline void tcp_bound_rto(const struct sock *sk)
 {
 	if (inet_csk(sk)->icsk_rto > TCP_RTO_MAX)
 		inet_csk(sk)->icsk_rto = TCP_RTO_MAX;
+	else if (inet_csk(sk)->icsk_rto < TCP_RTO_MIN)
+		inet_csk(sk)->icsk_rto = TCP_RTO_MIN;
 }
 
 static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
-- 
1.7.4.1.57.g0466.dirty

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] tcp: bound RTO to minimum
  2011-08-24 23:41     ` [PATCH] tcp: bound RTO to minimum Hagen Paul Pfeifer
@ 2011-08-24 23:43       ` Hagen Paul Pfeifer
  2011-08-25  1:50       ` Yuchung Cheng
  1 sibling, 0 replies; 22+ messages in thread
From: Hagen Paul Pfeifer @ 2011-08-24 23:43 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, Ilpo Järvinen

This should do the trick Eric, Ilpo?

Hagen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tcp: bound RTO to minimum
  2011-08-24 23:41     ` [PATCH] tcp: bound RTO to minimum Hagen Paul Pfeifer
  2011-08-24 23:43       ` Hagen Paul Pfeifer
@ 2011-08-25  1:50       ` Yuchung Cheng
  2011-08-25  5:28         ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Yuchung Cheng @ 2011-08-25  1:50 UTC (permalink / raw)
  To: Hagen Paul Pfeifer; +Cc: netdev, eric.dumazet

On Wed, Aug 24, 2011 at 4:41 PM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
> Check if calculated RTO is less then TCP_RTO_MIN. If this is true we
> adjust the value to TCP_RTO_MIN.
>
but tp->rttvar is already lower-bounded via tcp_rto_min()?

static inline void tcp_set_rto(struct sock *sk)
{
...

  /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
   * guarantees that rto is higher.
   */
  tcp_bound_rto(sk);
}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tcp: bound RTO to minimum
  2011-08-25  1:50       ` Yuchung Cheng
@ 2011-08-25  5:28         ` Eric Dumazet
  2011-08-25  7:28           ` Alexander Zimmermann
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2011-08-25  5:28 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: Hagen Paul Pfeifer, netdev

Le mercredi 24 août 2011 à 18:50 -0700, Yuchung Cheng a écrit :
> On Wed, Aug 24, 2011 at 4:41 PM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
> > Check if calculated RTO is less then TCP_RTO_MIN. If this is true we
> > adjust the value to TCP_RTO_MIN.
> >
> but tp->rttvar is already lower-bounded via tcp_rto_min()?
> 
> static inline void tcp_set_rto(struct sock *sk)
> {
> ...
> 
>   /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
>    * guarantees that rto is higher.
>    */
>   tcp_bound_rto(sk);
> }

Yes, and furthermore, we also limit ICMP rate, so in in my tests, I
reach in a few rounds icsk_rto > 1sec

07:16:13.010633 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 3833540215:3833540263(48) ack 2593537670 win 305
07:16:13.221111 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:13.661151 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:14.541153 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:16.301152 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
<from this point, icsk_rto=1.76sec >
07:16:18.061158 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:19.821158 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:21.581018 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:23.341156 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:25.101151 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:26.861155 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:28.621158 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:30.381152 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:32.141157 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 

Real question is : do we really want to process ~1000 timer interrupts
per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
requests, only to make tcp revover in ~1sec when connectivity returns
back. This just doesnt scale.

On a server handling ~1.000.000 (long living) sessions, using
application side keepalives (say one message sent every minute on each
session), a temporary connectivity disruption _could_ makes it enter a
critical zone, burning cpu and memory.

It seems TCP-LCD (RFC6069) depends very much on ICMP being rate limited.

I'll have to check what happens on multiple sessions : We might have
cpus fighting on a single inetpeer and throtle, thus allowing backoff to
increase after all. 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tcp: bound RTO to minimum
  2011-08-25  5:28         ` Eric Dumazet
@ 2011-08-25  7:28           ` Alexander Zimmermann
  2011-08-25  8:26             ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Zimmermann @ 2011-08-25  7:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yuchung Cheng, Hagen Paul Pfeifer, netdev, Hannemann Arnd,
	Lukowski Damian

Hi Eric,

Am 25.08.2011 um 07:28 schrieb Eric Dumazet:

> Le mercredi 24 août 2011 à 18:50 -0700, Yuchung Cheng a écrit :
>> On Wed, Aug 24, 2011 at 4:41 PM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
>>> Check if calculated RTO is less then TCP_RTO_MIN. If this is true we
>>> adjust the value to TCP_RTO_MIN.
>>> 
>> but tp->rttvar is already lower-bounded via tcp_rto_min()?
>> 
>> static inline void tcp_set_rto(struct sock *sk)
>> {
>> ...
>> 
>>  /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
>>   * guarantees that rto is higher.
>>   */
>>  tcp_bound_rto(sk);
>> }
> 
> Yes, and furthermore, we also limit ICMP rate, so in in my tests, I
> reach in a few rounds icsk_rto > 1sec
> 
> 07:16:13.010633 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 3833540215:3833540263(48) ack 2593537670 win 305
> 07:16:13.221111 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:13.661151 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:14.541153 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:16.301152 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> <from this point, icsk_rto=1.76sec >
> 07:16:18.061158 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:19.821158 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:21.581018 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:23.341156 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:25.101151 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:26.861155 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:28.621158 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:30.381152 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:32.141157 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 
> Real question is : do we really want to process ~1000 timer interrupts
> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
> requests, only to make tcp revover in ~1sec when connectivity returns
> back. This just doesnt scale.

maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
probing time of 120s, we 600 retransmits in a worst-case-senario
(assumed that we get for every rot retransmission an icmp). No?

> 
> On a server handling ~1.000.000 (long living) sessions, using
> application side keepalives (say one message sent every minute on each
> session), a temporary connectivity disruption _could_ makes it enter a
> critical zone, burning cpu and memory.
> 
> It seems TCP-LCD (RFC6069) depends very much on ICMP being rate limited.

This is right. We assume that a server/router sends only icmps when they
have free cycles.

> 
> I'll have to check what happens on multiple sessions : We might have
> cpus fighting on a single inetpeer and throtle, thus allowing backoff to
> increase after all. 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tcp: bound RTO to minimum
  2011-08-25  7:28           ` Alexander Zimmermann
@ 2011-08-25  8:26             ` Eric Dumazet
  2011-08-25  8:44               ` Alexander Zimmermann
  2011-08-25  8:46               ` Arnd Hannemann
  0 siblings, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2011-08-25  8:26 UTC (permalink / raw)
  To: Alexander Zimmermann
  Cc: Yuchung Cheng, Hagen Paul Pfeifer, netdev, Hannemann Arnd,
	Lukowski Damian

Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
> Hi Eric,
> 
> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:

> > Real question is : do we really want to process ~1000 timer interrupts
> > per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
> > requests, only to make tcp revover in ~1sec when connectivity returns
> > back. This just doesnt scale.
> 
> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
> probing time of 120s, we 600 retransmits in a worst-case-senario
> (assumed that we get for every rot retransmission an icmp). No?

Where is asserted the "max probing time of 120s" ? 

It is not the case on my machine :
I have way more retransmits than that, even if spaced by 1600 ms

07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)

Old kernels where performing up to 15 retries, doing exponential backoff.

Now its kind of unlimited, according to experimental results.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tcp: bound RTO to minimum
  2011-08-25  8:26             ` Eric Dumazet
@ 2011-08-25  8:44               ` Alexander Zimmermann
  2011-08-25  8:46               ` Arnd Hannemann
  1 sibling, 0 replies; 22+ messages in thread
From: Alexander Zimmermann @ 2011-08-25  8:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yuchung Cheng, Hagen Paul Pfeifer, netdev, Hannemann Arnd,
	Lukowski Damian


Am 25.08.2011 um 10:26 schrieb Eric Dumazet:

> Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
>> Hi Eric,
>> 
>> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:
> 
>>> Real question is : do we really want to process ~1000 timer interrupts
>>> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
>>> requests, only to make tcp revover in ~1sec when connectivity returns
>>> back. This just doesnt scale.
>> 
>> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
>> probing time of 120s, we 600 retransmits in a worst-case-senario
>> (assumed that we get for every rot retransmission an icmp). No?
> 
> Where is asserted the "max probing time of 120s" ? 
> 
> It is not the case on my machine :
> I have way more retransmits than that, even if spaced by 1600 ms
> 
> 07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
> 07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
> 07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)
> 
> Old kernels where performing up to 15 retries, doing exponential backoff.

Yes I know. And in combination with RFC6069 we have to convert this
See Section 7.1

and

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6fa12c85031485dff38ce550c24f10da23b0adaa

Is the transformation broken? Damian?


> 
> Now its kind of unlimited, according to experimental results.

Ok, unlimited is not what I expect...


> 
> 
> 

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tcp: bound RTO to minimum
  2011-08-25  8:26             ` Eric Dumazet
  2011-08-25  8:44               ` Alexander Zimmermann
@ 2011-08-25  8:46               ` Arnd Hannemann
  2011-08-25  9:09                 ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Arnd Hannemann @ 2011-08-25  8:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Zimmermann, Yuchung Cheng, Hagen Paul Pfeifer, netdev,
	Lukowski Damian

Hi,

Am 25.08.2011 10:26, schrieb Eric Dumazet:
> Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
>> Hi Eric,
>>
>> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:
> 
>>> Real question is : do we really want to process ~1000 timer interrupts
>>> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
>>> requests, only to make tcp revover in ~1sec when connectivity returns
>>> back. This just doesnt scale.
>>
>> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
>> probing time of 120s, we 600 retransmits in a worst-case-senario
>> (assumed that we get for every rot retransmission an icmp). No?
> 
> Where is asserted the "max probing time of 120s" ? 
> 
> It is not the case on my machine :
> I have way more retransmits than that, even if spaced by 1600 ms
> 
> 07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
> 07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
> 07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)
> 
> Old kernels where performing up to 15 retries, doing exponential backoff.
> 
> Now its kind of unlimited, according to experimental results.

That shouldn't be. It should stop after the same time a TCP connection with an
RTO of Minimum RTO which is doing 15 retries (tcp_retries2=15) and doing exponential backoff.
So it should be around 900s*. But it could be that because of the icsk_retransmit wrapover
this doesn't work as expected.

* 200ms + 400ms + 800ms ...

Best regards,
Arnd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
  2011-08-24 23:00   ` Eric Dumazet
  2011-08-24 23:41     ` [PATCH] tcp: bound RTO to minimum Hagen Paul Pfeifer
@ 2011-08-25  8:56     ` Ilpo Järvinen
  2011-08-25  9:40       ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Ilpo Järvinen @ 2011-08-25  8:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jerry Chu, Damian Lukowski

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1973 bytes --]

On Thu, 25 Aug 2011, Eric Dumazet wrote:

> Le jeudi 25 août 2011 à 01:44 +0300, Ilpo Järvinen a écrit :
> > On Wed, 24 Aug 2011, Eric Dumazet wrote:
> > 
> > > On one dev machine running net-next, I just found strange tcp sessions
> > > that retransmit a frame forever (The other peer disappeared)
> > > 
> > > # ss -emoi dst 10.2.1.1
> > > State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
> > > ESTAB      0      816              10.2.1.2:37930             10.2.1.1:ssh      timer:(on,630ms,246) ino:60786 sk:ffff8801189aa400
> > > 	 mem:(r0,w3776,f320,t0) ts sack ecn cubic wscale:8,6 rto:1680 rtt:16.25/7.5 ato:40 ssthresh:7 send 1.4Mbps rcv_rtt:10 rcv_space:16632
> > > 
> > > 
> > > You can see the retransmit count : 246 
> > > 
> > > What possibly can be going on ?
> > > 
> > > What happened to backoff ?
> > 
> > But RTO (even without any backoffs) should be lower bounded to some not so 
> > zeroish value?
> 
> Apparently not.
> 
> The only thing that protect us from a flood is that ip_error() uses
> inetpeer cache to ratelimit the icmp_send(ICMP_DEST_UNREACH)
> 
> This is why we get retransmit period >= 1 sec
>
> vi +432 net/ipv4/tcp_ipv4.c
> 
>                 icsk->icsk_backoff--;
>                 inet_csk(sk)->icsk_rto = (tp->srtt ? __tcp_set_rto(tp) :
>                         TCP_TIMEOUT_INIT) << icsk->icsk_backoff;
>                 tcp_bound_rto(sk);
> 
> and __tcp_set_rto() uses : return (tp->srtt >> 3) + tp->rttvar;

So you think that this is not true: ?

        /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
         * guarantees that rto is higher.
         */

...it would still be smaller than 1sec though, but certainly not going to 
cause flooding either. Default tcp_rto_min should be 200ms so it's 
5pkts+5ICMP sent, received and processed per second. Which doesn't sound 
that bad CPU load?!?

It is unclear to me how tp->rttvar could become smaller than 
tcp_rto_min().

-- 
 i.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tcp: bound RTO to minimum
  2011-08-25  8:46               ` Arnd Hannemann
@ 2011-08-25  9:09                 ` Eric Dumazet
  2011-08-25  9:46                   ` Arnd Hannemann
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2011-08-25  9:09 UTC (permalink / raw)
  To: Arnd Hannemann
  Cc: Alexander Zimmermann, Yuchung Cheng, Hagen Paul Pfeifer, netdev,
	Lukowski Damian

Le jeudi 25 août 2011 à 10:46 +0200, Arnd Hannemann a écrit :
> Hi,
> 
> Am 25.08.2011 10:26, schrieb Eric Dumazet:
> > Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
> >> Hi Eric,
> >>
> >> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:
> > 
> >>> Real question is : do we really want to process ~1000 timer interrupts
> >>> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
> >>> requests, only to make tcp revover in ~1sec when connectivity returns
> >>> back. This just doesnt scale.
> >>
> >> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
> >> probing time of 120s, we 600 retransmits in a worst-case-senario
> >> (assumed that we get for every rot retransmission an icmp). No?
> > 
> > Where is asserted the "max probing time of 120s" ? 
> > 
> > It is not the case on my machine :
> > I have way more retransmits than that, even if spaced by 1600 ms
> > 
> > 07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
> > 07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
> > 07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)
> > 
> > Old kernels where performing up to 15 retries, doing exponential backoff.
> > 
> > Now its kind of unlimited, according to experimental results.
> 
> That shouldn't be. It should stop after the same time a TCP connection with an
> RTO of Minimum RTO which is doing 15 retries (tcp_retries2=15) and doing exponential backoff.
> So it should be around 900s*. But it could be that because of the icsk_retransmit wrapover
> this doesn't work as expected.
> 
> * 200ms + 400ms + 800ms ...

It is 924 second with retries2=15 (default value)

I said ~1000 probes.

If ICMP are not rate limited, that could be about 924*5 probes, instead
of 15 probes on old kernels.

Maybe we should refine the thing a bit, to not reverse backoff unless
rto is > some_threshold.

Say 10s being the value, that would give at most 92 tries.

I mean, what is the gain to be able to restart a frozen TCP session with
a 1sec latency instead of 10s if it was blocked more than 60 seconds ?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
  2011-08-25  8:56     ` [BUG] tcp : how many times a frame can possibly be retransmitted ? Ilpo Järvinen
@ 2011-08-25  9:40       ` Eric Dumazet
  2011-08-25 10:07         ` Ilpo Järvinen
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2011-08-25  9:40 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: netdev, Jerry Chu, Damian Lukowski

Le jeudi 25 août 2011 à 11:56 +0300, Ilpo Järvinen a écrit :

> So you think that this is not true: ?
> 
>         /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
>          * guarantees that rto is higher.
>          */
> 
> ...it would still be smaller than 1sec though, but certainly not going to 
> cause flooding either. Default tcp_rto_min should be 200ms so it's 
> 5pkts+5ICMP sent, received and processed per second. Which doesn't sound 
> that bad CPU load?!?
> 

Unless you have 100.000 active sessions maybe ?

Some years ago, I helped people running servers with more than 1.000.000
long living active sessions, and a temporary network disruption was
already very critical at that time, with old kernels (At that time, IP
route cache could blow away and consume too much ram or cpu time, things
are now under control)

I guess they would not try a new kernel :(

> It is unclear to me how tp->rttvar could become smaller than 
> tcp_rto_min().

I believe this part is fine Ilpo.

As long as we handle few tcp sessions, its fine to send 5 messages per
session per second.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tcp: bound RTO to minimum
  2011-08-25  9:09                 ` Eric Dumazet
@ 2011-08-25  9:46                   ` Arnd Hannemann
  2011-08-25 10:02                     ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Arnd Hannemann @ 2011-08-25  9:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Zimmermann, Yuchung Cheng, Hagen Paul Pfeifer, netdev,
	Lukowski Damian

Hi Eric,

Am 25.08.2011 11:09, schrieb Eric Dumazet:
> Le jeudi 25 août 2011 à 10:46 +0200, Arnd Hannemann a écrit :
>> Am 25.08.2011 10:26, schrieb Eric Dumazet:
>>> Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
>>>> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:
>>>
>>>>> Real question is : do we really want to process ~1000 timer interrupts
>>>>> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
>>>>> requests, only to make tcp revover in ~1sec when connectivity returns
>>>>> back. This just doesnt scale.
>>>>
>>>> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
>>>> probing time of 120s, we 600 retransmits in a worst-case-senario
>>>> (assumed that we get for every rot retransmission an icmp). No?
>>>
>>> Where is asserted the "max probing time of 120s" ? 
>>>
>>> It is not the case on my machine :
>>> I have way more retransmits than that, even if spaced by 1600 ms
>>>
>>> 07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
>>> 07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
>>> 07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)
>>>
>>> Old kernels where performing up to 15 retries, doing exponential backoff.
>>>
>>> Now its kind of unlimited, according to experimental results.
>>
>> That shouldn't be. It should stop after the same time a TCP connection with an
>> RTO of Minimum RTO which is doing 15 retries (tcp_retries2=15) and doing exponential backoff.
>> So it should be around 900s*. But it could be that because of the icsk_retransmit wrapover
>> this doesn't work as expected.
>>
>> * 200ms + 400ms + 800ms ...
> 
> It is 924 second with retries2=15 (default value)
> 
> I said ~1000 probes.
> 
> If ICMP are not rate limited, that could be about 924*5 probes, instead
> of 15 probes on old kernels.

At a rate of 5 packets/s if RTT is zero, yes. I would like to say: so
what? But your example with millions of idle connections stands.

> Maybe we should refine the thing a bit, to not reverse backoff unless
> rto is > some_threshold.
> 
> Say 10s being the value, that would give at most 92 tries.

I personally think that 10s would be too large and eliminate the benefit of the
algorithm, so I would prefer a different solution.

In case of one bulk data TCP session, which was transmitting hundreds of packets/s
before the connectivity disruption those worst case rate of 5 packet/s really
seems conservative enough.

However in case of a lot of idle connections, which were transmitting only
a number of packets per minute. We might increase the rate drastically for
a certain period until it throttles down. You say that we have a problem here
correct?

Do you think it would be possible without much hassle to use a kind of "global"
rate limiting only for these probe packets of a TCP connection?

> I mean, what is the gain to be able to restart a frozen TCP session with
> a 1sec latency instead of 10s if it was blocked more than 60 seconds ?

I'm afraid it does a lot, especially in highly dynamic environments. You
don't have just the additional latency, you may actually miss the full
period where connectivity was there, and then just retransmit into the next
connectivity disrupted period.

Best regards,
Arnd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tcp: bound RTO to minimum
  2011-08-25  9:46                   ` Arnd Hannemann
@ 2011-08-25 10:02                     ` Eric Dumazet
  2011-08-25 10:14                       ` Ilpo Järvinen
  2011-08-25 10:15                       ` Arnd Hannemann
  0 siblings, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2011-08-25 10:02 UTC (permalink / raw)
  To: Arnd Hannemann
  Cc: Alexander Zimmermann, Yuchung Cheng, Hagen Paul Pfeifer, netdev,
	Lukowski Damian

Le jeudi 25 août 2011 à 11:46 +0200, Arnd Hannemann a écrit :
> Hi Eric,
> 
> Am 25.08.2011 11:09, schrieb Eric Dumazet:

> > Maybe we should refine the thing a bit, to not reverse backoff unless
> > rto is > some_threshold.
> > 
> > Say 10s being the value, that would give at most 92 tries.
> 
> I personally think that 10s would be too large and eliminate the benefit of the
> algorithm, so I would prefer a different solution.
> 
> In case of one bulk data TCP session, which was transmitting hundreds of packets/s
> before the connectivity disruption those worst case rate of 5 packet/s really
> seems conservative enough.
> 
> However in case of a lot of idle connections, which were transmitting only
> a number of packets per minute. We might increase the rate drastically for
> a certain period until it throttles down. You say that we have a problem here
> correct?
> 
> Do you think it would be possible without much hassle to use a kind of "global"
> rate limiting only for these probe packets of a TCP connection?
> 
> > I mean, what is the gain to be able to restart a frozen TCP session with
> > a 1sec latency instead of 10s if it was blocked more than 60 seconds ?
> 
> I'm afraid it does a lot, especially in highly dynamic environments. You
> don't have just the additional latency, you may actually miss the full
> period where connectivity was there, and then just retransmit into the next
> connectivity disrupted period.

Problem with this is that with short and synchronized timers, all
sessions will flood at the same time and you'll get congestion this
time.

The reason for exponential backoff is also to smooth the restarts of
sessions, because timers are randomized.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
  2011-08-25  9:40       ` Eric Dumazet
@ 2011-08-25 10:07         ` Ilpo Järvinen
  0 siblings, 0 replies; 22+ messages in thread
From: Ilpo Järvinen @ 2011-08-25 10:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jerry Chu, Damian Lukowski

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1489 bytes --]

On Thu, 25 Aug 2011, Eric Dumazet wrote:

> Le jeudi 25 août 2011 à 11:56 +0300, Ilpo Järvinen a écrit :
> 
> > So you think that this is not true: ?
> > 
> >         /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
> >          * guarantees that rto is higher.
> >          */
> > 
> > ...it would still be smaller than 1sec though, but certainly not going to 
> > cause flooding either. Default tcp_rto_min should be 200ms so it's 
> > 5pkts+5ICMP sent, received and processed per second. Which doesn't sound 
> > that bad CPU load?!?
> > 
> 
> Unless you have 100.000 active sessions maybe ?
> 
> Some years ago, I helped people running servers with more than 1.000.000
> long living active sessions, and a temporary network disruption was
> already very critical at that time, with old kernels (At that time, IP
> route cache could blow away and consume too much ram or cpu time, things
> are now under control)
> 
> I guess they would not try a new kernel :(
> 
> > It is unclear to me how tp->rttvar could become smaller than 
> > tcp_rto_min().
> 
> I believe this part is fine Ilpo.
> 
> As long as we handle few tcp sessions, its fine to send 5 messages per
> session per second.

Yeah, thanks for the clarification. I was just confused by the initial 
wording of yours which seemed to imply that we could, at worst, end up 
doing it with full rate without any timers.

To me it seems that both cases are quite valid, with pretty much 
contradicting goals.


-- 
 i.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tcp: bound RTO to minimum
  2011-08-25 10:02                     ` Eric Dumazet
@ 2011-08-25 10:14                       ` Ilpo Järvinen
  2011-08-25 10:15                       ` Arnd Hannemann
  1 sibling, 0 replies; 22+ messages in thread
From: Ilpo Järvinen @ 2011-08-25 10:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Arnd Hannemann, Alexander Zimmermann, Yuchung Cheng,
	Hagen Paul Pfeifer, netdev, Lukowski Damian

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1966 bytes --]

On Thu, 25 Aug 2011, Eric Dumazet wrote:

> Le jeudi 25 août 2011 à 11:46 +0200, Arnd Hannemann a écrit :
> > Hi Eric,
> > 
> > Am 25.08.2011 11:09, schrieb Eric Dumazet:
> 
> > > Maybe we should refine the thing a bit, to not reverse backoff unless
> > > rto is > some_threshold.
> > > 
> > > Say 10s being the value, that would give at most 92 tries.
> > 
> > I personally think that 10s would be too large and eliminate the benefit of the
> > algorithm, so I would prefer a different solution.
> > 
> > In case of one bulk data TCP session, which was transmitting hundreds of packets/s
> > before the connectivity disruption those worst case rate of 5 packet/s really
> > seems conservative enough.
> > 
> > However in case of a lot of idle connections, which were transmitting only
> > a number of packets per minute. We might increase the rate drastically for
> > a certain period until it throttles down. You say that we have a problem here
> > correct?
> > 
> > Do you think it would be possible without much hassle to use a kind of 
> > "global" rate limiting only for these probe packets of a TCP connection?
> >
> > > I mean, what is the gain to be able to restart a frozen TCP session with
> > > a 1sec latency instead of 10s if it was blocked more than 60 seconds ?
> > 
> > I'm afraid it does a lot, especially in highly dynamic environments. You
> > don't have just the additional latency, you may actually miss the full
> > period where connectivity was there, and then just retransmit into the next
> > connectivity disrupted period.
> 
> Problem with this is that with short and synchronized timers, all
> sessions will flood at the same time and you'll get congestion this
> time.
>
> The reason for exponential backoff is also to smooth the restarts of
> sessions, because timers are randomized.

But if you get a real congestion the system will self-regulate using 
exponential backoffs due to lack of ICMPs for some of the connections?


-- 
 i.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tcp: bound RTO to minimum
  2011-08-25 10:02                     ` Eric Dumazet
  2011-08-25 10:14                       ` Ilpo Järvinen
@ 2011-08-25 10:15                       ` Arnd Hannemann
  1 sibling, 0 replies; 22+ messages in thread
From: Arnd Hannemann @ 2011-08-25 10:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Zimmermann, Yuchung Cheng, Hagen Paul Pfeifer, netdev,
	Lukowski Damian

Hi Eric,

Am 25.08.2011 12:02, schrieb Eric Dumazet:
> Le jeudi 25 août 2011 à 11:46 +0200, Arnd Hannemann a écrit :
>> Hi Eric,
>>
>> Am 25.08.2011 11:09, schrieb Eric Dumazet:
> 
>>> Maybe we should refine the thing a bit, to not reverse backoff unless
>>> rto is > some_threshold.
>>>
>>> Say 10s being the value, that would give at most 92 tries.
>>
>> I personally think that 10s would be too large and eliminate the benefit of the
>> algorithm, so I would prefer a different solution.
>>
>> In case of one bulk data TCP session, which was transmitting hundreds of packets/s
>> before the connectivity disruption those worst case rate of 5 packet/s really
>> seems conservative enough.
>>
>> However in case of a lot of idle connections, which were transmitting only
>> a number of packets per minute. We might increase the rate drastically for
>> a certain period until it throttles down. You say that we have a problem here
>> correct?
>>
>> Do you think it would be possible without much hassle to use a kind of "global"
>> rate limiting only for these probe packets of a TCP connection?
>>
>>> I mean, what is the gain to be able to restart a frozen TCP session with
>>> a 1sec latency instead of 10s if it was blocked more than 60 seconds ?
>>
>> I'm afraid it does a lot, especially in highly dynamic environments. You
>> don't have just the additional latency, you may actually miss the full
>> period where connectivity was there, and then just retransmit into the next
>> connectivity disrupted period.
> 
> Problem with this is that with short and synchronized timers, all
> sessions will flood at the same time and you'll get congestion this
> time.

Why do you think the timers are "syncronized"? If you have congestion
then you will do exponential backoff.

> The reason for exponential backoff is also to smooth the restarts of
> sessions, because timers are randomized.

If the RTO of these sessions were "randomized" they keep this randomization,
even if backoffs are reverted, at least they should.

Best regards
Arnd

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2011-08-25 10:15 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-24 16:21 [BUG] tcp : how many times a frame can possibly be retransmitted ? Eric Dumazet
2011-08-24 19:03 ` Alexander Zimmermann
2011-08-24 19:39   ` Jerry Chu
2011-08-24 19:45   ` Eric Dumazet
2011-08-24 22:44 ` Ilpo Järvinen
2011-08-24 23:00   ` Eric Dumazet
2011-08-24 23:41     ` [PATCH] tcp: bound RTO to minimum Hagen Paul Pfeifer
2011-08-24 23:43       ` Hagen Paul Pfeifer
2011-08-25  1:50       ` Yuchung Cheng
2011-08-25  5:28         ` Eric Dumazet
2011-08-25  7:28           ` Alexander Zimmermann
2011-08-25  8:26             ` Eric Dumazet
2011-08-25  8:44               ` Alexander Zimmermann
2011-08-25  8:46               ` Arnd Hannemann
2011-08-25  9:09                 ` Eric Dumazet
2011-08-25  9:46                   ` Arnd Hannemann
2011-08-25 10:02                     ` Eric Dumazet
2011-08-25 10:14                       ` Ilpo Järvinen
2011-08-25 10:15                       ` Arnd Hannemann
2011-08-25  8:56     ` [BUG] tcp : how many times a frame can possibly be retransmitted ? Ilpo Järvinen
2011-08-25  9:40       ` Eric Dumazet
2011-08-25 10:07         ` Ilpo Järvinen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.