All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis
@ 2012-04-13  5:48 Eric Dumazet
  2012-04-14 19:47 ` David Miller
  2012-04-15 18:33 ` Jerry Chu
  0 siblings, 2 replies; 7+ messages in thread
From: Eric Dumazet @ 2012-04-13  5:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, H.K. Jerry Chu, Tom Herbert

Updates some comments to track RFC6298

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Tom Herbert <therbert@google.com>
---
BTW, one side effect of the TCP_TIMEOUT_INIT change (3 -> 1) is
inet_csk_reqsk_queue_prune() latency is 200% worse:

It fires every 200ms and scans 40% of hash table each time, listener
socket held.

 include/net/tcp.h               |    2 +-
 net/ipv4/inet_connection_sock.c |    2 +-
 net/ipv4/tcp_input.c            |    4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index f75a04d..057f016 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -123,7 +123,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #endif
 #define TCP_RTO_MAX	((unsigned)(120*HZ))
 #define TCP_RTO_MIN	((unsigned)(HZ/5))
-#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ))	/* RFC2988bis initial RTO value	*/
+#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ))	/* RFC6298 2.1 initial RTO value	*/
 #define TCP_TIMEOUT_FALLBACK ((unsigned)(3*HZ))	/* RFC 1122 initial RTO value, now
 						 * used as a fallback RTO for the
 						 * initial data transmission if no
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 19d66ce..c12396f 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -514,7 +514,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
 
 	/* Normally all the openreqs are young and become mature
 	 * (i.e. converted to established socket) for first timeout.
-	 * If synack was not acknowledged for 3 seconds, it means
+	 * If synack was not acknowledged for 1 second, it means
 	 * one of the following things: synack was lost, ack was lost,
 	 * rtt is high or nobody planned to ack (i.e. synflood).
 	 * When server is a bit loaded, queue is populated with old
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e886e2f..9147c27 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -933,7 +933,7 @@ static void tcp_init_metrics(struct sock *sk)
 	tcp_set_rto(sk);
 reset:
 	if (tp->srtt == 0) {
-		/* RFC2988bis: We've failed to get a valid RTT sample from
+		/* RFC6298: 5.7 We've failed to get a valid RTT sample from
 		 * 3WHS. This is most likely due to retransmission,
 		 * including spurious one. Reset the RTO back to 3secs
 		 * from the more aggressive 1sec to avoid more spurious
@@ -943,7 +943,7 @@ reset:
 		inet_csk(sk)->icsk_rto = TCP_TIMEOUT_FALLBACK;
 	}
 	/* Cut cwnd down to 1 per RFC5681 if SYN or SYN-ACK has been
-	 * retransmitted. In light of RFC2988bis' more aggressive 1sec
+	 * retransmitted. In light of RFC6298 more aggressive 1sec
 	 * initRTO, we only reset cwnd when more than 1 SYN/SYN-ACK
 	 * retransmission has occurred.
 	 */

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis
  2012-04-13  5:48 [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis Eric Dumazet
@ 2012-04-14 19:47 ` David Miller
  2012-04-15 15:40   ` Eric Dumazet
  2012-04-15 18:33 ` Jerry Chu
  1 sibling, 1 reply; 7+ messages in thread
From: David Miller @ 2012-04-14 19:47 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, hkchu, therbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 13 Apr 2012 07:48:40 +0200

> Updates some comments to track RFC6298
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: H.K. Jerry Chu <hkchu@google.com>
> Cc: Tom Herbert <therbert@google.com>

Applied.

> BTW, one side effect of the TCP_TIMEOUT_INIT change (3 -> 1) is
> inet_csk_reqsk_queue_prune() latency is 200% worse:
> 
> It fires every 200ms and scans 40% of hash table each time, listener
> socket held.

That's rather unfortunate, but I can't see an easy way around it.  We
have to process the whole table within the timeout quantum.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis
  2012-04-14 19:47 ` David Miller
@ 2012-04-15 15:40   ` Eric Dumazet
  2012-04-15 16:48     ` David Miller
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2012-04-15 15:40 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, hkchu, therbert

On Sat, 2012-04-14 at 15:47 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>

> > BTW, one side effect of the TCP_TIMEOUT_INIT change (3 -> 1) is
> > inet_csk_reqsk_queue_prune() latency is 200% worse:
> > 
> > It fires every 200ms and scans 40% of hash table each time, listener
> > socket held.
> 
> That's rather unfortunate, but I can't see an easy way around it.  We
> have to process the whole table within the timeout quantum.

I am currently working on this issue (and more generally to provide
better scalable LISTEN/SYN_RECV processing)

1) convert inet_csk_reqsk_queue_prune() to work queue instead of timer
2) use hashed spinlocks to protect syn_table[]
3) use RCU and dont hold parent socket lock to allow parallelism for
multiqueue NICS (or RPS ...)
4) use 32bit instead of 16bit for sk_max_ack_backlog/sk_ack_backlog

It occured to me that on my 12 cores machine (24 threads) and IXGBE card
(24 queues per link), a moderate SYN packets load could basically freeze
whole machine, 23 cpus waiting one cpu is done with the listener lock.

TCP processing on ESTABLISHED/TIME_WAIT sockets has RCU and all goodies,
time has come to address the LISTEN/SYN_RECV states as well.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis
  2012-04-15 15:40   ` Eric Dumazet
@ 2012-04-15 16:48     ` David Miller
  0 siblings, 0 replies; 7+ messages in thread
From: David Miller @ 2012-04-15 16:48 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, hkchu, therbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 15 Apr 2012 17:40:06 +0200

> 3) use RCU and dont hold parent socket lock to allow parallelism for
> multiqueue NICS (or RPS ...)

This part could be tricky.

We have to be careful in the case that one cpu comes in and finds the
listner sock for a particular child, meanwhile another cpu progresses
that child socket into ESTABLISHED state.  Most of the parent locking
and strict synchronization is there to make sure this case works out
properly.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis
  2012-04-13  5:48 [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis Eric Dumazet
  2012-04-14 19:47 ` David Miller
@ 2012-04-15 18:33 ` Jerry Chu
  2012-04-15 20:01   ` Eric Dumazet
  1 sibling, 1 reply; 7+ messages in thread
From: Jerry Chu @ 2012-04-15 18:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Tom Herbert

[send again - it looks like my previous comment was lost...]

On Thu, Apr 12, 2012 at 10:48 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Updates some comments to track RFC6298
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: H.K. Jerry Chu <hkchu@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---
> BTW, one side effect of the TCP_TIMEOUT_INIT change (3 -> 1) is
> () latency is 200% worse:

Or even worse - 300% (3/1)?

>
> It fires every 200ms and scans 40% of hash table each time, listener
> socket held.

If this becomes a real issue we could decrease TCP_SYNQ_INTERVAL,
essentially making the keepalive timer fire more often, but each time with
less work to do?

Also why is
budget = 2 * (lopt->nr_table_entries / (timeout / interval));
rather than
budget = (lopt->nr_table_entries / (timeout / interval)) + 1;
?

Acked-by: Jerry Chu <hkchu@google.com>

>
>  include/net/tcp.h               |    2 +-
>  net/ipv4/inet_connection_sock.c |    2 +-
>  net/ipv4/tcp_input.c            |    4 ++--
>  3 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index f75a04d..057f016 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -123,7 +123,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
>  #endif
>  #define TCP_RTO_MAX    ((unsigned)(120*HZ))
>  #define TCP_RTO_MIN    ((unsigned)(HZ/5))
> -#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ))    /* RFC2988bis initial RTO value */
> +#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ))    /* RFC6298 2.1 initial RTO value        */
>  #define TCP_TIMEOUT_FALLBACK ((unsigned)(3*HZ))        /* RFC 1122 initial RTO value, now
>                                                 * used as a fallback RTO for the
>                                                 * initial data transmission if no
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 19d66ce..c12396f 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -514,7 +514,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
>
>        /* Normally all the openreqs are young and become mature
>         * (i.e. converted to established socket) for first timeout.
> -        * If synack was not acknowledged for 3 seconds, it means
> +        * If synack was not acknowledged for 1 second, it means
>         * one of the following things: synack was lost, ack was lost,
>         * rtt is high or nobody planned to ack (i.e. synflood).
>         * When server is a bit loaded, queue is populated with old
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index e886e2f..9147c27 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -933,7 +933,7 @@ static void tcp_init_metrics(struct sock *sk)
>        tcp_set_rto(sk);
>  reset:
>        if (tp->srtt == 0) {
> -               /* RFC2988bis: We've failed to get a valid RTT sample from
> +               /* RFC6298: 5.7 We've failed to get a valid RTT sample from
>                 * 3WHS. This is most likely due to retransmission,
>                 * including spurious one. Reset the RTO back to 3secs
>                 * from the more aggressive 1sec to avoid more spurious
> @@ -943,7 +943,7 @@ reset:
>                inet_csk(sk)->icsk_rto = TCP_TIMEOUT_FALLBACK;
>        }
>        /* Cut cwnd down to 1 per RFC5681 if SYN or SYN-ACK has been
> -        * retransmitted. In light of RFC2988bis' more aggressive 1sec
> +        * retransmitted. In light of RFC6298 more aggressive 1sec
>         * initRTO, we only reset cwnd when more than 1 SYN/SYN-ACK
>         * retransmission has occurred.
>         */
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis
  2012-04-15 18:33 ` Jerry Chu
@ 2012-04-15 20:01   ` Eric Dumazet
  2012-04-16  2:21     ` Jerry Chu
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2012-04-15 20:01 UTC (permalink / raw)
  To: Jerry Chu; +Cc: David Miller, netdev, Tom Herbert

On Sun, 2012-04-15 at 11:33 -0700, Jerry Chu wrote:
> [send again - it looks like my previous comment was lost...]
> 
> On Thu, Apr 12, 2012 at 10:48 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Updates some comments to track RFC6298
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: H.K. Jerry Chu <hkchu@google.com>
> > Cc: Tom Herbert <therbert@google.com>
> > ---
> > BTW, one side effect of the TCP_TIMEOUT_INIT change (3 -> 1) is
> > () latency is 200% worse:
> 
> Or even worse - 300% (3/1)?

well, 3 instead of 1 is a 200% increase ;)

> 
> >
> > It fires every 200ms and scans 40% of hash table each time, listener
> > socket held.
> 
> If this becomes a real issue we could decrease TCP_SYNQ_INTERVAL,
> essentially making the keepalive timer fire more often, but each time with
> less work to do?

Hmm... 200ms is already aggressive for power saving

> 
> Also why is
> budget = 2 * (lopt->nr_table_entries / (timeout / interval));
> rather than
> budget = (lopt->nr_table_entries / (timeout / interval)) + 1;
> ?

Thats because if we do that, retransmits could be delayed by 100%,
instead of 50% with this solution.

(right now it takes 2.5 rounds to scan whole table, so a one sec 'timer'
can be fired after 1.6 second)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis
  2012-04-15 20:01   ` Eric Dumazet
@ 2012-04-16  2:21     ` Jerry Chu
  0 siblings, 0 replies; 7+ messages in thread
From: Jerry Chu @ 2012-04-16  2:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Tom Herbert

n Sun, Apr 15, 2012 at 1:01 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sun, 2012-04-15 at 11:33 -0700, Jerry Chu wrote:
>> [send again - it looks like my previous comment was lost...]
>>
>> On Thu, Apr 12, 2012 at 10:48 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > Updates some comments to track RFC6298
>> >
>> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> > Cc: H.K. Jerry Chu <hkchu@google.com>
>> > Cc: Tom Herbert <therbert@google.com>
>> > ---
>> > BTW, one side effect of the TCP_TIMEOUT_INIT change (3 -> 1) is
>> > () latency is 200% worse:
>>
>> Or even worse - 300% (3/1)?
>
> well, 3 instead of 1 is a 200% increase ;)
>
>>
>> >
>> > It fires every 200ms and scans 40% of hash table each time, listener
>> > socket held.
>>
>> If this becomes a real issue we could decrease TCP_SYNQ_INTERVAL,
>> essentially making the keepalive timer fire more often, but each time with
>> less work to do?
>
> Hmm... 200ms is already aggressive for power saving

Not sure how much power saving can one attain when syn_table is non-empty
anyway.

>
>>
>> Also why is
>> budget = 2 * (lopt->nr_table_entries / (timeout / interval));
>> rather than
>> budget = (lopt->nr_table_entries / (timeout / interval)) + 1;
>> ?
>
> Thats because if we do that, retransmits could be delayed by 100%,
> instead of 50% with this solution.

Oh, that's right - the delay can be upto the time to scan the whole table
so it's a balance between how much delay vs the processing overhead
in the current data structure...

Thanks,

Jerry

>
> (right now it takes 2.5 rounds to scan whole table, so a one sec 'timer'
> can be fired after 1.6 second)
>
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-04-16  2:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-13  5:48 [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis Eric Dumazet
2012-04-14 19:47 ` David Miller
2012-04-15 15:40   ` Eric Dumazet
2012-04-15 16:48     ` David Miller
2012-04-15 18:33 ` Jerry Chu
2012-04-15 20:01   ` Eric Dumazet
2012-04-16  2:21     ` Jerry Chu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.