* [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis
@ 2012-04-13 5:48 Eric Dumazet
2012-04-14 19:47 ` David Miller
2012-04-15 18:33 ` Jerry Chu
0 siblings, 2 replies; 7+ messages in thread
From: Eric Dumazet @ 2012-04-13 5:48 UTC (permalink / raw)
To: David Miller; +Cc: netdev, H.K. Jerry Chu, Tom Herbert
Updates some comments to track RFC6298
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Tom Herbert <therbert@google.com>
---
BTW, one side effect of the TCP_TIMEOUT_INIT change (3 -> 1) is
inet_csk_reqsk_queue_prune() latency is 200% worse:
It fires every 200ms and scans 40% of hash table each time, listener
socket held.
include/net/tcp.h | 2 +-
net/ipv4/inet_connection_sock.c | 2 +-
net/ipv4/tcp_input.c | 4 ++--
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f75a04d..057f016 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -123,7 +123,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
#endif
#define TCP_RTO_MAX ((unsigned)(120*HZ))
#define TCP_RTO_MIN ((unsigned)(HZ/5))
-#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ)) /* RFC2988bis initial RTO value */
+#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ)) /* RFC6298 2.1 initial RTO value */
#define TCP_TIMEOUT_FALLBACK ((unsigned)(3*HZ)) /* RFC 1122 initial RTO value, now
* used as a fallback RTO for the
* initial data transmission if no
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 19d66ce..c12396f 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -514,7 +514,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
/* Normally all the openreqs are young and become mature
* (i.e. converted to established socket) for first timeout.
- * If synack was not acknowledged for 3 seconds, it means
+ * If synack was not acknowledged for 1 second, it means
* one of the following things: synack was lost, ack was lost,
* rtt is high or nobody planned to ack (i.e. synflood).
* When server is a bit loaded, queue is populated with old
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e886e2f..9147c27 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -933,7 +933,7 @@ static void tcp_init_metrics(struct sock *sk)
tcp_set_rto(sk);
reset:
if (tp->srtt == 0) {
- /* RFC2988bis: We've failed to get a valid RTT sample from
+ /* RFC6298: 5.7 We've failed to get a valid RTT sample from
* 3WHS. This is most likely due to retransmission,
* including spurious one. Reset the RTO back to 3secs
* from the more aggressive 1sec to avoid more spurious
@@ -943,7 +943,7 @@ reset:
inet_csk(sk)->icsk_rto = TCP_TIMEOUT_FALLBACK;
}
/* Cut cwnd down to 1 per RFC5681 if SYN or SYN-ACK has been
- * retransmitted. In light of RFC2988bis' more aggressive 1sec
+ * retransmitted. In light of RFC6298 more aggressive 1sec
* initRTO, we only reset cwnd when more than 1 SYN/SYN-ACK
* retransmission has occurred.
*/
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis
2012-04-13 5:48 [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis Eric Dumazet
@ 2012-04-14 19:47 ` David Miller
2012-04-15 15:40 ` Eric Dumazet
2012-04-15 18:33 ` Jerry Chu
1 sibling, 1 reply; 7+ messages in thread
From: David Miller @ 2012-04-14 19:47 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, hkchu, therbert
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 13 Apr 2012 07:48:40 +0200
> Updates some comments to track RFC6298
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: H.K. Jerry Chu <hkchu@google.com>
> Cc: Tom Herbert <therbert@google.com>
Applied.
> BTW, one side effect of the TCP_TIMEOUT_INIT change (3 -> 1) is
> inet_csk_reqsk_queue_prune() latency is 200% worse:
>
> It fires every 200ms and scans 40% of hash table each time, listener
> socket held.
That's rather unfortunate, but I can't see an easy way around it. We
have to process the whole table within the timeout quantum.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis
2012-04-14 19:47 ` David Miller
@ 2012-04-15 15:40 ` Eric Dumazet
2012-04-15 16:48 ` David Miller
0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2012-04-15 15:40 UTC (permalink / raw)
To: David Miller; +Cc: netdev, hkchu, therbert
On Sat, 2012-04-14 at 15:47 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> > BTW, one side effect of the TCP_TIMEOUT_INIT change (3 -> 1) is
> > inet_csk_reqsk_queue_prune() latency is 200% worse:
> >
> > It fires every 200ms and scans 40% of hash table each time, listener
> > socket held.
>
> That's rather unfortunate, but I can't see an easy way around it. We
> have to process the whole table within the timeout quantum.
I am currently working on this issue (and more generally to provide
better scalable LISTEN/SYN_RECV processing)
1) convert inet_csk_reqsk_queue_prune() to work queue instead of timer
2) use hashed spinlocks to protect syn_table[]
3) use RCU and dont hold parent socket lock to allow parallelism for
multiqueue NICS (or RPS ...)
4) use 32bit instead of 16bit for sk_max_ack_backlog/sk_ack_backlog
It occured to me that on my 12 cores machine (24 threads) and IXGBE card
(24 queues per link), a moderate SYN packets load could basically freeze
whole machine, 23 cpus waiting one cpu is done with the listener lock.
TCP processing on ESTABLISHED/TIME_WAIT sockets has RCU and all goodies,
time has come to address the LISTEN/SYN_RECV states as well.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis
2012-04-15 15:40 ` Eric Dumazet
@ 2012-04-15 16:48 ` David Miller
0 siblings, 0 replies; 7+ messages in thread
From: David Miller @ 2012-04-15 16:48 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, hkchu, therbert
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 15 Apr 2012 17:40:06 +0200
> 3) use RCU and dont hold parent socket lock to allow parallelism for
> multiqueue NICS (or RPS ...)
This part could be tricky.
We have to be careful in the case that one cpu comes in and finds the
listner sock for a particular child, meanwhile another cpu progresses
that child socket into ESTABLISHED state. Most of the parent locking
and strict synchronization is there to make sure this case works out
properly.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis
2012-04-13 5:48 [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis Eric Dumazet
2012-04-14 19:47 ` David Miller
@ 2012-04-15 18:33 ` Jerry Chu
2012-04-15 20:01 ` Eric Dumazet
1 sibling, 1 reply; 7+ messages in thread
From: Jerry Chu @ 2012-04-15 18:33 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, Tom Herbert
[send again - it looks like my previous comment was lost...]
On Thu, Apr 12, 2012 at 10:48 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Updates some comments to track RFC6298
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: H.K. Jerry Chu <hkchu@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---
> BTW, one side effect of the TCP_TIMEOUT_INIT change (3 -> 1) is
> () latency is 200% worse:
Or even worse - 300% (3/1)?
>
> It fires every 200ms and scans 40% of hash table each time, listener
> socket held.
If this becomes a real issue we could decrease TCP_SYNQ_INTERVAL,
essentially making the keepalive timer fire more often, but each time with
less work to do?
Also why is
budget = 2 * (lopt->nr_table_entries / (timeout / interval));
rather than
budget = (lopt->nr_table_entries / (timeout / interval)) + 1;
?
Acked-by: Jerry Chu <hkchu@google.com>
>
> include/net/tcp.h | 2 +-
> net/ipv4/inet_connection_sock.c | 2 +-
> net/ipv4/tcp_input.c | 4 ++--
> 3 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index f75a04d..057f016 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -123,7 +123,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
> #endif
> #define TCP_RTO_MAX ((unsigned)(120*HZ))
> #define TCP_RTO_MIN ((unsigned)(HZ/5))
> -#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ)) /* RFC2988bis initial RTO value */
> +#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ)) /* RFC6298 2.1 initial RTO value */
> #define TCP_TIMEOUT_FALLBACK ((unsigned)(3*HZ)) /* RFC 1122 initial RTO value, now
> * used as a fallback RTO for the
> * initial data transmission if no
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 19d66ce..c12396f 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -514,7 +514,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
>
> /* Normally all the openreqs are young and become mature
> * (i.e. converted to established socket) for first timeout.
> - * If synack was not acknowledged for 3 seconds, it means
> + * If synack was not acknowledged for 1 second, it means
> * one of the following things: synack was lost, ack was lost,
> * rtt is high or nobody planned to ack (i.e. synflood).
> * When server is a bit loaded, queue is populated with old
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index e886e2f..9147c27 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -933,7 +933,7 @@ static void tcp_init_metrics(struct sock *sk)
> tcp_set_rto(sk);
> reset:
> if (tp->srtt == 0) {
> - /* RFC2988bis: We've failed to get a valid RTT sample from
> + /* RFC6298: 5.7 We've failed to get a valid RTT sample from
> * 3WHS. This is most likely due to retransmission,
> * including spurious one. Reset the RTO back to 3secs
> * from the more aggressive 1sec to avoid more spurious
> @@ -943,7 +943,7 @@ reset:
> inet_csk(sk)->icsk_rto = TCP_TIMEOUT_FALLBACK;
> }
> /* Cut cwnd down to 1 per RFC5681 if SYN or SYN-ACK has been
> - * retransmitted. In light of RFC2988bis' more aggressive 1sec
> + * retransmitted. In light of RFC6298 more aggressive 1sec
> * initRTO, we only reset cwnd when more than 1 SYN/SYN-ACK
> * retransmission has occurred.
> */
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis
2012-04-15 18:33 ` Jerry Chu
@ 2012-04-15 20:01 ` Eric Dumazet
2012-04-16 2:21 ` Jerry Chu
0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2012-04-15 20:01 UTC (permalink / raw)
To: Jerry Chu; +Cc: David Miller, netdev, Tom Herbert
On Sun, 2012-04-15 at 11:33 -0700, Jerry Chu wrote:
> [send again - it looks like my previous comment was lost...]
>
> On Thu, Apr 12, 2012 at 10:48 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Updates some comments to track RFC6298
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: H.K. Jerry Chu <hkchu@google.com>
> > Cc: Tom Herbert <therbert@google.com>
> > ---
> > BTW, one side effect of the TCP_TIMEOUT_INIT change (3 -> 1) is
> > () latency is 200% worse:
>
> Or even worse - 300% (3/1)?
well, 3 instead of 1 is a 200% increase ;)
>
> >
> > It fires every 200ms and scans 40% of hash table each time, listener
> > socket held.
>
> If this becomes a real issue we could decrease TCP_SYNQ_INTERVAL,
> essentially making the keepalive timer fire more often, but each time with
> less work to do?
Hmm... 200ms is already aggressive for power saving
>
> Also why is
> budget = 2 * (lopt->nr_table_entries / (timeout / interval));
> rather than
> budget = (lopt->nr_table_entries / (timeout / interval)) + 1;
> ?
Thats because if we do that, retransmits could be delayed by 100%,
instead of 50% with this solution.
(right now it takes 2.5 rounds to scan whole table, so a one sec 'timer'
can be fired after 1.6 second)
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis
2012-04-15 20:01 ` Eric Dumazet
@ 2012-04-16 2:21 ` Jerry Chu
0 siblings, 0 replies; 7+ messages in thread
From: Jerry Chu @ 2012-04-16 2:21 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, Tom Herbert
n Sun, Apr 15, 2012 at 1:01 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sun, 2012-04-15 at 11:33 -0700, Jerry Chu wrote:
>> [send again - it looks like my previous comment was lost...]
>>
>> On Thu, Apr 12, 2012 at 10:48 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > Updates some comments to track RFC6298
>> >
>> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> > Cc: H.K. Jerry Chu <hkchu@google.com>
>> > Cc: Tom Herbert <therbert@google.com>
>> > ---
>> > BTW, one side effect of the TCP_TIMEOUT_INIT change (3 -> 1) is
>> > () latency is 200% worse:
>>
>> Or even worse - 300% (3/1)?
>
> well, 3 instead of 1 is a 200% increase ;)
>
>>
>> >
>> > It fires every 200ms and scans 40% of hash table each time, listener
>> > socket held.
>>
>> If this becomes a real issue we could decrease TCP_SYNQ_INTERVAL,
>> essentially making the keepalive timer fire more often, but each time with
>> less work to do?
>
> Hmm... 200ms is already aggressive for power saving
Not sure how much power saving can one attain when syn_table is non-empty
anyway.
>
>>
>> Also why is
>> budget = 2 * (lopt->nr_table_entries / (timeout / interval));
>> rather than
>> budget = (lopt->nr_table_entries / (timeout / interval)) + 1;
>> ?
>
> Thats because if we do that, retransmits could be delayed by 100%,
> instead of 50% with this solution.
Oh, that's right - the delay can be upto the time to scan the whole table
so it's a balance between how much delay vs the processing overhead
in the current data structure...
Thanks,
Jerry
>
> (right now it takes 2.5 rounds to scan whole table, so a one sec 'timer'
> can be fired after 1.6 second)
>
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-04-16 2:21 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-13 5:48 [PATCH net-next] tcp: RFC6298 supersedes RFC2988bis Eric Dumazet
2012-04-14 19:47 ` David Miller
2012-04-15 15:40 ` Eric Dumazet
2012-04-15 16:48 ` David Miller
2012-04-15 18:33 ` Jerry Chu
2012-04-15 20:01 ` Eric Dumazet
2012-04-16 2:21 ` Jerry Chu
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.