* [PATCH net-next v2] tcp: add rfc3168, section 6.1.1.1. fallback
@ 2015-05-19 19:04 Daniel Borkmann
2015-05-19 20:50 ` [PATCH net-next v3] " Eric Dumazet
2015-05-19 20:54 ` David Miller
0 siblings, 2 replies; 6+ messages in thread
From: Daniel Borkmann @ 2015-05-19 19:04 UTC (permalink / raw)
To: davem
Cc: netdev, Daniel Borkmann, Florian Westphal, Mirja Kühlewind,
Brian Trammell, Eric Dumazet, Dave That
This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:
[...] A host that receives no reply to an ECN-setup SYN within the
normal SYN retransmission timeout interval MAY resend the SYN and
any subsequent SYN retransmissions with CWR and ECE cleared. [...]
Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
ECN sysctl to allow server-side only ECN"):
1) Normal ECN-capable path:
SYN ECE CWR ----->
<----- SYN ACK ECE
ACK ----->
2) Path with broken middlebox, when client has fallback:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ----->
<----- SYN ACK
ACK ----->
In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:
Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
Gorry Fairhurst, and Richard Scheffenegger:
"Enabling Internet-Wide Deployment of Explicit Congestion Notification."
Proc. PAM 2015, New York.
http://ecn.ethz.ch/ecn-pam15.pdf
Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.
tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.
Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
---
v1 -> v2:
- Added suggestion from Eric to let ecn_flags be cleared eventually
in tcp_ecn_rcv_synack(), thanks!
- Rest as is.
Documentation/networking/dctcp.txt | 1 +
Documentation/networking/ip-sysctl.txt | 9 +++++++++
include/net/netns/ipv4.h | 2 ++
include/net/tcp.h | 2 ++
net/ipv4/sysctl_net_ipv4.c | 7 +++++++
net/ipv4/tcp_ipv4.c | 5 ++++-
net/ipv4/tcp_output.c | 13 +++++++++++++
7 files changed, 38 insertions(+), 1 deletion(-)
diff --git a/Documentation/networking/dctcp.txt b/Documentation/networking/dctcp.txt
index 0d5dfbc..cd9d3eb 100644
--- a/Documentation/networking/dctcp.txt
+++ b/Documentation/networking/dctcp.txt
@@ -8,6 +8,7 @@ the data center network to provide multi-bit feedback to the end hosts.
To enable it on end hosts:
sysctl -w net.ipv4.tcp_congestion_control=dctcp
+ sysctl -w net.ipv4.tcp_ecn_fallback=0 (optional)
All switches in the data center network running DCTCP must support ECN
marking and be configured for marking when reaching defined switch buffer
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 5095c63..cb083e0 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -267,6 +267,15 @@ tcp_ecn - INTEGER
but do not request ECN on outgoing connections.
Default: 2
+tcp_ecn_fallback - BOOLEAN
+ If the kernel detects that ECN connection misbehaves, enable fall
+ back to non-ECN. Currently, this knob implements the fallback
+ from RFC3168, section 6.1.1.1., but we reserve that in future,
+ additional detection mechanisms could be implemented under this
+ knob. The value is not used, if tcp_ecn or per route (or congestion
+ control) ECN settings are disabled.
+ Default: 1 (fallback enabled)
+
tcp_fack - BOOLEAN
Enable FACK congestion avoidance and fast retransmission.
The value is not used, if tcp_sack is not enabled.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 614a49b..6848b8b 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -77,6 +77,8 @@ struct netns_ipv4 {
struct local_ports ip_local_ports;
int sysctl_tcp_ecn;
+ int sysctl_tcp_ecn_fallback;
+
int sysctl_ip_no_pmtu_disc;
int sysctl_ip_fwd_use_pmtu;
int sysctl_ip_nonlocal_bind;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7ace6ac..3275f93 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -702,6 +702,8 @@ static inline u32 tcp_skb_timestamp(const struct sk_buff *skb)
#define TCPHDR_ECE 0x40
#define TCPHDR_CWR 0x80
+#define TCPHDR_SYN_ECN (TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR)
+
/* This is what the send packet queuing engine uses to pass
* TCP per-packet control information to the transmission code.
* We also store the host-order sequence numbers in here too.
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index c3852a7..841de32 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -821,6 +821,13 @@ static struct ctl_table ipv4_net_table[] = {
.proc_handler = proc_dointvec
},
{
+ .procname = "tcp_ecn_fallback",
+ .data = &init_net.ipv4.sysctl_tcp_ecn_fallback,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
+ {
.procname = "ip_local_port_range",
.maxlen = sizeof(init_net.ipv4.ip_local_ports.range),
.data = &init_net.ipv4.ip_local_ports.range,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 91cb476..0cc4b5a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2411,12 +2411,15 @@ static int __net_init tcp_sk_init(struct net *net)
goto fail;
*per_cpu_ptr(net->ipv4.tcp_sk, cpu) = sk;
}
+
net->ipv4.sysctl_tcp_ecn = 2;
+ net->ipv4.sysctl_tcp_ecn_fallback = 1;
+
net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
- return 0;
+ return 0;
fail:
tcp_sk_exit(net);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7386d32..a057054 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -350,6 +350,15 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
}
}
+static void tcp_ecn_clear_syn(struct sock *sk, struct sk_buff *skb)
+{
+ if (sock_net(sk)->ipv4.sysctl_tcp_ecn_fallback)
+ /* tp->ecn_flags are cleared at a later point in time when
+ * SYN ACK is ultimatively being received.
+ */
+ TCP_SKB_CB(skb)->tcp_flags &= ~(TCPHDR_ECE | TCPHDR_CWR);
+}
+
static void
tcp_ecn_make_synack(const struct request_sock *req, struct tcphdr *th,
struct sock *sk)
@@ -2615,6 +2624,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
}
}
+ /* RFC3168, section 6.1.1.1. ECN fallback */
+ if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN_ECN) == TCPHDR_SYN_ECN)
+ tcp_ecn_clear_syn(sk, skb);
+
tcp_retrans_try_collapse(sk, skb, cur_mss);
/* Make a copy, if the first transmission SKB clone we made
--
1.9.3
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net-next v3] tcp: add rfc3168, section 6.1.1.1. fallback
2015-05-19 19:04 [PATCH net-next v2] tcp: add rfc3168, section 6.1.1.1. fallback Daniel Borkmann
@ 2015-05-19 20:50 ` Eric Dumazet
2015-05-19 20:54 ` David Miller
1 sibling, 0 replies; 6+ messages in thread
From: Eric Dumazet @ 2015-05-19 20:50 UTC (permalink / raw)
To: Daniel Borkmann
Cc: davem, netdev, Florian Westphal, Mirja Kühlewind,
Brian Trammell, Eric Dumazet, Dave Täht
On Tue, 2015-05-19 at 21:33 +0200, Daniel Borkmann wrote:
...
>
> Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
> Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
> Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Dave Täht <dave.taht@gmail.com>
> ---
> v2 -> v3:
> - Very sorry. Typo happened in Dave's name since v1, getting it right
> this time, no bad intentions. ;)
> v1 -> v2:
> - Added suggestion from Eric to let ecn_flags be cleared eventually in
> tcp_ecn_rcv_synack(), thanks!
> - Rest as is.
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next v3] tcp: add rfc3168, section 6.1.1.1. fallback
2015-05-19 19:04 [PATCH net-next v2] tcp: add rfc3168, section 6.1.1.1. fallback Daniel Borkmann
2015-05-19 20:50 ` [PATCH net-next v3] " Eric Dumazet
@ 2015-05-19 20:54 ` David Miller
2015-05-20 18:13 ` Vijay Subramanian
1 sibling, 1 reply; 6+ messages in thread
From: David Miller @ 2015-05-19 20:54 UTC (permalink / raw)
To: daniel; +Cc: netdev, fw, mirja.kuehlewind, trammell, edumazet, dave.taht
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Tue, 19 May 2015 21:33:42 +0200
> This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
> via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
> ECN connections. In other words, this work adds a retry with a non-ECN
> setup SYN packet, as suggested from the RFC on the first timeout:
>
> [...] A host that receives no reply to an ECN-setup SYN within the
> normal SYN retransmission timeout interval MAY resend the SYN and
> any subsequent SYN retransmissions with CWR and ECE cleared. [...]
...
> Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
> Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
> Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Dave Täht <dave.taht@gmail.com>
> ---
> v2 -> v3:
> - Very sorry. Typo happened in Dave's name since v1, getting it right
> this time, no bad intentions. ;)
> v1 -> v2:
> - Added suggestion from Eric to let ecn_flags be cleared eventually in
> tcp_ecn_rcv_synack(), thanks!
> - Rest as is.
Applied, thanks everyone.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next v3] tcp: add rfc3168, section 6.1.1.1. fallback
2015-05-19 20:54 ` David Miller
@ 2015-05-20 18:13 ` Vijay Subramanian
2015-05-20 18:37 ` Eric Dumazet
0 siblings, 1 reply; 6+ messages in thread
From: Vijay Subramanian @ 2015-05-20 18:13 UTC (permalink / raw)
To: David Miller
Cc: Daniel Borkmann, netdev, fw, mirja.kuehlewind, trammell,
Eric Dumazet, Dave Taht
Hi Daniel,
With this commit, ifconfig does not show any of the interfaces and I
don't have any connectivity as a result.
Can you double check this?
Thanks!
Vijay
On 19 May 2015 at 13:54, David Miller <davem@davemloft.net> wrote:
> From: Daniel Borkmann <daniel@iogearbox.net>
> Date: Tue, 19 May 2015 21:33:42 +0200
>
>> This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
>> via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
>> ECN connections. In other words, this work adds a retry with a non-ECN
>> setup SYN packet, as suggested from the RFC on the first timeout:
>>
>> [...] A host that receives no reply to an ECN-setup SYN within the
>> normal SYN retransmission timeout interval MAY resend the SYN and
>> any subsequent SYN retransmissions with CWR and ECE cleared. [...]
> ...
>> Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
>> Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Florian Westphal <fw@strlen.de>
>> Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
>> Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Cc: Dave Täht <dave.taht@gmail.com>
>> ---
>> v2 -> v3:
>> - Very sorry. Typo happened in Dave's name since v1, getting it right
>> this time, no bad intentions. ;)
>> v1 -> v2:
>> - Added suggestion from Eric to let ecn_flags be cleared eventually in
>> tcp_ecn_rcv_synack(), thanks!
>> - Rest as is.
>
> Applied, thanks everyone.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next v3] tcp: add rfc3168, section 6.1.1.1. fallback
2015-05-20 18:13 ` Vijay Subramanian
@ 2015-05-20 18:37 ` Eric Dumazet
2015-05-21 18:10 ` Vijay Subramanian
0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2015-05-20 18:37 UTC (permalink / raw)
To: Vijay Subramanian
Cc: David Miller, Daniel Borkmann, netdev, fw, mirja.kuehlewind,
trammell, Eric Dumazet, Dave Taht
On Wed, 2015-05-20 at 11:13 -0700, Vijay Subramanian wrote:
> Hi Daniel,
>
> With this commit, ifconfig does not show any of the interfaces and I
> don't have any connectivity as a result.
> Can you double check this?
Please do not top post.
No problem here. I do not see obvious reasons for breaking your setup.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next v3] tcp: add rfc3168, section 6.1.1.1. fallback
2015-05-20 18:37 ` Eric Dumazet
@ 2015-05-21 18:10 ` Vijay Subramanian
0 siblings, 0 replies; 6+ messages in thread
From: Vijay Subramanian @ 2015-05-21 18:10 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, Daniel Borkmann, netdev, fw, mirja.kuehlewind,
trammell, Eric Dumazet, Dave Taht
>
> Please do not top post.
My fault. Will be more careful in future.
>
> No problem here. I do not see obvious reasons for breaking your setup.
>
It was a problem with my network driver not getting installed due to
some symbol mismatch after a compile. It got sorted out after I
cleaned up everything. This was a false alarm. Apologies for the
noise.
Vijay
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-05-21 18:10 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-19 19:04 [PATCH net-next v2] tcp: add rfc3168, section 6.1.1.1. fallback Daniel Borkmann
2015-05-19 20:50 ` [PATCH net-next v3] " Eric Dumazet
2015-05-19 20:54 ` David Miller
2015-05-20 18:13 ` Vijay Subramanian
2015-05-20 18:37 ` Eric Dumazet
2015-05-21 18:10 ` Vijay Subramanian
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.