All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv3 net-next 0/4] Improvements for TCP PMTU
@ 2015-02-28  3:23 Fan Du
  2015-02-28  3:23 ` [PATCHv3 net-next 1/4] ipv4: Raise tcp PMTU probe mss base size Fan Du
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Fan Du @ 2015-02-28  3:23 UTC (permalink / raw)
  To: johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

Hi,

This patchset performs some improvements and enhancement
for current TCP PMTU as per RFC4821 with the aim to find
optimal mms size quickly, and also be adaptive to route
changes like enlarged path MTU. Then TCP PMTU could be
used to probe a effective pmtu in absence of ICMP message
for tunnels(e.g. vxlan) across different networking stack.

Patch1/4: Set probe mss base to 1024 Bytes per RFC4821
Patch2/4: Do not double probe_size for each probing,
          use a simple binary search to gain maximum performance.
Patch3/4: When it comes to blackhole detection, shrink current
	  mss for next probing.
Patch4/4: Create a probe timer to detect enlarged path MTU.

Changelog:
v3:
   - Update commit message for patch2
   - Fix pseudo timer delta caculation in patch4
v2:
   - Introduce sysctl_tcp_probe_threshold to control when
     probing will stop, as suggested by John Heffner.
   - Add patch3 to shrink current mss value for search low boundary.
   - Drop cannonical timer usages, implements pseudo timer based on
     32bits jiffies tcp_time_stamp, as suggested by Eric Dumazet.

Fan Du (4):
  ipv4: Raise tcp PMTU probe mss base size
  ipv4: Use binary search to choose tcp PMTU probe_size
  ipv4: shrink current mss for tcp PMTU blackhole detection
  ipv4: Create probe timer for tcp PMTU as per RFC4821

 include/net/inet_connection_sock.h |    2 +
 include/net/netns/ipv4.h           |    2 +
 include/net/tcp.h                  |    8 +++++-
 net/ipv4/sysctl_net_ipv4.c         |   14 +++++++++++
 net/ipv4/tcp_ipv4.c                |    2 +
 net/ipv4/tcp_output.c              |   46 ++++++++++++++++++++++++++++++++---
 net/ipv4/tcp_timer.c               |    9 ++++--
 7 files changed, 75 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCHv3 net-next 1/4] ipv4: Raise tcp PMTU probe mss base size
  2015-02-28  3:23 [PATCHv3 net-next 0/4] Improvements for TCP PMTU Fan Du
@ 2015-02-28  3:23 ` Fan Du
  2015-02-28  3:23 ` [PATCHv3 net-next 2/4] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Fan Du @ 2015-02-28  3:23 UTC (permalink / raw)
  To: johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

Quotes from RFC4821 7.2.  Selecting Initial Values

   It is RECOMMENDED that search_low be initially set to an MTU size
   that is likely to work over a very wide range of environments.  Given
   today's technologies, a value of 1024 bytes is probably safe enough.
   The initial value for search_low SHOULD be configurable.

Moreover, set a small value will introduce extra time for the search
to converge. So set the initial probe base mss size to 1024 Bytes.

Signed-off-by: Fan Du <fan.du@intel.com>
---
 include/net/tcp.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 8d6b983..7b57e5b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -65,7 +65,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCP_MIN_MSS		88U
 
 /* The least MTU to use for probing */
-#define TCP_BASE_MSS		512
+#define TCP_BASE_MSS		1024
 
 /* After receiving this amount of duplicate ACKs fast retransmit starts. */
 #define TCP_FASTRETRANS_THRESH 3
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCHv3 net-next 2/4] ipv4: Use binary search to choose tcp PMTU probe_size
  2015-02-28  3:23 [PATCHv3 net-next 0/4] Improvements for TCP PMTU Fan Du
  2015-02-28  3:23 ` [PATCHv3 net-next 1/4] ipv4: Raise tcp PMTU probe mss base size Fan Du
@ 2015-02-28  3:23 ` Fan Du
  2015-02-28 23:20   ` John Heffner
  2015-02-28  3:23 ` [PATCHv3 net-next 3/4] ipv4: shrink current mss for tcp PMTU blackhole detection Fan Du
  2015-02-28  3:23 ` [PATCHv3 net-next 4/4] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
  3 siblings, 1 reply; 11+ messages in thread
From: Fan Du @ 2015-02-28  3:23 UTC (permalink / raw)
  To: johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

Current probe_size is chosen by doubling mss_cache,
the probing process will end shortly with a sub-optimal
mss size, and the link mtu will not be taken full
advantage of, in return, this will make user to tweak
tcp_base_mss with care.

Use binary search to choose probe_size in a fine
granularity manner, an optimal mss will be found
to boost performance as its maxmium.

In addition, introduce a sysctl_tcp_probe_threshold
to control when probing will stop in respect to
the width of search range.

Test env:
Docker instance with vxlan encapuslation(82599EB)
iperf -c 10.0.0.24  -t 60

before this patch:
1.26 Gbits/sec

After this patch: increase 26%
1.59 Gbits/sec

Signed-off-by: Fan Du <fan.du@intel.com>
---
v3:
  - Fix commit message
v2:
  - Use sysctl_tcp_probe_threshold to control when probing
    will stop wrt interval between search high and search low.
---
 include/net/netns/ipv4.h   |    1 +
 include/net/tcp.h          |    3 +++
 net/ipv4/sysctl_net_ipv4.c |    7 +++++++
 net/ipv4/tcp_ipv4.c        |    1 +
 net/ipv4/tcp_output.c      |   14 +++++++++++---
 5 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index dbe2254..25200d4 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -84,6 +84,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_fwmark_accept;
 	int sysctl_tcp_mtu_probing;
 	int sysctl_tcp_base_mss;
+	int sysctl_tcp_probe_threshold;
 
 	struct ping_group_range ping_group_range;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7b57e5b..d269c91 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -67,6 +67,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* The least MTU to use for probing */
 #define TCP_BASE_MSS		1024
 
+/* Specify interval when tcp mtu probing will stop */
+#define TCP_PROBE_THRESHOLD	8
+
 /* After receiving this amount of duplicate ACKs fast retransmit starts. */
 #define TCP_FASTRETRANS_THRESH 3
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d151539..d3c09c1 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -883,6 +883,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "tcp_probe_threshold",
+		.data		= &init_net.ipv4.sysctl_tcp_probe_threshold,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5a2dfed..35790d9 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2460,6 +2460,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	}
 	net->ipv4.sysctl_tcp_ecn = 2;
 	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
+	net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
 	return 0;
 
 fail:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a2a796c..c418829 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1837,11 +1837,13 @@ static int tcp_mtu_probe(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct sk_buff *skb, *nskb, *next;
+	struct net *net = sock_net(sk);
 	int len;
 	int probe_size;
 	int size_needed;
 	int copy;
 	int mss_now;
+	int interval;
 
 	/* Not currently probing/verifying,
 	 * not in recovery,
@@ -1854,11 +1856,17 @@ static int tcp_mtu_probe(struct sock *sk)
 	    tp->rx_opt.num_sacks || tp->rx_opt.dsack)
 		return -1;
 
-	/* Very simple search strategy: just double the MSS. */
+	/* Use binary search for probe_size bewteen tcp_mss_base,
+	 * and current mss_clamp. if (search_high - search_low)
+	 * smaller than a threshold, backoff from probing.
+	 */
 	mss_now = tcp_current_mss(sk);
-	probe_size = 2 * tp->mss_cache;
+	probe_size = (icsk->icsk_mtup.search_high +
+		      icsk->icsk_mtup.search_low) >> 1;
 	size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
-	if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)) {
+	interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
+	if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high) ||
+	    interval < net->ipv4.sysctl_tcp_probe_threshold) {
 		/* TODO: set timer for probe_converge_event */
 		return -1;
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCHv3 net-next 3/4] ipv4: shrink current mss for tcp PMTU blackhole detection
  2015-02-28  3:23 [PATCHv3 net-next 0/4] Improvements for TCP PMTU Fan Du
  2015-02-28  3:23 ` [PATCHv3 net-next 1/4] ipv4: Raise tcp PMTU probe mss base size Fan Du
  2015-02-28  3:23 ` [PATCHv3 net-next 2/4] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
@ 2015-02-28  3:23 ` Fan Du
  2015-02-28 23:39   ` John Heffner
  2015-02-28  3:23 ` [PATCHv3 net-next 4/4] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
  3 siblings, 1 reply; 11+ messages in thread
From: Fan Du @ 2015-02-28  3:23 UTC (permalink / raw)
  To: johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

Reducing current tcp mss instead of search_low will make
more sense, as current tcp mss still got lost.

In addition, rename tcp_mtu_probing to tcp_blackhole_probe
to clearify confusion between tcp_mtu_probing and
tcp_mtu_probe.

Signed-off-by: Fan Du <fan.du@intel.com>
---
 net/ipv4/tcp_timer.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 0732b78..3465175 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -99,7 +99,8 @@ static int tcp_orphan_retries(struct sock *sk, int alive)
 	return retries;
 }
 
-static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
+static void tcp_blackhole_probe(struct inet_connection_sock *icsk,
+				struct sock *sk)
 {
 	struct net *net = sock_net(sk);
 
@@ -113,7 +114,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
 			struct tcp_sock *tp = tcp_sk(sk);
 			int mss;
 
-			mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
+			/* try mss smaller than current mss */
+			mss = tcp_current_mss(sk) >> 1;
 			mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
 			mss = max(mss, 68 - tp->tcp_header_len);
 			icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
@@ -176,7 +178,7 @@ static int tcp_write_timeout(struct sock *sk)
 	} else {
 		if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0, 0)) {
 			/* Black hole detection */
-			tcp_mtu_probing(icsk, sk);
+			tcp_blackhole_probe(icsk, sk);
 
 			dst_negative_advice(sk);
 		}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCHv3 net-next 4/4] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-02-28  3:23 [PATCHv3 net-next 0/4] Improvements for TCP PMTU Fan Du
                   ` (2 preceding siblings ...)
  2015-02-28  3:23 ` [PATCHv3 net-next 3/4] ipv4: shrink current mss for tcp PMTU blackhole detection Fan Du
@ 2015-02-28  3:23 ` Fan Du
  3 siblings, 0 replies; 11+ messages in thread
From: Fan Du @ 2015-02-28  3:23 UTC (permalink / raw)
  To: johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
be armed once probing has converged. Once this timer expired,
probing again to take advantage of any path PMTU change. The
recommended probing interval is 10 minutes per RFC1981. Probing
interval could be sysctled by sysctl_tcp_probe_interval.

Eric Dumazet suggested to implement pseudo timer based on 32bits
jiffies tcp_time_stamp instead of using classic timer for such
rare event.

Signed-off-by: Fan Du <fan.du@intel.com>
---
v3:
  - Fix pseudo timer delta caculation.
v2:
  - Create a pseudo timer based on 32bits jiffies tcp_time_stamp
    to control when to reprobing as suggested by Eric.
---
 include/net/inet_connection_sock.h |    2 ++
 include/net/netns/ipv4.h           |    1 +
 include/net/tcp.h                  |    3 +++
 net/ipv4/sysctl_net_ipv4.c         |    7 +++++++
 net/ipv4/tcp_ipv4.c                |    1 +
 net/ipv4/tcp_output.c              |   34 ++++++++++++++++++++++++++++++++--
 net/ipv4/tcp_timer.c               |    1 +
 7 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 5976bde..b9a6b0a 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -126,6 +126,8 @@ struct inet_connection_sock {
 
 		/* Information on the current probe. */
 		int		  probe_size;
+
+		u32		  probe_timestamp;
 	} icsk_mtup;
 	u32			  icsk_ca_priv[16];
 	u32			  icsk_user_timeout;
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 25200d4..18ebc99 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -85,6 +85,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_mtu_probing;
 	int sysctl_tcp_base_mss;
 	int sysctl_tcp_probe_threshold;
+	u32 sysctl_tcp_probe_interval;
 
 	struct ping_group_range ping_group_range;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index d269c91..dd6adbc 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -67,6 +67,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* The least MTU to use for probing */
 #define TCP_BASE_MSS		1024
 
+/* probing interval, default to 10 minutes as per RFC4821 */
+#define TCP_PROBE_INTERVAL	600
+
 /* Specify interval when tcp mtu probing will stop */
 #define TCP_PROBE_THRESHOLD	8
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d3c09c1..fdf8991 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -890,6 +890,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "tcp_probe_interval",
+		.data		= &init_net.ipv4.sysctl_tcp_probe_interval,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 35790d9..f0c6fc3 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2461,6 +2461,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.sysctl_tcp_ecn = 2;
 	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
 	net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
+	net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
 	return 0;
 
 fail:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c418829..f9b2906 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1354,6 +1354,8 @@ void tcp_mtup_init(struct sock *sk)
 			       icsk->icsk_af_ops->net_header_len;
 	icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, net->ipv4.sysctl_tcp_base_mss);
 	icsk->icsk_mtup.probe_size = 0;
+	if (icsk->icsk_mtup.enabled)
+		icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
 }
 EXPORT_SYMBOL(tcp_mtup_init);
 
@@ -1823,6 +1825,31 @@ send_now:
 	return false;
 }
 
+static inline void tcp_mtu_check_reprobe(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct net *net = sock_net(sk);
+	u32 interval;
+	s32 delta;
+
+	interval = net->ipv4.sysctl_tcp_probe_interval;
+	delta = tcp_time_stamp - icsk->icsk_mtup.probe_timestamp;
+	if (unlikely(delta >= interval * HZ)) {
+		int mss = tcp_current_mss(sk);
+
+		/* Update current search range */
+		icsk->icsk_mtup.search_high = tp->rx_opt.mss_clamp +
+			sizeof(struct tcphdr) +
+			icsk->icsk_af_ops->net_header_len;
+		icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
+		icsk->icsk_mtup.probe_size = 0;
+
+		/* Update probe time stamp */
+		icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
+	}
+}
+
 /* Create a new MTU probe if we are ready.
  * MTU probe is regularly attempting to increase the path MTU by
  * deliberately sending larger packets.  This discovers routing
@@ -1866,8 +1893,11 @@ static int tcp_mtu_probe(struct sock *sk)
 	size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
 	interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
 	if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high) ||
-	    interval < net->ipv4.sysctl_tcp_probe_threshold) {
-		/* TODO: set timer for probe_converge_event */
+		interval < net->ipv4.sysctl_tcp_probe_threshold) {
+		/* Check whether enough time has elaplased for
+		 * another round of probing.
+		 */
+		tcp_mtu_check_reprobe(sk);
 		return -1;
 	}
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 3465175..da937fd 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -108,6 +108,7 @@ static void tcp_blackhole_probe(struct inet_connection_sock *icsk,
 	if (net->ipv4.sysctl_tcp_mtu_probing) {
 		if (!icsk->icsk_mtup.enabled) {
 			icsk->icsk_mtup.enabled = 1;
+			icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
 			tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
 		} else {
 			struct net *net = sock_net(sk);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCHv3 net-next 2/4] ipv4: Use binary search to choose tcp PMTU probe_size
  2015-02-28  3:23 ` [PATCHv3 net-next 2/4] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
@ 2015-02-28 23:20   ` John Heffner
  2015-03-02  9:29     ` Fan Du
  0 siblings, 1 reply; 11+ messages in thread
From: John Heffner @ 2015-02-28 23:20 UTC (permalink / raw)
  To: Fan Du; +Cc: edumazet, David Miller, Netdev, Fan Du

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index a2a796c..c418829 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1837,11 +1837,13 @@ static int tcp_mtu_probe(struct sock *sk)
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct inet_connection_sock *icsk = inet_csk(sk);
>         struct sk_buff *skb, *nskb, *next;
> +       struct net *net = sock_net(sk);
>         int len;
>         int probe_size;
>         int size_needed;
>         int copy;
>         int mss_now;
> +       int interval;
>
>         /* Not currently probing/verifying,
>          * not in recovery,
> @@ -1854,11 +1856,17 @@ static int tcp_mtu_probe(struct sock *sk)
>             tp->rx_opt.num_sacks || tp->rx_opt.dsack)
>                 return -1;
>
> -       /* Very simple search strategy: just double the MSS. */
> +       /* Use binary search for probe_size bewteen tcp_mss_base,
> +        * and current mss_clamp. if (search_high - search_low)
> +        * smaller than a threshold, backoff from probing.
> +        */
>         mss_now = tcp_current_mss(sk);
> -       probe_size = 2 * tp->mss_cache;
> +       probe_size = (icsk->icsk_mtup.search_high +
> +                     icsk->icsk_mtup.search_low) >> 1;
>         size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
> -       if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)) {
> +       interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
> +       if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high) ||
> +           interval < net->ipv4.sysctl_tcp_probe_threshold) {
>                 /* TODO: set timer for probe_converge_event */
>                 return -1;
>         }

A couple things: the local variable probe_size here is TCP segment
size, while search_low and search_high are IP datagram sizes.  Use
tcp_mtu_to_mss to subtract headers.  Also, I think if you set
sysctl_tcp_probe_threshold <= 0, this will keep probing indefinitely
at search_low (not useful).  You probably want to test interval <
max(1, sysctl_tcp_probe_threshold).

  -John

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv3 net-next 3/4] ipv4: shrink current mss for tcp PMTU blackhole detection
  2015-02-28  3:23 ` [PATCHv3 net-next 3/4] ipv4: shrink current mss for tcp PMTU blackhole detection Fan Du
@ 2015-02-28 23:39   ` John Heffner
  2015-03-02  9:29     ` Fan Du
  0 siblings, 1 reply; 11+ messages in thread
From: John Heffner @ 2015-02-28 23:39 UTC (permalink / raw)
  To: Fan Du; +Cc: edumazet, David Miller, Netdev, Fan Du

On Fri, Feb 27, 2015 at 10:23 PM, Fan Du <fan.du@intel.com> wrote:
> Reducing current tcp mss instead of search_low will make
> more sense, as current tcp mss still got lost.

> @@ -113,7 +114,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
>                         struct tcp_sock *tp = tcp_sk(sk);
>                         int mss;
>
> -                       mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
> +                       /* try mss smaller than current mss */
> +                       mss = tcp_current_mss(sk) >> 1;
>                         mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>                         mss = max(mss, 68 - tp->tcp_header_len);
>                         icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
> @@ -176,7 +178,7 @@ static int tcp_write_timeout(struct sock *sk)

I don't believe there's a problem with the code as written.  Modulo
headers, search_low <= mss =< search_high.  This code, on successive
timeouts, widens the search range by adjusting search_low (and mss)
downward.

  -John

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv3 net-next 2/4] ipv4: Use binary search to choose tcp PMTU probe_size
  2015-02-28 23:20   ` John Heffner
@ 2015-03-02  9:29     ` Fan Du
  0 siblings, 0 replies; 11+ messages in thread
From: Fan Du @ 2015-03-02  9:29 UTC (permalink / raw)
  To: John Heffner; +Cc: Fan Du, edumazet, David Miller, Netdev

于 2015年03月01日 07:20, John Heffner 写道:
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index a2a796c..c418829 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -1837,11 +1837,13 @@ static int tcp_mtu_probe(struct sock *sk)
>>          struct tcp_sock *tp = tcp_sk(sk);
>>          struct inet_connection_sock *icsk = inet_csk(sk);
>>          struct sk_buff *skb, *nskb, *next;
>> +       struct net *net = sock_net(sk);
>>          int len;
>>          int probe_size;
>>          int size_needed;
>>          int copy;
>>          int mss_now;
>> +       int interval;
>>
>>          /* Not currently probing/verifying,
>>           * not in recovery,
>> @@ -1854,11 +1856,17 @@ static int tcp_mtu_probe(struct sock *sk)
>>              tp->rx_opt.num_sacks || tp->rx_opt.dsack)
>>                  return -1;
>>
>> -       /* Very simple search strategy: just double the MSS. */
>> +       /* Use binary search for probe_size bewteen tcp_mss_base,
>> +        * and current mss_clamp. if (search_high - search_low)
>> +        * smaller than a threshold, backoff from probing.
>> +        */
>>          mss_now = tcp_current_mss(sk);
>> -       probe_size = 2 * tp->mss_cache;
>> +       probe_size = (icsk->icsk_mtup.search_high +
>> +                     icsk->icsk_mtup.search_low) >> 1;
>>          size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>> -       if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)) {
>> +       interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
>> +       if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high) ||
>> +           interval < net->ipv4.sysctl_tcp_probe_threshold) {
>>                  /* TODO: set timer for probe_converge_event */
>>                  return -1;
>>          }
>
> A couple things: the local variable probe_size here is TCP segment
> size, while search_low and search_high are IP datagram sizes.  Use
> tcp_mtu_to_mss to subtract headers.  Also, I think if you set
> sysctl_tcp_probe_threshold <= 0, this will keep probing indefinitely
> at search_low (not useful).  You probably want to test interval <
> max(1, sysctl_tcp_probe_threshold).
Thanks for the comments, will update in next version.
btw, I'm little confused about two points here:

1. Checking if there is enough user data available in write queue with size_needed,
    Is there any special consideration here involving (tp->reordering + 1) * tp->mss_cache ?
    Since the assembly always copies "probe_size" bytes data from the write queue.

    size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;

2. Traverse write queue to build probing packet with "probe_size" bytes segment,
    when will the final nskb len exceeding probe_size? which trigger the break in line:1971

1971                 if (len >= probe_size)
1972                         break;
1973         }
1974         tcp_init_tso_segs(sk, nskb, nskb->len);


>    -John
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv3 net-next 3/4] ipv4: shrink current mss for tcp PMTU blackhole detection
  2015-02-28 23:39   ` John Heffner
@ 2015-03-02  9:29     ` Fan Du
  2015-03-02 20:32       ` John Heffner
  0 siblings, 1 reply; 11+ messages in thread
From: Fan Du @ 2015-03-02  9:29 UTC (permalink / raw)
  To: John Heffner; +Cc: Fan Du, edumazet, David Miller, Netdev

于 2015年03月01日 07:39, John Heffner 写道:
> On Fri, Feb 27, 2015 at 10:23 PM, Fan Du<fan.du@intel.com>  wrote:
>> >Reducing current tcp mss instead of search_low will make
>> >more sense, as current tcp mss still got lost.
>> >@@ -113,7 +114,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
>> >                         struct tcp_sock *tp = tcp_sk(sk);
>> >                         int mss;
>> >
>> >-                       mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
>> >+                       /* try mss smaller than current mss */
>> >+                       mss = tcp_current_mss(sk) >> 1;
>> >                         mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>> >                         mss = max(mss, 68 - tp->tcp_header_len);
>> >                         icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
>> >@@ -176,7 +178,7 @@ static int tcp_write_timeout(struct sock *sk)
> I don't believe there's a problem with the code as written.  Modulo
> headers, search_low <= mss =< search_high.  This code, on successive
> timeouts, widens the search range by adjusting search_low (and mss)
> downward.
With original approach(doubling mss), mss is not in between search_low and search_high,
it always equates search_low(subtract headers), the potential mss in case of blackhole
is 256 and 128, after doubling, it will become 512 and 256 eventually no matter how
route changes, even mtu reduced from 1500 to 1100 in intermediate node.

After using binary search, halve search_low will make search window left boundary expending
in a slower manner, as a result the limit of mss is search_high/2, because search_high does
not change.

Timeout indicates search_high should be set to the new mtu corresponding to current_mss no
matter how we change search_low. So the best shot here IMO would be updating search_high
with current_mss, which in return makes the search window *slide* from right to left, and
the probing will converge in good speed eventually.

So my thoughts is:
@@ -113,6 +113,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
                         struct tcp_sock *tp = tcp_sk(sk);
                         int mss;

+                       icsk_mtup.search_high = tcp_mss_to_mtu(sk, tcp_current_mss(sk));
                         mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
                         mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
                         mss = max(mss, 68 - tp->tcp_header_len);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv3 net-next 3/4] ipv4: shrink current mss for tcp PMTU blackhole detection
  2015-03-02  9:29     ` Fan Du
@ 2015-03-02 20:32       ` John Heffner
  2015-03-03  9:18         ` Fan Du
  0 siblings, 1 reply; 11+ messages in thread
From: John Heffner @ 2015-03-02 20:32 UTC (permalink / raw)
  To: Fan Du; +Cc: Fan Du, Eric Dumazet, David Miller, Netdev

On Mon, Mar 2, 2015 at 4:29 AM, Fan Du <fengyuleidian0615@gmail.com> wrote:
> Timeout indicates search_high should be set to the new mtu corresponding to
> current_mss no
> matter how we change search_low. So the best shot here IMO would be updating
> search_high
> with current_mss, which in return makes the search window *slide* from right
> to left, and
> the probing will converge in good speed eventually.
>
> So my thoughts is:
> @@ -113,6 +113,7 @@ static void tcp_mtu_probing(struct inet_connection_sock
> *icsk, struct sock *sk)
>                         struct tcp_sock *tp = tcp_sk(sk);
>                         int mss;
>
> +                       icsk_mtup.search_high = tcp_mss_to_mtu(sk,
> tcp_current_mss(sk));
>                         mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low)
>>> 1;
>                         mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>                         mss = max(mss, 68 - tp->tcp_header_len);

Search_high should be adjusted downward only when you're quite certain
that you've gotten a negative signal.  There are many possible reasons
for successive timeouts including intermittent disconnection, and they
should not have a persistent (or very long-term) effect on MTU.  Leave
search_high where it is until a working MTU can be established, then
probe upward until probing can give you confidence you've found a new
ceiling, or gotten back to the old one.

If you think the current approach is broken, it would help to see a
concrete demonstration of how it's deficient (a real packet trace is
good!), and how a different approach work better.

Thanks,
  -John

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv3 net-next 3/4] ipv4: shrink current mss for tcp PMTU blackhole detection
  2015-03-02 20:32       ` John Heffner
@ 2015-03-03  9:18         ` Fan Du
  0 siblings, 0 replies; 11+ messages in thread
From: Fan Du @ 2015-03-03  9:18 UTC (permalink / raw)
  To: John Heffner; +Cc: Fan Du, Eric Dumazet, David Miller, Netdev

于 2015年03月03日 04:32, John Heffner 写道:
> On Mon, Mar 2, 2015 at 4:29 AM, Fan Du<fengyuleidian0615@gmail.com>  wrote:
>> >Timeout indicates search_high should be set to the new mtu corresponding to
>> >current_mss no
>> >matter how we change search_low. So the best shot here IMO would be updating
>> >search_high
>> >with current_mss, which in return makes the search window*slide*  from right
>> >to left, and
>> >the probing will converge in good speed eventually.
>> >
>> >So my thoughts is:
>> >@@ -113,6 +113,7 @@ static void tcp_mtu_probing(struct inet_connection_sock
>> >*icsk, struct sock *sk)
>> >                         struct tcp_sock *tp = tcp_sk(sk);
>> >                         int mss;
>> >
>> >+                       icsk_mtup.search_high = tcp_mss_to_mtu(sk,
>> >tcp_current_mss(sk));
>> >                         mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low)
>>>> >>>1;
>> >                         mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>> >                         mss = max(mss, 68 - tp->tcp_header_len);
> Search_high should be adjusted downward only when you're quite certain
> that you've gotten a negative signal.  There are many possible reasons
> for successive timeouts including intermittent disconnection, and they
> should not have a persistent (or very long-term) effect on MTU.  Leave
> search_high where it is until a working MTU can be established, then
> probe upward until probing can give you confidence you've found a new
> ceiling, or gotten back to the old one.
>
> If you think the current approach is broken, it would help to see a
> concrete demonstration of how it's deficient (a real packet trace is
> good!), and how a different approach work better.

 > With original approach(doubling mss), mss is not in between search_low and search_high,
 > it always equates search_low(subtract headers), the potential mss in case of blackhole
 > is 256 and 128, after doubling, it will become 512 and 256 eventually no matter how
 > route changes, even mtu reduced from 1500 to 1100 in intermediate node.

As for above statement, my test scenario is simple, using vxlan tunnel to connect two
docker instances on two hosts. All mtu default to 1500, run iperf between docker instances.
After the connection is setup, adjust the iperf sender host physical eth0 mtu to
1100, and after a couple of seconds, mss will be set to 512.

And after using binary search, actually, search_high will be set to probe_size trigger from
tcp_fastretrans_alert by my investigation. So the searching window does slide to left, thus
I will drop this patch. Maybe this is because my test method is not practical. And I believe
what you said about:
> There are many possible reasons for successive timeouts including intermittent disconnection,
 > and they should not have a persistent (or very long-term) effect on MTU.

Never mind, no matter whether to adjust search_high, the reprobe timer will restore search_high
to maximum allowed, and once again, new mss will be available.

I will resend the rest of patches after incorporating your comments for reviewing.
Thanks for your feedback.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-03-03  9:23 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-28  3:23 [PATCHv3 net-next 0/4] Improvements for TCP PMTU Fan Du
2015-02-28  3:23 ` [PATCHv3 net-next 1/4] ipv4: Raise tcp PMTU probe mss base size Fan Du
2015-02-28  3:23 ` [PATCHv3 net-next 2/4] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
2015-02-28 23:20   ` John Heffner
2015-03-02  9:29     ` Fan Du
2015-02-28  3:23 ` [PATCHv3 net-next 3/4] ipv4: shrink current mss for tcp PMTU blackhole detection Fan Du
2015-02-28 23:39   ` John Heffner
2015-03-02  9:29     ` Fan Du
2015-03-02 20:32       ` John Heffner
2015-03-03  9:18         ` Fan Du
2015-02-28  3:23 ` [PATCHv3 net-next 4/4] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.