All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv4 net-next 0/3] Improvements for TCP PMTU
@ 2015-03-03  9:19 Fan Du
  2015-03-03  9:19 ` [PATCHv4 net-next 1/3] ipv4: Raise tcp PMTU probe mss base size Fan Du
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Fan Du @ 2015-03-03  9:19 UTC (permalink / raw)
  To: johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

Hi,

This patchset performs some improvements and enhancement
for current TCP PMTU as per RFC4821 with the aim to find
optimal mms size quickly, and also be adaptive to route
changes like enlarged path MTU. Then TCP PMTU could be
used to probe a effective pmtu in absence of ICMP message
for tunnels(e.g. vxlan) across different networking stack.

Patch1/3: Set probe mss base to 1024 Bytes per RFC4821
Patch2/3: Do not double probe_size for each probing,
          use a simple binary search to gain maximum performance.
	  mss for next probing.
Patch3/3: Create a probe timer to detect enlarged path MTU.

Changelog:
v4:
  - Convert probe_size to mss, not directly from search_low/high
  - Clamp probe_threshold
  - Don't adjust search_high in blackhole probe, so drop orignal patch3
v3:
  - Update commit message for patch2
  - Fix pseudo timer delta calculation in patch4
v2:
  - Introduce sysctl_tcp_probe_threshold to control when
    probing will stop, as suggested by John Heffner.
  - Add patch3 to shrink current mss value for search low boundary.
  - Drop cannonical timer usages, implements pseudo timer based on
    32bits jiffies tcp_time_stamp, as suggested by Eric Dumazet.


Fan Du (3):
  ipv4: Raise tcp PMTU probe mss base size
  ipv4: Use binary search to choose tcp PMTU probe_size
  ipv4: Create probe timer for tcp PMTU as per RFC4821

 include/net/inet_connection_sock.h |    2 +
 include/net/netns/ipv4.h           |    2 +
 include/net/tcp.h                  |    8 +++++-
 net/ipv4/sysctl_net_ipv4.c         |   14 +++++++++++
 net/ipv4/tcp_ipv4.c                |    2 +
 net/ipv4/tcp_output.c              |   46 ++++++++++++++++++++++++++++++++---
 net/ipv4/tcp_timer.c               |    1 +
 7 files changed, 70 insertions(+), 5 deletions(-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCHv4 net-next 1/3] ipv4: Raise tcp PMTU probe mss base size
  2015-03-03  9:19 [PATCHv4 net-next 0/3] Improvements for TCP PMTU Fan Du
@ 2015-03-03  9:19 ` Fan Du
  2015-03-03 16:52   ` John Heffner
  2015-03-03  9:19 ` [PATCHv4 net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Fan Du @ 2015-03-03  9:19 UTC (permalink / raw)
  To: johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

Quotes from RFC4821 7.2.  Selecting Initial Values

   It is RECOMMENDED that search_low be initially set to an MTU size
   that is likely to work over a very wide range of environments.  Given
   today's technologies, a value of 1024 bytes is probably safe enough.
   The initial value for search_low SHOULD be configurable.

Moreover, set a small value will introduce extra time for the search
to converge. So set the initial probe base mss size to 1024 Bytes.

Signed-off-by: Fan Du <fan.du@intel.com>
---
 include/net/tcp.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 8d6b983..7b57e5b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -65,7 +65,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCP_MIN_MSS		88U
 
 /* The least MTU to use for probing */
-#define TCP_BASE_MSS		512
+#define TCP_BASE_MSS		1024
 
 /* After receiving this amount of duplicate ACKs fast retransmit starts. */
 #define TCP_FASTRETRANS_THRESH 3
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCHv4 net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size
  2015-03-03  9:19 [PATCHv4 net-next 0/3] Improvements for TCP PMTU Fan Du
  2015-03-03  9:19 ` [PATCHv4 net-next 1/3] ipv4: Raise tcp PMTU probe mss base size Fan Du
@ 2015-03-03  9:19 ` Fan Du
  2015-03-03 16:51   ` John Heffner
  2015-03-03  9:19 ` [PATCHv4 net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
  2015-03-03 16:43 ` [PATCHv4 net-next 0/3] Improvements for TCP PMTU John Heffner
  3 siblings, 1 reply; 14+ messages in thread
From: Fan Du @ 2015-03-03  9:19 UTC (permalink / raw)
  To: johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

Current probe_size is chosen by doubling mss_cache,
the probing process will end shortly with a sub-optimal
mss size, and the link mtu will not be taken full
advantage of, in return, this will make user to tweak
tcp_base_mss with care.

Use binary search to choose probe_size in a fine
granularity manner, an optimal mss will be found
to boost performance as its maxmium.

In addition, introduce a sysctl_tcp_probe_threshold
to control when probing will stop in respect to
the width of search range.

Test env:
Docker instance with vxlan encapuslation(82599EB)
iperf -c 10.0.0.24  -t 60

before this patch:
1.26 Gbits/sec

After this patch: increase 26%
1.59 Gbits/sec

Signed-off-by: Fan Du <fan.du@intel.com>
---
v4:
  - Convert probe_size to mss
  - Clamp probe_threshold
v3:
  - Fix commit message
v2:
  - Use sysctl_tcp_probe_threshold to control when probing
    will stop wrt interval between search high and search low.
---
 include/net/netns/ipv4.h   |    1 +
 include/net/tcp.h          |    3 +++
 net/ipv4/sysctl_net_ipv4.c |    7 +++++++
 net/ipv4/tcp_ipv4.c        |    1 +
 net/ipv4/tcp_output.c      |   14 +++++++++++---
 5 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 1b26c6c..374bf3f 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -85,6 +85,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_fwmark_accept;
 	int sysctl_tcp_mtu_probing;
 	int sysctl_tcp_base_mss;
+	int sysctl_tcp_probe_threshold;
 
 	struct ping_group_range ping_group_range;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7b57e5b..d269c91 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -67,6 +67,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* The least MTU to use for probing */
 #define TCP_BASE_MSS		1024
 
+/* Specify interval when tcp mtu probing will stop */
+#define TCP_PROBE_THRESHOLD	8
+
 /* After receiving this amount of duplicate ACKs fast retransmit starts. */
 #define TCP_FASTRETRANS_THRESH 3
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d151539..d3c09c1 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -883,6 +883,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "tcp_probe_threshold",
+		.data		= &init_net.ipv4.sysctl_tcp_probe_threshold,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5a2dfed..35790d9 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2460,6 +2460,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	}
 	net->ipv4.sysctl_tcp_ecn = 2;
 	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
+	net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
 	return 0;
 
 fail:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a2a796c..46acddc 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1837,11 +1837,13 @@ static int tcp_mtu_probe(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct sk_buff *skb, *nskb, *next;
+	struct net *net = sock_net(sk);
 	int len;
 	int probe_size;
 	int size_needed;
 	int copy;
 	int mss_now;
+	int interval;
 
 	/* Not currently probing/verifying,
 	 * not in recovery,
@@ -1854,11 +1856,17 @@ static int tcp_mtu_probe(struct sock *sk)
 	    tp->rx_opt.num_sacks || tp->rx_opt.dsack)
 		return -1;
 
-	/* Very simple search strategy: just double the MSS. */
+	/* Use binary search for probe_size between tcp_mss_base,
+	 * and current mss_clamp. if (search_high - search_low)
+	 * smaller than a threshold, backoff from probing.
+	 */
 	mss_now = tcp_current_mss(sk);
-	probe_size = 2 * tp->mss_cache;
+	probe_size = tcp_mtu_to_mss(sk, (icsk->icsk_mtup.search_high +
+				    icsk->icsk_mtup.search_low) >> 1);
 	size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
-	if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)) {
+	interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
+	if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high) ||
+	    interval < max(1, net->ipv4.sysctl_tcp_probe_threshold)) {
 		/* TODO: set timer for probe_converge_event */
 		return -1;
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCHv4 net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-03-03  9:19 [PATCHv4 net-next 0/3] Improvements for TCP PMTU Fan Du
  2015-03-03  9:19 ` [PATCHv4 net-next 1/3] ipv4: Raise tcp PMTU probe mss base size Fan Du
  2015-03-03  9:19 ` [PATCHv4 net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
@ 2015-03-03  9:19 ` Fan Du
  2015-03-03 16:39   ` John Heffner
  2015-03-03 16:43 ` [PATCHv4 net-next 0/3] Improvements for TCP PMTU John Heffner
  3 siblings, 1 reply; 14+ messages in thread
From: Fan Du @ 2015-03-03  9:19 UTC (permalink / raw)
  To: johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
be armed once probing has converged. Once this timer expired,
probing again to take advantage of any path PMTU change. The
recommended probing interval is 10 minutes per RFC1981. Probing
interval could be sysctled by sysctl_tcp_probe_interval.

Eric Dumazet suggested to implement pseudo timer based on 32bits
jiffies tcp_time_stamp instead of using classic timer for such
rare event.

Signed-off-by: Fan Du <fan.du@intel.com>
---
v3:
  - Fix pseudo timer delta caculation.
v2:
  - Create a pseudo timer based on 32bits jiffies tcp_time_stamp
    to control when to reprobing as suggested by Eric.
---
 include/net/inet_connection_sock.h |    2 ++
 include/net/netns/ipv4.h           |    1 +
 include/net/tcp.h                  |    3 +++
 net/ipv4/sysctl_net_ipv4.c         |    7 +++++++
 net/ipv4/tcp_ipv4.c                |    1 +
 net/ipv4/tcp_output.c              |   34 ++++++++++++++++++++++++++++++++--
 net/ipv4/tcp_timer.c               |    1 +
 7 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 5976bde..b9a6b0a 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -126,6 +126,8 @@ struct inet_connection_sock {
 
 		/* Information on the current probe. */
 		int		  probe_size;
+
+		u32		  probe_timestamp;
 	} icsk_mtup;
 	u32			  icsk_ca_priv[16];
 	u32			  icsk_user_timeout;
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 374bf3f..6323a22 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -86,6 +86,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_mtu_probing;
 	int sysctl_tcp_base_mss;
 	int sysctl_tcp_probe_threshold;
+	u32 sysctl_tcp_probe_interval;
 
 	struct ping_group_range ping_group_range;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index d269c91..dd6adbc 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -67,6 +67,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* The least MTU to use for probing */
 #define TCP_BASE_MSS		1024
 
+/* probing interval, default to 10 minutes as per RFC4821 */
+#define TCP_PROBE_INTERVAL	600
+
 /* Specify interval when tcp mtu probing will stop */
 #define TCP_PROBE_THRESHOLD	8
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d3c09c1..fdf8991 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -890,6 +890,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "tcp_probe_interval",
+		.data		= &init_net.ipv4.sysctl_tcp_probe_interval,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 35790d9..f0c6fc3 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2461,6 +2461,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.sysctl_tcp_ecn = 2;
 	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
 	net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
+	net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
 	return 0;
 
 fail:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 46acddc..e99edf2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1354,6 +1354,8 @@ void tcp_mtup_init(struct sock *sk)
 			       icsk->icsk_af_ops->net_header_len;
 	icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, net->ipv4.sysctl_tcp_base_mss);
 	icsk->icsk_mtup.probe_size = 0;
+	if (icsk->icsk_mtup.enabled)
+		icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
 }
 EXPORT_SYMBOL(tcp_mtup_init);
 
@@ -1823,6 +1825,31 @@ send_now:
 	return false;
 }
 
+static inline void tcp_mtu_check_reprobe(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct net *net = sock_net(sk);
+	u32 interval;
+	s32 delta;
+
+	interval = net->ipv4.sysctl_tcp_probe_interval;
+	delta = tcp_time_stamp - icsk->icsk_mtup.probe_timestamp;
+	if (unlikely(delta >= interval * HZ)) {
+		int mss = tcp_current_mss(sk);
+
+		/* Update current search range */
+		icsk->icsk_mtup.search_high = tp->rx_opt.mss_clamp +
+			sizeof(struct tcphdr) +
+			icsk->icsk_af_ops->net_header_len;
+		icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
+		icsk->icsk_mtup.probe_size = 0;
+
+		/* Update probe time stamp */
+		icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
+	}
+}
+
 /* Create a new MTU probe if we are ready.
  * MTU probe is regularly attempting to increase the path MTU by
  * deliberately sending larger packets.  This discovers routing
@@ -1866,8 +1893,11 @@ static int tcp_mtu_probe(struct sock *sk)
 	size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
 	interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
 	if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high) ||
-	    interval < max(1, net->ipv4.sysctl_tcp_probe_threshold)) {
-		/* TODO: set timer for probe_converge_event */
+		interval < net->ipv4.sysctl_tcp_probe_threshold) {
+		/* Check whether enough time has elaplased for
+		 * another round of probing.
+		 */
+		tcp_mtu_check_reprobe(sk);
 		return -1;
 	}
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 0732b78..1550593 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -107,6 +107,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
 	if (net->ipv4.sysctl_tcp_mtu_probing) {
 		if (!icsk->icsk_mtup.enabled) {
 			icsk->icsk_mtup.enabled = 1;
+			icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
 			tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
 		} else {
 			struct net *net = sock_net(sk);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCHv4 net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-03-03  9:19 ` [PATCHv4 net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
@ 2015-03-03 16:39   ` John Heffner
  2015-03-04  2:11     ` Fan Du
  0 siblings, 1 reply; 14+ messages in thread
From: John Heffner @ 2015-03-03 16:39 UTC (permalink / raw)
  To: Fan Du; +Cc: Eric Dumazet, David Miller, Netdev, Fan Du

On Tue, Mar 3, 2015 at 4:19 AM, Fan Du <fan.du@intel.com> wrote:
> As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
> be armed once probing has converged. Once this timer expired,
> probing again to take advantage of any path PMTU change. The
> recommended probing interval is 10 minutes per RFC1981. Probing
> interval could be sysctled by sysctl_tcp_probe_interval.
>
> Eric Dumazet suggested to implement pseudo timer based on 32bits
> jiffies tcp_time_stamp instead of using classic timer for such
> rare event.
>
> Signed-off-by: Fan Du <fan.du@intel.com>

> @@ -1823,6 +1825,31 @@ send_now:
>         return false;
>  }
>
> +static inline void tcp_mtu_check_reprobe(struct sock *sk)
> +{
> +       struct inet_connection_sock *icsk = inet_csk(sk);
> +       struct tcp_sock *tp = tcp_sk(sk);
> +       struct net *net = sock_net(sk);
> +       u32 interval;
> +       s32 delta;
> +
> +       interval = net->ipv4.sysctl_tcp_probe_interval;
> +       delta = tcp_time_stamp - icsk->icsk_mtup.probe_timestamp;
> +       if (unlikely(delta >= interval * HZ)) {
> +               int mss = tcp_current_mss(sk);
> +
> +               /* Update current search range */
> +               icsk->icsk_mtup.search_high = tp->rx_opt.mss_clamp +
> +                       sizeof(struct tcphdr) +
> +                       icsk->icsk_af_ops->net_header_len;
> +               icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
> +               icsk->icsk_mtup.probe_size = 0;
> +
> +               /* Update probe time stamp */
> +               icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
> +       }
> +}
> +
>  /* Create a new MTU probe if we are ready.
>   * MTU probe is regularly attempting to increase the path MTU by
>   * deliberately sending larger packets.  This discovers routing

I think the only update to the search range required here is
potentially moving search_high upward.  Touching search_low seems
unnecessary, and probe_size better be zero when executing this anyway.
(We don't want to be changing the state while a probe is in flight.)



> @@ -1866,8 +1893,11 @@ static int tcp_mtu_probe(struct sock *sk)
>         size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>         interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
>         if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high) ||
> -           interval < max(1, net->ipv4.sysctl_tcp_probe_threshold)) {
> -               /* TODO: set timer for probe_converge_event */
> +               interval < net->ipv4.sysctl_tcp_probe_threshold) {
> +               /* Check whether enough time has elaplased for
> +                * another round of probing.
> +                */
> +               tcp_mtu_check_reprobe(sk);
>                 return -1;
>         }

The way this check works, I think putting it here may not be exactly
what we want.  The comment was to set a timer here, but that's not
exactly what tcp_mtu_check_reprobe does.  Since it may update the
search range, I think it would be better to call prior to comparing
the candidate probe_size to search_high.

Thanks,
  -John

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv4 net-next 0/3] Improvements for TCP PMTU
  2015-03-03  9:19 [PATCHv4 net-next 0/3] Improvements for TCP PMTU Fan Du
                   ` (2 preceding siblings ...)
  2015-03-03  9:19 ` [PATCHv4 net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
@ 2015-03-03 16:43 ` John Heffner
  2015-03-04  2:12   ` Fan Du
  3 siblings, 1 reply; 14+ messages in thread
From: John Heffner @ 2015-03-03 16:43 UTC (permalink / raw)
  To: Fan Du; +Cc: Eric Dumazet, David Miller, Netdev, Fan Du

On Tue, Mar 3, 2015 at 4:19 AM, Fan Du <fan.du@intel.com> wrote:
> Fan Du (3):
>   ipv4: Raise tcp PMTU probe mss base size
>   ipv4: Use binary search to choose tcp PMTU probe_size
>   ipv4: Create probe timer for tcp PMTU as per RFC4821
>
>  include/net/inet_connection_sock.h |    2 +
>  include/net/netns/ipv4.h           |    2 +
>  include/net/tcp.h                  |    8 +++++-
>  net/ipv4/sysctl_net_ipv4.c         |   14 +++++++++++
>  net/ipv4/tcp_ipv4.c                |    2 +
>  net/ipv4/tcp_output.c              |   46 ++++++++++++++++++++++++++++++++---
>  net/ipv4/tcp_timer.c               |    1 +
>  7 files changed, 70 insertions(+), 5 deletions(-)
>

This changeset adds a couple new sysctls.  They probably deserve
descriptions in ip-sysctl.txt.

Thanks,
  -John

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv4 net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size
  2015-03-03  9:19 ` [PATCHv4 net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
@ 2015-03-03 16:51   ` John Heffner
  2015-03-04 13:39     ` John Heffner
  0 siblings, 1 reply; 14+ messages in thread
From: John Heffner @ 2015-03-03 16:51 UTC (permalink / raw)
  To: Fan Du; +Cc: Eric Dumazet, David Miller, Netdev, Fan Du

On Tue, Mar 3, 2015 at 4:19 AM, Fan Du <fan.du@intel.com> wrote:
> Current probe_size is chosen by doubling mss_cache,
> the probing process will end shortly with a sub-optimal
> mss size, and the link mtu will not be taken full
> advantage of, in return, this will make user to tweak
> tcp_base_mss with care.
>
> Use binary search to choose probe_size in a fine
> granularity manner, an optimal mss will be found
> to boost performance as its maxmium.
>
> In addition, introduce a sysctl_tcp_probe_threshold
> to control when probing will stop in respect to
> the width of search range.
>
> Test env:
> Docker instance with vxlan encapuslation(82599EB)
> iperf -c 10.0.0.24  -t 60
>
> before this patch:
> 1.26 Gbits/sec
>
> After this patch: increase 26%
> 1.59 Gbits/sec
>
> Signed-off-by: Fan Du <fan.du@intel.com>
> ---
> v4:
>   - Convert probe_size to mss
>   - Clamp probe_threshold
> v3:
>   - Fix commit message
> v2:
>   - Use sysctl_tcp_probe_threshold to control when probing
>     will stop wrt interval between search high and search low.
> ---
>  include/net/netns/ipv4.h   |    1 +
>  include/net/tcp.h          |    3 +++
>  net/ipv4/sysctl_net_ipv4.c |    7 +++++++
>  net/ipv4/tcp_ipv4.c        |    1 +
>  net/ipv4/tcp_output.c      |   14 +++++++++++---
>  5 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 1b26c6c..374bf3f 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -85,6 +85,7 @@ struct netns_ipv4 {
>         int sysctl_tcp_fwmark_accept;
>         int sysctl_tcp_mtu_probing;
>         int sysctl_tcp_base_mss;
> +       int sysctl_tcp_probe_threshold;
>
>         struct ping_group_range ping_group_range;
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 7b57e5b..d269c91 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -67,6 +67,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>  /* The least MTU to use for probing */
>  #define TCP_BASE_MSS           1024
>
> +/* Specify interval when tcp mtu probing will stop */
> +#define TCP_PROBE_THRESHOLD    8
> +
>  /* After receiving this amount of duplicate ACKs fast retransmit starts. */
>  #define TCP_FASTRETRANS_THRESH 3
>
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index d151539..d3c09c1 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -883,6 +883,13 @@ static struct ctl_table ipv4_net_table[] = {
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec,
>         },
> +       {
> +               .procname       = "tcp_probe_threshold",
> +               .data           = &init_net.ipv4.sysctl_tcp_probe_threshold,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec,
> +       },
>         { }
>  };
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 5a2dfed..35790d9 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2460,6 +2460,7 @@ static int __net_init tcp_sk_init(struct net *net)
>         }
>         net->ipv4.sysctl_tcp_ecn = 2;
>         net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
> +       net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
>         return 0;
>
>  fail:
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index a2a796c..46acddc 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1837,11 +1837,13 @@ static int tcp_mtu_probe(struct sock *sk)
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct inet_connection_sock *icsk = inet_csk(sk);
>         struct sk_buff *skb, *nskb, *next;
> +       struct net *net = sock_net(sk);
>         int len;
>         int probe_size;
>         int size_needed;
>         int copy;
>         int mss_now;
> +       int interval;
>
>         /* Not currently probing/verifying,
>          * not in recovery,
> @@ -1854,11 +1856,17 @@ static int tcp_mtu_probe(struct sock *sk)
>             tp->rx_opt.num_sacks || tp->rx_opt.dsack)
>                 return -1;
>
> -       /* Very simple search strategy: just double the MSS. */
> +       /* Use binary search for probe_size between tcp_mss_base,
> +        * and current mss_clamp. if (search_high - search_low)
> +        * smaller than a threshold, backoff from probing.
> +        */
>         mss_now = tcp_current_mss(sk);
> -       probe_size = 2 * tp->mss_cache;
> +       probe_size = tcp_mtu_to_mss(sk, (icsk->icsk_mtup.search_high +
> +                                   icsk->icsk_mtup.search_low) >> 1);
>         size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
> -       if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)) {
> +       interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
> +       if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high) ||
> +           interval < max(1, net->ipv4.sysctl_tcp_probe_threshold)) {
>                 /* TODO: set timer for probe_converge_event */
>                 return -1;
>         }
> --
> 1.7.1
>


I suspect there's plenty of room for further improvement in the
probing heuristic, but this seems like a reasonable improvement.

Acked-by: John Heffner <johnwheffner@gmail.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv4 net-next 1/3] ipv4: Raise tcp PMTU probe mss base size
  2015-03-03  9:19 ` [PATCHv4 net-next 1/3] ipv4: Raise tcp PMTU probe mss base size Fan Du
@ 2015-03-03 16:52   ` John Heffner
  0 siblings, 0 replies; 14+ messages in thread
From: John Heffner @ 2015-03-03 16:52 UTC (permalink / raw)
  To: Fan Du; +Cc: Eric Dumazet, David Miller, Netdev, Fan Du

On Tue, Mar 3, 2015 at 4:19 AM, Fan Du <fan.du@intel.com> wrote:
> Quotes from RFC4821 7.2.  Selecting Initial Values
>
>    It is RECOMMENDED that search_low be initially set to an MTU size
>    that is likely to work over a very wide range of environments.  Given
>    today's technologies, a value of 1024 bytes is probably safe enough.
>    The initial value for search_low SHOULD be configurable.
>
> Moreover, set a small value will introduce extra time for the search
> to converge. So set the initial probe base mss size to 1024 Bytes.
>
> Signed-off-by: Fan Du <fan.du@intel.com>
> ---
>  include/net/tcp.h |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 8d6b983..7b57e5b 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -65,7 +65,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>  #define TCP_MIN_MSS            88U
>
>  /* The least MTU to use for probing */
> -#define TCP_BASE_MSS           512
> +#define TCP_BASE_MSS           1024
>
>  /* After receiving this amount of duplicate ACKs fast retransmit starts. */
>  #define TCP_FASTRETRANS_THRESH 3
> --
> 1.7.1
>

Acked-by: John Heffner <johnwheffner@gmail.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv4 net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-03-03 16:39   ` John Heffner
@ 2015-03-04  2:11     ` Fan Du
  2015-03-04 13:26       ` John Heffner
  0 siblings, 1 reply; 14+ messages in thread
From: Fan Du @ 2015-03-04  2:11 UTC (permalink / raw)
  To: John Heffner; +Cc: Fan Du, Eric Dumazet, David Miller, Netdev

于 2015年03月04日 00:39, John Heffner 写道:
> On Tue, Mar 3, 2015 at 4:19 AM, Fan Du <fan.du@intel.com> wrote:
>> As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
>> be armed once probing has converged. Once this timer expired,
>> probing again to take advantage of any path PMTU change. The
>> recommended probing interval is 10 minutes per RFC1981. Probing
>> interval could be sysctled by sysctl_tcp_probe_interval.
>>
>> Eric Dumazet suggested to implement pseudo timer based on 32bits
>> jiffies tcp_time_stamp instead of using classic timer for such
>> rare event.
>>
>> Signed-off-by: Fan Du <fan.du@intel.com>
>
>> @@ -1823,6 +1825,31 @@ send_now:
>>          return false;
>>   }
>>
>> +static inline void tcp_mtu_check_reprobe(struct sock *sk)
>> +{
>> +       struct inet_connection_sock *icsk = inet_csk(sk);
>> +       struct tcp_sock *tp = tcp_sk(sk);
>> +       struct net *net = sock_net(sk);
>> +       u32 interval;
>> +       s32 delta;
>> +
>> +       interval = net->ipv4.sysctl_tcp_probe_interval;
>> +       delta = tcp_time_stamp - icsk->icsk_mtup.probe_timestamp;
>> +       if (unlikely(delta >= interval * HZ)) {
>> +               int mss = tcp_current_mss(sk);
>> +
>> +               /* Update current search range */
>> +               icsk->icsk_mtup.search_high = tp->rx_opt.mss_clamp +
>> +                       sizeof(struct tcphdr) +
>> +                       icsk->icsk_af_ops->net_header_len;
>> +               icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
>> +               icsk->icsk_mtup.probe_size = 0;
>> +
>> +               /* Update probe time stamp */
>> +               icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
>> +       }
>> +}
>> +
>>   /* Create a new MTU probe if we are ready.
>>    * MTU probe is regularly attempting to increase the path MTU by
>>    * deliberately sending larger packets.  This discovers routing
>
> I think the only update to the search range required here is
> potentially moving search_high upward.  Touching search_low seems
> unnecessary,
No.

If you restoring search_low to its original, the first probe will probably
be no better than using current mss, because current mss is working good,
there may be better mss between mtu(current mss) and search_high. Reprobing
from search_low to search_high is definitely not worth the effort.

> and probe_size better be zero when executing this anyway.
> (We don't want to be changing the state while a probe is in flight.)
Good point.
probe size should be set to zero frist, then update search range to keep
state consistent.
>
>
>> @@ -1866,8 +1893,11 @@ static int tcp_mtu_probe(struct sock *sk)
>>          size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>>          interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
>>          if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high) ||
>> -           interval < max(1, net->ipv4.sysctl_tcp_probe_threshold)) {
>> -               /* TODO: set timer for probe_converge_event */
>> +               interval < net->ipv4.sysctl_tcp_probe_threshold) {
>> +               /* Check whether enough time has elaplased for
>> +                * another round of probing.
>> +                */
>> +               tcp_mtu_check_reprobe(sk);
>>                  return -1;
>>          }
>
> The way this check works, I think putting it here may not be exactly
> what we want.  The comment was to set a timer here, but that's not
> exactly what tcp_mtu_check_reprobe does.  Since it may update the
> search range, I think it would be better to call prior to comparing
> the candidate probe_size to search_high.
Semantically speaking, reprobe checking should be placed inside where
a decision to not probe has been made, reprobing from a stable state ;)
If we moving the reporobe checking outside, current probing may be legal,
but the reprobe check will rule it out.

So I don't see any compelling reason to move the reprobe checking outside.

> Thanks,
>    -John
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv4 net-next 0/3] Improvements for TCP PMTU
  2015-03-03 16:43 ` [PATCHv4 net-next 0/3] Improvements for TCP PMTU John Heffner
@ 2015-03-04  2:12   ` Fan Du
  0 siblings, 0 replies; 14+ messages in thread
From: Fan Du @ 2015-03-04  2:12 UTC (permalink / raw)
  To: John Heffner; +Cc: Fan Du, Eric Dumazet, David Miller, Netdev

于 2015年03月04日 00:43, John Heffner 写道:
> On Tue, Mar 3, 2015 at 4:19 AM, Fan Du<fan.du@intel.com>  wrote:
>> >Fan Du (3):
>> >   ipv4: Raise tcp PMTU probe mss base size
>> >   ipv4: Use binary search to choose tcp PMTU probe_size
>> >   ipv4: Create probe timer for tcp PMTU as per RFC4821
>> >
>> >  include/net/inet_connection_sock.h |    2 +
>> >  include/net/netns/ipv4.h           |    2 +
>> >  include/net/tcp.h                  |    8 +++++-
>> >  net/ipv4/sysctl_net_ipv4.c         |   14 +++++++++++
>> >  net/ipv4/tcp_ipv4.c                |    2 +
>> >  net/ipv4/tcp_output.c              |   46 ++++++++++++++++++++++++++++++++---
>> >  net/ipv4/tcp_timer.c               |    1 +
>> >  7 files changed, 70 insertions(+), 5 deletions(-)
>> >
> This changeset adds a couple new sysctls.  They probably deserve
> descriptions in ip-sysctl.txt.

Thanks for the notice, will update ip-sysctl.txt in next version.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv4 net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-03-04  2:11     ` Fan Du
@ 2015-03-04 13:26       ` John Heffner
  2015-03-05  2:36         ` Fan Du
  0 siblings, 1 reply; 14+ messages in thread
From: John Heffner @ 2015-03-04 13:26 UTC (permalink / raw)
  To: Fan Du; +Cc: Fan Du, Eric Dumazet, David Miller, Netdev

On Tue, Mar 3, 2015 at 9:11 PM, Fan Du <fengyuleidian0615@gmail.com> wrote:
> 于 2015年03月04日 00:39, John Heffner 写道:
>
>> On Tue, Mar 3, 2015 at 4:19 AM, Fan Du <fan.du@intel.com> wrote:
>>>
>>> As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
>>> be armed once probing has converged. Once this timer expired,
>>> probing again to take advantage of any path PMTU change. The
>>> recommended probing interval is 10 minutes per RFC1981. Probing
>>> interval could be sysctled by sysctl_tcp_probe_interval.
>>>
>>> Eric Dumazet suggested to implement pseudo timer based on 32bits
>>> jiffies tcp_time_stamp instead of using classic timer for such
>>> rare event.
>>>
>>> Signed-off-by: Fan Du <fan.du@intel.com>
>>
>>
>>> @@ -1823,6 +1825,31 @@ send_now:
>>>          return false;
>>>   }
>>>
>>> +static inline void tcp_mtu_check_reprobe(struct sock *sk)
>>> +{
>>> +       struct inet_connection_sock *icsk = inet_csk(sk);
>>> +       struct tcp_sock *tp = tcp_sk(sk);
>>> +       struct net *net = sock_net(sk);
>>> +       u32 interval;
>>> +       s32 delta;
>>> +
>>> +       interval = net->ipv4.sysctl_tcp_probe_interval;
>>> +       delta = tcp_time_stamp - icsk->icsk_mtup.probe_timestamp;
>>> +       if (unlikely(delta >= interval * HZ)) {
>>> +               int mss = tcp_current_mss(sk);
>>> +
>>> +               /* Update current search range */
>>> +               icsk->icsk_mtup.search_high = tp->rx_opt.mss_clamp +
>>> +                       sizeof(struct tcphdr) +
>>> +                       icsk->icsk_af_ops->net_header_len;
>>> +               icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
>>> +               icsk->icsk_mtup.probe_size = 0;
>>> +
>>> +               /* Update probe time stamp */
>>> +               icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
>>> +       }
>>> +}
>>> +
>>>   /* Create a new MTU probe if we are ready.
>>>    * MTU probe is regularly attempting to increase the path MTU by
>>>    * deliberately sending larger packets.  This discovers routing
>>
>>
>> I think the only update to the search range required here is
>> potentially moving search_high upward.  Touching search_low seems
>> unnecessary,
>
> No.
>
> If you restoring search_low to its original, the first probe will probably
> be no better than using current mss, because current mss is working good,
> there may be better mss between mtu(current mss) and search_high. Reprobing
> from search_low to search_high is definitely not worth the effort.

Set search_high to the maximal MSS.  The next probe will be halfway
between search_low and MSS.  There's no reason to reduce search_low
when the current MSS == search_low.


>>> @@ -1866,8 +1893,11 @@ static int tcp_mtu_probe(struct sock *sk)
>>>          size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>>>          interval = icsk->icsk_mtup.search_high -
>>> icsk->icsk_mtup.search_low;
>>>          if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)
>>> ||
>>> -           interval < max(1, net->ipv4.sysctl_tcp_probe_threshold)) {
>>> -               /* TODO: set timer for probe_converge_event */
>>> +               interval < net->ipv4.sysctl_tcp_probe_threshold) {
>>> +               /* Check whether enough time has elaplased for
>>> +                * another round of probing.
>>> +                */
>>> +               tcp_mtu_check_reprobe(sk);
>>>                  return -1;
>>>          }
>>
>>
>> The way this check works, I think putting it here may not be exactly
>> what we want.  The comment was to set a timer here, but that's not
>> exactly what tcp_mtu_check_reprobe does.  Since it may update the
>> search range, I think it would be better to call prior to comparing
>> the candidate probe_size to search_high.
>
> Semantically speaking, reprobe checking should be placed inside where
> a decision to not probe has been made, reprobing from a stable state ;)
> If we moving the reporobe checking outside, current probing may be legal,
> but the reprobe check will rule it out.
>
> So I don't see any compelling reason to move the reprobe checking outside.

The reprobe check should happen when you know you don't already have a
pending probe, not when a decision has already been made not to probe.
The impact is relatively minor because practically you're just
extending the duration of a long timer a bit more, but there's not a
good reason to do so.

Thanks,
  -John

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv4 net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size
  2015-03-03 16:51   ` John Heffner
@ 2015-03-04 13:39     ` John Heffner
  2015-03-05  2:36       ` Fan Du
  0 siblings, 1 reply; 14+ messages in thread
From: John Heffner @ 2015-03-04 13:39 UTC (permalink / raw)
  To: Fan Du; +Cc: Eric Dumazet, David Miller, Netdev, Fan Du

On Tue, Mar 3, 2015 at 11:51 AM, John Heffner <johnwheffner@gmail.com> wrote:
> On Tue, Mar 3, 2015 at 4:19 AM, Fan Du <fan.du@intel.com> wrote:
>> Current probe_size is chosen by doubling mss_cache,
>> the probing process will end shortly with a sub-optimal
>> mss size, and the link mtu will not be taken full
>> advantage of, in return, this will make user to tweak
>> tcp_base_mss with care.
>>
>> Use binary search to choose probe_size in a fine
>> granularity manner, an optimal mss will be found
>> to boost performance as its maxmium.
>>
>> In addition, introduce a sysctl_tcp_probe_threshold
>> to control when probing will stop in respect to
>> the width of search range.
>>
>> Test env:
>> Docker instance with vxlan encapuslation(82599EB)
>> iperf -c 10.0.0.24  -t 60
>>
>> before this patch:
>> 1.26 Gbits/sec
>>
>> After this patch: increase 26%
>> 1.59 Gbits/sec
>>
>> Signed-off-by: Fan Du <fan.du@intel.com>
>> ---
>> v4:
>>   - Convert probe_size to mss
>>   - Clamp probe_threshold
>> v3:
>>   - Fix commit message
>> v2:
>>   - Use sysctl_tcp_probe_threshold to control when probing
>>     will stop wrt interval between search high and search low.
>> ---
>>  include/net/netns/ipv4.h   |    1 +
>>  include/net/tcp.h          |    3 +++
>>  net/ipv4/sysctl_net_ipv4.c |    7 +++++++
>>  net/ipv4/tcp_ipv4.c        |    1 +
>>  net/ipv4/tcp_output.c      |   14 +++++++++++---
>>  5 files changed, 23 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>> index 1b26c6c..374bf3f 100644
>> --- a/include/net/netns/ipv4.h
>> +++ b/include/net/netns/ipv4.h
>> @@ -85,6 +85,7 @@ struct netns_ipv4 {
>>         int sysctl_tcp_fwmark_accept;
>>         int sysctl_tcp_mtu_probing;
>>         int sysctl_tcp_base_mss;
>> +       int sysctl_tcp_probe_threshold;
>>
>>         struct ping_group_range ping_group_range;
>>
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index 7b57e5b..d269c91 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -67,6 +67,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>>  /* The least MTU to use for probing */
>>  #define TCP_BASE_MSS           1024
>>
>> +/* Specify interval when tcp mtu probing will stop */
>> +#define TCP_PROBE_THRESHOLD    8
>> +
>>  /* After receiving this amount of duplicate ACKs fast retransmit starts. */
>>  #define TCP_FASTRETRANS_THRESH 3
>>
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index d151539..d3c09c1 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -883,6 +883,13 @@ static struct ctl_table ipv4_net_table[] = {
>>                 .mode           = 0644,
>>                 .proc_handler   = proc_dointvec,
>>         },
>> +       {
>> +               .procname       = "tcp_probe_threshold",
>> +               .data           = &init_net.ipv4.sysctl_tcp_probe_threshold,
>> +               .maxlen         = sizeof(int),
>> +               .mode           = 0644,
>> +               .proc_handler   = proc_dointvec,
>> +       },
>>         { }
>>  };
>>
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index 5a2dfed..35790d9 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -2460,6 +2460,7 @@ static int __net_init tcp_sk_init(struct net *net)
>>         }
>>         net->ipv4.sysctl_tcp_ecn = 2;
>>         net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
>> +       net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
>>         return 0;
>>
>>  fail:
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index a2a796c..46acddc 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -1837,11 +1837,13 @@ static int tcp_mtu_probe(struct sock *sk)
>>         struct tcp_sock *tp = tcp_sk(sk);
>>         struct inet_connection_sock *icsk = inet_csk(sk);
>>         struct sk_buff *skb, *nskb, *next;
>> +       struct net *net = sock_net(sk);
>>         int len;
>>         int probe_size;
>>         int size_needed;
>>         int copy;
>>         int mss_now;
>> +       int interval;
>>
>>         /* Not currently probing/verifying,
>>          * not in recovery,
>> @@ -1854,11 +1856,17 @@ static int tcp_mtu_probe(struct sock *sk)
>>             tp->rx_opt.num_sacks || tp->rx_opt.dsack)
>>                 return -1;
>>
>> -       /* Very simple search strategy: just double the MSS. */
>> +       /* Use binary search for probe_size between tcp_mss_base,
>> +        * and current mss_clamp. if (search_high - search_low)
>> +        * smaller than a threshold, backoff from probing.
>> +        */
>>         mss_now = tcp_current_mss(sk);
>> -       probe_size = 2 * tp->mss_cache;
>> +       probe_size = tcp_mtu_to_mss(sk, (icsk->icsk_mtup.search_high +
>> +                                   icsk->icsk_mtup.search_low) >> 1);
>>         size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>> -       if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)) {
>> +       interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
>> +       if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high) ||
>> +           interval < max(1, net->ipv4.sysctl_tcp_probe_threshold)) {
>>                 /* TODO: set timer for probe_converge_event */
>>                 return -1;
>>         }
>> --
>> 1.7.1
>>
>
>
> I suspect there's plenty of room for further improvement in the
> probing heuristic, but this seems like a reasonable improvement.
>
> Acked-by: John Heffner <johnwheffner@gmail.com>

Actually, one final suggestion here.  The cost of a failed probe is
much higher than a successful probe.  I'd suggest still bounding the
probe segment size to no more than 2*cur_mss.  (Think of a common case
where mss_clamp = 9000 - headers, but the actual path MTU = 1500.)

  -John

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv4 net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-03-04 13:26       ` John Heffner
@ 2015-03-05  2:36         ` Fan Du
  0 siblings, 0 replies; 14+ messages in thread
From: Fan Du @ 2015-03-05  2:36 UTC (permalink / raw)
  To: John Heffner; +Cc: Fan Du, Eric Dumazet, David Miller, Netdev

于 2015年03月04日 21:26, John Heffner 写道:
> On Tue, Mar 3, 2015 at 9:11 PM, Fan Du <fengyuleidian0615@gmail.com> wrote:
>> 于 2015年03月04日 00:39, John Heffner 写道:
>>
>>> On Tue, Mar 3, 2015 at 4:19 AM, Fan Du <fan.du@intel.com> wrote:
>>>>
>>>> As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
>>>> be armed once probing has converged. Once this timer expired,
>>>> probing again to take advantage of any path PMTU change. The
>>>> recommended probing interval is 10 minutes per RFC1981. Probing
>>>> interval could be sysctled by sysctl_tcp_probe_interval.
>>>>
>>>> Eric Dumazet suggested to implement pseudo timer based on 32bits
>>>> jiffies tcp_time_stamp instead of using classic timer for such
>>>> rare event.
>>>>
>>>> Signed-off-by: Fan Du <fan.du@intel.com>
>>>
>>>
>>>> @@ -1823,6 +1825,31 @@ send_now:
>>>>           return false;
>>>>    }
>>>>
>>>> +static inline void tcp_mtu_check_reprobe(struct sock *sk)
>>>> +{
>>>> +       struct inet_connection_sock *icsk = inet_csk(sk);
>>>> +       struct tcp_sock *tp = tcp_sk(sk);
>>>> +       struct net *net = sock_net(sk);
>>>> +       u32 interval;
>>>> +       s32 delta;
>>>> +
>>>> +       interval = net->ipv4.sysctl_tcp_probe_interval;
>>>> +       delta = tcp_time_stamp - icsk->icsk_mtup.probe_timestamp;
>>>> +       if (unlikely(delta >= interval * HZ)) {
>>>> +               int mss = tcp_current_mss(sk);
>>>> +
>>>> +               /* Update current search range */
>>>> +               icsk->icsk_mtup.search_high = tp->rx_opt.mss_clamp +
>>>> +                       sizeof(struct tcphdr) +
>>>> +                       icsk->icsk_af_ops->net_header_len;
>>>> +               icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
>>>> +               icsk->icsk_mtup.probe_size = 0;
>>>> +
>>>> +               /* Update probe time stamp */
>>>> +               icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
>>>> +       }
>>>> +}
>>>> +
>>>>    /* Create a new MTU probe if we are ready.
>>>>     * MTU probe is regularly attempting to increase the path MTU by
>>>>     * deliberately sending larger packets.  This discovers routing
>>>
>>>
>>> I think the only update to the search range required here is
>>> potentially moving search_high upward.  Touching search_low seems
>>> unnecessary,
>>
>> No.
>>
>> If you restoring search_low to its original, the first probe will probably
>> be no better than using current mss, because current mss is working good,
>> there may be better mss between mtu(current mss) and search_high. Reprobing
>> from search_low to search_high is definitely not worth the effort.
>
> Set search_high to the maximal MSS.  The next probe will be halfway
> between search_low and MSS.  There's no reason to reduce search_low
> when the current MSS == search_low.

I think you mis-interpret the code and my intention here, put it in a simple way:
Search range is 512 ~ 1500, if current mss is 1024, then for another reprobing,
Should we start from A or B?
A: 1024 ~ 1500
B: 512  ~ 1500

The modification choose searching from A, because more optimal mss lies between
current mss and search_high.

In case of current mss == search_low, the modification also cover this scenario.
Again reprobing from current mss to search high.

>
>>>> @@ -1866,8 +1893,11 @@ static int tcp_mtu_probe(struct sock *sk)
>>>>           size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>>>>           interval = icsk->icsk_mtup.search_high -
>>>> icsk->icsk_mtup.search_low;
>>>>           if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)
>>>> ||
>>>> -           interval < max(1, net->ipv4.sysctl_tcp_probe_threshold)) {
>>>> -               /* TODO: set timer for probe_converge_event */
>>>> +               interval < net->ipv4.sysctl_tcp_probe_threshold) {
>>>> +               /* Check whether enough time has elaplased for
>>>> +                * another round of probing.
>>>> +                */
>>>> +               tcp_mtu_check_reprobe(sk);
>>>>                   return -1;
>>>>           }
>>>
>>>
>>> The way this check works, I think putting it here may not be exactly
>>> what we want.  The comment was to set a timer here, but that's not
>>> exactly what tcp_mtu_check_reprobe does.  Since it may update the
>>> search range, I think it would be better to call prior to comparing
>>> the candidate probe_size to search_high.
>>
>> Semantically speaking, reprobe checking should be placed inside where
>> a decision to not probe has been made, reprobing from a stable state ;)
>> If we moving the reporobe checking outside, current probing may be legal,
>> but the reprobe check will rule it out.
>>
>> So I don't see any compelling reason to move the reprobe checking outside.
>
> The reprobe check should happen when you know you don't already have a
> pending probe, not when a decision has already been made not to probe.
> The impact is relatively minor because practically you're just
> extending the duration of a long timer a bit more, but there's not a
> good reason to do so.
When a reprobe check tell us enough time has elapsed, it's time for another probe
attempt, but *IF *we have already been probing at the time being due to misfortunes,
and we are approaching to the optimal mss no longer in distance than reprobing
from the very beginning. Why bother ignoring current probe effort?

Restarting the reprobe in exactly time interval sounds logical, but not so in
scenario where a *active* probing is going on.

> Thanks,
>    -John
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv4 net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size
  2015-03-04 13:39     ` John Heffner
@ 2015-03-05  2:36       ` Fan Du
  0 siblings, 0 replies; 14+ messages in thread
From: Fan Du @ 2015-03-05  2:36 UTC (permalink / raw)
  To: John Heffner; +Cc: Fan Du, Eric Dumazet, David Miller, Netdev

于 2015年03月04日 21:39, John Heffner 写道:
>> >I suspect there's plenty of room for further improvement in the
>> >probing heuristic, but this seems like a reasonable improvement.
>> >
>> >Acked-by: John Heffner<johnwheffner@gmail.com>
> Actually, one final suggestion here.  The cost of a failed probe is
> much higher than a successful probe.  I'd suggest still bounding the
> probe segment size to no more than 2*cur_mss.  (Think of a common case
> where mss_clamp = 9000 - headers, but the actual path MTU = 1500.)

huh...
Are you serious about clamping probe size to no more than 2*current_mss?
What about the opposite scenario where path MTU enlarges, e.g., current_mss
is nearing search_low?

I will make next version incorporating:
a. Update ip-sysctl.txt
b. Zero probe_size before reset search_low/high


>    -John

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-03-05  2:41 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-03  9:19 [PATCHv4 net-next 0/3] Improvements for TCP PMTU Fan Du
2015-03-03  9:19 ` [PATCHv4 net-next 1/3] ipv4: Raise tcp PMTU probe mss base size Fan Du
2015-03-03 16:52   ` John Heffner
2015-03-03  9:19 ` [PATCHv4 net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
2015-03-03 16:51   ` John Heffner
2015-03-04 13:39     ` John Heffner
2015-03-05  2:36       ` Fan Du
2015-03-03  9:19 ` [PATCHv4 net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
2015-03-03 16:39   ` John Heffner
2015-03-04  2:11     ` Fan Du
2015-03-04 13:26       ` John Heffner
2015-03-05  2:36         ` Fan Du
2015-03-03 16:43 ` [PATCHv4 net-next 0/3] Improvements for TCP PMTU John Heffner
2015-03-04  2:12   ` Fan Du

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.