All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] Small fix for TCP PMTU
@ 2015-02-13  8:16 Fan Du
  2015-02-13  8:16 ` [PATCH net-next 1/3] ipv4: Raise tcp PMTU probe mss base size Fan Du
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Fan Du @ 2015-02-13  8:16 UTC (permalink / raw)
  To: davem; +Cc: netdev, fengyuleidian0615

This patchset perform some small fix for current TCP PMTU
as per RFC4821 with below changes:

Patch1/3: Set probe mss base to 1024 Bytes
Patch2/3: Do not double probe_size for each probing,
	  use a simple binary search to gain maximum performance.
Patch3/3: Create a probe timer to detect enlarged path MTU.

Fan Du (3):
  ipv4: Raise tcp PMTU probe mss base size
  ipv4: Use binary search to choose tcp PMTU probe_size
  ipv4: Create probe timer for tcp PMTU as per RFC4821

 include/net/inet_connection_sock.h |    5 +++++
 include/net/netns/ipv4.h           |    1 +
 include/net/tcp.h                  |    5 ++++-
 net/ipv4/sysctl_net_ipv4.c         |    7 +++++++
 net/ipv4/tcp.c                     |    2 ++
 net/ipv4/tcp_input.c               |    5 ++++-
 net/ipv4/tcp_ipv4.c                |    1 +
 net/ipv4/tcp_output.c              |   35 +++++++++++++++++++++++++++++++----
 net/ipv4/tcp_timer.c               |    2 +-
 9 files changed, 56 insertions(+), 7 deletions(-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next 1/3] ipv4: Raise tcp PMTU probe mss base size
  2015-02-13  8:16 [PATCH net-next 0/3] Small fix for TCP PMTU Fan Du
@ 2015-02-13  8:16 ` Fan Du
  2015-02-13  9:49   ` yzhu1
  2015-02-13  8:16 ` [PATCH net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Fan Du @ 2015-02-13  8:16 UTC (permalink / raw)
  To: davem; +Cc: netdev, fengyuleidian0615

Quotes from RFC4821 7.2.  Selecting Initial Values

   It is RECOMMENDED that search_low be initially set to an MTU size
   that is likely to work over a very wide range of environments.  Given
   today's technologies, a value of 1024 bytes is probably safe enough.
   The initial value for search_low SHOULD be configurable.

Moreover, set a small value will introduce extra time for the search
to converge. So set the initial probe base mss size to 1024 Bytes.

Signed-off-by: Fan Du <fan.du@intel.com>
---
 include/net/tcp.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 8d6b983..7b57e5b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -65,7 +65,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCP_MIN_MSS		88U
 
 /* The least MTU to use for probing */
-#define TCP_BASE_MSS		512
+#define TCP_BASE_MSS		1024
 
 /* After receiving this amount of duplicate ACKs fast retransmit starts. */
 #define TCP_FASTRETRANS_THRESH 3
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size
  2015-02-13  8:16 [PATCH net-next 0/3] Small fix for TCP PMTU Fan Du
  2015-02-13  8:16 ` [PATCH net-next 1/3] ipv4: Raise tcp PMTU probe mss base size Fan Du
@ 2015-02-13  8:16 ` Fan Du
  2015-02-13 17:52   ` John Heffner
  2015-02-13  8:16 ` [PATCH net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
  2015-02-26  3:49 ` [PATCHv2 net-next 0/4] Small fix for TCP PMTU Fan Du
  3 siblings, 1 reply; 24+ messages in thread
From: Fan Du @ 2015-02-13  8:16 UTC (permalink / raw)
  To: davem; +Cc: netdev, fengyuleidian0615

Current probe_size is chosen by doubling mss_cache,
the initial mss base is 512 Bytes, as a result the
converged probe_size will only be 1024 Bytes, there
is still big gap between 1024 and common 1500 bytes
of mtu.

Use binary search to choose probe_size in a fine
granularity manner, an optimal mss will be found
to boost performance as its maxmium.

Test env:
Docker instance with vxlan encapuslation(82599EB)
iperf -c 10.0.0.24  -t 60

before this patch:
1.26 Gbits/sec

After this patch: increase 26%
1.59 Gbits/sec

Signed-off-by: Fan Du <fan.du@intel.com>
---
 include/net/inet_connection_sock.h |    3 +++
 net/ipv4/tcp_input.c               |    5 ++++-
 net/ipv4/tcp_output.c              |   12 +++++++++---
 net/ipv4/tcp_timer.c               |    2 +-
 4 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 5976bde..3d0932e 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -124,6 +124,9 @@ struct inet_connection_sock {
 		int		  search_high;
 		int		  search_low;
 
+		int		  search_high_sav;
+		int		  search_low_sav;
+
 		/* Information on the current probe. */
 		int		  probe_size;
 	} icsk_mtup;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8fdd27b..20b28e9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2613,7 +2613,10 @@ static void tcp_mtup_probe_success(struct sock *sk)
 	tp->snd_cwnd_stamp = tcp_time_stamp;
 	tp->snd_ssthresh = tcp_current_ssthresh(sk);
 
-	icsk->icsk_mtup.search_low = icsk->icsk_mtup.probe_size;
+	if (icsk->icsk_mtup.search_low == icsk->icsk_mtup.probe_size)
+		icsk->icsk_mtup.search_low = icsk->icsk_mtup.search_high;
+	else
+		icsk->icsk_mtup.search_low = icsk->icsk_mtup.probe_size;
 	icsk->icsk_mtup.probe_size = 0;
 	tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
 }
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a2a796c..0a60deb 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1349,10 +1349,13 @@ void tcp_mtup_init(struct sock *sk)
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct net *net = sock_net(sk);
 
-	icsk->icsk_mtup.enabled = net->ipv4.sysctl_tcp_mtu_probing > 1;
+	icsk->icsk_mtup.enabled = net->ipv4.sysctl_tcp_mtu_probing;
 	icsk->icsk_mtup.search_high = tp->rx_opt.mss_clamp + sizeof(struct tcphdr) +
 			       icsk->icsk_af_ops->net_header_len;
 	icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, net->ipv4.sysctl_tcp_base_mss);
+
+	icsk->icsk_mtup.search_high_sav = icsk->icsk_mtup.search_high;
+	icsk->icsk_mtup.search_low_sav = icsk->icsk_mtup.search_low;
 	icsk->icsk_mtup.probe_size = 0;
 }
 EXPORT_SYMBOL(tcp_mtup_init);
@@ -1854,9 +1857,12 @@ static int tcp_mtu_probe(struct sock *sk)
 	    tp->rx_opt.num_sacks || tp->rx_opt.dsack)
 		return -1;
 
-	/* Very simple search strategy: just double the MSS. */
+	/* Use binary search for probe_size bewteen tcp_mss_base
+	 * and current mss_clamp.
+	 */
 	mss_now = tcp_current_mss(sk);
-	probe_size = 2 * tp->mss_cache;
+
+	probe_size = (icsk->icsk_mtup.search_high + icsk->icsk_mtup.search_low) >> 1;
 	size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
 	if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)) {
 		/* TODO: set timer for probe_converge_event */
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 0732b78..9d1cfe0 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -113,7 +113,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
 			struct tcp_sock *tp = tcp_sk(sk);
 			int mss;
 
-			mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
+			mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low);
 			mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
 			mss = max(mss, 68 - tp->tcp_header_len);
 			icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-02-13  8:16 [PATCH net-next 0/3] Small fix for TCP PMTU Fan Du
  2015-02-13  8:16 ` [PATCH net-next 1/3] ipv4: Raise tcp PMTU probe mss base size Fan Du
  2015-02-13  8:16 ` [PATCH net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
@ 2015-02-13  8:16 ` Fan Du
  2015-02-13  9:59   ` Ying Xue
  2015-02-13 12:31   ` Eric Dumazet
  2015-02-26  3:49 ` [PATCHv2 net-next 0/4] Small fix for TCP PMTU Fan Du
  3 siblings, 2 replies; 24+ messages in thread
From: Fan Du @ 2015-02-13  8:16 UTC (permalink / raw)
  To: davem; +Cc: netdev, fengyuleidian0615

As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
be armed once probing has converged. Once this timer expired,
probing again to take advantage of any path PMTU change. The
recommended probing interval is 10 minutes per RFC1981.

Signed-off-by: Fan Du <fan.du@intel.com>
---
 include/net/inet_connection_sock.h |    2 ++
 include/net/netns/ipv4.h           |    1 +
 include/net/tcp.h                  |    3 +++
 net/ipv4/sysctl_net_ipv4.c         |    7 +++++++
 net/ipv4/tcp.c                     |    2 ++
 net/ipv4/tcp_ipv4.c                |    1 +
 net/ipv4/tcp_output.c              |   23 ++++++++++++++++++++++-
 7 files changed, 38 insertions(+), 1 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 3d0932e..e78e5ab 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -126,6 +126,8 @@ struct inet_connection_sock {
 
 		int		  search_high_sav;
 		int		  search_low_sav;
+
+		struct timer_list probe_timer;
 
 		/* Information on the current probe. */
 		int		  probe_size;
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index dbe2254..bb2c2d1 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -84,6 +84,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_fwmark_accept;
 	int sysctl_tcp_mtu_probing;
 	int sysctl_tcp_base_mss;
+	u32 sysctl_tcp_probe_interval;
 
 	struct ping_group_range ping_group_range;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7b57e5b..16fa2e6 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -67,6 +67,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* The least MTU to use for probing */
 #define TCP_BASE_MSS		1024
 
+/* probing interval, default to 10 minutes as per RFC4821 */
+#define TCP_PROBE_INTERVAL	600
+
 /* After receiving this amount of duplicate ACKs fast retransmit starts. */
 #define TCP_FASTRETRANS_THRESH 3
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d151539..4fa5d31 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -883,6 +883,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "tcp_probe_interval",
+		.data		= &init_net.ipv4.sysctl_tcp_probe_interval,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9d72a0f..46413ee 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1986,6 +1986,7 @@ void tcp_close(struct sock *sk, long timeout)
 	struct sk_buff *skb;
 	int data_was_unread = 0;
 	int state;
+	struct inet_connection_sock *icsk = inet_csk(sk);
 
 	lock_sock(sk);
 	sk->sk_shutdown = SHUTDOWN_MASK;
@@ -2149,6 +2150,7 @@ adjudge_to_death:
 	/* Otherwise, socket is reprieved until protocol close. */
 
 out:
+	del_timer(&icsk->icsk_mtup.probe_timer);
 	bh_unlock_sock(sk);
 	local_bh_enable();
 	sock_put(sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5a2dfed..3cc71b3 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2460,6 +2460,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	}
 	net->ipv4.sysctl_tcp_ecn = 2;
 	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
+	net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
 	return 0;
 
 fail:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0a60deb..461b4a4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1342,6 +1342,18 @@ int tcp_mss_to_mtu(struct sock *sk, int mss)
 	return mtu;
 }
 
+static void icsk_mtup_probe_timer(unsigned long arg)
+{
+	struct sock *sk = (struct sock *)arg;
+	struct net *net = sock_net(sk);
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	/* Restore orignal search range */
+	icsk->icsk_mtup.search_high = icsk->icsk_mtup.search_high_sav;
+	icsk->icsk_mtup.search_low = icsk->icsk_mtup.search_low_sav;
+	icsk->icsk_mtup.probe_size = 0;
+}
+
 /* MTU probing init per socket */
 void tcp_mtup_init(struct sock *sk)
 {
@@ -1357,6 +1369,9 @@ void tcp_mtup_init(struct sock *sk)
 	icsk->icsk_mtup.search_high_sav = icsk->icsk_mtup.search_high;
 	icsk->icsk_mtup.search_low_sav = icsk->icsk_mtup.search_low;
 	icsk->icsk_mtup.probe_size = 0;
+
+	setup_timer(&icsk->icsk_mtup.probe_timer, icsk_mtup_probe_timer,
+		    (unsigned long)sk);
 }
 EXPORT_SYMBOL(tcp_mtup_init);
 
@@ -1840,6 +1855,7 @@ static int tcp_mtu_probe(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct sk_buff *skb, *nskb, *next;
+	struct net *net = sock_net(sk);
 	int len;
 	int probe_size;
 	int size_needed;
@@ -1865,7 +1881,12 @@ static int tcp_mtu_probe(struct sock *sk)
 	probe_size = (icsk->icsk_mtup.search_high + icsk->icsk_mtup.search_low) >> 1;
 	size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
 	if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)) {
-		/* TODO: set timer for probe_converge_event */
+		u32 probe_interval = net->ipv4.sysctl_tcp_probe_interval;
+
+		/* Search has been converged, start the timer,
+		 * take advantage of path changing */
+		mod_timer(&icsk->icsk_mtup.probe_timer,
+			  jiffies + msecs_to_jiffies(1000*probe_interval));
 		return -1;
 	}
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/3] ipv4: Raise tcp PMTU probe mss base size
  2015-02-13  8:16 ` [PATCH net-next 1/3] ipv4: Raise tcp PMTU probe mss base size Fan Du
@ 2015-02-13  9:49   ` yzhu1
  2015-02-16  5:15     ` Fan Du
  0 siblings, 1 reply; 24+ messages in thread
From: yzhu1 @ 2015-02-13  9:49 UTC (permalink / raw)
  To: Fan Du, davem; +Cc: netdev, fengyuleidian0615

backward compatible? :-D

Zhu Yanjun
On 02/13/2015 04:16 PM, Fan Du wrote:
> Quotes from RFC4821 7.2.  Selecting Initial Values
>
>     It is RECOMMENDED that search_low be initially set to an MTU size
>     that is likely to work over a very wide range of environments.  Given
>     today's technologies, a value of 1024 bytes is probably safe enough.
>     The initial value for search_low SHOULD be configurable.
>
> Moreover, set a small value will introduce extra time for the search
> to converge. So set the initial probe base mss size to 1024 Bytes.
>
> Signed-off-by: Fan Du <fan.du@intel.com>
> ---
>   include/net/tcp.h |    2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 8d6b983..7b57e5b 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -65,7 +65,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>   #define TCP_MIN_MSS		88U
>   
>   /* The least MTU to use for probing */
> -#define TCP_BASE_MSS		512
> +#define TCP_BASE_MSS		1024
>   
>   /* After receiving this amount of duplicate ACKs fast retransmit starts. */
>   #define TCP_FASTRETRANS_THRESH 3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-02-13  8:16 ` [PATCH net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
@ 2015-02-13  9:59   ` Ying Xue
  2015-02-16  5:28     ` Fan Du
  2015-02-13 12:31   ` Eric Dumazet
  1 sibling, 1 reply; 24+ messages in thread
From: Ying Xue @ 2015-02-13  9:59 UTC (permalink / raw)
  To: Fan Du, davem; +Cc: netdev, fengyuleidian0615

On 02/13/2015 04:16 PM, Fan Du wrote:
> As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
> be armed once probing has converged. Once this timer expired,
> probing again to take advantage of any path PMTU change. The
> recommended probing interval is 10 minutes per RFC1981.
> 
> Signed-off-by: Fan Du <fan.du@intel.com>
> ---
>  include/net/inet_connection_sock.h |    2 ++
>  include/net/netns/ipv4.h           |    1 +
>  include/net/tcp.h                  |    3 +++
>  net/ipv4/sysctl_net_ipv4.c         |    7 +++++++
>  net/ipv4/tcp.c                     |    2 ++
>  net/ipv4/tcp_ipv4.c                |    1 +
>  net/ipv4/tcp_output.c              |   23 ++++++++++++++++++++++-
>  7 files changed, 38 insertions(+), 1 deletions(-)
> 
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 3d0932e..e78e5ab 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -126,6 +126,8 @@ struct inet_connection_sock {
>  
>  		int		  search_high_sav;
>  		int		  search_low_sav;
> +
> +		struct timer_list probe_timer;
>  
>  		/* Information on the current probe. */
>  		int		  probe_size;
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index dbe2254..bb2c2d1 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -84,6 +84,7 @@ struct netns_ipv4 {
>  	int sysctl_tcp_fwmark_accept;
>  	int sysctl_tcp_mtu_probing;
>  	int sysctl_tcp_base_mss;
> +	u32 sysctl_tcp_probe_interval;
>  
>  	struct ping_group_range ping_group_range;
>  
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 7b57e5b..16fa2e6 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -67,6 +67,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>  /* The least MTU to use for probing */
>  #define TCP_BASE_MSS		1024
>  
> +/* probing interval, default to 10 minutes as per RFC4821 */
> +#define TCP_PROBE_INTERVAL	600
> +
>  /* After receiving this amount of duplicate ACKs fast retransmit starts. */
>  #define TCP_FASTRETRANS_THRESH 3
>  
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index d151539..4fa5d31 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -883,6 +883,13 @@ static struct ctl_table ipv4_net_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> +	{
> +		.procname	= "tcp_probe_interval",
> +		.data		= &init_net.ipv4.sysctl_tcp_probe_interval,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
>  	{ }
>  };
>  
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 9d72a0f..46413ee 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1986,6 +1986,7 @@ void tcp_close(struct sock *sk, long timeout)
>  	struct sk_buff *skb;
>  	int data_was_unread = 0;
>  	int state;
> +	struct inet_connection_sock *icsk = inet_csk(sk);
>  
>  	lock_sock(sk);
>  	sk->sk_shutdown = SHUTDOWN_MASK;
> @@ -2149,6 +2150,7 @@ adjudge_to_death:
>  	/* Otherwise, socket is reprieved until protocol close. */
>  
>  out:
> +	del_timer(&icsk->icsk_mtup.probe_timer);
>  	bh_unlock_sock(sk);
>  	local_bh_enable();
>  	sock_put(sk);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 5a2dfed..3cc71b3 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2460,6 +2460,7 @@ static int __net_init tcp_sk_init(struct net *net)
>  	}
>  	net->ipv4.sysctl_tcp_ecn = 2;
>  	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
> +	net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
>  	return 0;
>  
>  fail:
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 0a60deb..461b4a4 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1342,6 +1342,18 @@ int tcp_mss_to_mtu(struct sock *sk, int mss)
>  	return mtu;
>  }
>  
> +static void icsk_mtup_probe_timer(unsigned long arg)
> +{
> +	struct sock *sk = (struct sock *)arg;
> +	struct net *net = sock_net(sk);
> +	struct inet_connection_sock *icsk = inet_csk(sk);
> +
> +	/* Restore orignal search range */
> +	icsk->icsk_mtup.search_high = icsk->icsk_mtup.search_high_sav;
> +	icsk->icsk_mtup.search_low = icsk->icsk_mtup.search_low_sav;
> +	icsk->icsk_mtup.probe_size = 0;
> +}
> +

As icsk_mtup_probe_timer() is run asynchronously, we may touch an invalid socket
instance if we don't hold socket's refcount before launching the timer.

Therefore, in general we use the standard interfaces like sk_reset_timer() and
sk_stop_timer() to operate timers associated with socket. So, the usage about
timer in the patch seems unsafe for us. For instance, you can study how
icsk_retransmit_timer, icsk_delack_timer and sk_timer, are implemented.

Regards,
Ying

>  /* MTU probing init per socket */
>  void tcp_mtup_init(struct sock *sk)
>  {
> @@ -1357,6 +1369,9 @@ void tcp_mtup_init(struct sock *sk)
>  	icsk->icsk_mtup.search_high_sav = icsk->icsk_mtup.search_high;
>  	icsk->icsk_mtup.search_low_sav = icsk->icsk_mtup.search_low;
>  	icsk->icsk_mtup.probe_size = 0;
> +
> +	setup_timer(&icsk->icsk_mtup.probe_timer, icsk_mtup_probe_timer,
> +		    (unsigned long)sk);
>  }
>  EXPORT_SYMBOL(tcp_mtup_init);
>  
> @@ -1840,6 +1855,7 @@ static int tcp_mtu_probe(struct sock *sk)
>  	struct tcp_sock *tp = tcp_sk(sk);
>  	struct inet_connection_sock *icsk = inet_csk(sk);
>  	struct sk_buff *skb, *nskb, *next;
> +	struct net *net = sock_net(sk);
>  	int len;
>  	int probe_size;
>  	int size_needed;
> @@ -1865,7 +1881,12 @@ static int tcp_mtu_probe(struct sock *sk)
>  	probe_size = (icsk->icsk_mtup.search_high + icsk->icsk_mtup.search_low) >> 1;
>  	size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>  	if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)) {
> -		/* TODO: set timer for probe_converge_event */
> +		u32 probe_interval = net->ipv4.sysctl_tcp_probe_interval;
> +
> +		/* Search has been converged, start the timer,
> +		 * take advantage of path changing */
> +		mod_timer(&icsk->icsk_mtup.probe_timer,
> +			  jiffies + msecs_to_jiffies(1000*probe_interval));
>  		return -1;
>  	}
>  
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-02-13  8:16 ` [PATCH net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
  2015-02-13  9:59   ` Ying Xue
@ 2015-02-13 12:31   ` Eric Dumazet
  2015-02-16  5:38     ` Fan Du
  1 sibling, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2015-02-13 12:31 UTC (permalink / raw)
  To: Fan Du; +Cc: davem, netdev, fengyuleidian0615

On Fri, 2015-02-13 at 16:16 +0800, Fan Du wrote:
> As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
> be armed once probing has converged. Once this timer expired,
> probing again to take advantage of any path PMTU change. The
> recommended probing interval is 10 minutes per RFC1981.
> 
> Signed-off-by: Fan Du <fan.du@intel.com>
> ---
>  include/net/inet_connection_sock.h |    2 ++
>  include/net/netns/ipv4.h           |    1 +
>  include/net/tcp.h                  |    3 +++
>  net/ipv4/sysctl_net_ipv4.c         |    7 +++++++
>  net/ipv4/tcp.c                     |    2 ++
>  net/ipv4/tcp_ipv4.c                |    1 +
>  net/ipv4/tcp_output.c              |   23 ++++++++++++++++++++++-
>  7 files changed, 38 insertions(+), 1 deletions(-)
> 
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 3d0932e..e78e5ab 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -126,6 +126,8 @@ struct inet_connection_sock {
>  
>  		int		  search_high_sav;
>  		int		  search_low_sav;
> +
> +		struct timer_list probe_timer;
>  

We certainly wont add yet another timer in tcp socket for such usage.

And a buggy one, since you forgot all the refcounting associated with
such timers.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size
  2015-02-13  8:16 ` [PATCH net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
@ 2015-02-13 17:52   ` John Heffner
  2015-02-16  5:27     ` Fan Du
  0 siblings, 1 reply; 24+ messages in thread
From: John Heffner @ 2015-02-13 17:52 UTC (permalink / raw)
  To: Fan Du; +Cc: David Miller, Netdev, fengyuleidian0615

On Fri, Feb 13, 2015 at 3:16 AM, Fan Du <fan.du@intel.com> wrote:
> Current probe_size is chosen by doubling mss_cache,
> the initial mss base is 512 Bytes, as a result the
> converged probe_size will only be 1024 Bytes, there
> is still big gap between 1024 and common 1500 bytes
> of mtu.
>
> Use binary search to choose probe_size in a fine
> granularity manner, an optimal mss will be found
> to boost performance as its maxmium.
>
> Test env:
> Docker instance with vxlan encapuslation(82599EB)
> iperf -c 10.0.0.24  -t 60
>
> before this patch:
> 1.26 Gbits/sec
>
> After this patch: increase 26%
> 1.59 Gbits/sec
>
> Signed-off-by: Fan Du <fan.du@intel.com>

Thanks for looking into making mtu probing better.  Improving the
search strategy is commendable.  One high level comment though is that
there's some cost associated with probing and diminishing returns the
smaller the interval (search_high - search_low), so there should be
some threshold below which further probing is deemed no longer useful.

Aside from that, some things in this patch don't look right to me.
Comments inline below.


> ---
>  include/net/inet_connection_sock.h |    3 +++
>  net/ipv4/tcp_input.c               |    5 ++++-
>  net/ipv4/tcp_output.c              |   12 +++++++++---
>  net/ipv4/tcp_timer.c               |    2 +-
>  4 files changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 5976bde..3d0932e 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -124,6 +124,9 @@ struct inet_connection_sock {
>                 int               search_high;
>                 int               search_low;
>
> +               int               search_high_sav;
> +               int               search_low_sav;
> +
>                 /* Information on the current probe. */
>                 int               probe_size;
>         } icsk_mtup;


What are these for?  They're assigned but not used.


> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 8fdd27b..20b28e9 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2613,7 +2613,10 @@ static void tcp_mtup_probe_success(struct sock *sk)
>         tp->snd_cwnd_stamp = tcp_time_stamp;
>         tp->snd_ssthresh = tcp_current_ssthresh(sk);
>
> -       icsk->icsk_mtup.search_low = icsk->icsk_mtup.probe_size;
> +       if (icsk->icsk_mtup.search_low == icsk->icsk_mtup.probe_size)
> +               icsk->icsk_mtup.search_low = icsk->icsk_mtup.search_high;
> +       else
> +               icsk->icsk_mtup.search_low = icsk->icsk_mtup.probe_size;
>         icsk->icsk_mtup.probe_size = 0;
>         tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
>  }

It would be cleaner to handle this in tcp_mtu_probe, in deciding
whether to issue a probe, than to change the semantics of search_high
and search_low.  Issuing a probe where probe_size == search_low seems
like the wrong thing to do.


> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index a2a796c..0a60deb 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1349,10 +1349,13 @@ void tcp_mtup_init(struct sock *sk)
>         struct inet_connection_sock *icsk = inet_csk(sk);
>         struct net *net = sock_net(sk);
>
> -       icsk->icsk_mtup.enabled = net->ipv4.sysctl_tcp_mtu_probing > 1;
> +       icsk->icsk_mtup.enabled = net->ipv4.sysctl_tcp_mtu_probing;
>         icsk->icsk_mtup.search_high = tp->rx_opt.mss_clamp + sizeof(struct tcphdr) +
>                                icsk->icsk_af_ops->net_header_len;
>         icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, net->ipv4.sysctl_tcp_base_mss);
> +
> +       icsk->icsk_mtup.search_high_sav = icsk->icsk_mtup.search_high;
> +       icsk->icsk_mtup.search_low_sav = icsk->icsk_mtup.search_low;
>         icsk->icsk_mtup.probe_size = 0;
>  }
>  EXPORT_SYMBOL(tcp_mtup_init);

You're changing the meaning of sysctl_tcp_mtu_probing.  I don't think
that's what you want.  From Documentation/networking/ip-sysctl.txt:

tcp_mtu_probing - INTEGER
Controls TCP Packetization-Layer Path MTU Discovery.  Takes three
values:
 0 - Disabled
 1 - Disabled by default, enabled when an ICMP black hole detected
 2 - Always enabled, use initial MSS of tcp_base_mss.



> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 0732b78..9d1cfe0 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -113,7 +113,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
>                         struct tcp_sock *tp = tcp_sk(sk);
>                         int mss;
>
> -                       mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
> +                       mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low);
>                         mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>                         mss = max(mss, 68 - tp->tcp_header_len);
>                         icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);

Why did you change this?  I think this breaks black hole detection.

Thanks,
  -John

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/3] ipv4: Raise tcp PMTU probe mss base size
  2015-02-13  9:49   ` yzhu1
@ 2015-02-16  5:15     ` Fan Du
  0 siblings, 0 replies; 24+ messages in thread
From: Fan Du @ 2015-02-16  5:15 UTC (permalink / raw)
  To: yzhu1; +Cc: Fan Du, davem, netdev

于 2015年02月13日 17:49, yzhu1 写道:
> backward compatible? :-D

yes, it will be auto tuned back in black hole detecting path.
Also user could adjust the base size through /proc/sys/net/ipv4/tcp_base_mss
for optimal configuration.

> Zhu Yanjun
> On 02/13/2015 04:16 PM, Fan Du wrote:
>> Quotes from RFC4821 7.2.  Selecting Initial Values
>>
>>     It is RECOMMENDED that search_low be initially set to an MTU size
>>     that is likely to work over a very wide range of environments.  Given
>>     today's technologies, a value of 1024 bytes is probably safe enough.
>>     The initial value for search_low SHOULD be configurable.
>>
>> Moreover, set a small value will introduce extra time for the search
>> to converge. So set the initial probe base mss size to 1024 Bytes.
>>
>> Signed-off-by: Fan Du <fan.du@intel.com>
>> ---

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size
  2015-02-13 17:52   ` John Heffner
@ 2015-02-16  5:27     ` Fan Du
  2015-02-16 23:59       ` John Heffner
  0 siblings, 1 reply; 24+ messages in thread
From: Fan Du @ 2015-02-16  5:27 UTC (permalink / raw)
  To: John Heffner; +Cc: Fan Du, David Miller, Netdev

于 2015年02月14日 01:52, John Heffner 写道:
> On Fri, Feb 13, 2015 at 3:16 AM, Fan Du <fan.du@intel.com> wrote:
>> Current probe_size is chosen by doubling mss_cache,
>> the initial mss base is 512 Bytes, as a result the
>> converged probe_size will only be 1024 Bytes, there
>> is still big gap between 1024 and common 1500 bytes
>> of mtu.
>>
>> Use binary search to choose probe_size in a fine
>> granularity manner, an optimal mss will be found
>> to boost performance as its maxmium.
>>
>> Test env:
>> Docker instance with vxlan encapuslation(82599EB)
>> iperf -c 10.0.0.24  -t 60
>>
>> before this patch:
>> 1.26 Gbits/sec
>>
>> After this patch: increase 26%
>> 1.59 Gbits/sec
>>
>> Signed-off-by: Fan Du <fan.du@intel.com>
>
> Thanks for looking into making mtu probing better.  Improving the
> search strategy is commendable.  One high level comment though is that
> there's some cost associated with probing and diminishing returns the
> smaller the interval (search_high - search_low), so there should be
> some threshold below which further probing is deemed no longer useful.
>
> Aside from that, some things in this patch don't look right to me.
> Comments inline below.
>
>
>> ---
>>   include/net/inet_connection_sock.h |    3 +++
>>   net/ipv4/tcp_input.c               |    5 ++++-
>>   net/ipv4/tcp_output.c              |   12 +++++++++---
>>   net/ipv4/tcp_timer.c               |    2 +-
>>   4 files changed, 17 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
>> index 5976bde..3d0932e 100644
>> --- a/include/net/inet_connection_sock.h
>> +++ b/include/net/inet_connection_sock.h
>> @@ -124,6 +124,9 @@ struct inet_connection_sock {
>>                  int               search_high;
>>                  int               search_low;
>>
>> +               int               search_high_sav;
>> +               int               search_low_sav;
>> +
>>                  /* Information on the current probe. */
>>                  int               probe_size;
>>          } icsk_mtup;
>
>
> What are these for?  They're assigned but not used.

It's used by the probe timer to restore original search range.
See patch3/3.

>
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index 8fdd27b..20b28e9 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -2613,7 +2613,10 @@ static void tcp_mtup_probe_success(struct sock *sk)
>>          tp->snd_cwnd_stamp = tcp_time_stamp;
>>          tp->snd_ssthresh = tcp_current_ssthresh(sk);
>>
>> -       icsk->icsk_mtup.search_low = icsk->icsk_mtup.probe_size;
>> +       if (icsk->icsk_mtup.search_low == icsk->icsk_mtup.probe_size)
>> +               icsk->icsk_mtup.search_low = icsk->icsk_mtup.search_high;
>> +       else
>> +               icsk->icsk_mtup.search_low = icsk->icsk_mtup.probe_size;
>>          icsk->icsk_mtup.probe_size = 0;
>>          tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
>>   }
>
> It would be cleaner to handle this in tcp_mtu_probe, in deciding
> whether to issue a probe, than to change the semantics of search_high
> and search_low.  Issuing a probe where probe_size == search_low seems
> like the wrong thing to do.
That's my original thoughts, the seconds thoughts is every BYTE in datacenter
cost money, so why not to get optimal performance by using every possible byte
available.

Anyway, a sysctl threshold will also do the job, will incorporate this in next version.

>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index a2a796c..0a60deb 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -1349,10 +1349,13 @@ void tcp_mtup_init(struct sock *sk)
>>          struct inet_connection_sock *icsk = inet_csk(sk);
>>          struct net *net = sock_net(sk);
>>
>> -       icsk->icsk_mtup.enabled = net->ipv4.sysctl_tcp_mtu_probing > 1;
>> +       icsk->icsk_mtup.enabled = net->ipv4.sysctl_tcp_mtu_probing;
>>          icsk->icsk_mtup.search_high = tp->rx_opt.mss_clamp + sizeof(struct tcphdr) +
>>                                 icsk->icsk_af_ops->net_header_len;
>>          icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, net->ipv4.sysctl_tcp_base_mss);
>> +
>> +       icsk->icsk_mtup.search_high_sav = icsk->icsk_mtup.search_high;
>> +       icsk->icsk_mtup.search_low_sav = icsk->icsk_mtup.search_low;
>>          icsk->icsk_mtup.probe_size = 0;
>>   }
>>   EXPORT_SYMBOL(tcp_mtup_init);
>
> You're changing the meaning of sysctl_tcp_mtu_probing.  I don't think
> that's what you want.  From Documentation/networking/ip-sysctl.txt:
>
> tcp_mtu_probing - INTEGER
> Controls TCP Packetization-Layer Path MTU Discovery.  Takes three
> values:
>   0 - Disabled
>   1 - Disabled by default, enabled when an ICMP black hole detected
>   2 - Always enabled, use initial MSS of tcp_base_mss.
yes, will honor the original enable theme.

>
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index 0732b78..9d1cfe0 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -113,7 +113,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
>>                          struct tcp_sock *tp = tcp_sk(sk);
>>                          int mss;
>>
>> -                       mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
>> +                       mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low);
>>                          mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>>                          mss = max(mss, 68 - tp->tcp_header_len);
>>                          icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
>
> Why did you change this?  I think this breaks black hole detection.
hmm, I misunderstood this part.
In case of pure black hole detection, lowering the current tcp mss instead of search_low,
will make more sense, as current tcp mss still got lost.

-                       mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
+                       /* try mss smaller than current mss */
+                       mss = tcp_current_mss(sk) >> 1;

Black hole seems more like a misconfiguration in administrative level on intermediate node,
rather than a stack issue, why keep shrinking mss to get packet through with poor performance?


> Thanks,
>    -John
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-02-13  9:59   ` Ying Xue
@ 2015-02-16  5:28     ` Fan Du
  0 siblings, 0 replies; 24+ messages in thread
From: Fan Du @ 2015-02-16  5:28 UTC (permalink / raw)
  To: Ying Xue; +Cc: Fan Du, davem, netdev

于 2015年02月13日 17:59, Ying Xue 写道:
>> +static void icsk_mtup_probe_timer(unsigned long arg)
>> >+{
>> >+	struct sock *sk = (struct sock *)arg;
>> >+	struct net *net = sock_net(sk);
>> >+	struct inet_connection_sock *icsk = inet_csk(sk);
>> >+
>> >+	/* Restore orignal search range */
>> >+	icsk->icsk_mtup.search_high = icsk->icsk_mtup.search_high_sav;
>> >+	icsk->icsk_mtup.search_low = icsk->icsk_mtup.search_low_sav;
>> >+	icsk->icsk_mtup.probe_size = 0;
>> >+}
>> >+
> As icsk_mtup_probe_timer() is run asynchronously, we may touch an invalid socket
> instance if we don't hold socket's refcount before launching the timer.
>
> Therefore, in general we use the standard interfaces like sk_reset_timer() and
> sk_stop_timer() to operate timers associated with socket. So, the usage about
> timer in the patch seems unsafe for us. For instance, you can study how
> icsk_retransmit_timer, icsk_delack_timer and sk_timer, are implemented.

right, socket layer has stander API wrapper to manipulate timers like you point out.
Thanks for the notice :)

> Regards,
> Ying
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-02-13 12:31   ` Eric Dumazet
@ 2015-02-16  5:38     ` Fan Du
  2015-02-16 12:19       ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Fan Du @ 2015-02-16  5:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Fan Du, davem, netdev

于 2015年02月13日 20:31, Eric Dumazet 写道:
> On Fri, 2015-02-13 at 16:16 +0800, Fan Du wrote:
>> >As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
>> >be armed once probing has converged. Once this timer expired,
>> >probing again to take advantage of any path PMTU change. The
>> >recommended probing interval is 10 minutes per RFC1981.
>> >
>> >Signed-off-by: Fan Du<fan.du@intel.com>
>> >---
>> >  include/net/inet_connection_sock.h |    2 ++
>> >  include/net/netns/ipv4.h           |    1 +
>> >  include/net/tcp.h                  |    3 +++
>> >  net/ipv4/sysctl_net_ipv4.c         |    7 +++++++
>> >  net/ipv4/tcp.c                     |    2 ++
>> >  net/ipv4/tcp_ipv4.c                |    1 +
>> >  net/ipv4/tcp_output.c              |   23 ++++++++++++++++++++++-
>> >  7 files changed, 38 insertions(+), 1 deletions(-)
>> >
>> >diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
>> >index 3d0932e..e78e5ab 100644
>> >--- a/include/net/inet_connection_sock.h
>> >+++ b/include/net/inet_connection_sock.h
>> >@@ -126,6 +126,8 @@ struct inet_connection_sock {
>> >
>> >  		int		  search_high_sav;
>> >  		int		  search_low_sav;
>> >+
>> >+		struct timer_list probe_timer;
>> >
> We certainly wont add yet another timer in tcp socket for such usage.
>
> And a buggy one, since you forgot all the refcounting associated with
> such timers.

oh, embarrassing...
Will place probe timer aside with icsk_delack_timer in struct inet_connection_sock,
and manipulate through sk_reset_timer.

Thanks for the reviewing.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-02-16  5:38     ` Fan Du
@ 2015-02-16 12:19       ` Eric Dumazet
  0 siblings, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2015-02-16 12:19 UTC (permalink / raw)
  To: Fan Du; +Cc: Fan Du, davem, netdev

On Mon, 2015-02-16 at 13:38 +0800, Fan Du wrote:
> 于 2015年02月13日 20:31, Eric Dumazet 写道:

> > We certainly wont add yet another timer in tcp socket for such usage.
> >
> > And a buggy one, since you forgot all the refcounting associated with
> > such timers.
> 
> oh, embarrassing...
> Will place probe timer aside with icsk_delack_timer in struct inet_connection_sock,
> and manipulate through sk_reset_timer.

No : I said we would _not_ accept yet another timer (ie a big structure)
for such very rare event.

You can implement a pseudo timer, using a simple u32 field, that you
test in tcp_mtu_probe() to clear icsk->icsk_mtup.probe_size when enough
time has elapsed.

(tcp_time_stamp is u32, so are tp->lsndtime, tp->snd_cwnd_stamp, 
tp->rcv_tstamp, tp->retrans_stamp, tp->tso_deferred, ...)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size
  2015-02-16  5:27     ` Fan Du
@ 2015-02-16 23:59       ` John Heffner
  0 siblings, 0 replies; 24+ messages in thread
From: John Heffner @ 2015-02-16 23:59 UTC (permalink / raw)
  To: Fan Du; +Cc: Fan Du, David Miller, Netdev

On Mon, Feb 16, 2015 at 12:27 AM, Fan Du <fengyuleidian0615@gmail.com> wrote:
> 于 2015年02月14日 01:52, John Heffner 写道:
>
>> On Fri, Feb 13, 2015 at 3:16 AM, Fan Du <fan.du@intel.com> wrote:
>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>> index 0732b78..9d1cfe0 100644
>>> --- a/net/ipv4/tcp_timer.c
>>> +++ b/net/ipv4/tcp_timer.c
>>> @@ -113,7 +113,7 @@ static void tcp_mtu_probing(struct
>>> inet_connection_sock *icsk, struct sock *sk)
>>>                          struct tcp_sock *tp = tcp_sk(sk);
>>>                          int mss;
>>>
>>> -                       mss = tcp_mtu_to_mss(sk,
>>> icsk->icsk_mtup.search_low) >> 1;
>>> +                       mss = tcp_mtu_to_mss(sk,
>>> icsk->icsk_mtup.search_low);
>>>                          mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>>>                          mss = max(mss, 68 - tp->tcp_header_len);
>>>                          icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk,
>>> mss);
>>
>>
>> Why did you change this?  I think this breaks black hole detection.
>
> hmm, I misunderstood this part.
> In case of pure black hole detection, lowering the current tcp mss instead
> of search_low,
> will make more sense, as current tcp mss still got lost.
>
> -                       mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low)
>>> 1;
> +                       /* try mss smaller than current mss */
> +                       mss = tcp_current_mss(sk) >> 1;
>
> Black hole seems more like a misconfiguration in administrative level on
> intermediate node,
> rather than a stack issue, why keep shrinking mss to get packet through with
> poor performance?

The idea here is to make it robust to route changes, where the new
path has a lower MTU.  This also protects us against paths that have a
lower MTU than the base mss (which you're also trying to raise in this
patch series).

Thanks,
  -John

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCHv2 net-next 0/4] Small fix for TCP PMTU
  2015-02-13  8:16 [PATCH net-next 0/3] Small fix for TCP PMTU Fan Du
                   ` (2 preceding siblings ...)
  2015-02-13  8:16 ` [PATCH net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
@ 2015-02-26  3:49 ` Fan Du
  2015-02-26  3:49   ` [PATCHv2 net-next 1/4] ipv4: Raise tcp PMTU probe mss base size Fan Du
                     ` (4 more replies)
  3 siblings, 5 replies; 24+ messages in thread
From: Fan Du @ 2015-02-26  3:49 UTC (permalink / raw)
  To: johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

Hi,

This patchset performs some small fix and enhancement
for current TCP PMTU as per RFC4821 with below changes:

Patch1/4: Set probe mss base to 1024 Bytes per RFC4821
Patch2/4: Do not double probe_size for each probing,
          use a simple binary search to gain maximum performance.
Patch3/4: When it comes to blackhole detection, shrink current
	  mss for next probing.
Patch4/4: Create a probe timer to detect enlarged path MTU.

Changelog:
v2:
   - Introduce sysctl_tcp_probe_threshold to control when
     probing will stop, as suggested by John Heffner.
   - Add patch3 to shrink current mss value for search low boundary.
   - Drop cannonical timer usages, implements pseudo timer based on
     32bits jiffies tcp_time_stamp, as suggested by Eric Dumazet.

Fan Du (4):
  ipv4: Raise tcp PMTU probe mss base size
  ipv4: Use binary search to choose tcp PMTU probe_size
  ipv4: shrink current mss for tcp PMTU blackhole detection
  ipv4: Create probe timer for tcp PMTU as per RFC4821

 include/net/inet_connection_sock.h |    2 +
 include/net/netns/ipv4.h           |    2 +
 include/net/tcp.h                  |    8 +++++-
 net/ipv4/sysctl_net_ipv4.c         |   14 ++++++++++
 net/ipv4/tcp_ipv4.c                |    2 +
 net/ipv4/tcp_output.c              |   50 +++++++++++++++++++++++++++++++++---
 net/ipv4/tcp_timer.c               |    8 +++--
 7 files changed, 78 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCHv2 net-next 1/4] ipv4: Raise tcp PMTU probe mss base size
  2015-02-26  3:49 ` [PATCHv2 net-next 0/4] Small fix for TCP PMTU Fan Du
@ 2015-02-26  3:49   ` Fan Du
  2015-02-26  3:49   ` [PATCHv2 net-next 2/4] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Fan Du @ 2015-02-26  3:49 UTC (permalink / raw)
  To: johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

Quotes from RFC4821 7.2.  Selecting Initial Values

   It is RECOMMENDED that search_low be initially set to an MTU size
   that is likely to work over a very wide range of environments.  Given
   today's technologies, a value of 1024 bytes is probably safe enough.
   The initial value for search_low SHOULD be configurable.

Moreover, set a small value will introduce extra time for the search
to converge. So set the initial probe base mss size to 1024 Bytes.

Signed-off-by: Fan Du <fan.du@intel.com>
---
 include/net/tcp.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 8d6b983..7b57e5b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -65,7 +65,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCP_MIN_MSS		88U
 
 /* The least MTU to use for probing */
-#define TCP_BASE_MSS		512
+#define TCP_BASE_MSS		1024
 
 /* After receiving this amount of duplicate ACKs fast retransmit starts. */
 #define TCP_FASTRETRANS_THRESH 3
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCHv2 net-next 2/4] ipv4: Use binary search to choose tcp PMTU probe_size
  2015-02-26  3:49 ` [PATCHv2 net-next 0/4] Small fix for TCP PMTU Fan Du
  2015-02-26  3:49   ` [PATCHv2 net-next 1/4] ipv4: Raise tcp PMTU probe mss base size Fan Du
@ 2015-02-26  3:49   ` Fan Du
  2015-02-27 22:17     ` David Miller
  2015-02-26  3:49   ` [PATCHv2 net-next 3/4] ipv4: shrink current mss for tcp PMTU blackhole detection Fan Du
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Fan Du @ 2015-02-26  3:49 UTC (permalink / raw)
  To: johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

Current probe_size is chosen by doubling mss_cache,
the initial mss base is 512 Bytes, as a result the
converged probe_size will only be 1024 Bytes, there
is still big gap between 1024 and common 1500 bytes
of mtu.

Use binary search to choose probe_size in a fine
granularity manner, an optimal mss will be found
to boost performance as its maxmium.

In addition, introduce a sysctl_tcp_probe_threshold
to control when probing will stop in respect to
the width of search range.

Test env:
Docker instance with vxlan encapuslation(82599EB)
iperf -c 10.0.0.24  -t 60

before this patch:
1.26 Gbits/sec

After this patch: increase 26%
1.59 Gbits/sec

Signed-off-by: Fan Du <fan.du@intel.com>
---
v2:
  - Use sysctl_tcp_probe_threshold to control when probing
    will stop wrt interval between search high and search low.
---
 include/net/netns/ipv4.h   |    1 +
 include/net/tcp.h          |    3 +++
 net/ipv4/sysctl_net_ipv4.c |    7 +++++++
 net/ipv4/tcp_ipv4.c        |    1 +
 net/ipv4/tcp_output.c      |   14 +++++++++++---
 5 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index dbe2254..25200d4 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -84,6 +84,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_fwmark_accept;
 	int sysctl_tcp_mtu_probing;
 	int sysctl_tcp_base_mss;
+	int sysctl_tcp_probe_threshold;
 
 	struct ping_group_range ping_group_range;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7b57e5b..d269c91 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -67,6 +67,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* The least MTU to use for probing */
 #define TCP_BASE_MSS		1024
 
+/* Specify interval when tcp mtu probing will stop */
+#define TCP_PROBE_THRESHOLD	8
+
 /* After receiving this amount of duplicate ACKs fast retransmit starts. */
 #define TCP_FASTRETRANS_THRESH 3
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d151539..d3c09c1 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -883,6 +883,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "tcp_probe_threshold",
+		.data		= &init_net.ipv4.sysctl_tcp_probe_threshold,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5a2dfed..35790d9 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2460,6 +2460,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	}
 	net->ipv4.sysctl_tcp_ecn = 2;
 	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
+	net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
 	return 0;
 
 fail:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a2a796c..c418829 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1837,11 +1837,13 @@ static int tcp_mtu_probe(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct sk_buff *skb, *nskb, *next;
+	struct net *net = sock_net(sk);
 	int len;
 	int probe_size;
 	int size_needed;
 	int copy;
 	int mss_now;
+	int interval;
 
 	/* Not currently probing/verifying,
 	 * not in recovery,
@@ -1854,11 +1856,17 @@ static int tcp_mtu_probe(struct sock *sk)
 	    tp->rx_opt.num_sacks || tp->rx_opt.dsack)
 		return -1;
 
-	/* Very simple search strategy: just double the MSS. */
+	/* Use binary search for probe_size bewteen tcp_mss_base,
+	 * and current mss_clamp. if (search_high - search_low)
+	 * smaller than a threshold, backoff from probing.
+	 */
 	mss_now = tcp_current_mss(sk);
-	probe_size = 2 * tp->mss_cache;
+	probe_size = (icsk->icsk_mtup.search_high +
+		      icsk->icsk_mtup.search_low) >> 1;
 	size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
-	if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)) {
+	interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
+	if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high) ||
+	    interval < net->ipv4.sysctl_tcp_probe_threshold) {
 		/* TODO: set timer for probe_converge_event */
 		return -1;
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCHv2 net-next 3/4] ipv4: shrink current mss for tcp PMTU blackhole detection
  2015-02-26  3:49 ` [PATCHv2 net-next 0/4] Small fix for TCP PMTU Fan Du
  2015-02-26  3:49   ` [PATCHv2 net-next 1/4] ipv4: Raise tcp PMTU probe mss base size Fan Du
  2015-02-26  3:49   ` [PATCHv2 net-next 2/4] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
@ 2015-02-26  3:49   ` Fan Du
  2015-02-26  3:49   ` [PATCHv2 net-next 4/4] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
  2015-02-26 13:40   ` [PATCHv2 net-next 0/4] Small fix for TCP PMTU David Laight
  4 siblings, 0 replies; 24+ messages in thread
From: Fan Du @ 2015-02-26  3:49 UTC (permalink / raw)
  To: johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

Reducing current tcp mss instead of search_low will make
more sense, as current tcp mss still got lost.

In addition, rename tcp_mtu_probing to tcp_blackhole_probe
to clearify confusion between tcp_mtu_probing and
tcp_mtu_probe.

Signed-off-by: Fan Du <fan.du@intel.com>
---
 net/ipv4/tcp_timer.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 0732b78..3465175 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -99,7 +99,8 @@ static int tcp_orphan_retries(struct sock *sk, int alive)
 	return retries;
 }
 
-static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
+static void tcp_blackhole_probe(struct inet_connection_sock *icsk,
+				struct sock *sk)
 {
 	struct net *net = sock_net(sk);
 
@@ -113,7 +114,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
 			struct tcp_sock *tp = tcp_sk(sk);
 			int mss;
 
-			mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
+			/* try mss smaller than current mss */
+			mss = tcp_current_mss(sk) >> 1;
 			mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
 			mss = max(mss, 68 - tp->tcp_header_len);
 			icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
@@ -176,7 +178,7 @@ static int tcp_write_timeout(struct sock *sk)
 	} else {
 		if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0, 0)) {
 			/* Black hole detection */
-			tcp_mtu_probing(icsk, sk);
+			tcp_blackhole_probe(icsk, sk);
 
 			dst_negative_advice(sk);
 		}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCHv2 net-next 4/4] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-02-26  3:49 ` [PATCHv2 net-next 0/4] Small fix for TCP PMTU Fan Du
                     ` (2 preceding siblings ...)
  2015-02-26  3:49   ` [PATCHv2 net-next 3/4] ipv4: shrink current mss for tcp PMTU blackhole detection Fan Du
@ 2015-02-26  3:49   ` Fan Du
  2015-02-26  4:19     ` Eric Dumazet
  2015-02-26 13:40   ` [PATCHv2 net-next 0/4] Small fix for TCP PMTU David Laight
  4 siblings, 1 reply; 24+ messages in thread
From: Fan Du @ 2015-02-26  3:49 UTC (permalink / raw)
  To: johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
be armed once probing has converged. Once this timer expired,
probing again to take advantage of any path PMTU change. The
recommended probing interval is 10 minutes per RFC1981. Probing
interval could be sysctled by sysctl_tcp_probe_interval.

Eric suggested to implement pseudo timer based on 32bits jiffies
tcp_time_stamp instead of using classic timer for such rare event.

Signed-off-by: Fan Du <fan.du@intel.com>
---
v2:
   - Create a pseudo timer based on 32bits jiffies tcp_time_stamp
     to control when to reprobing as suggested by Eric.
---
 include/net/inet_connection_sock.h |    2 +
 include/net/netns/ipv4.h           |    1 +
 include/net/tcp.h                  |    3 ++
 net/ipv4/sysctl_net_ipv4.c         |    7 ++++++
 net/ipv4/tcp_ipv4.c                |    1 +
 net/ipv4/tcp_output.c              |   38 ++++++++++++++++++++++++++++++++++-
 6 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 5976bde..b9a6b0a 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -126,6 +126,8 @@ struct inet_connection_sock {
 
 		/* Information on the current probe. */
 		int		  probe_size;
+
+		u32		  probe_timestamp;
 	} icsk_mtup;
 	u32			  icsk_ca_priv[16];
 	u32			  icsk_user_timeout;
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 25200d4..18ebc99 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -85,6 +85,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_mtu_probing;
 	int sysctl_tcp_base_mss;
 	int sysctl_tcp_probe_threshold;
+	u32 sysctl_tcp_probe_interval;
 
 	struct ping_group_range ping_group_range;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index d269c91..dd6adbc 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -67,6 +67,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* The least MTU to use for probing */
 #define TCP_BASE_MSS		1024
 
+/* probing interval, default to 10 minutes as per RFC4821 */
+#define TCP_PROBE_INTERVAL	600
+
 /* Specify interval when tcp mtu probing will stop */
 #define TCP_PROBE_THRESHOLD	8
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d3c09c1..fdf8991 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -890,6 +890,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "tcp_probe_interval",
+		.data		= &init_net.ipv4.sysctl_tcp_probe_interval,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 35790d9..f0c6fc3 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2461,6 +2461,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.sysctl_tcp_ecn = 2;
 	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
 	net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
+	net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
 	return 0;
 
 fail:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c418829..0e9ee6a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1354,6 +1354,8 @@ void tcp_mtup_init(struct sock *sk)
 			       icsk->icsk_af_ops->net_header_len;
 	icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, net->ipv4.sysctl_tcp_base_mss);
 	icsk->icsk_mtup.probe_size = 0;
+	if (icsk->icsk_mtup.enabled)
+		icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
 }
 EXPORT_SYMBOL(tcp_mtup_init);
 
@@ -1823,6 +1825,35 @@ send_now:
 	return false;
 }
 
+static inline void tcp_mtu_check_reprobe(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct net *net = sock_net(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	u32 delta;
+	u32 now = tcp_time_stamp;
+	u32 interval = net->ipv4.sysctl_tcp_probe_interval;
+
+	if (likely(now > icsk->icsk_mtup.probe_timestamp))
+		delta = now - icsk->icsk_mtup.probe_timestamp;
+	else
+		delta = now + (U32_MAX - icsk->icsk_mtup.probe_timestamp);
+	if (unlikely(jiffies_to_msecs(delta) >= interval * MSEC_PER_SC)) {
+		int mss = tcp_current_mss(sk);
+
+		/* Update current search range */
+		icsk->icsk_mtup.search_high = tp->rx_opt.mss_clamp +
+			sizeof(struct tcphdr) +
+			icsk->icsk_af_ops->net_header_len;
+		icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
+		icsk->icsk_mtup.probe_size = 0;
+
+		/* Update probe time stamp */
+		icsk->icsk_mtup.probe_timestamp = now;
+	}
+}
+
 /* Create a new MTU probe if we are ready.
  * MTU probe is regularly attempting to increase the path MTU by
  * deliberately sending larger packets.  This discovers routing
@@ -1866,8 +1897,11 @@ static int tcp_mtu_probe(struct sock *sk)
 	size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
 	interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
 	if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high) ||
-	    interval < net->ipv4.sysctl_tcp_probe_threshold) {
-		/* TODO: set timer for probe_converge_event */
+		interval < net->ipv4.sysctl_tcp_probe_threshold) {
+		/* Check whether enough time has elaplased for
+		 * another round of probing.
+		 */
+		tcp_mtu_check_reprobe(sk);
 		return -1;
 	}
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 net-next 4/4] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-02-26  3:49   ` [PATCHv2 net-next 4/4] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
@ 2015-02-26  4:19     ` Eric Dumazet
  2015-02-26  6:24       ` Fan Du
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2015-02-26  4:19 UTC (permalink / raw)
  To: Fan Du; +Cc: johnwheffner, edumazet, davem, netdev, fengyuleidian0615

On Thu, 2015-02-26 at 11:49 +0800, Fan Du wrote:
> As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
> be armed once probing has converged. Once this timer expired,
> probing again to take advantage of any path PMTU change. The
> recommended probing interval is 10 minutes per RFC1981. Probing
> interval could be sysctled by sysctl_tcp_probe_interval.
> 
> Eric suggested to implement pseudo timer based on 32bits jiffies
> tcp_time_stamp instead of using classic timer for such rare event.
> 
> Signed-off-by: Fan Du <fan.du@intel.com>
> ---
> v2:
>    - Create a pseudo timer based on 32bits jiffies tcp_time_stamp
>      to control when to reprobing as suggested by Eric.

...

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index c418829..0e9ee6a 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1354,6 +1354,8 @@ void tcp_mtup_init(struct sock *sk)
>  			       icsk->icsk_af_ops->net_header_len;
>  	icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, net->ipv4.sysctl_tcp_base_mss);
>  	icsk->icsk_mtup.probe_size = 0;
> +	if (icsk->icsk_mtup.enabled)

I guess the test can be removed.

Assignment of probe_timestamp can be done regardless of the flag.

> +		icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
>  }
>  EXPORT_SYMBOL(tcp_mtup_init);
>  
> @@ -1823,6 +1825,35 @@ send_now:
>  	return false;
>  }
>  
> +static inline void tcp_mtu_check_reprobe(struct sock *sk)
> +{
> +	struct inet_connection_sock *icsk = inet_csk(sk);
> +	struct net *net = sock_net(sk);
> +	struct tcp_sock *tp = tcp_sk(sk);
> +

extra newline

> +	u32 delta;
> +	u32 now = tcp_time_stamp;
> +	u32 interval = net->ipv4.sysctl_tcp_probe_interval;


Please try to keep variables sorted by lengths, it helps readability :

	u32 interval = net->ipv4.sysctl_tcp_probe_interval;
	struct inet_connection_sock *icsk = inet_csk(sk);
	struct tcp_sock *tp = tcp_sk(sk);
	struct net *net = sock_net(sk);
	s32 delta;
 


> +
> +	if (likely(now > icsk->icsk_mtup.probe_timestamp))

Wrapping issue.

 if ((s32)(now - icsk->icsk_mtup.probe_timestamp)) > 0) ...


> +		delta = now - icsk->icsk_mtup.probe_timestamp;
> +	else
> +		delta = now + (U32_MAX - icsk->icsk_mtup.probe_timestamp);


oh well..

delta = tcp_time_stamp - icsk->icsk_mtup.probe_timestamp;

is fine (if delta is declared as s32).
Adding U32_MAX makes no sense. No need for @now var.

> +	if (unlikely(jiffies_to_msecs(delta) >= interval * MSEC_PER_SC)) {

if (unlikely(delta >= interval * HZ)) {

> +		int mss = tcp_current_mss(sk);
> +
> +		/* Update current search range */
> +		icsk->icsk_mtup.search_high = tp->rx_opt.mss_clamp +
> +			sizeof(struct tcphdr) +
> +			icsk->icsk_af_ops->net_header_len;
> +		icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
> +		icsk->icsk_mtup.probe_size = 0;
> +
> +		/* Update probe time stamp */
> +		icsk->icsk_mtup.probe_timestamp = now;

		icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
> +	}
> +}
> +
>  /* Create a new MTU probe if we are ready.
>   * MTU probe is regularly attempting to increase the path MTU by
>   * deliberately sending larger packets.  This discovers routing
> @@ -1866,8 +1897,11 @@ static int tcp_mtu_probe(struct sock *sk)
>  	size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>  	interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
>  	if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high) ||
> -	    interval < net->ipv4.sysctl_tcp_probe_threshold) {
> -		/* TODO: set timer for probe_converge_event */
> +		interval < net->ipv4.sysctl_tcp_probe_threshold) {
> +		/* Check whether enough time has elaplased for
> +		 * another round of probing.
> +		 */
> +		tcp_mtu_check_reprobe(sk);
>  		return -1;
>  	}
>  

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 net-next 4/4] ipv4: Create probe timer for tcp PMTU as per RFC4821
  2015-02-26  4:19     ` Eric Dumazet
@ 2015-02-26  6:24       ` Fan Du
  0 siblings, 0 replies; 24+ messages in thread
From: Fan Du @ 2015-02-26  6:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Fan Du, johnwheffner, edumazet, davem, netdev

于 2015年02月26日 12:19, Eric Dumazet 写道:
> On Thu, 2015-02-26 at 11:49 +0800, Fan Du wrote:
>> As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
>> be armed once probing has converged. Once this timer expired,
>> probing again to take advantage of any path PMTU change. The
>> recommended probing interval is 10 minutes per RFC1981. Probing
>> interval could be sysctled by sysctl_tcp_probe_interval.
>>
>> Eric suggested to implement pseudo timer based on 32bits jiffies
>> tcp_time_stamp instead of using classic timer for such rare event.
>>
>> Signed-off-by: Fan Du <fan.du@intel.com>
>> ---
>> v2:
>>     - Create a pseudo timer based on 32bits jiffies tcp_time_stamp
>>       to control when to reprobing as suggested by Eric.
>
> ...
>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index c418829..0e9ee6a 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -1354,6 +1354,8 @@ void tcp_mtup_init(struct sock *sk)
>>   			       icsk->icsk_af_ops->net_header_len;
>>   	icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, net->ipv4.sysctl_tcp_base_mss);
>>   	icsk->icsk_mtup.probe_size = 0;
>> +	if (icsk->icsk_mtup.enabled)
>
> I guess the test can be removed.
>
> Assignment of probe_timestamp can be done regardless of the flag.

Exactly speaking, the assignment should honor enabled variable,
once enabled is 1, probing actually starts, so probe_timestamp should also be marked from that point.
one place missing is below:

@@ -108,6 +108,7 @@ static void tcp_blackhole_probe(struct inet_connection_sock *icsk,
         if (net->ipv4.sysctl_tcp_mtu_probing) {
                 if (!icsk->icsk_mtup.enabled) {
                         icsk->icsk_mtup.enabled = 1;
+                       icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;

>> +		icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
>>   }
>>   EXPORT_SYMBOL(tcp_mtup_init);
>>
>> @@ -1823,6 +1825,35 @@ send_now:
>>   	return false;
>>   }
>>
>> +static inline void tcp_mtu_check_reprobe(struct sock *sk)
>> +{
>> +	struct inet_connection_sock *icsk = inet_csk(sk);
>> +	struct net *net = sock_net(sk);
>> +	struct tcp_sock *tp = tcp_sk(sk);
>> +
>
> extra newline
>
>> +	u32 delta;
>> +	u32 now = tcp_time_stamp;
>> +	u32 interval = net->ipv4.sysctl_tcp_probe_interval;
>
>
> Please try to keep variables sorted by lengths, it helps readability :
>
> 	u32 interval = net->ipv4.sysctl_tcp_probe_interval;
> 	struct inet_connection_sock *icsk = inet_csk(sk);
> 	struct tcp_sock *tp = tcp_sk(sk);
> 	struct net *net = sock_net(sk);
> 	s32 delta;

then I have to arrange those variable like this ;)

+static inline void tcp_mtu_check_reprobe(struct sock *sk)
+{
+       struct inet_connection_sock *icsk = inet_csk(sk);
+       struct tcp_sock *tp = tcp_sk(sk);
+       struct net *net = sock_net(sk);
+       u32 interval;
+       s32 delta;
+
+       interval = net->ipv4.sysctl_tcp_probe_interval;
+       delta = tcp_time_stamp - icsk->icsk_mtup.probe_timestamp;
+       if (unlikely(delta >= interval * HZ)) {
+               int mss = tcp_current_mss(sk);
+
+               /* Update current search range */
+               icsk->icsk_mtup.search_high = tp->rx_opt.mss_clamp +
+                       sizeof(struct tcphdr) +
+                       icsk->icsk_af_ops->net_header_len;
+               icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
+               icsk->icsk_mtup.probe_size = 0;
+
+               /* Update probe time stamp */
+               icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
+       }
+}

And thanks for the all the comments.

>
>
>> +
>> +	if (likely(now > icsk->icsk_mtup.probe_timestamp))
>
> Wrapping issue.
>
>   if ((s32)(now - icsk->icsk_mtup.probe_timestamp)) > 0) ...
>
>
>> +		delta = now - icsk->icsk_mtup.probe_timestamp;
>> +	else
>> +		delta = now + (U32_MAX - icsk->icsk_mtup.probe_timestamp);
>
>
> oh well..
>
> delta = tcp_time_stamp - icsk->icsk_mtup.probe_timestamp;
>
> is fine (if delta is declared as s32).
> Adding U32_MAX makes no sense. No need for @now var.
>
>> +	if (unlikely(jiffies_to_msecs(delta) >= interval * MSEC_PER_SC)) {
>
> if (unlikely(delta >= interval * HZ)) {
>
>> +		int mss = tcp_current_mss(sk);
>> +
>> +		/* Update current search range */
>> +		icsk->icsk_mtup.search_high = tp->rx_opt.mss_clamp +
>> +			sizeof(struct tcphdr) +
>> +			icsk->icsk_af_ops->net_header_len;
>> +		icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
>> +		icsk->icsk_mtup.probe_size = 0;
>> +
>> +		/* Update probe time stamp */
>> +		icsk->icsk_mtup.probe_timestamp = now;
>
> 		icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
>> +	}
>> +}
>> +
>>   /* Create a new MTU probe if we are ready.
>>    * MTU probe is regularly attempting to increase the path MTU by
>>    * deliberately sending larger packets.  This discovers routing
>> @@ -1866,8 +1897,11 @@ static int tcp_mtu_probe(struct sock *sk)
>>   	size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>>   	interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
>>   	if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high) ||
>> -	    interval < net->ipv4.sysctl_tcp_probe_threshold) {
>> -		/* TODO: set timer for probe_converge_event */
>> +		interval < net->ipv4.sysctl_tcp_probe_threshold) {
>> +		/* Check whether enough time has elaplased for
>> +		 * another round of probing.
>> +		 */
>> +		tcp_mtu_check_reprobe(sk);
>>   		return -1;
>>   	}
>>
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCHv2 net-next 0/4] Small fix for TCP PMTU
  2015-02-26  3:49 ` [PATCHv2 net-next 0/4] Small fix for TCP PMTU Fan Du
                     ` (3 preceding siblings ...)
  2015-02-26  3:49   ` [PATCHv2 net-next 4/4] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
@ 2015-02-26 13:40   ` David Laight
  2015-02-27  5:37     ` Fan Du
  4 siblings, 1 reply; 24+ messages in thread
From: David Laight @ 2015-02-26 13:40 UTC (permalink / raw)
  To: 'Fan Du', johnwheffner; +Cc: edumazet, davem, netdev, fengyuleidian0615

From: Fan Du
> This patchset performs some small fix and enhancement
> for current TCP PMTU as per RFC4821 with below changes:
> 
> Patch1/4: Set probe mss base to 1024 Bytes per RFC4821
> Patch2/4: Do not double probe_size for each probing,
>           use a simple binary search to gain maximum performance.
> Patch3/4: When it comes to blackhole detection, shrink current
> 	  mss for next probing.
> Patch4/4: Create a probe timer to detect enlarged path MTU.

Isn't the most likely problem PPPoE blackholes?
Where a reduction of the PMTU from 1500 to (IIRC) 1492 is
all that is required?

Would seem appropriate to test that value reasonable quickly.

	David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 net-next 0/4] Small fix for TCP PMTU
  2015-02-26 13:40   ` [PATCHv2 net-next 0/4] Small fix for TCP PMTU David Laight
@ 2015-02-27  5:37     ` Fan Du
  0 siblings, 0 replies; 24+ messages in thread
From: Fan Du @ 2015-02-27  5:37 UTC (permalink / raw)
  To: David Laight; +Cc: 'Fan Du', johnwheffner, edumazet, davem, netdev

于 2015年02月26日 21:40, David Laight 写道:
> From: Fan Du
>> >This patchset performs some small fix and enhancement
>> >for current TCP PMTU as per RFC4821 with below changes:
>> >
>> >Patch1/4: Set probe mss base to 1024 Bytes per RFC4821
>> >Patch2/4: Do not double probe_size for each probing,
>> >           use a simple binary search to gain maximum performance.
>> >Patch3/4: When it comes to blackhole detection, shrink current
>> >	  mss for next probing.
>> >Patch4/4: Create a probe timer to detect enlarged path MTU.
> Isn't the most likely problem PPPoE blackholes?
No.

It's used to probe a optimal pmtu value for vxlan tunnel when ICMP message
is not available. Refer patch2/4 for test summary.

> Where a reduction of the PMTU from 1500 to (IIRC) 1492 is
> all that is required?
>
> Would seem appropriate to test that value reasonable quickly.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv2 net-next 2/4] ipv4: Use binary search to choose tcp PMTU probe_size
  2015-02-26  3:49   ` [PATCHv2 net-next 2/4] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
@ 2015-02-27 22:17     ` David Miller
  0 siblings, 0 replies; 24+ messages in thread
From: David Miller @ 2015-02-27 22:17 UTC (permalink / raw)
  To: fan.du; +Cc: johnwheffner, edumazet, netdev, fengyuleidian0615

From: Fan Du <fan.du@intel.com>
Date: Thu, 26 Feb 2015 11:49:24 +0800

> Current probe_size is chosen by doubling mss_cache,
> the initial mss base is 512 Bytes, as a result the
> converged probe_size will only be 1024 Bytes, there
> is still big gap between 1024 and common 1500 bytes
> of mtu.

After patch #1 in this series it is not longer 512.

You need to update this commit message to match reality.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-02-27 22:17 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-13  8:16 [PATCH net-next 0/3] Small fix for TCP PMTU Fan Du
2015-02-13  8:16 ` [PATCH net-next 1/3] ipv4: Raise tcp PMTU probe mss base size Fan Du
2015-02-13  9:49   ` yzhu1
2015-02-16  5:15     ` Fan Du
2015-02-13  8:16 ` [PATCH net-next 2/3] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
2015-02-13 17:52   ` John Heffner
2015-02-16  5:27     ` Fan Du
2015-02-16 23:59       ` John Heffner
2015-02-13  8:16 ` [PATCH net-next 3/3] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
2015-02-13  9:59   ` Ying Xue
2015-02-16  5:28     ` Fan Du
2015-02-13 12:31   ` Eric Dumazet
2015-02-16  5:38     ` Fan Du
2015-02-16 12:19       ` Eric Dumazet
2015-02-26  3:49 ` [PATCHv2 net-next 0/4] Small fix for TCP PMTU Fan Du
2015-02-26  3:49   ` [PATCHv2 net-next 1/4] ipv4: Raise tcp PMTU probe mss base size Fan Du
2015-02-26  3:49   ` [PATCHv2 net-next 2/4] ipv4: Use binary search to choose tcp PMTU probe_size Fan Du
2015-02-27 22:17     ` David Miller
2015-02-26  3:49   ` [PATCHv2 net-next 3/4] ipv4: shrink current mss for tcp PMTU blackhole detection Fan Du
2015-02-26  3:49   ` [PATCHv2 net-next 4/4] ipv4: Create probe timer for tcp PMTU as per RFC4821 Fan Du
2015-02-26  4:19     ` Eric Dumazet
2015-02-26  6:24       ` Fan Du
2015-02-26 13:40   ` [PATCHv2 net-next 0/4] Small fix for TCP PMTU David Laight
2015-02-27  5:37     ` Fan Du

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.