All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing
@ 2021-04-21 10:21 Leonard Crestez
  2021-04-21 12:47 ` Neal Cardwell
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Leonard Crestez @ 2021-04-21 10:21 UTC (permalink / raw)
  To: Willem de Bruijn, Ilya Lesokhin, David S. Miller, Eric Dumazet
  Cc: Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, Wei Wang,
	Neal Cardwell, Soheil Hassas Yeganeh, Roopa Prabhu, netdev,
	linux-kernel

According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
in order to accumulate enough data" but linux almost never does that.

Linux waits for probe_size + (1 + retries) * mss_cache to be available
in the send buffer and if that condition is not met it will send anyway
using the current MSS. The feature can be made to work by sending very
large chunks of data from userspace (for example 128k) but for small writes
on fast links probes almost never happen.

This patch tries to implement the "MAY" by adding an extra flag
"wait_data" to icsk_mtup which is set to 1 if a probe is possible but
insufficient data is available. Then data is held back in
tcp_write_xmit until a probe is sent, probing conditions are no longer
met, or 500ms pass.

Signed-off-by: Leonard Crestez <cdleonard@gmail.com>

---
 Documentation/networking/ip-sysctl.rst |  4 ++
 include/net/inet_connection_sock.h     |  7 +++-
 include/net/netns/ipv4.h               |  1 +
 include/net/tcp.h                      |  2 +
 net/ipv4/sysctl_net_ipv4.c             |  7 ++++
 net/ipv4/tcp_ipv4.c                    |  1 +
 net/ipv4/tcp_output.c                  | 54 ++++++++++++++++++++++++--
 7 files changed, 71 insertions(+), 5 deletions(-)

My tests are here: https://github.com/cdleonard/test-tcp-mtu-probing

This patch makes the test pass quite reliably with
ICMP_BLACKHOLE=1 TCP_MTU_PROBING=1 IPERF_WINDOW=256k IPERF_LEN=8k while
before it only worked with much higher IPERF_LEN=256k

In my loopback tests I also observed another issue when tcp_retries
increases because of SACKReorder. This makes the original problem worse
(since the retries amount factors in buffer requirement) and seems to be
unrelated issue. Maybe when loss happens due to MTU shrinkage the sender
sack logic is confused somehow?

I know it's towards the end of the cycle but this is mostly just intended for
discussion.

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index c2ecc9894fd0..47945a6b435d 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -550,10 +550,14 @@ tcp_probe_interval - UNSIGNED INTEGER
 tcp_probe_threshold - INTEGER
 	Controls when TCP Packetization-Layer Path MTU Discovery probing
 	will stop in respect to the width of search range in bytes. Default
 	is 8 bytes.
 
+tcp_probe_wait - INTEGER
+	How long to wait for data for a tcp mtu probe. The default is 500
+	milliseconds, zero can be used to disable this feature.
+
 tcp_no_metrics_save - BOOLEAN
 	By default, TCP saves various connection metrics in the route cache
 	when the connection closes, so that connections established in the
 	near future can use these to set initial conditions.  Usually, this
 	increases overall performance, but may sometimes cause performance
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 3c8c59471bc1..19afcc7a4f4a 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -123,15 +123,20 @@ struct inet_connection_sock {
 		/* Range of MTUs to search */
 		int		  search_high;
 		int		  search_low;
 
 		/* Information on the current probe. */
-		u32		  probe_size:31,
+		u32		  probe_size:30,
+		/* Are we actively accumulating data for an mtu probe? */
+				  wait_data:1,
 		/* Is the MTUP feature enabled for this connection? */
 				  enabled:1;
 
 		u32		  probe_timestamp;
+
+		/* Timer for wait_data */
+		struct	  timer_list	wait_data_timer;
 	} icsk_mtup;
 	u32			  icsk_probes_tstamp;
 	u32			  icsk_user_timeout;
 
 	u64			  icsk_ca_priv[104 / sizeof(u64)];
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 87e1612497ea..5af1e8a31a02 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -126,10 +126,11 @@ struct netns_ipv4 {
 	int sysctl_tcp_mtu_probe_floor;
 	int sysctl_tcp_base_mss;
 	int sysctl_tcp_min_snd_mss;
 	int sysctl_tcp_probe_threshold;
 	u32 sysctl_tcp_probe_interval;
+	int sysctl_tcp_probe_wait;
 
 	int sysctl_tcp_keepalive_time;
 	int sysctl_tcp_keepalive_intvl;
 	u8 sysctl_tcp_keepalive_probes;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index eaea43afcc97..9060cc855363 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1375,10 +1375,12 @@ static inline void tcp_slow_start_after_idle_check(struct sock *sk)
 	s32 delta;
 
 	if (!sock_net(sk)->ipv4.sysctl_tcp_slow_start_after_idle || tp->packets_out ||
 	    ca_ops->cong_control)
 		return;
+	if (inet_csk(sk)->icsk_mtup.wait_data)
+		return;
 	delta = tcp_jiffies32 - tp->lsndtime;
 	if (delta > inet_csk(sk)->icsk_rto)
 		tcp_cwnd_restart(sk, delta);
 }
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index a62934b9f15a..1b6bbb24138a 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -842,10 +842,17 @@ static struct ctl_table ipv4_net_table[] = {
 		.maxlen		= sizeof(u32),
 		.mode		= 0644,
 		.proc_handler	= proc_douintvec_minmax,
 		.extra2		= &u32_max_div_HZ,
 	},
+	{
+		.procname	= "tcp_probe_wait",
+		.data		= &init_net.ipv4.sysctl_tcp_probe_wait,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_ms_jiffies,
+	},
 	{
 		.procname	= "igmp_link_local_mcast_reports",
 		.data		= &init_net.ipv4.sysctl_igmp_llm_reports,
 		.maxlen		= sizeof(u8),
 		.mode		= 0644,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 312184cead57..39f74f04e8b5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2889,10 +2889,11 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
 	net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
 	net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
 	net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
 	net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
+	net->ipv4.sysctl_tcp_probe_wait = HZ / 2;
 
 	net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
 	net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
 	net->ipv4.sysctl_tcp_keepalive_intvl = TCP_KEEPALIVE_INTVL;
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index bde781f46b41..15bbf0c97959 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1754,10 +1754,39 @@ int tcp_mss_to_mtu(struct sock *sk, int mss)
 	}
 	return mtu;
 }
 EXPORT_SYMBOL(tcp_mss_to_mtu);
 
+void tcp_mtu_probe_wait_stop(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	if (icsk->icsk_mtup.wait_data) {
+		icsk->icsk_mtup.wait_data = false;
+		sk_stop_timer(sk, &icsk->icsk_mtup.wait_data_timer);
+	}
+}
+
+static void tcp_mtu_probe_wait_timer(struct timer_list *t)
+{
+	struct inet_connection_sock *icsk = from_timer(icsk, t, icsk_mtup.wait_data_timer);
+	struct sock *sk = &icsk->icsk_inet.sk;
+
+	bh_lock_sock(sk);
+	if (!sock_owned_by_user(sk)) {
+		/* push pending frames now */
+		icsk->icsk_mtup.wait_data = false;
+		tcp_push_pending_frames(sk);
+	} else {
+		/* flush later if sock locked by user */
+		sk_reset_timer(sk, &icsk->icsk_mtup.wait_data_timer, jiffies + HZ / 10);
+	}
+	bh_unlock_sock(sk);
+	sock_put(sk);
+}
+
+
 /* MTU probing init per socket */
 void tcp_mtup_init(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
@@ -1768,10 +1797,11 @@ void tcp_mtup_init(struct sock *sk)
 			       icsk->icsk_af_ops->net_header_len;
 	icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, net->ipv4.sysctl_tcp_base_mss);
 	icsk->icsk_mtup.probe_size = 0;
 	if (icsk->icsk_mtup.enabled)
 		icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
+	timer_setup(&icsk->icsk_mtup.wait_data_timer, tcp_mtu_probe_wait_timer, 0);
 }
 EXPORT_SYMBOL(tcp_mtup_init);
 
 /* This function synchronize snd mss to current pmtu/exthdr set.
 
@@ -2366,16 +2396,18 @@ static int tcp_mtu_probe(struct sock *sk)
 		 */
 		tcp_mtu_check_reprobe(sk);
 		return -1;
 	}
 
+	/* Can probe ever fit inside window? */
+	if (tp->snd_wnd < size_needed)
+		return -1;
+
 	/* Have enough data in the send queue to probe? */
 	if (tp->write_seq - tp->snd_nxt < size_needed)
-		return -1;
+		return net->ipv4.sysctl_tcp_probe_wait ? 0 : -1;
 
-	if (tp->snd_wnd < size_needed)
-		return -1;
 	if (after(tp->snd_nxt + size_needed, tcp_wnd_end(tp)))
 		return 0;
 
 	/* Do we need to wait to drain cwnd? With none in flight, don't stall */
 	if (tcp_packets_in_flight(tp) + 2 > tp->snd_cwnd) {
@@ -2596,28 +2628,42 @@ void tcp_chrono_stop(struct sock *sk, const enum tcp_chrono type)
  * but cannot send anything now because of SWS or another problem.
  */
 static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 			   int push_one, gfp_t gfp)
 {
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct net *net = sock_net(sk);
 	struct sk_buff *skb;
 	unsigned int tso_segs, sent_pkts;
 	int cwnd_quota;
 	int result;
 	bool is_cwnd_limited = false, is_rwnd_limited = false;
 	u32 max_segs;
 
 	sent_pkts = 0;
 
 	tcp_mstamp_refresh(tp);
-	if (!push_one) {
+	/*
+	 * Waiting for tcp probe data also applies when push_one=1
+	 * If user does many small writes we hold them until we have have enough
+	 * for a probe.
+	 */
+	if (!push_one || (push_one < 2 && net->ipv4.sysctl_tcp_probe_wait)) {
 		/* Do MTU probing. */
 		result = tcp_mtu_probe(sk);
 		if (!result) {
+			if (net->ipv4.sysctl_tcp_probe_wait && !icsk->icsk_mtup.wait_data) {
+				icsk->icsk_mtup.wait_data = true;
+				sk_reset_timer(sk, &icsk->icsk_mtup.wait_data_timer, jiffies + net->ipv4.sysctl_tcp_probe_wait);
+			}
 			return false;
 		} else if (result > 0) {
+			tcp_mtu_probe_wait_stop(sk);
 			sent_pkts = 1;
+		} else {
+			tcp_mtu_probe_wait_stop(sk);
 		}
 	}
 
 	max_segs = tcp_tso_segs(sk, mss_now);
 	while ((skb = tcp_send_head(sk))) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing
  2021-04-21 10:21 [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing Leonard Crestez
@ 2021-04-21 12:47 ` Neal Cardwell
       [not found]   ` <CAH56bmDBGsHOSjJpo=TseUATOh0cZqTMFyFO1sqtQmMrTPHtrA@mail.gmail.com>
  2021-04-26  2:34   ` Leonard Crestez
  2021-04-21 13:05 ` kernel test robot
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 14+ messages in thread
From: Neal Cardwell @ 2021-04-21 12:47 UTC (permalink / raw)
  To: Leonard Crestez
  Cc: Willem de Bruijn, Ilya Lesokhin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, Wei Wang,
	Soheil Hassas Yeganeh, Roopa Prabhu, netdev, linux-kernel,
	Matt Mathis, Yuchung Cheng

On Wed, Apr 21, 2021 at 6:21 AM Leonard Crestez <cdleonard@gmail.com> wrote:
>
> According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
> in order to accumulate enough data" but linux almost never does that.
>
> Linux waits for probe_size + (1 + retries) * mss_cache to be available
> in the send buffer and if that condition is not met it will send anyway
> using the current MSS. The feature can be made to work by sending very
> large chunks of data from userspace (for example 128k) but for small writes
> on fast links probes almost never happen.
>
> This patch tries to implement the "MAY" by adding an extra flag
> "wait_data" to icsk_mtup which is set to 1 if a probe is possible but
> insufficient data is available. Then data is held back in
> tcp_write_xmit until a probe is sent, probing conditions are no longer
> met, or 500ms pass.
>
> Signed-off-by: Leonard Crestez <cdleonard@gmail.com>
>
> ---
>  Documentation/networking/ip-sysctl.rst |  4 ++
>  include/net/inet_connection_sock.h     |  7 +++-
>  include/net/netns/ipv4.h               |  1 +
>  include/net/tcp.h                      |  2 +
>  net/ipv4/sysctl_net_ipv4.c             |  7 ++++
>  net/ipv4/tcp_ipv4.c                    |  1 +
>  net/ipv4/tcp_output.c                  | 54 ++++++++++++++++++++++++--
>  7 files changed, 71 insertions(+), 5 deletions(-)
>
> My tests are here: https://github.com/cdleonard/test-tcp-mtu-probing
>
> This patch makes the test pass quite reliably with
> ICMP_BLACKHOLE=1 TCP_MTU_PROBING=1 IPERF_WINDOW=256k IPERF_LEN=8k while
> before it only worked with much higher IPERF_LEN=256k
>
> In my loopback tests I also observed another issue when tcp_retries
> increases because of SACKReorder. This makes the original problem worse
> (since the retries amount factors in buffer requirement) and seems to be
> unrelated issue. Maybe when loss happens due to MTU shrinkage the sender
> sack logic is confused somehow?
>
> I know it's towards the end of the cycle but this is mostly just intended for
> discussion.

Thanks for raising the question of how to trigger PMTU probes more often!

AFAICT this approach would cause unacceptable performance impacts by
often injecting unnecessary 500ms delays when there is no need to do
so.

If the goal is to increase the frequency of PMTU probes, which seems
like a valid goal, I would suggest that we rethink the Linux heuristic
for triggering PMTU probes in the light of the fact that the loss
detection mechanism is now RACK-TLP, which provides quick recovery in
a much wider variety of scenarios.

After all, https://tools.ietf.org/html/rfc4821#section-7.4 says:

   In addition, the timely loss detection algorithms in most protocols
   have pre-conditions that SHOULD be satisfied before sending a probe.

And we know that the "timely loss detection algorithms" have advanced
since this RFC was written in 2007.

You mention:
> Linux waits for probe_size + (1 + retries) * mss_cache to be available

The code in question seems to be:

  size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;

How about just changing this to:

  size_needed = probe_size + tp->mss_cache;

The rationale would be that if that amount of data is available, then
the sender can send one probe and one following current-mss-size
packet. If the path MTU has not increased to allow the probe of size
probe_size to pass through the network, then the following
current-mss-size packet will likely pass through the network, generate
a SACK, and trigger a RACK fast recovery 1/4*min_rtt later, when the
RACK reorder timer fires.

A secondary rationale for this heuristic would be: if the flow never
accumulates roughly two packets worth of data, then does the flow
really need a bigger packet size?

IMHO, just reducing the size_needed seems far preferable to needlessly
injecting 500ms delays.

best,
neal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing
  2021-04-21 10:21 [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing Leonard Crestez
  2021-04-21 12:47 ` Neal Cardwell
@ 2021-04-21 13:05 ` kernel test robot
  2021-04-21 13:05 ` [RFC PATCH] tcp: tcp_mtu_probe_wait_stop() can be static kernel test robot
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: kernel test robot @ 2021-04-21 13:05 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1739 bytes --]

Hi Leonard,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on next-20210420]
[also build test WARNING on v5.12-rc8]
[cannot apply to linus/master v5.12-rc8 v5.12-rc7 v5.12-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Leonard-Crestez/tcp-Delay-sending-non-probes-for-RFC4821-mtu-probing/20210421-182309
base:    593ef1658ecf61d3619885bdbbcfffa3d1417891
config: um-randconfig-s031-20210421 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-341-g8af24329-dirty
        # https://github.com/0day-ci/linux/commit/0b2ec1d5d492384a89a499c7227c7f6a21f283d0
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Leonard-Crestez/tcp-Delay-sending-non-probes-for-RFC4821-mtu-probing/20210421-182309
        git checkout 0b2ec1d5d492384a89a499c7227c7f6a21f283d0
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' W=1 ARCH=um 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> net/ipv4/tcp_output.c:1759:6: sparse: sparse: symbol 'tcp_mtu_probe_wait_stop' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 12068 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH] tcp: tcp_mtu_probe_wait_stop() can be static
  2021-04-21 10:21 [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing Leonard Crestez
  2021-04-21 12:47 ` Neal Cardwell
  2021-04-21 13:05 ` kernel test robot
@ 2021-04-21 13:05 ` kernel test robot
  2021-04-21 13:15 ` [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing kernel test robot
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: kernel test robot @ 2021-04-21 13:05 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 612 bytes --]


Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
---
 tcp_output.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 15bbf0c97959a8..db71f039dc8db4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1756,7 +1756,7 @@ int tcp_mss_to_mtu(struct sock *sk, int mss)
 }
 EXPORT_SYMBOL(tcp_mss_to_mtu);
 
-void tcp_mtu_probe_wait_stop(struct sock *sk)
+static void tcp_mtu_probe_wait_stop(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing
  2021-04-21 10:21 [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing Leonard Crestez
                   ` (2 preceding siblings ...)
  2021-04-21 13:05 ` [RFC PATCH] tcp: tcp_mtu_probe_wait_stop() can be static kernel test robot
@ 2021-04-21 13:15 ` kernel test robot
  2021-04-21 14:53 ` kernel test robot
  2021-04-21 15:09 ` kernel test robot
  5 siblings, 0 replies; 14+ messages in thread
From: kernel test robot @ 2021-04-21 13:15 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2233 bytes --]

Hi Leonard,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on next-20210420]
[also build test WARNING on v5.12-rc8]
[cannot apply to linus/master v5.12-rc8 v5.12-rc7 v5.12-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Leonard-Crestez/tcp-Delay-sending-non-probes-for-RFC4821-mtu-probing/20210421-182309
base:    593ef1658ecf61d3619885bdbbcfffa3d1417891
config: nios2-defconfig (attached as .config)
compiler: nios2-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/0b2ec1d5d492384a89a499c7227c7f6a21f283d0
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Leonard-Crestez/tcp-Delay-sending-non-probes-for-RFC4821-mtu-probing/20210421-182309
        git checkout 0b2ec1d5d492384a89a499c7227c7f6a21f283d0
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross W=1 ARCH=nios2 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> net/ipv4/tcp_output.c:1759:6: warning: no previous prototype for 'tcp_mtu_probe_wait_stop' [-Wmissing-prototypes]
    1759 | void tcp_mtu_probe_wait_stop(struct sock *sk)
         |      ^~~~~~~~~~~~~~~~~~~~~~~


vim +/tcp_mtu_probe_wait_stop +1759 net/ipv4/tcp_output.c

  1758	
> 1759	void tcp_mtu_probe_wait_stop(struct sock *sk)
  1760	{
  1761		struct inet_connection_sock *icsk = inet_csk(sk);
  1762	
  1763		if (icsk->icsk_mtup.wait_data) {
  1764			icsk->icsk_mtup.wait_data = false;
  1765			sk_stop_timer(sk, &icsk->icsk_mtup.wait_data_timer);
  1766		}
  1767	}
  1768	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 10039 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing
  2021-04-21 10:21 [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing Leonard Crestez
                   ` (3 preceding siblings ...)
  2021-04-21 13:15 ` [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing kernel test robot
@ 2021-04-21 14:53 ` kernel test robot
  2021-04-21 15:09 ` kernel test robot
  5 siblings, 0 replies; 14+ messages in thread
From: kernel test robot @ 2021-04-21 14:53 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 7650 bytes --]

Hi Leonard,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on next-20210420]
[also build test WARNING on v5.12-rc8]
[cannot apply to linus/master v5.12-rc8 v5.12-rc7 v5.12-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Leonard-Crestez/tcp-Delay-sending-non-probes-for-RFC4821-mtu-probing/20210421-182309
base:    593ef1658ecf61d3619885bdbbcfffa3d1417891
config: s390-randconfig-r022-20210421 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project d87b9b81ccb95217181ce75515c6c68bbb408ca4)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install s390 cross compiling tool for clang build
        # apt-get install binutils-s390x-linux-gnu
        # https://github.com/0day-ci/linux/commit/0b2ec1d5d492384a89a499c7227c7f6a21f283d0
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Leonard-Crestez/tcp-Delay-sending-non-probes-for-RFC4821-mtu-probing/20210421-182309
        git checkout 0b2ec1d5d492384a89a499c7227c7f6a21f283d0
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=s390 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from net/ipv4/tcp_output.c:40:
   In file included from include/net/tcp.h:20:
   In file included from include/linux/tcp.h:17:
   In file included from include/linux/skbuff.h:31:
   In file included from include/linux/dma-mapping.h:10:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/s390/include/asm/io.h:75:
   include/asm-generic/io.h:464:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __raw_readb(PCI_IOBASE + addr);
                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:477:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:36:59: note: expanded from macro '__le16_to_cpu'
   #define __le16_to_cpu(x) __swab16((__force __u16)(__le16)(x))
                                                             ^
   include/uapi/linux/swab.h:102:54: note: expanded from macro '__swab16'
   #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
                                                        ^
   In file included from net/ipv4/tcp_output.c:40:
   In file included from include/net/tcp.h:20:
   In file included from include/linux/tcp.h:17:
   In file included from include/linux/skbuff.h:31:
   In file included from include/linux/dma-mapping.h:10:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/s390/include/asm/io.h:75:
   include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu'
   #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
                                                             ^
   include/uapi/linux/swab.h:115:54: note: expanded from macro '__swab32'
   #define __swab32(x) (__u32)__builtin_bswap32((__u32)(x))
                                                        ^
   In file included from net/ipv4/tcp_output.c:40:
   In file included from include/net/tcp.h:20:
   In file included from include/linux/tcp.h:17:
   In file included from include/linux/skbuff.h:31:
   In file included from include/linux/dma-mapping.h:10:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/s390/include/asm/io.h:75:
   include/asm-generic/io.h:501:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writeb(value, PCI_IOBASE + addr);
                               ~~~~~~~~~~ ^
   include/asm-generic/io.h:511:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:521:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:609:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsb(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:617:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsw(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:625:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsl(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:634:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesb(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:643:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesw(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:652:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesl(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
>> net/ipv4/tcp_output.c:1759:6: warning: no previous prototype for function 'tcp_mtu_probe_wait_stop' [-Wmissing-prototypes]
   void tcp_mtu_probe_wait_stop(struct sock *sk)
        ^
   net/ipv4/tcp_output.c:1759:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void tcp_mtu_probe_wait_stop(struct sock *sk)
   ^
   static 
   13 warnings generated.


vim +/tcp_mtu_probe_wait_stop +1759 net/ipv4/tcp_output.c

  1758	
> 1759	void tcp_mtu_probe_wait_stop(struct sock *sk)
  1760	{
  1761		struct inet_connection_sock *icsk = inet_csk(sk);
  1762	
  1763		if (icsk->icsk_mtup.wait_data) {
  1764			icsk->icsk_mtup.wait_data = false;
  1765			sk_stop_timer(sk, &icsk->icsk_mtup.wait_data_timer);
  1766		}
  1767	}
  1768	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 16777 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing
  2021-04-21 10:21 [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing Leonard Crestez
                   ` (4 preceding siblings ...)
  2021-04-21 14:53 ` kernel test robot
@ 2021-04-21 15:09 ` kernel test robot
  5 siblings, 0 replies; 14+ messages in thread
From: kernel test robot @ 2021-04-21 15:09 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4362 bytes --]

Hi Leonard,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on next-20210420]
[also build test WARNING on v5.12-rc8]
[cannot apply to linus/master v5.12-rc8 v5.12-rc7 v5.12-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Leonard-Crestez/tcp-Delay-sending-non-probes-for-RFC4821-mtu-probing/20210421-182309
base:    593ef1658ecf61d3619885bdbbcfffa3d1417891
config: x86_64-randconfig-a015-20210421 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project d87b9b81ccb95217181ce75515c6c68bbb408ca4)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # https://github.com/0day-ci/linux/commit/0b2ec1d5d492384a89a499c7227c7f6a21f283d0
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Leonard-Crestez/tcp-Delay-sending-non-probes-for-RFC4821-mtu-probing/20210421-182309
        git checkout 0b2ec1d5d492384a89a499c7227c7f6a21f283d0
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   net/ipv4/tcp_output.c:188:3: warning: result of comparison of constant -1 with expression of type 'u8' (aka 'unsigned char') is always false [-Wtautological-constant-out-of-range-compare]
                   NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPACKCOMPRESSED,
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/net/ip.h:290:41: note: expanded from macro 'NET_ADD_STATS'
   #define NET_ADD_STATS(net, field, adnd) SNMP_ADD_STATS((net)->mib.net_statistics, field, adnd)
                                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/net/snmp.h:143:4: note: expanded from macro 'SNMP_ADD_STATS'
                           this_cpu_add(mib->mibs[field], addend)
                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/percpu-defs.h:509:33: note: expanded from macro 'this_cpu_add'
   #define this_cpu_add(pcp, val)          __pcpu_size_call(this_cpu_add_, pcp, val)
                                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   note: (skipping 1 expansions in backtrace; use -fmacro-backtrace-limit=0 to see all)
   <scratch space>:218:1: note: expanded from here
   this_cpu_add_8
   ^
   arch/x86/include/asm/percpu.h:326:35: note: expanded from macro 'this_cpu_add_8'
   #define this_cpu_add_8(pcp, val)                percpu_add_op(8, volatile, (pcp), val)
                                                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/include/asm/percpu.h:127:31: note: expanded from macro 'percpu_add_op'
                                 ((val) == 1 || (val) == -1)) ?            \
                                                ~~~~~ ^  ~~
>> net/ipv4/tcp_output.c:1759:6: warning: no previous prototype for function 'tcp_mtu_probe_wait_stop' [-Wmissing-prototypes]
   void tcp_mtu_probe_wait_stop(struct sock *sk)
        ^
   net/ipv4/tcp_output.c:1759:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void tcp_mtu_probe_wait_stop(struct sock *sk)
   ^
   static 
   2 warnings generated.


vim +/tcp_mtu_probe_wait_stop +1759 net/ipv4/tcp_output.c

  1758	
> 1759	void tcp_mtu_probe_wait_stop(struct sock *sk)
  1760	{
  1761		struct inet_connection_sock *icsk = inet_csk(sk);
  1762	
  1763		if (icsk->icsk_mtup.wait_data) {
  1764			icsk->icsk_mtup.wait_data = false;
  1765			sk_stop_timer(sk, &icsk->icsk_mtup.wait_data_timer);
  1766		}
  1767	}
  1768	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 38984 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Fwd: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing
       [not found]   ` <CAH56bmDBGsHOSjJpo=TseUATOh0cZqTMFyFO1sqtQmMrTPHtrA@mail.gmail.com>
@ 2021-04-21 16:45     ` Matt Mathis
  2021-04-26  2:32       ` Leonard Crestez
  0 siblings, 1 reply; 14+ messages in thread
From: Matt Mathis @ 2021-04-21 16:45 UTC (permalink / raw)
  To: Leonard Crestez
  Cc: Cc: Willem de Bruijn, Neal Cardwell, Ilya Lesokhin,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Wei Wang, Soheil Hassas Yeganeh, Roopa Prabhu,
	netdev, linux-kernel, Yuchung Cheng

(Resending in plain text mode)

Surely there is a way to adapt tcp_tso_should_defer(), it is trying to
solve a similar problem.

If I were to implement PLPMTUD today, I would more deeply entwine it
into TCP's support for TSO.  e.g. successful deferring segments
sometimes enables TSO and sometimes enables PLPMTUD.

But there is a deeper question:  John Heffner and I invested a huge
amount of energy in trying to make PLPMTUD work for opportunistic
Jumbo discovery, only to discover that we had moved the problem down
to the device driver/nic, were it isn't so readily solvable.

The driver needs to carve nic buffer memory before it can communicate
with a switch (to either ask or measure the MTU), and once it has done
that it needs to either re-carve the memory or run with suboptimal
carving.  Both of these are problematic.

There is also a problem that many link technologies will
non-deterministically deliver jumbo frames at greatly increased error
rates.   This issue requires a long conversation on it's own.

Thanks,
--MM--
The best way to predict the future is to create it.  - Alan Kay

We must not tolerate intolerance;
       however our response must be carefully measured:
            too strong would be hypocritical and risks spiraling out of control;
            too weak risks being mistaken for tacit approval.


On Wed, Apr 21, 2021 at 5:48 AM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Wed, Apr 21, 2021 at 6:21 AM Leonard Crestez <cdleonard@gmail.com> wrote:
> >
> > According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
> > in order to accumulate enough data" but linux almost never does that.
> >
> > Linux waits for probe_size + (1 + retries) * mss_cache to be available
> > in the send buffer and if that condition is not met it will send anyway
> > using the current MSS. The feature can be made to work by sending very
> > large chunks of data from userspace (for example 128k) but for small writes
> > on fast links probes almost never happen.
> >
> > This patch tries to implement the "MAY" by adding an extra flag
> > "wait_data" to icsk_mtup which is set to 1 if a probe is possible but
> > insufficient data is available. Then data is held back in
> > tcp_write_xmit until a probe is sent, probing conditions are no longer
> > met, or 500ms pass.
> >
> > Signed-off-by: Leonard Crestez <cdleonard@gmail.com>
> >
> > ---
> >  Documentation/networking/ip-sysctl.rst |  4 ++
> >  include/net/inet_connection_sock.h     |  7 +++-
> >  include/net/netns/ipv4.h               |  1 +
> >  include/net/tcp.h                      |  2 +
> >  net/ipv4/sysctl_net_ipv4.c             |  7 ++++
> >  net/ipv4/tcp_ipv4.c                    |  1 +
> >  net/ipv4/tcp_output.c                  | 54 ++++++++++++++++++++++++--
> >  7 files changed, 71 insertions(+), 5 deletions(-)
> >
> > My tests are here: https://github.com/cdleonard/test-tcp-mtu-probing
> >
> > This patch makes the test pass quite reliably with
> > ICMP_BLACKHOLE=1 TCP_MTU_PROBING=1 IPERF_WINDOW=256k IPERF_LEN=8k while
> > before it only worked with much higher IPERF_LEN=256k
> >
> > In my loopback tests I also observed another issue when tcp_retries
> > increases because of SACKReorder. This makes the original problem worse
> > (since the retries amount factors in buffer requirement) and seems to be
> > unrelated issue. Maybe when loss happens due to MTU shrinkage the sender
> > sack logic is confused somehow?
> >
> > I know it's towards the end of the cycle but this is mostly just intended for
> > discussion.
>
> Thanks for raising the question of how to trigger PMTU probes more often!
>
> AFAICT this approach would cause unacceptable performance impacts by
> often injecting unnecessary 500ms delays when there is no need to do
> so.
>
> If the goal is to increase the frequency of PMTU probes, which seems
> like a valid goal, I would suggest that we rethink the Linux heuristic
> for triggering PMTU probes in the light of the fact that the loss
> detection mechanism is now RACK-TLP, which provides quick recovery in
> a much wider variety of scenarios.
>
> After all, https://tools.ietf.org/html/rfc4821#section-7.4 says:
>
>    In addition, the timely loss detection algorithms in most protocols
>    have pre-conditions that SHOULD be satisfied before sending a probe.
>
> And we know that the "timely loss detection algorithms" have advanced
> since this RFC was written in 2007.
>
> You mention:
> > Linux waits for probe_size + (1 + retries) * mss_cache to be available
>
> The code in question seems to be:
>
>   size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>
> How about just changing this to:
>
>   size_needed = probe_size + tp->mss_cache;
>
> The rationale would be that if that amount of data is available, then
> the sender can send one probe and one following current-mss-size
> packet. If the path MTU has not increased to allow the probe of size
> probe_size to pass through the network, then the following
> current-mss-size packet will likely pass through the network, generate
> a SACK, and trigger a RACK fast recovery 1/4*min_rtt later, when the
> RACK reorder timer fires.
>
> A secondary rationale for this heuristic would be: if the flow never
> accumulates roughly two packets worth of data, then does the flow
> really need a bigger packet size?
>
> IMHO, just reducing the size_needed seems far preferable to needlessly
> injecting 500ms delays.
>
> best,
> neal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fwd: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing
  2021-04-21 16:45     ` Fwd: " Matt Mathis
@ 2021-04-26  2:32       ` Leonard Crestez
  0 siblings, 0 replies; 14+ messages in thread
From: Leonard Crestez @ 2021-04-26  2:32 UTC (permalink / raw)
  To: Matt Mathis
  Cc: Cc: Willem de Bruijn, Neal Cardwell, Ilya Lesokhin,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Soheil Hassas Yeganeh, Roopa Prabhu, netdev,
	linux-kernel, Yuchung Cheng, John Heffner

On 4/21/21 7:45 PM, Matt Mathis wrote:
> (Resending in plain text mode)
> 
> Surely there is a way to adapt tcp_tso_should_defer(), it is trying to
> solve a similar problem.
> 
> If I were to implement PLPMTUD today, I would more deeply entwine it
> into TCP's support for TSO.  e.g. successful deferring segments
> sometimes enables TSO and sometimes enables PLPMTUD.

The mechanisms for delaying sending are difficult to understand, this 
RFC just added a brand-new unrelated timer. Intertwining it with 
existing mechanisms would indeed be better. On a closer look it seems 
that they're not actually based on a timer but other heuristics.

It seems that tcp_sendmsg will "tcp_push_one" once the skb at the head 
of the queue reaches tcp_xmit_size_goal and tcp_xmit_size_goal does not 
take mtu probing into account. In practice this would mean that 
application-limited streams won't perform mtu probing unless a single 
write is 5*mss + probe_size (1*mss over size_needed)

I sent a different RFC which tries to modify tcp_xmit_size_goal.

> But there is a deeper question:  John Heffner and I invested a huge
> amount of energy in trying to make PLPMTUD work for opportunistic
> Jumbo discovery, only to discover that we had moved the problem down
> to the device driver/nic, were it isn't so readily solvable.
> 
> The driver needs to carve nic buffer memory before it can communicate
> with a switch (to either ask or measure the MTU), and once it has done
> that it needs to either re-carve the memory or run with suboptimal
> carving.  Both of these are problematic.
> 
> There is also a problem that many link technologies will
> non-deterministically deliver jumbo frames at greatly increased error
> rates.   This issue requires a long conversation on it's own.

I'm looking to improve this for tunnels that don't correctly send ICMP 
packet-too-big messages, the hardware is assumed to be fine.

--
Regards,
Leonard

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing
  2021-04-21 12:47 ` Neal Cardwell
       [not found]   ` <CAH56bmDBGsHOSjJpo=TseUATOh0cZqTMFyFO1sqtQmMrTPHtrA@mail.gmail.com>
@ 2021-04-26  2:34   ` Leonard Crestez
  2021-04-26  3:20     ` Matt Mathis
  2021-04-26 15:59     ` Neal Cardwell
  1 sibling, 2 replies; 14+ messages in thread
From: Leonard Crestez @ 2021-04-26  2:34 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Willem de Bruijn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Soheil Hassas Yeganeh,
	Roopa Prabhu, netdev, linux-kernel, Matt Mathis, Yuchung Cheng,
	John Heffner

On 4/21/21 3:47 PM, Neal Cardwell wrote:
> On Wed, Apr 21, 2021 at 6:21 AM Leonard Crestez <cdleonard@gmail.com> wrote:
>>
>> According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
>> in order to accumulate enough data" but linux almost never does that.
>>
>> Linux waits for probe_size + (1 + retries) * mss_cache to be available
>> in the send buffer and if that condition is not met it will send anyway
>> using the current MSS. The feature can be made to work by sending very
>> large chunks of data from userspace (for example 128k) but for small writes
>> on fast links probes almost never happen.
>>
>> This patch tries to implement the "MAY" by adding an extra flag
>> "wait_data" to icsk_mtup which is set to 1 if a probe is possible but
>> insufficient data is available. Then data is held back in
>> tcp_write_xmit until a probe is sent, probing conditions are no longer
>> met, or 500ms pass.
>>
>> Signed-off-by: Leonard Crestez <cdleonard@gmail.com>
>>
>> ---
>>   Documentation/networking/ip-sysctl.rst |  4 ++
>>   include/net/inet_connection_sock.h     |  7 +++-
>>   include/net/netns/ipv4.h               |  1 +
>>   include/net/tcp.h                      |  2 +
>>   net/ipv4/sysctl_net_ipv4.c             |  7 ++++
>>   net/ipv4/tcp_ipv4.c                    |  1 +
>>   net/ipv4/tcp_output.c                  | 54 ++++++++++++++++++++++++--
>>   7 files changed, 71 insertions(+), 5 deletions(-)
>>
>> My tests are here: https://github.com/cdleonard/test-tcp-mtu-probing
>>
>> This patch makes the test pass quite reliably with
>> ICMP_BLACKHOLE=1 TCP_MTU_PROBING=1 IPERF_WINDOW=256k IPERF_LEN=8k while
>> before it only worked with much higher IPERF_LEN=256k
>>
>> In my loopback tests I also observed another issue when tcp_retries
>> increases because of SACKReorder. This makes the original problem worse
>> (since the retries amount factors in buffer requirement) and seems to be
>> unrelated issue. Maybe when loss happens due to MTU shrinkage the sender
>> sack logic is confused somehow?
>>
>> I know it's towards the end of the cycle but this is mostly just intended for
>> discussion.
> 
> Thanks for raising the question of how to trigger PMTU probes more often!
> 
> AFAICT this approach would cause unacceptable performance impacts by
> often injecting unnecessary 500ms delays when there is no need to do
> so.
> 
> If the goal is to increase the frequency of PMTU probes, which seems
> like a valid goal, I would suggest that we rethink the Linux heuristic
> for triggering PMTU probes in the light of the fact that the loss
> detection mechanism is now RACK-TLP, which provides quick recovery in
> a much wider variety of scenarios.

> After all, https://tools.ietf.org/html/rfc4821#section-7.4 says:
> 
>     In addition, the timely loss detection algorithms in most protocols
>     have pre-conditions that SHOULD be satisfied before sending a probe.
> 
> And we know that the "timely loss detection algorithms" have advanced
> since this RFC was written in 2007. >
> You mention:
>> Linux waits for probe_size + (1 + retries) * mss_cache to be available
> 
> The code in question seems to be:
> 
>    size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;

As far as I understand this is meant to work with classical retransmit: 
if 3 dupacks are received then the first segment is considered lost and 
probe success or failure is can determine within roughly 1*rtt. RACK 
marks segments as lost based on echoed timestamps so it doesn't need 
multiple segments. The minimum time interval is only a little higher 
(5/4 rtt). Is this correct?

> How about just changing this to:
> 
>    size_needed = probe_size + tp->mss_cache;
> 
> The rationale would be that if that amount of data is available, then
> the sender can send one probe and one following current-mss-size
> packet. If the path MTU has not increased to allow the probe of size
> probe_size to pass through the network, then the following
> current-mss-size packet will likely pass through the network, generate
> a SACK, and trigger a RACK fast recovery 1/4*min_rtt later, when the
> RACK reorder timer fires.

This appears to almost work except it stalls after a while. I spend some 
time investigating it and it seems that cwnd is shrunk on mss increases 
and does not go back up. This causes probes to be skipped because of a 
"snd_cwnd < 11" condition.

I don't undestand where that magical "11" comes from, could that be 
shrunk. Maybe it's meant to only send probes when the cwnd is above the 
default of 10? Then maybe mtu_probe_success shouldn't shrink mss below 
what is required for an additional probe, or at least round-up.

The shrinkage of cwnd is a problem with this "short probes" approach 
because tcp_is_cwnd_limited returns false because tp->max_packets_out is 
smaller (4). With longer probes tp->max_packets_out is larger (6) so 
tcp_is_cwnd_limited returns true even for a cwnd of 10.

I'm testing using namespace-to-namespace loopback so my delays are close 
to zero. I tried to introduce an artificial delay of 30ms (using tc 
netem) and it works but 20ms does not.

> A secondary rationale for this heuristic would be: if the flow never
> accumulates roughly two packets worth of data, then does the flow
> really need a bigger packet size?

The problem is that "accumulating sufficient data" is an extremely fuzzy 
concept. In particular it seems that at the same traffic level 
performing shorter writes from userspace (2kb instead of 64k) can 
prevent mtu probing entirely and this is unreasonable.

--
Regards,
Leonard

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing
  2021-04-26  2:34   ` Leonard Crestez
@ 2021-04-26  3:20     ` Matt Mathis
  2021-04-26 15:59     ` Neal Cardwell
  1 sibling, 0 replies; 14+ messages in thread
From: Matt Mathis @ 2021-04-26  3:20 UTC (permalink / raw)
  To: Leonard Crestez
  Cc: Neal Cardwell, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Soheil Hassas Yeganeh, Roopa Prabhu, netdev, linux-kernel,
	Yuchung Cheng, John Heffner

First RFC 4821 was finished in 2007, but most of the work was done
several years earlier.   In the intervening years just about
everything else in TCP has been overhauled.   The 11 packets was
probably reasonable for Reno with early retransmit and non adaptive
reordering.   These are archaic algorithms by today's standards.

You have two problems that you need to address:
1) like TSO you need to hold back transmits that are otherwise
eligible to be sent, such that you have enough data in the backlog to
send a probe or TSO.
2) you need to make sure that TCP recovers promptly when a probe is
lost (which is the most common case).  This is not so hard on a well
behaved network, but is likely to make your head hurt in the presence
of reordering (e,g, ECMP w/o flow pinning).  RACK helps, I'm sure.

Thanks,
--MM--
The best way to predict the future is to create it.  - Alan Kay

We must not tolerate intolerance;
       however our response must be carefully measured:
            too strong would be hypocritical and risks spiraling out of control;
            too weak risks being mistaken for tacit approval.

On Sun, Apr 25, 2021 at 7:34 PM Leonard Crestez <lcrestez@drivenets.com> wrote:
>
> On 4/21/21 3:47 PM, Neal Cardwell wrote:
> > On Wed, Apr 21, 2021 at 6:21 AM Leonard Crestez <cdleonard@gmail.com> wrote:
> >>
> >> According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
> >> in order to accumulate enough data" but linux almost never does that.
> >>
> >> Linux waits for probe_size + (1 + retries) * mss_cache to be available
> >> in the send buffer and if that condition is not met it will send anyway
> >> using the current MSS. The feature can be made to work by sending very
> >> large chunks of data from userspace (for example 128k) but for small writes
> >> on fast links probes almost never happen.
> >>
> >> This patch tries to implement the "MAY" by adding an extra flag
> >> "wait_data" to icsk_mtup which is set to 1 if a probe is possible but
> >> insufficient data is available. Then data is held back in
> >> tcp_write_xmit until a probe is sent, probing conditions are no longer
> >> met, or 500ms pass.
> >>
> >> Signed-off-by: Leonard Crestez <cdleonard@gmail.com>
> >>
> >> ---
> >>   Documentation/networking/ip-sysctl.rst |  4 ++
> >>   include/net/inet_connection_sock.h     |  7 +++-
> >>   include/net/netns/ipv4.h               |  1 +
> >>   include/net/tcp.h                      |  2 +
> >>   net/ipv4/sysctl_net_ipv4.c             |  7 ++++
> >>   net/ipv4/tcp_ipv4.c                    |  1 +
> >>   net/ipv4/tcp_output.c                  | 54 ++++++++++++++++++++++++--
> >>   7 files changed, 71 insertions(+), 5 deletions(-)
> >>
> >> My tests are here: https://github.com/cdleonard/test-tcp-mtu-probing
> >>
> >> This patch makes the test pass quite reliably with
> >> ICMP_BLACKHOLE=1 TCP_MTU_PROBING=1 IPERF_WINDOW=256k IPERF_LEN=8k while
> >> before it only worked with much higher IPERF_LEN=256k
> >>
> >> In my loopback tests I also observed another issue when tcp_retries
> >> increases because of SACKReorder. This makes the original problem worse
> >> (since the retries amount factors in buffer requirement) and seems to be
> >> unrelated issue. Maybe when loss happens due to MTU shrinkage the sender
> >> sack logic is confused somehow?
> >>
> >> I know it's towards the end of the cycle but this is mostly just intended for
> >> discussion.
> >
> > Thanks for raising the question of how to trigger PMTU probes more often!
> >
> > AFAICT this approach would cause unacceptable performance impacts by
> > often injecting unnecessary 500ms delays when there is no need to do
> > so.
> >
> > If the goal is to increase the frequency of PMTU probes, which seems
> > like a valid goal, I would suggest that we rethink the Linux heuristic
> > for triggering PMTU probes in the light of the fact that the loss
> > detection mechanism is now RACK-TLP, which provides quick recovery in
> > a much wider variety of scenarios.
>
> > After all, https://tools.ietf.org/html/rfc4821#section-7.4 says:
> >
> >     In addition, the timely loss detection algorithms in most protocols
> >     have pre-conditions that SHOULD be satisfied before sending a probe.
> >
> > And we know that the "timely loss detection algorithms" have advanced
> > since this RFC was written in 2007. >
> > You mention:
> >> Linux waits for probe_size + (1 + retries) * mss_cache to be available
> >
> > The code in question seems to be:
> >
> >    size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>
> As far as I understand this is meant to work with classical retransmit:
> if 3 dupacks are received then the first segment is considered lost and
> probe success or failure is can determine within roughly 1*rtt. RACK
> marks segments as lost based on echoed timestamps so it doesn't need
> multiple segments. The minimum time interval is only a little higher
> (5/4 rtt). Is this correct?
>
> > How about just changing this to:
> >
> >    size_needed = probe_size + tp->mss_cache;
> >
> > The rationale would be that if that amount of data is available, then
> > the sender can send one probe and one following current-mss-size
> > packet. If the path MTU has not increased to allow the probe of size
> > probe_size to pass through the network, then the following
> > current-mss-size packet will likely pass through the network, generate
> > a SACK, and trigger a RACK fast recovery 1/4*min_rtt later, when the
> > RACK reorder timer fires.
>
> This appears to almost work except it stalls after a while. I spend some
> time investigating it and it seems that cwnd is shrunk on mss increases
> and does not go back up. This causes probes to be skipped because of a
> "snd_cwnd < 11" condition.
>
> I don't undestand where that magical "11" comes from, could that be
> shrunk. Maybe it's meant to only send probes when the cwnd is above the
> default of 10? Then maybe mtu_probe_success shouldn't shrink mss below
> what is required for an additional probe, or at least round-up.
>
> The shrinkage of cwnd is a problem with this "short probes" approach
> because tcp_is_cwnd_limited returns false because tp->max_packets_out is
> smaller (4). With longer probes tp->max_packets_out is larger (6) so
> tcp_is_cwnd_limited returns true even for a cwnd of 10.
>
> I'm testing using namespace-to-namespace loopback so my delays are close
> to zero. I tried to introduce an artificial delay of 30ms (using tc
> netem) and it works but 20ms does not.
>
> > A secondary rationale for this heuristic would be: if the flow never
> > accumulates roughly two packets worth of data, then does the flow
> > really need a bigger packet size?
>
> The problem is that "accumulating sufficient data" is an extremely fuzzy
> concept. In particular it seems that at the same traffic level
> performing shorter writes from userspace (2kb instead of 64k) can
> prevent mtu probing entirely and this is unreasonable.
>
> --
> Regards,
> Leonard

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing
  2021-04-26  2:34   ` Leonard Crestez
  2021-04-26  3:20     ` Matt Mathis
@ 2021-04-26 15:59     ` Neal Cardwell
  2021-04-26 17:09       ` Leonard Crestez
  1 sibling, 1 reply; 14+ messages in thread
From: Neal Cardwell @ 2021-04-26 15:59 UTC (permalink / raw)
  To: Leonard Crestez
  Cc: Willem de Bruijn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Soheil Hassas Yeganeh,
	Roopa Prabhu, netdev, linux-kernel, Matt Mathis, Yuchung Cheng,
	John Heffner

On Sun, Apr 25, 2021 at 10:34 PM Leonard Crestez <lcrestez@drivenets.com> wrote:
>
> On 4/21/21 3:47 PM, Neal Cardwell wrote:
> > On Wed, Apr 21, 2021 at 6:21 AM Leonard Crestez <cdleonard@gmail.com> wrote:
> >>
> >> According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
> >> in order to accumulate enough data" but linux almost never does that.
> >>
> >> Linux waits for probe_size + (1 + retries) * mss_cache to be available
> >> in the send buffer and if that condition is not met it will send anyway
> >> using the current MSS. The feature can be made to work by sending very
> >> large chunks of data from userspace (for example 128k) but for small writes
> >> on fast links probes almost never happen.
> >>
> >> This patch tries to implement the "MAY" by adding an extra flag
> >> "wait_data" to icsk_mtup which is set to 1 if a probe is possible but
> >> insufficient data is available. Then data is held back in
> >> tcp_write_xmit until a probe is sent, probing conditions are no longer
> >> met, or 500ms pass.
> >>
> >> Signed-off-by: Leonard Crestez <cdleonard@gmail.com>
> >>
> >> ---
> >>   Documentation/networking/ip-sysctl.rst |  4 ++
> >>   include/net/inet_connection_sock.h     |  7 +++-
> >>   include/net/netns/ipv4.h               |  1 +
> >>   include/net/tcp.h                      |  2 +
> >>   net/ipv4/sysctl_net_ipv4.c             |  7 ++++
> >>   net/ipv4/tcp_ipv4.c                    |  1 +
> >>   net/ipv4/tcp_output.c                  | 54 ++++++++++++++++++++++++--
> >>   7 files changed, 71 insertions(+), 5 deletions(-)
> >>
> >> My tests are here: https://github.com/cdleonard/test-tcp-mtu-probing
> >>
> >> This patch makes the test pass quite reliably with
> >> ICMP_BLACKHOLE=1 TCP_MTU_PROBING=1 IPERF_WINDOW=256k IPERF_LEN=8k while
> >> before it only worked with much higher IPERF_LEN=256k
> >>
> >> In my loopback tests I also observed another issue when tcp_retries
> >> increases because of SACKReorder. This makes the original problem worse
> >> (since the retries amount factors in buffer requirement) and seems to be
> >> unrelated issue. Maybe when loss happens due to MTU shrinkage the sender
> >> sack logic is confused somehow?
> >>
> >> I know it's towards the end of the cycle but this is mostly just intended for
> >> discussion.
> >
> > Thanks for raising the question of how to trigger PMTU probes more often!
> >
> > AFAICT this approach would cause unacceptable performance impacts by
> > often injecting unnecessary 500ms delays when there is no need to do
> > so.
> >
> > If the goal is to increase the frequency of PMTU probes, which seems
> > like a valid goal, I would suggest that we rethink the Linux heuristic
> > for triggering PMTU probes in the light of the fact that the loss
> > detection mechanism is now RACK-TLP, which provides quick recovery in
> > a much wider variety of scenarios.
>
> > After all, https://tools.ietf.org/html/rfc4821#section-7.4 says:
> >
> >     In addition, the timely loss detection algorithms in most protocols
> >     have pre-conditions that SHOULD be satisfied before sending a probe.
> >
> > And we know that the "timely loss detection algorithms" have advanced
> > since this RFC was written in 2007. >
> > You mention:
> >> Linux waits for probe_size + (1 + retries) * mss_cache to be available
> >
> > The code in question seems to be:
> >
> >    size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>
> As far as I understand this is meant to work with classical retransmit:
> if 3 dupacks are received then the first segment is considered lost and
> probe success or failure is can determine within roughly 1*rtt.

Yes, that is my sense as well.

> RACK
> marks segments as lost based on echoed timestamps so it doesn't need
> multiple segments. The minimum time interval is only a little higher
> (5/4 rtt). Is this correct?

That's basically the case, though RACK doesn't even require echoed timestamps.

> > How about just changing this to:
> >
> >    size_needed = probe_size + tp->mss_cache;
> >
> > The rationale would be that if that amount of data is available, then
> > the sender can send one probe and one following current-mss-size
> > packet. If the path MTU has not increased to allow the probe of size
> > probe_size to pass through the network, then the following
> > current-mss-size packet will likely pass through the network, generate
> > a SACK, and trigger a RACK fast recovery 1/4*min_rtt later, when the
> > RACK reorder timer fires.
>
> This appears to almost work except it stalls after a while. I spend some
> time investigating it and it seems that cwnd is shrunk on mss increases
> and does not go back up. This causes probes to be skipped because of a
> "snd_cwnd < 11" condition.
>
> I don't undestand where that magical "11" comes from, could that be
> shrunk. Maybe it's meant to only send probes when the cwnd is above the
> default of 10? Then maybe mtu_probe_success shouldn't shrink mss below
> what is required for an additional probe, or at least round-up.
>
> The shrinkage of cwnd is a problem with this "short probes" approach
> because tcp_is_cwnd_limited returns false because tp->max_packets_out is
> smaller (4). With longer probes tp->max_packets_out is larger (6) so
> tcp_is_cwnd_limited returns true even for a cwnd of 10.
>
> I'm testing using namespace-to-namespace loopback so my delays are close
> to zero. I tried to introduce an artificial delay of 30ms (using tc
> netem) and it works but 20ms does not.

I agree the magic 11 seems outdated and unnecessarily high, given RACK-TLP.

I think it would be fine to change the magic 11 to a magic
(TCP_FASTRETRANS_THRESH+1), aka 3+1=4:

  - tp->snd_cwnd < 11 ||
  + p->snd_cwnd < (TCP_FASTRETRANS_THRESH + 1) ||

As long as the cwnd is >= TCP_FASTRETRANS_THRESH+1 then the sender
should usually be able to send the 1 probe packet and then 3
additional packets beyond the probe, and in the common case (with no
reordering) then with failed probes this should allow the sender to
quickly receive 3 SACKed segments and enter fast recovery quickly.
Even if the sender doesn't have 3 additional packets, or if reordering
has been detected, then RACK-TLP should be able to start recovery
quickly (5/4*RTT if there is at least one SACK, or 2*RTT for a TLP if
there is no SACK).

> > A secondary rationale for this heuristic would be: if the flow never
> > accumulates roughly two packets worth of data, then does the flow
> > really need a bigger packet size?
>
> The problem is that "accumulating sufficient data" is an extremely fuzzy
> concept. In particular it seems that at the same traffic level
> performing shorter writes from userspace (2kb instead of 64k) can
> prevent mtu probing entirely and this is unreasonable.

Something like your autocorking-enhanced-PMTU patch sounds like a
reasonable way to deal with this.

thanks,
neal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing
  2021-04-26 15:59     ` Neal Cardwell
@ 2021-04-26 17:09       ` Leonard Crestez
  2021-04-26 17:24         ` Neal Cardwell
  0 siblings, 1 reply; 14+ messages in thread
From: Leonard Crestez @ 2021-04-26 17:09 UTC (permalink / raw)
  To: Neal Cardwell, Matt Mathis
  Cc: Willem de Bruijn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Soheil Hassas Yeganeh,
	Roopa Prabhu, netdev, linux-kernel, Yuchung Cheng, John Heffner

On 26.04.2021 18:59, Neal Cardwell wrote:
> On Sun, Apr 25, 2021 at 10:34 PM Leonard Crestez <lcrestez@drivenets.com> wrote:
>> On 4/21/21 3:47 PM, Neal Cardwell wrote:
>>> On Wed, Apr 21, 2021 at 6:21 AM Leonard Crestez <cdleonard@gmail.com> wrote:

>>> If the goal is to increase the frequency of PMTU probes, which seems
>>> like a valid goal, I would suggest that we rethink the Linux heuristic
>>> for triggering PMTU probes in the light of the fact that the loss
>>> detection mechanism is now RACK-TLP, which provides quick recovery in
>>> a much wider variety of scenarios.
>>
>>> You mention:
>>>> Linux waits for probe_size + (1 + retries) * mss_cache to be available
>>>
>>> The code in question seems to be:
>>>
>>>     size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>>> How about just changing this to:
>>>
>>>     size_needed = probe_size + tp->mss_cache;
>>>
>>> The rationale would be that if that amount of data is available, then
>>> the sender can send one probe and one following current-mss-size
>>> packet. If the path MTU has not increased to allow the probe of size
>>> probe_size to pass through the network, then the following
>>> current-mss-size packet will likely pass through the network, generate
>>> a SACK, and trigger a RACK fast recovery 1/4*min_rtt later, when the
>>> RACK reorder timer fires.
>>
>> This appears to almost work except it stalls after a while. I spend some
>> time investigating it and it seems that cwnd is shrunk on mss increases
>> and does not go back up. This causes probes to be skipped because of a
>> "snd_cwnd < 11" condition.
>>
>> I don't undestand where that magical "11" comes from, could that be
>> shrunk. Maybe it's meant to only send probes when the cwnd is above the
>> default of 10? Then maybe mtu_probe_success shouldn't shrink mss below
>> what is required for an additional probe, or at least round-up.
>>
>> The shrinkage of cwnd is a problem with this "short probes" approach
>> because tcp_is_cwnd_limited returns false because tp->max_packets_out is
>> smaller (4). With longer probes tp->max_packets_out is larger (6) so
>> tcp_is_cwnd_limited returns true even for a cwnd of 10.
>>
>> I'm testing using namespace-to-namespace loopback so my delays are close
>> to zero. I tried to introduce an artificial delay of 30ms (using tc
>> netem) and it works but 20ms does not.
> 
> I agree the magic 11 seems outdated and unnecessarily high, given RACK-TLP.
> 
> I think it would be fine to change the magic 11 to a magic
> (TCP_FASTRETRANS_THRESH+1), aka 3+1=4:
> 
>    - tp->snd_cwnd < 11 ||
>    + p->snd_cwnd < (TCP_FASTRETRANS_THRESH + 1) ||
> 
> As long as the cwnd is >= TCP_FASTRETRANS_THRESH+1 then the sender
> should usually be able to send the 1 probe packet and then 3
> additional packets beyond the probe, and in the common case (with no
> reordering) then with failed probes this should allow the sender to
> quickly receive 3 SACKed segments and enter fast recovery quickly.
> Even if the sender doesn't have 3 additional packets, or if reordering
> has been detected, then RACK-TLP should be able to start recovery
> quickly (5/4*RTT if there is at least one SACK, or 2*RTT for a TLP if
> there is no SACK).

As far as I understand tp->reordering is a dynamic evaluation of the 
fastretrans threshold to deal with environments with lots of reordering. 
Your suggestion seems equivalent to the current size_needed calculation 
except using packets instead of bytes.

Wouldn't it be easier to drop the "11" check and just verify that 
size_needed fits into cwnd as bytes?

--
Regards,
Leonard

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing
  2021-04-26 17:09       ` Leonard Crestez
@ 2021-04-26 17:24         ` Neal Cardwell
  0 siblings, 0 replies; 14+ messages in thread
From: Neal Cardwell @ 2021-04-26 17:24 UTC (permalink / raw)
  To: Leonard Crestez
  Cc: Matt Mathis, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Soheil Hassas Yeganeh, Roopa Prabhu, netdev, linux-kernel,
	Yuchung Cheng, John Heffner

On Mon, Apr 26, 2021 at 1:09 PM Leonard Crestez <lcrestez@drivenets.com> wrote:
>
> On 26.04.2021 18:59, Neal Cardwell wrote:
> > On Sun, Apr 25, 2021 at 10:34 PM Leonard Crestez <lcrestez@drivenets.com> wrote:
> >> On 4/21/21 3:47 PM, Neal Cardwell wrote:
> >>> On Wed, Apr 21, 2021 at 6:21 AM Leonard Crestez <cdleonard@gmail.com> wrote:
>
> >>> If the goal is to increase the frequency of PMTU probes, which seems
> >>> like a valid goal, I would suggest that we rethink the Linux heuristic
> >>> for triggering PMTU probes in the light of the fact that the loss
> >>> detection mechanism is now RACK-TLP, which provides quick recovery in
> >>> a much wider variety of scenarios.
> >>
> >>> You mention:
> >>>> Linux waits for probe_size + (1 + retries) * mss_cache to be available
> >>>
> >>> The code in question seems to be:
> >>>
> >>>     size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
> >>> How about just changing this to:
> >>>
> >>>     size_needed = probe_size + tp->mss_cache;
> >>>
> >>> The rationale would be that if that amount of data is available, then
> >>> the sender can send one probe and one following current-mss-size
> >>> packet. If the path MTU has not increased to allow the probe of size
> >>> probe_size to pass through the network, then the following
> >>> current-mss-size packet will likely pass through the network, generate
> >>> a SACK, and trigger a RACK fast recovery 1/4*min_rtt later, when the
> >>> RACK reorder timer fires.
> >>
> >> This appears to almost work except it stalls after a while. I spend some
> >> time investigating it and it seems that cwnd is shrunk on mss increases
> >> and does not go back up. This causes probes to be skipped because of a
> >> "snd_cwnd < 11" condition.
> >>
> >> I don't undestand where that magical "11" comes from, could that be
> >> shrunk. Maybe it's meant to only send probes when the cwnd is above the
> >> default of 10? Then maybe mtu_probe_success shouldn't shrink mss below
> >> what is required for an additional probe, or at least round-up.
> >>
> >> The shrinkage of cwnd is a problem with this "short probes" approach
> >> because tcp_is_cwnd_limited returns false because tp->max_packets_out is
> >> smaller (4). With longer probes tp->max_packets_out is larger (6) so
> >> tcp_is_cwnd_limited returns true even for a cwnd of 10.
> >>
> >> I'm testing using namespace-to-namespace loopback so my delays are close
> >> to zero. I tried to introduce an artificial delay of 30ms (using tc
> >> netem) and it works but 20ms does not.
> >
> > I agree the magic 11 seems outdated and unnecessarily high, given RACK-TLP.
> >
> > I think it would be fine to change the magic 11 to a magic
> > (TCP_FASTRETRANS_THRESH+1), aka 3+1=4:
> >
> >    - tp->snd_cwnd < 11 ||
> >    + p->snd_cwnd < (TCP_FASTRETRANS_THRESH + 1) ||
> >
> > As long as the cwnd is >= TCP_FASTRETRANS_THRESH+1 then the sender
> > should usually be able to send the 1 probe packet and then 3
> > additional packets beyond the probe, and in the common case (with no
> > reordering) then with failed probes this should allow the sender to
> > quickly receive 3 SACKed segments and enter fast recovery quickly.
> > Even if the sender doesn't have 3 additional packets, or if reordering
> > has been detected, then RACK-TLP should be able to start recovery
> > quickly (5/4*RTT if there is at least one SACK, or 2*RTT for a TLP if
> > there is no SACK).
>
> As far as I understand tp->reordering is a dynamic evaluation of the
> fastretrans threshold to deal with environments with lots of reordering.
> Your suggestion seems equivalent to the current size_needed calculation
> except using packets instead of bytes.
>
> Wouldn't it be easier to drop the "11" check and just verify that
> size_needed fits into cwnd as bytes?

Yes, that sounds good to me (dropping the "11" check in favor of
verifying that size_needed fits into the cwnd).

neal

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-04-26 17:24 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-21 10:21 [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing Leonard Crestez
2021-04-21 12:47 ` Neal Cardwell
     [not found]   ` <CAH56bmDBGsHOSjJpo=TseUATOh0cZqTMFyFO1sqtQmMrTPHtrA@mail.gmail.com>
2021-04-21 16:45     ` Fwd: " Matt Mathis
2021-04-26  2:32       ` Leonard Crestez
2021-04-26  2:34   ` Leonard Crestez
2021-04-26  3:20     ` Matt Mathis
2021-04-26 15:59     ` Neal Cardwell
2021-04-26 17:09       ` Leonard Crestez
2021-04-26 17:24         ` Neal Cardwell
2021-04-21 13:05 ` kernel test robot
2021-04-21 13:05 ` [RFC PATCH] tcp: tcp_mtu_probe_wait_stop() can be static kernel test robot
2021-04-21 13:15 ` [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing kernel test robot
2021-04-21 14:53 ` kernel test robot
2021-04-21 15:09 ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.