netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v3] tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
@ 2023-06-09 20:47 Mike Freemon
  2023-06-10  0:42 ` Stephen Hemminger
  2023-06-10  1:43 ` Eric Dumazet
  0 siblings, 2 replies; 6+ messages in thread
From: Mike Freemon @ 2023-06-09 20:47 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, edumazet, mfreemon

From: "mfreemon@cloudflare.com" <mfreemon@cloudflare.com>

Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.

To reproduce:  Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()).  This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].

As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.

The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender.  This problem has previously been identified in
RFC 7323, appendix F [1].

The Linux kernel currently adheres to never shrinking the window.

In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session.  A receive buffer full condition should instead result
in a zero window and an indefinite wait.

In practice, this problem is largely hidden for most flows.  It
is not applicable to mice flows.  Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.

But this problem does show up for other types of flows.  Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time.  In these cases, we directly
encounter the problem described in [1].

RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].

This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf).  This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window

Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323

Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
---
 Documentation/networking/ip-sysctl.rst | 13 +++++
 include/net/netns/ipv4.h               |  1 +
 net/ipv4/sysctl_net_ipv4.c             |  9 ++++
 net/ipv4/tcp_ipv4.c                    |  2 +
 net/ipv4/tcp_output.c                  | 73 ++++++++++++++++++++++++--
 5 files changed, 93 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 366e2a5097d9..ddb895e8af56 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -981,6 +981,19 @@ tcp_tw_reuse - INTEGER
 tcp_window_scaling - BOOLEAN
 	Enable window scaling as defined in RFC1323.
 
+tcp_shrink_window - BOOLEAN
+	This changes how the TCP receive window is calculated.
+
+	RFC 7323, section 2.4, says there are instances when a retracted
+	window can be offered, and that TCP implementations MUST ensure
+	that they handle a shrinking window, as specified in RFC 1122.
+
+	- 0 - Disabled.	The window is never shrunk.
+	- 1 - Enabled.	The window is shrunk when necessary to remain within
+			the memory limit set by autotuning (sk_rcvbuf).
+
+	Default: 0
+
 tcp_wmem - vector of 3 INTEGERs: min, default, max
 	min: Amount of memory reserved for send buffers for TCP sockets.
 	Each TCP socket has rights to use it due to fact of its birth.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a4efb7a2796c..f00374718159 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -65,6 +65,7 @@ struct netns_ipv4 {
 #endif
 	bool			fib_has_custom_local_routes;
 	bool			fib_offload_disabled;
+	u8			sysctl_tcp_shrink_window;
 #ifdef CONFIG_IP_ROUTE_CLASSID
 	atomic_t		fib_num_tclassid_users;
 #endif
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 356afe54951c..2afb0870648b 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1480,6 +1480,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= &tcp_syn_linear_timeouts_max,
 	},
+	{
+		.procname	= "tcp_shrink_window",
+		.data		= &init_net.ipv4.sysctl_tcp_shrink_window,
+		.maxlen		= sizeof(u8),
+		.mode		= 0644,
+		.proc_handler	= proc_dou8vec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 84a5d557dc1a..9213804b034f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3281,6 +3281,8 @@ static int __net_init tcp_sk_init(struct net *net)
 		net->ipv4.tcp_congestion_control = &tcp_reno;
 
 	net->ipv4.sysctl_tcp_syn_linear_timeouts = 4;
+	net->ipv4.sysctl_tcp_shrink_window = 0;
+
 	return 0;
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f8ce77ce7c3e..5c86873e2193 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -260,8 +260,8 @@ static u16 tcp_select_window(struct sock *sk)
 	u32 old_win = tp->rcv_wnd;
 	u32 cur_win = tcp_receive_window(tp);
 	u32 new_win = __tcp_select_window(sk);
+	struct net *net = sock_net(sk);
 
-	/* Never shrink the offered window */
 	if (new_win < cur_win) {
 		/* Danger Will Robinson!
 		 * Don't update rcv_wup/rcv_wnd here or else
@@ -270,11 +270,15 @@ static u16 tcp_select_window(struct sock *sk)
 		 *
 		 * Relax Will Robinson.
 		 */
-		if (new_win == 0)
-			NET_INC_STATS(sock_net(sk),
-				      LINUX_MIB_TCPWANTZEROWINDOWADV);
-		new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+		if (!READ_ONCE(net->ipv4.sysctl_tcp_shrink_window)) {
+			/* Never shrink the offered window */
+			if (new_win == 0)
+				NET_INC_STATS(sock_net(sk),
+					      LINUX_MIB_TCPWANTZEROWINDOWADV);
+			new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+		}
 	}
+
 	tp->rcv_wnd = new_win;
 	tp->rcv_wup = tp->rcv_nxt;
 
@@ -3003,6 +3007,7 @@ u32 __tcp_select_window(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct net *net = sock_net(sk);
 	/* MSS for the peer's data.  Previous versions used mss_clamp
 	 * here.  I don't know if the value based on our guesses
 	 * of peer's MSS is better for the performance.  It's more correct
@@ -3024,6 +3029,12 @@ u32 __tcp_select_window(struct sock *sk)
 		if (mss <= 0)
 			return 0;
 	}
+
+	if (READ_ONCE(net->ipv4.sysctl_tcp_shrink_window))
+		goto shrink_window_allowed;
+
+	/* do not allow window to shrink */
+
 	if (free_space < (full_space >> 1)) {
 		icsk->icsk_ack.quick = 0;
 
@@ -3077,6 +3088,58 @@ u32 __tcp_select_window(struct sock *sk)
 			window = free_space;
 	}
 
+	return window;
+
+shrink_window_allowed:
+	/* new window should always be an exact multiple of scaling factor */
+	free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
+
+	if (free_space < (full_space >> 1)) {
+		icsk->icsk_ack.quick = 0;
+
+		if (tcp_under_memory_pressure(sk))
+			tcp_adjust_rcv_ssthresh(sk);
+
+		/* if free space is too low, return a zero window */
+		if (free_space < (allowed_space >> 4) || free_space < mss ||
+			free_space < (1 << tp->rx_opt.rcv_wscale))
+			return 0;
+	}
+
+	if (free_space > tp->rcv_ssthresh) {
+		free_space = tp->rcv_ssthresh;
+		/* new window should always be an exact multiple of scaling factor
+		 *
+		 * For this case, we ALIGN "up" (increase free_space) because
+		 * we know free_space is not zero here, it has been reduced from
+		 * the memory-based limit, and rcv_ssthresh is not a hard limit
+		 * (unlike sk_rcvbuf).
+		 */
+		free_space = ALIGN(free_space, (1 << tp->rx_opt.rcv_wscale));
+	}
+
+	/* Don't do rounding if we are using window scaling, since the
+	 * scaled window will not line up with the MSS boundary anyway.
+	 */
+	if (tp->rx_opt.rcv_wscale) {
+		window = free_space;
+	} else {
+		window = tp->rcv_wnd;
+		/* Get the largest window that is a nice multiple of mss.
+		 * Window clamp already applied above.
+		 * If our current window offering is within 1 mss of the
+		 * free space we just keep it. This prevents the divide
+		 * and multiply from happening most of the time.
+		 * We also don't do any window rounding when the free space
+		 * is too small.
+		 */
+		if (window <= free_space - mss || window > free_space)
+			window = rounddown(free_space, mss);
+		else if (mss == full_space &&
+			 free_space > window + (full_space >> 1))
+			window = free_space;
+	}
+
 	return window;
 }
 
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next v3] tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
  2023-06-09 20:47 [PATCH net-next v3] tcp: enforce receive buffer memory limits by allowing the tcp window to shrink Mike Freemon
@ 2023-06-10  0:42 ` Stephen Hemminger
  2023-06-10  1:37   ` Eric Dumazet
  2023-06-10  1:43 ` Eric Dumazet
  1 sibling, 1 reply; 6+ messages in thread
From: Stephen Hemminger @ 2023-06-10  0:42 UTC (permalink / raw)
  To: Mike Freemon; +Cc: netdev, kernel-team, edumazet

On Fri,  9 Jun 2023 15:47:06 -0500
Mike Freemon <mfreemon@cloudflare.com> wrote:

> +	{
> +		.procname	= "tcp_shrink_window",
> +		.data		= &init_net.ipv4.sysctl_tcp_shrink_window,
> +		.maxlen		= sizeof(u8),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dou8vec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_ONE,
> +	},

NAK on introducing another sysctl.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next v3] tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
  2023-06-10  0:42 ` Stephen Hemminger
@ 2023-06-10  1:37   ` Eric Dumazet
  0 siblings, 0 replies; 6+ messages in thread
From: Eric Dumazet @ 2023-06-10  1:37 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Mike Freemon, netdev, kernel-team

On Sat, Jun 10, 2023 at 2:42 AM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Fri,  9 Jun 2023 15:47:06 -0500
> Mike Freemon <mfreemon@cloudflare.com> wrote:
>
> > +     {
> > +             .procname       = "tcp_shrink_window",
> > +             .data           = &init_net.ipv4.sysctl_tcp_shrink_window,
> > +             .maxlen         = sizeof(u8),
> > +             .mode           = 0644,
> > +             .proc_handler   = proc_dou8vec_minmax,
> > +             .extra1         = SYSCTL_ZERO,
> > +             .extra2         = SYSCTL_ONE,
> > +     },
>
> NAK on introducing another sysctl.

We respectfully disagree.

A sysctl makes a lot of sense to us, as it allows us to not break many
packetdrill tests.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next v3] tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
  2023-06-09 20:47 [PATCH net-next v3] tcp: enforce receive buffer memory limits by allowing the tcp window to shrink Mike Freemon
  2023-06-10  0:42 ` Stephen Hemminger
@ 2023-06-10  1:43 ` Eric Dumazet
  2023-06-11  0:26   ` Mike Freemon
  2023-06-11  4:10   ` Mike Freemon
  1 sibling, 2 replies; 6+ messages in thread
From: Eric Dumazet @ 2023-06-10  1:43 UTC (permalink / raw)
  To: Mike Freemon, Neal Cardwell; +Cc: netdev, kernel-team

On Fri, Jun 9, 2023 at 10:47 PM Mike Freemon <mfreemon@cloudflare.com> wrote:
>
> From: "mfreemon@cloudflare.com" <mfreemon@cloudflare.com>
>
> Under certain circumstances, the tcp receive buffer memory limit
> set by autotuning (sk_rcvbuf) is increased due to incoming data
> packets as a result of the window not closing when it should be.
> This can result in the receive buffer growing all the way up to
> tcp_rmem[2], even for tcp sessions with a low BDP.
>
> To reproduce:  Connect a TCP session with the receiver doing
> nothing and the sender sending small packets (an infinite loop
> of socket send() with 4 bytes of payload with a sleep of 1 ms
> in between each send()).  This will cause the tcp receive buffer
> to grow all the way up to tcp_rmem[2].
>
> As a result, a host can have individual tcp sessions with receive
> buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
> limits, causing the host to go into tcp memory pressure mode.
>
> The fundamental issue is the relationship between the granularity
> of the window scaling factor and the number of byte ACKed back
> to the sender.  This problem has previously been identified in
> RFC 7323, appendix F [1].
>
> The Linux kernel currently adheres to never shrinking the window.
>
> In addition to the overallocation of memory mentioned above, the
> current behavior is functionally incorrect, because once tcp_rmem[2]
> is reached when no remediations remain (i.e. tcp collapse fails to
> free up any more memory and there are no packets to prune from the
> out-of-order queue), the receiver will drop in-window packets
> resulting in retransmissions and an eventual timeout of the tcp
> session.  A receive buffer full condition should instead result
> in a zero window and an indefinite wait.
>
> In practice, this problem is largely hidden for most flows.  It
> is not applicable to mice flows.  Elephant flows can send data
> fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
> triggering a zero window.
>
> But this problem does show up for other types of flows.  Examples
> are websockets and other type of flows that send small amounts of
> data spaced apart slightly in time.  In these cases, we directly
> encounter the problem described in [1].
>
> RFC 7323, section 2.4 [2], says there are instances when a retracted
> window can be offered, and that TCP implementations MUST ensure
> that they handle a shrinking window, as specified in RFC 1122,
> section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
> management have made clear that sender must accept a shrunk window
> from the receiver, including RFC 793 [4] and RFC 1323 [5].
>
> This patch implements the functionality to shrink the tcp window
> when necessary to keep the right edge within the memory limit by
> autotuning (sk_rcvbuf).  This new functionality is enabled with
> the new sysctl: net.ipv4.tcp_shrink_window
>
> Additional information can be found at:
> https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
>
> [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
> [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
> [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
> [4] https://www.rfc-editor.org/rfc/rfc793
> [5] https://www.rfc-editor.org/rfc/rfc1323
>
> Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
> ---
>  Documentation/networking/ip-sysctl.rst | 13 +++++
>  include/net/netns/ipv4.h               |  1 +
>  net/ipv4/sysctl_net_ipv4.c             |  9 ++++
>  net/ipv4/tcp_ipv4.c                    |  2 +
>  net/ipv4/tcp_output.c                  | 73 ++++++++++++++++++++++++--
>  5 files changed, 93 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index 366e2a5097d9..ddb895e8af56 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -981,6 +981,19 @@ tcp_tw_reuse - INTEGER
>  tcp_window_scaling - BOOLEAN
>         Enable window scaling as defined in RFC1323.
>
> +tcp_shrink_window - BOOLEAN
> +       This changes how the TCP receive window is calculated.
> +
> +       RFC 7323, section 2.4, says there are instances when a retracted
> +       window can be offered, and that TCP implementations MUST ensure
> +       that they handle a shrinking window, as specified in RFC 1122.
> +
> +       - 0 - Disabled. The window is never shrunk.
> +       - 1 - Enabled.  The window is shrunk when necessary to remain within
> +                       the memory limit set by autotuning (sk_rcvbuf).
> +
> +       Default: 0
> +
>  tcp_wmem - vector of 3 INTEGERs: min, default, max
>         min: Amount of memory reserved for send buffers for TCP sockets.
>         Each TCP socket has rights to use it due to fact of its birth.
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index a4efb7a2796c..f00374718159 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -65,6 +65,7 @@ struct netns_ipv4 {
>  #endif
>         bool                    fib_has_custom_local_routes;
>         bool                    fib_offload_disabled;
> +       u8                      sysctl_tcp_shrink_window;
>  #ifdef CONFIG_IP_ROUTE_CLASSID
>         atomic_t                fib_num_tclassid_users;
>  #endif
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 356afe54951c..2afb0870648b 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -1480,6 +1480,15 @@ static struct ctl_table ipv4_net_table[] = {
>                 .extra1         = SYSCTL_ZERO,
>                 .extra2         = &tcp_syn_linear_timeouts_max,
>         },
> +       {
> +               .procname       = "tcp_shrink_window",
> +               .data           = &init_net.ipv4.sysctl_tcp_shrink_window,
> +               .maxlen         = sizeof(u8),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dou8vec_minmax,
> +               .extra1         = SYSCTL_ZERO,
> +               .extra2         = SYSCTL_ONE,
> +       },
>         { }
>  };
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 84a5d557dc1a..9213804b034f 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -3281,6 +3281,8 @@ static int __net_init tcp_sk_init(struct net *net)
>                 net->ipv4.tcp_congestion_control = &tcp_reno;
>
>         net->ipv4.sysctl_tcp_syn_linear_timeouts = 4;
> +       net->ipv4.sysctl_tcp_shrink_window = 0;
> +
>         return 0;
>  }
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index f8ce77ce7c3e..5c86873e2193 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -260,8 +260,8 @@ static u16 tcp_select_window(struct sock *sk)
>         u32 old_win = tp->rcv_wnd;
>         u32 cur_win = tcp_receive_window(tp);
>         u32 new_win = __tcp_select_window(sk);
> +       struct net *net = sock_net(sk);

Here you cache sock_net() in @net variable.

>
> -       /* Never shrink the offered window */
>         if (new_win < cur_win) {
>                 /* Danger Will Robinson!
>                  * Don't update rcv_wup/rcv_wnd here or else
> @@ -270,11 +270,15 @@ static u16 tcp_select_window(struct sock *sk)
>                  *
>                  * Relax Will Robinson.
>                  */
> -               if (new_win == 0)
> -                       NET_INC_STATS(sock_net(sk),
> -                                     LINUX_MIB_TCPWANTZEROWINDOWADV);
> -               new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
> +               if (!READ_ONCE(net->ipv4.sysctl_tcp_shrink_window)) {
> +                       /* Never shrink the offered window */
> +                       if (new_win == 0)
> +                               NET_INC_STATS(sock_net(sk),

You then can replace sock_net(sk) by @net here.

> +                                             LINUX_MIB_TCPWANTZEROWINDOWADV);
> +                       new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
> +               }
>         }
> +
>         tp->rcv_wnd = new_win;
>         tp->rcv_wup = tp->rcv_nxt;
>
> @@ -3003,6 +3007,7 @@ u32 __tcp_select_window(struct sock *sk)
>  {
>         struct inet_connection_sock *icsk = inet_csk(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
> +       struct net *net = sock_net(sk);
>         /* MSS for the peer's data.  Previous versions used mss_clamp
>          * here.  I don't know if the value based on our guesses
>          * of peer's MSS is better for the performance.  It's more correct
> @@ -3024,6 +3029,12 @@ u32 __tcp_select_window(struct sock *sk)
>                 if (mss <= 0)
>                         return 0;
>         }
> +
> +       if (READ_ONCE(net->ipv4.sysctl_tcp_shrink_window))
> +               goto shrink_window_allowed;
> +
> +       /* do not allow window to shrink */
> +
>         if (free_space < (full_space >> 1)) {
>                 icsk->icsk_ack.quick = 0;
>
> @@ -3077,6 +3088,58 @@ u32 __tcp_select_window(struct sock *sk)
>                         window = free_space;
>         }
>
> +       return window;
> +
> +shrink_window_allowed:
> +       /* new window should always be an exact multiple of scaling factor */
> +       free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
> +
> +       if (free_space < (full_space >> 1)) {
> +               icsk->icsk_ack.quick = 0;
> +
> +               if (tcp_under_memory_pressure(sk))
> +                       tcp_adjust_rcv_ssthresh(sk);
> +
> +               /* if free space is too low, return a zero window */
> +               if (free_space < (allowed_space >> 4) || free_space < mss ||
> +                       free_space < (1 << tp->rx_opt.rcv_wscale))
> +                       return 0;

Are you sure this block can not be shared with the existing one ?

Existing one has this added part:

free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);

Not sure why this would break the tcp_shrink_window == 1 case.


> +       }
> +
> +       if (free_space > tp->rcv_ssthresh) {
> +               free_space = tp->rcv_ssthresh;
> +               /* new window should always be an exact multiple of scaling factor
> +                *
> +                * For this case, we ALIGN "up" (increase free_space) because
> +                * we know free_space is not zero here, it has been reduced from
> +                * the memory-based limit, and rcv_ssthresh is not a hard limit
> +                * (unlike sk_rcvbuf).
> +                */
> +               free_space = ALIGN(free_space, (1 << tp->rx_opt.rcv_wscale));
> +       }
> +
> +       /* Don't do rounding if we are using window scaling, since the
> +        * scaled window will not line up with the MSS boundary anyway.
> +        */
> +       if (tp->rx_opt.rcv_wscale) {
> +               window = free_space;
> +       } else {
> +               window = tp->rcv_wnd;
> +               /* Get the largest window that is a nice multiple of mss.
> +                * Window clamp already applied above.
> +                * If our current window offering is within 1 mss of the
> +                * free space we just keep it. This prevents the divide
> +                * and multiply from happening most of the time.
> +                * We also don't do any window rounding when the free space
> +                * is too small.
> +                */
> +               if (window <= free_space - mss || window > free_space)
> +                       window = rounddown(free_space, mss);
> +               else if (mss == full_space &&
> +                        free_space > window + (full_space >> 1))
> +                       window = free_space;
> +       }
> +

I am a bit surprised we can not come up with something simpler.

I was suggesting to look at the sysctl only if we were in a dangerous operation,


Neal, can you have a look ?

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next v3] tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
  2023-06-10  1:43 ` Eric Dumazet
@ 2023-06-11  0:26   ` Mike Freemon
  2023-06-11  4:10   ` Mike Freemon
  1 sibling, 0 replies; 6+ messages in thread
From: Mike Freemon @ 2023-06-11  0:26 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell; +Cc: netdev, kernel-team



On 6/9/23 20:43, Eric Dumazet wrote:
> On Fri, Jun 9, 2023 at 10:47 PM Mike Freemon <mfreemon@cloudflare.com> wrote:
>>
>> From: "mfreemon@cloudflare.com" <mfreemon@cloudflare.com>
>>
>> Under certain circumstances, the tcp receive buffer memory limit
>> set by autotuning (sk_rcvbuf) is increased due to incoming data
>> packets as a result of the window not closing when it should be.
>> This can result in the receive buffer growing all the way up to
>> tcp_rmem[2], even for tcp sessions with a low BDP.
>>
>> To reproduce:  Connect a TCP session with the receiver doing
>> nothing and the sender sending small packets (an infinite loop
>> of socket send() with 4 bytes of payload with a sleep of 1 ms
>> in between each send()).  This will cause the tcp receive buffer
>> to grow all the way up to tcp_rmem[2].
>>
>> As a result, a host can have individual tcp sessions with receive
>> buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
>> limits, causing the host to go into tcp memory pressure mode.
>>
>> The fundamental issue is the relationship between the granularity
>> of the window scaling factor and the number of byte ACKed back
>> to the sender.  This problem has previously been identified in
>> RFC 7323, appendix F [1].
>>
>> The Linux kernel currently adheres to never shrinking the window.
>>
>> In addition to the overallocation of memory mentioned above, the
>> current behavior is functionally incorrect, because once tcp_rmem[2]
>> is reached when no remediations remain (i.e. tcp collapse fails to
>> free up any more memory and there are no packets to prune from the
>> out-of-order queue), the receiver will drop in-window packets
>> resulting in retransmissions and an eventual timeout of the tcp
>> session.  A receive buffer full condition should instead result
>> in a zero window and an indefinite wait.
>>
>> In practice, this problem is largely hidden for most flows.  It
>> is not applicable to mice flows.  Elephant flows can send data
>> fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
>> triggering a zero window.
>>
>> But this problem does show up for other types of flows.  Examples
>> are websockets and other type of flows that send small amounts of
>> data spaced apart slightly in time.  In these cases, we directly
>> encounter the problem described in [1].
>>
>> RFC 7323, section 2.4 [2], says there are instances when a retracted
>> window can be offered, and that TCP implementations MUST ensure
>> that they handle a shrinking window, as specified in RFC 1122,
>> section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
>> management have made clear that sender must accept a shrunk window
>> from the receiver, including RFC 793 [4] and RFC 1323 [5].
>>
>> This patch implements the functionality to shrink the tcp window
>> when necessary to keep the right edge within the memory limit by
>> autotuning (sk_rcvbuf).  This new functionality is enabled with
>> the new sysctl: net.ipv4.tcp_shrink_window
>>
>> Additional information can be found at:
>> https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
>>
>> [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
>> [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
>> [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
>> [4] https://www.rfc-editor.org/rfc/rfc793
>> [5] https://www.rfc-editor.org/rfc/rfc1323
>>
>> Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
>> ---
>>  Documentation/networking/ip-sysctl.rst | 13 +++++
>>  include/net/netns/ipv4.h               |  1 +
>>  net/ipv4/sysctl_net_ipv4.c             |  9 ++++
>>  net/ipv4/tcp_ipv4.c                    |  2 +
>>  net/ipv4/tcp_output.c                  | 73 ++++++++++++++++++++++++--
>>  5 files changed, 93 insertions(+), 5 deletions(-)
>>
>> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
>> index 366e2a5097d9..ddb895e8af56 100644
>> --- a/Documentation/networking/ip-sysctl.rst
>> +++ b/Documentation/networking/ip-sysctl.rst
>> @@ -981,6 +981,19 @@ tcp_tw_reuse - INTEGER
>>  tcp_window_scaling - BOOLEAN
>>         Enable window scaling as defined in RFC1323.
>>
>> +tcp_shrink_window - BOOLEAN
>> +       This changes how the TCP receive window is calculated.
>> +
>> +       RFC 7323, section 2.4, says there are instances when a retracted
>> +       window can be offered, and that TCP implementations MUST ensure
>> +       that they handle a shrinking window, as specified in RFC 1122.
>> +
>> +       - 0 - Disabled. The window is never shrunk.
>> +       - 1 - Enabled.  The window is shrunk when necessary to remain within
>> +                       the memory limit set by autotuning (sk_rcvbuf).
>> +
>> +       Default: 0
>> +
>>  tcp_wmem - vector of 3 INTEGERs: min, default, max
>>         min: Amount of memory reserved for send buffers for TCP sockets.
>>         Each TCP socket has rights to use it due to fact of its birth.
>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>> index a4efb7a2796c..f00374718159 100644
>> --- a/include/net/netns/ipv4.h
>> +++ b/include/net/netns/ipv4.h
>> @@ -65,6 +65,7 @@ struct netns_ipv4 {
>>  #endif
>>         bool                    fib_has_custom_local_routes;
>>         bool                    fib_offload_disabled;
>> +       u8                      sysctl_tcp_shrink_window;
>>  #ifdef CONFIG_IP_ROUTE_CLASSID
>>         atomic_t                fib_num_tclassid_users;
>>  #endif
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index 356afe54951c..2afb0870648b 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -1480,6 +1480,15 @@ static struct ctl_table ipv4_net_table[] = {
>>                 .extra1         = SYSCTL_ZERO,
>>                 .extra2         = &tcp_syn_linear_timeouts_max,
>>         },
>> +       {
>> +               .procname       = "tcp_shrink_window",
>> +               .data           = &init_net.ipv4.sysctl_tcp_shrink_window,
>> +               .maxlen         = sizeof(u8),
>> +               .mode           = 0644,
>> +               .proc_handler   = proc_dou8vec_minmax,
>> +               .extra1         = SYSCTL_ZERO,
>> +               .extra2         = SYSCTL_ONE,
>> +       },
>>         { }
>>  };
>>
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index 84a5d557dc1a..9213804b034f 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -3281,6 +3281,8 @@ static int __net_init tcp_sk_init(struct net *net)
>>                 net->ipv4.tcp_congestion_control = &tcp_reno;
>>
>>         net->ipv4.sysctl_tcp_syn_linear_timeouts = 4;
>> +       net->ipv4.sysctl_tcp_shrink_window = 0;
>> +
>>         return 0;
>>  }
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index f8ce77ce7c3e..5c86873e2193 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -260,8 +260,8 @@ static u16 tcp_select_window(struct sock *sk)
>>         u32 old_win = tp->rcv_wnd;
>>         u32 cur_win = tcp_receive_window(tp);
>>         u32 new_win = __tcp_select_window(sk);
>> +       struct net *net = sock_net(sk);
> 
> Here you cache sock_net() in @net variable.
> 
>>
>> -       /* Never shrink the offered window */
>>         if (new_win < cur_win) {
>>                 /* Danger Will Robinson!
>>                  * Don't update rcv_wup/rcv_wnd here or else
>> @@ -270,11 +270,15 @@ static u16 tcp_select_window(struct sock *sk)
>>                  *
>>                  * Relax Will Robinson.
>>                  */
>> -               if (new_win == 0)
>> -                       NET_INC_STATS(sock_net(sk),
>> -                                     LINUX_MIB_TCPWANTZEROWINDOWADV);
>> -               new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
>> +               if (!READ_ONCE(net->ipv4.sysctl_tcp_shrink_window)) {
>> +                       /* Never shrink the offered window */
>> +                       if (new_win == 0)
>> +                               NET_INC_STATS(sock_net(sk),
> 
> You then can replace sock_net(sk) by @net here.
> 
>> +                                             LINUX_MIB_TCPWANTZEROWINDOWADV);
>> +                       new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
>> +               }
>>         }
>> +
>>         tp->rcv_wnd = new_win;
>>         tp->rcv_wup = tp->rcv_nxt;
>>
>> @@ -3003,6 +3007,7 @@ u32 __tcp_select_window(struct sock *sk)
>>  {
>>         struct inet_connection_sock *icsk = inet_csk(sk);
>>         struct tcp_sock *tp = tcp_sk(sk);
>> +       struct net *net = sock_net(sk);
>>         /* MSS for the peer's data.  Previous versions used mss_clamp
>>          * here.  I don't know if the value based on our guesses
>>          * of peer's MSS is better for the performance.  It's more correct
>> @@ -3024,6 +3029,12 @@ u32 __tcp_select_window(struct sock *sk)
>>                 if (mss <= 0)
>>                         return 0;
>>         }
>> +
>> +       if (READ_ONCE(net->ipv4.sysctl_tcp_shrink_window))
>> +               goto shrink_window_allowed;
>> +
>> +       /* do not allow window to shrink */
>> +
>>         if (free_space < (full_space >> 1)) {
>>                 icsk->icsk_ack.quick = 0;
>>
>> @@ -3077,6 +3088,58 @@ u32 __tcp_select_window(struct sock *sk)
>>                         window = free_space;
>>         }
>>
>> +       return window;
>> +
>> +shrink_window_allowed:
>> +       /* new window should always be an exact multiple of scaling factor */
>> +       free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
>> +
>> +       if (free_space < (full_space >> 1)) {
>> +               icsk->icsk_ack.quick = 0;
>> +
>> +               if (tcp_under_memory_pressure(sk))
>> +                       tcp_adjust_rcv_ssthresh(sk);
>> +
>> +               /* if free space is too low, return a zero window */
>> +               if (free_space < (allowed_space >> 4) || free_space < mss ||
>> +                       free_space < (1 << tp->rx_opt.rcv_wscale))
>> +                       return 0;
> 
> Are you sure this block can not be shared with the existing one ?
> 
> Existing one has this added part:
> 
> free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
> 
> Not sure why this would break the tcp_shrink_window == 1 case.
> 
> 
>> +       }
>> +
>> +       if (free_space > tp->rcv_ssthresh) {
>> +               free_space = tp->rcv_ssthresh;
>> +               /* new window should always be an exact multiple of scaling factor
>> +                *
>> +                * For this case, we ALIGN "up" (increase free_space) because
>> +                * we know free_space is not zero here, it has been reduced from
>> +                * the memory-based limit, and rcv_ssthresh is not a hard limit
>> +                * (unlike sk_rcvbuf).
>> +                */
>> +               free_space = ALIGN(free_space, (1 << tp->rx_opt.rcv_wscale));
>> +       }
>> +
>> +       /* Don't do rounding if we are using window scaling, since the
>> +        * scaled window will not line up with the MSS boundary anyway.
>> +        */
>> +       if (tp->rx_opt.rcv_wscale) {
>> +               window = free_space;
>> +       } else {
>> +               window = tp->rcv_wnd;
>> +               /* Get the largest window that is a nice multiple of mss.
>> +                * Window clamp already applied above.
>> +                * If our current window offering is within 1 mss of the
>> +                * free space we just keep it. This prevents the divide
>> +                * and multiply from happening most of the time.
>> +                * We also don't do any window rounding when the free space
>> +                * is too small.
>> +                */
>> +               if (window <= free_space - mss || window > free_space)
>> +                       window = rounddown(free_space, mss);
>> +               else if (mss == full_space &&
>> +                        free_space > window + (full_space >> 1))
>> +                       window = free_space;
>> +       }
>> +
> 
> I am a bit surprised we can not come up with something simpler.

Since sysctl_tcp_shrink_window was only intended to apply when rcv_wscale > 0, I can make
that explicit, which will eliminate this whole latter block of code.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next v3] tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
  2023-06-10  1:43 ` Eric Dumazet
  2023-06-11  0:26   ` Mike Freemon
@ 2023-06-11  4:10   ` Mike Freemon
  1 sibling, 0 replies; 6+ messages in thread
From: Mike Freemon @ 2023-06-11  4:10 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell; +Cc: netdev, kernel-team



On 6/9/23 20:43, Eric Dumazet wrote:
> On Fri, Jun 9, 2023 at 10:47 PM Mike Freemon <mfreemon@cloudflare.com> wrote:
>>
>> From: "mfreemon@cloudflare.com" <mfreemon@cloudflare.com>
>>
>> Under certain circumstances, the tcp receive buffer memory limit
>> set by autotuning (sk_rcvbuf) is increased due to incoming data
>> packets as a result of the window not closing when it should be.
>> This can result in the receive buffer growing all the way up to
>> tcp_rmem[2], even for tcp sessions with a low BDP.
>>
>> To reproduce:  Connect a TCP session with the receiver doing
>> nothing and the sender sending small packets (an infinite loop
>> of socket send() with 4 bytes of payload with a sleep of 1 ms
>> in between each send()).  This will cause the tcp receive buffer
>> to grow all the way up to tcp_rmem[2].
>>
>> As a result, a host can have individual tcp sessions with receive
>> buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
>> limits, causing the host to go into tcp memory pressure mode.
>>
>> The fundamental issue is the relationship between the granularity
>> of the window scaling factor and the number of byte ACKed back
>> to the sender.  This problem has previously been identified in
>> RFC 7323, appendix F [1].
>>
>> The Linux kernel currently adheres to never shrinking the window.
>>
>> In addition to the overallocation of memory mentioned above, the
>> current behavior is functionally incorrect, because once tcp_rmem[2]
>> is reached when no remediations remain (i.e. tcp collapse fails to
>> free up any more memory and there are no packets to prune from the
>> out-of-order queue), the receiver will drop in-window packets
>> resulting in retransmissions and an eventual timeout of the tcp
>> session.  A receive buffer full condition should instead result
>> in a zero window and an indefinite wait.
>>
>> In practice, this problem is largely hidden for most flows.  It
>> is not applicable to mice flows.  Elephant flows can send data
>> fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
>> triggering a zero window.
>>
>> But this problem does show up for other types of flows.  Examples
>> are websockets and other type of flows that send small amounts of
>> data spaced apart slightly in time.  In these cases, we directly
>> encounter the problem described in [1].
>>
>> RFC 7323, section 2.4 [2], says there are instances when a retracted
>> window can be offered, and that TCP implementations MUST ensure
>> that they handle a shrinking window, as specified in RFC 1122,
>> section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
>> management have made clear that sender must accept a shrunk window
>> from the receiver, including RFC 793 [4] and RFC 1323 [5].
>>
>> This patch implements the functionality to shrink the tcp window
>> when necessary to keep the right edge within the memory limit by
>> autotuning (sk_rcvbuf).  This new functionality is enabled with
>> the new sysctl: net.ipv4.tcp_shrink_window
>>
>> Additional information can be found at:
>> https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
>>
>> [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
>> [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
>> [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
>> [4] https://www.rfc-editor.org/rfc/rfc793
>> [5] https://www.rfc-editor.org/rfc/rfc1323
>>
>> Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
>> ---
>>  Documentation/networking/ip-sysctl.rst | 13 +++++
>>  include/net/netns/ipv4.h               |  1 +
>>  net/ipv4/sysctl_net_ipv4.c             |  9 ++++
>>  net/ipv4/tcp_ipv4.c                    |  2 +
>>  net/ipv4/tcp_output.c                  | 73 ++++++++++++++++++++++++--
>>  5 files changed, 93 insertions(+), 5 deletions(-)
>>
>> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
>> index 366e2a5097d9..ddb895e8af56 100644
>> --- a/Documentation/networking/ip-sysctl.rst
>> +++ b/Documentation/networking/ip-sysctl.rst
>> @@ -981,6 +981,19 @@ tcp_tw_reuse - INTEGER
>>  tcp_window_scaling - BOOLEAN
>>         Enable window scaling as defined in RFC1323.
>>
>> +tcp_shrink_window - BOOLEAN
>> +       This changes how the TCP receive window is calculated.
>> +
>> +       RFC 7323, section 2.4, says there are instances when a retracted
>> +       window can be offered, and that TCP implementations MUST ensure
>> +       that they handle a shrinking window, as specified in RFC 1122.
>> +
>> +       - 0 - Disabled. The window is never shrunk.
>> +       - 1 - Enabled.  The window is shrunk when necessary to remain within
>> +                       the memory limit set by autotuning (sk_rcvbuf).
>> +
>> +       Default: 0
>> +
>>  tcp_wmem - vector of 3 INTEGERs: min, default, max
>>         min: Amount of memory reserved for send buffers for TCP sockets.
>>         Each TCP socket has rights to use it due to fact of its birth.
>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>> index a4efb7a2796c..f00374718159 100644
>> --- a/include/net/netns/ipv4.h
>> +++ b/include/net/netns/ipv4.h
>> @@ -65,6 +65,7 @@ struct netns_ipv4 {
>>  #endif
>>         bool                    fib_has_custom_local_routes;
>>         bool                    fib_offload_disabled;
>> +       u8                      sysctl_tcp_shrink_window;
>>  #ifdef CONFIG_IP_ROUTE_CLASSID
>>         atomic_t                fib_num_tclassid_users;
>>  #endif
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index 356afe54951c..2afb0870648b 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -1480,6 +1480,15 @@ static struct ctl_table ipv4_net_table[] = {
>>                 .extra1         = SYSCTL_ZERO,
>>                 .extra2         = &tcp_syn_linear_timeouts_max,
>>         },
>> +       {
>> +               .procname       = "tcp_shrink_window",
>> +               .data           = &init_net.ipv4.sysctl_tcp_shrink_window,
>> +               .maxlen         = sizeof(u8),
>> +               .mode           = 0644,
>> +               .proc_handler   = proc_dou8vec_minmax,
>> +               .extra1         = SYSCTL_ZERO,
>> +               .extra2         = SYSCTL_ONE,
>> +       },
>>         { }
>>  };
>>
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index 84a5d557dc1a..9213804b034f 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -3281,6 +3281,8 @@ static int __net_init tcp_sk_init(struct net *net)
>>                 net->ipv4.tcp_congestion_control = &tcp_reno;
>>
>>         net->ipv4.sysctl_tcp_syn_linear_timeouts = 4;
>> +       net->ipv4.sysctl_tcp_shrink_window = 0;
>> +
>>         return 0;
>>  }
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index f8ce77ce7c3e..5c86873e2193 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -260,8 +260,8 @@ static u16 tcp_select_window(struct sock *sk)
>>         u32 old_win = tp->rcv_wnd;
>>         u32 cur_win = tcp_receive_window(tp);
>>         u32 new_win = __tcp_select_window(sk);
>> +       struct net *net = sock_net(sk);
> 
> Here you cache sock_net() in @net variable.
> 
>>
>> -       /* Never shrink the offered window */
>>         if (new_win < cur_win) {
>>                 /* Danger Will Robinson!
>>                  * Don't update rcv_wup/rcv_wnd here or else
>> @@ -270,11 +270,15 @@ static u16 tcp_select_window(struct sock *sk)
>>                  *
>>                  * Relax Will Robinson.
>>                  */
>> -               if (new_win == 0)
>> -                       NET_INC_STATS(sock_net(sk),
>> -                                     LINUX_MIB_TCPWANTZEROWINDOWADV);
>> -               new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
>> +               if (!READ_ONCE(net->ipv4.sysctl_tcp_shrink_window)) {
>> +                       /* Never shrink the offered window */
>> +                       if (new_win == 0)
>> +                               NET_INC_STATS(sock_net(sk),
> 
> You then can replace sock_net(sk) by @net here.
> 
>> +                                             LINUX_MIB_TCPWANTZEROWINDOWADV);
>> +                       new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
>> +               }
>>         }
>> +
>>         tp->rcv_wnd = new_win;
>>         tp->rcv_wup = tp->rcv_nxt;
>>
>> @@ -3003,6 +3007,7 @@ u32 __tcp_select_window(struct sock *sk)
>>  {
>>         struct inet_connection_sock *icsk = inet_csk(sk);
>>         struct tcp_sock *tp = tcp_sk(sk);
>> +       struct net *net = sock_net(sk);
>>         /* MSS for the peer's data.  Previous versions used mss_clamp
>>          * here.  I don't know if the value based on our guesses
>>          * of peer's MSS is better for the performance.  It's more correct
>> @@ -3024,6 +3029,12 @@ u32 __tcp_select_window(struct sock *sk)
>>                 if (mss <= 0)
>>                         return 0;
>>         }
>> +
>> +       if (READ_ONCE(net->ipv4.sysctl_tcp_shrink_window))
>> +               goto shrink_window_allowed;
>> +
>> +       /* do not allow window to shrink */
>> +
>>         if (free_space < (full_space >> 1)) {
>>                 icsk->icsk_ack.quick = 0;
>>
>> @@ -3077,6 +3088,58 @@ u32 __tcp_select_window(struct sock *sk)
>>                         window = free_space;
>>         }
>>
>> +       return window;
>> +
>> +shrink_window_allowed:
>> +       /* new window should always be an exact multiple of scaling factor */
>> +       free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
>> +
>> +       if (free_space < (full_space >> 1)) {
>> +               icsk->icsk_ack.quick = 0;
>> +
>> +               if (tcp_under_memory_pressure(sk))
>> +                       tcp_adjust_rcv_ssthresh(sk);
>> +
>> +               /* if free space is too low, return a zero window */
>> +               if (free_space < (allowed_space >> 4) || free_space < mss ||
>> +                       free_space < (1 << tp->rx_opt.rcv_wscale))
>> +                       return 0;
> 
> Are you sure this block can not be shared with the existing one ?
> 
> Existing one has this added part:
> 
> free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
> 
> Not sure why this would break the tcp_shrink_window == 1 case.

The main issue is the additional conditional on the if that returns 0:
    
    free_space < (1 << tp->rx_opt.rcv_wscale))

This is an important difference that I think needs to continue to be
different in the two cases.

I'll post a "v4" of the patch with the cleaned up code from the other 
comments so we can have a clean look at it.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-06-11  4:10 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-09 20:47 [PATCH net-next v3] tcp: enforce receive buffer memory limits by allowing the tcp window to shrink Mike Freemon
2023-06-10  0:42 ` Stephen Hemminger
2023-06-10  1:37   ` Eric Dumazet
2023-06-10  1:43 ` Eric Dumazet
2023-06-11  0:26   ` Mike Freemon
2023-06-11  4:10   ` Mike Freemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).