netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB
@ 2022-07-21 15:10 Marek Majkowski
  2022-07-21 15:10 ` [PATCH net-next 1/2] " Marek Majkowski
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Marek Majkowski @ 2022-07-21 15:10 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kernel-team, ivan, edumazet, davem, kuba, pabeni, ast,
	daniel, andrii, Marek Majkowski

Among many route options we support initrwnd/RTAX_INITRWND path
attribute:

 $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024

This sets the initial receive window size (in packets). However, it's
not very useful in practice. For smaller buffers (<128KiB) it can be
used to bring the initial receive window down, but it's hard to
imagine when this is useful. The same effect can be achieved with
TCP_WINDOW_CLAMP / RTAX_WINDOW option.

For larger buffers (>128KiB) the initial receive window is usually
limited by rcv_ssthresh, which starts at 64KiB. The initrwnd option
can't bring the window above it, which limits its usefulness

This patch changes that. Now, by setting RTAX_INITRWND path attribute
we bring up the initial rcv_ssthresh in line with the initrwnd
value. This allows to increase the initial advertised receive window
instantly, after first TCP RTT, above 64KiB.

With this change, the administrator can configure a route (or skops
ebpf program) where the receive window is opened much faster than
usual. This is useful on big BDP connections - large latency, high
throughput - where it takes much time to fully open the receive
window, due to the usual rcv_ssthresh cap.

However, this feature should be used with caution. It only makes sense
to employ it in limited circumstances:

 * When using high-bandwidth TCP transfers over big-latency links.
 * When the truesize of the flow/NIC is sensible and predictable.
 * When the application is ready to send a lot of data immediately
   after flow is established.
 * When the sender has configured larger than usual `initcwnd`.
 * When optimizing for every possible RTT.

This patch is related to previous work by Ivan Babrou:

  https://lore.kernel.org/bpf/CAA93jw5+LjKLcCaNr5wJGPrXhbjvLhts8hqpKPFx7JeWG4g0AA@mail.gmail.com/T/

Please note that due to TCP wscale semantics, the TCP sender will need
to receive first ACK to be informed of the large opened receive
window. That is: the large window is advertised only in the first ACK
from the peer. When the TCP client has large window, it is advertised
in the third-packet (ACK) of the handshake. When the TCP sever has
large window, it is advertised only in the first ACK after some data
has been received.

Signed-off-by: Marek Majkowski <marek@cloudflare.com>

Marek Majkowski (2):
  RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB
  Tests for RTAX_INITRWND

 include/net/inet_sock.h                       |   1 +
 net/ipv4/tcp_minisocks.c                      |   8 +-
 net/ipv4/tcp_output.c                         |  10 +-
 .../selftests/bpf/prog_tests/tcp_initrwnd.c   | 398 ++++++++++++++++++
 .../selftests/bpf/progs/test_tcp_initrwnd.c   |  30 ++
 5 files changed, 443 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_initrwnd.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_initrwnd.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH net-next 1/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB
  2022-07-21 15:10 [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Marek Majkowski
@ 2022-07-21 15:10 ` Marek Majkowski
  2022-07-22  9:23   ` Eric Dumazet
  2022-07-21 15:10 ` [PATCH net-next 2/2] Tests for RTAX_INITRWND Marek Majkowski
  2022-07-22  1:54 ` [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Jakub Kicinski
  2 siblings, 1 reply; 7+ messages in thread
From: Marek Majkowski @ 2022-07-21 15:10 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kernel-team, ivan, edumazet, davem, kuba, pabeni, ast,
	daniel, andrii, Marek Majkowski

We already support RTAX_INITRWND / initrwnd path attribute:

 $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024

However normally, the initial advertised receive window is limited to
64KiB by rcv_ssthresh, regardless of initrwnd. This patch changes
that, bumping up rcv_ssthresh to value derived from initrwnd. This
allows for larger initial advertised receive windows, which is useful
for specific types of TCP flows: big BDP ones, where there is a lot of
data to send immediately after the flow is established.

There are three places where we initialize sockets:
 - tcp_output:tcp_connect_init
 - tcp_minisocks:tcp_openreq_init_rwin
 - syncookies

In the first two we already have a call to `tcp_rwnd_init_bpf` and
`dst_metric(RTAX_INITRWND)` which retrieve the bpf/path initrwnd
attribute. We use this value to bring `rcv_ssthresh` up, potentially
above the traditional 64KiB.

With higher initial `rcv_ssthresh` the receiver will open the receive
window more aggresively, which can improve large BDP flows - large
throughput and latency.

This patch does not cover the syncookies case.

Signed-off-by: Marek Majkowski <marek@cloudflare.com>
---
 include/net/inet_sock.h  |  1 +
 net/ipv4/tcp_minisocks.c |  8 ++++++--
 net/ipv4/tcp_output.c    | 10 ++++++++--
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index daead5fb389a..bc68c9b70942 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -89,6 +89,7 @@ struct inet_request_sock {
 				no_srccheck: 1,
 				smc_ok	   : 1;
 	u32                     ir_mark;
+	u32                     rcv_ssthresh;
 	union {
 		struct ip_options_rcu __rcu	*ireq_opt;
 #if IS_ENABLED(CONFIG_IPV6)
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 6854bb1fb32b..89ba2a30a012 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -360,6 +360,7 @@ void tcp_openreq_init_rwin(struct request_sock *req,
 	u32 window_clamp;
 	__u8 rcv_wscale;
 	u32 rcv_wnd;
+	int adj_mss;
 	int mss;
 
 	mss = tcp_mss_clamp(tp, dst_metric_advmss(dst));
@@ -378,15 +379,18 @@ void tcp_openreq_init_rwin(struct request_sock *req,
 	else if (full_space < rcv_wnd * mss)
 		full_space = rcv_wnd * mss;
 
+	adj_mss = mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0);
+
 	/* tcp_full_space because it is guaranteed to be the first packet */
 	tcp_select_initial_window(sk_listener, full_space,
-		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
+		adj_mss,
 		&req->rsk_rcv_wnd,
 		&req->rsk_window_clamp,
 		ireq->wscale_ok,
 		&rcv_wscale,
 		rcv_wnd);
 	ireq->rcv_wscale = rcv_wscale;
+	ireq->rcv_ssthresh = max(req->rsk_rcv_wnd, rcv_wnd * adj_mss);
 }
 EXPORT_SYMBOL(tcp_openreq_init_rwin);
 
@@ -502,7 +506,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 	newtp->rx_opt.tstamp_ok = ireq->tstamp_ok;
 	newtp->rx_opt.sack_ok = ireq->sack_ok;
 	newtp->window_clamp = req->rsk_window_clamp;
-	newtp->rcv_ssthresh = req->rsk_rcv_wnd;
+	newtp->rcv_ssthresh = ireq->rcv_ssthresh;
 	newtp->rcv_wnd = req->rsk_rcv_wnd;
 	newtp->rx_opt.wscale_ok = ireq->wscale_ok;
 	if (newtp->rx_opt.wscale_ok) {
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 18c913a2347a..0f2d4174ea59 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3642,6 +3642,7 @@ static void tcp_connect_init(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u8 rcv_wscale;
 	u32 rcv_wnd;
+	u32 mss;
 
 	/* We'll fix this up when we get a response from the other end.
 	 * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT.
@@ -3679,8 +3680,10 @@ static void tcp_connect_init(struct sock *sk)
 	if (rcv_wnd == 0)
 		rcv_wnd = dst_metric(dst, RTAX_INITRWND);
 
+	mss = tp->advmss - (tp->rx_opt.ts_recent_stamp ?
+			    tp->tcp_header_len - sizeof(struct tcphdr) : 0);
 	tcp_select_initial_window(sk, tcp_full_space(sk),
-				  tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
+				  mss,
 				  &tp->rcv_wnd,
 				  &tp->window_clamp,
 				  sock_net(sk)->ipv4.sysctl_tcp_window_scaling,
@@ -3688,7 +3691,10 @@ static void tcp_connect_init(struct sock *sk)
 				  rcv_wnd);
 
 	tp->rx_opt.rcv_wscale = rcv_wscale;
-	tp->rcv_ssthresh = tp->rcv_wnd;
+	if (rcv_wnd)
+		tp->rcv_ssthresh = max(tp->rcv_wnd, rcv_wnd * mss);
+	else
+		tp->rcv_ssthresh = tp->rcv_wnd;
 
 	sk->sk_err = 0;
 	sock_reset_flag(sk, SOCK_DONE);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH net-next 2/2] Tests for RTAX_INITRWND
  2022-07-21 15:10 [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Marek Majkowski
  2022-07-21 15:10 ` [PATCH net-next 1/2] " Marek Majkowski
@ 2022-07-21 15:10 ` Marek Majkowski
  2022-07-22  1:54 ` [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Jakub Kicinski
  2 siblings, 0 replies; 7+ messages in thread
From: Marek Majkowski @ 2022-07-21 15:10 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kernel-team, ivan, edumazet, davem, kuba, pabeni, ast,
	daniel, andrii, Marek Majkowski

Accompanying tests. We open skops program, hooking on
BPF_SOCK_OPS_RWND_INIT event, where we return updated value of
initrwnd route path attribute.

In tests we see if values above 64KiB indeed are advertised correctly
to the remote peer.

Signed-off-by: Marek Majkowski <marek@cloudflare.com>
---
 .../selftests/bpf/prog_tests/tcp_initrwnd.c   | 398 ++++++++++++++++++
 .../selftests/bpf/progs/test_tcp_initrwnd.c   |  30 ++
 2 files changed, 428 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_initrwnd.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_initrwnd.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_initrwnd.c b/tools/testing/selftests/bpf/prog_tests/tcp_initrwnd.c
new file mode 100644
index 000000000000..0276fe9c8ce6
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_initrwnd.c
@@ -0,0 +1,398 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+// Copyright (c) 2022 Cloudflare
+
+#include "test_progs.h"
+#include "bpf_util.h"
+#include "network_helpers.h"
+
+#include "test_tcp_initrwnd.skel.h"
+
+#define CG_NAME "/tcpbpf-user-test"
+
+/* It's easier to hardcode offsets than to fight with headers
+ *
+ * $ pahole tcp_info
+ * struct tcp_info {
+ *	__u32                      tcpi_rcv_ssthresh;   *    64     4 *
+ *	__u32                      tcpi_snd_wnd;        *   228     4 *
+ */
+
+#define TCPI_RCV_SSTHRESH(info) info[64 / 4]
+#define TCPI_SND_WND(info) info[228 / 4]
+
+static int read_int_sysctl(const char *sysctl)
+{
+	char buf[16];
+	int fd, ret;
+
+	fd = open(sysctl, 0);
+	if (CHECK_FAIL(fd == -1))
+		goto err;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (CHECK_FAIL(ret <= 0))
+		goto err;
+
+	close(fd);
+	return atoi(buf);
+err:
+	if (fd < 0)
+		close(fd);
+	return -1;
+}
+
+static int write_int_sysctl(const char *sysctl, int v)
+{
+	int fd, ret, size;
+	char buf[16];
+
+	fd = open(sysctl, O_RDWR);
+	if (CHECK_FAIL(fd < 0))
+		goto err;
+
+	size = snprintf(buf, sizeof(buf), "%d", v);
+	ret = write(fd, buf, size);
+	if (CHECK_FAIL(ret < 0))
+		goto err;
+
+	close(fd);
+	return 0;
+err:
+	if (fd < 0)
+		close(fd);
+	return -1;
+}
+
+static int tcp_timestamps;
+static int tcp_window_scaling;
+static int tcp_workaround_signed_windows;
+
+static void do_test_server(int server_fd, struct test_tcp_initrwnd *skel,
+			   int initrwnd, unsigned int tcpi_snd_wnd_on_connect,
+			   unsigned int rcv_ssthresh_on_recv,
+			   unsigned int tcpi_snd_wnd_on_recv)
+{
+	int client_fd = -1, sd = -1, r;
+	__u32 info[256 / 4];
+	socklen_t optlen = sizeof(info);
+	char b[1] = { 0x55 };
+
+	fprintf(stderr,
+		"[*] server initrwnd=%d tcp_timestamps=%d tcp_window_scaling=%d tcp_workaround_signed_windows=%d\n",
+		initrwnd, tcp_timestamps, tcp_window_scaling,
+		tcp_workaround_signed_windows);
+
+	skel->bss->initrwnd = initrwnd; // in full MSS packets
+
+	client_fd = connect_to_fd(server_fd, 0);
+	if (CHECK_FAIL(client_fd < 0))
+		goto err;
+
+	sd = accept(server_fd, NULL, NULL);
+	if (CHECK_FAIL(sd < 0))
+		goto err;
+
+	/* There are three moments where we check the window/rcv_ssthresh.
+	 *
+	 * (1) First, after socket creation, TCP handshake, we expect
+	 * the client to see only SYN+ACK which is without window
+	 * scaling. That is: from client/sender point of view we see
+	 * at most 64KiB open receive window.
+	 */
+	r = getsockopt(client_fd, SOL_TCP, TCP_INFO, &info, &optlen);
+	if (CHECK_FAIL(r < 0))
+		goto err;
+
+	ASSERT_EQ(TCPI_SND_WND(info), tcpi_snd_wnd_on_connect,
+		  "getsockopt(TCP_INFO.tcpi_snd_wnd) on connect");
+
+	/* (2) At the same time, from the server/receiver point of
+	 * view, we already initiated socket, so rcv_ssthresh is set
+	 * to high value, potentially larger than 64KiB.
+	 */
+	r = getsockopt(sd, SOL_TCP, TCP_INFO, &info, &optlen);
+	if (CHECK_FAIL(r < 0))
+		goto err;
+
+	ASSERT_EQ(TCPI_RCV_SSTHRESH(info), rcv_ssthresh_on_recv,
+		  "getsockopt(TCP_INFO.rcv_ssthresh) on recv");
+
+	/* (3) Finally, after receiving some ACK from client, the
+	 * client/sender should also see wider open window, larger
+	 * than 64KiB.
+	 */
+	if (CHECK_FAIL(write(client_fd, &b, sizeof(b)) != 1))
+		perror("Failed to send single byte");
+
+	if (CHECK_FAIL(read(sd, &b, sizeof(b)) != 1))
+		perror("Failed to send single byte");
+
+	r = getsockopt(client_fd, SOL_TCP, TCP_INFO, &info, &optlen);
+	if (CHECK_FAIL(r < 0))
+		goto err;
+
+	ASSERT_EQ(TCPI_SND_WND(info), tcpi_snd_wnd_on_recv,
+		  "getsockopt(TCP_INFO.tcpi_snd_wnd) after recv");
+
+err:
+	if (sd != -1)
+		close(sd);
+	if (client_fd != -1)
+		close(client_fd);
+}
+
+static int socket_client(int server_fd)
+{
+	socklen_t optlen;
+	int family, type, protocol, r;
+
+	optlen = sizeof(family);
+	r = getsockopt(server_fd, SOL_SOCKET, SO_DOMAIN, &family, &optlen);
+	if (CHECK_FAIL(r < 0))
+		return -1;
+
+	optlen = sizeof(type);
+	r = getsockopt(server_fd, SOL_SOCKET, SO_TYPE, &type, &optlen);
+	if (CHECK_FAIL(r < 0))
+		return -1;
+
+	optlen = sizeof(protocol);
+	r = getsockopt(server_fd, SOL_SOCKET, SO_PROTOCOL, &protocol, &optlen);
+	if (CHECK_FAIL(r < 0))
+		return -1;
+
+	return socket(family, type, protocol);
+}
+
+static void do_test_client(int server_fd, struct test_tcp_initrwnd *skel,
+			   int initrwnd, unsigned int rcv_ssthresh,
+			   unsigned int tcpi_snd_wnd)
+{
+	int client_fd = -1, sd = -1, r, maxseg;
+	__u32 info[256 / 4];
+	socklen_t optlen = sizeof(info);
+	size_t rcvbuf;
+	char b[1] = { 0x55 };
+
+	fprintf(stderr,
+		"[*] client initrwnd=%d tcp_timestamps=%d tcp_window_scaling=%d tcp_workaround_signed_windows=%d\n",
+		initrwnd, tcp_timestamps, tcp_window_scaling,
+		tcp_workaround_signed_windows);
+
+	skel->bss->initrwnd = initrwnd; // in full MSS packets
+
+	client_fd = socket_client(server_fd);
+	if (CHECK_FAIL(client_fd < 0))
+		goto err;
+
+	/* With MSS=64KiB on loopback it's hard to argue about init
+	 * rwnd. Let's set MSS to something that will make our life
+	 * easier, like 1024 + timestamps.
+	 */
+	maxseg = 1024;
+
+	r = setsockopt(client_fd, SOL_TCP, TCP_MAXSEG, &maxseg, sizeof(maxseg));
+	if (CHECK_FAIL(r < 0))
+		goto err;
+
+	rcvbuf = 208 * 1024;
+	r = setsockopt(client_fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf,
+		       sizeof(rcvbuf));
+	if (CHECK_FAIL(r < 0))
+		goto err;
+
+	r = connect_fd_to_fd(client_fd, server_fd, 0);
+	if (CHECK_FAIL(r < 0))
+		goto err;
+
+	sd = accept(server_fd, NULL, NULL);
+	if (CHECK_FAIL(sd < 0))
+		goto err;
+
+	if (CHECK_FAIL(write(sd, &b, sizeof(b)) != 1))
+		perror("Failed to send single byte");
+
+	if (CHECK_FAIL(read(client_fd, &b, sizeof(b)) != 1))
+		perror("Failed to send single byte");
+
+	/* There is only one moment to check - the server should know
+	 * about client window just after accept. First check client
+	 * rcv_ssthresh.
+	 */
+	r = getsockopt(client_fd, SOL_TCP, TCP_INFO, &info, &optlen);
+	if (CHECK_FAIL(r < 0))
+		goto err;
+
+	ASSERT_EQ(TCPI_RCV_SSTHRESH(info), rcv_ssthresh,
+		  "getsockopt(TCP_INFO.tcpi_rcv_ssthresh) on client");
+
+	/* And the recevie window size as seen from the server.
+	 */
+	r = getsockopt(sd, SOL_TCP, TCP_INFO, &info, &optlen);
+	if (CHECK_FAIL(r < 0))
+		goto err;
+
+	ASSERT_EQ(TCPI_SND_WND(info), tcpi_snd_wnd,
+		  "getsockopt(TCP_INFO.tcpi_snd_wnd)");
+
+err:
+	if (sd != -1)
+		close(sd);
+	if (client_fd != -1)
+		close(client_fd);
+}
+
+static void run_tests(int cg_fd, struct test_tcp_initrwnd *skel)
+{
+	int server_fd = -1, r, rcvbuf, maxseg;
+	unsigned int max_wnd, buf;
+
+	skel->links.bpf_testcb =
+		bpf_program__attach_cgroup(skel->progs.bpf_testcb, cg_fd);
+	if (!ASSERT_OK_PTR(skel->links.bpf_testcb, "attach_cgroup(bpf_testcb)"))
+		goto err;
+
+	server_fd = start_server(AF_INET, SOCK_STREAM, NULL, 0, 0);
+	if (CHECK_FAIL(server_fd < 0))
+		goto err;
+
+	maxseg = 1024;
+	if (tcp_timestamps)
+		maxseg += 12;
+
+	/* With MSS=64KiB on loopback it's hard to argue about init
+	 * rwnd. Let's set MSS to something that will make our life
+	 * easier, like 1024 + timestamps.
+	 */
+	r = setsockopt(server_fd, SOL_TCP, TCP_MAXSEG, &maxseg, sizeof(maxseg));
+	if (CHECK_FAIL(r < 0))
+		goto err;
+
+	/* Obviously, rcvbuffer must be large at the start for the
+	 * initrwnd to make any dent in rcv_ssthresh (assuming default
+	 * tcp_rmem of 128KiB)
+	 */
+	rcvbuf = 208 * 1024;
+	r = setsockopt(server_fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf,
+		       sizeof(rcvbuf));
+	if (CHECK_FAIL(r < 0))
+		goto err;
+
+	max_wnd = tcp_workaround_signed_windows ? 32767 : 65535;
+
+	/* [*] server advertising large window ** */
+
+	/* Small initrwnd. Not exceeding 64KiB */
+	do_test_server(server_fd, skel, 1, 1024, 1024, 1024);
+
+	if (tcp_window_scaling) {
+		/* Borderline. Not exceeding 64KiB */
+		do_test_server(server_fd, skel, 63, MIN(max_wnd, 63 * 1024),
+			       63 * 1024, 63 * 1024);
+	} else {
+		do_test_server(server_fd, skel, 63, MIN(max_wnd, 63 * 1024),
+			       63 * 1024, MIN(max_wnd, 63 * 1024));
+	}
+
+	if (tcp_window_scaling) {
+		/* The interesting case. Crossing 64KiB */
+		do_test_server(server_fd, skel, 128, max_wnd, 128 * 1024,
+			       128 * 1024);
+	} else {
+		do_test_server(server_fd, skel, 128, max_wnd, 65535,
+			       MIN(max_wnd, 65535));
+	}
+
+	if (tcp_window_scaling) {
+		/* Super large. The buffer is 208*2 */
+		do_test_server(server_fd, skel, 206, max_wnd, 206 * 1024U,
+			       206 * 1024U);
+		buf = 207 * 1024U - (tcp_timestamps ? 12 : 0);
+		do_test_server(server_fd, skel, 512, max_wnd, buf, buf);
+	}
+
+	/* [*] client advertising large window ** */
+
+	/* Test if client advertises small rcv window */
+	do_test_client(server_fd, skel, 1, 1024, 1024);
+
+	if (tcp_window_scaling) {
+		/* Medium size */
+		do_test_client(server_fd, skel, 63, 63 * 1024, 63 * 1024);
+	} else {
+		do_test_client(server_fd, skel, 63, 63 * 1024,
+			       MIN(max_wnd, 63 * 1024));
+	}
+
+	if (tcp_window_scaling) {
+		/* And large window */
+		do_test_client(server_fd, skel, 128, 128 * 1024, 128 * 1024);
+	} else {
+		do_test_client(server_fd, skel, 128, 65535,
+			       MIN(max_wnd, 65535));
+	}
+
+	if (tcp_window_scaling) {
+		/* Super large. */
+		do_test_client(server_fd, skel, 206, 206 * 1024U, 206 * 1024U);
+		buf = 207 * 1024U + (tcp_timestamps ? 12 : 0);
+		do_test_client(server_fd, skel, 512, buf, buf);
+	}
+err:
+	if (server_fd != -1)
+		close(server_fd);
+}
+
+#define PROC_TCP_TIMESTAMPS "/proc/sys/net/ipv4/tcp_timestamps"
+#define PROC_TCP_WINDOW_SCALING "/proc/sys/net/ipv4/tcp_window_scaling"
+#define PROC_TCP_WORKAROUND_SIGNED_WINDOWS \
+	"/proc/sys/net/ipv4/tcp_workaround_signed_windows"
+
+void test_tcp_initrwnd(void)
+{
+	struct test_tcp_initrwnd *skel;
+	unsigned int i;
+	int cg_fd;
+
+	int saved_tcp_timestamps = read_int_sysctl(PROC_TCP_TIMESTAMPS);
+	int saved_tcp_window_scaling = read_int_sysctl(PROC_TCP_WINDOW_SCALING);
+	int saved_tcp_workaround_signed_windows =
+		read_int_sysctl(PROC_TCP_WORKAROUND_SIGNED_WINDOWS);
+
+	if (CHECK_FAIL(saved_tcp_timestamps == -1 ||
+		       saved_tcp_window_scaling == -1 ||
+		       saved_tcp_workaround_signed_windows == -1))
+		return;
+
+	cg_fd = test__join_cgroup(CG_NAME);
+	if (CHECK_FAIL(cg_fd < 0))
+		return;
+
+	skel = test_tcp_initrwnd__open_and_load();
+	if (CHECK_FAIL(!skel)) {
+		close(cg_fd);
+		return;
+	}
+
+	for (i = 0; i < 8; i++) {
+		tcp_timestamps = !!(i & 0x1);
+		tcp_window_scaling = !!(i & 0x2);
+		tcp_workaround_signed_windows = !!(i & 0x4);
+
+		write_int_sysctl(PROC_TCP_TIMESTAMPS, tcp_timestamps);
+		write_int_sysctl(PROC_TCP_WINDOW_SCALING, tcp_window_scaling);
+		write_int_sysctl(PROC_TCP_WORKAROUND_SIGNED_WINDOWS,
+				 tcp_workaround_signed_windows);
+
+		run_tests(cg_fd, skel);
+	}
+
+	write_int_sysctl(PROC_TCP_TIMESTAMPS, saved_tcp_timestamps);
+	write_int_sysctl(PROC_TCP_WINDOW_SCALING, saved_tcp_window_scaling);
+	write_int_sysctl(PROC_TCP_WORKAROUND_SIGNED_WINDOWS,
+			 saved_tcp_workaround_signed_windows);
+
+	test_tcp_initrwnd__destroy(skel);
+
+	close(cg_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_initrwnd.c b/tools/testing/selftests/bpf/progs/test_tcp_initrwnd.c
new file mode 100644
index 000000000000..d532e9e2d344
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tcp_initrwnd.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+// Copyright (c) 2022 Cloudflare
+
+#include <linux/bpf.h>
+
+#include <bpf/bpf_helpers.h>
+
+int initrwnd;
+
+SEC("sockops")
+int bpf_testcb(struct bpf_sock_ops *skops)
+{
+	int rv = -1;
+	int op;
+
+	op = (int)skops->op;
+
+	switch (op) {
+	case BPF_SOCK_OPS_RWND_INIT:
+		rv = initrwnd;
+		break;
+
+	default:
+		rv = -1;
+	}
+	skops->reply = rv;
+	return 1;
+}
+
+char _license[] SEC("license") = "Dual BSD/GPL";
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB
  2022-07-21 15:10 [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Marek Majkowski
  2022-07-21 15:10 ` [PATCH net-next 1/2] " Marek Majkowski
  2022-07-21 15:10 ` [PATCH net-next 2/2] Tests for RTAX_INITRWND Marek Majkowski
@ 2022-07-22  1:54 ` Jakub Kicinski
  2 siblings, 0 replies; 7+ messages in thread
From: Jakub Kicinski @ 2022-07-22  1:54 UTC (permalink / raw)
  To: Marek Majkowski
  Cc: netdev, bpf, kernel-team, ivan, edumazet, davem, pabeni, ast,
	daniel, andrii

On Thu, 21 Jul 2022 17:10:39 +0200 Marek Majkowski wrote:
> Among many route options we support initrwnd/RTAX_INITRWND path
> attribute:
> 
>  $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024
> 
> This sets the initial receive window size (in packets). However, it's
> not very useful in practice. For smaller buffers (<128KiB) it can be
> used to bring the initial receive window down, but it's hard to
> imagine when this is useful. The same effect can be achieved with
> TCP_WINDOW_CLAMP / RTAX_WINDOW option.

I think you based this on bpf-next so you should put bpf-next 
in the subject, it doesn't apply to net-next. Either way let's 
wait for Eric to comment.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next 1/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB
  2022-07-21 15:10 ` [PATCH net-next 1/2] " Marek Majkowski
@ 2022-07-22  9:23   ` Eric Dumazet
  2022-07-27 11:19     ` Marek Majkowski
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2022-07-22  9:23 UTC (permalink / raw)
  To: Marek Majkowski
  Cc: netdev, bpf, kernel-team, Ivan Babrou, David Miller,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko

On Thu, Jul 21, 2022 at 5:10 PM Marek Majkowski <marek@cloudflare.com> wrote:
>
> We already support RTAX_INITRWND / initrwnd path attribute:
>
>  $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024
>
> However normally, the initial advertised receive window is limited to
> 64KiB by rcv_ssthresh, regardless of initrwnd. This patch changes
> that, bumping up rcv_ssthresh to value derived from initrwnd. This
> allows for larger initial advertised receive windows, which is useful
> for specific types of TCP flows: big BDP ones, where there is a lot of
> data to send immediately after the flow is established.
>
> There are three places where we initialize sockets:
>  - tcp_output:tcp_connect_init
>  - tcp_minisocks:tcp_openreq_init_rwin
>  - syncookies
>
> In the first two we already have a call to `tcp_rwnd_init_bpf` and
> `dst_metric(RTAX_INITRWND)` which retrieve the bpf/path initrwnd
> attribute. We use this value to bring `rcv_ssthresh` up, potentially
> above the traditional 64KiB.
>
> With higher initial `rcv_ssthresh` the receiver will open the receive
> window more aggresively, which can improve large BDP flows - large
> throughput and latency.
>
> This patch does not cover the syncookies case.
>
> Signed-off-by: Marek Majkowski <marek@cloudflare.com>
> ---
>  include/net/inet_sock.h  |  1 +
>  net/ipv4/tcp_minisocks.c |  8 ++++++--
>  net/ipv4/tcp_output.c    | 10 ++++++++--
>  3 files changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> index daead5fb389a..bc68c9b70942 100644
> --- a/include/net/inet_sock.h
> +++ b/include/net/inet_sock.h
> @@ -89,6 +89,7 @@ struct inet_request_sock {
>                                 no_srccheck: 1,
>                                 smc_ok     : 1;
>         u32                     ir_mark;
> +       u32                     rcv_ssthresh;

Why do we need to store this value in the request_sock ?

It is derived from a route attribute and MSS, all this should be
available when the full blown socket is created.

It would also work even with syncookies.

>         union {
>                 struct ip_options_rcu __rcu     *ireq_opt;
>  #if IS_ENABLED(CONFIG_IPV6)
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 6854bb1fb32b..89ba2a30a012 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -360,6 +360,7 @@ void tcp_openreq_init_rwin(struct request_sock *req,
>         u32 window_clamp;
>         __u8 rcv_wscale;
>         u32 rcv_wnd;
> +       int adj_mss;
>         int mss;
>
>         mss = tcp_mss_clamp(tp, dst_metric_advmss(dst));
> @@ -378,15 +379,18 @@ void tcp_openreq_init_rwin(struct request_sock *req,
>         else if (full_space < rcv_wnd * mss)
>                 full_space = rcv_wnd * mss;
>
> +       adj_mss = mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0);
> +
>         /* tcp_full_space because it is guaranteed to be the first packet */
>         tcp_select_initial_window(sk_listener, full_space,
> -               mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
> +               adj_mss,
>                 &req->rsk_rcv_wnd,
>                 &req->rsk_window_clamp,
>                 ireq->wscale_ok,
>                 &rcv_wscale,
>                 rcv_wnd);
>         ireq->rcv_wscale = rcv_wscale;
> +       ireq->rcv_ssthresh = max(req->rsk_rcv_wnd, rcv_wnd * adj_mss);
>  }
>  EXPORT_SYMBOL(tcp_openreq_init_rwin);
>
> @@ -502,7 +506,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
>         newtp->rx_opt.tstamp_ok = ireq->tstamp_ok;
>         newtp->rx_opt.sack_ok = ireq->sack_ok;
>         newtp->window_clamp = req->rsk_window_clamp;
> -       newtp->rcv_ssthresh = req->rsk_rcv_wnd;
> +       newtp->rcv_ssthresh = ireq->rcv_ssthresh;
>         newtp->rcv_wnd = req->rsk_rcv_wnd;
>         newtp->rx_opt.wscale_ok = ireq->wscale_ok;
>         if (newtp->rx_opt.wscale_ok) {
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 18c913a2347a..0f2d4174ea59 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -3642,6 +3642,7 @@ static void tcp_connect_init(struct sock *sk)
>         struct tcp_sock *tp = tcp_sk(sk);
>         __u8 rcv_wscale;
>         u32 rcv_wnd;
> +       u32 mss;
>
>         /* We'll fix this up when we get a response from the other end.
>          * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT.
> @@ -3679,8 +3680,10 @@ static void tcp_connect_init(struct sock *sk)
>         if (rcv_wnd == 0)
>                 rcv_wnd = dst_metric(dst, RTAX_INITRWND);
>
> +       mss = tp->advmss - (tp->rx_opt.ts_recent_stamp ?
> +                           tp->tcp_header_len - sizeof(struct tcphdr) : 0);
>         tcp_select_initial_window(sk, tcp_full_space(sk),
> -                                 tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
> +                                 mss,
>                                   &tp->rcv_wnd,
>                                   &tp->window_clamp,
>                                   sock_net(sk)->ipv4.sysctl_tcp_window_scaling,
> @@ -3688,7 +3691,10 @@ static void tcp_connect_init(struct sock *sk)
>                                   rcv_wnd);
>
>         tp->rx_opt.rcv_wscale = rcv_wscale;
> -       tp->rcv_ssthresh = tp->rcv_wnd;
> +       if (rcv_wnd)
> +               tp->rcv_ssthresh = max(tp->rcv_wnd, rcv_wnd * mss);
> +       else
> +               tp->rcv_ssthresh = tp->rcv_wnd;
>
>         sk->sk_err = 0;
>         sock_reset_flag(sk, SOCK_DONE);
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next 1/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB
  2022-07-22  9:23   ` Eric Dumazet
@ 2022-07-27 11:19     ` Marek Majkowski
  2022-07-27 12:54       ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: Marek Majkowski @ 2022-07-27 11:19 UTC (permalink / raw)
  To: Eric Dumazet, brakmo
  Cc: netdev, bpf, kernel-team, Ivan Babrou, David Miller,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko

On Fri, Jul 22, 2022 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Jul 21, 2022 at 5:10 PM Marek Majkowski <marek@cloudflare.com> wrote:
> >
> > We already support RTAX_INITRWND / initrwnd path attribute:
> >
> >  $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024
> >
> > However normally, the initial advertised receive window is limited to
> > 64KiB by rcv_ssthresh, regardless of initrwnd. This patch changes
> > that, bumping up rcv_ssthresh to value derived from initrwnd. This
> > allows for larger initial advertised receive windows, which is useful
> > for specific types of TCP flows: big BDP ones, where there is a lot of
> > data to send immediately after the flow is established.
> >
> > There are three places where we initialize sockets:
> >  - tcp_output:tcp_connect_init
> >  - tcp_minisocks:tcp_openreq_init_rwin
> >  - syncookies
> >
> > In the first two we already have a call to `tcp_rwnd_init_bpf` and
> > `dst_metric(RTAX_INITRWND)` which retrieve the bpf/path initrwnd
> > attribute. We use this value to bring `rcv_ssthresh` up, potentially
> > above the traditional 64KiB.
> >
> > With higher initial `rcv_ssthresh` the receiver will open the receive
> > window more aggresively, which can improve large BDP flows - large
> > throughput and latency.
> >
> > This patch does not cover the syncookies case.
> >
> > Signed-off-by: Marek Majkowski <marek@cloudflare.com>
> > ---
> >  include/net/inet_sock.h  |  1 +
> >  net/ipv4/tcp_minisocks.c |  8 ++++++--
> >  net/ipv4/tcp_output.c    | 10 ++++++++--
> >  3 files changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> > index daead5fb389a..bc68c9b70942 100644
> > --- a/include/net/inet_sock.h
> > +++ b/include/net/inet_sock.h
> > @@ -89,6 +89,7 @@ struct inet_request_sock {
> >                                 no_srccheck: 1,
> >                                 smc_ok     : 1;
> >         u32                     ir_mark;
> > +       u32                     rcv_ssthresh;
>
> Why do we need to store this value in the request_sock ?
>
> It is derived from a route attribute and MSS, all this should be
> available when the full blown socket is created.
>
> It would also work even with syncookies.

Eric,

Thanks for the feedback. For some context, I published a blog post
explaining this work in detail [1].

https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/

I understand the suggestion is to move tcp_rwnd_init_bpf +
RTAX_INITRWND lookup from `tcp_openreq_init_rwin` into
`tcp_create_openreq_child`.

I gave it a try (patch: [2]), but I don't think this will work under
all circumstances. The issue is that we need to advertise *some*
window in the SYNACK packet, before creating the full blown socket.

With RTAX_INITRWND it is possible to move the advertised window up, or
down.

In the latter case, of reducing the window, at the SYNACK moment we
must know if the window is reduced under 64KiB. This is what happens
right now, we can _reduce_ window with RTAX_INITRWND to small values,
I guess down to 1 MSS. This smaller window is then advertised in the
SYNACK.

If we move RTAX_INITRWND lookup into the later
`tcp_create_openreq_child` then it will be too late, we won't know the
correct window size on SYNACK stage. We will likely end up sending
large window on SYNACK and then a small window on subsequent ACK,
violating TCP.

There are two approaches here. First, keep the semantics and allow
RTAX_INITRWND to _reduce_ the initial window.

In this case there are four ways out of this:

1) Keep it as proposed, that indeed requires some new value in
request_sock. (perhaps maybe it could be it smaller than u32)

2) Send the SYNACK with small/zero window, since we haven't done the
initrwnd lookup at this stage, but that would be at least
controversial, and also adds one more RTT to the common case. I don't
think this is acceptable.

3) Do two initrwnd lookups. One in the `tcp_openreq_init_rwin` to
figure out if the window is smaller than 64KiB, second one in
`tcp_create_openreq_child` to figure out if the suggested window is
larger than 64KiB.

4) Abort the whole approach and recycle Ivan's
bpf_setsockopt(TCP_BPF_RCV_SSTHRESH) approach [3]. But I prefer the route
attribute approach, seems easier to use and more flexible.

But, thinking about it, I don't think we could ever support reducing
initial receive window in the syncookie case. Only (3) - two initrwnd
lookups - could be made to work, but even that is controversial.

However the intention of RTAX_INITRWND as far as I understand was to
_increase_ rcv_ssthresh, back in the days when it started from 10MSS
(so I was told).

So, we could change the semantics of RTAX_INITRWND to allow only
*increasing* the window - and disallow reducing it. With such an
approach indeed we could make the code as you suggested, and move the
route attribute lookup away from minisocks into `tcp_create_openreq_child`.

Marek

[1] https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/
[2] https://gist.github.com/majek/13848c050a3dc218ed295364ee717879
[3] https://lore.kernel.org/bpf/20220111192952.49040-1-ivan@cloudflare.com/t/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next 1/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB
  2022-07-27 11:19     ` Marek Majkowski
@ 2022-07-27 12:54       ` Eric Dumazet
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2022-07-27 12:54 UTC (permalink / raw)
  To: Marek Majkowski
  Cc: Lawrence Brakmo, netdev, bpf, kernel-team, Ivan Babrou,
	David Miller, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko

On Wed, Jul 27, 2022 at 1:19 PM Marek Majkowski <marek@cloudflare.com> wrote:
>
> On Fri, Jul 22, 2022 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Jul 21, 2022 at 5:10 PM Marek Majkowski <marek@cloudflare.com> wrote:
> > >
> > > We already support RTAX_INITRWND / initrwnd path attribute:
> > >
> > >  $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024
> > >
> > > However normally, the initial advertised receive window is limited to
> > > 64KiB by rcv_ssthresh, regardless of initrwnd. This patch changes
> > > that, bumping up rcv_ssthresh to value derived from initrwnd. This
> > > allows for larger initial advertised receive windows, which is useful
> > > for specific types of TCP flows: big BDP ones, where there is a lot of
> > > data to send immediately after the flow is established.
> > >
> > > There are three places where we initialize sockets:
> > >  - tcp_output:tcp_connect_init
> > >  - tcp_minisocks:tcp_openreq_init_rwin
> > >  - syncookies
> > >
> > > In the first two we already have a call to `tcp_rwnd_init_bpf` and
> > > `dst_metric(RTAX_INITRWND)` which retrieve the bpf/path initrwnd
> > > attribute. We use this value to bring `rcv_ssthresh` up, potentially
> > > above the traditional 64KiB.
> > >
> > > With higher initial `rcv_ssthresh` the receiver will open the receive
> > > window more aggresively, which can improve large BDP flows - large
> > > throughput and latency.
> > >
> > > This patch does not cover the syncookies case.
> > >
> > > Signed-off-by: Marek Majkowski <marek@cloudflare.com>
> > > ---
> > >  include/net/inet_sock.h  |  1 +
> > >  net/ipv4/tcp_minisocks.c |  8 ++++++--
> > >  net/ipv4/tcp_output.c    | 10 ++++++++--
> > >  3 files changed, 15 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> > > index daead5fb389a..bc68c9b70942 100644
> > > --- a/include/net/inet_sock.h
> > > +++ b/include/net/inet_sock.h
> > > @@ -89,6 +89,7 @@ struct inet_request_sock {
> > >                                 no_srccheck: 1,
> > >                                 smc_ok     : 1;
> > >         u32                     ir_mark;
> > > +       u32                     rcv_ssthresh;

Please move this in struct tcp_request_sock

> >
> > Why do we need to store this value in the request_sock ?
> >
> > It is derived from a route attribute and MSS, all this should be
> > available when the full blown socket is created.
> >
> > It would also work even with syncookies.
>
> Eric,
>
> Thanks for the feedback. For some context, I published a blog post
> explaining this work in detail [1].
>
> https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/
>
> I understand the suggestion is to move tcp_rwnd_init_bpf +
> RTAX_INITRWND lookup from `tcp_openreq_init_rwin` into
> `tcp_create_openreq_child`.
>
> I gave it a try (patch: [2]), but I don't think this will work under
> all circumstances. The issue is that we need to advertise *some*
> window in the SYNACK packet, before creating the full blown socket.
>
> With RTAX_INITRWND it is possible to move the advertised window up, or
> down.
>
> In the latter case, of reducing the window, at the SYNACK moment we
> must know if the window is reduced under 64KiB. This is what happens
> right now, we can _reduce_ window with RTAX_INITRWND to small values,
> I guess down to 1 MSS. This smaller window is then advertised in the
> SYNACK.
>
> If we move RTAX_INITRWND lookup into the later
> `tcp_create_openreq_child` then it will be too late, we won't know the
> correct window size on SYNACK stage. We will likely end up sending
> large window on SYNACK and then a small window on subsequent ACK,
> violating TCP.
>
> There are two approaches here. First, keep the semantics and allow
> RTAX_INITRWND to _reduce_ the initial window.
>
> In this case there are four ways out of this:
>
> 1) Keep it as proposed, that indeed requires some new value in
> request_sock. (perhaps maybe it could be it smaller than u32)
>
> 2) Send the SYNACK with small/zero window, since we haven't done the
> initrwnd lookup at this stage, but that would be at least
> controversial, and also adds one more RTT to the common case. I don't
> think this is acceptable.
>
> 3) Do two initrwnd lookups. One in the `tcp_openreq_init_rwin` to
> figure out if the window is smaller than 64KiB, second one in
> `tcp_create_openreq_child` to figure out if the suggested window is
> larger than 64KiB.

I think syncookies can be handled, if you look at cookie_v6_check() &
cookie_v4_check()
after their calls to cookie_tcp_reqsk_alloc()

>
> 4) Abort the whole approach and recycle Ivan's
> bpf_setsockopt(TCP_BPF_RCV_SSTHRESH) approach [3]. But I prefer the route
> attribute approach, seems easier to use and more flexible.
>
> But, thinking about it, I don't think we could ever support reducing
> initial receive window in the syncookie case. Only (3) - two initrwnd
> lookups - could be made to work, but even that is controversial.
>
> However the intention of RTAX_INITRWND as far as I understand was to
> _increase_ rcv_ssthresh, back in the days when it started from 10MSS
> (so I was told).

That was before we fixed DRS and that we made initial RWIN 65535, the
max allowed value in a SYN , SYNACK packet.
But yes...

>
> So, we could change the semantics of RTAX_INITRWND to allow only
> *increasing* the window - and disallow reducing it. With such an
> approach indeed we could make the code as you suggested, and move the
> route attribute lookup away from minisocks into `tcp_create_openreq_child`.
>
> Marek
>
> [1] https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/
> [2] https://gist.github.com/majek/13848c050a3dc218ed295364ee717879
> [3] https://lore.kernel.org/bpf/20220111192952.49040-1-ivan@cloudflare.com/t/

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-07-27 12:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-21 15:10 [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Marek Majkowski
2022-07-21 15:10 ` [PATCH net-next 1/2] " Marek Majkowski
2022-07-22  9:23   ` Eric Dumazet
2022-07-27 11:19     ` Marek Majkowski
2022-07-27 12:54       ` Eric Dumazet
2022-07-21 15:10 ` [PATCH net-next 2/2] Tests for RTAX_INITRWND Marek Majkowski
2022-07-22  1:54 ` [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).