* [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB @ 2022-07-21 15:10 Marek Majkowski 2022-07-21 15:10 ` [PATCH net-next 1/2] " Marek Majkowski ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Marek Majkowski @ 2022-07-21 15:10 UTC (permalink / raw) To: netdev Cc: bpf, kernel-team, ivan, edumazet, davem, kuba, pabeni, ast, daniel, andrii, Marek Majkowski Among many route options we support initrwnd/RTAX_INITRWND path attribute: $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024 This sets the initial receive window size (in packets). However, it's not very useful in practice. For smaller buffers (<128KiB) it can be used to bring the initial receive window down, but it's hard to imagine when this is useful. The same effect can be achieved with TCP_WINDOW_CLAMP / RTAX_WINDOW option. For larger buffers (>128KiB) the initial receive window is usually limited by rcv_ssthresh, which starts at 64KiB. The initrwnd option can't bring the window above it, which limits its usefulness This patch changes that. Now, by setting RTAX_INITRWND path attribute we bring up the initial rcv_ssthresh in line with the initrwnd value. This allows to increase the initial advertised receive window instantly, after first TCP RTT, above 64KiB. With this change, the administrator can configure a route (or skops ebpf program) where the receive window is opened much faster than usual. This is useful on big BDP connections - large latency, high throughput - where it takes much time to fully open the receive window, due to the usual rcv_ssthresh cap. However, this feature should be used with caution. It only makes sense to employ it in limited circumstances: * When using high-bandwidth TCP transfers over big-latency links. * When the truesize of the flow/NIC is sensible and predictable. * When the application is ready to send a lot of data immediately after flow is established. * When the sender has configured larger than usual `initcwnd`. * When optimizing for every possible RTT. This patch is related to previous work by Ivan Babrou: https://lore.kernel.org/bpf/CAA93jw5+LjKLcCaNr5wJGPrXhbjvLhts8hqpKPFx7JeWG4g0AA@mail.gmail.com/T/ Please note that due to TCP wscale semantics, the TCP sender will need to receive first ACK to be informed of the large opened receive window. That is: the large window is advertised only in the first ACK from the peer. When the TCP client has large window, it is advertised in the third-packet (ACK) of the handshake. When the TCP sever has large window, it is advertised only in the first ACK after some data has been received. Signed-off-by: Marek Majkowski <marek@cloudflare.com> Marek Majkowski (2): RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Tests for RTAX_INITRWND include/net/inet_sock.h | 1 + net/ipv4/tcp_minisocks.c | 8 +- net/ipv4/tcp_output.c | 10 +- .../selftests/bpf/prog_tests/tcp_initrwnd.c | 398 ++++++++++++++++++ .../selftests/bpf/progs/test_tcp_initrwnd.c | 30 ++ 5 files changed, 443 insertions(+), 4 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_initrwnd.c create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_initrwnd.c -- 2.25.1 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH net-next 1/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB 2022-07-21 15:10 [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Marek Majkowski @ 2022-07-21 15:10 ` Marek Majkowski 2022-07-22 9:23 ` Eric Dumazet 2022-07-21 15:10 ` [PATCH net-next 2/2] Tests for RTAX_INITRWND Marek Majkowski 2022-07-22 1:54 ` [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Jakub Kicinski 2 siblings, 1 reply; 7+ messages in thread From: Marek Majkowski @ 2022-07-21 15:10 UTC (permalink / raw) To: netdev Cc: bpf, kernel-team, ivan, edumazet, davem, kuba, pabeni, ast, daniel, andrii, Marek Majkowski We already support RTAX_INITRWND / initrwnd path attribute: $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024 However normally, the initial advertised receive window is limited to 64KiB by rcv_ssthresh, regardless of initrwnd. This patch changes that, bumping up rcv_ssthresh to value derived from initrwnd. This allows for larger initial advertised receive windows, which is useful for specific types of TCP flows: big BDP ones, where there is a lot of data to send immediately after the flow is established. There are three places where we initialize sockets: - tcp_output:tcp_connect_init - tcp_minisocks:tcp_openreq_init_rwin - syncookies In the first two we already have a call to `tcp_rwnd_init_bpf` and `dst_metric(RTAX_INITRWND)` which retrieve the bpf/path initrwnd attribute. We use this value to bring `rcv_ssthresh` up, potentially above the traditional 64KiB. With higher initial `rcv_ssthresh` the receiver will open the receive window more aggresively, which can improve large BDP flows - large throughput and latency. This patch does not cover the syncookies case. Signed-off-by: Marek Majkowski <marek@cloudflare.com> --- include/net/inet_sock.h | 1 + net/ipv4/tcp_minisocks.c | 8 ++++++-- net/ipv4/tcp_output.c | 10 ++++++++-- 3 files changed, 15 insertions(+), 4 deletions(-) diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h index daead5fb389a..bc68c9b70942 100644 --- a/include/net/inet_sock.h +++ b/include/net/inet_sock.h @@ -89,6 +89,7 @@ struct inet_request_sock { no_srccheck: 1, smc_ok : 1; u32 ir_mark; + u32 rcv_ssthresh; union { struct ip_options_rcu __rcu *ireq_opt; #if IS_ENABLED(CONFIG_IPV6) diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 6854bb1fb32b..89ba2a30a012 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -360,6 +360,7 @@ void tcp_openreq_init_rwin(struct request_sock *req, u32 window_clamp; __u8 rcv_wscale; u32 rcv_wnd; + int adj_mss; int mss; mss = tcp_mss_clamp(tp, dst_metric_advmss(dst)); @@ -378,15 +379,18 @@ void tcp_openreq_init_rwin(struct request_sock *req, else if (full_space < rcv_wnd * mss) full_space = rcv_wnd * mss; + adj_mss = mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0); + /* tcp_full_space because it is guaranteed to be the first packet */ tcp_select_initial_window(sk_listener, full_space, - mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0), + adj_mss, &req->rsk_rcv_wnd, &req->rsk_window_clamp, ireq->wscale_ok, &rcv_wscale, rcv_wnd); ireq->rcv_wscale = rcv_wscale; + ireq->rcv_ssthresh = max(req->rsk_rcv_wnd, rcv_wnd * adj_mss); } EXPORT_SYMBOL(tcp_openreq_init_rwin); @@ -502,7 +506,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk, newtp->rx_opt.tstamp_ok = ireq->tstamp_ok; newtp->rx_opt.sack_ok = ireq->sack_ok; newtp->window_clamp = req->rsk_window_clamp; - newtp->rcv_ssthresh = req->rsk_rcv_wnd; + newtp->rcv_ssthresh = ireq->rcv_ssthresh; newtp->rcv_wnd = req->rsk_rcv_wnd; newtp->rx_opt.wscale_ok = ireq->wscale_ok; if (newtp->rx_opt.wscale_ok) { diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 18c913a2347a..0f2d4174ea59 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3642,6 +3642,7 @@ static void tcp_connect_init(struct sock *sk) struct tcp_sock *tp = tcp_sk(sk); __u8 rcv_wscale; u32 rcv_wnd; + u32 mss; /* We'll fix this up when we get a response from the other end. * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT. @@ -3679,8 +3680,10 @@ static void tcp_connect_init(struct sock *sk) if (rcv_wnd == 0) rcv_wnd = dst_metric(dst, RTAX_INITRWND); + mss = tp->advmss - (tp->rx_opt.ts_recent_stamp ? + tp->tcp_header_len - sizeof(struct tcphdr) : 0); tcp_select_initial_window(sk, tcp_full_space(sk), - tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0), + mss, &tp->rcv_wnd, &tp->window_clamp, sock_net(sk)->ipv4.sysctl_tcp_window_scaling, @@ -3688,7 +3691,10 @@ static void tcp_connect_init(struct sock *sk) rcv_wnd); tp->rx_opt.rcv_wscale = rcv_wscale; - tp->rcv_ssthresh = tp->rcv_wnd; + if (rcv_wnd) + tp->rcv_ssthresh = max(tp->rcv_wnd, rcv_wnd * mss); + else + tp->rcv_ssthresh = tp->rcv_wnd; sk->sk_err = 0; sock_reset_flag(sk, SOCK_DONE); -- 2.25.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH net-next 1/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB 2022-07-21 15:10 ` [PATCH net-next 1/2] " Marek Majkowski @ 2022-07-22 9:23 ` Eric Dumazet 2022-07-27 11:19 ` Marek Majkowski 0 siblings, 1 reply; 7+ messages in thread From: Eric Dumazet @ 2022-07-22 9:23 UTC (permalink / raw) To: Marek Majkowski Cc: netdev, bpf, kernel-team, Ivan Babrou, David Miller, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko On Thu, Jul 21, 2022 at 5:10 PM Marek Majkowski <marek@cloudflare.com> wrote: > > We already support RTAX_INITRWND / initrwnd path attribute: > > $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024 > > However normally, the initial advertised receive window is limited to > 64KiB by rcv_ssthresh, regardless of initrwnd. This patch changes > that, bumping up rcv_ssthresh to value derived from initrwnd. This > allows for larger initial advertised receive windows, which is useful > for specific types of TCP flows: big BDP ones, where there is a lot of > data to send immediately after the flow is established. > > There are three places where we initialize sockets: > - tcp_output:tcp_connect_init > - tcp_minisocks:tcp_openreq_init_rwin > - syncookies > > In the first two we already have a call to `tcp_rwnd_init_bpf` and > `dst_metric(RTAX_INITRWND)` which retrieve the bpf/path initrwnd > attribute. We use this value to bring `rcv_ssthresh` up, potentially > above the traditional 64KiB. > > With higher initial `rcv_ssthresh` the receiver will open the receive > window more aggresively, which can improve large BDP flows - large > throughput and latency. > > This patch does not cover the syncookies case. > > Signed-off-by: Marek Majkowski <marek@cloudflare.com> > --- > include/net/inet_sock.h | 1 + > net/ipv4/tcp_minisocks.c | 8 ++++++-- > net/ipv4/tcp_output.c | 10 ++++++++-- > 3 files changed, 15 insertions(+), 4 deletions(-) > > diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h > index daead5fb389a..bc68c9b70942 100644 > --- a/include/net/inet_sock.h > +++ b/include/net/inet_sock.h > @@ -89,6 +89,7 @@ struct inet_request_sock { > no_srccheck: 1, > smc_ok : 1; > u32 ir_mark; > + u32 rcv_ssthresh; Why do we need to store this value in the request_sock ? It is derived from a route attribute and MSS, all this should be available when the full blown socket is created. It would also work even with syncookies. > union { > struct ip_options_rcu __rcu *ireq_opt; > #if IS_ENABLED(CONFIG_IPV6) > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > index 6854bb1fb32b..89ba2a30a012 100644 > --- a/net/ipv4/tcp_minisocks.c > +++ b/net/ipv4/tcp_minisocks.c > @@ -360,6 +360,7 @@ void tcp_openreq_init_rwin(struct request_sock *req, > u32 window_clamp; > __u8 rcv_wscale; > u32 rcv_wnd; > + int adj_mss; > int mss; > > mss = tcp_mss_clamp(tp, dst_metric_advmss(dst)); > @@ -378,15 +379,18 @@ void tcp_openreq_init_rwin(struct request_sock *req, > else if (full_space < rcv_wnd * mss) > full_space = rcv_wnd * mss; > > + adj_mss = mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0); > + > /* tcp_full_space because it is guaranteed to be the first packet */ > tcp_select_initial_window(sk_listener, full_space, > - mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0), > + adj_mss, > &req->rsk_rcv_wnd, > &req->rsk_window_clamp, > ireq->wscale_ok, > &rcv_wscale, > rcv_wnd); > ireq->rcv_wscale = rcv_wscale; > + ireq->rcv_ssthresh = max(req->rsk_rcv_wnd, rcv_wnd * adj_mss); > } > EXPORT_SYMBOL(tcp_openreq_init_rwin); > > @@ -502,7 +506,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk, > newtp->rx_opt.tstamp_ok = ireq->tstamp_ok; > newtp->rx_opt.sack_ok = ireq->sack_ok; > newtp->window_clamp = req->rsk_window_clamp; > - newtp->rcv_ssthresh = req->rsk_rcv_wnd; > + newtp->rcv_ssthresh = ireq->rcv_ssthresh; > newtp->rcv_wnd = req->rsk_rcv_wnd; > newtp->rx_opt.wscale_ok = ireq->wscale_ok; > if (newtp->rx_opt.wscale_ok) { > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 18c913a2347a..0f2d4174ea59 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -3642,6 +3642,7 @@ static void tcp_connect_init(struct sock *sk) > struct tcp_sock *tp = tcp_sk(sk); > __u8 rcv_wscale; > u32 rcv_wnd; > + u32 mss; > > /* We'll fix this up when we get a response from the other end. > * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT. > @@ -3679,8 +3680,10 @@ static void tcp_connect_init(struct sock *sk) > if (rcv_wnd == 0) > rcv_wnd = dst_metric(dst, RTAX_INITRWND); > > + mss = tp->advmss - (tp->rx_opt.ts_recent_stamp ? > + tp->tcp_header_len - sizeof(struct tcphdr) : 0); > tcp_select_initial_window(sk, tcp_full_space(sk), > - tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0), > + mss, > &tp->rcv_wnd, > &tp->window_clamp, > sock_net(sk)->ipv4.sysctl_tcp_window_scaling, > @@ -3688,7 +3691,10 @@ static void tcp_connect_init(struct sock *sk) > rcv_wnd); > > tp->rx_opt.rcv_wscale = rcv_wscale; > - tp->rcv_ssthresh = tp->rcv_wnd; > + if (rcv_wnd) > + tp->rcv_ssthresh = max(tp->rcv_wnd, rcv_wnd * mss); > + else > + tp->rcv_ssthresh = tp->rcv_wnd; > > sk->sk_err = 0; > sock_reset_flag(sk, SOCK_DONE); > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net-next 1/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB 2022-07-22 9:23 ` Eric Dumazet @ 2022-07-27 11:19 ` Marek Majkowski 2022-07-27 12:54 ` Eric Dumazet 0 siblings, 1 reply; 7+ messages in thread From: Marek Majkowski @ 2022-07-27 11:19 UTC (permalink / raw) To: Eric Dumazet, brakmo Cc: netdev, bpf, kernel-team, Ivan Babrou, David Miller, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko On Fri, Jul 22, 2022 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, Jul 21, 2022 at 5:10 PM Marek Majkowski <marek@cloudflare.com> wrote: > > > > We already support RTAX_INITRWND / initrwnd path attribute: > > > > $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024 > > > > However normally, the initial advertised receive window is limited to > > 64KiB by rcv_ssthresh, regardless of initrwnd. This patch changes > > that, bumping up rcv_ssthresh to value derived from initrwnd. This > > allows for larger initial advertised receive windows, which is useful > > for specific types of TCP flows: big BDP ones, where there is a lot of > > data to send immediately after the flow is established. > > > > There are three places where we initialize sockets: > > - tcp_output:tcp_connect_init > > - tcp_minisocks:tcp_openreq_init_rwin > > - syncookies > > > > In the first two we already have a call to `tcp_rwnd_init_bpf` and > > `dst_metric(RTAX_INITRWND)` which retrieve the bpf/path initrwnd > > attribute. We use this value to bring `rcv_ssthresh` up, potentially > > above the traditional 64KiB. > > > > With higher initial `rcv_ssthresh` the receiver will open the receive > > window more aggresively, which can improve large BDP flows - large > > throughput and latency. > > > > This patch does not cover the syncookies case. > > > > Signed-off-by: Marek Majkowski <marek@cloudflare.com> > > --- > > include/net/inet_sock.h | 1 + > > net/ipv4/tcp_minisocks.c | 8 ++++++-- > > net/ipv4/tcp_output.c | 10 ++++++++-- > > 3 files changed, 15 insertions(+), 4 deletions(-) > > > > diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h > > index daead5fb389a..bc68c9b70942 100644 > > --- a/include/net/inet_sock.h > > +++ b/include/net/inet_sock.h > > @@ -89,6 +89,7 @@ struct inet_request_sock { > > no_srccheck: 1, > > smc_ok : 1; > > u32 ir_mark; > > + u32 rcv_ssthresh; > > Why do we need to store this value in the request_sock ? > > It is derived from a route attribute and MSS, all this should be > available when the full blown socket is created. > > It would also work even with syncookies. Eric, Thanks for the feedback. For some context, I published a blog post explaining this work in detail [1]. https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/ I understand the suggestion is to move tcp_rwnd_init_bpf + RTAX_INITRWND lookup from `tcp_openreq_init_rwin` into `tcp_create_openreq_child`. I gave it a try (patch: [2]), but I don't think this will work under all circumstances. The issue is that we need to advertise *some* window in the SYNACK packet, before creating the full blown socket. With RTAX_INITRWND it is possible to move the advertised window up, or down. In the latter case, of reducing the window, at the SYNACK moment we must know if the window is reduced under 64KiB. This is what happens right now, we can _reduce_ window with RTAX_INITRWND to small values, I guess down to 1 MSS. This smaller window is then advertised in the SYNACK. If we move RTAX_INITRWND lookup into the later `tcp_create_openreq_child` then it will be too late, we won't know the correct window size on SYNACK stage. We will likely end up sending large window on SYNACK and then a small window on subsequent ACK, violating TCP. There are two approaches here. First, keep the semantics and allow RTAX_INITRWND to _reduce_ the initial window. In this case there are four ways out of this: 1) Keep it as proposed, that indeed requires some new value in request_sock. (perhaps maybe it could be it smaller than u32) 2) Send the SYNACK with small/zero window, since we haven't done the initrwnd lookup at this stage, but that would be at least controversial, and also adds one more RTT to the common case. I don't think this is acceptable. 3) Do two initrwnd lookups. One in the `tcp_openreq_init_rwin` to figure out if the window is smaller than 64KiB, second one in `tcp_create_openreq_child` to figure out if the suggested window is larger than 64KiB. 4) Abort the whole approach and recycle Ivan's bpf_setsockopt(TCP_BPF_RCV_SSTHRESH) approach [3]. But I prefer the route attribute approach, seems easier to use and more flexible. But, thinking about it, I don't think we could ever support reducing initial receive window in the syncookie case. Only (3) - two initrwnd lookups - could be made to work, but even that is controversial. However the intention of RTAX_INITRWND as far as I understand was to _increase_ rcv_ssthresh, back in the days when it started from 10MSS (so I was told). So, we could change the semantics of RTAX_INITRWND to allow only *increasing* the window - and disallow reducing it. With such an approach indeed we could make the code as you suggested, and move the route attribute lookup away from minisocks into `tcp_create_openreq_child`. Marek [1] https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/ [2] https://gist.github.com/majek/13848c050a3dc218ed295364ee717879 [3] https://lore.kernel.org/bpf/20220111192952.49040-1-ivan@cloudflare.com/t/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net-next 1/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB 2022-07-27 11:19 ` Marek Majkowski @ 2022-07-27 12:54 ` Eric Dumazet 0 siblings, 0 replies; 7+ messages in thread From: Eric Dumazet @ 2022-07-27 12:54 UTC (permalink / raw) To: Marek Majkowski Cc: Lawrence Brakmo, netdev, bpf, kernel-team, Ivan Babrou, David Miller, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko On Wed, Jul 27, 2022 at 1:19 PM Marek Majkowski <marek@cloudflare.com> wrote: > > On Fri, Jul 22, 2022 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Thu, Jul 21, 2022 at 5:10 PM Marek Majkowski <marek@cloudflare.com> wrote: > > > > > > We already support RTAX_INITRWND / initrwnd path attribute: > > > > > > $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024 > > > > > > However normally, the initial advertised receive window is limited to > > > 64KiB by rcv_ssthresh, regardless of initrwnd. This patch changes > > > that, bumping up rcv_ssthresh to value derived from initrwnd. This > > > allows for larger initial advertised receive windows, which is useful > > > for specific types of TCP flows: big BDP ones, where there is a lot of > > > data to send immediately after the flow is established. > > > > > > There are three places where we initialize sockets: > > > - tcp_output:tcp_connect_init > > > - tcp_minisocks:tcp_openreq_init_rwin > > > - syncookies > > > > > > In the first two we already have a call to `tcp_rwnd_init_bpf` and > > > `dst_metric(RTAX_INITRWND)` which retrieve the bpf/path initrwnd > > > attribute. We use this value to bring `rcv_ssthresh` up, potentially > > > above the traditional 64KiB. > > > > > > With higher initial `rcv_ssthresh` the receiver will open the receive > > > window more aggresively, which can improve large BDP flows - large > > > throughput and latency. > > > > > > This patch does not cover the syncookies case. > > > > > > Signed-off-by: Marek Majkowski <marek@cloudflare.com> > > > --- > > > include/net/inet_sock.h | 1 + > > > net/ipv4/tcp_minisocks.c | 8 ++++++-- > > > net/ipv4/tcp_output.c | 10 ++++++++-- > > > 3 files changed, 15 insertions(+), 4 deletions(-) > > > > > > diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h > > > index daead5fb389a..bc68c9b70942 100644 > > > --- a/include/net/inet_sock.h > > > +++ b/include/net/inet_sock.h > > > @@ -89,6 +89,7 @@ struct inet_request_sock { > > > no_srccheck: 1, > > > smc_ok : 1; > > > u32 ir_mark; > > > + u32 rcv_ssthresh; Please move this in struct tcp_request_sock > > > > Why do we need to store this value in the request_sock ? > > > > It is derived from a route attribute and MSS, all this should be > > available when the full blown socket is created. > > > > It would also work even with syncookies. > > Eric, > > Thanks for the feedback. For some context, I published a blog post > explaining this work in detail [1]. > > https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/ > > I understand the suggestion is to move tcp_rwnd_init_bpf + > RTAX_INITRWND lookup from `tcp_openreq_init_rwin` into > `tcp_create_openreq_child`. > > I gave it a try (patch: [2]), but I don't think this will work under > all circumstances. The issue is that we need to advertise *some* > window in the SYNACK packet, before creating the full blown socket. > > With RTAX_INITRWND it is possible to move the advertised window up, or > down. > > In the latter case, of reducing the window, at the SYNACK moment we > must know if the window is reduced under 64KiB. This is what happens > right now, we can _reduce_ window with RTAX_INITRWND to small values, > I guess down to 1 MSS. This smaller window is then advertised in the > SYNACK. > > If we move RTAX_INITRWND lookup into the later > `tcp_create_openreq_child` then it will be too late, we won't know the > correct window size on SYNACK stage. We will likely end up sending > large window on SYNACK and then a small window on subsequent ACK, > violating TCP. > > There are two approaches here. First, keep the semantics and allow > RTAX_INITRWND to _reduce_ the initial window. > > In this case there are four ways out of this: > > 1) Keep it as proposed, that indeed requires some new value in > request_sock. (perhaps maybe it could be it smaller than u32) > > 2) Send the SYNACK with small/zero window, since we haven't done the > initrwnd lookup at this stage, but that would be at least > controversial, and also adds one more RTT to the common case. I don't > think this is acceptable. > > 3) Do two initrwnd lookups. One in the `tcp_openreq_init_rwin` to > figure out if the window is smaller than 64KiB, second one in > `tcp_create_openreq_child` to figure out if the suggested window is > larger than 64KiB. I think syncookies can be handled, if you look at cookie_v6_check() & cookie_v4_check() after their calls to cookie_tcp_reqsk_alloc() > > 4) Abort the whole approach and recycle Ivan's > bpf_setsockopt(TCP_BPF_RCV_SSTHRESH) approach [3]. But I prefer the route > attribute approach, seems easier to use and more flexible. > > But, thinking about it, I don't think we could ever support reducing > initial receive window in the syncookie case. Only (3) - two initrwnd > lookups - could be made to work, but even that is controversial. > > However the intention of RTAX_INITRWND as far as I understand was to > _increase_ rcv_ssthresh, back in the days when it started from 10MSS > (so I was told). That was before we fixed DRS and that we made initial RWIN 65535, the max allowed value in a SYN , SYNACK packet. But yes... > > So, we could change the semantics of RTAX_INITRWND to allow only > *increasing* the window - and disallow reducing it. With such an > approach indeed we could make the code as you suggested, and move the > route attribute lookup away from minisocks into `tcp_create_openreq_child`. > > Marek > > [1] https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/ > [2] https://gist.github.com/majek/13848c050a3dc218ed295364ee717879 > [3] https://lore.kernel.org/bpf/20220111192952.49040-1-ivan@cloudflare.com/t/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH net-next 2/2] Tests for RTAX_INITRWND 2022-07-21 15:10 [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Marek Majkowski 2022-07-21 15:10 ` [PATCH net-next 1/2] " Marek Majkowski @ 2022-07-21 15:10 ` Marek Majkowski 2022-07-22 1:54 ` [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Jakub Kicinski 2 siblings, 0 replies; 7+ messages in thread From: Marek Majkowski @ 2022-07-21 15:10 UTC (permalink / raw) To: netdev Cc: bpf, kernel-team, ivan, edumazet, davem, kuba, pabeni, ast, daniel, andrii, Marek Majkowski Accompanying tests. We open skops program, hooking on BPF_SOCK_OPS_RWND_INIT event, where we return updated value of initrwnd route path attribute. In tests we see if values above 64KiB indeed are advertised correctly to the remote peer. Signed-off-by: Marek Majkowski <marek@cloudflare.com> --- .../selftests/bpf/prog_tests/tcp_initrwnd.c | 398 ++++++++++++++++++ .../selftests/bpf/progs/test_tcp_initrwnd.c | 30 ++ 2 files changed, 428 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_initrwnd.c create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_initrwnd.c diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_initrwnd.c b/tools/testing/selftests/bpf/prog_tests/tcp_initrwnd.c new file mode 100644 index 000000000000..0276fe9c8ce6 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/tcp_initrwnd.c @@ -0,0 +1,398 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause +// Copyright (c) 2022 Cloudflare + +#include "test_progs.h" +#include "bpf_util.h" +#include "network_helpers.h" + +#include "test_tcp_initrwnd.skel.h" + +#define CG_NAME "/tcpbpf-user-test" + +/* It's easier to hardcode offsets than to fight with headers + * + * $ pahole tcp_info + * struct tcp_info { + * __u32 tcpi_rcv_ssthresh; * 64 4 * + * __u32 tcpi_snd_wnd; * 228 4 * + */ + +#define TCPI_RCV_SSTHRESH(info) info[64 / 4] +#define TCPI_SND_WND(info) info[228 / 4] + +static int read_int_sysctl(const char *sysctl) +{ + char buf[16]; + int fd, ret; + + fd = open(sysctl, 0); + if (CHECK_FAIL(fd == -1)) + goto err; + + ret = read(fd, buf, sizeof(buf)); + if (CHECK_FAIL(ret <= 0)) + goto err; + + close(fd); + return atoi(buf); +err: + if (fd < 0) + close(fd); + return -1; +} + +static int write_int_sysctl(const char *sysctl, int v) +{ + int fd, ret, size; + char buf[16]; + + fd = open(sysctl, O_RDWR); + if (CHECK_FAIL(fd < 0)) + goto err; + + size = snprintf(buf, sizeof(buf), "%d", v); + ret = write(fd, buf, size); + if (CHECK_FAIL(ret < 0)) + goto err; + + close(fd); + return 0; +err: + if (fd < 0) + close(fd); + return -1; +} + +static int tcp_timestamps; +static int tcp_window_scaling; +static int tcp_workaround_signed_windows; + +static void do_test_server(int server_fd, struct test_tcp_initrwnd *skel, + int initrwnd, unsigned int tcpi_snd_wnd_on_connect, + unsigned int rcv_ssthresh_on_recv, + unsigned int tcpi_snd_wnd_on_recv) +{ + int client_fd = -1, sd = -1, r; + __u32 info[256 / 4]; + socklen_t optlen = sizeof(info); + char b[1] = { 0x55 }; + + fprintf(stderr, + "[*] server initrwnd=%d tcp_timestamps=%d tcp_window_scaling=%d tcp_workaround_signed_windows=%d\n", + initrwnd, tcp_timestamps, tcp_window_scaling, + tcp_workaround_signed_windows); + + skel->bss->initrwnd = initrwnd; // in full MSS packets + + client_fd = connect_to_fd(server_fd, 0); + if (CHECK_FAIL(client_fd < 0)) + goto err; + + sd = accept(server_fd, NULL, NULL); + if (CHECK_FAIL(sd < 0)) + goto err; + + /* There are three moments where we check the window/rcv_ssthresh. + * + * (1) First, after socket creation, TCP handshake, we expect + * the client to see only SYN+ACK which is without window + * scaling. That is: from client/sender point of view we see + * at most 64KiB open receive window. + */ + r = getsockopt(client_fd, SOL_TCP, TCP_INFO, &info, &optlen); + if (CHECK_FAIL(r < 0)) + goto err; + + ASSERT_EQ(TCPI_SND_WND(info), tcpi_snd_wnd_on_connect, + "getsockopt(TCP_INFO.tcpi_snd_wnd) on connect"); + + /* (2) At the same time, from the server/receiver point of + * view, we already initiated socket, so rcv_ssthresh is set + * to high value, potentially larger than 64KiB. + */ + r = getsockopt(sd, SOL_TCP, TCP_INFO, &info, &optlen); + if (CHECK_FAIL(r < 0)) + goto err; + + ASSERT_EQ(TCPI_RCV_SSTHRESH(info), rcv_ssthresh_on_recv, + "getsockopt(TCP_INFO.rcv_ssthresh) on recv"); + + /* (3) Finally, after receiving some ACK from client, the + * client/sender should also see wider open window, larger + * than 64KiB. + */ + if (CHECK_FAIL(write(client_fd, &b, sizeof(b)) != 1)) + perror("Failed to send single byte"); + + if (CHECK_FAIL(read(sd, &b, sizeof(b)) != 1)) + perror("Failed to send single byte"); + + r = getsockopt(client_fd, SOL_TCP, TCP_INFO, &info, &optlen); + if (CHECK_FAIL(r < 0)) + goto err; + + ASSERT_EQ(TCPI_SND_WND(info), tcpi_snd_wnd_on_recv, + "getsockopt(TCP_INFO.tcpi_snd_wnd) after recv"); + +err: + if (sd != -1) + close(sd); + if (client_fd != -1) + close(client_fd); +} + +static int socket_client(int server_fd) +{ + socklen_t optlen; + int family, type, protocol, r; + + optlen = sizeof(family); + r = getsockopt(server_fd, SOL_SOCKET, SO_DOMAIN, &family, &optlen); + if (CHECK_FAIL(r < 0)) + return -1; + + optlen = sizeof(type); + r = getsockopt(server_fd, SOL_SOCKET, SO_TYPE, &type, &optlen); + if (CHECK_FAIL(r < 0)) + return -1; + + optlen = sizeof(protocol); + r = getsockopt(server_fd, SOL_SOCKET, SO_PROTOCOL, &protocol, &optlen); + if (CHECK_FAIL(r < 0)) + return -1; + + return socket(family, type, protocol); +} + +static void do_test_client(int server_fd, struct test_tcp_initrwnd *skel, + int initrwnd, unsigned int rcv_ssthresh, + unsigned int tcpi_snd_wnd) +{ + int client_fd = -1, sd = -1, r, maxseg; + __u32 info[256 / 4]; + socklen_t optlen = sizeof(info); + size_t rcvbuf; + char b[1] = { 0x55 }; + + fprintf(stderr, + "[*] client initrwnd=%d tcp_timestamps=%d tcp_window_scaling=%d tcp_workaround_signed_windows=%d\n", + initrwnd, tcp_timestamps, tcp_window_scaling, + tcp_workaround_signed_windows); + + skel->bss->initrwnd = initrwnd; // in full MSS packets + + client_fd = socket_client(server_fd); + if (CHECK_FAIL(client_fd < 0)) + goto err; + + /* With MSS=64KiB on loopback it's hard to argue about init + * rwnd. Let's set MSS to something that will make our life + * easier, like 1024 + timestamps. + */ + maxseg = 1024; + + r = setsockopt(client_fd, SOL_TCP, TCP_MAXSEG, &maxseg, sizeof(maxseg)); + if (CHECK_FAIL(r < 0)) + goto err; + + rcvbuf = 208 * 1024; + r = setsockopt(client_fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf, + sizeof(rcvbuf)); + if (CHECK_FAIL(r < 0)) + goto err; + + r = connect_fd_to_fd(client_fd, server_fd, 0); + if (CHECK_FAIL(r < 0)) + goto err; + + sd = accept(server_fd, NULL, NULL); + if (CHECK_FAIL(sd < 0)) + goto err; + + if (CHECK_FAIL(write(sd, &b, sizeof(b)) != 1)) + perror("Failed to send single byte"); + + if (CHECK_FAIL(read(client_fd, &b, sizeof(b)) != 1)) + perror("Failed to send single byte"); + + /* There is only one moment to check - the server should know + * about client window just after accept. First check client + * rcv_ssthresh. + */ + r = getsockopt(client_fd, SOL_TCP, TCP_INFO, &info, &optlen); + if (CHECK_FAIL(r < 0)) + goto err; + + ASSERT_EQ(TCPI_RCV_SSTHRESH(info), rcv_ssthresh, + "getsockopt(TCP_INFO.tcpi_rcv_ssthresh) on client"); + + /* And the recevie window size as seen from the server. + */ + r = getsockopt(sd, SOL_TCP, TCP_INFO, &info, &optlen); + if (CHECK_FAIL(r < 0)) + goto err; + + ASSERT_EQ(TCPI_SND_WND(info), tcpi_snd_wnd, + "getsockopt(TCP_INFO.tcpi_snd_wnd)"); + +err: + if (sd != -1) + close(sd); + if (client_fd != -1) + close(client_fd); +} + +static void run_tests(int cg_fd, struct test_tcp_initrwnd *skel) +{ + int server_fd = -1, r, rcvbuf, maxseg; + unsigned int max_wnd, buf; + + skel->links.bpf_testcb = + bpf_program__attach_cgroup(skel->progs.bpf_testcb, cg_fd); + if (!ASSERT_OK_PTR(skel->links.bpf_testcb, "attach_cgroup(bpf_testcb)")) + goto err; + + server_fd = start_server(AF_INET, SOCK_STREAM, NULL, 0, 0); + if (CHECK_FAIL(server_fd < 0)) + goto err; + + maxseg = 1024; + if (tcp_timestamps) + maxseg += 12; + + /* With MSS=64KiB on loopback it's hard to argue about init + * rwnd. Let's set MSS to something that will make our life + * easier, like 1024 + timestamps. + */ + r = setsockopt(server_fd, SOL_TCP, TCP_MAXSEG, &maxseg, sizeof(maxseg)); + if (CHECK_FAIL(r < 0)) + goto err; + + /* Obviously, rcvbuffer must be large at the start for the + * initrwnd to make any dent in rcv_ssthresh (assuming default + * tcp_rmem of 128KiB) + */ + rcvbuf = 208 * 1024; + r = setsockopt(server_fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf, + sizeof(rcvbuf)); + if (CHECK_FAIL(r < 0)) + goto err; + + max_wnd = tcp_workaround_signed_windows ? 32767 : 65535; + + /* [*] server advertising large window ** */ + + /* Small initrwnd. Not exceeding 64KiB */ + do_test_server(server_fd, skel, 1, 1024, 1024, 1024); + + if (tcp_window_scaling) { + /* Borderline. Not exceeding 64KiB */ + do_test_server(server_fd, skel, 63, MIN(max_wnd, 63 * 1024), + 63 * 1024, 63 * 1024); + } else { + do_test_server(server_fd, skel, 63, MIN(max_wnd, 63 * 1024), + 63 * 1024, MIN(max_wnd, 63 * 1024)); + } + + if (tcp_window_scaling) { + /* The interesting case. Crossing 64KiB */ + do_test_server(server_fd, skel, 128, max_wnd, 128 * 1024, + 128 * 1024); + } else { + do_test_server(server_fd, skel, 128, max_wnd, 65535, + MIN(max_wnd, 65535)); + } + + if (tcp_window_scaling) { + /* Super large. The buffer is 208*2 */ + do_test_server(server_fd, skel, 206, max_wnd, 206 * 1024U, + 206 * 1024U); + buf = 207 * 1024U - (tcp_timestamps ? 12 : 0); + do_test_server(server_fd, skel, 512, max_wnd, buf, buf); + } + + /* [*] client advertising large window ** */ + + /* Test if client advertises small rcv window */ + do_test_client(server_fd, skel, 1, 1024, 1024); + + if (tcp_window_scaling) { + /* Medium size */ + do_test_client(server_fd, skel, 63, 63 * 1024, 63 * 1024); + } else { + do_test_client(server_fd, skel, 63, 63 * 1024, + MIN(max_wnd, 63 * 1024)); + } + + if (tcp_window_scaling) { + /* And large window */ + do_test_client(server_fd, skel, 128, 128 * 1024, 128 * 1024); + } else { + do_test_client(server_fd, skel, 128, 65535, + MIN(max_wnd, 65535)); + } + + if (tcp_window_scaling) { + /* Super large. */ + do_test_client(server_fd, skel, 206, 206 * 1024U, 206 * 1024U); + buf = 207 * 1024U + (tcp_timestamps ? 12 : 0); + do_test_client(server_fd, skel, 512, buf, buf); + } +err: + if (server_fd != -1) + close(server_fd); +} + +#define PROC_TCP_TIMESTAMPS "/proc/sys/net/ipv4/tcp_timestamps" +#define PROC_TCP_WINDOW_SCALING "/proc/sys/net/ipv4/tcp_window_scaling" +#define PROC_TCP_WORKAROUND_SIGNED_WINDOWS \ + "/proc/sys/net/ipv4/tcp_workaround_signed_windows" + +void test_tcp_initrwnd(void) +{ + struct test_tcp_initrwnd *skel; + unsigned int i; + int cg_fd; + + int saved_tcp_timestamps = read_int_sysctl(PROC_TCP_TIMESTAMPS); + int saved_tcp_window_scaling = read_int_sysctl(PROC_TCP_WINDOW_SCALING); + int saved_tcp_workaround_signed_windows = + read_int_sysctl(PROC_TCP_WORKAROUND_SIGNED_WINDOWS); + + if (CHECK_FAIL(saved_tcp_timestamps == -1 || + saved_tcp_window_scaling == -1 || + saved_tcp_workaround_signed_windows == -1)) + return; + + cg_fd = test__join_cgroup(CG_NAME); + if (CHECK_FAIL(cg_fd < 0)) + return; + + skel = test_tcp_initrwnd__open_and_load(); + if (CHECK_FAIL(!skel)) { + close(cg_fd); + return; + } + + for (i = 0; i < 8; i++) { + tcp_timestamps = !!(i & 0x1); + tcp_window_scaling = !!(i & 0x2); + tcp_workaround_signed_windows = !!(i & 0x4); + + write_int_sysctl(PROC_TCP_TIMESTAMPS, tcp_timestamps); + write_int_sysctl(PROC_TCP_WINDOW_SCALING, tcp_window_scaling); + write_int_sysctl(PROC_TCP_WORKAROUND_SIGNED_WINDOWS, + tcp_workaround_signed_windows); + + run_tests(cg_fd, skel); + } + + write_int_sysctl(PROC_TCP_TIMESTAMPS, saved_tcp_timestamps); + write_int_sysctl(PROC_TCP_WINDOW_SCALING, saved_tcp_window_scaling); + write_int_sysctl(PROC_TCP_WORKAROUND_SIGNED_WINDOWS, + saved_tcp_workaround_signed_windows); + + test_tcp_initrwnd__destroy(skel); + + close(cg_fd); +} diff --git a/tools/testing/selftests/bpf/progs/test_tcp_initrwnd.c b/tools/testing/selftests/bpf/progs/test_tcp_initrwnd.c new file mode 100644 index 000000000000..d532e9e2d344 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_tcp_initrwnd.c @@ -0,0 +1,30 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause +// Copyright (c) 2022 Cloudflare + +#include <linux/bpf.h> + +#include <bpf/bpf_helpers.h> + +int initrwnd; + +SEC("sockops") +int bpf_testcb(struct bpf_sock_ops *skops) +{ + int rv = -1; + int op; + + op = (int)skops->op; + + switch (op) { + case BPF_SOCK_OPS_RWND_INIT: + rv = initrwnd; + break; + + default: + rv = -1; + } + skops->reply = rv; + return 1; +} + +char _license[] SEC("license") = "Dual BSD/GPL"; -- 2.25.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB 2022-07-21 15:10 [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Marek Majkowski 2022-07-21 15:10 ` [PATCH net-next 1/2] " Marek Majkowski 2022-07-21 15:10 ` [PATCH net-next 2/2] Tests for RTAX_INITRWND Marek Majkowski @ 2022-07-22 1:54 ` Jakub Kicinski 2 siblings, 0 replies; 7+ messages in thread From: Jakub Kicinski @ 2022-07-22 1:54 UTC (permalink / raw) To: Marek Majkowski Cc: netdev, bpf, kernel-team, ivan, edumazet, davem, pabeni, ast, daniel, andrii On Thu, 21 Jul 2022 17:10:39 +0200 Marek Majkowski wrote: > Among many route options we support initrwnd/RTAX_INITRWND path > attribute: > > $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024 > > This sets the initial receive window size (in packets). However, it's > not very useful in practice. For smaller buffers (<128KiB) it can be > used to bring the initial receive window down, but it's hard to > imagine when this is useful. The same effect can be achieved with > TCP_WINDOW_CLAMP / RTAX_WINDOW option. I think you based this on bpf-next so you should put bpf-next in the subject, it doesn't apply to net-next. Either way let's wait for Eric to comment. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-07-27 12:55 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-07-21 15:10 [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Marek Majkowski 2022-07-21 15:10 ` [PATCH net-next 1/2] " Marek Majkowski 2022-07-22 9:23 ` Eric Dumazet 2022-07-27 11:19 ` Marek Majkowski 2022-07-27 12:54 ` Eric Dumazet 2022-07-21 15:10 ` [PATCH net-next 2/2] Tests for RTAX_INITRWND Marek Majkowski 2022-07-22 1:54 ` [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Jakub Kicinski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).