netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eric Dumazet <edumazet@google.com>
To: Marek Majkowski <marek@cloudflare.com>
Cc: Lawrence Brakmo <brakmo@fb.com>, netdev <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>,
	Ivan Babrou <ivan@cloudflare.com>,
	David Miller <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>
Subject: Re: [PATCH net-next 1/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB
Date: Wed, 27 Jul 2022 14:54:59 +0200	[thread overview]
Message-ID: <CANn89i++56L396Mhr1LxL3UN6D9uPMGsp5yaDTY5n4bhuir_BQ@mail.gmail.com> (raw)
In-Reply-To: <CAJPywTKf1FdCRt2DZz3H+yhXqdFQ2tq9eNC4jtNHb0SgLwGfgA@mail.gmail.com>

On Wed, Jul 27, 2022 at 1:19 PM Marek Majkowski <marek@cloudflare.com> wrote:
>
> On Fri, Jul 22, 2022 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Jul 21, 2022 at 5:10 PM Marek Majkowski <marek@cloudflare.com> wrote:
> > >
> > > We already support RTAX_INITRWND / initrwnd path attribute:
> > >
> > >  $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024
> > >
> > > However normally, the initial advertised receive window is limited to
> > > 64KiB by rcv_ssthresh, regardless of initrwnd. This patch changes
> > > that, bumping up rcv_ssthresh to value derived from initrwnd. This
> > > allows for larger initial advertised receive windows, which is useful
> > > for specific types of TCP flows: big BDP ones, where there is a lot of
> > > data to send immediately after the flow is established.
> > >
> > > There are three places where we initialize sockets:
> > >  - tcp_output:tcp_connect_init
> > >  - tcp_minisocks:tcp_openreq_init_rwin
> > >  - syncookies
> > >
> > > In the first two we already have a call to `tcp_rwnd_init_bpf` and
> > > `dst_metric(RTAX_INITRWND)` which retrieve the bpf/path initrwnd
> > > attribute. We use this value to bring `rcv_ssthresh` up, potentially
> > > above the traditional 64KiB.
> > >
> > > With higher initial `rcv_ssthresh` the receiver will open the receive
> > > window more aggresively, which can improve large BDP flows - large
> > > throughput and latency.
> > >
> > > This patch does not cover the syncookies case.
> > >
> > > Signed-off-by: Marek Majkowski <marek@cloudflare.com>
> > > ---
> > >  include/net/inet_sock.h  |  1 +
> > >  net/ipv4/tcp_minisocks.c |  8 ++++++--
> > >  net/ipv4/tcp_output.c    | 10 ++++++++--
> > >  3 files changed, 15 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> > > index daead5fb389a..bc68c9b70942 100644
> > > --- a/include/net/inet_sock.h
> > > +++ b/include/net/inet_sock.h
> > > @@ -89,6 +89,7 @@ struct inet_request_sock {
> > >                                 no_srccheck: 1,
> > >                                 smc_ok     : 1;
> > >         u32                     ir_mark;
> > > +       u32                     rcv_ssthresh;

Please move this in struct tcp_request_sock

> >
> > Why do we need to store this value in the request_sock ?
> >
> > It is derived from a route attribute and MSS, all this should be
> > available when the full blown socket is created.
> >
> > It would also work even with syncookies.
>
> Eric,
>
> Thanks for the feedback. For some context, I published a blog post
> explaining this work in detail [1].
>
> https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/
>
> I understand the suggestion is to move tcp_rwnd_init_bpf +
> RTAX_INITRWND lookup from `tcp_openreq_init_rwin` into
> `tcp_create_openreq_child`.
>
> I gave it a try (patch: [2]), but I don't think this will work under
> all circumstances. The issue is that we need to advertise *some*
> window in the SYNACK packet, before creating the full blown socket.
>
> With RTAX_INITRWND it is possible to move the advertised window up, or
> down.
>
> In the latter case, of reducing the window, at the SYNACK moment we
> must know if the window is reduced under 64KiB. This is what happens
> right now, we can _reduce_ window with RTAX_INITRWND to small values,
> I guess down to 1 MSS. This smaller window is then advertised in the
> SYNACK.
>
> If we move RTAX_INITRWND lookup into the later
> `tcp_create_openreq_child` then it will be too late, we won't know the
> correct window size on SYNACK stage. We will likely end up sending
> large window on SYNACK and then a small window on subsequent ACK,
> violating TCP.
>
> There are two approaches here. First, keep the semantics and allow
> RTAX_INITRWND to _reduce_ the initial window.
>
> In this case there are four ways out of this:
>
> 1) Keep it as proposed, that indeed requires some new value in
> request_sock. (perhaps maybe it could be it smaller than u32)
>
> 2) Send the SYNACK with small/zero window, since we haven't done the
> initrwnd lookup at this stage, but that would be at least
> controversial, and also adds one more RTT to the common case. I don't
> think this is acceptable.
>
> 3) Do two initrwnd lookups. One in the `tcp_openreq_init_rwin` to
> figure out if the window is smaller than 64KiB, second one in
> `tcp_create_openreq_child` to figure out if the suggested window is
> larger than 64KiB.

I think syncookies can be handled, if you look at cookie_v6_check() &
cookie_v4_check()
after their calls to cookie_tcp_reqsk_alloc()

>
> 4) Abort the whole approach and recycle Ivan's
> bpf_setsockopt(TCP_BPF_RCV_SSTHRESH) approach [3]. But I prefer the route
> attribute approach, seems easier to use and more flexible.
>
> But, thinking about it, I don't think we could ever support reducing
> initial receive window in the syncookie case. Only (3) - two initrwnd
> lookups - could be made to work, but even that is controversial.
>
> However the intention of RTAX_INITRWND as far as I understand was to
> _increase_ rcv_ssthresh, back in the days when it started from 10MSS
> (so I was told).

That was before we fixed DRS and that we made initial RWIN 65535, the
max allowed value in a SYN , SYNACK packet.
But yes...

>
> So, we could change the semantics of RTAX_INITRWND to allow only
> *increasing* the window - and disallow reducing it. With such an
> approach indeed we could make the code as you suggested, and move the
> route attribute lookup away from minisocks into `tcp_create_openreq_child`.
>
> Marek
>
> [1] https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/
> [2] https://gist.github.com/majek/13848c050a3dc218ed295364ee717879
> [3] https://lore.kernel.org/bpf/20220111192952.49040-1-ivan@cloudflare.com/t/

  reply	other threads:[~2022-07-27 12:55 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-21 15:10 [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Marek Majkowski
2022-07-21 15:10 ` [PATCH net-next 1/2] " Marek Majkowski
2022-07-22  9:23   ` Eric Dumazet
2022-07-27 11:19     ` Marek Majkowski
2022-07-27 12:54       ` Eric Dumazet [this message]
2022-07-21 15:10 ` [PATCH net-next 2/2] Tests for RTAX_INITRWND Marek Majkowski
2022-07-22  1:54 ` [PATCH net-next 0/2] RTAX_INITRWND should be able to bring the rcv_ssthresh above 64KiB Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANn89i++56L396Mhr1LxL3UN6D9uPMGsp5yaDTY5n4bhuir_BQ@mail.gmail.com \
    --to=edumazet@google.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brakmo@fb.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=ivan@cloudflare.com \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=marek@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).