Re: Initial TCP receive window is clamped to 64k by rcv_ssthresh

From: Eric Dumazet <edumazet@google.com>
To: Ivan Babrou <ivan@cloudflare.com>
Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	"David S. Miller" <davem@davemloft.net>
Subject: Re: Initial TCP receive window is clamped to 64k by rcv_ssthresh
Date: Wed, 22 Dec 2021 10:09:52 -0800	[thread overview]
Message-ID: <CANn89i+mhqGaM2tuhgEmEPbbNu_59GGMhBMha4jnnzFE=UBNYg@mail.gmail.com> (raw)
In-Reply-To: <CABWYdi3bzd4P0nsyZe6P6coBCQ2jN=kVOJte62zKj=Q8iJCSOQ@mail.gmail.com>

On Wed, Dec 22, 2021 at 9:46 AM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> Hello,
>
> I noticed that the advertised TCP receive window in the first ACK from
> the client is clamped at 64k. I'm wondering if this is intentional.
>
> We have an environment with many pairs of distant servers connected by
> high BDP links. For the reasons that aren't relevant, we need to
> re-establish connections between those often and expect to have as few
> round trips as possible to get a response after a handshake.
>
> We have made BBR cooperate on the initcwnd front with TCP_BPF_IW and
> some code that remembers cwnd and lets new connections start with a
> high value. It's safe to assume that we set initcwnd to 250 from the
> server side. I have no issues with the congestion control side of
> things.
>
> We also have high rmem and wmem values and plenty of memory.
>
> The problem lies in the fact that no matter how high we crank up the
> initcwnd, the connection will hit the 64k wall of the receive window
> and will have to stall waiting on ACKs from the other side, which take
> a long while to arrive on high latency links. A realistic scenario:
>
> 1. TCP connection established, receive window = 64k.
> 2. Client sends a request.
> 3. Server userspace program generates a 120k response and writes it to
> the socket. That's T0.
> 4. Server sends 64k worth of data in TCP packets to the client.
> 5. Client sees the first 64k worth of data T0 + RTT/2 later.
> 6. Client sends ACKs to cover for the data it just received.
> 7. Server sees the ACKs T0 + 1 RTT later.
> 8. Server sends the remaining data.
> 9. Client sees the remaining data T0 + RTT + RTT/2 later.
>
> In my mind, on a good network (guarded by the initcwnd) I expect to
> have the whole response to be sent immediately at T0 and received
> RTT/2 later.
>
> The current TCP connection establishment code picks two window sizes
> in tcp_select_initial_window() during the SYN packet generation:
>
> * rcv_wnd to advertise (cannot be higher than 64k during SYN, as we
> don't know whether wscale is supported yet)
> * window_clamp (current max memory allowed for the socket, can be large)
>
> You can find these in code here:
>
> * https://elixir.bootlin.com/linux/v5.15.10/source/include/linux/tcp.h#L209
>
> The call into tcp_select_initial_window() is here:
>
> * https://elixir.bootlin.com/linux/v5.15.10/source/net/ipv4/tcp_output.c#L3682
>
> Then immediately after rcv_ssthresh is set to rcv_wnd. This is the
> part that gives me pause.
>
> During the generation of the first ACK after the SYN ACK is received
> on the client, assuming the window scaling is supported, I expect the
> client to advertise the whole buffer as available and let congestion
> control handle whether it can be filled from the sender side. What
> happens in reality is that rcv_ssthresh is sent as the window value.
> Unfortunately, rcv_ssthresh is limited to 64k from rcv_wnd as
> described above.
>
> My question is whether it should be limited to window_clamp in
> tcp_connect_init() instead.
>
> I tried looking through git history and the following line was there
> since Git import in 2005:
>
>   tp->rcv_ssthresh = tp->rcv_wnd;
>
> I made a small patch that toggles rcv_ssthresh between rcv_wnd and
> window_clamp and I'm doing some testing to see if it solves my issue.
> I can see it advertise 512k receive buffer in the first ACK from the
> client, which seems to address my problem. I'm not sure if there's
> some drawback here.

Stack is conservative about RWIN increase, it wants to receive packets
to have an idea
of the skb->len/skb->truesize ratio to convert a memory budget to  RWIN.

Some drivers have to allocate 16K buffers (or even 32K buffers) just
to hold one segment
(of less than 1500 bytes of payload), while others are able to pack
memory more efficiently.

I guess that you could use eBPF code to precisely tweak stack behavior
to your needs.