Re: [PATCH net-next] tcp: optimise receiver buffer autotuning initialisation for high latency connections

From: Eric Dumazet <edumazet@google.com>
To: "Mohamed Abuelfotoh, Hazem" <abuehaze@amazon.com>
Cc: Neal Cardwell <ncardwell@google.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"ycheng@google.com" <ycheng@google.com>,
	"weiwan@google.com" <weiwan@google.com>,
	"Strohman, Andy" <astroh@amazon.com>,
	"Herrenschmidt, Benjamin" <benh@amazon.com>
Subject: Re: [PATCH net-next] tcp: optimise receiver buffer autotuning initialisation for high latency connections
Date: Tue, 8 Dec 2020 00:22:15 +0100	[thread overview]
Message-ID: <CANn89iL48fnjMQ=rt7oGRC_cejhuwfvcyy5JcUs=k5Cet+-Vwg@mail.gmail.com> (raw)
In-Reply-To: <4235A2E1-A685-43DE-B513-C9163DE434CB@amazon.com>

On Mon, Dec 7, 2020 at 9:09 PM Mohamed Abuelfotoh, Hazem
<abuehaze@amazon.com> wrote:
>
>     >I want to state again that using 536 bytes as a magic value makes no
>     sense to me.
>
>  >autotuning might be delayed by one RTT, this does not match numbers
>  >given by Mohamed (flows stuck in low speed)
>
>   >autotuning is an heuristic, and because it has one RTT latency, it is
>    >crucial to get proper initial rcvmem values.
>
>    >People using MTU=9000 should know they have to tune tcp_rmem[1]
>    >accordingly, especially when using drivers consuming one page per
>    >+incoming MSS.
>
>
>
> The magic number would be 10*rcv_mss=5360 not 536 and in my opinion it's a big amount of data to be sent in security attack so if we are talking about DDos attack triggering Autotuning at 5360 bytes I'd say he will also be able to trigger it sending 64KB but I totally agree that it would be easier with lower rcvq_space.space, it's always a tradeoff between security and performance.



>
> Other options would be to either consider the configured MTU in the rcv_wnd calculation or probably check the MTU before calculating the initial rcvspace. We have to make sure that initial receive space is lower than initial receive window so Autotuning would work regardless the configured MTU on the receiver and only people using Jumbo frames will be paying the price if we agreed that it's expected for Jumbo frame users to have machines with more memory,  I'd say something as below should work:
>
> void tcp_init_buffer_space(struct sock *sk)
> {
>         int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
>         struct inet_connection_sock *icsk = inet_csk(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
>         int maxwin;
>
>         if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
>                 tcp_sndbuf_expand(sk);
>         if(tp->advmss < 6000)
>                 tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);

This is just another hack, based on 'magic' numbers.

>         else
>                 tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss);
>         tcp_mstamp_refresh(tp);
>         tp->rcvq_space.time = tp->tcp_mstamp;
>         tp->rcvq_space.seq = tp->copied_seq;
>
>
>
> I don't think that we should rely on Admins manually tuning this tcp_rmem[1] with Jumbo frame in use also Linux users shouldn't expect performance degradation after kernel upgrade. although [1] is the only public reporting of this issue, I am pretty sure we will see more users reporting this with Linux Main distributions moving to kernel 5.4 as stable version. In Summary we should come up with something either the proposed patch or something else to avoid admins doing the manual job.
>



Default MTU is 1500, not 9000.

I hinted in my very first reply to you that MTU  9000 is not easy and
needs tuning. We could argue and try to make this less of a pain in
future kernel (net-next)

<quote>Also worth noting that if you set MTU to 9000 (instead of
standard 1500), you probably need to tweak a few sysctls.
</quote>

I think I have asked you multiple times to test appropriate
tcp_rmem[1] settings...

I gave the reason why tcp_rmem[1] set to 131072 is not good for MTU
9000, I will prefer a solution that involves no kernel patch, no
backports, just a matter of educating sysadmins, for increased TCP
performance,
especially when really using 9000 MTU...

Your patch would change the behavior of TCP stack for standard
MTU=1500 flows which are yet the majority. This is very risky.

Anyway. _if_ we really wanted to change the kernel, ( keeping stupid
tcp_rmem[1] value ) :

In the tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND *
tp->advmss);  formula, really the bug is in the tp->rcv_wnd term, not
the second one.

This is buggy, because tcp_init_buffer_space() ends up with
tp->window_clamp smaller than tp->rcv_wnd, so tcp_grow_window() is not
able to change tp->rcv_ssthresh

The only mechanism allowing to change tp->window_clamp later would be
DRS, so we better use the proper limit when initializing
tp->rcvq_space.space

This issue disappears if tcp_rmem[1] is slightly above 131072, because
then the following is not needed.

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9e8a6c1aa0190cc248b3b99b073a4c6e45884cf5..81b5d9375860ae583e08045fb25b089c456c60ab
100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -534,6 +534,7 @@ static void tcp_init_buffer_space(struct sock *sk)

        tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp);
        tp->snd_cwnd_stamp = tcp_jiffies32;
+       tp->rcvq_space.space = min(tp->rcv_ssthresh, tp->rcvq_space.space);
 }

 /* 4. Recalculate window clamp after socket hit its memory bounds. */