Re: [PATCH net-next] net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()

From: Eric Dumazet <edumazet@google.com>
To: Jakub Kicinski <kuba@kernel.org>
Cc: Wei Wang <weiwan@google.com>,
	netdev@vger.kernel.org,  "David S . Miller" <davem@davemloft.net>,
	cgroups@vger.kernel.org, linux-mm@kvack.org,
	 Shakeel Butt <shakeelb@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	 Neil Spring <ntspring@meta.com>,
	ycheng@google.com
Subject: Re: [PATCH net-next] net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
Date: Thu, 13 Oct 2022 15:02:26 -0700	[thread overview]
Message-ID: <CANn89iJ0SYX_oxjZE_2BOLzWXemA2mZeMeOdPoEFiu-AxE2GMQ@mail.gmail.com> (raw)
In-Reply-To: <20221013144950.44b52f90@kernel.org>

On Thu, Oct 13, 2022 at 2:49 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 12 Oct 2022 16:33:00 -0700 Jakub Kicinski wrote:
> > This patch is causing a little bit of pain to us, to workloads running
> > with just memory.max set. After this change the TCP rx path no longer
> > forces the charging.
> >
> > Any recommendation for the fix? Setting memory.high a few MB under
> > memory.max seems to remove the failures.
>
> Eric, is there anything that would make the TCP perform particularly
> poorly under mem pressure?
>
> Dropping and pruning happens a lot here:
>
> # nstat -a | grep -i -E 'Prune|Drop'
> TcpExtPruneCalled               1202577            0.0
> TcpExtOfoPruned                 734606             0.0
> TcpExtTCPOFODrop                64191              0.0
> TcpExtTCPRcvQDrop               384305             0.0
>
> Same workload on 5.6 kernel:
>
> TcpExtPruneCalled               1223043            0.0
> TcpExtOfoPruned                 3377               0.0
> TcpExtListenDrops               10596              0.0
> TcpExtTCPOFODrop                22                 0.0
> TcpExtTCPRcvQDrop               734                0.0
>
> From a quick look at the code and with what Shakeel explained in mind -
> previously we would have "loaded up the cache" after the first failed
> try, so we never got into the loop inside tcp_try_rmem_schedule() which
> most likely nukes the entire OFO queue:
>
> static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
>                                  unsigned int size)
> {
>         if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
>             !sk_rmem_schedule(sk, skb, size)) {
>             /* ^ would fail but "load up the cache" ^ */
>
>                 if (tcp_prune_queue(sk) < 0)
>                         return -1;
>
>                 /* v this one would not fail due to the cache v */
>                 while (!sk_rmem_schedule(sk, skb, size)) {
>                         if (!tcp_prune_ofo_queue(sk))
>                                 return -1;
>
> Neil mentioned that he's seen multi-second stalls when SACKed segments
> get dropped from the OFO queue. Sender waits for a very long time before
> retrying something that was already SACKed if the receiver keeps
> sacking new, later segments. Even when ACK reaches the previously-SACKed
> block which should prove to the sender that something is very wrong.
>
> I tried to repro this with a packet drill and it's not what I see
> exactly, I need to keep shortening the RTT otherwise the retx comes
> out before the next SACK arrives.
>
> I'll try to read the code, and maybe I'll get lucky and manage capture
> the exact impacted flows :S But does anything of this nature ring the
> bell?
>
> `../common/defaults.sh`
>
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>    +0 bind(3, ..., ...) = 0
>    +0 listen(3, 1) = 0
>
>    +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 8>
>    +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
>   +.1 < . 1:1(0) ack 1 win 2048
>    +0 accept(3, ..., ...) = 4
>
>    +0 write(4, ..., 60000) = 60000
>    +0 > P. 1:10001(10000) ack 1
>
> // Do some SACK-ing
>   +.1 < . 1:1(0) ack 1 win 513 <sack 1001:2001,nop,nop>
> +.001 < . 1:1(0) ack 1 win 513 <sack 1001:2001 3001:4001 5001:6001,nop,nop>
> // ..and we pretend we lost 1001:2001
> +.001 < . 1:1(0) ack 1 win 513 <sack 2001:10001,nop,nop>
>
> // re-xmit holes and send more
>    +0 > . 10001:11001(1000) ack 1
>    +0 > . 1:1001(1000) ack 1
>    +0 > . 2001:3001(1000) ack 1 win 256
>    +0 > P. 11001:13001(2000) ack 1 win 256
>    +0 > P. 13001:15001(2000) ack 1 win 256
>
>   +.1 < . 1:1(0) ack 1001 win 513 <sack 2001:15001,nop,nop>
>
>    +0 > P. 15001:18001(3000) ack 1 win 256
>    +0 > P. 18001:20001(2000) ack 1 win 256
>    +0 > P. 20001:22001(2000) ack 1 win 256
>
>   +.1 < . 1:1(0) ack 1001 win 513 <sack 2001:22001,nop,nop>
>
>    +0 > P. 22001:24001(2000) ack 1 win 256
>    +0 > P. 24001:26001(2000) ack 1 win 256
>    +0 > P. 26001:28001(2000) ack 1 win 256
>    +0 > .  28001:29001(1000) ack 1 win 256
>
> +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop>
>
>    +0 > P. 29001:31001(2000) ack 1 win 256
>    +0 > P. 31001:33001(2000) ack 1 win 256
>    +0 > P. 33001:35001(2000) ack 1 win 256
>    +0 > . 35001:36001(1000) ack 1 win 256
>
> +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:36001,nop,nop>
>
>    +0 > P. 36001:38001(2000) ack 1 win 256
>    +0 > P. 38001:40001(2000) ack 1 win 256
>    +0 > P. 40001:42001(2000) ack 1 win 256
>    +0 > .  42001:43001(1000) ack 1 win 256
>
> +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:43001,nop,nop>
>
>    +0 > P. 43001:45001(2000) ack 1 win 256
>    +0 > P. 45001:47001(2000) ack 1 win 256
>    +0 > P. 47001:49001(2000) ack 1 win 256
>    +0 > .  49001:50001(1000) ack 1 win 256
>
> +0.04 < . 1:1(0) ack 1001 win 257 <sack 2001:50001,nop,nop>
>
>    +0 > P. 50001:52001(2000) ack 1 win 256
>    +0 > P. 52001:54001(2000) ack 1 win 256
>    +0 > P. 54001:56001(2000) ack 1 win 256
>    +0 > .  56001:57001(1000) ack 1 win 256
>
> +0.04 > . 1001:2001(1000) ack 1 win 256
>
>

This is SACK reneging, I would have to double check linux behavior, but
reverting to timeout could very well happen.

>   +.1 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop>
>