From: Eric Dumazet <edumazet@google.com>
To: Jakub Kicinski <kuba@kernel.org>
Cc: Wei Wang <weiwan@google.com>,
netdev@vger.kernel.org, "David S . Miller" <davem@davemloft.net>,
cgroups@vger.kernel.org, linux-mm@kvack.org,
Shakeel Butt <shakeelb@google.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Neil Spring <ntspring@meta.com>,
ycheng@google.com
Subject: Re: [PATCH net-next] net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
Date: Thu, 13 Oct 2022 15:02:26 -0700 [thread overview]
Message-ID: <CANn89iJ0SYX_oxjZE_2BOLzWXemA2mZeMeOdPoEFiu-AxE2GMQ@mail.gmail.com> (raw)
In-Reply-To: <20221013144950.44b52f90@kernel.org>
On Thu, Oct 13, 2022 at 2:49 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 12 Oct 2022 16:33:00 -0700 Jakub Kicinski wrote:
> > This patch is causing a little bit of pain to us, to workloads running
> > with just memory.max set. After this change the TCP rx path no longer
> > forces the charging.
> >
> > Any recommendation for the fix? Setting memory.high a few MB under
> > memory.max seems to remove the failures.
>
> Eric, is there anything that would make the TCP perform particularly
> poorly under mem pressure?
>
> Dropping and pruning happens a lot here:
>
> # nstat -a | grep -i -E 'Prune|Drop'
> TcpExtPruneCalled 1202577 0.0
> TcpExtOfoPruned 734606 0.0
> TcpExtTCPOFODrop 64191 0.0
> TcpExtTCPRcvQDrop 384305 0.0
>
> Same workload on 5.6 kernel:
>
> TcpExtPruneCalled 1223043 0.0
> TcpExtOfoPruned 3377 0.0
> TcpExtListenDrops 10596 0.0
> TcpExtTCPOFODrop 22 0.0
> TcpExtTCPRcvQDrop 734 0.0
>
> From a quick look at the code and with what Shakeel explained in mind -
> previously we would have "loaded up the cache" after the first failed
> try, so we never got into the loop inside tcp_try_rmem_schedule() which
> most likely nukes the entire OFO queue:
>
> static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
> unsigned int size)
> {
> if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
> !sk_rmem_schedule(sk, skb, size)) {
> /* ^ would fail but "load up the cache" ^ */
>
> if (tcp_prune_queue(sk) < 0)
> return -1;
>
> /* v this one would not fail due to the cache v */
> while (!sk_rmem_schedule(sk, skb, size)) {
> if (!tcp_prune_ofo_queue(sk))
> return -1;
>
> Neil mentioned that he's seen multi-second stalls when SACKed segments
> get dropped from the OFO queue. Sender waits for a very long time before
> retrying something that was already SACKed if the receiver keeps
> sacking new, later segments. Even when ACK reaches the previously-SACKed
> block which should prove to the sender that something is very wrong.
>
> I tried to repro this with a packet drill and it's not what I see
> exactly, I need to keep shortening the RTT otherwise the retx comes
> out before the next SACK arrives.
>
> I'll try to read the code, and maybe I'll get lucky and manage capture
> the exact impacted flows :S But does anything of this nature ring the
> bell?
>
> `../common/defaults.sh`
>
> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 8>
> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> +.1 < . 1:1(0) ack 1 win 2048
> +0 accept(3, ..., ...) = 4
>
> +0 write(4, ..., 60000) = 60000
> +0 > P. 1:10001(10000) ack 1
>
> // Do some SACK-ing
> +.1 < . 1:1(0) ack 1 win 513 <sack 1001:2001,nop,nop>
> +.001 < . 1:1(0) ack 1 win 513 <sack 1001:2001 3001:4001 5001:6001,nop,nop>
> // ..and we pretend we lost 1001:2001
> +.001 < . 1:1(0) ack 1 win 513 <sack 2001:10001,nop,nop>
>
> // re-xmit holes and send more
> +0 > . 10001:11001(1000) ack 1
> +0 > . 1:1001(1000) ack 1
> +0 > . 2001:3001(1000) ack 1 win 256
> +0 > P. 11001:13001(2000) ack 1 win 256
> +0 > P. 13001:15001(2000) ack 1 win 256
>
> +.1 < . 1:1(0) ack 1001 win 513 <sack 2001:15001,nop,nop>
>
> +0 > P. 15001:18001(3000) ack 1 win 256
> +0 > P. 18001:20001(2000) ack 1 win 256
> +0 > P. 20001:22001(2000) ack 1 win 256
>
> +.1 < . 1:1(0) ack 1001 win 513 <sack 2001:22001,nop,nop>
>
> +0 > P. 22001:24001(2000) ack 1 win 256
> +0 > P. 24001:26001(2000) ack 1 win 256
> +0 > P. 26001:28001(2000) ack 1 win 256
> +0 > . 28001:29001(1000) ack 1 win 256
>
> +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop>
>
> +0 > P. 29001:31001(2000) ack 1 win 256
> +0 > P. 31001:33001(2000) ack 1 win 256
> +0 > P. 33001:35001(2000) ack 1 win 256
> +0 > . 35001:36001(1000) ack 1 win 256
>
> +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:36001,nop,nop>
>
> +0 > P. 36001:38001(2000) ack 1 win 256
> +0 > P. 38001:40001(2000) ack 1 win 256
> +0 > P. 40001:42001(2000) ack 1 win 256
> +0 > . 42001:43001(1000) ack 1 win 256
>
> +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:43001,nop,nop>
>
> +0 > P. 43001:45001(2000) ack 1 win 256
> +0 > P. 45001:47001(2000) ack 1 win 256
> +0 > P. 47001:49001(2000) ack 1 win 256
> +0 > . 49001:50001(1000) ack 1 win 256
>
> +0.04 < . 1:1(0) ack 1001 win 257 <sack 2001:50001,nop,nop>
>
> +0 > P. 50001:52001(2000) ack 1 win 256
> +0 > P. 52001:54001(2000) ack 1 win 256
> +0 > P. 54001:56001(2000) ack 1 win 256
> +0 > . 56001:57001(1000) ack 1 win 256
>
> +0.04 > . 1001:2001(1000) ack 1 win 256
>
>
This is SACK reneging, I would have to double check linux behavior, but
reverting to timeout could very well happen.
> +.1 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop>
>
prev parent reply other threads:[~2022-10-13 22:02 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-17 19:40 [PATCH net-next] net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem() Wei Wang
2021-08-18 10:50 ` patchwork-bot+netdevbpf
2022-10-12 23:33 ` Jakub Kicinski
2022-10-13 0:17 ` Shakeel Butt
2022-10-13 0:38 ` Jakub Kicinski
2022-10-13 0:54 ` Shakeel Butt
2022-10-13 1:40 ` Jakub Kicinski
2022-10-13 3:16 ` Jakub Kicinski
2022-10-13 3:34 ` Wei Wang
2022-10-13 3:49 ` Jakub Kicinski
2022-10-13 4:04 ` Wei Wang
2022-10-13 4:18 ` Shakeel Butt
2022-10-13 21:49 ` Jakub Kicinski
2022-10-13 22:02 ` Eric Dumazet [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CANn89iJ0SYX_oxjZE_2BOLzWXemA2mZeMeOdPoEFiu-AxE2GMQ@mail.gmail.com \
--to=edumazet@google.com \
--cc=cgroups@vger.kernel.org \
--cc=davem@davemloft.net \
--cc=kuba@kernel.org \
--cc=linux-mm@kvack.org \
--cc=netdev@vger.kernel.org \
--cc=ntspring@meta.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeelb@google.com \
--cc=weiwan@google.com \
--cc=ycheng@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).