Re: Expensive tcp_collapse with high tcp_rmem limit

From: Daniel Dao <dqminh@cloudflare.com>
To: Eric Dumazet <edumazet@google.com>
Cc: netdev <netdev@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	David Miller <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>,
	Marek Majkowski <marek@cloudflare.com>
Subject: Re: Expensive tcp_collapse with high tcp_rmem limit
Date: Thu, 20 Jan 2022 17:29:52 +0000	[thread overview]
Message-ID: <CA+wXwBSGsBjovTqvoPQEe012yEF2eYbnC5_0W==EAvWH1zbOAg@mail.gmail.com> (raw)
In-Reply-To: <CANn89iKBqPRHFy5U+SMxT5RUPkioDFrZ5rN5WKNwfzA-TkMhwA@mail.gmail.com>

On Thu, Jan 6, 2022 at 6:55 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Jan 6, 2022 at 10:52 AM Eric Dumazet <edumazet@google.com> wrote:
>
> > I think that you should first look if you are under some kind of attack [1]
> >
> > Eventually you would still have to make room, involving expensive copies.
> >
> > 12% of 16MB is still a lot of memory to copy.
> >
> > [1] Detecting an attack signature could allow you to zap the socket
> > and save ~16MB of memory per flow.

Sorry for the late reply, we spent more time over the past weeks to
gather more data.

>   tid 0: rmem_alloc=16780416 sk_rcvbuf=16777216 rcv_ssthresh=2920
>   tid 0: advmss=1460 wclamp=4194304 rcv_wnd=450560
>   tid 0: len=3316 truesize=15808
>   tid 0: len=4106 truesize=16640
>   tid 0: len=3967 truesize=16512
>   tid 0: len=2988 truesize=15488
> > I think that you should first look if you are under some kind of attack [1]

This and indeed the majority of similar occurrences come from a
websocket origin that can
emit a large flow of tiny packets. As the tcp_collapse hiccups occur
in a proxy node, we think that
a combination of slow / unresponsive clients and the websocket traffic
can trigger this.

We made a workaround to clamp the websocket's rcvbuf to a smaller
value and it reduces
the peak latency of tcp_collapse as we no longer need to collapse up to 16MB.

> What kind of NIC driver is used on your host ?

We are running mlx5

> Except that you would still have to parse the linear list.

Most of the time when we see a high value of tcp_collapse, the bloated
skb is almost always at the top
of the list. I guess the client is already unresponsive so the flow is
full of bloated skbs. I would rather not
having to spend too much time collapsing these skbs.