Re: Expensive tcp_collapse with high tcp

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Expensive tcp_collapse with high tcp_rmem limit
       [not found] <CA+wXwBRbLq6SW39qCD8GNG98YD5BJR2MFXmJV2zU1xwFjC-V0A@mail.gmail.com>
@ 2022-01-05 13:38 ` Eric Dumazet
  2022-01-06 12:32   ` Daniel Dao
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2022-01-05 13:38 UTC (permalink / raw)
  To: Daniel Dao
  Cc: netdev, kernel-team, linux-kernel, David Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, Marek Majkowski

On Wed, Jan 5, 2022 at 4:15 AM Daniel Dao <dqminh@cloudflare.com> wrote:
>
> Hello,
>
> We are looking at increasing the maximum value of TCP receive buffer in order
> to take better advantage of high BDP links. For historical reasons (
> https://blog.cloudflare.com/the-story-of-one-latency-spike/), this was set to
> a lower than default value.
>
> We are still occasionally seeing long time spent in tcp_collapse, and the time
> seems to be proportional with max rmem. For example, with net.ipv4.tcp_rmem = 8192 2097152 16777216,
> we observe tcp_collapse latency with the following bpftrace command:
>

I suggest you add more traces, like the payload/truesize ratio when
these events happen.
and tp->rcv_ssthresh, sk->sk_rcvbuf

TCP stack by default assumes a conservative [1] payload/truesize ratio of 50%

Meaning that a 16MB sk->rcvbuf would translate to a TCP RWIN of 8MB.

I suspect that you use XDP, and standard MTU=1500.
Drivers in XDP mode use one page (4096 bytes on x86) per incoming frame.
In this case, the ratio is ~1428/4096 = 35%

This is one of the reason we switched to a 4K MTU at Google, because we
have an effective ratio close to 100% (even if XDP was used)

[1] The 50% ratio of TCP is defeated with small MSS, and malicious traffic.


>   bpftrace -e 'kprobe:tcp_collapse { @start[tid] = nsecs; } kretprobe:tcp_collapse /@start[tid] != 0/ { $us = (nsecs - @start[tid])/1000; @us = hist($us); delete(@start[tid]); printf("%ld us\n", $us);} interval:s:6000 { exit(); }'
>   Attaching 3 probes...
>   15496 us
>   14301 us
>   12248 us
>   @us:
>   [8K, 16K)              3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> Spending up to 16ms with 16MiB maximum receive buffer seems high.  Are there any
> recommendations on possible approaches to reduce the tcp_collapse latency ?
> Would clamping the duration of a tcp_collapse call be reasonable, since we only
> need to spend enough time to free space to queue the required skb ?

It depends if the incoming skb is queued in in-order queue or
out-of-order queue.
For out-of-orders, we have a strategy in tcp_prune_ofo_queue() which
should work reasonably well after commit
72cd43ba64fc17 tcp: free batches of packets in tcp_prune_ofo_queue()

Given the nature of tcp_collapse(), limiting it to even 1ms of processing time
would still allow for malicious traffic to hurt you quite a lot.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Expensive tcp_collapse with high tcp_rmem limit
  2022-01-05 13:38 ` Expensive tcp_collapse with high tcp_rmem limit Eric Dumazet
@ 2022-01-06 12:32   ` Daniel Dao
  2022-01-06 18:52     ` Eric Dumazet
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Dao @ 2022-01-06 12:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, linux-kernel, David Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, Marek Majkowski

On Wed, Jan 5, 2022 at 1:38 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Jan 5, 2022 at 4:15 AM Daniel Dao <dqminh@cloudflare.com> wrote:
> >
> > Hello,
> >
> > We are looking at increasing the maximum value of TCP receive buffer in order
> > to take better advantage of high BDP links. For historical reasons (
> > https://blog.cloudflare.com/the-story-of-one-latency-spike/), this was set to
> > a lower than default value.
> >
> > We are still occasionally seeing long time spent in tcp_collapse, and the time
> > seems to be proportional with max rmem. For example, with net.ipv4.tcp_rmem = 8192 2097152 16777216,
> > we observe tcp_collapse latency with the following bpftrace command:
> >
>
> I suggest you add more traces, like the payload/truesize ratio when
> these events happen.
> and tp->rcv_ssthresh, sk->sk_rcvbuf
>
> TCP stack by default assumes a conservative [1] payload/truesize ratio of 50%

I forgot to add that for this experiment we also set tcp_adv_win_scale
= -2 to see if it
reduces the chance of triggering tcp_collapse

>
> Meaning that a 16MB sk->rcvbuf would translate to a TCP RWIN of 8MB.
>
> I suspect that you use XDP, and standard MTU=1500.
> Drivers in XDP mode use one page (4096 bytes on x86) per incoming frame.
> In this case, the ratio is ~1428/4096 = 35%
>
> This is one of the reason we switched to a 4K MTU at Google, because we
> have an effective ratio close to 100% (even if XDP was used)
>
> [1] The 50% ratio of TCP is defeated with small MSS, and malicious traffic.

I updated the bpftrace script to get data on len/truesize on collapsed skb

  kprobe:tcp_collapse {
    $sk = (struct sock *) arg0;
    $tp = (struct tcp_sock *) arg0;
    printf("tid %d: rmem_alloc=%ld sk_rcvbuf=%ld rcv_ssthresh=%ld\n", tid,
        $sk->sk_backlog.rmem_alloc.counter, $sk->sk_rcvbuf, $tp->rcv_ssthresh);
    printf("tid %d: advmss=%ld wclamp=%ld rcv_wnd=%ld\n", tid, $tp->advmss,
        $tp->window_clamp, $tp->rcv_wnd);
    @start[tid] = nsecs;
  }

  kretprobe:tcp_collapse /@start[tid] != 0/ {
    $us = (nsecs - @start[tid])/1000;
    @us = hist($us);
    printf("tid %d: %ld us\n", tid, $us);
    delete(@start[tid]);
  }

  kprobe:tcp_collapse_one {
    $skb = (struct sk_buff *) arg1;
    printf("tid %d: s=%ld len=%ld truesize=%ld\n", tid, sizeof(struct
sk_buff), $skb->len, $skb->truesize);
  }

  interval:s:6000 { exit(); }

Here is the output:

  tid 0: rmem_alloc=16780416 sk_rcvbuf=16777216 rcv_ssthresh=2920
  tid 0: advmss=1460 wclamp=4194304 rcv_wnd=450560
  tid 0: len=3316 truesize=15808
  tid 0: len=4106 truesize=16640
  tid 0: len=3967 truesize=16512
  tid 0: len=2988 truesize=15488
  ...
  tid 0: len=5279 truesize=17664
  tid 0: len=425 truesize=2048
  tid 0: 17176 us

The skb looks indeed bloated (len=3316, truesize=15808), so collapsing
definitely
helps. It just took a long time to go through thousands of 16KB skb

>
>
> >   bpftrace -e 'kprobe:tcp_collapse { @start[tid] = nsecs; } kretprobe:tcp_collapse /@start[tid] != 0/ { $us = (nsecs - @start[tid])/1000; @us = hist($us); delete(@start[tid]); printf("%ld us\n", $us);} interval:s:6000 { exit(); }'
> >   Attaching 3 probes...
> >   15496 us
> >   14301 us
> >   12248 us
> >   @us:
> >   [8K, 16K)              3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >
> > Spending up to 16ms with 16MiB maximum receive buffer seems high.  Are there any
> > recommendations on possible approaches to reduce the tcp_collapse latency ?
> > Would clamping the duration of a tcp_collapse call be reasonable, since we only
> > need to spend enough time to free space to queue the required skb ?
>
> It depends if the incoming skb is queued in in-order queue or
> out-of-order queue.
> For out-of-orders, we have a strategy in tcp_prune_ofo_queue() which
> should work reasonably well after commit
> 72cd43ba64fc17 tcp: free batches of packets in tcp_prune_ofo_queue()
>
> Given the nature of tcp_collapse(), limiting it to even 1ms of processing time
> would still allow for malicious traffic to hurt you quite a lot.

I don't yet understand why we have cases of bloated skbs. But it seems
like adapting the
batch prune strategy in tcp_prune_ofo_queue() to tcp_collapse makes sense to me.

I think every collapsed skb saves us truesize - len (?), and we can
set goal to free up 12.5% of sk_rcvbuf
same as tcp_prune_ofo_queue()

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Expensive tcp_collapse with high tcp_rmem limit
  2022-01-06 12:32   ` Daniel Dao
@ 2022-01-06 18:52     ` Eric Dumazet
  2022-01-06 18:55       ` Eric Dumazet
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2022-01-06 18:52 UTC (permalink / raw)
  To: Daniel Dao
  Cc: netdev, kernel-team, linux-kernel, David Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, Marek Majkowski

On Thu, Jan 6, 2022 at 4:32 AM Daniel Dao <dqminh@cloudflare.com> wrote:
>
> On Wed, Jan 5, 2022 at 1:38 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Jan 5, 2022 at 4:15 AM Daniel Dao <dqminh@cloudflare.com> wrote:
> > >
> > > Hello,
> > >
> > > We are looking at increasing the maximum value of TCP receive buffer in order
> > > to take better advantage of high BDP links. For historical reasons (
> > > https://blog.cloudflare.com/the-story-of-one-latency-spike/), this was set to
> > > a lower than default value.
> > >
> > > We are still occasionally seeing long time spent in tcp_collapse, and the time
> > > seems to be proportional with max rmem. For example, with net.ipv4.tcp_rmem = 8192 2097152 16777216,
> > > we observe tcp_collapse latency with the following bpftrace command:
> > >
> >
> > I suggest you add more traces, like the payload/truesize ratio when
> > these events happen.
> > and tp->rcv_ssthresh, sk->sk_rcvbuf
> >
> > TCP stack by default assumes a conservative [1] payload/truesize ratio of 50%
>
> I forgot to add that for this experiment we also set tcp_adv_win_scale
> = -2 to see if it
> reduces the chance of triggering tcp_collapse
>
> >
> > Meaning that a 16MB sk->rcvbuf would translate to a TCP RWIN of 8MB.
> >
> > I suspect that you use XDP, and standard MTU=1500.
> > Drivers in XDP mode use one page (4096 bytes on x86) per incoming frame.
> > In this case, the ratio is ~1428/4096 = 35%
> >
> > This is one of the reason we switched to a 4K MTU at Google, because we
> > have an effective ratio close to 100% (even if XDP was used)
> >
> > [1] The 50% ratio of TCP is defeated with small MSS, and malicious traffic.
>
> I updated the bpftrace script to get data on len/truesize on collapsed skb
>
>   kprobe:tcp_collapse {
>     $sk = (struct sock *) arg0;
>     $tp = (struct tcp_sock *) arg0;
>     printf("tid %d: rmem_alloc=%ld sk_rcvbuf=%ld rcv_ssthresh=%ld\n", tid,
>         $sk->sk_backlog.rmem_alloc.counter, $sk->sk_rcvbuf, $tp->rcv_ssthresh);
>     printf("tid %d: advmss=%ld wclamp=%ld rcv_wnd=%ld\n", tid, $tp->advmss,
>         $tp->window_clamp, $tp->rcv_wnd);
>     @start[tid] = nsecs;
>   }
>
>   kretprobe:tcp_collapse /@start[tid] != 0/ {
>     $us = (nsecs - @start[tid])/1000;
>     @us = hist($us);
>     printf("tid %d: %ld us\n", tid, $us);
>     delete(@start[tid]);
>   }
>
>   kprobe:tcp_collapse_one {
>     $skb = (struct sk_buff *) arg1;
>     printf("tid %d: s=%ld len=%ld truesize=%ld\n", tid, sizeof(struct
> sk_buff), $skb->len, $skb->truesize);
>   }
>
>   interval:s:6000 { exit(); }
>
> Here is the output:
>
>   tid 0: rmem_alloc=16780416 sk_rcvbuf=16777216 rcv_ssthresh=2920
>   tid 0: advmss=1460 wclamp=4194304 rcv_wnd=450560
>   tid 0: len=3316 truesize=15808
>   tid 0: len=4106 truesize=16640
>   tid 0: len=3967 truesize=16512
>   tid 0: len=2988 truesize=15488

Ouch.
What kind of NIC driver is used on your host ?

>   ...
>   tid 0: len=5279 truesize=17664
>   tid 0: len=425 truesize=2048
>   tid 0: 17176 us
>
> The skb looks indeed bloated (len=3316, truesize=15808), so collapsing
> definitely
> helps. It just took a long time to go through thousands of 16KB skb
>
> >
> >
> > >   bpftrace -e 'kprobe:tcp_collapse { @start[tid] = nsecs; } kretprobe:tcp_collapse /@start[tid] != 0/ { $us = (nsecs - @start[tid])/1000; @us = hist($us); delete(@start[tid]); printf("%ld us\n", $us);} interval:s:6000 { exit(); }'
> > >   Attaching 3 probes...
> > >   15496 us
> > >   14301 us
> > >   12248 us
> > >   @us:
> > >   [8K, 16K)              3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > >
> > > Spending up to 16ms with 16MiB maximum receive buffer seems high.  Are there any
> > > recommendations on possible approaches to reduce the tcp_collapse latency ?
> > > Would clamping the duration of a tcp_collapse call be reasonable, since we only
> > > need to spend enough time to free space to queue the required skb ?
> >
> > It depends if the incoming skb is queued in in-order queue or
> > out-of-order queue.
> > For out-of-orders, we have a strategy in tcp_prune_ofo_queue() which
> > should work reasonably well after commit
> > 72cd43ba64fc17 tcp: free batches of packets in tcp_prune_ofo_queue()
> >
> > Given the nature of tcp_collapse(), limiting it to even 1ms of processing time
> > would still allow for malicious traffic to hurt you quite a lot.
>
> I don't yet understand why we have cases of bloated skbs. But it seems
> like adapting the
> batch prune strategy in tcp_prune_ofo_queue() to tcp_collapse makes sense to me.
>

Except that you would still have to parse the linear list.

> I think every collapsed skb saves us truesize - len (?), and we can
> set goal to free up 12.5% of sk_rcvbuf
> same as tcp_prune_ofo_queue()

I think that you should first look if you are under some kind of attack [1]

Eventually you would still have to make room, involving expensive copies.

12% of 16MB is still a lot of memory to copy.

[1] Detecting an attack signature could allow you to zap the socket
and save ~16MB of memory per flow.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Expensive tcp_collapse with high tcp_rmem limit
  2022-01-06 18:52     ` Eric Dumazet
@ 2022-01-06 18:55       ` Eric Dumazet
  2022-01-20 17:29         ` Daniel Dao
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2022-01-06 18:55 UTC (permalink / raw)
  To: Daniel Dao
  Cc: netdev, kernel-team, linux-kernel, David Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, Marek Majkowski

On Thu, Jan 6, 2022 at 10:52 AM Eric Dumazet <edumazet@google.com> wrote:

> I think that you should first look if you are under some kind of attack [1]
>
> Eventually you would still have to make room, involving expensive copies.
>
> 12% of 16MB is still a lot of memory to copy.
>
> [1] Detecting an attack signature could allow you to zap the socket
> and save ~16MB of memory per flow.

I forgot to ask, have you set tcp_min_snd_mss to a sensible value ?

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=5f3e2bf008c2221478101ee72f5cb4654b9fc363

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Expensive tcp_collapse with high tcp_rmem limit
  2022-01-06 18:55       ` Eric Dumazet
@ 2022-01-20 17:29         ` Daniel Dao
  0 siblings, 0 replies; 5+ messages in thread
From: Daniel Dao @ 2022-01-20 17:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, linux-kernel, David Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, Marek Majkowski

On Thu, Jan 6, 2022 at 6:55 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Jan 6, 2022 at 10:52 AM Eric Dumazet <edumazet@google.com> wrote:
>
> > I think that you should first look if you are under some kind of attack [1]
> >
> > Eventually you would still have to make room, involving expensive copies.
> >
> > 12% of 16MB is still a lot of memory to copy.
> >
> > [1] Detecting an attack signature could allow you to zap the socket
> > and save ~16MB of memory per flow.

Sorry for the late reply, we spent more time over the past weeks to
gather more data.

>   tid 0: rmem_alloc=16780416 sk_rcvbuf=16777216 rcv_ssthresh=2920
>   tid 0: advmss=1460 wclamp=4194304 rcv_wnd=450560
>   tid 0: len=3316 truesize=15808
>   tid 0: len=4106 truesize=16640
>   tid 0: len=3967 truesize=16512
>   tid 0: len=2988 truesize=15488
> > I think that you should first look if you are under some kind of attack [1]

This and indeed the majority of similar occurrences come from a
websocket origin that can
emit a large flow of tiny packets. As the tcp_collapse hiccups occur
in a proxy node, we think that
a combination of slow / unresponsive clients and the websocket traffic
can trigger this.

We made a workaround to clamp the websocket's rcvbuf to a smaller
value and it reduces
the peak latency of tcp_collapse as we no longer need to collapse up to 16MB.

> What kind of NIC driver is used on your host ?

We are running mlx5

> Except that you would still have to parse the linear list.

Most of the time when we see a high value of tcp_collapse, the bloated
skb is almost always at the top
of the list. I guess the client is already unresponsive so the flow is
full of bloated skbs. I would rather not
having to spend too much time collapsing these skbs.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-01-20 17:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CA+wXwBRbLq6SW39qCD8GNG98YD5BJR2MFXmJV2zU1xwFjC-V0A@mail.gmail.com>
2022-01-05 13:38 ` Expensive tcp_collapse with high tcp_rmem limit Eric Dumazet
2022-01-06 12:32   ` Daniel Dao
2022-01-06 18:52     ` Eric Dumazet
2022-01-06 18:55       ` Eric Dumazet
2022-01-20 17:29         ` Daniel Dao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).