All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yuchung Cheng <ycheng@google.com>
To: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>,
	David Miller <davem@davemloft.net>,
	netdev <netdev@vger.kernel.org>, kernel-team <kernel-team@fb.com>,
	Neil Spring <ntspring@fb.com>,
	Neal Cardwell <ncardwell@google.com>
Subject: Re: [PATCH net] net: tcp: don't allocate fast clones for fastopen SYN
Date: Tue, 2 Mar 2021 12:52:14 -0800	[thread overview]
Message-ID: <CAK6E8=fL7HP3ObOrUtR=UbR5ZrCDjc0qQ-t7cD9oUMorWFsKwg@mail.gmail.com> (raw)
In-Reply-To: <CANn89iLaQuCGeWOh7Hp8X9dL09FhPP8Nwj+zV=rhYX7Cq7efpg@mail.gmail.com>

On Tue, Mar 2, 2021 at 11:58 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > When receiver does not accept TCP Fast Open it will only ack
> > the SYN, and not the data. We detect this and immediately queue
> > the data for (re)transmission in tcp_rcv_fastopen_synack().
> >
> > In DC networks with very low RTT and without RFS the SYN-ACK
> > may arrive before NIC driver reported Tx completion on
> > the original SYN. In which case skb_still_in_host_queue()
> > returns true and sender will need to wait for the retransmission
> > timer to fire milliseconds later.
> >
> > Revert back to non-fast clone skbs, this way
> > skb_still_in_host_queue() won't prevent the recovery flow
> > from completing.
> >
> > Suggested-by: Eric Dumazet <edumazet@google.com>
> > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly")
>
> Hmmm, not sure if this Fixes: tag makes sense.
>
> Really, if we delay TX completions by say 10 ms, other parts of the
> stack will misbehave anyway.
>
> Also, backporting this patch up to linux-3.19 is going to be tricky.
>
> The real issue here is that skb_still_in_host_queue() can give a false positive.
>
> I have mixed feelings here, as you can read my answer :/
>
> Maybe skb_still_in_host_queue() signal should not be used when a part
> of the SKB has been received/acknowledged by the remote peer
> (in this case the SYN part).
Thank you Eric and Jakub for working on the TFO issue.

I like this option the most because it's more generic and easy to understand. Is
it easy to implement by checking snd_una etc?




>
> Alternative is that drivers unable to TX complete their skbs in a
> reasonable time should call skb_orphan()
>  to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue)
>
> If you really want to play and delay TX completions, maybe provide a
> way to disable skb_still_in_host_queue() globally,
> using a static key ?
>
> My personal WIP/hack was something like :
>
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 69a545db80d2ead47ffcf2f3819a6d066e95f35d..666f6f0a6a06fece204199e07a79e21d1faf8f92
> 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5995,7 +5995,8 @@ static bool tcp_rcv_fastopen_synack(struct sock
> *sk, struct sk_buff *synack,
>                 else
>                         tp->fastopen_client_fail = TFO_DATA_NOT_ACKED;
>                 skb_rbtree_walk_from(data) {
> -                       if (__tcp_retransmit_skb(sk, data, 1))
> +                       /* segs = -1 to bypass
> skb_still_in_host_queue() check */
> +                       if (__tcp_retransmit_skb(sk, data, -1))
>                                 break;
>                 }
>                 tcp_rearm_rto(sk);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index fbf140a770d8e21b936369b79abbe9857537acd8..1d1489e596976e352fe7d5ccee7a6eae55fdbcce
> 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -3155,8 +3155,12 @@ int __tcp_retransmit_skb(struct sock *sk,
> struct sk_buff *skb, int segs)
>                   sk->sk_sndbuf))
>                 return -EAGAIN;
>
> -       if (skb_still_in_host_queue(sk, skb))
> -               return -EBUSY;
> +       if (segs > 0) {
> +               if (skb_still_in_host_queue(sk, skb))
> +                       return -EBUSY;
> +       } else {
> +               segs = -segs;
> +       }
>
>         if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
>                 if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) {
>
>
> > Signed-off-by: Neil Spring <ntspring@fb.com>
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> > ---
> >  net/ipv4/tcp_output.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index fbf140a770d8..cd9461588539 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -3759,9 +3759,16 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
> >         /* limit to order-0 allocations */
> >         space = min_t(size_t, space, SKB_MAX_HEAD(MAX_TCP_HEADER));
> >
> > -       syn_data = sk_stream_alloc_skb(sk, space, sk->sk_allocation, false);
> > +       syn_data = alloc_skb(MAX_TCP_HEADER + space, sk->sk_allocation);
> >         if (!syn_data)
> >                 goto fallback;
> > +       if (!sk_wmem_schedule(sk, syn_data->truesize)) {
> > +               __kfree_skb(syn_data);
> > +               goto fallback;
> > +       }
> > +       skb_reserve(syn_data, MAX_TCP_HEADER);
> > +       INIT_LIST_HEAD(&syn_data->tcp_tsorted_anchor);
> > +
> >         syn_data->ip_summed = CHECKSUM_PARTIAL;
> >         memcpy(syn_data->cb, syn->cb, sizeof(syn->cb));
> >         if (space) {
> > --
> > 2.26.2
> >

  parent reply	other threads:[~2021-03-03  4:23 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-02  6:07 [PATCH net] net: tcp: don't allocate fast clones for fastopen SYN Jakub Kicinski
2021-03-02  9:38 ` Eric Dumazet
2021-03-02 17:00   ` Jakub Kicinski
2021-03-02 17:02     ` Eric Dumazet
2021-03-02 20:52   ` Yuchung Cheng [this message]
2021-03-02 22:00     ` Jakub Kicinski
2021-03-03 21:35   ` Alexander Duyck
2021-03-04  0:07     ` Jakub Kicinski
2021-03-04  2:45       ` Alexander Duyck
2021-03-04 12:51         ` Eric Dumazet
2021-03-04 19:06           ` Jakub Kicinski
2021-03-04 19:41             ` Eric Dumazet
2021-03-04 20:18               ` Eric Dumazet
2021-03-04 21:08               ` Jonathan Lemon
2021-03-04 21:20                 ` Eric Dumazet
2021-03-04 21:26                   ` Eric Dumazet
2021-03-04 23:27                     ` Jakub Kicinski
2021-03-05  5:17                       ` Eric Dumazet
2021-03-05  5:33                         ` Eric Dumazet
2021-03-05  6:38                           ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAK6E8=fL7HP3ObOrUtR=UbR5ZrCDjc0qQ-t7cD9oUMorWFsKwg@mail.gmail.com' \
    --to=ycheng@google.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kernel-team@fb.com \
    --cc=kuba@kernel.org \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=ntspring@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.