All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <edumazet@google.com>
To: Dust Li <dust.li@linux.alibaba.com>
Cc: David Miller <davem@davemloft.net>,
	netdev <netdev@vger.kernel.org>,
	Tony Lu <tonylu@linux.alibaba.com>,
	Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Subject: Re: [PATCH net-next] tcp: avoid spurious loopback retransmit
Date: Mon, 7 Jun 2021 18:17:45 +0200	[thread overview]
Message-ID: <CANn89i+dDy6ev50mBMwoK7f0NN+0xHf8V-Jas8zAmew02hJV4w@mail.gmail.com> (raw)
In-Reply-To: <20210607154534.57034-1-dust.li@linux.alibaba.com>

On Mon, Jun 7, 2021 at 5:45 PM Dust Li <dust.li@linux.alibaba.com> wrote:
>
> We found there are pretty much loopback TCP retransmitions
> on our online servers. Most of them are TLP retransmition.
>
> This is because for loopback communication, TLP is usally
> triggered about 2ms after the last packet was sent if no
> ACK was received within that period.
> This condition can be met if there are some kernel tasks
> running on that CPU for more than 2ms, which delays the
> softirq to process the sd->backlog.
>
> We sampled the loopback TLP retransmit on our online servers,
> and found an average 2K+ retransmit per hour. But in some cases,
> this can be much bigger, I found a peak 40 retrans/s on the
> same server.
> Actually, those loopback retransmitions are not a big problem as
> long as they don't happen too frequently. It's just spurious and
> meanless and waste some CPU cycles.

So, why do you send such a patch, adding a lot of code ?

>
> I also write a test module which just busy-loop in the kernel
> for more then 2ms, and the lo retransmition can be triggered
> every time we run the busy-loop on the target CPU.
> With this patch, the retransmition is gone and the throughput
> is not affected.


This makes no sense to me.

Are you running a pristine linux kernel, or some modified version of it ?

Why loopback is magical, compared to veth pair ?

The only case where skb_fclone_busy() might have an issue is when a driver
is very slow to perform tx completion (compared to possible RTX)

But given that the loopback driver does an skb_orphan() in its
ndo_start_xmit (loopback_xmit()),
the skb_fclone_busy() should not be fired at all.

(skb->sk is NULL, before packet is 'looped back')

It seems your diagnosis is wrong.

>
> Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
> ---
>  include/linux/skbuff.h |  7 +++++--
>  net/ipv4/tcp_output.c  | 31 +++++++++++++++++++++++++++----
>  net/xfrm/xfrm_policy.c |  2 +-
>  3 files changed, 33 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index dbf820a50a39..290e0a6a3a47 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1131,9 +1131,12 @@ struct sk_buff_fclones {
>   * Returns true if skb is a fast clone, and its clone is not freed.
>   * Some drivers call skb_orphan() in their ndo_start_xmit(),
>   * so we also check that this didnt happen.
> + * For loopback, the skb maybe in the target sock's receive_queue
> + * we need to ignore that case.
>   */
>  static inline bool skb_fclone_busy(const struct sock *sk,
> -                                  const struct sk_buff *skb)
> +                                  const struct sk_buff *skb,
> +                                  bool is_loopback)
>  {
>         const struct sk_buff_fclones *fclones;
>
> @@ -1141,7 +1144,7 @@ static inline bool skb_fclone_busy(const struct sock *sk,
>
>         return skb->fclone == SKB_FCLONE_ORIG &&
>                refcount_read(&fclones->fclone_ref) > 1 &&
> -              READ_ONCE(fclones->skb2.sk) == sk;
> +              is_loopback ? true : READ_ONCE(fclones->skb2.sk) == sk;
>  }
>
>  /**
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index bde781f46b41..f51a6a565678 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2771,6 +2771,20 @@ bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto)
>         return true;
>  }
>
> +static int sock_is_loopback(const struct sock *sk)
> +{
> +       struct dst_entry *dst;
> +       int loopback = 0;
> +
> +       rcu_read_lock();
> +       dst = rcu_dereference(sk->sk_dst_cache);
> +       if (dst && dst->dev &&
> +           (dst->dev->features & NETIF_F_LOOPBACK))
> +               loopback = 1;
> +       rcu_read_unlock();
> +       return loopback;
> +}
> +
>  /* Thanks to skb fast clones, we can detect if a prior transmit of
>   * a packet is still in a qdisc or driver queue.
>   * In this case, there is very little point doing a retransmit !
> @@ -2778,15 +2792,24 @@ bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto)
>  static bool skb_still_in_host_queue(struct sock *sk,
>                                     const struct sk_buff *skb)
>  {
> -       if (unlikely(skb_fclone_busy(sk, skb))) {
> -               set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
> -               smp_mb__after_atomic();
> -               if (skb_fclone_busy(sk, skb)) {
> +       bool is_loopback = sock_is_loopback(sk);
> +
> +       if (unlikely(skb_fclone_busy(sk, skb, is_loopback))) {
> +               if (!is_loopback) {
> +                       set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
> +                       smp_mb__after_atomic();
> +                       if (skb_fclone_busy(sk, skb, is_loopback)) {
> +                               NET_INC_STATS(sock_net(sk),
> +                                             LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES);
> +                               return true;
> +                       }
> +               } else {
>                         NET_INC_STATS(sock_net(sk),
>                                       LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES);
>                         return true;
>                 }
>         }
> +
>         return false;
>  }
>
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index ce500f847b99..f8ea62a840e9 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -2846,7 +2846,7 @@ static int xdst_queue_output(struct net *net, struct sock *sk, struct sk_buff *s
>         struct xfrm_policy *pol = xdst->pols[0];
>         struct xfrm_policy_queue *pq = &pol->polq;
>
> -       if (unlikely(skb_fclone_busy(sk, skb))) {
> +       if (unlikely(skb_fclone_busy(sk, skb, false))) {
>                 kfree_skb(skb);
>                 return 0;
>         }
> --
> 2.19.1.3.ge56e4f7
>

  reply	other threads:[~2021-06-07 16:21 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-07 15:45 [PATCH net-next] tcp: avoid spurious loopback retransmit Dust Li
2021-06-07 16:17 ` Eric Dumazet [this message]
2021-06-08  3:09   ` dust.li
2021-06-08 16:04     ` Eric Dumazet
2021-06-09  0:25       ` dust.li
2021-06-09  7:03         ` Eric Dumazet
2021-06-09  8:45           ` dust.li
2021-06-09 16:46             ` Eric Dumazet
2021-06-22  8:25 ` [tcp] 7bf6f9c438: kernel-selftests.net/mptcp.mptcp_connect.sh.fail kernel test robot
2021-06-22  8:25   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANn89i+dDy6ev50mBMwoK7f0NN+0xHf8V-Jas8zAmew02hJV4w@mail.gmail.com \
    --to=edumazet@google.com \
    --cc=davem@davemloft.net \
    --cc=dust.li@linux.alibaba.com \
    --cc=netdev@vger.kernel.org \
    --cc=tonylu@linux.alibaba.com \
    --cc=xuanzhuo@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.