Re: [PATCH net-next] tcp: avoid spurious loopback retransmit

From: "dust.li" <dust.li@linux.alibaba.com>
To: Eric Dumazet <edumazet@google.com>
Cc: David Miller <davem@davemloft.net>,
	netdev <netdev@vger.kernel.org>,
	Tony Lu <tonylu@linux.alibaba.com>,
	Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Subject: Re: [PATCH net-next] tcp: avoid spurious loopback retransmit
Date: Wed, 9 Jun 2021 08:25:42 +0800	[thread overview]
Message-ID: <20210609002542.GO53857@linux.alibaba.com> (raw)
In-Reply-To: <CANn89i+VEA4rc3T_oC7tJXYvA7OAmDc=Vk_wyxYwzYz23nENPg@mail.gmail.com>

On Tue, Jun 08, 2021 at 06:04:14PM +0200, Eric Dumazet wrote:
>On Tue, Jun 8, 2021 at 5:09 AM dust.li <dust.li@linux.alibaba.com> wrote:
>>
>> On Mon, Jun 07, 2021 at 06:17:45PM +0200, Eric Dumazet wrote:
>> >On Mon, Jun 7, 2021 at 5:45 PM Dust Li <dust.li@linux.alibaba.com> wrote:
>> >>
>> >> We found there are pretty much loopback TCP retransmitions
>> >> on our online servers. Most of them are TLP retransmition.
>> >>
>> >> This is because for loopback communication, TLP is usally
>> >> triggered about 2ms after the last packet was sent if no
>> >> ACK was received within that period.
>> >> This condition can be met if there are some kernel tasks
>> >> running on that CPU for more than 2ms, which delays the
>> >> softirq to process the sd->backlog.
>> >>
>> >> We sampled the loopback TLP retransmit on our online servers,
>> >> and found an average 2K+ retransmit per hour. But in some cases,
>> >> this can be much bigger, I found a peak 40 retrans/s on the
>> >> same server.
>> >> Actually, those loopback retransmitions are not a big problem as
>> >> long as they don't happen too frequently. It's just spurious and
>> >> meanless and waste some CPU cycles.
>> >
>> >So, why do you send such a patch, adding a lot of code ?
>> It's because we are doing retransmition analysis, we did a statistic
>> on our online server and found the most frequented retransmitted
>> (src_ip, dst_ip) tuple of all is between (127.0.0.1, 127.0.0.1),
>> but actually there are no loopback drop at all.
>>
>> Those loopback retransmittions distract us, and are really meaningless,
>> that why I tried to get rid of it.
>>
>> >
>> >>
>> >> I also write a test module which just busy-loop in the kernel
>> >> for more then 2ms, and the lo retransmition can be triggered
>> >> every time we run the busy-loop on the target CPU.
>> >> With this patch, the retransmition is gone and the throughput
>> >> is not affected.
>> >
>> >
>> >This makes no sense to me.
>> >
>> >Are you running a pristine linux kernel, or some modified version of it ?
>> No, I did the exact same test on the current upstream kernel.
>>
>> >
>> >Why loopback is magical, compared to veth pair ?
>> >
>> >The only case where skb_fclone_busy() might have an issue is when a driver
>> >is very slow to perform tx completion (compared to possible RTX)
>> >
>> >But given that the loopback driver does an skb_orphan() in its
>> >ndo_start_xmit (loopback_xmit()),
>> >the skb_fclone_busy() should not be fired at all.
>> >
>> >(skb->sk is NULL, before packet is 'looped back')
>> >
>> >It seems your diagnosis is wrong.
>>
>> For loopback, this should be the opposite, because in most cases
>> the orignal packet maybe still in the per CPU backlog queue.
>> We want the skb_fclone_busy() to fire to prevent the retransmittion
>>
>> But now after skb->sk set to NULL in loopback_xmit(),
>> skb_still_in_host_queue() returns false, and triggered the unneeded
>> retransmittion.
>
>Honestly I do not see how this is even possible.
>
>Normal TCP RTO timers are at least 200 ms
>
>For fast retransmits, this would mean you have huge OOO issues, maybe
>because of crazy thread migrations.

Normal RTO and fast retransmits are rarely triggerred.
But for TLP timers, it is easy since its timeout is usally only 2ms.

Thanks.

>
>I would rather address the root cause instead of risking serious bugs,
>if the packet is _dropped_ by some eBPF filter or something.
>
>If we do not retransmit while we _must_, then the flow is definitely broken.
>
>I prefer spurious retransmits, especially because they seem to be
>absolute noise in normal setups.
>
>>
>> Given the following case:
>>
>>      CPU0                        CPU1
>> -----------------
>>       |
>>       v
>> loopback_xmit()                 -----
>>       |                           ^
>>       v                           |
>> enqueue_to_backlog()              | CPU1 is doing some work in
>>       |                           | kernel for more than 2ms without
>>  (more then 2ms                   | calling net_rx_action()
>>   without ACK)                    | CPU1's backlog queue not been processed.
>>       |                           v
>>       v                         -----
>> CPU0 try TLP probe                |
>> and skb_fclone_busy()             |
>> found skb been orphaned.          |
>> So, send the TLP probe            |
>>                                   v
>>                            process_backlog()
>>                                 After CPU0's TLP probe, CPU1 finnally
>>                                 come back and processed its backlog queue,
>>                                 and saw the original packet, send the ACK
>>                                 back to CPU0.
>>
>>
>> The problem here is packet sent by CPU0 is first queued in CPU1's
>> backlog queue, but CPU1 is busy doing something else for longer
>> then TLP probe time. And CPU0 don't think the packet is still in
>> "host queue", so he retransmitted.
>>
>> I think the core problem is how we define "host queue", currently we
>> only think qdisc and device driver is "host queue", I think we should
>> also cover the per CPU backlog queue.
>>
>>
>> For veth, since it also uses netif_rx() to receive the packet on the
>> peer's device, it should have the some problem.
>> Yeah, checking sock_is_loopback() seems not a good solution, maybe any
>> better suggestions ?
>>
>>
>> Thanks.
>>
>> >
>> >>
>> >> Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
>> >> ---
>> >>  include/linux/skbuff.h |  7 +++++--
>> >>  net/ipv4/tcp_output.c  | 31 +++++++++++++++++++++++++++----
>> >>  net/xfrm/xfrm_policy.c |  2 +-
>> >>  3 files changed, 33 insertions(+), 7 deletions(-)
>> >>
>> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> >> index dbf820a50a39..290e0a6a3a47 100644
>> >> --- a/include/linux/skbuff.h
>> >> +++ b/include/linux/skbuff.h
>> >> @@ -1131,9 +1131,12 @@ struct sk_buff_fclones {
>> >>   * Returns true if skb is a fast clone, and its clone is not freed.
>> >>   * Some drivers call skb_orphan() in their ndo_start_xmit(),
>> >>   * so we also check that this didnt happen.
>> >> + * For loopback, the skb maybe in the target sock's receive_queue
>> >> + * we need to ignore that case.
>> >>   */
>> >>  static inline bool skb_fclone_busy(const struct sock *sk,
>> >> -                                  const struct sk_buff *skb)
>> >> +                                  const struct sk_buff *skb,
>> >> +                                  bool is_loopback)
>> >>  {
>> >>         const struct sk_buff_fclones *fclones;
>> >>
>> >> @@ -1141,7 +1144,7 @@ static inline bool skb_fclone_busy(const struct sock *sk,
>> >>
>> >>         return skb->fclone == SKB_FCLONE_ORIG &&
>> >>                refcount_read(&fclones->fclone_ref) > 1 &&
>> >> -              READ_ONCE(fclones->skb2.sk) == sk;
>> >> +              is_loopback ? true : READ_ONCE(fclones->skb2.sk) == sk;
>> >>  }
>> >>
>> >>  /**
>> >> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> >> index bde781f46b41..f51a6a565678 100644
>> >> --- a/net/ipv4/tcp_output.c
>> >> +++ b/net/ipv4/tcp_output.c
>> >> @@ -2771,6 +2771,20 @@ bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto)
>> >>         return true;
>> >>  }
>> >>
>> >> +static int sock_is_loopback(const struct sock *sk)
>> >> +{
>> >> +       struct dst_entry *dst;
>> >> +       int loopback = 0;
>> >> +
>> >> +       rcu_read_lock();
>> >> +       dst = rcu_dereference(sk->sk_dst_cache);
>> >> +       if (dst && dst->dev &&
>> >> +           (dst->dev->features & NETIF_F_LOOPBACK))
>> >> +               loopback = 1;
>> >> +       rcu_read_unlock();
>> >> +       return loopback;
>> >> +}
>> >> +
>> >>  /* Thanks to skb fast clones, we can detect if a prior transmit of
>> >>   * a packet is still in a qdisc or driver queue.
>> >>   * In this case, there is very little point doing a retransmit !
>> >> @@ -2778,15 +2792,24 @@ bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto)
>> >>  static bool skb_still_in_host_queue(struct sock *sk,
>> >>                                     const struct sk_buff *skb)
>> >>  {
>> >> -       if (unlikely(skb_fclone_busy(sk, skb))) {
>> >> -               set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
>> >> -               smp_mb__after_atomic();
>> >> -               if (skb_fclone_busy(sk, skb)) {
>> >> +       bool is_loopback = sock_is_loopback(sk);
>> >> +
>> >> +       if (unlikely(skb_fclone_busy(sk, skb, is_loopback))) {
>> >> +               if (!is_loopback) {
>> >> +                       set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
>> >> +                       smp_mb__after_atomic();
>> >> +                       if (skb_fclone_busy(sk, skb, is_loopback)) {
>> >> +                               NET_INC_STATS(sock_net(sk),
>> >> +                                             LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES);
>> >> +                               return true;
>> >> +                       }
>> >> +               } else {
>> >>                         NET_INC_STATS(sock_net(sk),
>> >>                                       LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES);
>> >>                         return true;
>> >>                 }
>> >>         }
>> >> +
>> >>         return false;
>> >>  }
>> >>
>> >> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
>> >> index ce500f847b99..f8ea62a840e9 100644
>> >> --- a/net/xfrm/xfrm_policy.c
>> >> +++ b/net/xfrm/xfrm_policy.c
>> >> @@ -2846,7 +2846,7 @@ static int xdst_queue_output(struct net *net, struct sock *sk, struct sk_buff *s
>> >>         struct xfrm_policy *pol = xdst->pols[0];
>> >>         struct xfrm_policy_queue *pq = &pol->polq;
>> >>
>> >> -       if (unlikely(skb_fclone_busy(sk, skb))) {
>> >> +       if (unlikely(skb_fclone_busy(sk, skb, false))) {
>> >>                 kfree_skb(skb);
>> >>                 return 0;
>> >>         }
>> >> --
>> >> 2.19.1.3.ge56e4f7
>> >>