All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: Yunsheng Lin <linyunsheng@huawei.com>
Cc: Yunsheng Lin <yunshenglin0825@gmail.com>, <davem@davemloft.net>,
	<olteanv@gmail.com>, <ast@kernel.org>, <daniel@iogearbox.net>,
	<andriin@fb.com>, <edumazet@google.com>, <weiwan@google.com>,
	<cong.wang@bytedance.com>, <ap420073@gmail.com>,
	<netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linuxarm@openeuler.org>, <mkl@pengutronix.de>,
	<linux-can@vger.kernel.org>, <jhs@mojatatu.com>,
	<xiyou.wangcong@gmail.com>, <jiri@resnulli.us>,
	<andrii@kernel.org>, <kafai@fb.com>, <songliubraving@fb.com>,
	<yhs@fb.com>, <john.fastabend@gmail.com>, <kpsingh@kernel.org>,
	<bpf@vger.kernel.org>, <jonas.bonn@netrounds.com>,
	<pabeni@redhat.com>, <mzhivich@akamai.com>, <johunt@akamai.com>,
	<albcamus@gmail.com>, <kehuan.feng@gmail.com>,
	<a.fatoum@pengutronix.de>, <atenart@kernel.org>,
	<alexander.duyck@gmail.com>, <hdanton@sina.com>,
	<jgross@suse.com>, <JKosina@suse.com>, <mkubecek@suse.cz>,
	<bjorn@kernel.org>, <alobakin@pm.me>
Subject: Re: [Linuxarm] Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc
Date: Mon, 31 May 2021 21:51:46 -0700	[thread overview]
Message-ID: <20210531215146.5ca802a5@kicinski-fedora-PC1C0HJN.hsd1.ca.comcast.net> (raw)
In-Reply-To: <428f92d8-f4a2-13cf-8dcc-b38d48a42965@huawei.com>

On Mon, 31 May 2021 20:40:01 +0800 Yunsheng Lin wrote:
> On 2021/5/31 9:10, Yunsheng Lin wrote:
> > On 2021/5/31 8:40, Yunsheng Lin wrote:  
> >> On 2021/5/31 4:21, Jakub Kicinski wrote:  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
> >>
> >> When nolock_qdisc_is_empty() is not re-checking under q->seqlock, we
> >> may have:
> >>
> >>
> >>         CPU1                                   CPU2
> >>   qdisc_run_begin(q)                            .
> >>           .                                enqueue skb1
> >> deuqueue skb1 and clear MISSED                  .
> >>           .                        nolock_qdisc_is_empty() return true
> >>     requeue skb                                 .
> >>    q->enqueue()                                 .
> >>     set MISSED                                  .
> >>         .                                       .
> >>  qdisc_run_end(q)                               .
> >>         .                              qdisc_run_begin(q)
> >>         .                             transmit skb2 directly
> >>         .                           transmit the requeued skb1
> >>
> >> The problem here is that skb1 and skb2  are from the same CPU, which
> >> means they are likely from the same flow, so we need to avoid this,
> >> right?  
> > 
> > 
> >          CPU1                                   CPU2
> >    qdisc_run_begin(q)                            .
> >            .                                enqueue skb1
> >      dequeue skb1                                .
> >            .                                     .
> > netdevice stopped and MISSED is clear            .
> >            .                        nolock_qdisc_is_empty() return true
> >      requeue skb                                 .
> >            .                                     .
> >            .                                     .
> >            .                                     .
> >   qdisc_run_end(q)                               .
> >            .                              qdisc_run_begin(q)
> >            .                             transmit skb2 directly
> >            .                           transmit the requeued skb1
> > 
> > The above sequence diagram seems more correct, it is basically about how to
> > avoid transmitting a packet directly bypassing the requeued packet.

I see, thanks! That explains the need. Perhaps we can rephrase the
comment? Maybe:

+			/* Retest nolock_qdisc_is_empty() within the protection
+			 * of q->seqlock to protect from racing with requeuing.
+			 */

> I had did some interesting testing to show how adjust a small number
> of code has some notiable performance degrade.
> 
> 1. I used below patch to remove the nolock_qdisc_is_empty() testing
>    under q->seqlock.
> 
> @@ -3763,17 +3763,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>         if (q->flags & TCQ_F_NOLOCK) {
>                 if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
>                     qdisc_run_begin(q)) {
> -                       /* Retest nolock_qdisc_is_empty() within the protection
> -                        * of q->seqlock to ensure qdisc is indeed empty.
> -                        */
> -                       if (unlikely(!nolock_qdisc_is_empty(q))) {
> -                               rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> -                               __qdisc_run(q);
> -                               qdisc_run_end(q);
> -
> -                               goto no_lock_out;
> -                       }
> -
>                         qdisc_bstats_cpu_update(q, skb);
>                         if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
>                             !nolock_qdisc_is_empty(q))
> @@ -3786,7 +3775,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>                 rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
>                 qdisc_run(q);
> 
> -no_lock_out:
>                 if (unlikely(to_free))
>                         kfree_skb_list(to_free);
>                 return rc;
> 
> which has the below performance improvement:
> 
>  threads      v1             v1 + above patch          delta
>     1       3.21Mpps            3.20Mpps               -0.3%
>     2       5.56Mpps            5.94Mpps               +4.9%
>     4       5.58Mpps            5.60Mpps               +0.3%
>     8       2.76Mpps            2.77Mpps               +0.3%
>    16       2.23Mpps            2.23Mpps               +0.0%
> 
> v1 = this patchset.
> 
> 
> 2. After the above testing, it seems worthwhile to remove the
>    nolock_qdisc_is_empty() testing under q->seqlock, so I used below
>    patch to make sure nolock_qdisc_is_empty() always return false for
>    netdev queue stopped case。
> 
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops);
>  static void qdisc_maybe_clear_missed(struct Qdisc *q,
>                                      const struct netdev_queue *txq)
>  {
> +       set_bit(__QDISC_STATE_DRAINING, &q->state);
> +
> +       /* Make sure DRAINING is set before clearing MISSED
> +        * to make sure nolock_qdisc_is_empty() always return
> +        * false for aoviding transmitting a packet directly
> +        * bypassing the requeued packet.
> +        */
> +       smp_mb__after_atomic();
> +
>         clear_bit(__QDISC_STATE_MISSED, &q->state);
> 
>         /* Make sure the below netif_xmit_frozen_or_stopped()
> @@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
>          */
>         if (!netif_xmit_frozen_or_stopped(txq))
>                 set_bit(__QDISC_STATE_MISSED, &q->state);
> -       else
> -               set_bit(__QDISC_STATE_DRAINING, &q->state);
>  }

But this would not be enough because we may also clear MISSING 
in pfifo_fast_dequeue()?

> which has the below performance data:
> 
>  threads      v1          v1 + above two patch          delta
>     1       3.21Mpps            3.20Mpps               -0.3%
>     2       5.56Mpps            5.94Mpps               +4.9%
>     4       5.58Mpps            5.02Mpps                -10%
>     8       2.76Mpps            2.77Mpps               +0.3%
>    16       2.23Mpps            2.23Mpps               +0.0%
> 
> So the adjustment in qdisc_maybe_clear_missed() seems to have
> caused about 10% performance degradation for 4 threads case.
> 
> And the cpu topdown perf data suggested that icache missed and
> bad Speculation play the main factor to those performance difference.
> 
> I tried to control the above factor by removing the inline function
> and add likely and unlikely tag for netif_xmit_frozen_or_stopped()
> in sch_generic.c.
> 
> And after removing the inline mark for function in sch_generic.c
> and add likely/unlikely tag for netif_xmit_frozen_or_stopped()
> checking in in sch_generic.c, we got notiable performance improvement
> for 1/2 threads case(some performance improvement for ip forwarding
> test too), but not for 4 threads case.
> 
> So it seems we need to ignore the performance degradation for 4
> threads case? or any idea?

No ideas, are the threads pinned to CPUs in some particular way?

  reply	other threads:[~2021-06-01  4:51 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-28  2:49 [PATCH net-next 0/3] Some optimization for lockless qdisc Yunsheng Lin
2021-05-28  2:49 ` [PATCH net-next 1/3] net: sched: avoid unnecessary seqcount operation " Yunsheng Lin
2021-05-28  2:49 ` [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS " Yunsheng Lin
2021-05-29  1:00   ` Jakub Kicinski
2021-05-29  1:44     ` Yunsheng Lin
2021-05-29  4:32       ` Jakub Kicinski
2021-05-29  7:03         ` Yunsheng Lin
2021-05-29 18:49           ` Jakub Kicinski
2021-05-30  1:37             ` Yunsheng Lin
2021-05-30 20:21               ` Jakub Kicinski
2021-05-31  0:40                 ` Yunsheng Lin
2021-05-31  1:10                   ` [Linuxarm] " Yunsheng Lin
2021-05-31 12:40                     ` Yunsheng Lin
2021-06-01  4:51                       ` Jakub Kicinski [this message]
2021-06-01  8:18                         ` Yunsheng Lin
2021-06-01 20:48                           ` Jakub Kicinski
2021-06-02  1:21                             ` Yunsheng Lin
2021-06-02 16:28                               ` Jakub Kicinski
2021-05-28  2:49 ` [PATCH net-next 3/3] net: sched: remove qdisc->empty " Yunsheng Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210531215146.5ca802a5@kicinski-fedora-PC1C0HJN.hsd1.ca.comcast.net \
    --to=kuba@kernel.org \
    --cc=JKosina@suse.com \
    --cc=a.fatoum@pengutronix.de \
    --cc=albcamus@gmail.com \
    --cc=alexander.duyck@gmail.com \
    --cc=alobakin@pm.me \
    --cc=andrii@kernel.org \
    --cc=andriin@fb.com \
    --cc=ap420073@gmail.com \
    --cc=ast@kernel.org \
    --cc=atenart@kernel.org \
    --cc=bjorn@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hdanton@sina.com \
    --cc=jgross@suse.com \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=john.fastabend@gmail.com \
    --cc=johunt@akamai.com \
    --cc=jonas.bonn@netrounds.com \
    --cc=kafai@fb.com \
    --cc=kehuan.feng@gmail.com \
    --cc=kpsingh@kernel.org \
    --cc=linux-can@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@openeuler.org \
    --cc=linyunsheng@huawei.com \
    --cc=mkl@pengutronix.de \
    --cc=mkubecek@suse.cz \
    --cc=mzhivich@akamai.com \
    --cc=netdev@vger.kernel.org \
    --cc=olteanv@gmail.com \
    --cc=pabeni@redhat.com \
    --cc=songliubraving@fb.com \
    --cc=weiwan@google.com \
    --cc=xiyou.wangcong@gmail.com \
    --cc=yhs@fb.com \
    --cc=yunshenglin0825@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.