From: Jakub Kicinski <kuba@kernel.org>
To: Yunsheng Lin <linyunsheng@huawei.com>
Cc: <davem@davemloft.net>, <olteanv@gmail.com>, <ast@kernel.org>,
<daniel@iogearbox.net>, <andriin@fb.com>, <edumazet@google.com>,
<weiwan@google.com>, <cong.wang@bytedance.com>,
<ap420073@gmail.com>, <netdev@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <linuxarm@openeuler.org>,
<mkl@pengutronix.de>, <linux-can@vger.kernel.org>,
<jhs@mojatatu.com>, <xiyou.wangcong@gmail.com>,
<jiri@resnulli.us>, <andrii@kernel.org>, <kafai@fb.com>,
<songliubraving@fb.com>, <yhs@fb.com>, <john.fastabend@gmail.com>,
<kpsingh@kernel.org>, <bpf@vger.kernel.org>,
<jonas.bonn@netrounds.com>, <pabeni@redhat.com>,
<mzhivich@akamai.com>, <johunt@akamai.com>, <albcamus@gmail.com>,
<kehuan.feng@gmail.com>, <a.fatoum@pengutronix.de>,
<atenart@kernel.org>, <alexander.duyck@gmail.com>,
<hdanton@sina.com>, <jgross@suse.com>, <JKosina@suse.com>,
<mkubecek@suse.cz>, <bjorn@kernel.org>, <alobakin@pm.me>
Subject: Re: [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc
Date: Fri, 28 May 2021 18:00:12 -0700 [thread overview]
Message-ID: <20210528180012.676797d6@kicinski-fedora-PC1C0HJN.hsd1.ca.comcast.net> (raw)
In-Reply-To: <1622170197-27370-3-git-send-email-linyunsheng@huawei.com>
On Fri, 28 May 2021 10:49:56 +0800 Yunsheng Lin wrote:
> Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
> flag set, but queue discipline by-pass does not work for lockless
> qdisc because skb is always enqueued to qdisc even when the qdisc
> is empty, see __dev_xmit_skb().
>
> This patch calls sch_direct_xmit() to transmit the skb directly
> to the driver for empty lockless qdisc, which aviod enqueuing
> and dequeuing operation.
>
> As qdisc->empty is not reliable to indicate a empty qdisc because
> there is a time window between enqueuing and setting qdisc->empty.
> So we use the MISSED state added in commit a90c57f2cedd ("net:
> sched: fix packet stuck problem for lockless qdisc"), which
> indicate there is lock contention, suggesting that it is better
> not to do the qdisc bypass in order to avoid packet out of order
> problem.
>
> In order to make MISSED state reliable to indicate a empty qdisc,
> we need to ensure that testing and clearing of MISSED state is
> within the protection of qdisc->seqlock, only setting MISSED state
> can be done without the protection of qdisc->seqlock. A MISSED
> state testing is added without the protection of qdisc->seqlock to
> aviod doing unnecessary spin_trylock() for contention case.
>
> There are below cases that need special handling:
> 1. When MISSED state is cleared before another round of dequeuing
> in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
> dequeue all skb in one round and call __netif_schedule(), which
> might result in a non-empty qdisc without MISSED set. In order
> to avoid this, the MISSED state is set for lockless qdisc and
> __netif_schedule() will be called at the end of qdisc_run_end.
>
> 2. The MISSED state also need to be set for lockless qdisc instead
> of calling __netif_schedule() directly when requeuing a skb for
> a similar reason.
>
> 3. For netdev queue stopped case, the MISSED case need clearing
> while the netdev queue is stopped, otherwise there may be
> unnecessary __netif_schedule() calling. So a new DRAINING state
> is added to indicate this case, which also indicate a non-empty
> qdisc.
>
> 4. As there is already netif_xmit_frozen_or_stopped() checking in
> dequeue_skb() and sch_direct_xmit(), which are both within the
> protection of qdisc->seqlock, but the same checking in
> __dev_xmit_skb() is without the protection, which might cause
> empty indication of a lockless qdisc to be not reliable. So
> remove the checking in __dev_xmit_skb(), and the checking in
> the protection of qdisc->seqlock seems enough to avoid the cpu
> consumption problem for netdev queue stopped case.
>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 3ed6bcc..177f240 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -37,8 +37,15 @@ enum qdisc_state_t {
> __QDISC_STATE_SCHED,
> __QDISC_STATE_DEACTIVATED,
> __QDISC_STATE_MISSED,
> + __QDISC_STATE_DRAINING,
> };
>
> +#define QDISC_STATE_MISSED BIT(__QDISC_STATE_MISSED)
> +#define QDISC_STATE_DRAINING BIT(__QDISC_STATE_DRAINING)
> +
> +#define QDISC_STATE_NON_EMPTY (QDISC_STATE_MISSED | \
> + QDISC_STATE_DRAINING)
> +
> struct qdisc_size_table {
> struct rcu_head rcu;
> struct list_head list;
> @@ -145,6 +152,11 @@ static inline bool qdisc_is_running(struct Qdisc *qdisc)
> return (raw_read_seqcount(&qdisc->running) & 1) ? true : false;
> }
>
> +static inline bool nolock_qdisc_is_empty(const struct Qdisc *qdisc)
> +{
> + return !(READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY);
> +}
> +
> static inline bool qdisc_is_percpu_stats(const struct Qdisc *q)
> {
> return q->flags & TCQ_F_CPUSTATS;
> @@ -206,10 +218,8 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
> spin_unlock(&qdisc->seqlock);
>
> if (unlikely(test_bit(__QDISC_STATE_MISSED,
> - &qdisc->state))) {
> - clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
> + &qdisc->state)))
Why remove the clear_bit()? Wasn't that added to avoid infinite
re-schedules?
> __netif_schedule(qdisc);
> - }
> } else {
> write_seqcount_end(&qdisc->running);
> }
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 50531a2..e4cc926 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> qdisc_calculate_pkt_len(skb, q);
>
> if (q->flags & TCQ_F_NOLOCK) {
> + if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
> + qdisc_run_begin(q)) {
> + /* Retest nolock_qdisc_is_empty() within the protection
> + * of q->seqlock to ensure qdisc is indeed empty.
> + */
> + if (unlikely(!nolock_qdisc_is_empty(q))) {
This is just for the DRAINING case right?
MISSED can be set at any moment, testing MISSED seems confusing.
Is it really worth the extra code?
> + rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> + __qdisc_run(q);
> + qdisc_run_end(q);
> +
> + goto no_lock_out;
> + }
> +
> + qdisc_bstats_cpu_update(q, skb);
> + if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
> + !nolock_qdisc_is_empty(q))
> + __qdisc_run(q);
> +
> + qdisc_run_end(q);
> + return NET_XMIT_SUCCESS;
> + }
> +
> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> - if (likely(!netif_xmit_frozen_or_stopped(txq)))
> - qdisc_run(q);
> + qdisc_run(q);
>
> +no_lock_out:
> if (unlikely(to_free))
> kfree_skb_list(to_free);
> return rc;
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index fc8b56b..83d7f5f 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -52,6 +52,8 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
> */
> if (!netif_xmit_frozen_or_stopped(txq))
> set_bit(__QDISC_STATE_MISSED, &q->state);
> + else
> + set_bit(__QDISC_STATE_DRAINING, &q->state);
> }
>
> /* Main transmission queue. */
> @@ -164,9 +166,13 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>
> skb = next;
> }
> - if (lock)
> +
> + if (lock) {
> spin_unlock(lock);
> - __netif_schedule(q);
> + set_bit(__QDISC_STATE_MISSED, &q->state);
> + } else {
> + __netif_schedule(q);
Could we reorder qdisc_run_begin() with clear_bit(SCHED)
in net_tx_action() and add SCHED to the NON_EMPTY mask?
> + }
> }
>
> static void try_bulk_dequeue_skb(struct Qdisc *q,
> @@ -409,7 +415,11 @@ void __qdisc_run(struct Qdisc *q)
> while (qdisc_restart(q, &packets)) {
> quota -= packets;
> if (quota <= 0) {
> - __netif_schedule(q);
> + if (q->flags & TCQ_F_NOLOCK)
> + set_bit(__QDISC_STATE_MISSED, &q->state);
> + else
> + __netif_schedule(q);
> +
> break;
> }
> }
> @@ -680,13 +690,14 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
> if (likely(skb)) {
> qdisc_update_stats_at_dequeue(qdisc, skb);
> } else if (need_retry &&
> - test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
> + READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY) {
Do we really need to retry based on DRAINING being set?
Or is it just a convenient way of coding things up?
> /* Delay clearing the STATE_MISSED here to reduce
> * the overhead of the second spin_trylock() in
> * qdisc_run_begin() and __netif_schedule() calling
> * in qdisc_run_end().
> */
> clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
> + clear_bit(__QDISC_STATE_DRAINING, &qdisc->state);
next prev parent reply other threads:[~2021-05-29 1:00 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-28 2:49 [PATCH net-next 0/3] Some optimization for lockless qdisc Yunsheng Lin
2021-05-28 2:49 ` [PATCH net-next 1/3] net: sched: avoid unnecessary seqcount operation " Yunsheng Lin
2021-05-28 2:49 ` [PATCH net-next 2/3] net: sched: implement TCQ_F_CAN_BYPASS " Yunsheng Lin
2021-05-29 1:00 ` Jakub Kicinski [this message]
2021-05-29 1:44 ` Yunsheng Lin
2021-05-29 4:32 ` Jakub Kicinski
2021-05-29 7:03 ` Yunsheng Lin
2021-05-29 18:49 ` Jakub Kicinski
2021-05-30 1:37 ` Yunsheng Lin
2021-05-30 20:21 ` Jakub Kicinski
2021-05-31 0:40 ` Yunsheng Lin
2021-05-31 1:10 ` [Linuxarm] " Yunsheng Lin
2021-05-31 12:40 ` Yunsheng Lin
2021-06-01 4:51 ` Jakub Kicinski
2021-06-01 8:18 ` Yunsheng Lin
2021-06-01 20:48 ` Jakub Kicinski
2021-06-02 1:21 ` Yunsheng Lin
2021-06-02 16:28 ` Jakub Kicinski
2021-05-28 2:49 ` [PATCH net-next 3/3] net: sched: remove qdisc->empty " Yunsheng Lin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210528180012.676797d6@kicinski-fedora-PC1C0HJN.hsd1.ca.comcast.net \
--to=kuba@kernel.org \
--cc=JKosina@suse.com \
--cc=a.fatoum@pengutronix.de \
--cc=albcamus@gmail.com \
--cc=alexander.duyck@gmail.com \
--cc=alobakin@pm.me \
--cc=andrii@kernel.org \
--cc=andriin@fb.com \
--cc=ap420073@gmail.com \
--cc=ast@kernel.org \
--cc=atenart@kernel.org \
--cc=bjorn@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=cong.wang@bytedance.com \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=hdanton@sina.com \
--cc=jgross@suse.com \
--cc=jhs@mojatatu.com \
--cc=jiri@resnulli.us \
--cc=john.fastabend@gmail.com \
--cc=johunt@akamai.com \
--cc=jonas.bonn@netrounds.com \
--cc=kafai@fb.com \
--cc=kehuan.feng@gmail.com \
--cc=kpsingh@kernel.org \
--cc=linux-can@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linuxarm@openeuler.org \
--cc=linyunsheng@huawei.com \
--cc=mkl@pengutronix.de \
--cc=mkubecek@suse.cz \
--cc=mzhivich@akamai.com \
--cc=netdev@vger.kernel.org \
--cc=olteanv@gmail.com \
--cc=pabeni@redhat.com \
--cc=songliubraving@fb.com \
--cc=weiwan@google.com \
--cc=xiyou.wangcong@gmail.com \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).