All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net v8 0/3] fix packet stuck problem for lockless qdisc
@ 2021-05-14  2:26 Yunsheng Lin
  2021-05-14  2:26 ` [PATCH net v8 1/3] net: sched: " Yunsheng Lin
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Yunsheng Lin @ 2021-05-14  2:26 UTC (permalink / raw)
  To: davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng, a.fatoum, atenart,
	alexander.duyck, hdanton, jgross, JKosina, mkubecek, bjorn,
	alobakin

This patchset fixes the packet stuck problem mentioned in [1].

Patch 1: Add STATE_MISSED flag to fix packet stuck problem.
Patch 2: Fix a tx_action rescheduling problem after STATE_MISSED
         flag is added in patch 1.
Patch 3: Fix the significantly higher CPU consumption problem when
         multiple threads are competing on a saturated outgoing
         device.

V8: Change function name in patch 3 as suggested by Jakub, adjust
    commit log in patch 2, and add Acked-by from Jakub.
V7: Fix netif_tx_wake_queue() data race noted by Jakub.
V6: Some performance optimization in patch 1 suggested by Jakub
    and drop NET_XMIT_DROP checking in patch 3.
V5: add patch 3 to fix the problem reported by Michal Kubecek.
V4: Change STATE_NEED_RESCHEDULE to STATE_MISSED and add patch 2.

[1]. https://lkml.org/lkml/2019/10/9/42

Yunsheng Lin (3):
  net: sched: fix packet stuck problem for lockless qdisc
  net: sched: fix tx action rescheduling issue during deactivation
  net: sched: fix tx action reschedule issue with stopped queue

 include/net/pkt_sched.h   |  7 +------
 include/net/sch_generic.h | 35 ++++++++++++++++++++++++++++++++-
 net/core/dev.c            | 29 ++++++++++++++++++++++-----
 net/sched/sch_generic.c   | 50 +++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 107 insertions(+), 14 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net v8 1/3] net: sched: fix packet stuck problem for lockless qdisc
  2021-05-14  2:26 [PATCH net v8 0/3] fix packet stuck problem for lockless qdisc Yunsheng Lin
@ 2021-05-14  2:26 ` Yunsheng Lin
  2021-05-14 23:36   ` Cong Wang
  2021-05-14  2:26 ` [PATCH net v8 2/3] net: sched: fix tx action rescheduling issue during deactivation Yunsheng Lin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Yunsheng Lin @ 2021-05-14  2:26 UTC (permalink / raw)
  To: davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng, a.fatoum, atenart,
	alexander.duyck, hdanton, jgross, JKosina, mkubecek, bjorn,
	alobakin

Lockless qdisc has below concurrent problem:
    cpu0                 cpu1
     .                     .
q->enqueue                 .
     .                     .
qdisc_run_begin()          .
     .                     .
dequeue_skb()              .
     .                     .
sch_direct_xmit()          .
     .                     .
     .                q->enqueue
     .             qdisc_run_begin()
     .            return and do nothing
     .                     .
qdisc_run_end()            .

cpu1 enqueue a skb without calling __qdisc_run() because cpu0
has not released the lock yet and spin_trylock() return false
for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
enqueued by cpu1 when calling dequeue_skb() because cpu1 may
enqueue the skb after cpu0 calling dequeue_skb() and before
cpu0 calling qdisc_run_end().

Lockless qdisc has below another concurrent problem when
tx_action is involved:

cpu0(serving tx_action)     cpu1             cpu2
          .                   .                .
          .              q->enqueue            .
          .            qdisc_run_begin()       .
          .              dequeue_skb()         .
          .                   .            q->enqueue
          .                   .                .
          .             sch_direct_xmit()      .
          .                   .         qdisc_run_begin()
          .                   .       return and do nothing
          .                   .                .
 clear __QDISC_STATE_SCHED    .                .
 qdisc_run_begin()            .                .
 return and do nothing        .                .
          .                   .                .
          .            qdisc_run_end()         .

This patch fixes the above data race by:
1. If the first spin_trylock() return false and STATE_MISSED is
   not set, set STATE_MISSED and retry another spin_trylock() in
   case other CPU may not see STATE_MISSED after it releases the
   lock.
2. reschedule if STATE_MISSED is set after the lock is released
   at the end of qdisc_run_end().

For tx_action case, STATE_MISSED is also set when cpu1 is at the
end if qdisc_run_end(), so tx_action will be rescheduled again
to dequeue the skb enqueued by cpu2.

Clear STATE_MISSED before retrying a dequeuing when dequeuing
returns NULL in order to reduce the overhead of the second
spin_trylock() and __netif_schedule() calling.

Also clear the STATE_MISSED before calling __netif_schedule()
at the end of qdisc_run_end() to avoid doing another round of
dequeuing in the pfifo_fast_dequeue().

The performance impact of this patch, tested using pktgen and
dummy netdev with pfifo_fast qdisc attached:

 threads  without+this_patch   with+this_patch      delta
    1        2.61Mpps            2.60Mpps           -0.3%
    2        3.97Mpps            3.82Mpps           -3.7%
    4        5.62Mpps            5.59Mpps           -0.5%
    8        2.78Mpps            2.77Mpps           -0.3%
   16        2.22Mpps            2.22Mpps           -0.0%

Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
V7: Clear STATE_MISSED before calling __netif_schedule()
    as suggested by Jakub.
V6: Check MISSED after the first trylock, and remove the
    automic test and set for performance sake as suggested
    by Jakub.
V4: Change STATE_NEED_RESCHEDULE to STATE_MISSED mirroring
    NAPI's NAPIF_STATE_MISSED, and add Juergen's "Tested-by"
    tag for there is only renaming and typo fixing between
    V4 and V3.
V3: Fix a compile error and a few comment typo, remove the
    __QDISC_STATE_DEACTIVATED checking, and update the
    performance data.
V2: Avoid the overhead of fixing the data race as much as
    possible.
---
 include/net/sch_generic.h | 35 ++++++++++++++++++++++++++++++++++-
 net/sched/sch_generic.c   | 19 +++++++++++++++++++
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f7a6e14..1e62551 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -36,6 +36,7 @@ struct qdisc_rate_table {
 enum qdisc_state_t {
 	__QDISC_STATE_SCHED,
 	__QDISC_STATE_DEACTIVATED,
+	__QDISC_STATE_MISSED,
 };
 
 struct qdisc_size_table {
@@ -159,8 +160,33 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK) {
+		if (spin_trylock(&qdisc->seqlock))
+			goto nolock_empty;
+
+		/* If the MISSED flag is set, it means other thread has
+		 * set the MISSED flag before second spin_trylock(), so
+		 * we can return false here to avoid multi cpus doing
+		 * the set_bit() and second spin_trylock() concurrently.
+		 */
+		if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
+			return false;
+
+		/* Set the MISSED flag before the second spin_trylock(),
+		 * if the second spin_trylock() return false, it means
+		 * other cpu holding the lock will do dequeuing for us
+		 * or it will see the MISSED flag set after releasing
+		 * lock and reschedule the net_tx_action() to do the
+		 * dequeuing.
+		 */
+		set_bit(__QDISC_STATE_MISSED, &qdisc->state);
+
+		/* Retry again in case other CPU may not see the new flag
+		 * after it releases the lock at the end of qdisc_run_end().
+		 */
 		if (!spin_trylock(&qdisc->seqlock))
 			return false;
+
+nolock_empty:
 		WRITE_ONCE(qdisc->empty, false);
 	} else if (qdisc_is_running(qdisc)) {
 		return false;
@@ -176,8 +202,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
 	write_seqcount_end(&qdisc->running);
-	if (qdisc->flags & TCQ_F_NOLOCK)
+	if (qdisc->flags & TCQ_F_NOLOCK) {
 		spin_unlock(&qdisc->seqlock);
+
+		if (unlikely(test_bit(__QDISC_STATE_MISSED,
+				      &qdisc->state))) {
+			clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
+			__netif_schedule(qdisc);
+		}
+	}
 }
 
 static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 44991ea..795d986 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 {
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
 	struct sk_buff *skb = NULL;
+	bool need_retry = true;
 	int band;
 
+retry:
 	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
 		struct skb_array *q = band2list(priv, band);
 
@@ -652,6 +654,23 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 	}
 	if (likely(skb)) {
 		qdisc_update_stats_at_dequeue(qdisc, skb);
+	} else if (need_retry &&
+		   test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
+		/* Delay clearing the STATE_MISSED here to reduce
+		 * the overhead of the second spin_trylock() in
+		 * qdisc_run_begin() and __netif_schedule() calling
+		 * in qdisc_run_end().
+		 */
+		clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
+
+		/* Make sure dequeuing happens after clearing
+		 * STATE_MISSED.
+		 */
+		smp_mb__after_atomic();
+
+		need_retry = false;
+
+		goto retry;
 	} else {
 		WRITE_ONCE(qdisc->empty, true);
 	}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net v8 2/3] net: sched: fix tx action rescheduling issue during deactivation
  2021-05-14  2:26 [PATCH net v8 0/3] fix packet stuck problem for lockless qdisc Yunsheng Lin
  2021-05-14  2:26 ` [PATCH net v8 1/3] net: sched: " Yunsheng Lin
@ 2021-05-14  2:26 ` Yunsheng Lin
  2021-05-14  2:26 ` [PATCH net v8 3/3] net: sched: fix tx action reschedule issue with stopped queue Yunsheng Lin
  2021-05-14  2:56 ` [PATCH net v8 0/3] fix packet stuck problem for lockless qdisc Yunsheng Lin
  3 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2021-05-14  2:26 UTC (permalink / raw)
  To: davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng, a.fatoum, atenart,
	alexander.duyck, hdanton, jgross, JKosina, mkubecek, bjorn,
	alobakin

Currently qdisc_run() checks the STATE_DEACTIVATED of lockless
qdisc before calling __qdisc_run(), which ultimately clear the
STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED
is set before clearing STATE_MISSED, there may be rescheduling
of net_tx_action() at the end of qdisc_run_end(), see below:

CPU0(net_tx_atcion)  CPU1(__dev_xmit_skb)  CPU2(dev_deactivate)
          .                   .                     .
          .            set STATE_MISSED             .
          .           __netif_schedule()            .
          .                   .           set STATE_DEACTIVATED
          .                   .                qdisc_reset()
          .                   .                     .
          .<---------------   .              synchronize_net()
clear __QDISC_STATE_SCHED  |  .                     .
          .                |  .                     .
          .                |  .            some_qdisc_is_busy()
          .                |  .               return *false*
          .                |  .                     .
  test STATE_DEACTIVATED   |  .                     .
__qdisc_run() *not* called |  .                     .
          .                |  .                     .
   test STATE_MISS         |  .                     .
 __netif_schedule()--------|  .                     .
          .                   .                     .
          .                   .                     .

__qdisc_run() is not called by net_tx_atcion() in CPU0 because
CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and
STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule
is called at the end of qdisc_run_end(), causing tx action
rescheduling problem.

qdisc_run() called by net_tx_action() runs in the softirq context,
which should has the same semantic as the qdisc_run() called by
__dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a
synchronize_net() between STATE_DEACTIVATED flag being set and
qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely
bail out for the deactived lockless qdisc in net_tx_action(), and
qdisc_reset() will reset all skb not dequeued yet.

So add the rcu_read_lock() explicitly to protect the qdisc_run()
and do the STATE_DEACTIVATED checking in net_tx_action() before
calling qdisc_run_begin(). Another option is to do the checking in
the qdisc_run_end(), but it will add unnecessary overhead for
non-tx_action case, because __dev_queue_xmit() will not see qdisc
with STATE_DEACTIVATED after synchronize_net(), the qdisc with
STATE_DEACTIVATED can only be seen by net_tx_action() because of
__netif_schedule().

The STATE_DEACTIVATED checking in qdisc_run() is to avoid race
between net_tx_action() and qdisc_reset(), see:
commit d518d2ed8640 ("net/sched: fix race between deactivation
and dequeue for NOLOCK qdisc"). As the bailout added above for
deactived lockless qdisc in net_tx_action() provides better
protection for the race without calling qdisc_run() at all, so
remove the STATE_DEACTIVATED checking in qdisc_run().

After qdisc_reset(), there is no skb in qdisc to be dequeued, so
clear the STATE_MISSED in dev_reset_queue() too.

Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
V7: Clearing STATE_MISSED before calling __netif_schedule() has
    avoid the endless rescheduling problem, but there may still
    be a unnecessary rescheduling, so adjust the commit log.
---
 include/net/pkt_sched.h |  7 +------
 net/core/dev.c          | 26 ++++++++++++++++++++++----
 net/sched/sch_generic.c |  4 +++-
 3 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index f5c1bee..6d7b12c 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -128,12 +128,7 @@ void __qdisc_run(struct Qdisc *q);
 static inline void qdisc_run(struct Qdisc *q)
 {
 	if (qdisc_run_begin(q)) {
-		/* NOLOCK qdisc must check 'state' under the qdisc seqlock
-		 * to avoid racing with dev_qdisc_reset()
-		 */
-		if (!(q->flags & TCQ_F_NOLOCK) ||
-		    likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
-			__qdisc_run(q);
+		__qdisc_run(q);
 		qdisc_run_end(q);
 	}
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index 222b1d3..d596cd7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5025,25 +5025,43 @@ static __latent_entropy void net_tx_action(struct softirq_action *h)
 		sd->output_queue_tailp = &sd->output_queue;
 		local_irq_enable();
 
+		rcu_read_lock();
+
 		while (head) {
 			struct Qdisc *q = head;
 			spinlock_t *root_lock = NULL;
 
 			head = head->next_sched;
 
-			if (!(q->flags & TCQ_F_NOLOCK)) {
-				root_lock = qdisc_lock(q);
-				spin_lock(root_lock);
-			}
 			/* We need to make sure head->next_sched is read
 			 * before clearing __QDISC_STATE_SCHED
 			 */
 			smp_mb__before_atomic();
+
+			if (!(q->flags & TCQ_F_NOLOCK)) {
+				root_lock = qdisc_lock(q);
+				spin_lock(root_lock);
+			} else if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED,
+						     &q->state))) {
+				/* There is a synchronize_net() between
+				 * STATE_DEACTIVATED flag being set and
+				 * qdisc_reset()/some_qdisc_is_busy() in
+				 * dev_deactivate(), so we can safely bail out
+				 * early here to avoid data race between
+				 * qdisc_deactivate() and some_qdisc_is_busy()
+				 * for lockless qdisc.
+				 */
+				clear_bit(__QDISC_STATE_SCHED, &q->state);
+				continue;
+			}
+
 			clear_bit(__QDISC_STATE_SCHED, &q->state);
 			qdisc_run(q);
 			if (root_lock)
 				spin_unlock(root_lock);
 		}
+
+		rcu_read_unlock();
 	}
 
 	xfrm_dev_backlog(sd);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 795d986..d86c4cc 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1177,8 +1177,10 @@ static void dev_reset_queue(struct net_device *dev,
 	qdisc_reset(qdisc);
 
 	spin_unlock_bh(qdisc_lock(qdisc));
-	if (nolock)
+	if (nolock) {
+		clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
 		spin_unlock_bh(&qdisc->seqlock);
+	}
 }
 
 static bool some_qdisc_is_busy(struct net_device *dev)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net v8 3/3] net: sched: fix tx action reschedule issue with stopped queue
  2021-05-14  2:26 [PATCH net v8 0/3] fix packet stuck problem for lockless qdisc Yunsheng Lin
  2021-05-14  2:26 ` [PATCH net v8 1/3] net: sched: " Yunsheng Lin
  2021-05-14  2:26 ` [PATCH net v8 2/3] net: sched: fix tx action rescheduling issue during deactivation Yunsheng Lin
@ 2021-05-14  2:26 ` Yunsheng Lin
  2021-05-14  2:56 ` [PATCH net v8 0/3] fix packet stuck problem for lockless qdisc Yunsheng Lin
  3 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2021-05-14  2:26 UTC (permalink / raw)
  To: davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng, a.fatoum, atenart,
	alexander.duyck, hdanton, jgross, JKosina, mkubecek, bjorn,
	alobakin

The netdev qeueue might be stopped when byte queue limit has
reached or tx hw ring is full, net_tx_action() may still be
rescheduled endlessly if STATE_MISSED is set, which consumes
a lot of cpu without dequeuing and transmiting any skb because
the netdev queue is stopped, see qdisc_run_end().

This patch fixes it by checking the netdev queue state before
calling qdisc_run() and clearing STATE_MISSED if netdev queue is
stopped during qdisc_run(), the net_tx_action() is recheduled
again when netdev qeueue is restarted, see netif_tx_wake_queue().

As there is time window betewwn netif_xmit_frozen_or_stopped()
checking and STATE_MISSED clearing, between which STATE_MISSED
may set by net_tx_action() scheduled by netif_tx_wake_queue(),
so set the STATE_MISSED again if netdev queue is restarted.

Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
V8: Change qdisc_maybe_stop_tx() to qdisc_maybe_clear_missed()
    as suggested by Jakub.
V7: Fix the netif_tx_wake_queue() data race noted by Jakub.
V6: Drop NET_XMIT_DROP checking for it is not really relevant
    to this patch, and it may cause performance performance
    regression with multi pktgen threads on dummy netdev with
    pfifo_fast qdisc case.
---
 net/core/dev.c          |  3 ++-
 net/sched/sch_generic.c | 27 ++++++++++++++++++++++++++-
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index d596cd7..ef8cf76 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3853,7 +3853,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 
 	if (q->flags & TCQ_F_NOLOCK) {
 		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
-		qdisc_run(q);
+		if (likely(!netif_xmit_frozen_or_stopped(txq)))
+			qdisc_run(q);
 
 		if (unlikely(to_free))
 			kfree_skb_list(to_free);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index d86c4cc..fc8b56b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -35,6 +35,25 @@
 const struct Qdisc_ops *default_qdisc_ops = &pfifo_fast_ops;
 EXPORT_SYMBOL(default_qdisc_ops);
 
+static void qdisc_maybe_clear_missed(struct Qdisc *q,
+				     const struct netdev_queue *txq)
+{
+	clear_bit(__QDISC_STATE_MISSED, &q->state);
+
+	/* Make sure the below netif_xmit_frozen_or_stopped()
+	 * checking happens after clearing STATE_MISSED.
+	 */
+	smp_mb__after_atomic();
+
+	/* Checking netif_xmit_frozen_or_stopped() again to
+	 * make sure STATE_MISSED is set if the STATE_MISSED
+	 * set by netif_tx_wake_queue()'s rescheduling of
+	 * net_tx_action() is cleared by the above clear_bit().
+	 */
+	if (!netif_xmit_frozen_or_stopped(txq))
+		set_bit(__QDISC_STATE_MISSED, &q->state);
+}
+
 /* Main transmission queue. */
 
 /* Modifications to data participating in scheduling must be protected with
@@ -74,6 +93,7 @@ static inline struct sk_buff *__skb_dequeue_bad_txq(struct Qdisc *q)
 			}
 		} else {
 			skb = SKB_XOFF_MAGIC;
+			qdisc_maybe_clear_missed(q, txq);
 		}
 	}
 
@@ -242,6 +262,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 			}
 		} else {
 			skb = NULL;
+			qdisc_maybe_clear_missed(q, txq);
 		}
 		if (lock)
 			spin_unlock(lock);
@@ -251,8 +272,10 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 	*validate = true;
 
 	if ((q->flags & TCQ_F_ONETXQUEUE) &&
-	    netif_xmit_frozen_or_stopped(txq))
+	    netif_xmit_frozen_or_stopped(txq)) {
+		qdisc_maybe_clear_missed(q, txq);
 		return skb;
+	}
 
 	skb = qdisc_dequeue_skb_bad_txq(q);
 	if (unlikely(skb)) {
@@ -311,6 +334,8 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 		HARD_TX_LOCK(dev, txq, smp_processor_id());
 		if (!netif_xmit_frozen_or_stopped(txq))
 			skb = dev_hard_start_xmit(skb, dev, txq, &ret);
+		else
+			qdisc_maybe_clear_missed(q, txq);
 
 		HARD_TX_UNLOCK(dev, txq);
 	} else {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net v8 0/3] fix packet stuck problem for lockless qdisc
  2021-05-14  2:26 [PATCH net v8 0/3] fix packet stuck problem for lockless qdisc Yunsheng Lin
                   ` (2 preceding siblings ...)
  2021-05-14  2:26 ` [PATCH net v8 3/3] net: sched: fix tx action reschedule issue with stopped queue Yunsheng Lin
@ 2021-05-14  2:56 ` Yunsheng Lin
  3 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2021-05-14  2:56 UTC (permalink / raw)
  To: davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng, a.fatoum, atenart,
	alexander.duyck, hdanton, jgross, JKosina, mkubecek, bjorn,
	alobakin

On 2021/5/14 10:26, Yunsheng Lin wrote:
> This patchset fixes the packet stuck problem mentioned in [1].
> 
> Patch 1: Add STATE_MISSED flag to fix packet stuck problem.
> Patch 2: Fix a tx_action rescheduling problem after STATE_MISSED
>          flag is added in patch 1.
> Patch 3: Fix the significantly higher CPU consumption problem when
>          multiple threads are competing on a saturated outgoing
>          device.
> 
> V8: Change function name in patch 3 as suggested by Jakub, adjust
>     commit log in patch 2, and add Acked-by from Jakub.

Please ignore this patchset, there is some typo in patch 3.

> V7: Fix netif_tx_wake_queue() data race noted by Jakub.
> V6: Some performance optimization in patch 1 suggested by Jakub
>     and drop NET_XMIT_DROP checking in patch 3.
> V5: add patch 3 to fix the problem reported by Michal Kubecek.
> V4: Change STATE_NEED_RESCHEDULE to STATE_MISSED and add patch 2.
> 
> [1]. https://lkml.org/lkml/2019/10/9/42
> 
> Yunsheng Lin (3):
>   net: sched: fix packet stuck problem for lockless qdisc
>   net: sched: fix tx action rescheduling issue during deactivation
>   net: sched: fix tx action reschedule issue with stopped queue
> 
>  include/net/pkt_sched.h   |  7 +------
>  include/net/sch_generic.h | 35 ++++++++++++++++++++++++++++++++-
>  net/core/dev.c            | 29 ++++++++++++++++++++++-----
>  net/sched/sch_generic.c   | 50 +++++++++++++++++++++++++++++++++++++++++++++--
>  4 files changed, 107 insertions(+), 14 deletions(-)
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v8 1/3] net: sched: fix packet stuck problem for lockless qdisc
  2021-05-14  2:26 ` [PATCH net v8 1/3] net: sched: " Yunsheng Lin
@ 2021-05-14 23:36   ` Cong Wang
  2021-05-14 23:39     ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2021-05-14 23:36 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang .,
	Taehee Yoo, Linux Kernel Network Developers, LKML, linuxarm,
	Marc Kleine-Budde, linux-can, Jamal Hadi Salim, Jiri Pirko,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, kpsingh, bpf, Jonas Bonn, Paolo Abeni,
	Michael Zhivich, Josh Hunt, Jike Song, Kehuan Feng, Ahmad Fatoum,
	atenart, Alexander Duyck, Hillf Danton, jgross, JKosina,
	Michal Kubecek, Björn Töpel, Alexander Lobakin

On Thu, May 13, 2021 at 7:27 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>  struct qdisc_size_table {
> @@ -159,8 +160,33 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  {
>         if (qdisc->flags & TCQ_F_NOLOCK) {
> +               if (spin_trylock(&qdisc->seqlock))
> +                       goto nolock_empty;
> +
> +               /* If the MISSED flag is set, it means other thread has
> +                * set the MISSED flag before second spin_trylock(), so
> +                * we can return false here to avoid multi cpus doing
> +                * the set_bit() and second spin_trylock() concurrently.
> +                */
> +               if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
> +                       return false;
> +
> +               /* Set the MISSED flag before the second spin_trylock(),
> +                * if the second spin_trylock() return false, it means
> +                * other cpu holding the lock will do dequeuing for us
> +                * or it will see the MISSED flag set after releasing
> +                * lock and reschedule the net_tx_action() to do the
> +                * dequeuing.
> +                */
> +               set_bit(__QDISC_STATE_MISSED, &qdisc->state);
> +
> +               /* Retry again in case other CPU may not see the new flag
> +                * after it releases the lock at the end of qdisc_run_end().
> +                */
>                 if (!spin_trylock(&qdisc->seqlock))
>                         return false;
> +
> +nolock_empty:
>                 WRITE_ONCE(qdisc->empty, false);
>         } else if (qdisc_is_running(qdisc)) {
>                 return false;
> @@ -176,8 +202,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  static inline void qdisc_run_end(struct Qdisc *qdisc)
>  {
>         write_seqcount_end(&qdisc->running);
> -       if (qdisc->flags & TCQ_F_NOLOCK)
> +       if (qdisc->flags & TCQ_F_NOLOCK) {
>                 spin_unlock(&qdisc->seqlock);
> +
> +               if (unlikely(test_bit(__QDISC_STATE_MISSED,
> +                                     &qdisc->state))) {
> +                       clear_bit(__QDISC_STATE_MISSED, &qdisc->state);


We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()
is not.


> +                       __netif_schedule(qdisc);
> +               }
> +       }
>  }
>
>  static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 44991ea..795d986 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>  {
>         struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>         struct sk_buff *skb = NULL;
> +       bool need_retry = true;
>         int band;
>
> +retry:
>         for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>                 struct skb_array *q = band2list(priv, band);
>
> @@ -652,6 +654,23 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>         }
>         if (likely(skb)) {
>                 qdisc_update_stats_at_dequeue(qdisc, skb);
> +       } else if (need_retry &&
> +                  test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
> +               /* Delay clearing the STATE_MISSED here to reduce
> +                * the overhead of the second spin_trylock() in
> +                * qdisc_run_begin() and __netif_schedule() calling
> +                * in qdisc_run_end().
> +                */
> +               clear_bit(__QDISC_STATE_MISSED, &qdisc->state);

Ditto.

> +
> +               /* Make sure dequeuing happens after clearing
> +                * STATE_MISSED.
> +                */
> +               smp_mb__after_atomic();
> +
> +               need_retry = false;
> +
> +               goto retry;

Two concurrent pfifo_fast_dequeue() would possibly retry it at the
same time when they test __QDISC_STATE_MISSED at the same
time and get true. Is this a problem?

Also, any reason why you want pfifo_fast to handle a generic
Qdisc flag? IOW, why not handle this logic in, for example,
qdisc_restart()?

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v8 1/3] net: sched: fix packet stuck problem for lockless qdisc
  2021-05-14 23:36   ` Cong Wang
@ 2021-05-14 23:39     ` Jakub Kicinski
  2021-05-14 23:57       ` Cong Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2021-05-14 23:39 UTC (permalink / raw)
  To: Cong Wang
  Cc: Yunsheng Lin, David Miller, Vladimir Oltean, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eric Dumazet, Wei Wang,
	Cong Wang .,
	Taehee Yoo, Linux Kernel Network Developers, LKML, linuxarm,
	Marc Kleine-Budde, linux-can, Jamal Hadi Salim, Jiri Pirko,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, kpsingh, bpf, Jonas Bonn, Paolo Abeni,
	Michael Zhivich, Josh Hunt, Jike Song, Kehuan Feng, Ahmad Fatoum,
	atenart, Alexander Duyck, Hillf Danton, jgross, JKosina,
	Michal Kubecek, Björn Töpel, Alexander Lobakin

On Fri, 14 May 2021 16:36:16 -0700 Cong Wang wrote:
> > @@ -176,8 +202,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
> >  static inline void qdisc_run_end(struct Qdisc *qdisc)
> >  {
> >         write_seqcount_end(&qdisc->running);
> > -       if (qdisc->flags & TCQ_F_NOLOCK)
> > +       if (qdisc->flags & TCQ_F_NOLOCK) {
> >                 spin_unlock(&qdisc->seqlock);
> > +
> > +               if (unlikely(test_bit(__QDISC_STATE_MISSED,
> > +                                     &qdisc->state))) {
> > +                       clear_bit(__QDISC_STATE_MISSED, &qdisc->state);  
> 
> We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()
> is not.

It doesn't have to be atomic, right? I asked to split the test because
test_and_clear is a locked op on x86, test by itself is not.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v8 1/3] net: sched: fix packet stuck problem for lockless qdisc
  2021-05-14 23:39     ` Jakub Kicinski
@ 2021-05-14 23:57       ` Cong Wang
  2021-05-15  0:17         ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2021-05-14 23:57 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Yunsheng Lin, David Miller, Vladimir Oltean, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eric Dumazet, Wei Wang,
	Cong Wang .,
	Taehee Yoo, Linux Kernel Network Developers, LKML, linuxarm,
	Marc Kleine-Budde, linux-can, Jamal Hadi Salim, Jiri Pirko,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, kpsingh, bpf, Jonas Bonn, Paolo Abeni,
	Michael Zhivich, Josh Hunt, Jike Song, Kehuan Feng, Ahmad Fatoum,
	atenart, Alexander Duyck, Hillf Danton, jgross, JKosina,
	Michal Kubecek, Björn Töpel, Alexander Lobakin

On Fri, May 14, 2021 at 4:39 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 14 May 2021 16:36:16 -0700 Cong Wang wrote:
> > > @@ -176,8 +202,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
> > >  static inline void qdisc_run_end(struct Qdisc *qdisc)
> > >  {
> > >         write_seqcount_end(&qdisc->running);
> > > -       if (qdisc->flags & TCQ_F_NOLOCK)
> > > +       if (qdisc->flags & TCQ_F_NOLOCK) {
> > >                 spin_unlock(&qdisc->seqlock);
> > > +
> > > +               if (unlikely(test_bit(__QDISC_STATE_MISSED,
> > > +                                     &qdisc->state))) {
> > > +                       clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
> >
> > We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()
> > is not.
>
> It doesn't have to be atomic, right? I asked to split the test because
> test_and_clear is a locked op on x86, test by itself is not.

It depends on whether you expect the code under the true condition
to run once or multiple times, something like:

if (test_bit()) {
  clear_bit();
  // this code may run multiple times
}

With the atomic test_and_clear_bit(), it only runs once:

if (test_and_clear_bit()) {
  // this code runs once
}

This is why __netif_schedule() uses test_and_set_bit() instead of
test_bit()+set_bit().

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v8 1/3] net: sched: fix packet stuck problem for lockless qdisc
  2021-05-14 23:57       ` Cong Wang
@ 2021-05-15  0:17         ` Jakub Kicinski
  2021-05-15  2:25           ` Yunsheng Lin
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2021-05-15  0:17 UTC (permalink / raw)
  To: Cong Wang
  Cc: Yunsheng Lin, David Miller, Vladimir Oltean, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eric Dumazet, Wei Wang,
	Cong Wang .,
	Taehee Yoo, Linux Kernel Network Developers, LKML, linuxarm,
	Marc Kleine-Budde, linux-can, Jamal Hadi Salim, Jiri Pirko,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, kpsingh, bpf, Jonas Bonn, Paolo Abeni,
	Michael Zhivich, Josh Hunt, Jike Song, Kehuan Feng, Ahmad Fatoum,
	atenart, Alexander Duyck, Hillf Danton, jgross, JKosina,
	Michal Kubecek, Björn Töpel, Alexander Lobakin

On Fri, 14 May 2021 16:57:29 -0700 Cong Wang wrote:
> On Fri, May 14, 2021 at 4:39 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri, 14 May 2021 16:36:16 -0700 Cong Wang wrote:  
>  [...]  
> > >
> > > We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()
> > > is not.  
> >
> > It doesn't have to be atomic, right? I asked to split the test because
> > test_and_clear is a locked op on x86, test by itself is not.  
> 
> It depends on whether you expect the code under the true condition
> to run once or multiple times, something like:
> 
> if (test_bit()) {
>   clear_bit();
>   // this code may run multiple times
> }
> 
> With the atomic test_and_clear_bit(), it only runs once:
> 
> if (test_and_clear_bit()) {
>   // this code runs once
> }
> 
> This is why __netif_schedule() uses test_and_set_bit() instead of
> test_bit()+set_bit().

Thanks, makes sense, so hopefully the MISSED-was-set case is not common
and we can depend on __netif_schedule() to DTRT, avoiding the atomic op
in the common case.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v8 1/3] net: sched: fix packet stuck problem for lockless qdisc
  2021-05-15  0:17         ` Jakub Kicinski
@ 2021-05-15  2:25           ` Yunsheng Lin
  2021-05-18  0:49             ` Cong Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Yunsheng Lin @ 2021-05-15  2:25 UTC (permalink / raw)
  To: Jakub Kicinski, Cong Wang
  Cc: David Miller, Vladimir Oltean, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eric Dumazet, Wei Wang,
	Cong Wang .,
	Taehee Yoo, Linux Kernel Network Developers, LKML, linuxarm,
	Marc Kleine-Budde, linux-can, Jamal Hadi Salim, Jiri Pirko,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, kpsingh, bpf, Jonas Bonn, Paolo Abeni,
	Michael Zhivich, Josh Hunt, Jike Song, Kehuan Feng, Ahmad Fatoum,
	atenart, Alexander Duyck, Hillf Danton, jgross, JKosina,
	Michal Kubecek, Björn Töpel, Alexander Lobakin

On 2021/5/15 8:17, Jakub Kicinski wrote:
> On Fri, 14 May 2021 16:57:29 -0700 Cong Wang wrote:
>> On Fri, May 14, 2021 at 4:39 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>>
>>> On Fri, 14 May 2021 16:36:16 -0700 Cong Wang wrote:  
>>  [...]  
>>>>
>>>> We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()
>>>> is not.  
>>>
>>> It doesn't have to be atomic, right? I asked to split the test because
>>> test_and_clear is a locked op on x86, test by itself is not.  
>>
>> It depends on whether you expect the code under the true condition
>> to run once or multiple times, something like:
>>
>> if (test_bit()) {
>>   clear_bit();
>>   // this code may run multiple times
>> }
>>
>> With the atomic test_and_clear_bit(), it only runs once:
>>
>> if (test_and_clear_bit()) {
>>   // this code runs once
>> }

I am not sure if the above really matter when the test and clear
does not need to be atomic.

In order for the above to happens, the MISSED has to set between
test and clear, right?

>>
>> This is why __netif_schedule() uses test_and_set_bit() instead of
>> test_bit()+set_bit().

I think test_and_set_bit() is needed in __netif_schedule() mainly
because STATE_SCHED is also used to indicate if the qdisc is in
sd->output_queue, so it has to be atomic.

> 
> Thanks, makes sense, so hopefully the MISSED-was-set case is not common
> and we can depend on __netif_schedule() to DTRT, avoiding the atomic op
> in the common case.
> 
> .
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v8 1/3] net: sched: fix packet stuck problem for lockless qdisc
  2021-05-15  2:25           ` Yunsheng Lin
@ 2021-05-18  0:49             ` Cong Wang
  0 siblings, 0 replies; 12+ messages in thread
From: Cong Wang @ 2021-05-18  0:49 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Jakub Kicinski, David Miller, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang .,
	Taehee Yoo, Linux Kernel Network Developers, LKML, linuxarm,
	Marc Kleine-Budde, linux-can, Jamal Hadi Salim, Jiri Pirko,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, kpsingh, bpf, Jonas Bonn, Paolo Abeni,
	Michael Zhivich, Josh Hunt, Jike Song, Kehuan Feng, Ahmad Fatoum,
	atenart, Alexander Duyck, Hillf Danton, jgross, JKosina,
	Michal Kubecek, Björn Töpel, Alexander Lobakin

On Fri, May 14, 2021 at 7:25 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2021/5/15 8:17, Jakub Kicinski wrote:
> > On Fri, 14 May 2021 16:57:29 -0700 Cong Wang wrote:
> >> On Fri, May 14, 2021 at 4:39 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >>>
> >>> On Fri, 14 May 2021 16:36:16 -0700 Cong Wang wrote:
> >>  [...]
> >>>>
> >>>> We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()
> >>>> is not.
> >>>
> >>> It doesn't have to be atomic, right? I asked to split the test because
> >>> test_and_clear is a locked op on x86, test by itself is not.
> >>
> >> It depends on whether you expect the code under the true condition
> >> to run once or multiple times, something like:
> >>
> >> if (test_bit()) {
> >>   clear_bit();
> >>   // this code may run multiple times
> >> }
> >>
> >> With the atomic test_and_clear_bit(), it only runs once:
> >>
> >> if (test_and_clear_bit()) {
> >>   // this code runs once
> >> }
>
> I am not sure if the above really matter when the test and clear
> does not need to be atomic.
>
> In order for the above to happens, the MISSED has to set between
> test and clear, right?

Nope, see the following:

// MISSED bit is already set
CPU0                            CPU1
if (test_bit(MISSED) ( //true
                                if (test_bit(MISSED)) { // also true
        clear_bit(MISSED);
        do_something();
                                        clear_bit(MISSED);
                                        do_something();
                                }
}

Now do_something() is called twice instead of once. This may or may
not be a problem, hence I asked this question.

>
> >>
> >> This is why __netif_schedule() uses test_and_set_bit() instead of
> >> test_bit()+set_bit().
>
> I think test_and_set_bit() is needed in __netif_schedule() mainly
> because STATE_SCHED is also used to indicate if the qdisc is in
> sd->output_queue, so it has to be atomic.

If you replace the "do_something()" above with __netif_reschedule(),
this is exactly my point. An entry can not be inserted twice into a
list, hence it should never be called twice like above.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net v8 1/3] net: sched: fix packet stuck problem for lockless qdisc
  2021-05-14  3:16 Yunsheng Lin
@ 2021-05-14  3:16 ` Yunsheng Lin
  0 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2021-05-14  3:16 UTC (permalink / raw)
  To: davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng, a.fatoum, atenart,
	alexander.duyck, hdanton, jgross, JKosina, mkubecek, bjorn,
	alobakin

Lockless qdisc has below concurrent problem:
    cpu0                 cpu1
     .                     .
q->enqueue                 .
     .                     .
qdisc_run_begin()          .
     .                     .
dequeue_skb()              .
     .                     .
sch_direct_xmit()          .
     .                     .
     .                q->enqueue
     .             qdisc_run_begin()
     .            return and do nothing
     .                     .
qdisc_run_end()            .

cpu1 enqueue a skb without calling __qdisc_run() because cpu0
has not released the lock yet and spin_trylock() return false
for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
enqueued by cpu1 when calling dequeue_skb() because cpu1 may
enqueue the skb after cpu0 calling dequeue_skb() and before
cpu0 calling qdisc_run_end().

Lockless qdisc has below another concurrent problem when
tx_action is involved:

cpu0(serving tx_action)     cpu1             cpu2
          .                   .                .
          .              q->enqueue            .
          .            qdisc_run_begin()       .
          .              dequeue_skb()         .
          .                   .            q->enqueue
          .                   .                .
          .             sch_direct_xmit()      .
          .                   .         qdisc_run_begin()
          .                   .       return and do nothing
          .                   .                .
 clear __QDISC_STATE_SCHED    .                .
 qdisc_run_begin()            .                .
 return and do nothing        .                .
          .                   .                .
          .            qdisc_run_end()         .

This patch fixes the above data race by:
1. If the first spin_trylock() return false and STATE_MISSED is
   not set, set STATE_MISSED and retry another spin_trylock() in
   case other CPU may not see STATE_MISSED after it releases the
   lock.
2. reschedule if STATE_MISSED is set after the lock is released
   at the end of qdisc_run_end().

For tx_action case, STATE_MISSED is also set when cpu1 is at the
end if qdisc_run_end(), so tx_action will be rescheduled again
to dequeue the skb enqueued by cpu2.

Clear STATE_MISSED before retrying a dequeuing when dequeuing
returns NULL in order to reduce the overhead of the second
spin_trylock() and __netif_schedule() calling.

Also clear the STATE_MISSED before calling __netif_schedule()
at the end of qdisc_run_end() to avoid doing another round of
dequeuing in the pfifo_fast_dequeue().

The performance impact of this patch, tested using pktgen and
dummy netdev with pfifo_fast qdisc attached:

 threads  without+this_patch   with+this_patch      delta
    1        2.61Mpps            2.60Mpps           -0.3%
    2        3.97Mpps            3.82Mpps           -3.7%
    4        5.62Mpps            5.59Mpps           -0.5%
    8        2.78Mpps            2.77Mpps           -0.3%
   16        2.22Mpps            2.22Mpps           -0.0%

Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
V7: Clear STATE_MISSED before calling __netif_schedule()
    as suggested by Jakub.
V6: Check MISSED after the first trylock, and remove the
    automic test and set for performance sake as suggested
    by Jakub.
V4: Change STATE_NEED_RESCHEDULE to STATE_MISSED mirroring
    NAPI's NAPIF_STATE_MISSED, and add Juergen's "Tested-by"
    tag for there is only renaming and typo fixing between
    V4 and V3.
V3: Fix a compile error and a few comment typo, remove the
    __QDISC_STATE_DEACTIVATED checking, and update the
    performance data.
V2: Avoid the overhead of fixing the data race as much as
    possible.
---
 include/net/sch_generic.h | 35 ++++++++++++++++++++++++++++++++++-
 net/sched/sch_generic.c   | 19 +++++++++++++++++++
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f7a6e14..1e62551 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -36,6 +36,7 @@ struct qdisc_rate_table {
 enum qdisc_state_t {
 	__QDISC_STATE_SCHED,
 	__QDISC_STATE_DEACTIVATED,
+	__QDISC_STATE_MISSED,
 };
 
 struct qdisc_size_table {
@@ -159,8 +160,33 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK) {
+		if (spin_trylock(&qdisc->seqlock))
+			goto nolock_empty;
+
+		/* If the MISSED flag is set, it means other thread has
+		 * set the MISSED flag before second spin_trylock(), so
+		 * we can return false here to avoid multi cpus doing
+		 * the set_bit() and second spin_trylock() concurrently.
+		 */
+		if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
+			return false;
+
+		/* Set the MISSED flag before the second spin_trylock(),
+		 * if the second spin_trylock() return false, it means
+		 * other cpu holding the lock will do dequeuing for us
+		 * or it will see the MISSED flag set after releasing
+		 * lock and reschedule the net_tx_action() to do the
+		 * dequeuing.
+		 */
+		set_bit(__QDISC_STATE_MISSED, &qdisc->state);
+
+		/* Retry again in case other CPU may not see the new flag
+		 * after it releases the lock at the end of qdisc_run_end().
+		 */
 		if (!spin_trylock(&qdisc->seqlock))
 			return false;
+
+nolock_empty:
 		WRITE_ONCE(qdisc->empty, false);
 	} else if (qdisc_is_running(qdisc)) {
 		return false;
@@ -176,8 +202,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
 	write_seqcount_end(&qdisc->running);
-	if (qdisc->flags & TCQ_F_NOLOCK)
+	if (qdisc->flags & TCQ_F_NOLOCK) {
 		spin_unlock(&qdisc->seqlock);
+
+		if (unlikely(test_bit(__QDISC_STATE_MISSED,
+				      &qdisc->state))) {
+			clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
+			__netif_schedule(qdisc);
+		}
+	}
 }
 
 static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 44991ea..795d986 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 {
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
 	struct sk_buff *skb = NULL;
+	bool need_retry = true;
 	int band;
 
+retry:
 	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
 		struct skb_array *q = band2list(priv, band);
 
@@ -652,6 +654,23 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 	}
 	if (likely(skb)) {
 		qdisc_update_stats_at_dequeue(qdisc, skb);
+	} else if (need_retry &&
+		   test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
+		/* Delay clearing the STATE_MISSED here to reduce
+		 * the overhead of the second spin_trylock() in
+		 * qdisc_run_begin() and __netif_schedule() calling
+		 * in qdisc_run_end().
+		 */
+		clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
+
+		/* Make sure dequeuing happens after clearing
+		 * STATE_MISSED.
+		 */
+		smp_mb__after_atomic();
+
+		need_retry = false;
+
+		goto retry;
 	} else {
 		WRITE_ONCE(qdisc->empty, true);
 	}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-05-18  0:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-14  2:26 [PATCH net v8 0/3] fix packet stuck problem for lockless qdisc Yunsheng Lin
2021-05-14  2:26 ` [PATCH net v8 1/3] net: sched: " Yunsheng Lin
2021-05-14 23:36   ` Cong Wang
2021-05-14 23:39     ` Jakub Kicinski
2021-05-14 23:57       ` Cong Wang
2021-05-15  0:17         ` Jakub Kicinski
2021-05-15  2:25           ` Yunsheng Lin
2021-05-18  0:49             ` Cong Wang
2021-05-14  2:26 ` [PATCH net v8 2/3] net: sched: fix tx action rescheduling issue during deactivation Yunsheng Lin
2021-05-14  2:26 ` [PATCH net v8 3/3] net: sched: fix tx action reschedule issue with stopped queue Yunsheng Lin
2021-05-14  2:56 ` [PATCH net v8 0/3] fix packet stuck problem for lockless qdisc Yunsheng Lin
2021-05-14  3:16 Yunsheng Lin
2021-05-14  3:16 ` [PATCH net v8 1/3] net: sched: " Yunsheng Lin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.