linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net v5 0/3] fix packet stuck problem for lockless qdisc
@ 2021-05-06  1:57 Yunsheng Lin
  2021-05-06  1:57 ` [PATCH net v5 1/3] net: sched: " Yunsheng Lin
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Yunsheng Lin @ 2021-05-06  1:57 UTC (permalink / raw)
  To: davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng, a.fatoum, atenart,
	alexander.duyck, hdanton, jgross, JKosina, mkubecek, bjorn

This patchset fixes the packet stuck problem mentioned in [1].

Patch 1: Add STATE_MISSED flag to fix packet stuck problem.
Patch 2: Fix a tx_action rescheduling problem after STATE_MISSED
         flag is added in patch 1.
Patch 3: Fix the significantly higher CPU consumption problem when
         multiple threads are competing on a saturated outgoing
         device.

V5: add patch 3 to fix the problem reported by Michal Kubecek.
V4: Change STATE_NEED_RESCHEDULE to STATE_MISSED and add patch 2.

[1]. https://lkml.org/lkml/2019/10/9/42

Yunsheng Lin (3):
  net: sched: fix packet stuck problem for lockless qdisc
  net: sched: fix endless tx action reschedule during deactivation
  net: sched: fix tx action reschedule issue with stopped queue

 include/net/pkt_sched.h   |  7 +------
 include/net/sch_generic.h | 37 ++++++++++++++++++++++++++++++++++++-
 net/core/dev.c            | 30 +++++++++++++++++++++++++-----
 net/sched/sch_generic.c   | 24 ++++++++++++++++++++++--
 4 files changed, 84 insertions(+), 14 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH net v5 1/3] net: sched: fix packet stuck problem for lockless qdisc
  2021-05-06  1:57 [PATCH net v5 0/3] fix packet stuck problem for lockless qdisc Yunsheng Lin
@ 2021-05-06  1:57 ` Yunsheng Lin
  2021-05-07 23:57   ` Jakub Kicinski
  2021-05-06  1:57 ` [PATCH net v5 2/3] net: sched: fix endless tx action reschedule during deactivation Yunsheng Lin
  2021-05-06  1:57 ` [PATCH net v5 3/3] net: sched: fix tx action reschedule issue with stopped queue Yunsheng Lin
  2 siblings, 1 reply; 7+ messages in thread
From: Yunsheng Lin @ 2021-05-06  1:57 UTC (permalink / raw)
  To: davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng, a.fatoum, atenart,
	alexander.duyck, hdanton, jgross, JKosina, mkubecek, bjorn

Lockless qdisc has below concurrent problem:
    cpu0                 cpu1
     .                     .
q->enqueue                 .
     .                     .
qdisc_run_begin()          .
     .                     .
dequeue_skb()              .
     .                     .
sch_direct_xmit()          .
     .                     .
     .                q->enqueue
     .             qdisc_run_begin()
     .            return and do nothing
     .                     .
qdisc_run_end()            .

cpu1 enqueue a skb without calling __qdisc_run() because cpu0
has not released the lock yet and spin_trylock() return false
for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
enqueued by cpu1 when calling dequeue_skb() because cpu1 may
enqueue the skb after cpu0 calling dequeue_skb() and before
cpu0 calling qdisc_run_end().

Lockless qdisc has below another concurrent problem when
tx_action is involved:

cpu0(serving tx_action)     cpu1             cpu2
          .                   .                .
          .              q->enqueue            .
          .            qdisc_run_begin()       .
          .              dequeue_skb()         .
          .                   .            q->enqueue
          .                   .                .
          .             sch_direct_xmit()      .
          .                   .         qdisc_run_begin()
          .                   .       return and do nothing
          .                   .                .
 clear __QDISC_STATE_SCHED    .                .
 qdisc_run_begin()            .                .
 return and do nothing        .                .
          .                   .                .
          .            qdisc_run_end()         .

This patch fixes the above data race by:
1. Test STATE_MISSED before doing spin_trylock().
2. If the first spin_trylock() return false and STATE_MISSED is
   not set before the first spin_trylock(), Set STATE_MISSED and
   retry another spin_trylock() in case other CPU may not see
   STATE_MISSED after it releases the lock.
3. reschedule if STATE_MISSED is set after the lock is released
   at the end of qdisc_run_end().

For tx_action case, STATE_MISSED is also set when cpu1 is at the
end if qdisc_run_end(), so tx_action will be rescheduled again
to dequeue the skb enqueued by cpu2.

Clear STATE_MISSED before retrying a dequeuing when dequeuing
returns NULL in order to reduce the overhead of the above double
spin_trylock() and __netif_schedule() calling.

The performance impact of this patch, tested using pktgen and
dummy netdev with pfifo_fast qdisc attached:

 threads  without+this_patch   with+this_patch      delta
    1        2.61Mpps            2.60Mpps           -0.3%
    2        3.97Mpps            3.82Mpps           -3.7%
    4        5.62Mpps            5.59Mpps           -0.5%
    8        2.78Mpps            2.77Mpps           -0.3%
   16        2.22Mpps            2.22Mpps           -0.0%

Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Tested-by: Juergen Gross <jgross@suse.com>
---
V4: Change STATE_NEED_RESCHEDULE to STATE_MISSED mirroring
    NAPI's NAPIF_STATE_MISSED, and add Juergen's "Tested-by"
    tag for there is only renaming and typo fixing between
    V4 and V3.
V3: Fix a compile error and a few comment typo, remove the
    __QDISC_STATE_DEACTIVATED checking, and update the
    performance data.
V2: Avoid the overhead of fixing the data race as much as
    possible.
---
 include/net/sch_generic.h | 37 ++++++++++++++++++++++++++++++++++++-
 net/sched/sch_generic.c   | 12 ++++++++++++
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f7a6e14..b85b8ea 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -36,6 +36,7 @@ struct qdisc_rate_table {
 enum qdisc_state_t {
 	__QDISC_STATE_SCHED,
 	__QDISC_STATE_DEACTIVATED,
+	__QDISC_STATE_MISSED,
 };
 
 struct qdisc_size_table {
@@ -159,8 +160,37 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK) {
+		bool dont_retry = test_bit(__QDISC_STATE_MISSED,
+					   &qdisc->state);
+
+		if (spin_trylock(&qdisc->seqlock))
+			goto nolock_empty;
+
+		/* If the flag is set before doing the spin_trylock() and
+		 * the above spin_trylock() return false, it means other cpu
+		 * holding the lock will do dequeuing for us, or it wil see
+		 * the flag set after releasing lock and reschedule the
+		 * net_tx_action() to do the dequeuing.
+		 */
+		if (dont_retry)
+			return false;
+
+		/* We could do set_bit() before the first spin_trylock(),
+		 * and avoid doing second spin_trylock() completely, then
+		 * we could have multi cpus doing the set_bit(). Here use
+		 * dont_retry to avoid doing the set_bit() and the second
+		 * spin_trylock(), which has 5% performance improvement than
+		 * doing the set_bit() before the first spin_trylock().
+		 */
+		set_bit(__QDISC_STATE_MISSED, &qdisc->state);
+
+		/* Retry again in case other CPU may not see the new flag
+		 * after it releases the lock at the end of qdisc_run_end().
+		 */
 		if (!spin_trylock(&qdisc->seqlock))
 			return false;
+
+nolock_empty:
 		WRITE_ONCE(qdisc->empty, false);
 	} else if (qdisc_is_running(qdisc)) {
 		return false;
@@ -176,8 +206,13 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
 	write_seqcount_end(&qdisc->running);
-	if (qdisc->flags & TCQ_F_NOLOCK)
+	if (qdisc->flags & TCQ_F_NOLOCK) {
 		spin_unlock(&qdisc->seqlock);
+
+		if (unlikely(test_bit(__QDISC_STATE_MISSED,
+				      &qdisc->state)))
+			__netif_schedule(qdisc);
+	}
 }
 
 static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 44991ea..9bc73ea 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 {
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
 	struct sk_buff *skb = NULL;
+	bool need_retry = true;
 	int band;
 
+retry:
 	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
 		struct skb_array *q = band2list(priv, band);
 
@@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 	}
 	if (likely(skb)) {
 		qdisc_update_stats_at_dequeue(qdisc, skb);
+	} else if (need_retry &&
+		   test_and_clear_bit(__QDISC_STATE_MISSED,
+				      &qdisc->state)) {
+		/* do another dequeuing after clearing the flag to
+		 * avoid calling __netif_schedule().
+		 */
+		smp_mb__after_atomic();
+		need_retry = false;
+
+		goto retry;
 	} else {
 		WRITE_ONCE(qdisc->empty, true);
 	}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH net v5 2/3] net: sched: fix endless tx action reschedule during deactivation
  2021-05-06  1:57 [PATCH net v5 0/3] fix packet stuck problem for lockless qdisc Yunsheng Lin
  2021-05-06  1:57 ` [PATCH net v5 1/3] net: sched: " Yunsheng Lin
@ 2021-05-06  1:57 ` Yunsheng Lin
  2021-05-06  1:57 ` [PATCH net v5 3/3] net: sched: fix tx action reschedule issue with stopped queue Yunsheng Lin
  2 siblings, 0 replies; 7+ messages in thread
From: Yunsheng Lin @ 2021-05-06  1:57 UTC (permalink / raw)
  To: davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng, a.fatoum, atenart,
	alexander.duyck, hdanton, jgross, JKosina, mkubecek, bjorn

Currently qdisc_run() checks the STATE_DEACTIVATED of lockless
qdisc before calling __qdisc_run(), which ultimately clear the
STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED
is set before clearing STATE_MISSED, there may be endless
rescheduling of net_tx_action() at the end of qdisc_run_end(),
see below:

CPU0(net_tx_atcion)  CPU1(__dev_xmit_skb)  CPU2(dev_deactivate)
          .                   .                     .
          .            set STATE_MISSED             .
          .           __netif_schedule()            .
          .                   .           set STATE_DEACTIVATED
          .                   .                qdisc_reset()
          .                   .                     .
          .<---------------   .              synchronize_net()
clear __QDISC_STATE_SCHED  |  .                     .
          .                |  .                     .
          .                |  .                     .
          .                |  .           --------->.
          .                |  .          |          .
  test STATE_DEACTIVATED   |  .          | some_qdisc_is_busy()
__qdisc_run() *not* called |  .          |-----return *true*
          .                |  .                     .
   test STATE_MISS         |  .                     .
 __netif_schedule()--------|  .                     .
          .                   .                     .
          .                   .                     .

__qdisc_run() is not called by net_tx_atcion() in CPU0 because
CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and
STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule
is called endlessly at the end of qdisc_run_end(), causing endless
tx action rescheduling problem.

qdisc_run() called by net_tx_action() runs in the softirq context,
which should has the same semantic as the qdisc_run() called by
__dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a
synchronize_net() between STATE_DEACTIVATED flag being set and
qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely
bail out for the deactived lockless qdisc in net_tx_action(), and
qdisc_reset() will reset all skb not dequeued yet.

So add the rcu_read_lock() explicitly to protect the qdisc_run()
and do the STATE_DEACTIVATED checking in net_tx_action() before
calling qdisc_run_begin(). Another option is to do the checking in
the qdisc_run_end(), but it will add unnecessary overhead for
non-tx_action case, because __dev_queue_xmit() will not see qdisc
with STATE_DEACTIVATED after synchronize_net(), the qdisc with
STATE_DEACTIVATED can only be seen by net_tx_action() because of
__netif_schedule().

The STATE_DEACTIVATED checking in qdisc_run() is to avoid race
between net_tx_action() and qdisc_reset(), see:
commit d518d2ed8640 ("net/sched: fix race between deactivation
and dequeue for NOLOCK qdisc"). As the bailout added above for
deactived lockless qdisc in net_tx_action() provides better
protection for the race without calling qdisc_run() at all, so
remove the STATE_DEACTIVATED checking in qdisc_run().

After qdisc_reset(), there is no skb in qdisc to be dequeued, so
clear the STATE_MISSED in dev_reset_queue() too.

Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/net/pkt_sched.h |  7 +------
 net/core/dev.c          | 26 ++++++++++++++++++++++----
 net/sched/sch_generic.c |  4 +++-
 3 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index f5c1bee..6d7b12c 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -128,12 +128,7 @@ void __qdisc_run(struct Qdisc *q);
 static inline void qdisc_run(struct Qdisc *q)
 {
 	if (qdisc_run_begin(q)) {
-		/* NOLOCK qdisc must check 'state' under the qdisc seqlock
-		 * to avoid racing with dev_qdisc_reset()
-		 */
-		if (!(q->flags & TCQ_F_NOLOCK) ||
-		    likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
-			__qdisc_run(q);
+		__qdisc_run(q);
 		qdisc_run_end(q);
 	}
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index 222b1d3..d596cd7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5025,25 +5025,43 @@ static __latent_entropy void net_tx_action(struct softirq_action *h)
 		sd->output_queue_tailp = &sd->output_queue;
 		local_irq_enable();
 
+		rcu_read_lock();
+
 		while (head) {
 			struct Qdisc *q = head;
 			spinlock_t *root_lock = NULL;
 
 			head = head->next_sched;
 
-			if (!(q->flags & TCQ_F_NOLOCK)) {
-				root_lock = qdisc_lock(q);
-				spin_lock(root_lock);
-			}
 			/* We need to make sure head->next_sched is read
 			 * before clearing __QDISC_STATE_SCHED
 			 */
 			smp_mb__before_atomic();
+
+			if (!(q->flags & TCQ_F_NOLOCK)) {
+				root_lock = qdisc_lock(q);
+				spin_lock(root_lock);
+			} else if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED,
+						     &q->state))) {
+				/* There is a synchronize_net() between
+				 * STATE_DEACTIVATED flag being set and
+				 * qdisc_reset()/some_qdisc_is_busy() in
+				 * dev_deactivate(), so we can safely bail out
+				 * early here to avoid data race between
+				 * qdisc_deactivate() and some_qdisc_is_busy()
+				 * for lockless qdisc.
+				 */
+				clear_bit(__QDISC_STATE_SCHED, &q->state);
+				continue;
+			}
+
 			clear_bit(__QDISC_STATE_SCHED, &q->state);
 			qdisc_run(q);
 			if (root_lock)
 				spin_unlock(root_lock);
 		}
+
+		rcu_read_unlock();
 	}
 
 	xfrm_dev_backlog(sd);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 9bc73ea..c32ac5b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1170,8 +1170,10 @@ static void dev_reset_queue(struct net_device *dev,
 	qdisc_reset(qdisc);
 
 	spin_unlock_bh(qdisc_lock(qdisc));
-	if (nolock)
+	if (nolock) {
+		clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
 		spin_unlock_bh(&qdisc->seqlock);
+	}
 }
 
 static bool some_qdisc_is_busy(struct net_device *dev)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH net v5 3/3] net: sched: fix tx action reschedule issue with stopped queue
  2021-05-06  1:57 [PATCH net v5 0/3] fix packet stuck problem for lockless qdisc Yunsheng Lin
  2021-05-06  1:57 ` [PATCH net v5 1/3] net: sched: " Yunsheng Lin
  2021-05-06  1:57 ` [PATCH net v5 2/3] net: sched: fix endless tx action reschedule during deactivation Yunsheng Lin
@ 2021-05-06  1:57 ` Yunsheng Lin
  2 siblings, 0 replies; 7+ messages in thread
From: Yunsheng Lin @ 2021-05-06  1:57 UTC (permalink / raw)
  To: davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng, a.fatoum, atenart,
	alexander.duyck, hdanton, jgross, JKosina, mkubecek, bjorn

The netdev qeueue might be stopped when byte queue limit has
reached or tx hw ring is full, net_tx_action() may still be
rescheduled endlessly if STATE_MISSED is set, which consumes
a lot of cpu without dequeuing and transmiting any skb because
the netdev queue is stopped, see qdisc_run_end().

This patch fixes it by checking the netdev queue state before
calling qdisc_run() and clearing STATE_MISSED if netdev queue is
stopped during qdisc_run(), the net_tx_action() is recheduled
again when netdev qeueue is restarted, see netif_tx_wake_queue().

As q->enqueue() may return NET_XMIT_DROP when there is no enough
space, running qdisc_run() will likely consume unnecessary cpu, so
avoid calling qdisc_run() when q->enqueue() returns NET_XMIT_DROP
too.

Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 net/core/dev.c          | 4 +++-
 net/sched/sch_generic.c | 8 +++++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index d596cd7..005bc3e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3853,7 +3853,9 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 
 	if (q->flags & TCQ_F_NOLOCK) {
 		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
-		qdisc_run(q);
+		if (likely(rc != NET_XMIT_DROP &&
+			   !netif_xmit_frozen_or_stopped(txq)))
+			qdisc_run(q);
 
 		if (unlikely(to_free))
 			kfree_skb_list(to_free);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index c32ac5b..2bb829ea 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -74,6 +74,7 @@ static inline struct sk_buff *__skb_dequeue_bad_txq(struct Qdisc *q)
 			}
 		} else {
 			skb = SKB_XOFF_MAGIC;
+			clear_bit(__QDISC_STATE_MISSED, &q->state);
 		}
 	}
 
@@ -242,6 +243,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 			}
 		} else {
 			skb = NULL;
+			clear_bit(__QDISC_STATE_MISSED, &q->state);
 		}
 		if (lock)
 			spin_unlock(lock);
@@ -251,8 +253,10 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 	*validate = true;
 
 	if ((q->flags & TCQ_F_ONETXQUEUE) &&
-	    netif_xmit_frozen_or_stopped(txq))
+	    netif_xmit_frozen_or_stopped(txq)) {
+		clear_bit(__QDISC_STATE_MISSED, &q->state);
 		return skb;
+	}
 
 	skb = qdisc_dequeue_skb_bad_txq(q);
 	if (unlikely(skb)) {
@@ -311,6 +315,8 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 		HARD_TX_LOCK(dev, txq, smp_processor_id());
 		if (!netif_xmit_frozen_or_stopped(txq))
 			skb = dev_hard_start_xmit(skb, dev, txq, &ret);
+		else
+			clear_bit(__QDISC_STATE_MISSED, &q->state);
 
 		HARD_TX_UNLOCK(dev, txq);
 	} else {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net v5 1/3] net: sched: fix packet stuck problem for lockless qdisc
  2021-05-06  1:57 ` [PATCH net v5 1/3] net: sched: " Yunsheng Lin
@ 2021-05-07 23:57   ` Jakub Kicinski
  2021-05-08  2:55     ` Yunsheng Lin
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Kicinski @ 2021-05-07 23:57 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, olteanv, ast, daniel, andriin, edumazet, weiwan,
	cong.wang, ap420073, netdev, linux-kernel, linuxarm, mkl,
	linux-can, jhs, xiyou.wangcong, jiri, andrii, kafai,
	songliubraving, yhs, john.fastabend, kpsingh, bpf, jonas.bonn,
	pabeni, mzhivich, johunt, albcamus, kehuan.feng, a.fatoum,
	atenart, alexander.duyck, hdanton, jgross, JKosina, mkubecek,
	bjorn

On Thu, 6 May 2021 09:57:42 +0800 Yunsheng Lin wrote:
> @@ -159,8 +160,37 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  {
>  	if (qdisc->flags & TCQ_F_NOLOCK) {
> +		bool dont_retry = test_bit(__QDISC_STATE_MISSED,
> +					   &qdisc->state);
> +
> +		if (spin_trylock(&qdisc->seqlock))
> +			goto nolock_empty;
> +
> +		/* If the flag is set before doing the spin_trylock() and
> +		 * the above spin_trylock() return false, it means other cpu
> +		 * holding the lock will do dequeuing for us, or it wil see

s/wil/will/

> +		 * the flag set after releasing lock and reschedule the
> +		 * net_tx_action() to do the dequeuing.

I don't understand why MISSED is checked before the trylock.
Could you explain why it can't be tested directly here?

> +		 */
> +		if (dont_retry)
> +			return false;
> +
> +		/* We could do set_bit() before the first spin_trylock(),
> +		 * and avoid doing second spin_trylock() completely, then
> +		 * we could have multi cpus doing the set_bit(). Here use
> +		 * dont_retry to avoid doing the set_bit() and the second
> +		 * spin_trylock(), which has 5% performance improvement than
> +		 * doing the set_bit() before the first spin_trylock().
> +		 */
> +		set_bit(__QDISC_STATE_MISSED, &qdisc->state);
> +
> +		/* Retry again in case other CPU may not see the new flag
> +		 * after it releases the lock at the end of qdisc_run_end().
> +		 */
>  		if (!spin_trylock(&qdisc->seqlock))
>  			return false;
> +
> +nolock_empty:
>  		WRITE_ONCE(qdisc->empty, false);
>  	} else if (qdisc_is_running(qdisc)) {
>  		return false;
> @@ -176,8 +206,13 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  static inline void qdisc_run_end(struct Qdisc *qdisc)
>  {
>  	write_seqcount_end(&qdisc->running);
> -	if (qdisc->flags & TCQ_F_NOLOCK)
> +	if (qdisc->flags & TCQ_F_NOLOCK) {
>  		spin_unlock(&qdisc->seqlock);
> +
> +		if (unlikely(test_bit(__QDISC_STATE_MISSED,
> +				      &qdisc->state)))
> +			__netif_schedule(qdisc);
> +	}
>  }
>  
>  static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 44991ea..9bc73ea 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>  {
>  	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>  	struct sk_buff *skb = NULL;
> +	bool need_retry = true;
>  	int band;
>  
> +retry:
>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>  		struct skb_array *q = band2list(priv, band);
>  
> @@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>  	}
>  	if (likely(skb)) {
>  		qdisc_update_stats_at_dequeue(qdisc, skb);
> +	} else if (need_retry &&
> +		   test_and_clear_bit(__QDISC_STATE_MISSED,
> +				      &qdisc->state)) {

Why test_and_clear_bit() here? AFAICT this is the only place the bit 
is cleared. So the test and clear do not have to be atomic.

To my limited understanding on x86 test_bit() is never a locked
operation, while test_and_clear_bit() is always locked. So we'd save
an atomic operation in un-contended case if we tested first and then
cleared.

> +		/* do another dequeuing after clearing the flag to
> +		 * avoid calling __netif_schedule().
> +		 */
> +		smp_mb__after_atomic();

test_and_clear_bit() which returned true implies a memory barrier,
AFAIU, so the barrier is not needed with the code as is. It will be
needed if we switch to test_bit() + clear_bit(), but please clarify
what it is paring with.

> +		need_retry = false;
> +
> +		goto retry;
>  	} else {
>  		WRITE_ONCE(qdisc->empty, true);
>  	}


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net v5 1/3] net: sched: fix packet stuck problem for lockless qdisc
  2021-05-07 23:57   ` Jakub Kicinski
@ 2021-05-08  2:55     ` Yunsheng Lin
  2021-05-08  3:05       ` Jakub Kicinski
  0 siblings, 1 reply; 7+ messages in thread
From: Yunsheng Lin @ 2021-05-08  2:55 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, olteanv, ast, daniel, andriin, edumazet, weiwan,
	cong.wang, ap420073, netdev, linux-kernel, linuxarm, mkl,
	linux-can, jhs, xiyou.wangcong, jiri, andrii, kafai,
	songliubraving, yhs, john.fastabend, kpsingh, bpf, jonas.bonn,
	pabeni, mzhivich, johunt, albcamus, kehuan.feng, a.fatoum,
	atenart, alexander.duyck, hdanton, jgross, JKosina, mkubecek,
	bjorn

On 2021/5/8 7:57, Jakub Kicinski wrote:
> On Thu, 6 May 2021 09:57:42 +0800 Yunsheng Lin wrote:
>> @@ -159,8 +160,37 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
>>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>>  {
>>  	if (qdisc->flags & TCQ_F_NOLOCK) {
>> +		bool dont_retry = test_bit(__QDISC_STATE_MISSED,
>> +					   &qdisc->state);
>> +
>> +		if (spin_trylock(&qdisc->seqlock))
>> +			goto nolock_empty;
>> +
>> +		/* If the flag is set before doing the spin_trylock() and
>> +		 * the above spin_trylock() return false, it means other cpu
>> +		 * holding the lock will do dequeuing for us, or it wil see
> 
> s/wil/will/

Thanks.

> 
>> +		 * the flag set after releasing lock and reschedule the
>> +		 * net_tx_action() to do the dequeuing.
> 
> I don't understand why MISSED is checked before the trylock.
> Could you explain why it can't be tested directly here?
The initial thinking was:
Just like the set_bit() before the second trylock, If MISSED is set
before first trylock, it means other thread has set the MISSED flag
for this thread before doing the first trylock, so that this thread
does not need to do the set_bit().

But the initial thinking seems over thinking, as thread 3' setting the
MISSED before the second trylock has ensure either thread 3' second
trylock returns ture or thread 2 holding the lock will see the MISSED
flag, so thread 1 can do the test_bit() before or after the first
trylock, as below:

    thread 1                thread 2                    thread 3
                         holding q->seqlock
first trylock failed                              first trylock failed
                         unlock q->seqlock
                                               test_bit(MISSED) return false
                   test_bit(MISSED) return false
                          and not reschedule
                                                      set_bit(MISSED)
						      trylock success
test_bit(MISSED) retun ture
and not retry second trylock

If the above is correct, it seems we could:
1. do test_bit(MISSED) before the first trylock to avoid doing the
   first trylock for contended case.
or
2. do test_bit(MISSED) after the first trylock to avoid doing the
   test_bit() for un-contended case.

Which one do you prefer?

> 
>> +		 */
>> +		if (dont_retry)
>> +			return false;
>> +
>> +		/* We could do set_bit() before the first spin_trylock(),
>> +		 * and avoid doing second spin_trylock() completely, then
>> +		 * we could have multi cpus doing the set_bit(). Here use
>> +		 * dont_retry to avoid doing the set_bit() and the second
>> +		 * spin_trylock(), which has 5% performance improvement than
>> +		 * doing the set_bit() before the first spin_trylock().
>> +		 */
>> +		set_bit(__QDISC_STATE_MISSED, &qdisc->state);
>> +
>> +		/* Retry again in case other CPU may not see the new flag
>> +		 * after it releases the lock at the end of qdisc_run_end().
>> +		 */
>>  		if (!spin_trylock(&qdisc->seqlock))
>>  			return false;
>> +
>> +nolock_empty:
>>  		WRITE_ONCE(qdisc->empty, false);
>>  	} else if (qdisc_is_running(qdisc)) {
>>  		return false;
>> @@ -176,8 +206,13 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>>  static inline void qdisc_run_end(struct Qdisc *qdisc)
>>  {
>>  	write_seqcount_end(&qdisc->running);
>> -	if (qdisc->flags & TCQ_F_NOLOCK)
>> +	if (qdisc->flags & TCQ_F_NOLOCK) {
>>  		spin_unlock(&qdisc->seqlock);
>> +
>> +		if (unlikely(test_bit(__QDISC_STATE_MISSED,
>> +				      &qdisc->state)))
>> +			__netif_schedule(qdisc);
>> +	}
>>  }
>>  
>>  static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 44991ea..9bc73ea 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>  {
>>  	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>>  	struct sk_buff *skb = NULL;
>> +	bool need_retry = true;
>>  	int band;
>>  
>> +retry:
>>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>  		struct skb_array *q = band2list(priv, band);
>>  
>> @@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>  	}
>>  	if (likely(skb)) {
>>  		qdisc_update_stats_at_dequeue(qdisc, skb);
>> +	} else if (need_retry &&
>> +		   test_and_clear_bit(__QDISC_STATE_MISSED,
>> +				      &qdisc->state)) {
> 
> Why test_and_clear_bit() here? AFAICT this is the only place the bit 
> is cleared. So the test and clear do not have to be atomic.

The the bit is also cleared in other place in patch 2/3, but within the
protection of q->seqlock too, so yes, the test and clear do not have to
be atomic for performance sake.

> 
> To my limited understanding on x86 test_bit() is never a locked
> operation, while test_and_clear_bit() is always locked. So we'd save
> an atomic operation in un-contended case if we tested first and then
> cleared.
> 
>> +		/* do another dequeuing after clearing the flag to
>> +		 * avoid calling __netif_schedule().
>> +		 */
>> +		smp_mb__after_atomic();
> 
> test_and_clear_bit() which returned true implies a memory barrier,
> AFAIU, so the barrier is not needed with the code as is. It will be
> needed if we switch to test_bit() + clear_bit(), but please clarify
> what it is paring with.

Ok.
Thanks for the reviewing.

> 
>> +		need_retry = false;
>> +
>> +		goto retry;
>>  	} else {
>>  		WRITE_ONCE(qdisc->empty, true);
>>  	}
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net v5 1/3] net: sched: fix packet stuck problem for lockless qdisc
  2021-05-08  2:55     ` Yunsheng Lin
@ 2021-05-08  3:05       ` Jakub Kicinski
  0 siblings, 0 replies; 7+ messages in thread
From: Jakub Kicinski @ 2021-05-08  3:05 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, olteanv, ast, daniel, andriin, edumazet, weiwan,
	cong.wang, ap420073, netdev, linux-kernel, linuxarm, mkl,
	linux-can, jhs, xiyou.wangcong, jiri, andrii, kafai,
	songliubraving, yhs, john.fastabend, kpsingh, bpf, jonas.bonn,
	pabeni, mzhivich, johunt, albcamus, kehuan.feng, a.fatoum,
	atenart, alexander.duyck, hdanton, jgross, JKosina, mkubecek,
	bjorn

On Sat, 8 May 2021 10:55:19 +0800 Yunsheng Lin wrote:
> >> +		 * the flag set after releasing lock and reschedule the
> >> +		 * net_tx_action() to do the dequeuing.  
> > 
> > I don't understand why MISSED is checked before the trylock.
> > Could you explain why it can't be tested directly here?  
> The initial thinking was:
> Just like the set_bit() before the second trylock, If MISSED is set
> before first trylock, it means other thread has set the MISSED flag
> for this thread before doing the first trylock, so that this thread
> does not need to do the set_bit().
> 
> But the initial thinking seems over thinking, as thread 3' setting the
> MISSED before the second trylock has ensure either thread 3' second
> trylock returns ture or thread 2 holding the lock will see the MISSED
> flag, so thread 1 can do the test_bit() before or after the first
> trylock, as below:
> 
>     thread 1                thread 2                    thread 3
>                          holding q->seqlock
> first trylock failed                              first trylock failed
>                          unlock q->seqlock
>                                                test_bit(MISSED) return false
>                    test_bit(MISSED) return false
>                           and not reschedule
>                                                       set_bit(MISSED)
> 						      trylock success
> test_bit(MISSED) retun ture
> and not retry second trylock
> 
> If the above is correct, it seems we could:
> 1. do test_bit(MISSED) before the first trylock to avoid doing the
>    first trylock for contended case.
> or
> 2. do test_bit(MISSED) after the first trylock to avoid doing the
>    test_bit() for un-contended case.
> 
> Which one do you prefer?

No strong preference but testing after the trylock seems more obvious
as it saves the temporary variable.

For the contended case could we potentially move or add a MISSED test
before even the first try_lock()? I'm not good at optimizing things, 
but it could save us the atomic op, right? (at least on x86)

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-05-08  3:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-06  1:57 [PATCH net v5 0/3] fix packet stuck problem for lockless qdisc Yunsheng Lin
2021-05-06  1:57 ` [PATCH net v5 1/3] net: sched: " Yunsheng Lin
2021-05-07 23:57   ` Jakub Kicinski
2021-05-08  2:55     ` Yunsheng Lin
2021-05-08  3:05       ` Jakub Kicinski
2021-05-06  1:57 ` [PATCH net v5 2/3] net: sched: fix endless tx action reschedule during deactivation Yunsheng Lin
2021-05-06  1:57 ` [PATCH net v5 3/3] net: sched: fix tx action reschedule issue with stopped queue Yunsheng Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).