All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net] net: sched: fix packet stuck problem for lockless qdisc
@ 2021-03-18  6:53 Yunsheng Lin
  2021-03-19  9:25 ` [Linuxarm] " Yunsheng Lin
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Yunsheng Lin @ 2021-03-18  6:53 UTC (permalink / raw)
  To: davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng

Lockless qdisc has below concurrent problem:
        cpu0                  cpu1
          .                     .
     q->enqueue                 .
          .                     .
   qdisc_run_begin()            .
          .                     .
     dequeue_skb()              .
          .                     .
   sch_direct_xmit()            .
          .                     .
          .                q->enqueue
          .             qdisc_run_begin()
          .            return and do nothing
          .                     .
qdisc_run_end()                 .

cpu1 enqueue a skb without calling __qdisc_run() because cpu0
has not released the lock yet and spin_trylock() return false
for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
enqueued by cpu1 when calling dequeue_skb() because cpu1 may
enqueue the skb after cpu0 calling dequeue_skb() and before
cpu0 calling qdisc_run_end().

Lockless qdisc has another concurrent problem when tx_action
is involved:

cpu0(serving tx_action)     cpu1             cpu2
          .                   .                .
          .              q->enqueue            .
          .            qdisc_run_begin()       .
          .              dequeue_skb()         .
          .                   .            q->enqueue
          .                   .                .
          .             sch_direct_xmit()      .
          .                   .         qdisc_run_begin()
          .                   .       return and do nothing
          .                   .                .
clear __QDISC_STATE_SCHED     .                .
    qdisc_run_begin()         .                .
return and do nothing         .                .
          .                   .                .
          .          qdisc_run_begin()         .

This patch fixes the above data race by:
1. Set a flag after spin_trylock() return false.
2. Retry a spin_trylock() in case other CPU may not see the
   new flag after it releases the lock.
3. reschedule if the flag is set after the lock is released
   at the end of qdisc_run_end().

For tx_action case, the flags is also set when cpu1 is at the
end if qdisc_run_begin(), so tx_action will be rescheduled
again to dequeue the skb enqueued by cpu2.

Also clear the flag before dequeuing in order to reduce the
overhead of the above process, and aviod doing the heavy
test_and_clear_bit() at the end of qdisc_run_end().

Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
For those who has not been following the qdsic scheduling
discussion, there is packet stuck problem for lockless qdisc,
see [1], and I has done some cleanup and added some enhanced
features too, see [2] [3].
While I was doing the optimization for lockless qdisc, it
accurred to me that these optimization is useless if there is
still basic bug in lockless qdisc, even the bug is not easily
reproducible. So look through [1] again, I found that the data
race for tx action mentioned by Michael, and thought deep about
it and came up with this patch trying to fix it.

So I am really appreciated some who still has the reproducer
can try this patch and report back.

1. https://lore.kernel.org/netdev/d102074f-7489-e35a-98cf-e2cad7efd8a2@netrounds.com/t/#ma7013a79b8c4d8e7c49015c724e481e6d5325b32
2. https://patchwork.kernel.org/project/netdevbpf/patch/1615777818-13969-1-git-send-email-linyunsheng@huawei.com/
3. https://patchwork.kernel.org/project/netdevbpf/patch/1615800610-34700-1-git-send-email-linyunsheng@huawei.com/
---
 include/net/sch_generic.h | 23 ++++++++++++++++++++---
 net/sched/sch_generic.c   |  1 +
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f7a6e14..4220eab 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -36,6 +36,7 @@ struct qdisc_rate_table {
 enum qdisc_state_t {
 	__QDISC_STATE_SCHED,
 	__QDISC_STATE_DEACTIVATED,
+	__QDISC_STATE_NEED_RESCHEDULE,
 };
 
 struct qdisc_size_table {
@@ -159,8 +160,17 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK) {
-		if (!spin_trylock(&qdisc->seqlock))
-			return false;
+		if (!spin_trylock(&qdisc->seqlock)) {
+			set_bit(__QDISC_STATE_NEED_RESCHEDULE,
+				&qdisc->state);
+
+			/* Retry again in case other CPU may not see the
+			 * new flags after it releases the lock at the
+			 * end of qdisc_run_end().
+			 */
+			if (!spin_trylock(&qdisc->seqlock))
+				return false;
+		}
 		WRITE_ONCE(qdisc->empty, false);
 	} else if (qdisc_is_running(qdisc)) {
 		return false;
@@ -176,8 +186,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
 	write_seqcount_end(&qdisc->running);
-	if (qdisc->flags & TCQ_F_NOLOCK)
+	if (qdisc->flags & TCQ_F_NOLOCK) {
 		spin_unlock(&qdisc->seqlock);
+
+		if (unlikely(test_bit(__QDISC_STATE_NEED_RESCHEDULE,
+				      &qdisc->state) &&
+			     !test_bit(__QDISC_STATE_DEACTIVATED,
+				       &qdisc->state)))
+			__netif_schedule(qdisc);
+	}
 }
 
 static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 44991ea..25d75d8 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -205,6 +205,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 	const struct netdev_queue *txq = q->dev_queue;
 	struct sk_buff *skb = NULL;
 
+	clear_bit(__QDISC_STATE_NEED_RESCHEDULE, &q->state);
 	*packets = 1;
 	if (unlikely(!skb_queue_empty(&q->gso_skb))) {
 		spinlock_t *lock = NULL;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Linuxarm] [PATCH net] net: sched: fix packet stuck problem for lockless qdisc
  2021-03-18  6:53 [PATCH net] net: sched: fix packet stuck problem for lockless qdisc Yunsheng Lin
@ 2021-03-19  9:25 ` Yunsheng Lin
  2021-03-19 19:45   ` Cong Wang
  2021-03-19 19:40 ` Cong Wang
  2021-03-22  9:09 ` [RFC v3] net: sched: implement TCQ_F_CAN_BYPASS " Yunsheng Lin
  2 siblings, 1 reply; 11+ messages in thread
From: Yunsheng Lin @ 2021-03-19  9:25 UTC (permalink / raw)
  To: davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng, Ahmad Fatoum

On 2021/3/18 14:53, Yunsheng Lin wrote:
> Lockless qdisc has below concurrent problem:
>         cpu0                  cpu1
>           .                     .
>      q->enqueue                 .
>           .                     .
>    qdisc_run_begin()            .
>           .                     .
>      dequeue_skb()              .
>           .                     .
>    sch_direct_xmit()            .
>           .                     .
>           .                q->enqueue
>           .             qdisc_run_begin()
>           .            return and do nothing
>           .                     .
> qdisc_run_end()                 .
> 
> cpu1 enqueue a skb without calling __qdisc_run() because cpu0
> has not released the lock yet and spin_trylock() return false
> for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
> enqueued by cpu1 when calling dequeue_skb() because cpu1 may
> enqueue the skb after cpu0 calling dequeue_skb() and before
> cpu0 calling qdisc_run_end().
> 
> Lockless qdisc has another concurrent problem when tx_action
> is involved:
> 
> cpu0(serving tx_action)     cpu1             cpu2
>           .                   .                .
>           .              q->enqueue            .
>           .            qdisc_run_begin()       .
>           .              dequeue_skb()         .
>           .                   .            q->enqueue
>           .                   .                .
>           .             sch_direct_xmit()      .
>           .                   .         qdisc_run_begin()
>           .                   .       return and do nothing
>           .                   .                .
> clear __QDISC_STATE_SCHED     .                .
>     qdisc_run_begin()         .                .
> return and do nothing         .                .
>           .                   .                .
>           .          qdisc_run_begin()         .
> 
> This patch fixes the above data race by:
> 1. Set a flag after spin_trylock() return false.
> 2. Retry a spin_trylock() in case other CPU may not see the
>    new flag after it releases the lock.
> 3. reschedule if the flag is set after the lock is released
>    at the end of qdisc_run_end().
> 
> For tx_action case, the flags is also set when cpu1 is at the
> end if qdisc_run_begin(), so tx_action will be rescheduled
> again to dequeue the skb enqueued by cpu2.
> 
> Also clear the flag before dequeuing in order to reduce the
> overhead of the above process, and aviod doing the heavy
> test_and_clear_bit() at the end of qdisc_run_end().
> 
> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> For those who has not been following the qdsic scheduling
> discussion, there is packet stuck problem for lockless qdisc,
> see [1], and I has done some cleanup and added some enhanced
> features too, see [2] [3].
> While I was doing the optimization for lockless qdisc, it
> accurred to me that these optimization is useless if there is
> still basic bug in lockless qdisc, even the bug is not easily
> reproducible. So look through [1] again, I found that the data
> race for tx action mentioned by Michael, and thought deep about
> it and came up with this patch trying to fix it.
> 
> So I am really appreciated some who still has the reproducer
> can try this patch and report back.

I had done some performance test to see if there is value to
fix the packet stuck problem and support lockless qdisc bypass,
here is some result using pktgen in 'queue_xmit' mode on a dummy
device as Paolo Abeni had done in [1], and using pfifo_fast qdisc:

threads	 vanilla    locked-qdisc    vanilla+this_patch
   1     2.6Mpps      2.9Mpps            2.5Mpps
   2     3.9Mpps      4.8Mpps            3.6Mpps
   4     5.6Mpps      3.0Mpps            4.7Mpps
   8     2.7Mpps      1.6Mpps            2.8Mpps
   16    2.2Mpps      1.3Mpps            2.3Mpps

locked-qdisc: test by removing the "TCQ_F_NOLOCK | TCQ_F_CPUSTATS".

And add the lockless qdisc bypatch and other optimization upon
this patch:

threads   patch_set_1   patch_set_2     patch_set_3
   1       2.5Mpps        3.0Mpps         3.0Mpps
   2       3.6Mpps        4.1Mpps         5.3Mpps
   4       4.7Mpps        4.6Mpps         5.1Mpps
   8       2.8Mpps        2.6Mpps         2.7Mpps
   16      2.3Mpps        2.2Mpps         2.2Mpps

patch_set_1: vanilla + this_patch
patch_set_2: vanilla + this_patch + lockless_qdisc_bypass_patch
patch_set_3: vanilla + this_patch + lockless_qdisc_bypass_patch +
             remove_seq_operation_for_lockless_qdisc_optimization +
             check_rc_before_calling_qdisc_run()_optimization +
             spin_trylock()_retry_optimization.

So all the fix and optimization added together, the lockless qdisc
has better performance than vanilla except for the 4 threads case,
which has about 9% performance degradation than vanilla one, but still
better than the locked-qdisc.


> 
> 1. https://lore.kernel.org/netdev/d102074f-7489-e35a-98cf-e2cad7efd8a2@netrounds.com/t/#ma7013a79b8c4d8e7c49015c724e481e6d5325b32
> 2. https://patchwork.kernel.org/project/netdevbpf/patch/1615777818-13969-1-git-send-email-linyunsheng@huawei.com/
> 3. https://patchwork.kernel.org/project/netdevbpf/patch/1615800610-34700-1-git-send-email-linyunsheng@huawei.com/
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net] net: sched: fix packet stuck problem for lockless qdisc
  2021-03-18  6:53 [PATCH net] net: sched: fix packet stuck problem for lockless qdisc Yunsheng Lin
  2021-03-19  9:25 ` [Linuxarm] " Yunsheng Lin
@ 2021-03-19 19:40 ` Cong Wang
  2021-03-22  1:31   ` Yunsheng Lin
  2021-03-22  9:09 ` [RFC v3] net: sched: implement TCQ_F_CAN_BYPASS " Yunsheng Lin
  2 siblings, 1 reply; 11+ messages in thread
From: Cong Wang @ 2021-03-19 19:40 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang .,
	Taehee Yoo, Linux Kernel Network Developers, LKML, linuxarm,
	Marc Kleine-Budde, linux-can, Jamal Hadi Salim, Jiri Pirko,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, kpsingh, bpf, Jonas Bonn, Paolo Abeni,
	Michael Zhivich, Josh Hunt, Jike Song, Kehuan Feng

On Wed, Mar 17, 2021 at 11:52 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> Lockless qdisc has below concurrent problem:
>         cpu0                  cpu1
>           .                     .
>      q->enqueue                 .
>           .                     .
>    qdisc_run_begin()            .
>           .                     .
>      dequeue_skb()              .
>           .                     .
>    sch_direct_xmit()            .
>           .                     .
>           .                q->enqueue
>           .             qdisc_run_begin()
>           .            return and do nothing
>           .                     .
> qdisc_run_end()                 .
>
> cpu1 enqueue a skb without calling __qdisc_run() because cpu0
> has not released the lock yet and spin_trylock() return false
> for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
> enqueued by cpu1 when calling dequeue_skb() because cpu1 may
> enqueue the skb after cpu0 calling dequeue_skb() and before
> cpu0 calling qdisc_run_end().
>
> Lockless qdisc has another concurrent problem when tx_action
> is involved:
>
> cpu0(serving tx_action)     cpu1             cpu2
>           .                   .                .
>           .              q->enqueue            .
>           .            qdisc_run_begin()       .
>           .              dequeue_skb()         .
>           .                   .            q->enqueue
>           .                   .                .
>           .             sch_direct_xmit()      .
>           .                   .         qdisc_run_begin()
>           .                   .       return and do nothing
>           .                   .                .
> clear __QDISC_STATE_SCHED     .                .
>     qdisc_run_begin()         .                .
> return and do nothing         .                .
>           .                   .                .
>           .          qdisc_run_begin()         .
>
> This patch fixes the above data race by:
> 1. Set a flag after spin_trylock() return false.
> 2. Retry a spin_trylock() in case other CPU may not see the
>    new flag after it releases the lock.
> 3. reschedule if the flag is set after the lock is released
>    at the end of qdisc_run_end().
>
> For tx_action case, the flags is also set when cpu1 is at the
> end if qdisc_run_begin(), so tx_action will be rescheduled
> again to dequeue the skb enqueued by cpu2.
>
> Also clear the flag before dequeuing in order to reduce the
> overhead of the above process, and aviod doing the heavy
> test_and_clear_bit() at the end of qdisc_run_end().
>
> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> For those who has not been following the qdsic scheduling
> discussion, there is packet stuck problem for lockless qdisc,
> see [1], and I has done some cleanup and added some enhanced
> features too, see [2] [3].
> While I was doing the optimization for lockless qdisc, it
> accurred to me that these optimization is useless if there is
> still basic bug in lockless qdisc, even the bug is not easily
> reproducible. So look through [1] again, I found that the data
> race for tx action mentioned by Michael, and thought deep about
> it and came up with this patch trying to fix it.
>
> So I am really appreciated some who still has the reproducer
> can try this patch and report back.
>
> 1. https://lore.kernel.org/netdev/d102074f-7489-e35a-98cf-e2cad7efd8a2@netrounds.com/t/#ma7013a79b8c4d8e7c49015c724e481e6d5325b32
> 2. https://patchwork.kernel.org/project/netdevbpf/patch/1615777818-13969-1-git-send-email-linyunsheng@huawei.com/
> 3. https://patchwork.kernel.org/project/netdevbpf/patch/1615800610-34700-1-git-send-email-linyunsheng@huawei.com/
> ---
>  include/net/sch_generic.h | 23 ++++++++++++++++++++---
>  net/sched/sch_generic.c   |  1 +
>  2 files changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index f7a6e14..4220eab 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -36,6 +36,7 @@ struct qdisc_rate_table {
>  enum qdisc_state_t {
>         __QDISC_STATE_SCHED,
>         __QDISC_STATE_DEACTIVATED,
> +       __QDISC_STATE_NEED_RESCHEDULE,
>  };
>
>  struct qdisc_size_table {
> @@ -159,8 +160,17 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  {
>         if (qdisc->flags & TCQ_F_NOLOCK) {
> -               if (!spin_trylock(&qdisc->seqlock))
> -                       return false;
> +               if (!spin_trylock(&qdisc->seqlock)) {
> +                       set_bit(__QDISC_STATE_NEED_RESCHEDULE,
> +                               &qdisc->state);

Why do we need another bit? I mean why not just call __netif_schedule()?

> +
> +                       /* Retry again in case other CPU may not see the
> +                        * new flags after it releases the lock at the
> +                        * end of qdisc_run_end().
> +                        */
> +                       if (!spin_trylock(&qdisc->seqlock))
> +                               return false;
> +               }
>                 WRITE_ONCE(qdisc->empty, false);
>         } else if (qdisc_is_running(qdisc)) {
>                 return false;
> @@ -176,8 +186,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  static inline void qdisc_run_end(struct Qdisc *qdisc)
>  {
>         write_seqcount_end(&qdisc->running);
> -       if (qdisc->flags & TCQ_F_NOLOCK)
> +       if (qdisc->flags & TCQ_F_NOLOCK) {
>                 spin_unlock(&qdisc->seqlock);
> +
> +               if (unlikely(test_bit(__QDISC_STATE_NEED_RESCHEDULE,
> +                                     &qdisc->state) &&
> +                            !test_bit(__QDISC_STATE_DEACTIVATED,
> +                                      &qdisc->state)))

Testing two bits one by one is not atomic...


> +                       __netif_schedule(qdisc);
> +       }
>  }
>
>  static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 44991ea..25d75d8 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -205,6 +205,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
>         const struct netdev_queue *txq = q->dev_queue;
>         struct sk_buff *skb = NULL;
>
> +       clear_bit(__QDISC_STATE_NEED_RESCHEDULE, &q->state);
>         *packets = 1;
>         if (unlikely(!skb_queue_empty(&q->gso_skb))) {
>                 spinlock_t *lock = NULL;
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Linuxarm] [PATCH net] net: sched: fix packet stuck problem for lockless qdisc
  2021-03-19  9:25 ` [Linuxarm] " Yunsheng Lin
@ 2021-03-19 19:45   ` Cong Wang
  2021-03-22  1:37     ` Yunsheng Lin
  0 siblings, 1 reply; 11+ messages in thread
From: Cong Wang @ 2021-03-19 19:45 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang .,
	Taehee Yoo, Linux Kernel Network Developers, LKML, linuxarm,
	Marc Kleine-Budde, linux-can, Jamal Hadi Salim, Jiri Pirko,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, kpsingh, bpf, Jonas Bonn, Paolo Abeni,
	Michael Zhivich, Josh Hunt, Jike Song, Kehuan Feng, Ahmad Fatoum

On Fri, Mar 19, 2021 at 2:25 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> I had done some performance test to see if there is value to
> fix the packet stuck problem and support lockless qdisc bypass,
> here is some result using pktgen in 'queue_xmit' mode on a dummy
> device as Paolo Abeni had done in [1], and using pfifo_fast qdisc:
>
> threads  vanilla    locked-qdisc    vanilla+this_patch
>    1     2.6Mpps      2.9Mpps            2.5Mpps
>    2     3.9Mpps      4.8Mpps            3.6Mpps
>    4     5.6Mpps      3.0Mpps            4.7Mpps
>    8     2.7Mpps      1.6Mpps            2.8Mpps
>    16    2.2Mpps      1.3Mpps            2.3Mpps
>
> locked-qdisc: test by removing the "TCQ_F_NOLOCK | TCQ_F_CPUSTATS".

I read this as this patch introduces somehow a performance
regression for -net, as the lockless bypass patch you submitted is
for -net-next.

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net] net: sched: fix packet stuck problem for lockless qdisc
  2021-03-19 19:40 ` Cong Wang
@ 2021-03-22  1:31   ` Yunsheng Lin
  0 siblings, 0 replies; 11+ messages in thread
From: Yunsheng Lin @ 2021-03-22  1:31 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Jakub Kicinski, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang .,
	Taehee Yoo, Linux Kernel Network Developers, LKML, linuxarm,
	Marc Kleine-Budde, linux-can, Jamal Hadi Salim, Jiri Pirko,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, kpsingh, bpf, Jonas Bonn, Paolo Abeni,
	Michael Zhivich, Josh Hunt, Jike Song, Kehuan Feng

On 2021/3/20 3:40, Cong Wang wrote:
> On Wed, Mar 17, 2021 at 11:52 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> Lockless qdisc has below concurrent problem:
>>         cpu0                  cpu1
>>           .                     .
>>      q->enqueue                 .
>>           .                     .
>>    qdisc_run_begin()            .
>>           .                     .
>>      dequeue_skb()              .
>>           .                     .
>>    sch_direct_xmit()            .
>>           .                     .
>>           .                q->enqueue
>>           .             qdisc_run_begin()
>>           .            return and do nothing
>>           .                     .
>> qdisc_run_end()                 .
>>
>> cpu1 enqueue a skb without calling __qdisc_run() because cpu0
>> has not released the lock yet and spin_trylock() return false
>> for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
>> enqueued by cpu1 when calling dequeue_skb() because cpu1 may
>> enqueue the skb after cpu0 calling dequeue_skb() and before
>> cpu0 calling qdisc_run_end().
>>
>> Lockless qdisc has another concurrent problem when tx_action
>> is involved:
>>
>> cpu0(serving tx_action)     cpu1             cpu2
>>           .                   .                .
>>           .              q->enqueue            .
>>           .            qdisc_run_begin()       .
>>           .              dequeue_skb()         .
>>           .                   .            q->enqueue
>>           .                   .                .
>>           .             sch_direct_xmit()      .
>>           .                   .         qdisc_run_begin()
>>           .                   .       return and do nothing
>>           .                   .                .
>> clear __QDISC_STATE_SCHED     .                .
>>     qdisc_run_begin()         .                .
>> return and do nothing         .                .
>>           .                   .                .
>>           .          qdisc_run_begin()         .
>>
>> This patch fixes the above data race by:
>> 1. Set a flag after spin_trylock() return false.
>> 2. Retry a spin_trylock() in case other CPU may not see the
>>    new flag after it releases the lock.
>> 3. reschedule if the flag is set after the lock is released
>>    at the end of qdisc_run_end().
>>
>> For tx_action case, the flags is also set when cpu1 is at the
>> end if qdisc_run_begin(), so tx_action will be rescheduled
>> again to dequeue the skb enqueued by cpu2.
>>
>> Also clear the flag before dequeuing in order to reduce the
>> overhead of the above process, and aviod doing the heavy
>> test_and_clear_bit() at the end of qdisc_run_end().
>>
>> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>> For those who has not been following the qdsic scheduling
>> discussion, there is packet stuck problem for lockless qdisc,
>> see [1], and I has done some cleanup and added some enhanced
>> features too, see [2] [3].
>> While I was doing the optimization for lockless qdisc, it
>> accurred to me that these optimization is useless if there is
>> still basic bug in lockless qdisc, even the bug is not easily
>> reproducible. So look through [1] again, I found that the data
>> race for tx action mentioned by Michael, and thought deep about
>> it and came up with this patch trying to fix it.
>>
>> So I am really appreciated some who still has the reproducer
>> can try this patch and report back.
>>
>> 1. https://lore.kernel.org/netdev/d102074f-7489-e35a-98cf-e2cad7efd8a2@netrounds.com/t/#ma7013a79b8c4d8e7c49015c724e481e6d5325b32
>> 2. https://patchwork.kernel.org/project/netdevbpf/patch/1615777818-13969-1-git-send-email-linyunsheng@huawei.com/
>> 3. https://patchwork.kernel.org/project/netdevbpf/patch/1615800610-34700-1-git-send-email-linyunsheng@huawei.com/
>> ---
>>  include/net/sch_generic.h | 23 ++++++++++++++++++++---
>>  net/sched/sch_generic.c   |  1 +
>>  2 files changed, 21 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index f7a6e14..4220eab 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -36,6 +36,7 @@ struct qdisc_rate_table {
>>  enum qdisc_state_t {
>>         __QDISC_STATE_SCHED,
>>         __QDISC_STATE_DEACTIVATED,
>> +       __QDISC_STATE_NEED_RESCHEDULE,
>>  };
>>
>>  struct qdisc_size_table {
>> @@ -159,8 +160,17 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
>>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>>  {
>>         if (qdisc->flags & TCQ_F_NOLOCK) {
>> -               if (!spin_trylock(&qdisc->seqlock))
>> -                       return false;
>> +               if (!spin_trylock(&qdisc->seqlock)) {
>> +                       set_bit(__QDISC_STATE_NEED_RESCHEDULE,
>> +                               &qdisc->state);
> 
> Why do we need another bit? I mean why not just call __netif_schedule()?

I think that was your proposal in [1], the only difference is that
it also handle the tx_action case when __netif_schedule() is called
here.

So yes, it can also fix the two data race described in this patch, but
with a bigger performance degradation, quoting performance data in the
[1]:

pktgen threads	vanilla		patched		delta
nr		kpps		kpps		%

1		3240		3240		0
2		3910		2710		-30.5
4		5140		4920		-4


performance data with this patch:

threads  vanilla       vanilla+this_patch       delta
   1     2.6Mpps            2.5Mpps              -3%
   2     3.9Mpps            3.6Mpps              -7%
   4     5.6Mpps            4.7Mpps             -16%


So the performance is why I does not call __netif_schedule() directly
here.

1. https://lore.kernel.org/netdev/d102074f-7489-e35a-98cf-e2cad7efd8a2@netrounds.com/t/#md927651488ce4d226f6279aad6699b4bee4674a3

> 
>> +
>> +                       /* Retry again in case other CPU may not see the
>> +                        * new flags after it releases the lock at the
>> +                        * end of qdisc_run_end().
>> +                        */
>> +                       if (!spin_trylock(&qdisc->seqlock))
>> +                               return false;
>> +               }
>>                 WRITE_ONCE(qdisc->empty, false);
>>         } else if (qdisc_is_running(qdisc)) {
>>                 return false;
>> @@ -176,8 +186,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>>  static inline void qdisc_run_end(struct Qdisc *qdisc)
>>  {
>>         write_seqcount_end(&qdisc->running);
>> -       if (qdisc->flags & TCQ_F_NOLOCK)
>> +       if (qdisc->flags & TCQ_F_NOLOCK) {
>>                 spin_unlock(&qdisc->seqlock);
>> +
>> +               if (unlikely(test_bit(__QDISC_STATE_NEED_RESCHEDULE,
>> +                                     &qdisc->state) &&
>> +                            !test_bit(__QDISC_STATE_DEACTIVATED,
>> +                                      &qdisc->state)))
> 
> Testing two bits one by one is not atomic...

For non-tx_action case, actually it is atomic because the above
two bits testing is within the rcu protection, and qdisc reset
will do a synchronize_net() after setting __QDISC_STATE_DEACTIVATED.

For tx_action case, I think we need a rcu protection explicitly in
net_tx_action() too, at least for PREEMPT_RCU:

https://stackoverflow.com/questions/21287932/is-it-necessary-invoke-rcu-read-lock-in-softirq-context

> 
> 
>> +                       __netif_schedule(qdisc);
>> +       }
>>  }
>>
>>  static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 44991ea..25d75d8 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -205,6 +205,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
>>         const struct netdev_queue *txq = q->dev_queue;
>>         struct sk_buff *skb = NULL;
>>
>> +       clear_bit(__QDISC_STATE_NEED_RESCHEDULE, &q->state);
>>         *packets = 1;
>>         if (unlikely(!skb_queue_empty(&q->gso_skb))) {
>>                 spinlock_t *lock = NULL;
>> --
>> 2.7.4
>>
> 
> .
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Linuxarm] [PATCH net] net: sched: fix packet stuck problem for lockless qdisc
  2021-03-19 19:45   ` Cong Wang
@ 2021-03-22  1:37     ` Yunsheng Lin
  0 siblings, 0 replies; 11+ messages in thread
From: Yunsheng Lin @ 2021-03-22  1:37 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Jakub Kicinski, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang .,
	Taehee Yoo, Linux Kernel Network Developers, LKML, linuxarm,
	Marc Kleine-Budde, linux-can, Jamal Hadi Salim, Jiri Pirko,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, kpsingh, bpf, Jonas Bonn, Paolo Abeni,
	Michael Zhivich, Josh Hunt, Jike Song, Kehuan Feng, Ahmad Fatoum

On 2021/3/20 3:45, Cong Wang wrote:
> On Fri, Mar 19, 2021 at 2:25 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>> I had done some performance test to see if there is value to
>> fix the packet stuck problem and support lockless qdisc bypass,
>> here is some result using pktgen in 'queue_xmit' mode on a dummy
>> device as Paolo Abeni had done in [1], and using pfifo_fast qdisc:
>>
>> threads  vanilla    locked-qdisc    vanilla+this_patch
>>    1     2.6Mpps      2.9Mpps            2.5Mpps
>>    2     3.9Mpps      4.8Mpps            3.6Mpps
>>    4     5.6Mpps      3.0Mpps            4.7Mpps
>>    8     2.7Mpps      1.6Mpps            2.8Mpps
>>    16    2.2Mpps      1.3Mpps            2.3Mpps
>>
>> locked-qdisc: test by removing the "TCQ_F_NOLOCK | TCQ_F_CPUSTATS".
> 
> I read this as this patch introduces somehow a performance
> regression for -net, as the lockless bypass patch you submitted is
> for -net-next.

Yes, right now there is performance regression for fixing this bug,
but the problem is that if we can not fix the above data race without
any performance regression, do you prefer to send this patch to
-net, or to -net-next with the lockless bypass patch?

Any idea to fix this with less performance regression?

> 
> Thanks.
> 
> .
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC v3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc
  2021-03-18  6:53 [PATCH net] net: sched: fix packet stuck problem for lockless qdisc Yunsheng Lin
  2021-03-19  9:25 ` [Linuxarm] " Yunsheng Lin
  2021-03-19 19:40 ` Cong Wang
@ 2021-03-22  9:09 ` Yunsheng Lin
  2021-03-22 20:00   ` Vladimir Oltean
  2021-03-23  6:37   ` Ahmad Fatoum
  2 siblings, 2 replies; 11+ messages in thread
From: Yunsheng Lin @ 2021-03-22  9:09 UTC (permalink / raw)
  To: davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng, a.fatoum

Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
flag set, but queue discipline by-pass does not work for lockless
qdisc because skb is always enqueued to qdisc even when the qdisc
is empty, see __dev_xmit_skb().

This patch calls sch_direct_xmit() to transmit the skb directly
to the driver for empty lockless qdisc too, which aviod enqueuing
and dequeuing operation. qdisc->empty is set to false whenever a
skb is enqueued, see pfifo_fast_enqueue(), and is set to true when
skb dequeuing return NULL, see pfifo_fast_dequeue().

There is a data race between enqueue/dequeue and qdisc->empty
setting, qdisc->empty is only used as a hint, so we need to call
sch_may_need_requeuing() to see if the queue is really empty and if
there is requeued skb, which has higher priority than the current
skb.

The performance for ip_forward test increases about 10% with this
patch.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
Hi, Vladimir and Ahmad
	Please give it a test to see if there is any out of order
packet for this patch, which has removed the priv->lock added in
RFC v2.

There is a data race as below:

      CPU1                                   CPU2
qdisc_run_begin(q)                            .
        .                                q->enqueue()
sch_may_need_requeuing()                      .
    return true                               .
        .                                     .
        .                                     .
    q->enqueue()                              .

When above happen, the skb enqueued by CPU1 is dequeued after the
skb enqueued by CPU2 because sch_may_need_requeuing() return true.
If there is not qdisc bypass, the CPU1 has better chance to queue
the skb quicker than CPU2.

This patch does not take care of the above data race, because I
view this as similar as below:

Even at the same time CPU1 and CPU2 write the skb to two socket
which both heading to the same qdisc, there is no guarantee that
which skb will hit the qdisc first, becuase there is a lot of
factor like interrupt/softirq/cache miss/scheduling afffecting
that.

So I hope the above data race will not cause problem for Vladimir
and Ahmad.
---
 include/net/pkt_sched.h   |  1 +
 include/net/sch_generic.h |  1 -
 net/core/dev.c            | 22 ++++++++++++++++++++++
 net/sched/sch_generic.c   | 11 +++++++++++
 4 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index f5c1bee..5715ddf 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -122,6 +122,7 @@ void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc);
 bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 		     struct net_device *dev, struct netdev_queue *txq,
 		     spinlock_t *root_lock, bool validate);
+bool sch_may_need_requeuing(struct Qdisc *q);
 
 void __qdisc_run(struct Qdisc *q);
 
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f7a6e14..e08cc77 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -161,7 +161,6 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 	if (qdisc->flags & TCQ_F_NOLOCK) {
 		if (!spin_trylock(&qdisc->seqlock))
 			return false;
-		WRITE_ONCE(qdisc->empty, false);
 	} else if (qdisc_is_running(qdisc)) {
 		return false;
 	}
diff --git a/net/core/dev.c b/net/core/dev.c
index be941ed..317180a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3796,9 +3796,31 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	qdisc_calculate_pkt_len(skb, q);
 
 	if (q->flags & TCQ_F_NOLOCK) {
+		if (q->flags & TCQ_F_CAN_BYPASS && READ_ONCE(q->empty) &&
+		    qdisc_run_begin(q)) {
+			if (sch_may_need_requeuing(q)) {
+				rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
+				__qdisc_run(q);
+				qdisc_run_end(q);
+
+				goto no_lock_out;
+			}
+
+			qdisc_bstats_cpu_update(q, skb);
+
+			if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
+			    !READ_ONCE(q->empty))
+				__qdisc_run(q);
+
+			qdisc_run_end(q);
+			return NET_XMIT_SUCCESS;
+		}
+
 		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
+		WRITE_ONCE(q->empty, false);
 		qdisc_run(q);
 
+no_lock_out:
 		if (unlikely(to_free))
 			kfree_skb_list(to_free);
 		return rc;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 44991ea..2145fdad 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -146,6 +146,8 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 	}
 	if (lock)
 		spin_unlock(lock);
+
+	WRITE_ONCE(q->empty, false);
 	__netif_schedule(q);
 }
 
@@ -273,6 +275,15 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 	return skb;
 }
 
+bool sch_may_need_requeuing(struct Qdisc *q)
+{
+	if (likely(skb_queue_empty(&q->gso_skb) &&
+		   !q->ops->peek(q)))
+		return false;
+
+	return true;
+}
+
 /*
  * Transmit possibly several skbs, and handle the return status as
  * required. Owning running seqcount bit guarantees that
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC v3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc
  2021-03-22  9:09 ` [RFC v3] net: sched: implement TCQ_F_CAN_BYPASS " Yunsheng Lin
@ 2021-03-22 20:00   ` Vladimir Oltean
  2021-03-23 11:39     ` Vladimir Oltean
  2021-03-23  6:37   ` Ahmad Fatoum
  1 sibling, 1 reply; 11+ messages in thread
From: Vladimir Oltean @ 2021-03-22 20:00 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng, a.fatoum

Hi Yunsheng,

On Mon, Mar 22, 2021 at 05:09:16PM +0800, Yunsheng Lin wrote:
> Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
> flag set, but queue discipline by-pass does not work for lockless
> qdisc because skb is always enqueued to qdisc even when the qdisc
> is empty, see __dev_xmit_skb().
> 
> This patch calls sch_direct_xmit() to transmit the skb directly
> to the driver for empty lockless qdisc too, which aviod enqueuing
> and dequeuing operation. qdisc->empty is set to false whenever a
> skb is enqueued, see pfifo_fast_enqueue(), and is set to true when
> skb dequeuing return NULL, see pfifo_fast_dequeue().
> 
> There is a data race between enqueue/dequeue and qdisc->empty
> setting, qdisc->empty is only used as a hint, so we need to call
> sch_may_need_requeuing() to see if the queue is really empty and if
> there is requeued skb, which has higher priority than the current
> skb.
> 
> The performance for ip_forward test increases about 10% with this
> patch.
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> Hi, Vladimir and Ahmad
> 	Please give it a test to see if there is any out of order
> packet for this patch, which has removed the priv->lock added in
> RFC v2.
> 
> There is a data race as below:
> 
>       CPU1                                   CPU2
> qdisc_run_begin(q)                            .
>         .                                q->enqueue()
> sch_may_need_requeuing()                      .
>     return true                               .
>         .                                     .
>         .                                     .
>     q->enqueue()                              .
> 
> When above happen, the skb enqueued by CPU1 is dequeued after the
> skb enqueued by CPU2 because sch_may_need_requeuing() return true.
> If there is not qdisc bypass, the CPU1 has better chance to queue
> the skb quicker than CPU2.
> 
> This patch does not take care of the above data race, because I
> view this as similar as below:
> 
> Even at the same time CPU1 and CPU2 write the skb to two socket
> which both heading to the same qdisc, there is no guarantee that
> which skb will hit the qdisc first, becuase there is a lot of
> factor like interrupt/softirq/cache miss/scheduling afffecting
> that.
> 
> So I hope the above data race will not cause problem for Vladimir
> and Ahmad.
> ---

Preliminary results on my test setup look fine, but please allow me to
run the canfdtest overnight, since as you say, races are still
theoretically possible.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc
  2021-03-22  9:09 ` [RFC v3] net: sched: implement TCQ_F_CAN_BYPASS " Yunsheng Lin
  2021-03-22 20:00   ` Vladimir Oltean
@ 2021-03-23  6:37   ` Ahmad Fatoum
  2021-03-23 11:34     ` Yunsheng Lin
  1 sibling, 1 reply; 11+ messages in thread
From: Ahmad Fatoum @ 2021-03-23  6:37 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng

Hi,

On 22.03.21 10:09, Yunsheng Lin wrote:
> Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
> flag set, but queue discipline by-pass does not work for lockless
> qdisc because skb is always enqueued to qdisc even when the qdisc
> is empty, see __dev_xmit_skb().
> 
> This patch calls sch_direct_xmit() to transmit the skb directly
> to the driver for empty lockless qdisc too, which aviod enqueuing
> and dequeuing operation. qdisc->empty is set to false whenever a
> skb is enqueued, see pfifo_fast_enqueue(), and is set to true when
> skb dequeuing return NULL, see pfifo_fast_dequeue().
> 
> There is a data race between enqueue/dequeue and qdisc->empty
> setting, qdisc->empty is only used as a hint, so we need to call
> sch_may_need_requeuing() to see if the queue is really empty and if
> there is requeued skb, which has higher priority than the current
> skb.
> 
> The performance for ip_forward test increases about 10% with this
> patch.
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> Hi, Vladimir and Ahmad
> 	Please give it a test to see if there is any out of order
> packet for this patch, which has removed the priv->lock added in
> RFC v2.

Overnight test (10h, 64 mil frames) didn't see any out-of-order frames
between 2 FlexCANs on a dual core machine:

Tested-by: Ahmad Fatoum <a.fatoum@pengutronix.de>

No performance measurements taken.

> 
> There is a data race as below:
> 
>       CPU1                                   CPU2
> qdisc_run_begin(q)                            .
>         .                                q->enqueue()
> sch_may_need_requeuing()                      .
>     return true                               .
>         .                                     .
>         .                                     .
>     q->enqueue()                              .
> 
> When above happen, the skb enqueued by CPU1 is dequeued after the
> skb enqueued by CPU2 because sch_may_need_requeuing() return true.
> If there is not qdisc bypass, the CPU1 has better chance to queue
> the skb quicker than CPU2.
> 
> This patch does not take care of the above data race, because I
> view this as similar as below:
> 
> Even at the same time CPU1 and CPU2 write the skb to two socket
> which both heading to the same qdisc, there is no guarantee that
> which skb will hit the qdisc first, becuase there is a lot of
> factor like interrupt/softirq/cache miss/scheduling afffecting
> that.
> 
> So I hope the above data race will not cause problem for Vladimir
> and Ahmad.
> ---
>  include/net/pkt_sched.h   |  1 +
>  include/net/sch_generic.h |  1 -
>  net/core/dev.c            | 22 ++++++++++++++++++++++
>  net/sched/sch_generic.c   | 11 +++++++++++
>  4 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index f5c1bee..5715ddf 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -122,6 +122,7 @@ void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc);
>  bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
>  		     struct net_device *dev, struct netdev_queue *txq,
>  		     spinlock_t *root_lock, bool validate);
> +bool sch_may_need_requeuing(struct Qdisc *q);
>  
>  void __qdisc_run(struct Qdisc *q);
>  
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index f7a6e14..e08cc77 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -161,7 +161,6 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  	if (qdisc->flags & TCQ_F_NOLOCK) {
>  		if (!spin_trylock(&qdisc->seqlock))
>  			return false;
> -		WRITE_ONCE(qdisc->empty, false);
>  	} else if (qdisc_is_running(qdisc)) {
>  		return false;
>  	}
> diff --git a/net/core/dev.c b/net/core/dev.c
> index be941ed..317180a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3796,9 +3796,31 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  	qdisc_calculate_pkt_len(skb, q);
>  
>  	if (q->flags & TCQ_F_NOLOCK) {
> +		if (q->flags & TCQ_F_CAN_BYPASS && READ_ONCE(q->empty) &&
> +		    qdisc_run_begin(q)) {
> +			if (sch_may_need_requeuing(q)) {
> +				rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> +				__qdisc_run(q);
> +				qdisc_run_end(q);
> +
> +				goto no_lock_out;
> +			}
> +
> +			qdisc_bstats_cpu_update(q, skb);
> +
> +			if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
> +			    !READ_ONCE(q->empty))
> +				__qdisc_run(q);
> +
> +			qdisc_run_end(q);
> +			return NET_XMIT_SUCCESS;
> +		}
> +
>  		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> +		WRITE_ONCE(q->empty, false);
>  		qdisc_run(q);
>  
> +no_lock_out:
>  		if (unlikely(to_free))
>  			kfree_skb_list(to_free);
>  		return rc;
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 44991ea..2145fdad 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -146,6 +146,8 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>  	}
>  	if (lock)
>  		spin_unlock(lock);
> +
> +	WRITE_ONCE(q->empty, false);
>  	__netif_schedule(q);
>  }
>  
> @@ -273,6 +275,15 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
>  	return skb;
>  }
>  
> +bool sch_may_need_requeuing(struct Qdisc *q)
> +{
> +	if (likely(skb_queue_empty(&q->gso_skb) &&
> +		   !q->ops->peek(q)))
> +		return false;
> +
> +	return true;
> +}
> +
>  /*
>   * Transmit possibly several skbs, and handle the return status as
>   * required. Owning running seqcount bit guarantees that
> 

-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc
  2021-03-23  6:37   ` Ahmad Fatoum
@ 2021-03-23 11:34     ` Yunsheng Lin
  0 siblings, 0 replies; 11+ messages in thread
From: Yunsheng Lin @ 2021-03-23 11:34 UTC (permalink / raw)
  To: Ahmad Fatoum, davem, kuba
  Cc: olteanv, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng

On 2021/3/23 14:37, Ahmad Fatoum wrote:
> Hi,
> 
> On 22.03.21 10:09, Yunsheng Lin wrote:
>> Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
>> flag set, but queue discipline by-pass does not work for lockless
>> qdisc because skb is always enqueued to qdisc even when the qdisc
>> is empty, see __dev_xmit_skb().
>>
>> This patch calls sch_direct_xmit() to transmit the skb directly
>> to the driver for empty lockless qdisc too, which aviod enqueuing
>> and dequeuing operation. qdisc->empty is set to false whenever a
>> skb is enqueued, see pfifo_fast_enqueue(), and is set to true when
>> skb dequeuing return NULL, see pfifo_fast_dequeue().
>>
>> There is a data race between enqueue/dequeue and qdisc->empty
>> setting, qdisc->empty is only used as a hint, so we need to call
>> sch_may_need_requeuing() to see if the queue is really empty and if
>> there is requeued skb, which has higher priority than the current
>> skb.
>>
>> The performance for ip_forward test increases about 10% with this
>> patch.
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>> Hi, Vladimir and Ahmad
>> 	Please give it a test to see if there is any out of order
>> packet for this patch, which has removed the priv->lock added in
>> RFC v2.
> 
> Overnight test (10h, 64 mil frames) didn't see any out-of-order frames
> between 2 FlexCANs on a dual core machine:
> 
> Tested-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
> 
> No performance measurements taken.

Thanks for the testing.
And I has done the performance measurement.

L3 forward testing improves from 1.09Mpps to 1.21Mpps, still about
10% improvement.

pktgen + dummy netdev:

 threads  without+this_patch   with+this_patch      delta
    1       2.56Mpps            3.11Mpps             +21%
    2       3.76Mpps            4.31Mpps             +14%
    4       5.51Mpps            5.53Mpps             +0.3%
    8       2.81Mpps            2.72Mpps             -3%
   16       2.24Mpps            2.22Mpps             -0.8%

> 
>>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc
  2021-03-22 20:00   ` Vladimir Oltean
@ 2021-03-23 11:39     ` Vladimir Oltean
  0 siblings, 0 replies; 11+ messages in thread
From: Vladimir Oltean @ 2021-03-23 11:39 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, ast, daniel, andriin, edumazet, weiwan, cong.wang,
	ap420073, netdev, linux-kernel, linuxarm, mkl, linux-can, jhs,
	xiyou.wangcong, jiri, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, bpf, jonas.bonn, pabeni, mzhivich,
	johunt, albcamus, kehuan.feng, a.fatoum

On Mon, Mar 22, 2021 at 10:00:33PM +0200, Vladimir Oltean wrote:
> Hi Yunsheng,
> 
> On Mon, Mar 22, 2021 at 05:09:16PM +0800, Yunsheng Lin wrote:
> > Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
> > flag set, but queue discipline by-pass does not work for lockless
> > qdisc because skb is always enqueued to qdisc even when the qdisc
> > is empty, see __dev_xmit_skb().
> > 
> > This patch calls sch_direct_xmit() to transmit the skb directly
> > to the driver for empty lockless qdisc too, which aviod enqueuing
> > and dequeuing operation. qdisc->empty is set to false whenever a
> > skb is enqueued, see pfifo_fast_enqueue(), and is set to true when
> > skb dequeuing return NULL, see pfifo_fast_dequeue().
> > 
> > There is a data race between enqueue/dequeue and qdisc->empty
> > setting, qdisc->empty is only used as a hint, so we need to call
> > sch_may_need_requeuing() to see if the queue is really empty and if
> > there is requeued skb, which has higher priority than the current
> > skb.
> > 
> > The performance for ip_forward test increases about 10% with this
> > patch.
> > 
> > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> > ---
> > Hi, Vladimir and Ahmad
> > 	Please give it a test to see if there is any out of order
> > packet for this patch, which has removed the priv->lock added in
> > RFC v2.
> > 
> > There is a data race as below:
> > 
> >       CPU1                                   CPU2
> > qdisc_run_begin(q)                            .
> >         .                                q->enqueue()
> > sch_may_need_requeuing()                      .
> >     return true                               .
> >         .                                     .
> >         .                                     .
> >     q->enqueue()                              .
> > 
> > When above happen, the skb enqueued by CPU1 is dequeued after the
> > skb enqueued by CPU2 because sch_may_need_requeuing() return true.
> > If there is not qdisc bypass, the CPU1 has better chance to queue
> > the skb quicker than CPU2.
> > 
> > This patch does not take care of the above data race, because I
> > view this as similar as below:
> > 
> > Even at the same time CPU1 and CPU2 write the skb to two socket
> > which both heading to the same qdisc, there is no guarantee that
> > which skb will hit the qdisc first, becuase there is a lot of
> > factor like interrupt/softirq/cache miss/scheduling afffecting
> > that.
> > 
> > So I hope the above data race will not cause problem for Vladimir
> > and Ahmad.
> > ---
> 
> Preliminary results on my test setup look fine, but please allow me to
> run the canfdtest overnight, since as you say, races are still
> theoretically possible.

I haven't found any issues during the overnight test and until now.

Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-03-23 11:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-18  6:53 [PATCH net] net: sched: fix packet stuck problem for lockless qdisc Yunsheng Lin
2021-03-19  9:25 ` [Linuxarm] " Yunsheng Lin
2021-03-19 19:45   ` Cong Wang
2021-03-22  1:37     ` Yunsheng Lin
2021-03-19 19:40 ` Cong Wang
2021-03-22  1:31   ` Yunsheng Lin
2021-03-22  9:09 ` [RFC v3] net: sched: implement TCQ_F_CAN_BYPASS " Yunsheng Lin
2021-03-22 20:00   ` Vladimir Oltean
2021-03-23 11:39     ` Vladimir Oltean
2021-03-23  6:37   ` Ahmad Fatoum
2021-03-23 11:34     ` Yunsheng Lin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.