linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] rps: process the skb directly if rps cpu not changed
@ 2023-03-21 12:12 yang.yang29
  2023-03-22  2:02 ` Yunsheng Lin
  2023-03-22  2:54 ` Jakub Kicinski
  0 siblings, 2 replies; 8+ messages in thread
From: yang.yang29 @ 2023-03-21 12:12 UTC (permalink / raw)
  To: edumazet
  Cc: davem, kuba, netdev, linux-kernel, xu.xin16, jiang.xuexin, zhang.yunkai

From: xu xin <xu.xin16@zte.com.cn>

In the RPS procedure of NAPI receiving, regardless of whether the
rps-calculated CPU of the skb equals to the currently processing CPU, RPS
will always use enqueue_to_backlog to enqueue the skb to per-cpu backlog,
which will trigger a new NET_RX softirq.

Actually, it's not necessary to enqueue it to backlog when rps-calculated
CPU id equals to the current processing CPU, and we can call
__netif_receive_skb or __netif_receive_skb_list to process the skb directly.
The benefit is that it can reduce the number of softirqs of NET_RX and reduce
the processing delay of skb.

The measured result shows the patch brings 50% reduction of NET_RX softirqs.
The test was done on the QEMU environment with two-core CPU by iperf3.
taskset 01 iperf3 -c 192.168.2.250 -t 3 -u -R;
taskset 02 iperf3 -c 192.168.2.250 -t 3 -u -R;

Previous RPS:
		    	CPU0       CPU1
NET_RX:         45          0    (before iperf3 testing)
NET_RX:        1095         241   (after iperf3 testing)

Patched RPS:
                CPU0       CPU1
NET_RX:         28          4    (before iperf3 testing)
NET_RX:         573         32   (after iperf3 testing)

Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Zhang Yunkai <zhang.yunkai@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
---
 net/core/dev.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c7853192563d..c33ddac3c012 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5666,8 +5666,9 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
 	if (static_branch_unlikely(&rps_needed)) {
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
 		int cpu = get_rps_cpu(skb->dev, skb, &rflow);
+		int current_cpu = smp_processor_id();

-		if (cpu >= 0) {
+		if (cpu >= 0 && cpu != current_cpu) {
 			ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
 			rcu_read_unlock();
 			return ret;
@@ -5699,8 +5700,9 @@ void netif_receive_skb_list_internal(struct list_head *head)
 		list_for_each_entry_safe(skb, next, head, list) {
 			struct rps_dev_flow voidflow, *rflow = &voidflow;
 			int cpu = get_rps_cpu(skb->dev, skb, &rflow);
+			int current_cpu = smp_processor_id();

-			if (cpu >= 0) {
+			if (cpu >= 0 && cpu != current_cpu) {
 				/* Will be handled, remove from list */
 				skb_list_del_init(skb);
 				enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
-- 
2.15.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] rps: process the skb directly if rps cpu not changed
  2023-03-21 12:12 [PATCH] rps: process the skb directly if rps cpu not changed yang.yang29
@ 2023-03-22  2:02 ` Yunsheng Lin
  2023-03-22  7:21   ` [PATCH v5 2/6] ksm: support unsharing zero pages placed by KSM xu xin
  2023-03-22  7:24   ` [PATCH] rps: process the skb directly if rps cpu not changed xu xin
  2023-03-22  2:54 ` Jakub Kicinski
  1 sibling, 2 replies; 8+ messages in thread
From: Yunsheng Lin @ 2023-03-22  2:02 UTC (permalink / raw)
  To: yang.yang29, edumazet
  Cc: davem, kuba, netdev, linux-kernel, xu.xin16, jiang.xuexin, zhang.yunkai

On 2023/3/21 20:12, yang.yang29@zte.com.cn wrote:
> From: xu xin <xu.xin16@zte.com.cn>
> 
> In the RPS procedure of NAPI receiving, regardless of whether the
> rps-calculated CPU of the skb equals to the currently processing CPU, RPS
> will always use enqueue_to_backlog to enqueue the skb to per-cpu backlog,
> which will trigger a new NET_RX softirq.

Does bypassing the backlog cause out of order problem for packet handling?
It seems currently the RPS/RFS will ensure order delivery,such as:
https://elixir.bootlin.com/linux/v6.3-rc3/source/net/core/dev.c#L4485

Also, this is an optimization, it should target the net-next branch:
[PATCH net-next] rps: process the skb directly if rps cpu not changed

> 
> Actually, it's not necessary to enqueue it to backlog when rps-calculated
> CPU id equals to the current processing CPU, and we can call
> __netif_receive_skb or __netif_receive_skb_list to process the skb directly.
> The benefit is that it can reduce the number of softirqs of NET_RX and reduce
> the processing delay of skb.
> 
> The measured result shows the patch brings 50% reduction of NET_RX softirqs.
> The test was done on the QEMU environment with two-core CPU by iperf3.
> taskset 01 iperf3 -c 192.168.2.250 -t 3 -u -R;
> taskset 02 iperf3 -c 192.168.2.250 -t 3 -u -R;
> 
> Previous RPS:
> 		    	CPU0       CPU1
> NET_RX:         45          0    (before iperf3 testing)
> NET_RX:        1095         241   (after iperf3 testing)
> 
> Patched RPS:
>                 CPU0       CPU1
> NET_RX:         28          4    (before iperf3 testing)
> NET_RX:         573         32   (after iperf3 testing)
> 
> Signed-off-by: xu xin <xu.xin16@zte.com.cn>
> Reviewed-by: Zhang Yunkai <zhang.yunkai@zte.com.cn>
> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
> ---
>  net/core/dev.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index c7853192563d..c33ddac3c012 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5666,8 +5666,9 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
>  	if (static_branch_unlikely(&rps_needed)) {
>  		struct rps_dev_flow voidflow, *rflow = &voidflow;
>  		int cpu = get_rps_cpu(skb->dev, skb, &rflow);
> +		int current_cpu = smp_processor_id();
> 
> -		if (cpu >= 0) {
> +		if (cpu >= 0 && cpu != current_cpu) {
>  			ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>  			rcu_read_unlock();
>  			return ret;
> @@ -5699,8 +5700,9 @@ void netif_receive_skb_list_internal(struct list_head *head)
>  		list_for_each_entry_safe(skb, next, head, list) {
>  			struct rps_dev_flow voidflow, *rflow = &voidflow;
>  			int cpu = get_rps_cpu(skb->dev, skb, &rflow);
> +			int current_cpu = smp_processor_id();
> 
> -			if (cpu >= 0) {
> +			if (cpu >= 0 && cpu != current_cpu) {
>  				/* Will be handled, remove from list */
>  				skb_list_del_init(skb);
>  				enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] rps: process the skb directly if rps cpu not changed
  2023-03-21 12:12 [PATCH] rps: process the skb directly if rps cpu not changed yang.yang29
  2023-03-22  2:02 ` Yunsheng Lin
@ 2023-03-22  2:54 ` Jakub Kicinski
  1 sibling, 0 replies; 8+ messages in thread
From: Jakub Kicinski @ 2023-03-22  2:54 UTC (permalink / raw)
  To: yang.yang29
  Cc: edumazet, davem, netdev, linux-kernel, xu.xin16, jiang.xuexin,
	zhang.yunkai

On Tue, 21 Mar 2023 20:12:29 +0800 (CST) yang.yang29@zte.com.cn wrote:
> The measured result shows the patch brings 50% reduction of NET_RX softirqs.
> The test was done on the QEMU environment with two-core CPU by iperf3.
> taskset 01 iperf3 -c 192.168.2.250 -t 3 -u -R;
> taskset 02 iperf3 -c 192.168.2.250 -t 3 -u -R;
> 
> Previous RPS:
> 		    	CPU0       CPU1

this header looks misalinged

> NET_RX:         45          0    (before iperf3 testing)
> NET_RX:        1095         241   (after iperf3 testing)
> 
> Patched RPS:
>                 CPU0       CPU1
> NET_RX:         28          4    (before iperf3 testing)
> NET_RX:         573         32   (after iperf3 testing)

This table is really confusing. What's the unit, how is it measured 
and why are you showing before/after rather than the delta?

> diff --git a/net/core/dev.c b/net/core/dev.c
> index c7853192563d..c33ddac3c012 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5666,8 +5666,9 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
>  	if (static_branch_unlikely(&rps_needed)) {
>  		struct rps_dev_flow voidflow, *rflow = &voidflow;
>  		int cpu = get_rps_cpu(skb->dev, skb, &rflow);
> +		int current_cpu = smp_processor_id();
> 
> -		if (cpu >= 0) {
> +		if (cpu >= 0 && cpu != current_cpu) {
>  			ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>  			rcu_read_unlock();
>  			return ret;
> @@ -5699,8 +5700,9 @@ void netif_receive_skb_list_internal(struct list_head *head)
>  		list_for_each_entry_safe(skb, next, head, list) {
>  			struct rps_dev_flow voidflow, *rflow = &voidflow;
>  			int cpu = get_rps_cpu(skb->dev, skb, &rflow);
> +			int current_cpu = smp_processor_id();

This does not have to be in the loop.

> 
> -			if (cpu >= 0) {
> +			if (cpu >= 0 && cpu != current_cpu) {

Please answer Yunsheng's question as well..

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v5 2/6] ksm: support unsharing zero pages placed by KSM
  2023-03-22  2:02 ` Yunsheng Lin
@ 2023-03-22  7:21   ` xu xin
  2023-03-22  7:24   ` [PATCH] rps: process the skb directly if rps cpu not changed xu xin
  1 sibling, 0 replies; 8+ messages in thread
From: xu xin @ 2023-03-22  7:21 UTC (permalink / raw)
  To: linyunsheng, kuba
  Cc: davem, edumazet, jiang.xuexin, linux-kernel, netdev, xu.xin16,
	yang.yang29, zhang.yunkai

On 2023/3/21 20:12, yang.yang29@zte.com.cn wrote:
>> From: xu xin <xu.xin16@zte.com.cn>
>> 
>> In the RPS procedure of NAPI receiving, regardless of whether the
>> rps-calculated CPU of the skb equals to the currently processing CPU, RPS
>> will always use enqueue_to_backlog to enqueue the skb to per-cpu backlog,
>> which will trigger a new NET_RX softirq.
>
>Does bypassing the backlog cause out of order problem for packet handling?
>It seems currently the RPS/RFS will ensure order delivery,such as:
>https://elixir.bootlin.com/linux/v6.3-rc3/source/net/core/dev.c#L4485
>
>Also, this is an optimization, it should target the net-next branch:
>[PATCH net-next] rps: process the skb directly if rps cpu not changed
>

Well, I thought the patch would't break the effort RFS tried to avoid "Out of
Order" packets. But thanks for your reminder, I rethink it again, bypassing the
backlog from "netif_receive_skb_list" will mislead RFS's judging if all
previous packets for the flow have been dequeued, where RFS thought all packets
have been dealed with, but actually they are still in skb lists. Fortunately,
bypassing the backlog from "netif_receive_skb" for a single skb is okay and won't
cause OOO packets because every skb is processed serially by RPS and sent to the
protocol stack as soon as possible.

If I'm correct, the code as follws can fix this.

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5666,8 +5666,9 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
        if (static_branch_unlikely(&rps_needed)) {
                struct rps_dev_flow voidflow, *rflow = &voidflow;
                int cpu = get_rps_cpu(skb->dev, skb, &rflow);
+               int current_cpu = smp_processor_id();
 
-               if (cpu >= 0) {
+               if (cpu >= 0 && cpu != current_cpu) {
                        ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
                        rcu_read_unlock();
                        return ret;
@@ -5699,11 +5700,15 @@ void netif_receive_skb_list_internal(struct list_head *head)
                list_for_each_entry_safe(skb, next, head, list) {
                        struct rps_dev_flow voidflow, *rflow = &voidflow;
                        int cpu = get_rps_cpu(skb->dev, skb, &rflow);
+                       int current_cpu = smp_processor_id();
 
                        if (cpu >= 0) {
                                /* Will be handled, remove from list */
                                skb_list_del_init(skb);
-                               enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+                               if (cpu != current_cpu)
+                                       enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+                               else
+                                       __netif_receive_skb(skb);
                        }
                }


Thanks.

>> 
>> Actually, it's not necessary to enqueue it to backlog when rps-calculated
>> CPU id equals to the current processing CPU, and we can call
>> __netif_receive_skb or __netif_receive_skb_list to process the skb directly.
>> The benefit is that it can reduce the number of softirqs of NET_RX and reduce
>> the processing delay of skb.
>> 
>> The measured result shows the patch brings 50% reduction of NET_RX softirqs.
>> The test was done on the QEMU environment with two-core CPU by iperf3.
>> taskset 01 iperf3 -c 192.168.2.250 -t 3 -u -R;
>> taskset 02 iperf3 -c 192.168.2.250 -t 3 -u -R;
>> 
>> Previous RPS:
>> 		    	CPU0       CPU1
>> NET_RX:         45          0    (before iperf3 testing)
>> NET_RX:        1095         241   (after iperf3 testing)
>> 
>> Patched RPS:
>>                 CPU0       CPU1
>> NET_RX:         28          4    (before iperf3 testing)
>> NET_RX:         573         32   (after iperf3 testing)
>
>Sincerely.
>Xu Xin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] rps: process the skb directly if rps cpu not changed
  2023-03-22  2:02 ` Yunsheng Lin
  2023-03-22  7:21   ` [PATCH v5 2/6] ksm: support unsharing zero pages placed by KSM xu xin
@ 2023-03-22  7:24   ` xu xin
  2023-03-23  9:01     ` Yunsheng Lin
  1 sibling, 1 reply; 8+ messages in thread
From: xu xin @ 2023-03-22  7:24 UTC (permalink / raw)
  To: linyunsheng, kuba
  Cc: davem, edumazet, jiang.xuexin, linux-kernel, netdev, xu.xin16,
	yang.yang29, zhang.yunkai

[So sorry, I made a mistake in the reply title]

On 2023/3/21 20:12, yang.yang29@zte.com.cn wrote:
>> From: xu xin <xu.xin16@zte.com.cn>
>> 
>> In the RPS procedure of NAPI receiving, regardless of whether the
>> rps-calculated CPU of the skb equals to the currently processing CPU, RPS
>> will always use enqueue_to_backlog to enqueue the skb to per-cpu backlog,
>> which will trigger a new NET_RX softirq.
>
>Does bypassing the backlog cause out of order problem for packet handling?
>It seems currently the RPS/RFS will ensure order delivery,such as:
>https://elixir.bootlin.com/linux/v6.3-rc3/source/net/core/dev.c#L4485
>
>Also, this is an optimization, it should target the net-next branch:
>[PATCH net-next] rps: process the skb directly if rps cpu not changed
>

Well, I thought the patch would't break the effort RFS tried to avoid "Out of
Order" packets. But thanks for your reminder, I rethink it again, bypassing the
backlog from "netif_receive_skb_list" will mislead RFS's judging if all
previous packets for the flow have been dequeued, where RFS thought all packets
have been dealed with, but actually they are still in skb lists. Fortunately,
bypassing the backlog from "netif_receive_skb" for a single skb is okay and won't
cause OOO packets because every skb is processed serially by RPS and sent to the
protocol stack as soon as possible.

If I'm correct, the code as follws can fix this.

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5666,8 +5666,9 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
        if (static_branch_unlikely(&rps_needed)) {
                struct rps_dev_flow voidflow, *rflow = &voidflow;
                int cpu = get_rps_cpu(skb->dev, skb, &rflow);
+               int current_cpu = smp_processor_id();
 
-               if (cpu >= 0) {
+               if (cpu >= 0 && cpu != current_cpu) {
                        ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
                        rcu_read_unlock();
                        return ret;
@@ -5699,11 +5700,15 @@ void netif_receive_skb_list_internal(struct list_head *head)
                list_for_each_entry_safe(skb, next, head, list) {
                        struct rps_dev_flow voidflow, *rflow = &voidflow;
                        int cpu = get_rps_cpu(skb->dev, skb, &rflow);
+                       int current_cpu = smp_processor_id();
 
                        if (cpu >= 0) {
                                /* Will be handled, remove from list */
                                skb_list_del_init(skb);
-                               enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+                               if (cpu != current_cpu)
+                                       enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+                               else
+                                       __netif_receive_skb(skb);
                        }
                }


Thanks.

>> 
>> Actually, it's not necessary to enqueue it to backlog when rps-calculated
>> CPU id equals to the current processing CPU, and we can call
>> __netif_receive_skb or __netif_receive_skb_list to process the skb directly.
>> The benefit is that it can reduce the number of softirqs of NET_RX and reduce
>> the processing delay of skb.
>> 
>> The measured result shows the patch brings 50% reduction of NET_RX softirqs.
>> The test was done on the QEMU environment with two-core CPU by iperf3.
>> taskset 01 iperf3 -c 192.168.2.250 -t 3 -u -R;
>> taskset 02 iperf3 -c 192.168.2.250 -t 3 -u -R;
>> 
>> Previous RPS:
>> 		    	CPU0       CPU1
>> NET_RX:         45          0    (before iperf3 testing)
>> NET_RX:        1095         241   (after iperf3 testing)
>> 
>> Patched RPS:
>>                 CPU0       CPU1
>> NET_RX:         28          4    (before iperf3 testing)
>> NET_RX:         573         32   (after iperf3 testing)
>
>Sincerely.
>Xu Xin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] rps: process the skb directly if rps cpu not changed
  2023-03-22  7:24   ` [PATCH] rps: process the skb directly if rps cpu not changed xu xin
@ 2023-03-23  9:01     ` Yunsheng Lin
  2023-03-23 10:04       ` xu xin
  0 siblings, 1 reply; 8+ messages in thread
From: Yunsheng Lin @ 2023-03-23  9:01 UTC (permalink / raw)
  To: xu xin, kuba
  Cc: davem, edumazet, jiang.xuexin, linux-kernel, netdev, xu.xin16,
	yang.yang29, zhang.yunkai

On 2023/3/22 15:24, xu xin wrote:
> [So sorry, I made a mistake in the reply title]
> 
> On 2023/3/21 20:12, yang.yang29@zte.com.cn wrote:
>>> From: xu xin <xu.xin16@zte.com.cn>
>>>
>>> In the RPS procedure of NAPI receiving, regardless of whether the
>>> rps-calculated CPU of the skb equals to the currently processing CPU, RPS
>>> will always use enqueue_to_backlog to enqueue the skb to per-cpu backlog,
>>> which will trigger a new NET_RX softirq.
>>
>> Does bypassing the backlog cause out of order problem for packet handling?
>> It seems currently the RPS/RFS will ensure order delivery,such as:
>> https://elixir.bootlin.com/linux/v6.3-rc3/source/net/core/dev.c#L4485
>>
>> Also, this is an optimization, it should target the net-next branch:
>> [PATCH net-next] rps: process the skb directly if rps cpu not changed
>>
> 
> Well, I thought the patch would't break the effort RFS tried to avoid "Out of
> Order" packets. But thanks for your reminder, I rethink it again, bypassing the
> backlog from "netif_receive_skb_list" will mislead RFS's judging if all
> previous packets for the flow have been dequeued, where RFS thought all packets
> have been dealed with, but actually they are still in skb lists. Fortunately,
> bypassing the backlog from "netif_receive_skb" for a single skb is okay and won't
> cause OOO packets because every skb is processed serially by RPS and sent to the
> protocol stack as soon as possible.

Suppose a lot of skbs have been queued to the backlog waiting to
processed and passed to the stack when current_cpu is not the same
as the target cpu, then current_cpu is changed to be the same as the
target cpu, with your patch, new skb will be processed and passed to
the stack immediately, which may bypass the old skb in the backlog.

> 
> If I'm correct, the code as follws can fix this.
> 
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5666,8 +5666,9 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
>         if (static_branch_unlikely(&rps_needed)) {
>                 struct rps_dev_flow voidflow, *rflow = &voidflow;
>                 int cpu = get_rps_cpu(skb->dev, skb, &rflow);
> +               int current_cpu = smp_processor_id();
>  
> -               if (cpu >= 0) {
> +               if (cpu >= 0 && cpu != current_cpu) {
>                         ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>                         rcu_read_unlock();
>                         return ret;
> @@ -5699,11 +5700,15 @@ void netif_receive_skb_list_internal(struct list_head *head)
>                 list_for_each_entry_safe(skb, next, head, list) {
>                         struct rps_dev_flow voidflow, *rflow = &voidflow;
>                         int cpu = get_rps_cpu(skb->dev, skb, &rflow);
> +                       int current_cpu = smp_processor_id();
>  
>                         if (cpu >= 0) {
>                                 /* Will be handled, remove from list */
>                                 skb_list_del_init(skb);
> -                               enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
> +                               if (cpu != current_cpu)
> +                                       enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
> +                               else
> +                                       __netif_receive_skb(skb);
>                         }
>                 }
> 
> 
> Thanks.
> 
>>>
>>> Actually, it's not necessary to enqueue it to backlog when rps-calculated
>>> CPU id equals to the current processing CPU, and we can call
>>> __netif_receive_skb or __netif_receive_skb_list to process the skb directly.
>>> The benefit is that it can reduce the number of softirqs of NET_RX and reduce
>>> the processing delay of skb.
>>>
>>> The measured result shows the patch brings 50% reduction of NET_RX softirqs.
>>> The test was done on the QEMU environment with two-core CPU by iperf3.
>>> taskset 01 iperf3 -c 192.168.2.250 -t 3 -u -R;
>>> taskset 02 iperf3 -c 192.168.2.250 -t 3 -u -R;
>>>
>>> Previous RPS:
>>> 		    	CPU0       CPU1
>>> NET_RX:         45          0    (before iperf3 testing)
>>> NET_RX:        1095         241   (after iperf3 testing)
>>>
>>> Patched RPS:
>>>                 CPU0       CPU1
>>> NET_RX:         28          4    (before iperf3 testing)
>>> NET_RX:         573         32   (after iperf3 testing)
>>
>> Sincerely.
>> Xu Xin
> .
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] rps: process the skb directly if rps cpu not changed
  2023-03-23  9:01     ` Yunsheng Lin
@ 2023-03-23 10:04       ` xu xin
  2023-03-23 10:47         ` Yunsheng Lin
  0 siblings, 1 reply; 8+ messages in thread
From: xu xin @ 2023-03-23 10:04 UTC (permalink / raw)
  To: linyunsheng, kuba
  Cc: davem, edumazet, jiang.xuexin, linux-kernel, netdev, xu.xin16,
	yang.yang29, zhang.yunkai

>On 2023/3/22 15:24, xu xin wrote:
>> [So sorry, I made a mistake in the reply title]
>> 
>> On 2023/3/21 20:12, yang.yang29@zte.com.cn wrote:
>>>> From: xu xin <xu.xin16@zte.com.cn>
>>>>
>>>> In the RPS procedure of NAPI receiving, regardless of whether the
>>>> rps-calculated CPU of the skb equals to the currently processing CPU, RPS
>>>> will always use enqueue_to_backlog to enqueue the skb to per-cpu backlog,
>>>> which will trigger a new NET_RX softirq.
>>>
>>> Does bypassing the backlog cause out of order problem for packet handling?
>>> It seems currently the RPS/RFS will ensure order delivery,such as:
>>> https://elixir.bootlin.com/linux/v6.3-rc3/source/net/core/dev.c#L4485
>>>
>>> Also, this is an optimization, it should target the net-next branch:
>>> [PATCH net-next] rps: process the skb directly if rps cpu not changed
>>>
>> 
>> Well, I thought the patch would't break the effort RFS tried to avoid "Out of
>> Order" packets. But thanks for your reminder, I rethink it again, bypassing the
>> backlog from "netif_receive_skb_list" will mislead RFS's judging if all
>> previous packets for the flow have been dequeued, where RFS thought all packets
>> have been dealed with, but actually they are still in skb lists. Fortunately,
>> bypassing the backlog from "netif_receive_skb" for a single skb is okay and won't
>> cause OOO packets because every skb is processed serially by RPS and sent to the
>> protocol stack as soon as possible.
>
>Suppose a lot of skbs have been queued to the backlog waiting to
>processed and passed to the stack when current_cpu is not the same
>as the target cpu,

Well.  I'm afraid that what we mean by current_cpu may be different. The
"current_cpu" in my patch refers to the cpu NAPI poll is running on (Or
the cpu that the skb origins from).

>then current_cpu is changed to be the same as the
>target cpu, with your patch, new skb will be processed and passed to
>the stack immediately, which may bypass the old skb in the backlog.
>
I think Nop, RFS procedure won't let target cpu switch into a new cpu
if there are still old skbs in the backlog of last recorded cpu. So the
target cpu of the new skb will never equal to current_cpu if old skb in the
backlog.
==========================================================================
Let me draw the situation you described: At the time of T1, the app runs
on cpu-0, so there are many packets queueing into the rxqueue-0 by RFS from
CPU-1(suppose NAPI poll processing on cpu-1). Then, suddenly at the time of
T2, the app tranfers to cpu-1, RFS know there are still old skb in rxqueue-0,
so get_rps_cpu will not return a value of cpu-1, but cpu-0 instead.

========================================================
When T1, app runs on cpu-0:
  APP
-----------------------------
|      |        |      |
|cpu-0 |        |cpu-1 |
|stack |        |stack |
|      |        |      |
   ^
  |=|
  |=|             | |
  |=|             | |
(rxqueue-0)      (rxqueue-1,empty)
   ^<--
       <--
         <--
	     <-- packet(poll on cpu1)
			
===========================================================
When T2, app tranfer to cpu-1, target cpu is still on cpu-0:
                  APP
----------------------------
|      |        |      |
|cpu-0 |        |cpu-1 |
|stack |        |stack |
|      |        |      |
   ^
   |
  |=|             | |
  |=|             | |
(rxqueue-0)      (rxqueue-2,empty)
   ^<--
       <--
         <--
            <-- packet(poll on cpu1)

===================================

Thanks for your reply.

>> 
>> If I'm correct, the code as follws can fix this.
>> 
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -5666,8 +5666,9 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
>>         if (static_branch_unlikely(&rps_needed)) {
>>                 struct rps_dev_flow voidflow, *rflow = &voidflow;
>>                 int cpu = get_rps_cpu(skb->dev, skb, &rflow);
>> +               int current_cpu = smp_processor_id();
>>  
>> -               if (cpu >= 0) {
>> +               if (cpu >= 0 && cpu != current_cpu) {
>>                         ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>>                         rcu_read_unlock();
>>                         return ret;
>> @@ -5699,11 +5700,15 @@ void netif_receive_skb_list_internal(struct list_head *head)
>>                 list_for_each_entry_safe(skb, next, head, list) {
>>                         struct rps_dev_flow voidflow, *rflow = &voidflow;
>>                         int cpu = get_rps_cpu(skb->dev, skb, &rflow);
>> +                       int current_cpu = smp_processor_id();
>>  
>>                         if (cpu >= 0) {
>>                                 /* Will be handled, remove from list */
>>                                 skb_list_del_init(skb);
>> -                               enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>> +                               if (cpu != current_cpu)
>> +                                       enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>> +                               else
>> +                                       __netif_receive_skb(skb);
>>                         }
>>                 }
>> 
>> 
>> Thanks.
>> 
>>>>
>>>> Actually, it's not necessary to enqueue it to backlog when rps-calculated
>>>> CPU id equals to the current processing CPU, and we can call
>>>> __netif_receive_skb or __netif_receive_skb_list to process the skb directly.
>>>> The benefit is that it can reduce the number of softirqs of NET_RX and reduce
>>>> the processing delay of skb.
>>>>
>>>> The measured result shows the patch brings 50% reduction of NET_RX softirqs.
>>>> The test was done on the QEMU environment with two-core CPU by iperf3.
>>>> taskset 01 iperf3 -c 192.168.2.250 -t 3 -u -R;
>>>> taskset 02 iperf3 -c 192.168.2.250 -t 3 -u -R;
>>>>
>>>> Previous RPS:
>>>> 		    	CPU0       CPU1
>>>> NET_RX:         45          0    (before iperf3 testing)
>>>> NET_RX:        1095         241   (after iperf3 testing)
>>>>
>>>> Patched RPS:
>>>>                 CPU0       CPU1
>>>> NET_RX:         28          4    (before iperf3 testing)
>>>> NET_RX:         573         32   (after iperf3 testing)
>>>
>>> Sincerely.
>>> Xu Xin
>> .
>> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] rps: process the skb directly if rps cpu not changed
  2023-03-23 10:04       ` xu xin
@ 2023-03-23 10:47         ` Yunsheng Lin
  0 siblings, 0 replies; 8+ messages in thread
From: Yunsheng Lin @ 2023-03-23 10:47 UTC (permalink / raw)
  To: xu xin, kuba
  Cc: davem, edumazet, jiang.xuexin, linux-kernel, netdev, xu.xin16,
	yang.yang29, zhang.yunkai

On 2023/3/23 18:04, xu xin wrote:
>> On 2023/3/22 15:24, xu xin wrote:
>>> [So sorry, I made a mistake in the reply title]
>>>
>>> On 2023/3/21 20:12, yang.yang29@zte.com.cn wrote:
>>>>> From: xu xin <xu.xin16@zte.com.cn>
>>>>>
>>>>> In the RPS procedure of NAPI receiving, regardless of whether the
>>>>> rps-calculated CPU of the skb equals to the currently processing CPU, RPS
>>>>> will always use enqueue_to_backlog to enqueue the skb to per-cpu backlog,
>>>>> which will trigger a new NET_RX softirq.
>>>>
>>>> Does bypassing the backlog cause out of order problem for packet handling?
>>>> It seems currently the RPS/RFS will ensure order delivery,such as:
>>>> https://elixir.bootlin.com/linux/v6.3-rc3/source/net/core/dev.c#L4485
>>>>
>>>> Also, this is an optimization, it should target the net-next branch:
>>>> [PATCH net-next] rps: process the skb directly if rps cpu not changed
>>>>
>>>
>>> Well, I thought the patch would't break the effort RFS tried to avoid "Out of
>>> Order" packets. But thanks for your reminder, I rethink it again, bypassing the
>>> backlog from "netif_receive_skb_list" will mislead RFS's judging if all
>>> previous packets for the flow have been dequeued, where RFS thought all packets
>>> have been dealed with, but actually they are still in skb lists. Fortunately,
>>> bypassing the backlog from "netif_receive_skb" for a single skb is okay and won't
>>> cause OOO packets because every skb is processed serially by RPS and sent to the
>>> protocol stack as soon as possible.
>>
>> Suppose a lot of skbs have been queued to the backlog waiting to
>> processed and passed to the stack when current_cpu is not the same
>> as the target cpu,
> 
> Well.  I'm afraid that what we mean by current_cpu may be different. The
> "current_cpu" in my patch refers to the cpu NAPI poll is running on (Or
> the cpu that the skb origins from).

That's what I meant too. current_cpu refers to the cpu on which the driver's
NAPI poll is running.

> 
>> then current_cpu is changed to be the same as the
>> target cpu, with your patch, new skb will be processed and passed to
>> the stack immediately, which may bypass the old skb in the backlog.
>>
> I think Nop, RFS procedure won't let target cpu switch into a new cpu
> if there are still old skbs in the backlog of last recorded cpu. So the
> target cpu of the new skb will never equal to current_cpu if old skb in the
> backlog.
> ==========================================================================
> Let me draw the situation you described: At the time of T1, the app runs
> on cpu-0, so there are many packets queueing into the rxqueue-0 by RFS from
> CPU-1(suppose NAPI poll processing on cpu-1). Then, suddenly at the time of
> T2, the app tranfers to cpu-1, RFS know there are still old skb in rxqueue-0,
> so get_rps_cpu will not return a value of cpu-1, but cpu-0 instead.
> 
> ========================================================
> When T1, app runs on cpu-0:
>   APP
> -----------------------------
> |      |        |      |
> |cpu-0 |        |cpu-1 |
> |stack |        |stack |
> |      |        |      |
>    ^
>   |=|
>   |=|             | |
>   |=|             | |
> (rxqueue-0)      (rxqueue-1,empty)
>    ^<--
>        <--
>          <--
> 	     <-- packet(poll on cpu1)
> 			
> ===========================================================
> When T2, app tranfer to cpu-1, target cpu is still on cpu-0:
>                   APP
> ----------------------------
> |      |        |      |
> |cpu-0 |        |cpu-1 |
> |stack |        |stack |
> |      |        |      |
>    ^
>    |
>   |=|             | |
>   |=|             | |
> (rxqueue-0)      (rxqueue-2,empty)
>    ^<--
>        <--
>          <--
>             <-- packet(poll on cpu1)

what about the NAPI poll changing between cpu0 and cpu1 while
APP stays on the same cpu?

Note that backlog queue is per cpu.

> 
> ===================================
> 
> Thanks for your reply.
> 
>>>
>>> If I'm correct, the code as follws can fix this.
>>>
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -5666,8 +5666,9 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
>>>         if (static_branch_unlikely(&rps_needed)) {
>>>                 struct rps_dev_flow voidflow, *rflow = &voidflow;
>>>                 int cpu = get_rps_cpu(skb->dev, skb, &rflow);
>>> +               int current_cpu = smp_processor_id();
>>>  
>>> -               if (cpu >= 0) {
>>> +               if (cpu >= 0 && cpu != current_cpu) {
>>>                         ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>>>                         rcu_read_unlock();
>>>                         return ret;
>>> @@ -5699,11 +5700,15 @@ void netif_receive_skb_list_internal(struct list_head *head)
>>>                 list_for_each_entry_safe(skb, next, head, list) {
>>>                         struct rps_dev_flow voidflow, *rflow = &voidflow;
>>>                         int cpu = get_rps_cpu(skb->dev, skb, &rflow);
>>> +                       int current_cpu = smp_processor_id();
>>>  
>>>                         if (cpu >= 0) {
>>>                                 /* Will be handled, remove from list */
>>>                                 skb_list_del_init(skb);
>>> -                               enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>>> +                               if (cpu != current_cpu)
>>> +                                       enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>>> +                               else
>>> +                                       __netif_receive_skb(skb);
>>>                         }
>>>                 }
>>>
>>>
>>> Thanks.
>>>
>>>>>
>>>>> Actually, it's not necessary to enqueue it to backlog when rps-calculated
>>>>> CPU id equals to the current processing CPU, and we can call
>>>>> __netif_receive_skb or __netif_receive_skb_list to process the skb directly.
>>>>> The benefit is that it can reduce the number of softirqs of NET_RX and reduce
>>>>> the processing delay of skb.
>>>>>
>>>>> The measured result shows the patch brings 50% reduction of NET_RX softirqs.
>>>>> The test was done on the QEMU environment with two-core CPU by iperf3.
>>>>> taskset 01 iperf3 -c 192.168.2.250 -t 3 -u -R;
>>>>> taskset 02 iperf3 -c 192.168.2.250 -t 3 -u -R;
>>>>>
>>>>> Previous RPS:
>>>>> 		    	CPU0       CPU1
>>>>> NET_RX:         45          0    (before iperf3 testing)
>>>>> NET_RX:        1095         241   (after iperf3 testing)
>>>>>
>>>>> Patched RPS:
>>>>>                 CPU0       CPU1
>>>>> NET_RX:         28          4    (before iperf3 testing)
>>>>> NET_RX:         573         32   (after iperf3 testing)
>>>>
>>>> Sincerely.
>>>> Xu Xin
>>> .
>>>
> .
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-03-23 10:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-21 12:12 [PATCH] rps: process the skb directly if rps cpu not changed yang.yang29
2023-03-22  2:02 ` Yunsheng Lin
2023-03-22  7:21   ` [PATCH v5 2/6] ksm: support unsharing zero pages placed by KSM xu xin
2023-03-22  7:24   ` [PATCH] rps: process the skb directly if rps cpu not changed xu xin
2023-03-23  9:01     ` Yunsheng Lin
2023-03-23 10:04       ` xu xin
2023-03-23 10:47         ` Yunsheng Lin
2023-03-22  2:54 ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).