[PATCH net] net: Correct wrong skb_flow_limit check when enable RPS

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS
@ 2018-05-10  8:28 gfree.wind
  2018-05-10 13:02 ` Eric Dumazet
  2018-05-11  3:54 ` Willem de Bruijn
  0 siblings, 2 replies; 11+ messages in thread
From: gfree.wind @ 2018-05-10  8:28 UTC (permalink / raw)
  To: davem, daniel, jakub.kicinski, dsahern, netdev; +Cc: Gao Feng

From: Gao Feng <gfree.wind@vip.163.com>

The skb flow limit is implemented for each CPU independently. In the
current codes, the function skb_flow_limit gets the softnet_data by
this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
the current cpu when enable RPS. As the result, the skb_flow_limit checks
the stats of current CPU, while the skb is going to append the queue of
another CPU. It isn't the expected behavior.

Now pass the softnet_data as a param to softnet_data to make consistent.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
---
 net/core/dev.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index af0558b..0f98eff 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3883,18 +3883,15 @@ static int rps_ipi_queued(struct softnet_data *sd)
 int netdev_flow_limit_table_len __read_mostly = (1 << 12);
 #endif
 
-static bool skb_flow_limit(struct sk_buff *skb, unsigned int qlen)
+static bool skb_flow_limit(struct softnet_data *sd, struct sk_buff *skb, unsigned int qlen)
 {
 #ifdef CONFIG_NET_FLOW_LIMIT
 	struct sd_flow_limit *fl;
-	struct softnet_data *sd;
 	unsigned int old_flow, new_flow;
 
 	if (qlen < (netdev_max_backlog >> 1))
 		return false;
 
-	sd = this_cpu_ptr(&softnet_data);
-
 	rcu_read_lock();
 	fl = rcu_dereference(sd->flow_limit);
 	if (fl) {
@@ -3938,7 +3935,7 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
 	if (!netif_running(skb->dev))
 		goto drop;
 	qlen = skb_queue_len(&sd->input_pkt_queue);
-	if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) {
+	if (qlen <= netdev_max_backlog && !skb_flow_limit(sd, skb, qlen)) {
 		if (qlen) {
 enqueue:
 			__skb_queue_tail(&sd->input_pkt_queue, skb);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS
  2018-05-10  8:28 [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS gfree.wind
@ 2018-05-10 13:02 ` Eric Dumazet
  2018-05-11  0:18   ` Gao Feng
  2018-05-11  3:54 ` Willem de Bruijn
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2018-05-10 13:02 UTC (permalink / raw)
  To: gfree.wind, davem, daniel, jakub.kicinski, dsahern, netdev



On 05/10/2018 01:28 AM, gfree.wind@vip.163.com wrote:
> From: Gao Feng <gfree.wind@vip.163.com>
> 
> The skb flow limit is implemented for each CPU independently. In the
> current codes, the function skb_flow_limit gets the softnet_data by
> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
> the current cpu when enable RPS. As the result, the skb_flow_limit checks
> the stats of current CPU, while the skb is going to append the queue of
> another CPU. It isn't the expected behavior.
> 
> Now pass the softnet_data as a param to softnet_data to make consistent.
>

Please add a correct Fixes: tag

By doing so, you will likely add a CC: tag to make sure the author of the code
will receive your email and give feed back.

Thanks !

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re:Re: [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS
  2018-05-10 13:02 ` Eric Dumazet
@ 2018-05-11  0:18   ` Gao Feng
  2018-05-11  0:55     ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Gao Feng @ 2018-05-11  0:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, daniel, jakub.kicinski, David Ahern, netdev

At 2018-05-10 21:02:55, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:
>
>
>On 05/10/2018 01:28 AM, gfree.wind@vip.163.com wrote:
>> From: Gao Feng <gfree.wind@vip.163.com>
>> 
>> The skb flow limit is implemented for each CPU independently. In the
>> current codes, the function skb_flow_limit gets the softnet_data by
>> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
>> the current cpu when enable RPS. As the result, the skb_flow_limit checks
>> the stats of current CPU, while the skb is going to append the queue of
>> another CPU. It isn't the expected behavior.
>> 
>> Now pass the softnet_data as a param to softnet_data to make consistent.
>>
>
>Please add a correct Fixes: tag

Thanks Eric.

I have one question about the "Fixes: tag".
Most of patches are bug fixes, but when need to add the "Fixes: tag", and when not ?

I'm not clear about it. Could you explain it please?

Best Regards
Feng

>
>By doing so, you will likely add a CC: tag to make sure the author of the code
>will receive your email and give feed back.
>
>Thanks !
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS
  2018-05-11  0:18   ` Gao Feng
@ 2018-05-11  0:55     ` Eric Dumazet
  2018-05-11  1:29       ` Gao Feng
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2018-05-11  0:55 UTC (permalink / raw)
  To: Gao Feng, Eric Dumazet; +Cc: davem, daniel, jakub.kicinski, David Ahern, netdev



On 05/10/2018 05:18 PM, Gao Feng wrote:
> At 2018-05-10 21:02:55, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:
>>
>>
>> On 05/10/2018 01:28 AM, gfree.wind@vip.163.com wrote:
>>> From: Gao Feng <gfree.wind@vip.163.com>
>>>
>>> The skb flow limit is implemented for each CPU independently. In the
>>> current codes, the function skb_flow_limit gets the softnet_data by
>>> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
>>> the current cpu when enable RPS. As the result, the skb_flow_limit checks
>>> the stats of current CPU, while the skb is going to append the queue of
>>> another CPU. It isn't the expected behavior.
>>>
>>> Now pass the softnet_data as a param to softnet_data to make consistent.
>>>
>>
>> Please add a correct Fixes: tag
> 
> Thanks Eric.
> 
> I have one question about the "Fixes: tag".
> Most of patches are bug fixes, but when need to add the "Fixes: tag", and when not ?
> 
> I'm not clear about it. Could you explain it please?
> 

For this particular patch, since you have not CC Willem (author of the patch),
I found very useful that you did a search to find out.
Once you found which commit added the problem, simply add the Fixes: tag and CC: the author.

Doing so saves us (stable teams, reviewers, maintainers) a lot of time really.

In my opinion, Fixes: tags should be mandatory when applicable.

> Best Regards
> Feng
> 
>>
>> By doing so, you will likely add a CC: tag to make sure the author of the code
>> will receive your email and give feed back.
>>
>> Thanks !
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re:Re: [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS
  2018-05-11  0:55     ` Eric Dumazet
@ 2018-05-11  1:29       ` Gao Feng
  0 siblings, 0 replies; 11+ messages in thread
From: Gao Feng @ 2018-05-11  1:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, daniel, jakub.kicinski, David Ahern, netdev

<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><pre>At 2018-05-11 08:55:47, "Eric Dumazet" &lt;eric.dumazet@gmail.com&gt; wrote:
&gt;
&gt;
&gt;On 05/10/2018 05:18 PM, Gao Feng wrote:
&gt;&gt; At 2018-05-10 21:02:55, "Eric Dumazet" &lt;eric.dumazet@gmail.com&gt; wrote:
&gt;&gt;&gt;
&gt;&gt;&gt;
&gt;&gt;&gt; On 05/10/2018 01:28 AM, gfree.wind@vip.163.com wrote:
&gt;&gt;&gt;&gt; From: Gao Feng &lt;gfree.wind@vip.163.com&gt;
&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt; The skb flow limit is implemented for each CPU independently. In the
&gt;&gt;&gt;&gt; current codes, the function skb_flow_limit gets the softnet_data by
&gt;&gt;&gt;&gt; this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
&gt;&gt;&gt;&gt; the current cpu when enable RPS. As the result, the skb_flow_limit checks
&gt;&gt;&gt;&gt; the stats of current CPU, while the skb is going to append the queue of
&gt;&gt;&gt;&gt; another CPU. It isn't the expected behavior.
&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt; Now pass the softnet_data as a param to softnet_data to make consistent.
&gt;&gt;&gt;&gt;
&gt;&gt;&gt;
&gt;&gt;&gt; Please add a correct Fixes: tag
&gt;&gt; 
&gt;&gt; Thanks Eric.
&gt;&gt; 
&gt;&gt; I have one question about the "Fixes: tag".
&gt;&gt; Most of patches are bug fixes, but when need to add the "Fixes: tag", and when not ?
&gt;&gt; 
&gt;&gt; I'm not clear about it. Could you explain it please?
&gt;&gt; 
&gt;
&gt;For this particular patch, since you have not CC Willem (author of the patch),
&gt;I found very useful that you did a search to find out.
&gt;Once you found which commit added the problem, simply add the Fixes: tag and CC: the author.
&gt;
<div>&gt;Doing so saves us (stable teams, reviewers, maintainers) a lot of time really.</div><div><br /></div><div> Normally I get the "to" list by get_maintainer.pl script, now I would save the stable team ASAP.</div>&gt;
<div>&gt;In my opinion, Fixes: tags should be mandatory when applicable.</div><div><br /></div><div>Thanks your explanations, I get it.</div><div><br /></div><div>Best Regards</div><div>Feng</div><div><br /></div>&gt;
&gt;&gt; Best Regards
&gt;&gt; Feng
&gt;&gt; 
&gt;&gt;&gt;
&gt;&gt;&gt; By doing so, you will likely add a CC: tag to make sure the author of the code
&gt;&gt;&gt; will receive your email and give feed back.
&gt;&gt;&gt;
&gt;&gt;&gt; Thanks !
&gt;&gt;&gt;
</pre></div>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS
  2018-05-10  8:28 [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS gfree.wind
  2018-05-10 13:02 ` Eric Dumazet
@ 2018-05-11  3:54 ` Willem de Bruijn
  2018-05-11  6:20   ` Gao Feng
  1 sibling, 1 reply; 11+ messages in thread
From: Willem de Bruijn @ 2018-05-11  3:54 UTC (permalink / raw)
  To: gfree.wind
  Cc: David Miller, Daniel Borkmann, jakub.kicinski, David Ahern,
	Network Development

On Thu, May 10, 2018 at 4:28 AM,  <gfree.wind@vip.163.com> wrote:
> From: Gao Feng <gfree.wind@vip.163.com>
>
> The skb flow limit is implemented for each CPU independently. In the
> current codes, the function skb_flow_limit gets the softnet_data by
> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
> the current cpu when enable RPS. As the result, the skb_flow_limit checks
> the stats of current CPU, while the skb is going to append the queue of
> another CPU. It isn't the expected behavior.
>
> Now pass the softnet_data as a param to softnet_data to make consistent.

The local cpu softnet_data is used on purpose. The operations in
skb_flow_limit() on sd fields could race if not executed on the local cpu.

Flow limit tries to detect large ("elephant") DoS flows with a fixed four-tuple.
These would always hit the same RPS cpu, so that cpu being backlogged
may be an indication that such a flow is active. But the flow will also always
arrive on the same initial cpu courtesy of RSS. So storing the lookup table
on the initial CPU is also fine. There may be false positives on other CPUs
with the same RPS destination, but that is unlikely with a highly concurrent
traffic server mix ("mice").

Note that the sysctl net.core.flow_limit_cpu_bitmap enables the feature
for the cpus on which traffic initially lands, not the RPS destination cpus.
See also Documentation/networking/scaling.txt

That said, I had to reread the code, as it does seem sensible that the
same softnet_data is intended to be used both when testing qlen and
flow_limit.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re:Re: [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS
  2018-05-11  3:54 ` Willem de Bruijn
@ 2018-05-11  6:20   ` Gao Feng
  2018-05-11 13:23     ` Willem de Bruijn
  0 siblings, 1 reply; 11+ messages in thread
From: Gao Feng @ 2018-05-11  6:20 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: davem, daniel, jakub.kicinski, David Ahern, netdev

At 2018-05-11 11:54:55, "Willem de Bruijn" <willemdebruijn.kernel@gmail.com> wrote:
>On Thu, May 10, 2018 at 4:28 AM,  <gfree.wind@vip.163.com> wrote:
>> From: Gao Feng <gfree.wind@vip.163.com>
>>
>> The skb flow limit is implemented for each CPU independently. In the
>> current codes, the function skb_flow_limit gets the softnet_data by
>> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
>> the current cpu when enable RPS. As the result, the skb_flow_limit checks
>> the stats of current CPU, while the skb is going to append the queue of
>> another CPU. It isn't the expected behavior.
>>
>> Now pass the softnet_data as a param to softnet_data to make consistent.
>
>The local cpu softnet_data is used on purpose. The operations in
>skb_flow_limit() on sd fields could race if not executed on the local cpu.

I think the race doesn't exist because of the rps_lock.
The enqueue_to_backlog has hold the rps_lock before skb_flow_limit.

>
>Flow limit tries to detect large ("elephant") DoS flows with a fixed four-tuple.
>These would always hit the same RPS cpu, so that cpu being backlogged

They may hit the different target CPU when enable RFS. Because the app could be scheduled
to another CPU, then RFS tries to deliver the skb to latest core which has hot cache.

>may be an indication that such a flow is active. But the flow will also always
>arrive on the same initial cpu courtesy of RSS. So storing the lookup table

The RSS couldn't make sure the irq is handled by same cpu. It would be balanced between
the cpus.

>on the initial CPU is also fine. There may be false positives on other CPUs
>with the same RPS destination, but that is unlikely with a highly concurrent
>traffic server mix ("mice").

If my comment is right, the flow couldn't always arrive one the same initial cpu,  although
it may be sent to one same target cpu.

>
>Note that the sysctl net.core.flow_limit_cpu_bitmap enables the feature
>for the cpus on which traffic initially lands, not the RPS destination cpus.
>See also Documentation/networking/scaling.txt
>
>That said, I had to reread the code, as it does seem sensible that the
>same softnet_data is intended to be used both when testing qlen and
>flow_limit.

In most cases, user configures the same RPS map with flow_limit like 0xff.
Because user couldn't predict which core the evil flow would arrive on.

Take an example, there are 2 cores, cpu0 and cpu1. 
One flow is the an evil flow, but the irq is sent to cpu0. After RPS/RFS, the target cpu is cpu1.
Now cpu0 invokes enqueue_to_backlog, then the skb_flow_limit checkes the queue length
of cpu0. Certainly it could pass the check of skb_flow_limit because there is no any evil flow on cpu0.
Then the cpu0 inserts the skb into the queue of cpu1.
As a result, the skb_flow_limit doesn't work as expected.

BTW, I have already sent the v2 patch which only adds the "Fixes: tag".

Best Regards
Feng

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS
  2018-05-11  6:20   ` Gao Feng
@ 2018-05-11 13:23     ` Willem de Bruijn
  2018-05-11 14:44       ` Gao Feng
  0 siblings, 1 reply; 11+ messages in thread
From: Willem de Bruijn @ 2018-05-11 13:23 UTC (permalink / raw)
  To: Gao Feng; +Cc: davem, daniel, jakub.kicinski, David Ahern, netdev

On Fri, May 11, 2018 at 2:20 AM, Gao Feng <gfree.wind@vip.163.com> wrote:
> At 2018-05-11 11:54:55, "Willem de Bruijn" <willemdebruijn.kernel@gmail.com> wrote:
>>On Thu, May 10, 2018 at 4:28 AM,  <gfree.wind@vip.163.com> wrote:
>>> From: Gao Feng <gfree.wind@vip.163.com>
>>>
>>> The skb flow limit is implemented for each CPU independently. In the
>>> current codes, the function skb_flow_limit gets the softnet_data by
>>> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
>>> the current cpu when enable RPS. As the result, the skb_flow_limit checks
>>> the stats of current CPU, while the skb is going to append the queue of
>>> another CPU. It isn't the expected behavior.
>>>
>>> Now pass the softnet_data as a param to softnet_data to make consistent.
>>
>>The local cpu softnet_data is used on purpose. The operations in
>>skb_flow_limit() on sd fields could race if not executed on the local cpu.
>
> I think the race doesn't exist because of the rps_lock.
> The enqueue_to_backlog has hold the rps_lock before skb_flow_limit.

Indeed, I overlooked that. There still is the matter of cache contention.

>>Flow limit tries to detect large ("elephant") DoS flows with a fixed four-tuple.
>>These would always hit the same RPS cpu, so that cpu being backlogged
>
> They may hit the different target CPU when enable RFS. Because the app could be scheduled
> to another CPU, then RFS tries to deliver the skb to latest core which has hot cache.

This even more suggest using the initial (or IRQ) cpu to track state, instead
of the destination (RPS/RFS) cpu.

>>may be an indication that such a flow is active. But the flow will also always
>>arrive on the same initial cpu courtesy of RSS. So storing the lookup table
>
> The RSS couldn't make sure the irq is handled by same cpu. It would be balanced between
> the cpus.

IRQs are usually pinned to cores. Unless using something like irqbalance,
but that operates at too coarse a timescale to do anything useful at Mpps
packet rates.

>>on the initial CPU is also fine. There may be false positives on other CPUs
>>with the same RPS destination, but that is unlikely with a highly concurrent
>>traffic server mix ("mice").
>
> If my comment is right, the flow couldn't always arrive one the same initial cpu,  although
> it may be sent to one same target cpu.
>
>>
>>Note that the sysctl net.core.flow_limit_cpu_bitmap enables the feature
>>for the cpus on which traffic initially lands, not the RPS destination cpus.
>>See also Documentation/networking/scaling.txt
>>
>>That said, I had to reread the code, as it does seem sensible that the
>>same softnet_data is intended to be used both when testing qlen and
>>flow_limit.
>
> In most cases, user configures the same RPS map with flow_limit like 0xff.
> Because user couldn't predict which core the evil flow would arrive on.
>
> Take an example, there are 2 cores, cpu0 and cpu1.
> One flow is the an evil flow, but the irq is sent to cpu0. After RPS/RFS, the target cpu is cpu1.
> Now cpu0 invokes enqueue_to_backlog, then the skb_flow_limit checkes the queue length
> of cpu0. Certainly it could pass the check of skb_flow_limit because there is no any evil flow on cpu0.

No, enqueue_to_backlog passes qlen to skb_flow_limit, so that does
check the queue length of the RPS cpu.

> Then the cpu0 inserts the skb into the queue of cpu1.
> As a result, the skb_flow_limit doesn't work as expected.
>
> BTW, I have already sent the v2 patch which only adds the "Fixes: tag".

The change also makes the code inconsistent with
Documentation/networking/scaling.txt

"In such environments, enable the feature on all CPUs that handle
network rx interrupts (as set in /proc/irq/N/smp_affinity)."

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re:Re: Re: [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS
  2018-05-11 13:23     ` Willem de Bruijn
@ 2018-05-11 14:44       ` Gao Feng
  2018-05-11 14:56         ` Willem de Bruijn
  0 siblings, 1 reply; 11+ messages in thread
From: Gao Feng @ 2018-05-11 14:44 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: davem, daniel, jakub.kicinski, David Ahern, netdev

At 2018-05-11 21:23:55, "Willem de Bruijn" <willemdebruijn.kernel@gmail.com> wrote:
>On Fri, May 11, 2018 at 2:20 AM, Gao Feng <gfree.wind@vip.163.com> wrote:
>> At 2018-05-11 11:54:55, "Willem de Bruijn" <willemdebruijn.kernel@gmail.com> wrote:
>>>On Thu, May 10, 2018 at 4:28 AM,  <gfree.wind@vip.163.com> wrote:
>>>> From: Gao Feng <gfree.wind@vip.163.com>
>>>>
>>>> The skb flow limit is implemented for each CPU independently. In the
>>>> current codes, the function skb_flow_limit gets the softnet_data by
>>>> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
>>>> the current cpu when enable RPS. As the result, the skb_flow_limit checks
>>>> the stats of current CPU, while the skb is going to append the queue of
>>>> another CPU. It isn't the expected behavior.
>>>>
>>>> Now pass the softnet_data as a param to softnet_data to make consistent.
>>>
>>>The local cpu softnet_data is used on purpose. The operations in
>>>skb_flow_limit() on sd fields could race if not executed on the local cpu.
>>
>> I think the race doesn't exist because of the rps_lock.
>> The enqueue_to_backlog has hold the rps_lock before skb_flow_limit.
>
>Indeed, I overlooked that. There still is the matter of cache contention.

The cache contention is really important in this case?
I don't think so, because the enqueue_to_backlog have touched and modified the softnet_stat
of target cpu.

>
>>>Flow limit tries to detect large ("elephant") DoS flows with a fixed four-tuple.
>>>These would always hit the same RPS cpu, so that cpu being backlogged
>>
>> They may hit the different target CPU when enable RFS. Because the app could be scheduled
>> to another CPU, then RFS tries to deliver the skb to latest core which has hot cache.
>
>This even more suggest using the initial (or IRQ) cpu to track state, instead
>of the destination (RPS/RFS) cpu.

I couldn't understand why it is better to track state on initial cpu, not the target cpu.
The latter one could get more accurate result.

>
>>>may be an indication that such a flow is active. But the flow will also always
>>>arrive on the same initial cpu courtesy of RSS. So storing the lookup table
>>
>> The RSS couldn't make sure the irq is handled by same cpu. It would be balanced between
>> the cpus.
>
>IRQs are usually pinned to cores. Unless using something like irqbalance,
>but that operates at too coarse a timescale to do anything useful at Mpps
>packet rates.

There are some motherboard which couldn't make sure the irq is pinned.
The flow_limit wouldn't work as well as expected.

>
>>>on the initial CPU is also fine. There may be false positives on other CPUs
>>>with the same RPS destination, but that is unlikely with a highly concurrent
>>>traffic server mix ("mice").
>>
>> If my comment is right, the flow couldn't always arrive one the same initial cpu,  although
>> it may be sent to one same target cpu.
>>
>>>
>>>Note that the sysctl net.core.flow_limit_cpu_bitmap enables the feature
>>>for the cpus on which traffic initially lands, not the RPS destination cpus.
>>>See also Documentation/networking/scaling.txt
>>>
>>>That said, I had to reread the code, as it does seem sensible that the
>>>same softnet_data is intended to be used both when testing qlen and
>>>flow_limit.
>>
>> In most cases, user configures the same RPS map with flow_limit like 0xff.
>> Because user couldn't predict which core the evil flow would arrive on.
>>
>> Take an example, there are 2 cores, cpu0 and cpu1.
>> One flow is the an evil flow, but the irq is sent to cpu0. After RPS/RFS, the target cpu is cpu1.
>> Now cpu0 invokes enqueue_to_backlog, then the skb_flow_limit checkes the queue length
>> of cpu0. Certainly it could pass the check of skb_flow_limit because there is no any evil flow on cpu0.
>
>No, enqueue_to_backlog passes qlen to skb_flow_limit, so that does
>check the queue length of the RPS cpu.

Sorry, I overlooked the qlen is the length of the rps cpu.
Then it's ok unless the stats may be not accurate when irq isn't pinned.

But I still doubt that is it really important to track state on initial cpu, not target cpu?
Because the enqueue_to_backlog have touched the softnet_data of target cpu.

Best Regards
Feng

>
>> Then the cpu0 inserts the skb into the queue of cpu1.
>> As a result, the skb_flow_limit doesn't work as expected.
>>
>> BTW, I have already sent the v2 patch which only adds the "Fixes: tag".
>
>The change also makes the code inconsistent with
>Documentation/networking/scaling.txt
>
>"In such environments, enable the feature on all CPUs that handle
>network rx interrupts (as set in /proc/irq/N/smp_affinity)."

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Re: [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS
  2018-05-11 14:44       ` Gao Feng
@ 2018-05-11 14:56         ` Willem de Bruijn
  2018-05-14  1:26           ` Gao Feng
  0 siblings, 1 reply; 11+ messages in thread
From: Willem de Bruijn @ 2018-05-11 14:56 UTC (permalink / raw)
  To: Gao Feng; +Cc: davem, daniel, jakub.kicinski, David Ahern, netdev

On Fri, May 11, 2018 at 10:44 AM, Gao Feng <gfree.wind@vip.163.com> wrote:
> At 2018-05-11 21:23:55, "Willem de Bruijn" <willemdebruijn.kernel@gmail.com> wrote:
>>On Fri, May 11, 2018 at 2:20 AM, Gao Feng <gfree.wind@vip.163.com> wrote:
>>> At 2018-05-11 11:54:55, "Willem de Bruijn" <willemdebruijn.kernel@gmail.com> wrote:
>>>>On Thu, May 10, 2018 at 4:28 AM,  <gfree.wind@vip.163.com> wrote:
>>>>> From: Gao Feng <gfree.wind@vip.163.com>
>>>>>
>>>>> The skb flow limit is implemented for each CPU independently. In the
>>>>> current codes, the function skb_flow_limit gets the softnet_data by
>>>>> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
>>>>> the current cpu when enable RPS. As the result, the skb_flow_limit checks
>>>>> the stats of current CPU, while the skb is going to append the queue of
>>>>> another CPU. It isn't the expected behavior.
>>>>>
>>>>> Now pass the softnet_data as a param to softnet_data to make consistent.
>>>>
>>>>The local cpu softnet_data is used on purpose. The operations in
>>>>skb_flow_limit() on sd fields could race if not executed on the local cpu.
>>>
>>> I think the race doesn't exist because of the rps_lock.
>>> The enqueue_to_backlog has hold the rps_lock before skb_flow_limit.
>>
>>Indeed, I overlooked that. There still is the matter of cache contention.
>
> The cache contention is really important in this case?
> I don't think so, because the enqueue_to_backlog have touched and modified the softnet_stat
> of target cpu.
>
>>
>>>>Flow limit tries to detect large ("elephant") DoS flows with a fixed four-tuple.
>>>>These would always hit the same RPS cpu, so that cpu being backlogged
>>>
>>> They may hit the different target CPU when enable RFS. Because the app could be scheduled
>>> to another CPU, then RFS tries to deliver the skb to latest core which has hot cache.
>>
>>This even more suggest using the initial (or IRQ) cpu to track state, instead
>>of the destination (RPS/RFS) cpu.
>
> I couldn't understand why it is better to track state on initial cpu, not the target cpu.
> The latter one could get more accurate result.

For a single DoS flow with normal cpu pinned IRQs, the results will be equally
good when tracked on the initial IRQ cpu..

>
>>
>>>>may be an indication that such a flow is active. But the flow will also always
>>>>arrive on the same initial cpu courtesy of RSS. So storing the lookup table
>>>
>>> The RSS couldn't make sure the irq is handled by same cpu. It would be balanced between
>>> the cpus.
>>
>>IRQs are usually pinned to cores. Unless using something like irqbalance,
>>but that operates at too coarse a timescale to do anything useful at Mpps
>>packet rates.
>
> There are some motherboard which couldn't make sure the irq is pinned.
> The flow_limit wouldn't work as well as expected.

.. this seems to be the crux of the argument. I am not aware of any network
interrupts that do not adhere to the cpu pinning configuration in

   /proc/irq/$IRQ/smp_affinity(_list)

What kind of hardware ignores this setting and sprays interrupts? I agree
that in that case flow_limit as is may be ineffective (if migration happens
at rates comparable to packet rates). But this should not happen?

>
>>
>>>>on the initial CPU is also fine. There may be false positives on other CPUs
>>>>with the same RPS destination, but that is unlikely with a highly concurrent
>>>>traffic server mix ("mice").
>>>
>>> If my comment is right, the flow couldn't always arrive one the same initial cpu,  although
>>> it may be sent to one same target cpu.
>>>
>>>>
>>>>Note that the sysctl net.core.flow_limit_cpu_bitmap enables the feature
>>>>for the cpus on which traffic initially lands, not the RPS destination cpus.
>>>>See also Documentation/networking/scaling.txt
>>>>
>>>>That said, I had to reread the code, as it does seem sensible that the
>>>>same softnet_data is intended to be used both when testing qlen and
>>>>flow_limit.
>>>
>>> In most cases, user configures the same RPS map with flow_limit like 0xff.
>>> Because user couldn't predict which core the evil flow would arrive on.
>>>
>>> Take an example, there are 2 cores, cpu0 and cpu1.
>>> One flow is the an evil flow, but the irq is sent to cpu0. After RPS/RFS, the target cpu is cpu1.
>>> Now cpu0 invokes enqueue_to_backlog, then the skb_flow_limit checkes the queue length
>>> of cpu0. Certainly it could pass the check of skb_flow_limit because there is no any evil flow on cpu0.
>>
>>No, enqueue_to_backlog passes qlen to skb_flow_limit, so that does
>>check the queue length of the RPS cpu.
>
> Sorry, I overlooked the qlen is the length of the rps cpu.
> Then it's ok unless the stats may be not accurate when irq isn't pinned.
>
> But I still doubt that is it really important to track state on initial cpu, not target cpu?
> Because the enqueue_to_backlog have touched the softnet_data of target cpu.

I think the merit of both IRQ and RPS cpu can be argued for attaching the
flow_limit state.

Either way, the current behavior is not a bug, so I don't think that this is a
candidate for net.

The cost of moving from IRQ to RPS cpu will be the cacheline contention
on a system with multiple IRQ cpus that all try to update the sd->flow_data
of the same RPS cpus. Which is particularly likely with RFS. I suspect that
this cost is non-trivial and not worth the benefit of handling hardware with
unpinned IRQs.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re:Re: Re: Re: [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS
  2018-05-11 14:56         ` Willem de Bruijn
@ 2018-05-14  1:26           ` Gao Feng
  0 siblings, 0 replies; 11+ messages in thread
From: Gao Feng @ 2018-05-14  1:26 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: davem, daniel, jakub.kicinski, David Ahern, netdev

At 2018-05-11 22:56:04, "Willem de Bruijn" <willemdebruijn.kernel@gmail.com> wrote:
>On Fri, May 11, 2018 at 10:44 AM, Gao Feng <gfree.wind@vip.163.com> wrote:
>> At 2018-05-11 21:23:55, "Willem de Bruijn" <willemdebruijn.kernel@gmail.com> wrote:
>>>On Fri, May 11, 2018 at 2:20 AM, Gao Feng <gfree.wind@vip.163.com> wrote:
>>>> At 2018-05-11 11:54:55, "Willem de Bruijn" <willemdebruijn.kernel@gmail.com> wrote:
>>>>>On Thu, May 10, 2018 at 4:28 AM,  <gfree.wind@vip.163.com> wrote:
>>>>>> From: Gao Feng <gfree.wind@vip.163.com>
>>>>>>
>>>>>> The skb flow limit is implemented for each CPU independently. In the
>>>>>> current codes, the function skb_flow_limit gets the softnet_data by
>>>>>> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
>>>>>> the current cpu when enable RPS. As the result, the skb_flow_limit checks
>>>>>> the stats of current CPU, while the skb is going to append the queue of
>>>>>> another CPU. It isn't the expected behavior.
>>>>>>
>>>>>> Now pass the softnet_data as a param to softnet_data to make consistent.
>>>>>
>>>>>The local cpu softnet_data is used on purpose. The operations in
>>>>>skb_flow_limit() on sd fields could race if not executed on the local cpu.
>>>>
>>>> I think the race doesn't exist because of the rps_lock.
>>>> The enqueue_to_backlog has hold the rps_lock before skb_flow_limit.
>>>
>>>Indeed, I overlooked that. There still is the matter of cache contention.
>>
>> The cache contention is really important in this case?
>> I don't think so, because the enqueue_to_backlog have touched and modified the softnet_stat
>> of target cpu.
>>
>>>
>>>>>Flow limit tries to detect large ("elephant") DoS flows with a fixed four-tuple.
>>>>>These would always hit the same RPS cpu, so that cpu being backlogged
>>>>
>>>> They may hit the different target CPU when enable RFS. Because the app could be scheduled
>>>> to another CPU, then RFS tries to deliver the skb to latest core which has hot cache.
>>>
>>>This even more suggest using the initial (or IRQ) cpu to track state, instead
>>>of the destination (RPS/RFS) cpu.
>>
>> I couldn't understand why it is better to track state on initial cpu, not the target cpu.
>> The latter one could get more accurate result.
>
>For a single DoS flow with normal cpu pinned IRQs, the results will be equally
>good when tracked on the initial IRQ cpu..
>
>>
>>>
>>>>>may be an indication that such a flow is active. But the flow will also always
>>>>>arrive on the same initial cpu courtesy of RSS. So storing the lookup table
>>>>
>>>> The RSS couldn't make sure the irq is handled by same cpu. It would be balanced between
>>>> the cpus.
>>>
>>>IRQs are usually pinned to cores. Unless using something like irqbalance,
>>>but that operates at too coarse a timescale to do anything useful at Mpps
>>>packet rates.
>>
>> There are some motherboard which couldn't make sure the irq is pinned.
>> The flow_limit wouldn't work as well as expected.
>
>.. this seems to be the crux of the argument. I am not aware of any network
>interrupts that do not adhere to the cpu pinning configuration in
>
>   /proc/irq/$IRQ/smp_affinity(_list)
>

When smp_affinity is configured 0xff, I met some hardwares they deliver most of
irqs to a specific cpu, and left irqs are spread to the other cpus. And it couldn't 
make sure the irq of one flow is delivered to a fixed cpu.

I don't know if it is caused by apic or motherboard.
And what information you need to confirm it.

>What kind of hardware ignores this setting and sprays interrupts? I agree
>that in that case flow_limit as is may be ineffective (if migration happens
>at rates comparable to packet rates). But this should not happen?
>
>>
>>>
>>>>>on the initial CPU is also fine. There may be false positives on other CPUs
>>>>>with the same RPS destination, but that is unlikely with a highly concurrent
>>>>>traffic server mix ("mice").
>>>>
>>>> If my comment is right, the flow couldn't always arrive one the same initial cpu,  although
>>>> it may be sent to one same target cpu.
>>>>
>>>>>
>>>>>Note that the sysctl net.core.flow_limit_cpu_bitmap enables the feature
>>>>>for the cpus on which traffic initially lands, not the RPS destination cpus.
>>>>>See also Documentation/networking/scaling.txt
>>>>>
>>>>>That said, I had to reread the code, as it does seem sensible that the
>>>>>same softnet_data is intended to be used both when testing qlen and
>>>>>flow_limit.
>>>>
>>>> In most cases, user configures the same RPS map with flow_limit like 0xff.
>>>> Because user couldn't predict which core the evil flow would arrive on.
>>>>
>>>> Take an example, there are 2 cores, cpu0 and cpu1.
>>>> One flow is the an evil flow, but the irq is sent to cpu0. After RPS/RFS, the target cpu is cpu1.
>>>> Now cpu0 invokes enqueue_to_backlog, then the skb_flow_limit checkes the queue length
>>>> of cpu0. Certainly it could pass the check of skb_flow_limit because there is no any evil flow on cpu0.
>>>
>>>No, enqueue_to_backlog passes qlen to skb_flow_limit, so that does
>>>check the queue length of the RPS cpu.
>>
>> Sorry, I overlooked the qlen is the length of the rps cpu.
>> Then it's ok unless the stats may be not accurate when irq isn't pinned.
>>
>> But I still doubt that is it really important to track state on initial cpu, not target cpu?
>> Because the enqueue_to_backlog have touched the softnet_data of target cpu.
>
>I think the merit of both IRQ and RPS cpu can be argued for attaching the
>flow_limit state.
>
>Either way, the current behavior is not a bug, so I don't think that this is a
>candidate for net.
>
>The cost of moving from IRQ to RPS cpu will be the cacheline contention
>on a system with multiple IRQ cpus that all try to update the sd->flow_data
>of the same RPS cpus. Which is particularly likely with RFS. I suspect that
>this cost is non-trivial and not worth the benefit of handling hardware with
>unpinned IRQs.

Ok, I agree with you.
Thanks your detail discussion with me.

Best Regards
Feng

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-05-14  1:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-10  8:28 [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS gfree.wind
2018-05-10 13:02 ` Eric Dumazet
2018-05-11  0:18   ` Gao Feng
2018-05-11  0:55     ` Eric Dumazet
2018-05-11  1:29       ` Gao Feng
2018-05-11  3:54 ` Willem de Bruijn
2018-05-11  6:20   ` Gao Feng
2018-05-11 13:23     ` Willem de Bruijn
2018-05-11 14:44       ` Gao Feng
2018-05-11 14:56         ` Willem de Bruijn
2018-05-14  1:26           ` Gao Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).