linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] net;sched: Try to find idle cpu for RPS to handle packets
@ 2018-09-19 12:28 Kirill Tkhai
  2018-09-19 14:55 ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: Kirill Tkhai @ 2018-09-19 12:28 UTC (permalink / raw)
  To: peterz, davem, daniel, edumazet, tom, ktkhai, netdev, linux-kernel

Many workloads have polling mode of work. The application
checks for incomming packets from time to time, but it also
has a work to do, when there is no packets. This RFC
tries to develop an idea to queue RPS packets on idle
CPU in the the L3 domain of the consumer, so backlog
processing of the packets and the application can execute
in parallel.

We require this in case of network cards does not
have enough RX queues to cover all online CPUs (this seems
to be the most cards), and  get_rps_cpu() actually chooses
remote cpu, and SMP interrupt is sent. Here we may try
our best, and to find idle CPU nearly the consumer's CPU.
Note, that in case of consumer works in poll mode and it
does not waits for incomming packets, its CPU will be not
idle, while CPU of a sleeping consumer may be idle. So,
not polling consumers will still be able to have skb
handled on its CPU.

In case of network card has many queues, the device
interrupts will come on consumer's CPU, and this patch
won't try to find idle cpu for them.

I've tried simple netperf test for this:
netserver -p 1234
netperf -L 127.0.0.1 -p 1234 -l 100

Before:
 87380  16384  16384    100.00   60323.56
 87380  16384  16384    100.00   60388.46
 87380  16384  16384    100.00   60217.68
 87380  16384  16384    100.00   57995.41
 87380  16384  16384    100.00   60659.00

After:
 87380  16384  16384    100.00   64569.09
 87380  16384  16384    100.00   64569.25
 87380  16384  16384    100.00   64691.63
 87380  16384  16384    100.00   64930.14
 87380  16384  16384    100.00   62670.15

The difference between best runs is +7%,
the worst runs differ +8%.

What do you think about following somehow in this way?

[This also requires a pre-patch, which exports
 select_idle_sibling() and teaches it handles
 NULL task argument, but since it's not very
 interesting to see, I skip it sending].

Kirill
---
 net/core/dev.c |   34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 559a91271f82..9a867ff34622 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3738,13 +3738,12 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		       struct rps_dev_flow **rflowp)
 {
-	const struct rps_sock_flow_table *sock_flow_table;
+	struct rps_sock_flow_table *sock_flow_table;
 	struct netdev_rx_queue *rxqueue = dev->_rx;
 	struct rps_dev_flow_table *flow_table;
 	struct rps_map *map;
+	u32 tcpu, hash, val;
 	int cpu = -1;
-	u32 tcpu;
-	u32 hash;
 
 	if (skb_rx_queue_recorded(skb)) {
 		u16 index = skb_get_rx_queue(skb);
@@ -3774,6 +3773,9 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	sock_flow_table = rcu_dereference(rps_sock_flow_table);
 	if (flow_table && sock_flow_table) {
 		struct rps_dev_flow *rflow;
+		bool want_new_cpu = false;
+		unsigned long flags;
+		unsigned int qhead;
 		u32 next_cpu;
 		u32 ident;
 
@@ -3801,12 +3803,26 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		 *     This guarantees that all previous packets for the flow
 		 *     have been dequeued, thus preserving in order delivery.
 		 */
-		if (unlikely(tcpu != next_cpu) &&
-		    (tcpu >= nr_cpu_ids || !cpu_online(tcpu) ||
-		     ((int)(per_cpu(softnet_data, tcpu).input_queue_head -
-		      rflow->last_qtail)) >= 0)) {
-			tcpu = next_cpu;
-			rflow = set_rps_cpu(dev, skb, rflow, next_cpu);
+		if (tcpu != next_cpu) {
+			qhead = per_cpu(softnet_data, tcpu).input_queue_head;
+			if (tcpu >= nr_cpu_ids || !cpu_online(tcpu) ||
+			    (int)(qhead - rflow->last_qtail) >= 0)
+				want_new_cpu = true;
+		} else if (tcpu < nr_cpu_ids && cpu_online(tcpu) &&
+			   tcpu != smp_processor_id() && !available_idle_cpu(tcpu)) {
+			want_new_cpu = true;
+		}
+
+		if (want_new_cpu) {
+			local_irq_save(flags);
+			next_cpu = select_idle_sibling(NULL, next_cpu, next_cpu);
+			local_irq_restore(flags);
+			if (tcpu != next_cpu) {
+				tcpu = next_cpu;
+				rflow = set_rps_cpu(dev, skb, rflow, tcpu);
+				val = (hash & ~rps_cpu_mask) | tcpu;
+				sock_flow_table->ents[hash & sock_flow_table->mask] = val;
+			}
 		}
 
 		if (tcpu < nr_cpu_ids && cpu_online(tcpu)) {


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC] net;sched: Try to find idle cpu for RPS to handle packets
  2018-09-19 12:28 [RFC] net;sched: Try to find idle cpu for RPS to handle packets Kirill Tkhai
@ 2018-09-19 14:55 ` Eric Dumazet
  2018-09-19 15:41   ` Kirill Tkhai
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2018-09-19 14:55 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Peter Zijlstra, David Miller, Daniel Borkmann, tom, netdev, LKML

On Wed, Sep 19, 2018 at 5:29 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> Many workloads have polling mode of work. The application
> checks for incomming packets from time to time, but it also
> has a work to do, when there is no packets. This RFC
> tries to develop an idea to queue RPS packets on idle
> CPU in the the L3 domain of the consumer, so backlog
> processing of the packets and the application can execute
> in parallel.
>
> We require this in case of network cards does not
> have enough RX queues to cover all online CPUs (this seems
> to be the most cards), and  get_rps_cpu() actually chooses
> remote cpu, and SMP interrupt is sent. Here we may try
> our best, and to find idle CPU nearly the consumer's CPU.
> Note, that in case of consumer works in poll mode and it
> does not waits for incomming packets, its CPU will be not
> idle, while CPU of a sleeping consumer may be idle. So,
> not polling consumers will still be able to have skb
> handled on its CPU.
>
> In case of network card has many queues, the device
> interrupts will come on consumer's CPU, and this patch
> won't try to find idle cpu for them.
>
> I've tried simple netperf test for this:
> netserver -p 1234
> netperf -L 127.0.0.1 -p 1234 -l 100
>
> Before:
>  87380  16384  16384    100.00   60323.56
>  87380  16384  16384    100.00   60388.46
>  87380  16384  16384    100.00   60217.68
>  87380  16384  16384    100.00   57995.41
>  87380  16384  16384    100.00   60659.00
>
> After:
>  87380  16384  16384    100.00   64569.09
>  87380  16384  16384    100.00   64569.25
>  87380  16384  16384    100.00   64691.63
>  87380  16384  16384    100.00   64930.14
>  87380  16384  16384    100.00   62670.15
>
> The difference between best runs is +7%,
> the worst runs differ +8%.
>
> What do you think about following somehow in this way?

Hi Kirill

In my experience, scheduler has a poor view of softirq processing
happening on various cpus.
A cpu spending 90% of its cycles processing IRQ might be considered 'idle'

So please run a real workload (it is _very_ uncommon anyone set up RPS
on lo interface !)

Like 400 or more concurrent netperf -t TCP_RR on a 10Gbit NIC.

Thanks.

PS: Idea of playing with L3 domains is interesting, I have personally
tried various strategies in the past but none of them
demonstrated a clear win.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] net;sched: Try to find idle cpu for RPS to handle packets
  2018-09-19 14:55 ` Eric Dumazet
@ 2018-09-19 15:41   ` Kirill Tkhai
  2018-09-19 15:49     ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: Kirill Tkhai @ 2018-09-19 15:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, David Miller, Daniel Borkmann, tom, netdev, LKML

On 19.09.2018 17:55, Eric Dumazet wrote:
> On Wed, Sep 19, 2018 at 5:29 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> Many workloads have polling mode of work. The application
>> checks for incomming packets from time to time, but it also
>> has a work to do, when there is no packets. This RFC
>> tries to develop an idea to queue RPS packets on idle
>> CPU in the the L3 domain of the consumer, so backlog
>> processing of the packets and the application can execute
>> in parallel.
>>
>> We require this in case of network cards does not
>> have enough RX queues to cover all online CPUs (this seems
>> to be the most cards), and  get_rps_cpu() actually chooses
>> remote cpu, and SMP interrupt is sent. Here we may try
>> our best, and to find idle CPU nearly the consumer's CPU.
>> Note, that in case of consumer works in poll mode and it
>> does not waits for incomming packets, its CPU will be not
>> idle, while CPU of a sleeping consumer may be idle. So,
>> not polling consumers will still be able to have skb
>> handled on its CPU.
>>
>> In case of network card has many queues, the device
>> interrupts will come on consumer's CPU, and this patch
>> won't try to find idle cpu for them.
>>
>> I've tried simple netperf test for this:
>> netserver -p 1234
>> netperf -L 127.0.0.1 -p 1234 -l 100
>>
>> Before:
>>  87380  16384  16384    100.00   60323.56
>>  87380  16384  16384    100.00   60388.46
>>  87380  16384  16384    100.00   60217.68
>>  87380  16384  16384    100.00   57995.41
>>  87380  16384  16384    100.00   60659.00
>>
>> After:
>>  87380  16384  16384    100.00   64569.09
>>  87380  16384  16384    100.00   64569.25
>>  87380  16384  16384    100.00   64691.63
>>  87380  16384  16384    100.00   64930.14
>>  87380  16384  16384    100.00   62670.15
>>
>> The difference between best runs is +7%,
>> the worst runs differ +8%.
>>
>> What do you think about following somehow in this way?
> 
> Hi Kirill
> 
> In my experience, scheduler has a poor view of softirq processing
> happening on various cpus.
> A cpu spending 90% of its cycles processing IRQ might be considered 'idle'

Yes, in case of there is softirq on top of irq_exit(), the cpu is not
considered as busy. But after MAX_SOFTIRQ_TIME (=2ms), ksoftirqd are
waken up to execute the work in process context, and the processor is
considered as !idle. 2ms is 2 timer ticks in case of HZ=1000. So, we
don't restart softirq in case of it was executed for more then 2ms.

The similar way, single net_rx_action() can't be executed longer
than 2ms.

Having 90% load in softirq (called on top of irq_exit()) should be
very unlikely situation, when there are too many interrupts with small
amount of work, which related softirq calls are doing for each of them.
I think it had be a problem even in plain napi case, since it would
worked not like expected.

But anyway. You worry, that during handling of next portion of skbs,
we find that previous portion of skbs already woken ksoftirqd, and
we don't see this cpu as idle? Yeah, then we'll try to change cpu,
and this is not what we want. We want to continue use the cpu, where
previous portion was handler. Hm, not so fast I'll answer, but certainly,
this may be handled somehow in more creative way.

> So please run a real workload (it is _very_ uncommon anyone set up RPS
> on lo interface !)
>
> Like 400 or more concurrent netperf -t TCP_RR on a 10Gbit NIC.

Yeah, it's just a simulation of a single irq nic. I'll try on something
more real hardware.

How do you execute such the tests? I don't see the appropriate parameter
of netperf. Does this mean just to start 400 copies of netperf? How is
to aggregate their results in this case? 
 
> Thanks.
> 
> PS: Idea of playing with L3 domains is interesting, I have personally
> tried various strategies in the past but none of them
> demonstrated a clear win.

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] net;sched: Try to find idle cpu for RPS to handle packets
  2018-09-19 15:41   ` Kirill Tkhai
@ 2018-09-19 15:49     ` Eric Dumazet
  2018-09-19 15:58       ` Kirill Tkhai
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2018-09-19 15:49 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Peter Zijlstra, David Miller, Daniel Borkmann, tom, netdev, LKML

[-- Attachment #1: Type: text/plain, Size: 4801 bytes --]

On Wed, Sep 19, 2018 at 8:41 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 19.09.2018 17:55, Eric Dumazet wrote:
> > On Wed, Sep 19, 2018 at 5:29 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >> Many workloads have polling mode of work. The application
> >> checks for incomming packets from time to time, but it also
> >> has a work to do, when there is no packets. This RFC
> >> tries to develop an idea to queue RPS packets on idle
> >> CPU in the the L3 domain of the consumer, so backlog
> >> processing of the packets and the application can execute
> >> in parallel.
> >>
> >> We require this in case of network cards does not
> >> have enough RX queues to cover all online CPUs (this seems
> >> to be the most cards), and  get_rps_cpu() actually chooses
> >> remote cpu, and SMP interrupt is sent. Here we may try
> >> our best, and to find idle CPU nearly the consumer's CPU.
> >> Note, that in case of consumer works in poll mode and it
> >> does not waits for incomming packets, its CPU will be not
> >> idle, while CPU of a sleeping consumer may be idle. So,
> >> not polling consumers will still be able to have skb
> >> handled on its CPU.
> >>
> >> In case of network card has many queues, the device
> >> interrupts will come on consumer's CPU, and this patch
> >> won't try to find idle cpu for them.
> >>
> >> I've tried simple netperf test for this:
> >> netserver -p 1234
> >> netperf -L 127.0.0.1 -p 1234 -l 100
> >>
> >> Before:
> >>  87380  16384  16384    100.00   60323.56
> >>  87380  16384  16384    100.00   60388.46
> >>  87380  16384  16384    100.00   60217.68
> >>  87380  16384  16384    100.00   57995.41
> >>  87380  16384  16384    100.00   60659.00
> >>
> >> After:
> >>  87380  16384  16384    100.00   64569.09
> >>  87380  16384  16384    100.00   64569.25
> >>  87380  16384  16384    100.00   64691.63
> >>  87380  16384  16384    100.00   64930.14
> >>  87380  16384  16384    100.00   62670.15
> >>
> >> The difference between best runs is +7%,
> >> the worst runs differ +8%.
> >>
> >> What do you think about following somehow in this way?
> >
> > Hi Kirill
> >
> > In my experience, scheduler has a poor view of softirq processing
> > happening on various cpus.
> > A cpu spending 90% of its cycles processing IRQ might be considered 'idle'
>
> Yes, in case of there is softirq on top of irq_exit(), the cpu is not
> considered as busy. But after MAX_SOFTIRQ_TIME (=2ms), ksoftirqd are
> waken up to execute the work in process context, and the processor is
> considered as !idle. 2ms is 2 timer ticks in case of HZ=1000. So, we
> don't restart softirq in case of it was executed for more then 2ms.
>

That's the theory, but reality is very different unfortunately.

If RFS/RPS is setup properly, we really do not hit MAX_SOFTIRQ_TIME condition
unless in some synthetic benchmarks maybe.

> The similar way, single net_rx_action() can't be executed longer
> than 2ms.
>
> Having 90% load in softirq (called on top of irq_exit()) should be
> very unlikely situation, when there are too many interrupts with small
> amount of work, which related softirq calls are doing for each of them.
> I think it had be a problem even in plain napi case, since it would
> worked not like expected.
>
> But anyway. You worry, that during handling of next portion of skbs,
> we find that previous portion of skbs already woken ksoftirqd, and
> we don't see this cpu as idle? Yeah, then we'll try to change cpu,
> and this is not what we want. We want to continue use the cpu, where
> previous portion was handler. Hm, not so fast I'll answer, but certainly,
> this may be handled somehow in more creative way.
>
> > So please run a real workload (it is _very_ uncommon anyone set up RPS
> > on lo interface !)
> >
> > Like 400 or more concurrent netperf -t TCP_RR on a 10Gbit NIC.
>
> Yeah, it's just a simulation of a single irq nic. I'll try on something
> more real hardware.

Also my concern is that you might have results that are tied to a particular
version of process scheduling, platform, workload...

One month later, a small change in process scheduler,
and very different results.

This is why I believe this new feature must be controllable, via a new
tunable (like RPS/RFS are controllable per rx queue)

>
> How do you execute such the tests? I don't see the appropriate parameter
> of netperf. Does this mean just to start 400 copies of netperf? How is
> to aggregate their results in this case?

Yeah, there are various 'super_netperf' scripts available on the net
(almost trivial to write anyway)

( I am attaching one of them)

Thanks.
>
> > Thanks.
> >
> > PS: Idea of playing with L3 domains is interesting, I have personally
> > tried various strategies in the past but none of them
> > demonstrated a clear win.
>
> Thanks,
> Kirill

[-- Attachment #2: super_netperf --]
[-- Type: application/octet-stream, Size: 435 bytes --]

#!/bin/bash

run_netperf() {
	loops=$1
	shift
	for ((i=0; i<loops; i++)); do
		./netperf -s 2 $@ | awk '/Min/{
			if (!once) {
				print;
				once=1;
			}
		}
		{
			if (NR == 6)
				save = $NF
			else if (NR==7) {
				if (NF > 0)
					print $NF
				else
					print save
			} else if (NR==11) {
				print $0
			}
		}' &
	done
	wait
	return 0
}

run_netperf $@ | awk '{if (NF==7) {print $0; next}} {sum += $1} END {printf "%7u\n",sum}'

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] net;sched: Try to find idle cpu for RPS to handle packets
  2018-09-19 15:49     ` Eric Dumazet
@ 2018-09-19 15:58       ` Kirill Tkhai
  2018-09-27 16:17         ` Willem de Bruijn
  0 siblings, 1 reply; 6+ messages in thread
From: Kirill Tkhai @ 2018-09-19 15:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, David Miller, Daniel Borkmann, tom, netdev, LKML

On 19.09.2018 18:49, Eric Dumazet wrote:
> On Wed, Sep 19, 2018 at 8:41 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 19.09.2018 17:55, Eric Dumazet wrote:
>>> On Wed, Sep 19, 2018 at 5:29 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>
>>>> Many workloads have polling mode of work. The application
>>>> checks for incomming packets from time to time, but it also
>>>> has a work to do, when there is no packets. This RFC
>>>> tries to develop an idea to queue RPS packets on idle
>>>> CPU in the the L3 domain of the consumer, so backlog
>>>> processing of the packets and the application can execute
>>>> in parallel.
>>>>
>>>> We require this in case of network cards does not
>>>> have enough RX queues to cover all online CPUs (this seems
>>>> to be the most cards), and  get_rps_cpu() actually chooses
>>>> remote cpu, and SMP interrupt is sent. Here we may try
>>>> our best, and to find idle CPU nearly the consumer's CPU.
>>>> Note, that in case of consumer works in poll mode and it
>>>> does not waits for incomming packets, its CPU will be not
>>>> idle, while CPU of a sleeping consumer may be idle. So,
>>>> not polling consumers will still be able to have skb
>>>> handled on its CPU.
>>>>
>>>> In case of network card has many queues, the device
>>>> interrupts will come on consumer's CPU, and this patch
>>>> won't try to find idle cpu for them.
>>>>
>>>> I've tried simple netperf test for this:
>>>> netserver -p 1234
>>>> netperf -L 127.0.0.1 -p 1234 -l 100
>>>>
>>>> Before:
>>>>  87380  16384  16384    100.00   60323.56
>>>>  87380  16384  16384    100.00   60388.46
>>>>  87380  16384  16384    100.00   60217.68
>>>>  87380  16384  16384    100.00   57995.41
>>>>  87380  16384  16384    100.00   60659.00
>>>>
>>>> After:
>>>>  87380  16384  16384    100.00   64569.09
>>>>  87380  16384  16384    100.00   64569.25
>>>>  87380  16384  16384    100.00   64691.63
>>>>  87380  16384  16384    100.00   64930.14
>>>>  87380  16384  16384    100.00   62670.15
>>>>
>>>> The difference between best runs is +7%,
>>>> the worst runs differ +8%.
>>>>
>>>> What do you think about following somehow in this way?
>>>
>>> Hi Kirill
>>>
>>> In my experience, scheduler has a poor view of softirq processing
>>> happening on various cpus.
>>> A cpu spending 90% of its cycles processing IRQ might be considered 'idle'
>>
>> Yes, in case of there is softirq on top of irq_exit(), the cpu is not
>> considered as busy. But after MAX_SOFTIRQ_TIME (=2ms), ksoftirqd are
>> waken up to execute the work in process context, and the processor is
>> considered as !idle. 2ms is 2 timer ticks in case of HZ=1000. So, we
>> don't restart softirq in case of it was executed for more then 2ms.
>>
> 
> That's the theory, but reality is very different unfortunately.
> 
> If RFS/RPS is setup properly, we really do not hit MAX_SOFTIRQ_TIME condition
> unless in some synthetic benchmarks maybe.
> 
>> The similar way, single net_rx_action() can't be executed longer
>> than 2ms.
>>
>> Having 90% load in softirq (called on top of irq_exit()) should be
>> very unlikely situation, when there are too many interrupts with small
>> amount of work, which related softirq calls are doing for each of them.
>> I think it had be a problem even in plain napi case, since it would
>> worked not like expected.
>>
>> But anyway. You worry, that during handling of next portion of skbs,
>> we find that previous portion of skbs already woken ksoftirqd, and
>> we don't see this cpu as idle? Yeah, then we'll try to change cpu,
>> and this is not what we want. We want to continue use the cpu, where
>> previous portion was handler. Hm, not so fast I'll answer, but certainly,
>> this may be handled somehow in more creative way.
>>
>>> So please run a real workload (it is _very_ uncommon anyone set up RPS
>>> on lo interface !)
>>>
>>> Like 400 or more concurrent netperf -t TCP_RR on a 10Gbit NIC.
>>
>> Yeah, it's just a simulation of a single irq nic. I'll try on something
>> more real hardware.
> 
> Also my concern is that you might have results that are tied to a particular
> version of process scheduling, platform, workload...
> 
> One month later, a small change in process scheduler,
> and very different results.

Maybe, but especially that function logic has not changed for a long time.
10 years at least. The only change is Peter adds idle core searching
functionality recently.

> This is why I believe this new feature must be controllable, via a new
> tunable (like RPS/RFS are controllable per rx queue)
> 
>>
>> How do you execute such the tests? I don't see the appropriate parameter
>> of netperf. Does this mean just to start 400 copies of netperf? How is
>> to aggregate their results in this case?
> 
> Yeah, there are various 'super_netperf' scripts available on the net
> (almost trivial to write anyway)
> 
> ( I am attaching one of them)

Thanks.

> Thanks.
>>
>>> Thanks.
>>>
>>> PS: Idea of playing with L3 domains is interesting, I have personally
>>> tried various strategies in the past but none of them
>>> demonstrated a clear win.
>>
>> Thanks,
>> Kirill

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] net;sched: Try to find idle cpu for RPS to handle packets
  2018-09-19 15:58       ` Kirill Tkhai
@ 2018-09-27 16:17         ` Willem de Bruijn
  0 siblings, 0 replies; 6+ messages in thread
From: Willem de Bruijn @ 2018-09-27 16:17 UTC (permalink / raw)
  To: ktkhai
  Cc: Eric Dumazet, peterz, David Miller, Daniel Borkmann, Tom Herbert,
	Network Development, LKML

On Wed, Sep 19, 2018 at 12:02 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 19.09.2018 18:49, Eric Dumazet wrote:
> > On Wed, Sep 19, 2018 at 8:41 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >> On 19.09.2018 17:55, Eric Dumazet wrote:
> >>> On Wed, Sep 19, 2018 at 5:29 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>>>
> >>>> Many workloads have polling mode of work. The application
> >>>> checks for incomming packets from time to time, but it also
> >>>> has a work to do, when there is no packets. This RFC
> >>>> tries to develop an idea to queue RPS packets on idle
> >>>> CPU in the the L3 domain of the consumer, so backlog
> >>>> processing of the packets and the application can execute
> >>>> in parallel.
> >>>>
> >>>> We require this in case of network cards does not
> >>>> have enough RX queues to cover all online CPUs (this seems
> >>>> to be the most cards), and  get_rps_cpu() actually chooses
> >>>> remote cpu, and SMP interrupt is sent. Here we may try
> >>>> our best, and to find idle CPU nearly the consumer's CPU.
> >>>> Note, that in case of consumer works in poll mode and it
> >>>> does not waits for incomming packets, its CPU will be not
> >>>> idle, while CPU of a sleeping consumer may be idle. So,
> >>>> not polling consumers will still be able to have skb
> >>>> handled on its CPU.
> >>>>
> >>>> In case of network card has many queues, the device
> >>>> interrupts will come on consumer's CPU, and this patch
> >>>> won't try to find idle cpu for them.
> >>>>
> >>>> I've tried simple netperf test for this:
> >>>> netserver -p 1234
> >>>> netperf -L 127.0.0.1 -p 1234 -l 100
> >>>>
> >>>> Before:
> >>>>  87380  16384  16384    100.00   60323.56
> >>>>  87380  16384  16384    100.00   60388.46
> >>>>  87380  16384  16384    100.00   60217.68
> >>>>  87380  16384  16384    100.00   57995.41
> >>>>  87380  16384  16384    100.00   60659.00
> >>>>
> >>>> After:
> >>>>  87380  16384  16384    100.00   64569.09
> >>>>  87380  16384  16384    100.00   64569.25
> >>>>  87380  16384  16384    100.00   64691.63
> >>>>  87380  16384  16384    100.00   64930.14
> >>>>  87380  16384  16384    100.00   62670.15
> >>>>
> >>>> The difference between best runs is +7%,
> >>>> the worst runs differ +8%.
> >>>>
> >>>> What do you think about following somehow in this way?
> >>>
> >>> Hi Kirill
> >>>
> >>> In my experience, scheduler has a poor view of softirq processing
> >>> happening on various cpus.
> >>> A cpu spending 90% of its cycles processing IRQ might be considered 'idle'
> >>
> >> Yes, in case of there is softirq on top of irq_exit(), the cpu is not
> >> considered as busy. But after MAX_SOFTIRQ_TIME (=2ms), ksoftirqd are
> >> waken up to execute the work in process context, and the processor is
> >> considered as !idle. 2ms is 2 timer ticks in case of HZ=1000. So, we
> >> don't restart softirq in case of it was executed for more then 2ms.
> >>
> >
> > That's the theory, but reality is very different unfortunately.
> >
> > If RFS/RPS is setup properly, we really do not hit MAX_SOFTIRQ_TIME condition
> > unless in some synthetic benchmarks maybe.
> >
> >> The similar way, single net_rx_action() can't be executed longer
> >> than 2ms.
> >>
> >> Having 90% load in softirq (called on top of irq_exit()) should be
> >> very unlikely situation, when there are too many interrupts with small
> >> amount of work, which related softirq calls are doing for each of them.
> >> I think it had be a problem even in plain napi case, since it would
> >> worked not like expected.
> >>
> >> But anyway. You worry, that during handling of next portion of skbs,
> >> we find that previous portion of skbs already woken ksoftirqd, and
> >> we don't see this cpu as idle? Yeah, then we'll try to change cpu,
> >> and this is not what we want. We want to continue use the cpu, where
> >> previous portion was handler. Hm, not so fast I'll answer, but certainly,
> >> this may be handled somehow in more creative way.
> >>
> >>> So please run a real workload (it is _very_ uncommon anyone set up RPS
> >>> on lo interface !)
> >>>
> >>> Like 400 or more concurrent netperf -t TCP_RR on a 10Gbit NIC.
> >>
> >> Yeah, it's just a simulation of a single irq nic. I'll try on something
> >> more real hardware.
> >
> > Also my concern is that you might have results that are tied to a particular
> > version of process scheduling, platform, workload...
> >
> > One month later, a small change in process scheduler,
> > and very different results.
>
> Maybe, but especially that function logic has not changed for a long time.
> 10 years at least. The only change is Peter adds idle core searching
> functionality recently.
>
> > This is why I believe this new feature must be controllable, via a new
> > tunable (like RPS/RFS are controllable per rx queue)

Agreed. For RFS we can have different heuristics, but they
should be configurable.

Please also make clear in your patch that this changes RFS, not
RPS.

For RPS, selection should not silently change to select a CPU
outside the configured rps_cpus set. I don't think that that should
ever be relaxed, even with a new knob, as it makes reasoning
about RPS configuration that much harder.

RFS already ignores rps_cpus, so using a different heuristic there
is easier.

I have thought about experimenting with biasing towards a core affine
with the numa node of the rx softirq cpu. In other words, ignoring
RFS if the request is remote. With the assumption (correct or not)
that wake affinity would pull the thread to the same node, load
permitting. But I have zero data for that.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-09-27 16:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-19 12:28 [RFC] net;sched: Try to find idle cpu for RPS to handle packets Kirill Tkhai
2018-09-19 14:55 ` Eric Dumazet
2018-09-19 15:41   ` Kirill Tkhai
2018-09-19 15:49     ` Eric Dumazet
2018-09-19 15:58       ` Kirill Tkhai
2018-09-27 16:17         ` Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).