All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] sched/core: Don't schedule threads on pre-empted vcpus
@ 2018-05-02 20:52 Rohit Jain
  2018-05-04  9:37 ` [tip:sched/core] sched/core: Don't schedule threads on pre-empted vCPUs tip-bot for Rohit Jain
  2018-05-04  9:47 ` [RFC] sched/core: Don't schedule threads on pre-empted vcpus Peter Zijlstra
  0 siblings, 2 replies; 6+ messages in thread
From: Rohit Jain @ 2018-05-02 20:52 UTC (permalink / raw)
  To: matt, peterz
  Cc: mingo, dhaval.giani, subhra.mazumdar, steven.sistare, linux-kernel

In paravirt configurations today, spinlocks figure out whether a vcpu is
running to determine whether or not spinlock should bother spinning. We
can use the same logic to prioritize CPUs when scheduling threads. If a
vcpu has been pre-empted, it will incur the extra cost of VMENTER and
the time it actually spends to be running on the host CPU. If we had
other vcpus which were actually running on the host CPU and idle we
should schedule threads there.

Performance numbers:

Note: With patch is referred to as Paravirt in the following and without
patch is referred to as Base.

1) When only 1 VM is running:

    a) Hackbench test on KVM 8 vcpus, 10,000 loops (lower is better):

+-------+-----------------+----------------+
|Number |Paravirt         |Base            |
|of     +---------+-------+-------+--------+
|Threads|Average  |Std Dev|Average| Std Dev|
+-------+---------+-------+-------+--------+
|1      |1.817    |0.076  |1.721  | 0.067  |
|2      |3.467    |0.120  |3.468  | 0.074  |
|4      |6.266    |0.035  |6.314  | 0.068  |
|8      |11.437   |0.105  |11.418 | 0.132  |
|16     |21.862   |0.167  |22.161 | 0.129  |
|25     |33.341   |0.326  |33.692 | 0.147  |
+-------+---------+-------+-------+--------+ 

2) When two VMs are running with same CPU affinities:

    a) tbench test on VM 8 cpus

Base:

VM1:

Throughput 220.59 MB/sec   1 clients  1 procs  max_latency=12.872 ms
Throughput 448.716 MB/sec  2 clients  2 procs  max_latency=7.555 ms
Throughput 861.009 MB/sec  4 clients  4 procs  max_latency=49.501 ms
Throughput 1261.81 MB/sec  7 clients  7 procs  max_latency=76.990 ms

VM2:

Throughput 219.937 MB/sec  1 clients  1 procs  max_latency=12.517 ms
Throughput 470.99 MB/sec   2 clients  2 procs  max_latency=12.419 ms
Throughput 841.299 MB/sec  4 clients  4 procs  max_latency=37.043 ms
Throughput 1240.78 MB/sec  7 clients  7 procs  max_latency=77.489 ms

Paravirt:

VM1:

Throughput 222.572 MB/sec  1 clients  1 procs  max_latency=7.057 ms
Throughput 485.993 MB/sec  2 clients  2 procs  max_latency=26.049 ms
Throughput 947.095 MB/sec  4 clients  4 procs  max_latency=45.338 ms
Throughput 1364.26 MB/sec  7 clients  7 procs  max_latency=145.124 ms

VM2:

Throughput 224.128 MB/sec  1 clients  1 procs  max_latency=4.564 ms
Throughput 501.878 MB/sec  2 clients  2 procs  max_latency=11.061 ms
Throughput 965.455 MB/sec  4 clients  4 procs  max_latency=45.370 ms
Throughput 1359.08 MB/sec  7 clients  7 procs  max_latency=168.053 ms

    b) Hackbench with 4 fd 1,000,000 loops

+-------+--------------------------------------+----------------------------------------+
|Number |Paravirt                              |Base                                    |
|of     +----------+--------+---------+--------+----------+--------+---------+----------+
|Threads|Average1  |Std Dev1|Average2 | Std Dev|Average1  |Std Dev1|Average2 | Std Dev 2|
+-------+----------+--------+---------+--------+----------+--------+---------+----------+
|  1    | 3.748    | 0.620  | 3.576   | 0.432  | 4.006    | 0.395  | 3.446   | 0.787    |
+-------+----------+--------+---------+--------+----------+--------+---------+----------+

Note that this test was run just to show the interference effect
over-subscription can have in baseline

    c) schbench results with 2 message groups on 8 vcpu VMs
+-----------+-------+---------------+--------------+------------+
|           |       | Paravirt      | Base         |            |
+-----------+-------+-------+-------+-------+------+------------+
|           |Threads| VM1   | VM2   |  VM1  | VM2  |%Improvement|
+-----------+-------+-------+-------+-------+------+------------+
|50.0000th  |    1  | 52    | 53    |  58   | 54   |  +6.25%    |
|75.0000th  |    1  | 69    | 61    |  83   | 59   |  +8.45%    |
|90.0000th  |    1  | 80    | 80    |  89   | 83   |  +6.98%    |
|95.0000th  |    1  | 83    | 83    |  93   | 87   |  +7.78%    |
|*99.0000th |    1  | 92    | 94    |  99   | 97   |  +5.10%    |
|99.5000th  |    1  | 95    | 100   |  102  | 103  |  +4.88%    |
|99.9000th  |    1  | 107   | 123   |  105  | 203  |  +25.32%   |
+-----------+-------+-------+-------+-------+------+------------+
|50.0000th  |    2  | 56    | 62    |  67   | 59   |  +6.35%    |
|75.0000th  |    2  | 69    | 75    |  80   | 71   |  +4.64%    |
|90.0000th  |    2  | 80    | 82    |  90   | 81   |  +5.26%    |
|95.0000th  |    2  | 85    | 87    |  97   | 91   |  +8.51%    |
|*99.0000th |    2  | 98    | 99    |  107  | 109  |  +8.79%    |
|99.5000th  |    2  | 107   | 105   |  109  | 116  |  +5.78%    |
|99.9000th  |    2  | 9968  | 609   |  875  | 3116 | -165.02%   |
+-----------+-------+-------+-------+-------+------+------------+
|50.0000th  |    4  | 78    | 77    |  78   | 79   |  +1.27%    |
|75.0000th  |    4  | 98    | 106   |  100  | 104  |   0.00%    |
|90.0000th  |    4  | 987   | 1001  |  995  | 1015 |  +1.09%    |
|95.0000th  |    4  | 4136  | 5368  |  5752 | 5192 |  +13.16%   |
|*99.0000th |    4  | 11632 | 11344 |  11024| 10736|  -5.59%    |
|99.5000th  |    4  | 12624 | 13040 |  12720| 12144|  -3.22%    |
|99.9000th  |    4  | 13168 | 18912 |  14992| 17824|  +2.24%    |
+-----------+-------+-------+-------+-------+------+------------+

Note: Improvement is measured for (VM1+VM2)

Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
---
 kernel/sched/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e10aae..75d1ecf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4033,6 +4033,9 @@ int idle_cpu(int cpu)
 		return 0;
 #endif
 
+	if (vcpu_is_preempted(cpu))
+		return 0;
+
 	return 1;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [tip:sched/core] sched/core: Don't schedule threads on pre-empted vCPUs
  2018-05-02 20:52 [RFC] sched/core: Don't schedule threads on pre-empted vcpus Rohit Jain
@ 2018-05-04  9:37 ` tip-bot for Rohit Jain
  2018-05-04  9:47 ` [RFC] sched/core: Don't schedule threads on pre-empted vcpus Peter Zijlstra
  1 sibling, 0 replies; 6+ messages in thread
From: tip-bot for Rohit Jain @ 2018-05-04  9:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, rohit.k.jain, hpa, peterz, tglx, mingo, torvalds

Commit-ID:  247f2f6f3c706b40b5f3886646f3eb53671258bf
Gitweb:     https://git.kernel.org/tip/247f2f6f3c706b40b5f3886646f3eb53671258bf
Author:     Rohit Jain <rohit.k.jain@oracle.com>
AuthorDate: Wed, 2 May 2018 13:52:10 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 4 May 2018 10:00:09 +0200

sched/core: Don't schedule threads on pre-empted vCPUs

In paravirt configurations today, spinlocks figure out whether a vCPU is
running to determine whether or not spinlock should bother spinning. We
can use the same logic to prioritize CPUs when scheduling threads. If a
vCPU has been pre-empted, it will incur the extra cost of VMENTER and
the time it actually spends to be running on the host CPU. If we had
other vCPUs which were actually running on the host CPU and idle we
should schedule threads there.

Performance numbers:

Note: With patch is referred to as Paravirt in the following and without
patch is referred to as Base.

1) When only 1 VM is running:

    a) Hackbench test on KVM 8 vCPUs, 10,000 loops (lower is better):

	+-------+-----------------+----------------+
	|Number |Paravirt         |Base            |
	|of     +---------+-------+-------+--------+
	|Threads|Average  |Std Dev|Average| Std Dev|
	+-------+---------+-------+-------+--------+
	|1      |1.817    |0.076  |1.721  | 0.067  |
	|2      |3.467    |0.120  |3.468  | 0.074  |
	|4      |6.266    |0.035  |6.314  | 0.068  |
	|8      |11.437   |0.105  |11.418 | 0.132  |
	|16     |21.862   |0.167  |22.161 | 0.129  |
	|25     |33.341   |0.326  |33.692 | 0.147  |
	+-------+---------+-------+-------+--------+

2) When two VMs are running with same CPU affinities:

    a) tbench test on VM 8 cpus

    Base:

	VM1:

	Throughput 220.59 MB/sec   1 clients  1 procs  max_latency=12.872 ms
	Throughput 448.716 MB/sec  2 clients  2 procs  max_latency=7.555 ms
	Throughput 861.009 MB/sec  4 clients  4 procs  max_latency=49.501 ms
	Throughput 1261.81 MB/sec  7 clients  7 procs  max_latency=76.990 ms

	VM2:

	Throughput 219.937 MB/sec  1 clients  1 procs  max_latency=12.517 ms
	Throughput 470.99 MB/sec   2 clients  2 procs  max_latency=12.419 ms
	Throughput 841.299 MB/sec  4 clients  4 procs  max_latency=37.043 ms
	Throughput 1240.78 MB/sec  7 clients  7 procs  max_latency=77.489 ms

    Paravirt:

	VM1:

	Throughput 222.572 MB/sec  1 clients  1 procs  max_latency=7.057 ms
	Throughput 485.993 MB/sec  2 clients  2 procs  max_latency=26.049 ms
	Throughput 947.095 MB/sec  4 clients  4 procs  max_latency=45.338 ms
	Throughput 1364.26 MB/sec  7 clients  7 procs  max_latency=145.124 ms

	VM2:

	Throughput 224.128 MB/sec  1 clients  1 procs  max_latency=4.564 ms
	Throughput 501.878 MB/sec  2 clients  2 procs  max_latency=11.061 ms
	Throughput 965.455 MB/sec  4 clients  4 procs  max_latency=45.370 ms
	Throughput 1359.08 MB/sec  7 clients  7 procs  max_latency=168.053 ms

    b) Hackbench with 4 fd 1,000,000 loops

	+-------+--------------------------------------+----------------------------------------+
	|Number |Paravirt                              |Base                                    |
	|of     +----------+--------+---------+--------+----------+--------+---------+----------+
	|Threads|Average1  |Std Dev1|Average2 | Std Dev|Average1  |Std Dev1|Average2 | Std Dev 2|
	+-------+----------+--------+---------+--------+----------+--------+---------+----------+
	|  1    | 3.748    | 0.620  | 3.576   | 0.432  | 4.006    | 0.395  | 3.446   | 0.787    |
	+-------+----------+--------+---------+--------+----------+--------+---------+----------+

    Note that this test was run just to show the interference effect
    over-subscription can have in baseline

    c) schbench results with 2 message groups on 8 vCPU VMs

	+-----------+-------+---------------+--------------+------------+
	|           |       | Paravirt      | Base         |            |
	+-----------+-------+-------+-------+-------+------+------------+
	|           |Threads| VM1   | VM2   |  VM1  | VM2  |%Improvement|
	+-----------+-------+-------+-------+-------+------+------------+
	|50.0000th  |    1  | 52    | 53    |  58   | 54   |  +6.25%    |
	|75.0000th  |    1  | 69    | 61    |  83   | 59   |  +8.45%    |
	|90.0000th  |    1  | 80    | 80    |  89   | 83   |  +6.98%    |
	|95.0000th  |    1  | 83    | 83    |  93   | 87   |  +7.78%    |
	|*99.0000th |    1  | 92    | 94    |  99   | 97   |  +5.10%    |
	|99.5000th  |    1  | 95    | 100   |  102  | 103  |  +4.88%    |
	|99.9000th  |    1  | 107   | 123   |  105  | 203  |  +25.32%   |
	+-----------+-------+-------+-------+-------+------+------------+
	|50.0000th  |    2  | 56    | 62    |  67   | 59   |  +6.35%    |
	|75.0000th  |    2  | 69    | 75    |  80   | 71   |  +4.64%    |
	|90.0000th  |    2  | 80    | 82    |  90   | 81   |  +5.26%    |
	|95.0000th  |    2  | 85    | 87    |  97   | 91   |  +8.51%    |
	|*99.0000th |    2  | 98    | 99    |  107  | 109  |  +8.79%    |
	|99.5000th  |    2  | 107   | 105   |  109  | 116  |  +5.78%    |
	|99.9000th  |    2  | 9968  | 609   |  875  | 3116 | -165.02%   |
	+-----------+-------+-------+-------+-------+------+------------+
	|50.0000th  |    4  | 78    | 77    |  78   | 79   |  +1.27%    |
	|75.0000th  |    4  | 98    | 106   |  100  | 104  |   0.00%    |
	|90.0000th  |    4  | 987   | 1001  |  995  | 1015 |  +1.09%    |
	|95.0000th  |    4  | 4136  | 5368  |  5752 | 5192 |  +13.16%   |
	|*99.0000th |    4  | 11632 | 11344 |  11024| 10736|  -5.59%    |
	|99.5000th  |    4  | 12624 | 13040 |  12720| 12144|  -3.22%    |
	|99.9000th  |    4  | 13168 | 18912 |  14992| 17824|  +2.24%    |
	+-----------+-------+-------+-------+-------+------+------------+

    Note: Improvement is measured for (VM1+VM2)

Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dhaval.giani@oracle.com
Cc: matt@codeblueprint.co.uk
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Link: http://lkml.kernel.org/r/1525294330-7759-1-git-send-email-rohit.k.jain@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ffde9eebc846..71bdb86e07f9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4028,6 +4028,9 @@ int idle_cpu(int cpu)
 		return 0;
 #endif
 
+	if (vcpu_is_preempted(cpu))
+		return 0;
+
 	return 1;
 }
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC] sched/core: Don't schedule threads on pre-empted vcpus
  2018-05-02 20:52 [RFC] sched/core: Don't schedule threads on pre-empted vcpus Rohit Jain
  2018-05-04  9:37 ` [tip:sched/core] sched/core: Don't schedule threads on pre-empted vCPUs tip-bot for Rohit Jain
@ 2018-05-04  9:47 ` Peter Zijlstra
  2018-05-04 17:22   ` Rohit Jain
  1 sibling, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2018-05-04  9:47 UTC (permalink / raw)
  To: Rohit Jain
  Cc: matt, mingo, dhaval.giani, subhra.mazumdar, steven.sistare, linux-kernel

On Wed, May 02, 2018 at 01:52:10PM -0700, Rohit Jain wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5e10aae..75d1ecf 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4033,6 +4033,9 @@ int idle_cpu(int cpu)
>  		return 0;
>  #endif
>  
> +	if (vcpu_is_preempted(cpu))
> +		return 0;
> +
>  	return 1;
>  }

Basically OK with this, but did you consider idle_cpu() usage outside of
select_idle_sibling()?

For instance, I think got_nohz_idle_kick() isn't quite right with this
on. Similarly for scheduler_tick(), that wants the actual idle state.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] sched/core: Don't schedule threads on pre-empted vcpus
  2018-05-04  9:47 ` [RFC] sched/core: Don't schedule threads on pre-empted vcpus Peter Zijlstra
@ 2018-05-04 17:22   ` Rohit Jain
  2018-05-04 17:32     ` Steven Sistare
  0 siblings, 1 reply; 6+ messages in thread
From: Rohit Jain @ 2018-05-04 17:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: matt, mingo, dhaval.giani, subhra.mazumdar, steven.sistare, linux-kernel

Hi Peter,


On 05/04/2018 02:47 AM, Peter Zijlstra wrote:
> On Wed, May 02, 2018 at 01:52:10PM -0700, Rohit Jain wrote:
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 5e10aae..75d1ecf 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4033,6 +4033,9 @@ int idle_cpu(int cpu)
>>   		return 0;
>>   #endif
>>   
>> +	if (vcpu_is_preempted(cpu))
>> +		return 0;
>> +
>>   	return 1;
>>   }
> Basically OK with this, but did you consider idle_cpu() usage outside of
> select_idle_sibling()?
>
> For instance, I think got_nohz_idle_kick() isn't quite right with this
> on. Similarly for scheduler_tick(), that wants the actual idle state.

As far as intent is concerned, yes I agree you might be right. I left
the VM running for a couple of days, didn't see anything weird however.

We could add a check at each of those places or something to that effect
if this is an issue. Please let me know how you want to proceed.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] sched/core: Don't schedule threads on pre-empted vcpus
  2018-05-04 17:22   ` Rohit Jain
@ 2018-05-04 17:32     ` Steven Sistare
  2018-05-04 17:37       ` Rohit Jain
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Sistare @ 2018-05-04 17:32 UTC (permalink / raw)
  To: Rohit Jain
  Cc: Peter Zijlstra, matt, mingo, dhaval.giani, subhra.mazumdar, linux-kernel

On 5/4/2018 1:22 PM, Rohit Jain wrote:
> Hi Peter,
> 
> On 05/04/2018 02:47 AM, Peter Zijlstra wrote:
>> On Wed, May 02, 2018 at 01:52:10PM -0700, Rohit Jain wrote:
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 5e10aae..75d1ecf 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -4033,6 +4033,9 @@ int idle_cpu(int cpu)
>>>           return 0;
>>>   #endif
>>>   +    if (vcpu_is_preempted(cpu))
>>> +        return 0;
>>> +
>>>       return 1;
>>>   }
>> Basically OK with this, but did you consider idle_cpu() usage outside of
>> select_idle_sibling()?
>>
>> For instance, I think got_nohz_idle_kick() isn't quite right with this
>> on. Similarly for scheduler_tick(), that wants the actual idle state.
> 
> As far as intent is concerned, yes I agree you might be right. I left
> the VM running for a couple of days, didn't see anything weird however.
> 
> We could add a check at each of those places or something to that effect
> if this is an issue. Please let me know how you want to proceed.

The point is that some idle_cpu() call sites should consider preemption state
and some should not, and they must be considered on a case by case basis.  You 
could define a new accessor to abstract the difference, and call it from
select_idle_sibling and anywhere else it makes sense.

available_idle_cpu()
{
  return idle_cpu() && !vcpu_is_preempted()
}

- Steve

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] sched/core: Don't schedule threads on pre-empted vcpus
  2018-05-04 17:32     ` Steven Sistare
@ 2018-05-04 17:37       ` Rohit Jain
  0 siblings, 0 replies; 6+ messages in thread
From: Rohit Jain @ 2018-05-04 17:37 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Peter Zijlstra, matt, mingo, dhaval.giani, subhra.mazumdar, linux-kernel

Hi Steve,


On 05/04/2018 10:32 AM, Steven Sistare wrote:
> On 5/4/2018 1:22 PM, Rohit Jain wrote:
>> Hi Peter,
>>
>> On 05/04/2018 02:47 AM, Peter Zijlstra wrote:
>>> On Wed, May 02, 2018 at 01:52:10PM -0700, Rohit Jain wrote:
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 5e10aae..75d1ecf 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -4033,6 +4033,9 @@ int idle_cpu(int cpu)
>>>>            return 0;
>>>>    #endif
>>>>    +    if (vcpu_is_preempted(cpu))
>>>> +        return 0;
>>>> +
>>>>        return 1;
>>>>    }
>>> Basically OK with this, but did you consider idle_cpu() usage outside of
>>> select_idle_sibling()?
>>>
>>> For instance, I think got_nohz_idle_kick() isn't quite right with this
>>> on. Similarly for scheduler_tick(), that wants the actual idle state.
>> As far as intent is concerned, yes I agree you might be right. I left
>> the VM running for a couple of days, didn't see anything weird however.
>>
>> We could add a check at each of those places or something to that effect
>> if this is an issue. Please let me know how you want to proceed.
> The point is that some idle_cpu() call sites should consider preemption state
> and some should not, and they must be considered on a case by case basis.  You
> could define a new accessor to abstract the difference, and call it from
> select_idle_sibling and anywhere else it makes sense.
>
> available_idle_cpu()
> {
>    return idle_cpu() && !vcpu_is_preempted()
> }

Great! That's what I was thinking as "something to that effect" :)

Thanks,
Rohit

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-05-04 17:38 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-02 20:52 [RFC] sched/core: Don't schedule threads on pre-empted vcpus Rohit Jain
2018-05-04  9:37 ` [tip:sched/core] sched/core: Don't schedule threads on pre-empted vCPUs tip-bot for Rohit Jain
2018-05-04  9:47 ` [RFC] sched/core: Don't schedule threads on pre-empted vcpus Peter Zijlstra
2018-05-04 17:22   ` Rohit Jain
2018-05-04 17:32     ` Steven Sistare
2018-05-04 17:37       ` Rohit Jain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.