linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/2] Optimize the idle CPU search
@ 2019-07-08  4:54 Parth Shah
  2019-07-08  4:54 ` [RFC 1/2] sched/fair: Rename select_idle_mask to iterator_mask Parth Shah
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Parth Shah @ 2019-07-08  4:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, vincent.guittot, subhra.mazumdar

When searching for an idle_sibling, scheduler first iterates to search for
an idle core and then for an idle CPU. By maintaining the idle CPU mask
while iterating through idle cores, we can mark non-idle CPUs for which
idle CPU search would not have to iterate through again. This is especially
true in a moderately load system

Optimize idle CPUs search by marking already found non idle CPUs during
idle core search. This reduces iteration count when searching for idle
CPUs, resulting in lower iteration count.

The results show that the time for `select_idle_cpu` decreases and there is
no regression on time search for `select_idle_core` and almost no
regression on schbench as well. With proper tuning schbench shows benefit
as well when idle_core search fails most times.

When doing this, rename locally used cpumask 'select_idle_mask' to
something else to use this existing mask for such optimization.

Patch set based on tip/core/core

Results
===========
IBM POWER9 system: 2-socket, 44 cores, 176 CPUs

Function latency (with tb tick):
(lower is better)
+--------------+----------+--------+-------------+--------+
| select_idle_ | Baseline | stddev |    Patch    | stddev |
+--------------+----------+--------+-------------+--------+
| core         |     2080 |   1307 | 1975(+5.3%) |   1286 |
| cpu          |      834 |    393 |    91(+89%) |     64 |
| sibling      |     0.96 |  0.003 |   0.89(+7%) |   0.02 |
+--------------+----------+--------+-------------+--------+

Schbench:
- schbench -m 44 -t 1
(lower is better)
+------+----------+--------+------------+--------+
| %ile | Baseline | stddev |   Patch    | stddev |
+------+----------+--------+------------+--------+
|   50 |      9.9 |      2 | 10(-1.01)  |    1.4 |
|   95 |      465 |    3.9 | 465(0%)    |      2 |
|   99 |      561 |     24 | 483(-1.0%) |     14 |
| 99.5 |      631 |     29 | 635(-0.6%) |     32 |
| 99.9 |      801 |     41 | 763(+4.7%) |    125 |
+------+----------+--------+------------+--------+

- 44 threads spread across cores to make select_idle_core return -1 most
  times
- schbench -m 44 -t 1
+-------+----------+--------+-----------+--------+
| %ile  | Baseline | stddev |   patch   | stddev |
+-------+----------+--------+-----------+--------+
|    50 |       10 |      9 | 12(-20%)  |      1 |
|    95 |      468 |      3 | 31(+93%)  |      1 |
|    99 |      577 |     16 | 477(+17%) |     38 |
| 99.95 |      647 |     26 | 482(+25%) |      2 |
| 99.99 |      835 |     61 | 492(+41%) |      2 |
+-------+----------+--------+-----------+--------+


Hackbench:
- 44 threads spread across cores to make select_idle_core return -1 most
  times
- perf bench sched messaging -g 1 -l 100000
(lower is better)
+----------+--------+--------------+--------+
| Baseline | stddev |    patch     | stddev |
+----------+--------+--------------+--------+
|   16.107 |   0.62 | 16.02(+0.5%) |   0.32 |
+----------+--------+--------------+--------+


Series:
- Patch 01: Rename select_idle_mask to reuse the name in next patch
- Patch 02: Optimize the wakeup fast path


Parth Shah (2):
  sched/fair: Rename select_idle_mask to iterator_mask
  sched/fair: Optimize idle CPU search

 kernel/sched/core.c |  3 +++
 kernel/sched/fair.c | 15 ++++++++++-----
 2 files changed, 13 insertions(+), 5 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC 1/2] sched/fair: Rename select_idle_mask to iterator_mask
  2019-07-08  4:54 [RFC 0/2] Optimize the idle CPU search Parth Shah
@ 2019-07-08  4:54 ` Parth Shah
  2019-07-08  4:54 ` [RFC 2/2] sched/fair: Optimize the idle CPU search Parth Shah
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Parth Shah @ 2019-07-08  4:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, vincent.guittot, subhra.mazumdar

Per cpu variable 'select_idle_mask' serves the only purpose of an iterator
inside select_idle_core method. Also there is an opportunity to optimize
the search for an idle CPU for which this mask is required in the
subsequent patch. Hence renaming this per_cpu variable to iterator mask
which can be used locally for CPU iteration.

Subsequent patch uses the select_idle_mask to keep track of the idle CPUs
which can be shared across function calls.


Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c | 4 ++--
 kernel/sched/fair.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4778c48a7fda..d5a6bdc956c8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5950,7 +5950,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
-DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
+DECLARE_PER_CPU(cpumask_var_t, iterator_mask);
 
 void __init sched_init(void)
 {
@@ -5989,7 +5989,7 @@ void __init sched_init(void)
 	for_each_possible_cpu(i) {
 		per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
-		per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
+		per_cpu(iterator_mask, i) = (cpumask_var_t)kzalloc_node(
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 	}
 #endif /* CONFIG_CPUMASK_OFFSTACK */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdab7eb6f351..20affe03379d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5294,7 +5294,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 /* Working cpumask for: load_balance, load_balance_newidle. */
 DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
-DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+DEFINE_PER_CPU(cpumask_var_t, iterator_mask);
 
 #ifdef CONFIG_NO_HZ_COMMON
 /*
@@ -6083,7 +6083,7 @@ void __update_idle_core(struct rq *rq)
  */
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
 {
-	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(iterator_mask);
 	int core, cpu;
 
 	if (!static_branch_likely(&sched_smt_present))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC 2/2] sched/fair: Optimize the idle CPU search
  2019-07-08  4:54 [RFC 0/2] Optimize the idle CPU search Parth Shah
  2019-07-08  4:54 ` [RFC 1/2] sched/fair: Rename select_idle_mask to iterator_mask Parth Shah
@ 2019-07-08  4:54 ` Parth Shah
  2019-07-08  8:08 ` [RFC 0/2] " Peter Zijlstra
  2019-07-09  0:08 ` Subhra Mazumdar
  3 siblings, 0 replies; 9+ messages in thread
From: Parth Shah @ 2019-07-08  4:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, vincent.guittot, subhra.mazumdar

Optimize idle CPUs search by marking already found non idle CPUs during
idle core search. This reduces iteration count when searching for idle
CPUs, resulting in the lower iteration count.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c |  3 +++
 kernel/sched/fair.c | 13 +++++++++----
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d5a6bdc956c8..196e4eaca66e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5951,6 +5951,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
 DECLARE_PER_CPU(cpumask_var_t, iterator_mask);
+DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
 
 void __init sched_init(void)
 {
@@ -5991,6 +5992,8 @@ void __init sched_init(void)
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 		per_cpu(iterator_mask, i) = (cpumask_var_t)kzalloc_node(
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
+		per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
+			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 	}
 #endif /* CONFIG_CPUMASK_OFFSTACK */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 20affe03379d..2b70b94b3e66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5295,6 +5295,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 /* Working cpumask for: load_balance, load_balance_newidle. */
 DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
 DEFINE_PER_CPU(cpumask_var_t, iterator_mask);
+DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
 
 #ifdef CONFIG_NO_HZ_COMMON
 /*
@@ -6084,6 +6085,7 @@ void __update_idle_core(struct rq *rq)
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(iterator_mask);
+	struct cpumask *idle_cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int core, cpu;
 
 	if (!static_branch_likely(&sched_smt_present))
@@ -6099,8 +6101,10 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 
 		for_each_cpu(cpu, cpu_smt_mask(core)) {
 			__cpumask_clear_cpu(cpu, cpus);
-			if (!available_idle_cpu(cpu))
+			if (!available_idle_cpu(cpu)) {
 				idle = false;
+				__cpumask_clear_cpu(cpu, idle_cpus);
+			}
 		}
 
 		if (idle)
@@ -6161,6 +6165,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 time, cost;
 	s64 delta;
 	int cpu, nr = INT_MAX;
+	struct cpumask *idle_cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6186,11 +6191,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 
 	time = local_clock();
 
-	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+	for_each_cpu_wrap(cpu, idle_cpus, target) {
 		if (!--nr)
 			return -1;
-		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-			continue;
 		if (available_idle_cpu(cpu))
 			break;
 	}
@@ -6210,6 +6213,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 {
 	struct sched_domain *sd;
 	int i, recent_used_cpu;
+	struct cpumask *idle_cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 
 	if (available_idle_cpu(target))
 		return target;
@@ -6239,6 +6243,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (!sd)
 		return target;
 
+	cpumask_and(idle_cpus, sched_domain_span(sd), &p->cpus_allowed);
 	i = select_idle_core(p, sd, target);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC 0/2] Optimize the idle CPU search
  2019-07-08  4:54 [RFC 0/2] Optimize the idle CPU search Parth Shah
  2019-07-08  4:54 ` [RFC 1/2] sched/fair: Rename select_idle_mask to iterator_mask Parth Shah
  2019-07-08  4:54 ` [RFC 2/2] sched/fair: Optimize the idle CPU search Parth Shah
@ 2019-07-08  8:08 ` Peter Zijlstra
  2019-07-08  9:09   ` Parth Shah
  2019-07-09  6:45   ` Subhra Mazumdar
  2019-07-09  0:08 ` Subhra Mazumdar
  3 siblings, 2 replies; 9+ messages in thread
From: Peter Zijlstra @ 2019-07-08  8:08 UTC (permalink / raw)
  To: Parth Shah; +Cc: linux-kernel, mingo, vincent.guittot, subhra.mazumdar

On Mon, Jul 08, 2019 at 10:24:30AM +0530, Parth Shah wrote:
> When searching for an idle_sibling, scheduler first iterates to search for
> an idle core and then for an idle CPU. By maintaining the idle CPU mask
> while iterating through idle cores, we can mark non-idle CPUs for which
> idle CPU search would not have to iterate through again. This is especially
> true in a moderately load system
> 
> Optimize idle CPUs search by marking already found non idle CPUs during
> idle core search. This reduces iteration count when searching for idle
> CPUs, resulting in lower iteration count.

Have you seen these patches:

  https://lkml.kernel.org/r/20180530142236.667774973@infradead.org

I've meant to get back to that, but never quite had the time :/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 0/2] Optimize the idle CPU search
  2019-07-08  8:08 ` [RFC 0/2] " Peter Zijlstra
@ 2019-07-08  9:09   ` Parth Shah
  2019-07-09  6:45   ` Subhra Mazumdar
  1 sibling, 0 replies; 9+ messages in thread
From: Parth Shah @ 2019-07-08  9:09 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, vincent.guittot, subhra.mazumdar



On 7/8/19 1:38 PM, Peter Zijlstra wrote:
> On Mon, Jul 08, 2019 at 10:24:30AM +0530, Parth Shah wrote:
>> When searching for an idle_sibling, scheduler first iterates to search for
>> an idle core and then for an idle CPU. By maintaining the idle CPU mask
>> while iterating through idle cores, we can mark non-idle CPUs for which
>> idle CPU search would not have to iterate through again. This is especially
>> true in a moderately load system
>>
>> Optimize idle CPUs search by marking already found non idle CPUs during
>> idle core search. This reduces iteration count when searching for idle
>> CPUs, resulting in lower iteration count.
> 
> Have you seen these patches:
> 
>   https://lkml.kernel.org/r/20180530142236.667774973@infradead.org
> 
> I've meant to get back to that, but never quite had the time :/
> 

Ok, I was not aware of this patch-set.
It seems interesting and I will evaluate it.


Thanks


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 0/2] Optimize the idle CPU search
  2019-07-08  4:54 [RFC 0/2] Optimize the idle CPU search Parth Shah
                   ` (2 preceding siblings ...)
  2019-07-08  8:08 ` [RFC 0/2] " Peter Zijlstra
@ 2019-07-09  0:08 ` Subhra Mazumdar
  2019-07-09  5:38   ` Parth Shah
  3 siblings, 1 reply; 9+ messages in thread
From: Subhra Mazumdar @ 2019-07-09  0:08 UTC (permalink / raw)
  To: Parth Shah, linux-kernel; +Cc: peterz, mingo, vincent.guittot


On 7/8/19 10:24 AM, Parth Shah wrote:
> When searching for an idle_sibling, scheduler first iterates to search for
> an idle core and then for an idle CPU. By maintaining the idle CPU mask
> while iterating through idle cores, we can mark non-idle CPUs for which
> idle CPU search would not have to iterate through again. This is especially
> true in a moderately load system
>
> Optimize idle CPUs search by marking already found non idle CPUs during
> idle core search. This reduces iteration count when searching for idle
> CPUs, resulting in lower iteration count.
>
I believe this can co-exist with latency-nice? We can derive the 'nr' in
select_idle_cpu from latency-nice and use the new mask to iterate.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 0/2] Optimize the idle CPU search
  2019-07-09  0:08 ` Subhra Mazumdar
@ 2019-07-09  5:38   ` Parth Shah
  2019-07-09  6:48     ` Subhra Mazumdar
  0 siblings, 1 reply; 9+ messages in thread
From: Parth Shah @ 2019-07-09  5:38 UTC (permalink / raw)
  To: Subhra Mazumdar, linux-kernel; +Cc: peterz, mingo, vincent.guittot



On 7/9/19 5:38 AM, Subhra Mazumdar wrote:
> 
> On 7/8/19 10:24 AM, Parth Shah wrote:
>> When searching for an idle_sibling, scheduler first iterates to search for
>> an idle core and then for an idle CPU. By maintaining the idle CPU mask
>> while iterating through idle cores, we can mark non-idle CPUs for which
>> idle CPU search would not have to iterate through again. This is especially
>> true in a moderately load system
>>
>> Optimize idle CPUs search by marking already found non idle CPUs during
>> idle core search. This reduces iteration count when searching for idle
>> CPUs, resulting in lower iteration count.
>>
> I believe this can co-exist with latency-nice? We can derive the 'nr' in
> select_idle_cpu from latency-nice and use the new mask to iterate.
> 

I agree, can be done with latency-nice.

Maybe something like below?
smt = nr_cpus / nr_cores
nr = smt + (p->latency_nice * (total_cpus-smt) / max_latency_nice)

This limits lower bounds to 1 core and goes through all the cores if
latency_nice is maximum for a task.


Thanks
Parth


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 0/2] Optimize the idle CPU search
  2019-07-08  8:08 ` [RFC 0/2] " Peter Zijlstra
  2019-07-08  9:09   ` Parth Shah
@ 2019-07-09  6:45   ` Subhra Mazumdar
  1 sibling, 0 replies; 9+ messages in thread
From: Subhra Mazumdar @ 2019-07-09  6:45 UTC (permalink / raw)
  To: Peter Zijlstra, Parth Shah; +Cc: linux-kernel, mingo, vincent.guittot


On 7/8/19 1:38 PM, Peter Zijlstra wrote:
> On Mon, Jul 08, 2019 at 10:24:30AM +0530, Parth Shah wrote:
>> When searching for an idle_sibling, scheduler first iterates to search for
>> an idle core and then for an idle CPU. By maintaining the idle CPU mask
>> while iterating through idle cores, we can mark non-idle CPUs for which
>> idle CPU search would not have to iterate through again. This is especially
>> true in a moderately load system
>>
>> Optimize idle CPUs search by marking already found non idle CPUs during
>> idle core search. This reduces iteration count when searching for idle
>> CPUs, resulting in lower iteration count.
> Have you seen these patches:
>
>    https://lkml.kernel.org/r/20180530142236.667774973@infradead.org
>
> I've meant to get back to that, but never quite had the time :/
The most relevant bit of this was folding select_idle_core and
select_idle_cpu. But it may be good to keep them separate as workloads
which just want any idle cpu can find one and break early by disabling
the idle core search. And this can still work with latency-nice which can
moderate both idle core and idle cpu search.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 0/2] Optimize the idle CPU search
  2019-07-09  5:38   ` Parth Shah
@ 2019-07-09  6:48     ` Subhra Mazumdar
  0 siblings, 0 replies; 9+ messages in thread
From: Subhra Mazumdar @ 2019-07-09  6:48 UTC (permalink / raw)
  To: Parth Shah, linux-kernel; +Cc: peterz, mingo, vincent.guittot


On 7/9/19 11:08 AM, Parth Shah wrote:
>
> On 7/9/19 5:38 AM, Subhra Mazumdar wrote:
>> On 7/8/19 10:24 AM, Parth Shah wrote:
>>> When searching for an idle_sibling, scheduler first iterates to search for
>>> an idle core and then for an idle CPU. By maintaining the idle CPU mask
>>> while iterating through idle cores, we can mark non-idle CPUs for which
>>> idle CPU search would not have to iterate through again. This is especially
>>> true in a moderately load system
>>>
>>> Optimize idle CPUs search by marking already found non idle CPUs during
>>> idle core search. This reduces iteration count when searching for idle
>>> CPUs, resulting in lower iteration count.
>>>
>> I believe this can co-exist with latency-nice? We can derive the 'nr' in
>> select_idle_cpu from latency-nice and use the new mask to iterate.
>>
> I agree, can be done with latency-nice.
>
> Maybe something like below?
> smt = nr_cpus / nr_cores
> nr = smt + (p->latency_nice * (total_cpus-smt) / max_latency_nice)
>
> This limits lower bounds to 1 core and goes through all the cores if
> latency_nice is maximum for a task.
Yes I had similar in mind.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-07-09  6:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-08  4:54 [RFC 0/2] Optimize the idle CPU search Parth Shah
2019-07-08  4:54 ` [RFC 1/2] sched/fair: Rename select_idle_mask to iterator_mask Parth Shah
2019-07-08  4:54 ` [RFC 2/2] sched/fair: Optimize the idle CPU search Parth Shah
2019-07-08  8:08 ` [RFC 0/2] " Peter Zijlstra
2019-07-08  9:09   ` Parth Shah
2019-07-09  6:45   ` Subhra Mazumdar
2019-07-09  0:08 ` Subhra Mazumdar
2019-07-09  5:38   ` Parth Shah
2019-07-09  6:48     ` Subhra Mazumdar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).