* [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-02-12 10:11 ` Peter Zijlstra
2013-01-24 3:06 ` [patch v4 02/18] sched: select_task_rq_fair clean up Alex Shi
` (19 subsequent siblings)
20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
The domain flag SD_PREFER_SIBLING was set both on MC and CPU domain at
frist commit b5d978e0c7e79a, and was removed uncarefully when clear up
obsolete power scheduler. Then commit 6956dc568 recover the flag on CPU
domain only. It works, but it introduces a extra domain level since this
cause MC/CPU different.
So, recover the the flag in MC domain too to remove a domain level in
x86 platform.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
include/linux/topology.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..386bcf4 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -132,6 +132,7 @@ int arch_update_cpu_topology(void);
| 0*SD_SHARE_CPUPOWER \
| 1*SD_SHARE_PKG_RESOURCES \
| 0*SD_SERIALIZE \
+ | 1*SD_PREFER_SIBLING \
, \
.last_balance = jiffies, \
.balance_interval = 1, \
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level
2013-01-24 3:06 ` [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
@ 2013-02-12 10:11 ` Peter Zijlstra
2013-02-13 13:22 ` Alex Shi
2013-02-13 14:17 ` Alex Shi
0 siblings, 2 replies; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:11 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> The domain flag SD_PREFER_SIBLING was set both on MC and CPU domain at
> frist commit b5d978e0c7e79a, and was removed uncarefully when clear up
> obsolete power scheduler. Then commit 6956dc568 recover the flag on CPU
> domain only. It works, but it introduces a extra domain level since this
> cause MC/CPU different.
>
> So, recover the the flag in MC domain too to remove a domain level in
> x86 platform.
This fails to clearly state why its desirable.. I'm guessing its because
we should use sibling cache domains before sibling threads, right?
A clearly stated reason is always preferable over: it was this way, make
it so again; which leaves us wondering why.
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level
2013-02-12 10:11 ` Peter Zijlstra
@ 2013-02-13 13:22 ` Alex Shi
2013-02-15 12:38 ` Peter Zijlstra
2013-02-13 14:17 ` Alex Shi
1 sibling, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-02-13 13:22 UTC (permalink / raw)
To: Peter Zijlstra
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On 02/12/2013 06:11 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> The domain flag SD_PREFER_SIBLING was set both on MC and CPU domain at
>> frist commit b5d978e0c7e79a, and was removed uncarefully when clear up
>> obsolete power scheduler. Then commit 6956dc568 recover the flag on CPU
>> domain only. It works, but it introduces a extra domain level since this
>> cause MC/CPU different.
>>
>> So, recover the the flag in MC domain too to remove a domain level in
>> x86 platform.
Peter, I am very very happy to see you again! :)
>
> This fails to clearly state why its desirable.. I'm guessing its because
> we should use sibling cache domains before sibling threads, right?
No, the flags set on MC/CPU domain, but is checked in their parents
balancing, like in NUMA domain.
Without the flag, will cause NUMA domain imbalance. like on my 2 sockets
NHM EP: 3 of 4 tasks were assigned on socket 0(lcpu, 10, 12, 14)
In this case, update_sd_pick_busiest() need a reduced group_capacity to
return true:
if (sgs->sum_nr_running > sgs->group_capacity)
return true;
then numa domain balancing get chance to start.
---------
05:00:28 AM CPU %usr %nice %idle
05:00:29 AM all 25.00 0.00 74.94
05:00:29 AM 0 0.00 0.00 99.00
05:00:29 AM 1 0.00 0.00 100.00
05:00:29 AM 2 0.00 0.00 100.00
05:00:29 AM 3 0.00 0.00 100.00
05:00:29 AM 4 0.00 0.00 100.00
05:00:29 AM 5 0.00 0.00 100.00
05:00:29 AM 6 0.00 0.00 100.00
05:00:29 AM 7 0.00 0.00 100.00
05:00:29 AM 8 0.00 0.00 100.00
05:00:29 AM 9 0.00 0.00 100.00
05:00:29 AM 10 100.00 0.00 0.00
05:00:29 AM 11 0.00 0.00 100.00
05:00:29 AM 12 100.00 0.00 0.00
05:00:29 AM 13 0.00 0.00 100.00
05:00:29 AM 14 100.00 0.00 0.00
05:00:29 AM 15 100.00 0.00 0.00
--
Thanks
Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level
2013-02-13 13:22 ` Alex Shi
@ 2013-02-15 12:38 ` Peter Zijlstra
2013-02-16 5:16 ` Alex Shi
0 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-15 12:38 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On Wed, 2013-02-13 at 21:22 +0800, Alex Shi wrote:
> No, the flags set on MC/CPU domain, but is checked in their parents
> balancing, like in NUMA domain.
> Without the flag, will cause NUMA domain imbalance. like on my 2
> sockets
> NHM EP: 3 of 4 tasks were assigned on socket 0(lcpu, 10, 12, 14)
>
> In this case, update_sd_pick_busiest() need a reduced group_capacity
> to
> return true:
> if (sgs->sum_nr_running > sgs->group_capacity)
> return true;
> then numa domain balancing get chance to start.
Ah, indeed. Its always better to include such 'obvious' problems in the
changelog :-)
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level
2013-02-15 12:38 ` Peter Zijlstra
@ 2013-02-16 5:16 ` Alex Shi
0 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-16 5:16 UTC (permalink / raw)
To: Peter Zijlstra
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On 02/15/2013 08:38 PM, Peter Zijlstra wrote:
> On Wed, 2013-02-13 at 21:22 +0800, Alex Shi wrote:
>> No, the flags set on MC/CPU domain, but is checked in their parents
>> balancing, like in NUMA domain.
>> Without the flag, will cause NUMA domain imbalance. like on my 2
>> sockets
>> NHM EP: 3 of 4 tasks were assigned on socket 0(lcpu, 10, 12, 14)
>>
>> In this case, update_sd_pick_busiest() need a reduced group_capacity
>> to
>> return true:
>> if (sgs->sum_nr_running > sgs->group_capacity)
>> return true;
>> then numa domain balancing get chance to start.
>
> Ah, indeed. Its always better to include such 'obvious' problems in the
> changelog :-)
>
got it. :)
how about the following commit log and patch:
---
>From c97fceceaf9d68e73eaf015d5915474a9a94a2d1 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@intel.com>
Date: Fri, 28 Dec 2012 13:53:00 +0800
Subject: [PATCH] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain
level
The domain flag SD_PREFER_SIBLING was set both on MC and CPU domain at
frist commit b5d978e0c7e79a, and was removed in-carefully when clear up
obsolete power scheduler. Then commit 6956dc568 recover the flag on CPU
domain only. It works, but it introduces a extra domain level since this
cause MC/CPU different.
So, recover the the flag in MC domain too to remove a domain level in
x86 platform.
This flag can not be removed since it is used to keep parent domain
balancing, like in NUMA domain, update_sd_pick_busiest() need a reduced
group_capacity to return 'true' then re-balance tasks from groups.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
include/linux/topology.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..386bcf4 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -132,6 +132,7 @@ int arch_update_cpu_topology(void);
| 0*SD_SHARE_CPUPOWER \
| 1*SD_SHARE_PKG_RESOURCES \
| 0*SD_SERIALIZE \
+ | 1*SD_PREFER_SIBLING \
, \
.last_balance = jiffies, \
.balance_interval = 1, \
--
1.7.5.4
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level
2013-02-12 10:11 ` Peter Zijlstra
2013-02-13 13:22 ` Alex Shi
@ 2013-02-13 14:17 ` Alex Shi
1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-13 14:17 UTC (permalink / raw)
To: Peter Zijlstra
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On 02/12/2013 06:11 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> The domain flag SD_PREFER_SIBLING was set both on MC and CPU domain at
>> frist commit b5d978e0c7e79a, and was removed uncarefully when clear up
>> obsolete power scheduler. Then commit 6956dc568 recover the flag on CPU
>> domain only. It works, but it introduces a extra domain level since this
>> cause MC/CPU different.
>>
>> So, recover the the flag in MC domain too to remove a domain level in
>> x86 platform.
May I still miss the points of this patch:
Without this patch, the domain levels on my machines are:
SMT, MC, CPU, NUMA
with this patch, the domain levels are:
SMT, MC, NUMA
--
Thanks
Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* [patch v4 02/18] sched: select_task_rq_fair clean up
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
2013-01-24 3:06 ` [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-02-12 10:14 ` Peter Zijlstra
2013-01-24 3:06 ` [patch v4 03/18] sched: fix find_idlest_group mess logical Alex Shi
` (18 subsequent siblings)
20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
It is impossible to miss a task allowed cpu in a eligible group.
And since find_idlest_group only return a different group which
excludes old cpu, it's also impossible to find a new cpu same as old
cpu.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
kernel/sched/fair.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eea870..6d3a95d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3378,11 +3378,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
}
new_cpu = find_idlest_cpu(group, p, cpu);
- if (new_cpu == -1 || new_cpu == cpu) {
- /* Now try balancing at a lower domain level of cpu */
- sd = sd->child;
- continue;
- }
/* Now try balancing at a lower domain level of new_cpu */
cpu = new_cpu;
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [patch v4 02/18] sched: select_task_rq_fair clean up
2013-01-24 3:06 ` [patch v4 02/18] sched: select_task_rq_fair clean up Alex Shi
@ 2013-02-12 10:14 ` Peter Zijlstra
2013-02-13 14:44 ` Alex Shi
0 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:14 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> It is impossible to miss a task allowed cpu in a eligible group.
I suppose your reasoning goes like: tsk->cpus_allowed is protected by
->pi_lock, we hold this, therefore it cannot change and
find_idlest_group() dtrt?
We can then state that this is due to adding proper serialization to
tsk->cpus_allowed.
> And since find_idlest_group only return a different group which
> excludes old cpu, it's also impossible to find a new cpu same as old
> cpu.
Sounds plausible, but I'm not convinced, do we have hard serialization
against hotplug?
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 02/18] sched: select_task_rq_fair clean up
2013-02-12 10:14 ` Peter Zijlstra
@ 2013-02-13 14:44 ` Alex Shi
0 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-13 14:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On 02/12/2013 06:14 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> It is impossible to miss a task allowed cpu in a eligible group.
>
> I suppose your reasoning goes like: tsk->cpus_allowed is protected by
> ->pi_lock, we hold this, therefore it cannot change and
> find_idlest_group() dtrt?
yes.
>
> We can then state that this is due to adding proper serialization to
> tsk->cpus_allowed.
>
>> And since find_idlest_group only return a different group which
>> excludes old cpu, it's also impossible to find a new cpu same as old
>> cpu.
>
> Sounds plausible, but I'm not convinced, do we have hard serialization
> against hotplug?
Any caller of select_task_rq will check if returned dst_cpu is still
working. So there is nothing need worry about.
>
>
>
--
Thanks
Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* [patch v4 03/18] sched: fix find_idlest_group mess logical
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
2013-01-24 3:06 ` [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
2013-01-24 3:06 ` [patch v4 02/18] sched: select_task_rq_fair clean up Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-02-12 10:16 ` Peter Zijlstra
2013-01-24 3:06 ` [patch v4 04/18] sched: don't need go to smaller sched domain Alex Shi
` (17 subsequent siblings)
20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
There is 4 situations in the function:
1, no task allowed group;
so min_load = ULONG_MAX, this_load = 0, idlest = NULL
2, only local group task allowed;
so min_load = ULONG_MAX, this_load assigned, idlest = NULL
3, only non-local task group allowed;
so min_load assigned, this_load = 0, idlest != NULL
4, local group + another group are task allowed.
so min_load assigned, this_load assigned, idlest != NULL
Current logical will return NULL in first 3 kinds of scenarios.
And still return NULL, if idlest group is heavier then the
local group in the 4th situation.
Actually, I thought groups in situation 2,3 are also eligible to host
the task. And in 4th situation, agree to bias toward local group.
So, has this patch.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
kernel/sched/fair.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d3a95d..3c7b09a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3181,6 +3181,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int load_idx)
{
struct sched_group *idlest = NULL, *group = sd->groups;
+ struct sched_group *this_group = NULL;
unsigned long min_load = ULONG_MAX, this_load = 0;
int imbalance = 100 + (sd->imbalance_pct-100)/2;
@@ -3215,14 +3216,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
if (local_group) {
this_load = avg_load;
- } else if (avg_load < min_load) {
+ this_group = group;
+ }
+ if (avg_load < min_load) {
min_load = avg_load;
idlest = group;
}
} while (group = group->next, group != sd->groups);
- if (!idlest || 100*this_load < imbalance*min_load)
- return NULL;
+ if (this_group && idlest != this_group)
+ /* Bias toward our group again */
+ if (100*this_load < imbalance*min_load)
+ idlest = this_group;
+
return idlest;
}
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [patch v4 03/18] sched: fix find_idlest_group mess logical
2013-01-24 3:06 ` [patch v4 03/18] sched: fix find_idlest_group mess logical Alex Shi
@ 2013-02-12 10:16 ` Peter Zijlstra
2013-02-13 15:07 ` Alex Shi
0 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:16 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> There is 4 situations in the function:
> 1, no task allowed group;
> so min_load = ULONG_MAX, this_load = 0, idlest = NULL
> 2, only local group task allowed;
> so min_load = ULONG_MAX, this_load assigned, idlest = NULL
> 3, only non-local task group allowed;
> so min_load assigned, this_load = 0, idlest != NULL
> 4, local group + another group are task allowed.
> so min_load assigned, this_load assigned, idlest != NULL
>
> Current logical will return NULL in first 3 kinds of scenarios.
> And still return NULL, if idlest group is heavier then the
> local group in the 4th situation.
>
> Actually, I thought groups in situation 2,3 are also eligible to host
> the task. And in 4th situation, agree to bias toward local group.
I'm not convinced this is actually a cleanup.. taken together with patch
4 (which is a direct consequence of this patch afaict) you replace one
conditional with 2.
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 03/18] sched: fix find_idlest_group mess logical
2013-02-12 10:16 ` Peter Zijlstra
@ 2013-02-13 15:07 ` Alex Shi
0 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-13 15:07 UTC (permalink / raw)
To: Peter Zijlstra
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On 02/12/2013 06:16 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> There is 4 situations in the function:
>> 1, no task allowed group;
>> so min_load = ULONG_MAX, this_load = 0, idlest = NULL
>> 2, only local group task allowed;
>> so min_load = ULONG_MAX, this_load assigned, idlest = NULL
>> 3, only non-local task group allowed;
>> so min_load assigned, this_load = 0, idlest != NULL
>> 4, local group + another group are task allowed.
>> so min_load assigned, this_load assigned, idlest != NULL
>>
>> Current logical will return NULL in first 3 kinds of scenarios.
>> And still return NULL, if idlest group is heavier then the
>> local group in the 4th situation.
>>
>> Actually, I thought groups in situation 2,3 are also eligible to host
>> the task. And in 4th situation, agree to bias toward local group.
>
> I'm not convinced this is actually a cleanup.. taken together with patch
> 4 (which is a direct consequence of this patch afaict) you replace one
> conditional with 2.
>
The current logical will always miss the eligible CPU in the 3rd
situation. 'sd = sd->child' still skip the non-local group eligible CPU.
--
Thanks
Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* [patch v4 04/18] sched: don't need go to smaller sched domain
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (2 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 03/18] sched: fix find_idlest_group mess logical Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-01-24 3:06 ` [patch v4 05/18] sched: quicker balancing on fork/exec/wake Alex Shi
` (16 subsequent siblings)
20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
If parent sched domain has no task allowed cpu. neither in
it's child. So, go out to save useless checking.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
kernel/sched/fair.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3c7b09a..ecfbf8e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3378,10 +3378,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
load_idx = sd->wake_idx;
group = find_idlest_group(sd, p, cpu, load_idx);
- if (!group) {
- sd = sd->child;
- continue;
- }
+ if (!group)
+ goto unlock;
new_cpu = find_idlest_cpu(group, p, cpu);
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* [patch v4 05/18] sched: quicker balancing on fork/exec/wake
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (3 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 04/18] sched: don't need go to smaller sched domain Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-02-12 10:22 ` Peter Zijlstra
2013-01-24 3:06 ` [patch v4 06/18] sched: give initial value for runnable avg of sched entities Alex Shi
` (15 subsequent siblings)
20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
Guess the search cpu from bottom to up in domain tree come from
commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
balancing over tasks on all level domains.
This balancing cost too much if there has many domain/groups in a
large system.
If we remove this code, we will get quick fork/exec/wake with a similar
balancing result amony whole system.
This patch increases 10+% performance of hackbench on my 4 sockets
SNB machines and about 3% increasing on 2 sockets servers.
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
kernel/sched/fair.c | 20 +-------------------
1 file changed, 1 insertion(+), 19 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ecfbf8e..895a3f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
goto unlock;
}
- while (sd) {
+ if (sd) {
int load_idx = sd->forkexec_idx;
struct sched_group *group;
- int weight;
-
- if (!(sd->flags & sd_flag)) {
- sd = sd->child;
- continue;
- }
if (sd_flag & SD_BALANCE_WAKE)
load_idx = sd->wake_idx;
@@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
goto unlock;
new_cpu = find_idlest_cpu(group, p, cpu);
-
- /* Now try balancing at a lower domain level of new_cpu */
- cpu = new_cpu;
- weight = sd->span_weight;
- sd = NULL;
- for_each_domain(cpu, tmp) {
- if (weight <= tmp->span_weight)
- break;
- if (tmp->flags & sd_flag)
- sd = tmp;
- }
- /* while loop will break here if sd == NULL */
}
unlock:
rcu_read_unlock();
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake
2013-01-24 3:06 ` [patch v4 05/18] sched: quicker balancing on fork/exec/wake Alex Shi
@ 2013-02-12 10:22 ` Peter Zijlstra
2013-02-14 3:13 ` Alex Shi
2013-02-14 8:12 ` Preeti U Murthy
0 siblings, 2 replies; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:22 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> Guess the search cpu from bottom to up in domain tree come from
> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
> balancing over tasks on all level domains.
>
> This balancing cost too much if there has many domain/groups in a
> large system.
>
> If we remove this code, we will get quick fork/exec/wake with a
> similar
> balancing result amony whole system.
>
> This patch increases 10+% performance of hackbench on my 4 sockets
> SNB machines and about 3% increasing on 2 sockets servers.
>
>
Numbers be groovy.. still I'd like a little more on the behavioural
change. Expand on what exactly is lost by this change so that if we
later find a regression we have a better idea of what and how.
For instance, note how find_idlest_group() isn't symmetric wrt
local_group. So by not doing the domain iteration we change things.
Now, it might well be that all this is somewhat overkill as it is, but
should we then not replace all of it with a simple min search over all
eligible cpus; that would be a real clean up.
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake
2013-02-12 10:22 ` Peter Zijlstra
@ 2013-02-14 3:13 ` Alex Shi
2013-02-14 8:12 ` Preeti U Murthy
1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-14 3:13 UTC (permalink / raw)
To: Peter Zijlstra
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On 02/12/2013 06:22 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> Guess the search cpu from bottom to up in domain tree come from
>> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
>> balancing over tasks on all level domains.
>>
>> This balancing cost too much if there has many domain/groups in a
>> large system.
>>
>> If we remove this code, we will get quick fork/exec/wake with a
>> similar
>> balancing result amony whole system.
>>
>> This patch increases 10+% performance of hackbench on my 4 sockets
>> SNB machines and about 3% increasing on 2 sockets servers.
>>
>>
> Numbers be groovy.. still I'd like a little more on the behavioural
> change. Expand on what exactly is lost by this change so that if we
> later find a regression we have a better idea of what and how.
>
> For instance, note how find_idlest_group() isn't symmetric wrt
> local_group. So by not doing the domain iteration we change things.
>
> Now, it might well be that all this is somewhat overkill as it is, but
> should we then not replace all of it with a simple min search over all
> eligible cpus; that would be a real clean up.
>
Um, will think this again..
>
--
Thanks
Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake
2013-02-12 10:22 ` Peter Zijlstra
2013-02-14 3:13 ` Alex Shi
@ 2013-02-14 8:12 ` Preeti U Murthy
2013-02-14 14:08 ` Alex Shi
2013-02-15 13:00 ` Peter Zijlstra
1 sibling, 2 replies; 88+ messages in thread
From: Preeti U Murthy @ 2013-02-14 8:12 UTC (permalink / raw)
To: Peter Zijlstra, Alex Shi
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, viresh.kumar, linux-kernel
Hi everyone,
On 02/12/2013 03:52 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> Guess the search cpu from bottom to up in domain tree come from
>> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
>> balancing over tasks on all level domains.
>>
>> This balancing cost too much if there has many domain/groups in a
>> large system.
>>
>> If we remove this code, we will get quick fork/exec/wake with a
>> similar
>> balancing result amony whole system.
>>
>> This patch increases 10+% performance of hackbench on my 4 sockets
>> SNB machines and about 3% increasing on 2 sockets servers.
>>
>>
> Numbers be groovy.. still I'd like a little more on the behavioural
> change. Expand on what exactly is lost by this change so that if we
> later find a regression we have a better idea of what and how.
>
> For instance, note how find_idlest_group() isn't symmetric wrt
> local_group. So by not doing the domain iteration we change things.
>
> Now, it might well be that all this is somewhat overkill as it is, but
> should we then not replace all of it with a simple min search over all
> eligible cpus; that would be a real clean up.
Hi Peter,Alex,
If the eligible cpus happen to be all the cpus,then iterating over all the
cpus for idlest would be much worse than iterating over sched domains right?
I am also wondering how important it is to bias the balancing of forked/woken up
task onto an idlest cpu at every iteration.
If biasing towards the idlest_cpu at every iteration is not really the criteria,
then we could cut down on the iterations in fork/exec/wake balancing.
Then the problem boils down to,is the option between biasing our search towards
the idlest_cpu or the idlest_group.If we are not really concerned about balancing
load across groups,but ensuring we find the idlest cpu to run the task on,then
Alex's patch seems to have covered the criteria.
However if the concern is to distribute the load uniformly across groups,then
I have the following patch which might reduce the overhead of the search of an
eligible cpu for a forked/exec/woken up task.
Alex,if your patch does not show favourable behavioural changes,you could try
the below and check the same.
**************************START PATCH*************************************
sched:Improve balancing for fork/exec/wake tasks
As I see it,currently,we first find the idlest group,then the idlest cpu
within it.However the current code does not seem to get convinced that the
selected cpu in the current iteration,is in the correct child group and does
an iteration over all the groups yet again,
*taking the selected cpu as reference to point to the next level sched
domain.*
Why then find the idlest cpu at every iteration,if the concern is primarily
around the idlest group at each iteration? Why not find the idlest group in
every iteration and the idlest cpu in the final iteration? As a result:
1.We save time spent in going over all the cpus in a sched group in
find_idlest_cpu() at every iteration.
2.Functionality remains the same.We find the idlest group at every level,and
consider a cpu of the idlest group as a reference to get the next lower level
sched domain so as to find the idlest group there.
However instead of taking the idlest cpu as the reference,take
the first cpu as the reference.The resulting next level sched domain remains
the same.
*However this completely removes the bias towards the idlest cpu at every level.*
This patchset therefore tries to bias its find towards the right group to put
the task on,and not the idlest cpu.
---
kernel/sched/fair.c | 53 ++++++++++++++-------------------------------------
1 file changed, 15 insertions(+), 38 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8691b0d..90855dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3190,16 +3190,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
*/
static struct sched_group *
find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int load_idx)
+ int this_cpu)
{
struct sched_group *idlest = NULL, *group = sd->groups;
- struct sched_group *this_group = NULL;
- u64 min_load = ~0ULL, this_load = 0;
- int imbalance = 100 + (sd->imbalance_pct-100)/2;
+ struct sched_group;
+ u64 min_load = ~0ULL;
do {
u64 load, avg_load;
- int local_group;
int i;
/* Skip over this group if it has no CPUs allowed */
@@ -3207,18 +3205,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
tsk_cpus_allowed(p)))
continue;
- local_group = cpumask_test_cpu(this_cpu,
- sched_group_cpus(group));
-
/* Tally up the load of all CPUs in the group */
avg_load = 0;
for_each_cpu(i, sched_group_cpus(group)) {
- /* Bias balancing toward cpus of our domain */
- if (local_group)
- load = source_load(i, load_idx);
- else
- load = target_load(i, load_idx);
+ load = weighted_cpuload(i);
avg_load += load;
}
@@ -3227,20 +3218,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
avg_load = (avg_load * SCHED_POWER_SCALE);
do_div(avg_load, group->sgp->power);
- if (local_group) {
- this_load = avg_load;
- this_group = group;
- } else if (avg_load < min_load) {
+ if (avg_load < min_load) {
min_load = avg_load;
idlest = group;
}
} while (group = group->next, group != sd->groups);
- if (this_group && idlest!= this_group) {
- /* Bias towards our group again */
- if (!idlest || 100*this_load < imbalance*min_load)
- return this_group;
- }
return idlest;
}
@@ -3248,7 +3231,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
* find_idlest_cpu - find the idlest cpu among the cpus in group.
*/
static int
-find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
+find_idlest_cpu(struct sched_group *group, struct task_struct *p)
{
u64 load, min_load = ~0ULL;
int idlest = -1;
@@ -3258,7 +3241,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
load = weighted_cpuload(i);
- if (load < min_load || (load == min_load && i == this_cpu)) {
+ if (load < min_load) {
min_load = load;
idlest = i;
}
@@ -3325,6 +3308,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
int cpu = smp_processor_id();
int prev_cpu = task_cpu(p);
int new_cpu = cpu;
+ struct sched_group *group;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
@@ -3367,8 +3351,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
}
while (sd) {
- int load_idx = sd->forkexec_idx;
- struct sched_group *group;
int weight;
if (!(sd->flags & sd_flag)) {
@@ -3376,15 +3358,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
continue;
}
- if (sd_flag & SD_BALANCE_WAKE)
- load_idx = sd->wake_idx;
-
- group = find_idlest_group(sd, p, cpu, load_idx);
- if (!group) {
- goto unlock;
- }
+ group = find_idlest_group(sd, p, cpu);
- new_cpu = find_idlest_cpu(group, p, cpu);
+ new_cpu = group_first_cpu(group);
/* Now try balancing at a lower domain level of new_cpu */
cpu = new_cpu;
@@ -3398,6 +3374,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
}
/* while loop will break here if sd == NULL */
}
+ new_cpu = find_idlest_cpu(group, p);
unlock:
rcu_read_unlock();
@@ -4280,15 +4257,15 @@ static inline void update_sd_lb_power_stats(struct lb_env *env,
if (sgs->sum_nr_running + 1 > sgs->group_capacity)
return;
if (sgs->group_util > sds->leader_util ||
- sgs->group_util == sds->leader_util &&
- group_first_cpu(group) < group_first_cpu(sds->group_leader)) {
+ (sgs->group_util == sds->leader_util &&
+ group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
sds->group_leader = group;
sds->leader_util = sgs->group_util;
}
/* Calculate the group which is almost idle */
if (sgs->group_util < sds->min_util ||
- sgs->group_util == sds->min_util &&
- group_first_cpu(group) > group_first_cpu(sds->group_leader)) {
+ (sgs->group_util == sds->min_util &&
+ group_first_cpu(group) > group_first_cpu(sds->group_leader))) {
sds->group_min = group;
sds->min_util = sgs->group_util;
sds->min_load_per_task = sgs->sum_weighted_load;
>
>
Regards
Preeti U Murthy
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake
2013-02-14 8:12 ` Preeti U Murthy
@ 2013-02-14 14:08 ` Alex Shi
2013-02-15 13:00 ` Peter Zijlstra
1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-14 14:08 UTC (permalink / raw)
To: Preeti U Murthy
Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, pjt,
namhyung, efault, vincent.guittot, gregkh, viresh.kumar,
linux-kernel
On 02/14/2013 04:12 PM, Preeti U Murthy wrote:
> **************************START PATCH*************************************
>
> sched:Improve balancing for fork/exec/wake tasks
Can you test the patch with hackbench and aim7 with 2000 threads? plus
the patch in the tip:core/locking tree:
3a15e0e0cdda rwsem: Implement writer lock-stealing for better scalability
If it works you may find some improvement.
--
Thanks
Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake
2013-02-14 8:12 ` Preeti U Murthy
2013-02-14 14:08 ` Alex Shi
@ 2013-02-15 13:00 ` Peter Zijlstra
1 sibling, 0 replies; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-15 13:00 UTC (permalink / raw)
To: Preeti U Murthy
Cc: Alex Shi, torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung,
efault, vincent.guittot, gregkh, viresh.kumar, linux-kernel
On Thu, 2013-02-14 at 13:42 +0530, Preeti U Murthy wrote:
> Hi Peter,Alex,
> If the eligible cpus happen to be all the cpus,then iterating over all
> the
> cpus for idlest would be much worse than iterating over sched domains
> right?
Depends, doing a domain walk generally gets you 2n cpus visited --
geometric series and such. A simple scan of the top-most domain mask
that's eligible will limit that to n.
> I am also wondering how important it is to bias the balancing of
> forked/woken up
> task onto an idlest cpu at every iteration.
Yeah, I don't know, it seems overkill to me, that code is from before my
time, so far it has survived.
> If biasing towards the idlest_cpu at every iteration is not really the
> criteria,
> then we could cut down on the iterations in fork/exec/wake balancing.
> Then the problem boils down to,is the option between biasing our
> search towards
> the idlest_cpu or the idlest_group.If we are not really concerned
> about balancing
> load across groups,but ensuring we find the idlest cpu to run the
> task on,then
> Alex's patch seems to have covered the criteria.
>
> However if the concern is to distribute the load uniformly across
> groups,then
> I have the following patch which might reduce the overhead of the
> search of an
> eligible cpu for a forked/exec/woken up task.
Nah, so I think the whole bias thing was mostly done to avoid
over-balancing and possibly to compensate for some approximations on the
whole weight/load measurement stuff.
^ permalink raw reply [flat|nested] 88+ messages in thread
* [patch v4 06/18] sched: give initial value for runnable avg of sched entities.
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (4 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 05/18] sched: quicker balancing on fork/exec/wake Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-02-12 10:23 ` Peter Zijlstra
2013-01-24 3:06 ` [patch v4 07/18] sched: set initial load avg of new forked task Alex Shi
` (14 subsequent siblings)
20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
after a new task forked.
Otherwise random values of above variables cause mess when do new task
enqueue:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
kernel/sched/core.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26058d0..1743746 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1559,6 +1559,8 @@ static void __sched_fork(struct task_struct *p)
#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
+ p->se.avg.decay_count = 0;
+ p->se.avg.load_avg_contrib = 0;
#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [patch v4 06/18] sched: give initial value for runnable avg of sched entities.
2013-01-24 3:06 ` [patch v4 06/18] sched: give initial value for runnable avg of sched entities Alex Shi
@ 2013-02-12 10:23 ` Peter Zijlstra
0 siblings, 0 replies; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:23 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
> after a new task forked.
> Otherwise random values of above variables cause mess when do new task
> enqueue:
> enqueue_task_fair
> enqueue_entity
> enqueue_entity_load_avg
>
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
> kernel/sched/core.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26058d0..1743746 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1559,6 +1559,8 @@ static void __sched_fork(struct task_struct *p)
> #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
> p->se.avg.runnable_avg_period = 0;
> p->se.avg.runnable_avg_sum = 0;
> + p->se.avg.decay_count = 0;
> + p->se.avg.load_avg_contrib = 0;
> #endif
> #ifdef CONFIG_SCHEDSTATS
> memset(&p->se.statistics, 0, sizeof(p->se.statistics));
pjt?
^ permalink raw reply [flat|nested] 88+ messages in thread
* [patch v4 07/18] sched: set initial load avg of new forked task
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (5 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 06/18] sched: give initial value for runnable avg of sched entities Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-02-12 10:26 ` Peter Zijlstra
2013-01-24 3:06 ` [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
` (13 subsequent siblings)
20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
New task has no runnable sum at its first runnable time, so its
runnable load is zero. That makes burst forking balancing just select
few idle cpus to assign tasks if we engage runnable load in balancing.
Set initial load avg of new forked task as its load weight to resolve
this issue.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 11 +++++++++--
3 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d211247..f283d3d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1069,6 +1069,7 @@ struct sched_domain;
#else
#define ENQUEUE_WAKING 0
#endif
+#define ENQUEUE_NEWTASK 8
#define DEQUEUE_SLEEP 1
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1743746..7292965 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1706,7 +1706,7 @@ void wake_up_new_task(struct task_struct *p)
#endif
rq = __task_rq_lock(p);
- activate_task(rq, p, 0);
+ activate_task(rq, p, ENQUEUE_NEWTASK);
p->on_rq = 1;
trace_sched_wakeup_new(p, true);
check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 895a3f4..5c545e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1503,8 +1503,9 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
/* Add the load generated by se into cfs_rq's child load-average */
static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
struct sched_entity *se,
- int wakeup)
+ int flags)
{
+ int wakeup = flags & ENQUEUE_WAKEUP;
/*
* We track migrations using entity decay_count <= 0, on a wake-up
* migration we use a negative decay count to track the remote decays
@@ -1538,6 +1539,12 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
update_entity_load_avg(se, 0);
}
+ /*
+ * set the initial load avg of new task same as its load
+ * in order to avoid brust fork make few cpu too heavier
+ */
+ if (flags & ENQUEUE_NEWTASK)
+ se->avg.load_avg_contrib = se->load.weight;
cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
/* we force update consideration on load-balancer moves */
update_cfs_rq_blocked_load(cfs_rq, !wakeup);
@@ -1701,7 +1708,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
+ enqueue_entity_load_avg(cfs_rq, se, flags);
account_entity_enqueue(cfs_rq, se);
update_cfs_shares(cfs_rq);
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [patch v4 07/18] sched: set initial load avg of new forked task
2013-01-24 3:06 ` [patch v4 07/18] sched: set initial load avg of new forked task Alex Shi
@ 2013-02-12 10:26 ` Peter Zijlstra
2013-02-13 15:14 ` Alex Shi
0 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:26 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> + /*
> + * set the initial load avg of new task same as its load
> + * in order to avoid brust fork make few cpu too heavier
> + */
> + if (flags & ENQUEUE_NEWTASK)
> + se->avg.load_avg_contrib = se->load.weight;
I seem to have vague recollections of a discussion with pjt where we
talk about the initial behaviour of tasks; from this haze I had the
impression that new tasks should behave like full weight..
PJT is something more fundamental screwy?
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 07/18] sched: set initial load avg of new forked task
2013-02-12 10:26 ` Peter Zijlstra
@ 2013-02-13 15:14 ` Alex Shi
2013-02-13 15:41 ` Paul Turner
0 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-02-13 15:14 UTC (permalink / raw)
To: Peter Zijlstra
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On 02/12/2013 06:26 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> + /*
>> + * set the initial load avg of new task same as its load
>> + * in order to avoid brust fork make few cpu too heavier
>> + */
>> + if (flags & ENQUEUE_NEWTASK)
>> + se->avg.load_avg_contrib = se->load.weight;
>
> I seem to have vague recollections of a discussion with pjt where we
> talk about the initial behaviour of tasks; from this haze I had the
> impression that new tasks should behave like full weight..
>
Here just make the new task has full weight..
> PJT is something more fundamental screwy?
>
--
Thanks
Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 07/18] sched: set initial load avg of new forked task
2013-02-13 15:14 ` Alex Shi
@ 2013-02-13 15:41 ` Paul Turner
2013-02-14 13:07 ` Alex Shi
0 siblings, 1 reply; 88+ messages in thread
From: Paul Turner @ 2013-02-13 15:41 UTC (permalink / raw)
To: Alex Shi
Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, namhyung,
efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Wed, Feb 13, 2013 at 7:14 AM, Alex Shi <alex.shi@intel.com> wrote:
> On 02/12/2013 06:26 PM, Peter Zijlstra wrote:
>> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>>> + /*
>>> + * set the initial load avg of new task same as its load
>>> + * in order to avoid brust fork make few cpu too heavier
>>> + */
>>> + if (flags & ENQUEUE_NEWTASK)
>>> + se->avg.load_avg_contrib = se->load.weight;
>>
>> I seem to have vague recollections of a discussion with pjt where we
>> talk about the initial behaviour of tasks; from this haze I had the
>> impression that new tasks should behave like full weight..
>>
>
> Here just make the new task has full weight..
>
>> PJT is something more fundamental screwy?
>>
So tasks get the quotient of their runnability over the period. Given
the period initially is equivalent to runnability it's definitely the
*intent* to start at full-weight and ramp-down.
Thinking on it, perhaps this is running a-foul of amortization -- in
that we only recompute this quotient on each 1024ns boundary; perhaps
in the fork-bomb case we're too slow to accumulate these.
Alex, does something like the following help? This would force an
initial __update_entity_load_avg_contrib() update the first time we
see the task.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1dff78a..9d1c193 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1557,8 +1557,8 @@ static void __sched_fork(struct task_struct *p)
* load-balance).
*/
#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
- p->se.avg.runnable_avg_period = 0;
- p->se.avg.runnable_avg_sum = 0;
+ p->se.avg.runnable_avg_period = 1024;
+ p->se.avg.runnable_avg_sum = 1024;
#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>
>
> --
> Thanks
> Alex
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [patch v4 07/18] sched: set initial load avg of new forked task
2013-02-13 15:41 ` Paul Turner
@ 2013-02-14 13:07 ` Alex Shi
2013-02-19 11:34 ` Paul Turner
0 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-02-14 13:07 UTC (permalink / raw)
To: Paul Turner
Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, namhyung,
efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1dff78a..9d1c193 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1557,8 +1557,8 @@ static void __sched_fork(struct task_struct *p)
> * load-balance).
> */
> #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
> - p->se.avg.runnable_avg_period = 0;
> - p->se.avg.runnable_avg_sum = 0;
> + p->se.avg.runnable_avg_period = 1024;
> + p->se.avg.runnable_avg_sum = 1024;
It can't work.
avg.decay_count needs to be set to 0 before enqueue_entity_load_avg(), then
update_entity_load_avg() can't be called, so, runnable_avg_period/sum
are unusable.
Even we has chance to call __update_entity_runnable_avg(),
avg.last_runnable_update needs be set before that, usually, it needs to
be set as 'now', that cause __update_entity_runnable_avg() function
return 0, then update_entity_load_avg() still can not reach to
__update_entity_load_avg_contrib().
If we embed a simple new task load initialization to many functions,
that is too hard for future reader.
> #endif
> #ifdef CONFIG_SCHEDSTATS
> memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>
>
>
>>
>>
>> --
>> Thanks
>> Alex
--
Thanks
Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 07/18] sched: set initial load avg of new forked task
2013-02-14 13:07 ` Alex Shi
@ 2013-02-19 11:34 ` Paul Turner
2013-02-20 4:18 ` Preeti U Murthy
2013-02-20 5:13 ` Alex Shi
0 siblings, 2 replies; 88+ messages in thread
From: Paul Turner @ 2013-02-19 11:34 UTC (permalink / raw)
To: Alex Shi
Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, namhyung,
efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Fri, Feb 15, 2013 at 2:07 AM, Alex Shi <alex.shi@intel.com> wrote:
>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 1dff78a..9d1c193 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1557,8 +1557,8 @@ static void __sched_fork(struct task_struct *p)
>> * load-balance).
>> */
>> #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
>> - p->se.avg.runnable_avg_period = 0;
>> - p->se.avg.runnable_avg_sum = 0;
>> + p->se.avg.runnable_avg_period = 1024;
>> + p->se.avg.runnable_avg_sum = 1024;
>
> It can't work.
> avg.decay_count needs to be set to 0 before enqueue_entity_load_avg(), then
> update_entity_load_avg() can't be called, so, runnable_avg_period/sum
> are unusable.
Well we _could_ also use a negative decay_count here and treat it like
a migration; but the larger problem is the visibility of p->on_rq;
which is gates whether we account the time as runnable and occurs
after activate_task() so that's out.
>
> Even we has chance to call __update_entity_runnable_avg(),
> avg.last_runnable_update needs be set before that, usually, it needs to
> be set as 'now', that cause __update_entity_runnable_avg() function
> return 0, then update_entity_load_avg() still can not reach to
> __update_entity_load_avg_contrib().
>
> If we embed a simple new task load initialization to many functions,
> that is too hard for future reader.
This is my concern about making this a special case with the
introduction ENQUEUE_NEWTASK flag; enqueue jumps through enough hoops
as it is.
I still don't see why we can't resolve this at init time in
__sched_fork(); your patch above just moves an explicit initialization
of load_avg_contrib into the enqueue path. Adding a call to
__update_task_entity_contrib() to the previous alternate suggestion
would similarly seem to resolve this?
>
>> #endif
>> #ifdef CONFIG_SCHEDSTATS
>> memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>>
>>
>>
>>>
>>>
>>> --
>>> Thanks
>>> Alex
>
>
> --
> Thanks
> Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 07/18] sched: set initial load avg of new forked task
2013-02-19 11:34 ` Paul Turner
@ 2013-02-20 4:18 ` Preeti U Murthy
2013-02-20 5:13 ` Alex Shi
1 sibling, 0 replies; 88+ messages in thread
From: Preeti U Murthy @ 2013-02-20 4:18 UTC (permalink / raw)
To: Paul Turner
Cc: Alex Shi, Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp,
namhyung, efault, vincent.guittot, gregkh, viresh.kumar,
linux-kernel
Hi everyone,
On 02/19/2013 05:04 PM, Paul Turner wrote:
> On Fri, Feb 15, 2013 at 2:07 AM, Alex Shi <alex.shi@intel.com> wrote:
>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 1dff78a..9d1c193 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -1557,8 +1557,8 @@ static void __sched_fork(struct task_struct *p)
>>> * load-balance).
>>> */
>>> #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
>>> - p->se.avg.runnable_avg_period = 0;
>>> - p->se.avg.runnable_avg_sum = 0;
>>> + p->se.avg.runnable_avg_period = 1024;
>>> + p->se.avg.runnable_avg_sum = 1024;
>>
>> It can't work.
>> avg.decay_count needs to be set to 0 before enqueue_entity_load_avg(), then
>> update_entity_load_avg() can't be called, so, runnable_avg_period/sum
>> are unusable.
>
> Well we _could_ also use a negative decay_count here and treat it like
> a migration; but the larger problem is the visibility of p->on_rq;
> which is gates whether we account the time as runnable and occurs
> after activate_task() so that's out.
>
>>
>> Even we has chance to call __update_entity_runnable_avg(),
>> avg.last_runnable_update needs be set before that, usually, it needs to
>> be set as 'now', that cause __update_entity_runnable_avg() function
>> return 0, then update_entity_load_avg() still can not reach to
>> __update_entity_load_avg_contrib().
>>
>> If we embed a simple new task load initialization to many functions,
>> that is too hard for future reader.
>
> This is my concern about making this a special case with the
> introduction ENQUEUE_NEWTASK flag; enqueue jumps through enough hoops
> as it is.
>
> I still don't see why we can't resolve this at init time in
> __sched_fork(); your patch above just moves an explicit initialization
> of load_avg_contrib into the enqueue path. Adding a call to
> __update_task_entity_contrib() to the previous alternate suggestion
> would similarly seem to resolve this?
We could do this(Adding a call to __update_task_entity_contrib()),but the
cfs_rq->runnable_load_avg gets updated only if the task is on the runqueue.
But in the forked task's case the on_rq flag is not yet set.Something like
the below:
---
kernel/sched/fair.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8691b0d..841e156 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1451,14 +1451,20 @@ static inline void update_entity_load_avg(struct sched_entity *se,
else
now = cfs_rq_clock_task(group_cfs_rq(se));
- if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
- return;
-
+ if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq)) {
+ if (!(flags & ENQUEUE_NEWTASK))
+ return;
+ }
contrib_delta = __update_entity_load_avg_contrib(se);
if (!update_cfs_rq)
return;
+ /* But the cfs_rq->runnable_load_avg does not get updated in case of
+ * a forked task,because the se->on_rq = 0,although we update the
+ * task's load_avg_contrib above in
+ * __update_entity_laod_avg_contrib().
+ */
if (se->on_rq)
cfs_rq->runnable_load_avg += contrib_delta;
else
@@ -1538,12 +1544,6 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
update_entity_load_avg(se, 0);
}
- /*
- * set the initial load avg of new task same as its load
- * in order to avoid brust fork make few cpu too heavier
- */
- if (flags & ENQUEUE_NEWTASK)
- se->avg.load_avg_contrib = se->load.weight;
cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
/* we force update consideration on load-balancer moves */
Thanks
Regards
Preeti U Murthy
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [patch v4 07/18] sched: set initial load avg of new forked task
2013-02-19 11:34 ` Paul Turner
2013-02-20 4:18 ` Preeti U Murthy
@ 2013-02-20 5:13 ` Alex Shi
1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-20 5:13 UTC (permalink / raw)
To: Paul Turner
Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, namhyung,
efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
> This is my concern about making this a special case with the
> introduction ENQUEUE_NEWTASK flag; enqueue jumps through enough hoops
> as it is.
>
> I still don't see why we can't resolve this at init time in
> __sched_fork(); your patch above just moves an explicit initialization
> of load_avg_contrib into the enqueue path. Adding a call to
> __update_task_entity_contrib() to the previous alternate suggestion
> would similarly seem to resolve this?
>
Without ENQUEUE_NEWTASK flag, we can use the following patch. That embeds
the new fork with a implicate way.
but since the newtask flag just follows existing enqueue path, it also
looks natural and is a explicit way.
I am ok for alternate of solutions.
=======
>From 0f5dd6babe899e27cfb78ea49d337e4f0918591b Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@intel.com>
Date: Wed, 20 Feb 2013 12:51:28 +0800
Subject: [PATCH 02/15] sched: set initial load avg of new forked task
New task has no runnable sum at its first runnable time, so its
runnable load is zero. That makes burst forking balancing just select
few idle cpus to assign tasks if we engage runnable load in balancing.
Set initial load avg of new forked task as its load weight to resolve
this issue.
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 4 ++++
2 files changed, 6 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1743746..93a7590 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1648,6 +1648,8 @@ void sched_fork(struct task_struct *p)
p->sched_reset_on_fork = 0;
}
+ p->se.avg.load_avg_contrib = p->se.load.weight;
+
if (!rt_prio(p->prio))
p->sched_class = &fair_sched_class;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 81fa536..cae5134 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1509,6 +1509,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
* We track migrations using entity decay_count <= 0, on a wake-up
* migration we use a negative decay count to track the remote decays
* accumulated while sleeping.
+ *
+ * When enqueue a new forked task, the se->avg.decay_count == 0, so
+ * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
+ * value: se->load.weight.
*/
if (unlikely(se->avg.decay_count <= 0)) {
se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (6 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 07/18] sched: set initial load avg of new forked task Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-02-12 10:27 ` Peter Zijlstra
2013-01-24 3:06 ` [patch v4 09/18] sched: add sched_policies in kernel Alex Shi
` (12 subsequent siblings)
20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
include/linux/sched.h | 8 +-------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 13 ++-----------
kernel/sched/sched.h | 9 +--------
4 files changed, 5 insertions(+), 32 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f283d3d..66b05e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1195,13 +1195,7 @@ struct sched_entity {
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
#endif
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
- /* Per-entity load-tracking */
+#ifdef CONFIG_SMP
struct sched_avg avg;
#endif
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7292965..0bd9d5f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1551,12 +1551,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
p->se.avg.decay_count = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c545e4..efeb65c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1109,8 +1109,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3391,12 +3390,6 @@ unlock:
}
/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
@@ -3419,7 +3412,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#endif
#endif /* CONFIG_SMP */
static unsigned long
@@ -6111,9 +6103,8 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
+
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc88644..ae3511e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -225,12 +225,6 @@ struct cfs_rq {
#endif
#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -240,8 +234,7 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
u32 tg_runnable_contrib;
u64 tg_load_contrib;
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
2013-01-24 3:06 ` [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
@ 2013-02-12 10:27 ` Peter Zijlstra
2013-02-13 15:23 ` Alex Shi
0 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:27 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
> we can use runnable load variables.
>
It would be nice if we could quantify the performance hit of doing so.
Haven't yet looked at later patches to see if we remove anything to
off-set this.
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
2013-02-12 10:27 ` Peter Zijlstra
@ 2013-02-13 15:23 ` Alex Shi
2013-02-13 15:45 ` Paul Turner
0 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-02-13 15:23 UTC (permalink / raw)
To: Peter Zijlstra
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On 02/12/2013 06:27 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
>> we can use runnable load variables.
>>
> It would be nice if we could quantify the performance hit of doing so.
> Haven't yet looked at later patches to see if we remove anything to
> off-set this.
>
In our rough testing, no much clear performance changes.
--
Thanks
Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
2013-02-13 15:23 ` Alex Shi
@ 2013-02-13 15:45 ` Paul Turner
2013-02-14 3:07 ` Preeti U Murthy
0 siblings, 1 reply; 88+ messages in thread
From: Paul Turner @ 2013-02-13 15:45 UTC (permalink / raw)
To: Alex Shi
Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, namhyung,
efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Wed, Feb 13, 2013 at 7:23 AM, Alex Shi <alex.shi@intel.com> wrote:
> On 02/12/2013 06:27 PM, Peter Zijlstra wrote:
>> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>>> Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
>>> we can use runnable load variables.
>>>
>> It would be nice if we could quantify the performance hit of doing so.
>> Haven't yet looked at later patches to see if we remove anything to
>> off-set this.
>>
>
> In our rough testing, no much clear performance changes.
>
I'd personally like this to go with a series that actually does
something with it.
There's been a few proposals floating around on _how_ to do this; but
the challenge is in getting it stable enough that all of the wake-up
balancing does not totally perforate your stability gains into the
noise. select_idle_sibling really is your nemesis here.
It's a small enough patch that it can go at the head of any such
series (and indeed; it was originally structured to make such a patch
rather explicit.)
> --
> Thanks
> Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
2013-02-13 15:45 ` Paul Turner
@ 2013-02-14 3:07 ` Preeti U Murthy
0 siblings, 0 replies; 88+ messages in thread
From: Preeti U Murthy @ 2013-02-14 3:07 UTC (permalink / raw)
To: Paul Turner
Cc: Alex Shi, Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp,
namhyung, efault, vincent.guittot, gregkh, viresh.kumar,
linux-kernel
Hi everyone,
On 02/13/2013 09:15 PM, Paul Turner wrote:
> On Wed, Feb 13, 2013 at 7:23 AM, Alex Shi <alex.shi@intel.com> wrote:
>> On 02/12/2013 06:27 PM, Peter Zijlstra wrote:
>>> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>>>> Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
>>>> we can use runnable load variables.
>>>>
>>> It would be nice if we could quantify the performance hit of doing so.
>>> Haven't yet looked at later patches to see if we remove anything to
>>> off-set this.
>>>
>>
>> In our rough testing, no much clear performance changes.
>>
>
> I'd personally like this to go with a series that actually does
> something with it.
>
> There's been a few proposals floating around on _how_ to do this; but
> the challenge is in getting it stable enough that all of the wake-up
> balancing does not totally perforate your stability gains into the
> noise. select_idle_sibling really is your nemesis here.
>
> It's a small enough patch that it can go at the head of any such
> series (and indeed; it was originally structured to make such a patch
> rather explicit.)
>
>> --
>> Thanks
>> Alex
>
Paul,what exactly do you mean by select_idle_sibling() is our nemesis
here? What we observed through our experiments was that:
1.With the per entity load tracking(runnable_load_avg) in load
balancing,the load is distributed appropriately across the cpus.
2.However when a task sleeps and wakes up,select_idle_sibling() searches
for the idlest group top to bottom.If a suitable candidate is not
found,it wakes up the task on the prev_cpu/waker_cpu.This would increase
the runqueue size and load of prev_cpu/waker_cpu respectively.
3.The load balancer would then come to the rescue and redistribute the load.
As a consequence,
*The primary observation was that there is no performance degradation
with the integration of per entity load tracking into the load balancer
but there was a good increase in the number of migrations*. This as I
see it, is due to the point2 and point3 above.Is this what you call as
the nemesis? OR
select_idle_sibling() does a top to bottom search of the chosen domain
for an idlest group and is very likely to spread the waking task to a
far off group,in case of underutilized systems.This would prove costly
for the software buddies in finding each other due to the time taken for
the search and the possible spreading of the software buddy tasks.Is
this what you call nemesis?
Another approach to remove the above two nemesis,if they are so,would be
to use blocked_load+runnable_load for balancing.But when waking up a
task,use select_idle_sibling() only to search the L2 cache domains for
an idlest group.If unsuccessful,return the prev_cpu which has already
accounted for the task in the blocked_load,hence this move would not
increase its load.Would you recommend going in this direction?
Thank you
Regards
Preeti U Murthy
^ permalink raw reply [flat|nested] 88+ messages in thread
* [patch v4 09/18] sched: add sched_policies in kernel
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (7 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-02-12 10:36 ` Peter Zijlstra
2013-01-24 3:06 ` [patch v4 10/18] sched: add sysfs interface for sched_policy selection Alex Shi
` (11 subsequent siblings)
20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
Current scheduler behavior is just consider the for larger performance
of system. So it try to spread tasks on more cpu sockets and cpu cores
To adding the consideration of power awareness, the patchset adds
2 kinds of scheduler policy: powersaving and balance. They will use
runnable load util in scheduler balancing. The current scheduling is taken
as performance policy.
performance: the current scheduling behaviour, try to spread tasks
on more CPU sockets or cores. performance oriented.
powersaving: will pack tasks into few sched group until all LCPU in the
group is full, power oriented.
balance : will pack tasks into few sched group until group_capacity
numbers CPU is full, balance between performance and
powersaving.
The following patches will enable powersaving/balance scheduling in CFS.
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
kernel/sched/fair.c | 2 ++
kernel/sched/sched.h | 6 ++++++
2 files changed, 8 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index efeb65c..538f469 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6086,6 +6086,8 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
return rr_interval;
}
+/* The default scheduler policy is 'performance'. */
+int __read_mostly sched_policy = SCHED_POLICY_PERFORMANCE;
/*
* All the scheduling class methods:
*/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ae3511e..66b08a1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -8,6 +8,12 @@
extern __read_mostly int scheduler_running;
+#define SCHED_POLICY_PERFORMANCE (0x1)
+#define SCHED_POLICY_POWERSAVING (0x2)
+#define SCHED_POLICY_BALANCE (0x4)
+
+extern int __read_mostly sched_policy;
+
/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
* to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [patch v4 09/18] sched: add sched_policies in kernel
2013-01-24 3:06 ` [patch v4 09/18] sched: add sched_policies in kernel Alex Shi
@ 2013-02-12 10:36 ` Peter Zijlstra
2013-02-13 15:41 ` Alex Shi
0 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:36 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> Current scheduler behavior is just consider the for larger performance
> of system. So it try to spread tasks on more cpu sockets and cpu cores
>
> To adding the consideration of power awareness, the patchset adds
> 2 kinds of scheduler policy: powersaving and balance. They will use
> runnable load util in scheduler balancing. The current scheduling is taken
> as performance policy.
>
> performance: the current scheduling behaviour, try to spread tasks
> on more CPU sockets or cores. performance oriented.
> powersaving: will pack tasks into few sched group until all LCPU in the
> group is full, power oriented.
> balance : will pack tasks into few sched group until group_capacity
> numbers CPU is full, balance between performance and
> powersaving.
_WHY_ do you start out with so much choice?
If your power policy is so abysmally poor on performance that you
already know you need a 3rd policy to keep people happy, maybe you're
doing something wrong?
> +#define SCHED_POLICY_PERFORMANCE (0x1)
> +#define SCHED_POLICY_POWERSAVING (0x2)
> +#define SCHED_POLICY_BALANCE (0x4)
> +
> +extern int __read_mostly sched_policy;
I'd much prefer: sched_balance_policy. Scheduler policy is a concept
already well defined by posix and we don't need it to mean two
completely different things.
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 09/18] sched: add sched_policies in kernel
2013-02-12 10:36 ` Peter Zijlstra
@ 2013-02-13 15:41 ` Alex Shi
0 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-13 15:41 UTC (permalink / raw)
To: Peter Zijlstra
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On 02/12/2013 06:36 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> Current scheduler behavior is just consider the for larger performance
>> of system. So it try to spread tasks on more cpu sockets and cpu cores
>>
>> To adding the consideration of power awareness, the patchset adds
>> 2 kinds of scheduler policy: powersaving and balance. They will use
>> runnable load util in scheduler balancing. The current scheduling is taken
>> as performance policy.
>>
>> performance: the current scheduling behaviour, try to spread tasks
>> on more CPU sockets or cores. performance oriented.
>> powersaving: will pack tasks into few sched group until all LCPU in the
>> group is full, power oriented.
>> balance : will pack tasks into few sched group until group_capacity
>> numbers CPU is full, balance between performance and
>> powersaving.
>
> _WHY_ do you start out with so much choice?
>
> If your power policy is so abysmally poor on performance that you
> already know you need a 3rd policy to keep people happy, maybe you're
> doing something wrong?
Nope, no much performance yield for both of powersaving and balance policy.
Much of testing results in replaying Ingo's email on '0/18' thread --
the cover letter email threads.
https://lkml.org/lkml/2013/2/3/353
https://lkml.org/lkml/2013/2/4/735
I introduce a 'balance' policy just because HT thread LCPU in Intel CPU
is less then 1 usual cpu power. It is used when someone want to save
power but still want tasks have a whole cpu core...
>
>> +#define SCHED_POLICY_PERFORMANCE (0x1)
>> +#define SCHED_POLICY_POWERSAVING (0x2)
>> +#define SCHED_POLICY_BALANCE (0x4)
>> +
>> +extern int __read_mostly sched_policy;
>
> I'd much prefer: sched_balance_policy. Scheduler policy is a concept
> already well defined by posix and we don't need it to mean two
> completely different things.
>
Got it.
--
Thanks
Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* [patch v4 10/18] sched: add sysfs interface for sched_policy selection
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (8 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 09/18] sched: add sched_policies in kernel Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-01-24 3:06 ` [patch v4 11/18] sched: log the cpu utilization at rq Alex Shi
` (10 subsequent siblings)
20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
This patch add the power aware scheduler knob into sysfs:
$cat /sys/devices/system/cpu/sched_policy/available_sched_policy
performance powersaving balance
$cat /sys/devices/system/cpu/sched_policy/current_sched_policy
powersaving
This means the using sched policy is 'powersaving'.
User can change the policy by commend 'echo':
echo performance > /sys/devices/system/cpu/current_sched_policy
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
Documentation/ABI/testing/sysfs-devices-system-cpu | 25 +++++++
kernel/sched/fair.c | 76 ++++++++++++++++++++++
2 files changed, 101 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 6943133..0ca0727 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -53,6 +53,31 @@ Description: Dynamic addition and removal of CPU's. This is not hotplug
the system. Information writtento the file to remove CPU's
is architecture specific.
+What: /sys/devices/system/cpu/sched_policy/current_sched_policy
+ /sys/devices/system/cpu/sched_policy/available_sched_policy
+Date: Oct 2012
+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description: CFS scheduler policy showing and setting interface.
+
+ available_sched_policy shows there are 3 kinds of policy now:
+ performance, balance and powersaving.
+ current_sched_policy shows current scheduler policy. User
+ can change the policy by writing it.
+
+ Policy decides the CFS scheduler how to distribute tasks onto
+ different CPU unit.
+
+ performance: try to spread tasks onto more CPU sockets,
+ more CPU cores. performance oriented.
+
+ powersaving: try to pack tasks onto same core or same CPU
+ until every LCPUs are busy in the core or CPU socket.
+ powersaving oriented.
+
+ balance: try to pack tasks onto same core or same CPU
+ until full powered CPUs are busy.
+ balance between performance and powersaving.
+
What: /sys/devices/system/cpu/cpu#/node
Date: October 2009
Contact: Linux memory management mailing list <linux-mm@kvack.org>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 538f469..947542f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6088,6 +6088,82 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
/* The default scheduler policy is 'performance'. */
int __read_mostly sched_policy = SCHED_POLICY_PERFORMANCE;
+
+#ifdef CONFIG_SYSFS
+static ssize_t show_available_sched_policy(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "performance balance powersaving\n");
+}
+
+static ssize_t show_current_sched_policy(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ if (sched_policy == SCHED_POLICY_PERFORMANCE)
+ return sprintf(buf, "performance\n");
+ else if (sched_policy == SCHED_POLICY_POWERSAVING)
+ return sprintf(buf, "powersaving\n");
+ else if (sched_policy == SCHED_POLICY_BALANCE)
+ return sprintf(buf, "balance\n");
+ return 0;
+}
+
+static ssize_t set_sched_policy(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ unsigned int ret = -EINVAL;
+ char str_policy[16];
+
+ ret = sscanf(buf, "%15s", str_policy);
+ if (ret != 1)
+ return -EINVAL;
+
+ if (!strcmp(str_policy, "performance"))
+ sched_policy = SCHED_POLICY_PERFORMANCE;
+ else if (!strcmp(str_policy, "powersaving"))
+ sched_policy = SCHED_POLICY_POWERSAVING;
+ else if (!strcmp(str_policy, "balance"))
+ sched_policy = SCHED_POLICY_BALANCE;
+ else
+ return -EINVAL;
+
+ return count;
+}
+
+/*
+ * * Sysfs setup bits:
+ * */
+static DEVICE_ATTR(current_sched_policy, 0644, show_current_sched_policy,
+ set_sched_policy);
+
+static DEVICE_ATTR(available_sched_policy, 0444,
+ show_available_sched_policy, NULL);
+
+static struct attribute *sched_policy_default_attrs[] = {
+ &dev_attr_current_sched_policy.attr,
+ &dev_attr_available_sched_policy.attr,
+ NULL
+};
+static struct attribute_group sched_policy_attr_group = {
+ .attrs = sched_policy_default_attrs,
+ .name = "sched_policy",
+};
+
+int __init create_sysfs_sched_policy_group(struct device *dev)
+{
+ return sysfs_create_group(&dev->kobj, &sched_policy_attr_group);
+}
+
+static int __init sched_policy_sysfs_init(void)
+{
+ return create_sysfs_sched_policy_group(cpu_subsys.dev_root);
+}
+
+core_initcall(sched_policy_sysfs_init);
+#endif /* CONFIG_SYSFS */
+
/*
* All the scheduling class methods:
*/
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* [patch v4 11/18] sched: log the cpu utilization at rq
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (9 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 10/18] sched: add sysfs interface for sched_policy selection Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-02-12 10:39 ` Peter Zijlstra
2013-01-24 3:06 ` [patch v4 12/18] sched: add power aware scheduling in fork/exec/wake Alex Shi
` (9 subsequent siblings)
20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
The cpu's utilization is to measure how busy is the cpu.
util = cpu_rq(cpu)->avg.runnable_avg_sum
/ cpu_rq(cpu)->avg.runnable_avg_period;
Since the util is no more than 1, we use its percentage value in later
caculations. And set the the FULL_UTIL as 100%.
In later power aware scheduling, we are sensitive for how busy of the
cpu, not how much weight of its load. As to power consuming, it is more
related with cpu busy time, not the load weight.
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 4 ++++
kernel/sched/sched.h | 4 ++++
3 files changed, 9 insertions(+)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2cd3c1b..e4035f7 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -318,6 +318,7 @@ do { \
P(ttwu_count);
P(ttwu_local);
+ P(util);
#undef P
#undef P64
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 947542f..20363fb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1495,8 +1495,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
{
+ u32 period;
__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
__update_tg_runnable_avg(&rq->avg, &rq->cfs);
+
+ period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
+ rq->util = rq->avg.runnable_avg_sum * 100 / period;
}
/* Add the load generated by se into cfs_rq's child load-average */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 66b08a1..fa8bdb9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -350,6 +350,9 @@ extern struct root_domain def_root_domain;
#endif /* CONFIG_SMP */
+/* the percentage full cpu utilization */
+#define FULL_UTIL 100
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -481,6 +484,7 @@ struct rq {
#endif
struct sched_avg avg;
+ unsigned int util;
};
static inline int cpu_of(struct rq *rq)
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [patch v4 11/18] sched: log the cpu utilization at rq
2013-01-24 3:06 ` [patch v4 11/18] sched: log the cpu utilization at rq Alex Shi
@ 2013-02-12 10:39 ` Peter Zijlstra
2013-02-14 3:10 ` Alex Shi
0 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:39 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>
> The cpu's utilization is to measure how busy is the cpu.
> util = cpu_rq(cpu)->avg.runnable_avg_sum
> / cpu_rq(cpu)->avg.runnable_avg_period;
>
> Since the util is no more than 1, we use its percentage value in later
> caculations. And set the the FULL_UTIL as 100%.
>
> In later power aware scheduling, we are sensitive for how busy of the
> cpu, not how much weight of its load. As to power consuming, it is more
> related with cpu busy time, not the load weight.
I think we can make that argument in general; that is irrespective of
the actual policy. We simply never had anything better to go with.
So please clarify why you think this only applies to power aware
scheduling.
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 11/18] sched: log the cpu utilization at rq
2013-02-12 10:39 ` Peter Zijlstra
@ 2013-02-14 3:10 ` Alex Shi
0 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-14 3:10 UTC (permalink / raw)
To: Peter Zijlstra
Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel
On 02/12/2013 06:39 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>>
>> The cpu's utilization is to measure how busy is the cpu.
>> util = cpu_rq(cpu)->avg.runnable_avg_sum
>> / cpu_rq(cpu)->avg.runnable_avg_period;
>>
>> Since the util is no more than 1, we use its percentage value in later
>> caculations. And set the the FULL_UTIL as 100%.
>>
>> In later power aware scheduling, we are sensitive for how busy of the
>> cpu, not how much weight of its load. As to power consuming, it is more
>> related with cpu busy time, not the load weight.
>
> I think we can make that argument in general; that is irrespective of
> the actual policy. We simply never had anything better to go with.
>
> So please clarify why you think this only applies to power aware
> scheduling.
Um, the rq->util is a general argument. It can be used on any other
places if needed, not power aware scheduling specific.
>
--
Thanks
Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* [patch v4 12/18] sched: add power aware scheduling in fork/exec/wake
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (10 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 11/18] sched: log the cpu utilization at rq Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-01-24 3:06 ` [patch v4 13/18] sched: packing small tasks in wake/exec balancing Alex Shi
` (8 subsequent siblings)
20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
This patch add power aware scheduling in fork/exec/wake. It try to
select cpu from the busiest while still has utilization group. That's
will save power for other groups.
The trade off is adding a power aware statistics collection in group
seeking. But since the collection just happened in power scheduling
eligible condition, the worst case of hackbench testing just drops
about 2% with powersaving/balance policy. No clear change for
performance policy.
I had tried to use rq utilisation in this balancing, but since the
utilisation need much time to accumulate itself(345ms). It's bad for
any burst balancing. So I use instant rq utilisation -- nr_running.
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
kernel/sched/fair.c | 230 ++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 179 insertions(+), 51 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 20363fb..7c7d9db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3318,25 +3318,189 @@ done:
}
/*
- * sched_balance_self: balance the current task (running on cpu) in domains
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ * during load balancing.
+ */
+struct sd_lb_stats {
+ struct sched_group *busiest; /* Busiest group in this sd */
+ struct sched_group *this; /* Local group in this sd */
+ unsigned long total_load; /* Total load of all groups in sd */
+ unsigned long total_pwr; /* Total power of all groups in sd */
+ unsigned long avg_load; /* Average load across all groups in sd */
+
+ /** Statistics of this group */
+ unsigned long this_load;
+ unsigned long this_load_per_task;
+ unsigned long this_nr_running;
+ unsigned int this_has_capacity;
+ unsigned int this_idle_cpus;
+
+ /* Statistics of the busiest group */
+ unsigned int busiest_idle_cpus;
+ unsigned long max_load;
+ unsigned long busiest_load_per_task;
+ unsigned long busiest_nr_running;
+ unsigned long busiest_group_capacity;
+ unsigned int busiest_has_capacity;
+ unsigned int busiest_group_weight;
+
+ int group_imb; /* Is there imbalance in this sd */
+
+ /* Varibles of power awaring scheduling */
+ unsigned int sd_utils; /* sum utilizations of this domain */
+ unsigned long sd_capacity; /* capacity of this domain */
+ struct sched_group *group_leader; /* Group which relieves group_min */
+ unsigned long min_load_per_task; /* load_per_task in group_min */
+ unsigned int leader_util; /* sum utilizations of group_leader */
+ unsigned int min_util; /* sum utilizations of group_min */
+};
+
+/*
+ * sg_lb_stats - stats of a sched_group required for load_balancing
+ */
+struct sg_lb_stats {
+ unsigned long avg_load; /*Avg load across the CPUs of the group */
+ unsigned long group_load; /* Total load over the CPUs of the group */
+ unsigned long sum_nr_running; /* Nr tasks running in the group */
+ unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+ unsigned long group_capacity;
+ unsigned long idle_cpus;
+ unsigned long group_weight;
+ int group_imb; /* Is there an imbalance in the group ? */
+ int group_has_capacity; /* Is there extra capacity in the group? */
+ unsigned int group_utils; /* sum utilizations of group */
+
+ unsigned long sum_shared_running; /* 0 on non-NUMA */
+};
+
+static inline int
+fix_small_capacity(struct sched_domain *sd, struct sched_group *group);
+
+/*
+ * Try to collect the task running number and capacity of the group.
+ */
+static void get_sg_power_stats(struct sched_group *group,
+ struct sched_domain *sd, struct sg_lb_stats *sgs)
+{
+ int i;
+
+ for_each_cpu(i, sched_group_cpus(group)) {
+ struct rq *rq = cpu_rq(i);
+
+ sgs->group_utils += rq->nr_running;
+ }
+
+ sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
+ SCHED_POWER_SCALE);
+ if (!sgs->group_capacity)
+ sgs->group_capacity = fix_small_capacity(sd, group);
+ sgs->group_weight = group->group_weight;
+}
+
+/*
+ * Try to collect the task running number and capacity of the doamin.
+ */
+static void get_sd_power_stats(struct sched_domain *sd,
+ struct task_struct *p, struct sd_lb_stats *sds)
+{
+ struct sched_group *group;
+ struct sg_lb_stats sgs;
+ int sd_min_delta = INT_MAX;
+ int cpu = task_cpu(p);
+
+ group = sd->groups;
+ do {
+ long g_delta;
+ unsigned long threshold;
+
+ if (!cpumask_test_cpu(cpu, sched_group_mask(group)))
+ continue;
+
+ memset(&sgs, 0, sizeof(sgs));
+ get_sg_power_stats(group, sd, &sgs);
+
+ if (sched_policy == SCHED_POLICY_POWERSAVING)
+ threshold = sgs.group_weight;
+ else
+ threshold = sgs.group_capacity;
+
+ g_delta = threshold - sgs.group_utils;
+
+ if (g_delta > 0 && g_delta < sd_min_delta) {
+ sd_min_delta = g_delta;
+ sds->group_leader = group;
+ }
+
+ sds->sd_utils += sgs.group_utils;
+ sds->total_pwr += group->sgp->power;
+ } while (group = group->next, group != sd->groups);
+
+ sds->sd_capacity = DIV_ROUND_CLOSEST(sds->total_pwr,
+ SCHED_POWER_SCALE);
+}
+
+/*
+ * Execute power policy if this domain is not full.
+ */
+static inline int get_sd_sched_policy(struct sched_domain *sd,
+ int cpu, struct task_struct *p, struct sd_lb_stats *sds)
+{
+ unsigned long threshold;
+
+ if (sched_policy == SCHED_POLICY_PERFORMANCE)
+ return SCHED_POLICY_PERFORMANCE;
+
+ memset(sds, 0, sizeof(*sds));
+ get_sd_power_stats(sd, p, sds);
+
+ if (sched_policy == SCHED_POLICY_POWERSAVING)
+ threshold = sd->span_weight;
+ else
+ threshold = sds->sd_capacity;
+
+ /* still can hold one more task in this domain */
+ if (sds->sd_utils < threshold)
+ return sched_policy;
+
+ return SCHED_POLICY_PERFORMANCE;
+}
+
+/*
+ * If power policy is eligible for this domain, and it has task allowed cpu.
+ * we will select CPU from this domain.
+ */
+static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
+ struct task_struct *p, struct sd_lb_stats *sds)
+{
+ int policy;
+ int new_cpu = -1;
+
+ policy = get_sd_sched_policy(sd, cpu, p, sds);
+ if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
+ new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+
+ return new_cpu;
+}
+
+/*
+ * select_task_rq_fair: balance the current task (running on cpu) in domains
* that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
* SD_BALANCE_EXEC.
*
- * Balance, ie. select the least loaded group.
- *
* Returns the target CPU number, or the same CPU if no balancing is needed.
*
* preempt must be disabled.
*/
static int
-select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
int prev_cpu = task_cpu(p);
int new_cpu = cpu;
int want_affine = 0;
- int sync = wake_flags & WF_SYNC;
+ int sync = flags & WF_SYNC;
+ struct sd_lb_stats sds;
if (p->nr_cpus_allowed == 1)
return prev_cpu;
@@ -3362,11 +3526,20 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
break;
}
- if (tmp->flags & sd_flag)
+ if (tmp->flags & sd_flag) {
sd = tmp;
+
+ new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+ if (new_cpu != -1)
+ goto unlock;
+ }
}
if (affine_sd) {
+ new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+ if (new_cpu != -1)
+ goto unlock;
+
if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
prev_cpu = cpu;
@@ -4167,51 +4340,6 @@ static unsigned long task_h_load(struct task_struct *p)
#endif
/********** Helpers for find_busiest_group ************************/
-/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * during load balancing.
- */
-struct sd_lb_stats {
- struct sched_group *busiest; /* Busiest group in this sd */
- struct sched_group *this; /* Local group in this sd */
- unsigned long total_load; /* Total load of all groups in sd */
- unsigned long total_pwr; /* Total power of all groups in sd */
- unsigned long avg_load; /* Average load across all groups in sd */
-
- /** Statistics of this group */
- unsigned long this_load;
- unsigned long this_load_per_task;
- unsigned long this_nr_running;
- unsigned long this_has_capacity;
- unsigned int this_idle_cpus;
-
- /* Statistics of the busiest group */
- unsigned int busiest_idle_cpus;
- unsigned long max_load;
- unsigned long busiest_load_per_task;
- unsigned long busiest_nr_running;
- unsigned long busiest_group_capacity;
- unsigned long busiest_has_capacity;
- unsigned int busiest_group_weight;
-
- int group_imb; /* Is there imbalance in this sd */
-};
-
-/*
- * sg_lb_stats - stats of a sched_group required for load_balancing
- */
-struct sg_lb_stats {
- unsigned long avg_load; /*Avg load across the CPUs of the group */
- unsigned long group_load; /* Total load over the CPUs of the group */
- unsigned long sum_nr_running; /* Nr tasks running in the group */
- unsigned long sum_weighted_load; /* Weighted load of group's tasks */
- unsigned long group_capacity;
- unsigned long idle_cpus;
- unsigned long group_weight;
- int group_imb; /* Is there an imbalance in the group ? */
- int group_has_capacity; /* Is there extra capacity in the group? */
-};
-
/**
* get_sd_load_idx - Obtain the load index for a given sched domain.
* @sd: The sched_domain whose load_idx is to be obtained.
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* [patch v4 13/18] sched: packing small tasks in wake/exec balancing
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (11 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 12/18] sched: add power aware scheduling in fork/exec/wake Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-01-24 3:06 ` [patch v4 14/18] sched: add power/performance balance allowed flag Alex Shi
` (7 subsequent siblings)
20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
If the waked/execed task is idle enough, it will has the chance to be
packed into a cpu which is busy but still has time to care it.
Morten Rasmussen catch a bug and suggest using different criteria for
different policy, thanks!
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 60 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c7d9db..eede065 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3466,19 +3466,72 @@ static inline int get_sd_sched_policy(struct sched_domain *sd,
}
/*
+ * find_leader_cpu - find the busiest but still has enough leisure time cpu
+ * among the cpus in group.
+ */
+static int
+find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu,
+ int policy)
+{
+ /* percentage of the task's util */
+ unsigned putil = p->se.avg.runnable_avg_sum * 100
+ / (p->se.avg.runnable_avg_period + 1);
+
+ struct rq *rq = cpu_rq(this_cpu);
+ int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
+ int vacancy, min_vacancy = INT_MAX, max_util;
+ int leader_cpu = -1;
+ int i;
+
+ if (policy == SCHED_POLICY_POWERSAVING)
+ max_util = FULL_UTIL;
+ else
+ /* maximum allowable util is 60% */
+ max_util = 60;
+
+ /* bias toward local cpu */
+ if (cpumask_test_cpu(this_cpu, tsk_cpus_allowed(p)) &&
+ max_util - (rq->util * nr_running + (putil << 2)) > 0)
+ return this_cpu;
+
+ /* Traverse only the allowed CPUs */
+ for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
+ if (i == this_cpu)
+ continue;
+
+ rq = cpu_rq(i);
+ nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
+
+ /* only light task allowed, like putil < 25% for powersaving */
+ vacancy = max_util - (rq->util * nr_running + (putil << 2));
+
+ if (vacancy > 0 && vacancy < min_vacancy) {
+ min_vacancy = vacancy;
+ leader_cpu = i;
+ }
+ }
+ return leader_cpu;
+}
+
+/*
* If power policy is eligible for this domain, and it has task allowed cpu.
* we will select CPU from this domain.
*/
static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
- struct task_struct *p, struct sd_lb_stats *sds)
+ struct task_struct *p, struct sd_lb_stats *sds, int fork)
{
int policy;
int new_cpu = -1;
policy = get_sd_sched_policy(sd, cpu, p, sds);
- if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
- new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
-
+ if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
+ if (!fork)
+ new_cpu = find_leader_cpu(sds->group_leader,
+ p, cpu, policy);
+ /* for fork balancing and a little busy task */
+ if (new_cpu == -1)
+ new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+ }
return new_cpu;
}
@@ -3529,14 +3582,15 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
if (tmp->flags & sd_flag) {
sd = tmp;
- new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+ new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds,
+ flags & SD_BALANCE_FORK);
if (new_cpu != -1)
goto unlock;
}
}
if (affine_sd) {
- new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+ new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 0);
if (new_cpu != -1)
goto unlock;
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* [patch v4 14/18] sched: add power/performance balance allowed flag
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (12 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 13/18] sched: packing small tasks in wake/exec balancing Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-01-24 3:06 ` [patch v4 15/18] sched: pull all tasks from source group Alex Shi
` (6 subsequent siblings)
20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
If a sched domain is idle enough for power balance, power_lb
will be set, perf_lb will be clean. If a sched domain is busy,
their value will be set oppositely.
If the domain is suitable for power balance, but balance should not
be down by this cpu, both of perf_lb and power_lb are cleared to wait a
suitable cpu to do power balance. That mean no any balance, neither
power balance nor performance balance will be done on the balance cpu.
Above logical will be implemented by following patches.
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eede065..19624f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4039,6 +4039,8 @@ struct lb_env {
unsigned int loop;
unsigned int loop_break;
unsigned int loop_max;
+ int power_lb; /* if power balance needed */
+ int perf_lb; /* if performance balance needed */
};
/*
@@ -5180,6 +5182,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.idle = idle,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
+ .power_lb = 0,
+ .perf_lb = 1,
};
cpumask_copy(cpus, cpu_active_mask);
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* [patch v4 15/18] sched: pull all tasks from source group
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (13 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 14/18] sched: add power/performance balance allowed flag Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-01-24 3:06 ` [patch v4 16/18] sched: don't care if the local group has capacity Alex Shi
` (5 subsequent siblings)
20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
In power balance, we hope some sched groups are fully empty to save
CPU power of them. So, we want to move any tasks from them.
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
kernel/sched/fair.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 19624f4..a1ccb40 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5110,7 +5110,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* When comparing with imbalance, use weighted_cpuload()
* which is not scaled with the cpu power.
*/
- if (capacity && rq->nr_running == 1 && wl > env->imbalance)
+ if (rq->nr_running == 0 ||
+ (!env->power_lb && capacity &&
+ rq->nr_running == 1 && wl > env->imbalance))
continue;
/*
@@ -5214,7 +5216,8 @@ redo:
ld_moved = 0;
lb_iterations = 1;
- if (busiest->nr_running > 1) {
+ if (busiest->nr_running > 1 ||
+ (busiest->nr_running == 1 && env.power_lb)) {
/*
* Attempt to move tasks. If find_busiest_group has found
* an imbalance but busiest->nr_running <= 1, the group is
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* [patch v4 16/18] sched: don't care if the local group has capacity
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (14 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 15/18] sched: pull all tasks from source group Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-01-24 3:06 ` [patch v4 17/18] sched: power aware load balance, Alex Shi
` (4 subsequent siblings)
20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
In power aware scheduling, we don't care load weight and
want not to pull tasks just because local group has capacity.
Because the local group maybe no tasks at the time, that is the power
balance hope so.
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
kernel/sched/fair.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1ccb40..94bd40b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4767,8 +4767,12 @@ static inline void update_sd_lb_stats(struct lb_env *env,
* extra check prevents the case where you always pull from the
* heaviest group when it is already under-utilized (possible
* with a large weight task outweighs the tasks on the system).
+ *
+ * In power aware scheduling, we don't care load weight and
+ * want not to pull tasks just because local group has capacity.
*/
- if (prefer_sibling && !local_group && sds->this_has_capacity)
+ if (prefer_sibling && !local_group && sds->this_has_capacity
+ && env->perf_lb)
sgs.group_capacity = min(sgs.group_capacity, 1UL);
if (local_group) {
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* [patch v4 17/18] sched: power aware load balance,
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (15 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 16/18] sched: don't care if the local group has capacity Alex Shi
@ 2013-01-24 3:06 ` Alex Shi
2013-01-24 3:07 ` [patch v4 18/18] sched: lazy power balance Alex Shi
` (3 subsequent siblings)
20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:06 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
This patch enabled the power aware consideration in load balance.
As mentioned in the power aware scheduler proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, pack tasks on less sched_groups will reduce power consumption
The first assumption make performance policy take over scheduling when
system busy.
The second assumption make power aware scheduling try to move
disperse tasks into fewer groups until that groups are full of tasks.
This patch reuse some of Suresh's power saving load balance code.
The enabling logical summary here:
1, Collect power aware scheduler statistics during performance load
balance statistics collection.
2, If the balance cpu is eligible for power load balance, just do it
and forget performance load balance. But if the domain is suitable for
power balance, while the cpu is not appropriate, stop both
power/performance balance, else do performance load balance.
A test can show the effort on different policy:
for ((i = 0; i < I; i++)) ; do while true; do :; done & done
On my SNB laptop with 4core* HT: the data is Watts
powersaving balance performance
i = 2 40 54 54
i = 4 57 64* 68
i = 8 68 68 68
Note:
When i = 4 with balance policy, the power may change in 57~68Watt,
since the HT capacity and core capacity are both 1.
on SNB EP machine with 2 sockets * 8 cores * HT:
powersaving balance performance
i = 4 190 201 238
i = 8 205 241 268
i = 16 271 348 376
If system has few continued tasks, use power policy can get
the performance/power gain. Like sysbench fileio randrw test with 16
thread on the SNB EP box,
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
kernel/sched/fair.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 124 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 94bd40b..a83ad90 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3350,6 +3350,7 @@ struct sd_lb_stats {
unsigned int sd_utils; /* sum utilizations of this domain */
unsigned long sd_capacity; /* capacity of this domain */
struct sched_group *group_leader; /* Group which relieves group_min */
+ struct sched_group *group_min; /* Least loaded group in sd */
unsigned long min_load_per_task; /* load_per_task in group_min */
unsigned int leader_util; /* sum utilizations of group_leader */
unsigned int min_util; /* sum utilizations of group_min */
@@ -4396,6 +4397,106 @@ static unsigned long task_h_load(struct task_struct *p)
#endif
/********** Helpers for find_busiest_group ************************/
+
+/**
+ * init_sd_lb_power_stats - Initialize power savings statistics for
+ * the given sched_domain, during load balancing.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics for sd.
+ */
+static inline void init_sd_lb_power_stats(struct lb_env *env,
+ struct sd_lb_stats *sds)
+{
+ if (sched_policy == SCHED_POLICY_PERFORMANCE ||
+ env->idle == CPU_NOT_IDLE) {
+ env->power_lb = 0;
+ env->perf_lb = 1;
+ return;
+ }
+ env->perf_lb = 0;
+ env->power_lb = 1;
+ sds->min_util = UINT_MAX;
+ sds->leader_util = 0;
+}
+
+/**
+ * update_sd_lb_power_stats - Update the power saving stats for a
+ * sched_domain while performing load balancing.
+ *
+ * @env: The load balancing environment.
+ * @group: sched_group belonging to the sched_domain under consideration.
+ * @sds: Variable containing the statistics of the sched_domain
+ * @local_group: Does group contain the CPU for which we're performing
+ * load balancing?
+ * @sgs: Variable containing the statistics of the group.
+ */
+static inline void update_sd_lb_power_stats(struct lb_env *env,
+ struct sched_group *group, struct sd_lb_stats *sds,
+ int local_group, struct sg_lb_stats *sgs)
+{
+ unsigned long threshold, threshold_util;
+
+ if (env->perf_lb)
+ return;
+
+ if (sched_policy == SCHED_POLICY_POWERSAVING)
+ threshold = sgs->group_weight;
+ else
+ threshold = sgs->group_capacity;
+ threshold_util = threshold * FULL_UTIL;
+
+ /*
+ * If the local group is idle or full loaded
+ * no need to do power savings balance at this domain
+ */
+ if (local_group && (!sgs->sum_nr_running ||
+ sgs->group_utils + FULL_UTIL > threshold_util))
+ env->power_lb = 0;
+
+ /* Do performance load balance if any group overload */
+ if (sgs->group_utils > threshold_util) {
+ env->perf_lb = 1;
+ env->power_lb = 0;
+ }
+
+ /*
+ * If a group is idle,
+ * don't include that group in power savings calculations
+ */
+ if (!env->power_lb || !sgs->sum_nr_running)
+ return;
+
+ /*
+ * Calculate the group which has the least non-idle load.
+ * This is the group from where we need to pick up the load
+ * for saving power
+ */
+ if ((sgs->group_utils < sds->min_util) ||
+ (sgs->group_utils == sds->min_util &&
+ group_first_cpu(group) > group_first_cpu(sds->group_min))) {
+ sds->group_min = group;
+ sds->min_util = sgs->group_utils;
+ sds->min_load_per_task = sgs->sum_weighted_load /
+ sgs->sum_nr_running;
+ }
+
+ /*
+ * Calculate the group which is almost near its
+ * capacity but still has some space to pick up some load
+ * from other group and save more power
+ */
+ if (sgs->group_utils + FULL_UTIL > threshold_util)
+ return;
+
+ if (sgs->group_utils > sds->leader_util ||
+ (sgs->group_utils == sds->leader_util && sds->group_leader &&
+ group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
+ sds->group_leader = group;
+ sds->leader_util = sgs->group_utils;
+ }
+}
+
/**
* get_sd_load_idx - Obtain the load index for a given sched domain.
* @sd: The sched_domain whose load_idx is to be obtained.
@@ -4635,6 +4736,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_load += load;
sgs->sum_nr_running += nr_running;
sgs->sum_weighted_load += weighted_cpuload(i);
+
+ /* accumulate the maximum potential util */
+ if (!nr_running)
+ nr_running = 1;
+ sgs->group_utils += rq->util * nr_running;
+
if (idle_cpu(i))
sgs->idle_cpus++;
}
@@ -4743,6 +4850,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;
+ init_sd_lb_power_stats(env, sds);
load_idx = get_sd_load_idx(env->sd, env->idle);
do {
@@ -4794,6 +4902,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sds->group_imb = sgs.group_imb;
}
+ update_sd_lb_power_stats(env, sg, sds, local_group, &sgs);
sg = sg->next;
} while (sg != env->sd->groups);
}
@@ -5011,6 +5120,19 @@ find_busiest_group(struct lb_env *env, int *balance)
*/
update_sd_lb_stats(env, balance, &sds);
+ if (!env->perf_lb && !env->power_lb)
+ return NULL;
+
+ if (env->power_lb) {
+ if (sds.this == sds.group_leader &&
+ sds.group_leader != sds.group_min) {
+ env->imbalance = sds.min_load_per_task;
+ return sds.group_min;
+ }
+ env->power_lb = 0;
+ return NULL;
+ }
+
/*
* this_cpu is not the appropriate cpu to perform load balancing at
* this level.
@@ -5188,8 +5310,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.idle = idle,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
- .power_lb = 0,
- .perf_lb = 1,
+ .power_lb = 1,
+ .perf_lb = 0,
};
cpumask_copy(cpus, cpu_active_mask);
@@ -6267,7 +6389,6 @@ void unregister_fair_sched_group(struct task_group *tg, int cpu) { }
#endif /* CONFIG_FAIR_GROUP_SCHED */
-
static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task)
{
struct sched_entity *se = &task->se;
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* [patch v4 18/18] sched: lazy power balance
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (16 preceding siblings ...)
2013-01-24 3:06 ` [patch v4 17/18] sched: power aware load balance, Alex Shi
@ 2013-01-24 3:07 ` Alex Shi
2013-01-24 9:44 ` [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Borislav Petkov
` (2 subsequent siblings)
20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24 3:07 UTC (permalink / raw)
To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi
When active task number in sched domain waves around the power friendly
scheduling creteria, scheduling will thresh between the power friendly
balance and performance balance, bring unnecessary task migration.
The typical benchmark is 'make -j x'.
To remove such issue, introduce a u64 perf_lb_record variable to record
performance load balance history. If there is no performance LB for
continuing 32 times load balancing, or no LB for 8 times max_interval ms,
or only 4 times performance LB in last 64 times load balancing, then we
accept a powersaving LB. Otherwise, give up this power awareness
LB chance.
With this patch, the worst case for power scheduling -- kbuild, gets
similar performance/power value among different policy.
BTW, the lazy balance shows the performance gain when j is up to 32.
On my SNB EP 2 sockets machine with 8 cores * HT: 'make -j x' results:
powersaving balance performance
x = 1 175.603 /417 13 175.220 /416 13 176.073 /407 13
x = 2 192.215 /218 23 194.522 /202 25 217.393 /200 23
x = 4 205.226 /124 39 208.823 /114 42 230.425 /105 41
x = 8 236.369 /71 59 249.005 /65 61 257.661 /62 62
x = 16 283.842 /48 73 307.465 /40 81 309.336 /39 82
x = 32 325.197 /32 96 333.503 /32 93 336.138 /32 92
data explains: 175.603 /417 13
175.603: avagerage Watts
417: seconds(compile time)
13: scaled performance/power = 1000000 / seconds / watts
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 67 +++++++++++++++++++++++++++++++++++++++++----------
2 files changed, 55 insertions(+), 13 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 66b05e1..5051990 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -941,6 +941,7 @@ struct sched_domain {
unsigned long last_balance; /* init to jiffies. units in jiffies */
unsigned int balance_interval; /* initialise to 1. units in ms. */
unsigned int nr_balance_failed; /* initialise to 0 */
+ u64 perf_lb_record; /* performance balance record */
u64 last_update;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a83ad90..262d7ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4497,6 +4497,58 @@ static inline void update_sd_lb_power_stats(struct lb_env *env,
}
}
+#define PERF_LB_HH_MASK 0xffffffff00000000ULL
+#define PERF_LB_LH_MASK 0xffffffffULL
+
+/**
+ * need_perf_balance - Check if the performance load balance needed
+ * in the sched_domain.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics of the sched_domain
+ */
+static int need_perf_balance(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ env->sd->perf_lb_record <<= 1;
+
+ if (env->perf_lb) {
+ env->sd->perf_lb_record |= 0x1;
+ return 1;
+ }
+
+ /*
+ * The situation isn't eligible for performance balance. If this_cpu
+ * is not eligible or the timing is not suitable for lazy powersaving
+ * balance, we will stop both powersaving and performance balance.
+ */
+ if (env->power_lb && sds->this == sds->group_leader
+ && sds->group_leader != sds->group_min) {
+ int interval;
+
+ /* powersaving balance interval set as 8 * max_interval */
+ interval = msecs_to_jiffies(8 * env->sd->max_interval);
+ if (time_after(jiffies, env->sd->last_balance + interval))
+ env->sd->perf_lb_record = 0;
+
+ /*
+ * A eligible timing is no performance balance in last 32
+ * balance and performance balance is no more than 4 times
+ * in last 64 balance, or no balance in powersaving interval
+ * time.
+ */
+ if ((hweight64(env->sd->perf_lb_record & PERF_LB_HH_MASK) <= 4)
+ && !(env->sd->perf_lb_record & PERF_LB_LH_MASK)) {
+
+ env->imbalance = sds->min_load_per_task;
+ return 0;
+ }
+
+ }
+ env->power_lb = 0;
+ sds->group_min = NULL;
+ return 0;
+}
+
/**
* get_sd_load_idx - Obtain the load index for a given sched domain.
* @sd: The sched_domain whose load_idx is to be obtained.
@@ -5087,7 +5139,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
}
/******* find_busiest_group() helpers end here *********************/
-
/**
* find_busiest_group - Returns the busiest group within the sched_domain
* if there is an imbalance. If there isn't an imbalance, and
@@ -5120,18 +5171,8 @@ find_busiest_group(struct lb_env *env, int *balance)
*/
update_sd_lb_stats(env, balance, &sds);
- if (!env->perf_lb && !env->power_lb)
- return NULL;
-
- if (env->power_lb) {
- if (sds.this == sds.group_leader &&
- sds.group_leader != sds.group_min) {
- env->imbalance = sds.min_load_per_task;
- return sds.group_min;
- }
- env->power_lb = 0;
- return NULL;
- }
+ if (!need_perf_balance(env, &sds))
+ return sds.group_min;
/*
* this_cpu is not the appropriate cpu to perform load balancing at
--
1.7.12
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (17 preceding siblings ...)
2013-01-24 3:07 ` [patch v4 18/18] sched: lazy power balance Alex Shi
@ 2013-01-24 9:44 ` Borislav Petkov
2013-01-24 15:07 ` Alex Shi
2013-01-28 1:28 ` Alex Shi
2013-02-04 1:35 ` Alex Shi
20 siblings, 1 reply; 88+ messages in thread
From: Borislav Petkov @ 2013-01-24 9:44 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Thu, Jan 24, 2013 at 11:06:42AM +0800, Alex Shi wrote:
> Since the runnable info needs 345ms to accumulate, balancing
> doesn't do well for many tasks burst waking. After talking with Mike
> Galbraith, we are agree to just use runnable avg in power friendly
> scheduling and keep current instant load in performance scheduling for
> low latency.
>
> So the biggest change in this version is removing runnable load avg in
> balance and just using runnable data in power balance.
>
> The patchset bases on Linus' tree, includes 3 parts,
> ** 1, bug fix and fork/wake balancing clean up. patch 1~5,
> ----------------------
> the first patch remove one domain level. patch 2~5 simplified fork/wake
> balancing, it can increase 10+% hackbench performance on our 4 sockets
> SNB EP machine.
Ok, I see some benchmarking results here and there in the commit
messages but since this is touching the scheduler, you probably would
need to make sure it doesn't introduce performance regressions vs
mainline with a comprehensive set of benchmarks.
And, AFAICR, mainline does by default the 'performance' scheme by
spreading out tasks to idle cores, so have you tried comparing vanilla
mainline to your patchset in the 'performance' setting so that you can
make sure there are no problems there? And not only hackbench or a
microbenchmark but aim9 (I saw that in a commit message somewhere) and
whatever else multithreaded benchmark you can get your hands on.
Also, you might want to run it on other machines too, not only SNB :-)
And what about ARM, maybe someone there can run your patchset too?
So, it would be cool to see comprehensive results from all those runs
and see what the numbers say.
Thanks.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-24 9:44 ` [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Borislav Petkov
@ 2013-01-24 15:07 ` Alex Shi
2013-01-27 2:41 ` Alex Shi
0 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24 15:07 UTC (permalink / raw)
To: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On 01/24/2013 05:44 PM, Borislav Petkov wrote:
> On Thu, Jan 24, 2013 at 11:06:42AM +0800, Alex Shi wrote:
>> Since the runnable info needs 345ms to accumulate, balancing
>> doesn't do well for many tasks burst waking. After talking with Mike
>> Galbraith, we are agree to just use runnable avg in power friendly
>> scheduling and keep current instant load in performance scheduling for
>> low latency.
>>
>> So the biggest change in this version is removing runnable load avg in
>> balance and just using runnable data in power balance.
>>
>> The patchset bases on Linus' tree, includes 3 parts,
>> ** 1, bug fix and fork/wake balancing clean up. patch 1~5,
>> ----------------------
>> the first patch remove one domain level. patch 2~5 simplified fork/wake
>> balancing, it can increase 10+% hackbench performance on our 4 sockets
>> SNB EP machine.
>
> Ok, I see some benchmarking results here and there in the commit
> messages but since this is touching the scheduler, you probably would
> need to make sure it doesn't introduce performance regressions vs
> mainline with a comprehensive set of benchmarks.
>
Thanks a lot for your comments, Borislav! :)
For this patchset, the code will just check current policy, if it is
performance, the code patch will back to original performance code at
once. So there should no performance change on performance policy.
I once tested the balance policy performance with benchmark
kbuild/hackbench/aim9/dbench/tbench on version 2, only hackbench has a
bit drop ~3%. others have no clear change.
> And, AFAICR, mainline does by default the 'performance' scheme by
> spreading out tasks to idle cores, so have you tried comparing vanilla
> mainline to your patchset in the 'performance' setting so that you can
> make sure there are no problems there? And not only hackbench or a
> microbenchmark but aim9 (I saw that in a commit message somewhere) and
> whatever else multithreaded benchmark you can get your hands on.
>
> Also, you might want to run it on other machines too, not only SNB :-)
Anyway I will redo the performance testing on this version again on all
machine. but doesn't expect something change. :)
> And what about ARM, maybe someone there can run your patchset too?
>
> So, it would be cool to see comprehensive results from all those runs
> and see what the numbers say.
>
> Thanks.
>
--
Thanks
Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-24 15:07 ` Alex Shi
@ 2013-01-27 2:41 ` Alex Shi
2013-01-27 4:36 ` Mike Galbraith
2013-01-27 10:40 ` Borislav Petkov
0 siblings, 2 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-27 2:41 UTC (permalink / raw)
To: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On 01/24/2013 11:07 PM, Alex Shi wrote:
> On 01/24/2013 05:44 PM, Borislav Petkov wrote:
>> On Thu, Jan 24, 2013 at 11:06:42AM +0800, Alex Shi wrote:
>>> Since the runnable info needs 345ms to accumulate, balancing
>>> doesn't do well for many tasks burst waking. After talking with Mike
>>> Galbraith, we are agree to just use runnable avg in power friendly
>>> scheduling and keep current instant load in performance scheduling for
>>> low latency.
>>>
>>> So the biggest change in this version is removing runnable load avg in
>>> balance and just using runnable data in power balance.
>>>
>>> The patchset bases on Linus' tree, includes 3 parts,
>>> ** 1, bug fix and fork/wake balancing clean up. patch 1~5,
>>> ----------------------
>>> the first patch remove one domain level. patch 2~5 simplified fork/wake
>>> balancing, it can increase 10+% hackbench performance on our 4 sockets
>>> SNB EP machine.
>>
>> Ok, I see some benchmarking results here and there in the commit
>> messages but since this is touching the scheduler, you probably would
>> need to make sure it doesn't introduce performance regressions vs
>> mainline with a comprehensive set of benchmarks.
>>
>
> Thanks a lot for your comments, Borislav! :)
>
> For this patchset, the code will just check current policy, if it is
> performance, the code patch will back to original performance code at
> once. So there should no performance change on performance policy.
>
> I once tested the balance policy performance with benchmark
> kbuild/hackbench/aim9/dbench/tbench on version 2, only hackbench has a
> bit drop ~3%. others have no clear change.
>
>> And, AFAICR, mainline does by default the 'performance' scheme by
>> spreading out tasks to idle cores, so have you tried comparing vanilla
>> mainline to your patchset in the 'performance' setting so that you can
>> make sure there are no problems there? And not only hackbench or a
>> microbenchmark but aim9 (I saw that in a commit message somewhere) and
>> whatever else multithreaded benchmark you can get your hands on.
>>
>> Also, you might want to run it on other machines too, not only SNB :-)
>
> Anyway I will redo the performance testing on this version again on all
> machine. but doesn't expect something change. :)
Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
performance change found.
I also tested balance policy/powersaving policy with above benchmark,
found, the specjbb2005 drop much 30~50% on both of policy whenever with
openjdk or jrockit. and hackbench drops a lots with powersaving policy
on snb 4 sockets platforms. others has no clear change.
>
>> And what about ARM, maybe someone there can run your patchset too?
>>
>> So, it would be cool to see comprehensive results from all those runs
>> and see what the numbers say.
>>
>> Thanks.
>>
>
>
--
Thanks
Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-27 2:41 ` Alex Shi
@ 2013-01-27 4:36 ` Mike Galbraith
2013-01-27 10:35 ` Borislav Petkov
2013-01-27 10:40 ` Borislav Petkov
1 sibling, 1 reply; 88+ messages in thread
From: Mike Galbraith @ 2013-01-27 4:36 UTC (permalink / raw)
To: Alex Shi
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Sun, 2013-01-27 at 10:41 +0800, Alex Shi wrote:
> On 01/24/2013 11:07 PM, Alex Shi wrote:
> > On 01/24/2013 05:44 PM, Borislav Petkov wrote:
> >> On Thu, Jan 24, 2013 at 11:06:42AM +0800, Alex Shi wrote:
> >>> Since the runnable info needs 345ms to accumulate, balancing
> >>> doesn't do well for many tasks burst waking. After talking with Mike
> >>> Galbraith, we are agree to just use runnable avg in power friendly
> >>> scheduling and keep current instant load in performance scheduling for
> >>> low latency.
> >>>
> >>> So the biggest change in this version is removing runnable load avg in
> >>> balance and just using runnable data in power balance.
> >>>
> >>> The patchset bases on Linus' tree, includes 3 parts,
> >>> ** 1, bug fix and fork/wake balancing clean up. patch 1~5,
> >>> ----------------------
> >>> the first patch remove one domain level. patch 2~5 simplified fork/wake
> >>> balancing, it can increase 10+% hackbench performance on our 4 sockets
> >>> SNB EP machine.
> >>
> >> Ok, I see some benchmarking results here and there in the commit
> >> messages but since this is touching the scheduler, you probably would
> >> need to make sure it doesn't introduce performance regressions vs
> >> mainline with a comprehensive set of benchmarks.
> >>
> >
> > Thanks a lot for your comments, Borislav! :)
> >
> > For this patchset, the code will just check current policy, if it is
> > performance, the code patch will back to original performance code at
> > once. So there should no performance change on performance policy.
> >
> > I once tested the balance policy performance with benchmark
> > kbuild/hackbench/aim9/dbench/tbench on version 2, only hackbench has a
> > bit drop ~3%. others have no clear change.
> >
> >> And, AFAICR, mainline does by default the 'performance' scheme by
> >> spreading out tasks to idle cores, so have you tried comparing vanilla
> >> mainline to your patchset in the 'performance' setting so that you can
> >> make sure there are no problems there? And not only hackbench or a
> >> microbenchmark but aim9 (I saw that in a commit message somewhere) and
> >> whatever else multithreaded benchmark you can get your hands on.
> >>
> >> Also, you might want to run it on other machines too, not only SNB :-)
> >
> > Anyway I will redo the performance testing on this version again on all
> > machine. but doesn't expect something change. :)
>
> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
> performance change found.
With aim7 compute on 4 node 40 core box, I see stable throughput
improvement at tasks = nr_cores and below w. balance and powersaving.
3.8.0-performance 3.8.0-balance 3.8.0-powersaving
Tasks jobs/min jti jobs/min/task real cpu jobs/min jti jobs/min/task real cpu jobs/min jti jobs/min/task real cpu
1 432.86 100 432.8571 14.00 3.99 433.48 100 433.4764 13.98 3.97 433.17 100 433.1665 13.99 3.98
1 437.23 100 437.2294 13.86 3.85 436.60 100 436.5994 13.88 3.86 435.66 100 435.6578 13.91 3.90
1 434.10 100 434.0974 13.96 3.95 436.29 100 436.2851 13.89 3.89 436.29 100 436.2851 13.89 3.87
5 2400.95 99 480.1902 12.62 12.49 2554.81 98 510.9612 11.86 7.55 2487.68 98 497.5369 12.18 8.22
5 2341.58 99 468.3153 12.94 13.95 2578.72 99 515.7447 11.75 7.25 2527.11 99 505.4212 11.99 7.90
5 2350.66 99 470.1319 12.89 13.66 2600.86 99 520.1717 11.65 7.09 2508.28 98 501.6556 12.08 8.24
10 4291.78 99 429.1785 14.12 40.14 5334.51 99 533.4507 11.36 11.13 5183.92 98 518.3918 11.69 12.15
10 4334.76 99 433.4764 13.98 38.70 5311.13 99 531.1131 11.41 11.23 5215.15 99 521.5146 11.62 12.53
10 4273.62 99 427.3625 14.18 40.29 5287.96 99 528.7958 11.46 11.46 5144.31 98 514.4312 11.78 12.32
20 8487.39 94 424.3697 14.28 63.14 10594.41 99 529.7203 11.44 23.72 10575.92 99 528.7958 11.46 22.08
20 8387.54 97 419.3772 14.45 77.01 10575.92 98 528.7958 11.46 23.41 10520.83 99 526.0417 11.52 21.88
20 8713.16 95 435.6578 13.91 55.10 10659.63 99 532.9815 11.37 24.17 10539.13 99 526.9565 11.50 22.13
40 16786.70 99 419.6676 14.44 170.08 19469.88 98 486.7470 12.45 60.78 19967.05 98 499.1763 12.14 51.40
40 16728.78 99 418.2195 14.49 172.96 19627.53 98 490.6883 12.35 65.26 20386.88 98 509.6720 11.89 46.91
40 16763.49 99 419.0871 14.46 171.42 20033.06 98 500.8264 12.10 51.44 20682.59 98 517.0648 11.72 42.45
No deltas after that. There were also no deltas between patched kernel
using performance policy and virgin source.
-Mike
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-27 4:36 ` Mike Galbraith
@ 2013-01-27 10:35 ` Borislav Petkov
2013-01-27 13:25 ` Alex Shi
0 siblings, 1 reply; 88+ messages in thread
From: Borislav Petkov @ 2013-01-27 10:35 UTC (permalink / raw)
To: Mike Galbraith
Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Sun, Jan 27, 2013 at 05:36:25AM +0100, Mike Galbraith wrote:
> With aim7 compute on 4 node 40 core box, I see stable throughput
> improvement at tasks = nr_cores and below w. balance and powersaving.
>
> 3.8.0-performance 3.8.0-balance 3.8.0-powersaving
> Tasks jobs/min jti jobs/min/task real cpu jobs/min jti jobs/min/task real cpu jobs/min jti jobs/min/task real cpu
> 1 432.86 100 432.8571 14.00 3.99 433.48 100 433.4764 13.98 3.97 433.17 100 433.1665 13.99 3.98
> 1 437.23 100 437.2294 13.86 3.85 436.60 100 436.5994 13.88 3.86 435.66 100 435.6578 13.91 3.90
> 1 434.10 100 434.0974 13.96 3.95 436.29 100 436.2851 13.89 3.89 436.29 100 436.2851 13.89 3.87
> 5 2400.95 99 480.1902 12.62 12.49 2554.81 98 510.9612 11.86 7.55 2487.68 98 497.5369 12.18 8.22
> 5 2341.58 99 468.3153 12.94 13.95 2578.72 99 515.7447 11.75 7.25 2527.11 99 505.4212 11.99 7.90
> 5 2350.66 99 470.1319 12.89 13.66 2600.86 99 520.1717 11.65 7.09 2508.28 98 501.6556 12.08 8.24
> 10 4291.78 99 429.1785 14.12 40.14 5334.51 99 533.4507 11.36 11.13 5183.92 98 518.3918 11.69 12.15
> 10 4334.76 99 433.4764 13.98 38.70 5311.13 99 531.1131 11.41 11.23 5215.15 99 521.5146 11.62 12.53
> 10 4273.62 99 427.3625 14.18 40.29 5287.96 99 528.7958 11.46 11.46 5144.31 98 514.4312 11.78 12.32
> 20 8487.39 94 424.3697 14.28 63.14 10594.41 99 529.7203 11.44 23.72 10575.92 99 528.7958 11.46 22.08
> 20 8387.54 97 419.3772 14.45 77.01 10575.92 98 528.7958 11.46 23.41 10520.83 99 526.0417 11.52 21.88
> 20 8713.16 95 435.6578 13.91 55.10 10659.63 99 532.9815 11.37 24.17 10539.13 99 526.9565 11.50 22.13
> 40 16786.70 99 419.6676 14.44 170.08 19469.88 98 486.7470 12.45 60.78 19967.05 98 499.1763 12.14 51.40
> 40 16728.78 99 418.2195 14.49 172.96 19627.53 98 490.6883 12.35 65.26 20386.88 98 509.6720 11.89 46.91
> 40 16763.49 99 419.0871 14.46 171.42 20033.06 98 500.8264 12.10 51.44 20682.59 98 517.0648 11.72 42.45
Ok, this is sick. How is balance and powersaving better than perf? Both
have much more jobs per minute than perf; is that because we do pack
much more tasks per cpu with balance and powersaving?
Thanks.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-27 10:35 ` Borislav Petkov
@ 2013-01-27 13:25 ` Alex Shi
2013-01-27 15:51 ` Mike Galbraith
0 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-27 13:25 UTC (permalink / raw)
To: Borislav Petkov, Mike Galbraith, torvalds, mingo, peterz, tglx,
akpm, arjan, pjt, namhyung, vincent.guittot, gregkh, preeti,
viresh.kumar, linux-kernel
On 01/27/2013 06:35 PM, Borislav Petkov wrote:
> On Sun, Jan 27, 2013 at 05:36:25AM +0100, Mike Galbraith wrote:
>> With aim7 compute on 4 node 40 core box, I see stable throughput
>> improvement at tasks = nr_cores and below w. balance and powersaving.
>>
>> 3.8.0-performance 3.8.0-balance 3.8.0-powersaving
>> Tasks jobs/min jti jobs/min/task real cpu jobs/min jti jobs/min/task real cpu jobs/min jti jobs/min/task real cpu
>> 1 432.86 100 432.8571 14.00 3.99 433.48 100 433.4764 13.98 3.97 433.17 100 433.1665 13.99 3.98
>> 1 437.23 100 437.2294 13.86 3.85 436.60 100 436.5994 13.88 3.86 435.66 100 435.6578 13.91 3.90
>> 1 434.10 100 434.0974 13.96 3.95 436.29 100 436.2851 13.89 3.89 436.29 100 436.2851 13.89 3.87
>> 5 2400.95 99 480.1902 12.62 12.49 2554.81 98 510.9612 11.86 7.55 2487.68 98 497.5369 12.18 8.22
>> 5 2341.58 99 468.3153 12.94 13.95 2578.72 99 515.7447 11.75 7.25 2527.11 99 505.4212 11.99 7.90
>> 5 2350.66 99 470.1319 12.89 13.66 2600.86 99 520.1717 11.65 7.09 2508.28 98 501.6556 12.08 8.24
>> 10 4291.78 99 429.1785 14.12 40.14 5334.51 99 533.4507 11.36 11.13 5183.92 98 518.3918 11.69 12.15
>> 10 4334.76 99 433.4764 13.98 38.70 5311.13 99 531.1131 11.41 11.23 5215.15 99 521.5146 11.62 12.53
>> 10 4273.62 99 427.3625 14.18 40.29 5287.96 99 528.7958 11.46 11.46 5144.31 98 514.4312 11.78 12.32
>> 20 8487.39 94 424.3697 14.28 63.14 10594.41 99 529.7203 11.44 23.72 10575.92 99 528.7958 11.46 22.08
>> 20 8387.54 97 419.3772 14.45 77.01 10575.92 98 528.7958 11.46 23.41 10520.83 99 526.0417 11.52 21.88
>> 20 8713.16 95 435.6578 13.91 55.10 10659.63 99 532.9815 11.37 24.17 10539.13 99 526.9565 11.50 22.13
>> 40 16786.70 99 419.6676 14.44 170.08 19469.88 98 486.7470 12.45 60.78 19967.05 98 499.1763 12.14 51.40
>> 40 16728.78 99 418.2195 14.49 172.96 19627.53 98 490.6883 12.35 65.26 20386.88 98 509.6720 11.89 46.91
>> 40 16763.49 99 419.0871 14.46 171.42 20033.06 98 500.8264 12.10 51.44 20682.59 98 517.0648 11.72 42.45
>
> Ok, this is sick. How is balance and powersaving better than perf? Both
> have much more jobs per minute than perf; is that because we do pack
> much more tasks per cpu with balance and powersaving?
Maybe it is due to the lazy balancing on balance/powersaving. You can
check the CS times in /proc/pid/status.
>
> Thanks.
>
--
Thanks
Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-27 13:25 ` Alex Shi
@ 2013-01-27 15:51 ` Mike Galbraith
2013-01-28 5:17 ` Mike Galbraith
0 siblings, 1 reply; 88+ messages in thread
From: Mike Galbraith @ 2013-01-27 15:51 UTC (permalink / raw)
To: Alex Shi
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Sun, 2013-01-27 at 21:25 +0800, Alex Shi wrote:
> On 01/27/2013 06:35 PM, Borislav Petkov wrote:
> > On Sun, Jan 27, 2013 at 05:36:25AM +0100, Mike Galbraith wrote:
> >> With aim7 compute on 4 node 40 core box, I see stable throughput
> >> improvement at tasks = nr_cores and below w. balance and powersaving.
...
> > Ok, this is sick. How is balance and powersaving better than perf? Both
> > have much more jobs per minute than perf; is that because we do pack
> > much more tasks per cpu with balance and powersaving?
>
> Maybe it is due to the lazy balancing on balance/powersaving. You can
> check the CS times in /proc/pid/status.
Well, it's not wakeup path, limiting entry frequency per waker did zip
squat nada to any policy throughput.
-Mike
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-27 15:51 ` Mike Galbraith
@ 2013-01-28 5:17 ` Mike Galbraith
2013-01-28 5:51 ` Alex Shi
` (2 more replies)
0 siblings, 3 replies; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 5:17 UTC (permalink / raw)
To: Alex Shi
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Sun, 2013-01-27 at 16:51 +0100, Mike Galbraith wrote:
> On Sun, 2013-01-27 at 21:25 +0800, Alex Shi wrote:
> > On 01/27/2013 06:35 PM, Borislav Petkov wrote:
> > > On Sun, Jan 27, 2013 at 05:36:25AM +0100, Mike Galbraith wrote:
> > >> With aim7 compute on 4 node 40 core box, I see stable throughput
> > >> improvement at tasks = nr_cores and below w. balance and powersaving.
> ...
> > > Ok, this is sick. How is balance and powersaving better than perf? Both
> > > have much more jobs per minute than perf; is that because we do pack
> > > much more tasks per cpu with balance and powersaving?
> >
> > Maybe it is due to the lazy balancing on balance/powersaving. You can
> > check the CS times in /proc/pid/status.
>
> Well, it's not wakeup path, limiting entry frequency per waker did zip
> squat nada to any policy throughput.
monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
043321 00058616
043313 00058616
043318 00058968
043317 00058968
043316 00059184
043319 00059192
043320 00059048
043314 00059048
043312 00058176
043315 00058184
monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
043337 00053448
043333 00053456
043338 00052992
043331 00053448
043332 00053488
043335 00053496
043334 00053480
043329 00053288
043336 00053464
043330 00053496
monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
043348 00052488
043344 00052488
043349 00052744
043343 00052504
043347 00052504
043352 00052888
043345 00052504
043351 00052496
043346 00052496
043350 00052304
monteverdi:/abuild/mike/:[0]#
Zzzt. Wish I could turn turbo thingy off.
-Mike
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 5:17 ` Mike Galbraith
@ 2013-01-28 5:51 ` Alex Shi
2013-01-28 6:15 ` Mike Galbraith
2013-01-28 9:55 ` Borislav Petkov
2013-01-28 15:47 ` Mike Galbraith
2 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-28 5:51 UTC (permalink / raw)
To: Mike Galbraith
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On 01/28/2013 01:17 PM, Mike Galbraith wrote:
> On Sun, 2013-01-27 at 16:51 +0100, Mike Galbraith wrote:
>> On Sun, 2013-01-27 at 21:25 +0800, Alex Shi wrote:
>>> On 01/27/2013 06:35 PM, Borislav Petkov wrote:
>>>> On Sun, Jan 27, 2013 at 05:36:25AM +0100, Mike Galbraith wrote:
>>>>> With aim7 compute on 4 node 40 core box, I see stable throughput
>>>>> improvement at tasks = nr_cores and below w. balance and powersaving.
>> ...
>>>> Ok, this is sick. How is balance and powersaving better than perf? Both
>>>> have much more jobs per minute than perf; is that because we do pack
>>>> much more tasks per cpu with balance and powersaving?
>>>
>>> Maybe it is due to the lazy balancing on balance/powersaving. You can
>>> check the CS times in /proc/pid/status.
>>
>> Well, it's not wakeup path, limiting entry frequency per waker did zip
>> squat nada to any policy throughput.
>
> monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 043321 00058616
> 043313 00058616
> 043318 00058968
> 043317 00058968
> 043316 00059184
> 043319 00059192
> 043320 00059048
> 043314 00059048
> 043312 00058176
> 043315 00058184
> monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 043337 00053448
> 043333 00053456
> 043338 00052992
> 043331 00053448
> 043332 00053488
> 043335 00053496
> 043334 00053480
> 043329 00053288
> 043336 00053464
> 043330 00053496
> monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 043348 00052488
> 043344 00052488
> 043349 00052744
> 043343 00052504
> 043347 00052504
> 043352 00052888
> 043345 00052504
> 043351 00052496
> 043346 00052496
> 043350 00052304
> monteverdi:/abuild/mike/:[0]#
similar with aim7 results. Thanks, Mike!
Wold you like to collect vmstat info in background?
>
> Zzzt. Wish I could turn turbo thingy off.
Do you mean the turbo mode of cpu frequency? I remember some of machine
can disable it in BIOS.
>
> -Mike
>
--
Thanks Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 5:51 ` Alex Shi
@ 2013-01-28 6:15 ` Mike Galbraith
2013-01-28 6:42 ` Mike Galbraith
0 siblings, 1 reply; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 6:15 UTC (permalink / raw)
To: Alex Shi
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Mon, 2013-01-28 at 13:51 +0800, Alex Shi wrote:
> On 01/28/2013 01:17 PM, Mike Galbraith wrote:
> > On Sun, 2013-01-27 at 16:51 +0100, Mike Galbraith wrote:
> >> On Sun, 2013-01-27 at 21:25 +0800, Alex Shi wrote:
> >>> On 01/27/2013 06:35 PM, Borislav Petkov wrote:
> >>>> On Sun, Jan 27, 2013 at 05:36:25AM +0100, Mike Galbraith wrote:
> >>>>> With aim7 compute on 4 node 40 core box, I see stable throughput
> >>>>> improvement at tasks = nr_cores and below w. balance and powersaving.
> >> ...
> >>>> Ok, this is sick. How is balance and powersaving better than perf? Both
> >>>> have much more jobs per minute than perf; is that because we do pack
> >>>> much more tasks per cpu with balance and powersaving?
> >>>
> >>> Maybe it is due to the lazy balancing on balance/powersaving. You can
> >>> check the CS times in /proc/pid/status.
> >>
> >> Well, it's not wakeup path, limiting entry frequency per waker did zip
> >> squat nada to any policy throughput.
> >
> > monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
> > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > 043321 00058616
> > 043313 00058616
> > 043318 00058968
> > 043317 00058968
> > 043316 00059184
> > 043319 00059192
> > 043320 00059048
> > 043314 00059048
> > 043312 00058176
> > 043315 00058184
> > monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > 043337 00053448
> > 043333 00053456
> > 043338 00052992
> > 043331 00053448
> > 043332 00053488
> > 043335 00053496
> > 043334 00053480
> > 043329 00053288
> > 043336 00053464
> > 043330 00053496
> > monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > 043348 00052488
> > 043344 00052488
> > 043349 00052744
> > 043343 00052504
> > 043347 00052504
> > 043352 00052888
> > 043345 00052504
> > 043351 00052496
> > 043346 00052496
> > 043350 00052304
> > monteverdi:/abuild/mike/:[0]#
>
> similar with aim7 results. Thanks, Mike!
>
> Wold you like to collect vmstat info in background?
> >
> > Zzzt. Wish I could turn turbo thingy off.
>
> Do you mean the turbo mode of cpu frequency? I remember some of machine
> can disable it in BIOS.
Yeah, I can do that in my local x3550 box. I can't fiddle with BIOS
settings on the remote NUMA box.
This can't be anything but turbo gizmo mucking up the numbers I think,
not that the numbers are invalid or anything, better numbers are better
numbers no matter where/how they come about ;-)
The massive_intr load is dirt simple sleep/spin with bean counting. It
sleeps 1ms spins 8ms. Change that to sleep 8ms, grind away for 1ms...
monteverdi:/abuild/mike/:[0]# ./massive_intr 10 60
045150 00006484
045157 00006427
045156 00006401
045152 00006428
045155 00006372
045154 00006370
045158 00006453
045149 00006372
045151 00006371
045153 00006371
monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# ./massive_intr 10 60
045170 00006380
045172 00006374
045169 00006376
045175 00006376
045171 00006334
045176 00006380
045168 00006374
045174 00006334
045177 00006375
045173 00006376
monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# ./massive_intr 10 60
045198 00006408
045191 00006408
045197 00006408
045192 00006411
045194 00006409
045196 00006409
045195 00006336
045189 00006336
045193 00006411
045190 00006410
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 6:15 ` Mike Galbraith
@ 2013-01-28 6:42 ` Mike Galbraith
2013-01-28 7:20 ` Mike Galbraith
2013-01-29 1:17 ` Alex Shi
0 siblings, 2 replies; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 6:42 UTC (permalink / raw)
To: Alex Shi
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Mon, 2013-01-28 at 07:15 +0100, Mike Galbraith wrote:
> On Mon, 2013-01-28 at 13:51 +0800, Alex Shi wrote:
> > On 01/28/2013 01:17 PM, Mike Galbraith wrote:
> > > On Sun, 2013-01-27 at 16:51 +0100, Mike Galbraith wrote:
> > >> On Sun, 2013-01-27 at 21:25 +0800, Alex Shi wrote:
> > >>> On 01/27/2013 06:35 PM, Borislav Petkov wrote:
> > >>>> On Sun, Jan 27, 2013 at 05:36:25AM +0100, Mike Galbraith wrote:
> > >>>>> With aim7 compute on 4 node 40 core box, I see stable throughput
> > >>>>> improvement at tasks = nr_cores and below w. balance and powersaving.
> > >> ...
> > >>>> Ok, this is sick. How is balance and powersaving better than perf? Both
> > >>>> have much more jobs per minute than perf; is that because we do pack
> > >>>> much more tasks per cpu with balance and powersaving?
> > >>>
> > >>> Maybe it is due to the lazy balancing on balance/powersaving. You can
> > >>> check the CS times in /proc/pid/status.
> > >>
> > >> Well, it's not wakeup path, limiting entry frequency per waker did zip
> > >> squat nada to any policy throughput.
> > >
> > > monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
> > > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > > 043321 00058616
> > > 043313 00058616
> > > 043318 00058968
> > > 043317 00058968
> > > 043316 00059184
> > > 043319 00059192
> > > 043320 00059048
> > > 043314 00059048
> > > 043312 00058176
> > > 043315 00058184
> > > monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> > > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > > 043337 00053448
> > > 043333 00053456
> > > 043338 00052992
> > > 043331 00053448
> > > 043332 00053488
> > > 043335 00053496
> > > 043334 00053480
> > > 043329 00053288
> > > 043336 00053464
> > > 043330 00053496
> > > monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> > > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > > 043348 00052488
> > > 043344 00052488
> > > 043349 00052744
> > > 043343 00052504
> > > 043347 00052504
> > > 043352 00052888
> > > 043345 00052504
> > > 043351 00052496
> > > 043346 00052496
> > > 043350 00052304
> > > monteverdi:/abuild/mike/:[0]#
> >
> > similar with aim7 results. Thanks, Mike!
> >
> > Wold you like to collect vmstat info in background?
> > >
> > > Zzzt. Wish I could turn turbo thingy off.
> >
> > Do you mean the turbo mode of cpu frequency? I remember some of machine
> > can disable it in BIOS.
>
> Yeah, I can do that in my local x3550 box. I can't fiddle with BIOS
> settings on the remote NUMA box.
>
> This can't be anything but turbo gizmo mucking up the numbers I think,
> not that the numbers are invalid or anything, better numbers are better
> numbers no matter where/how they come about ;-)
>
> The massive_intr load is dirt simple sleep/spin with bean counting. It
> sleeps 1ms spins 8ms. Change that to sleep 8ms, grind away for 1ms...
>
> monteverdi:/abuild/mike/:[0]# ./massive_intr 10 60
> 045150 00006484
> 045157 00006427
> 045156 00006401
> 045152 00006428
> 045155 00006372
> 045154 00006370
> 045158 00006453
> 045149 00006372
> 045151 00006371
> 045153 00006371
> monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# ./massive_intr 10 60
> 045170 00006380
> 045172 00006374
> 045169 00006376
> 045175 00006376
> 045171 00006334
> 045176 00006380
> 045168 00006374
> 045174 00006334
> 045177 00006375
> 045173 00006376
> monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# ./massive_intr 10 60
> 045198 00006408
> 045191 00006408
> 045197 00006408
> 045192 00006411
> 045194 00006409
> 045196 00006409
> 045195 00006336
> 045189 00006336
> 045193 00006411
> 045190 00006410
Back to original 1ms sleep, 8ms work, turning NUMA box into a single
node 10 core box with numactl.
monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
045286 00043872
045289 00043464
045284 00043488
045287 00043440
045283 00043416
045281 00044456
045285 00043456
045288 00044312
045280 00043048
045282 00043240
monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
045300 00052536
045307 00052472
045304 00052536
045299 00052536
045305 00052520
045306 00052528
045302 00052528
045303 00052528
045308 00052512
045301 00052520
monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
045339 00052600
045340 00052608
045338 00052600
045337 00052608
045343 00052600
045341 00052600
045336 00052608
045335 00052616
045334 00052576
045342 00052600
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 6:42 ` Mike Galbraith
@ 2013-01-28 7:20 ` Mike Galbraith
2013-01-29 1:17 ` Alex Shi
1 sibling, 0 replies; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 7:20 UTC (permalink / raw)
To: Alex Shi
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Mon, 2013-01-28 at 07:42 +0100, Mike Galbraith wrote:
> Back to original 1ms sleep, 8ms work, turning NUMA box into a single
> node 10 core box with numactl.
(aim7 in one 10 core node.. so spread, no delta.)
Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" powersaving Jan 28 08:04:14 2013
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 441.0 100 13.7 3.7 7.3508
5 2516.6 98 12.0 8.1 8.3887
10 5215.1 98 11.6 11.9 8.6919
20 10475.4 99 11.6 21.7 8.7295
40 20216.8 99 12.0 38.2 8.4237
80 35568.6 99 13.6 71.4 7.4101
160 57102.5 98 17.0 138.2 5.9482
320 82099.9 97 23.6 271.1 4.2760
Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" balance Jan 28 08:06:49 2013
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 439.4 100 13.8 3.8 7.3241
5 2583.1 98 11.7 7.2 8.6104
10 5325.1 99 11.4 11.0 8.8752
20 10687.8 99 11.3 23.6 8.9065
40 20200.0 99 12.0 38.7 8.4167
80 35464.5 98 13.7 71.4 7.3884
160 57203.5 98 16.9 137.9 5.9587
320 82065.2 98 23.6 271.1 4.2742
Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" performance Jan 28 08:09:20 2013
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 438.8 100 13.8 3.8 7.3135
5 2634.8 99 11.5 7.2 8.7826
10 5396.3 99 11.2 11.4 8.9938
20 10725.7 99 11.3 24.0 8.9381
40 20183.2 99 12.0 38.5 8.4097
80 35620.9 99 13.6 71.4 7.4210
160 57203.5 98 16.9 137.8 5.9587
320 81995.8 98 23.7 271.3 4.2706
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 6:42 ` Mike Galbraith
2013-01-28 7:20 ` Mike Galbraith
@ 2013-01-29 1:17 ` Alex Shi
1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-29 1:17 UTC (permalink / raw)
To: Mike Galbraith
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On 01/28/2013 02:42 PM, Mike Galbraith wrote:
> Back to original 1ms sleep, 8ms work, turning NUMA box into a single
> node 10 core box with numactl.
>
> monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
> 045286 00043872
> 045289 00043464
> 045284 00043488
> 045287 00043440
> 045283 00043416
> 045281 00044456
> 045285 00043456
> 045288 00044312
> 045280 00043048
> 045282 00043240
Um, no idea why the powersaving data is so low.
> monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
> 045300 00052536
> 045307 00052472
> 045304 00052536
> 045299 00052536
> 045305 00052520
> 045306 00052528
> 045302 00052528
> 045303 00052528
> 045308 00052512
> 045301 00052520
> monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
> 045339 00052600
> 045340 00052608
> 045338 00052600
> 045337 00052608
> 045343 00052600
> 045341 00052600
> 045336 00052608
> 045335 00052616
> 045334 00052576
--
Thanks Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 5:17 ` Mike Galbraith
2013-01-28 5:51 ` Alex Shi
@ 2013-01-28 9:55 ` Borislav Petkov
2013-01-28 10:44 ` Mike Galbraith
2013-01-28 15:47 ` Mike Galbraith
2 siblings, 1 reply; 88+ messages in thread
From: Borislav Petkov @ 2013-01-28 9:55 UTC (permalink / raw)
To: Mike Galbraith
Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Mon, Jan 28, 2013 at 06:17:46AM +0100, Mike Galbraith wrote:
> Zzzt. Wish I could turn turbo thingy off.
Try setting /sys/devices/system/cpu/cpufreq/boost to 0.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 9:55 ` Borislav Petkov
@ 2013-01-28 10:44 ` Mike Galbraith
2013-01-28 11:29 ` Borislav Petkov
0 siblings, 1 reply; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 10:44 UTC (permalink / raw)
To: Borislav Petkov
Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Mon, 2013-01-28 at 10:55 +0100, Borislav Petkov wrote:
> On Mon, Jan 28, 2013 at 06:17:46AM +0100, Mike Galbraith wrote:
> > Zzzt. Wish I could turn turbo thingy off.
>
> Try setting /sys/devices/system/cpu/cpufreq/boost to 0.
How convenient (test) works too.
So much for turbo boost theory. Nothing changed until I turned load
balancing off at NODE. High end went to hell (gee), but low end...
Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" performance-no-node-load_balance Jan 28 11:20:12 2013
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 436.3 100 13.9 3.9 7.2714
5 2637.1 99 11.5 7.3 8.7903
10 5415.5 99 11.2 11.3 9.0259
20 10603.7 99 11.4 24.8 8.8364
40 20066.2 99 12.1 40.5 8.3609
80 35079.6 99 13.8 75.5 7.3082
160 55884.7 98 17.3 145.6 5.8213
320 79345.3 98 24.4 287.4 4.1326
640 100294.8 98 38.7 570.9 2.6118
1280 115998.2 97 66.9 1132.8 1.5104
2560 125820.0 97 123.3 2256.6 0.8191
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 10:44 ` Mike Galbraith
@ 2013-01-28 11:29 ` Borislav Petkov
2013-01-28 11:32 ` Mike Galbraith
2013-01-29 1:36 ` Alex Shi
0 siblings, 2 replies; 88+ messages in thread
From: Borislav Petkov @ 2013-01-28 11:29 UTC (permalink / raw)
To: Mike Galbraith
Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Mon, Jan 28, 2013 at 11:44:44AM +0100, Mike Galbraith wrote:
> On Mon, 2013-01-28 at 10:55 +0100, Borislav Petkov wrote:
> > On Mon, Jan 28, 2013 at 06:17:46AM +0100, Mike Galbraith wrote:
> > > Zzzt. Wish I could turn turbo thingy off.
> >
> > Try setting /sys/devices/system/cpu/cpufreq/boost to 0.
>
> How convenient (test) works too.
>
> So much for turbo boost theory. Nothing changed until I turned load
> balancing off at NODE. High end went to hell (gee), but low end...
>
> Benchmark Version Machine Run Date
> AIM Multiuser Benchmark - Suite VII "1.1" performance-no-node-load_balance Jan 28 11:20:12 2013
>
> Tasks Jobs/Min JTI Real CPU Jobs/sec/task
> 1 436.3 100 13.9 3.9 7.2714
> 5 2637.1 99 11.5 7.3 8.7903
> 10 5415.5 99 11.2 11.3 9.0259
> 20 10603.7 99 11.4 24.8 8.8364
> 40 20066.2 99 12.1 40.5 8.3609
> 80 35079.6 99 13.8 75.5 7.3082
> 160 55884.7 98 17.3 145.6 5.8213
> 320 79345.3 98 24.4 287.4 4.1326
If you're talking about those results from earlier:
Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" performance Jan 28 08:09:20 2013
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 438.8 100 13.8 3.8 7.3135
5 2634.8 99 11.5 7.2 8.7826
10 5396.3 99 11.2 11.4 8.9938
20 10725.7 99 11.3 24.0 8.9381
40 20183.2 99 12.0 38.5 8.4097
80 35620.9 99 13.6 71.4 7.4210
160 57203.5 98 16.9 137.8 5.9587
320 81995.8 98 23.7 271.3 4.2706
then the above no_node-load_balance thing suffers a small-ish dip at 320
tasks, yeah.
And AFAICR, the effect of disabling boosting will be visible in the
small count tasks cases anyway because if you saturate the cores with
tasks, the boosting algorithms tend to get the box out of boosting for
the simple reason that the power/perf headroom simply disappears due to
the SOC being busy.
> 640 100294.8 98 38.7 570.9 2.6118
> 1280 115998.2 97 66.9 1132.8 1.5104
> 2560 125820.0 97 123.3 2256.6 0.8191
I dunno about those. maybe this is expected with so many tasks or do we
want to optimize that case further?
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 11:29 ` Borislav Petkov
@ 2013-01-28 11:32 ` Mike Galbraith
2013-01-28 11:40 ` Mike Galbraith
2013-01-29 1:32 ` Alex Shi
2013-01-29 1:36 ` Alex Shi
1 sibling, 2 replies; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 11:32 UTC (permalink / raw)
To: Borislav Petkov
Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Mon, 2013-01-28 at 12:29 +0100, Borislav Petkov wrote:
> On Mon, Jan 28, 2013 at 11:44:44AM +0100, Mike Galbraith wrote:
> > On Mon, 2013-01-28 at 10:55 +0100, Borislav Petkov wrote:
> > > On Mon, Jan 28, 2013 at 06:17:46AM +0100, Mike Galbraith wrote:
> > > > Zzzt. Wish I could turn turbo thingy off.
> > >
> > > Try setting /sys/devices/system/cpu/cpufreq/boost to 0.
> >
> > How convenient (test) works too.
> >
> > So much for turbo boost theory. Nothing changed until I turned load
> > balancing off at NODE. High end went to hell (gee), but low end...
> >
> > Benchmark Version Machine Run Date
> > AIM Multiuser Benchmark - Suite VII "1.1" performance-no-node-load_balance Jan 28 11:20:12 2013
> >
> > Tasks Jobs/Min JTI Real CPU Jobs/sec/task
> > 1 436.3 100 13.9 3.9 7.2714
> > 5 2637.1 99 11.5 7.3 8.7903
> > 10 5415.5 99 11.2 11.3 9.0259
> > 20 10603.7 99 11.4 24.8 8.8364
> > 40 20066.2 99 12.1 40.5 8.3609
> > 80 35079.6 99 13.8 75.5 7.3082
> > 160 55884.7 98 17.3 145.6 5.8213
> > 320 79345.3 98 24.4 287.4 4.1326
>
> If you're talking about those results from earlier:
>
> Benchmark Version Machine Run Date
> AIM Multiuser Benchmark - Suite VII "1.1" performance Jan 28 08:09:20 2013
>
> Tasks Jobs/Min JTI Real CPU Jobs/sec/task
> 1 438.8 100 13.8 3.8 7.3135
> 5 2634.8 99 11.5 7.2 8.7826
> 10 5396.3 99 11.2 11.4 8.9938
> 20 10725.7 99 11.3 24.0 8.9381
> 40 20183.2 99 12.0 38.5 8.4097
> 80 35620.9 99 13.6 71.4 7.4210
> 160 57203.5 98 16.9 137.8 5.9587
> 320 81995.8 98 23.7 271.3 4.2706
>
> then the above no_node-load_balance thing suffers a small-ish dip at 320
> tasks, yeah.
No no, that's not restricted to one node. It's just overloaded because
I turned balancing off at the NODE domain level.
> And AFAICR, the effect of disabling boosting will be visible in the
> small count tasks cases anyway because if you saturate the cores with
> tasks, the boosting algorithms tend to get the box out of boosting for
> the simple reason that the power/perf headroom simply disappears due to
> the SOC being busy.
>
> > 640 100294.8 98 38.7 570.9 2.6118
> > 1280 115998.2 97 66.9 1132.8 1.5104
> > 2560 125820.0 97 123.3 2256.6 0.8191
>
> I dunno about those. maybe this is expected with so many tasks or do we
> want to optimize that case further?
When using all 4 nodes properly, that's still scaling. Here, I
intentionally screwed up balancing to watch the low end. High end is
expected wreckage.
-Mike
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 11:32 ` Mike Galbraith
@ 2013-01-28 11:40 ` Mike Galbraith
2013-01-28 15:22 ` Borislav Petkov
2013-01-29 1:32 ` Alex Shi
1 sibling, 1 reply; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 11:40 UTC (permalink / raw)
To: Borislav Petkov
Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Mon, 2013-01-28 at 12:32 +0100, Mike Galbraith wrote:
> On Mon, 2013-01-28 at 12:29 +0100, Borislav Petkov wrote:
> > On Mon, Jan 28, 2013 at 11:44:44AM +0100, Mike Galbraith wrote:
> > > On Mon, 2013-01-28 at 10:55 +0100, Borislav Petkov wrote:
> > > > On Mon, Jan 28, 2013 at 06:17:46AM +0100, Mike Galbraith wrote:
> > > > > Zzzt. Wish I could turn turbo thingy off.
> > > >
> > > > Try setting /sys/devices/system/cpu/cpufreq/boost to 0.
> > >
> > > How convenient (test) works too.
> > >
> > > So much for turbo boost theory. Nothing changed until I turned load
> > > balancing off at NODE. High end went to hell (gee), but low end...
> > >
> > > Benchmark Version Machine Run Date
> > > AIM Multiuser Benchmark - Suite VII "1.1" performance-no-node-load_balance Jan 28 11:20:12 2013
> > >
> > > Tasks Jobs/Min JTI Real CPU Jobs/sec/task
> > > 1 436.3 100 13.9 3.9 7.2714
> > > 5 2637.1 99 11.5 7.3 8.7903
> > > 10 5415.5 99 11.2 11.3 9.0259
> > > 20 10603.7 99 11.4 24.8 8.8364
> > > 40 20066.2 99 12.1 40.5 8.3609
> > > 80 35079.6 99 13.8 75.5 7.3082
> > > 160 55884.7 98 17.3 145.6 5.8213
> > > 320 79345.3 98 24.4 287.4 4.1326
> >
> > If you're talking about those results from earlier:
> >
> > Benchmark Version Machine Run Date
> > AIM Multiuser Benchmark - Suite VII "1.1" performance Jan 28 08:09:20 2013
> >
> > Tasks Jobs/Min JTI Real CPU Jobs/sec/task
> > 1 438.8 100 13.8 3.8 7.3135
> > 5 2634.8 99 11.5 7.2 8.7826
> > 10 5396.3 99 11.2 11.4 8.9938
> > 20 10725.7 99 11.3 24.0 8.9381
> > 40 20183.2 99 12.0 38.5 8.4097
> > 80 35620.9 99 13.6 71.4 7.4210
> > 160 57203.5 98 16.9 137.8 5.9587
> > 320 81995.8 98 23.7 271.3 4.2706
> >
> > then the above no_node-load_balance thing suffers a small-ish dip at 320
> > tasks, yeah.
>
> No no, that's not restricted to one node. It's just overloaded because
> I turned balancing off at the NODE domain level.
Which shows only that I was multitasking, and in a rush. Boy was that
dumb. Hohum.
-Mike
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 11:40 ` Mike Galbraith
@ 2013-01-28 15:22 ` Borislav Petkov
2013-01-28 15:55 ` Mike Galbraith
0 siblings, 1 reply; 88+ messages in thread
From: Borislav Petkov @ 2013-01-28 15:22 UTC (permalink / raw)
To: Mike Galbraith
Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Mon, Jan 28, 2013 at 12:40:46PM +0100, Mike Galbraith wrote:
> > No no, that's not restricted to one node. It's just overloaded because
> > I turned balancing off at the NODE domain level.
>
> Which shows only that I was multitasking, and in a rush. Boy was that
> dumb. Hohum.
Ok, let's take a step back and slow it down a bit so that people like me
can understand it: you want to try it with disabled load balancing on
the node level, AFAICT. But with that many tasks, perf will suck anyway,
no? Unless you want to benchmark the numa-aware aspect and see whether
load balancing on the node level feels differently, perf-wise?
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 15:22 ` Borislav Petkov
@ 2013-01-28 15:55 ` Mike Galbraith
2013-01-29 1:38 ` Alex Shi
0 siblings, 1 reply; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 15:55 UTC (permalink / raw)
To: Borislav Petkov
Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Mon, 2013-01-28 at 16:22 +0100, Borislav Petkov wrote:
> On Mon, Jan 28, 2013 at 12:40:46PM +0100, Mike Galbraith wrote:
> > > No no, that's not restricted to one node. It's just overloaded because
> > > I turned balancing off at the NODE domain level.
> >
> > Which shows only that I was multitasking, and in a rush. Boy was that
> > dumb. Hohum.
>
> Ok, let's take a step back and slow it down a bit so that people like me
> can understand it: you want to try it with disabled load balancing on
> the node level, AFAICT. But with that many tasks, perf will suck anyway,
> no? Unless you want to benchmark the numa-aware aspect and see whether
> load balancing on the node level feels differently, perf-wise?
The broken thought was, since it's not wakeup path, stop node balance..
but killing all of it killed FORK/EXEC balance, oops.
I think I'm done with this thing though. See mail I just sent. There
are better things to do than letting box jerk my chain endlessly ;-)
-Mike
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 15:55 ` Mike Galbraith
@ 2013-01-29 1:38 ` Alex Shi
0 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-29 1:38 UTC (permalink / raw)
To: Mike Galbraith
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On 01/28/2013 11:55 PM, Mike Galbraith wrote:
> On Mon, 2013-01-28 at 16:22 +0100, Borislav Petkov wrote:
>> On Mon, Jan 28, 2013 at 12:40:46PM +0100, Mike Galbraith wrote:
>>>> No no, that's not restricted to one node. It's just overloaded because
>>>> I turned balancing off at the NODE domain level.
>>>
>>> Which shows only that I was multitasking, and in a rush. Boy was that
>>> dumb. Hohum.
>>
>> Ok, let's take a step back and slow it down a bit so that people like me
>> can understand it: you want to try it with disabled load balancing on
>> the node level, AFAICT. But with that many tasks, perf will suck anyway,
>> no? Unless you want to benchmark the numa-aware aspect and see whether
>> load balancing on the node level feels differently, perf-wise?
>
> The broken thought was, since it's not wakeup path, stop node balance..
> but killing all of it killed FORK/EXEC balance, oops.
Um. sure. so guess all of tasks just running on one node.
>
> I think I'm done with this thing though. See mail I just sent. There
> are better things to do than letting box jerk my chain endlessly ;-)
>
> -Mike
>
--
Thanks Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 11:32 ` Mike Galbraith
2013-01-28 11:40 ` Mike Galbraith
@ 2013-01-29 1:32 ` Alex Shi
1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-29 1:32 UTC (permalink / raw)
To: Mike Galbraith
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
>> then the above no_node-load_balance thing suffers a small-ish dip at 320
>> tasks, yeah.
>
> No no, that's not restricted to one node. It's just overloaded because
> I turned balancing off at the NODE domain level.
>
>> And AFAICR, the effect of disabling boosting will be visible in the
>> small count tasks cases anyway because if you saturate the cores with
>> tasks, the boosting algorithms tend to get the box out of boosting for
>> the simple reason that the power/perf headroom simply disappears due to
>> the SOC being busy.
>>
>>> 640 100294.8 98 38.7 570.9 2.6118
>>> 1280 115998.2 97 66.9 1132.8 1.5104
>>> 2560 125820.0 97 123.3 2256.6 0.8191
>>
>> I dunno about those. maybe this is expected with so many tasks or do we
>> want to optimize that case further?
>
> When using all 4 nodes properly, that's still scaling. Here, I
Without node regular balancing, only waking balance left in
select_task_rq_fair for aim7 testing, (I just assume you used shared
workfile, most of testing is cpu density and only few exec/fork load).
Since, waking balance just happened in same llc domain. guess that is
the reason for this.
> intentionally screwed up balancing to watch the low end. High end is
> expected wreckage.
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 11:29 ` Borislav Petkov
2013-01-28 11:32 ` Mike Galbraith
@ 2013-01-29 1:36 ` Alex Shi
1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-29 1:36 UTC (permalink / raw)
To: Borislav Petkov, Mike Galbraith, torvalds, mingo, peterz, tglx,
akpm, arjan, pjt, namhyung, vincent.guittot, gregkh, preeti,
viresh.kumar, linux-kernel
> Benchmark Version Machine Run Date
> AIM Multiuser Benchmark - Suite VII "1.1" performance Jan 28 08:09:20 2013
>
> Tasks Jobs/Min JTI Real CPU Jobs/sec/task
> 1 438.8 100 13.8 3.8 7.3135
> 5 2634.8 99 11.5 7.2 8.7826
> 10 5396.3 99 11.2 11.4 8.9938
> 20 10725.7 99 11.3 24.0 8.9381
> 40 20183.2 99 12.0 38.5 8.4097
> 80 35620.9 99 13.6 71.4 7.4210
> 160 57203.5 98 16.9 137.8 5.9587
> 320 81995.8 98 23.7 271.3 4.2706
>
> then the above no_node-load_balance thing suffers a small-ish dip at 320
> tasks, yeah.
>
> And AFAICR, the effect of disabling boosting will be visible in the
> small count tasks cases anyway because if you saturate the cores with
> tasks, the boosting algorithms tend to get the box out of boosting for
> the simple reason that the power/perf headroom simply disappears due to
> the SOC being busy.
Sure. and according to the context of serial email. guess this result
has boosting enabled, right?
>
>> 640 100294.8 98 38.7 570.9 2.6118
>> 1280 115998.2 97 66.9 1132.8 1.5104
>> 2560 125820.0 97 123.3 2256.6 0.8191
>
> I dunno about those. maybe this is expected with so many tasks or do we
> want to optimize that case further?
>
--
Thanks Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 5:17 ` Mike Galbraith
2013-01-28 5:51 ` Alex Shi
2013-01-28 9:55 ` Borislav Petkov
@ 2013-01-28 15:47 ` Mike Galbraith
2013-01-29 1:45 ` Alex Shi
2013-01-29 2:27 ` Alex Shi
2 siblings, 2 replies; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 15:47 UTC (permalink / raw)
To: Alex Shi
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Mon, 2013-01-28 at 06:17 +0100, Mike Galbraith wrote:
Ok damnit.
> monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 043321 00058616
> 043313 00058616
> 043318 00058968
> 043317 00058968
> 043316 00059184
> 043319 00059192
> 043320 00059048
> 043314 00059048
> 043312 00058176
> 043315 00058184
That was boost if you like, and free to roam 4 nodes.
monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# echo 0 > /sys/devices/system/cpu/cpufreq/boost
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
014618 00039616
014623 00039256
014617 00039256
014620 00039304
014621 00039304 (wait a minute, you said..)
014616 00039080
014625 00039064
014622 00039672
014624 00039624
014619 00039672
monteverdi:/abuild/mike/:[0]# echo 1 > /sys/devices/system/cpu/cpufreq/boost
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
014635 00058160
014633 00058592
014638 00058592
014636 00058160
014632 00058200
014634 00058704
014639 00058704
014641 00058200
014640 00058560
014637 00058560
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
014673 00059504
014676 00059504
014674 00059064
014672 00059064
014675 00058560
014671 00058560
014677 00059248
014668 00058864
014669 00059248
014670 00058864
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
014686 00043472
014689 00043472
014685 00043760
014690 00043760
014687 00043528
014688 00043528 (hmm)
014683 00043216
014692 00043208
014684 00043336
014691 00043336
monteverdi:/abuild/mike/:[0]# echo 0 > /sys/devices/system/cpu/cpufreq/boost
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
014701 00039344
014707 00039344
014709 00038976
014700 00038976
014708 00039256 (hmm)
014703 00039256
014705 00039400
014704 00039400
014706 00039320
014702 00039320
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
014713 00058552
014716 00058664
014719 00058600
014715 00058600
014718 00058520
014722 00058400
014721 00058768
014717 00058768
014714 00058552
014720 00058560
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
014732 00058736
014734 00058760
014729 00040872
014736 00059184
014728 00059184
014727 00058744
014733 00058760
014731 00059320
014730 00059280
014735 00041072
monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
014749 00040608
014748 00040616
014745 00039360
014750 00039360
014751 00039416
014747 00039416
014752 00039336
014746 00039336
014744 00039480
014753 00039480
monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
014757 00039272
014761 00039272
014765 00039528
014756 00039528
014759 00039352
014760 00039352
014764 00039248
014762 00039248
014758 00039352
014763 00039352
monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
014773 00059680
014769 00059680
014768 00059144
014777 00059144
014775 00059688
014774 00059688
014770 00059264
014771 00059264
014772 00059528
014776 00059528
Ok box, whatever blows your skirt up. I'm done.
Non
Uniform
Mysterious
Artifacts
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 15:47 ` Mike Galbraith
@ 2013-01-29 1:45 ` Alex Shi
2013-01-29 4:03 ` Mike Galbraith
2013-01-29 2:27 ` Alex Shi
1 sibling, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-29 1:45 UTC (permalink / raw)
To: Mike Galbraith
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On 01/28/2013 11:47 PM, Mike Galbraith wrote:
> On Mon, 2013-01-28 at 06:17 +0100, Mike Galbraith wrote:
>
> Ok damnit.
>
>> monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
>> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
>> 043321 00058616
>> 043313 00058616
>> 043318 00058968
>> 043317 00058968
>> 043316 00059184
>> 043319 00059192
>> 043320 00059048
>> 043314 00059048
>> 043312 00058176
>> 043315 00058184
>
> That was boost if you like, and free to roam 4 nodes.
>
> monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# echo 0 > /sys/devices/system/cpu/cpufreq/boost
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 014618 00039616
> 014623 00039256
> 014617 00039256
> 014620 00039304
> 014621 00039304 (wait a minute, you said..)
> 014616 00039080
> 014625 00039064
> 014622 00039672
> 014624 00039624
> 014619 00039672
> monteverdi:/abuild/mike/:[0]# echo 1 > /sys/devices/system/cpu/cpufreq/boost
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 014635 00058160
> 014633 00058592
> 014638 00058592
> 014636 00058160
> 014632 00058200
> 014634 00058704
> 014639 00058704
> 014641 00058200
> 014640 00058560
> 014637 00058560
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 014673 00059504
> 014676 00059504
> 014674 00059064
> 014672 00059064
> 014675 00058560
> 014671 00058560
> 014677 00059248
> 014668 00058864
> 014669 00059248
> 014670 00058864
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 014686 00043472
> 014689 00043472
> 014685 00043760
> 014690 00043760
> 014687 00043528
> 014688 00043528 (hmm)
> 014683 00043216
> 014692 00043208
> 014684 00043336
> 014691 00043336
I am sorry Mike. does above 3 times testing has a same sched policy? and
same question for the following testing.
> monteverdi:/abuild/mike/:[0]# echo 0 > /sys/devices/system/cpu/cpufreq/boost
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 014701 00039344
> 014707 00039344
> 014709 00038976
> 014700 00038976
> 014708 00039256 (hmm)
> 014703 00039256
> 014705 00039400
> 014704 00039400
> 014706 00039320
> 014702 00039320
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 014713 00058552
> 014716 00058664
> 014719 00058600
> 014715 00058600
> 014718 00058520
> 014722 00058400
> 014721 00058768
> 014717 00058768
> 014714 00058552
> 014720 00058560
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 014732 00058736
> 014734 00058760
> 014729 00040872
> 014736 00059184
> 014728 00059184
> 014727 00058744
> 014733 00058760
> 014731 00059320
> 014730 00059280
> 014735 00041072
> monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
> 014749 00040608
> 014748 00040616
> 014745 00039360
> 014750 00039360
> 014751 00039416
> 014747 00039416
> 014752 00039336
> 014746 00039336
> 014744 00039480
> 014753 00039480
> monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
> 014757 00039272
> 014761 00039272
> 014765 00039528
> 014756 00039528
> 014759 00039352
> 014760 00039352
> 014764 00039248
> 014762 00039248
> 014758 00039352
> 014763 00039352
> monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
> 014773 00059680
> 014769 00059680
> 014768 00059144
> 014777 00059144
> 014775 00059688
> 014774 00059688
> 014770 00059264
> 014771 00059264
> 014772 00059528
> 014776 00059528
>
> Ok box, whatever blows your skirt up. I'm done.
>
> Non
> Uniform
> Mysterious
> Artifacts
>
--
Thanks Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-29 1:45 ` Alex Shi
@ 2013-01-29 4:03 ` Mike Galbraith
0 siblings, 0 replies; 88+ messages in thread
From: Mike Galbraith @ 2013-01-29 4:03 UTC (permalink / raw)
To: Alex Shi
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Tue, 2013-01-29 at 09:45 +0800, Alex Shi wrote:
> On 01/28/2013 11:47 PM, Mike Galbraith wrote:
> > monteverdi:/abuild/mike/:[0]# echo 1 > /sys/devices/system/cpu/cpufreq/boost
> > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > 014635 00058160
> > 014633 00058592
> > 014638 00058592
> > 014636 00058160
> > 014632 00058200
> > 014634 00058704
> > 014639 00058704
> > 014641 00058200
> > 014640 00058560
> > 014637 00058560
> > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > 014673 00059504
> > 014676 00059504
> > 014674 00059064
> > 014672 00059064
> > 014675 00058560
> > 014671 00058560
> > 014677 00059248
> > 014668 00058864
> > 014669 00059248
> > 014670 00058864
> > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > 014686 00043472
> > 014689 00043472
> > 014685 00043760
> > 014690 00043760
> > 014687 00043528
> > 014688 00043528 (hmm)
> > 014683 00043216
> > 014692 00043208
> > 014684 00043336
> > 014691 00043336
>
> I am sorry Mike. does above 3 times testing has a same sched policy? and
> same question for the following testing.
Yeah, they're back to back repeats. Using dirt simple massive_intr
didn't help clarify aim7 oddity.
aim7 is fully repeatable, seems to be saying that consolidation of small
independent jobs is a win, that spreading before fully saturated has its
price, just as consolidation of large coordinated burst has its price.
Seems to cut both ways.. but why not, everything else does.
-Mike
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 15:47 ` Mike Galbraith
2013-01-29 1:45 ` Alex Shi
@ 2013-01-29 2:27 ` Alex Shi
1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-29 2:27 UTC (permalink / raw)
To: Mike Galbraith
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On 01/28/2013 11:47 PM, Mike Galbraith wrote:
> 014776 00059528
>
> Ok box, whatever blows your skirt up. I'm done.
Many thanks for so much fruitful testing! :D
--
Thanks Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-27 2:41 ` Alex Shi
2013-01-27 4:36 ` Mike Galbraith
@ 2013-01-27 10:40 ` Borislav Petkov
2013-01-27 14:03 ` Alex Shi
2013-01-28 5:19 ` Alex Shi
1 sibling, 2 replies; 88+ messages in thread
From: Borislav Petkov @ 2013-01-27 10:40 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Sun, Jan 27, 2013 at 10:41:40AM +0800, Alex Shi wrote:
> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
> performance change found.
Ok, good, You could put that in one of the commit messages so that it is
there and people know that this patchset doesn't cause perf regressions
with the bunch of benchmarks.
> I also tested balance policy/powersaving policy with above benchmark,
> found, the specjbb2005 drop much 30~50% on both of policy whenever
> with openjdk or jrockit. and hackbench drops a lots with powersaving
> policy on snb 4 sockets platforms. others has no clear change.
I guess this is expected because there has to be some performance hit
when saving power...
Thanks.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-27 10:40 ` Borislav Petkov
@ 2013-01-27 14:03 ` Alex Shi
2013-01-28 5:19 ` Alex Shi
1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-27 14:03 UTC (permalink / raw)
To: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On 01/27/2013 06:40 PM, Borislav Petkov wrote:
> On Sun, Jan 27, 2013 at 10:41:40AM +0800, Alex Shi wrote:
>> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
>> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
>> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
>> performance change found.
>
> Ok, good, You could put that in one of the commit messages so that it is
> there and people know that this patchset doesn't cause perf regressions
> with the bunch of benchmarks.
thanks suggestion!
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-27 10:40 ` Borislav Petkov
2013-01-27 14:03 ` Alex Shi
@ 2013-01-28 5:19 ` Alex Shi
2013-01-28 6:49 ` Mike Galbraith
2013-01-29 6:02 ` Alex Shi
1 sibling, 2 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-28 5:19 UTC (permalink / raw)
To: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On 01/27/2013 06:40 PM, Borislav Petkov wrote:
> On Sun, Jan 27, 2013 at 10:41:40AM +0800, Alex Shi wrote:
>> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
>> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
>> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
>> performance change found.
>
> Ok, good, You could put that in one of the commit messages so that it is
> there and people know that this patchset doesn't cause perf regressions
> with the bunch of benchmarks.
>
>> I also tested balance policy/powersaving policy with above benchmark,
>> found, the specjbb2005 drop much 30~50% on both of policy whenever
>> with openjdk or jrockit. and hackbench drops a lots with powersaving
>> policy on snb 4 sockets platforms. others has no clear change.
>
> I guess this is expected because there has to be some performance hit
> when saving power...
>
BTW, I had tested the v3 version based on sched numa -- on tip/master.
The specjbb just has about 5~7% dropping on balance/powersaving policy.
The power scheduling done after the numa scheduling logical.
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 5:19 ` Alex Shi
@ 2013-01-28 6:49 ` Mike Galbraith
2013-01-28 7:17 ` Alex Shi
2013-01-29 6:02 ` Alex Shi
1 sibling, 1 reply; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 6:49 UTC (permalink / raw)
To: Alex Shi
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Mon, 2013-01-28 at 13:19 +0800, Alex Shi wrote:
> On 01/27/2013 06:40 PM, Borislav Petkov wrote:
> > On Sun, Jan 27, 2013 at 10:41:40AM +0800, Alex Shi wrote:
> >> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
> >> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
> >> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
> >> performance change found.
> >
> > Ok, good, You could put that in one of the commit messages so that it is
> > there and people know that this patchset doesn't cause perf regressions
> > with the bunch of benchmarks.
> >
> >> I also tested balance policy/powersaving policy with above benchmark,
> >> found, the specjbb2005 drop much 30~50% on both of policy whenever
> >> with openjdk or jrockit. and hackbench drops a lots with powersaving
> >> policy on snb 4 sockets platforms. others has no clear change.
> >
> > I guess this is expected because there has to be some performance hit
> > when saving power...
> >
>
> BTW, I had tested the v3 version based on sched numa -- on tip/master.
> The specjbb just has about 5~7% dropping on balance/powersaving policy.
> The power scheduling done after the numa scheduling logical.
That makes sense. How the numa scheduling numbers compare to mainline?
Do you have all three available, mainline, and tip w. w/o powersaving
policy?
-Mike
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 6:49 ` Mike Galbraith
@ 2013-01-28 7:17 ` Alex Shi
2013-01-28 7:33 ` Mike Galbraith
0 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-28 7:17 UTC (permalink / raw)
To: Mike Galbraith
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On 01/28/2013 02:49 PM, Mike Galbraith wrote:
> On Mon, 2013-01-28 at 13:19 +0800, Alex Shi wrote:
>> On 01/27/2013 06:40 PM, Borislav Petkov wrote:
>>> On Sun, Jan 27, 2013 at 10:41:40AM +0800, Alex Shi wrote:
>>>> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
>>>> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
>>>> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
>>>> performance change found.
>>>
>>> Ok, good, You could put that in one of the commit messages so that it is
>>> there and people know that this patchset doesn't cause perf regressions
>>> with the bunch of benchmarks.
>>>
>>>> I also tested balance policy/powersaving policy with above benchmark,
>>>> found, the specjbb2005 drop much 30~50% on both of policy whenever
>>>> with openjdk or jrockit. and hackbench drops a lots with powersaving
>>>> policy on snb 4 sockets platforms. others has no clear change.
>>>
>>> I guess this is expected because there has to be some performance hit
>>> when saving power...
>>>
>>
>> BTW, I had tested the v3 version based on sched numa -- on tip/master.
>> The specjbb just has about 5~7% dropping on balance/powersaving policy.
>> The power scheduling done after the numa scheduling logical.
>
> That makes sense. How the numa scheduling numbers compare to mainline?
> Do you have all three available, mainline, and tip w. w/o powersaving
> policy?
>
I once caught 20~40% performance increasing on sched numa VS mainline
3.7-rc5. but have no baseline to compare balance/powersaving performance
since lower data are acceptable for balance/powersaving and
tip/master changes too quickly to follow up at that time.
:)
> -Mike
>
>
--
Thanks Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 7:17 ` Alex Shi
@ 2013-01-28 7:33 ` Mike Galbraith
0 siblings, 0 replies; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 7:33 UTC (permalink / raw)
To: Alex Shi
Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On Mon, 2013-01-28 at 15:17 +0800, Alex Shi wrote:
> On 01/28/2013 02:49 PM, Mike Galbraith wrote:
> > On Mon, 2013-01-28 at 13:19 +0800, Alex Shi wrote:
> >> On 01/27/2013 06:40 PM, Borislav Petkov wrote:
> >>> On Sun, Jan 27, 2013 at 10:41:40AM +0800, Alex Shi wrote:
> >>>> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
> >>>> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
> >>>> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
> >>>> performance change found.
> >>>
> >>> Ok, good, You could put that in one of the commit messages so that it is
> >>> there and people know that this patchset doesn't cause perf regressions
> >>> with the bunch of benchmarks.
> >>>
> >>>> I also tested balance policy/powersaving policy with above benchmark,
> >>>> found, the specjbb2005 drop much 30~50% on both of policy whenever
> >>>> with openjdk or jrockit. and hackbench drops a lots with powersaving
> >>>> policy on snb 4 sockets platforms. others has no clear change.
> >>>
> >>> I guess this is expected because there has to be some performance hit
> >>> when saving power...
> >>>
> >>
> >> BTW, I had tested the v3 version based on sched numa -- on tip/master.
> >> The specjbb just has about 5~7% dropping on balance/powersaving policy.
> >> The power scheduling done after the numa scheduling logical.
> >
> > That makes sense. How the numa scheduling numbers compare to mainline?
> > Do you have all three available, mainline, and tip w. w/o powersaving
> > policy?
> >
>
> I once caught 20~40% performance increasing on sched numa VS mainline
> 3.7-rc5. but have no baseline to compare balance/powersaving performance
> since lower data are acceptable for balance/powersaving and
> tip/master changes too quickly to follow up at that time.
> :)
(wow. dram sucks, dram+smp sucks more, dram+smp+numa _sucks rocks_;)
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-28 5:19 ` Alex Shi
2013-01-28 6:49 ` Mike Galbraith
@ 2013-01-29 6:02 ` Alex Shi
1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-29 6:02 UTC (permalink / raw)
To: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
namhyung, efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On 01/28/2013 01:19 PM, Alex Shi wrote:
> On 01/27/2013 06:40 PM, Borislav Petkov wrote:
>> On Sun, Jan 27, 2013 at 10:41:40AM +0800, Alex Shi wrote:
>>> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
>>> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
>>> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
>>> performance change found.
>>
>> Ok, good, You could put that in one of the commit messages so that it is
>> there and people know that this patchset doesn't cause perf regressions
>> with the bunch of benchmarks.
>>
>>> I also tested balance policy/powersaving policy with above benchmark,
>>> found, the specjbb2005 drop much 30~50% on both of policy whenever
>>> with openjdk or jrockit. and hackbench drops a lots with powersaving
>>> policy on snb 4 sockets platforms. others has no clear change.
Sorry, the testing configuration is unfair for this specjbb2005 results
here. I set JVM hard pin and use hugepage for peak performance.
When remove the hard pin and no hugepage, the balance/powersaving both
drop about 5% VS performance policy, and performance policy result is
similar with 3.8-rc5.
>>
>> I guess this is expected because there has to be some performance hit
>> when saving power...
>>
>
> BTW, I had tested the v3 version based on sched numa -- on tip/master.
> The specjbb just has about 5~7% dropping on balance/powersaving policy.
> The power scheduling done after the numa scheduling logical.
>
--
Thanks Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (18 preceding siblings ...)
2013-01-24 9:44 ` [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Borislav Petkov
@ 2013-01-28 1:28 ` Alex Shi
2013-02-04 1:35 ` Alex Shi
20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-28 1:28 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On 01/24/2013 11:06 AM, Alex Shi wrote:
> Since the runnable info needs 345ms to accumulate, balancing
> doesn't do well for many tasks burst waking. After talking with Mike
> Galbraith, we are agree to just use runnable avg in power friendly
> scheduling and keep current instant load in performance scheduling for
> low latency.
>
> So the biggest change in this version is removing runnable load avg in
> balance and just using runnable data in power balance.
>
> The patchset bases on Linus' tree, includes 3 parts,
Would you like to give some comments, Ingo? :)
Best regards!
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-01-24 3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
` (19 preceding siblings ...)
2013-01-28 1:28 ` Alex Shi
@ 2013-02-04 1:35 ` Alex Shi
2013-02-04 11:09 ` Ingo Molnar
20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-02-04 1:35 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
On 01/24/2013 11:06 AM, Alex Shi wrote:
> Since the runnable info needs 345ms to accumulate, balancing
> doesn't do well for many tasks burst waking. After talking with Mike
> Galbraith, we are agree to just use runnable avg in power friendly
> scheduling and keep current instant load in performance scheduling for
> low latency.
>
> So the biggest change in this version is removing runnable load avg in
> balance and just using runnable data in power balance.
>
> The patchset bases on Linus' tree, includes 3 parts,
> ** 1, bug fix and fork/wake balancing clean up. patch 1~5,
> ----------------------
> the first patch remove one domain level. patch 2~5 simplified fork/wake
> balancing, it can increase 10+% hackbench performance on our 4 sockets
> SNB EP machine.
>
> V3 change:
> a, added the first patch to remove one domain level on x86 platform.
> b, some small changes according to Namhyung Kim's comments, thanks!
>
> ** 2, bug fix of load avg and remove the CONFIG_FAIR_GROUP_SCHED limit
> ----------------------
> patch 6~8, That using runnable avg in load balancing, with
> two initial runnable variables fix.
>
> V4 change:
> a, remove runnable log avg using in balancing.
>
> V3 change:
> a, use rq->cfs.runnable_load_avg as cpu load not
> rq->avg.load_avg_contrib, since the latter need much time to accumulate
> for new forked task,
> b, a build issue fixed with Namhyung Kim's reminder.
>
> ** 3, power awareness scheduling, patch 9~18.
> ----------------------
> The subset implement/consummate the rough power aware scheduling
> proposal: https://lkml.org/lkml/2012/8/13/139.
> It defines 2 new power aware policy 'balance' and 'powersaving' and then
> try to spread or pack tasks on each sched groups level according the
> different scheduler policy. That can save much power when task number in
> system is no more then LCPU number.
>
> As mentioned in the power aware scheduler proposal, Power aware
> scheduling has 2 assumptions:
> 1, race to idle is helpful for power saving
> 2, pack tasks on less sched_groups will reduce power consumption
>
> The first assumption make performance policy take over scheduling when
> system busy.
> The second assumption make power aware scheduling try to move
> disperse tasks into fewer groups until that groups are full of tasks.
>
> Some power testing data is in the last 2 patches.
>
> V4 change:
> a, fix few bugs and clean up code according to Morten Rasmussen, Mike
> Galbraith and Namhyung Kim. Thanks!
> b, take Morten's suggestion to set different criteria for different
> policy in small task packing.
> c, shorter latency in power aware scheduling.
>
> V3 change:
> a, engaged nr_running in max potential utils consideration in periodic
> power balancing.
> b, try exec/wake small tasks on running cpu not idle cpu.
>
> V2 change:
> a, add lazy power scheduling to deal with kbuild like benchmark.
>
>
> Thanks Fengguang Wu for the build testing of this patchset!
Add some testing report summary that were posted:
Alex Shi tested the benchmarks: kbuild, specjbb2005, oltp, tbench, aim9, hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
loopback netperf. on core2, nhm, wsm, snb, platforms:
a, no clear performance change on performance balance
b, specjbb2005 drop 5~7% on balance/powersaving policy on SNB/NHM platforms; hackbench drop 30~70% SNB EP4S machine.
c, no other peformance change on balance/powersaving machine.
test result from Mike Galbraith:
---------
With aim7 compute on 4 node 40 core box, I see stable throughput
improvement at tasks = nr_cores and below w. balance and powersaving.
3.8.0-performance 3.8.0-balance 3.8.0-powersaving
Tasks jobs/min/task cpu jobs/min/task cpu jobs/min/task cpu
1 432.8571 3.99 433.4764 3.97 433.1665 3.98
5 480.1902 12.49 510.9612 7.55 497.5369 8.22
10 429.1785 40.14 533.4507 11.13 518.3918 12.15
20 424.3697 63.14 529.7203 23.72 528.7958 22.08
40 419.0871 171.42 500.8264 51.44 517.0648 42.45
No deltas after that. There were also no deltas between patched kernel
using performance policy and virgin source.
----------
Ingo, I appreciate for any comments from you. :)
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-02-04 1:35 ` Alex Shi
@ 2013-02-04 11:09 ` Ingo Molnar
2013-02-05 2:26 ` Alex Shi
0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2013-02-04 11:09 UTC (permalink / raw)
To: Alex Shi
Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel
* Alex Shi <alex.shi@intel.com> wrote:
> On 01/24/2013 11:06 AM, Alex Shi wrote:
> > Since the runnable info needs 345ms to accumulate, balancing
> > doesn't do well for many tasks burst waking. After talking with Mike
> > Galbraith, we are agree to just use runnable avg in power friendly
> > scheduling and keep current instant load in performance scheduling for
> > low latency.
> >
> > So the biggest change in this version is removing runnable load avg in
> > balance and just using runnable data in power balance.
> >
> > The patchset bases on Linus' tree, includes 3 parts,
> > ** 1, bug fix and fork/wake balancing clean up. patch 1~5,
> > ----------------------
> > the first patch remove one domain level. patch 2~5 simplified fork/wake
> > balancing, it can increase 10+% hackbench performance on our 4 sockets
> > SNB EP machine.
> >
> > V3 change:
> > a, added the first patch to remove one domain level on x86 platform.
> > b, some small changes according to Namhyung Kim's comments, thanks!
> >
> > ** 2, bug fix of load avg and remove the CONFIG_FAIR_GROUP_SCHED limit
> > ----------------------
> > patch 6~8, That using runnable avg in load balancing, with
> > two initial runnable variables fix.
> >
> > V4 change:
> > a, remove runnable log avg using in balancing.
> >
> > V3 change:
> > a, use rq->cfs.runnable_load_avg as cpu load not
> > rq->avg.load_avg_contrib, since the latter need much time to accumulate
> > for new forked task,
> > b, a build issue fixed with Namhyung Kim's reminder.
> >
> > ** 3, power awareness scheduling, patch 9~18.
> > ----------------------
> > The subset implement/consummate the rough power aware scheduling
> > proposal: https://lkml.org/lkml/2012/8/13/139.
> > It defines 2 new power aware policy 'balance' and 'powersaving' and then
> > try to spread or pack tasks on each sched groups level according the
> > different scheduler policy. That can save much power when task number in
> > system is no more then LCPU number.
> >
> > As mentioned in the power aware scheduler proposal, Power aware
> > scheduling has 2 assumptions:
> > 1, race to idle is helpful for power saving
> > 2, pack tasks on less sched_groups will reduce power consumption
> >
> > The first assumption make performance policy take over scheduling when
> > system busy.
> > The second assumption make power aware scheduling try to move
> > disperse tasks into fewer groups until that groups are full of tasks.
> >
> > Some power testing data is in the last 2 patches.
> >
> > V4 change:
> > a, fix few bugs and clean up code according to Morten Rasmussen, Mike
> > Galbraith and Namhyung Kim. Thanks!
> > b, take Morten's suggestion to set different criteria for different
> > policy in small task packing.
> > c, shorter latency in power aware scheduling.
> >
> > V3 change:
> > a, engaged nr_running in max potential utils consideration in periodic
> > power balancing.
> > b, try exec/wake small tasks on running cpu not idle cpu.
> >
> > V2 change:
> > a, add lazy power scheduling to deal with kbuild like benchmark.
> >
> >
> > Thanks Fengguang Wu for the build testing of this patchset!
>
>
> Add some testing report summary that were posted:
> Alex Shi tested the benchmarks: kbuild, specjbb2005, oltp, tbench, aim9, hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
> loopback netperf. on core2, nhm, wsm, snb, platforms:
> a, no clear performance change on performance balance
> b, specjbb2005 drop 5~7% on balance/powersaving policy on SNB/NHM platforms; hackbench drop 30~70% SNB EP4S machine.
> c, no other peformance change on balance/powersaving machine.
>
> test result from Mike Galbraith:
> ---------
> With aim7 compute on 4 node 40 core box, I see stable throughput
> improvement at tasks = nr_cores and below w. balance and powersaving.
>
> 3.8.0-performance 3.8.0-balance 3.8.0-powersaving
> Tasks jobs/min/task cpu jobs/min/task cpu jobs/min/task cpu
> 1 432.8571 3.99 433.4764 3.97 433.1665 3.98
> 5 480.1902 12.49 510.9612 7.55 497.5369 8.22
> 10 429.1785 40.14 533.4507 11.13 518.3918 12.15
> 20 424.3697 63.14 529.7203 23.72 528.7958 22.08
> 40 419.0871 171.42 500.8264 51.44 517.0648 42.45
>
> No deltas after that. There were also no deltas between patched kernel
> using performance policy and virgin source.
> ----------
>
> Ingo, I appreciate for any comments from you. :)
Have you tried to quantify the actual real or expected power
savings with the knob enabled?
I'd also love to have an automatic policy here, with a knob that
has 3 values:
0: always disabled
1: automatic
2: always enabled
here enabled/disabled is your current knob's functionality, and
those can also be used by user-space policy daemons/handlers.
The interesting thing would be '1' which should be the default:
on laptops that are on battery it should result in a power
saving policy, on laptops that are on AC or on battery-less
systems it should mean 'performance' policy.
It should generally default to 'performance', switching to
'power saving on' only if there's positive, reliable information
somewhere in the kernel that we are operating on battery power.
A callback or two would have to go into the ACPI battery driver
I suspect.
So I'd like this feature to be a tangible improvement for laptop
users (as long as the laptop hardware is passing us battery/AC
events reliably).
Or something like that - with .config switches to influence
these values as well.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-02-04 11:09 ` Ingo Molnar
@ 2013-02-05 2:26 ` Alex Shi
2013-02-06 5:08 ` Alex Shi
0 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-02-05 2:26 UTC (permalink / raw)
To: Ingo Molnar
Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel, Zhang, Rui
>> Ingo, I appreciate for any comments from you. :)
>
> Have you tried to quantify the actual real or expected power
> savings with the knob enabled?
Thanks a lot for your comments! :)
Yes, the following power data copied form patch 17th:
---
A test can show the effort on different policy:
for ((i = 0; i < I; i++)) ; do while true; do :; done & done
On my SNB laptop with 4core* HT: the data is Watts
powersaving balance performance
i = 2 40 54 54
i = 4 57 64* 68
i = 8 68 68 68
Note:
When i = 4 with balance policy, the power may change in 57~68Watt,
since the HT capacity and core capacity are both 1.
on SNB EP machine with 2 sockets * 8 cores * HT:
powersaving balance performance
i = 4 190 201 238
i = 8 205 241 268
i = 16 271 348 376
If system has few continued tasks, use power policy can get
the performance/power gain. Like sysbench fileio randrw test with 16
thread on the SNB EP box.
=====
and the following from patch 18th
---
On my SNB EP 2 sockets machine with 8 cores * HT: 'make -j x vmlinux'
results:
powersaving balance performance
x = 1 175.603 /417 13 175.220 /416 13 176.073 /407 13
x = 2 192.215 /218 23 194.522 /202 25 217.393 /200 23
x = 4 205.226 /124 39 208.823 /114 42 230.425 /105 41
x = 8 236.369 /71 59 249.005 /65 61 257.661 /62 62
x = 16 283.842 /48 73 307.465 /40 81 309.336 /39 82
x = 32 325.197 /32 96 333.503 /32 93 336.138 /32 92
data explains: 175.603 /417 13
175.603: average Watts
417: seconds(compile time)
13: scaled performance/power = 1000000 / seconds / watts
=====
some data for parallel compress: https://lkml.org/lkml/2012/12/11/155
---
Another testing of parallel compress with pigz on Linus' git tree.
results show we get much better performance/power with powersaving and
balance policy:
testing command:
#pigz -k -c -p$x -r linux* &> /dev/null
On a NHM EP box
powersaving balance performance
x = 4 166.516 /88 68 170.515 /82 71 165.283 /103 58
x = 8 173.654 /61 94 177.693 /60 93 172.31 /76 76
On a 2 sockets SNB EP box.
powersaving balance performance
x = 4 190.995 /149 35 200.6 /129 38 208.561 /135 35
x = 8 197.969 /108 46 208.885 /103 46 213.96 /108 43
x = 16 205.163 /76 64 212.144 /91 51 229.287 /97 44
data format is: 166.516 /88 68
166.516: average Watts
88: seconds(compress time)
68: scaled performance/power = 1000000 / time / power
=====
BTW, bltk-game with openarena dropped 0.3/1.5 Watt on powersaving policy
or 0.2/0.5 Watt on balance policy on my laptop wsm/snb;
>
> I'd also love to have an automatic policy here, with a knob that
> has 3 values:
>
> 0: always disabled
> 1: automatic
> 2: always enabled
>
> here enabled/disabled is your current knob's functionality, and
> those can also be used by user-space policy daemons/handlers.
Sure, this patch has a knob for user-space policy selecting,
$cat /sys/devices/system/cpu/sched_policy/available_sched_policy
performance powersaving balance
User can change the policy by commend 'echo':
echo performance > /sys/devices/system/cpu/current_sched_policy
The 'performance' policy means 'always disabled' power friendly scheduling.
The 'balance/powersaving' is automatic power friendly scheduling, since
system will auto bypass power scheduling when cpus utilisation in a
sched domain is beyond the domain's cpu weight (powersaving) or beyond
the domain's capacity (balance).
There is no always enabled power scheduling, since the patchset bases on
'race to idle'. but it's easy to add this function if needed.
>
> The interesting thing would be '1' which should be the default:
> on laptops that are on battery it should result in a power
> saving policy, on laptops that are on AC or on battery-less
> systems it should mean 'performance' policy.
yes, with above sysfs interface it is easy to be done. :)
>
> It should generally default to 'performance', switching to
> 'power saving on' only if there's positive, reliable information
> somewhere in the kernel that we are operating on battery power.
> A callback or two would have to go into the ACPI battery driver
> I suspect.
>
> So I'd like this feature to be a tangible improvement for laptop
> users (as long as the laptop hardware is passing us battery/AC
> events reliably).
Maybe it is better to let system admin change it from user space? I am
not sure some one like to enable a call back in ACPI battery driver?
CC to Zhang Rui.
>
> Or something like that - with .config switches to influence
> these values as well.
>
> Thanks,
>
> Ingo
>
--
Thanks Alex
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
2013-02-05 2:26 ` Alex Shi
@ 2013-02-06 5:08 ` Alex Shi
0 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-06 5:08 UTC (permalink / raw)
To: Ingo Molnar
Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
efault, vincent.guittot, gregkh, preeti, viresh.kumar,
linux-kernel, Zhang, Rui
BTW,
Since numa balance scheduling is also a kind of cpu locality policy, it
is natural compatible with power aware scheduling.
The v2/v3 of this patch had developed on tip/master, testing show above
2 scheduling policy work together well.
--
Thanks Alex
^ permalink raw reply [flat|nested] 88+ messages in thread