* [PATCH 0/2] sched: arch_scale_smt_powers @ 2010-01-20 20:00 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-20 20:00 UTC (permalink / raw) To: Peter Zijlstra Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh, jschopp The new Power7 processor has 4 way SMT. This 4 way SMT benefits from dynamic power updates that arch_scale_smt_power was designed to provide. The first patch fixes a generic scheduler bug necessary for arch_scale_smt to properly function. The second patch implements arch_scale_smt_power for powerpc, and in particular for Power7 processors. --- Gautham R Shenoy (1): sched: Fix the place where group powers are updated. Joel Schopp (1): powerpc: implement arch_scale_smt_power for Power7 arch/powerpc/kernel/smp.c | 41 +++++++++++++++++++++++++++++++++++++++++ kernel/sched.c | 7 +++---- kernel/sched_features.h | 2 +- 3 files changed, 45 insertions(+), 5 deletions(-) ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH 0/2] sched: arch_scale_smt_powers @ 2010-01-20 20:00 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-20 20:00 UTC (permalink / raw) To: Peter Zijlstra; +Cc: ego, linux-kernel, Ingo Molnar, linuxppc-dev The new Power7 processor has 4 way SMT. This 4 way SMT benefits from dynamic power updates that arch_scale_smt_power was designed to provide. The first patch fixes a generic scheduler bug necessary for arch_scale_smt to properly function. The second patch implements arch_scale_smt_power for powerpc, and in particular for Power7 processors. --- Gautham R Shenoy (1): sched: Fix the place where group powers are updated. Joel Schopp (1): powerpc: implement arch_scale_smt_power for Power7 arch/powerpc/kernel/smp.c | 41 +++++++++++++++++++++++++++++++++++++++++ kernel/sched.c | 7 +++---- kernel/sched_features.h | 2 +- 3 files changed, 45 insertions(+), 5 deletions(-) ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH 1/2] sched: Fix the place where group powers are updated. 2010-01-20 20:00 ` Joel Schopp @ 2010-01-20 20:02 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-20 20:02 UTC (permalink / raw) To: Peter Zijlstra; +Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh From: Gautham R Shenoy <ego@in.ibm.com> We want to update the sched_group_powers when balance_cpu == this_cpu. Currently the group powers are updated only if the balance_cpu is the first CPU in the local group. But balance_cpu = this_cpu could also be the first idle cpu in the group. Hence fix the place where the group powers are updated. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- kernel/sched.c | 7 +++---- 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 281da29..5d2a451 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3721,11 +3721,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, unsigned long sum_avg_load_per_task; unsigned long avg_load_per_task; - if (local_group) { + if (local_group) balance_cpu = group_first_cpu(group); - if (balance_cpu == this_cpu) - update_group_power(sd, this_cpu); - } /* Tally up the load of all CPUs in the group */ sum_avg_load_per_task = avg_load_per_task = 0; @@ -3773,6 +3770,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, return; } + update_group_power(sd, this_cpu); + /* Adjust by relative CPU power of the group */ sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power; ^ permalink raw reply related [flat|nested] 103+ messages in thread
* [PATCH 1/2] sched: Fix the place where group powers are updated. @ 2010-01-20 20:02 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-20 20:02 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego From: Gautham R Shenoy <ego@in.ibm.com> We want to update the sched_group_powers when balance_cpu == this_cpu. Currently the group powers are updated only if the balance_cpu is the first CPU in the local group. But balance_cpu = this_cpu could also be the first idle cpu in the group. Hence fix the place where the group powers are updated. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- kernel/sched.c | 7 +++---- 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 281da29..5d2a451 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3721,11 +3721,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, unsigned long sum_avg_load_per_task; unsigned long avg_load_per_task; - if (local_group) { + if (local_group) balance_cpu = group_first_cpu(group); - if (balance_cpu == this_cpu) - update_group_power(sd, this_cpu); - } /* Tally up the load of all CPUs in the group */ sum_avg_load_per_task = avg_load_per_task = 0; @@ -3773,6 +3770,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, return; } + update_group_power(sd, this_cpu); + /* Adjust by relative CPU power of the group */ sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power; ^ permalink raw reply related [flat|nested] 103+ messages in thread
* [tip:sched/core] sched: Fix the place where group powers are updated 2010-01-20 20:02 ` Joel Schopp (?) @ 2010-01-21 13:54 ` tip-bot for Gautham R Shenoy -1 siblings, 0 replies; 103+ messages in thread From: tip-bot for Gautham R Shenoy @ 2010-01-21 13:54 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, jschopp, ego, hpa, mingo, a.p.zijlstra, tglx, mingo Commit-ID: 871e35bc9733f273eaf5ceb69bbd0423b58e5285 Gitweb: http://git.kernel.org/tip/871e35bc9733f273eaf5ceb69bbd0423b58e5285 Author: Gautham R Shenoy <ego@in.ibm.com> AuthorDate: Wed, 20 Jan 2010 14:02:44 -0600 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Thu, 21 Jan 2010 13:40:17 +0100 sched: Fix the place where group powers are updated We want to update the sched_group_powers when balance_cpu == this_cpu. Currently the group powers are updated only if the balance_cpu is the first CPU in the local group. But balance_cpu = this_cpu could also be the first idle cpu in the group. Hence fix the place where the group powers are updated. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1264017764.5717.127.camel@jschopp-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_fair.c | 7 +++---- 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 0b482f5..22231cc 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2418,11 +2418,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, unsigned long sum_avg_load_per_task; unsigned long avg_load_per_task; - if (local_group) { + if (local_group) balance_cpu = group_first_cpu(group); - if (balance_cpu == this_cpu) - update_group_power(sd, this_cpu); - } /* Tally up the load of all CPUs in the group */ sum_avg_load_per_task = avg_load_per_task = 0; @@ -2470,6 +2467,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, return; } + update_group_power(sd, this_cpu); + /* Adjust by relative CPU power of the group */ sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power; ^ permalink raw reply related [flat|nested] 103+ messages in thread
* [PATCHv2 1/2] sched: enable ARCH_POWER 2010-01-20 20:02 ` Joel Schopp @ 2010-01-26 23:28 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-26 23:28 UTC (permalink / raw) To: Peter Zijlstra Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh, jschopp Enable the scheduler feature that allows use of arch_scale_smt_power. Stub out the broken x86 implementation. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Index: linux-2.6.git/kernel/sched_features.h =================================================================== --- linux-2.6.git.orig/kernel/sched_features.h +++ linux-2.6.git/kernel/sched_features.h @@ -102,7 +102,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1) /* * Use arch dependent cpu power functions */ -SCHED_FEAT(ARCH_POWER, 0) +SCHED_FEAT(ARCH_POWER, 1) SCHED_FEAT(HRTICK, 0) SCHED_FEAT(DOUBLE_TICK, 0) Index: linux-2.6.git/arch/x86/kernel/cpu/sched.c =================================================================== --- linux-2.6.git.orig/arch/x86/kernel/cpu/sched.c +++ linux-2.6.git/arch/x86/kernel/cpu/sched.c @@ -44,11 +44,9 @@ unsigned long arch_scale_freq_power(stru unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) { /* - * aperf/mperf already includes the smt gain + * aperf/mperf already includes the smt gain, but represents capacity + * as 0 when idle. So for now just return default. */ - if (boot_cpu_has(X86_FEATURE_APERFMPERF)) - return SCHED_LOAD_SCALE; - return default_scale_smt_power(sd, cpu); } ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv2 1/2] sched: enable ARCH_POWER @ 2010-01-26 23:28 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-26 23:28 UTC (permalink / raw) To: Peter Zijlstra; +Cc: ego, linux-kernel, Ingo Molnar, linuxppc-dev Enable the scheduler feature that allows use of arch_scale_smt_power. Stub out the broken x86 implementation. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Index: linux-2.6.git/kernel/sched_features.h =================================================================== --- linux-2.6.git.orig/kernel/sched_features.h +++ linux-2.6.git/kernel/sched_features.h @@ -102,7 +102,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1) /* * Use arch dependent cpu power functions */ -SCHED_FEAT(ARCH_POWER, 0) +SCHED_FEAT(ARCH_POWER, 1) SCHED_FEAT(HRTICK, 0) SCHED_FEAT(DOUBLE_TICK, 0) Index: linux-2.6.git/arch/x86/kernel/cpu/sched.c =================================================================== --- linux-2.6.git.orig/arch/x86/kernel/cpu/sched.c +++ linux-2.6.git/arch/x86/kernel/cpu/sched.c @@ -44,11 +44,9 @@ unsigned long arch_scale_freq_power(stru unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) { /* - * aperf/mperf already includes the smt gain + * aperf/mperf already includes the smt gain, but represents capacity + * as 0 when idle. So for now just return default. */ - if (boot_cpu_has(X86_FEATURE_APERFMPERF)) - return SCHED_LOAD_SCALE; - return default_scale_smt_power(sd, cpu); } ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv3 1/2] sched: enable ARCH_POWER 2010-01-26 23:28 ` Joel Schopp @ 2010-01-28 23:20 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-28 23:20 UTC (permalink / raw) To: Peter Zijlstra Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh, jschopp Enable the scheduler feature that allows use of arch_scale_smt_power. Stub out the broken x86 implementation. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Index: linux-2.6.git/kernel/sched_features.h =================================================================== --- linux-2.6.git.orig/kernel/sched_features.h +++ linux-2.6.git/kernel/sched_features.h @@ -102,7 +102,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1) /* * Use arch dependent cpu power functions */ -SCHED_FEAT(ARCH_POWER, 0) +SCHED_FEAT(ARCH_POWER, 1) SCHED_FEAT(HRTICK, 0) SCHED_FEAT(DOUBLE_TICK, 0) Index: linux-2.6.git/arch/x86/kernel/cpu/sched.c =================================================================== --- linux-2.6.git.orig/arch/x86/kernel/cpu/sched.c +++ linux-2.6.git/arch/x86/kernel/cpu/sched.c @@ -44,11 +44,9 @@ unsigned long arch_scale_freq_power(stru unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) { /* - * aperf/mperf already includes the smt gain + * aperf/mperf already includes the smt gain, but represents capacity + * as 0 when idle. So for now just return default. */ - if (boot_cpu_has(X86_FEATURE_APERFMPERF)) - return SCHED_LOAD_SCALE; - return default_scale_smt_power(sd, cpu); } ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv3 1/2] sched: enable ARCH_POWER @ 2010-01-28 23:20 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-28 23:20 UTC (permalink / raw) To: Peter Zijlstra; +Cc: ego, linux-kernel, Ingo Molnar, linuxppc-dev Enable the scheduler feature that allows use of arch_scale_smt_power. Stub out the broken x86 implementation. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Index: linux-2.6.git/kernel/sched_features.h =================================================================== --- linux-2.6.git.orig/kernel/sched_features.h +++ linux-2.6.git/kernel/sched_features.h @@ -102,7 +102,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1) /* * Use arch dependent cpu power functions */ -SCHED_FEAT(ARCH_POWER, 0) +SCHED_FEAT(ARCH_POWER, 1) SCHED_FEAT(HRTICK, 0) SCHED_FEAT(DOUBLE_TICK, 0) Index: linux-2.6.git/arch/x86/kernel/cpu/sched.c =================================================================== --- linux-2.6.git.orig/arch/x86/kernel/cpu/sched.c +++ linux-2.6.git/arch/x86/kernel/cpu/sched.c @@ -44,11 +44,9 @@ unsigned long arch_scale_freq_power(stru unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) { /* - * aperf/mperf already includes the smt gain + * aperf/mperf already includes the smt gain, but represents capacity + * as 0 when idle. So for now just return default. */ - if (boot_cpu_has(X86_FEATURE_APERFMPERF)) - return SCHED_LOAD_SCALE; - return default_scale_smt_power(sd, cpu); } ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv4 1/2] sched: enable ARCH_POWER 2010-01-28 23:20 ` Joel Schopp @ 2010-02-05 20:57 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-02-05 20:57 UTC (permalink / raw) To: Peter Zijlstra Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh, jschopp Enable the scheduler feature that allows use of arch_scale_smt_power. Stub out the broken x86 implementation. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Index: linux-2.6.git/kernel/sched_features.h =================================================================== --- linux-2.6.git.orig/kernel/sched_features.h +++ linux-2.6.git/kernel/sched_features.h @@ -102,7 +102,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1) /* * Use arch dependent cpu power functions */ -SCHED_FEAT(ARCH_POWER, 0) +SCHED_FEAT(ARCH_POWER, 1) SCHED_FEAT(HRTICK, 0) SCHED_FEAT(DOUBLE_TICK, 0) Index: linux-2.6.git/arch/x86/kernel/cpu/sched.c =================================================================== --- linux-2.6.git.orig/arch/x86/kernel/cpu/sched.c +++ linux-2.6.git/arch/x86/kernel/cpu/sched.c @@ -44,11 +44,9 @@ unsigned long arch_scale_freq_power(stru unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) { /* - * aperf/mperf already includes the smt gain + * aperf/mperf already includes the smt gain, but represents capacity + * as 0 when idle. So for now just return default. */ - if (boot_cpu_has(X86_FEATURE_APERFMPERF)) - return SCHED_LOAD_SCALE; - return default_scale_smt_power(sd, cpu); } ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv4 1/2] sched: enable ARCH_POWER @ 2010-02-05 20:57 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-02-05 20:57 UTC (permalink / raw) To: Peter Zijlstra; +Cc: ego, linux-kernel, Ingo Molnar, linuxppc-dev Enable the scheduler feature that allows use of arch_scale_smt_power. Stub out the broken x86 implementation. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Index: linux-2.6.git/kernel/sched_features.h =================================================================== --- linux-2.6.git.orig/kernel/sched_features.h +++ linux-2.6.git/kernel/sched_features.h @@ -102,7 +102,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1) /* * Use arch dependent cpu power functions */ -SCHED_FEAT(ARCH_POWER, 0) +SCHED_FEAT(ARCH_POWER, 1) SCHED_FEAT(HRTICK, 0) SCHED_FEAT(DOUBLE_TICK, 0) Index: linux-2.6.git/arch/x86/kernel/cpu/sched.c =================================================================== --- linux-2.6.git.orig/arch/x86/kernel/cpu/sched.c +++ linux-2.6.git/arch/x86/kernel/cpu/sched.c @@ -44,11 +44,9 @@ unsigned long arch_scale_freq_power(stru unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) { /* - * aperf/mperf already includes the smt gain + * aperf/mperf already includes the smt gain, but represents capacity + * as 0 when idle. So for now just return default. */ - if (boot_cpu_has(X86_FEATURE_APERFMPERF)) - return SCHED_LOAD_SCALE; - return default_scale_smt_power(sd, cpu); } ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-20 20:00 ` Joel Schopp @ 2010-01-20 20:04 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-20 20:04 UTC (permalink / raw) To: Peter Zijlstra; +Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads there is performance benefit to idling the higher numbered threads in the core. This patch implements arch_scale_smt_power to dynamically update smt thread power in these idle cases in order to prefer threads 0,1 over threads 2,3 within a core. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Index: linux-2.6.git/arch/powerpc/kernel/smp.c =================================================================== --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c +++ linux-2.6.git/arch/powerpc/kernel/smp.c @@ -617,3 +617,44 @@ void __cpu_die(unsigned int cpu) smp_ops->cpu_die(cpu); } #endif + +static inline int thread_in_smt4core(int x) +{ + return x % 4; +} +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) +{ + int cpu2; + int idle_count = 0; + + struct cpumask *cpu_map = sched_domain_span(sd); + + unsigned long weight = cpumask_weight(cpu_map); + unsigned long smt_gain = sd->smt_gain; + + if (cpu_has_feature(CPU_FTRS_POWER7) && weight == 4) { + for_each_cpu(cpu2, cpu_map) { + if (idle_cpu(cpu2)) + idle_count++; + } + + /* the following section attempts to tweak cpu power based + * on current idleness of the threads dynamically at runtime + */ + if (idle_count == 2 || idle_count == 3 || idle_count == 4) { + if (thread_in_smt4core(cpu) == 0 || + thread_in_smt4core(cpu) == 1) { + /* add 75 % to thread power */ + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); + } else { + /* subtract 75 % to thread power */ + smt_gain = smt_gain >> 2; + } + } + } + /* default smt gain is 1178, weight is # of SMT threads */ + smt_gain /= weight; + + return smt_gain; + +} Index: linux-2.6.git/kernel/sched_features.h =================================================================== --- linux-2.6.git.orig/kernel/sched_features.h +++ linux-2.6.git/kernel/sched_features.h @@ -107,7 +107,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1) /* * Use arch dependent cpu power functions */ -SCHED_FEAT(ARCH_POWER, 0) +SCHED_FEAT(ARCH_POWER, 1) SCHED_FEAT(HRTICK, 0) SCHED_FEAT(DOUBLE_TICK, 0) ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-20 20:04 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-20 20:04 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads there is performance benefit to idling the higher numbered threads in the core. This patch implements arch_scale_smt_power to dynamically update smt thread power in these idle cases in order to prefer threads 0,1 over threads 2,3 within a core. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Index: linux-2.6.git/arch/powerpc/kernel/smp.c =================================================================== --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c +++ linux-2.6.git/arch/powerpc/kernel/smp.c @@ -617,3 +617,44 @@ void __cpu_die(unsigned int cpu) smp_ops->cpu_die(cpu); } #endif + +static inline int thread_in_smt4core(int x) +{ + return x % 4; +} +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) +{ + int cpu2; + int idle_count = 0; + + struct cpumask *cpu_map = sched_domain_span(sd); + + unsigned long weight = cpumask_weight(cpu_map); + unsigned long smt_gain = sd->smt_gain; + + if (cpu_has_feature(CPU_FTRS_POWER7) && weight == 4) { + for_each_cpu(cpu2, cpu_map) { + if (idle_cpu(cpu2)) + idle_count++; + } + + /* the following section attempts to tweak cpu power based + * on current idleness of the threads dynamically at runtime + */ + if (idle_count == 2 || idle_count == 3 || idle_count == 4) { + if (thread_in_smt4core(cpu) == 0 || + thread_in_smt4core(cpu) == 1) { + /* add 75 % to thread power */ + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); + } else { + /* subtract 75 % to thread power */ + smt_gain = smt_gain >> 2; + } + } + } + /* default smt gain is 1178, weight is # of SMT threads */ + smt_gain /= weight; + + return smt_gain; + +} Index: linux-2.6.git/kernel/sched_features.h =================================================================== --- linux-2.6.git.orig/kernel/sched_features.h +++ linux-2.6.git/kernel/sched_features.h @@ -107,7 +107,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1) /* * Use arch dependent cpu power functions */ -SCHED_FEAT(ARCH_POWER, 0) +SCHED_FEAT(ARCH_POWER, 1) SCHED_FEAT(HRTICK, 0) SCHED_FEAT(DOUBLE_TICK, 0) ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-20 20:04 ` Joel Schopp @ 2010-01-20 20:48 ` Peter Zijlstra -1 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-01-20 20:48 UTC (permalink / raw) To: Joel Schopp Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh, Andreas Herrmann On Wed, 2010-01-20 at 14:04 -0600, Joel Schopp wrote: > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > there is performance benefit to idling the higher numbered threads in > the core. So this is an actual performance improvement, not only power savings? > This patch implements arch_scale_smt_power to dynamically update smt > thread power in these idle cases in order to prefer threads 0,1 over > threads 2,3 within a core. > > Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> > --- > Index: linux-2.6.git/arch/powerpc/kernel/smp.c > =================================================================== > --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c > +++ linux-2.6.git/arch/powerpc/kernel/smp.c > @@ -617,3 +617,44 @@ void __cpu_die(unsigned int cpu) > smp_ops->cpu_die(cpu); > } > #endif > + > +static inline int thread_in_smt4core(int x) > +{ > + return x % 4; > +} > +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) > +{ > + int cpu2; > + int idle_count = 0; > + > + struct cpumask *cpu_map = sched_domain_span(sd); > + > + unsigned long weight = cpumask_weight(cpu_map); > + unsigned long smt_gain = sd->smt_gain; > + > + if (cpu_has_feature(CPU_FTRS_POWER7) && weight == 4) { > + for_each_cpu(cpu2, cpu_map) { > + if (idle_cpu(cpu2)) > + idle_count++; > + } > + > + /* the following section attempts to tweak cpu power based > + * on current idleness of the threads dynamically at runtime > + */ > + if (idle_count == 2 || idle_count == 3 || idle_count == 4) { > + if (thread_in_smt4core(cpu) == 0 || > + thread_in_smt4core(cpu) == 1) { > + /* add 75 % to thread power */ > + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); > + } else { > + /* subtract 75 % to thread power */ > + smt_gain = smt_gain >> 2; > + } > + } > + } > + /* default smt gain is 1178, weight is # of SMT threads */ > + smt_gain /= weight; > + > + return smt_gain; > + > +} This looks to suffer significant whitespace damage. The design goal for smt_power was to be able to actually measure the processing gains from smt and feed that into the scheduler, not really placement tricks like this. Now I also heard AMD might want to have something similar to this, something to do with powerlines and die layout. I'm not sure playing games with cpu_power is the best or if simply moving tasks to lower numbered cpus using an SD_flag is the best solution for these kinds of things. > Index: linux-2.6.git/kernel/sched_features.h > =================================================================== > --- linux-2.6.git.orig/kernel/sched_features.h > +++ linux-2.6.git/kernel/sched_features.h > @@ -107,7 +107,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1) > /* > * Use arch dependent cpu power functions > */ > -SCHED_FEAT(ARCH_POWER, 0) > +SCHED_FEAT(ARCH_POWER, 1) > > SCHED_FEAT(HRTICK, 0) > SCHED_FEAT(DOUBLE_TICK, 0) And you just wrecked x86 ;-) It has an smt_power implementation that tries to measure smt gains using aperf/mperf, trouble is that this represents the actual performance not the capacity. This has the problem that when idle it represents 0 capacity and will not attract work. Coming up with something that actually works there is on the todo list, I was thinking perhaps temporal maximums from !idle. So if you want to go with this, you'll need to stub out arch/x86/kernel/cpu/sched.c ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-20 20:48 ` Peter Zijlstra 0 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-01-20 20:48 UTC (permalink / raw) To: Joel Schopp Cc: ego, Andreas Herrmann, linux-kernel, Ingo Molnar, linuxppc-dev On Wed, 2010-01-20 at 14:04 -0600, Joel Schopp wrote: > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > there is performance benefit to idling the higher numbered threads in > the core. So this is an actual performance improvement, not only power savings? > This patch implements arch_scale_smt_power to dynamically update smt > thread power in these idle cases in order to prefer threads 0,1 over > threads 2,3 within a core. > > Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> > --- > Index: linux-2.6.git/arch/powerpc/kernel/smp.c > =================================================================== > --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c > +++ linux-2.6.git/arch/powerpc/kernel/smp.c > @@ -617,3 +617,44 @@ void __cpu_die(unsigned int cpu) > smp_ops->cpu_die(cpu); > } > #endif > + > +static inline int thread_in_smt4core(int x) > +{ > + return x % 4; > +} > +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) > +{ > + int cpu2; > + int idle_count = 0; > + > + struct cpumask *cpu_map = sched_domain_span(sd); > + > + unsigned long weight = cpumask_weight(cpu_map); > + unsigned long smt_gain = sd->smt_gain; > + > + if (cpu_has_feature(CPU_FTRS_POWER7) && weight == 4) { > + for_each_cpu(cpu2, cpu_map) { > + if (idle_cpu(cpu2)) > + idle_count++; > + } > + > + /* the following section attempts to tweak cpu power based > + * on current idleness of the threads dynamically at runtime > + */ > + if (idle_count == 2 || idle_count == 3 || idle_count == 4) { > + if (thread_in_smt4core(cpu) == 0 || > + thread_in_smt4core(cpu) == 1) { > + /* add 75 % to thread power */ > + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); > + } else { > + /* subtract 75 % to thread power */ > + smt_gain = smt_gain >> 2; > + } > + } > + } > + /* default smt gain is 1178, weight is # of SMT threads */ > + smt_gain /= weight; > + > + return smt_gain; > + > +} This looks to suffer significant whitespace damage. The design goal for smt_power was to be able to actually measure the processing gains from smt and feed that into the scheduler, not really placement tricks like this. Now I also heard AMD might want to have something similar to this, something to do with powerlines and die layout. I'm not sure playing games with cpu_power is the best or if simply moving tasks to lower numbered cpus using an SD_flag is the best solution for these kinds of things. > Index: linux-2.6.git/kernel/sched_features.h > =================================================================== > --- linux-2.6.git.orig/kernel/sched_features.h > +++ linux-2.6.git/kernel/sched_features.h > @@ -107,7 +107,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1) > /* > * Use arch dependent cpu power functions > */ > -SCHED_FEAT(ARCH_POWER, 0) > +SCHED_FEAT(ARCH_POWER, 1) > > SCHED_FEAT(HRTICK, 0) > SCHED_FEAT(DOUBLE_TICK, 0) And you just wrecked x86 ;-) It has an smt_power implementation that tries to measure smt gains using aperf/mperf, trouble is that this represents the actual performance not the capacity. This has the problem that when idle it represents 0 capacity and will not attract work. Coming up with something that actually works there is on the todo list, I was thinking perhaps temporal maximums from !idle. So if you want to go with this, you'll need to stub out arch/x86/kernel/cpu/sched.c ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-20 20:48 ` Peter Zijlstra @ 2010-01-20 21:58 ` Michael Neuling -1 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-01-20 21:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Joel Schopp, ego, Andreas Herrmann, linux-kernel, Ingo Molnar, linuxppc-dev > > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > > there is performance benefit to idling the higher numbered threads in > > the core. > > So this is an actual performance improvement, not only power savings? It's primarily a performance improvement. Any power/energy savings would be a bonus. Mikey ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-20 21:58 ` Michael Neuling 0 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-01-20 21:58 UTC (permalink / raw) To: Peter Zijlstra Cc: ego, Andreas Herrmann, linux-kernel, Ingo Molnar, linuxppc-dev > > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > > there is performance benefit to idling the higher numbered threads in > > the core. > > So this is an actual performance improvement, not only power savings? It's primarily a performance improvement. Any power/energy savings would be a bonus. Mikey ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-20 20:48 ` Peter Zijlstra @ 2010-01-20 22:44 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-20 22:44 UTC (permalink / raw) To: Peter Zijlstra Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh, Andreas Herrmann Peter Zijlstra wrote: > On Wed, 2010-01-20 at 14:04 -0600, Joel Schopp wrote: > >> On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads >> there is performance benefit to idling the higher numbered threads in >> the core. >> > > So this is an actual performance improvement, not only power savings? > Yes. > > > And you just wrecked x86 ;-) > > It has an smt_power implementation that tries to measure smt gains using > aperf/mperf, trouble is that this represents the actual performance not > the capacity. This has the problem that when idle it represents 0 > capacity and will not attract work. > > Coming up with something that actually works there is on the todo list, > I was thinking perhaps temporal maximums from !idle. > > So if you want to go with this, you'll need to stub out > arch/x86/kernel/cpu/sched.c > OK. Guess I now will have a 3 patch series, with a patch to stub out the x86 broken version. Care to take Gautham's bugfix patch (patch 1/2) now, since it just fixes a bug? You'll need it if you ever try to make the x86 broken version work. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-20 22:44 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-20 22:44 UTC (permalink / raw) To: Peter Zijlstra Cc: ego, Andreas Herrmann, linux-kernel, Ingo Molnar, linuxppc-dev Peter Zijlstra wrote: > On Wed, 2010-01-20 at 14:04 -0600, Joel Schopp wrote: > >> On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads >> there is performance benefit to idling the higher numbered threads in >> the core. >> > > So this is an actual performance improvement, not only power savings? > Yes. > > > And you just wrecked x86 ;-) > > It has an smt_power implementation that tries to measure smt gains using > aperf/mperf, trouble is that this represents the actual performance not > the capacity. This has the problem that when idle it represents 0 > capacity and will not attract work. > > Coming up with something that actually works there is on the todo list, > I was thinking perhaps temporal maximums from !idle. > > So if you want to go with this, you'll need to stub out > arch/x86/kernel/cpu/sched.c > OK. Guess I now will have a 3 patch series, with a patch to stub out the x86 broken version. Care to take Gautham's bugfix patch (patch 1/2) now, since it just fixes a bug? You'll need it if you ever try to make the x86 broken version work. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-20 22:44 ` Joel Schopp @ 2010-01-21 8:27 ` Peter Zijlstra -1 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-01-21 8:27 UTC (permalink / raw) To: Joel Schopp Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh, Andreas Herrmann On Wed, 2010-01-20 at 16:44 -0600, Joel Schopp wrote: > > Care to take Gautham's bugfix patch (patch 1/2) now, since it just fixes > a bug? You'll need it if you ever try to make the x86 broken version work. Sure, I'll take that, thanks! ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-21 8:27 ` Peter Zijlstra 0 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-01-21 8:27 UTC (permalink / raw) To: Joel Schopp Cc: ego, Andreas Herrmann, linux-kernel, Ingo Molnar, linuxppc-dev On Wed, 2010-01-20 at 16:44 -0600, Joel Schopp wrote: > > Care to take Gautham's bugfix patch (patch 1/2) now, since it just fixes > a bug? You'll need it if you ever try to make the x86 broken version work. Sure, I'll take that, thanks! ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-20 20:04 ` Joel Schopp @ 2010-01-20 21:04 ` Michael Neuling -1 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-01-20 21:04 UTC (permalink / raw) To: Joel Schopp; +Cc: Peter Zijlstra, Ingo Molnar, linuxppc-dev, linux-kernel, ego > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > there is performance benefit to idling the higher numbered threads in > the core. > > This patch implements arch_scale_smt_power to dynamically update smt > thread power in these idle cases in order to prefer threads 0,1 over > threads 2,3 within a core. > > Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> > --- > Index: linux-2.6.git/arch/powerpc/kernel/smp.c > =================================================================== > --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c > +++ linux-2.6.git/arch/powerpc/kernel/smp.c > @@ -617,3 +617,44 @@ void __cpu_die(unsigned int cpu) > smp_ops->cpu_die(cpu); > } > #endif > + > +static inline int thread_in_smt4core(int x) > +{ > + return x % 4; > +} > +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) > +{ > + int cpu2; > + int idle_count = 0; > + > + struct cpumask *cpu_map = sched_domain_span(sd); > + > + unsigned long weight = cpumask_weight(cpu_map); > + unsigned long smt_gain = sd->smt_gain; > + > + if (cpu_has_feature(CPU_FTRS_POWER7) && weight == 4) { I think we should avoid using cpu_has_feature like this. It's better to create a new feature and add it to POWER7 in the cputable, then check for that here. The way that it is now, I think any CPU that has superset of the POWER7 features, will be true here. This is not what we want. > + for_each_cpu(cpu2, cpu_map) { > + if (idle_cpu(cpu2)) > + idle_count++; > + } > + > + /* the following section attempts to tweak cpu power based > + * on current idleness of the threads dynamically at runtime > + */ > + if (idle_count == 2 || idle_count == 3 || idle_count == 4) { > + if (thread_in_smt4core(cpu) == 0 || > + thread_in_smt4core(cpu) == 1) { > + /* add 75 % to thread power */ > + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); > + } else { > + /* subtract 75 % to thread power */ > + smt_gain = smt_gain >> 2; > + } > + } > + } > + /* default smt gain is 1178, weight is # of SMT threads */ > + smt_gain /= weight; This results in a PPC div, when most of the time it's going to be a power of two divide. You've optimised the divides a few lines above this, but not this one. Some consistency would be good. Mikey > + > + return smt_gain; > + > +} > Index: linux-2.6.git/kernel/sched_features.h > =================================================================== > --- linux-2.6.git.orig/kernel/sched_features.h > +++ linux-2.6.git/kernel/sched_features.h > @@ -107,7 +107,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1) > /* > * Use arch dependent cpu power functions > */ > -SCHED_FEAT(ARCH_POWER, 0) > +SCHED_FEAT(ARCH_POWER, 1) > > SCHED_FEAT(HRTICK, 0) > SCHED_FEAT(DOUBLE_TICK, 0) > > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-20 21:04 ` Michael Neuling 0 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-01-20 21:04 UTC (permalink / raw) To: Joel Schopp; +Cc: Ingo Molnar, ego, linuxppc-dev, Peter Zijlstra, linux-kernel > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > there is performance benefit to idling the higher numbered threads in > the core. > > This patch implements arch_scale_smt_power to dynamically update smt > thread power in these idle cases in order to prefer threads 0,1 over > threads 2,3 within a core. > > Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> > --- > Index: linux-2.6.git/arch/powerpc/kernel/smp.c > =================================================================== > --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c > +++ linux-2.6.git/arch/powerpc/kernel/smp.c > @@ -617,3 +617,44 @@ void __cpu_die(unsigned int cpu) > smp_ops->cpu_die(cpu); > } > #endif > + > +static inline int thread_in_smt4core(int x) > +{ > + return x % 4; > +} > +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) > +{ > + int cpu2; > + int idle_count = 0; > + > + struct cpumask *cpu_map = sched_domain_span(sd); > + > + unsigned long weight = cpumask_weight(cpu_map); > + unsigned long smt_gain = sd->smt_gain; > + > + if (cpu_has_feature(CPU_FTRS_POWER7) && weight == 4) { I think we should avoid using cpu_has_feature like this. It's better to create a new feature and add it to POWER7 in the cputable, then check for that here. The way that it is now, I think any CPU that has superset of the POWER7 features, will be true here. This is not what we want. > + for_each_cpu(cpu2, cpu_map) { > + if (idle_cpu(cpu2)) > + idle_count++; > + } > + > + /* the following section attempts to tweak cpu power based > + * on current idleness of the threads dynamically at runtime > + */ > + if (idle_count == 2 || idle_count == 3 || idle_count == 4) { > + if (thread_in_smt4core(cpu) == 0 || > + thread_in_smt4core(cpu) == 1) { > + /* add 75 % to thread power */ > + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); > + } else { > + /* subtract 75 % to thread power */ > + smt_gain = smt_gain >> 2; > + } > + } > + } > + /* default smt gain is 1178, weight is # of SMT threads */ > + smt_gain /= weight; This results in a PPC div, when most of the time it's going to be a power of two divide. You've optimised the divides a few lines above this, but not this one. Some consistency would be good. Mikey > + > + return smt_gain; > + > +} > Index: linux-2.6.git/kernel/sched_features.h > =================================================================== > --- linux-2.6.git.orig/kernel/sched_features.h > +++ linux-2.6.git/kernel/sched_features.h > @@ -107,7 +107,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1) > /* > * Use arch dependent cpu power functions > */ > -SCHED_FEAT(ARCH_POWER, 0) > +SCHED_FEAT(ARCH_POWER, 1) > > SCHED_FEAT(HRTICK, 0) > SCHED_FEAT(DOUBLE_TICK, 0) > > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-20 21:04 ` Michael Neuling @ 2010-01-20 22:09 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-20 22:09 UTC (permalink / raw) To: Michael Neuling Cc: Peter Zijlstra, Ingo Molnar, linuxppc-dev, linux-kernel, ego >> + if (cpu_has_feature(CPU_FTRS_POWER7) && weight == 4) { >> > > I think we should avoid using cpu_has_feature like this. It's better to > create a new feature and add it to POWER7 in the cputable, then check > for that here. > > The way that it is now, I think any CPU that has superset of the POWER7 > features, will be true here. This is not what we want. > Any ideas for what to call this feature? ASYM_SMT4 ? > >> + smt_gain /= weight; >> > > This results in a PPC div, when most of the time it's going to be a > power of two divide. You've optimised the divides a few lines above > this, but not this one. Some consistency would be good. > > I can turn that into a conditional branch (case statement) with a shift for the common 1,2,4 cases which should cover all procs available today falling back to a divide for any theoretical future processors that do other numbers of threads. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-20 22:09 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-20 22:09 UTC (permalink / raw) To: Michael Neuling Cc: Ingo Molnar, ego, linuxppc-dev, Peter Zijlstra, linux-kernel >> + if (cpu_has_feature(CPU_FTRS_POWER7) && weight == 4) { >> > > I think we should avoid using cpu_has_feature like this. It's better to > create a new feature and add it to POWER7 in the cputable, then check > for that here. > > The way that it is now, I think any CPU that has superset of the POWER7 > features, will be true here. This is not what we want. > Any ideas for what to call this feature? ASYM_SMT4 ? > >> + smt_gain /= weight; >> > > This results in a PPC div, when most of the time it's going to be a > power of two divide. You've optimised the divides a few lines above > this, but not this one. Some consistency would be good. > > I can turn that into a conditional branch (case statement) with a shift for the common 1,2,4 cases which should cover all procs available today falling back to a divide for any theoretical future processors that do other numbers of threads. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-20 22:09 ` Joel Schopp @ 2010-01-24 3:00 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 103+ messages in thread From: Benjamin Herrenschmidt @ 2010-01-24 3:00 UTC (permalink / raw) To: Joel Schopp Cc: Michael Neuling, Ingo Molnar, ego, linuxppc-dev, Peter Zijlstra, linux-kernel On Wed, 2010-01-20 at 16:09 -0600, Joel Schopp wrote: > I can turn that into a conditional branch (case statement) with a shift > for the common 1,2,4 cases which should cover all procs available today > falling back to a divide for any theoretical future processors that do > other numbers of threads. Look at the cputhreads.h implementation ... Today we only support power-of-two numbers of threads. Cheers, Ben. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-24 3:00 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 103+ messages in thread From: Benjamin Herrenschmidt @ 2010-01-24 3:00 UTC (permalink / raw) To: Joel Schopp Cc: Michael Neuling, Peter Zijlstra, ego, linux-kernel, Ingo Molnar, linuxppc-dev On Wed, 2010-01-20 at 16:09 -0600, Joel Schopp wrote: > I can turn that into a conditional branch (case statement) with a shift > for the common 1,2,4 cases which should cover all procs available today > falling back to a divide for any theoretical future processors that do > other numbers of threads. Look at the cputhreads.h implementation ... Today we only support power-of-two numbers of threads. Cheers, Ben. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-24 3:00 ` Benjamin Herrenschmidt @ 2010-01-25 17:50 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-25 17:50 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Michael Neuling, Ingo Molnar, ego, linuxppc-dev, Peter Zijlstra, linux-kernel Benjamin Herrenschmidt wrote: > On Wed, 2010-01-20 at 16:09 -0600, Joel Schopp wrote: > >> I can turn that into a conditional branch (case statement) with a shift >> for the common 1,2,4 cases which should cover all procs available today >> falling back to a divide for any theoretical future processors that do >> other numbers of threads. >> > > Look at the cputhreads.h implementation ... Today we only support > power-of-two numbers of threads. > I've run 3 threads using cpu hotplug to offline 1 of the 4. It's certainly a stupid idea, but there you go. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-25 17:50 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-25 17:50 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Michael Neuling, Peter Zijlstra, ego, linux-kernel, Ingo Molnar, linuxppc-dev Benjamin Herrenschmidt wrote: > On Wed, 2010-01-20 at 16:09 -0600, Joel Schopp wrote: > >> I can turn that into a conditional branch (case statement) with a shift >> for the common 1,2,4 cases which should cover all procs available today >> falling back to a divide for any theoretical future processors that do >> other numbers of threads. >> > > Look at the cputhreads.h implementation ... Today we only support > power-of-two numbers of threads. > I've run 3 threads using cpu hotplug to offline 1 of the 4. It's certainly a stupid idea, but there you go. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-25 17:50 ` Joel Schopp @ 2010-01-26 4:23 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 103+ messages in thread From: Benjamin Herrenschmidt @ 2010-01-26 4:23 UTC (permalink / raw) To: Joel Schopp Cc: Michael Neuling, Ingo Molnar, ego, linuxppc-dev, Peter Zijlstra, linux-kernel On Mon, 2010-01-25 at 11:50 -0600, Joel Schopp wrote: > > Look at the cputhreads.h implementation ... Today we only support > > power-of-two numbers of threads. > > > I've run 3 threads using cpu hotplug to offline 1 of the 4. It's > certainly a stupid idea, but there you go. Oh, you mean you need to use the actual online count ? In which case, yes, you do indeed need to be a bit more careful... In this case though, you're probably better off special casing "3" :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-26 4:23 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 103+ messages in thread From: Benjamin Herrenschmidt @ 2010-01-26 4:23 UTC (permalink / raw) To: Joel Schopp Cc: Michael Neuling, Peter Zijlstra, ego, linux-kernel, Ingo Molnar, linuxppc-dev On Mon, 2010-01-25 at 11:50 -0600, Joel Schopp wrote: > > Look at the cputhreads.h implementation ... Today we only support > > power-of-two numbers of threads. > > > I've run 3 threads using cpu hotplug to offline 1 of the 4. It's > certainly a stupid idea, but there you go. Oh, you mean you need to use the actual online count ? In which case, yes, you do indeed need to be a bit more careful... In this case though, you're probably better off special casing "3" :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-20 20:04 ` Joel Schopp @ 2010-01-20 21:33 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 103+ messages in thread From: Benjamin Herrenschmidt @ 2010-01-20 21:33 UTC (permalink / raw) To: Joel Schopp; +Cc: Peter Zijlstra, ego, linuxppc-dev, Ingo Molnar, linux-kernel On Wed, 2010-01-20 at 14:04 -0600, Joel Schopp wrote: > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > there is performance benefit to idling the higher numbered threads in > the core. > > This patch implements arch_scale_smt_power to dynamically update smt > thread power in these idle cases in order to prefer threads 0,1 over > threads 2,3 within a core. > > Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> So I'll leave Peter deal with the scheduler aspects and will focus on details :-) > --- > Index: linux-2.6.git/arch/powerpc/kernel/smp.c > =================================================================== > --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c > +++ linux-2.6.git/arch/powerpc/kernel/smp.c > @@ -617,3 +617,44 @@ void __cpu_die(unsigned int cpu) > smp_ops->cpu_die(cpu); > } > #endif > + > +static inline int thread_in_smt4core(int x) > +{ > + return x % 4; > +} Needs a whitespace here though I don't really like the above. Any reason why you can't use the existing cpu_thread_in_core() ? > +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) > +{ > + int cpu2; > + int idle_count = 0; > + > + struct cpumask *cpu_map = sched_domain_span(sd); > + > + unsigned long weight = cpumask_weight(cpu_map); > + unsigned long smt_gain = sd->smt_gain; More whitespace damage above. > + if (cpu_has_feature(CPU_FTRS_POWER7) && weight == 4) { > + for_each_cpu(cpu2, cpu_map) { > + if (idle_cpu(cpu2)) > + idle_count++; > + } I'm not 100% sure about the use of the CPU feature above. First I wonder if the right approach is to instead do something like if (!cpu_has_feature(...) !! weigth < 4) return default_scale_smt_power(sd, cpu); Though we may be better off using a ppc_md. hook here to avoid calculating the weight etc... on processors that don't need any of that. I also dislike your naming. I would suggest you change cpu_map to sibling_map() and cpu2 to sibling (or just c). One thing I wonder is how sure we are that sched_domain_span() is always going to give us the threads btw ? If we introduce another sched domain level for NUMA purposes can't we get confused ? Also, how hot is this code path ? > + /* the following section attempts to tweak cpu power based > + * on current idleness of the threads dynamically at runtime > + */ > + if (idle_count == 2 || idle_count == 3 || idle_count == 4) { if (idle_count > 1) ? :-) > + if (thread_in_smt4core(cpu) == 0 || > + thread_in_smt4core(cpu) == 1) { int thread = cpu_thread_in_core(cpu); if (thread < 2) ... > + /* add 75 % to thread power */ > + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); > + } else { > + /* subtract 75 % to thread power */ > + smt_gain = smt_gain >> 2; > + } > + } > + } > + /* default smt gain is 1178, weight is # of SMT threads */ > + smt_gain /= weight; > + > + return smt_gain; Cheers, Ben. > +} > Index: linux-2.6.git/kernel/sched_features.h > =================================================================== > --- linux-2.6.git.orig/kernel/sched_features.h > +++ linux-2.6.git/kernel/sched_features.h > @@ -107,7 +107,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1) > /* > * Use arch dependent cpu power functions > */ > -SCHED_FEAT(ARCH_POWER, 0) > +SCHED_FEAT(ARCH_POWER, 1) > > SCHED_FEAT(HRTICK, 0) > SCHED_FEAT(DOUBLE_TICK, 0) > ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-20 21:33 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 103+ messages in thread From: Benjamin Herrenschmidt @ 2010-01-20 21:33 UTC (permalink / raw) To: Joel Schopp; +Cc: linux-kernel, Ingo Molnar, linuxppc-dev, Peter Zijlstra, ego On Wed, 2010-01-20 at 14:04 -0600, Joel Schopp wrote: > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > there is performance benefit to idling the higher numbered threads in > the core. > > This patch implements arch_scale_smt_power to dynamically update smt > thread power in these idle cases in order to prefer threads 0,1 over > threads 2,3 within a core. > > Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> So I'll leave Peter deal with the scheduler aspects and will focus on details :-) > --- > Index: linux-2.6.git/arch/powerpc/kernel/smp.c > =================================================================== > --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c > +++ linux-2.6.git/arch/powerpc/kernel/smp.c > @@ -617,3 +617,44 @@ void __cpu_die(unsigned int cpu) > smp_ops->cpu_die(cpu); > } > #endif > + > +static inline int thread_in_smt4core(int x) > +{ > + return x % 4; > +} Needs a whitespace here though I don't really like the above. Any reason why you can't use the existing cpu_thread_in_core() ? > +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) > +{ > + int cpu2; > + int idle_count = 0; > + > + struct cpumask *cpu_map = sched_domain_span(sd); > + > + unsigned long weight = cpumask_weight(cpu_map); > + unsigned long smt_gain = sd->smt_gain; More whitespace damage above. > + if (cpu_has_feature(CPU_FTRS_POWER7) && weight == 4) { > + for_each_cpu(cpu2, cpu_map) { > + if (idle_cpu(cpu2)) > + idle_count++; > + } I'm not 100% sure about the use of the CPU feature above. First I wonder if the right approach is to instead do something like if (!cpu_has_feature(...) !! weigth < 4) return default_scale_smt_power(sd, cpu); Though we may be better off using a ppc_md. hook here to avoid calculating the weight etc... on processors that don't need any of that. I also dislike your naming. I would suggest you change cpu_map to sibling_map() and cpu2 to sibling (or just c). One thing I wonder is how sure we are that sched_domain_span() is always going to give us the threads btw ? If we introduce another sched domain level for NUMA purposes can't we get confused ? Also, how hot is this code path ? > + /* the following section attempts to tweak cpu power based > + * on current idleness of the threads dynamically at runtime > + */ > + if (idle_count == 2 || idle_count == 3 || idle_count == 4) { if (idle_count > 1) ? :-) > + if (thread_in_smt4core(cpu) == 0 || > + thread_in_smt4core(cpu) == 1) { int thread = cpu_thread_in_core(cpu); if (thread < 2) ... > + /* add 75 % to thread power */ > + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); > + } else { > + /* subtract 75 % to thread power */ > + smt_gain = smt_gain >> 2; > + } > + } > + } > + /* default smt gain is 1178, weight is # of SMT threads */ > + smt_gain /= weight; > + > + return smt_gain; Cheers, Ben. > +} > Index: linux-2.6.git/kernel/sched_features.h > =================================================================== > --- linux-2.6.git.orig/kernel/sched_features.h > +++ linux-2.6.git/kernel/sched_features.h > @@ -107,7 +107,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1) > /* > * Use arch dependent cpu power functions > */ > -SCHED_FEAT(ARCH_POWER, 0) > +SCHED_FEAT(ARCH_POWER, 1) > > SCHED_FEAT(HRTICK, 0) > SCHED_FEAT(DOUBLE_TICK, 0) > ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-20 21:33 ` Benjamin Herrenschmidt @ 2010-01-20 22:36 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-20 22:36 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Peter Zijlstra, ego, linuxppc-dev, Ingo Molnar, linux-kernel >> + >> +static inline int thread_in_smt4core(int x) >> +{ >> + return x % 4; >> +} >> > > Needs a whitespace here though I don't really like the above. Any reason > why you can't use the existing cpu_thread_in_core() ? > I will change it to cpu_thread_in_core() > >> +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) >> +{ >> + int cpu2; >> + int idle_count = 0; >> + >> + struct cpumask *cpu_map = sched_domain_span(sd); >> + >> + unsigned long weight = cpumask_weight(cpu_map); >> + unsigned long smt_gain = sd->smt_gain; >> > > More whitespace damage above. > You are better than checkpatch.pl, will fix. > >> + if (cpu_has_feature(CPU_FTRS_POWER7) && weight == 4) { >> + for_each_cpu(cpu2, cpu_map) { >> + if (idle_cpu(cpu2)) >> + idle_count++; >> + } >> > > I'm not 100% sure about the use of the CPU feature above. First I wonder > if the right approach is to instead do something like > > if (!cpu_has_feature(...) !! weigth < 4) > return default_scale_smt_power(sd, cpu); > > Though we may be better off using a ppc_md. hook here to avoid > calculating the weight etc... on processors that don't need any > of that. > > I also dislike your naming. I would suggest you change cpu_map to > sibling_map() and cpu2 to sibling (or just c). One thing I wonder is how > sure we are that sched_domain_span() is always going to give us the > threads btw ? If we introduce another sched domain level for NUMA > purposes can't we get confused ? > Right now it's 100% always giving us threads. My development version of the patch had a BUG_ON() to check this. I expect this to stay the case in the future as the name of the function is arch_scale_smt_power(), which clearly denotes threads are expected. I am not stuck on the names, I'll change it to sibling instead of cpu2 and sibling_map instead of cpu_map. It seems clear to me either way. As for testing the ! case it seems funcationally equivalent, and mine seems less confusing. Having a ppc.md hook with exactly 1 user is pointless, especially since you'll still have to calculate the weight with the ability to dynamically disable smt. > Also, how hot is this code path ? > It's every load balance, which is to say not hot, but fairly frequent. I haven't been able to measure an impact from doing very hairy calculations (without actually changing the weights) here vs not having it at all in actual end workloads. > >> + /* the following section attempts to tweak cpu power based >> + * on current idleness of the threads dynamically at runtime >> + */ >> + if (idle_count == 2 || idle_count == 3 || idle_count == 4) { >> > > if (idle_count > 1) ? :-) > Yes :) Originally I had done different weightings for each of the 3 cases, which gained in some workloads but regressed some others. But since I'm not doing that anymore I'll fold it down to > 1 ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-20 22:36 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-20 22:36 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linux-kernel, Ingo Molnar, linuxppc-dev, Peter Zijlstra, ego >> + >> +static inline int thread_in_smt4core(int x) >> +{ >> + return x % 4; >> +} >> > > Needs a whitespace here though I don't really like the above. Any reason > why you can't use the existing cpu_thread_in_core() ? > I will change it to cpu_thread_in_core() > >> +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) >> +{ >> + int cpu2; >> + int idle_count = 0; >> + >> + struct cpumask *cpu_map = sched_domain_span(sd); >> + >> + unsigned long weight = cpumask_weight(cpu_map); >> + unsigned long smt_gain = sd->smt_gain; >> > > More whitespace damage above. > You are better than checkpatch.pl, will fix. > >> + if (cpu_has_feature(CPU_FTRS_POWER7) && weight == 4) { >> + for_each_cpu(cpu2, cpu_map) { >> + if (idle_cpu(cpu2)) >> + idle_count++; >> + } >> > > I'm not 100% sure about the use of the CPU feature above. First I wonder > if the right approach is to instead do something like > > if (!cpu_has_feature(...) !! weigth < 4) > return default_scale_smt_power(sd, cpu); > > Though we may be better off using a ppc_md. hook here to avoid > calculating the weight etc... on processors that don't need any > of that. > > I also dislike your naming. I would suggest you change cpu_map to > sibling_map() and cpu2 to sibling (or just c). One thing I wonder is how > sure we are that sched_domain_span() is always going to give us the > threads btw ? If we introduce another sched domain level for NUMA > purposes can't we get confused ? > Right now it's 100% always giving us threads. My development version of the patch had a BUG_ON() to check this. I expect this to stay the case in the future as the name of the function is arch_scale_smt_power(), which clearly denotes threads are expected. I am not stuck on the names, I'll change it to sibling instead of cpu2 and sibling_map instead of cpu_map. It seems clear to me either way. As for testing the ! case it seems funcationally equivalent, and mine seems less confusing. Having a ppc.md hook with exactly 1 user is pointless, especially since you'll still have to calculate the weight with the ability to dynamically disable smt. > Also, how hot is this code path ? > It's every load balance, which is to say not hot, but fairly frequent. I haven't been able to measure an impact from doing very hairy calculations (without actually changing the weights) here vs not having it at all in actual end workloads. > >> + /* the following section attempts to tweak cpu power based >> + * on current idleness of the threads dynamically at runtime >> + */ >> + if (idle_count == 2 || idle_count == 3 || idle_count == 4) { >> > > if (idle_count > 1) ? :-) > Yes :) Originally I had done different weightings for each of the 3 cases, which gained in some workloads but regressed some others. But since I'm not doing that anymore I'll fold it down to > 1 ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv2 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-20 20:04 ` Joel Schopp @ 2010-01-26 23:28 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-26 23:28 UTC (permalink / raw) To: Peter Zijlstra; +Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads there is performance benefit to idling the higher numbered threads in the core. This patch implements arch_scale_smt_power to dynamically update smt thread power in these idle cases in order to prefer threads 0,1 over threads 2,3 within a core. v2 - Same functionality as v1, better coding style. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Version 2 addresses style and optimization, same basic functionality Index: linux-2.6.git/arch/powerpc/kernel/smp.c =================================================================== --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c +++ linux-2.6.git/arch/powerpc/kernel/smp.c @@ -620,3 +620,55 @@ void __cpu_die(unsigned int cpu) smp_ops->cpu_die(cpu); } #endif + +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) +{ + int sibling; + int idle_count = 0; + int thread; + + struct cpumask *sibling_map = sched_domain_span(sd); + + unsigned long weight = cpumask_weight(sibling_map); + unsigned long smt_gain = sd->smt_gain; + + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) { + for_each_cpu(sibling, sibling_map) { + if (idle_cpu(sibling)) + idle_count++; + } + + /* the following section attempts to tweak cpu power based + * on current idleness of the threads dynamically at runtime + */ + if (idle_count > 1) { + thread = cpu_thread_in_core(cpu); + if (thread < 2) { + /* add 75 % to thread power */ + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); + } else { + /* subtract 75 % to thread power */ + smt_gain = smt_gain >> 2; + } + } + } + + /* default smt gain is 1178, weight is # of SMT threads */ + switch (weight) { + case 1: + /*divide by 1, do nothing*/ + break; + case 2: + smt_gain = smt_gain >> 1; + break; + case 4: + smt_gain = smt_gain >> 2; + break; + default: + smt_gain /= weight; + break; + } + + return smt_gain; + +} Index: linux-2.6.git/arch/powerpc/include/asm/cputable.h =================================================================== --- linux-2.6.git.orig/arch/powerpc/include/asm/cputable.h +++ linux-2.6.git/arch/powerpc/include/asm/cputable.h @@ -195,6 +195,7 @@ extern const char *powerpc_base_platform #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000) #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000) #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080000000000000) +#define CPU_FTR_ASYNC_SMT4 LONG_ASM_CONST(0x0100000000000000) #ifndef __ASSEMBLY__ @@ -409,7 +410,7 @@ extern const char *powerpc_base_platform CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR | CPU_FTR_SAO) + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYNC_SMT4) #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv2 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-26 23:28 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-26 23:28 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads there is performance benefit to idling the higher numbered threads in the core. This patch implements arch_scale_smt_power to dynamically update smt thread power in these idle cases in order to prefer threads 0,1 over threads 2,3 within a core. v2 - Same functionality as v1, better coding style. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Version 2 addresses style and optimization, same basic functionality Index: linux-2.6.git/arch/powerpc/kernel/smp.c =================================================================== --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c +++ linux-2.6.git/arch/powerpc/kernel/smp.c @@ -620,3 +620,55 @@ void __cpu_die(unsigned int cpu) smp_ops->cpu_die(cpu); } #endif + +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) +{ + int sibling; + int idle_count = 0; + int thread; + + struct cpumask *sibling_map = sched_domain_span(sd); + + unsigned long weight = cpumask_weight(sibling_map); + unsigned long smt_gain = sd->smt_gain; + + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) { + for_each_cpu(sibling, sibling_map) { + if (idle_cpu(sibling)) + idle_count++; + } + + /* the following section attempts to tweak cpu power based + * on current idleness of the threads dynamically at runtime + */ + if (idle_count > 1) { + thread = cpu_thread_in_core(cpu); + if (thread < 2) { + /* add 75 % to thread power */ + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); + } else { + /* subtract 75 % to thread power */ + smt_gain = smt_gain >> 2; + } + } + } + + /* default smt gain is 1178, weight is # of SMT threads */ + switch (weight) { + case 1: + /*divide by 1, do nothing*/ + break; + case 2: + smt_gain = smt_gain >> 1; + break; + case 4: + smt_gain = smt_gain >> 2; + break; + default: + smt_gain /= weight; + break; + } + + return smt_gain; + +} Index: linux-2.6.git/arch/powerpc/include/asm/cputable.h =================================================================== --- linux-2.6.git.orig/arch/powerpc/include/asm/cputable.h +++ linux-2.6.git/arch/powerpc/include/asm/cputable.h @@ -195,6 +195,7 @@ extern const char *powerpc_base_platform #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000) #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000) #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080000000000000) +#define CPU_FTR_ASYNC_SMT4 LONG_ASM_CONST(0x0100000000000000) #ifndef __ASSEMBLY__ @@ -409,7 +410,7 @@ extern const char *powerpc_base_platform CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR | CPU_FTR_SAO) + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYNC_SMT4) #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv2 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-26 23:28 ` Joel Schopp @ 2010-01-27 0:52 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 103+ messages in thread From: Benjamin Herrenschmidt @ 2010-01-27 0:52 UTC (permalink / raw) To: Joel Schopp; +Cc: Peter Zijlstra, ego, linuxppc-dev, Ingo Molnar, linux-kernel On Tue, 2010-01-26 at 17:28 -0600, Joel Schopp wrote: > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > there is performance benefit to idling the higher numbered threads in > the core. > > This patch implements arch_scale_smt_power to dynamically update smt > thread power in these idle cases in order to prefer threads 0,1 over > threads 2,3 within a core. > > v2 - Same functionality as v1, better coding style. Better. Some more comments... > Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> > --- > Version 2 addresses style and optimization, same basic functionality > Index: linux-2.6.git/arch/powerpc/kernel/smp.c > =================================================================== > --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c > +++ linux-2.6.git/arch/powerpc/kernel/smp.c > @@ -620,3 +620,55 @@ void __cpu_die(unsigned int cpu) > smp_ops->cpu_die(cpu); > } > #endif > + > +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) > +{ > + int sibling; > + int idle_count = 0; > + int thread; > + > + struct cpumask *sibling_map = sched_domain_span(sd); What about an early exit if !cpu_has_feature(CPU_FTR_SMT) ? That would de-facto compile it out for 32-bit CPU platforms that don't support SMT at all and avoid some overhead on POWER3,4,970... > + unsigned long weight = cpumask_weight(sibling_map); > + unsigned long smt_gain = sd->smt_gain; > + > + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) { So that will only handle the case where all 4 threads are online right ? There is no provision for the case where the user play tricks like offlining thread, in which case it will stop trying to "push down" processes right ? Not a big deal per-se I suppose, just something to be aware of. Also, can you add a comment as to why this is done in the code itself ? above the if (cpu_has_feature(...)) statement. > + for_each_cpu(sibling, sibling_map) { > + if (idle_cpu(sibling)) > + idle_count++; > + } > + > + /* the following section attempts to tweak cpu power based > + * on current idleness of the threads dynamically at runtime > + */ > + if (idle_count > 1) { > + thread = cpu_thread_in_core(cpu); > + if (thread < 2) { > + /* add 75 % to thread power */ > + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); > + } else { > + /* subtract 75 % to thread power */ > + smt_gain = smt_gain >> 2; > + } > + } > + } > + > + /* default smt gain is 1178, weight is # of SMT threads */ > + switch (weight) { > + case 1: > + /*divide by 1, do nothing*/ > + break; > + case 2: > + smt_gain = smt_gain >> 1; > + break; > + case 4: > + smt_gain = smt_gain >> 2; > + break; > + default: > + smt_gain /= weight; > + break; > + } > + > + return smt_gain; > +} Appart from that, it looks allright to me. Cheers, Ben. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv2 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-27 0:52 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 103+ messages in thread From: Benjamin Herrenschmidt @ 2010-01-27 0:52 UTC (permalink / raw) To: Joel Schopp; +Cc: linux-kernel, Ingo Molnar, linuxppc-dev, Peter Zijlstra, ego On Tue, 2010-01-26 at 17:28 -0600, Joel Schopp wrote: > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > there is performance benefit to idling the higher numbered threads in > the core. > > This patch implements arch_scale_smt_power to dynamically update smt > thread power in these idle cases in order to prefer threads 0,1 over > threads 2,3 within a core. > > v2 - Same functionality as v1, better coding style. Better. Some more comments... > Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> > --- > Version 2 addresses style and optimization, same basic functionality > Index: linux-2.6.git/arch/powerpc/kernel/smp.c > =================================================================== > --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c > +++ linux-2.6.git/arch/powerpc/kernel/smp.c > @@ -620,3 +620,55 @@ void __cpu_die(unsigned int cpu) > smp_ops->cpu_die(cpu); > } > #endif > + > +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) > +{ > + int sibling; > + int idle_count = 0; > + int thread; > + > + struct cpumask *sibling_map = sched_domain_span(sd); What about an early exit if !cpu_has_feature(CPU_FTR_SMT) ? That would de-facto compile it out for 32-bit CPU platforms that don't support SMT at all and avoid some overhead on POWER3,4,970... > + unsigned long weight = cpumask_weight(sibling_map); > + unsigned long smt_gain = sd->smt_gain; > + > + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) { So that will only handle the case where all 4 threads are online right ? There is no provision for the case where the user play tricks like offlining thread, in which case it will stop trying to "push down" processes right ? Not a big deal per-se I suppose, just something to be aware of. Also, can you add a comment as to why this is done in the code itself ? above the if (cpu_has_feature(...)) statement. > + for_each_cpu(sibling, sibling_map) { > + if (idle_cpu(sibling)) > + idle_count++; > + } > + > + /* the following section attempts to tweak cpu power based > + * on current idleness of the threads dynamically at runtime > + */ > + if (idle_count > 1) { > + thread = cpu_thread_in_core(cpu); > + if (thread < 2) { > + /* add 75 % to thread power */ > + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); > + } else { > + /* subtract 75 % to thread power */ > + smt_gain = smt_gain >> 2; > + } > + } > + } > + > + /* default smt gain is 1178, weight is # of SMT threads */ > + switch (weight) { > + case 1: > + /*divide by 1, do nothing*/ > + break; > + case 2: > + smt_gain = smt_gain >> 1; > + break; > + case 4: > + smt_gain = smt_gain >> 2; > + break; > + default: > + smt_gain /= weight; > + break; > + } > + > + return smt_gain; > +} Appart from that, it looks allright to me. Cheers, Ben. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv2 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-27 0:52 ` Benjamin Herrenschmidt @ 2010-01-28 22:39 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-28 22:39 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Peter Zijlstra, ego, linuxppc-dev, Ingo Molnar, linux-kernel > > What about an early exit if !cpu_has_feature(CPU_FTR_SMT) ? That would > de-facto compile it out for 32-bit CPU platforms that don't support SMT > at all and avoid some overhead on POWER3,4,970... > If the SD_SHARE_CPUPOWER flag isn't set for the sched domain this function isn't called. So an extra check here is wasteful. > >> + unsigned long weight = cpumask_weight(sibling_map); >> + unsigned long smt_gain = sd->smt_gain; >> + >> + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) { >> > > So that will only handle the case where all 4 threads are online right ? > There is no provision for the case where the user play tricks like > offlining thread, in which case it will stop trying to "push down" > processes right ? Not a big deal per-se I suppose, just something to be > aware of. > I've tested it with manually offlined threads and it behaves as I'd like it to. > Also, can you add a comment as to why this is done in the code itself ? > above the if (cpu_has_feature(...)) statement. > OK. v3 coming soon with the comment. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv2 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-28 22:39 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-28 22:39 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linux-kernel, Ingo Molnar, linuxppc-dev, Peter Zijlstra, ego > > What about an early exit if !cpu_has_feature(CPU_FTR_SMT) ? That would > de-facto compile it out for 32-bit CPU platforms that don't support SMT > at all and avoid some overhead on POWER3,4,970... > If the SD_SHARE_CPUPOWER flag isn't set for the sched domain this function isn't called. So an extra check here is wasteful. > >> + unsigned long weight = cpumask_weight(sibling_map); >> + unsigned long smt_gain = sd->smt_gain; >> + >> + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) { >> > > So that will only handle the case where all 4 threads are online right ? > There is no provision for the case where the user play tricks like > offlining thread, in which case it will stop trying to "push down" > processes right ? Not a big deal per-se I suppose, just something to be > aware of. > I've tested it with manually offlined threads and it behaves as I'd like it to. > Also, can you add a comment as to why this is done in the code itself ? > above the if (cpu_has_feature(...)) statement. > OK. v3 coming soon with the comment. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv2 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-28 22:39 ` Joel Schopp @ 2010-01-29 1:23 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 103+ messages in thread From: Benjamin Herrenschmidt @ 2010-01-29 1:23 UTC (permalink / raw) To: Joel Schopp; +Cc: Peter Zijlstra, ego, linuxppc-dev, Ingo Molnar, linux-kernel > I've tested it with manually offlined threads and it behaves as I'd like > it to. Which is ? IE. I'm happy that you like how it behaves, but I'd like to u understand how that is so I can make sure I'm also happy with it :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv2 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-29 1:23 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 103+ messages in thread From: Benjamin Herrenschmidt @ 2010-01-29 1:23 UTC (permalink / raw) To: Joel Schopp; +Cc: linux-kernel, Ingo Molnar, linuxppc-dev, Peter Zijlstra, ego > I've tested it with manually offlined threads and it behaves as I'd like > it to. Which is ? IE. I'm happy that you like how it behaves, but I'd like to u understand how that is so I can make sure I'm also happy with it :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-26 23:28 ` Joel Schopp @ 2010-01-28 23:20 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-28 23:20 UTC (permalink / raw) To: Peter Zijlstra Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh, jschopp On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads there is performance benefit to idling the higher numbered threads in the core. This patch implements arch_scale_smt_power to dynamically update smt thread power in these idle cases in order to prefer threads 0,1 over threads 2,3 within a core. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Version 2 addresses style and optimization, same basic functionality Version 3 adds a comment On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads there is performance benefit to idling the higher numbered threads in the core. This patch implements arch_scale_smt_power to dynamically update smt thread power in these idle cases in order to prefer threads 0,1 over threads 2,3 within a core. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Version 2 addresses style and optimization, same basic functionality Index: linux-2.6.git/arch/powerpc/kernel/smp.c =================================================================== --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c +++ linux-2.6.git/arch/powerpc/kernel/smp.c @@ -620,3 +620,59 @@ void __cpu_die(unsigned int cpu) smp_ops->cpu_die(cpu); } #endif + +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) +{ + int sibling; + int idle_count = 0; + int thread; + + /* Setup the default weight and smt_gain used by most cpus for SMT + * Power. Doing this right away covers the default case and can be + * used by cpus that modify it dynamically. + */ + struct cpumask *sibling_map = sched_domain_span(sd); + unsigned long weight = cpumask_weight(sibling_map); + unsigned long smt_gain = sd->smt_gain; + + + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) { + for_each_cpu(sibling, sibling_map) { + if (idle_cpu(sibling)) + idle_count++; + } + + /* the following section attempts to tweak cpu power based + * on current idleness of the threads dynamically at runtime + */ + if (idle_count > 1) { + thread = cpu_thread_in_core(cpu); + if (thread < 2) { + /* add 75 % to thread power */ + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); + } else { + /* subtract 75 % to thread power */ + smt_gain = smt_gain >> 2; + } + } + } + + /* default smt gain is 1178, weight is # of SMT threads */ + switch (weight) { + case 1: + /*divide by 1, do nothing*/ + break; + case 2: + smt_gain = smt_gain >> 1; + break; + case 4: + smt_gain = smt_gain >> 2; + break; + default: + smt_gain /= weight; + break; + } + + return smt_gain; + +} Index: linux-2.6.git/arch/powerpc/include/asm/cputable.h =================================================================== --- linux-2.6.git.orig/arch/powerpc/include/asm/cputable.h +++ linux-2.6.git/arch/powerpc/include/asm/cputable.h @@ -195,6 +195,7 @@ extern const char *powerpc_base_platform #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000) #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000) #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080000000000000) +#define CPU_FTR_ASYNC_SMT4 LONG_ASM_CONST(0x0100000000000000) #ifndef __ASSEMBLY__ @@ -409,7 +410,7 @@ extern const char *powerpc_base_platform CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR | CPU_FTR_SAO) + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYNC_SMT4) #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-28 23:20 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-28 23:20 UTC (permalink / raw) To: Peter Zijlstra; +Cc: ego, linux-kernel, Ingo Molnar, linuxppc-dev On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads there is performance benefit to idling the higher numbered threads in the core. This patch implements arch_scale_smt_power to dynamically update smt thread power in these idle cases in order to prefer threads 0,1 over threads 2,3 within a core. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Version 2 addresses style and optimization, same basic functionality Version 3 adds a comment On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads there is performance benefit to idling the higher numbered threads in the core. This patch implements arch_scale_smt_power to dynamically update smt thread power in these idle cases in order to prefer threads 0,1 over threads 2,3 within a core. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Version 2 addresses style and optimization, same basic functionality Index: linux-2.6.git/arch/powerpc/kernel/smp.c =================================================================== --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c +++ linux-2.6.git/arch/powerpc/kernel/smp.c @@ -620,3 +620,59 @@ void __cpu_die(unsigned int cpu) smp_ops->cpu_die(cpu); } #endif + +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) +{ + int sibling; + int idle_count = 0; + int thread; + + /* Setup the default weight and smt_gain used by most cpus for SMT + * Power. Doing this right away covers the default case and can be + * used by cpus that modify it dynamically. + */ + struct cpumask *sibling_map = sched_domain_span(sd); + unsigned long weight = cpumask_weight(sibling_map); + unsigned long smt_gain = sd->smt_gain; + + + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) { + for_each_cpu(sibling, sibling_map) { + if (idle_cpu(sibling)) + idle_count++; + } + + /* the following section attempts to tweak cpu power based + * on current idleness of the threads dynamically at runtime + */ + if (idle_count > 1) { + thread = cpu_thread_in_core(cpu); + if (thread < 2) { + /* add 75 % to thread power */ + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); + } else { + /* subtract 75 % to thread power */ + smt_gain = smt_gain >> 2; + } + } + } + + /* default smt gain is 1178, weight is # of SMT threads */ + switch (weight) { + case 1: + /*divide by 1, do nothing*/ + break; + case 2: + smt_gain = smt_gain >> 1; + break; + case 4: + smt_gain = smt_gain >> 2; + break; + default: + smt_gain /= weight; + break; + } + + return smt_gain; + +} Index: linux-2.6.git/arch/powerpc/include/asm/cputable.h =================================================================== --- linux-2.6.git.orig/arch/powerpc/include/asm/cputable.h +++ linux-2.6.git/arch/powerpc/include/asm/cputable.h @@ -195,6 +195,7 @@ extern const char *powerpc_base_platform #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000) #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000) #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080000000000000) +#define CPU_FTR_ASYNC_SMT4 LONG_ASM_CONST(0x0100000000000000) #ifndef __ASSEMBLY__ @@ -409,7 +410,7 @@ extern const char *powerpc_base_platform CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR | CPU_FTR_SAO) + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYNC_SMT4) #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-28 23:20 ` Joel Schopp @ 2010-01-28 23:24 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-28 23:24 UTC (permalink / raw) To: Peter Zijlstra; +Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads there is performance benefit to idling the higher numbered threads in the core. This patch implements arch_scale_smt_power to dynamically update smt thread power in these idle cases in order to prefer threads 0,1 over threads 2,3 within a core. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Version 2 addresses style and optimization, same basic functionality Version 3 adds a comment, resent due to mailing format error Index: linux-2.6.git/arch/powerpc/kernel/smp.c =================================================================== --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c +++ linux-2.6.git/arch/powerpc/kernel/smp.c @@ -620,3 +620,59 @@ void __cpu_die(unsigned int cpu) smp_ops->cpu_die(cpu); } #endif + +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) +{ + int sibling; + int idle_count = 0; + int thread; + + /* Setup the default weight and smt_gain used by most cpus for SMT + * Power. Doing this right away covers the default case and can be + * used by cpus that modify it dynamically. + */ + struct cpumask *sibling_map = sched_domain_span(sd); + unsigned long weight = cpumask_weight(sibling_map); + unsigned long smt_gain = sd->smt_gain; + + + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) { + for_each_cpu(sibling, sibling_map) { + if (idle_cpu(sibling)) + idle_count++; + } + + /* the following section attempts to tweak cpu power based + * on current idleness of the threads dynamically at runtime + */ + if (idle_count > 1) { + thread = cpu_thread_in_core(cpu); + if (thread < 2) { + /* add 75 % to thread power */ + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); + } else { + /* subtract 75 % to thread power */ + smt_gain = smt_gain >> 2; + } + } + } + + /* default smt gain is 1178, weight is # of SMT threads */ + switch (weight) { + case 1: + /*divide by 1, do nothing*/ + break; + case 2: + smt_gain = smt_gain >> 1; + break; + case 4: + smt_gain = smt_gain >> 2; + break; + default: + smt_gain /= weight; + break; + } + + return smt_gain; + +} Index: linux-2.6.git/arch/powerpc/include/asm/cputable.h =================================================================== --- linux-2.6.git.orig/arch/powerpc/include/asm/cputable.h +++ linux-2.6.git/arch/powerpc/include/asm/cputable.h @@ -195,6 +195,7 @@ extern const char *powerpc_base_platform #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000) #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000) #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080000000000000) +#define CPU_FTR_ASYNC_SMT4 LONG_ASM_CONST(0x0100000000000000) #ifndef __ASSEMBLY__ @@ -409,7 +410,7 @@ extern const char *powerpc_base_platform CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR | CPU_FTR_SAO) + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYNC_SMT4) #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-28 23:24 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-28 23:24 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads there is performance benefit to idling the higher numbered threads in the core. This patch implements arch_scale_smt_power to dynamically update smt thread power in these idle cases in order to prefer threads 0,1 over threads 2,3 within a core. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Version 2 addresses style and optimization, same basic functionality Version 3 adds a comment, resent due to mailing format error Index: linux-2.6.git/arch/powerpc/kernel/smp.c =================================================================== --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c +++ linux-2.6.git/arch/powerpc/kernel/smp.c @@ -620,3 +620,59 @@ void __cpu_die(unsigned int cpu) smp_ops->cpu_die(cpu); } #endif + +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) +{ + int sibling; + int idle_count = 0; + int thread; + + /* Setup the default weight and smt_gain used by most cpus for SMT + * Power. Doing this right away covers the default case and can be + * used by cpus that modify it dynamically. + */ + struct cpumask *sibling_map = sched_domain_span(sd); + unsigned long weight = cpumask_weight(sibling_map); + unsigned long smt_gain = sd->smt_gain; + + + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) { + for_each_cpu(sibling, sibling_map) { + if (idle_cpu(sibling)) + idle_count++; + } + + /* the following section attempts to tweak cpu power based + * on current idleness of the threads dynamically at runtime + */ + if (idle_count > 1) { + thread = cpu_thread_in_core(cpu); + if (thread < 2) { + /* add 75 % to thread power */ + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); + } else { + /* subtract 75 % to thread power */ + smt_gain = smt_gain >> 2; + } + } + } + + /* default smt gain is 1178, weight is # of SMT threads */ + switch (weight) { + case 1: + /*divide by 1, do nothing*/ + break; + case 2: + smt_gain = smt_gain >> 1; + break; + case 4: + smt_gain = smt_gain >> 2; + break; + default: + smt_gain /= weight; + break; + } + + return smt_gain; + +} Index: linux-2.6.git/arch/powerpc/include/asm/cputable.h =================================================================== --- linux-2.6.git.orig/arch/powerpc/include/asm/cputable.h +++ linux-2.6.git/arch/powerpc/include/asm/cputable.h @@ -195,6 +195,7 @@ extern const char *powerpc_base_platform #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000) #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000) #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080000000000000) +#define CPU_FTR_ASYNC_SMT4 LONG_ASM_CONST(0x0100000000000000) #ifndef __ASSEMBLY__ @@ -409,7 +410,7 @@ extern const char *powerpc_base_platform CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR | CPU_FTR_SAO) + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYNC_SMT4) #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-28 23:24 ` Joel Schopp @ 2010-01-29 1:23 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 103+ messages in thread From: Benjamin Herrenschmidt @ 2010-01-29 1:23 UTC (permalink / raw) To: Joel Schopp; +Cc: Peter Zijlstra, ego, linuxppc-dev, Ingo Molnar, linux-kernel On Thu, 2010-01-28 at 17:24 -0600, Joel Schopp wrote: > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > there is performance benefit to idling the higher numbered threads in > the core. > > This patch implements arch_scale_smt_power to dynamically update smt > thread power in these idle cases in order to prefer threads 0,1 over > threads 2,3 within a core. Almost there :-) Joel, Peter, can you help me figure something out tho ? On machine that don't have SMT, I would like to avoid calling arch_scale_smt_power() at all if possible (in addition to not compiling it in if SMT is not enabled in .config). Now, I must say I'm utterly confused by how the domains are setup and I haven't quite managed to sort it out... it looks to me that SD_SHARE_CPUPOWER is always going to be set on all CPUs when the config option is set (though each CPU will have its own domain) or am I misguided ? IE. Is there any sense in having at least a fast exit path out of arch_scale_smt_power() for non-SMT CPUs ? Joel, can you look at compiling it out when SMT is not set ? We don't want to bloat SMP kernels for 32-bit non-SMT embedded platforms. Oh, and one minor nit: > Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> > --- > Version 2 addresses style and optimization, same basic functionality > Version 3 adds a comment, resent due to mailing format error > Index: linux-2.6.git/arch/powerpc/kernel/smp.c > =================================================================== > --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c > +++ linux-2.6.git/arch/powerpc/kernel/smp.c > @@ -620,3 +620,59 @@ void __cpu_die(unsigned int cpu) > smp_ops->cpu_die(cpu); > } > #endif ^^^ Please add the /* CONFIG_CPU_HOTPLUG */ (or whatever it is) that's missing after that #endif :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-29 1:23 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 103+ messages in thread From: Benjamin Herrenschmidt @ 2010-01-29 1:23 UTC (permalink / raw) To: Joel Schopp; +Cc: linux-kernel, Ingo Molnar, linuxppc-dev, Peter Zijlstra, ego On Thu, 2010-01-28 at 17:24 -0600, Joel Schopp wrote: > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > there is performance benefit to idling the higher numbered threads in > the core. > > This patch implements arch_scale_smt_power to dynamically update smt > thread power in these idle cases in order to prefer threads 0,1 over > threads 2,3 within a core. Almost there :-) Joel, Peter, can you help me figure something out tho ? On machine that don't have SMT, I would like to avoid calling arch_scale_smt_power() at all if possible (in addition to not compiling it in if SMT is not enabled in .config). Now, I must say I'm utterly confused by how the domains are setup and I haven't quite managed to sort it out... it looks to me that SD_SHARE_CPUPOWER is always going to be set on all CPUs when the config option is set (though each CPU will have its own domain) or am I misguided ? IE. Is there any sense in having at least a fast exit path out of arch_scale_smt_power() for non-SMT CPUs ? Joel, can you look at compiling it out when SMT is not set ? We don't want to bloat SMP kernels for 32-bit non-SMT embedded platforms. Oh, and one minor nit: > Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> > --- > Version 2 addresses style and optimization, same basic functionality > Version 3 adds a comment, resent due to mailing format error > Index: linux-2.6.git/arch/powerpc/kernel/smp.c > =================================================================== > --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c > +++ linux-2.6.git/arch/powerpc/kernel/smp.c > @@ -620,3 +620,59 @@ void __cpu_die(unsigned int cpu) > smp_ops->cpu_die(cpu); > } > #endif ^^^ Please add the /* CONFIG_CPU_HOTPLUG */ (or whatever it is) that's missing after that #endif :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-29 1:23 ` Benjamin Herrenschmidt @ 2010-01-29 10:13 ` Peter Zijlstra -1 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-01-29 10:13 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Joel Schopp, ego, linuxppc-dev, Ingo Molnar, linux-kernel On Fri, 2010-01-29 at 12:23 +1100, Benjamin Herrenschmidt wrote: > On machine that don't have SMT, I would like to avoid calling > arch_scale_smt_power() at all if possible (in addition to not compiling > it in if SMT is not enabled in .config). > > Now, I must say I'm utterly confused by how the domains are setup and I > haven't quite managed to sort it out... it looks to me that > SD_SHARE_CPUPOWER is always going to be set on all CPUs when the config > option is set (though each CPU will have its own domain) or am I > misguided ? IE. Is there any sense in having at least a fast exit path > out of arch_scale_smt_power() for non-SMT CPUs ? The sched_domain creation code is a f'ing stink pile that hurts everybody's brain. The AMD magny-cours people sort of cleaned it up a bit but didn't go nearly far enough. Doing so is somewhere on my todo list, but sadly that thing is way larger than my spare time. Now SD_SHARE_CPUPOWER _should_ only be set for SMT domains, because only SMT siblings share cpupower. SD_SHARE_PKG_RESOURCES _should_ be set for both SMT and MC, because they all share the same cache domain. If it all works out that way in practice on powerpc is another question entirely ;-) That said, I'm still not entirely convinced I like this usage of cpupower, its supposed to be a normalization scale for load-balancing, not a placement hook. I'd be much happier with a SD_GROUP_ORDER or something like that, that works together with SD_PREFER_SIBLING to pack active tasks to cpus in ascending group order. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-29 10:13 ` Peter Zijlstra 0 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-01-29 10:13 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego On Fri, 2010-01-29 at 12:23 +1100, Benjamin Herrenschmidt wrote: > On machine that don't have SMT, I would like to avoid calling > arch_scale_smt_power() at all if possible (in addition to not compiling > it in if SMT is not enabled in .config). > > Now, I must say I'm utterly confused by how the domains are setup and I > haven't quite managed to sort it out... it looks to me that > SD_SHARE_CPUPOWER is always going to be set on all CPUs when the config > option is set (though each CPU will have its own domain) or am I > misguided ? IE. Is there any sense in having at least a fast exit path > out of arch_scale_smt_power() for non-SMT CPUs ? The sched_domain creation code is a f'ing stink pile that hurts everybody's brain. The AMD magny-cours people sort of cleaned it up a bit but didn't go nearly far enough. Doing so is somewhere on my todo list, but sadly that thing is way larger than my spare time. Now SD_SHARE_CPUPOWER _should_ only be set for SMT domains, because only SMT siblings share cpupower. SD_SHARE_PKG_RESOURCES _should_ be set for both SMT and MC, because they all share the same cache domain. If it all works out that way in practice on powerpc is another question entirely ;-) That said, I'm still not entirely convinced I like this usage of cpupower, its supposed to be a normalization scale for load-balancing, not a placement hook. I'd be much happier with a SD_GROUP_ORDER or something like that, that works together with SD_PREFER_SIBLING to pack active tasks to cpus in ascending group order. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-29 10:13 ` Peter Zijlstra @ 2010-01-29 18:34 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-29 18:34 UTC (permalink / raw) To: Peter Zijlstra Cc: Benjamin Herrenschmidt, ego, linuxppc-dev, Ingo Molnar, linux-kernel > That said, I'm still not entirely convinced I like this usage of > cpupower, its supposed to be a normalization scale for load-balancing, > not a placement hook. > Even if you do a placement hook you'll need to address it in the load balancing as well. Consider a single 4 thread SMT core with 4 running tasks. If 2 of them exit the remaining 2 will need to be load balanced within the core in a way that takes into account the dynamic nature of the thread power. This patch does that. > I'd be much happier with a SD_GROUP_ORDER or something like that, that > works together with SD_PREFER_SIBLING to pack active tasks to cpus in > ascending group order. > > I don't see this load-balancing patch as mutually exclusive with a patch to fix placement. But even if it is a mutually exclusive solution there is no reason we can't fix things now with this patch and then later take it out when it's fixed another way. This patch series is straightforward, non-intrusive, and without it the scheduler is broken on this processor. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-29 18:34 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-29 18:34 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego > That said, I'm still not entirely convinced I like this usage of > cpupower, its supposed to be a normalization scale for load-balancing, > not a placement hook. > Even if you do a placement hook you'll need to address it in the load balancing as well. Consider a single 4 thread SMT core with 4 running tasks. If 2 of them exit the remaining 2 will need to be load balanced within the core in a way that takes into account the dynamic nature of the thread power. This patch does that. > I'd be much happier with a SD_GROUP_ORDER or something like that, that > works together with SD_PREFER_SIBLING to pack active tasks to cpus in > ascending group order. > > I don't see this load-balancing patch as mutually exclusive with a patch to fix placement. But even if it is a mutually exclusive solution there is no reason we can't fix things now with this patch and then later take it out when it's fixed another way. This patch series is straightforward, non-intrusive, and without it the scheduler is broken on this processor. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-29 1:23 ` Benjamin Herrenschmidt @ 2010-01-29 18:41 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-29 18:41 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Peter Zijlstra, ego, linuxppc-dev, Ingo Molnar, linux-kernel Benjamin Herrenschmidt wrote: > On Thu, 2010-01-28 at 17:24 -0600, Joel Schopp wrote: > >> On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads >> there is performance benefit to idling the higher numbered threads in >> the core. >> >> This patch implements arch_scale_smt_power to dynamically update smt >> thread power in these idle cases in order to prefer threads 0,1 over >> threads 2,3 within a core. >> > > Almost there :-) Joel, Peter, can you help me figure something out tho ? > > On machine that don't have SMT, I would like to avoid calling > arch_scale_smt_power() at all if possible (in addition to not compiling > it in if SMT is not enabled in .config). > > Now, I must say I'm utterly confused by how the domains are setup and I > haven't quite managed to sort it out... it looks to me that > SD_SHARE_CPUPOWER is always going to be set on all CPUs when the config > option is set (though each CPU will have its own domain) or am I > misguided ? IE. Is there any sense in having at least a fast exit path > out of arch_scale_smt_power() for non-SMT CPUs ? > > Joel, can you look at compiling it out when SMT is not set ? We don't > want to bloat SMP kernels for 32-bit non-SMT embedded platforms. > I can wrap the powerpc definition of arch_scale_smt in an #ifdef, if it's not there the scheduler uses the default, which is the same as it uses if SMT isn't compiled. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-29 18:41 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-29 18:41 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linux-kernel, Ingo Molnar, linuxppc-dev, Peter Zijlstra, ego Benjamin Herrenschmidt wrote: > On Thu, 2010-01-28 at 17:24 -0600, Joel Schopp wrote: > >> On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads >> there is performance benefit to idling the higher numbered threads in >> the core. >> >> This patch implements arch_scale_smt_power to dynamically update smt >> thread power in these idle cases in order to prefer threads 0,1 over >> threads 2,3 within a core. >> > > Almost there :-) Joel, Peter, can you help me figure something out tho ? > > On machine that don't have SMT, I would like to avoid calling > arch_scale_smt_power() at all if possible (in addition to not compiling > it in if SMT is not enabled in .config). > > Now, I must say I'm utterly confused by how the domains are setup and I > haven't quite managed to sort it out... it looks to me that > SD_SHARE_CPUPOWER is always going to be set on all CPUs when the config > option is set (though each CPU will have its own domain) or am I > misguided ? IE. Is there any sense in having at least a fast exit path > out of arch_scale_smt_power() for non-SMT CPUs ? > > Joel, can you look at compiling it out when SMT is not set ? We don't > want to bloat SMP kernels for 32-bit non-SMT embedded platforms. > I can wrap the powerpc definition of arch_scale_smt in an #ifdef, if it's not there the scheduler uses the default, which is the same as it uses if SMT isn't compiled. ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-28 23:24 ` Joel Schopp @ 2010-02-05 20:57 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-02-05 20:57 UTC (permalink / raw) To: Peter Zijlstra Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh, jschopp On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads there is performance benefit to idling the higher numbered threads in the core. This patch implements arch_scale_smt_power to dynamically update smt thread power in these idle cases in order to prefer threads 0,1 over threads 2,3 within a core. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Version 3 adds the #ifdef to avoid compiling on kernels that don't need it Index: linux-2.6.git/arch/powerpc/kernel/smp.c =================================================================== --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c +++ linux-2.6.git/arch/powerpc/kernel/smp.c @@ -620,3 +620,61 @@ void __cpu_die(unsigned int cpu) smp_ops->cpu_die(cpu); } #endif + +#ifdef CONFIG_SCHED_SMT +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) +{ + int sibling; + int idle_count = 0; + int thread; + + /* Setup the default weight and smt_gain used by most cpus for SMT + * Power. Doing this right away covers the default case and can be + * used by cpus that modify it dynamically. + */ + struct cpumask *sibling_map = sched_domain_span(sd); + unsigned long weight = cpumask_weight(sibling_map); + unsigned long smt_gain = sd->smt_gain; + + + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) { + for_each_cpu(sibling, sibling_map) { + if (idle_cpu(sibling)) + idle_count++; + } + + /* the following section attempts to tweak cpu power based + * on current idleness of the threads dynamically at runtime + */ + if (idle_count > 1) { + thread = cpu_thread_in_core(cpu); + if (thread < 2) { + /* add 75 % to thread power */ + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); + } else { + /* subtract 75 % to thread power */ + smt_gain = smt_gain >> 2; + } + } + } + + /* default smt gain is 1178, weight is # of SMT threads */ + switch (weight) { + case 1: + /*divide by 1, do nothing*/ + break; + case 2: + smt_gain = smt_gain >> 1; + break; + case 4: + smt_gain = smt_gain >> 2; + break; + default: + smt_gain /= weight; + break; + } + + return smt_gain; + +} +#endif Index: linux-2.6.git/arch/powerpc/include/asm/cputable.h =================================================================== --- linux-2.6.git.orig/arch/powerpc/include/asm/cputable.h +++ linux-2.6.git/arch/powerpc/include/asm/cputable.h @@ -195,6 +195,7 @@ extern const char *powerpc_base_platform #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000) #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000) #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080000000000000) +#define CPU_FTR_ASYNC_SMT4 LONG_ASM_CONST(0x0100000000000000) #ifndef __ASSEMBLY__ @@ -409,7 +410,7 @@ extern const char *powerpc_base_platform CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR | CPU_FTR_SAO) + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYNC_SMT4) #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-05 20:57 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-02-05 20:57 UTC (permalink / raw) To: Peter Zijlstra; +Cc: ego, linux-kernel, Ingo Molnar, linuxppc-dev On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads there is performance benefit to idling the higher numbered threads in the core. This patch implements arch_scale_smt_power to dynamically update smt thread power in these idle cases in order to prefer threads 0,1 over threads 2,3 within a core. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> --- Version 3 adds the #ifdef to avoid compiling on kernels that don't need it Index: linux-2.6.git/arch/powerpc/kernel/smp.c =================================================================== --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c +++ linux-2.6.git/arch/powerpc/kernel/smp.c @@ -620,3 +620,61 @@ void __cpu_die(unsigned int cpu) smp_ops->cpu_die(cpu); } #endif + +#ifdef CONFIG_SCHED_SMT +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) +{ + int sibling; + int idle_count = 0; + int thread; + + /* Setup the default weight and smt_gain used by most cpus for SMT + * Power. Doing this right away covers the default case and can be + * used by cpus that modify it dynamically. + */ + struct cpumask *sibling_map = sched_domain_span(sd); + unsigned long weight = cpumask_weight(sibling_map); + unsigned long smt_gain = sd->smt_gain; + + + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) { + for_each_cpu(sibling, sibling_map) { + if (idle_cpu(sibling)) + idle_count++; + } + + /* the following section attempts to tweak cpu power based + * on current idleness of the threads dynamically at runtime + */ + if (idle_count > 1) { + thread = cpu_thread_in_core(cpu); + if (thread < 2) { + /* add 75 % to thread power */ + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); + } else { + /* subtract 75 % to thread power */ + smt_gain = smt_gain >> 2; + } + } + } + + /* default smt gain is 1178, weight is # of SMT threads */ + switch (weight) { + case 1: + /*divide by 1, do nothing*/ + break; + case 2: + smt_gain = smt_gain >> 1; + break; + case 4: + smt_gain = smt_gain >> 2; + break; + default: + smt_gain /= weight; + break; + } + + return smt_gain; + +} +#endif Index: linux-2.6.git/arch/powerpc/include/asm/cputable.h =================================================================== --- linux-2.6.git.orig/arch/powerpc/include/asm/cputable.h +++ linux-2.6.git/arch/powerpc/include/asm/cputable.h @@ -195,6 +195,7 @@ extern const char *powerpc_base_platform #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000) #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000) #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080000000000000) +#define CPU_FTR_ASYNC_SMT4 LONG_ASM_CONST(0x0100000000000000) #ifndef __ASSEMBLY__ @@ -409,7 +410,7 @@ extern const char *powerpc_base_platform CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR | CPU_FTR_SAO) + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYNC_SMT4) #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-05 20:57 ` Joel Schopp @ 2010-02-14 10:12 ` Peter Zijlstra -1 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-02-14 10:12 UTC (permalink / raw) To: Joel Schopp; +Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh On Fri, 2010-02-05 at 14:57 -0600, Joel Schopp wrote: > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > there is performance benefit to idling the higher numbered threads in > the core. > > This patch implements arch_scale_smt_power to dynamically update smt > thread power in these idle cases in order to prefer threads 0,1 over > threads 2,3 within a core. > > Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> > --- > Index: linux-2.6.git/arch/powerpc/kernel/smp.c > =================================================================== > --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c > +++ linux-2.6.git/arch/powerpc/kernel/smp.c > @@ -620,3 +620,61 @@ void __cpu_die(unsigned int cpu) > smp_ops->cpu_die(cpu); > } > #endif > + > +#ifdef CONFIG_SCHED_SMT > +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) > +{ > + int sibling; > + int idle_count = 0; > + int thread; > + > + /* Setup the default weight and smt_gain used by most cpus for SMT > + * Power. Doing this right away covers the default case and can be > + * used by cpus that modify it dynamically. > + */ > + struct cpumask *sibling_map = sched_domain_span(sd); > + unsigned long weight = cpumask_weight(sibling_map); > + unsigned long smt_gain = sd->smt_gain; > + > + > + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) { > + for_each_cpu(sibling, sibling_map) { > + if (idle_cpu(sibling)) > + idle_count++; > + } > + > + /* the following section attempts to tweak cpu power based > + * on current idleness of the threads dynamically at runtime > + */ > + if (idle_count > 1) { > + thread = cpu_thread_in_core(cpu); > + if (thread < 2) { > + /* add 75 % to thread power */ > + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); > + } else { > + /* subtract 75 % to thread power */ > + smt_gain = smt_gain >> 2; > + } > + } > + } > + > + /* default smt gain is 1178, weight is # of SMT threads */ > + switch (weight) { > + case 1: > + /*divide by 1, do nothing*/ > + break; > + case 2: > + smt_gain = smt_gain >> 1; > + break; > + case 4: > + smt_gain = smt_gain >> 2; > + break; > + default: > + smt_gain /= weight; > + break; > + } > + > + return smt_gain; > + > +} > +#endif Suppose for a moment we have 2 threads (hot-unplugged thread 1 and 3, we can construct an equivalent but more complex example for 4 threads), and we have 4 tasks, 3 SCHED_OTHER of equal nice level and 1 SCHED_FIFO, the SCHED_FIFO task will consume exactly 50% walltime of whatever cpu it ends up on. In that situation, provided that each cpu's cpu_power is of equal measure, scale_rt_power() ensures that we run 2 SCHED_OTHER tasks on the cpu that doesn't run the RT task, and 1 SCHED_OTHER task next to the RT task, so that each task consumes 50%, which is all fair and proper. However, if you do the above, thread 0 will have +75% = 1.75 and thread 2 will have -75% = 0.25, then if the RT task will land on thread 0, we'll be having: 0.875 vs 0.25, or on thread 3, 1.75 vs 0.125. In either case thread 0 will receive too many (if not all) SCHED_OTHER tasks. That is, unless these threads 2 and 3 really are _that_ weak, at which point one wonders why IBM bothered with the silicon ;-) So tell me again, why is fiddling with the cpu_power a good placement tool? ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-14 10:12 ` Peter Zijlstra 0 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-02-14 10:12 UTC (permalink / raw) To: Joel Schopp; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego On Fri, 2010-02-05 at 14:57 -0600, Joel Schopp wrote: > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > there is performance benefit to idling the higher numbered threads in > the core. > > This patch implements arch_scale_smt_power to dynamically update smt > thread power in these idle cases in order to prefer threads 0,1 over > threads 2,3 within a core. > > Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> > --- > Index: linux-2.6.git/arch/powerpc/kernel/smp.c > =================================================================== > --- linux-2.6.git.orig/arch/powerpc/kernel/smp.c > +++ linux-2.6.git/arch/powerpc/kernel/smp.c > @@ -620,3 +620,61 @@ void __cpu_die(unsigned int cpu) > smp_ops->cpu_die(cpu); > } > #endif > + > +#ifdef CONFIG_SCHED_SMT > +unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) > +{ > + int sibling; > + int idle_count = 0; > + int thread; > + > + /* Setup the default weight and smt_gain used by most cpus for SMT > + * Power. Doing this right away covers the default case and can be > + * used by cpus that modify it dynamically. > + */ > + struct cpumask *sibling_map = sched_domain_span(sd); > + unsigned long weight = cpumask_weight(sibling_map); > + unsigned long smt_gain = sd->smt_gain; > + > + > + if (cpu_has_feature(CPU_FTR_ASYNC_SMT4) && weight == 4) { > + for_each_cpu(sibling, sibling_map) { > + if (idle_cpu(sibling)) > + idle_count++; > + } > + > + /* the following section attempts to tweak cpu power based > + * on current idleness of the threads dynamically at runtime > + */ > + if (idle_count > 1) { > + thread = cpu_thread_in_core(cpu); > + if (thread < 2) { > + /* add 75 % to thread power */ > + smt_gain += (smt_gain >> 1) + (smt_gain >> 2); > + } else { > + /* subtract 75 % to thread power */ > + smt_gain = smt_gain >> 2; > + } > + } > + } > + > + /* default smt gain is 1178, weight is # of SMT threads */ > + switch (weight) { > + case 1: > + /*divide by 1, do nothing*/ > + break; > + case 2: > + smt_gain = smt_gain >> 1; > + break; > + case 4: > + smt_gain = smt_gain >> 2; > + break; > + default: > + smt_gain /= weight; > + break; > + } > + > + return smt_gain; > + > +} > +#endif Suppose for a moment we have 2 threads (hot-unplugged thread 1 and 3, we can construct an equivalent but more complex example for 4 threads), and we have 4 tasks, 3 SCHED_OTHER of equal nice level and 1 SCHED_FIFO, the SCHED_FIFO task will consume exactly 50% walltime of whatever cpu it ends up on. In that situation, provided that each cpu's cpu_power is of equal measure, scale_rt_power() ensures that we run 2 SCHED_OTHER tasks on the cpu that doesn't run the RT task, and 1 SCHED_OTHER task next to the RT task, so that each task consumes 50%, which is all fair and proper. However, if you do the above, thread 0 will have +75% = 1.75 and thread 2 will have -75% = 0.25, then if the RT task will land on thread 0, we'll be having: 0.875 vs 0.25, or on thread 3, 1.75 vs 0.125. In either case thread 0 will receive too many (if not all) SCHED_OTHER tasks. That is, unless these threads 2 and 3 really are _that_ weak, at which point one wonders why IBM bothered with the silicon ;-) So tell me again, why is fiddling with the cpu_power a good placement tool? ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-14 10:12 ` Peter Zijlstra @ 2010-02-17 22:20 ` Michael Neuling -1 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-17 22:20 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego > Suppose for a moment we have 2 threads (hot-unplugged thread 1 and 3, we > can construct an equivalent but more complex example for 4 threads), and > we have 4 tasks, 3 SCHED_OTHER of equal nice level and 1 SCHED_FIFO, the > SCHED_FIFO task will consume exactly 50% walltime of whatever cpu it > ends up on. > > In that situation, provided that each cpu's cpu_power is of equal > measure, scale_rt_power() ensures that we run 2 SCHED_OTHER tasks on the > cpu that doesn't run the RT task, and 1 SCHED_OTHER task next to the RT > task, so that each task consumes 50%, which is all fair and proper. > > However, if you do the above, thread 0 will have +75% = 1.75 and thread > 2 will have -75% = 0.25, then if the RT task will land on thread 0, > we'll be having: 0.875 vs 0.25, or on thread 3, 1.75 vs 0.125. In either > case thread 0 will receive too many (if not all) SCHED_OTHER tasks. > > That is, unless these threads 2 and 3 really are _that_ weak, at which > point one wonders why IBM bothered with the silicon ;-) Peter, 2 & 3 aren't weaker than 0 & 1 but.... The core has dynamic SMT mode switching which is controlled by the hypervisor (IBM's PHYP). There are 3 SMT modes: SMT1 uses thread 0 SMT2 uses threads 0 & 1 SMT4 uses threads 0, 1, 2 & 3 When in any particular SMT mode, all threads have the same performance as each other (ie. at any moment in time, all threads perform the same). The SMT mode switching works such that when linux has threads 2 & 3 idle and 0 & 1 active, it will cede (H_CEDE hypercall) threads 2 and 3 in the idle loop and the hypervisor will automatically switch to SMT2 for that core (independent of other cores). The opposite is not true, so if threads 0 & 1 are idle and 2 & 3 are active, we will stay in SMT4 mode. Similarly if thread 0 is active and threads 1, 2 & 3 are idle, we'll go into SMT1 mode. If we can get the core into a lower SMT mode (SMT1 is best), the threads will perform better (since they share less core resources). Hence when we have idle threads, we want them to be the higher ones. So to answer your question, threads 2 and 3 aren't weaker than the other threads when in SMT4 mode. It's that if we idle threads 2 & 3, threads 0 & 1 will speed up since we'll move to SMT2 mode. I'm pretty vague on linux scheduler details, so I'm a bit at sea as to how to solve this. Can you suggest any mechanisms we currently have in the kernel to reflect these properties, or do you think we need to develop something new? If so, any pointers as to where we should look? Thanks, Mikey ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-17 22:20 ` Michael Neuling 0 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-17 22:20 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego > Suppose for a moment we have 2 threads (hot-unplugged thread 1 and 3, we > can construct an equivalent but more complex example for 4 threads), and > we have 4 tasks, 3 SCHED_OTHER of equal nice level and 1 SCHED_FIFO, the > SCHED_FIFO task will consume exactly 50% walltime of whatever cpu it > ends up on. > > In that situation, provided that each cpu's cpu_power is of equal > measure, scale_rt_power() ensures that we run 2 SCHED_OTHER tasks on the > cpu that doesn't run the RT task, and 1 SCHED_OTHER task next to the RT > task, so that each task consumes 50%, which is all fair and proper. > > However, if you do the above, thread 0 will have +75% = 1.75 and thread > 2 will have -75% = 0.25, then if the RT task will land on thread 0, > we'll be having: 0.875 vs 0.25, or on thread 3, 1.75 vs 0.125. In either > case thread 0 will receive too many (if not all) SCHED_OTHER tasks. > > That is, unless these threads 2 and 3 really are _that_ weak, at which > point one wonders why IBM bothered with the silicon ;-) Peter, 2 & 3 aren't weaker than 0 & 1 but.... The core has dynamic SMT mode switching which is controlled by the hypervisor (IBM's PHYP). There are 3 SMT modes: SMT1 uses thread 0 SMT2 uses threads 0 & 1 SMT4 uses threads 0, 1, 2 & 3 When in any particular SMT mode, all threads have the same performance as each other (ie. at any moment in time, all threads perform the same). The SMT mode switching works such that when linux has threads 2 & 3 idle and 0 & 1 active, it will cede (H_CEDE hypercall) threads 2 and 3 in the idle loop and the hypervisor will automatically switch to SMT2 for that core (independent of other cores). The opposite is not true, so if threads 0 & 1 are idle and 2 & 3 are active, we will stay in SMT4 mode. Similarly if thread 0 is active and threads 1, 2 & 3 are idle, we'll go into SMT1 mode. If we can get the core into a lower SMT mode (SMT1 is best), the threads will perform better (since they share less core resources). Hence when we have idle threads, we want them to be the higher ones. So to answer your question, threads 2 and 3 aren't weaker than the other threads when in SMT4 mode. It's that if we idle threads 2 & 3, threads 0 & 1 will speed up since we'll move to SMT2 mode. I'm pretty vague on linux scheduler details, so I'm a bit at sea as to how to solve this. Can you suggest any mechanisms we currently have in the kernel to reflect these properties, or do you think we need to develop something new? If so, any pointers as to where we should look? Thanks, Mikey ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-17 22:20 ` Michael Neuling @ 2010-02-18 13:17 ` Peter Zijlstra -1 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-02-18 13:17 UTC (permalink / raw) To: Michael Neuling; +Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego On Thu, 2010-02-18 at 09:20 +1100, Michael Neuling wrote: > > Suppose for a moment we have 2 threads (hot-unplugged thread 1 and 3, we > > can construct an equivalent but more complex example for 4 threads), and > > we have 4 tasks, 3 SCHED_OTHER of equal nice level and 1 SCHED_FIFO, the > > SCHED_FIFO task will consume exactly 50% walltime of whatever cpu it > > ends up on. > > > > In that situation, provided that each cpu's cpu_power is of equal > > measure, scale_rt_power() ensures that we run 2 SCHED_OTHER tasks on the > > cpu that doesn't run the RT task, and 1 SCHED_OTHER task next to the RT > > task, so that each task consumes 50%, which is all fair and proper. > > > > However, if you do the above, thread 0 will have +75% = 1.75 and thread > > 2 will have -75% = 0.25, then if the RT task will land on thread 0, > > we'll be having: 0.875 vs 0.25, or on thread 3, 1.75 vs 0.125. In either > > case thread 0 will receive too many (if not all) SCHED_OTHER tasks. > > > > That is, unless these threads 2 and 3 really are _that_ weak, at which > > point one wonders why IBM bothered with the silicon ;-) > > Peter, > > 2 & 3 aren't weaker than 0 & 1 but.... > > The core has dynamic SMT mode switching which is controlled by the > hypervisor (IBM's PHYP). There are 3 SMT modes: > SMT1 uses thread 0 > SMT2 uses threads 0 & 1 > SMT4 uses threads 0, 1, 2 & 3 > When in any particular SMT mode, all threads have the same performance > as each other (ie. at any moment in time, all threads perform the same). > > The SMT mode switching works such that when linux has threads 2 & 3 idle > and 0 & 1 active, it will cede (H_CEDE hypercall) threads 2 and 3 in the > idle loop and the hypervisor will automatically switch to SMT2 for that > core (independent of other cores). The opposite is not true, so if > threads 0 & 1 are idle and 2 & 3 are active, we will stay in SMT4 mode. > > Similarly if thread 0 is active and threads 1, 2 & 3 are idle, we'll go > into SMT1 mode. > > If we can get the core into a lower SMT mode (SMT1 is best), the threads > will perform better (since they share less core resources). Hence when > we have idle threads, we want them to be the higher ones. Just out of curiosity, is this a hardware constraint or a hypervisor constraint? > So to answer your question, threads 2 and 3 aren't weaker than the other > threads when in SMT4 mode. It's that if we idle threads 2 & 3, threads > 0 & 1 will speed up since we'll move to SMT2 mode. > > I'm pretty vague on linux scheduler details, so I'm a bit at sea as to > how to solve this. Can you suggest any mechanisms we currently have in > the kernel to reflect these properties, or do you think we need to > develop something new? If so, any pointers as to where we should look? Well there currently isn't one, and I've been telling people to create a new SD_flag to reflect this and influence the f_b_g() behaviour. Something like the below perhaps, totally untested and without comments so that you'll have to reverse engineer and validate my thinking. There's one fundamental assumption, and one weakness in the implementation. --- include/linux/sched.h | 2 +- kernel/sched_fair.c | 61 +++++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 58 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 0eef87b..42fa5c6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -849,7 +849,7 @@ enum cpu_idle_type { #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ - +#define SD_ASYM_PACKING 0x0800 #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ enum powersavings_balance_level { diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index ff7692c..7e42bfe 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2086,6 +2086,7 @@ struct sd_lb_stats { struct sched_group *this; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_pwr; /* Total power of all groups in sd */ + unsigned long total_nr_running; unsigned long avg_load; /* Average load across all groups in sd */ /** Statistics of this group */ @@ -2414,10 +2415,10 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, int *balance, struct sg_lb_stats *sgs) { unsigned long load, max_cpu_load, min_cpu_load; - int i; unsigned int balance_cpu = -1, first_idle_cpu = 0; unsigned long sum_avg_load_per_task; unsigned long avg_load_per_task; + int i; if (local_group) balance_cpu = group_first_cpu(group); @@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); } +static int update_sd_pick_busiest(struct sched_domain *sd, + struct sd_lb_stats *sds, + struct sched_group *sg, + struct sg_lb_stats *sgs) +{ + if (sgs->sum_nr_running > sgs->group_capacity) + return 1; + + if (sgs->group_imb) + return 1; + + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { + if (!sds->busiest) + return 1; + + if (group_first_cpu(sds->busiest) < group_first_cpu(group)) + return 1; + } + + return 0; +} + /** * update_sd_lb_stats - Update sched_group's statistics for load balancing. * @sd: sched_domain whose statistics are to be updated. @@ -2533,6 +2556,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, sds->total_load += sgs.group_load; sds->total_pwr += group->cpu_power; + sds->total_nr_running += sgs.sum_nr_running; /* * In case the child domain prefers tasks go to siblings @@ -2547,9 +2571,8 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, sds->this = group; sds->this_nr_running = sgs.sum_nr_running; sds->this_load_per_task = sgs.sum_weighted_load; - } else if (sgs.avg_load > sds->max_load && - (sgs.sum_nr_running > sgs.group_capacity || - sgs.group_imb)) { + } else if (sgs.avg_load >= sds->max_load && + update_sd_pick_busiest(sd, sds, group, &sgs)) { sds->max_load = sgs.avg_load; sds->busiest = group; sds->busiest_nr_running = sgs.sum_nr_running; @@ -2562,6 +2585,33 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, } while (group != sd->groups); } +static int check_asym_packing(struct sched_domain *sd, + struct sd_lb_stats *sds, + int cpu, unsigned long *imbalance) +{ + int i, cpu, busiest_cpu; + + if (!(sd->flags & SD_ASYM_PACKING)) + return 0; + + if (!sds->busiest) + return 0; + + i = 0; + busiest_cpu = group_first_cpu(sds->busiest); + for_each_cpu(cpu, sched_domain_span(sd)) { + i++; + if (cpu == busiest_cpu) + break; + } + + if (sds->total_nr_running > i) + return 0; + + *imbalance = sds->max_load; + return 1; +} + /** * fix_small_imbalance - Calculate the minor imbalance that exists * amongst the groups of a sched_domain, during @@ -2761,6 +2811,9 @@ find_busiest_group(struct sched_domain *sd, int this_cpu, return sds.busiest; out_balanced: + if (check_asym_packing(sd, &sds, this_cpu, imbalance)) + return sds.busiest; + /* * There is no obvious imbalance. But check if we can do some balancing * to save power. ^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-18 13:17 ` Peter Zijlstra 0 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-02-18 13:17 UTC (permalink / raw) To: Michael Neuling; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego On Thu, 2010-02-18 at 09:20 +1100, Michael Neuling wrote: > > Suppose for a moment we have 2 threads (hot-unplugged thread 1 and 3, we > > can construct an equivalent but more complex example for 4 threads), and > > we have 4 tasks, 3 SCHED_OTHER of equal nice level and 1 SCHED_FIFO, the > > SCHED_FIFO task will consume exactly 50% walltime of whatever cpu it > > ends up on. > > > > In that situation, provided that each cpu's cpu_power is of equal > > measure, scale_rt_power() ensures that we run 2 SCHED_OTHER tasks on the > > cpu that doesn't run the RT task, and 1 SCHED_OTHER task next to the RT > > task, so that each task consumes 50%, which is all fair and proper. > > > > However, if you do the above, thread 0 will have +75% = 1.75 and thread > > 2 will have -75% = 0.25, then if the RT task will land on thread 0, > > we'll be having: 0.875 vs 0.25, or on thread 3, 1.75 vs 0.125. In either > > case thread 0 will receive too many (if not all) SCHED_OTHER tasks. > > > > That is, unless these threads 2 and 3 really are _that_ weak, at which > > point one wonders why IBM bothered with the silicon ;-) > > Peter, > > 2 & 3 aren't weaker than 0 & 1 but.... > > The core has dynamic SMT mode switching which is controlled by the > hypervisor (IBM's PHYP). There are 3 SMT modes: > SMT1 uses thread 0 > SMT2 uses threads 0 & 1 > SMT4 uses threads 0, 1, 2 & 3 > When in any particular SMT mode, all threads have the same performance > as each other (ie. at any moment in time, all threads perform the same). > > The SMT mode switching works such that when linux has threads 2 & 3 idle > and 0 & 1 active, it will cede (H_CEDE hypercall) threads 2 and 3 in the > idle loop and the hypervisor will automatically switch to SMT2 for that > core (independent of other cores). The opposite is not true, so if > threads 0 & 1 are idle and 2 & 3 are active, we will stay in SMT4 mode. > > Similarly if thread 0 is active and threads 1, 2 & 3 are idle, we'll go > into SMT1 mode. > > If we can get the core into a lower SMT mode (SMT1 is best), the threads > will perform better (since they share less core resources). Hence when > we have idle threads, we want them to be the higher ones. Just out of curiosity, is this a hardware constraint or a hypervisor constraint? > So to answer your question, threads 2 and 3 aren't weaker than the other > threads when in SMT4 mode. It's that if we idle threads 2 & 3, threads > 0 & 1 will speed up since we'll move to SMT2 mode. > > I'm pretty vague on linux scheduler details, so I'm a bit at sea as to > how to solve this. Can you suggest any mechanisms we currently have in > the kernel to reflect these properties, or do you think we need to > develop something new? If so, any pointers as to where we should look? Well there currently isn't one, and I've been telling people to create a new SD_flag to reflect this and influence the f_b_g() behaviour. Something like the below perhaps, totally untested and without comments so that you'll have to reverse engineer and validate my thinking. There's one fundamental assumption, and one weakness in the implementation. --- include/linux/sched.h | 2 +- kernel/sched_fair.c | 61 +++++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 58 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 0eef87b..42fa5c6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -849,7 +849,7 @@ enum cpu_idle_type { #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ - +#define SD_ASYM_PACKING 0x0800 #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ enum powersavings_balance_level { diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index ff7692c..7e42bfe 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2086,6 +2086,7 @@ struct sd_lb_stats { struct sched_group *this; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_pwr; /* Total power of all groups in sd */ + unsigned long total_nr_running; unsigned long avg_load; /* Average load across all groups in sd */ /** Statistics of this group */ @@ -2414,10 +2415,10 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, int *balance, struct sg_lb_stats *sgs) { unsigned long load, max_cpu_load, min_cpu_load; - int i; unsigned int balance_cpu = -1, first_idle_cpu = 0; unsigned long sum_avg_load_per_task; unsigned long avg_load_per_task; + int i; if (local_group) balance_cpu = group_first_cpu(group); @@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); } +static int update_sd_pick_busiest(struct sched_domain *sd, + struct sd_lb_stats *sds, + struct sched_group *sg, + struct sg_lb_stats *sgs) +{ + if (sgs->sum_nr_running > sgs->group_capacity) + return 1; + + if (sgs->group_imb) + return 1; + + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { + if (!sds->busiest) + return 1; + + if (group_first_cpu(sds->busiest) < group_first_cpu(group)) + return 1; + } + + return 0; +} + /** * update_sd_lb_stats - Update sched_group's statistics for load balancing. * @sd: sched_domain whose statistics are to be updated. @@ -2533,6 +2556,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, sds->total_load += sgs.group_load; sds->total_pwr += group->cpu_power; + sds->total_nr_running += sgs.sum_nr_running; /* * In case the child domain prefers tasks go to siblings @@ -2547,9 +2571,8 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, sds->this = group; sds->this_nr_running = sgs.sum_nr_running; sds->this_load_per_task = sgs.sum_weighted_load; - } else if (sgs.avg_load > sds->max_load && - (sgs.sum_nr_running > sgs.group_capacity || - sgs.group_imb)) { + } else if (sgs.avg_load >= sds->max_load && + update_sd_pick_busiest(sd, sds, group, &sgs)) { sds->max_load = sgs.avg_load; sds->busiest = group; sds->busiest_nr_running = sgs.sum_nr_running; @@ -2562,6 +2585,33 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, } while (group != sd->groups); } +static int check_asym_packing(struct sched_domain *sd, + struct sd_lb_stats *sds, + int cpu, unsigned long *imbalance) +{ + int i, cpu, busiest_cpu; + + if (!(sd->flags & SD_ASYM_PACKING)) + return 0; + + if (!sds->busiest) + return 0; + + i = 0; + busiest_cpu = group_first_cpu(sds->busiest); + for_each_cpu(cpu, sched_domain_span(sd)) { + i++; + if (cpu == busiest_cpu) + break; + } + + if (sds->total_nr_running > i) + return 0; + + *imbalance = sds->max_load; + return 1; +} + /** * fix_small_imbalance - Calculate the minor imbalance that exists * amongst the groups of a sched_domain, during @@ -2761,6 +2811,9 @@ find_busiest_group(struct sched_domain *sd, int this_cpu, return sds.busiest; out_balanced: + if (check_asym_packing(sd, &sds, this_cpu, imbalance)) + return sds.busiest; + /* * There is no obvious imbalance. But check if we can do some balancing * to save power. ^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-18 13:17 ` Peter Zijlstra @ 2010-02-18 13:19 ` Peter Zijlstra -1 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-02-18 13:19 UTC (permalink / raw) To: Michael Neuling; +Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego On Thu, 2010-02-18 at 14:17 +0100, Peter Zijlstra wrote: > > There's one fundamental assumption, and one weakness in the > implementation. > Aside from bugs and the like.. ;-) ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-18 13:19 ` Peter Zijlstra 0 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-02-18 13:19 UTC (permalink / raw) To: Michael Neuling; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego On Thu, 2010-02-18 at 14:17 +0100, Peter Zijlstra wrote: > > There's one fundamental assumption, and one weakness in the > implementation. > Aside from bugs and the like.. ;-) ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-18 13:17 ` Peter Zijlstra @ 2010-02-18 16:28 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-02-18 16:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Michael Neuling, Ingo Molnar, linuxppc-dev, linux-kernel, ego Sorry for the slow reply, was on vacation. Mikey seems to have answered pretty well though. >>> That is, unless these threads 2 and 3 really are _that_ weak, at which >>> point one wonders why IBM bothered with the silicon ;-) >>> >> Peter, >> >> 2 & 3 aren't weaker than 0 & 1 but.... >> >> The core has dynamic SMT mode switching which is controlled by the >> hypervisor (IBM's PHYP). There are 3 SMT modes: >> SMT1 uses thread 0 >> SMT2 uses threads 0 & 1 >> SMT4 uses threads 0, 1, 2 & 3 >> When in any particular SMT mode, all threads have the same performance >> as each other (ie. at any moment in time, all threads perform the same). >> >> The SMT mode switching works such that when linux has threads 2 & 3 idle >> and 0 & 1 active, it will cede (H_CEDE hypercall) threads 2 and 3 in the >> idle loop and the hypervisor will automatically switch to SMT2 for that >> core (independent of other cores). The opposite is not true, so if >> threads 0 & 1 are idle and 2 & 3 are active, we will stay in SMT4 mode. >> >> Similarly if thread 0 is active and threads 1, 2 & 3 are idle, we'll go >> into SMT1 mode. >> >> If we can get the core into a lower SMT mode (SMT1 is best), the threads >> will perform better (since they share less core resources). Hence when >> we have idle threads, we want them to be the higher ones. >> > > Just out of curiosity, is this a hardware constraint or a hypervisor > constraint? > hardware > >> So to answer your question, threads 2 and 3 aren't weaker than the other >> threads when in SMT4 mode. It's that if we idle threads 2 & 3, threads >> 0 & 1 will speed up since we'll move to SMT2 mode. >> >> I'm pretty vague on linux scheduler details, so I'm a bit at sea as to >> how to solve this. Can you suggest any mechanisms we currently have in >> the kernel to reflect these properties, or do you think we need to >> develop something new? If so, any pointers as to where we should look? >> > > Since the threads speed up we'd need to change their weights at runtime regardless of placement. It just seems to make sense to let the changed weights affect placement naturally at the same time. > Well there currently isn't one, and I've been telling people to create a > new SD_flag to reflect this and influence the f_b_g() behaviour. > > Something like the below perhaps, totally untested and without comments > so that you'll have to reverse engineer and validate my thinking. > > There's one fundamental assumption, and one weakness in the > implementation. > I'm going to guess the weakness is that it doesn't adjust the cpu power so tasks running in SMT1 mode actually get more than they account for? What's the assumption? ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-18 16:28 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-02-18 16:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Michael Neuling, ego, linuxppc-dev, Ingo Molnar, linux-kernel Sorry for the slow reply, was on vacation. Mikey seems to have answered pretty well though. >>> That is, unless these threads 2 and 3 really are _that_ weak, at which >>> point one wonders why IBM bothered with the silicon ;-) >>> >> Peter, >> >> 2 & 3 aren't weaker than 0 & 1 but.... >> >> The core has dynamic SMT mode switching which is controlled by the >> hypervisor (IBM's PHYP). There are 3 SMT modes: >> SMT1 uses thread 0 >> SMT2 uses threads 0 & 1 >> SMT4 uses threads 0, 1, 2 & 3 >> When in any particular SMT mode, all threads have the same performance >> as each other (ie. at any moment in time, all threads perform the same). >> >> The SMT mode switching works such that when linux has threads 2 & 3 idle >> and 0 & 1 active, it will cede (H_CEDE hypercall) threads 2 and 3 in the >> idle loop and the hypervisor will automatically switch to SMT2 for that >> core (independent of other cores). The opposite is not true, so if >> threads 0 & 1 are idle and 2 & 3 are active, we will stay in SMT4 mode. >> >> Similarly if thread 0 is active and threads 1, 2 & 3 are idle, we'll go >> into SMT1 mode. >> >> If we can get the core into a lower SMT mode (SMT1 is best), the threads >> will perform better (since they share less core resources). Hence when >> we have idle threads, we want them to be the higher ones. >> > > Just out of curiosity, is this a hardware constraint or a hypervisor > constraint? > hardware > >> So to answer your question, threads 2 and 3 aren't weaker than the other >> threads when in SMT4 mode. It's that if we idle threads 2 & 3, threads >> 0 & 1 will speed up since we'll move to SMT2 mode. >> >> I'm pretty vague on linux scheduler details, so I'm a bit at sea as to >> how to solve this. Can you suggest any mechanisms we currently have in >> the kernel to reflect these properties, or do you think we need to >> develop something new? If so, any pointers as to where we should look? >> > > Since the threads speed up we'd need to change their weights at runtime regardless of placement. It just seems to make sense to let the changed weights affect placement naturally at the same time. > Well there currently isn't one, and I've been telling people to create a > new SD_flag to reflect this and influence the f_b_g() behaviour. > > Something like the below perhaps, totally untested and without comments > so that you'll have to reverse engineer and validate my thinking. > > There's one fundamental assumption, and one weakness in the > implementation. > I'm going to guess the weakness is that it doesn't adjust the cpu power so tasks running in SMT1 mode actually get more than they account for? What's the assumption? ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-18 16:28 ` Joel Schopp @ 2010-02-18 17:08 ` Peter Zijlstra -1 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-02-18 17:08 UTC (permalink / raw) To: Joel Schopp; +Cc: Michael Neuling, Ingo Molnar, linuxppc-dev, linux-kernel, ego On Thu, 2010-02-18 at 10:28 -0600, Joel Schopp wrote: > > There's one fundamental assumption, and one weakness in the > > implementation. > > > I'm going to guess the weakness is that it doesn't adjust the cpu power > so tasks running in SMT1 mode actually get more than they account for? No, but you're right, if these SMTx modes are running at different frequencies then yes that needs to happen as well. The weakness is failing to do the right thing in the presence of a 'strategically' placed RT task. Suppose: Sibling0, Sibling1, Sibling2, Sibling3 idle OTHER OTHER FIFO it might not manage to migrate a task to 0 because it ends up selecting 3 as busiest. It doesn't at all influence RT placement, but it does look at nr_running (which does include RT tasks) > What's the assumption? That cpu_of(Sibling n) < cpu_of(Sibling n+1) ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-18 17:08 ` Peter Zijlstra 0 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-02-18 17:08 UTC (permalink / raw) To: Joel Schopp; +Cc: Michael Neuling, ego, linuxppc-dev, Ingo Molnar, linux-kernel On Thu, 2010-02-18 at 10:28 -0600, Joel Schopp wrote: > > There's one fundamental assumption, and one weakness in the > > implementation. > > > I'm going to guess the weakness is that it doesn't adjust the cpu power > so tasks running in SMT1 mode actually get more than they account for? No, but you're right, if these SMTx modes are running at different frequencies then yes that needs to happen as well. The weakness is failing to do the right thing in the presence of a 'strategically' placed RT task. Suppose: Sibling0, Sibling1, Sibling2, Sibling3 idle OTHER OTHER FIFO it might not manage to migrate a task to 0 because it ends up selecting 3 as busiest. It doesn't at all influence RT placement, but it does look at nr_running (which does include RT tasks) > What's the assumption? That cpu_of(Sibling n) < cpu_of(Sibling n+1) ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-18 13:17 ` Peter Zijlstra @ 2010-02-19 6:05 ` Michael Neuling -1 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-19 6:05 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego > On Thu, 2010-02-18 at 09:20 +1100, Michael Neuling wrote: > > > Suppose for a moment we have 2 threads (hot-unplugged thread 1 and 3, we > > > can construct an equivalent but more complex example for 4 threads), and > > > we have 4 tasks, 3 SCHED_OTHER of equal nice level and 1 SCHED_FIFO, the > > > SCHED_FIFO task will consume exactly 50% walltime of whatever cpu it > > > ends up on. > > > > > > In that situation, provided that each cpu's cpu_power is of equal > > > measure, scale_rt_power() ensures that we run 2 SCHED_OTHER tasks on the > > > cpu that doesn't run the RT task, and 1 SCHED_OTHER task next to the RT > > > task, so that each task consumes 50%, which is all fair and proper. > > > > > > However, if you do the above, thread 0 will have +75% = 1.75 and thread > > > 2 will have -75% = 0.25, then if the RT task will land on thread 0, > > > we'll be having: 0.875 vs 0.25, or on thread 3, 1.75 vs 0.125. In either > > > case thread 0 will receive too many (if not all) SCHED_OTHER tasks. > > > > > > That is, unless these threads 2 and 3 really are _that_ weak, at which > > > point one wonders why IBM bothered with the silicon ;-) > > > > Peter, > > > > 2 & 3 aren't weaker than 0 & 1 but.... > > > > The core has dynamic SMT mode switching which is controlled by the > > hypervisor (IBM's PHYP). There are 3 SMT modes: > > SMT1 uses thread 0 > > SMT2 uses threads 0 & 1 > > SMT4 uses threads 0, 1, 2 & 3 > > When in any particular SMT mode, all threads have the same performance > > as each other (ie. at any moment in time, all threads perform the same). > > > > The SMT mode switching works such that when linux has threads 2 & 3 idle > > and 0 & 1 active, it will cede (H_CEDE hypercall) threads 2 and 3 in the > > idle loop and the hypervisor will automatically switch to SMT2 for that > > core (independent of other cores). The opposite is not true, so if > > threads 0 & 1 are idle and 2 & 3 are active, we will stay in SMT4 mode. > > > > Similarly if thread 0 is active and threads 1, 2 & 3 are idle, we'll go > > into SMT1 mode. > > > > If we can get the core into a lower SMT mode (SMT1 is best), the threads > > will perform better (since they share less core resources). Hence when > > we have idle threads, we want them to be the higher ones. > > Just out of curiosity, is this a hardware constraint or a hypervisor > constraint? > > > So to answer your question, threads 2 and 3 aren't weaker than the other > > threads when in SMT4 mode. It's that if we idle threads 2 & 3, threads > > 0 & 1 will speed up since we'll move to SMT2 mode. > > > > I'm pretty vague on linux scheduler details, so I'm a bit at sea as to > > how to solve this. Can you suggest any mechanisms we currently have in > > the kernel to reflect these properties, or do you think we need to > > develop something new? If so, any pointers as to where we should look? > > Well there currently isn't one, and I've been telling people to create a > new SD_flag to reflect this and influence the f_b_g() behaviour. > > Something like the below perhaps, totally untested and without comments > so that you'll have to reverse engineer and validate my thinking. > > There's one fundamental assumption, and one weakness in the > implementation. Thanks for the help. I'm still trying to get up to speed with how this works but while trying to cleanup and compile your patch, I had some simple questions below... > > --- > > include/linux/sched.h | 2 +- > kernel/sched_fair.c | 61 +++++++++++++++++++++++++++++++++++++++++++++-- - > 2 files changed, 58 insertions(+), 5 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 0eef87b..42fa5c6 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -849,7 +849,7 @@ enum cpu_idle_type { > #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ > #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ > #define SD_SERIALIZE 0x0400 /* Only a single load balancing instanc e */ > - > +#define SD_ASYM_PACKING 0x0800 Would we eventually add this to SD_SIBLING_INIT in a arch specific hook, or is this ok to add it generically? > #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling d omain */ > > enum powersavings_balance_level { > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c > index ff7692c..7e42bfe 100644 > --- a/kernel/sched_fair.c > +++ b/kernel/sched_fair.c > @@ -2086,6 +2086,7 @@ struct sd_lb_stats { > struct sched_group *this; /* Local group in this sd */ > unsigned long total_load; /* Total load of all groups in sd */ > unsigned long total_pwr; /* Total power of all groups in sd */ > + unsigned long total_nr_running; > unsigned long avg_load; /* Average load across all groups in sd */ > > /** Statistics of this group */ > @@ -2414,10 +2415,10 @@ static inline void update_sg_lb_stats(struct sched_do main *sd, > int *balance, struct sg_lb_stats *sgs) > { > unsigned long load, max_cpu_load, min_cpu_load; > - int i; > unsigned int balance_cpu = -1, first_idle_cpu = 0; > unsigned long sum_avg_load_per_task; > unsigned long avg_load_per_task; > + int i; > > if (local_group) > balance_cpu = group_first_cpu(group); > @@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(struct sched_dom ain *sd, > DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); > } > > +static int update_sd_pick_busiest(struct sched_domain *sd, > + struct sd_lb_stats *sds, > + struct sched_group *sg, > + struct sg_lb_stats *sgs) > +{ > + if (sgs->sum_nr_running > sgs->group_capacity) > + return 1; > + > + if (sgs->group_imb) > + return 1; > + > + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { > + if (!sds->busiest) > + return 1; > + > + if (group_first_cpu(sds->busiest) < group_first_cpu(group)) "group" => "sg" here? (I get a compile error otherwise) > + return 1; > + } > + > + return 0; > +} > + > /** > * update_sd_lb_stats - Update sched_group's statistics for load balancing. > * @sd: sched_domain whose statistics are to be updated. > @@ -2533,6 +2556,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, > > sds->total_load += sgs.group_load; > sds->total_pwr += group->cpu_power; > + sds->total_nr_running += sgs.sum_nr_running; > > /* > * In case the child domain prefers tasks go to siblings > @@ -2547,9 +2571,8 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, > sds->this = group; > sds->this_nr_running = sgs.sum_nr_running; > sds->this_load_per_task = sgs.sum_weighted_load; > - } else if (sgs.avg_load > sds->max_load && > - (sgs.sum_nr_running > sgs.group_capacity || > - sgs.group_imb)) { > + } else if (sgs.avg_load >= sds->max_load && > + update_sd_pick_busiest(sd, sds, group, &sgs)) { > sds->max_load = sgs.avg_load; > sds->busiest = group; > sds->busiest_nr_running = sgs.sum_nr_running; > @@ -2562,6 +2585,33 @@ static inline void update_sd_lb_stats(struct sched_dom ain *sd, int this_cpu, > } while (group != sd->groups); > } > > +static int check_asym_packing(struct sched_domain *sd, > + struct sd_lb_stats *sds, > + int cpu, unsigned long *imbalance) > +{ > + int i, cpu, busiest_cpu; Redefining cpu here. Looks like the cpu parameter is not really needed? > + > + if (!(sd->flags & SD_ASYM_PACKING)) > + return 0; > + > + if (!sds->busiest) > + return 0; > + > + i = 0; > + busiest_cpu = group_first_cpu(sds->busiest); > + for_each_cpu(cpu, sched_domain_span(sd)) { > + i++; > + if (cpu == busiest_cpu) > + break; > + } > + > + if (sds->total_nr_running > i) > + return 0; > + > + *imbalance = sds->max_load; > + return 1; > +} > + > /** > * fix_small_imbalance - Calculate the minor imbalance that exists > * amongst the groups of a sched_domain, during > @@ -2761,6 +2811,9 @@ find_busiest_group(struct sched_domain *sd, int this_cp u, > return sds.busiest; > > out_balanced: > + if (check_asym_packing(sd, &sds, this_cpu, imbalance)) > + return sds.busiest; > + > /* > * There is no obvious imbalance. But check if we can do some balancing > * to save power. > > ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-19 6:05 ` Michael Neuling 0 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-19 6:05 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego > On Thu, 2010-02-18 at 09:20 +1100, Michael Neuling wrote: > > > Suppose for a moment we have 2 threads (hot-unplugged thread 1 and 3, we > > > can construct an equivalent but more complex example for 4 threads), and > > > we have 4 tasks, 3 SCHED_OTHER of equal nice level and 1 SCHED_FIFO, the > > > SCHED_FIFO task will consume exactly 50% walltime of whatever cpu it > > > ends up on. > > > > > > In that situation, provided that each cpu's cpu_power is of equal > > > measure, scale_rt_power() ensures that we run 2 SCHED_OTHER tasks on the > > > cpu that doesn't run the RT task, and 1 SCHED_OTHER task next to the RT > > > task, so that each task consumes 50%, which is all fair and proper. > > > > > > However, if you do the above, thread 0 will have +75% = 1.75 and thread > > > 2 will have -75% = 0.25, then if the RT task will land on thread 0, > > > we'll be having: 0.875 vs 0.25, or on thread 3, 1.75 vs 0.125. In either > > > case thread 0 will receive too many (if not all) SCHED_OTHER tasks. > > > > > > That is, unless these threads 2 and 3 really are _that_ weak, at which > > > point one wonders why IBM bothered with the silicon ;-) > > > > Peter, > > > > 2 & 3 aren't weaker than 0 & 1 but.... > > > > The core has dynamic SMT mode switching which is controlled by the > > hypervisor (IBM's PHYP). There are 3 SMT modes: > > SMT1 uses thread 0 > > SMT2 uses threads 0 & 1 > > SMT4 uses threads 0, 1, 2 & 3 > > When in any particular SMT mode, all threads have the same performance > > as each other (ie. at any moment in time, all threads perform the same). > > > > The SMT mode switching works such that when linux has threads 2 & 3 idle > > and 0 & 1 active, it will cede (H_CEDE hypercall) threads 2 and 3 in the > > idle loop and the hypervisor will automatically switch to SMT2 for that > > core (independent of other cores). The opposite is not true, so if > > threads 0 & 1 are idle and 2 & 3 are active, we will stay in SMT4 mode. > > > > Similarly if thread 0 is active and threads 1, 2 & 3 are idle, we'll go > > into SMT1 mode. > > > > If we can get the core into a lower SMT mode (SMT1 is best), the threads > > will perform better (since they share less core resources). Hence when > > we have idle threads, we want them to be the higher ones. > > Just out of curiosity, is this a hardware constraint or a hypervisor > constraint? > > > So to answer your question, threads 2 and 3 aren't weaker than the other > > threads when in SMT4 mode. It's that if we idle threads 2 & 3, threads > > 0 & 1 will speed up since we'll move to SMT2 mode. > > > > I'm pretty vague on linux scheduler details, so I'm a bit at sea as to > > how to solve this. Can you suggest any mechanisms we currently have in > > the kernel to reflect these properties, or do you think we need to > > develop something new? If so, any pointers as to where we should look? > > Well there currently isn't one, and I've been telling people to create a > new SD_flag to reflect this and influence the f_b_g() behaviour. > > Something like the below perhaps, totally untested and without comments > so that you'll have to reverse engineer and validate my thinking. > > There's one fundamental assumption, and one weakness in the > implementation. Thanks for the help. I'm still trying to get up to speed with how this works but while trying to cleanup and compile your patch, I had some simple questions below... > > --- > > include/linux/sched.h | 2 +- > kernel/sched_fair.c | 61 +++++++++++++++++++++++++++++++++++++++++++++-- - > 2 files changed, 58 insertions(+), 5 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 0eef87b..42fa5c6 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -849,7 +849,7 @@ enum cpu_idle_type { > #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ > #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ > #define SD_SERIALIZE 0x0400 /* Only a single load balancing instanc e */ > - > +#define SD_ASYM_PACKING 0x0800 Would we eventually add this to SD_SIBLING_INIT in a arch specific hook, or is this ok to add it generically? > #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling d omain */ > > enum powersavings_balance_level { > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c > index ff7692c..7e42bfe 100644 > --- a/kernel/sched_fair.c > +++ b/kernel/sched_fair.c > @@ -2086,6 +2086,7 @@ struct sd_lb_stats { > struct sched_group *this; /* Local group in this sd */ > unsigned long total_load; /* Total load of all groups in sd */ > unsigned long total_pwr; /* Total power of all groups in sd */ > + unsigned long total_nr_running; > unsigned long avg_load; /* Average load across all groups in sd */ > > /** Statistics of this group */ > @@ -2414,10 +2415,10 @@ static inline void update_sg_lb_stats(struct sched_do main *sd, > int *balance, struct sg_lb_stats *sgs) > { > unsigned long load, max_cpu_load, min_cpu_load; > - int i; > unsigned int balance_cpu = -1, first_idle_cpu = 0; > unsigned long sum_avg_load_per_task; > unsigned long avg_load_per_task; > + int i; > > if (local_group) > balance_cpu = group_first_cpu(group); > @@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(struct sched_dom ain *sd, > DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); > } > > +static int update_sd_pick_busiest(struct sched_domain *sd, > + struct sd_lb_stats *sds, > + struct sched_group *sg, > + struct sg_lb_stats *sgs) > +{ > + if (sgs->sum_nr_running > sgs->group_capacity) > + return 1; > + > + if (sgs->group_imb) > + return 1; > + > + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { > + if (!sds->busiest) > + return 1; > + > + if (group_first_cpu(sds->busiest) < group_first_cpu(group)) "group" => "sg" here? (I get a compile error otherwise) > + return 1; > + } > + > + return 0; > +} > + > /** > * update_sd_lb_stats - Update sched_group's statistics for load balancing. > * @sd: sched_domain whose statistics are to be updated. > @@ -2533,6 +2556,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, > > sds->total_load += sgs.group_load; > sds->total_pwr += group->cpu_power; > + sds->total_nr_running += sgs.sum_nr_running; > > /* > * In case the child domain prefers tasks go to siblings > @@ -2547,9 +2571,8 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, > sds->this = group; > sds->this_nr_running = sgs.sum_nr_running; > sds->this_load_per_task = sgs.sum_weighted_load; > - } else if (sgs.avg_load > sds->max_load && > - (sgs.sum_nr_running > sgs.group_capacity || > - sgs.group_imb)) { > + } else if (sgs.avg_load >= sds->max_load && > + update_sd_pick_busiest(sd, sds, group, &sgs)) { > sds->max_load = sgs.avg_load; > sds->busiest = group; > sds->busiest_nr_running = sgs.sum_nr_running; > @@ -2562,6 +2585,33 @@ static inline void update_sd_lb_stats(struct sched_dom ain *sd, int this_cpu, > } while (group != sd->groups); > } > > +static int check_asym_packing(struct sched_domain *sd, > + struct sd_lb_stats *sds, > + int cpu, unsigned long *imbalance) > +{ > + int i, cpu, busiest_cpu; Redefining cpu here. Looks like the cpu parameter is not really needed? > + > + if (!(sd->flags & SD_ASYM_PACKING)) > + return 0; > + > + if (!sds->busiest) > + return 0; > + > + i = 0; > + busiest_cpu = group_first_cpu(sds->busiest); > + for_each_cpu(cpu, sched_domain_span(sd)) { > + i++; > + if (cpu == busiest_cpu) > + break; > + } > + > + if (sds->total_nr_running > i) > + return 0; > + > + *imbalance = sds->max_load; > + return 1; > +} > + > /** > * fix_small_imbalance - Calculate the minor imbalance that exists > * amongst the groups of a sched_domain, during > @@ -2761,6 +2811,9 @@ find_busiest_group(struct sched_domain *sd, int this_cp u, > return sds.busiest; > > out_balanced: > + if (check_asym_packing(sd, &sds, this_cpu, imbalance)) > + return sds.busiest; > + > /* > * There is no obvious imbalance. But check if we can do some balancing > * to save power. > > ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-19 6:05 ` Michael Neuling @ 2010-02-19 10:01 ` Peter Zijlstra -1 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-02-19 10:01 UTC (permalink / raw) To: Michael Neuling; +Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego On Fri, 2010-02-19 at 17:05 +1100, Michael Neuling wrote: > > include/linux/sched.h | 2 +- > > kernel/sched_fair.c | 61 +++++++++++++++++++++++++++++++++++++++++++++-- > - > > 2 files changed, 58 insertions(+), 5 deletions(-) > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > index 0eef87b..42fa5c6 100644 > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -849,7 +849,7 @@ enum cpu_idle_type { > > #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ > > #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg > resources */ > > #define SD_SERIALIZE 0x0400 /* Only a single load balancing instanc > e */ > > - > > +#define SD_ASYM_PACKING 0x0800 > > Would we eventually add this to SD_SIBLING_INIT in a arch specific hook, > or is this ok to add it generically? I'd think we'd want to keep that limited to architectures that actually need it. > > > +static int update_sd_pick_busiest(struct sched_domain *sd, > > + struct sd_lb_stats *sds, > > + struct sched_group *sg, > > + struct sg_lb_stats *sgs) > > +{ > > + if (sgs->sum_nr_running > sgs->group_capacity) > > + return 1; > > + > > + if (sgs->group_imb) > > + return 1; > > + > > + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { > > + if (!sds->busiest) > > + return 1; > > + > > + if (group_first_cpu(sds->busiest) < group_first_cpu(group)) > > "group" => "sg" here? (I get a compile error otherwise) Oh, quite ;-) > > +static int check_asym_packing(struct sched_domain *sd, > > + struct sd_lb_stats *sds, > > + int cpu, unsigned long *imbalance) > > +{ > > + int i, cpu, busiest_cpu; > > Redefining cpu here. Looks like the cpu parameter is not really needed? Seems that way indeed, I went back and forth a few times on the actual implementation of this function (which started out live as a copy of check_power_save_busiest_group), its amazing there were only these two compile glitches ;-) > > + > > + if (!(sd->flags & SD_ASYM_PACKING)) > > + return 0; > > + > > + if (!sds->busiest) > > + return 0; > > + > > + i = 0; > > + busiest_cpu = group_first_cpu(sds->busiest); > > + for_each_cpu(cpu, sched_domain_span(sd)) { > > + i++; > > + if (cpu == busiest_cpu) > > + break; > > + } > > + > > + if (sds->total_nr_running > i) > > + return 0; > > + > > + *imbalance = sds->max_load; > > + return 1; > > +} ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-19 10:01 ` Peter Zijlstra 0 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-02-19 10:01 UTC (permalink / raw) To: Michael Neuling; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego On Fri, 2010-02-19 at 17:05 +1100, Michael Neuling wrote: > > include/linux/sched.h | 2 +- > > kernel/sched_fair.c | 61 +++++++++++++++++++++++++++++++++++++++++++++-- > - > > 2 files changed, 58 insertions(+), 5 deletions(-) > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > index 0eef87b..42fa5c6 100644 > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -849,7 +849,7 @@ enum cpu_idle_type { > > #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ > > #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg > resources */ > > #define SD_SERIALIZE 0x0400 /* Only a single load balancing instanc > e */ > > - > > +#define SD_ASYM_PACKING 0x0800 > > Would we eventually add this to SD_SIBLING_INIT in a arch specific hook, > or is this ok to add it generically? I'd think we'd want to keep that limited to architectures that actually need it. > > > +static int update_sd_pick_busiest(struct sched_domain *sd, > > + struct sd_lb_stats *sds, > > + struct sched_group *sg, > > + struct sg_lb_stats *sgs) > > +{ > > + if (sgs->sum_nr_running > sgs->group_capacity) > > + return 1; > > + > > + if (sgs->group_imb) > > + return 1; > > + > > + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { > > + if (!sds->busiest) > > + return 1; > > + > > + if (group_first_cpu(sds->busiest) < group_first_cpu(group)) > > "group" => "sg" here? (I get a compile error otherwise) Oh, quite ;-) > > +static int check_asym_packing(struct sched_domain *sd, > > + struct sd_lb_stats *sds, > > + int cpu, unsigned long *imbalance) > > +{ > > + int i, cpu, busiest_cpu; > > Redefining cpu here. Looks like the cpu parameter is not really needed? Seems that way indeed, I went back and forth a few times on the actual implementation of this function (which started out live as a copy of check_power_save_busiest_group), its amazing there were only these two compile glitches ;-) > > + > > + if (!(sd->flags & SD_ASYM_PACKING)) > > + return 0; > > + > > + if (!sds->busiest) > > + return 0; > > + > > + i = 0; > > + busiest_cpu = group_first_cpu(sds->busiest); > > + for_each_cpu(cpu, sched_domain_span(sd)) { > > + i++; > > + if (cpu == busiest_cpu) > > + break; > > + } > > + > > + if (sds->total_nr_running > i) > > + return 0; > > + > > + *imbalance = sds->max_load; > > + return 1; > > +} ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-19 10:01 ` Peter Zijlstra @ 2010-02-19 11:01 ` Michael Neuling -1 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-19 11:01 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego In message <1266573672.1806.70.camel@laptop> you wrote: > On Fri, 2010-02-19 at 17:05 +1100, Michael Neuling wrote: > > > include/linux/sched.h | 2 +- > > > kernel/sched_fair.c | 61 +++++++++++++++++++++++++++++++++++++++++++ ++-- > > - > > > 2 files changed, 58 insertions(+), 5 deletions(-) > > > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > > index 0eef87b..42fa5c6 100644 > > > --- a/include/linux/sched.h > > > +++ b/include/linux/sched.h > > > @@ -849,7 +849,7 @@ enum cpu_idle_type { > > > #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ > > > #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg > > resources */ > > > #define SD_SERIALIZE 0x0400 /* Only a single load balancing instanc > > e */ > > > - > > > +#define SD_ASYM_PACKING 0x0800 > > > > Would we eventually add this to SD_SIBLING_INIT in a arch specific hook, > > or is this ok to add it generically? > > I'd think we'd want to keep that limited to architectures that actually > need it. OK > > > > > +static int update_sd_pick_busiest(struct sched_domain *sd, > > > + struct sd_lb_stats *sds, > > > + struct sched_group *sg, > > > + struct sg_lb_stats *sgs) > > > +{ > > > + if (sgs->sum_nr_running > sgs->group_capacity) > > > + return 1; > > > + > > > + if (sgs->group_imb) > > > + return 1; > > > + > > > + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { > > > + if (!sds->busiest) > > > + return 1; > > > + > > > + if (group_first_cpu(sds->busiest) < group_first_cpu(group)) > > > > "group" => "sg" here? (I get a compile error otherwise) > > Oh, quite ;-) > > > > +static int check_asym_packing(struct sched_domain *sd, > > > + struct sd_lb_stats *sds, > > > + int cpu, unsigned long *imbalance) > > > +{ > > > + int i, cpu, busiest_cpu; > > > > Redefining cpu here. Looks like the cpu parameter is not really needed? > > Seems that way indeed, I went back and forth a few times on the actual > implementation of this function (which started out live as a copy of > check_power_save_busiest_group), its amazing there were only these two > compile glitches ;-) :-) Below are the cleanups + the arch specific bits. It doesn't change your logic at all. Obviously the PPC arch bits would need to be split into a separate patch. Compiles and boots against linux-next. Mikey --- arch/powerpc/include/asm/cputable.h | 3 + arch/powerpc/kernel/process.c | 7 +++ include/linux/sched.h | 4 +- include/linux/topology.h | 1 kernel/sched_fair.c | 64 ++++++++++++++++++++++++++++++++++-- 5 files changed, 74 insertions(+), 5 deletions(-) Index: linux-next/arch/powerpc/include/asm/cputable.h =================================================================== --- linux-next.orig/arch/powerpc/include/asm/cputable.h +++ linux-next/arch/powerpc/include/asm/cputable.h @@ -195,6 +195,7 @@ extern const char *powerpc_base_platform #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000) #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000) #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080000000000000) +#define CPU_FTR_ASYM_SMT4 LONG_ASM_CONST(0x0100000000000000) #ifndef __ASSEMBLY__ @@ -409,7 +410,7 @@ extern const char *powerpc_base_platform CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR | CPU_FTR_SAO) + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYM_SMT4) #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ Index: linux-next/arch/powerpc/kernel/process.c =================================================================== --- linux-next.orig/arch/powerpc/kernel/process.c +++ linux-next/arch/powerpc/kernel/process.c @@ -1265,3 +1265,10 @@ unsigned long randomize_et_dyn(unsigned return ret; } + +int arch_sd_asym_packing(void) +{ + if (cpu_has_feature(CPU_FTR_ASYM_SMT4)) + return SD_ASYM_PACKING; + return 0; +} Index: linux-next/include/linux/sched.h =================================================================== --- linux-next.orig/include/linux/sched.h +++ linux-next/include/linux/sched.h @@ -849,7 +849,7 @@ enum cpu_idle_type { #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ - +#define SD_ASYM_PACKING 0x0800 /* Asymetric SMT packing */ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ enum powersavings_balance_level { @@ -881,6 +881,8 @@ static inline int sd_balance_for_package return SD_PREFER_SIBLING; } +extern int arch_sd_asym_packing(void); + /* * Optimise SD flags for power savings: * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings. Index: linux-next/include/linux/topology.h =================================================================== --- linux-next.orig/include/linux/topology.h +++ linux-next/include/linux/topology.h @@ -102,6 +102,7 @@ int arch_update_cpu_topology(void); | 1*SD_SHARE_PKG_RESOURCES \ | 0*SD_SERIALIZE \ | 0*SD_PREFER_SIBLING \ + | arch_sd_asym_packing() \ , \ .last_balance = jiffies, \ .balance_interval = 1, \ Index: linux-next/kernel/sched_fair.c =================================================================== --- linux-next.orig/kernel/sched_fair.c +++ linux-next/kernel/sched_fair.c @@ -2086,6 +2086,7 @@ struct sd_lb_stats { struct sched_group *this; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_pwr; /* Total power of all groups in sd */ + unsigned long total_nr_running; unsigned long avg_load; /* Average load across all groups in sd */ /** Statistics of this group */ @@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(st DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); } +static int update_sd_pick_busiest(struct sched_domain *sd, + struct sd_lb_stats *sds, + struct sched_group *sg, + struct sg_lb_stats *sgs) +{ + if (sgs->sum_nr_running > sgs->group_capacity) + return 1; + + if (sgs->group_imb) + return 1; + + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { + if (!sds->busiest) + return 1; + + if (group_first_cpu(sds->busiest) < group_first_cpu(sg)) + return 1; + } + + return 0; +} + /** * update_sd_lb_stats - Update sched_group's statistics for load balancing. * @sd: sched_domain whose statistics are to be updated. @@ -2533,6 +2556,7 @@ static inline void update_sd_lb_stats(st sds->total_load += sgs.group_load; sds->total_pwr += group->cpu_power; + sds->total_nr_running += sgs.sum_nr_running; /* * In case the child domain prefers tasks go to siblings @@ -2547,9 +2571,8 @@ static inline void update_sd_lb_stats(st sds->this = group; sds->this_nr_running = sgs.sum_nr_running; sds->this_load_per_task = sgs.sum_weighted_load; - } else if (sgs.avg_load > sds->max_load && - (sgs.sum_nr_running > sgs.group_capacity || - sgs.group_imb)) { + } else if (sgs.avg_load >= sds->max_load && + update_sd_pick_busiest(sd, sds, group, &sgs)) { sds->max_load = sgs.avg_load; sds->busiest = group; sds->busiest_nr_running = sgs.sum_nr_running; @@ -2562,6 +2585,38 @@ static inline void update_sd_lb_stats(st } while (group != sd->groups); } +int __weak sd_asym_packing_arch(void) +{ + return 0; +} + +static int check_asym_packing(struct sched_domain *sd, + struct sd_lb_stats *sds, + unsigned long *imbalance) +{ + int i, cpu, busiest_cpu; + + if (!(sd->flags & SD_ASYM_PACKING)) + return 0; + + if (!sds->busiest) + return 0; + + i = 0; + busiest_cpu = group_first_cpu(sds->busiest); + for_each_cpu(cpu, sched_domain_span(sd)) { + i++; + if (cpu == busiest_cpu) + break; + } + + if (sds->total_nr_running > i) + return 0; + + *imbalance = sds->max_load; + return 1; +} + /** * fix_small_imbalance - Calculate the minor imbalance that exists * amongst the groups of a sched_domain, during @@ -2761,6 +2816,9 @@ find_busiest_group(struct sched_domain * return sds.busiest; out_balanced: + if (check_asym_packing(sd, &sds, imbalance)) + return sds.busiest; + /* * There is no obvious imbalance. But check if we can do some balancing * to save power. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-19 11:01 ` Michael Neuling 0 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-19 11:01 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego In message <1266573672.1806.70.camel@laptop> you wrote: > On Fri, 2010-02-19 at 17:05 +1100, Michael Neuling wrote: > > > include/linux/sched.h | 2 +- > > > kernel/sched_fair.c | 61 +++++++++++++++++++++++++++++++++++++++++++ ++-- > > - > > > 2 files changed, 58 insertions(+), 5 deletions(-) > > > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > > index 0eef87b..42fa5c6 100644 > > > --- a/include/linux/sched.h > > > +++ b/include/linux/sched.h > > > @@ -849,7 +849,7 @@ enum cpu_idle_type { > > > #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ > > > #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg > > resources */ > > > #define SD_SERIALIZE 0x0400 /* Only a single load balancing instanc > > e */ > > > - > > > +#define SD_ASYM_PACKING 0x0800 > > > > Would we eventually add this to SD_SIBLING_INIT in a arch specific hook, > > or is this ok to add it generically? > > I'd think we'd want to keep that limited to architectures that actually > need it. OK > > > > > +static int update_sd_pick_busiest(struct sched_domain *sd, > > > + struct sd_lb_stats *sds, > > > + struct sched_group *sg, > > > + struct sg_lb_stats *sgs) > > > +{ > > > + if (sgs->sum_nr_running > sgs->group_capacity) > > > + return 1; > > > + > > > + if (sgs->group_imb) > > > + return 1; > > > + > > > + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { > > > + if (!sds->busiest) > > > + return 1; > > > + > > > + if (group_first_cpu(sds->busiest) < group_first_cpu(group)) > > > > "group" => "sg" here? (I get a compile error otherwise) > > Oh, quite ;-) > > > > +static int check_asym_packing(struct sched_domain *sd, > > > + struct sd_lb_stats *sds, > > > + int cpu, unsigned long *imbalance) > > > +{ > > > + int i, cpu, busiest_cpu; > > > > Redefining cpu here. Looks like the cpu parameter is not really needed? > > Seems that way indeed, I went back and forth a few times on the actual > implementation of this function (which started out live as a copy of > check_power_save_busiest_group), its amazing there were only these two > compile glitches ;-) :-) Below are the cleanups + the arch specific bits. It doesn't change your logic at all. Obviously the PPC arch bits would need to be split into a separate patch. Compiles and boots against linux-next. Mikey --- arch/powerpc/include/asm/cputable.h | 3 + arch/powerpc/kernel/process.c | 7 +++ include/linux/sched.h | 4 +- include/linux/topology.h | 1 kernel/sched_fair.c | 64 ++++++++++++++++++++++++++++++++++-- 5 files changed, 74 insertions(+), 5 deletions(-) Index: linux-next/arch/powerpc/include/asm/cputable.h =================================================================== --- linux-next.orig/arch/powerpc/include/asm/cputable.h +++ linux-next/arch/powerpc/include/asm/cputable.h @@ -195,6 +195,7 @@ extern const char *powerpc_base_platform #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000) #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000) #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080000000000000) +#define CPU_FTR_ASYM_SMT4 LONG_ASM_CONST(0x0100000000000000) #ifndef __ASSEMBLY__ @@ -409,7 +410,7 @@ extern const char *powerpc_base_platform CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR | CPU_FTR_SAO) + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYM_SMT4) #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ Index: linux-next/arch/powerpc/kernel/process.c =================================================================== --- linux-next.orig/arch/powerpc/kernel/process.c +++ linux-next/arch/powerpc/kernel/process.c @@ -1265,3 +1265,10 @@ unsigned long randomize_et_dyn(unsigned return ret; } + +int arch_sd_asym_packing(void) +{ + if (cpu_has_feature(CPU_FTR_ASYM_SMT4)) + return SD_ASYM_PACKING; + return 0; +} Index: linux-next/include/linux/sched.h =================================================================== --- linux-next.orig/include/linux/sched.h +++ linux-next/include/linux/sched.h @@ -849,7 +849,7 @@ enum cpu_idle_type { #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ - +#define SD_ASYM_PACKING 0x0800 /* Asymetric SMT packing */ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ enum powersavings_balance_level { @@ -881,6 +881,8 @@ static inline int sd_balance_for_package return SD_PREFER_SIBLING; } +extern int arch_sd_asym_packing(void); + /* * Optimise SD flags for power savings: * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings. Index: linux-next/include/linux/topology.h =================================================================== --- linux-next.orig/include/linux/topology.h +++ linux-next/include/linux/topology.h @@ -102,6 +102,7 @@ int arch_update_cpu_topology(void); | 1*SD_SHARE_PKG_RESOURCES \ | 0*SD_SERIALIZE \ | 0*SD_PREFER_SIBLING \ + | arch_sd_asym_packing() \ , \ .last_balance = jiffies, \ .balance_interval = 1, \ Index: linux-next/kernel/sched_fair.c =================================================================== --- linux-next.orig/kernel/sched_fair.c +++ linux-next/kernel/sched_fair.c @@ -2086,6 +2086,7 @@ struct sd_lb_stats { struct sched_group *this; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_pwr; /* Total power of all groups in sd */ + unsigned long total_nr_running; unsigned long avg_load; /* Average load across all groups in sd */ /** Statistics of this group */ @@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(st DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); } +static int update_sd_pick_busiest(struct sched_domain *sd, + struct sd_lb_stats *sds, + struct sched_group *sg, + struct sg_lb_stats *sgs) +{ + if (sgs->sum_nr_running > sgs->group_capacity) + return 1; + + if (sgs->group_imb) + return 1; + + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { + if (!sds->busiest) + return 1; + + if (group_first_cpu(sds->busiest) < group_first_cpu(sg)) + return 1; + } + + return 0; +} + /** * update_sd_lb_stats - Update sched_group's statistics for load balancing. * @sd: sched_domain whose statistics are to be updated. @@ -2533,6 +2556,7 @@ static inline void update_sd_lb_stats(st sds->total_load += sgs.group_load; sds->total_pwr += group->cpu_power; + sds->total_nr_running += sgs.sum_nr_running; /* * In case the child domain prefers tasks go to siblings @@ -2547,9 +2571,8 @@ static inline void update_sd_lb_stats(st sds->this = group; sds->this_nr_running = sgs.sum_nr_running; sds->this_load_per_task = sgs.sum_weighted_load; - } else if (sgs.avg_load > sds->max_load && - (sgs.sum_nr_running > sgs.group_capacity || - sgs.group_imb)) { + } else if (sgs.avg_load >= sds->max_load && + update_sd_pick_busiest(sd, sds, group, &sgs)) { sds->max_load = sgs.avg_load; sds->busiest = group; sds->busiest_nr_running = sgs.sum_nr_running; @@ -2562,6 +2585,38 @@ static inline void update_sd_lb_stats(st } while (group != sd->groups); } +int __weak sd_asym_packing_arch(void) +{ + return 0; +} + +static int check_asym_packing(struct sched_domain *sd, + struct sd_lb_stats *sds, + unsigned long *imbalance) +{ + int i, cpu, busiest_cpu; + + if (!(sd->flags & SD_ASYM_PACKING)) + return 0; + + if (!sds->busiest) + return 0; + + i = 0; + busiest_cpu = group_first_cpu(sds->busiest); + for_each_cpu(cpu, sched_domain_span(sd)) { + i++; + if (cpu == busiest_cpu) + break; + } + + if (sds->total_nr_running > i) + return 0; + + *imbalance = sds->max_load; + return 1; +} + /** * fix_small_imbalance - Calculate the minor imbalance that exists * amongst the groups of a sched_domain, during @@ -2761,6 +2816,9 @@ find_busiest_group(struct sched_domain * return sds.busiest; out_balanced: + if (check_asym_packing(sd, &sds, imbalance)) + return sds.busiest; + /* * There is no obvious imbalance. But check if we can do some balancing * to save power. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-19 11:01 ` Michael Neuling @ 2010-02-23 6:08 ` Michael Neuling -1 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-23 6:08 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego In message <24165.1266577276@neuling.org> you wrote: > In message <1266573672.1806.70.camel@laptop> you wrote: > > On Fri, 2010-02-19 at 17:05 +1100, Michael Neuling wrote: > > > > include/linux/sched.h | 2 +- > > > > kernel/sched_fair.c | 61 +++++++++++++++++++++++++++++++++++++++++ ++ > ++-- > > > - > > > > 2 files changed, 58 insertions(+), 5 deletions(-) > > > > > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > > > index 0eef87b..42fa5c6 100644 > > > > --- a/include/linux/sched.h > > > > +++ b/include/linux/sched.h > > > > @@ -849,7 +849,7 @@ enum cpu_idle_type { > > > > #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power sa vings */ > > > > #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg > > > resources */ > > > > #define SD_SERIALIZE 0x0400 /* Only a single load balancing > instanc > > > e */ > > > > - > > > > +#define SD_ASYM_PACKING 0x0800 > > > > > > Would we eventually add this to SD_SIBLING_INIT in a arch specific hook, > > > or is this ok to add it generically? > > > > I'd think we'd want to keep that limited to architectures that actually > > need it. > > OK > > > > > > > > +static int update_sd_pick_busiest(struct sched_domain *sd, > > > > + struct sd_lb_stats *sds, > > > > + struct sched_group *sg, > > > > + struct sg_lb_stats *sgs) > > > > +{ > > > > + if (sgs->sum_nr_running > sgs->group_capacity) > > > > + return 1; > > > > + > > > > + if (sgs->group_imb) > > > > + return 1; > > > > + > > > > + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { > > > > + if (!sds->busiest) > > > > + return 1; > > > > + > > > > + if (group_first_cpu(sds->busiest) < group_first_cpu(gro up)) > > > > > > "group" => "sg" here? (I get a compile error otherwise) > > > > Oh, quite ;-) > > > > > > +static int check_asym_packing(struct sched_domain *sd, > > > > + struct sd_lb_stats *sds, > > > > + int cpu, unsigned long *imbalance) > > > > +{ > > > > + int i, cpu, busiest_cpu; > > > > > > Redefining cpu here. Looks like the cpu parameter is not really needed? > > > > Seems that way indeed, I went back and forth a few times on the actual > > implementation of this function (which started out live as a copy of > > check_power_save_busiest_group), its amazing there were only these two > > compile glitches ;-) > > :-) > > Below are the cleanups + the arch specific bits. It doesn't change your > logic at all. Obviously the PPC arch bits would need to be split into a > separate patch. > > Compiles and boots against linux-next. I have some comments on the code inline but... So when I run this, I don't get processes pulled down to the lower threads. A simple test case of running 1 CPU intensive process at SCHED_OTHER on a machine with 2 way SMT system (a POWER6 but enabling SD_ASYM_PACKING). The single processes doesn't move to lower threads as I'd hope. Also, are you sure you want to put this in generic code? It seem to be quite POWER7 specific functionality, so would be logically better in arch/powerpc. I guess some other arch *might* need it, but seems unlikely. > Mikey > --- > arch/powerpc/include/asm/cputable.h | 3 + > arch/powerpc/kernel/process.c | 7 +++ > include/linux/sched.h | 4 +- > include/linux/topology.h | 1 > kernel/sched_fair.c | 64 +++++++++++++++++++++++++++++++++ +-- > 5 files changed, 74 insertions(+), 5 deletions(-) > > Index: linux-next/arch/powerpc/include/asm/cputable.h > =================================================================== > --- linux-next.orig/arch/powerpc/include/asm/cputable.h > +++ linux-next/arch/powerpc/include/asm/cputable.h > @@ -195,6 +195,7 @@ extern const char *powerpc_base_platform > #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000) > #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000) > #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080000000000000) > +#define CPU_FTR_ASYM_SMT4 LONG_ASM_CONST(0x0100000000000000) > > #ifndef __ASSEMBLY__ > > @@ -409,7 +410,7 @@ extern const char *powerpc_base_platform > CPU_FTR_MMCRA | CPU_FTR_SMT | \ > CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ > CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ > - CPU_FTR_DSCR | CPU_FTR_SAO) > + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYM_SMT4) > #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ > CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ > CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ > Index: linux-next/arch/powerpc/kernel/process.c > =================================================================== > --- linux-next.orig/arch/powerpc/kernel/process.c > +++ linux-next/arch/powerpc/kernel/process.c > @@ -1265,3 +1265,10 @@ unsigned long randomize_et_dyn(unsigned > > return ret; > } > + > +int arch_sd_asym_packing(void) > +{ > + if (cpu_has_feature(CPU_FTR_ASYM_SMT4)) > + return SD_ASYM_PACKING; > + return 0; > +} > Index: linux-next/include/linux/sched.h > =================================================================== > --- linux-next.orig/include/linux/sched.h > +++ linux-next/include/linux/sched.h > @@ -849,7 +849,7 @@ enum cpu_idle_type { > #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ > #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ > #define SD_SERIALIZE 0x0400 /* Only a single load balancing instanc e */ > - > +#define SD_ASYM_PACKING 0x0800 /* Asymetric SMT packing */ > #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling d omain */ > > enum powersavings_balance_level { > @@ -881,6 +881,8 @@ static inline int sd_balance_for_package > return SD_PREFER_SIBLING; > } > > +extern int arch_sd_asym_packing(void); > + > /* > * Optimise SD flags for power savings: > * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings. > Index: linux-next/include/linux/topology.h > =================================================================== > --- linux-next.orig/include/linux/topology.h > +++ linux-next/include/linux/topology.h > @@ -102,6 +102,7 @@ int arch_update_cpu_topology(void); > | 1*SD_SHARE_PKG_RESOURCES \ > | 0*SD_SERIALIZE \ > | 0*SD_PREFER_SIBLING \ > + | arch_sd_asym_packing() \ > , \ > .last_balance = jiffies, \ > .balance_interval = 1, \ > Index: linux-next/kernel/sched_fair.c > =================================================================== > --- linux-next.orig/kernel/sched_fair.c > +++ linux-next/kernel/sched_fair.c > @@ -2086,6 +2086,7 @@ struct sd_lb_stats { > struct sched_group *this; /* Local group in this sd */ > unsigned long total_load; /* Total load of all groups in sd */ > unsigned long total_pwr; /* Total power of all groups in sd */ > + unsigned long total_nr_running; > unsigned long avg_load; /* Average load across all groups in sd */ > > /** Statistics of this group */ > @@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(st > DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); > } > > +static int update_sd_pick_busiest(struct sched_domain *sd, > + struct sd_lb_stats *sds, > + struct sched_group *sg, > + struct sg_lb_stats *sgs) > +{ > + if (sgs->sum_nr_running > sgs->group_capacity) > + return 1; > + > + if (sgs->group_imb) > + return 1; > + > + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { If we are asymetric packing... > + if (!sds->busiest) > + return 1; This just seems to be a null pointer check. >From the tracing I've done, this is always true (always NULL) at this point so we return here. > + > + if (group_first_cpu(sds->busiest) < group_first_cpu(sg)) > + return 1; I'm a bit lost as to what this is for. Any clues you could provide would be appreciated. :-) Is the first cpu in this domain's busiest group before the first cpu in this group. If, so pick this as the busiest? Should this be the other way around if we want to pack the busiest to the first cpu? Mark it as the busiest if it's after (not before). Is group_first_cpu guaranteed to give us the first physical cpu (ie. thread 0 in our case) or are these virtualised at this point? I'm not seeing this hit anyway due to the null pointer check above. > + } > + > + return 0; > +} > + > /** > * update_sd_lb_stats - Update sched_group's statistics for load balancing. > * @sd: sched_domain whose statistics are to be updated. > @@ -2533,6 +2556,7 @@ static inline void update_sd_lb_stats(st > > sds->total_load += sgs.group_load; > sds->total_pwr += group->cpu_power; > + sds->total_nr_running += sgs.sum_nr_running; > > /* > * In case the child domain prefers tasks go to siblings > @@ -2547,9 +2571,8 @@ static inline void update_sd_lb_stats(st > sds->this = group; > sds->this_nr_running = sgs.sum_nr_running; > sds->this_load_per_task = sgs.sum_weighted_load; > - } else if (sgs.avg_load > sds->max_load && > - (sgs.sum_nr_running > sgs.group_capacity || > - sgs.group_imb)) { > + } else if (sgs.avg_load >= sds->max_load && > + update_sd_pick_busiest(sd, sds, group, &sgs)) { This is pretty clear. Moving stuff to the new function. > sds->max_load = sgs.avg_load; > sds->busiest = group; > sds->busiest_nr_running = sgs.sum_nr_running; > @@ -2562,6 +2585,38 @@ static inline void update_sd_lb_stats(st > } while (group != sd->groups); > } > > +int __weak sd_asym_packing_arch(void) > +{ > + return 0; > +} > + > +static int check_asym_packing(struct sched_domain *sd, > + struct sd_lb_stats *sds, > + unsigned long *imbalance) > +{ > + int i, cpu, busiest_cpu; > + > + if (!(sd->flags & SD_ASYM_PACKING)) > + return 0; > + > + if (!sds->busiest) > + return 0; > + > + i = 0; > + busiest_cpu = group_first_cpu(sds->busiest); > + for_each_cpu(cpu, sched_domain_span(sd)) { > + i++; > + if (cpu == busiest_cpu) > + break; > + } > + > + if (sds->total_nr_running > i) > + return 0; This seems to be the core of the packing logic. We make sure the busiest_cpu is not past total_nr_running. If it is we mark as imbalanced. Correct? It seems if a non zero thread/group had a pile of processes running on it and a lower thread had much less, this wouldn't fire, but I'm guessing normal load balancing would kick in that case to fix the imbalance. Any corrections to my ramblings appreciated :-) Thanks again, Mikey > + > + *imbalance = sds->max_load; > + return 1; > +} > + > /** > * fix_small_imbalance - Calculate the minor imbalance that exists > * amongst the groups of a sched_domain, during > @@ -2761,6 +2816,9 @@ find_busiest_group(struct sched_domain * > return sds.busiest; > > out_balanced: > + if (check_asym_packing(sd, &sds, imbalance)) > + return sds.busiest; > + > /* > * There is no obvious imbalance. But check if we can do some balancing > * to save power. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-23 6:08 ` Michael Neuling 0 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-23 6:08 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego In message <24165.1266577276@neuling.org> you wrote: > In message <1266573672.1806.70.camel@laptop> you wrote: > > On Fri, 2010-02-19 at 17:05 +1100, Michael Neuling wrote: > > > > include/linux/sched.h | 2 +- > > > > kernel/sched_fair.c | 61 +++++++++++++++++++++++++++++++++++++++++ ++ > ++-- > > > - > > > > 2 files changed, 58 insertions(+), 5 deletions(-) > > > > > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > > > index 0eef87b..42fa5c6 100644 > > > > --- a/include/linux/sched.h > > > > +++ b/include/linux/sched.h > > > > @@ -849,7 +849,7 @@ enum cpu_idle_type { > > > > #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power sa vings */ > > > > #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg > > > resources */ > > > > #define SD_SERIALIZE 0x0400 /* Only a single load balancing > instanc > > > e */ > > > > - > > > > +#define SD_ASYM_PACKING 0x0800 > > > > > > Would we eventually add this to SD_SIBLING_INIT in a arch specific hook, > > > or is this ok to add it generically? > > > > I'd think we'd want to keep that limited to architectures that actually > > need it. > > OK > > > > > > > > +static int update_sd_pick_busiest(struct sched_domain *sd, > > > > + struct sd_lb_stats *sds, > > > > + struct sched_group *sg, > > > > + struct sg_lb_stats *sgs) > > > > +{ > > > > + if (sgs->sum_nr_running > sgs->group_capacity) > > > > + return 1; > > > > + > > > > + if (sgs->group_imb) > > > > + return 1; > > > > + > > > > + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { > > > > + if (!sds->busiest) > > > > + return 1; > > > > + > > > > + if (group_first_cpu(sds->busiest) < group_first_cpu(gro up)) > > > > > > "group" => "sg" here? (I get a compile error otherwise) > > > > Oh, quite ;-) > > > > > > +static int check_asym_packing(struct sched_domain *sd, > > > > + struct sd_lb_stats *sds, > > > > + int cpu, unsigned long *imbalance) > > > > +{ > > > > + int i, cpu, busiest_cpu; > > > > > > Redefining cpu here. Looks like the cpu parameter is not really needed? > > > > Seems that way indeed, I went back and forth a few times on the actual > > implementation of this function (which started out live as a copy of > > check_power_save_busiest_group), its amazing there were only these two > > compile glitches ;-) > > :-) > > Below are the cleanups + the arch specific bits. It doesn't change your > logic at all. Obviously the PPC arch bits would need to be split into a > separate patch. > > Compiles and boots against linux-next. I have some comments on the code inline but... So when I run this, I don't get processes pulled down to the lower threads. A simple test case of running 1 CPU intensive process at SCHED_OTHER on a machine with 2 way SMT system (a POWER6 but enabling SD_ASYM_PACKING). The single processes doesn't move to lower threads as I'd hope. Also, are you sure you want to put this in generic code? It seem to be quite POWER7 specific functionality, so would be logically better in arch/powerpc. I guess some other arch *might* need it, but seems unlikely. > Mikey > --- > arch/powerpc/include/asm/cputable.h | 3 + > arch/powerpc/kernel/process.c | 7 +++ > include/linux/sched.h | 4 +- > include/linux/topology.h | 1 > kernel/sched_fair.c | 64 +++++++++++++++++++++++++++++++++ +-- > 5 files changed, 74 insertions(+), 5 deletions(-) > > Index: linux-next/arch/powerpc/include/asm/cputable.h > =================================================================== > --- linux-next.orig/arch/powerpc/include/asm/cputable.h > +++ linux-next/arch/powerpc/include/asm/cputable.h > @@ -195,6 +195,7 @@ extern const char *powerpc_base_platform > #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000) > #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000) > #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080000000000000) > +#define CPU_FTR_ASYM_SMT4 LONG_ASM_CONST(0x0100000000000000) > > #ifndef __ASSEMBLY__ > > @@ -409,7 +410,7 @@ extern const char *powerpc_base_platform > CPU_FTR_MMCRA | CPU_FTR_SMT | \ > CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ > CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ > - CPU_FTR_DSCR | CPU_FTR_SAO) > + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYM_SMT4) > #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ > CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ > CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ > Index: linux-next/arch/powerpc/kernel/process.c > =================================================================== > --- linux-next.orig/arch/powerpc/kernel/process.c > +++ linux-next/arch/powerpc/kernel/process.c > @@ -1265,3 +1265,10 @@ unsigned long randomize_et_dyn(unsigned > > return ret; > } > + > +int arch_sd_asym_packing(void) > +{ > + if (cpu_has_feature(CPU_FTR_ASYM_SMT4)) > + return SD_ASYM_PACKING; > + return 0; > +} > Index: linux-next/include/linux/sched.h > =================================================================== > --- linux-next.orig/include/linux/sched.h > +++ linux-next/include/linux/sched.h > @@ -849,7 +849,7 @@ enum cpu_idle_type { > #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ > #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ > #define SD_SERIALIZE 0x0400 /* Only a single load balancing instanc e */ > - > +#define SD_ASYM_PACKING 0x0800 /* Asymetric SMT packing */ > #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling d omain */ > > enum powersavings_balance_level { > @@ -881,6 +881,8 @@ static inline int sd_balance_for_package > return SD_PREFER_SIBLING; > } > > +extern int arch_sd_asym_packing(void); > + > /* > * Optimise SD flags for power savings: > * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings. > Index: linux-next/include/linux/topology.h > =================================================================== > --- linux-next.orig/include/linux/topology.h > +++ linux-next/include/linux/topology.h > @@ -102,6 +102,7 @@ int arch_update_cpu_topology(void); > | 1*SD_SHARE_PKG_RESOURCES \ > | 0*SD_SERIALIZE \ > | 0*SD_PREFER_SIBLING \ > + | arch_sd_asym_packing() \ > , \ > .last_balance = jiffies, \ > .balance_interval = 1, \ > Index: linux-next/kernel/sched_fair.c > =================================================================== > --- linux-next.orig/kernel/sched_fair.c > +++ linux-next/kernel/sched_fair.c > @@ -2086,6 +2086,7 @@ struct sd_lb_stats { > struct sched_group *this; /* Local group in this sd */ > unsigned long total_load; /* Total load of all groups in sd */ > unsigned long total_pwr; /* Total power of all groups in sd */ > + unsigned long total_nr_running; > unsigned long avg_load; /* Average load across all groups in sd */ > > /** Statistics of this group */ > @@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(st > DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); > } > > +static int update_sd_pick_busiest(struct sched_domain *sd, > + struct sd_lb_stats *sds, > + struct sched_group *sg, > + struct sg_lb_stats *sgs) > +{ > + if (sgs->sum_nr_running > sgs->group_capacity) > + return 1; > + > + if (sgs->group_imb) > + return 1; > + > + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { If we are asymetric packing... > + if (!sds->busiest) > + return 1; This just seems to be a null pointer check. >From the tracing I've done, this is always true (always NULL) at this point so we return here. > + > + if (group_first_cpu(sds->busiest) < group_first_cpu(sg)) > + return 1; I'm a bit lost as to what this is for. Any clues you could provide would be appreciated. :-) Is the first cpu in this domain's busiest group before the first cpu in this group. If, so pick this as the busiest? Should this be the other way around if we want to pack the busiest to the first cpu? Mark it as the busiest if it's after (not before). Is group_first_cpu guaranteed to give us the first physical cpu (ie. thread 0 in our case) or are these virtualised at this point? I'm not seeing this hit anyway due to the null pointer check above. > + } > + > + return 0; > +} > + > /** > * update_sd_lb_stats - Update sched_group's statistics for load balancing. > * @sd: sched_domain whose statistics are to be updated. > @@ -2533,6 +2556,7 @@ static inline void update_sd_lb_stats(st > > sds->total_load += sgs.group_load; > sds->total_pwr += group->cpu_power; > + sds->total_nr_running += sgs.sum_nr_running; > > /* > * In case the child domain prefers tasks go to siblings > @@ -2547,9 +2571,8 @@ static inline void update_sd_lb_stats(st > sds->this = group; > sds->this_nr_running = sgs.sum_nr_running; > sds->this_load_per_task = sgs.sum_weighted_load; > - } else if (sgs.avg_load > sds->max_load && > - (sgs.sum_nr_running > sgs.group_capacity || > - sgs.group_imb)) { > + } else if (sgs.avg_load >= sds->max_load && > + update_sd_pick_busiest(sd, sds, group, &sgs)) { This is pretty clear. Moving stuff to the new function. > sds->max_load = sgs.avg_load; > sds->busiest = group; > sds->busiest_nr_running = sgs.sum_nr_running; > @@ -2562,6 +2585,38 @@ static inline void update_sd_lb_stats(st > } while (group != sd->groups); > } > > +int __weak sd_asym_packing_arch(void) > +{ > + return 0; > +} > + > +static int check_asym_packing(struct sched_domain *sd, > + struct sd_lb_stats *sds, > + unsigned long *imbalance) > +{ > + int i, cpu, busiest_cpu; > + > + if (!(sd->flags & SD_ASYM_PACKING)) > + return 0; > + > + if (!sds->busiest) > + return 0; > + > + i = 0; > + busiest_cpu = group_first_cpu(sds->busiest); > + for_each_cpu(cpu, sched_domain_span(sd)) { > + i++; > + if (cpu == busiest_cpu) > + break; > + } > + > + if (sds->total_nr_running > i) > + return 0; This seems to be the core of the packing logic. We make sure the busiest_cpu is not past total_nr_running. If it is we mark as imbalanced. Correct? It seems if a non zero thread/group had a pile of processes running on it and a lower thread had much less, this wouldn't fire, but I'm guessing normal load balancing would kick in that case to fix the imbalance. Any corrections to my ramblings appreciated :-) Thanks again, Mikey > + > + *imbalance = sds->max_load; > + return 1; > +} > + > /** > * fix_small_imbalance - Calculate the minor imbalance that exists > * amongst the groups of a sched_domain, during > @@ -2761,6 +2816,9 @@ find_busiest_group(struct sched_domain * > return sds.busiest; > > out_balanced: > + if (check_asym_packing(sd, &sds, imbalance)) > + return sds.busiest; > + > /* > * There is no obvious imbalance. But check if we can do some balancing > * to save power. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-23 6:08 ` Michael Neuling @ 2010-02-23 16:24 ` Peter Zijlstra -1 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-02-23 16:24 UTC (permalink / raw) To: Michael Neuling; +Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego On Tue, 2010-02-23 at 17:08 +1100, Michael Neuling wrote: > I have some comments on the code inline but... > > So when I run this, I don't get processes pulled down to the lower > threads. A simple test case of running 1 CPU intensive process at > SCHED_OTHER on a machine with 2 way SMT system (a POWER6 but enabling > SD_ASYM_PACKING). The single processes doesn't move to lower threads as > I'd hope. > > Also, are you sure you want to put this in generic code? It seem to be > quite POWER7 specific functionality, so would be logically better in > arch/powerpc. I guess some other arch *might* need it, but seems > unlikely. Well, there are no arch hooks in the load-balancing (aside from the recent cpu_power stuff, and that really is the wrong thing to poke at for this), and I did hear some other people express interest in such a constraint. Also, load-balancing is complex enough as it is, so I prefer to keep everything in the generic code where possible, clearly things like sched_domain creation need arch topology bits, and the arch_scale* things require other arch information like cpu frequency. > > @@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(st > > DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); > > } > > > > +static int update_sd_pick_busiest(struct sched_domain *sd, > > + struct sd_lb_stats *sds, > > + struct sched_group *sg, > > + struct sg_lb_stats *sgs) > > +{ > > + if (sgs->sum_nr_running > sgs->group_capacity) > > + return 1; > > + > > + if (sgs->group_imb) > > + return 1; > > + > > + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { > > If we are asymetric packing... > > > > + if (!sds->busiest) > > + return 1; > > This just seems to be a null pointer check. > > From the tracing I've done, this is always true (always NULL) at this > point so we return here. Right, so we need to have a busiest group to take a task from, if there is no busiest yet, take this group. And in your scenario, with there being only a single task, we'd only hit this once at most, so yes it makes sense this is always NULL. > > + > > + if (group_first_cpu(sds->busiest) < group_first_cpu(sg)) > > + return 1; > > I'm a bit lost as to what this is for. Any clues you could provide > would be appreciated. :-) > > Is the first cpu in this domain's busiest group before the first cpu in > this group. If, so pick this as the busiest? > > Should this be the other way around if we want to pack the busiest to > the first cpu? Mark it as the busiest if it's after (not before). > > Is group_first_cpu guaranteed to give us the first physical cpu (ie. > thread 0 in our case) or are these virtualised at this point? > > I'm not seeing this hit anyway due to the null pointer check above. So this says, if all things being equal, and we already have a busiest, but this candidate (sg) is higher than the current (busiest) take this one. The idea is to move the highest SMT task down. > > @@ -2562,6 +2585,38 @@ static inline void update_sd_lb_stats(st > > } while (group != sd->groups); > > } > > > > +int __weak sd_asym_packing_arch(void) > > +{ > > + return 0; > > +} arch_sd_asym_packing() is what you used in topology.h > > +static int check_asym_packing(struct sched_domain *sd, > > + struct sd_lb_stats *sds, > > + unsigned long *imbalance) > > +{ > > + int i, cpu, busiest_cpu; > > + > > + if (!(sd->flags & SD_ASYM_PACKING)) > > + return 0; > > + > > + if (!sds->busiest) > > + return 0; > > + > > + i = 0; > > + busiest_cpu = group_first_cpu(sds->busiest); > > + for_each_cpu(cpu, sched_domain_span(sd)) { > > + i++; > > + if (cpu == busiest_cpu) > > + break; > > + } > > + > > + if (sds->total_nr_running > i) > > + return 0; > > This seems to be the core of the packing logic. > > We make sure the busiest_cpu is not past total_nr_running. If it is we > mark as imbalanced. Correct? > > It seems if a non zero thread/group had a pile of processes running on > it and a lower thread had much less, this wouldn't fire, but I'm > guessing normal load balancing would kick in that case to fix the > imbalance. > > Any corrections to my ramblings appreciated :-) Right, so we're concerned the scenario where there's less tasks than SMT siblings, if there's more they should all be running and the regular load-balancer will deal with it. If there's less the group will normally be balanced and we fall out and end up in check_asym_packing(). So what I tried doing with that loop is detect if there's a hole in the packing before busiest. Now that I think about it, what we need to check is if this_cpu (the removed cpu argument) is idle and less than busiest. So something like: static int check_asym_pacing(struct sched_domain *sd, struct sd_lb_stats *sds, int this_cpu, unsigned long *imbalance) { int busiest_cpu; if (!(sd->flags & SD_ASYM_PACKING)) return 0; if (!sds->busiest) return 0; busiest_cpu = group_first_cpu(sds->busiest); if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) return 0; *imbalance = (sds->max_load * sds->busiest->cpu_power) / SCHED_LOAD_SCALE; return 1; } Does that make sense? I still see two problems with this though,.. regular load-balancing only balances on the first cpu of a domain (see the *balance = 0, condition in update_sg_lb_stats()), this means that if SMT[12] are idle we'll not pull properly. Also, nohz balancing might mess this up further. We could maybe play some games with the balance decision in update_sg_lb_stats() for SD_ASYM_PACKING domains and idle == CPU_IDLE, no ideas yet on nohz though. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-23 16:24 ` Peter Zijlstra 0 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-02-23 16:24 UTC (permalink / raw) To: Michael Neuling; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego On Tue, 2010-02-23 at 17:08 +1100, Michael Neuling wrote: > I have some comments on the code inline but... > > So when I run this, I don't get processes pulled down to the lower > threads. A simple test case of running 1 CPU intensive process at > SCHED_OTHER on a machine with 2 way SMT system (a POWER6 but enabling > SD_ASYM_PACKING). The single processes doesn't move to lower threads as > I'd hope. > > Also, are you sure you want to put this in generic code? It seem to be > quite POWER7 specific functionality, so would be logically better in > arch/powerpc. I guess some other arch *might* need it, but seems > unlikely. Well, there are no arch hooks in the load-balancing (aside from the recent cpu_power stuff, and that really is the wrong thing to poke at for this), and I did hear some other people express interest in such a constraint. Also, load-balancing is complex enough as it is, so I prefer to keep everything in the generic code where possible, clearly things like sched_domain creation need arch topology bits, and the arch_scale* things require other arch information like cpu frequency. > > @@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(st > > DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); > > } > > > > +static int update_sd_pick_busiest(struct sched_domain *sd, > > + struct sd_lb_stats *sds, > > + struct sched_group *sg, > > + struct sg_lb_stats *sgs) > > +{ > > + if (sgs->sum_nr_running > sgs->group_capacity) > > + return 1; > > + > > + if (sgs->group_imb) > > + return 1; > > + > > + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { > > If we are asymetric packing... > > > > + if (!sds->busiest) > > + return 1; > > This just seems to be a null pointer check. > > From the tracing I've done, this is always true (always NULL) at this > point so we return here. Right, so we need to have a busiest group to take a task from, if there is no busiest yet, take this group. And in your scenario, with there being only a single task, we'd only hit this once at most, so yes it makes sense this is always NULL. > > + > > + if (group_first_cpu(sds->busiest) < group_first_cpu(sg)) > > + return 1; > > I'm a bit lost as to what this is for. Any clues you could provide > would be appreciated. :-) > > Is the first cpu in this domain's busiest group before the first cpu in > this group. If, so pick this as the busiest? > > Should this be the other way around if we want to pack the busiest to > the first cpu? Mark it as the busiest if it's after (not before). > > Is group_first_cpu guaranteed to give us the first physical cpu (ie. > thread 0 in our case) or are these virtualised at this point? > > I'm not seeing this hit anyway due to the null pointer check above. So this says, if all things being equal, and we already have a busiest, but this candidate (sg) is higher than the current (busiest) take this one. The idea is to move the highest SMT task down. > > @@ -2562,6 +2585,38 @@ static inline void update_sd_lb_stats(st > > } while (group != sd->groups); > > } > > > > +int __weak sd_asym_packing_arch(void) > > +{ > > + return 0; > > +} arch_sd_asym_packing() is what you used in topology.h > > +static int check_asym_packing(struct sched_domain *sd, > > + struct sd_lb_stats *sds, > > + unsigned long *imbalance) > > +{ > > + int i, cpu, busiest_cpu; > > + > > + if (!(sd->flags & SD_ASYM_PACKING)) > > + return 0; > > + > > + if (!sds->busiest) > > + return 0; > > + > > + i = 0; > > + busiest_cpu = group_first_cpu(sds->busiest); > > + for_each_cpu(cpu, sched_domain_span(sd)) { > > + i++; > > + if (cpu == busiest_cpu) > > + break; > > + } > > + > > + if (sds->total_nr_running > i) > > + return 0; > > This seems to be the core of the packing logic. > > We make sure the busiest_cpu is not past total_nr_running. If it is we > mark as imbalanced. Correct? > > It seems if a non zero thread/group had a pile of processes running on > it and a lower thread had much less, this wouldn't fire, but I'm > guessing normal load balancing would kick in that case to fix the > imbalance. > > Any corrections to my ramblings appreciated :-) Right, so we're concerned the scenario where there's less tasks than SMT siblings, if there's more they should all be running and the regular load-balancer will deal with it. If there's less the group will normally be balanced and we fall out and end up in check_asym_packing(). So what I tried doing with that loop is detect if there's a hole in the packing before busiest. Now that I think about it, what we need to check is if this_cpu (the removed cpu argument) is idle and less than busiest. So something like: static int check_asym_pacing(struct sched_domain *sd, struct sd_lb_stats *sds, int this_cpu, unsigned long *imbalance) { int busiest_cpu; if (!(sd->flags & SD_ASYM_PACKING)) return 0; if (!sds->busiest) return 0; busiest_cpu = group_first_cpu(sds->busiest); if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) return 0; *imbalance = (sds->max_load * sds->busiest->cpu_power) / SCHED_LOAD_SCALE; return 1; } Does that make sense? I still see two problems with this though,.. regular load-balancing only balances on the first cpu of a domain (see the *balance = 0, condition in update_sg_lb_stats()), this means that if SMT[12] are idle we'll not pull properly. Also, nohz balancing might mess this up further. We could maybe play some games with the balance decision in update_sg_lb_stats() for SD_ASYM_PACKING domains and idle == CPU_IDLE, no ideas yet on nohz though. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-23 16:24 ` Peter Zijlstra @ 2010-02-23 16:30 ` Peter Zijlstra -1 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-02-23 16:30 UTC (permalink / raw) To: Michael Neuling; +Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego On Tue, 2010-02-23 at 17:24 +0100, Peter Zijlstra wrote: > > busiest_cpu = group_first_cpu(sds->busiest); > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > return 0; Hmm, we could change the bit in find_busiest_group() to: if (idle == CPU_IDLE && check_asym_packing()) and skip the nr_running test.. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-23 16:30 ` Peter Zijlstra 0 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-02-23 16:30 UTC (permalink / raw) To: Michael Neuling; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego On Tue, 2010-02-23 at 17:24 +0100, Peter Zijlstra wrote: > > busiest_cpu = group_first_cpu(sds->busiest); > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > return 0; Hmm, we could change the bit in find_busiest_group() to: if (idle == CPU_IDLE && check_asym_packing()) and skip the nr_running test.. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-23 16:24 ` Peter Zijlstra @ 2010-02-24 6:07 ` Michael Neuling -1 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-24 6:07 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego In message <1266942281.11845.521.camel@laptop> you wrote: > On Tue, 2010-02-23 at 17:08 +1100, Michael Neuling wrote: > > > I have some comments on the code inline but... > > > > So when I run this, I don't get processes pulled down to the lower > > threads. A simple test case of running 1 CPU intensive process at > > SCHED_OTHER on a machine with 2 way SMT system (a POWER6 but enabling > > SD_ASYM_PACKING). The single processes doesn't move to lower threads as > > I'd hope. > > > > Also, are you sure you want to put this in generic code? It seem to be > > quite POWER7 specific functionality, so would be logically better in > > arch/powerpc. I guess some other arch *might* need it, but seems > > unlikely. > > Well, there are no arch hooks in the load-balancing (aside from the > recent cpu_power stuff, and that really is the wrong thing to poke at > for this), and I did hear some other people express interest in such a > constraint. Interesting. Can I ask, were people interesting in this at the SMT level or higher in the hierarchy? > Also, load-balancing is complex enough as it is, so I prefer to keep > everything in the generic code where possible, clearly things like > sched_domain creation need arch topology bits, and the arch_scale* > things require other arch information like cpu frequency. OK > > > @@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(st > > > DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); > > > } > > > > > > +static int update_sd_pick_busiest(struct sched_domain *sd, > > > + struct sd_lb_stats *sds, > > > + struct sched_group *sg, > > > + struct sg_lb_stats *sgs) > > > +{ > > > + if (sgs->sum_nr_running > sgs->group_capacity) > > > + return 1; > > > + > > > + if (sgs->group_imb) > > > + return 1; > > > + > > > + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { > > > > If we are asymetric packing... > > > > > > > + if (!sds->busiest) > > > + return 1; > > > > This just seems to be a null pointer check. > > > > From the tracing I've done, this is always true (always NULL) at this > > point so we return here. > > Right, so we need to have a busiest group to take a task from, if there > is no busiest yet, take this group. > > And in your scenario, with there being only a single task, we'd only hit > this once at most, so yes it makes sense this is always NULL. OK > > > > + > > > + if (group_first_cpu(sds->busiest) < group_first_cpu(sg)) > > > + return 1; > > > > I'm a bit lost as to what this is for. Any clues you could provide > > would be appreciated. :-) > > > > Is the first cpu in this domain's busiest group before the first cpu in > > this group. If, so pick this as the busiest? > > > > Should this be the other way around if we want to pack the busiest to > > the first cpu? Mark it as the busiest if it's after (not before). > > > > Is group_first_cpu guaranteed to give us the first physical cpu (ie. > > thread 0 in our case) or are these virtualised at this point? > > > > I'm not seeing this hit anyway due to the null pointer check above. > > So this says, if all things being equal, and we already have a busiest, > but this candidate (sg) is higher than the current (busiest) take this > one. > > The idea is to move the highest SMT task down. So in the p7 case, I don't think this is required, as the threads are all of the same performance when in a given SMT mode. So we don't need to change the order if the lower groups are busy anyway. It's only once they became idle that we'd need to rebalanced. That being said, it probably doesn't hurt? > > > > @@ -2562,6 +2585,38 @@ static inline void update_sd_lb_stats(st > > > } while (group != sd->groups); > > > } > > > > > > +int __weak sd_asym_packing_arch(void) > > > +{ > > > + return 0; > > > +} > > arch_sd_asym_packing() is what you used in topology.h Oops, thanks. That would make the function even weaker than I'd intended :-) > > > +static int check_asym_packing(struct sched_domain *sd, > > > + struct sd_lb_stats *sds, > > > + unsigned long *imbalance) > > > +{ > > > + int i, cpu, busiest_cpu; > > > + > > > + if (!(sd->flags & SD_ASYM_PACKING)) > > > + return 0; > > > + > > > + if (!sds->busiest) > > > + return 0; > > > + > > > + i = 0; > > > + busiest_cpu = group_first_cpu(sds->busiest); > > > + for_each_cpu(cpu, sched_domain_span(sd)) { > > > + i++; > > > + if (cpu == busiest_cpu) > > > + break; > > > + } > > > + > > > + if (sds->total_nr_running > i) > > > + return 0; > > > > This seems to be the core of the packing logic. > > > > We make sure the busiest_cpu is not past total_nr_running. If it is we > > mark as imbalanced. Correct? > > > > It seems if a non zero thread/group had a pile of processes running on > > it and a lower thread had much less, this wouldn't fire, but I'm > > guessing normal load balancing would kick in that case to fix the > > imbalance. > > > > Any corrections to my ramblings appreciated :-) > > Right, so we're concerned the scenario where there's less tasks than SMT > siblings, if there's more they should all be running and the regular > load-balancer will deal with it. Yes > If there's less the group will normally be balanced and we fall out and > end up in check_asym_packing(). > > So what I tried doing with that loop is detect if there's a hole in the > packing before busiest. Now that I think about it, what we need to check > is if this_cpu (the removed cpu argument) is idle and less than busiest. > > So something like: > > static int check_asym_pacing(struct sched_domain *sd, > struct sd_lb_stats *sds, > int this_cpu, unsigned long *imbalance) > { > int busiest_cpu; > > if (!(sd->flags & SD_ASYM_PACKING)) > return 0; > > if (!sds->busiest) > return 0; > > busiest_cpu = group_first_cpu(sds->busiest); > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > return 0; > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > SCHED_LOAD_SCALE; > return 1; > } > > Does that make sense? I think so. I'm seeing check_asym_packing do the right thing with the simple SMT2 with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and cpu1 is busy. Unfortunately the process doesn't seem to be get migrated down though. Do we need to give *imbalance a higher value? FYI this version doesn't use sgs->total_nr_running anymore so I've removed it. > > I still see two problems with this though,.. regular load-balancing only > balances on the first cpu of a domain (see the *balance = 0, condition > in update_sg_lb_stats()), this means that if SMT[12] are idle we'll not > pull properly. Also, nohz balancing might mess this up further. I do have CONFIG_NO_HZ set but turning it off doesn't help with the above issue. > We could maybe play some games with the balance decision in > update_sg_lb_stats() for SD_ASYM_PACKING domains and idle == CPU_IDLE, > no ideas yet on nohz though. OK <from followup email> > Hmm, we could change the bit in find_busiest_group() to: > > if (idle == CPU_IDLE && check_asym_packing()) > > and skip the nr_running test.. ok, changed. This is the patch so far minus the trivial PPC bits. Thanks! Mikey --- include/linux/sched.h | 4 ++ include/linux/topology.h | 1 kernel/sched_fair.c | 65 ++++++++++++++++++++++++++++++++++++++++++++--- 3 files changed, 66 insertions(+), 4 deletions(-) Index: linux-next/include/linux/sched.h =================================================================== --- linux-next.orig/include/linux/sched.h +++ linux-next/include/linux/sched.h @@ -849,7 +849,7 @@ enum cpu_idle_type { #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ - +#define SD_ASYM_PACKING 0x0800 /* Asymmetric SMT packing */ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ enum powersavings_balance_level { @@ -881,6 +881,8 @@ static inline int sd_balance_for_package return SD_PREFER_SIBLING; } +extern int arch_sd_asym_packing(void); + /* * Optimise SD flags for power savings: * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings. Index: linux-next/include/linux/topology.h =================================================================== --- linux-next.orig/include/linux/topology.h +++ linux-next/include/linux/topology.h @@ -102,6 +102,7 @@ int arch_update_cpu_topology(void); | 1*SD_SHARE_PKG_RESOURCES \ | 0*SD_SERIALIZE \ | 0*SD_PREFER_SIBLING \ + | arch_sd_asym_packing() \ , \ .last_balance = jiffies, \ .balance_interval = 1, \ Index: linux-next/kernel/sched_fair.c =================================================================== --- linux-next.orig/kernel/sched_fair.c +++ linux-next/kernel/sched_fair.c @@ -2494,6 +2494,32 @@ static inline void update_sg_lb_stats(st } /** + * update_sd_pick_busiest - return 1 on busiest group + */ +static int update_sd_pick_busiest(struct sched_domain *sd, + struct sd_lb_stats *sds, + struct sched_group *sg, + struct sg_lb_stats *sgs) +{ + if (sgs->sum_nr_running > sgs->group_capacity) + return 1; + + if (sgs->group_imb) + return 1; + + /* Check packing mode for this domain */ + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { + if (!sds->busiest) + return 1; + + if (group_first_cpu(sds->busiest) > group_first_cpu(sg)) + return 1; + } + + return 0; +} + +/** * update_sd_lb_stats - Update sched_group's statistics for load balancing. * @sd: sched_domain whose statistics are to be updated. * @this_cpu: Cpu for which load balance is currently performed. @@ -2547,9 +2573,8 @@ static inline void update_sd_lb_stats(st sds->this = group; sds->this_nr_running = sgs.sum_nr_running; sds->this_load_per_task = sgs.sum_weighted_load; - } else if (sgs.avg_load > sds->max_load && - (sgs.sum_nr_running > sgs.group_capacity || - sgs.group_imb)) { + } else if (sgs.avg_load >= sds->max_load && + update_sd_pick_busiest(sd, sds, group, &sgs)) { sds->max_load = sgs.avg_load; sds->busiest = group; sds->busiest_nr_running = sgs.sum_nr_running; @@ -2562,6 +2587,36 @@ static inline void update_sd_lb_stats(st } while (group != sd->groups); } +int __weak arch_sd_asym_packing(void) +{ + return 0*SD_ASYM_PACKING; +} + +/** + * check_asym_packing - Check to see if we the group is packed into + * the sched doman + */ +static int check_asym_packing(struct sched_domain *sd, + struct sd_lb_stats *sds, + int this_cpu, unsigned long *imbalance) +{ + int busiest_cpu; + + if (!(sd->flags & SD_ASYM_PACKING)) + return 0; + + if (!sds->busiest) + return 0; + + busiest_cpu = group_first_cpu(sds->busiest); + if (this_cpu > busiest_cpu) + return 0; + + *imbalance = (sds->max_load * sds->busiest->cpu_power) / + SCHED_LOAD_SCALE; + return 1; +} + /** * fix_small_imbalance - Calculate the minor imbalance that exists * amongst the groups of a sched_domain, during @@ -2761,6 +2816,10 @@ find_busiest_group(struct sched_domain * return sds.busiest; out_balanced: + if (idle == CPU_IDLE && + check_asym_packing(sd, &sds, this_cpu, imbalance)) + return sds.busiest; + /* * There is no obvious imbalance. But check if we can do some balancing * to save power. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-24 6:07 ` Michael Neuling 0 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-24 6:07 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego In message <1266942281.11845.521.camel@laptop> you wrote: > On Tue, 2010-02-23 at 17:08 +1100, Michael Neuling wrote: > > > I have some comments on the code inline but... > > > > So when I run this, I don't get processes pulled down to the lower > > threads. A simple test case of running 1 CPU intensive process at > > SCHED_OTHER on a machine with 2 way SMT system (a POWER6 but enabling > > SD_ASYM_PACKING). The single processes doesn't move to lower threads as > > I'd hope. > > > > Also, are you sure you want to put this in generic code? It seem to be > > quite POWER7 specific functionality, so would be logically better in > > arch/powerpc. I guess some other arch *might* need it, but seems > > unlikely. > > Well, there are no arch hooks in the load-balancing (aside from the > recent cpu_power stuff, and that really is the wrong thing to poke at > for this), and I did hear some other people express interest in such a > constraint. Interesting. Can I ask, were people interesting in this at the SMT level or higher in the hierarchy? > Also, load-balancing is complex enough as it is, so I prefer to keep > everything in the generic code where possible, clearly things like > sched_domain creation need arch topology bits, and the arch_scale* > things require other arch information like cpu frequency. OK > > > @@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(st > > > DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); > > > } > > > > > > +static int update_sd_pick_busiest(struct sched_domain *sd, > > > + struct sd_lb_stats *sds, > > > + struct sched_group *sg, > > > + struct sg_lb_stats *sgs) > > > +{ > > > + if (sgs->sum_nr_running > sgs->group_capacity) > > > + return 1; > > > + > > > + if (sgs->group_imb) > > > + return 1; > > > + > > > + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { > > > > If we are asymetric packing... > > > > > > > + if (!sds->busiest) > > > + return 1; > > > > This just seems to be a null pointer check. > > > > From the tracing I've done, this is always true (always NULL) at this > > point so we return here. > > Right, so we need to have a busiest group to take a task from, if there > is no busiest yet, take this group. > > And in your scenario, with there being only a single task, we'd only hit > this once at most, so yes it makes sense this is always NULL. OK > > > > + > > > + if (group_first_cpu(sds->busiest) < group_first_cpu(sg)) > > > + return 1; > > > > I'm a bit lost as to what this is for. Any clues you could provide > > would be appreciated. :-) > > > > Is the first cpu in this domain's busiest group before the first cpu in > > this group. If, so pick this as the busiest? > > > > Should this be the other way around if we want to pack the busiest to > > the first cpu? Mark it as the busiest if it's after (not before). > > > > Is group_first_cpu guaranteed to give us the first physical cpu (ie. > > thread 0 in our case) or are these virtualised at this point? > > > > I'm not seeing this hit anyway due to the null pointer check above. > > So this says, if all things being equal, and we already have a busiest, > but this candidate (sg) is higher than the current (busiest) take this > one. > > The idea is to move the highest SMT task down. So in the p7 case, I don't think this is required, as the threads are all of the same performance when in a given SMT mode. So we don't need to change the order if the lower groups are busy anyway. It's only once they became idle that we'd need to rebalanced. That being said, it probably doesn't hurt? > > > > @@ -2562,6 +2585,38 @@ static inline void update_sd_lb_stats(st > > > } while (group != sd->groups); > > > } > > > > > > +int __weak sd_asym_packing_arch(void) > > > +{ > > > + return 0; > > > +} > > arch_sd_asym_packing() is what you used in topology.h Oops, thanks. That would make the function even weaker than I'd intended :-) > > > +static int check_asym_packing(struct sched_domain *sd, > > > + struct sd_lb_stats *sds, > > > + unsigned long *imbalance) > > > +{ > > > + int i, cpu, busiest_cpu; > > > + > > > + if (!(sd->flags & SD_ASYM_PACKING)) > > > + return 0; > > > + > > > + if (!sds->busiest) > > > + return 0; > > > + > > > + i = 0; > > > + busiest_cpu = group_first_cpu(sds->busiest); > > > + for_each_cpu(cpu, sched_domain_span(sd)) { > > > + i++; > > > + if (cpu == busiest_cpu) > > > + break; > > > + } > > > + > > > + if (sds->total_nr_running > i) > > > + return 0; > > > > This seems to be the core of the packing logic. > > > > We make sure the busiest_cpu is not past total_nr_running. If it is we > > mark as imbalanced. Correct? > > > > It seems if a non zero thread/group had a pile of processes running on > > it and a lower thread had much less, this wouldn't fire, but I'm > > guessing normal load balancing would kick in that case to fix the > > imbalance. > > > > Any corrections to my ramblings appreciated :-) > > Right, so we're concerned the scenario where there's less tasks than SMT > siblings, if there's more they should all be running and the regular > load-balancer will deal with it. Yes > If there's less the group will normally be balanced and we fall out and > end up in check_asym_packing(). > > So what I tried doing with that loop is detect if there's a hole in the > packing before busiest. Now that I think about it, what we need to check > is if this_cpu (the removed cpu argument) is idle and less than busiest. > > So something like: > > static int check_asym_pacing(struct sched_domain *sd, > struct sd_lb_stats *sds, > int this_cpu, unsigned long *imbalance) > { > int busiest_cpu; > > if (!(sd->flags & SD_ASYM_PACKING)) > return 0; > > if (!sds->busiest) > return 0; > > busiest_cpu = group_first_cpu(sds->busiest); > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > return 0; > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > SCHED_LOAD_SCALE; > return 1; > } > > Does that make sense? I think so. I'm seeing check_asym_packing do the right thing with the simple SMT2 with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and cpu1 is busy. Unfortunately the process doesn't seem to be get migrated down though. Do we need to give *imbalance a higher value? FYI this version doesn't use sgs->total_nr_running anymore so I've removed it. > > I still see two problems with this though,.. regular load-balancing only > balances on the first cpu of a domain (see the *balance = 0, condition > in update_sg_lb_stats()), this means that if SMT[12] are idle we'll not > pull properly. Also, nohz balancing might mess this up further. I do have CONFIG_NO_HZ set but turning it off doesn't help with the above issue. > We could maybe play some games with the balance decision in > update_sg_lb_stats() for SD_ASYM_PACKING domains and idle == CPU_IDLE, > no ideas yet on nohz though. OK <from followup email> > Hmm, we could change the bit in find_busiest_group() to: > > if (idle == CPU_IDLE && check_asym_packing()) > > and skip the nr_running test.. ok, changed. This is the patch so far minus the trivial PPC bits. Thanks! Mikey --- include/linux/sched.h | 4 ++ include/linux/topology.h | 1 kernel/sched_fair.c | 65 ++++++++++++++++++++++++++++++++++++++++++++--- 3 files changed, 66 insertions(+), 4 deletions(-) Index: linux-next/include/linux/sched.h =================================================================== --- linux-next.orig/include/linux/sched.h +++ linux-next/include/linux/sched.h @@ -849,7 +849,7 @@ enum cpu_idle_type { #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ - +#define SD_ASYM_PACKING 0x0800 /* Asymmetric SMT packing */ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ enum powersavings_balance_level { @@ -881,6 +881,8 @@ static inline int sd_balance_for_package return SD_PREFER_SIBLING; } +extern int arch_sd_asym_packing(void); + /* * Optimise SD flags for power savings: * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings. Index: linux-next/include/linux/topology.h =================================================================== --- linux-next.orig/include/linux/topology.h +++ linux-next/include/linux/topology.h @@ -102,6 +102,7 @@ int arch_update_cpu_topology(void); | 1*SD_SHARE_PKG_RESOURCES \ | 0*SD_SERIALIZE \ | 0*SD_PREFER_SIBLING \ + | arch_sd_asym_packing() \ , \ .last_balance = jiffies, \ .balance_interval = 1, \ Index: linux-next/kernel/sched_fair.c =================================================================== --- linux-next.orig/kernel/sched_fair.c +++ linux-next/kernel/sched_fair.c @@ -2494,6 +2494,32 @@ static inline void update_sg_lb_stats(st } /** + * update_sd_pick_busiest - return 1 on busiest group + */ +static int update_sd_pick_busiest(struct sched_domain *sd, + struct sd_lb_stats *sds, + struct sched_group *sg, + struct sg_lb_stats *sgs) +{ + if (sgs->sum_nr_running > sgs->group_capacity) + return 1; + + if (sgs->group_imb) + return 1; + + /* Check packing mode for this domain */ + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { + if (!sds->busiest) + return 1; + + if (group_first_cpu(sds->busiest) > group_first_cpu(sg)) + return 1; + } + + return 0; +} + +/** * update_sd_lb_stats - Update sched_group's statistics for load balancing. * @sd: sched_domain whose statistics are to be updated. * @this_cpu: Cpu for which load balance is currently performed. @@ -2547,9 +2573,8 @@ static inline void update_sd_lb_stats(st sds->this = group; sds->this_nr_running = sgs.sum_nr_running; sds->this_load_per_task = sgs.sum_weighted_load; - } else if (sgs.avg_load > sds->max_load && - (sgs.sum_nr_running > sgs.group_capacity || - sgs.group_imb)) { + } else if (sgs.avg_load >= sds->max_load && + update_sd_pick_busiest(sd, sds, group, &sgs)) { sds->max_load = sgs.avg_load; sds->busiest = group; sds->busiest_nr_running = sgs.sum_nr_running; @@ -2562,6 +2587,36 @@ static inline void update_sd_lb_stats(st } while (group != sd->groups); } +int __weak arch_sd_asym_packing(void) +{ + return 0*SD_ASYM_PACKING; +} + +/** + * check_asym_packing - Check to see if we the group is packed into + * the sched doman + */ +static int check_asym_packing(struct sched_domain *sd, + struct sd_lb_stats *sds, + int this_cpu, unsigned long *imbalance) +{ + int busiest_cpu; + + if (!(sd->flags & SD_ASYM_PACKING)) + return 0; + + if (!sds->busiest) + return 0; + + busiest_cpu = group_first_cpu(sds->busiest); + if (this_cpu > busiest_cpu) + return 0; + + *imbalance = (sds->max_load * sds->busiest->cpu_power) / + SCHED_LOAD_SCALE; + return 1; +} + /** * fix_small_imbalance - Calculate the minor imbalance that exists * amongst the groups of a sched_domain, during @@ -2761,6 +2816,10 @@ find_busiest_group(struct sched_domain * return sds.busiest; out_balanced: + if (idle == CPU_IDLE && + check_asym_packing(sd, &sds, this_cpu, imbalance)) + return sds.busiest; + /* * There is no obvious imbalance. But check if we can do some balancing * to save power. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-24 6:07 ` Michael Neuling @ 2010-02-24 11:13 ` Michael Neuling -1 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-24 11:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego, Suresh Siddha > > If there's less the group will normally be balanced and we fall out and > > end up in check_asym_packing(). > > > > So what I tried doing with that loop is detect if there's a hole in the > > packing before busiest. Now that I think about it, what we need to check > > is if this_cpu (the removed cpu argument) is idle and less than busiest. > > > > So something like: > > > > static int check_asym_pacing(struct sched_domain *sd, > > struct sd_lb_stats *sds, > > int this_cpu, unsigned long *imbalance) > > { > > int busiest_cpu; > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > return 0; > > > > if (!sds->busiest) > > return 0; > > > > busiest_cpu = group_first_cpu(sds->busiest); > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > return 0; > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > SCHED_LOAD_SCALE; > > return 1; > > } > > > > Does that make sense? > > I think so. > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > cpu1 is busy. > > Unfortunately the process doesn't seem to be get migrated down though. > Do we need to give *imbalance a higher value? So with ego help, I traced this down a bit more. In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our new case in check_asym_packing(), load_balance then runs f_b_q(). f_b_q() has this: if (capacity && rq->nr_running == 1 && wl > imbalance) continue; when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we continue and busiest remains NULL. load_balance then does "goto out_balanced" and it doesn't attempt to move the task. Based on this and on egos suggestion I pulled in Suresh Siddha patch from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The process is moved down to t0. I've only tested SMT2 so far. Mikey ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-24 11:13 ` Michael Neuling 0 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-24 11:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Suresh Siddha, ego, linux-kernel, Ingo Molnar, linuxppc-dev > > If there's less the group will normally be balanced and we fall out and > > end up in check_asym_packing(). > > > > So what I tried doing with that loop is detect if there's a hole in the > > packing before busiest. Now that I think about it, what we need to check > > is if this_cpu (the removed cpu argument) is idle and less than busiest. > > > > So something like: > > > > static int check_asym_pacing(struct sched_domain *sd, > > struct sd_lb_stats *sds, > > int this_cpu, unsigned long *imbalance) > > { > > int busiest_cpu; > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > return 0; > > > > if (!sds->busiest) > > return 0; > > > > busiest_cpu = group_first_cpu(sds->busiest); > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > return 0; > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > SCHED_LOAD_SCALE; > > return 1; > > } > > > > Does that make sense? > > I think so. > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > cpu1 is busy. > > Unfortunately the process doesn't seem to be get migrated down though. > Do we need to give *imbalance a higher value? So with ego help, I traced this down a bit more. In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our new case in check_asym_packing(), load_balance then runs f_b_q(). f_b_q() has this: if (capacity && rq->nr_running == 1 && wl > imbalance) continue; when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we continue and busiest remains NULL. load_balance then does "goto out_balanced" and it doesn't attempt to move the task. Based on this and on egos suggestion I pulled in Suresh Siddha patch from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The process is moved down to t0. I've only tested SMT2 so far. Mikey ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-24 11:13 ` Michael Neuling @ 2010-02-24 11:58 ` Michael Neuling -1 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-24 11:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego, Suresh Siddha In message <11927.1267010024@neuling.org> you wrote: > > > If there's less the group will normally be balanced and we fall out and > > > end up in check_asym_packing(). > > > > > > So what I tried doing with that loop is detect if there's a hole in the > > > packing before busiest. Now that I think about it, what we need to check > > > is if this_cpu (the removed cpu argument) is idle and less than busiest. > > > > > > So something like: > > > > > > static int check_asym_pacing(struct sched_domain *sd, > > > struct sd_lb_stats *sds, > > > int this_cpu, unsigned long *imbalance) > > > { > > > int busiest_cpu; > > > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > > return 0; > > > > > > if (!sds->busiest) > > > return 0; > > > > > > busiest_cpu = group_first_cpu(sds->busiest); > > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > > return 0; > > > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > > SCHED_LOAD_SCALE; > > > return 1; > > > } > > > > > > Does that make sense? > > > > I think so. > > > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > > cpu1 is busy. > > > > Unfortunately the process doesn't seem to be get migrated down though. > > Do we need to give *imbalance a higher value? > > So with ego help, I traced this down a bit more. > > In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our > new case in check_asym_packing(), load_balance then runs f_b_q(). > f_b_q() has this: > > if (capacity && rq->nr_running == 1 && wl > imbalance) > continue; > > when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we > continue and busiest remains NULL. > > load_balance then does "goto out_balanced" and it doesn't attempt to > move the task. > > Based on this and on egos suggestion I pulled in Suresh Siddha patch > from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The > process is moved down to t0. > > I've only tested SMT2 so far. SMT4 also works in the simple test case of a single process being pulled down to thread 0. As you suspected though, unfortunately this is only working with CONFIG_NO_HZ off. If I turn NO_HZ on, my single process gets bounced around the core. Did you think of any ideas for how to fix the NO_HZ interaction? Mikey ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-24 11:58 ` Michael Neuling 0 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-24 11:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Suresh Siddha, ego, linux-kernel, Ingo Molnar, linuxppc-dev In message <11927.1267010024@neuling.org> you wrote: > > > If there's less the group will normally be balanced and we fall out and > > > end up in check_asym_packing(). > > > > > > So what I tried doing with that loop is detect if there's a hole in the > > > packing before busiest. Now that I think about it, what we need to check > > > is if this_cpu (the removed cpu argument) is idle and less than busiest. > > > > > > So something like: > > > > > > static int check_asym_pacing(struct sched_domain *sd, > > > struct sd_lb_stats *sds, > > > int this_cpu, unsigned long *imbalance) > > > { > > > int busiest_cpu; > > > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > > return 0; > > > > > > if (!sds->busiest) > > > return 0; > > > > > > busiest_cpu = group_first_cpu(sds->busiest); > > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > > return 0; > > > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > > SCHED_LOAD_SCALE; > > > return 1; > > > } > > > > > > Does that make sense? > > > > I think so. > > > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > > cpu1 is busy. > > > > Unfortunately the process doesn't seem to be get migrated down though. > > Do we need to give *imbalance a higher value? > > So with ego help, I traced this down a bit more. > > In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our > new case in check_asym_packing(), load_balance then runs f_b_q(). > f_b_q() has this: > > if (capacity && rq->nr_running == 1 && wl > imbalance) > continue; > > when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we > continue and busiest remains NULL. > > load_balance then does "goto out_balanced" and it doesn't attempt to > move the task. > > Based on this and on egos suggestion I pulled in Suresh Siddha patch > from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The > process is moved down to t0. > > I've only tested SMT2 so far. SMT4 also works in the simple test case of a single process being pulled down to thread 0. As you suspected though, unfortunately this is only working with CONFIG_NO_HZ off. If I turn NO_HZ on, my single process gets bounced around the core. Did you think of any ideas for how to fix the NO_HZ interaction? Mikey ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-24 11:13 ` Michael Neuling @ 2010-02-27 10:21 ` Michael Neuling -1 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-27 10:21 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego In message <11927.1267010024@neuling.org> you wrote: > > > If there's less the group will normally be balanced and we fall out and > > > end up in check_asym_packing(). > > > > > > So what I tried doing with that loop is detect if there's a hole in the > > > packing before busiest. Now that I think about it, what we need to check > > > is if this_cpu (the removed cpu argument) is idle and less than busiest. > > > > > > So something like: > > > > > > static int check_asym_pacing(struct sched_domain *sd, > > > struct sd_lb_stats *sds, > > > int this_cpu, unsigned long *imbalance) > > > { > > > int busiest_cpu; > > > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > > return 0; > > > > > > if (!sds->busiest) > > > return 0; > > > > > > busiest_cpu = group_first_cpu(sds->busiest); > > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > > return 0; > > > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > > SCHED_LOAD_SCALE; > > > return 1; > > > } > > > > > > Does that make sense? > > > > I think so. > > > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > > cpu1 is busy. > > > > Unfortunately the process doesn't seem to be get migrated down though. > > Do we need to give *imbalance a higher value? > > So with ego help, I traced this down a bit more. > > In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our > new case in check_asym_packing(), load_balance then runs f_b_q(). > f_b_q() has this: > > if (capacity && rq->nr_running == 1 && wl > imbalance) > continue; > > when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we > continue and busiest remains NULL. > > load_balance then does "goto out_balanced" and it doesn't attempt to > move the task. > > Based on this and on egos suggestion I pulled in Suresh Siddha patch > from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The > process is moved down to t0. > > I've only tested SMT2 so far. I'm finding this SMT2 result to be unreliable. Sometimes it doesn't work for the simple 1 process case. It seems to change boot to boot. Sometimes it works as expected with t0 busy and t1 idle, but other times it's the other way around. When it doesn't work, check_asym_packing() is still marking processes to be pulled down but only gets run about 1 in every 4 calls to load_balance(). For 2 of the other calls to load_balance, idle is CPU_NEWLY_IDLE and hence check_asym_packing() doesn't get called. This results in sd->nr_balance_failed being reset. When load_balance is next called and check_asym_packing() hits, need_active_balance() returns 0 as sd->nr_balance_failed is too small. This means the migration thread on t1 is not woken and the process remains there. So why does thread0 change from NEWLY_IDLE to IDLE and visa versa, when there is nothing running on it? Is this expected? This is with next-20100225 which pulled in Ingos tip at 407b4844f2af416cd8c9edd7c44d1545c93a4938 (from 24/2/2010) Mikey ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-02-27 10:21 ` Michael Neuling 0 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-02-27 10:21 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego In message <11927.1267010024@neuling.org> you wrote: > > > If there's less the group will normally be balanced and we fall out and > > > end up in check_asym_packing(). > > > > > > So what I tried doing with that loop is detect if there's a hole in the > > > packing before busiest. Now that I think about it, what we need to check > > > is if this_cpu (the removed cpu argument) is idle and less than busiest. > > > > > > So something like: > > > > > > static int check_asym_pacing(struct sched_domain *sd, > > > struct sd_lb_stats *sds, > > > int this_cpu, unsigned long *imbalance) > > > { > > > int busiest_cpu; > > > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > > return 0; > > > > > > if (!sds->busiest) > > > return 0; > > > > > > busiest_cpu = group_first_cpu(sds->busiest); > > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > > return 0; > > > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > > SCHED_LOAD_SCALE; > > > return 1; > > > } > > > > > > Does that make sense? > > > > I think so. > > > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > > cpu1 is busy. > > > > Unfortunately the process doesn't seem to be get migrated down though. > > Do we need to give *imbalance a higher value? > > So with ego help, I traced this down a bit more. > > In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our > new case in check_asym_packing(), load_balance then runs f_b_q(). > f_b_q() has this: > > if (capacity && rq->nr_running == 1 && wl > imbalance) > continue; > > when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we > continue and busiest remains NULL. > > load_balance then does "goto out_balanced" and it doesn't attempt to > move the task. > > Based on this and on egos suggestion I pulled in Suresh Siddha patch > from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The > process is moved down to t0. > > I've only tested SMT2 so far. I'm finding this SMT2 result to be unreliable. Sometimes it doesn't work for the simple 1 process case. It seems to change boot to boot. Sometimes it works as expected with t0 busy and t1 idle, but other times it's the other way around. When it doesn't work, check_asym_packing() is still marking processes to be pulled down but only gets run about 1 in every 4 calls to load_balance(). For 2 of the other calls to load_balance, idle is CPU_NEWLY_IDLE and hence check_asym_packing() doesn't get called. This results in sd->nr_balance_failed being reset. When load_balance is next called and check_asym_packing() hits, need_active_balance() returns 0 as sd->nr_balance_failed is too small. This means the migration thread on t1 is not woken and the process remains there. So why does thread0 change from NEWLY_IDLE to IDLE and visa versa, when there is nothing running on it? Is this expected? This is with next-20100225 which pulled in Ingos tip at 407b4844f2af416cd8c9edd7c44d1545c93a4938 (from 24/2/2010) Mikey ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-02-27 10:21 ` Michael Neuling @ 2010-03-02 14:44 ` Peter Zijlstra -1 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-03-02 14:44 UTC (permalink / raw) To: Michael Neuling; +Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego On Sat, 2010-02-27 at 21:21 +1100, Michael Neuling wrote: > In message <11927.1267010024@neuling.org> you wrote: > > > > If there's less the group will normally be balanced and we fall out and > > > > end up in check_asym_packing(). > > > > > > > > So what I tried doing with that loop is detect if there's a hole in the > > > > packing before busiest. Now that I think about it, what we need to check > > > > is if this_cpu (the removed cpu argument) is idle and less than busiest. > > > > > > > > So something like: > > > > > > > > static int check_asym_pacing(struct sched_domain *sd, > > > > struct sd_lb_stats *sds, > > > > int this_cpu, unsigned long *imbalance) > > > > { > > > > int busiest_cpu; > > > > > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > > > return 0; > > > > > > > > if (!sds->busiest) > > > > return 0; > > > > > > > > busiest_cpu = group_first_cpu(sds->busiest); > > > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > > > return 0; > > > > > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > > > SCHED_LOAD_SCALE; > > > > return 1; > > > > } > > > > > > > > Does that make sense? > > > > > > I think so. > > > > > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > > > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > > > cpu1 is busy. > > > > > > Unfortunately the process doesn't seem to be get migrated down though. > > > Do we need to give *imbalance a higher value? > > > > So with ego help, I traced this down a bit more. > > > > In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our > > new case in check_asym_packing(), load_balance then runs f_b_q(). > > f_b_q() has this: > > > > if (capacity && rq->nr_running == 1 && wl > imbalance) > > continue; > > > > when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we > > continue and busiest remains NULL. > > > > load_balance then does "goto out_balanced" and it doesn't attempt to > > move the task. > > > > Based on this and on egos suggestion I pulled in Suresh Siddha patch > > from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The > > process is moved down to t0. > > > > I've only tested SMT2 so far. > > I'm finding this SMT2 result to be unreliable. Sometimes it doesn't work > for the simple 1 process case. It seems to change boot to boot. > Sometimes it works as expected with t0 busy and t1 idle, but other times > it's the other way around. > > When it doesn't work, check_asym_packing() is still marking processes to > be pulled down but only gets run about 1 in every 4 calls to > load_balance(). > > For 2 of the other calls to load_balance, idle is CPU_NEWLY_IDLE and > hence check_asym_packing() doesn't get called. This results in > sd->nr_balance_failed being reset. When load_balance is next called and > check_asym_packing() hits, need_active_balance() returns 0 as > sd->nr_balance_failed is too small. This means the migration thread on > t1 is not woken and the process remains there. > > So why does thread0 change from NEWLY_IDLE to IDLE and visa versa, when > there is nothing running on it? Is this expected? Ah, yes, you should probably allow both those. NEWLY_IDLE is when we are about to schedule the idle thread, IDLE is when a tick hits the idle thread. I'm thinking that NEWLY_IDLE should also solve the NO_HZ case, since we'll have passed through that before we enter tickless state, just make sure SD_BALANCE_NEWIDLE is set on the relevant levels (should already be so). ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-03-02 14:44 ` Peter Zijlstra 0 siblings, 0 replies; 103+ messages in thread From: Peter Zijlstra @ 2010-03-02 14:44 UTC (permalink / raw) To: Michael Neuling; +Cc: Ingo Molnar, linuxppc-dev, linux-kernel, ego On Sat, 2010-02-27 at 21:21 +1100, Michael Neuling wrote: > In message <11927.1267010024@neuling.org> you wrote: > > > > If there's less the group will normally be balanced and we fall out and > > > > end up in check_asym_packing(). > > > > > > > > So what I tried doing with that loop is detect if there's a hole in the > > > > packing before busiest. Now that I think about it, what we need to check > > > > is if this_cpu (the removed cpu argument) is idle and less than busiest. > > > > > > > > So something like: > > > > > > > > static int check_asym_pacing(struct sched_domain *sd, > > > > struct sd_lb_stats *sds, > > > > int this_cpu, unsigned long *imbalance) > > > > { > > > > int busiest_cpu; > > > > > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > > > return 0; > > > > > > > > if (!sds->busiest) > > > > return 0; > > > > > > > > busiest_cpu = group_first_cpu(sds->busiest); > > > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > > > return 0; > > > > > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > > > SCHED_LOAD_SCALE; > > > > return 1; > > > > } > > > > > > > > Does that make sense? > > > > > > I think so. > > > > > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > > > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > > > cpu1 is busy. > > > > > > Unfortunately the process doesn't seem to be get migrated down though. > > > Do we need to give *imbalance a higher value? > > > > So with ego help, I traced this down a bit more. > > > > In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our > > new case in check_asym_packing(), load_balance then runs f_b_q(). > > f_b_q() has this: > > > > if (capacity && rq->nr_running == 1 && wl > imbalance) > > continue; > > > > when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we > > continue and busiest remains NULL. > > > > load_balance then does "goto out_balanced" and it doesn't attempt to > > move the task. > > > > Based on this and on egos suggestion I pulled in Suresh Siddha patch > > from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The > > process is moved down to t0. > > > > I've only tested SMT2 so far. > > I'm finding this SMT2 result to be unreliable. Sometimes it doesn't work > for the simple 1 process case. It seems to change boot to boot. > Sometimes it works as expected with t0 busy and t1 idle, but other times > it's the other way around. > > When it doesn't work, check_asym_packing() is still marking processes to > be pulled down but only gets run about 1 in every 4 calls to > load_balance(). > > For 2 of the other calls to load_balance, idle is CPU_NEWLY_IDLE and > hence check_asym_packing() doesn't get called. This results in > sd->nr_balance_failed being reset. When load_balance is next called and > check_asym_packing() hits, need_active_balance() returns 0 as > sd->nr_balance_failed is too small. This means the migration thread on > t1 is not woken and the process remains there. > > So why does thread0 change from NEWLY_IDLE to IDLE and visa versa, when > there is nothing running on it? Is this expected? Ah, yes, you should probably allow both those. NEWLY_IDLE is when we are about to schedule the idle thread, IDLE is when a tick hits the idle thread. I'm thinking that NEWLY_IDLE should also solve the NO_HZ case, since we'll have passed through that before we enter tickless state, just make sure SD_BALANCE_NEWIDLE is set on the relevant levels (should already be so). ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-03-02 14:44 ` Peter Zijlstra @ 2010-03-04 22:28 ` Michael Neuling -1 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-03-04 22:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Joel Schopp, Ingo Molnar, linuxppc-dev, linux-kernel, ego, Christopher Yeoh In message <1267541076.25158.60.camel@laptop> you wrote: > On Sat, 2010-02-27 at 21:21 +1100, Michael Neuling wrote: > > In message <11927.1267010024@neuling.org> you wrote: > > > > > If there's less the group will normally be balanced and we fall out a nd > > > > > end up in check_asym_packing(). > > > > > > > > > > So what I tried doing with that loop is detect if there's a hole in t he > > > > > packing before busiest. Now that I think about it, what we need to ch eck > > > > > is if this_cpu (the removed cpu argument) is idle and less than busie st. > > > > > > > > > > So something like: > > > > > > > > > > static int check_asym_pacing(struct sched_domain *sd, > > > > > struct sd_lb_stats *sds, > > > > > int this_cpu, unsigned long *imbalance) > > > > > { > > > > > int busiest_cpu; > > > > > > > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > > > > return 0; > > > > > > > > > > if (!sds->busiest) > > > > > return 0; > > > > > > > > > > busiest_cpu = group_first_cpu(sds->busiest); > > > > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > > > > return 0; > > > > > > > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > > > > SCHED_LOAD_SCALE; > > > > > return 1; > > > > > } > > > > > > > > > > Does that make sense? > > > > > > > > I think so. > > > > > > > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > > > > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > > > > cpu1 is busy. > > > > > > > > Unfortunately the process doesn't seem to be get migrated down though. > > > > Do we need to give *imbalance a higher value? > > > > > > So with ego help, I traced this down a bit more. > > > > > > In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our > > > new case in check_asym_packing(), load_balance then runs f_b_q(). > > > f_b_q() has this: > > > > > > if (capacity && rq->nr_running == 1 && wl > imbalance) > > > continue; > > > > > > when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we > > > continue and busiest remains NULL. > > > > > > load_balance then does "goto out_balanced" and it doesn't attempt to > > > move the task. > > > > > > Based on this and on egos suggestion I pulled in Suresh Siddha patch > > > from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The > > > process is moved down to t0. > > > > > > I've only tested SMT2 so far. > > > > I'm finding this SMT2 result to be unreliable. Sometimes it doesn't work > > for the simple 1 process case. It seems to change boot to boot. > > Sometimes it works as expected with t0 busy and t1 idle, but other times > > it's the other way around. > > > > When it doesn't work, check_asym_packing() is still marking processes to > > be pulled down but only gets run about 1 in every 4 calls to > > load_balance(). > > > > For 2 of the other calls to load_balance, idle is CPU_NEWLY_IDLE and > > hence check_asym_packing() doesn't get called. This results in > > sd->nr_balance_failed being reset. When load_balance is next called and > > check_asym_packing() hits, need_active_balance() returns 0 as > > sd->nr_balance_failed is too small. This means the migration thread on > > t1 is not woken and the process remains there. > > > > So why does thread0 change from NEWLY_IDLE to IDLE and visa versa, when > > there is nothing running on it? Is this expected? > > Ah, yes, you should probably allow both those. > > NEWLY_IDLE is when we are about to schedule the idle thread, IDLE is > when a tick hits the idle thread. > > I'm thinking that NEWLY_IDLE should also solve the NO_HZ case, since > we'll have passed through that before we enter tickless state, just make > sure SD_BALANCE_NEWIDLE is set on the relevant levels (should already be > so). OK, thanks. There seems to be a regression in Linus' latest tree (also -next) where new processes usually end up on the thread 1 rather than 0 (when in SMT2 mode). This only seems to happen with newly created processes. If you pin a process to t0 and then unpin it, it stays on t0. Also if a process is migrated to another core, it can end up on t0. This happens with a vanilla linus or -next tree on ppc64 pseries_defconfig - NO_HZ. I've not tried with NO_HZ. Anyway, this regression seems to be causing problems when we apply our patch. We are trying to pull down to T0 which works, but we immediately get pulled back upto t1 due to the above regression. This happens over and over, causing process to ping-pong every few sched ticks. We've not tried to bisect this problem but that's the next step unless someone has some insights to the problem. Also, we had to change the following to get the pull down to work correctly in the original patch: @@ -2618,8 +2618,8 @@ static int check_asym_packing(struct sch if (this_cpu > busiest_cpu) return 0; - *imbalance = (sds->max_load * sds->busiest->cpu_power) / - SCHED_LOAD_SCALE; + *imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->cpu_power, + SCHED_LOAD_SCALE); return 1; We found that imbalance = 1023.8 which got rounded down to 1023 which ended up being compared to a wl of 1024 in find_busiest_queue and failing. The closest round fixes this. Mikey ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-03-04 22:28 ` Michael Neuling 0 siblings, 0 replies; 103+ messages in thread From: Michael Neuling @ 2010-03-04 22:28 UTC (permalink / raw) To: Peter Zijlstra Cc: ego, linux-kernel, Christopher Yeoh, Ingo Molnar, linuxppc-dev In message <1267541076.25158.60.camel@laptop> you wrote: > On Sat, 2010-02-27 at 21:21 +1100, Michael Neuling wrote: > > In message <11927.1267010024@neuling.org> you wrote: > > > > > If there's less the group will normally be balanced and we fall out a nd > > > > > end up in check_asym_packing(). > > > > > > > > > > So what I tried doing with that loop is detect if there's a hole in t he > > > > > packing before busiest. Now that I think about it, what we need to ch eck > > > > > is if this_cpu (the removed cpu argument) is idle and less than busie st. > > > > > > > > > > So something like: > > > > > > > > > > static int check_asym_pacing(struct sched_domain *sd, > > > > > struct sd_lb_stats *sds, > > > > > int this_cpu, unsigned long *imbalance) > > > > > { > > > > > int busiest_cpu; > > > > > > > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > > > > return 0; > > > > > > > > > > if (!sds->busiest) > > > > > return 0; > > > > > > > > > > busiest_cpu = group_first_cpu(sds->busiest); > > > > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > > > > return 0; > > > > > > > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > > > > SCHED_LOAD_SCALE; > > > > > return 1; > > > > > } > > > > > > > > > > Does that make sense? > > > > > > > > I think so. > > > > > > > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > > > > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > > > > cpu1 is busy. > > > > > > > > Unfortunately the process doesn't seem to be get migrated down though. > > > > Do we need to give *imbalance a higher value? > > > > > > So with ego help, I traced this down a bit more. > > > > > > In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our > > > new case in check_asym_packing(), load_balance then runs f_b_q(). > > > f_b_q() has this: > > > > > > if (capacity && rq->nr_running == 1 && wl > imbalance) > > > continue; > > > > > > when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we > > > continue and busiest remains NULL. > > > > > > load_balance then does "goto out_balanced" and it doesn't attempt to > > > move the task. > > > > > > Based on this and on egos suggestion I pulled in Suresh Siddha patch > > > from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The > > > process is moved down to t0. > > > > > > I've only tested SMT2 so far. > > > > I'm finding this SMT2 result to be unreliable. Sometimes it doesn't work > > for the simple 1 process case. It seems to change boot to boot. > > Sometimes it works as expected with t0 busy and t1 idle, but other times > > it's the other way around. > > > > When it doesn't work, check_asym_packing() is still marking processes to > > be pulled down but only gets run about 1 in every 4 calls to > > load_balance(). > > > > For 2 of the other calls to load_balance, idle is CPU_NEWLY_IDLE and > > hence check_asym_packing() doesn't get called. This results in > > sd->nr_balance_failed being reset. When load_balance is next called and > > check_asym_packing() hits, need_active_balance() returns 0 as > > sd->nr_balance_failed is too small. This means the migration thread on > > t1 is not woken and the process remains there. > > > > So why does thread0 change from NEWLY_IDLE to IDLE and visa versa, when > > there is nothing running on it? Is this expected? > > Ah, yes, you should probably allow both those. > > NEWLY_IDLE is when we are about to schedule the idle thread, IDLE is > when a tick hits the idle thread. > > I'm thinking that NEWLY_IDLE should also solve the NO_HZ case, since > we'll have passed through that before we enter tickless state, just make > sure SD_BALANCE_NEWIDLE is set on the relevant levels (should already be > so). OK, thanks. There seems to be a regression in Linus' latest tree (also -next) where new processes usually end up on the thread 1 rather than 0 (when in SMT2 mode). This only seems to happen with newly created processes. If you pin a process to t0 and then unpin it, it stays on t0. Also if a process is migrated to another core, it can end up on t0. This happens with a vanilla linus or -next tree on ppc64 pseries_defconfig - NO_HZ. I've not tried with NO_HZ. Anyway, this regression seems to be causing problems when we apply our patch. We are trying to pull down to T0 which works, but we immediately get pulled back upto t1 due to the above regression. This happens over and over, causing process to ping-pong every few sched ticks. We've not tried to bisect this problem but that's the next step unless someone has some insights to the problem. Also, we had to change the following to get the pull down to work correctly in the original patch: @@ -2618,8 +2618,8 @@ static int check_asym_packing(struct sch if (this_cpu > busiest_cpu) return 0; - *imbalance = (sds->max_load * sds->busiest->cpu_power) / - SCHED_LOAD_SCALE; + *imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->cpu_power, + SCHED_LOAD_SCALE); return 1; We found that imbalance = 1023.8 which got rounded down to 1023 which ended up being compared to a wl of 1024 in find_busiest_queue and failing. The closest round fixes this. Mikey ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-28 23:20 ` Joel Schopp @ 2010-01-29 12:25 ` Gabriel Paubert -1 siblings, 0 replies; 103+ messages in thread From: Gabriel Paubert @ 2010-01-29 12:25 UTC (permalink / raw) To: Joel Schopp; +Cc: Peter Zijlstra, ego, linux-kernel, Ingo Molnar, linuxppc-dev On Thu, Jan 28, 2010 at 05:20:55PM -0600, Joel Schopp wrote: > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > there is performance benefit to idling the higher numbered threads in > the core. > Really 2, 3, or 4? When you have 4 idle threads out of 4, performance becomes a minor concern, no? ;-) Gabriel ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-29 12:25 ` Gabriel Paubert 0 siblings, 0 replies; 103+ messages in thread From: Gabriel Paubert @ 2010-01-29 12:25 UTC (permalink / raw) To: Joel Schopp; +Cc: linux-kernel, Ingo Molnar, linuxppc-dev, Peter Zijlstra, ego On Thu, Jan 28, 2010 at 05:20:55PM -0600, Joel Schopp wrote: > On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads > there is performance benefit to idling the higher numbered threads in > the core. > Really 2, 3, or 4? When you have 4 idle threads out of 4, performance becomes a minor concern, no? ;-) Gabriel ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 2010-01-29 12:25 ` Gabriel Paubert @ 2010-01-29 16:26 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-29 16:26 UTC (permalink / raw) To: Gabriel Paubert Cc: Peter Zijlstra, ego, linux-kernel, Ingo Molnar, linuxppc-dev Gabriel Paubert wrote: > On Thu, Jan 28, 2010 at 05:20:55PM -0600, Joel Schopp wrote: > >> On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads >> there is performance benefit to idling the higher numbered threads in >> the core. >> >> > > Really 2, 3, or 4? When you have 4 idle threads out of 4, performance > becomes a minor concern, no? ;-) > > Gabriel > Yes, but going from 4 idle to 3 idle you want to keep the slanted weights. If you ignored 4 you'd place wrong and then correct it after the fact. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCHv3 2/2] powerpc: implement arch_scale_smt_power for Power7 @ 2010-01-29 16:26 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-29 16:26 UTC (permalink / raw) To: Gabriel Paubert Cc: linux-kernel, Ingo Molnar, linuxppc-dev, Peter Zijlstra, ego Gabriel Paubert wrote: > On Thu, Jan 28, 2010 at 05:20:55PM -0600, Joel Schopp wrote: > >> On Power7 processors running in SMT4 mode with 2, 3, or 4 idle threads >> there is performance benefit to idling the higher numbered threads in >> the core. >> >> > > Really 2, 3, or 4? When you have 4 idle threads out of 4, performance > becomes a minor concern, no? ;-) > > Gabriel > Yes, but going from 4 idle to 3 idle you want to keep the slanted weights. If you ignored 4 you'd place wrong and then correct it after the fact. ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv2 0/2] sched: arch_scale_smt_powers v2 2010-01-20 20:00 ` Joel Schopp @ 2010-01-26 23:27 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-26 23:27 UTC (permalink / raw) To: Peter Zijlstra Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh, jschopp The new Power7 processor has 4 way SMT. This 4 way SMT benefits from dynamic power updates that arch_scale_smt_power was designed to provide. The first patch fixes a generic scheduler bug necessary for arch_scale_smt to properly function. The second patch implements arch_scale_smt_power for powerpc, and in particular for Power7 processors. Version 2 changes: - Drop Patch 1 from the original series since it's in the -tip tree now - Move enabling the cpu feature into it's own patch (now patch 1) - Add stubbing out broken x86 implementation to patch 1 - clean up coding style in patch 2 --- Joel Schopp (2): powerpc: implement arch_scale_smt_power for Power7 sched: enable ARCH_POWER arch/powerpc/include/asm/cputable.h | 3 +- arch/powerpc/kernel/smp.c | 52 ++++++++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/sched.c | 6 +--- kernel/sched_features.h | 2 - 4 files changed, 57 insertions(+), 6 deletions(-) ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv2 0/2] sched: arch_scale_smt_powers v2 @ 2010-01-26 23:27 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-26 23:27 UTC (permalink / raw) To: Peter Zijlstra; +Cc: ego, linux-kernel, Ingo Molnar, linuxppc-dev The new Power7 processor has 4 way SMT. This 4 way SMT benefits from dynamic power updates that arch_scale_smt_power was designed to provide. The first patch fixes a generic scheduler bug necessary for arch_scale_smt to properly function. The second patch implements arch_scale_smt_power for powerpc, and in particular for Power7 processors. Version 2 changes: - Drop Patch 1 from the original series since it's in the -tip tree now - Move enabling the cpu feature into it's own patch (now patch 1) - Add stubbing out broken x86 implementation to patch 1 - clean up coding style in patch 2 --- Joel Schopp (2): powerpc: implement arch_scale_smt_power for Power7 sched: enable ARCH_POWER arch/powerpc/include/asm/cputable.h | 3 +- arch/powerpc/kernel/smp.c | 52 ++++++++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/sched.c | 6 +--- kernel/sched_features.h | 2 - 4 files changed, 57 insertions(+), 6 deletions(-) ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv3 0/2] sched: arch_scale_smt_powers 2010-01-26 23:27 ` Joel Schopp @ 2010-01-28 23:20 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-28 23:20 UTC (permalink / raw) To: Peter Zijlstra Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh, jschopp The new Power7 processor has 4 way SMT. This 4 way SMT benefits from dynamic power updates that arch_scale_smt_power was designed to provide. The first patch fixes a generic scheduler bug necessary for arch_scale_smt to properly function. The second patch implements arch_scale_smt_power for powerpc, and in particular for Power7 processors. Version 2 changes: - Drop Patch 1 from the original series since it's in the -tip tree now - Move enabling the cpu feature into it's own patch (now patch 1) - Add stubbing out broken x86 implementation to patch 1 - clean up coding style in patch 2 Version 3 changes: - Added a comment to Patch 2 --- Joel Schopp (2): powerpc: implement arch_scale_smt_power for Power7 sched: enable ARCH_POWER arch/powerpc/include/asm/cputable.h | 3 + arch/powerpc/kernel/smp.c | 56 ++++++++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/sched.c | 6 +-- kernel/sched_features.h | 2 - 4 files changed, 61 insertions(+), 6 deletions(-) ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv3 0/2] sched: arch_scale_smt_powers @ 2010-01-28 23:20 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-01-28 23:20 UTC (permalink / raw) To: Peter Zijlstra; +Cc: ego, linux-kernel, Ingo Molnar, linuxppc-dev The new Power7 processor has 4 way SMT. This 4 way SMT benefits from dynamic power updates that arch_scale_smt_power was designed to provide. The first patch fixes a generic scheduler bug necessary for arch_scale_smt to properly function. The second patch implements arch_scale_smt_power for powerpc, and in particular for Power7 processors. Version 2 changes: - Drop Patch 1 from the original series since it's in the -tip tree now - Move enabling the cpu feature into it's own patch (now patch 1) - Add stubbing out broken x86 implementation to patch 1 - clean up coding style in patch 2 Version 3 changes: - Added a comment to Patch 2 --- Joel Schopp (2): powerpc: implement arch_scale_smt_power for Power7 sched: enable ARCH_POWER arch/powerpc/include/asm/cputable.h | 3 + arch/powerpc/kernel/smp.c | 56 ++++++++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/sched.c | 6 +-- kernel/sched_features.h | 2 - 4 files changed, 61 insertions(+), 6 deletions(-) ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv4 0/2] sched: arch_scale_smt_powers 2010-01-28 23:20 ` Joel Schopp @ 2010-02-05 20:57 ` Joel Schopp -1 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-02-05 20:57 UTC (permalink / raw) To: Peter Zijlstra Cc: ego, linuxppc-dev, Ingo Molnar, linux-kernel, benh, jschopp The new Power7 processor has 4 way SMT. This 4 way SMT benefits from dynamic power updates that arch_scale_smt_power was designed to provide. The first patch fixes a generic scheduler bug necessary for arch_scale_smt to properly function. The second patch implements arch_scale_smt_power for powerpc, and in particular for Power7 processors. Version 2 changes: - Drop Patch 1 from the original series since it's in the -tip tree now - Move enabling the cpu feature into it's own patch (now patch 1) - Add stubbing out broken x86 implementation to patch 1 - clean up coding style in patch 2 Version 3 changes: - Added a comment to Patch 2 Version 4 changes: - Wrap arch_scale_smt_power() in #ifdef CONFIG_SCHED_SMT in Patch 2 --- Joel Schopp (2): sched: enable ARCH_POWER powerpc: implement arch_scale_smt_power for Power7 arch/powerpc/include/asm/cputable.h | 3 + arch/powerpc/kernel/smp.c | 58 ++++++++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/sched.c | 6 +-- kernel/sched_features.h | 2 - 4 files changed, 63 insertions(+), 6 deletions(-) ^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCHv4 0/2] sched: arch_scale_smt_powers @ 2010-02-05 20:57 ` Joel Schopp 0 siblings, 0 replies; 103+ messages in thread From: Joel Schopp @ 2010-02-05 20:57 UTC (permalink / raw) To: Peter Zijlstra; +Cc: ego, linux-kernel, Ingo Molnar, linuxppc-dev The new Power7 processor has 4 way SMT. This 4 way SMT benefits from dynamic power updates that arch_scale_smt_power was designed to provide. The first patch fixes a generic scheduler bug necessary for arch_scale_smt to properly function. The second patch implements arch_scale_smt_power for powerpc, and in particular for Power7 processors. Version 2 changes: - Drop Patch 1 from the original series since it's in the -tip tree now - Move enabling the cpu feature into it's own patch (now patch 1) - Add stubbing out broken x86 implementation to patch 1 - clean up coding style in patch 2 Version 3 changes: - Added a comment to Patch 2 Version 4 changes: - Wrap arch_scale_smt_power() in #ifdef CONFIG_SCHED_SMT in Patch 2 --- Joel Schopp (2): sched: enable ARCH_POWER powerpc: implement arch_scale_smt_power for Power7 arch/powerpc/include/asm/cputable.h | 3 + arch/powerpc/kernel/smp.c | 58 ++++++++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/sched.c | 6 +-- kernel/sched_features.h | 2 - 4 files changed, 63 insertions(+), 6 deletions(-) ^ permalink raw reply [flat|nested] 103+ messages in thread
end of thread, other threads:[~2010-03-04 22:28 UTC | newest] Thread overview: 103+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-01-20 20:00 [PATCH 0/2] sched: arch_scale_smt_powers Joel Schopp 2010-01-20 20:00 ` Joel Schopp 2010-01-20 20:02 ` [PATCH 1/2] sched: Fix the place where group powers are updated Joel Schopp 2010-01-20 20:02 ` Joel Schopp 2010-01-21 13:54 ` [tip:sched/core] " tip-bot for Gautham R Shenoy 2010-01-26 23:28 ` [PATCHv2 1/2] sched: enable ARCH_POWER Joel Schopp 2010-01-26 23:28 ` Joel Schopp 2010-01-28 23:20 ` [PATCHv3 " Joel Schopp 2010-01-28 23:20 ` Joel Schopp 2010-02-05 20:57 ` [PATCHv4 " Joel Schopp 2010-02-05 20:57 ` Joel Schopp 2010-01-20 20:04 ` [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 Joel Schopp 2010-01-20 20:04 ` Joel Schopp 2010-01-20 20:48 ` Peter Zijlstra 2010-01-20 20:48 ` Peter Zijlstra 2010-01-20 21:58 ` Michael Neuling 2010-01-20 21:58 ` Michael Neuling 2010-01-20 22:44 ` Joel Schopp 2010-01-20 22:44 ` Joel Schopp 2010-01-21 8:27 ` Peter Zijlstra 2010-01-21 8:27 ` Peter Zijlstra 2010-01-20 21:04 ` Michael Neuling 2010-01-20 21:04 ` Michael Neuling 2010-01-20 22:09 ` Joel Schopp 2010-01-20 22:09 ` Joel Schopp 2010-01-24 3:00 ` Benjamin Herrenschmidt 2010-01-24 3:00 ` Benjamin Herrenschmidt 2010-01-25 17:50 ` Joel Schopp 2010-01-25 17:50 ` Joel Schopp 2010-01-26 4:23 ` Benjamin Herrenschmidt 2010-01-26 4:23 ` Benjamin Herrenschmidt 2010-01-20 21:33 ` Benjamin Herrenschmidt 2010-01-20 21:33 ` Benjamin Herrenschmidt 2010-01-20 22:36 ` Joel Schopp 2010-01-20 22:36 ` Joel Schopp 2010-01-26 23:28 ` [PATCHv2 " Joel Schopp 2010-01-26 23:28 ` Joel Schopp 2010-01-27 0:52 ` Benjamin Herrenschmidt 2010-01-27 0:52 ` Benjamin Herrenschmidt 2010-01-28 22:39 ` Joel Schopp 2010-01-28 22:39 ` Joel Schopp 2010-01-29 1:23 ` Benjamin Herrenschmidt 2010-01-29 1:23 ` Benjamin Herrenschmidt 2010-01-28 23:20 ` [PATCHv3 " Joel Schopp 2010-01-28 23:20 ` Joel Schopp 2010-01-28 23:24 ` Joel Schopp 2010-01-28 23:24 ` Joel Schopp 2010-01-29 1:23 ` Benjamin Herrenschmidt 2010-01-29 1:23 ` Benjamin Herrenschmidt 2010-01-29 10:13 ` Peter Zijlstra 2010-01-29 10:13 ` Peter Zijlstra 2010-01-29 18:34 ` Joel Schopp 2010-01-29 18:34 ` Joel Schopp 2010-01-29 18:41 ` Joel Schopp 2010-01-29 18:41 ` Joel Schopp 2010-02-05 20:57 ` [PATCHv4 " Joel Schopp 2010-02-05 20:57 ` Joel Schopp 2010-02-14 10:12 ` Peter Zijlstra 2010-02-14 10:12 ` Peter Zijlstra 2010-02-17 22:20 ` Michael Neuling 2010-02-17 22:20 ` Michael Neuling 2010-02-18 13:17 ` Peter Zijlstra 2010-02-18 13:17 ` Peter Zijlstra 2010-02-18 13:19 ` Peter Zijlstra 2010-02-18 13:19 ` Peter Zijlstra 2010-02-18 16:28 ` Joel Schopp 2010-02-18 16:28 ` Joel Schopp 2010-02-18 17:08 ` Peter Zijlstra 2010-02-18 17:08 ` Peter Zijlstra 2010-02-19 6:05 ` Michael Neuling 2010-02-19 6:05 ` Michael Neuling 2010-02-19 10:01 ` Peter Zijlstra 2010-02-19 10:01 ` Peter Zijlstra 2010-02-19 11:01 ` Michael Neuling 2010-02-19 11:01 ` Michael Neuling 2010-02-23 6:08 ` Michael Neuling 2010-02-23 6:08 ` Michael Neuling 2010-02-23 16:24 ` Peter Zijlstra 2010-02-23 16:24 ` Peter Zijlstra 2010-02-23 16:30 ` Peter Zijlstra 2010-02-23 16:30 ` Peter Zijlstra 2010-02-24 6:07 ` Michael Neuling 2010-02-24 6:07 ` Michael Neuling 2010-02-24 11:13 ` Michael Neuling 2010-02-24 11:13 ` Michael Neuling 2010-02-24 11:58 ` Michael Neuling 2010-02-24 11:58 ` Michael Neuling 2010-02-27 10:21 ` Michael Neuling 2010-02-27 10:21 ` Michael Neuling 2010-03-02 14:44 ` Peter Zijlstra 2010-03-02 14:44 ` Peter Zijlstra 2010-03-04 22:28 ` Michael Neuling 2010-03-04 22:28 ` Michael Neuling 2010-01-29 12:25 ` [PATCHv3 " Gabriel Paubert 2010-01-29 12:25 ` Gabriel Paubert 2010-01-29 16:26 ` Joel Schopp 2010-01-29 16:26 ` Joel Schopp 2010-01-26 23:27 ` [PATCHv2 0/2] sched: arch_scale_smt_powers v2 Joel Schopp 2010-01-26 23:27 ` Joel Schopp 2010-01-28 23:20 ` [PATCHv3 0/2] sched: arch_scale_smt_powers Joel Schopp 2010-01-28 23:20 ` Joel Schopp 2010-02-05 20:57 ` [PATCHv4 " Joel Schopp 2010-02-05 20:57 ` Joel Schopp
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.