* [PATCH 0/4] perf, x86: Fixes for v2.6.39 @ 2011-04-16 0:27 Robert Richter 2011-04-16 0:27 ` [PATCH 1/4] perf, x86: Fix pre-defined cache-misses event for AMD family 15h cpus Robert Richter ` (3 more replies) 0 siblings, 4 replies; 38+ messages in thread From: Robert Richter @ 2011-04-16 0:27 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, Stephane Eranian, LKML This patch series contains fixes for v2.6.39. They are related to the AMD family 15h PMU but also modify x86 generic code. See the diffstat below. If possible I would like to see patch #4 also in .39, though it is more than just a one-liner with many changes in the x86 event scheduling code. Without this patch scheduling multiple events on AMD family 15h may fail under certain conditions though there are enough counters available. -Robert The following changes since commit e566b76ed30768140df8f0023904aed5a41244f7: perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec() (2011-04-11 11:07:55 +0200) Andre Przywara (1): perf, x86: Fix pre-defined cache-misses event for AMD family 15h cpus Robert Richter (3): perf, x86: Fix AMD family 15h FPU event constraints perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE perf, x86: Fix event scheduler to solve complex scheduling problems arch/x86/kernel/cpu/perf_event.c | 63 +++++++++++++++++++++++++++------ arch/x86/kernel/cpu/perf_event_amd.c | 23 ++++++++++-- 2 files changed, 70 insertions(+), 16 deletions(-) ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 1/4] perf, x86: Fix pre-defined cache-misses event for AMD family 15h cpus 2011-04-16 0:27 [PATCH 0/4] perf, x86: Fixes for v2.6.39 Robert Richter @ 2011-04-16 0:27 ` Robert Richter 2011-04-19 12:03 ` [tip:perf/urgent] " tip-bot for Andre Przywara 2011-04-16 0:27 ` [PATCH 2/4] perf, x86: Fix AMD family 15h FPU event constraints Robert Richter ` (2 subsequent siblings) 3 siblings, 1 reply; 38+ messages in thread From: Robert Richter @ 2011-04-16 0:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Stephane Eranian, LKML, Andre Przywara, Robert Richter From: Andre Przywara <andre.przywara@amd.com> With AMD cpu family 15h a unit mask was introduced for the Data Cache Miss event (0x041/L1-dcache-load-misses). We need to enable bit 0 (first data cache miss or streaming store to a 64 B cache line) of this mask to proper count data cache misses. Now we set this bit for all families and models. In case a PMU does not implement a unit mask for event 0x041 the bit is ignored. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event_amd.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index 461f62b..4e16138 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -8,7 +8,7 @@ static __initconst const u64 amd_hw_cache_event_ids [ C(L1D) ] = { [ C(OP_READ) ] = { [ C(RESULT_ACCESS) ] = 0x0040, /* Data Cache Accesses */ - [ C(RESULT_MISS) ] = 0x0041, /* Data Cache Misses */ + [ C(RESULT_MISS) ] = 0x0141, /* Data Cache Misses */ }, [ C(OP_WRITE) ] = { [ C(RESULT_ACCESS) ] = 0x0142, /* Data Cache Refills :system */ -- 1.7.3.4 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [tip:perf/urgent] perf, x86: Fix pre-defined cache-misses event for AMD family 15h cpus 2011-04-16 0:27 ` [PATCH 1/4] perf, x86: Fix pre-defined cache-misses event for AMD family 15h cpus Robert Richter @ 2011-04-19 12:03 ` tip-bot for Andre Przywara 0 siblings, 0 replies; 38+ messages in thread From: tip-bot for Andre Przywara @ 2011-04-19 12:03 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, robert.richter, a.p.zijlstra, andre.przywara, tglx, mingo Commit-ID: 83112e688f5f05dea1e63787db9a6c16b2887a1d Gitweb: http://git.kernel.org/tip/83112e688f5f05dea1e63787db9a6c16b2887a1d Author: Andre Przywara <andre.przywara@amd.com> AuthorDate: Sat, 16 Apr 2011 02:27:53 +0200 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Tue, 19 Apr 2011 10:07:54 +0200 perf, x86: Fix pre-defined cache-misses event for AMD family 15h cpus With AMD cpu family 15h a unit mask was introduced for the Data Cache Miss event (0x041/L1-dcache-load-misses). We need to enable bit 0 (first data cache miss or streaming store to a 64 B cache line) of this mask to proper count data cache misses. Now we set this bit for all families and models. In case a PMU does not implement a unit mask for event 0x041 the bit is ignored. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1302913676-14352-2-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/cpu/perf_event_amd.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index 461f62b..4e16138 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -8,7 +8,7 @@ static __initconst const u64 amd_hw_cache_event_ids [ C(L1D) ] = { [ C(OP_READ) ] = { [ C(RESULT_ACCESS) ] = 0x0040, /* Data Cache Accesses */ - [ C(RESULT_MISS) ] = 0x0041, /* Data Cache Misses */ + [ C(RESULT_MISS) ] = 0x0141, /* Data Cache Misses */ }, [ C(OP_WRITE) ] = { [ C(RESULT_ACCESS) ] = 0x0142, /* Data Cache Refills :system */ ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 2/4] perf, x86: Fix AMD family 15h FPU event constraints 2011-04-16 0:27 [PATCH 0/4] perf, x86: Fixes for v2.6.39 Robert Richter 2011-04-16 0:27 ` [PATCH 1/4] perf, x86: Fix pre-defined cache-misses event for AMD family 15h cpus Robert Richter @ 2011-04-16 0:27 ` Robert Richter 2011-04-19 12:04 ` [tip:perf/urgent] " tip-bot for Robert Richter 2011-04-16 0:27 ` [PATCH 3/4] perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE Robert Richter 2011-04-16 0:27 ` [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems Robert Richter 3 siblings, 1 reply; 38+ messages in thread From: Robert Richter @ 2011-04-16 0:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Stephane Eranian, LKML, Robert Richter, Stephane Eranian Depending on the unit mask settings some FPU events may be scheduled only on cpu counter #3. This patch fixes this. Cc: Stephane Eranian <eranian@googlemail.com> Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event_amd.c | 21 ++++++++++++++++++--- 1 files changed, 18 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index 4e16138..758cccc 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -427,7 +427,9 @@ static __initconst const struct x86_pmu amd_pmu = { * * Exceptions: * + * 0x000 FP PERF_CTL[3], PERF_CTL[5:3] (*) * 0x003 FP PERF_CTL[3] + * 0x004 FP PERF_CTL[3], PERF_CTL[5:3] (*) * 0x00B FP PERF_CTL[3] * 0x00D FP PERF_CTL[3] * 0x023 DE PERF_CTL[2:0] @@ -448,6 +450,8 @@ static __initconst const struct x86_pmu amd_pmu = { * 0x0DF LS PERF_CTL[5:0] * 0x1D6 EX PERF_CTL[5:0] * 0x1D8 EX PERF_CTL[5:0] + * + * (*) depending on the umask all FPU counters may be used */ static struct event_constraint amd_f15_PMC0 = EVENT_CONSTRAINT(0, 0x01, 0); @@ -460,18 +464,29 @@ static struct event_constraint amd_f15_PMC53 = EVENT_CONSTRAINT(0, 0x38, 0); static struct event_constraint * amd_get_event_constraints_f15h(struct cpu_hw_events *cpuc, struct perf_event *event) { - unsigned int event_code = amd_get_event_code(&event->hw); + struct hw_perf_event *hwc = &event->hw; + unsigned int event_code = amd_get_event_code(hwc); switch (event_code & AMD_EVENT_TYPE_MASK) { case AMD_EVENT_FP: switch (event_code) { + case 0x000: + if (!(hwc->config & 0x0000F000ULL)) + break; + if (!(hwc->config & 0x00000F00ULL)) + break; + return &amd_f15_PMC3; + case 0x004: + if (hweight_long(hwc->config & + ARCH_PERFMON_EVENTSEL_UMASK) <= 1) + break; + return &amd_f15_PMC3; case 0x003: case 0x00B: case 0x00D: return &amd_f15_PMC3; - default: - return &amd_f15_PMC53; } + return &amd_f15_PMC53; case AMD_EVENT_LS: case AMD_EVENT_DC: case AMD_EVENT_EX_LS: -- 1.7.3.4 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [tip:perf/urgent] perf, x86: Fix AMD family 15h FPU event constraints 2011-04-16 0:27 ` [PATCH 2/4] perf, x86: Fix AMD family 15h FPU event constraints Robert Richter @ 2011-04-19 12:04 ` tip-bot for Robert Richter 0 siblings, 0 replies; 38+ messages in thread From: tip-bot for Robert Richter @ 2011-04-19 12:04 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, eranian, robert.richter, a.p.zijlstra, tglx, mingo Commit-ID: 855357a21744e488cbee23a47d2b124035160a87 Gitweb: http://git.kernel.org/tip/855357a21744e488cbee23a47d2b124035160a87 Author: Robert Richter <robert.richter@amd.com> AuthorDate: Sat, 16 Apr 2011 02:27:54 +0200 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Tue, 19 Apr 2011 10:07:55 +0200 perf, x86: Fix AMD family 15h FPU event constraints Depending on the unit mask settings some FPU events may be scheduled only on cpu counter #3. This patch fixes this. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@googlemail.com> Link: http://lkml.kernel.org/r/1302913676-14352-3-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/cpu/perf_event_amd.c | 20 +++++++++++++++++--- 1 files changed, 17 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index 4e16138..cf4e369 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -427,7 +427,9 @@ static __initconst const struct x86_pmu amd_pmu = { * * Exceptions: * + * 0x000 FP PERF_CTL[3], PERF_CTL[5:3] (*) * 0x003 FP PERF_CTL[3] + * 0x004 FP PERF_CTL[3], PERF_CTL[5:3] (*) * 0x00B FP PERF_CTL[3] * 0x00D FP PERF_CTL[3] * 0x023 DE PERF_CTL[2:0] @@ -448,6 +450,8 @@ static __initconst const struct x86_pmu amd_pmu = { * 0x0DF LS PERF_CTL[5:0] * 0x1D6 EX PERF_CTL[5:0] * 0x1D8 EX PERF_CTL[5:0] + * + * (*) depending on the umask all FPU counters may be used */ static struct event_constraint amd_f15_PMC0 = EVENT_CONSTRAINT(0, 0x01, 0); @@ -460,18 +464,28 @@ static struct event_constraint amd_f15_PMC53 = EVENT_CONSTRAINT(0, 0x38, 0); static struct event_constraint * amd_get_event_constraints_f15h(struct cpu_hw_events *cpuc, struct perf_event *event) { - unsigned int event_code = amd_get_event_code(&event->hw); + struct hw_perf_event *hwc = &event->hw; + unsigned int event_code = amd_get_event_code(hwc); switch (event_code & AMD_EVENT_TYPE_MASK) { case AMD_EVENT_FP: switch (event_code) { + case 0x000: + if (!(hwc->config & 0x0000F000ULL)) + break; + if (!(hwc->config & 0x00000F00ULL)) + break; + return &amd_f15_PMC3; + case 0x004: + if (hweight_long(hwc->config & ARCH_PERFMON_EVENTSEL_UMASK) <= 1) + break; + return &amd_f15_PMC3; case 0x003: case 0x00B: case 0x00D: return &amd_f15_PMC3; - default: - return &amd_f15_PMC53; } + return &amd_f15_PMC53; case AMD_EVENT_LS: case AMD_EVENT_DC: case AMD_EVENT_EX_LS: ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 3/4] perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE 2011-04-16 0:27 [PATCH 0/4] perf, x86: Fixes for v2.6.39 Robert Richter 2011-04-16 0:27 ` [PATCH 1/4] perf, x86: Fix pre-defined cache-misses event for AMD family 15h cpus Robert Richter 2011-04-16 0:27 ` [PATCH 2/4] perf, x86: Fix AMD family 15h FPU event constraints Robert Richter @ 2011-04-16 0:27 ` Robert Richter 2011-04-18 20:00 ` Andi Kleen 2011-04-19 12:04 ` [tip:perf/core] " tip-bot for Robert Richter 2011-04-16 0:27 ` [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems Robert Richter 3 siblings, 2 replies; 38+ messages in thread From: Robert Richter @ 2011-04-16 0:27 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, Stephane Eranian, LKML, Robert Richter Using ALTERNATIVE() when checking for X86_FEATURE_PERFCTR_CORE avoids an extra pointer chase and data cache hit. Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event.c | 15 +++++++++++---- 1 files changed, 11 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index eed3673a..224a84f 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -31,6 +31,7 @@ #include <asm/nmi.h> #include <asm/compat.h> #include <asm/smp.h> +#include <asm/alternative.h> #if 0 #undef wrmsrl @@ -363,12 +364,18 @@ again: return new_raw_count; } -/* using X86_FEATURE_PERFCTR_CORE to later implement ALTERNATIVE() here */ static inline int x86_pmu_addr_offset(int index) { - if (boot_cpu_has(X86_FEATURE_PERFCTR_CORE)) - return index << 1; - return index; + int offset; + + /* offset = X86_FEATURE_PERFCTR_CORE ? index << 1 : index */ + alternative_io(ASM_NOP2, + "shll $1, %%eax", + X86_FEATURE_PERFCTR_CORE, + "=a" (offset), + "a" (index)); + + return offset; } static inline unsigned int x86_pmu_config_addr(int index) -- 1.7.3.4 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 3/4] perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE 2011-04-16 0:27 ` [PATCH 3/4] perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE Robert Richter @ 2011-04-18 20:00 ` Andi Kleen 2011-04-19 10:39 ` Robert Richter 2011-04-19 12:04 ` [tip:perf/core] " tip-bot for Robert Richter 1 sibling, 1 reply; 38+ messages in thread From: Andi Kleen @ 2011-04-18 20:00 UTC (permalink / raw) To: Robert Richter; +Cc: Peter Zijlstra, Ingo Molnar, Stephane Eranian, LKML Robert Richter <robert.richter@amd.com> writes: > Using ALTERNATIVE() when checking for X86_FEATURE_PERFCTR_CORE avoids > an extra pointer chase and data cache hit. Is that really a performance critical path? Seems more like unnecessary obfuscation to me. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/4] perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE 2011-04-18 20:00 ` Andi Kleen @ 2011-04-19 10:39 ` Robert Richter 2011-04-19 18:21 ` Andi Kleen 0 siblings, 1 reply; 38+ messages in thread From: Robert Richter @ 2011-04-19 10:39 UTC (permalink / raw) To: Andi Kleen; +Cc: Peter Zijlstra, Ingo Molnar, Stephane Eranian, LKML On 18.04.11 16:00:57, Andi Kleen wrote: > Robert Richter <robert.richter@amd.com> writes: > > > Using ALTERNATIVE() when checking for X86_FEATURE_PERFCTR_CORE avoids > > an extra pointer chase and data cache hit. > > Is that really a performance critical path? > > Seems more like unnecessary obfuscation to me. We hotest path is in perf_pmu_disable(), which happens at least with every task switch when calling the event scheduler. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/4] perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE 2011-04-19 10:39 ` Robert Richter @ 2011-04-19 18:21 ` Andi Kleen 0 siblings, 0 replies; 38+ messages in thread From: Andi Kleen @ 2011-04-19 18:21 UTC (permalink / raw) To: Robert Richter Cc: Andi Kleen, Peter Zijlstra, Ingo Molnar, Stephane Eranian, LKML On Tue, Apr 19, 2011 at 12:39:27PM +0200, Robert Richter wrote: > On 18.04.11 16:00:57, Andi Kleen wrote: > > Robert Richter <robert.richter@amd.com> writes: > > > > > Using ALTERNATIVE() when checking for X86_FEATURE_PERFCTR_CORE avoids > > > an extra pointer chase and data cache hit. > > > > Is that really a performance critical path? > > > > Seems more like unnecessary obfuscation to me. > > We hotest path is in perf_pmu_disable(), which happens at least with > every task switch when calling the event scheduler. Yes but that's already a slow path isn't it? It better is, because the MSR accesses alone are incredibly expensive. I guess your test and jump isn't even on the radar after that ... -Andi ^ permalink raw reply [flat|nested] 38+ messages in thread
* [tip:perf/core] perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE 2011-04-16 0:27 ` [PATCH 3/4] perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE Robert Richter 2011-04-18 20:00 ` Andi Kleen @ 2011-04-19 12:04 ` tip-bot for Robert Richter 1 sibling, 0 replies; 38+ messages in thread From: tip-bot for Robert Richter @ 2011-04-19 12:04 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, robert.richter, a.p.zijlstra, tglx, mingo Commit-ID: c8e5910edf8bbe2e5c6c35a4ef2a578cc7893b25 Gitweb: http://git.kernel.org/tip/c8e5910edf8bbe2e5c6c35a4ef2a578cc7893b25 Author: Robert Richter <robert.richter@amd.com> AuthorDate: Sat, 16 Apr 2011 02:27:55 +0200 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Tue, 19 Apr 2011 10:08:12 +0200 perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE Using ALTERNATIVE() when checking for X86_FEATURE_PERFCTR_CORE avoids an extra pointer chase and data cache hit. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1302913676-14352-4-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/cpu/perf_event.c | 15 +++++++++++---- 1 files changed, 11 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index eed3673a..224a84f 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -31,6 +31,7 @@ #include <asm/nmi.h> #include <asm/compat.h> #include <asm/smp.h> +#include <asm/alternative.h> #if 0 #undef wrmsrl @@ -363,12 +364,18 @@ again: return new_raw_count; } -/* using X86_FEATURE_PERFCTR_CORE to later implement ALTERNATIVE() here */ static inline int x86_pmu_addr_offset(int index) { - if (boot_cpu_has(X86_FEATURE_PERFCTR_CORE)) - return index << 1; - return index; + int offset; + + /* offset = X86_FEATURE_PERFCTR_CORE ? index << 1 : index */ + alternative_io(ASM_NOP2, + "shll $1, %%eax", + X86_FEATURE_PERFCTR_CORE, + "=a" (offset), + "a" (index)); + + return offset; } static inline unsigned int x86_pmu_config_addr(int index) ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-16 0:27 [PATCH 0/4] perf, x86: Fixes for v2.6.39 Robert Richter ` (2 preceding siblings ...) 2011-04-16 0:27 ` [PATCH 3/4] perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE Robert Richter @ 2011-04-16 0:27 ` Robert Richter 2011-04-16 8:51 ` Peter Zijlstra ` (2 more replies) 3 siblings, 3 replies; 38+ messages in thread From: Robert Richter @ 2011-04-16 0:27 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, Stephane Eranian, LKML, Robert Richter The current x86 event scheduler fails to resolve scheduling problems of certain combinations of events and constraints. This happens esp. for events with complex constraints such as those of the AMD family 15h pmu. The scheduler does not find then an existing solution. Examples are: event code counter failure possible solution 1) 0x043 PMC[2:0] 0 1 0x02E PMC[3,0] 3 0 0x003 PMC3 FAIL 3 2) 0x02E PMC[3,0] 0 3 0x043 PMC[2:0] 1 0 0x045 PMC[2:0] 2 1 0x046 PMC[2:0] FAIL 2 Scheduling events on counters is a Hamiltonian path problem. To find a possible solution we must traverse all existing paths. This patch implements this. We need to save all states of already walked paths. If we fail to schedule an event we now rollback the previous state and try to use another free counter until we have analysed all paths. We might consider to later remove the constraint weight implementation completely, but I left this out as this is a much bigger and more risky change than this fix. Cc: Stephane Eranian <eranian@google.com> Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event.c | 48 +++++++++++++++++++++++++++++++------ 1 files changed, 40 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 224a84f..887a500 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -770,11 +770,19 @@ static inline int is_x86_event(struct perf_event *event) return event->pmu == &pmu; } +struct sched_log +{ + int i; + int w; + int idx; +}; + static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) { struct event_constraint *c, *constraints[X86_PMC_IDX_MAX]; unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)]; - int i, j, w, wmax, num = 0; + struct sched_log sched_log[X86_PMC_IDX_MAX]; + int i, idx, w, wmax, num = 0, scheduled = 0; struct hw_perf_event *hwc; bitmap_zero(used_mask, X86_PMC_IDX_MAX); @@ -815,6 +823,7 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) */ bitmap_zero(used_mask, X86_PMC_IDX_MAX); + memset(&sched_log, 0, sizeof(sched_log)); /* * weight = number of possible counters @@ -838,25 +847,48 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) for (w = 1, num = n; num && w <= wmax; w++) { /* for each event */ for (i = 0; num && i < n; i++) { + redo: c = constraints[i]; hwc = &cpuc->event_list[i]->hw; if (c->weight != w) continue; - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) { - if (!test_bit(j, used_mask)) + idx = sched_log[scheduled].idx; + /* for each bit in idxmsk starting from idx */ + while (idx < X86_PMC_IDX_MAX) { + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX, + idx); + if (idx == X86_PMC_IDX_MAX) + break; + if (!__test_and_set_bit(idx, used_mask)) break; + idx++; } - if (j == X86_PMC_IDX_MAX) - break; - - __set_bit(j, used_mask); + if (idx >= X86_PMC_IDX_MAX) { + /* roll back and try next free counter */ + if (!scheduled) + /* no free counters anymore */ + break; + sched_log[scheduled].idx = 0; + scheduled--; + num++; + clear_bit(sched_log[scheduled].idx, used_mask); + i = sched_log[scheduled].i; + w = sched_log[scheduled].w; + sched_log[scheduled].idx++; + goto redo; + } if (assign) - assign[i] = j; + assign[i] = idx; + num--; + sched_log[scheduled].i = i; + sched_log[scheduled].w = w; + sched_log[scheduled].idx = idx; + scheduled++; } } done: -- 1.7.3.4 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-16 0:27 ` [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems Robert Richter @ 2011-04-16 8:51 ` Peter Zijlstra 2011-04-16 9:43 ` Ingo Molnar 2011-04-17 8:15 ` Robert Richter 2011-04-16 15:52 ` Stephane Eranian 2011-04-19 10:26 ` [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters Robert Richter 2 siblings, 2 replies; 38+ messages in thread From: Peter Zijlstra @ 2011-04-16 8:51 UTC (permalink / raw) To: Robert Richter; +Cc: Ingo Molnar, Stephane Eranian, LKML On Sat, 2011-04-16 at 02:27 +0200, Robert Richter wrote: > The current x86 event scheduler fails to resolve scheduling problems > of certain combinations of events and constraints. This happens esp. > for events with complex constraints such as those of the AMD family > 15h pmu. The scheduler does not find then an existing solution. > Examples are: > > event code counter failure possible > solution > > 1) 0x043 PMC[2:0] 0 1 > 0x02E PMC[3,0] 3 0 > 0x003 PMC3 FAIL 3 > > 2) 0x02E PMC[3,0] 0 3 > 0x043 PMC[2:0] 1 0 > 0x045 PMC[2:0] 2 1 > 0x046 PMC[2:0] FAIL 2 > > Scheduling events on counters is a Hamiltonian path problem. To find a > possible solution we must traverse all existing paths. This patch > implements this. > > We need to save all states of already walked paths. If we fail to > schedule an event we now rollback the previous state and try to use > another free counter until we have analysed all paths. > > We might consider to later remove the constraint weight implementation > completely, but I left this out as this is a much bigger and more > risky change than this fix. Argh, crap. That's because AMD is now the first with overlapping constraints. Be sure to let your hardware guys know that they went from top to bottom om my appreciation list. AMD used to have no constraints and now they have the absolute worst. I'd really prefer not to do this for .39, and I'll have to sit down and actually read this code. It looks like we went from O(n^2) to O(n!) or somesuch, also not much of an improvement. I'll have to analyze the solver to see what it does for 'simple' constraints set to see if it will indeed be more expensive than the O(n^2) solver we had. Also, I think this code could do with a tiny bit of comments ;-) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-16 8:51 ` Peter Zijlstra @ 2011-04-16 9:43 ` Ingo Molnar 2011-04-16 10:08 ` Peter Zijlstra 2011-04-17 8:15 ` Robert Richter 1 sibling, 1 reply; 38+ messages in thread From: Ingo Molnar @ 2011-04-16 9:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Robert Richter, Stephane Eranian, LKML, Arnaldo Carvalho de Melo, Frédéric Weisbecker * Peter Zijlstra <peterz@infradead.org> wrote: > On Sat, 2011-04-16 at 02:27 +0200, Robert Richter wrote: > > The current x86 event scheduler fails to resolve scheduling problems > > of certain combinations of events and constraints. This happens esp. > > for events with complex constraints such as those of the AMD family > > 15h pmu. The scheduler does not find then an existing solution. > > Examples are: > > > > event code counter failure possible > > solution > > > > 1) 0x043 PMC[2:0] 0 1 > > 0x02E PMC[3,0] 3 0 > > 0x003 PMC3 FAIL 3 > > > > 2) 0x02E PMC[3,0] 0 3 > > 0x043 PMC[2:0] 1 0 > > 0x045 PMC[2:0] 2 1 > > 0x046 PMC[2:0] FAIL 2 > > > > Scheduling events on counters is a Hamiltonian path problem. To find a > > possible solution we must traverse all existing paths. This patch > > implements this. > > > > We need to save all states of already walked paths. If we fail to > > schedule an event we now rollback the previous state and try to use > > another free counter until we have analysed all paths. > > > > We might consider to later remove the constraint weight implementation > > completely, but I left this out as this is a much bigger and more > > risky change than this fix. > > Argh, crap. That's because AMD is now the first with overlapping > constraints. Be sure to let your hardware guys know that they went from > top to bottom om my appreciation list. AMD used to have no constraints > and now they have the absolute worst. > > I'd really prefer not to do this for .39, and I'll have to sit down and > actually read this code. It looks like we went from O(n^2) to O(n!) or > somesuch, also not much of an improvement. I'll have to analyze the > solver to see what it does for 'simple' constraints set to see if it > will indeed be more expensive than the O(n^2) solver we had. > > Also, I think this code could do with a tiny bit of comments ;-) I'd also prefer if we first had actual testcases in 'perf test' for all these failures - it took an *awfully* long time to find these regressions (the event scheduler code has been committed for months), while with proper testcases it would only take a second to run 'perf test'. Thanks, Ingo ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-16 9:43 ` Ingo Molnar @ 2011-04-16 10:08 ` Peter Zijlstra 2011-04-16 10:14 ` Ingo Molnar 2011-04-16 14:26 ` Valdis.Kletnieks 0 siblings, 2 replies; 38+ messages in thread From: Peter Zijlstra @ 2011-04-16 10:08 UTC (permalink / raw) To: Ingo Molnar Cc: Robert Richter, Stephane Eranian, LKML, Arnaldo Carvalho de Melo, Frédéric Weisbecker On Sat, 2011-04-16 at 11:43 +0200, Ingo Molnar wrote: > I'd also prefer if we first had actual testcases in 'perf test' for all these > failures - it took an *awfully* long time to find these regressions (the event > scheduler code has been committed for months), while with proper testcases it > would only take a second to run 'perf test'. These cases only exist on AMD F15, I don't think there's many people with such systems around. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-16 10:08 ` Peter Zijlstra @ 2011-04-16 10:14 ` Ingo Molnar 2011-04-16 10:15 ` Peter Zijlstra 2011-04-16 14:26 ` Valdis.Kletnieks 1 sibling, 1 reply; 38+ messages in thread From: Ingo Molnar @ 2011-04-16 10:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Robert Richter, Stephane Eranian, LKML, Arnaldo Carvalho de Melo, Frédéric Weisbecker * Peter Zijlstra <peterz@infradead.org> wrote: > On Sat, 2011-04-16 at 11:43 +0200, Ingo Molnar wrote: > > I'd also prefer if we first had actual testcases in 'perf test' for all these > > failures - it took an *awfully* long time to find these regressions (the event > > scheduler code has been committed for months), while with proper testcases it > > would only take a second to run 'perf test'. > > These cases only exist on AMD F15, I don't think there's many people > with such systems around. Well, if the trend continues we'll have more twisted constraints and more bugs of this sort, so having a testsuite sure cannot hurt, right? Thanks, Ingo ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-16 10:14 ` Ingo Molnar @ 2011-04-16 10:15 ` Peter Zijlstra 0 siblings, 0 replies; 38+ messages in thread From: Peter Zijlstra @ 2011-04-16 10:15 UTC (permalink / raw) To: Ingo Molnar Cc: Robert Richter, Stephane Eranian, LKML, Arnaldo Carvalho de Melo, Frédéric Weisbecker On Sat, 2011-04-16 at 12:14 +0200, Ingo Molnar wrote: > * Peter Zijlstra <peterz@infradead.org> wrote: > > > On Sat, 2011-04-16 at 11:43 +0200, Ingo Molnar wrote: > > > I'd also prefer if we first had actual testcases in 'perf test' for all these > > > failures - it took an *awfully* long time to find these regressions (the event > > > scheduler code has been committed for months), while with proper testcases it > > > would only take a second to run 'perf test'. > > > > These cases only exist on AMD F15, I don't think there's many people > > with such systems around. > > Well, if the trend continues we'll have more twisted constraints and more bugs > of this sort, so having a testsuite sure cannot hurt, right? For sure. But the problem with writing test cases at the perf userspace level is that they're very hardware specific. It would be much easier to write unit tests for the solver itself. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-16 10:08 ` Peter Zijlstra 2011-04-16 10:14 ` Ingo Molnar @ 2011-04-16 14:26 ` Valdis.Kletnieks 1 sibling, 0 replies; 38+ messages in thread From: Valdis.Kletnieks @ 2011-04-16 14:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Robert Richter, Stephane Eranian, LKML, Arnaldo Carvalho de Melo, Frédéric Weisbecker [-- Attachment #1: Type: text/plain, Size: 218 bytes --] On Sat, 16 Apr 2011 12:08:40 +0200, Peter Zijlstra said: > These cases only exist on AMD F15, I don't think there's many people > with such systems around. Didn't they say similar things about the first Itaniums? :) [-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-16 8:51 ` Peter Zijlstra 2011-04-16 9:43 ` Ingo Molnar @ 2011-04-17 8:15 ` Robert Richter 2011-04-17 8:18 ` Ingo Molnar 1 sibling, 1 reply; 38+ messages in thread From: Robert Richter @ 2011-04-17 8:15 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, Stephane Eranian, LKML On 16.04.11 04:51:17, Peter Zijlstra wrote: > Argh, crap. That's because AMD is now the first with overlapping > constraints. Be sure to let your hardware guys know that they went from > top to bottom om my appreciation list. AMD used to have no constraints > and now they have the absolute worst. Yes, the overlapping constraints are the problem. > I'd really prefer not to do this for .39, and I'll have to sit down and > actually read this code. It looks like we went from O(n^2) to O(n!) or > somesuch, also not much of an improvement. I'll have to analyze the > solver to see what it does for 'simple' constraints set to see if it > will indeed be more expensive than the O(n^2) solver we had. It wont be more expensive, if there is a solution. But if there is no one we walk all possible ways now which is something like O(n!). Yes, we can shift this general solution after .39. Will try to find a solution that handles the special case for family 15h as a fix for .39. > Also, I think this code could do with a tiny bit of comments ;-) Will comment on the code inline in the patch. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-17 8:15 ` Robert Richter @ 2011-04-17 8:18 ` Ingo Molnar 2011-04-17 8:53 ` Peter Zijlstra 0 siblings, 1 reply; 38+ messages in thread From: Ingo Molnar @ 2011-04-17 8:18 UTC (permalink / raw) To: Robert Richter; +Cc: Peter Zijlstra, Stephane Eranian, LKML * Robert Richter <robert.richter@amd.com> wrote: > > I'd really prefer not to do this for .39, and I'll have to sit down and > > actually read this code. It looks like we went from O(n^2) to O(n!) or > > somesuch, also not much of an improvement. I'll have to analyze the solver > > to see what it does for 'simple' constraints set to see if it will indeed > > be more expensive than the O(n^2) solver we had. > > It wont be more expensive, if there is a solution. But if there is no one we > walk all possible ways now which is something like O(n!). So with 6 counters it would be a loop of 720, with 8 counters a loop of 40320, with 10 counters a loop of 3628800 ... O(n!) is not fun. Thanks, Ingo ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-17 8:18 ` Ingo Molnar @ 2011-04-17 8:53 ` Peter Zijlstra 2011-04-17 11:23 ` Robert Richter 0 siblings, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2011-04-17 8:53 UTC (permalink / raw) To: Ingo Molnar; +Cc: Robert Richter, Stephane Eranian, LKML On Sun, 2011-04-17 at 10:18 +0200, Ingo Molnar wrote: > * Robert Richter <robert.richter@amd.com> wrote: > > > > I'd really prefer not to do this for .39, and I'll have to sit down and > > > actually read this code. It looks like we went from O(n^2) to O(n!) or > > > somesuch, also not much of an improvement. I'll have to analyze the solver > > > to see what it does for 'simple' constraints set to see if it will indeed > > > be more expensive than the O(n^2) solver we had. > > > > It wont be more expensive, if there is a solution. But if there is no one we > > walk all possible ways now which is something like O(n!). > > So with 6 counters it would be a loop of 720, with 8 counters a loop of 40320, > with 10 counters a loop of 3628800 ... O(n!) is not fun. Right, and we'll hit this case at least once when scheduling a over-committed system. Intel Sandy Bridge can have 8 counters per core + 3 fixed counters, giving an n=11 situation. You do _NOT_ want to have one 39916800 cycle loop before we determine the PMU isn't schedulable, that's simply unacceptable. There's a fine point between maximum PMU utilization and acceptable performance here, and an O(n!) algorithm is really not acceptable. If you can find a polynomial algorithm that improves the AMD-F15 situation we can talk. As it stands I'm tempted to have AMD suffer its terrible PMU design decisions, if you want this fixed, fix the silicon. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-17 8:53 ` Peter Zijlstra @ 2011-04-17 11:23 ` Robert Richter 2011-04-18 8:17 ` Robert Richter 0 siblings, 1 reply; 38+ messages in thread From: Robert Richter @ 2011-04-17 11:23 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, Stephane Eranian, LKML [-- Attachment #1: Type: text/plain, Size: 1906 bytes --] On 17.04.11 04:53:32, Peter Zijlstra wrote: > On Sun, 2011-04-17 at 10:18 +0200, Ingo Molnar wrote: > > So with 6 counters it would be a loop of 720, with 8 counters a loop of 40320, > > with 10 counters a loop of 3628800 ... O(n!) is not fun. > > Right, and we'll hit this case at least once when scheduling a > over-committed system. Intel Sandy Bridge can have 8 counters per core + > 3 fixed counters, giving an n=11 situation. You do _NOT_ want to have > one 39916800 cycle loop before we determine the PMU isn't schedulable, > that's simply unacceptable. Of course it is not that much as the algorithm is already optimized and we only walk through possible ways. Also, the more constraints we have the less we have to walk. So lets assume a worst case of 8 unconstraint counters, I reimplemented the algorithm in the perl script attached and counted 251 loops, following numbers I got depending on the number of counters: $ perl counter-scheduling.pl | grep Num Number of counters: 2, loops: 10, redos: 4, ratio: 2.5 Number of counters: 3, loops: 26, redos: 7, ratio: 3.7 Number of counters: 4, loops: 53, redos: 11, ratio: 4.8 Number of counters: 5, loops: 89, redos: 15, ratio: 5.9 Number of counters: 6, loops: 134, redos: 19, ratio: 7.1 Number of counters: 7, loops: 188, redos: 23, ratio: 8.2 Number of counters: 8, loops: 251, redos: 27, ratio: 9.3 Number of counters: 9, loops: 323, redos: 31, ratio: 10.4 Number of counters: 10, loops: 404, redos: 35, ratio: 11.5 Number of counters: 11, loops: 494, redos: 39, ratio: 12.7 Number of counters: 12, loops: 593, redos: 43, ratio: 13.8 It seems the algorithm is about number-of-counter times slower than the current. I think this is worth some further considerations. There is also some room for improvement with my algorithm. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center [-- Attachment #2: counter-scheduling.pl --] [-- Type: text/x-perl, Size: 984 bytes --] #! /usr/bin/perl #$num_ctrs = 11; for ($num_ctrs = 2; $num_ctrs <= 12; $num_ctrs++) { $num_events = $num_ctrs + 1; @sched_log = (); $scheduled = 0; $used_mask = 0; $loops = 0; $redos = 0; $scheduled = 0; while ($scheduled < $num_events) { for ($idx = $sched_log[$scheduled] || 0; $idx < $num_ctrs; $idx++) { $loops++; last if !((1 << $idx) & $used_mask); } if ($idx == $num_ctrs) { printf "Failed to schedule event #%d\n", $scheduled; last if (!$scheduled); $sched_log[$scheduled] = 0; $scheduled--; $idx = $sched_log[$scheduled]; $sched_log[$scheduled]++; $used_mask &= ~(1 << $idx); printf "Rollback event #%d on counter #%d\n", $scheduled, $idx; $redos++; redo; } $used_mask |= (1 << $idx); push @sched_log, $idx; printf "Scheduling event #%d on counter #%d\n", $scheduled, $idx; $scheduled++; } printf("Number of counters: %2d, loops: %3d, redos: %3d, ratio: %.1f\n", $num_ctrs, $loops, $redos, $loops / $redos); } ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-17 11:23 ` Robert Richter @ 2011-04-18 8:17 ` Robert Richter 0 siblings, 0 replies; 38+ messages in thread From: Robert Richter @ 2011-04-18 8:17 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, Stephane Eranian, LKML [-- Attachment #1: Type: text/plain, Size: 1929 bytes --] On 17.04.11 13:23:25, Robert Richter wrote: > $ perl counter-scheduling.pl | grep Num > Number of counters: 2, loops: 10, redos: 4, ratio: 2.5 > Number of counters: 3, loops: 26, redos: 7, ratio: 3.7 > Number of counters: 4, loops: 53, redos: 11, ratio: 4.8 > Number of counters: 5, loops: 89, redos: 15, ratio: 5.9 > Number of counters: 6, loops: 134, redos: 19, ratio: 7.1 > Number of counters: 7, loops: 188, redos: 23, ratio: 8.2 > Number of counters: 8, loops: 251, redos: 27, ratio: 9.3 > Number of counters: 9, loops: 323, redos: 31, ratio: 10.4 > Number of counters: 10, loops: 404, redos: 35, ratio: 11.5 > Number of counters: 11, loops: 494, redos: 39, ratio: 12.7 > Number of counters: 12, loops: 593, redos: 43, ratio: 13.8 Wrong, wrong, wrong! Before wasting your time with this, the script is buggy storing the correct state: @@ -35,7 +35,7 @@ while ($scheduled < $num_events) { } $used_mask |= (1 << $idx); - push @sched_log, $idx; + $sched_log[$scheduled] = $idx; printf "Scheduling event #%d on counter #%d\n", $scheduled, $idx; $scheduled++; } It's not possible do the calculation for 11 counters in reasonable time: $ perl counter-scheduling.pl | grep Num Number of counters: 2, loops: 10, redos: 4, ratio: 2.5 Number of counters: 3, loops: 48, redos: 15, ratio: 3.2 Number of counters: 4, loops: 260, redos: 64, ratio: 4.1 Number of counters: 5, loops: 1630, redos: 325, ratio: 5.0 Number of counters: 6, loops: 11742, redos: 1956, ratio: 6.0 Number of counters: 7, loops: 95900, redos: 13699, ratio: 7.0 Number of counters: 8, loops: 876808, redos: 109600, ratio: 8.0 Number of counters: 9, loops: 8877690, redos: 986409, ratio: 9.0 Number of counters: 10, loops: 98641010, redos: 9864100, ratio: 10.0 Updated script attached. Sorry, -Robert -- Advanced Micro Devices, Inc. Operating System Research Center [-- Attachment #2: counter-scheduling.pl --] [-- Type: text/x-perl, Size: 992 bytes --] #! /usr/bin/perl #$num_ctrs = 11; for ($num_ctrs = 2; $num_ctrs <= 12; $num_ctrs++) { $num_events = $num_ctrs + 1; @sched_log = (); $scheduled = 0; $used_mask = 0; $loops = 0; $redos = 0; $scheduled = 0; while ($scheduled < $num_events) { for ($idx = $sched_log[$scheduled] || 0; $idx < $num_ctrs; $idx++) { $loops++; last if !((1 << $idx) & $used_mask); } if ($idx == $num_ctrs) { printf "Failed to schedule event #%d\n", $scheduled; last if (!$scheduled); $sched_log[$scheduled] = 0; $scheduled--; $idx = $sched_log[$scheduled]; $sched_log[$scheduled]++; $used_mask &= ~(1 << $idx); printf "Rollback event #%d on counter #%d\n", $scheduled, $idx; $redos++; redo; } $used_mask |= (1 << $idx); $sched_log[$scheduled] = $idx; printf "Scheduling event #%d on counter #%d\n", $scheduled, $idx; $scheduled++; } printf("Number of counters: %2d, loops: %3d, redos: %3d, ratio: %.1f\n", $num_ctrs, $loops, $redos, $loops / $redos); } ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-16 0:27 ` [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems Robert Richter 2011-04-16 8:51 ` Peter Zijlstra @ 2011-04-16 15:52 ` Stephane Eranian 2011-04-17 8:44 ` Robert Richter 2011-04-19 10:26 ` [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters Robert Richter 2 siblings, 1 reply; 38+ messages in thread From: Stephane Eranian @ 2011-04-16 15:52 UTC (permalink / raw) To: Robert Richter; +Cc: Peter Zijlstra, Ingo Molnar, LKML [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 7153 bytes --] On Fri, Apr 15, 2011 at 5:27 PM, Robert Richter <robert.richter@amd.com> wrote: > The current x86 event scheduler fails to resolve scheduling problems > of certain combinations of events and constraints. This happens esp. > for events with complex constraints such as those of the AMD family > 15h pmu. The scheduler does not find then an existing solution. > Examples are: > >     event code    counter     failure     possible solution > > 1)    0x043      PMC[2:0]     0        1 >     0x02E      PMC[3,0]     3        0 >     0x003      PMC3       FAIL       3 > I am not sure I understand this failure case. If I recall the scheduler looks at the weight of each event first: weight 1)    0x043      PMC[2:0]  3     0x02E      PMC[3,0]  2     0x003      PMC3     1 Then, it schedules in increasing weight order. So it will schedule weight 1, 2, 3. For weight 1, it will find counter3, for weight 2, it will take counter0, for weight 3, it will take counter 1 given 0 is already used. Or am I reading your example the wrong way? The fact that counter have overlapping constraints should not matter. In fact this is what happens with counters without constraints. > 2)    0x02E      PMC[3,0]     0        3 >     0x043      PMC[2:0]     1        0 >     0x045      PMC[2:0]     2        1 >     0x046      PMC[2:0]     FAIL       2 > > Scheduling events on counters is a Hamiltonian path problem. To find a > possible solution we must traverse all existing paths. This patch > implements this. > > We need to save all states of already walked paths. If we fail to > schedule an event we now rollback the previous state and try to use > another free counter until we have analysed all paths. > > We might consider to later remove the constraint weight implementation > completely, but I left this out as this is a much bigger and more > risky change than this fix. > > Cc: Stephane Eranian <eranian@google.com> > Signed-off-by: Robert Richter <robert.richter@amd.com> > --- >  arch/x86/kernel/cpu/perf_event.c |  48 +++++++++++++++++++++++++++++++------ >  1 files changed, 40 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c > index 224a84f..887a500 100644 > --- a/arch/x86/kernel/cpu/perf_event.c > +++ b/arch/x86/kernel/cpu/perf_event.c > @@ -770,11 +770,19 @@ static inline int is_x86_event(struct perf_event *event) >     return event->pmu == &pmu; >  } > > +struct sched_log > +{ > +    int   i; > +    int   w; > +    int   idx; > +}; > + >  static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) >  { >     struct event_constraint *c, *constraints[X86_PMC_IDX_MAX]; >     unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)]; > -    int i, j, w, wmax, num = 0; > +    struct sched_log sched_log[X86_PMC_IDX_MAX]; > +    int i, idx, w, wmax, num = 0, scheduled = 0; >     struct hw_perf_event *hwc; > >     bitmap_zero(used_mask, X86_PMC_IDX_MAX); > @@ -815,6 +823,7 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) >     */ > >     bitmap_zero(used_mask, X86_PMC_IDX_MAX); > +    memset(&sched_log, 0, sizeof(sched_log)); > >     /* >     * weight = number of possible counters > @@ -838,25 +847,48 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) >     for (w = 1, num = n; num && w <= wmax; w++) { >         /* for each event */ >         for (i = 0; num && i < n; i++) { > +        redo: >             c = constraints[i]; >             hwc = &cpuc->event_list[i]->hw; > >             if (c->weight != w) >                 continue; > > -            for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) { > -                if (!test_bit(j, used_mask)) > +            idx = sched_log[scheduled].idx; > +            /* for each bit in idxmsk starting from idx */ > +            while (idx < X86_PMC_IDX_MAX) { > +                idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX, > +                          idx); > +                if (idx == X86_PMC_IDX_MAX) > +                    break; > +                if (!__test_and_set_bit(idx, used_mask)) >                     break; > +                idx++; >             } > > -            if (j == X86_PMC_IDX_MAX) > -                break; > - > -            __set_bit(j, used_mask); > +            if (idx >= X86_PMC_IDX_MAX) { > +                /* roll back and try next free counter */ > +                if (!scheduled) > +                    /* no free counters anymore */ > +                    break; > +                sched_log[scheduled].idx = 0; > +                scheduled--; > +                num++; > +                clear_bit(sched_log[scheduled].idx, used_mask); > +                i = sched_log[scheduled].i; > +                w = sched_log[scheduled].w; > +                sched_log[scheduled].idx++; > +                goto redo; > +            } > >             if (assign) > -                assign[i] = j; > +                assign[i] = idx; > + >             num--; > +            sched_log[scheduled].i = i; > +            sched_log[scheduled].w = w; > +            sched_log[scheduled].idx = idx; > +            scheduled++; >         } >     } >  done: > -- > 1.7.3.4 > > > ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-16 15:52 ` Stephane Eranian @ 2011-04-17 8:44 ` Robert Richter 2011-04-17 9:05 ` Stephane Eranian 0 siblings, 1 reply; 38+ messages in thread From: Robert Richter @ 2011-04-17 8:44 UTC (permalink / raw) To: Stephane Eranian; +Cc: Peter Zijlstra, Ingo Molnar, LKML On 16.04.11 11:52:54, Stephane Eranian wrote: > On Fri, Apr 15, 2011 at 5:27 PM, Robert Richter <robert.richter@amd.com> wrote: > > event code counter failure possible solution > > > > 1) 0x043 PMC[2:0] 0 1 > > 0x02E PMC[3,0] 3 0 > > 0x003 PMC3 FAIL 3 > > > I am not sure I understand this failure case. If I recall > the scheduler looks at the weight of each event first: > > weight > 1) 0x043 PMC[2:0] 3 > 0x02E PMC[3,0] 2 > 0x003 PMC3 1 > > Then, it schedules in increasing weight order. So it will > schedule weight 1, 2, 3. For weight 1, it will find counter3, > for weight 2, it will take counter0, for weight 3, it will > take counter 1 given 0 is already used. > > Or am I reading your example the wrong way? No, I have to admit this one was taken out of my mind and I picked a wrong example for the special problem I was thinking about. The above works as you described because of the constraint's weights. I don't have an example with existing constraints, but consider the following (theoretical) one: counter: 3210 Failure: Solution: e1 xx xo ox e2 xx xo ox e3 x x o x o x e4 x x x F x o The special with the above is that two events (e1 and e2) must be rescheduled to schedule e4. This means that swapping counters of only one already scheduled event is not sufficient. A counter of a third event must be freed, this counter is then taken by the second event. > The fact that counter have overlapping constraints > should not matter. In fact this is what happens with > counters without constraints. An event set containing constraints with following conditions is problematic: (c1->weight <= c2->weight && (c1->cmask & c2->cmask != c1->cmask)) Basically this means both constraints do not overlap completly. You can't select then the correct counter without knowing the events with higher weights that will be scheduled later. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems 2011-04-17 8:44 ` Robert Richter @ 2011-04-17 9:05 ` Stephane Eranian 0 siblings, 0 replies; 38+ messages in thread From: Stephane Eranian @ 2011-04-17 9:05 UTC (permalink / raw) To: Robert Richter; +Cc: Peter Zijlstra, Ingo Molnar, LKML On Sun, Apr 17, 2011 at 1:44 AM, Robert Richter <robert.richter@amd.com> wrote: > On 16.04.11 11:52:54, Stephane Eranian wrote: >> On Fri, Apr 15, 2011 at 5:27 PM, Robert Richter <robert.richter@amd.com> wrote: > >> > event code counter failure possible solution >> > >> > 1) 0x043 PMC[2:0] 0 1 >> > 0x02E PMC[3,0] 3 0 >> > 0x003 PMC3 FAIL 3 >> > >> I am not sure I understand this failure case. If I recall >> the scheduler looks at the weight of each event first: >> >> weight >> 1) 0x043 PMC[2:0] 3 >> 0x02E PMC[3,0] 2 >> 0x003 PMC3 1 >> >> Then, it schedules in increasing weight order. So it will >> schedule weight 1, 2, 3. For weight 1, it will find counter3, >> for weight 2, it will take counter0, for weight 3, it will >> take counter 1 given 0 is already used. >> >> Or am I reading your example the wrong way? > > No, I have to admit this one was taken out of my mind and I picked > a wrong example for the special problem I was thinking about. The > above works as you described because of the constraint's weights. > > I don't have an example with existing constraints, but consider the > following (theoretical) one: > > counter: 3210 Failure: Solution: > > e1 xx xo ox > e2 xx xo ox > e3 x x o x o x > e4 x x x F x o > > The special with the above is that two events (e1 and e2) must be > rescheduled to schedule e4. This means that swapping counters of only > one already scheduled event is not sufficient. A counter of a third > event must be freed, this counter is then taken by the second event. Ok, I understand this one better now. But you admit yourself, you made it up. Is it even possible with Fam15h? The thing is, I think, this is not a catastrophic problem. What will happen is that e4 cannot be scheduled with e1, e2, e3. In case the events are not in the same group, e4 will eventually get a chance to be scheduled due to round-robin on the event list. In case, the events are all in one group, then you won't be able to add e4 to the group because of group scheduling sanity check on creation. So either you have to shuffle your event groups or you will incur multiplexing. It's not like e4 will be created but it will nerver be able to count anything. >> The fact that counter have overlapping constraints >> should not matter. In fact this is what happens with >> counters without constraints. > > An event set containing constraints with following conditions is > problematic: > > (c1->weight <= c2->weight && (c1->cmask & c2->cmask != c1->cmask)) > > Basically this means both constraints do not overlap completly. You > can't select then the correct counter without knowing the events with > higher weights that will be scheduled later. > > -Robert > > -- > Advanced Micro Devices, Inc. > Operating System Research Center > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters 2011-04-16 0:27 ` [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems Robert Richter 2011-04-16 8:51 ` Peter Zijlstra 2011-04-16 15:52 ` Stephane Eranian @ 2011-04-19 10:26 ` Robert Richter 2011-04-19 11:29 ` Ingo Molnar 2011-05-18 21:16 ` Peter Zijlstra 2 siblings, 2 replies; 38+ messages in thread From: Robert Richter @ 2011-04-19 10:26 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, Stephane Eranian, LKML This version 2 implements an algorithm which does not increase the loop count for systems with no overlapping-counter constraints. With overlapping-counter constraints the loop count is reduceded to the minimum and also limited by a stack depth of 2 states. See below. -Robert >From cd70a579995fa7da885ec25db8ebe04ba4a0c30e Mon Sep 17 00:00:00 2001 From: Robert Richter <robert.richter@amd.com> Date: Fri, 15 Apr 2011 11:04:06 +0200 Subject: [PATCH] perf, x86: Fix event scheduler for constraints with overlapping counters The current x86 event scheduler fails to resolve scheduling problems of certain combinations of events and constraints. This happens if the counter mask of such an event is not a subset of any other counter mask of a constraint with an equal or higher weight, e.g. constraints of the AMD family 15h pmu: counter mask weight amd_f15_PMC30 0x09 2 <--- overlapping counters amd_f15_PMC20 0x07 3 amd_f15_PMC53 0x38 3 The scheduler does not find then an existing solution. Here is an example: event code counter failure possible solution 0x02E PMC[3,0] 0 3 0x043 PMC[2:0] 1 0 0x045 PMC[2:0] 2 1 0x046 PMC[2:0] FAIL 2 The event scheduler may not select the correct counter in the first cycle because it needs to know which subsequent events will be scheduled. It may fail to schedule the events then. To solve this, we now save the scheduler state of events with overlapping counter counstraints. If we fail to schedule the events we rollback to those states and try to use another free counter. Constraints with overlapping counters are marked with a new introduced redo flag. We set the redo flag for such constraints to give the scheduler a hint which events to select for counter rescheduling. The EVENT_CONSTRAINT_REDO() macro can be used for this. Care must be taken as the rescheduling algorithm is O(n!) which will increase scheduling cycles for an over-commited system dramatically. The number of such EVENT_CONSTRAINT_REDO() macros and its counter masks must be kept at a minimum. Thus, the current stack is limited to 2 states to limit the number of loops the algorithm takes in the worst case. On systems with no overlapping-counter constraints, this implementation does not increase the loop count compared to the previous algorithm. Cc: Stephane Eranian <eranian@google.com> Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event.c | 97 ++++++++++++++++++++++++++++++---- arch/x86/kernel/cpu/perf_event_amd.c | 2 +- 2 files changed, 87 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 224a84f..2d4cfae 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -86,6 +86,7 @@ struct event_constraint { u64 code; u64 cmask; int weight; + int redo; }; struct amd_nb { @@ -144,15 +145,40 @@ struct cpu_hw_events { struct amd_nb *amd_nb; }; -#define __EVENT_CONSTRAINT(c, n, m, w) {\ +#define __EVENT_CONSTRAINT(c, n, m, w, r) {\ { .idxmsk64 = (n) }, \ .code = (c), \ .cmask = (m), \ .weight = (w), \ + .redo = (r), \ } #define EVENT_CONSTRAINT(c, n, m) \ - __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n)) + __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0) + +/* + * The redo flag marks event constraints with overlapping counter + * masks. This is the case if the counter mask of such an event is not + * a subset of any other counter mask of a constraint with an equal or + * higher weight, e.g.: + * + * c_overlaps = EVENT_CONSTRAINT_REDO(0, 0x09, 0); + * c_another1 = EVENT_CONSTRAINT(0, 0x07, 0); + * c_another2 = EVENT_CONSTRAINT(0, 0x38, 0); + * + * The event scheduler may not select the correct counter in the first + * cycle because it needs to know which subsequent events will be + * scheduled. It may fail to schedule the events then. So we set the + * redo flag for such constraints to give the scheduler a hint which + * events to select for counter rescheduling. + * + * Care must be taken as the rescheduling algorithm is O(n!) which + * will increase scheduling cycles for an over-commited system + * dramatically. The number of such EVENT_CONSTRAINT_REDO() macros + * and its counter masks must be kept at a minimum. + */ +#define EVENT_CONSTRAINT_REDO(c, n, m) \ + __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 1) /* * Constraint on the Event code. @@ -770,11 +796,24 @@ static inline int is_x86_event(struct perf_event *event) return event->pmu == &pmu; } +struct sched_state +{ + int i; + int w; + int idx; + int num; + unsigned long used[BITS_TO_LONGS(X86_PMC_IDX_MAX)]; +}; + +/* Total max is X86_PMC_IDX_MAX, but we are O(n!) limited */ +#define SCHED_STATES_MAX 2 + static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) { struct event_constraint *c, *constraints[X86_PMC_IDX_MAX]; unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)]; - int i, j, w, wmax, num = 0; + struct sched_state sched_state[SCHED_STATES_MAX]; + int i, idx, w, wmax, num = 0, state = 0; struct hw_perf_event *hwc; bitmap_zero(used_mask, X86_PMC_IDX_MAX); @@ -838,25 +877,61 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) for (w = 1, num = n; num && w <= wmax; w++) { /* for each event */ for (i = 0; num && i < n; i++) { + idx = 0; + redo: c = constraints[i]; hwc = &cpuc->event_list[i]->hw; if (c->weight != w) continue; - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) { - if (!test_bit(j, used_mask)) + /* for each bit in idxmsk starting from idx */ + while (idx < X86_PMC_IDX_MAX) { + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX, + idx); + if (idx == X86_PMC_IDX_MAX) break; + if (!__test_and_set_bit(idx, used_mask)) + break; + idx++; } - if (j == X86_PMC_IDX_MAX) - break; - - __set_bit(j, used_mask); + if (idx >= X86_PMC_IDX_MAX) { + /* + * roll back and try to reschedule + * events on other counters + */ + if (!state) + /* no events to reschedule */ + goto done; + state--; + /* restore previous state: */ + i = sched_state[state].i; + w = sched_state[state].w; + idx = sched_state[state].idx; + num = sched_state[state].num; + bitmap_copy(used_mask, sched_state[state].used, n); + /* try next counter: */ + clear_bit(idx, used_mask); + num++; + idx++; + goto redo; + } if (assign) - assign[i] = j; + assign[i] = idx; + num--; + + if (c->redo && state < SCHED_STATES_MAX) { + /* store scheduler state: */ + sched_state[state].i = i; + sched_state[state].w = w; + sched_state[state].idx = idx; + sched_state[state].num = num; + bitmap_copy(sched_state[state].used, used_mask, n); + state++; + } } } done: @@ -1535,7 +1610,7 @@ static int __init init_hw_perf_events(void) unconstrained = (struct event_constraint) __EVENT_CONSTRAINT(0, (1ULL << x86_pmu.num_counters) - 1, - 0, x86_pmu.num_counters); + 0, x86_pmu.num_counters, 0); if (x86_pmu.event_constraints) { for_each_event_constraint(c, x86_pmu.event_constraints) { diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index 758cccc..21821eb 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -457,7 +457,7 @@ static __initconst const struct x86_pmu amd_pmu = { static struct event_constraint amd_f15_PMC0 = EVENT_CONSTRAINT(0, 0x01, 0); static struct event_constraint amd_f15_PMC20 = EVENT_CONSTRAINT(0, 0x07, 0); static struct event_constraint amd_f15_PMC3 = EVENT_CONSTRAINT(0, 0x08, 0); -static struct event_constraint amd_f15_PMC30 = EVENT_CONSTRAINT(0, 0x09, 0); +static struct event_constraint amd_f15_PMC30 = EVENT_CONSTRAINT_REDO(0, 0x09, 0); static struct event_constraint amd_f15_PMC50 = EVENT_CONSTRAINT(0, 0x3F, 0); static struct event_constraint amd_f15_PMC53 = EVENT_CONSTRAINT(0, 0x38, 0); -- 1.7.3.4 -- Advanced Micro Devices, Inc. Operating System Research Center ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters 2011-04-19 10:26 ` [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters Robert Richter @ 2011-04-19 11:29 ` Ingo Molnar 2011-04-19 13:55 ` Robert Richter 2011-05-18 21:16 ` Peter Zijlstra 1 sibling, 1 reply; 38+ messages in thread From: Ingo Molnar @ 2011-04-19 11:29 UTC (permalink / raw) To: Robert Richter; +Cc: Peter Zijlstra, Stephane Eranian, LKML * Robert Richter <robert.richter@amd.com> wrote: > + if (!state) > + /* no events to reschedule */ > + goto done; Hm, that's 5 levels of indentation. Would it be possible to factor out the gist of the logic into a helper and such? My eyes would be much happier! Thanks, Ingo ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters 2011-04-19 11:29 ` Ingo Molnar @ 2011-04-19 13:55 ` Robert Richter 2011-04-28 9:50 ` Robert Richter 0 siblings, 1 reply; 38+ messages in thread From: Robert Richter @ 2011-04-19 13:55 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, Stephane Eranian, LKML On 19.04.11 07:29:33, Ingo Molnar wrote: > > * Robert Richter <robert.richter@amd.com> wrote: > > > + if (!state) > > + /* no events to reschedule */ > > + goto done; > > Hm, that's 5 levels of indentation. Would it be possible to factor out the gist > of the logic into a helper and such? My eyes would be much happier! Hmm, I was thinking about this too, but it is not easy since we control also the loops with goto's and updates of the loop states. Refactoring out the inner block would mean to update state varables via pointers and to introduce return codes which contains control flow information. Don't think this will become easier. We could probably merge both for-loops in one while-loop but also on costs of code readability. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters 2011-04-19 13:55 ` Robert Richter @ 2011-04-28 9:50 ` Robert Richter 0 siblings, 0 replies; 38+ messages in thread From: Robert Richter @ 2011-04-28 9:50 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, Stephane Eranian, LKML On 19.04.11 15:55:18, Robert Richter wrote: > On 19.04.11 07:29:33, Ingo Molnar wrote: > > > > * Robert Richter <robert.richter@amd.com> wrote: > > > > > + if (!state) > > > + /* no events to reschedule */ > > > + goto done; > > > > Hm, that's 5 levels of indentation. Would it be possible to factor out the gist > > of the logic into a helper and such? My eyes would be much happier! > > Hmm, I was thinking about this too, but it is not easy since we > control also the loops with goto's and updates of the loop > states. Refactoring out the inner block would mean to update state > varables via pointers and to introduce return codes which contains > control flow information. Don't think this will become easier. > > We could probably merge both for-loops in one while-loop but also on > costs of code readability. Ingo, do you still want me to rework it? Any other comments on this patch? Thanks, -Robert -- Advanced Micro Devices, Inc. Operating System Research Center ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters 2011-04-19 10:26 ` [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters Robert Richter 2011-04-19 11:29 ` Ingo Molnar @ 2011-05-18 21:16 ` Peter Zijlstra 2011-05-18 21:20 ` Ingo Molnar 1 sibling, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2011-05-18 21:16 UTC (permalink / raw) To: Robert Richter; +Cc: Ingo Molnar, Stephane Eranian, LKML So I'm mostly ok with this, just a few nits.. On Tue, 2011-04-19 at 12:26 +0200, Robert Richter wrote: > @@ -838,25 +877,61 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) > for (w = 1, num = n; num && w <= wmax; w++) { > /* for each event */ > for (i = 0; num && i < n; i++) { > + idx = 0; > + redo: labels go on column 0. > c = constraints[i]; > hwc = &cpuc->event_list[i]->hw; > > if (c->weight != w) > continue; > > - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) { > - if (!test_bit(j, used_mask)) > + /* for each bit in idxmsk starting from idx */ > + while (idx < X86_PMC_IDX_MAX) { > + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX, > + idx); I'd be mighty tempted to ignore that 80 column rule here ;-) > + if (idx == X86_PMC_IDX_MAX) > break; > + if (!__test_and_set_bit(idx, used_mask)) > + break; > + idx++; > } > > - if (j == X86_PMC_IDX_MAX) > - break; > - > - __set_bit(j, used_mask); > + if (idx >= X86_PMC_IDX_MAX) { > + /* > + * roll back and try to reschedule > + * events on other counters > + */ > + if (!state) > + /* no events to reschedule */ > + goto done; Multi line indents get { } even if not strictly needed, alternatively we can place the comment on the same line as the goto. > + state--; > + /* restore previous state: */ > + i = sched_state[state].i; > + w = sched_state[state].w; > + idx = sched_state[state].idx; > + num = sched_state[state].num; > + bitmap_copy(used_mask, sched_state[state].used, n); > + /* try next counter: */ > + clear_bit(idx, used_mask); > + num++; Suppose we put the num--; bit below the if (c->redo) block, then we can remove this num++;, right? > + idx++; > + goto redo; > + } > > if (assign) > - assign[i] = j; > + assign[i] = idx; > + > num--; > + > + if (c->redo && state < SCHED_STATES_MAX) { > + /* store scheduler state: */ > + sched_state[state].i = i; > + sched_state[state].w = w; > + sched_state[state].idx = idx; > + sched_state[state].num = num; > + bitmap_copy(sched_state[state].used, used_mask, n); > + state++; > + } > } > } ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters 2011-05-18 21:16 ` Peter Zijlstra @ 2011-05-18 21:20 ` Ingo Molnar 2011-05-18 21:36 ` Peter Zijlstra 0 siblings, 1 reply; 38+ messages in thread From: Ingo Molnar @ 2011-05-18 21:20 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Robert Richter, Stephane Eranian, LKML * Peter Zijlstra <peterz@infradead.org> wrote: > > if (c->weight != w) > > continue; > > > > - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) { > > - if (!test_bit(j, used_mask)) > > + /* for each bit in idxmsk starting from idx */ > > + while (idx < X86_PMC_IDX_MAX) { > > + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX, > > + idx); > > I'd be mighty tempted to ignore that 80 column rule here ;-) Please put the body of the loop into a helper function, the function is large and there are countless col80 uglinesses in it! Thanks, Ingo ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters 2011-05-18 21:20 ` Ingo Molnar @ 2011-05-18 21:36 ` Peter Zijlstra 2011-05-19 10:49 ` Robert Richter 2011-05-19 18:06 ` Ingo Molnar 0 siblings, 2 replies; 38+ messages in thread From: Peter Zijlstra @ 2011-05-18 21:36 UTC (permalink / raw) To: Ingo Molnar; +Cc: Robert Richter, Stephane Eranian, LKML On Wed, 2011-05-18 at 23:20 +0200, Ingo Molnar wrote: > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > if (c->weight != w) > > > continue; > > > > > > - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) { > > > - if (!test_bit(j, used_mask)) > > > + /* for each bit in idxmsk starting from idx */ > > > + while (idx < X86_PMC_IDX_MAX) { > > > + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX, > > > + idx); > > > > I'd be mighty tempted to ignore that 80 column rule here ;-) > > Please put the body of the loop into a helper function, the function is large > and there are countless col80 uglinesses in it! I just tried that, its real ugly due to the amount of state you need to pass around. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters 2011-05-18 21:36 ` Peter Zijlstra @ 2011-05-19 10:49 ` Robert Richter 2011-05-19 18:06 ` Ingo Molnar 1 sibling, 0 replies; 38+ messages in thread From: Robert Richter @ 2011-05-19 10:49 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, Stephane Eranian, LKML On 18.05.11 17:36:53, Peter Zijlstra wrote: > On Wed, 2011-05-18 at 23:20 +0200, Ingo Molnar wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > if (c->weight != w) > > > > continue; > > > > > > > > - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) { > > > > - if (!test_bit(j, used_mask)) > > > > + /* for each bit in idxmsk starting from idx */ > > > > + while (idx < X86_PMC_IDX_MAX) { > > > > + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX, > > > > + idx); > > > > > > I'd be mighty tempted to ignore that 80 column rule here ;-) > > > > Please put the body of the loop into a helper function, the function is large > > and there are countless col80 uglinesses in it! > > I just tried that, its real ugly due to the amount of state you need to > pass around. I will try to improve the patch and send a new version. Thanks for review. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters 2011-05-18 21:36 ` Peter Zijlstra 2011-05-19 10:49 ` Robert Richter @ 2011-05-19 18:06 ` Ingo Molnar 2011-05-20 3:18 ` Robert Richter 1 sibling, 1 reply; 38+ messages in thread From: Ingo Molnar @ 2011-05-19 18:06 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Robert Richter, Stephane Eranian, LKML * Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, 2011-05-18 at 23:20 +0200, Ingo Molnar wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > if (c->weight != w) > > > > continue; > > > > > > > > - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) { > > > > - if (!test_bit(j, used_mask)) > > > > + /* for each bit in idxmsk starting from idx */ > > > > + while (idx < X86_PMC_IDX_MAX) { > > > > + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX, > > > > + idx); > > > > > > I'd be mighty tempted to ignore that 80 column rule here ;-) > > > > Please put the body of the loop into a helper function, the function is large > > and there are countless col80 uglinesses in it! > > I just tried that, its real ugly due to the amount of state you need to > pass around. Does it help if you put that state into a helper structure? Thanks, Ingo ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters 2011-05-19 18:06 ` Ingo Molnar @ 2011-05-20 3:18 ` Robert Richter 2011-09-01 12:56 ` Peter Zijlstra 0 siblings, 1 reply; 38+ messages in thread From: Robert Richter @ 2011-05-20 3:18 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, Stephane Eranian, LKML On 19.05.11 14:06:50, Ingo Molnar wrote: > * Peter Zijlstra <peterz@infradead.org> wrote: > > > On Wed, 2011-05-18 at 23:20 +0200, Ingo Molnar wrote: > > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > if (c->weight != w) > > > > > continue; > > > > > > > > > > - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) { > > > > > - if (!test_bit(j, used_mask)) > > > > > + /* for each bit in idxmsk starting from idx */ > > > > > + while (idx < X86_PMC_IDX_MAX) { > > > > > + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX, > > > > > + idx); > > > > > > > > I'd be mighty tempted to ignore that 80 column rule here ;-) > > > > > > Please put the body of the loop into a helper function, the function is large > > > and there are countless col80 uglinesses in it! > > > > I just tried that, its real ugly due to the amount of state you need to > > pass around. > > Does it help if you put that state into a helper structure? Yes, this is what I have in mind too. We could iterate on such a state stucture instead of a couple of single variables. Storing and restoring the state will then just copying the structure. -Robert > > Thanks, > > Ingo > -- Advanced Micro Devices, Inc. Operating System Research Center ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters 2011-05-20 3:18 ` Robert Richter @ 2011-09-01 12:56 ` Peter Zijlstra 2011-09-01 14:12 ` Robert Richter 0 siblings, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2011-09-01 12:56 UTC (permalink / raw) To: Robert Richter; +Cc: Ingo Molnar, Stephane Eranian, LKML On Fri, 2011-05-20 at 05:18 +0200, Robert Richter wrote: > On 19.05.11 14:06:50, Ingo Molnar wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Wed, 2011-05-18 at 23:20 +0200, Ingo Molnar wrote: > > > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > > > if (c->weight != w) > > > > > > continue; > > > > > > > > > > > > - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) { > > > > > > - if (!test_bit(j, used_mask)) > > > > > > + /* for each bit in idxmsk starting from idx */ > > > > > > + while (idx < X86_PMC_IDX_MAX) { > > > > > > + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX, > > > > > > + idx); > > > > > > > > > > I'd be mighty tempted to ignore that 80 column rule here ;-) > > > > > > > > Please put the body of the loop into a helper function, the function is large > > > > and there are countless col80 uglinesses in it! > > > > > > I just tried that, its real ugly due to the amount of state you need to > > > pass around. > > > > Does it help if you put that state into a helper structure? > > Yes, this is what I have in mind too. We could iterate on such a state > stucture instead of a couple of single variables. Storing and > restoring the state will then just copying the structure. Any word on this work? I just noticed we actually need this for Intel too, the fixed purpose events have overlapping but non-identical constraint masks. Now we could optimize the Intel case by always iterating from the top down, but it won't cure all cases. For example, suppose one counter (that could be on a FP reg) previously got scheduled on a GP register and we take the fast-path, in that case we would still end up under-utilized. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters 2011-09-01 12:56 ` Peter Zijlstra @ 2011-09-01 14:12 ` Robert Richter 2011-09-01 16:37 ` Peter Zijlstra 0 siblings, 1 reply; 38+ messages in thread From: Robert Richter @ 2011-09-01 14:12 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, Stephane Eranian, LKML On 01.09.11 08:56:46, Peter Zijlstra wrote: > On Fri, 2011-05-20 at 05:18 +0200, Robert Richter wrote: > > On 19.05.11 14:06:50, Ingo Molnar wrote: > > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > On Wed, 2011-05-18 at 23:20 +0200, Ingo Molnar wrote: > > > > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > > > > > if (c->weight != w) > > > > > > > continue; > > > > > > > > > > > > > > - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) { > > > > > > > - if (!test_bit(j, used_mask)) > > > > > > > + /* for each bit in idxmsk starting from idx */ > > > > > > > + while (idx < X86_PMC_IDX_MAX) { > > > > > > > + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX, > > > > > > > + idx); > > > > > > > > > > > > I'd be mighty tempted to ignore that 80 column rule here ;-) > > > > > > > > > > Please put the body of the loop into a helper function, the function is large > > > > > and there are countless col80 uglinesses in it! > > > > > > > > I just tried that, its real ugly due to the amount of state you need to > > > > pass around. > > > > > > Does it help if you put that state into a helper structure? > > > > Yes, this is what I have in mind too. We could iterate on such a state > > stucture instead of a couple of single variables. Storing and > > restoring the state will then just copying the structure. > > Any word on this work? I just noticed we actually need this for Intel > too, the fixed purpose events have overlapping but non-identical > constraint masks. > > Now we could optimize the Intel case by always iterating from the top > down, but it won't cure all cases. For example, suppose one counter > (that could be on a FP reg) previously got scheduled on a GP register > and we take the fast-path, in that case we would still end up > under-utilized. Yes, I had this patch in mind too. Will try to post an updated version next week. Ok? -Robert -- Advanced Micro Devices, Inc. Operating System Research Center ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters 2011-09-01 14:12 ` Robert Richter @ 2011-09-01 16:37 ` Peter Zijlstra 0 siblings, 0 replies; 38+ messages in thread From: Peter Zijlstra @ 2011-09-01 16:37 UTC (permalink / raw) To: Robert Richter; +Cc: Ingo Molnar, Stephane Eranian, LKML On Thu, 2011-09-01 at 16:12 +0200, Robert Richter wrote: > Will try to post an updated version > next week. Ok? > Sure.. thanks! ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2011-09-01 16:37 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-04-16 0:27 [PATCH 0/4] perf, x86: Fixes for v2.6.39 Robert Richter 2011-04-16 0:27 ` [PATCH 1/4] perf, x86: Fix pre-defined cache-misses event for AMD family 15h cpus Robert Richter 2011-04-19 12:03 ` [tip:perf/urgent] " tip-bot for Andre Przywara 2011-04-16 0:27 ` [PATCH 2/4] perf, x86: Fix AMD family 15h FPU event constraints Robert Richter 2011-04-19 12:04 ` [tip:perf/urgent] " tip-bot for Robert Richter 2011-04-16 0:27 ` [PATCH 3/4] perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE Robert Richter 2011-04-18 20:00 ` Andi Kleen 2011-04-19 10:39 ` Robert Richter 2011-04-19 18:21 ` Andi Kleen 2011-04-19 12:04 ` [tip:perf/core] " tip-bot for Robert Richter 2011-04-16 0:27 ` [PATCH 4/4] perf, x86: Fix event scheduler to solve complex scheduling problems Robert Richter 2011-04-16 8:51 ` Peter Zijlstra 2011-04-16 9:43 ` Ingo Molnar 2011-04-16 10:08 ` Peter Zijlstra 2011-04-16 10:14 ` Ingo Molnar 2011-04-16 10:15 ` Peter Zijlstra 2011-04-16 14:26 ` Valdis.Kletnieks 2011-04-17 8:15 ` Robert Richter 2011-04-17 8:18 ` Ingo Molnar 2011-04-17 8:53 ` Peter Zijlstra 2011-04-17 11:23 ` Robert Richter 2011-04-18 8:17 ` Robert Richter 2011-04-16 15:52 ` Stephane Eranian 2011-04-17 8:44 ` Robert Richter 2011-04-17 9:05 ` Stephane Eranian 2011-04-19 10:26 ` [PATCH v2] perf, x86: Fix event scheduler for constraints with overlapping counters Robert Richter 2011-04-19 11:29 ` Ingo Molnar 2011-04-19 13:55 ` Robert Richter 2011-04-28 9:50 ` Robert Richter 2011-05-18 21:16 ` Peter Zijlstra 2011-05-18 21:20 ` Ingo Molnar 2011-05-18 21:36 ` Peter Zijlstra 2011-05-19 10:49 ` Robert Richter 2011-05-19 18:06 ` Ingo Molnar 2011-05-20 3:18 ` Robert Richter 2011-09-01 12:56 ` Peter Zijlstra 2011-09-01 14:12 ` Robert Richter 2011-09-01 16:37 ` Peter Zijlstra
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.