* [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
@ 2017-02-01 20:02 Borislav Petkov
2017-02-01 21:37 ` Ghannam, Yazen
0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2017-02-01 20:02 UTC (permalink / raw)
To: x86-ml; +Cc: Yves Dionne, Brice Goglin, Peter Zijlstra, Yazen Ghannam, lkml
From: Borislav Petkov <bp@suse.de>
a33d331761bc ("x86/CPU/AMD: Fix Bulldozer topology") restored the
initial approach we had with the Fam15h topology of enumerating CU
(Compute Unit) threads as cores. And this is still correct - they're
beefier than HT threads but still have some shared functionality.
Our current approach has a problem with the Mad Max Steam game, for
example. Yves Dionne reported a certain "choppiness" while playing on
4.9.5.
That problem stems most likely from the fact that the CU threads share
resources within one CU and when we schedule to a thread of a different
compute unit, this incurs latency due to migrating the working set to a
different CU through the caches.
When the thread siblings mask mirrors that aspect of the CUs and
threads, the scheduler pays attention to it and tries to schedule within
one CU first. Which takes care of the latency, of course.
Reported-by: Yves Dionne <yves.dionne@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
---
I know Yazen is working on another issue which touches the same code
path but we'll synchronize the patches.
I'm marking it RFC for now but once we have agreed on it, it should be
CC:stable for 4.9.
Initial testing looks good, I still need to get on a bigger F15h box to
make sure everything is still kosher there too.
arch/x86/include/asm/processor.h | 1 +
arch/x86/kernel/cpu/amd.c | 11 ++++++++++-
arch/x86/kernel/smpboot.c | 10 +++++++---
3 files changed, 18 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 1be64da0384e..e6cfe7ba2d65 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -104,6 +104,7 @@ struct cpuinfo_x86 {
__u8 x86_phys_bits;
/* CPUID returned core id bits: */
__u8 x86_coreid_bits;
+ __u8 cu_id;
/* Max extended CPUID function supported: */
__u32 extended_cpuid_level;
/* Maximum supported CPUID level, -1=no CPUID: */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 80e657e89eed..e7158afb322b 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -309,8 +309,15 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
/* get information required for multi-node processors */
if (boot_cpu_has(X86_FEATURE_TOPOEXT)) {
+ u32 eax, ebx, ecx, edx;
- node_id = cpuid_ecx(0x8000001e) & 7;
+ cpuid(0x8000001e, &eax, &ebx, &ecx, &edx);
+
+ node_id = ecx & 7;
+ smp_num_siblings = ((ebx >> 8) & 0xff) + 1;
+
+ if (c->x86 == 0x15)
+ c->cu_id = ebx & 0xff;
/*
* We may have multiple LLCs if L3 caches exist, so check if we
@@ -328,6 +335,8 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
per_cpu(cpu_llc_id, cpu) = node_id;
}
}
+
+
} else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) {
u64 value;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 548da5a8013e..f06fa338076b 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -433,9 +433,13 @@ static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
if (c->phys_proc_id == o->phys_proc_id &&
- per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2) &&
- c->cpu_core_id == o->cpu_core_id)
- return topology_sane(c, o, "smt");
+ per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2)) {
+ if (c->cpu_core_id == o->cpu_core_id)
+ return topology_sane(c, o, "smt");
+
+ if (c->cu_id == o->cu_id)
+ return topology_sane(c, o, "smt");
+ }
} else if (c->phys_proc_id == o->phys_proc_id &&
c->cpu_core_id == o->cpu_core_id) {
--
2.11.0
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
^ permalink raw reply related [flat|nested] 14+ messages in thread
* RE: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
2017-02-01 20:02 [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID Borislav Petkov
@ 2017-02-01 21:37 ` Ghannam, Yazen
2017-02-01 21:44 ` Borislav Petkov
0 siblings, 1 reply; 14+ messages in thread
From: Ghannam, Yazen @ 2017-02-01 21:37 UTC (permalink / raw)
To: Borislav Petkov, x86-ml; +Cc: Yves Dionne, Brice Goglin, Peter Zijlstra, lkml
> -----Original Message-----
> From: Borislav Petkov [mailto:bp@alien8.de]
> Sent: Wednesday, February 1, 2017 3:03 PM
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 548da5a8013e..f06fa338076b 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -433,9 +433,13 @@ static bool match_smt(struct cpuinfo_x86 *c, struct
> cpuinfo_x86 *o)
> int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
>
> if (c->phys_proc_id == o->phys_proc_id &&
> - per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2) &&
> - c->cpu_core_id == o->cpu_core_id)
> - return topology_sane(c, o, "smt");
> + per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2)) {
> + if (c->cpu_core_id == o->cpu_core_id)
> + return topology_sane(c, o, "smt");
> +
> + if (c->cu_id == o->cu_id)
> + return topology_sane(c, o, "smt");
> + }
>
This hunk won't work for SMT enabled systems. It'll cause all threads under
an LLC to be considered SMT siblings. For example, threads 0 &2 will have
different cpu_core_id, so the first check will fail. But it'll match on the
second check since cu_id will be initialized to 0.
To get around this we can set cu_id for all TOPOEXT systems, and update
cpu_core_id, etc. for SMT enabled systems. This way we can just change
cpu_core_id to cu_id in match_smt().
I tested this patch, with the above changes, on a Fam17h SMT enabled
system. I'll test with SMT disabled and also on a fully-loaded Fam15h
system soon.
Thanks,
Yazen
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
2017-02-01 21:37 ` Ghannam, Yazen
@ 2017-02-01 21:44 ` Borislav Petkov
2017-02-01 21:55 ` Ghannam, Yazen
0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2017-02-01 21:44 UTC (permalink / raw)
To: Ghannam, Yazen; +Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml
On Wed, Feb 01, 2017 at 09:37:02PM +0000, Ghannam, Yazen wrote:
> This hunk won't work for SMT enabled systems. It'll cause all threads under
> an LLC to be considered SMT siblings. For example, threads 0 &2 will have
> different cpu_core_id, so the first check will fail. But it'll match on the
> second check since cu_id will be initialized to 0.
Good catch.
> To get around this we can set cu_id for all TOPOEXT systems, and update
> cpu_core_id, etc. for SMT enabled systems. This way we can just change
> cpu_core_id to cu_id in match_smt().
Ok, so we want to init ->cu_id to something invalid then. -1, for
example and then do:
if (c->cu_id != -1 && o->cu_id != -1 && (c->cu_id == o->cu_id))
...
Alternatively, we can define an X86_FEATURE_COMPUTE_UNITS or so
synthetic bit which we can check.
One thing I don't want to do is reuse ->cu_id on systems which don't
have CUs.
> I tested this patch, with the above changes, on a Fam17h SMT enabled
> system. I'll test with SMT disabled and also on a fully-loaded Fam15h
> system soon.
Thanks!
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
2017-02-01 21:44 ` Borislav Petkov
@ 2017-02-01 21:55 ` Ghannam, Yazen
2017-02-01 22:25 ` Borislav Petkov
0 siblings, 1 reply; 14+ messages in thread
From: Ghannam, Yazen @ 2017-02-01 21:55 UTC (permalink / raw)
To: Borislav Petkov; +Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml
> -----Original Message-----
> From: Borislav Petkov [mailto:bp@alien8.de]
> Sent: Wednesday, February 1, 2017 4:44 PM
>
> > To get around this we can set cu_id for all TOPOEXT systems, and update
> > cpu_core_id, etc. for SMT enabled systems. This way we can just change
> > cpu_core_id to cu_id in match_smt().
>
> Ok, so we want to init ->cu_id to something invalid then. -1, for
> example and then do:
>
> if (c->cu_id != -1 && o->cu_id != -1 && (c->cu_id == o->cu_id))
> ...
>
> Alternatively, we can define an X86_FEATURE_COMPUTE_UNITS or so
> synthetic bit which we can check.
>
> One thing I don't want to do is reuse ->cu_id on systems which don't
> have CUs.
>
Okay, in that case I would prefer to define a synthetic bit. I think it'll be a lot
more clear.
Thanks,
Yazen
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
2017-02-01 21:55 ` Ghannam, Yazen
@ 2017-02-01 22:25 ` Borislav Petkov
2017-02-01 22:41 ` Borislav Petkov
0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2017-02-01 22:25 UTC (permalink / raw)
To: Ghannam, Yazen; +Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml
On Wed, Feb 01, 2017 at 09:55:44PM +0000, Ghannam, Yazen wrote:
> Okay, in that case I would prefer to define a synthetic bit. I think it'll be a lot
> more clear.
No need - it is ok this way too. Now let me apply your changes ontop.
I'd like to have two separate patches for this.
Something like this, anyways.
---
>From 523638f56c0975f1de1bdb67ef59421b0439b774 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp@suse.de>
Date: Wed, 1 Feb 2017 11:14:35 +0100
Subject: [PATCH] x86/CPU/AMD: Bring back Compute Unit ID
a33d331761bc ("x86/CPU/AMD: Fix Bulldozer topology") restored the
initial approach we had with the Fam15h topology of enumerating CU
(Compute Unit) threads as cores. And this is still correct - they're
beefier than HT threads but still have some shared functionality.
Our current approach has a problem with the Mad Max Steam game, for
example. Yves Dionne reported a certain "choppiness" while playing on
4.9.5.
That problem stems most likely from the fact that the CU threads share
resources within one CU and when we schedule to a thread of a different
compute unit, this incurs latency due to migrating the working set to a
different CU through the caches.
When the thread siblings mask mirrors that aspect of the CUs and
threads, the scheduler pays attention to it and tries to schedule within
one CU first. Which takes care of the latency, of course.
Reported-by: Yves Dionne <yves.dionne@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
---
arch/x86/include/asm/processor.h | 1 +
arch/x86/kernel/cpu/amd.c | 11 ++++++++++-
arch/x86/kernel/cpu/common.c | 1 +
arch/x86/kernel/smpboot.c | 12 +++++++++---
4 files changed, 21 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 1be64da0384e..e6cfe7ba2d65 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -104,6 +104,7 @@ struct cpuinfo_x86 {
__u8 x86_phys_bits;
/* CPUID returned core id bits: */
__u8 x86_coreid_bits;
+ __u8 cu_id;
/* Max extended CPUID function supported: */
__u32 extended_cpuid_level;
/* Maximum supported CPUID level, -1=no CPUID: */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 80e657e89eed..e7158afb322b 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -309,8 +309,15 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
/* get information required for multi-node processors */
if (boot_cpu_has(X86_FEATURE_TOPOEXT)) {
+ u32 eax, ebx, ecx, edx;
- node_id = cpuid_ecx(0x8000001e) & 7;
+ cpuid(0x8000001e, &eax, &ebx, &ecx, &edx);
+
+ node_id = ecx & 7;
+ smp_num_siblings = ((ebx >> 8) & 0xff) + 1;
+
+ if (c->x86 == 0x15)
+ c->cu_id = ebx & 0xff;
/*
* We may have multiple LLCs if L3 caches exist, so check if we
@@ -328,6 +335,8 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
per_cpu(cpu_llc_id, cpu) = node_id;
}
}
+
+
} else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) {
u64 value;
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f74e84ea8557..807602d36d5f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1034,6 +1034,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
c->x86_model_id[0] = '\0'; /* Unset */
c->x86_max_cores = 1;
c->x86_coreid_bits = 0;
+ c->cu_id = 0xff;
#ifdef CONFIG_X86_64
c->x86_clflush_size = 64;
c->x86_phys_bits = 36;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 548da5a8013e..3876bc555e1a 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -433,9 +433,15 @@ static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
if (c->phys_proc_id == o->phys_proc_id &&
- per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2) &&
- c->cpu_core_id == o->cpu_core_id)
- return topology_sane(c, o, "smt");
+ per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2)) {
+ if (c->cpu_core_id == o->cpu_core_id)
+ return topology_sane(c, o, "smt");
+
+ if ((c->cu_id != 0xff) &&
+ (o->cu_id != 0xff) &&
+ (c->cu_id == o->cu_id))
+ return topology_sane(c, o, "smt");
+ }
} else if (c->phys_proc_id == o->phys_proc_id &&
c->cpu_core_id == o->cpu_core_id) {
--
2.11.0
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
2017-02-01 22:25 ` Borislav Petkov
@ 2017-02-01 22:41 ` Borislav Petkov
2017-02-02 12:10 ` Borislav Petkov
0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2017-02-01 22:41 UTC (permalink / raw)
To: Ghannam, Yazen; +Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml
On Wed, Feb 01, 2017 at 11:25:07PM +0100, Borislav Petkov wrote:
> On Wed, Feb 01, 2017 at 09:55:44PM +0000, Ghannam, Yazen wrote:
> > Okay, in that case I would prefer to define a synthetic bit. I think it'll be a lot
> > more clear.
>
> No need - it is ok this way too. Now let me apply your changes ontop.
> I'd like to have two separate patches for this.
Ok, here are your changes ontop of the first patch. We can still avoid
the division on SMT-off systems.
More playing with this tomorrow. It is late here and brain wants to
sleep now.
---
>From 4b3b9626ef8a535df304aaa017b61436a3b37922 Mon Sep 17 00:00:00 2001
From: Yazen Ghannam <Yazen.Ghannam@amd.com>
Date: Wed, 1 Feb 2017 23:33:16 +0100
Subject: [PATCH] x86/CPU/AMD: Fix Zen SMT topology
After a33d331761bc ("x86/CPU/AMD: Fix Bulldozer topology"), SMT
scheduling topology for Fam17h systems is broken because the ThreadId is
included in the ApicId when SMT is enabled.
So, without further decoding cpu_core_id is unique for each thread
rather than the same for threads on the same core. This didn't affect
systems with SMT disabled. Make cpu_core_id be what it is defined to be.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
---
arch/x86/kernel/cpu/amd.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index e7158afb322b..349b7d9baf3f 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -319,6 +319,13 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
if (c->x86 == 0x15)
c->cu_id = ebx & 0xff;
+ if (c->x86 >= 0x17) {
+ c->cpu_core_id = ebx & 0xff;
+
+ if (smp_num_siblings > 1)
+ c->x86_max_cores /= smp_num_siblings;
+ }
+
/*
* We may have multiple LLCs if L3 caches exist, so check if we
* have an L3 cache by looking at the L3 cache CPUID leaf.
--
2.11.0
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
2017-02-01 22:41 ` Borislav Petkov
@ 2017-02-02 12:10 ` Borislav Petkov
2017-02-02 15:43 ` Borislav Petkov
2017-02-02 16:14 ` Ghannam, Yazen
0 siblings, 2 replies; 14+ messages in thread
From: Borislav Petkov @ 2017-02-02 12:10 UTC (permalink / raw)
To: Ghannam, Yazen; +Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml
On Wed, Feb 01, 2017 at 11:41:50PM +0100, Borislav Petkov wrote:
> More playing with this tomorrow. It is late here and brain wants to
> sleep now.
Ok, did some measurements of our favourite workload with and without
those patches on rc6+tip/master.
It is a Kaveri laptop so small, only 2 CUs. perf command was:
./tools/perf/perf stat -e task-clock,context-switches,cache-misses,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses --repeat 3 --sync --pre /path/to/pre-build-kernel.sh -- make -s -j5 bzImage
and that script is:
$ cat pre-build-kernel.sh
#!/bin/bash
make -s clean
echo 3 > /proc/sys/vm/drop_caches
Here the results:
before:
Performance counter stats for 'make -s -j5 bzImage' (3 runs):
1457712.248049 task-clock (msec) # 3.612 CPUs utilized ( +- 1.20% )
400,872 context-switches # 0.275 K/sec ( +- 0.23% )
8,675,334,184 cache-misses ( +- 0.15% )
26,915 cpu-migrations # 0.018 K/sec ( +- 2.13% )
23,806,184 page-faults # 0.016 M/sec ( +- 0.00% )
3,648,915,008,651 cycles # 2.503 GHz ( +- 0.91% )
1,895,555,704,111 instructions # 0.52 insn per cycle ( +- 0.00% )
426,444,023,897 branches # 292.543 M/sec ( +- 0.00% )
26,127,609,710 branch-misses # 6.13% of all branches ( +- 0.02% )
403.601384883 seconds time elapsed ( +- 1.18% )
after:
Performance counter stats for 'make -s -j5 bzImage' (3 runs):
1436580.109340 task-clock (msec) # 3.614 CPUs utilized ( +- 1.37% )
396,949 context-switches # 0.276 K/sec ( +- 0.08% )
8,655,078,022 cache-misses ( +- 0.36% )
30,623 cpu-migrations # 0.021 K/sec ( +- 0.46% )
23,788,698 page-faults # 0.017 M/sec ( +- 0.00% )
3,568,254,088,919 cycles # 2.484 GHz ( +- 0.88% )
1,895,348,016,179 instructions # 0.53 insn per cycle ( +- 0.00% )
426,405,814,017 branches # 296.820 M/sec ( +- 0.00% )
26,090,473,525 branch-misses # 6.12% of all branches ( +- 0.04% )
397.531904252 seconds time elapsed ( +- 1.20% )
Context switches have dropped, cache misses are the same and we have a
rise in cpu-migrations. That last bit is interesting and I don't have an
answer yet. Maybe peterz has an idea.
Cycles have dropped too.
And we're 6 secs faster so I'll take that.
Now on to run the same thing on a bigger bulldozer.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
2017-02-02 12:10 ` Borislav Petkov
@ 2017-02-02 15:43 ` Borislav Petkov
2017-02-02 16:09 ` Ingo Molnar
2017-02-02 16:14 ` Ghannam, Yazen
1 sibling, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2017-02-02 15:43 UTC (permalink / raw)
To: Ghannam, Yazen; +Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml
On Thu, Feb 02, 2017 at 01:10:54PM +0100, Borislav Petkov wrote:
> Now on to run the same thing on a bigger bulldozer.
It looks differently on the bigger box:
before:
Performance counter stats for 'make -s -j17 bzImage' (3 runs):
1333240.579196 task-clock (msec) # 9.271 CPUs utilized ( +- 0.03% )
393,521 context-switches # 0.295 K/sec ( +- 0.27% )
6,629,632,543 cache-misses ( +- 0.10% )
51,570 cpu-migrations # 0.039 K/sec ( +- 1.21% )
27,812,379 page-faults # 0.021 M/sec ( +- 0.05% )
3,633,456,270,448 cycles # 2.725 GHz ( +- 0.03% )
2,144,192,888,087 instructions # 0.59 insn per cycle ( +- 0.00% )
476,225,081,234 branches # 357.194 M/sec ( +- 0.00% )
25,294,199,758 branch-misses # 5.31% of all branches ( +- 0.02% )
143.807894000 seconds time elapsed ( +- 0.89% )
after:
Performance counter stats for 'make -s -j17 bzImage' (3 runs):
1330842.218684 task-clock (msec) # 9.047 CPUs utilized ( +- 0.14% )
396,044 context-switches # 0.298 K/sec ( +- 0.60% )
6,625,112,597 cache-misses ( +- 0.36% )
60,050 cpu-migrations # 0.045 K/sec ( +- 1.11% )
27,862,451 page-faults # 0.021 M/sec ( +- 0.08% )
3,625,822,644,429 cycles # 2.724 GHz ( +- 0.12% )
2,144,932,865,362 instructions # 0.59 insn per cycle ( +- 0.01% )
476,375,898,692 branches # 357.951 M/sec ( +- 0.01% )
25,226,837,830 branch-misses # 5.30% of all branches ( +- 0.09% )
147.109189694 seconds time elapsed ( +- 0.49% )
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
2017-02-02 15:43 ` Borislav Petkov
@ 2017-02-02 16:09 ` Ingo Molnar
2017-02-02 17:04 ` Borislav Petkov
0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2017-02-02 16:09 UTC (permalink / raw)
To: Borislav Petkov
Cc: Ghannam, Yazen, x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml
* Borislav Petkov <bp@alien8.de> wrote:
> On Thu, Feb 02, 2017 at 01:10:54PM +0100, Borislav Petkov wrote:
> > Now on to run the same thing on a bigger bulldozer.
>
> It looks differently on the bigger box:
>
> before:
>
> Performance counter stats for 'make -s -j17 bzImage' (3 runs):
> 143.807894000 seconds time elapsed ( +- 0.89% )
>
> Performance counter stats for 'make -s -j17 bzImage' (3 runs):
> 147.109189694 seconds time elapsed ( +- 0.49% )
If there's any doubt about the validity of the measurement I'd suggest doing:
perf stat -a --sync --repeat 3 ...
... so that there's no perf overhead and skew from the many processes of a kernel
build workload, plus the --sync should reduce IO related noise.
Or:
perf stat --null --sync --repeat 3 ...
... will only measure elapsed time, but will do that very precisely and with very
little overhead.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
2017-02-02 12:10 ` Borislav Petkov
2017-02-02 15:43 ` Borislav Petkov
@ 2017-02-02 16:14 ` Ghannam, Yazen
2017-02-02 16:29 ` Ingo Molnar
1 sibling, 1 reply; 14+ messages in thread
From: Ghannam, Yazen @ 2017-02-02 16:14 UTC (permalink / raw)
To: Borislav Petkov; +Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml
> -----Original Message-----
> From: Borislav Petkov [mailto:bp@alien8.de]
> Sent: Thursday, February 2, 2017 7:11 AM
>
> Context switches have dropped, cache misses are the same and we have a
> rise in cpu-migrations. That last bit is interesting and I don't have an
> answer yet. Maybe peterz has an idea.
>
Could it be that the scheduler is more lax about migrations between SMT
siblings?
> Cycles have dropped too.
>
> And we're 6 secs faster so I'll take that.
>
> Now on to run the same thing on a bigger bulldozer.
>
Here are my results on a 32C Bulldozer system with an SSD. Also, I use ccache so
I added "ccache -C" in the pre-build script so the cache gets cleared.
Before:
Performance counter stats for 'make -s -j65 bzImage' (3 runs):
2375752.777479 task-clock (msec) # 23.589 CPUs utilized ( +- 0.35% )
1,198,979 context-switches # 0.505 K/sec ( +- 0.34% )
8,964,671,259 cache-misses ( +- 0.44% )
79,399 cpu-migrations # 0.033 K/sec ( +- 1.92% )
37,840,875 page-faults # 0.016 M/sec ( +- 0.20% )
5,425,612,846,538 cycles # 2.284 GHz ( +- 0.36% )
3,367,750,745,825 instructions # 0.62 insn per cycle ( +- 0.11% )
750,591,286,261 branches # 315.938 M/sec ( +- 0.11% )
43,544,059,077 branch-misses # 5.80% of all branches ( +- 0.08% )
100.716043494 seconds time elapsed ( +- 1.97% )
After:
Performance counter stats for 'make -s -j65 bzImage' (3 runs):
1736720.488346 task-clock (msec) # 23.529 CPUs utilized ( +- 0.16% )
1,144,737 context-switches # 0.659 K/sec ( +- 0.20% )
8,570,352,975 cache-misses ( +- 0.33% )
91,817 cpu-migrations # 0.053 K/sec ( +- 1.67% )
37,688,118 page-faults # 0.022 M/sec ( +- 0.03% )
5,547,082,899,245 cycles # 3.194 GHz ( +- 0.19% )
3,363,365,420,405 instructions # 0.61 insn per cycle ( +- 0.00% )
749,676,420,820 branches # 431.662 M/sec ( +- 0.00% )
43,243,046,270 branch-misses # 5.77% of all branches ( +- 0.01% )
73.810517234 seconds time elapsed ( +- 0.02% )
Thanks,
Yazen
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
2017-02-02 16:14 ` Ghannam, Yazen
@ 2017-02-02 16:29 ` Ingo Molnar
0 siblings, 0 replies; 14+ messages in thread
From: Ingo Molnar @ 2017-02-02 16:29 UTC (permalink / raw)
To: Ghannam, Yazen
Cc: Borislav Petkov, x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml
* Ghannam, Yazen <Yazen.Ghannam@amd.com> wrote:
> Here are my results on a 32C Bulldozer system with an SSD. Also, I use ccache so
> I added "ccache -C" in the pre-build script so the cache gets cleared.
>
> Before:
> Performance counter stats for 'make -s -j65 bzImage' (3 runs):
>
> 2375752.777479 task-clock (msec) # 23.589 CPUs utilized ( +- 0.35% )
> 1,198,979 context-switches # 0.505 K/sec ( +- 0.34% )
> 8,964,671,259 cache-misses ( +- 0.44% )
> 79,399 cpu-migrations # 0.033 K/sec ( +- 1.92% )
> 37,840,875 page-faults # 0.016 M/sec ( +- 0.20% )
> 5,425,612,846,538 cycles # 2.284 GHz ( +- 0.36% )
> 3,367,750,745,825 instructions # 0.62 insn per cycle ( +- 0.11% )
> 750,591,286,261 branches # 315.938 M/sec ( +- 0.11% )
> 43,544,059,077 branch-misses # 5.80% of all branches ( +- 0.08% )
>
> 100.716043494 seconds time elapsed ( +- 1.97% )
>
> After:
> Performance counter stats for 'make -s -j65 bzImage' (3 runs):
>
> 1736720.488346 task-clock (msec) # 23.529 CPUs utilized ( +- 0.16% )
> 1,144,737 context-switches # 0.659 K/sec ( +- 0.20% )
> 8,570,352,975 cache-misses ( +- 0.33% )
> 91,817 cpu-migrations # 0.053 K/sec ( +- 1.67% )
> 37,688,118 page-faults # 0.022 M/sec ( +- 0.03% )
> 5,547,082,899,245 cycles # 3.194 GHz ( +- 0.19% )
> 3,363,365,420,405 instructions # 0.61 insn per cycle ( +- 0.00% )
> 749,676,420,820 branches # 431.662 M/sec ( +- 0.00% )
> 43,243,046,270 branch-misses # 5.77% of all branches ( +- 0.01% )
>
> 73.810517234 seconds time elapsed ( +- 0.02% )
That's pretty impressive: ~35% difference in wall clock performance of this
workload.
And that while both the cycles and the instructions count is within 2.5% of each
other. The only stat the differs beyond the level of noise is cache-misses:
8,964,671,259 cache-misses ( +- 0.44% )
8,570,352,975 cache-misses ( +- 0.33% )
which is 4.5%, but I have trouble believing that just 4.5% more cachemisses can
have such a massive effect on performance.
So unless +4.5% cachemisses can cause a 35% difference in performance this is a
really weird result. Where did the extra performance come from - was the 'good'
workload perhaps running at higher CPU frequencies for some reason?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
2017-02-02 16:09 ` Ingo Molnar
@ 2017-02-02 17:04 ` Borislav Petkov
2017-02-02 18:10 ` Borislav Petkov
0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2017-02-02 17:04 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ghannam, Yazen, x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml
On Thu, Feb 02, 2017 at 05:09:16PM +0100, Ingo Molnar wrote:
> If there's any doubt about the validity of the measurement I'd suggest doing:
>
> perf stat -a --sync --repeat 3 ...
>
> ... so that there's no perf overhead and skew from the many processes of a kernel
> build workload, plus the --sync should reduce IO related noise.
>
> Or:
>
> perf stat --null --sync --repeat 3 ...
>
> ... will only measure elapsed time, but will do that very precisely and with very
> little overhead.
Yeah, I was talking to Peter about the -a thing on IRC... I think I'm going to
try that. Here's the full command I was using:
./tools/perf/perf stat -e task-clock,context-switches,cache-misses,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses --repeat 3 --sync --pre ~/bin/pre-build-kernel.sh -- make -s -j17 bzImage
I think I stole it from you from some mail thread we had in the past.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
2017-02-02 17:04 ` Borislav Petkov
@ 2017-02-02 18:10 ` Borislav Petkov
2017-02-02 20:45 ` Ghannam, Yazen
0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2017-02-02 18:10 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ghannam, Yazen, x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml
On Thu, Feb 02, 2017 at 06:04:56PM +0100, Borislav Petkov wrote:
> I think I stole it from you from some mail thread we had in the past.
Yap, --all-cpus is a bit better in that the difference between the two
kernels is smaller.
For some reason, though, with the patch the workload is a bit slower.
We have more cycles, more branches, ... It is only 2 sec slower though.
I think that's probably because it is the first Bulldozer uarch and
when you run it on newer versions of the uarch, it is better, due to
improvements in the uarch.
Yazen, what BD generation is your machine?
I have one more Bulldozer box: rev C0 on which I could run this over the
weekend.
./tools/perf/perf stat -a -e task-clock,context-switches,cache-misses,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses --repeat 3 --sync --pre ~/bin/pre-build-kernel.sh -- make -s -j17 bzImage
before:
Performance counter stats for 'system wide' (3 runs):
2279512.230871 task-clock (msec) # 15.999 CPUs utilized ( +- 0.40% )
714,492 context-switches # 0.313 K/sec ( +- 0.19% )
6,726,972,836 cache-misses ( +- 0.15% )
56,490 cpu-migrations # 0.025 K/sec ( +- 2.98% )
27,794,829 page-faults # 0.012 M/sec ( +- 0.04% )
3,719,570,726,045 cycles # 1.632 GHz ( +- 0.06% )
2,146,930,432,417 instructions # 0.58 insn per cycle ( +- 0.05% )
476,587,085,009 branches # 209.074 M/sec ( +- 0.06% )
25,286,321,575 branch-misses # 5.31% of all branches ( +- 0.07% )
142.475046735 seconds time elapsed ( +- 0.40% )
after:
Performance counter stats for 'system wide' (3 runs):
2312821.267459 task-clock (msec) # 16.000 CPUs utilized ( +- 0.20% )
760,839 context-switches # 0.329 K/sec ( +- 0.29% )
6,769,543,062 cache-misses ( +- 0.05% )
68,785 cpu-migrations # 0.030 K/sec ( +- 0.75% )
27,828,222 page-faults # 0.012 M/sec ( +- 0.04% )
3,725,704,384,061 cycles # 1.611 GHz ( +- 0.06% )
2,149,336,525,435 instructions # 0.58 insn per cycle ( +- 0.01% )
477,157,066,501 branches # 206.310 M/sec ( +- 0.01% )
25,289,357,158 branch-misses # 5.30% of all branches ( +- 0.07% )
144.551731453 seconds time elapsed ( +- 0.20% )
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
2017-02-02 18:10 ` Borislav Petkov
@ 2017-02-02 20:45 ` Ghannam, Yazen
0 siblings, 0 replies; 14+ messages in thread
From: Ghannam, Yazen @ 2017-02-02 20:45 UTC (permalink / raw)
To: Borislav Petkov, Ingo Molnar
Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml
> -----Original Message-----
> From: Borislav Petkov [mailto:bp@alien8.de]
> Sent: Thursday, February 2, 2017 1:11 PM
>
> Yazen, what BD generation is your machine?
>
The processors are revision C0. Also, I forgot to mention it's a 2P G34 system.
Thanks,
Yazen
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2017-02-02 20:45 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-01 20:02 [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID Borislav Petkov
2017-02-01 21:37 ` Ghannam, Yazen
2017-02-01 21:44 ` Borislav Petkov
2017-02-01 21:55 ` Ghannam, Yazen
2017-02-01 22:25 ` Borislav Petkov
2017-02-01 22:41 ` Borislav Petkov
2017-02-02 12:10 ` Borislav Petkov
2017-02-02 15:43 ` Borislav Petkov
2017-02-02 16:09 ` Ingo Molnar
2017-02-02 17:04 ` Borislav Petkov
2017-02-02 18:10 ` Borislav Petkov
2017-02-02 20:45 ` Ghannam, Yazen
2017-02-02 16:14 ` Ghannam, Yazen
2017-02-02 16:29 ` Ingo Molnar
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.