linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
@ 2017-02-01 20:02 Borislav Petkov
  2017-02-01 21:37 ` Ghannam, Yazen
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2017-02-01 20:02 UTC (permalink / raw)
  To: x86-ml; +Cc: Yves Dionne, Brice Goglin, Peter Zijlstra, Yazen Ghannam, lkml

From: Borislav Petkov <bp@suse.de>

a33d331761bc ("x86/CPU/AMD: Fix Bulldozer topology") restored the
initial approach we had with the Fam15h topology of enumerating CU
(Compute Unit) threads as cores. And this is still correct - they're
beefier than HT threads but still have some shared functionality.

Our current approach has a problem with the Mad Max Steam game, for
example. Yves Dionne reported a certain "choppiness" while playing on
4.9.5.

That problem stems most likely from the fact that the CU threads share
resources within one CU and when we schedule to a thread of a different
compute unit, this incurs latency due to migrating the working set to a
different CU through the caches.

When the thread siblings mask mirrors that aspect of the CUs and
threads, the scheduler pays attention to it and tries to schedule within
one CU first. Which takes care of the latency, of course.

Reported-by: Yves Dionne <yves.dionne@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
---

I know Yazen is working on another issue which touches the same code
path but we'll synchronize the patches.

I'm marking it RFC for now but once we have agreed on it, it should be
CC:stable for 4.9.

Initial testing looks good, I still need to get on a bigger F15h box to
make sure everything is still kosher there too.

 arch/x86/include/asm/processor.h |  1 +
 arch/x86/kernel/cpu/amd.c        | 11 ++++++++++-
 arch/x86/kernel/smpboot.c        | 10 +++++++---
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 1be64da0384e..e6cfe7ba2d65 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -104,6 +104,7 @@ struct cpuinfo_x86 {
 	__u8			x86_phys_bits;
 	/* CPUID returned core id bits: */
 	__u8			x86_coreid_bits;
+	__u8			cu_id;
 	/* Max extended CPUID function supported: */
 	__u32			extended_cpuid_level;
 	/* Maximum supported CPUID level, -1=no CPUID: */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 80e657e89eed..e7158afb322b 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -309,8 +309,15 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
 
 	/* get information required for multi-node processors */
 	if (boot_cpu_has(X86_FEATURE_TOPOEXT)) {
+		u32 eax, ebx, ecx, edx;
 
-		node_id = cpuid_ecx(0x8000001e) & 7;
+		cpuid(0x8000001e, &eax, &ebx, &ecx, &edx);
+
+		node_id  = ecx & 7;
+		smp_num_siblings = ((ebx >> 8) & 0xff) + 1;
+
+		if (c->x86 == 0x15)
+			c->cu_id = ebx & 0xff;
 
 		/*
 		 * We may have multiple LLCs if L3 caches exist, so check if we
@@ -328,6 +335,8 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
 				per_cpu(cpu_llc_id, cpu) = node_id;
 			}
 		}
+
+
 	} else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) {
 		u64 value;
 
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 548da5a8013e..f06fa338076b 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -433,9 +433,13 @@ static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 		int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
 
 		if (c->phys_proc_id == o->phys_proc_id &&
-		    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2) &&
-		    c->cpu_core_id == o->cpu_core_id)
-			return topology_sane(c, o, "smt");
+		    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2)) {
+			if (c->cpu_core_id == o->cpu_core_id)
+				return topology_sane(c, o, "smt");
+
+			if (c->cu_id == o->cu_id)
+				return topology_sane(c, o, "smt");
+		}
 
 	} else if (c->phys_proc_id == o->phys_proc_id &&
 		   c->cpu_core_id == o->cpu_core_id) {
-- 
2.11.0

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* RE: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
  2017-02-01 20:02 [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID Borislav Petkov
@ 2017-02-01 21:37 ` Ghannam, Yazen
  2017-02-01 21:44   ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Ghannam, Yazen @ 2017-02-01 21:37 UTC (permalink / raw)
  To: Borislav Petkov, x86-ml; +Cc: Yves Dionne, Brice Goglin, Peter Zijlstra, lkml

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@alien8.de]
> Sent: Wednesday, February 1, 2017 3:03 PM
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 548da5a8013e..f06fa338076b 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -433,9 +433,13 @@ static bool match_smt(struct cpuinfo_x86 *c, struct
> cpuinfo_x86 *o)
>  		int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
> 
>  		if (c->phys_proc_id == o->phys_proc_id &&
> -		    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2) &&
> -		    c->cpu_core_id == o->cpu_core_id)
> -			return topology_sane(c, o, "smt");
> +		    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2)) {
> +			if (c->cpu_core_id == o->cpu_core_id)
> +				return topology_sane(c, o, "smt");
> +
> +			if (c->cu_id == o->cu_id)
> +				return topology_sane(c, o, "smt");
> +		}
> 

This hunk won't work for SMT enabled systems. It'll cause all threads under
an LLC to be considered SMT siblings. For example, threads 0 &2 will have
different cpu_core_id, so the first check will fail. But it'll match on the
second check since cu_id will be initialized to 0.

To get around this we can set cu_id for all TOPOEXT systems, and update
cpu_core_id, etc. for SMT enabled systems. This way we can just change
cpu_core_id to cu_id in match_smt().

I tested this patch,  with the above changes, on a Fam17h SMT enabled
system. I'll test with SMT disabled and also on a fully-loaded Fam15h
system soon.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
  2017-02-01 21:37 ` Ghannam, Yazen
@ 2017-02-01 21:44   ` Borislav Petkov
  2017-02-01 21:55     ` Ghannam, Yazen
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2017-02-01 21:44 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml

On Wed, Feb 01, 2017 at 09:37:02PM +0000, Ghannam, Yazen wrote:
> This hunk won't work for SMT enabled systems. It'll cause all threads under
> an LLC to be considered SMT siblings. For example, threads 0 &2 will have
> different cpu_core_id, so the first check will fail. But it'll match on the
> second check since cu_id will be initialized to 0.

Good catch.

> To get around this we can set cu_id for all TOPOEXT systems, and update
> cpu_core_id, etc. for SMT enabled systems. This way we can just change
> cpu_core_id to cu_id in match_smt().

Ok, so we want to init ->cu_id to something invalid then. -1, for
example and then do:

	if (c->cu_id != -1 && o->cu_id != -1 && (c->cu_id == o->cu_id))
		...

Alternatively, we can define an X86_FEATURE_COMPUTE_UNITS or so
synthetic bit which we can check.

One thing I don't want to do is reuse ->cu_id on systems which don't
have CUs.

> I tested this patch,  with the above changes, on a Fam17h SMT enabled
> system. I'll test with SMT disabled and also on a fully-loaded Fam15h
> system soon.

Thanks!

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
  2017-02-01 21:44   ` Borislav Petkov
@ 2017-02-01 21:55     ` Ghannam, Yazen
  2017-02-01 22:25       ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Ghannam, Yazen @ 2017-02-01 21:55 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@alien8.de]
> Sent: Wednesday, February 1, 2017 4:44 PM
> 
> > To get around this we can set cu_id for all TOPOEXT systems, and update
> > cpu_core_id, etc. for SMT enabled systems. This way we can just change
> > cpu_core_id to cu_id in match_smt().
> 
> Ok, so we want to init ->cu_id to something invalid then. -1, for
> example and then do:
> 
> 	if (c->cu_id != -1 && o->cu_id != -1 && (c->cu_id == o->cu_id))
> 		...
> 
> Alternatively, we can define an X86_FEATURE_COMPUTE_UNITS or so
> synthetic bit which we can check.
> 
> One thing I don't want to do is reuse ->cu_id on systems which don't
> have CUs.
> 

Okay, in that case I would prefer to define a synthetic bit. I think it'll be a lot
more clear.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
  2017-02-01 21:55     ` Ghannam, Yazen
@ 2017-02-01 22:25       ` Borislav Petkov
  2017-02-01 22:41         ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2017-02-01 22:25 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml

On Wed, Feb 01, 2017 at 09:55:44PM +0000, Ghannam, Yazen wrote:
> Okay, in that case I would prefer to define a synthetic bit. I think it'll be a lot
> more clear.

No need - it is ok this way too. Now let me apply your changes ontop.
I'd like to have two separate patches for this.

Something like this, anyways.

---
>From 523638f56c0975f1de1bdb67ef59421b0439b774 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp@suse.de>
Date: Wed, 1 Feb 2017 11:14:35 +0100
Subject: [PATCH] x86/CPU/AMD: Bring back Compute Unit ID

a33d331761bc ("x86/CPU/AMD: Fix Bulldozer topology") restored the
initial approach we had with the Fam15h topology of enumerating CU
(Compute Unit) threads as cores. And this is still correct - they're
beefier than HT threads but still have some shared functionality.

Our current approach has a problem with the Mad Max Steam game, for
example. Yves Dionne reported a certain "choppiness" while playing on
4.9.5.

That problem stems most likely from the fact that the CU threads share
resources within one CU and when we schedule to a thread of a different
compute unit, this incurs latency due to migrating the working set to a
different CU through the caches.

When the thread siblings mask mirrors that aspect of the CUs and
threads, the scheduler pays attention to it and tries to schedule within
one CU first. Which takes care of the latency, of course.

Reported-by: Yves Dionne <yves.dionne@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
---
 arch/x86/include/asm/processor.h |  1 +
 arch/x86/kernel/cpu/amd.c        | 11 ++++++++++-
 arch/x86/kernel/cpu/common.c     |  1 +
 arch/x86/kernel/smpboot.c        | 12 +++++++++---
 4 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 1be64da0384e..e6cfe7ba2d65 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -104,6 +104,7 @@ struct cpuinfo_x86 {
 	__u8			x86_phys_bits;
 	/* CPUID returned core id bits: */
 	__u8			x86_coreid_bits;
+	__u8			cu_id;
 	/* Max extended CPUID function supported: */
 	__u32			extended_cpuid_level;
 	/* Maximum supported CPUID level, -1=no CPUID: */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 80e657e89eed..e7158afb322b 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -309,8 +309,15 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
 
 	/* get information required for multi-node processors */
 	if (boot_cpu_has(X86_FEATURE_TOPOEXT)) {
+		u32 eax, ebx, ecx, edx;
 
-		node_id = cpuid_ecx(0x8000001e) & 7;
+		cpuid(0x8000001e, &eax, &ebx, &ecx, &edx);
+
+		node_id  = ecx & 7;
+		smp_num_siblings = ((ebx >> 8) & 0xff) + 1;
+
+		if (c->x86 == 0x15)
+			c->cu_id = ebx & 0xff;
 
 		/*
 		 * We may have multiple LLCs if L3 caches exist, so check if we
@@ -328,6 +335,8 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
 				per_cpu(cpu_llc_id, cpu) = node_id;
 			}
 		}
+
+
 	} else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) {
 		u64 value;
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f74e84ea8557..807602d36d5f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1034,6 +1034,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 	c->x86_model_id[0] = '\0';  /* Unset */
 	c->x86_max_cores = 1;
 	c->x86_coreid_bits = 0;
+	c->cu_id = 0xff;
 #ifdef CONFIG_X86_64
 	c->x86_clflush_size = 64;
 	c->x86_phys_bits = 36;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 548da5a8013e..3876bc555e1a 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -433,9 +433,15 @@ static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 		int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
 
 		if (c->phys_proc_id == o->phys_proc_id &&
-		    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2) &&
-		    c->cpu_core_id == o->cpu_core_id)
-			return topology_sane(c, o, "smt");
+		    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2)) {
+			if (c->cpu_core_id == o->cpu_core_id)
+				return topology_sane(c, o, "smt");
+
+			if ((c->cu_id != 0xff) &&
+			    (o->cu_id != 0xff) &&
+			    (c->cu_id == o->cu_id))
+				return topology_sane(c, o, "smt");
+		}
 
 	} else if (c->phys_proc_id == o->phys_proc_id &&
 		   c->cpu_core_id == o->cpu_core_id) {
-- 
2.11.0

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
  2017-02-01 22:25       ` Borislav Petkov
@ 2017-02-01 22:41         ` Borislav Petkov
  2017-02-02 12:10           ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2017-02-01 22:41 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml

On Wed, Feb 01, 2017 at 11:25:07PM +0100, Borislav Petkov wrote:
> On Wed, Feb 01, 2017 at 09:55:44PM +0000, Ghannam, Yazen wrote:
> > Okay, in that case I would prefer to define a synthetic bit. I think it'll be a lot
> > more clear.
> 
> No need - it is ok this way too. Now let me apply your changes ontop.
> I'd like to have two separate patches for this.

Ok, here are your changes ontop of the first patch. We can still avoid
the division on SMT-off systems.

More playing with this tomorrow. It is late here and brain wants to
sleep now.

---
>From 4b3b9626ef8a535df304aaa017b61436a3b37922 Mon Sep 17 00:00:00 2001
From: Yazen Ghannam <Yazen.Ghannam@amd.com>
Date: Wed, 1 Feb 2017 23:33:16 +0100
Subject: [PATCH] x86/CPU/AMD: Fix Zen SMT topology

After a33d331761bc ("x86/CPU/AMD: Fix Bulldozer topology"), SMT
scheduling topology for Fam17h systems is broken because the ThreadId is
included in the ApicId when SMT is enabled.

So, without further decoding cpu_core_id is unique for each thread
rather than the same for threads on the same core. This didn't affect
systems with SMT disabled. Make cpu_core_id be what it is defined to be.

Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/amd.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index e7158afb322b..349b7d9baf3f 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -319,6 +319,13 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
 		if (c->x86 == 0x15)
 			c->cu_id = ebx & 0xff;
 
+		if (c->x86 >= 0x17) {
+			c->cpu_core_id = ebx & 0xff;
+
+			if (smp_num_siblings > 1)
+				c->x86_max_cores /= smp_num_siblings;
+		}
+
 		/*
 		 * We may have multiple LLCs if L3 caches exist, so check if we
 		 * have an L3 cache by looking at the L3 cache CPUID leaf.
-- 
2.11.0

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
  2017-02-01 22:41         ` Borislav Petkov
@ 2017-02-02 12:10           ` Borislav Petkov
  2017-02-02 15:43             ` Borislav Petkov
  2017-02-02 16:14             ` Ghannam, Yazen
  0 siblings, 2 replies; 14+ messages in thread
From: Borislav Petkov @ 2017-02-02 12:10 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml

On Wed, Feb 01, 2017 at 11:41:50PM +0100, Borislav Petkov wrote:
> More playing with this tomorrow. It is late here and brain wants to
> sleep now.

Ok, did some measurements of our favourite workload with and without
those patches on rc6+tip/master.

It is a Kaveri laptop so small, only 2 CUs. perf command was:

./tools/perf/perf stat -e task-clock,context-switches,cache-misses,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses --repeat 3 --sync --pre /path/to/pre-build-kernel.sh -- make -s -j5 bzImage

and that script is:

$ cat pre-build-kernel.sh
#!/bin/bash

make -s clean
echo 3 > /proc/sys/vm/drop_caches

Here the results:

before:

 Performance counter stats for 'make -s -j5 bzImage' (3 runs):

    1457712.248049      task-clock (msec)         #    3.612 CPUs utilized            ( +-  1.20% )
           400,872      context-switches          #    0.275 K/sec                    ( +-  0.23% )
     8,675,334,184      cache-misses                                                  ( +-  0.15% )
            26,915      cpu-migrations            #    0.018 K/sec                    ( +-  2.13% )
        23,806,184      page-faults               #    0.016 M/sec                    ( +-  0.00% )
 3,648,915,008,651      cycles                    #    2.503 GHz                      ( +-  0.91% )
 1,895,555,704,111      instructions              #    0.52  insn per cycle                                              ( +-  0.00% )
   426,444,023,897      branches                  #  292.543 M/sec                    ( +-  0.00% )
    26,127,609,710      branch-misses             #    6.13% of all branches          ( +-  0.02% )

     403.601384883 seconds time elapsed                                          ( +-  1.18% )


after:

 Performance counter stats for 'make -s -j5 bzImage' (3 runs):

    1436580.109340      task-clock (msec)         #    3.614 CPUs utilized            ( +-  1.37% )
           396,949      context-switches          #    0.276 K/sec                    ( +-  0.08% )
     8,655,078,022      cache-misses                                                  ( +-  0.36% )
            30,623      cpu-migrations            #    0.021 K/sec                    ( +-  0.46% )
        23,788,698      page-faults               #    0.017 M/sec                    ( +-  0.00% )
 3,568,254,088,919      cycles                    #    2.484 GHz                      ( +-  0.88% )
 1,895,348,016,179      instructions              #    0.53  insn per cycle                                              ( +-  0.00% )
   426,405,814,017      branches                  #  296.820 M/sec                    ( +-  0.00% )
    26,090,473,525      branch-misses             #    6.12% of all branches          ( +-  0.04% )

     397.531904252 seconds time elapsed                                          ( +-  1.20% )


Context switches have dropped, cache misses are the same and we have a
rise in cpu-migrations. That last bit is interesting and I don't have an
answer yet. Maybe peterz has an idea.

Cycles have dropped too.

And we're 6 secs faster so I'll take that.

Now on to run the same thing on a bigger bulldozer.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
  2017-02-02 12:10           ` Borislav Petkov
@ 2017-02-02 15:43             ` Borislav Petkov
  2017-02-02 16:09               ` Ingo Molnar
  2017-02-02 16:14             ` Ghannam, Yazen
  1 sibling, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2017-02-02 15:43 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml

On Thu, Feb 02, 2017 at 01:10:54PM +0100, Borislav Petkov wrote:
> Now on to run the same thing on a bigger bulldozer.

It looks differently on the bigger box:

before:

 Performance counter stats for 'make -s -j17 bzImage' (3 runs):

    1333240.579196      task-clock (msec)         #    9.271 CPUs utilized            ( +-  0.03% )
           393,521      context-switches          #    0.295 K/sec                    ( +-  0.27% )
     6,629,632,543      cache-misses                                                  ( +-  0.10% )
            51,570      cpu-migrations            #    0.039 K/sec                    ( +-  1.21% )
        27,812,379      page-faults               #    0.021 M/sec                    ( +-  0.05% )
 3,633,456,270,448      cycles                    #    2.725 GHz                      ( +-  0.03% )
 2,144,192,888,087      instructions              #    0.59  insn per cycle                                              ( +-  0.00% )
   476,225,081,234      branches                  #  357.194 M/sec                    ( +-  0.00% )
    25,294,199,758      branch-misses             #    5.31% of all branches          ( +-  0.02% )

     143.807894000 seconds time elapsed                                          ( +-  0.89% )


after:

 Performance counter stats for 'make -s -j17 bzImage' (3 runs):

    1330842.218684      task-clock (msec)         #    9.047 CPUs utilized            ( +-  0.14% )
           396,044      context-switches          #    0.298 K/sec                    ( +-  0.60% )
     6,625,112,597      cache-misses                                                  ( +-  0.36% )
            60,050      cpu-migrations            #    0.045 K/sec                    ( +-  1.11% )
        27,862,451      page-faults               #    0.021 M/sec                    ( +-  0.08% )
 3,625,822,644,429      cycles                    #    2.724 GHz                      ( +-  0.12% )
 2,144,932,865,362      instructions              #    0.59  insn per cycle                                              ( +-  0.01% )
   476,375,898,692      branches                  #  357.951 M/sec                    ( +-  0.01% )
    25,226,837,830      branch-misses             #    5.30% of all branches          ( +-  0.09% )

     147.109189694 seconds time elapsed                                          ( +-  0.49% )

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
  2017-02-02 15:43             ` Borislav Petkov
@ 2017-02-02 16:09               ` Ingo Molnar
  2017-02-02 17:04                 ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2017-02-02 16:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ghannam, Yazen, x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml


* Borislav Petkov <bp@alien8.de> wrote:

> On Thu, Feb 02, 2017 at 01:10:54PM +0100, Borislav Petkov wrote:
> > Now on to run the same thing on a bigger bulldozer.
> 
> It looks differently on the bigger box:
> 
> before:
> 
>  Performance counter stats for 'make -s -j17 bzImage' (3 runs):
>      143.807894000 seconds time elapsed                                          ( +-  0.89% )
> 
>  Performance counter stats for 'make -s -j17 bzImage' (3 runs):
>      147.109189694 seconds time elapsed                                          ( +-  0.49% )

If there's any doubt about the validity of the measurement I'd suggest doing:

	perf stat -a --sync --repeat 3 ...

... so that there's no perf overhead and skew from the many processes of a kernel 
build workload, plus the --sync should reduce IO related noise.

Or:

	perf stat --null --sync --repeat 3 ...

... will only measure elapsed time, but will do that very precisely and with very 
little overhead.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
  2017-02-02 12:10           ` Borislav Petkov
  2017-02-02 15:43             ` Borislav Petkov
@ 2017-02-02 16:14             ` Ghannam, Yazen
  2017-02-02 16:29               ` Ingo Molnar
  1 sibling, 1 reply; 14+ messages in thread
From: Ghannam, Yazen @ 2017-02-02 16:14 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@alien8.de]
> Sent: Thursday, February 2, 2017 7:11 AM
> 
> Context switches have dropped, cache misses are the same and we have a
> rise in cpu-migrations. That last bit is interesting and I don't have an
> answer yet. Maybe peterz has an idea.
> 

Could it be that the scheduler is more lax about migrations between SMT
siblings?

> Cycles have dropped too.
> 
> And we're 6 secs faster so I'll take that.
> 
> Now on to run the same thing on a bigger bulldozer.
> 

Here are my results on a 32C Bulldozer system with an SSD. Also, I use ccache so
I added "ccache -C" in the pre-build script so the cache gets cleared.

Before:
Performance counter stats for 'make -s -j65 bzImage' (3 runs):

    2375752.777479      task-clock (msec)         #   23.589 CPUs utilized            ( +-  0.35% )
         1,198,979      context-switches          #    0.505 K/sec                    ( +-  0.34% )
     8,964,671,259      cache-misses                                                  ( +-  0.44% )
            79,399      cpu-migrations            #    0.033 K/sec                    ( +-  1.92% )
        37,840,875      page-faults               #    0.016 M/sec                    ( +-  0.20% )
 5,425,612,846,538      cycles                    #    2.284 GHz                      ( +-  0.36% )
 3,367,750,745,825      instructions              #    0.62  insn per cycle                                              ( +-  0.11% )
   750,591,286,261      branches                  #  315.938 M/sec                    ( +-  0.11% )
    43,544,059,077      branch-misses             #    5.80% of all branches          ( +-  0.08% )

     100.716043494 seconds time elapsed                                          ( +-  1.97% )

After:
Performance counter stats for 'make -s -j65 bzImage' (3 runs):

    1736720.488346      task-clock (msec)         #   23.529 CPUs utilized            ( +-  0.16% )
         1,144,737      context-switches          #    0.659 K/sec                    ( +-  0.20% )
     8,570,352,975      cache-misses                                                  ( +-  0.33% )
            91,817      cpu-migrations            #    0.053 K/sec                    ( +-  1.67% )
        37,688,118      page-faults               #    0.022 M/sec                    ( +-  0.03% )
 5,547,082,899,245      cycles                    #    3.194 GHz                      ( +-  0.19% )
 3,363,365,420,405      instructions              #    0.61  insn per cycle                                              ( +-  0.00% )
   749,676,420,820      branches                  #  431.662 M/sec                    ( +-  0.00% )
    43,243,046,270      branch-misses             #    5.77% of all branches          ( +-  0.01% )

      73.810517234 seconds time elapsed                                          ( +-  0.02% )

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
  2017-02-02 16:14             ` Ghannam, Yazen
@ 2017-02-02 16:29               ` Ingo Molnar
  0 siblings, 0 replies; 14+ messages in thread
From: Ingo Molnar @ 2017-02-02 16:29 UTC (permalink / raw)
  To: Ghannam, Yazen
  Cc: Borislav Petkov, x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml


* Ghannam, Yazen <Yazen.Ghannam@amd.com> wrote:

> Here are my results on a 32C Bulldozer system with an SSD. Also, I use ccache so 
> I added "ccache -C" in the pre-build script so the cache gets cleared.
> 
> Before:
> Performance counter stats for 'make -s -j65 bzImage' (3 runs):
> 
>     2375752.777479      task-clock (msec)         #   23.589 CPUs utilized            ( +-  0.35% )
>          1,198,979      context-switches          #    0.505 K/sec                    ( +-  0.34% )
>      8,964,671,259      cache-misses                                                  ( +-  0.44% )
>             79,399      cpu-migrations            #    0.033 K/sec                    ( +-  1.92% )
>         37,840,875      page-faults               #    0.016 M/sec                    ( +-  0.20% )
>  5,425,612,846,538      cycles                    #    2.284 GHz                      ( +-  0.36% )
>  3,367,750,745,825      instructions              #    0.62  insn per cycle                                              ( +-  0.11% )
>    750,591,286,261      branches                  #  315.938 M/sec                    ( +-  0.11% )
>     43,544,059,077      branch-misses             #    5.80% of all branches          ( +-  0.08% )
> 
>      100.716043494 seconds time elapsed                                          ( +-  1.97% )
> 
> After:
> Performance counter stats for 'make -s -j65 bzImage' (3 runs):
> 
>     1736720.488346      task-clock (msec)         #   23.529 CPUs utilized            ( +-  0.16% )
>          1,144,737      context-switches          #    0.659 K/sec                    ( +-  0.20% )
>      8,570,352,975      cache-misses                                                  ( +-  0.33% )
>             91,817      cpu-migrations            #    0.053 K/sec                    ( +-  1.67% )
>         37,688,118      page-faults               #    0.022 M/sec                    ( +-  0.03% )
>  5,547,082,899,245      cycles                    #    3.194 GHz                      ( +-  0.19% )
>  3,363,365,420,405      instructions              #    0.61  insn per cycle                                              ( +-  0.00% )
>    749,676,420,820      branches                  #  431.662 M/sec                    ( +-  0.00% )
>     43,243,046,270      branch-misses             #    5.77% of all branches          ( +-  0.01% )
> 
>       73.810517234 seconds time elapsed                                          ( +-  0.02% )

That's pretty impressive: ~35% difference in wall clock performance of this 
workload.

And that while both the cycles and the instructions count is within 2.5% of each 
other. The only stat the differs beyond the level of noise is cache-misses:

      8,964,671,259      cache-misses                                                  ( +-  0.44% )
      8,570,352,975      cache-misses                                                  ( +-  0.33% )

which is 4.5%, but I have trouble believing that just 4.5% more cachemisses can 
have such a massive effect on performance.

So unless +4.5% cachemisses can cause a 35% difference in performance this is a 
really weird result. Where did the extra performance come from - was the 'good' 
workload perhaps running at higher CPU frequencies for some reason?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
  2017-02-02 16:09               ` Ingo Molnar
@ 2017-02-02 17:04                 ` Borislav Petkov
  2017-02-02 18:10                   ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2017-02-02 17:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ghannam, Yazen, x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml

On Thu, Feb 02, 2017 at 05:09:16PM +0100, Ingo Molnar wrote:
> If there's any doubt about the validity of the measurement I'd suggest doing:
> 
> 	perf stat -a --sync --repeat 3 ...
> 
> ... so that there's no perf overhead and skew from the many processes of a kernel 
> build workload, plus the --sync should reduce IO related noise.
> 
> Or:
> 
> 	perf stat --null --sync --repeat 3 ...
> 
> ... will only measure elapsed time, but will do that very precisely and with very 
> little overhead.

Yeah, I was talking to Peter about the -a thing on IRC... I think I'm going to
try that. Here's the full command I was using:

./tools/perf/perf stat -e task-clock,context-switches,cache-misses,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses --repeat 3 --sync --pre ~/bin/pre-build-kernel.sh -- make -s -j17 bzImage

I think I stole it from you from some mail thread we had in the past.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
  2017-02-02 17:04                 ` Borislav Petkov
@ 2017-02-02 18:10                   ` Borislav Petkov
  2017-02-02 20:45                     ` Ghannam, Yazen
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2017-02-02 18:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ghannam, Yazen, x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml

On Thu, Feb 02, 2017 at 06:04:56PM +0100, Borislav Petkov wrote:
> I think I stole it from you from some mail thread we had in the past.

Yap, --all-cpus is a bit better in that the difference between the two
kernels is smaller.

For some reason, though, with the patch the workload is a bit slower.
We have more cycles, more branches, ... It is only 2 sec slower though.
I think that's probably because it is the first Bulldozer uarch and
when you run it on newer versions of the uarch, it is better, due to
improvements in the uarch.

Yazen, what BD generation is your machine?

I have one more Bulldozer box: rev C0 on which I could run this over the
weekend.

./tools/perf/perf stat -a -e task-clock,context-switches,cache-misses,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses --repeat 3 --sync --pre ~/bin/pre-build-kernel.sh -- make -s -j17 bzImage

before:

 Performance counter stats for 'system wide' (3 runs):

    2279512.230871      task-clock (msec)         #   15.999 CPUs utilized            ( +-  0.40% )
           714,492      context-switches          #    0.313 K/sec                    ( +-  0.19% )
     6,726,972,836      cache-misses                                                  ( +-  0.15% )
            56,490      cpu-migrations            #    0.025 K/sec                    ( +-  2.98% )
        27,794,829      page-faults               #    0.012 M/sec                    ( +-  0.04% )
 3,719,570,726,045      cycles                    #    1.632 GHz                      ( +-  0.06% )
 2,146,930,432,417      instructions              #    0.58  insn per cycle                                              ( +-  0.05% )
   476,587,085,009      branches                  #  209.074 M/sec                    ( +-  0.06% )
    25,286,321,575      branch-misses             #    5.31% of all branches          ( +-  0.07% )

     142.475046735 seconds time elapsed                                          ( +-  0.40% )

after:

 Performance counter stats for 'system wide' (3 runs):

    2312821.267459      task-clock (msec)         #   16.000 CPUs utilized            ( +-  0.20% )
           760,839      context-switches          #    0.329 K/sec                    ( +-  0.29% )
     6,769,543,062      cache-misses                                                  ( +-  0.05% )
            68,785      cpu-migrations            #    0.030 K/sec                    ( +-  0.75% )
        27,828,222      page-faults               #    0.012 M/sec                    ( +-  0.04% )
 3,725,704,384,061      cycles                    #    1.611 GHz                      ( +-  0.06% )
 2,149,336,525,435      instructions              #    0.58  insn per cycle                                              ( +-  0.01% )
   477,157,066,501      branches                  #  206.310 M/sec                    ( +-  0.01% )
    25,289,357,158      branch-misses             #    5.30% of all branches          ( +-  0.07% )

     144.551731453 seconds time elapsed                                          ( +-  0.20% )

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID
  2017-02-02 18:10                   ` Borislav Petkov
@ 2017-02-02 20:45                     ` Ghannam, Yazen
  0 siblings, 0 replies; 14+ messages in thread
From: Ghannam, Yazen @ 2017-02-02 20:45 UTC (permalink / raw)
  To: Borislav Petkov, Ingo Molnar
  Cc: x86-ml, Yves Dionne, Brice Goglin, Peter Zijlstra, lkml

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@alien8.de]
> Sent: Thursday, February 2, 2017 1:11 PM
> 
> Yazen, what BD generation is your machine?
> 

The processors are revision C0. Also, I forgot to mention it's a 2P G34 system. 

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-02-02 20:45 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-01 20:02 [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID Borislav Petkov
2017-02-01 21:37 ` Ghannam, Yazen
2017-02-01 21:44   ` Borislav Petkov
2017-02-01 21:55     ` Ghannam, Yazen
2017-02-01 22:25       ` Borislav Petkov
2017-02-01 22:41         ` Borislav Petkov
2017-02-02 12:10           ` Borislav Petkov
2017-02-02 15:43             ` Borislav Petkov
2017-02-02 16:09               ` Ingo Molnar
2017-02-02 17:04                 ` Borislav Petkov
2017-02-02 18:10                   ` Borislav Petkov
2017-02-02 20:45                     ` Ghannam, Yazen
2017-02-02 16:14             ` Ghannam, Yazen
2017-02-02 16:29               ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).