All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5] x86,sched: allow topologies where NUMA nodes share an LLC
@ 2018-04-07  0:21 Alison Schofield
  2018-04-16 18:57 ` Borislav Petkov
  2018-04-17 13:46 ` [tip:x86/urgent] x86,sched: Allow " tip-bot for Alison Schofield
  0 siblings, 2 replies; 7+ messages in thread
From: Alison Schofield @ 2018-04-07  0:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar
  Cc: Dave Hansen, Tony Luck, Tim Chen, H. Peter Anvin,
	Borislav Petkov, Peter Zijlstra, David Rientjes, Igor Mammedov,
	Prarit Bhargava, brice.goglin, x86, linux-kernel

From: Alison Schofield <alison.schofield@intel.com>

Intel's Skylake Server CPUs have a different LLC topology than previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
divided into two "slices", each containing half the cores, half the LLC,
and one memory controller and each slice is enumerated to Linux as a
NUMA node. This is similar to how the cores and LLC were arranged
for the Cluster-On-Die (CoD) feature.

CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice. This
means that the portion of the LLC *available* to a CPU depends on the
data being accessed:

    Remote socket: entire package LLC is shared
    Local socket->local slice: data goes into local slice LLC
    Local socket->remote slice: data goes into remote-slice LLC. Slightly
                    		higher latency than local slice LLC.

The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.

The CPU describes its cache hierarchy with the CPUID instruction. One
of the CPUID leaves enumerates the "logical processors sharing this
cache". This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.

But, the CPUID for the SNC configuration discussed above enumerates
the LLC as being shared by the entire package. This is not 100%
precise because the entire cache is not usable by all accesses. But,
it *is* the way the hardware enumerates itself, and this is not likely
to change.

The userspace visible impact of all the above is that the sysfs info
reports the entire LLC as being available to the entire package. As
noted above, this is not true for local socket accesses. This patch
does not correct the sysfs info. It is the same, pre and post patch.

This patch continues to allow this SNC topology and it does so without
complaint. It eliminates a warning that looks like this:

	sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

The warning is coming from the sane_topology check() in smpboot.c.
To fix this, add a vendor and model specific check to never call
topology_sane() for these systems. Also, just like "Cluster-on-Die"
we throw out the "coregroup" sched_domain_topology_level and use
NUMA information from the SRAT alone.

This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package
level, which are also NUMA boundaries.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: brice.goglin@gmail.com
Cc: Ingo Molnar <mingo@kernel.org>
---

Changes in v5:

 * Removed the redundant setting of boolean x86_has_numa_in_package.
   Related comments were put back in their pre-patch locations.

Changes in v4:

 * Added this to the patch description above:

   The userspace visible impact of all the above is that the sysfs info
   reports the entire LLC as being available to the entire package. As
   noted above, this is not true for local socket accesses. This patch
   does not correct the sysfs info. It is the same, pre and post patch.

   This patch continues to allow this SNC topology and it does so without
   complaint. It eliminates a warning that looks like this:
  
 * Changed the code comment per PeterZ/DaveH discussion wrt bypassing
   that topology_sane() check in match_llc()
	/*
	 * false means 'c' does not share the LLC of 'o'.
	 * Note: this decision gets reflected all the way
	 * out to userspace
	 */
   This message hopes to clarify what happens when we return false.
   Note that returning false is *not* new to this patch. Without
   this patch we always returned false - with a warning. This avoids
   that warning and returns false directly.

 * Remove __initconst from snc_cpu[] declaration that I had added in
   v3. This is not an init time only path. 

 * Do not deal with the wrong sysfs info. It was wrong before this
   patch and it will be the exact same 'wrongness' after this patch.

   We can address the sysfs reporting separately. Here are some options:
   1) Change the way the LLC-size is reported.  Enumerate two separate,
      half-sized LLCs shared only by the slice when SNC mode is on.
   2) Do not export the sysfs info that is wrong. Prevents userspace
      from making bad decisions based on inaccurate info.


Changes in v3:

 * Use x86_match_cpu() for vendor & model check and moved related
   comments to the array define. (Still just one system model)

 * Updated the comments surrounding the topology_sane() check.


Changes in v2:

 * Add vendor check (Intel) where we only had a model check (Skylake_X).
   Considered the suggestion of adding a new flag here but thought that
   to be overkill for this usage.

 * Return false, instead of true, from match_llc() per reviewer suggestion.
   That also cleaned up a topology broken bug message in sched_domain().

 * Updated the comments around the match_llc() return value, continuing to
   note that the return value doesn't actually matter because we are throwing
   away coregroups for scheduling.

 * Changed submitter. I'm following up on this patch originally submitted
   by Dave Hansen.


 arch/x86/kernel/smpboot.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 40 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ff99e2b..45175b8 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -77,6 +77,8 @@
 #include <asm/i8259.h>
 #include <asm/misc.h>
 #include <asm/qspinlock.h>
+#include <asm/intel-family.h>
+#include <asm/cpu_device_id.h>
 
 /* Number of siblings per CPU package */
 int smp_num_siblings = 1;
@@ -390,15 +392,47 @@ static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 	return false;
 }
 
+/*
+ * Define snc_cpu[] for SNC (Sub-NUMA Cluster) CPUs.
+ *
+ * These are Intel CPUs that enumerate an LLC that is shared by
+ * multiple NUMA nodes. The LLC on these systems is shared for
+ * off-package data access but private to the NUMA node (half
+ * of the package) for on-package access.
+ *
+ * CPUID (the source of the information about the LLC) can only
+ * enumerate the cache as being shared *or* unshared, but not
+ * this particular configuration. The CPU in this case enumerates
+ * the cache to be shared across the entire package (spanning both
+ * NUMA nodes).
+ */
+
+static const struct x86_cpu_id snc_cpu[] = {
+	{ X86_VENDOR_INTEL, 6, INTEL_FAM6_SKYLAKE_X },
+	{}
+};
+
 static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 {
 	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
 
-	if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID &&
-	    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2))
-		return topology_sane(c, o, "llc");
+	/* Do not match if we do not have a valid APICID for cpu: */
+	if (per_cpu(cpu_llc_id, cpu1) == BAD_APICID)
+		return false;
 
-	return false;
+	/* Do not match if LLC id does not match: */
+	if (per_cpu(cpu_llc_id, cpu1) != per_cpu(cpu_llc_id, cpu2))
+		return false;
+
+	/*
+	 * Allow the SNC topology without warning. Return of false
+	 * means 'c' does not share the LLC of 'o'. This will be
+	 * reflected to userspace.
+	 */
+	if (!topology_same_node(c, o) && x86_match_cpu(snc_cpu))
+		return false;
+
+	return topology_sane(c, o, "llc");
 }
 
 /*
@@ -456,7 +490,8 @@ static struct sched_domain_topology_level x86_topology[] = {
 
 /*
  * Set if a package/die has multiple NUMA nodes inside.
- * AMD Magny-Cours and Intel Cluster-on-Die have this.
+ * AMD Magny-Cours, Intel Cluster-on-Die, and Intel
+ * Sub-NUMA Clustering have this.
  */
 static bool x86_has_numa_in_package;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v5] x86,sched: allow topologies where NUMA nodes share an LLC
  2018-04-07  0:21 [PATCH v5] x86,sched: allow topologies where NUMA nodes share an LLC Alison Schofield
@ 2018-04-16 18:57 ` Borislav Petkov
  2018-04-17 13:46 ` [tip:x86/urgent] x86,sched: Allow " tip-bot for Alison Schofield
  1 sibling, 0 replies; 7+ messages in thread
From: Borislav Petkov @ 2018-04-16 18:57 UTC (permalink / raw)
  To: Alison Schofield
  Cc: Thomas Gleixner, Ingo Molnar, Dave Hansen, Tony Luck, Tim Chen,
	H. Peter Anvin, Peter Zijlstra, David Rientjes, Igor Mammedov,
	Prarit Bhargava, brice.goglin, x86, linux-kernel

On Fri, Apr 06, 2018 at 05:21:30PM -0700, Alison Schofield wrote:
> From: Alison Schofield <alison.schofield@intel.com>
> 
> Intel's Skylake Server CPUs have a different LLC topology than previous
> generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
> divided into two "slices", each containing half the cores, half the LLC,
> and one memory controller and each slice is enumerated to Linux as a
> NUMA node. This is similar to how the cores and LLC were arranged
> for the Cluster-On-Die (CoD) feature.
> 
> CoD allowed the same cache line to be present in each half of the LLC.
> But, with SNC, each line is only ever present in *one* slice. This
> means that the portion of the LLC *available* to a CPU depends on the
> data being accessed:
> 
>     Remote socket: entire package LLC is shared
>     Local socket->local slice: data goes into local slice LLC
>     Local socket->remote slice: data goes into remote-slice LLC. Slightly
>                     		higher latency than local slice LLC.
> 
> The biggest implication from this is that a process accessing all
> NUMA-local memory only sees half the LLC capacity.
> 
> The CPU describes its cache hierarchy with the CPUID instruction. One
> of the CPUID leaves enumerates the "logical processors sharing this
> cache". This information is used for scheduling decisions so that tasks
> move more freely between CPUs sharing the cache.
> 
> But, the CPUID for the SNC configuration discussed above enumerates
> the LLC as being shared by the entire package. This is not 100%
> precise because the entire cache is not usable by all accesses. But,
> it *is* the way the hardware enumerates itself, and this is not likely
> to change.
> 
> The userspace visible impact of all the above is that the sysfs info
> reports the entire LLC as being available to the entire package. As
> noted above, this is not true for local socket accesses. This patch
> does not correct the sysfs info. It is the same, pre and post patch.
> 
> This patch continues to allow this SNC topology and it does so without
> complaint. It eliminates a warning that looks like this:
> 
> 	sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
> 
> The warning is coming from the sane_topology check() in smpboot.c.

s/sane_topology check()/topology_sane() check/

> To fix this, add a vendor and model specific check to never call
> topology_sane() for these systems. Also, just like "Cluster-on-Die"
> we throw out the "coregroup" sched_domain_topology_level and use
> NUMA information from the SRAT alone.
> 
> This is OK at least on the hardware we are immediately concerned about
> because the LLC sharing happens at both the slice and at the package
> level, which are also NUMA boundaries.

I wish everyone would write commit messages like this. Very good and
nicely written explanation!

> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: "H. Peter Anvin" <hpa@linux.intel.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Prarit Bhargava <prarit@redhat.com>
> Cc: brice.goglin@gmail.com
> Cc: Ingo Molnar <mingo@kernel.org>
> ---

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [tip:x86/urgent] x86,sched: Allow topologies where NUMA nodes share an LLC
  2018-04-07  0:21 [PATCH v5] x86,sched: allow topologies where NUMA nodes share an LLC Alison Schofield
  2018-04-16 18:57 ` Borislav Petkov
@ 2018-04-17 13:46 ` tip-bot for Alison Schofield
  2018-04-17 14:02   ` Fwd: [tip:x86/urgent] x86, sched: " Juergen Gross
  1 sibling, 1 reply; 7+ messages in thread
From: tip-bot for Alison Schofield @ 2018-04-17 13:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, linux-kernel, prarit, imammedo, rientjes, mingo,
	alison.schofield, tglx, tim.c.chen, tony.luck, bp, hpa,
	dave.hansen, hpa, bp

Commit-ID:  1340ccfa9a9afefdbab90d7935d4ed19817e37c2
Gitweb:     https://git.kernel.org/tip/1340ccfa9a9afefdbab90d7935d4ed19817e37c2
Author:     Alison Schofield <alison.schofield@intel.com>
AuthorDate: Fri, 6 Apr 2018 17:21:30 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 17 Apr 2018 15:39:55 +0200

x86,sched: Allow topologies where NUMA nodes share an LLC

Intel's Skylake Server CPUs have a different LLC topology than previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is divided
into two "slices", each containing half the cores, half the LLC, and one
memory controller and each slice is enumerated to Linux as a NUMA
node. This is similar to how the cores and LLC were arranged for the
Cluster-On-Die (CoD) feature.

CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice. This means
that the portion of the LLC *available* to a CPU depends on the data being
accessed:

    Remote socket: entire package LLC is shared
    Local socket->local slice: data goes into local slice LLC
    Local socket->remote slice: data goes into remote-slice LLC. Slightly
                    		higher latency than local slice LLC.

The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.

The CPU describes its cache hierarchy with the CPUID instruction. One of
the CPUID leaves enumerates the "logical processors sharing this
cache". This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.

But, the CPUID for the SNC configuration discussed above enumerates the LLC
as being shared by the entire package. This is not 100% precise because the
entire cache is not usable by all accesses. But, it *is* the way the
hardware enumerates itself, and this is not likely to change.

The userspace visible impact of all the above is that the sysfs info
reports the entire LLC as being available to the entire package. As noted
above, this is not true for local socket accesses. This patch does not
correct the sysfs info. It is the same, pre and post patch.

The current code emits the following warning:

 sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

The warning is coming from the topology_sane() check in smpboot.c because
the topology is not matching the expectations of the model for obvious
reasons.

To fix this, add a vendor and model specific check to never call
topology_sane() for these systems. Also, just like "Cluster-on-Die" disable
the "coregroup" sched_domain_topology_level and use NUMA information from
the SRAT alone.

This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package level,
which are also NUMA boundaries.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: brice.goglin@gmail.com
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lkml.kernel.org/r/20180407002130.GA18984@alison-desk.jf.intel.com

---
 arch/x86/kernel/smpboot.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 40 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ff99e2b6fc54..45175b81dd5b 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -77,6 +77,8 @@
 #include <asm/i8259.h>
 #include <asm/misc.h>
 #include <asm/qspinlock.h>
+#include <asm/intel-family.h>
+#include <asm/cpu_device_id.h>
 
 /* Number of siblings per CPU package */
 int smp_num_siblings = 1;
@@ -390,15 +392,47 @@ static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 	return false;
 }
 
+/*
+ * Define snc_cpu[] for SNC (Sub-NUMA Cluster) CPUs.
+ *
+ * These are Intel CPUs that enumerate an LLC that is shared by
+ * multiple NUMA nodes. The LLC on these systems is shared for
+ * off-package data access but private to the NUMA node (half
+ * of the package) for on-package access.
+ *
+ * CPUID (the source of the information about the LLC) can only
+ * enumerate the cache as being shared *or* unshared, but not
+ * this particular configuration. The CPU in this case enumerates
+ * the cache to be shared across the entire package (spanning both
+ * NUMA nodes).
+ */
+
+static const struct x86_cpu_id snc_cpu[] = {
+	{ X86_VENDOR_INTEL, 6, INTEL_FAM6_SKYLAKE_X },
+	{}
+};
+
 static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 {
 	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
 
-	if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID &&
-	    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2))
-		return topology_sane(c, o, "llc");
+	/* Do not match if we do not have a valid APICID for cpu: */
+	if (per_cpu(cpu_llc_id, cpu1) == BAD_APICID)
+		return false;
 
-	return false;
+	/* Do not match if LLC id does not match: */
+	if (per_cpu(cpu_llc_id, cpu1) != per_cpu(cpu_llc_id, cpu2))
+		return false;
+
+	/*
+	 * Allow the SNC topology without warning. Return of false
+	 * means 'c' does not share the LLC of 'o'. This will be
+	 * reflected to userspace.
+	 */
+	if (!topology_same_node(c, o) && x86_match_cpu(snc_cpu))
+		return false;
+
+	return topology_sane(c, o, "llc");
 }
 
 /*
@@ -456,7 +490,8 @@ static struct sched_domain_topology_level x86_topology[] = {
 
 /*
  * Set if a package/die has multiple NUMA nodes inside.
- * AMD Magny-Cours and Intel Cluster-on-Die have this.
+ * AMD Magny-Cours, Intel Cluster-on-Die, and Intel
+ * Sub-NUMA Clustering have this.
  */
 static bool x86_has_numa_in_package;
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Fwd: [tip:x86/urgent] x86, sched: Allow topologies where NUMA nodes share an LLC
  2018-04-17 13:46 ` [tip:x86/urgent] x86,sched: Allow " tip-bot for Alison Schofield
@ 2018-04-17 14:02   ` Juergen Gross
  2018-04-17 14:08     ` Andrew Cooper
  0 siblings, 1 reply; 7+ messages in thread
From: Juergen Gross @ 2018-04-17 14:02 UTC (permalink / raw)
  To: xen-devel; +Cc: George.Dunlap, Dario Faggioli

Is this something we should be aware of in Xen, too?


Juergen


-------- Forwarded Message --------
Subject: [tip:x86/urgent] x86,sched: Allow topologies where NUMA nodes
share an LLC
Date: Tue, 17 Apr 2018 06:46:12 -0700
From: tip-bot for Alison Schofield <tipbot@zytor.com>
Reply-To: hpa@zytor.com, bp@suse.de, tony.luck@intel.com,
tim.c.chen@linux.intel.com, bp@alien8.de, hpa@linux.intel.com,
dave.hansen@linux.intel.com, imammedo@redhat.com, prarit@redhat.com,
linux-kernel@vger.kernel.org, peterz@infradead.org,
alison.schofield@intel.com, tglx@linutronix.de, mingo@kernel.org,
rientjes@google.com
To: linux-tip-commits@vger.kernel.org
CC: peterz@infradead.org, linux-kernel@vger.kernel.org,
prarit@redhat.com, imammedo@redhat.com, rientjes@google.com,
mingo@kernel.org, alison.schofield@intel.com, tglx@linutronix.de,
tim.c.chen@linux.intel.com, tony.luck@intel.com, bp@suse.de,
hpa@zytor.com, dave.hansen@linux.intel.com, hpa@linux.intel.com,
bp@alien8.de

Commit-ID:  1340ccfa9a9afefdbab90d7935d4ed19817e37c2
Gitweb:
https://git.kernel.org/tip/1340ccfa9a9afefdbab90d7935d4ed19817e37c2
Author:     Alison Schofield <alison.schofield@intel.com>
AuthorDate: Fri, 6 Apr 2018 17:21:30 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 17 Apr 2018 15:39:55 +0200

x86,sched: Allow topologies where NUMA nodes share an LLC

Intel's Skylake Server CPUs have a different LLC topology than previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is divided
into two "slices", each containing half the cores, half the LLC, and one
memory controller and each slice is enumerated to Linux as a NUMA
node. This is similar to how the cores and LLC were arranged for the
Cluster-On-Die (CoD) feature.

CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice. This means
that the portion of the LLC *available* to a CPU depends on the data being
accessed:

    Remote socket: entire package LLC is shared
    Local socket->local slice: data goes into local slice LLC
    Local socket->remote slice: data goes into remote-slice LLC. Slightly
                    		higher latency than local slice LLC.

The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.

The CPU describes its cache hierarchy with the CPUID instruction. One of
the CPUID leaves enumerates the "logical processors sharing this
cache". This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.

But, the CPUID for the SNC configuration discussed above enumerates the LLC
as being shared by the entire package. This is not 100% precise because the
entire cache is not usable by all accesses. But, it *is* the way the
hardware enumerates itself, and this is not likely to change.

The userspace visible impact of all the above is that the sysfs info
reports the entire LLC as being available to the entire package. As noted
above, this is not true for local socket accesses. This patch does not
correct the sysfs info. It is the same, pre and post patch.

The current code emits the following warning:

 sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 !=
0]. Ignoring dependency.

The warning is coming from the topology_sane() check in smpboot.c because
the topology is not matching the expectations of the model for obvious
reasons.

To fix this, add a vendor and model specific check to never call
topology_sane() for these systems. Also, just like "Cluster-on-Die" disable
the "coregroup" sched_domain_topology_level and use NUMA information from
the SRAT alone.

This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package level,
which are also NUMA boundaries.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: brice.goglin@gmail.com
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link:
https://lkml.kernel.org/r/20180407002130.GA18984@alison-desk.jf.intel.com

---
 arch/x86/kernel/smpboot.c | 45
++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 40 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ff99e2b6fc54..45175b81dd5b 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -77,6 +77,8 @@
 #include <asm/i8259.h>
 #include <asm/misc.h>
 #include <asm/qspinlock.h>
+#include <asm/intel-family.h>
+#include <asm/cpu_device_id.h>
  /* Number of siblings per CPU package */
 int smp_num_siblings = 1;
@@ -390,15 +392,47 @@ static bool match_smt(struct cpuinfo_x86 *c,
struct cpuinfo_x86 *o)
 	return false;
 }
 +/*
+ * Define snc_cpu[] for SNC (Sub-NUMA Cluster) CPUs.
+ *
+ * These are Intel CPUs that enumerate an LLC that is shared by
+ * multiple NUMA nodes. The LLC on these systems is shared for
+ * off-package data access but private to the NUMA node (half
+ * of the package) for on-package access.
+ *
+ * CPUID (the source of the information about the LLC) can only
+ * enumerate the cache as being shared *or* unshared, but not
+ * this particular configuration. The CPU in this case enumerates
+ * the cache to be shared across the entire package (spanning both
+ * NUMA nodes).
+ */
+
+static const struct x86_cpu_id snc_cpu[] = {
+	{ X86_VENDOR_INTEL, 6, INTEL_FAM6_SKYLAKE_X },
+	{}
+};
+
 static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 {
 	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
 -	if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID &&
-	    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2))
-		return topology_sane(c, o, "llc");
+	/* Do not match if we do not have a valid APICID for cpu: */
+	if (per_cpu(cpu_llc_id, cpu1) == BAD_APICID)
+		return false;
 -	return false;
+	/* Do not match if LLC id does not match: */
+	if (per_cpu(cpu_llc_id, cpu1) != per_cpu(cpu_llc_id, cpu2))
+		return false;
+
+	/*
+	 * Allow the SNC topology without warning. Return of false
+	 * means 'c' does not share the LLC of 'o'. This will be
+	 * reflected to userspace.
+	 */
+	if (!topology_same_node(c, o) && x86_match_cpu(snc_cpu))
+		return false;
+
+	return topology_sane(c, o, "llc");
 }
  /*
@@ -456,7 +490,8 @@ static struct sched_domain_topology_level
x86_topology[] = {
  /*
  * Set if a package/die has multiple NUMA nodes inside.
- * AMD Magny-Cours and Intel Cluster-on-Die have this.
+ * AMD Magny-Cours, Intel Cluster-on-Die, and Intel
+ * Sub-NUMA Clustering have this.
  */
 static bool x86_has_numa_in_package;


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Fwd: [tip:x86/urgent] x86, sched: Allow topologies where NUMA nodes share an LLC
  2018-04-17 14:02   ` Fwd: [tip:x86/urgent] x86, sched: " Juergen Gross
@ 2018-04-17 14:08     ` Andrew Cooper
  2018-04-17 17:02       ` Dario Faggioli
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Cooper @ 2018-04-17 14:08 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: George.Dunlap, Dario Faggioli

On 17/04/18 15:02, Juergen Gross wrote:
> Is this something we should be aware of in Xen, too?

If we had something close to a working topology representation, probably.

As things stand, I don't really know, but I doubt it will cause problems
in the default case, due to us being fairly NUMA-inefficient to begin with.

I haven't had the time to turn SNC on see what happens.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fwd: [tip:x86/urgent] x86, sched: Allow topologies where NUMA nodes share an LLC
  2018-04-17 14:08     ` Andrew Cooper
@ 2018-04-17 17:02       ` Dario Faggioli
  2018-04-18  8:11         ` Anshul Makkar
  0 siblings, 1 reply; 7+ messages in thread
From: Dario Faggioli @ 2018-04-17 17:02 UTC (permalink / raw)
  To: Andrew Cooper, Juergen Gross, xen-devel; +Cc: George.Dunlap, anshul makkar


[-- Attachment #1.1: Type: text/plain, Size: 1533 bytes --]

On Tue, 2018-04-17 at 15:08 +0100, Andrew Cooper wrote:
> On 17/04/18 15:02, Juergen Gross wrote:
> > Is this something we should be aware of in Xen, too?
> 
> If we had something close to a working topology representation,
> probably.
> 
True, as far as letting the guest know about these details (in
appropriate circumnstaces) goes.

About using it _within_ Xen, well:

"Intel's Skylake Server CPUs have a different LLC topology than
previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
divided
into two "slices", each containing half the cores, half the LLC, and
one
memory controller and each slice is enumerated to Linux as a NUMA
node. This is similar to how the cores and LLC were arranged for the
Cluster-On-Die (CoD) feature."

This looks similar (but not identical) to, e.g., AMD EPYC. If Xen also
sees each slice as a NUMA node as well, we already are doing something
(although, of course, we can always improve).

If not, I see ways of taking advantage of the information that not all
the cores in the socket equally share the LLC, e.g., in Credit2, by
providing a per-LLC runqueue option and/or by taking that into account
in the 'migration resistance' mechanisms (there is already work in that
direction).

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fwd: [tip:x86/urgent] x86, sched: Allow topologies where NUMA nodes share an LLC
  2018-04-17 17:02       ` Dario Faggioli
@ 2018-04-18  8:11         ` Anshul Makkar
  0 siblings, 0 replies; 7+ messages in thread
From: Anshul Makkar @ 2018-04-18  8:11 UTC (permalink / raw)
  To: Dario Faggioli, Andrew Cooper, Juergen Gross, xen-devel; +Cc: George.Dunlap



On 4/17/18 6:02 PM, Dario Faggioli wrote:
> On Tue, 2018-04-17 at 15:08 +0100, Andrew Cooper wrote:
>> On 17/04/18 15:02, Juergen Gross wrote:
>>> Is this something we should be aware of in Xen, too?
>> If we had something close to a working topology representation,
>> probably.
>>
> True, as far as letting the guest know about these details (in
> appropriate circumnstaces) goes.
>
> About using it _within_ Xen, well:
>
> "Intel's Skylake Server CPUs have a different LLC topology than
> previous
> generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
> divided
> into two "slices", each containing half the cores, half the LLC, and
> one
> memory controller and each slice is enumerated to Linux as a NUMA
> node. This is similar to how the cores and LLC were arranged for the
> Cluster-On-Die (CoD) feature."
>
> This looks similar (but not identical) to, e.g., AMD EPYC. If Xen also
> sees each slice as a NUMA node as well, we already are doing something
> (although, of course, we can always improve).
>
> If not, I see ways of taking advantage of the information that not all
> the cores in the socket equally share the LLC, e.g., in Credit2, by
> providing a per-LLC runqueue option and/or by taking that into account
> in the 'migration resistance' mechanisms (there is already work in that
> direction).
Yeah, per-LLC runqueue option sounds like a good option. We can look
into that.
Yes, I am working on dynamically definining migration resistance as a 
factor of
cache and cpu topology ie. instead of having a constant value of 
migration resistance
it will consider the placement of cpus, cache and sharing of cache among 
cpus.
>
> Regards,
> Dario
Anshul

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-04-18  8:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-07  0:21 [PATCH v5] x86,sched: allow topologies where NUMA nodes share an LLC Alison Schofield
2018-04-16 18:57 ` Borislav Petkov
2018-04-17 13:46 ` [tip:x86/urgent] x86,sched: Allow " tip-bot for Alison Schofield
2018-04-17 14:02   ` Fwd: [tip:x86/urgent] x86, sched: " Juergen Gross
2018-04-17 14:08     ` Andrew Cooper
2018-04-17 17:02       ` Dario Faggioli
2018-04-18  8:11         ` Anshul Makkar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.