All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC
@ 2017-11-06 22:15 Dave Hansen
  2017-11-07  8:30 ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Hansen @ 2017-11-06 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, tony.luck, tim.c.chen, hpa, bp, peterz, rientjes,
	imammedo, prarit, toshi.kani, brice.goglin, mingo


From: Dave Hansen <dave.hansen@linux.intel.com>

Intel's Skylake Server CPUs have a different LLC topology than previous
generations.  When in Sub-NUMA-Clustering (SNC) mode, the package is
divided into two "slices", each containing half the cores, half the LLC,
and one memory controller and each slice is enumerated to Linux as a
NUMA node.  This is similar to how the cores and LLC were arranged
for the Cluster-On-Die (CoD) feature.

CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice.  This
means that the portion of the LLC *available* to a CPU depends on the
data being accessed:

    Remote socket: entire package LLC is shared
    Local socket->local slice: data goes into local slice LLC
    Local socket->remote slice: data goes into remote-slice LLC.  Slightly
                    		higher latency than local slice LLC.

The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.

The CPU describes its cache hierarchy with the CPUID instruction.  One
of the CPUID leaves enumerates the "logical processors sharing this
cache".  This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.

But, the CPUID for the SNC configuration discussed above enumerates
the LLC as being shared by the entire package.  This is not 100%
precise because the entire cache is not usable by all accesses.  But,
it *is* the way the hardware enumerates itself, and this is not likely
to change.

This breaks the sane_topology() check in the smpboot.c code because
this topology is considered not-sane.  To fix this, add a model-
specifc check to never call topology_sane() for these systems.  Also,
just like "Cluster-on-Die" we throw out the "coregroup"
sched_domain_topology_level and use NUMA information from the SRAT
alone.

This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package
level, which are also NUMA boundaries.

This patch eliminates a warning that looks like this:

	sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Luck, Tony <tony.luck@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: brice.goglin@gmail.com
Cc: Ingo Molnar <mingo@kernel.org>
---

 b/arch/x86/kernel/smpboot.c |   51 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 40 insertions(+), 11 deletions(-)

diff -puN arch/x86/kernel/smpboot.c~x86-numa-nodes-share-llc arch/x86/kernel/smpboot.c
--- a/arch/x86/kernel/smpboot.c~x86-numa-nodes-share-llc	2017-11-06 13:29:49.319087764 -0800
+++ b/arch/x86/kernel/smpboot.c	2017-11-06 13:45:12.902085460 -0800
@@ -77,6 +77,7 @@
 #include <asm/i8259.h>
 #include <asm/realmode.h>
 #include <asm/misc.h>
+#include <asm/intel-family.h>
 
 /* Number of siblings per CPU package */
 int smp_num_siblings = 1;
@@ -457,15 +458,50 @@ static bool match_smt(struct cpuinfo_x86
 	return false;
 }
 
+/*
+ * Set if a package/die has multiple NUMA nodes inside.
+ * AMD Magny-Cours, Intel Cluster-on-Die, and Intel
+ * Sub-NUMA Clustering have this.
+ */
+static bool x86_has_numa_in_package;
+
 static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 {
 	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
 
-	if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID &&
-	    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2))
-		return topology_sane(c, o, "llc");
+	/* Do not match if we do not have a valid APICID for cpu: */
+	if (per_cpu(cpu_llc_id, cpu1) == BAD_APICID)
+		return false;
+
+	/* Do not match if LLC id does not match: */
+	if (per_cpu(cpu_llc_id, cpu1) != per_cpu(cpu_llc_id, cpu2))
+		return false;
 
-	return false;
+	/*
+	 * Some Intel CPUs enumerate an LLC that is shared by
+	 * multiple NUMA nodes.  The LLC on these systems is
+	 * shared for off-package data acccess but private to the
+	 * NUMA node (half of the package) for on-package access.
+	 *
+	 * CPUID can only enumerate the cache as being shared *or*
+	 * unshared, but not this particular configuration.  The
+	 * CPU in this case enumerates the cache to be shared
+	 * across the entire package (spanning both NUMA nodes).
+	 */
+	if (!topology_same_node(c, o) &&
+	    (c->x86_model == INTEL_FAM6_SKYLAKE_X)) {
+		/* Use NUMA instead of coregroups for scheduling: */
+		x86_has_numa_in_package = true;
+
+		/*
+		 * Now, tell the truth, that the LLC matches. But,
+		 * note that throwing away coregroups for
+		 * scheduling means this will have no actual effect.
+		 */
+		return true;
+	}
+
+	return topology_sane(c, o, "llc");
 }
 
 /*
@@ -521,12 +557,6 @@ static struct sched_domain_topology_leve
 	{ NULL, },
 };
 
-/*
- * Set if a package/die has multiple NUMA nodes inside.
- * AMD Magny-Cours and Intel Cluster-on-Die have this.
- */
-static bool x86_has_numa_in_package;
-
 void set_cpu_sibling_map(int cpu)
 {
 	bool has_smt = smp_num_siblings > 1;
@@ -553,7 +583,6 @@ void set_cpu_sibling_map(int cpu)
 
 		if ((i == cpu) || (has_mp && match_llc(c, o)))
 			link_mask(cpu_llc_shared_mask, cpu, i);
-
 	}
 
 	/*
_

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC
  2017-11-06 22:15 [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC Dave Hansen
@ 2017-11-07  8:30 ` Peter Zijlstra
  2017-11-07 16:22   ` Dave Hansen
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2017-11-07  8:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, tony.luck, tim.c.chen, hpa, bp, rientjes, imammedo,
	prarit, toshi.kani, brice.goglin, mingo

On Mon, Nov 06, 2017 at 02:15:00PM -0800, Dave Hansen wrote:

> But, the CPUID for the SNC configuration discussed above enumerates
> the LLC as being shared by the entire package.  This is not 100%
> precise because the entire cache is not usable by all accesses.  But,
> it *is* the way the hardware enumerates itself, and this is not likely
> to change.

So CPUID and SRAT will remain inconsistent; even in future products?
That would absolutely blow chunks.

If that is the case, we'd best use a fake feature like
X86_BUG_TOPOLOGY_BROKEN and use that instead of an ever growing list of
models in this code.

> +/*
> + * Set if a package/die has multiple NUMA nodes inside.
> + * AMD Magny-Cours, Intel Cluster-on-Die, and Intel
> + * Sub-NUMA Clustering have this.
> + */
> +static bool x86_has_numa_in_package;
> +
>  static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
>  {
>  	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
>  
> +	/* Do not match if we do not have a valid APICID for cpu: */
> +	if (per_cpu(cpu_llc_id, cpu1) == BAD_APICID)
> +		return false;
> +
> +	/* Do not match if LLC id does not match: */
> +	if (per_cpu(cpu_llc_id, cpu1) != per_cpu(cpu_llc_id, cpu2))
> +		return false;
>  
> +	/*
> +	 * Some Intel CPUs enumerate an LLC that is shared by
> +	 * multiple NUMA nodes.  The LLC on these systems is
> +	 * shared for off-package data acccess but private to the
> +	 * NUMA node (half of the package) for on-package access.
> +	 *
> +	 * CPUID can only enumerate the cache as being shared *or*
> +	 * unshared, but not this particular configuration.  The
> +	 * CPU in this case enumerates the cache to be shared
> +	 * across the entire package (spanning both NUMA nodes).
> +	 */
> +	if (!topology_same_node(c, o) &&
> +	    (c->x86_model == INTEL_FAM6_SKYLAKE_X)) {

This needs a c->x86_vendor test; imagine the fun when AMD releases a
part with model == SKX ...

> +		/* Use NUMA instead of coregroups for scheduling: */
> +		x86_has_numa_in_package = true;
> +
> +		/*
> +		 * Now, tell the truth, that the LLC matches. But,
> +		 * note that throwing away coregroups for
> +		 * scheduling means this will have no actual effect.
> +		 */
> +		return true;

What are the ramifications here? Is anybody else using that cpumask
outside of the scheduler topology setup?

> +	}
> +
> +	return topology_sane(c, o, "llc");
>  }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC
  2017-11-07  8:30 ` Peter Zijlstra
@ 2017-11-07 16:22   ` Dave Hansen
  2017-11-07 19:56     ` Borislav Petkov
  2017-11-08  9:31     ` Peter Zijlstra
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Hansen @ 2017-11-07 16:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, tony.luck, tim.c.chen, hpa, bp, rientjes, imammedo,
	prarit, toshi.kani, brice.goglin, mingo

On 11/07/2017 12:30 AM, Peter Zijlstra wrote:
> On Mon, Nov 06, 2017 at 02:15:00PM -0800, Dave Hansen wrote:
> 
>> But, the CPUID for the SNC configuration discussed above enumerates
>> the LLC as being shared by the entire package.  This is not 100%
>> precise because the entire cache is not usable by all accesses.  But,
>> it *is* the way the hardware enumerates itself, and this is not likely
>> to change.
> 
> So CPUID and SRAT will remain inconsistent; even in future products?
> That would absolutely blow chunks.

It certainly isn't ideal as it stands.  If it was changed, what would it
be changed to?  You can not even represent the current L3 topology in
CPUID, at least not precisely.

I've been arguing we should optimize the CPUID information for
performance.  Right now, it's suboptimal for folks doing NUMA-local
allocations, and I think that's precisely the group of folks that needs
precise information.  I'm trying to get it changed going forward.

> If that is the case, we'd best use a fake feature like
> X86_BUG_TOPOLOGY_BROKEN and use that instead of an ever growing list of
> models in this code.

FWIW, I don't consider the current situation broken.  Nobody ever
promised the kernel that a NUMA node would never happen inside a socket,
or inside a cache boundary enumerated in CPUID.

The assumptions the kernel made were sane, but the CPU's description of
itself, *and* the BIOS-provided information are also sane.  But, the
world changed, some of those assumptions turned out to be wrong, and
somebody needs to adjust.

...
>> +	if (!topology_same_node(c, o) &&
>> +	    (c->x86_model == INTEL_FAM6_SKYLAKE_X)) {
> 
> This needs a c->x86_vendor test; imagine the fun when AMD releases a
> part with model == SKX ...

Yup, will do.

>> +		/* Use NUMA instead of coregroups for scheduling: */
>> +		x86_has_numa_in_package = true;
>> +
>> +		/*
>> +		 * Now, tell the truth, that the LLC matches. But,
>> +		 * note that throwing away coregroups for
>> +		 * scheduling means this will have no actual effect.
>> +		 */
>> +		return true;
> 
> What are the ramifications here? Is anybody else using that cpumask
> outside of the scheduler topology setup?

I looked for it and didn't see anything else.  I'll double check that
nothing has popped up since I hacked this together.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC
  2017-11-07 16:22   ` Dave Hansen
@ 2017-11-07 19:56     ` Borislav Petkov
  2017-11-08  9:31     ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Borislav Petkov @ 2017-11-07 19:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, linux-kernel, tony.luck, tim.c.chen, hpa,
	rientjes, imammedo, prarit, toshi.kani, brice.goglin, mingo

On Tue, Nov 07, 2017 at 08:22:19AM -0800, Dave Hansen wrote:
> FWIW, I don't consider the current situation broken.  Nobody ever
> promised the kernel that a NUMA node would never happen inside a socket,
> or inside a cache boundary enumerated in CPUID.

Right, I don't think the name matters. A synthetic flag which you set in
.../cpu/intel.c for each SNC configuration and then check said flag in
topology setup code is cleaner than having to add all those vendor/model
checks ...

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC
  2017-11-07 16:22   ` Dave Hansen
  2017-11-07 19:56     ` Borislav Petkov
@ 2017-11-08  9:31     ` Peter Zijlstra
  2017-11-09  0:00       ` Dave Hansen
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2017-11-08  9:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, tony.luck, tim.c.chen, hpa, bp, rientjes, imammedo,
	prarit, toshi.kani, brice.goglin, mingo

On Tue, Nov 07, 2017 at 08:22:19AM -0800, Dave Hansen wrote:
> On 11/07/2017 12:30 AM, Peter Zijlstra wrote:
> > On Mon, Nov 06, 2017 at 02:15:00PM -0800, Dave Hansen wrote:
> > 
> >> But, the CPUID for the SNC configuration discussed above enumerates
> >> the LLC as being shared by the entire package.  This is not 100%
> >> precise because the entire cache is not usable by all accesses.  But,
> >> it *is* the way the hardware enumerates itself, and this is not likely
> >> to change.
> > 
> > So CPUID and SRAT will remain inconsistent; even in future products?
> > That would absolutely blow chunks.
> 
> It certainly isn't ideal as it stands.  If it was changed, what would it
> be changed to?  You can not even represent the current L3 topology in
> CPUID, at least not precisely.
> 
> I've been arguing we should optimize the CPUID information for
> performance.  Right now, it's suboptimal for folks doing NUMA-local
> allocations, and I think that's precisely the group of folks that needs
> precise information.  I'm trying to get it changed going forward.

So this SNC situation is indeed very intricate and cannot be accurately
represented in CPUID. In fact its decidedly complex with matching
complex performance characteristics (i suspect).

People doing NUMA-local stuff care about performance (otherwise they'd
not bother dealing with the NUMA stuff to begin with); however there are
plenty people mostly ignoring small NUMA because the NUMA factor is
fairly low these days.

And SNC makes it even smaller; it effectively puts a cache in between
the two on-die nodes; not entirely unlike the s390 BOOK domain. Which
makes ignoring NUMA even more tempting.

What does this topology approach do for those workloads?

> > If that is the case, we'd best use a fake feature like
> > X86_BUG_TOPOLOGY_BROKEN and use that instead of an ever growing list of
> > models in this code.
> 
> FWIW, I don't consider the current situation broken.  Nobody ever
> promised the kernel that a NUMA node would never happen inside a socket,
> or inside a cache boundary enumerated in CPUID.

Its the last that I really find dodgy...

> The assumptions the kernel made were sane, but the CPU's description of
> itself, *and* the BIOS-provided information are also sane.  But, the
> world changed, some of those assumptions turned out to be wrong, and
> somebody needs to adjust.

The thing is; this patch effectively says CPUID *is* wrong. It says we
only consider NUMA local cache slices, since that is all that is
available to the local cores.

So what does it mean for cores to share a cache? What does CPUID
describe?

Should we not look at it as a node local L3 with a socket wide L3.5
(I'm not calling it L4 to avoid confusion with the Crystal Well iGPU
stuff) which happens to be tightly coupled? Your statement that the
local slice is faster seems to support that view.

In that case CPUID really is wrong; L3 should be node local. And the
L3.5 should be represented in the NUMA topology (SLIT) as a lower factor
between the two nodes.

Is this closeness between the nodes appropriately described by current
SLIT tables? Can I get: cat /sys/devices/system/node/node*/distance
from a machine that has this enabled?

> >> +		/* Use NUMA instead of coregroups for scheduling: */
> >> +		x86_has_numa_in_package = true;
> >> +
> >> +		/*
> >> +		 * Now, tell the truth, that the LLC matches. But,
> >> +		 * note that throwing away coregroups for
> >> +		 * scheduling means this will have no actual effect.
> >> +		 */
> >> +		return true;
> > 
> > What are the ramifications here? Is anybody else using that cpumask
> > outside of the scheduler topology setup?
> 
> I looked for it and didn't see anything else.  I'll double check that
> nothing has popped up since I hacked this together.

Lets put it another way; is there a sane use-case for the multi-node
spanning coregroups thing? It seems inconsistent at best; with the
L3/L3.5 view these cores do not in fact share a cache (although their
caches are 'close').

So I would suggest making that return false and be consistent; esp. if
there are no other users, this doesn't 'cost' anything.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC
  2017-11-08  9:31     ` Peter Zijlstra
@ 2017-11-09  0:00       ` Dave Hansen
  2017-11-09 14:07         ` Borislav Petkov
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Hansen @ 2017-11-09  0:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, tony.luck, tim.c.chen, hpa, bp, rientjes, imammedo,
	prarit, toshi.kani, brice.goglin, mingo

On 11/08/2017 01:31 AM, Peter Zijlstra wrote:
> And SNC makes it even smaller; it effectively puts a cache in between
> the two on-die nodes; not entirely unlike the s390 BOOK domain. Which
> makes ignoring NUMA even more tempting.
> 
> What does this topology approach do for those workloads?

What does this L3 topology do for workloads ignoring NUMA?

Let's just assume that an app is entirely NUMA unaware and that it is
accessing a large amount of memory entirely uniformly across the entire
system.  Let's also say that we just have a 2-socket system which now
shows up as having 4 NUMA nodes (one per slice, two slices per socket,
two sockets).  Let's also just say we have 20MB of L3 per socket, so
10MB per slice.

 - 1/4 of the memory accesses will be local to the slice and will have
   access to 10MB of L3.
 - 1/4 of the memory accesses will be to the *other* slice and will have
   access to 10MB of L3 (non-conflicting with the previous 10MB).  This
   access is marginally slower than the access to the local slice.
 - 1/2 of memory accesses will be cross-node and will have access to
   20MB of L3 (both slices' L3's).

That's all OK.  Without this halved-L3 (the previous Cluster-on-Die)
configuration, it looked like this:

 - 1/2 of the memory accesses will be local to the socket and have
   access to 20MB of L3.
 - 1/2 of memory accesses will be cross-node and will have access to
   20MB of L3 (both slices' L3's).

I'd argue that those two end up looking pretty much the same to an app.
The only difference is that the slice-local and slice-remote cache hits
have slightly different access latencies.  I don't think it's enough to
notice.

The place where it is not optimal is where an app does NUMA-local
accesses, then sees that it has 20MB of L3 (via CPUID) and expects to
*get* 20MB of L3.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC
  2017-11-09  0:00       ` Dave Hansen
@ 2017-11-09 14:07         ` Borislav Petkov
  2017-11-09 18:13           ` Dave Hansen
  0 siblings, 1 reply; 8+ messages in thread
From: Borislav Petkov @ 2017-11-09 14:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, linux-kernel, tony.luck, tim.c.chen, hpa,
	rientjes, imammedo, prarit, toshi.kani, brice.goglin, mingo

On Wed, Nov 08, 2017 at 04:00:38PM -0800, Dave Hansen wrote:
> I'd argue that those two end up looking pretty much the same to an app.
> The only difference is that the slice-local and slice-remote cache hits
> have slightly different access latencies.  I don't think it's enough to
> notice.

So if it is not enough to notice, why do we even bother? I.e., is
there any workload showing any advantages at all from the resources
partitioning?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC
  2017-11-09 14:07         ` Borislav Petkov
@ 2017-11-09 18:13           ` Dave Hansen
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Hansen @ 2017-11-09 18:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, linux-kernel, tony.luck, tim.c.chen, hpa,
	rientjes, imammedo, prarit, toshi.kani, brice.goglin, mingo

On 11/09/2017 06:07 AM, Borislav Petkov wrote:
> On Wed, Nov 08, 2017 at 04:00:38PM -0800, Dave Hansen wrote:
>> I'd argue that those two end up looking pretty much the same to an app.
>> The only difference is that the slice-local and slice-remote cache hits
>> have slightly different access latencies.  I don't think it's enough to
>> notice.
> 
> So if it is not enough to notice, why do we even bother? I.e., is
> there any workload showing any advantages at all from the resources
> partitioning?

If you want the *absolutely* best latency available, you turn on SNC.
You get a small boost to slice-local access and a slight penalty to
remote-slice access compared to when Sub-NUMA-Clustering is off.

You can measure this for sure, but I'll still say that most folks will
never notice.  In addition, if you have access interleaved everywhere,
the "slice-local boost" and "remote-slice penalty" roughly cancel
each-other out.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-11-09 18:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-06 22:15 [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC Dave Hansen
2017-11-07  8:30 ` Peter Zijlstra
2017-11-07 16:22   ` Dave Hansen
2017-11-07 19:56     ` Borislav Petkov
2017-11-08  9:31     ` Peter Zijlstra
2017-11-09  0:00       ` Dave Hansen
2017-11-09 14:07         ` Borislav Petkov
2017-11-09 18:13           ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.