All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86: new topology for multi-NUMA-node CPUs
@ 2014-09-18 19:33 Dave Hansen
  2014-09-18 20:49 ` Peter Zijlstra
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Dave Hansen @ 2014-09-18 19:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, dave.hansen, a.p.zijlstra, mingo, hpa, brice.goglin, bp



This is my third attempt to fix this.  It's definitely simpler
than the previous set.  This takes Peter Z's suggestion to just
create and use a different topology when we see these
"Cluster-on-Die" systems.

--
From: Dave Hansen <dave.hansen@linux.intel.com>

I'm getting the spew below when booting with Haswell (Xeon
E5-2699 v3) CPUs and the "Cluster-on-Die" (CoD) feature enabled
in the BIOS.  It seems similar to the issue that some folks from
AMD ran in to on their systems and addressed in this commit:

	http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=161270fc1f9ddfc17154e0d49291472a9cdef7db

Both these Intel and AMD systems break an assumption which is
being enforced by topology_sane(): a socket may not contain more
than one NUMA node.

AMD special-cased their system by looking for a cpuid flag.  The
Intel mode is dependent on BIOS options and I do not know of a
way which it is enumerated other than the tables being parsed
during the CPU bringup process.  In other words, we have to trust
the ACPI tables <shudder>.

This detects the situation where a NUMA node occurs at a place in
the middle of the "CPU" sched domains.  It replaces the default
topology with one that relies on the NUMA information from the
firmware (SRAT table) for all levels of sched domains above the
hyperthreads.

This also fixes a sysfs bug.  We used to freak out when we saw
the "mc" group cross a node boundary, so we stopped building the
MC group.  MC gets exported as the 'core_siblings_list' in
/sys/devices/system/cpu/cpu*/topology/ and this caused CPUs with
the same 'physical_package_id' to not be listed together in
'core_siblings_list'.  This violates a statement from
Documentation/ABI/testing/sysfs-devices-system-cpu:

	core_siblings: internal kernel map of cpu#'s hardware threads
	within the same physical_package_id.

	core_siblings_list: human-readable list of the logical CPU
	numbers within the same physical_package_id as cpu#.

The sysfs effects here cause an issue with the hwloc tool where
it gets confused and thinks there are more sockets than are
physically present.

Before this patch, there are two packages:

# cd /sys/devices/system/cpu/
# cat cpu*/topology/physical_package_id | sort | uniq -c
     18 0
     18 1

But 4 _sets_ of core siblings:

# cat cpu*/topology/core_siblings_list | sort | uniq -c
      9 0-8
      9 18-26
      9 27-35
      9 9-17

After this set, there are only 2 sets of core siblings, which
is what we expect for a 2-socket system.

# cat cpu*/topology/physical_package_id | sort | uniq -c
     18 0
     18 1
# cat cpu*/topology/core_siblings_list | sort | uniq -c
     18 0-17
     18 18-35


Example spew:
...
	NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
	 #2  #3  #4  #5  #6  #7  #8
	.... node  #1, CPUs:    #9
	------------[ cut here ]------------
	WARNING: CPU: 9 PID: 0 at /home/ak/hle/linux-hle-2.6/arch/x86/kernel/smpboot.c:306 topology_sane.isra.2+0x74/0x90()
	sched: CPU #9's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
	Modules linked in:
	CPU: 9 PID: 0 Comm: swapper/9 Not tainted 3.17.0-rc1-00293-g8e01c4d-dirty #631
	Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014
	0000000000000009 ffff88046ddabe00 ffffffff8172e485 ffff88046ddabe48
	ffff88046ddabe38 ffffffff8109691d 000000000000b001 0000000000000009
	ffff88086fc12580 000000000000b020 0000000000000009 ffff88046ddabe98
	Call Trace:
	[<ffffffff8172e485>] dump_stack+0x45/0x56
	[<ffffffff8109691d>] warn_slowpath_common+0x7d/0xa0
	[<ffffffff8109698c>] warn_slowpath_fmt+0x4c/0x50
	[<ffffffff81074f94>] topology_sane.isra.2+0x74/0x90
	[<ffffffff8107530e>] set_cpu_sibling_map+0x31e/0x4f0
	[<ffffffff8107568d>] start_secondary+0x1ad/0x240
	---[ end trace 3fe5f587a9fcde61 ]---
	#10 #11 #12 #13 #14 #15 #16 #17
	.... node  #2, CPUs:   #18 #19 #20 #21 #22 #23 #24 #25 #26
	.... node  #3, CPUs:   #27 #28 #29 #30 #31 #32 #33 #34 #35

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
rc: ak@linux.intel.com
Cc: brice.goglin@gmail.com
Cc: bp@alien8.de
---

 b/arch/x86/kernel/smpboot.c |   48 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 41 insertions(+), 7 deletions(-)

diff -puN arch/x86/kernel/smpboot.c~ignore-mc-sanity-check arch/x86/kernel/smpboot.c
--- a/arch/x86/kernel/smpboot.c~ignore-mc-sanity-check	2014-09-18 11:23:03.422187764 -0700
+++ b/arch/x86/kernel/smpboot.c	2014-09-18 12:32:01.622256575 -0700
@@ -296,11 +296,19 @@ void smp_store_cpu_info(int id)
 }
 
 static bool
+topology_same_node(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
+
+	return (cpu_to_node(cpu1) == cpu_to_node(cpu2));
+}
+
+static bool
 topology_sane(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o, const char *name)
 {
 	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
 
-	return !WARN_ONCE(cpu_to_node(cpu1) != cpu_to_node(cpu2),
+	return !WARN_ONCE(!topology_same_node(c, o),
 		"sched: CPU #%d's %s-sibling CPU #%d is not on the same node! "
 		"[node: %d != %d]. Ignoring dependency.\n",
 		cpu1, name, cpu2, cpu_to_node(cpu1), cpu_to_node(cpu2));
@@ -341,17 +349,41 @@ static bool match_llc(struct cpuinfo_x86
 	return false;
 }
 
+/*
+ * Unlike the other levels, we do not enforce keeping a
+ * multicore group inside a NUMA node.  If this happens, we will
+ * discard the MC level of the topology later.
+ */
 static bool match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 {
-	if (c->phys_proc_id == o->phys_proc_id) {
-		if (cpu_has(c, X86_FEATURE_AMD_DCM))
-			return true;
-
-		return topology_sane(c, o, "mc");
-	}
+	if (c->phys_proc_id == o->phys_proc_id)
+		return true;
 	return false;
 }
 
+static struct sched_domain_topology_level numa_inside_package_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+	{ NULL, },
+};
+/*
+ * set_sched_topology() sets the topology internal to a CPU.  The
+ * NUMA topologies are layered on top of it to build the full
+ * system topology.
+ *
+ * If NUMA nodes are observed to occur within a CPU package, this
+ * function should be called.  It forces the sched domain code to
+ * only use the SMT level for the CPU portion of the topology.
+ * This essentially falls back to relying on NUMA information
+ * from the SRAT table to describe the entire system topology
+ * (except for hyperthreads).
+ */
+static void primarily_use_numa_for_topology(void)
+{
+	set_sched_topology(numa_inside_package_topology);
+}
+
 void set_cpu_sibling_map(int cpu)
 {
 	bool has_smt = smp_num_siblings > 1;
@@ -410,6 +442,8 @@ void set_cpu_sibling_map(int cpu)
 			} else if (i != cpu && !c->booted_cores)
 				c->booted_cores = cpu_data(i).booted_cores;
 		}
+		if (match_mc(c, o) == !topology_same_node(c, o))
+			primarily_use_numa_for_topology();
 	}
 }
 
_

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86: new topology for multi-NUMA-node CPUs
  2014-09-18 19:33 [PATCH] x86: new topology for multi-NUMA-node CPUs Dave Hansen
@ 2014-09-18 20:49 ` Peter Zijlstra
  2014-09-18 21:57 ` Dave Hansen
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2014-09-18 20:49 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, dave.hansen, mingo, hpa, brice.goglin, bp

On Thu, Sep 18, 2014 at 12:33:34PM -0700, Dave Hansen wrote:
> 
> 
> This is my third attempt to fix this.  It's definitely simpler
> than the previous set.  This takes Peter Z's suggestion to just
> create and use a different topology when we see these
> "Cluster-on-Die" systems.


> +static struct sched_domain_topology_level numa_inside_package_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> +	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> +#endif
> +	{ NULL, },
> +};

Yes, that looks ok. The only thing is that we need to fix the MC and DIE
masks. But I can do that in a separate patch. This should ideally still
contain the MC domain (as in LLC) mask, just not the DIE domain (as in
pkg).

But as you already saw, bits are somewhat icky atm.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86: new topology for multi-NUMA-node CPUs
  2014-09-18 19:33 [PATCH] x86: new topology for multi-NUMA-node CPUs Dave Hansen
  2014-09-18 20:49 ` Peter Zijlstra
@ 2014-09-18 21:57 ` Dave Hansen
  2014-09-19 11:45 ` Karel Zak
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Dave Hansen @ 2014-09-18 21:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: dave.hansen, a.p.zijlstra, mingo, hpa, brice.goglin, bp

On 09/18/2014 12:33 PM, Dave Hansen wrote:
> @@ -410,6 +442,8 @@ void set_cpu_sibling_map(int cpu)
>  			} else if (i != cpu && !c->booted_cores)
>  				c->booted_cores = cpu_data(i).booted_cores;
>  		}
> +		if (match_mc(c, o) == !topology_same_node(c, o))
> +			primarily_use_numa_for_topology();
>  	}
>  }

I went to test this on some more systems.  The "== !" above should be a
"!=".  The test is meant to see if a CPU pair is in the same mc group,
but not in the same node.

I'll fix it in the next version.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86: new topology for multi-NUMA-node CPUs
  2014-09-18 19:33 [PATCH] x86: new topology for multi-NUMA-node CPUs Dave Hansen
  2014-09-18 20:49 ` Peter Zijlstra
  2014-09-18 21:57 ` Dave Hansen
@ 2014-09-19 11:45 ` Karel Zak
  2014-09-19 18:15   ` Dave Hansen
  2014-09-21 17:56 ` Brice Goglin
  2014-09-24 14:58 ` [tip:sched/core] x86, sched: Add " tip-bot for Dave Hansen
  4 siblings, 1 reply; 8+ messages in thread
From: Karel Zak @ 2014-09-19 11:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, dave.hansen, a.p.zijlstra, mingo, hpa, brice.goglin, bp

On Thu, Sep 18, 2014 at 12:33:34PM -0700, Dave Hansen wrote:
> The sysfs effects here cause an issue with the hwloc tool where
> it gets confused and thinks there are more sockets than are
> physically present.
> 
> Before this patch, there are two packages:
> 
> # cd /sys/devices/system/cpu/
> # cat cpu*/topology/physical_package_id | sort | uniq -c
>      18 0
>      18 1
> 
> But 4 _sets_ of core siblings:
> 
> # cat cpu*/topology/core_siblings_list | sort | uniq -c
>       9 0-8
>       9 18-26
>       9 27-35
>       9 9-17
> 
> After this set, there are only 2 sets of core siblings, which
> is what we expect for a 2-socket system.
> 
> # cat cpu*/topology/physical_package_id | sort | uniq -c
>      18 0
>      18 1
> # cat cpu*/topology/core_siblings_list | sort | uniq -c
>      18 0-17
>      18 18-35

 hmm... it would be also nice to test it with lscpu(1) from 
 util-linux (but it uses maps rather than lists from cpu*/topology/).

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86: new topology for multi-NUMA-node CPUs
  2014-09-19 11:45 ` Karel Zak
@ 2014-09-19 18:15   ` Dave Hansen
  2014-09-22  7:43     ` Karel Zak
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Hansen @ 2014-09-19 18:15 UTC (permalink / raw)
  To: Karel Zak
  Cc: linux-kernel, dave.hansen, a.p.zijlstra, mingo, hpa, brice.goglin, bp

On 09/19/2014 04:45 AM, Karel Zak wrote:
>  hmm... it would be also nice to test it with lscpu(1) from 
>  util-linux (but it uses maps rather than lists from cpu*/topology/).

Here's the output with and with out Cluster-on-Die enabled.

Everything looks OK to me.  The cache size changes are what the CPU
actually tells us through CPUID leaves.

[root@otc-grantley-03 ~]# diff -ru lscpu.nocod lscpu.wcod
--- lscpu.nocod	2014-09-19 04:01:17.846336595 -0700
+++ lscpu.wcod	2014-09-19 04:10:56.557383761 -0700
@@ -6,18 +6,20 @@
 Thread(s) per core:    2
 Core(s) per socket:    18
 Socket(s):             2
-NUMA node(s):          2
+NUMA node(s):          4
 Vendor ID:             GenuineIntel
 CPU family:            6
 Model:                 63
 Model name:            Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
 Stepping:              2
-CPU MHz:               1340.468
-BogoMIPS:              4590.53
+CPU MHz:               1360.234
+BogoMIPS:              4590.67
 Virtualization:        VT-x
 L1d cache:             32K
 L1i cache:             32K
 L2 cache:              256K
-L3 cache:              46080K
-NUMA node0 CPU(s):     0-17,36-53
-NUMA node1 CPU(s):     18-35,54-71
+L3 cache:              23040K
+NUMA node0 CPU(s):     0-8,36-44
+NUMA node1 CPU(s):     9-17,45-53
+NUMA node2 CPU(s):     18-26,54-62
+NUMA node3 CPU(s):     27-35,63-71



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86: new topology for multi-NUMA-node CPUs
  2014-09-18 19:33 [PATCH] x86: new topology for multi-NUMA-node CPUs Dave Hansen
                   ` (2 preceding siblings ...)
  2014-09-19 11:45 ` Karel Zak
@ 2014-09-21 17:56 ` Brice Goglin
  2014-09-24 14:58 ` [tip:sched/core] x86, sched: Add " tip-bot for Dave Hansen
  4 siblings, 0 replies; 8+ messages in thread
From: Brice Goglin @ 2014-09-21 17:56 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel; +Cc: dave.hansen, a.p.zijlstra, mingo, hpa, bp

Le 18/09/2014 21:33, Dave Hansen a écrit :
> After this set, there are only 2 sets of core siblings, which
> is what we expect for a 2-socket system.
>
> # cat cpu*/topology/physical_package_id | sort | uniq -c
>      18 0
>      18 1
> # cat cpu*/topology/core_siblings_list | sort | uniq -c
>      18 0-17
>      18 18-35
>

Thanks a lot for working on this. I can't comment on the code but at
least the above core_siblings values should fix the original problem
observed with hwloc. I don't have a E5 v3 to test but installing the
hwloc package and running lstopo should confirm that it now sees a
single socket per group of 2 NUMA nodes as expected.

Brice


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86: new topology for multi-NUMA-node CPUs
  2014-09-19 18:15   ` Dave Hansen
@ 2014-09-22  7:43     ` Karel Zak
  0 siblings, 0 replies; 8+ messages in thread
From: Karel Zak @ 2014-09-22  7:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, dave.hansen, a.p.zijlstra, mingo, hpa, brice.goglin, bp

On Fri, Sep 19, 2014 at 11:15:05AM -0700, Dave Hansen wrote:
> On 09/19/2014 04:45 AM, Karel Zak wrote:
> >  hmm... it would be also nice to test it with lscpu(1) from 
> >  util-linux (but it uses maps rather than lists from cpu*/topology/).
> 
> Here's the output with and with out Cluster-on-Die enabled.
> 
> Everything looks OK to me.  The cache size changes are what the CPU
> actually tells us through CPUID leaves.

 Thanks Dave!

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [tip:sched/core] x86, sched: Add new topology for multi-NUMA-node CPUs
  2014-09-18 19:33 [PATCH] x86: new topology for multi-NUMA-node CPUs Dave Hansen
                   ` (3 preceding siblings ...)
  2014-09-21 17:56 ` Brice Goglin
@ 2014-09-24 14:58 ` tip-bot for Dave Hansen
  4 siblings, 0 replies; 8+ messages in thread
From: tip-bot for Dave Hansen @ 2014-09-24 14:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, imammedo, torvalds, peterz, toshi.kani,
	bp, tglx, dave.hansen, hpa, prarit, rientjes

Commit-ID:  cebf15eb09a2fd2fa73ee4faa9c4d2f813cf0f09
Gitweb:     http://git.kernel.org/tip/cebf15eb09a2fd2fa73ee4faa9c4d2f813cf0f09
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Thu, 18 Sep 2014 12:33:34 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Sep 2014 14:47:14 +0200

x86, sched: Add new topology for multi-NUMA-node CPUs

I'm getting the spew below when booting with Haswell (Xeon
E5-2699 v3) CPUs and the "Cluster-on-Die" (CoD) feature enabled
in the BIOS.  It seems similar to the issue that some folks from
AMD ran in to on their systems and addressed in this commit:

  161270fc1f9d ("x86/smp: Fix topology checks on AMD MCM CPUs")

Both these Intel and AMD systems break an assumption which is
being enforced by topology_sane(): a socket may not contain more
than one NUMA node.

AMD special-cased their system by looking for a cpuid flag.  The
Intel mode is dependent on BIOS options and I do not know of a
way which it is enumerated other than the tables being parsed
during the CPU bringup process.  In other words, we have to trust
the ACPI tables <shudder>.

This detects the situation where a NUMA node occurs at a place in
the middle of the "CPU" sched domains.  It replaces the default
topology with one that relies on the NUMA information from the
firmware (SRAT table) for all levels of sched domains above the
hyperthreads.

This also fixes a sysfs bug.  We used to freak out when we saw
the "mc" group cross a node boundary, so we stopped building the
MC group.  MC gets exported as the 'core_siblings_list' in
/sys/devices/system/cpu/cpu*/topology/ and this caused CPUs with
the same 'physical_package_id' to not be listed together in
'core_siblings_list'.  This violates a statement from
Documentation/ABI/testing/sysfs-devices-system-cpu:

	core_siblings: internal kernel map of cpu#'s hardware threads
	within the same physical_package_id.

	core_siblings_list: human-readable list of the logical CPU
	numbers within the same physical_package_id as cpu#.

The sysfs effects here cause an issue with the hwloc tool where
it gets confused and thinks there are more sockets than are
physically present.

Before this patch, there are two packages:

# cd /sys/devices/system/cpu/
# cat cpu*/topology/physical_package_id | sort | uniq -c
     18 0
     18 1

But 4 _sets_ of core siblings:

# cat cpu*/topology/core_siblings_list | sort | uniq -c
      9 0-8
      9 18-26
      9 27-35
      9 9-17

After this set, there are only 2 sets of core siblings, which
is what we expect for a 2-socket system.

# cat cpu*/topology/physical_package_id | sort | uniq -c
     18 0
     18 1
# cat cpu*/topology/core_siblings_list | sort | uniq -c
     18 0-17
     18 18-35

Example spew:
...
	NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
	 #2  #3  #4  #5  #6  #7  #8
	.... node  #1, CPUs:    #9
	------------[ cut here ]------------
	WARNING: CPU: 9 PID: 0 at /home/ak/hle/linux-hle-2.6/arch/x86/kernel/smpboot.c:306 topology_sane.isra.2+0x74/0x90()
	sched: CPU #9's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
	Modules linked in:
	CPU: 9 PID: 0 Comm: swapper/9 Not tainted 3.17.0-rc1-00293-g8e01c4d-dirty #631
	Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014
	0000000000000009 ffff88046ddabe00 ffffffff8172e485 ffff88046ddabe48
	ffff88046ddabe38 ffffffff8109691d 000000000000b001 0000000000000009
	ffff88086fc12580 000000000000b020 0000000000000009 ffff88046ddabe98
	Call Trace:
	[<ffffffff8172e485>] dump_stack+0x45/0x56
	[<ffffffff8109691d>] warn_slowpath_common+0x7d/0xa0
	[<ffffffff8109698c>] warn_slowpath_fmt+0x4c/0x50
	[<ffffffff81074f94>] topology_sane.isra.2+0x74/0x90
	[<ffffffff8107530e>] set_cpu_sibling_map+0x31e/0x4f0
	[<ffffffff8107568d>] start_secondary+0x1ad/0x240
	---[ end trace 3fe5f587a9fcde61 ]---
	#10 #11 #12 #13 #14 #15 #16 #17
	.... node  #2, CPUs:   #18 #19 #20 #21 #22 #23 #24 #25 #26
	.... node  #3, CPUs:   #27 #28 #29 #30 #31 #32 #33 #34 #35

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
[ Added LLC domain and s/match_mc/match_die/ ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: brice.goglin@gmail.com
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/20140918193334.C065EBCE@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/smpboot.c | 55 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 46 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 2d872e0..8de8eb7 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -296,11 +296,19 @@ void smp_store_cpu_info(int id)
 }
 
 static bool
+topology_same_node(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
+
+	return (cpu_to_node(cpu1) == cpu_to_node(cpu2));
+}
+
+static bool
 topology_sane(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o, const char *name)
 {
 	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
 
-	return !WARN_ONCE(cpu_to_node(cpu1) != cpu_to_node(cpu2),
+	return !WARN_ONCE(!topology_same_node(c, o),
 		"sched: CPU #%d's %s-sibling CPU #%d is not on the same node! "
 		"[node: %d != %d]. Ignoring dependency.\n",
 		cpu1, name, cpu2, cpu_to_node(cpu1), cpu_to_node(cpu2));
@@ -341,17 +349,44 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 	return false;
 }
 
-static bool match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+/*
+ * Unlike the other levels, we do not enforce keeping a
+ * multicore group inside a NUMA node.  If this happens, we will
+ * discard the MC level of the topology later.
+ */
+static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 {
-	if (c->phys_proc_id == o->phys_proc_id) {
-		if (cpu_has(c, X86_FEATURE_AMD_DCM))
-			return true;
-
-		return topology_sane(c, o, "mc");
-	}
+	if (c->phys_proc_id == o->phys_proc_id)
+		return true;
 	return false;
 }
 
+static struct sched_domain_topology_level numa_inside_package_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_MC
+	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+#endif
+	{ NULL, },
+};
+/*
+ * set_sched_topology() sets the topology internal to a CPU.  The
+ * NUMA topologies are layered on top of it to build the full
+ * system topology.
+ *
+ * If NUMA nodes are observed to occur within a CPU package, this
+ * function should be called.  It forces the sched domain code to
+ * only use the SMT level for the CPU portion of the topology.
+ * This essentially falls back to relying on NUMA information
+ * from the SRAT table to describe the entire system topology
+ * (except for hyperthreads).
+ */
+static void primarily_use_numa_for_topology(void)
+{
+	set_sched_topology(numa_inside_package_topology);
+}
+
 void set_cpu_sibling_map(int cpu)
 {
 	bool has_smt = smp_num_siblings > 1;
@@ -388,7 +423,7 @@ void set_cpu_sibling_map(int cpu)
 	for_each_cpu(i, cpu_sibling_setup_mask) {
 		o = &cpu_data(i);
 
-		if ((i == cpu) || (has_mp && match_mc(c, o))) {
+		if ((i == cpu) || (has_mp && match_die(c, o))) {
 			link_mask(core, cpu, i);
 
 			/*
@@ -410,6 +445,8 @@ void set_cpu_sibling_map(int cpu)
 			} else if (i != cpu && !c->booted_cores)
 				c->booted_cores = cpu_data(i).booted_cores;
 		}
+		if (match_die(c, o) == !topology_same_node(c, o))
+			primarily_use_numa_for_topology();
 	}
 }
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-09-24 15:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-18 19:33 [PATCH] x86: new topology for multi-NUMA-node CPUs Dave Hansen
2014-09-18 20:49 ` Peter Zijlstra
2014-09-18 21:57 ` Dave Hansen
2014-09-19 11:45 ` Karel Zak
2014-09-19 18:15   ` Dave Hansen
2014-09-22  7:43     ` Karel Zak
2014-09-21 17:56 ` Brice Goglin
2014-09-24 14:58 ` [tip:sched/core] x86, sched: Add " tip-bot for Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.