All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs
@ 2014-09-17 22:33 Dave Hansen
  2014-09-17 22:33 ` [RFC][PATCH 1/6] topology: rename topology_core_cpumask() to topology_package_cpumask() Dave Hansen
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Dave Hansen @ 2014-09-17 22:33 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: mingo, hpa, brice.goglin, bp, linux-kernel, Dave Hansen

This is a big fat RFC.  It takes quite a few liberties with the
multi-core topology level that I'm not completely comfortable
with.

It has only been tested lightly.

Full dmesg for a Cluster-on-Die system with this set applied,
and sched_debug on the command-line is here:

	http://sr71.net/~dave/intel/full-dmesg-hswep-20140917.txt

---

I'm getting the spew below when booting with Haswell (Xeon
E5-2699 v3) CPUs and the "Cluster-on-Die" (CoD) feature enabled
in the BIOS.  It seems similar to the issue that some folks from
AMD ran in to on their systems and addressed in this commit:

	http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=161270fc1f9ddfc17154e0d49291472a9cdef7db

Both these Intel and AMD systems break an assumption which is
being enforced by topology_sane(): a socket may not contain more
than one NUMA node.

AMD special-cased their system by looking for a cpuid flag.  The
Intel mode is dependent on BIOS options and I do not know of a
way which it is enumerated other than the tables being parsed
during the CPU bringup process.

This also fixes sysfs because CPUs with the same 'physical_package_id'
in /sys/devices/system/cpu/cpu*/topology/ are not listed together
in the same 'core_siblings_list'.  This violates a statement from
Documentation/ABI/testing/sysfs-devices-system-cpu:

	core_siblings: internal kernel map of cpu#'s hardware threads
	within the same physical_package_id.

	core_siblings_list: human-readable list of the logical CPU
	numbers within the same physical_package_id as cpu#.

The sysfs effects here cause an issue with the hwloc tool where
it gets confused and thinks there are more sockets than are
physically present.

Before this set, there are two packages:

# cd /sys/devices/system/cpu/
# cat cpu*/topology/physical_package_id | sort | uniq -c
     18 0
     18 1

But 4 _sets_ of core siblings:

# cat cpu*/topology/core_siblings_list | sort | uniq -c
      9 0-8
      9 18-26
      9 27-35
      9 9-17

After this set, there are only 2 sets of core siblings, which
is what we expect for a 2-socket system.

# cat cpu*/topology/physical_package_id | sort | uniq -c
     18 0
     18 1
# cat cpu*/topology/core_siblings_list | sort | uniq -c
     18 0-17
     18 18-35


Example spew:
...
	NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
	 #2  #3  #4  #5  #6  #7  #8
	.... node  #1, CPUs:    #9
	------------[ cut here ]------------
	WARNING: CPU: 9 PID: 0 at /home/ak/hle/linux-hle-2.6/arch/x86/kernel/smpboot.c:306 topology_sane.isra.2+0x74/0x90()
	sched: CPU #9's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
	Modules linked in:
	CPU: 9 PID: 0 Comm: swapper/9 Not tainted 3.17.0-rc1-00293-g8e01c4d-dirty #631
	Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014
	0000000000000009 ffff88046ddabe00 ffffffff8172e485 ffff88046ddabe48
	ffff88046ddabe38 ffffffff8109691d 000000000000b001 0000000000000009
	ffff88086fc12580 000000000000b020 0000000000000009 ffff88046ddabe98
	Call Trace:
	[<ffffffff8172e485>] dump_stack+0x45/0x56
	[<ffffffff8109691d>] warn_slowpath_common+0x7d/0xa0
	[<ffffffff8109698c>] warn_slowpath_fmt+0x4c/0x50
	[<ffffffff81074f94>] topology_sane.isra.2+0x74/0x90
	[<ffffffff8107530e>] set_cpu_sibling_map+0x31e/0x4f0
	[<ffffffff8107568d>] start_secondary+0x1ad/0x240
	---[ end trace 3fe5f587a9fcde61 ]---
	#10 #11 #12 #13 #14 #15 #16 #17
	.... node  #2, CPUs:   #18 #19 #20 #21 #22 #23 #24 #25 #26
	.... node  #3, CPUs:   #27 #28 #29 #30 #31 #32 #33 #34 #35

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
rc: ak@linux.intel.com
Cc: brice.goglin@gmail.com
Cc: bp@alien8.de

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 1/6] topology: rename topology_core_cpumask() to topology_package_cpumask()
  2014-09-17 22:33 [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs Dave Hansen
@ 2014-09-17 22:33 ` Dave Hansen
  2014-09-17 22:33 ` [RFC][PATCH 2/6] x86: introduce cpumask specifically for the package Dave Hansen
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2014-09-17 22:33 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: mingo, hpa, brice.goglin, bp, linux-kernel, Dave Hansen,
	linux-arch, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

topology_core_cpumask() is a function that arch-independent code
uses.  Each (non-generic) architecture will define it in order to
map over to the arch-specific variables storing the actual mask.

I think topology_core_cpumask() is a bad name.  It makes it sound
like it is generating a cpumask for *A* core.  It is, in fact,
generating the mask of all cores inside a CPU package.  Let's
make that more clear with the naming.

For the non-x86 architectures, I'd appreciate a sanity check of
this.  It definitely makes sense on x86, but I'm less confident
about the others.

Cc: linux-arch@vger.kernel.org

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/arm/include/asm/topology.h         |    2 +-
 b/arch/arm64/include/asm/topology.h       |    2 +-
 b/arch/ia64/include/asm/topology.h        |    2 +-
 b/arch/metag/include/asm/topology.h       |    2 +-
 b/arch/mips/include/asm/smp.h             |    2 +-
 b/arch/powerpc/include/asm/topology.h     |    2 +-
 b/arch/s390/include/asm/topology.h        |    2 +-
 b/arch/sh/include/asm/topology.h          |    2 +-
 b/arch/sparc/include/asm/topology_64.h    |    2 +-
 b/arch/tile/include/asm/topology.h        |    2 +-
 b/arch/x86/include/asm/topology.h         |    2 +-
 b/drivers/base/topology.c                 |    8 ++++----
 b/drivers/block/nvme-core.c               |    2 +-
 b/drivers/cpufreq/arm_big_little.c        |    2 +-
 b/drivers/infiniband/hw/qib/qib_iba7322.c |    2 +-
 b/include/linux/topology.h                |    4 ++--
 b/lib/cpu_rmap.c                          |    2 +-
 17 files changed, 21 insertions(+), 21 deletions(-)

diff -puN arch/arm64/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask arch/arm64/include/asm/topology.h
--- a/arch/arm64/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.310517034 -0700
+++ b/arch/arm64/include/asm/topology.h	2014-09-17 15:28:56.341518454 -0700
@@ -17,7 +17,7 @@ extern struct cpu_topology cpu_topology[
 
 #define topology_physical_package_id(cpu)	(cpu_topology[cpu].cluster_id)
 #define topology_core_id(cpu)		(cpu_topology[cpu].core_id)
-#define topology_core_cpumask(cpu)	(&cpu_topology[cpu].core_sibling)
+#define topology_package_cpumask(cpu)	(&cpu_topology[cpu].core_sibling)
 #define topology_thread_cpumask(cpu)	(&cpu_topology[cpu].thread_sibling)
 
 void init_cpu_topology(void);
diff -puN arch/arm/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask arch/arm/include/asm/topology.h
--- a/arch/arm/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.311517079 -0700
+++ b/arch/arm/include/asm/topology.h	2014-09-17 15:28:56.341518454 -0700
@@ -17,7 +17,7 @@ extern struct cputopo_arm cpu_topology[N
 
 #define topology_physical_package_id(cpu)	(cpu_topology[cpu].socket_id)
 #define topology_core_id(cpu)		(cpu_topology[cpu].core_id)
-#define topology_core_cpumask(cpu)	(&cpu_topology[cpu].core_sibling)
+#define topology_package_cpumask(cpu)	(&cpu_topology[cpu].core_sibling)
 #define topology_thread_cpumask(cpu)	(&cpu_topology[cpu].thread_sibling)
 
 void init_cpu_topology(void);
diff -puN arch/ia64/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask arch/ia64/include/asm/topology.h
--- a/arch/ia64/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.313517171 -0700
+++ b/arch/ia64/include/asm/topology.h	2014-09-17 15:28:56.341518454 -0700
@@ -52,7 +52,7 @@ void build_cpu_to_node_map(void);
 #ifdef CONFIG_SMP
 #define topology_physical_package_id(cpu)	(cpu_data(cpu)->socket_id)
 #define topology_core_id(cpu)			(cpu_data(cpu)->core_id)
-#define topology_core_cpumask(cpu)		(&cpu_core_map[cpu])
+#define topology_package_cpumask(cpu)		(&cpu_core_map[cpu])
 #define topology_thread_cpumask(cpu)		(&per_cpu(cpu_sibling_map, cpu))
 #endif
 
diff -puN arch/metag/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask arch/metag/include/asm/topology.h
--- a/arch/metag/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.315517263 -0700
+++ b/arch/metag/include/asm/topology.h	2014-09-17 15:28:56.342518499 -0700
@@ -21,7 +21,7 @@ const struct cpumask *cpu_coregroup_mask
 
 extern cpumask_t cpu_core_map[NR_CPUS];
 
-#define topology_core_cpumask(cpu)	(&cpu_core_map[cpu])
+#define topology_package_cpumask(cpu)	(&cpu_core_map[cpu])
 
 #include <asm-generic/topology.h>
 
diff -puN arch/mips/include/asm/smp.h~rename-topology_cpu_cpumask-topology_package_cpumask arch/mips/include/asm/smp.h
--- a/arch/mips/include/asm/smp.h~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.316517309 -0700
+++ b/arch/mips/include/asm/smp.h	2014-09-17 15:28:56.342518499 -0700
@@ -39,7 +39,7 @@ extern int __cpu_logical_map[NR_CPUS];
 
 #define topology_physical_package_id(cpu)	(cpu_data[cpu].package)
 #define topology_core_id(cpu)			(cpu_data[cpu].core)
-#define topology_core_cpumask(cpu)		(&cpu_core_map[cpu])
+#define topology_package_cpumask(cpu)		(&cpu_core_map[cpu])
 #define topology_thread_cpumask(cpu)		(&cpu_sibling_map[cpu])
 
 #define SMP_RESCHEDULE_YOURSELF 0x1	/* XXX braindead */
diff -puN arch/powerpc/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask arch/powerpc/include/asm/topology.h
--- a/arch/powerpc/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.318517401 -0700
+++ b/arch/powerpc/include/asm/topology.h	2014-09-17 15:28:56.342518499 -0700
@@ -88,7 +88,7 @@ static inline int prrn_is_enabled(void)
 
 #define topology_physical_package_id(cpu)	(cpu_to_chip_id(cpu))
 #define topology_thread_cpumask(cpu)	(per_cpu(cpu_sibling_map, cpu))
-#define topology_core_cpumask(cpu)	(per_cpu(cpu_core_map, cpu))
+#define topology_package_cpumask(cpu)	(per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)		(cpu_to_core_id(cpu))
 #endif
 #endif
diff -puN arch/s390/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask arch/s390/include/asm/topology.h
--- a/arch/s390/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.320517491 -0700
+++ b/arch/s390/include/asm/topology.h	2014-09-17 15:28:56.343518544 -0700
@@ -20,7 +20,7 @@ extern struct cpu_topology_s390 cpu_topo
 
 #define topology_physical_package_id(cpu)	(cpu_topology[cpu].socket_id)
 #define topology_core_id(cpu)			(cpu_topology[cpu].core_id)
-#define topology_core_cpumask(cpu)		(&cpu_topology[cpu].core_mask)
+#define topology_package_cpumask(cpu)		(&cpu_topology[cpu].core_mask)
 #define topology_book_id(cpu)			(cpu_topology[cpu].book_id)
 #define topology_book_cpumask(cpu)		(&cpu_topology[cpu].book_mask)
 
diff -puN arch/sh/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask arch/sh/include/asm/topology.h
--- a/arch/sh/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.321517537 -0700
+++ b/arch/sh/include/asm/topology.h	2014-09-17 15:28:56.343518544 -0700
@@ -21,7 +21,7 @@ const struct cpumask *cpu_coregroup_mask
 
 extern cpumask_t cpu_core_map[NR_CPUS];
 
-#define topology_core_cpumask(cpu)	(&cpu_core_map[cpu])
+#define topology_package_cpumask(cpu)	(&cpu_core_map[cpu])
 
 #include <asm-generic/topology.h>
 
diff -puN arch/sparc/include/asm/topology_64.h~rename-topology_cpu_cpumask-topology_package_cpumask arch/sparc/include/asm/topology_64.h
--- a/arch/sparc/include/asm/topology_64.h~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.323517629 -0700
+++ b/arch/sparc/include/asm/topology_64.h	2014-09-17 15:28:56.343518544 -0700
@@ -40,7 +40,7 @@ static inline int pcibus_to_node(struct
 #ifdef CONFIG_SMP
 #define topology_physical_package_id(cpu)	(cpu_data(cpu).proc_id)
 #define topology_core_id(cpu)			(cpu_data(cpu).core_id)
-#define topology_core_cpumask(cpu)		(&cpu_core_map[cpu])
+#define topology_package_cpumask(cpu)		(&cpu_core_map[cpu])
 #define topology_thread_cpumask(cpu)		(&per_cpu(cpu_sibling_map, cpu))
 #endif /* CONFIG_SMP */
 
diff -puN arch/tile/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask arch/tile/include/asm/topology.h
--- a/arch/tile/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.325517721 -0700
+++ b/arch/tile/include/asm/topology.h	2014-09-17 15:28:56.344518590 -0700
@@ -54,7 +54,7 @@ static inline const struct cpumask *cpum
 #ifdef CONFIG_SMP
 #define topology_physical_package_id(cpu)       ((void)(cpu), 0)
 #define topology_core_id(cpu)                   (cpu)
-#define topology_core_cpumask(cpu)              ((void)(cpu), cpu_online_mask)
+#define topology_package_cpumask(cpu)              ((void)(cpu), cpu_online_mask)
 #define topology_thread_cpumask(cpu)            cpumask_of(cpu)
 #endif
 
diff -puN arch/x86/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask arch/x86/include/asm/topology.h
--- a/arch/x86/include/asm/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.326517767 -0700
+++ b/arch/x86/include/asm/topology.h	2014-09-17 15:28:56.344518590 -0700
@@ -123,7 +123,7 @@ extern const struct cpumask *cpu_coregro
 #define topology_core_id(cpu)			(cpu_data(cpu).cpu_core_id)
 
 #ifdef ENABLE_TOPO_DEFINES
-#define topology_core_cpumask(cpu)		(per_cpu(cpu_core_map, cpu))
+#define topology_package_cpumask(cpu)		(per_cpu(cpu_core_map, cpu))
 #define topology_thread_cpumask(cpu)		(per_cpu(cpu_sibling_map, cpu))
 #endif
 
diff -puN drivers/base/topology.c~rename-topology_cpu_cpumask-topology_package_cpumask drivers/base/topology.c
--- a/drivers/base/topology.c~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.328517857 -0700
+++ b/drivers/base/topology.c	2014-09-17 15:28:56.344518590 -0700
@@ -42,7 +42,7 @@ static ssize_t show_##name(struct device
 	return sprintf(buf, "%d\n", topology_##name(dev->id));	\
 }
 
-#if defined(topology_thread_cpumask) || defined(topology_core_cpumask) || \
+#if defined(topology_thread_cpumask) || defined(topology_package_cpumask) || \
     defined(topology_book_cpumask)
 static ssize_t show_cpumap(int type, const struct cpumask *mask, char *buf)
 {
@@ -88,9 +88,9 @@ define_siblings_show_func(thread_cpumask
 define_one_ro_named(thread_siblings, show_thread_cpumask);
 define_one_ro_named(thread_siblings_list, show_thread_cpumask_list);
 
-define_siblings_show_func(core_cpumask);
-define_one_ro_named(core_siblings, show_core_cpumask);
-define_one_ro_named(core_siblings_list, show_core_cpumask_list);
+define_siblings_show_func(package_cpumask);
+define_one_ro_named(core_siblings, show_package_cpumask);
+define_one_ro_named(core_siblings_list, show_package_cpumask_list);
 
 #ifdef CONFIG_SCHED_BOOK
 define_id_show_func(book_id);
diff -puN drivers/block/nvme-core.c~rename-topology_cpu_cpumask-topology_package_cpumask drivers/block/nvme-core.c
--- a/drivers/block/nvme-core.c~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.330517950 -0700
+++ b/drivers/block/nvme-core.c	2014-09-17 15:28:56.345518636 -0700
@@ -2054,7 +2054,7 @@ static void nvme_assign_io_queues(struct
 				nvmeq, cpus_per_queue);
 		if (cpus_weight(mask) < cpus_per_queue)
 			nvme_add_cpus(&mask, unassigned_cpus,
-				topology_core_cpumask(cpu),
+				topology_package_cpumask(cpu),
 				nvmeq, cpus_per_queue);
 		if (cpus_weight(mask) < cpus_per_queue)
 			nvme_add_cpus(&mask, unassigned_cpus,
diff -puN drivers/cpufreq/arm_big_little.c~rename-topology_cpu_cpumask-topology_package_cpumask drivers/cpufreq/arm_big_little.c
--- a/drivers/cpufreq/arm_big_little.c~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.332518042 -0700
+++ b/drivers/cpufreq/arm_big_little.c	2014-09-17 15:28:56.346518682 -0700
@@ -449,7 +449,7 @@ static int bL_cpufreq_init(struct cpufre
 	if (cur_cluster < MAX_CLUSTERS) {
 		int cpu;
 
-		cpumask_copy(policy->cpus, topology_core_cpumask(policy->cpu));
+		cpumask_copy(policy->cpus, topology_package_cpumask(policy->cpu));
 
 		for_each_cpu(cpu, policy->cpus)
 			per_cpu(physical_cluster, cpu) = cur_cluster;
diff -puN drivers/infiniband/hw/qib/qib_iba7322.c~rename-topology_cpu_cpumask-topology_package_cpumask drivers/infiniband/hw/qib/qib_iba7322.c
--- a/drivers/infiniband/hw/qib/qib_iba7322.c~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.334518132 -0700
+++ b/drivers/infiniband/hw/qib/qib_iba7322.c	2014-09-17 15:28:56.349518820 -0700
@@ -3449,7 +3449,7 @@ try_intx:
 	firstcpu = cpumask_first(local_mask);
 	if (firstcpu >= nr_cpu_ids ||
 			cpumask_weight(local_mask) == num_online_cpus()) {
-		local_mask = topology_core_cpumask(0);
+		local_mask = topology_package_cpumask(0);
 		firstcpu = cpumask_first(local_mask);
 	}
 	if (firstcpu < nr_cpu_ids) {
diff -puN include/linux/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask include/linux/topology.h
--- a/include/linux/topology.h~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.336518224 -0700
+++ b/include/linux/topology.h	2014-09-17 15:28:56.350518866 -0700
@@ -177,8 +177,8 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_thread_cpumask
 #define topology_thread_cpumask(cpu)		cpumask_of(cpu)
 #endif
-#ifndef topology_core_cpumask
-#define topology_core_cpumask(cpu)		cpumask_of(cpu)
+#ifndef topology_package_cpumask
+#define topology_package_cpumask(cpu)		cpumask_of(cpu)
 #endif
 
 #ifdef CONFIG_SCHED_SMT
diff -puN lib/cpu_rmap.c~rename-topology_cpu_cpumask-topology_package_cpumask lib/cpu_rmap.c
--- a/lib/cpu_rmap.c~rename-topology_cpu_cpumask-topology_package_cpumask	2014-09-17 15:28:56.337518270 -0700
+++ b/lib/cpu_rmap.c	2014-09-17 15:28:56.350518866 -0700
@@ -194,7 +194,7 @@ int cpu_rmap_update(struct cpu_rmap *rma
 					topology_thread_cpumask(cpu), 1))
 			continue;
 		if (cpu_rmap_copy_neigh(rmap, cpu,
-					topology_core_cpumask(cpu), 2))
+					topology_package_cpumask(cpu), 2))
 			continue;
 		if (cpu_rmap_copy_neigh(rmap, cpu,
 					cpumask_of_node(cpu_to_node(cpu)), 3))
_

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 2/6] x86: introduce cpumask specifically for the package
  2014-09-17 22:33 [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs Dave Hansen
  2014-09-17 22:33 ` [RFC][PATCH 1/6] topology: rename topology_core_cpumask() to topology_package_cpumask() Dave Hansen
@ 2014-09-17 22:33 ` Dave Hansen
  2014-09-18 14:57   ` Peter Zijlstra
  2014-09-17 22:33 ` [RFC][PATCH 3/6] x86: use package_map instead of core_map for sysfs Dave Hansen
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2014-09-17 22:33 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: mingo, hpa, brice.goglin, bp, linux-kernel, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

As noted by multiple reports:

	https://lkml.org/lkml/2014/9/15/1240
	https://lkml.org/lkml/2014/7/28/442

the sched domains code has some assumptions that break on newer
AMD and Intel CPUs.  Namely, the code assumes that NUMA node
boundaries always lie outside of a CPU package.  That assumption
is no longer true with Intel's Cluster-on-Die found in Haswell
CPUs (with a special BIOS config knob) and AMD's DCM feature.

Essentially, the 'cpu_core_map' is no longer suitable for
enumerating all the CPUs in a physical package.

This patch introduces a new map which is specifically built by
consulting the the physical package ids instead of inferring the
information from NUMA nodes.

This still leaves us with a broken 'core_siblings_list' in sysfs,
but a later patch will fix that up too.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/smp.h      |    6 ++++++
 b/arch/x86/include/asm/topology.h |    1 +
 b/arch/x86/kernel/smpboot.c       |   13 +++++++++++++
 b/arch/x86/xen/smp.c              |    1 +
 4 files changed, 21 insertions(+)

diff -puN arch/x86/include/asm/smp.h~introduce-package-sd-level arch/x86/include/asm/smp.h
--- a/arch/x86/include/asm/smp.h~introduce-package-sd-level	2014-09-17 15:28:57.075552056 -0700
+++ b/arch/x86/include/asm/smp.h	2014-09-17 15:28:57.084552469 -0700
@@ -32,6 +32,7 @@ static inline bool cpu_has_ht_siblings(v
 
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
+DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_package_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
@@ -52,6 +53,11 @@ static inline struct cpumask *cpu_llc_sh
 	return per_cpu(cpu_llc_shared_map, cpu);
 }
 
+static inline struct cpumask *cpu_package_mask(int cpu)
+{
+	return per_cpu(cpu_package_map, cpu);
+}
+
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86_32)
diff -puN arch/x86/include/asm/topology.h~introduce-package-sd-level arch/x86/include/asm/topology.h
--- a/arch/x86/include/asm/topology.h~introduce-package-sd-level	2014-09-17 15:28:57.077552149 -0700
+++ b/arch/x86/include/asm/topology.h	2014-09-17 15:28:57.084552469 -0700
@@ -118,6 +118,7 @@ static inline void setup_node_to_cpumask
 #include <asm-generic/topology.h>
 
 extern const struct cpumask *cpu_coregroup_mask(int cpu);
+extern const struct cpumask *cpu_package_mask_func(int cpu);
 
 #define topology_physical_package_id(cpu)	(cpu_data(cpu).phys_proc_id)
 #define topology_core_id(cpu)			(cpu_data(cpu).cpu_core_id)
diff -puN arch/x86/kernel/smpboot.c~introduce-package-sd-level arch/x86/kernel/smpboot.c
--- a/arch/x86/kernel/smpboot.c~introduce-package-sd-level	2014-09-17 15:28:57.079552240 -0700
+++ b/arch/x86/kernel/smpboot.c	2014-09-17 15:28:57.085552515 -0700
@@ -98,6 +98,8 @@ EXPORT_PER_CPU_SYMBOL(cpu_core_map);
 
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_package_map);
+
 /* Per CPU bogomips and other parameters */
 DEFINE_PER_CPU_SHARED_ALIGNED(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
@@ -352,6 +354,13 @@ static bool match_mc(struct cpuinfo_x86
 	return false;
 }
 
+static bool match_pkg(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+	if (c->phys_proc_id == o->phys_proc_id)
+		return true;
+	return false;
+}
+
 void set_cpu_sibling_map(int cpu)
 {
 	bool has_smt = smp_num_siblings > 1;
@@ -365,6 +374,7 @@ void set_cpu_sibling_map(int cpu)
 	if (!has_mp) {
 		cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
 		cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
+		cpumask_set_cpu(cpu, cpu_package_mask(cpu));
 		cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 		c->booted_cores = 1;
 		return;
@@ -410,6 +420,9 @@ void set_cpu_sibling_map(int cpu)
 			} else if (i != cpu && !c->booted_cores)
 				c->booted_cores = cpu_data(i).booted_cores;
 		}
+		if ((i == cpu) || (has_mp && match_pkg(c, o))) {
+			link_mask(package, cpu, i);
+		}
 	}
 }
 
diff -puN arch/x86/xen/smp.c~introduce-package-sd-level arch/x86/xen/smp.c
--- a/arch/x86/xen/smp.c~introduce-package-sd-level	2014-09-17 15:28:57.080552285 -0700
+++ b/arch/x86/xen/smp.c	2014-09-17 15:28:57.085552515 -0700
@@ -331,6 +331,7 @@ static void __init xen_smp_prepare_cpus(
 		zalloc_cpumask_var(&per_cpu(cpu_sibling_map, i), GFP_KERNEL);
 		zalloc_cpumask_var(&per_cpu(cpu_core_map, i), GFP_KERNEL);
 		zalloc_cpumask_var(&per_cpu(cpu_llc_shared_map, i), GFP_KERNEL);
+		zalloc_cpumask_var(&per_cpu(cpu_package_map, i), GFP_KERNEL);
 	}
 	set_cpu_sibling_map(0);
 
_

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 3/6] x86: use package_map instead of core_map for sysfs
  2014-09-17 22:33 [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs Dave Hansen
  2014-09-17 22:33 ` [RFC][PATCH 1/6] topology: rename topology_core_cpumask() to topology_package_cpumask() Dave Hansen
  2014-09-17 22:33 ` [RFC][PATCH 2/6] x86: introduce cpumask specifically for the package Dave Hansen
@ 2014-09-17 22:33 ` Dave Hansen
  2014-09-17 22:33 ` [RFC][PATCH 4/6] sched: eliminate "DIE" domain level when NUMA present Dave Hansen
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2014-09-17 22:33 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: mingo, hpa, brice.goglin, bp, linux-kernel, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The /sys/devices/system/cpu/cpu*/topology/core_siblings*
files were previously built from the "cpu_core_map".  That
mask is deeply connected to the sched domains internal
multi-core (MC) level, which is now become disconnected from
the actual CPU package.

We have a new "cpu_package_map" which has the sole purpose of
tracking which package the CPU is in and is unconnected to the
scheduler.  We will now build those sysfs with information from
the new package map.

Note: this also realigns the sysfs files with their documentation
in Documentation/ABI.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/topology.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN arch/x86/include/asm/topology.h~x86-use-package-map-instead-of-core-map arch/x86/include/asm/topology.h
--- a/arch/x86/include/asm/topology.h~x86-use-package-map-instead-of-core-map	2014-09-17 15:28:57.508571881 -0700
+++ b/arch/x86/include/asm/topology.h	2014-09-17 15:28:57.511572017 -0700
@@ -124,7 +124,7 @@ extern const struct cpumask *cpu_package
 #define topology_core_id(cpu)			(cpu_data(cpu).cpu_core_id)
 
 #ifdef ENABLE_TOPO_DEFINES
-#define topology_package_cpumask(cpu)		(per_cpu(cpu_core_map, cpu))
+#define topology_package_cpumask(cpu)		(per_cpu(cpu_package_map, cpu))
 #define topology_thread_cpumask(cpu)		(per_cpu(cpu_sibling_map, cpu))
 #endif
 
_

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 4/6] sched: eliminate "DIE" domain level when NUMA present
  2014-09-17 22:33 [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs Dave Hansen
                   ` (2 preceding siblings ...)
  2014-09-17 22:33 ` [RFC][PATCH 3/6] x86: use package_map instead of core_map for sysfs Dave Hansen
@ 2014-09-17 22:33 ` Dave Hansen
  2014-09-18 17:28   ` Peter Zijlstra
  2014-09-17 22:33 ` [RFC][PATCH 5/6] sched: keep MC domain from crossing nodes OR packages Dave Hansen
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2014-09-17 22:33 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: mingo, hpa, brice.goglin, bp, linux-kernel, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The "DIE" topology level is currently defined like this:

static inline const struct cpumask *cpu_cpu_mask(int cpu)
{
        return cpumask_of_node(cpu_to_node(cpu));
}

But that makes very little sense on a NUMA system since
the lowest-domain NUMA node is guaranteed to be essentially
the same as this level.

We leave this for systems that are !CONFIG_NUMA and that
might need a top-level domain.

This also keeps us from having screwy topologies when the
smallest NUMA node is only _part_ of the die.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/kernel/sched/core.c |    2 ++
 1 file changed, 2 insertions(+)

diff -puN kernel/sched/core.c~die-is-NUMA-based-and-screwed-up kernel/sched/core.c
--- a/kernel/sched/core.c~die-is-NUMA-based-and-screwed-up	2014-09-17 15:28:57.867588315 -0700
+++ b/kernel/sched/core.c	2014-09-17 15:28:57.873588591 -0700
@@ -6141,7 +6141,9 @@ static struct sched_domain_topology_leve
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
+#ifndef CONFIG_NUMA
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+#endif
 	{ NULL, },
 };
 
_

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 5/6] sched: keep MC domain from crossing nodes OR packages
  2014-09-17 22:33 [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs Dave Hansen
                   ` (3 preceding siblings ...)
  2014-09-17 22:33 ` [RFC][PATCH 4/6] sched: eliminate "DIE" domain level when NUMA present Dave Hansen
@ 2014-09-17 22:33 ` Dave Hansen
  2014-09-17 22:33 ` [RFC][PATCH 6/6] sched: consolidate config options Dave Hansen
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2014-09-17 22:33 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: mingo, hpa, brice.goglin, bp, linux-kernel, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The MC (MultiCore) sched domain was originally intended to
represent the groupings of cores in a multi-core CPU.

The sched domains code has essentially two kinds of topology
levels:
1. CPU levels, like hyperthreads or cores that share cache
2. NUMA levels derived from the system's NUMA topology

The domains are built by first going through the CPU levels, then
through the NUMA levels.  However, we now have at least two
instances of systems where a single CPU "package" has multiple
NUMA nodes.

To fix this, we redefine the multi-core level.  Previously, it
was defined as stopping at the CPU package.  Now, we define it as
grouping similar CPUs that are both in the same package *and*
that have the same access to some set of memory.  Essentially an
MC group must be in the same package *and* be on the same NUMA
node.

This does no harm because there is a NUMA precisely at the level
which "MC" used to represent.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/kernel/smpboot.c |   26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff -puN arch/x86/kernel/smpboot.c~keep-mc-from-crossing-nodes-or-package arch/x86/kernel/smpboot.c
--- a/arch/x86/kernel/smpboot.c~keep-mc-from-crossing-nodes-or-package	2014-09-17 15:28:58.226604751 -0700
+++ b/arch/x86/kernel/smpboot.c	2014-09-17 15:28:58.230604935 -0700
@@ -345,13 +345,27 @@ static bool match_llc(struct cpuinfo_x86
 
 static bool match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 {
-	if (c->phys_proc_id == o->phys_proc_id) {
-		if (cpu_has(c, X86_FEATURE_AMD_DCM))
-			return true;
+	/*
+	 * Do not allow a multi-core group
+	 * ouside of a package.
+	 */
+	if (c->phys_proc_id != o->phys_proc_id)
+       		return false;
 
-		return topology_sane(c, o, "mc");
-	}
-	return false;
+	/*
+	 * Do not allow a multi-core group
+	 * ouside of a NUMA node.
+	 */
+	if (cpu_to_node(c->cpu_index) !=
+	    cpu_to_node(o->cpu_index))
+		return false;
+
+	/*
+	 * This pretty much repeats the NUMA node
+	 * check above, but leave it here for
+	 * consistency.
+	 */
+	return topology_sane(c, o, "mc");
 }
 
 static bool match_pkg(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
_

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 6/6] sched: consolidate config options
  2014-09-17 22:33 [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs Dave Hansen
                   ` (4 preceding siblings ...)
  2014-09-17 22:33 ` [RFC][PATCH 5/6] sched: keep MC domain from crossing nodes OR packages Dave Hansen
@ 2014-09-17 22:33 ` Dave Hansen
  2014-09-18 17:29   ` Peter Zijlstra
  2014-09-18  7:45 ` [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs Borislav Petkov
       [not found] ` <CAOjmkp8EGO0jicmdO=p6ATHz-hUJmWb+xoBLjOdLBUwwGzyhhg@mail.gmail.com>
  7 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2014-09-17 22:33 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: mingo, hpa, brice.goglin, bp, linux-kernel, Dave Hansen, dave.hansen


I originally did this when renaming CONFIG_SCHED_MC.  I ended up
not renaming it, but I still think it's nice to have all the
Kconfigs consolidated like this.

--

From: Dave Hansen <dave.hansen@linux.intel.com>

We have 2 config options (SCHED_MC and SCHED_SMT) which are used
across a few architectures.  We have one (SCHED_BOOK) only used
on s390.

The Kconfig text for MC/SMT are copied verbatim across each of
the architectures that use it.  This consolidates them down to a
single Kconfig location.

This gives us a centrally-defined set of config options which
architectures can 'select' when needed.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/arm/Kconfig                       |   18 ++-----------
 b/arch/arm64/Kconfig                     |   18 ++-----------
 b/arch/ia64/Kconfig                      |    9 ------
 b/arch/mips/Kconfig                      |   14 +---------
 b/arch/powerpc/Kconfig                   |    8 -----
 b/arch/powerpc/platforms/Kconfig.cputype |    1 
 b/arch/s390/Kconfig                      |   14 ++--------
 b/arch/sh/Kconfig                        |    1 
 b/arch/sh/mm/Kconfig                     |    9 ------
 b/arch/sparc/Kconfig                     |   20 ++------------
 b/arch/x86/Kconfig                       |   20 ++------------
 b/kernel/sched/Kconfig                   |   42 +++++++++++++++++++++++++++++++
 12 files changed, 63 insertions(+), 111 deletions(-)

diff -puN arch/arm64/Kconfig~consolidate-config-SCHED_MC arch/arm64/Kconfig
--- a/arch/arm64/Kconfig~consolidate-config-SCHED_MC	2014-09-17 15:28:58.592621508 -0700
+++ b/arch/arm64/Kconfig	2014-09-17 15:28:58.613622467 -0700
@@ -68,6 +68,8 @@ config ARM64
 	select SPARSE_IRQ
 	select SYSCTL_EXCEPTION_TRACE
 	select HAVE_CONTEXT_TRACKING
+	select ARCH_ENABLE_SCHED_MC
+	select ARCH_ENABLE_SCHED_SMT
 	help
 	  ARM 64-bit (AArch64) Linux support.
 
@@ -235,21 +237,7 @@ config SMP
 
 	  If you don't know what to do here, say N.
 
-config SCHED_MC
-	bool "Multi-core scheduler support"
-	depends on SMP
-	help
-	  Multi-core scheduler support improves the CPU scheduler's decision
-	  making when dealing with multi-core CPU chips at a cost of slightly
-	  increased overhead in some places. If unsure say N here.
-
-config SCHED_SMT
-	bool "SMT scheduler support"
-	depends on SMP
-	help
-	  Improves the CPU scheduler's decision making when dealing with
-	  MultiThreading at a cost of slightly increased overhead in some
-	  places. If unsure say N here.
+source kernel/sched/Kconfig
 
 config NR_CPUS
 	int "Maximum number of CPUs (2-32)"
diff -puN arch/arm/Kconfig~consolidate-config-SCHED_MC arch/arm/Kconfig
--- a/arch/arm/Kconfig~consolidate-config-SCHED_MC	2014-09-17 15:28:58.593621554 -0700
+++ b/arch/arm/Kconfig	2014-09-17 15:28:58.614622513 -0700
@@ -1355,27 +1355,15 @@ config SMP_ON_UP
 config ARM_CPU_TOPOLOGY
 	bool "Support cpu topology definition"
 	depends on SMP && CPU_V7
+	select ARCH_ENABLE_SCHED_MC
+	select ARCH_ENABLE_SCHED_SMT
 	default y
 	help
 	  Support ARM cpu topology definition. The MPIDR register defines
 	  affinity between processors which is then used to describe the cpu
 	  topology of an ARM System.
 
-config SCHED_MC
-	bool "Multi-core scheduler support"
-	depends on ARM_CPU_TOPOLOGY
-	help
-	  Multi-core scheduler support improves the CPU scheduler's decision
-	  making when dealing with multi-core CPU chips at a cost of slightly
-	  increased overhead in some places. If unsure say N here.
-
-config SCHED_SMT
-	bool "SMT scheduler support"
-	depends on ARM_CPU_TOPOLOGY
-	help
-	  Improves the CPU scheduler's decision making when dealing with
-	  MultiThreading at a cost of slightly increased overhead in some
-	  places. If unsure say N here.
+source kernel/sched/Kconfig
 
 config HAVE_ARM_SCU
 	bool
diff -puN arch/ia64/Kconfig~consolidate-config-SCHED_MC arch/ia64/Kconfig
--- a/arch/ia64/Kconfig~consolidate-config-SCHED_MC	2014-09-17 15:28:58.595621644 -0700
+++ b/arch/ia64/Kconfig	2014-09-17 15:28:58.614622513 -0700
@@ -49,6 +49,7 @@ config IA64
 	select MODULES_USE_ELF_RELA
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select HAVE_ARCH_AUDITSYSCALL
+	select ARCH_ENABLE_SCHED_SMT
 	default y
 	help
 	  The Itanium Processor Family is Intel's 64-bit successor to
@@ -382,14 +383,6 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 config ARCH_ENABLE_MEMORY_HOTREMOVE
 	def_bool y
 
-config SCHED_SMT
-	bool "SMT scheduler support"
-	depends on SMP
-	help
-	  Improves the CPU scheduler's decision making when dealing with
-	  Intel IA64 chips with MultiThreading at a cost of slightly increased
-	  overhead in some places. If unsure say N here.
-
 config PERMIT_BSP_REMOVE
 	bool "Support removal of Bootstrap Processor"
 	depends on HOTPLUG_CPU
diff -puN arch/mips/Kconfig~consolidate-config-SCHED_MC arch/mips/Kconfig
--- a/arch/mips/Kconfig~consolidate-config-SCHED_MC	2014-09-17 15:28:58.597621736 -0700
+++ b/arch/mips/Kconfig	2014-09-17 15:28:58.615622560 -0700
@@ -1948,7 +1948,7 @@ config MIPS_MT_SMP
 	select SMP
 	select SMP_UP
 	select SYS_SUPPORTS_SMP
-	select SYS_SUPPORTS_SCHED_SMT
+	select ARCH_ENABLE_SCHED_SMT
 	select MIPS_PERF_SHARED_TC_COUNTERS
 	help
 	  This is a kernel model which is known as SMVP. This is supported
@@ -1960,17 +1960,7 @@ config MIPS_MT_SMP
 config MIPS_MT
 	bool
 
-config SCHED_SMT
-	bool "SMT (multithreading) scheduler support"
-	depends on SYS_SUPPORTS_SCHED_SMT
-	default n
-	help
-	  SMT scheduler support improves the CPU scheduler's decision making
-	  when dealing with MIPS MT enabled cores at a cost of slightly
-	  increased overhead in some places. If unsure say N here.
-
-config SYS_SUPPORTS_SCHED_SMT
-	bool
+source "kernel/sched/Kconfig"
 
 config SYS_SUPPORTS_MULTITHREADING
 	bool
diff -puN arch/powerpc/Kconfig~consolidate-config-SCHED_MC arch/powerpc/Kconfig
--- a/arch/powerpc/Kconfig~consolidate-config-SCHED_MC	2014-09-17 15:28:58.599621828 -0700
+++ b/arch/powerpc/Kconfig	2014-09-17 15:28:58.616622606 -0700
@@ -603,14 +603,6 @@ config PPC_SUBPAGE_PROT
 	  to set access permissions (read/write, readonly, or no access)
 	  on the 4k subpages of each 64k page.
 
-config SCHED_SMT
-	bool "SMT (Hyperthreading) scheduler support"
-	depends on PPC64 && SMP
-	help
-	  SMT scheduler support improves the CPU scheduler's decision making
-	  when dealing with POWER5 cpus at a cost of slightly increased
-	  overhead in some places. If unsure say N here.
-
 config PPC_DENORMALISATION
 	bool "PowerPC denormalisation exception handling"
 	depends on PPC_BOOK3S_64
diff -puN arch/powerpc/platforms/Kconfig.cputype~consolidate-config-SCHED_MC arch/powerpc/platforms/Kconfig.cputype
--- a/arch/powerpc/platforms/Kconfig.cputype~consolidate-config-SCHED_MC	2014-09-17 15:28:58.600621874 -0700
+++ b/arch/powerpc/platforms/Kconfig.cputype	2014-09-17 15:28:58.616622606 -0700
@@ -2,6 +2,7 @@ config PPC64
 	bool "64-bit kernel"
 	default n
 	select HAVE_VIRT_CPU_ACCOUNTING
+	select SCHED_SMT if SMP
 	help
 	  This option selects whether a 32-bit or a 64-bit kernel
 	  will be built.
diff -puN arch/s390/Kconfig~consolidate-config-SCHED_MC arch/s390/Kconfig
--- a/arch/s390/Kconfig~consolidate-config-SCHED_MC	2014-09-17 15:28:58.602621966 -0700
+++ b/arch/s390/Kconfig	2014-09-17 15:28:58.617622652 -0700
@@ -146,6 +146,8 @@ config S390
 	select VIRT_CPU_ACCOUNTING
 	select VIRT_TO_BUS
 	select ARCH_HAS_SG_CHAIN
+	select ARCH_ENABLE_SCHED_BOOK
+	select ARCH_ENABLE_SCHED_MC
 
 config SCHED_OMIT_FRAME_POINTER
 	def_bool y
@@ -372,17 +374,7 @@ config HOTPLUG_CPU
 	  can be controlled through /sys/devices/system/cpu/cpu#.
 	  Say N if you want to disable CPU hotplug.
 
-config SCHED_MC
-	def_bool n
-
-config SCHED_BOOK
-	def_bool y
-	prompt "Book scheduler support"
-	depends on SMP
-	select SCHED_MC
-	help
-	  Book scheduler support improves the CPU scheduler's decision making
-	  when dealing with machines that have several books.
+source kernel/sched/Kconfig
 
 source kernel/Kconfig.preempt
 
diff -puN arch/sh/Kconfig~consolidate-config-SCHED_MC arch/sh/Kconfig
--- a/arch/sh/Kconfig~consolidate-config-SCHED_MC	2014-09-17 15:28:58.604622056 -0700
+++ b/arch/sh/Kconfig	2014-09-17 15:28:58.618622698 -0700
@@ -43,6 +43,7 @@ config SUPERH
 	select OLD_SIGSUSPEND
 	select OLD_SIGACTION
 	select HAVE_ARCH_AUDITSYSCALL
+	select ARCH_ENABLE_SCHED_MC
 	help
 	  The SuperH is a RISC processor targeted for use in embedded systems
 	  and consumer electronics; it was also used in the Sega Dreamcast
diff -puN arch/sh/mm/Kconfig~consolidate-config-SCHED_MC arch/sh/mm/Kconfig
--- a/arch/sh/mm/Kconfig~consolidate-config-SCHED_MC	2014-09-17 15:28:58.606622148 -0700
+++ b/arch/sh/mm/Kconfig	2014-09-17 15:28:58.618622698 -0700
@@ -226,14 +226,7 @@ endchoice
 
 source "mm/Kconfig"
 
-config SCHED_MC
-	bool "Multi-core scheduler support"
-	depends on SMP
-	default y
-	help
-	  Multi-core scheduler support improves the CPU scheduler's decision
-	  making when dealing with multi-core CPU chips at a cost of slightly
-	  increased overhead in some places. If unsure say N here.
+source "kernel/sched/Kconfig"
 
 endmenu
 
diff -puN arch/sparc/Kconfig~consolidate-config-SCHED_MC arch/sparc/Kconfig
--- a/arch/sparc/Kconfig~consolidate-config-SCHED_MC	2014-09-17 15:28:58.607622194 -0700
+++ b/arch/sparc/Kconfig	2014-09-17 15:28:58.618622698 -0700
@@ -79,6 +79,8 @@ config SPARC64
 	select NO_BOOTMEM
 	select HAVE_ARCH_AUDITSYSCALL
 	select ARCH_SUPPORTS_ATOMIC_RMW
+	select ARCH_ENABLE_SCHED_MC if SMP
+	select ARCH_ENABLE_SCHED_SMT if SMP
 
 config ARCH_DEFCONFIG
 	string
@@ -306,23 +308,7 @@ if SPARC64
 source "kernel/power/Kconfig"
 endif
 
-config SCHED_SMT
-	bool "SMT (Hyperthreading) scheduler support"
-	depends on SPARC64 && SMP
-	default y
-	help
-	  SMT scheduler support improves the CPU scheduler's decision making
-	  when dealing with SPARC cpus at a cost of slightly increased overhead
-	  in some places. If unsure say N here.
-
-config SCHED_MC
-	bool "Multi-core scheduler support"
-	depends on SPARC64 && SMP
-	default y
-	help
-	  Multi-core scheduler support improves the CPU scheduler's decision
-	  making when dealing with multi-core CPU chips at a cost of slightly
-	  increased overhead in some places. If unsure say N here.
+source "kernel/sched/Kconfig"
 
 source "kernel/Kconfig.preempt"
 
diff -puN arch/x86/Kconfig~consolidate-config-SCHED_MC arch/x86/Kconfig
--- a/arch/x86/Kconfig~consolidate-config-SCHED_MC	2014-09-17 15:28:58.609622286 -0700
+++ b/arch/x86/Kconfig	2014-09-17 15:28:58.620622789 -0700
@@ -255,6 +255,8 @@ config X86_64_SMP
 config X86_HT
 	def_bool y
 	depends on SMP
+	select ARCH_ENABLE_SCHED_MC
+	select ARCH_ENABLE_SCHED_SMT
 
 config X86_32_LAZY_GS
 	def_bool y
@@ -789,23 +791,7 @@ config NR_CPUS
 	  This is purely to save memory - each supported CPU adds
 	  approximately eight kilobytes to the kernel image.
 
-config SCHED_SMT
-	bool "SMT (Hyperthreading) scheduler support"
-	depends on X86_HT
-	---help---
-	  SMT scheduler support improves the CPU scheduler's decision making
-	  when dealing with Intel Pentium 4 chips with HyperThreading at a
-	  cost of slightly increased overhead in some places. If unsure say
-	  N here.
-
-config SCHED_MC
-	def_bool y
-	prompt "Multi-core scheduler support"
-	depends on X86_HT
-	---help---
-	  Multi-core scheduler support improves the CPU scheduler's decision
-	  making when dealing with multi-core CPU chips at a cost of slightly
-	  increased overhead in some places. If unsure say N here.
+source "kernel/sched/Kconfig"
 
 source "kernel/Kconfig.preempt"
 
diff -puN /dev/null kernel/sched/Kconfig
--- /dev/null	2014-04-10 11:28:14.066815724 -0700
+++ b/kernel/sched/Kconfig	2014-09-17 15:28:58.620622789 -0700
@@ -0,0 +1,42 @@
+config ARCH_ENABLE_SCHED_MC
+	depends on SMP
+	def_bool n
+
+config ARCH_ENABLE_SCHED_BOOK
+	depends on SMP
+	def_bool n
+
+config ARCH_ENABLE_SCHED_SMT
+	depends on SMP
+	def_bool n
+
+config SCHED_MC
+	bool "Multi-core scheduler support"
+	default n if s390
+	default y
+	depends on ARCH_ENABLE_SCHED_MC
+	help
+	  Multi-core scheduler support improves the CPU scheduler's decision
+	  making when dealing with multi-core CPU chips at a cost of slightly
+	  increased overhead in some places. If unsure say N here.
+
+config SCHED_BOOK
+	def_bool y
+	prompt "Book scheduler support"
+	depends on ARCH_ENABLE_SCHED_BOOK
+	select SCHED_MC
+	help
+	  Book scheduler support improves the CPU scheduler's decision making
+	  when dealing with machines that have several books.
+
+	  Currenltly only used on s390 which has only a single NUMA node.
+	  Books are collections of CPUs that are grouped similarly to a NUMA
+	  node, but without the same memory properites that NUMA nodes have.
+
+config SCHED_SMT
+	bool "SMT scheduler support"
+	depends on ARCH_ENABLE_SCHED_SMT
+	help
+	  Improves the CPU scheduler's decision making when dealing with
+	  MultiThreading at a cost of slightly increased overhead in some
+	  places. If unsure say N here.
_

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs
  2014-09-17 22:33 [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs Dave Hansen
                   ` (5 preceding siblings ...)
  2014-09-17 22:33 ` [RFC][PATCH 6/6] sched: consolidate config options Dave Hansen
@ 2014-09-18  7:45 ` Borislav Petkov
       [not found] ` <CAOjmkp8EGO0jicmdO=p6ATHz-hUJmWb+xoBLjOdLBUwwGzyhhg@mail.gmail.com>
  7 siblings, 0 replies; 14+ messages in thread
From: Borislav Petkov @ 2014-09-18  7:45 UTC (permalink / raw)
  To: Dave Hansen; +Cc: a.p.zijlstra, mingo, hpa, brice.goglin, linux-kernel

On Wed, Sep 17, 2014 at 03:33:10PM -0700, Dave Hansen wrote:
> This is a big fat RFC.  It takes quite a few liberties with the
> multi-core topology level that I'm not completely comfortable
> with.
> 
> It has only been tested lightly.
> 
> Full dmesg for a Cluster-on-Die system with this set applied,
> and sched_debug on the command-line is here:
> 
> 	http://sr71.net/~dave/intel/full-dmesg-hswep-20140917.txt

So how do I find out what topology this system has?

[    0.175294] .... node  #0, CPUs:        #1
[    0.190970] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[    0.191813]   #2  #3  #4  #5  #6  #7  #8
[    0.290753] .... node  #1, CPUs:    #9 #10 #11 #12 #13 #14 #15 #16 #17
[    0.436162] .... node  #2, CPUs:   #18 #19 #20 #21 #22 #23 #24 #25 #26
[    0.660795] .... node  #3, CPUs:   #27 #28 #29 #30 #31 #32 #33 #34 #35
[    0.806365] .... node  #0, CPUs:   #36 #37 #38 #39 #40 #41 #42 #43 #44
[    0.933573] .... node  #1, CPUs:   #45 #46 #47 #48 #49 #50 #51 #52 #53
[    1.061079] .... node  #2, CPUs:   #54 #55 #56 #57 #58 #59 #60 #61 #62
[    1.188491] .... node  #3, CPUs:   #63
[    1.202620] x86: Booted up 4 nodes, 64 CPUs

SRAT says 4 nodes but I'm guessing from the context those 4 nodes are
actually in pairs in two physical sockets, right?

Btw, you'd need to increase NR_CPUS because you obviously have more
APICs than 64.

So if we pick a cpu at random:

[    1.350640] CPU49 attaching sched-domain:
[    1.350641]  domain 0: span 13,49 level SMT
[    1.350642]   groups: 49 (cpu_capacity = 588) 13 (cpu_capacity = 588)
[    1.350644]   domain 1: span 9-17,45-53 level MC
[    1.350645]    groups: 13,49 (cpu_capacity = 1176) 14,50 (cpu_capacity = 1176) 15,51 (cpu_capacity = 1176) 16,52 (cpu_capa
city = 1176) 17,53 (cpu_capacity = 1176) 9,45 (cpu_capacity = 1176) 10,46 (cpu_capacity = 1177) 11,47 (cpu_capacity = 1176) 1
2,48 (cpu_capacity = 1176)
[    1.350654]    domain 2: span 0-17,36-53 level NUMA
[    1.350655]     groups: 9-17,45-53 (cpu_capacity = 10585) 0-8,36-44 (cpu_capacity = 10589)
[    1.350659]     domain 3: span 0-63 level NUMA
[    1.350660]      groups: 0-17,36-53 (cpu_capacity = 21174) 18-35,54-63 (cpu_capacity = 19944)

domain level 1 MC is what tells me which cores are on the internal nodes
of a socket? Or how do we find that out? Or even, do we need that info
at all...?

It might be useful for RAS and when we want to disable cores or whatever...

Thanks.

-- 
Regards/Gruss,
    Boris.
--

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 2/6] x86: introduce cpumask specifically for the package
  2014-09-17 22:33 ` [RFC][PATCH 2/6] x86: introduce cpumask specifically for the package Dave Hansen
@ 2014-09-18 14:57   ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2014-09-18 14:57 UTC (permalink / raw)
  To: Dave Hansen; +Cc: mingo, hpa, brice.goglin, bp, linux-kernel, dave.hansen

On Wed, Sep 17, 2014 at 03:33:14PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> As noted by multiple reports:
> 
> 	https://lkml.org/lkml/2014/9/15/1240
> 	https://lkml.org/lkml/2014/7/28/442
> 
> the sched domains code has some assumptions that break on newer
> AMD and Intel CPUs.  Namely, the code assumes that NUMA node
> boundaries always lie outside of a CPU package.  That assumption
> is no longer true with Intel's Cluster-on-Die found in Haswell
> CPUs (with a special BIOS config knob) and AMD's DCM feature.
> 
> Essentially, the 'cpu_core_map' is no longer suitable for
> enumerating all the CPUs in a physical package.
> 
> This patch introduces a new map which is specifically built by
> consulting the the physical package ids instead of inferring the
> information from NUMA nodes.
> 
> This still leaves us with a broken 'core_siblings_list' in sysfs,
> but a later patch will fix that up too.

If we do dynamic topology layout we don't need a second mask I think.
The machines that have multiple packages per node will simply present a
different sched_domain_topology than the machines that have multiple
nodes per package.

Specifically, in the former we include the package_mask as DIE level, in
the other case we leave it out entirely and rely on the SLIT table to
build the right domain topology.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 4/6] sched: eliminate "DIE" domain level when NUMA present
  2014-09-17 22:33 ` [RFC][PATCH 4/6] sched: eliminate "DIE" domain level when NUMA present Dave Hansen
@ 2014-09-18 17:28   ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2014-09-18 17:28 UTC (permalink / raw)
  To: Dave Hansen; +Cc: mingo, hpa, brice.goglin, bp, linux-kernel, dave.hansen

On Wed, Sep 17, 2014 at 03:33:16PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The "DIE" topology level is currently defined like this:
> 
> static inline const struct cpumask *cpu_cpu_mask(int cpu)
> {
>         return cpumask_of_node(cpu_to_node(cpu));
> }
> 
> But that makes very little sense on a NUMA system since
> the lowest-domain NUMA node is guaranteed to be essentially
> the same as this level.
> 
> We leave this for systems that are !CONFIG_NUMA and that
> might need a top-level domain.
> 
> This also keeps us from having screwy topologies when the
> smallest NUMA node is only _part_ of the die.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  b/kernel/sched/core.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff -puN kernel/sched/core.c~die-is-NUMA-based-and-screwed-up kernel/sched/core.c
> --- a/kernel/sched/core.c~die-is-NUMA-based-and-screwed-up	2014-09-17 15:28:57.867588315 -0700
> +++ b/kernel/sched/core.c	2014-09-17 15:28:57.873588591 -0700
> @@ -6141,7 +6141,9 @@ static struct sched_domain_topology_leve
>  #ifdef CONFIG_SCHED_MC
>  	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
>  #endif
> +#ifndef CONFIG_NUMA
>  	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
> +#endif
>  	{ NULL, },
>  };

Yeah, no. Also, looking at it now, I see why it worked and how its been
wrong :-) As you say it returns the node mask, not the PKG mask as it
should have been doing.

So don't change the default topology, in general I'd say its still true
that you get one or more packages inside a node. Change override the
default topology in arch code.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 6/6] sched: consolidate config options
  2014-09-17 22:33 ` [RFC][PATCH 6/6] sched: consolidate config options Dave Hansen
@ 2014-09-18 17:29   ` Peter Zijlstra
  2014-09-19 19:15     ` Dave Hansen
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2014-09-18 17:29 UTC (permalink / raw)
  To: Dave Hansen; +Cc: mingo, hpa, brice.goglin, bp, linux-kernel, dave.hansen

On Wed, Sep 17, 2014 at 03:33:20PM -0700, Dave Hansen wrote:
> diff -puN /dev/null kernel/sched/Kconfig
> --- /dev/null	2014-04-10 11:28:14.066815724 -0700
> +++ b/kernel/sched/Kconfig	2014-09-17 15:28:58.620622789 -0700
> @@ -0,0 +1,42 @@
> +config ARCH_ENABLE_SCHED_MC
> +	depends on SMP
> +	def_bool n
> +
> +config ARCH_ENABLE_SCHED_BOOK
> +	depends on SMP
> +	def_bool n
> +
> +config ARCH_ENABLE_SCHED_SMT
> +	depends on SMP
> +	def_bool n
> +
> +config SCHED_MC
> +	bool "Multi-core scheduler support"
> +	default n if s390
> +	default y
> +	depends on ARCH_ENABLE_SCHED_MC
> +	help
> +	  Multi-core scheduler support improves the CPU scheduler's decision
> +	  making when dealing with multi-core CPU chips at a cost of slightly
> +	  increased overhead in some places. If unsure say N here.
> +
> +config SCHED_BOOK
> +	def_bool y
> +	prompt "Book scheduler support"
> +	depends on ARCH_ENABLE_SCHED_BOOK
> +	select SCHED_MC
> +	help
> +	  Book scheduler support improves the CPU scheduler's decision making
> +	  when dealing with machines that have several books.
> +
> +	  Currenltly only used on s390 which has only a single NUMA node.
> +	  Books are collections of CPUs that are grouped similarly to a NUMA
> +	  node, but without the same memory properites that NUMA nodes have.

Nothing outside of s390 knows about SCHED_BOOK, it doesn't make sense to
have that here.

> +config SCHED_SMT
> +	bool "SMT scheduler support"
> +	depends on ARCH_ENABLE_SCHED_SMT
> +	help
> +	  Improves the CPU scheduler's decision making when dealing with
> +	  MultiThreading at a cost of slightly increased overhead in some
> +	  places. If unsure say N here.
> _

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 6/6] sched: consolidate config options
  2014-09-18 17:29   ` Peter Zijlstra
@ 2014-09-19 19:15     ` Dave Hansen
  2014-09-19 23:03       ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2014-09-19 19:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, hpa, brice.goglin, bp, linux-kernel, dave.hansen

On 09/18/2014 10:29 AM, Peter Zijlstra wrote:
>> > +config SCHED_BOOK
>> > +	def_bool y
>> > +	prompt "Book scheduler support"
>> > +	depends on ARCH_ENABLE_SCHED_BOOK
>> > +	select SCHED_MC
>> > +	help
>> > +	  Book scheduler support improves the CPU scheduler's decision making
>> > +	  when dealing with machines that have several books.
>> > +
>> > +	  Currenltly only used on s390 which has only a single NUMA node.
>> > +	  Books are collections of CPUs that are grouped similarly to a NUMA
>> > +	  node, but without the same memory properites that NUMA nodes have.
> Nothing outside of s390 knows about SCHED_BOOK, it doesn't make sense to
> have that here.

By sticking all of them together, my hope was that folks who were going
to add a topology level could see all of the existing options in a
single place.

But, just say the word and I'll yank it out and repost.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 6/6] sched: consolidate config options
  2014-09-19 19:15     ` Dave Hansen
@ 2014-09-19 23:03       ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2014-09-19 23:03 UTC (permalink / raw)
  To: Dave Hansen; +Cc: mingo, hpa, brice.goglin, bp, linux-kernel, dave.hansen

On Fri, Sep 19, 2014 at 12:15:45PM -0700, Dave Hansen wrote:
> On 09/18/2014 10:29 AM, Peter Zijlstra wrote:
> >> > +config SCHED_BOOK
> >> > +	def_bool y
> >> > +	prompt "Book scheduler support"
> >> > +	depends on ARCH_ENABLE_SCHED_BOOK
> >> > +	select SCHED_MC
> >> > +	help
> >> > +	  Book scheduler support improves the CPU scheduler's decision making
> >> > +	  when dealing with machines that have several books.
> >> > +
> >> > +	  Currenltly only used on s390 which has only a single NUMA node.
> >> > +	  Books are collections of CPUs that are grouped similarly to a NUMA
> >> > +	  node, but without the same memory properites that NUMA nodes have.
> > Nothing outside of s390 knows about SCHED_BOOK, it doesn't make sense to
> > have that here.
> 
> By sticking all of them together, my hope was that folks who were going
> to add a topology level could see all of the existing options in a
> single place.
> 
> But, just say the word and I'll yank it out and repost.

Yeah, I think its best to leave it a s390 private afair.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs
       [not found] ` <CAOjmkp8EGO0jicmdO=p6ATHz-hUJmWb+xoBLjOdLBUwwGzyhhg@mail.gmail.com>
@ 2014-09-22 15:54   ` Aravind Gopalakrishnan
  0 siblings, 0 replies; 14+ messages in thread
From: Aravind Gopalakrishnan @ 2014-09-22 15:54 UTC (permalink / raw)
  To: dave, Borislav Petkov; +Cc: Ingo Molnar, hpa, brice.goglin, LKML

On 9/22/2014 9:33 AM, Aravind Gopalakrishnan wrote:
>
> This is a big fat RFC.  It takes quite a few liberties with the
> multi-core topology level that I'm not completely comfortable
> with.
>
> It has only been tested lightly.
>
> Full dmesg for a Cluster-on-Die system with this set applied,
> and sched_debug on the command-line is here:
>
> http://sr71.net/~dave/intel/full-dmesg-hswep-20140917.txt 
> <http://sr71.net/%7Edave/intel/full-dmesg-hswep-20140917.txt>
>
> ---
>
> I'm getting the spew below when booting with Haswell (Xeon
> E5-2699 v3) CPUs and the "Cluster-on-Die" (CoD) feature enabled
> in the BIOS.  It seems similar to the issue that some folks from
> AMD ran in to on their systems and addressed in this commit:
>
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=161270fc1f9ddfc17154e0d49291472a9cdef7db
>
> Both these Intel and AMD systems break an assumption which is
> being enforced by topology_sane(): a socket may not contain more
> than one NUMA node.
>
> AMD special-cased their system by looking for a cpuid flag. The
> Intel mode is dependent on BIOS options and I do not know of a
> way which it is enumerated other than the tables being parsed
> during the CPU bringup process.
>
> This also fixes sysfs because CPUs with the same 'physical_package_id'
> in /sys/devices/system/cpu/cpu*/topology/ are not listed together
> in the same 'core_siblings_list'.  This violates a statement from
> Documentation/ABI/testing/sysfs-devices-system-cpu:
>
>         core_siblings: internal kernel map of cpu#'s hardware threads
>         within the same physical_package_id.
>
>         core_siblings_list: human-readable list of the logical CPU
>         numbers within the same physical_package_id as cpu#.
>
> The sysfs effects here cause an issue with the hwloc tool where
> it gets confused and thinks there are more sockets than are
> physically present.
>
> Before this set, there are two packages:
>
> # cd /sys/devices/system/cpu/
> # cat cpu*/topology/physical_package_id | sort | uniq -c
>      18 0
>      18 1
>
> But 4 _sets_ of core siblings:
>
> # cat cpu*/topology/core_siblings_list | sort | uniq -c
>       9 0-8
>       9 18-26
>       9 27-35
>       9 9-17
>
> After this set, there are only 2 sets of core siblings, which
> is what we expect for a 2-socket system.
>
> # cat cpu*/topology/physical_package_id | sort | uniq -c
>      18 0
>      18 1
> # cat cpu*/topology/core_siblings_list | sort | uniq -c
>      18 0-17
>      18 18-35
>
>
> Example spew:
> ...
>         NMI watchdog: enabled on all CPUs, permanently consumes one 
> hw-PMU counter.
>          #2  #3  #4  #5  #6  #7  #8
>         .... node  #1, CPUs:    #9
>         ------------[ cut here ]------------
>         WARNING: CPU: 9 PID: 0 at 
> /home/ak/hle/linux-hle-2.6/arch/x86/kernel/smpboot.c:306 
> topology_sane.isra.2+0x74/0x90()
>         sched: CPU #9's mc-sibling CPU #0 is not on the same node! 
> [node: 1 != 0]. Ignoring dependency.
>         Modules linked in:
>         CPU: 9 PID: 0 Comm: swapper/9 Not tainted 
> 3.17.0-rc1-00293-g8e01c4d-dirty #631
>         Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
> GRNDSDP1.86B.0036.R05.1407140519 07/14/2014
>         0000000000000009 ffff88046ddabe00 ffffffff8172e485 
> ffff88046ddabe48
>         ffff88046ddabe38 ffffffff8109691d 000000000000b001 
> 0000000000000009
>         ffff88086fc12580 000000000000b020 0000000000000009 
> ffff88046ddabe98
>         Call Trace:
>         [<ffffffff8172e485>] dump_stack+0x45/0x56
>         [<ffffffff8109691d>] warn_slowpath_common+0x7d/0xa0
>         [<ffffffff8109698c>] warn_slowpath_fmt+0x4c/0x50
>         [<ffffffff81074f94>] topology_sane.isra.2+0x74/0x90
>         [<ffffffff8107530e>] set_cpu_sibling_map+0x31e/0x4f0
>         [<ffffffff8107568d>] start_secondary+0x1ad/0x240
>         ---[ end trace 3fe5f587a9fcde61 ]---
>         #10 #11 #12 #13 #14 #15 #16 #17
>         .... node  #2, CPUs:   #18 #19 #20 #21 #22 #23 #24 #25 #26
>         .... node  #3, CPUs:   #27 #28 #29 #30 #31 #32 #33 #34 #35


Hi,
I looked at the topology info from sysfs both w/ and w/o the patch 
series and they are identical.
So, the patches seem to work fine on an AMD MCM part.

Tested-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>

Thanks,
-Aravind.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-09-22 16:10 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-17 22:33 [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs Dave Hansen
2014-09-17 22:33 ` [RFC][PATCH 1/6] topology: rename topology_core_cpumask() to topology_package_cpumask() Dave Hansen
2014-09-17 22:33 ` [RFC][PATCH 2/6] x86: introduce cpumask specifically for the package Dave Hansen
2014-09-18 14:57   ` Peter Zijlstra
2014-09-17 22:33 ` [RFC][PATCH 3/6] x86: use package_map instead of core_map for sysfs Dave Hansen
2014-09-17 22:33 ` [RFC][PATCH 4/6] sched: eliminate "DIE" domain level when NUMA present Dave Hansen
2014-09-18 17:28   ` Peter Zijlstra
2014-09-17 22:33 ` [RFC][PATCH 5/6] sched: keep MC domain from crossing nodes OR packages Dave Hansen
2014-09-17 22:33 ` [RFC][PATCH 6/6] sched: consolidate config options Dave Hansen
2014-09-18 17:29   ` Peter Zijlstra
2014-09-19 19:15     ` Dave Hansen
2014-09-19 23:03       ` Peter Zijlstra
2014-09-18  7:45 ` [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs Borislav Petkov
     [not found] ` <CAOjmkp8EGO0jicmdO=p6ATHz-hUJmWb+xoBLjOdLBUwwGzyhhg@mail.gmail.com>
2014-09-22 15:54   ` Aravind Gopalakrishnan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.