All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
@ 2014-09-15 22:26 Dave Hansen
  2014-09-16  3:29 ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2014-09-15 22:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, borislav.petkov, andreas.herrmann3, a.p.zijlstra,
	mingo, hpa, ak


I'm getting the spew below when booting with Haswell (Xeon
E5-2699) CPUs and the "Cluster-on-Die" (CoD) feature enabled in
the BIOS.  It seems similar to the issue that some folks from
AMD ran in to on their systems and addressed in this commit:

	http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=161270fc1f9ddfc17154e0d49291472a9cdef7db

Both these Intel and AMD systems break an assumption which is
being enforced by "topology_sane()": a socket may not contain
more than one NUMA node.

AMD special-cased their system by looking for a cpuid flag.  The
Intel mode is dependent on BIOS options and I do not know of a
way which it is enumerated other than the tables being parsed
during the CPU bringup process.

One system that does this is a weirdo that can be special-cased,
but two makes for an industry trend.  This patch removes the
sanity check completely.

This also fixes sysfs because CPUs with the same 'physical_package_id'
in /sys/devices/system/cpu/cpu*/topology/ are not listed together
in the same 'core_siblings_list'.  This violates a statement from
Documentation/ABI/testing/sysfs-devices-system-cpu:

	core_siblings: internal kernel map of cpu#'s hardware threads
	within the same physical_package_id.

	core_siblings_list: human-readable list of the logical CPU
	numbers within the same physical_package_id as cpu#.

The sysfs effects here cause an issue with the hwloc tool where
it gets confused and thinks there are more sockets than are
physically present.

Before this patch, there are two packages:

# cd /sys/devices/system/cpu/
# cat cpu*/topology/physical_package_id | sort | uniq -c
     18 0
     18 1

But 4 _sets_ of core siblings:

# cat cpu*/topology/core_siblings_list | sort | uniq -c
      9 0-8
      9 18-26
      9 27-35
      9 9-17

After this patch, there are only 2 sets of core siblings, which
is what we expect for a 2-socket system.

# cat cpu*/topology/physical_package_id | sort | uniq -c
     18 0
     18 1
# cat cpu*/topology/core_siblings_list | sort | uniq -c
     18 0-17
     18 18-35


Example spew:
...
	NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
	 #2  #3  #4  #5  #6  #7  #8
	.... node  #1, CPUs:    #9
	------------[ cut here ]------------
	WARNING: CPU: 9 PID: 0 at /home/ak/hle/linux-hle-2.6/arch/x86/kernel/smpboot.c:306 topology_sane.isra.2+0x74/0x90()
	sched: CPU #9's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
	Modules linked in:
	CPU: 9 PID: 0 Comm: swapper/9 Not tainted 3.17.0-rc1-00293-g8e01c4d-dirty #631
	Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014
	0000000000000009 ffff88046ddabe00 ffffffff8172e485 ffff88046ddabe48
	ffff88046ddabe38 ffffffff8109691d 000000000000b001 0000000000000009
	ffff88086fc12580 000000000000b020 0000000000000009 ffff88046ddabe98
	Call Trace:
	[<ffffffff8172e485>] dump_stack+0x45/0x56
	[<ffffffff8109691d>] warn_slowpath_common+0x7d/0xa0
	[<ffffffff8109698c>] warn_slowpath_fmt+0x4c/0x50
	[<ffffffff81074f94>] topology_sane.isra.2+0x74/0x90
	[<ffffffff8107530e>] set_cpu_sibling_map+0x31e/0x4f0
	[<ffffffff8107568d>] start_secondary+0x1ad/0x240
	---[ end trace 3fe5f587a9fcde61 ]---
	#10 #11 #12 #13 #14 #15 #16 #17
	.... node  #2, CPUs:   #18 #19 #20 #21 #22 #23 #24 #25 #26
	.... node  #3, CPUs:   #27 #28 #29 #30 #31 #32 #33 #34 #35

Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: ak@linux.intel.com

---

 b/arch/x86/kernel/smpboot.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff -puN arch/x86/kernel/smpboot.c~hsw-cod-is-sane arch/x86/kernel/smpboot.c
--- a/arch/x86/kernel/smpboot.c~hsw-cod-is-sane	2014-09-15 14:56:20.012314468 -0700
+++ b/arch/x86/kernel/smpboot.c	2014-09-15 14:58:58.837506644 -0700
@@ -344,10 +344,13 @@ static bool match_llc(struct cpuinfo_x86
 static bool match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 {
 	if (c->phys_proc_id == o->phys_proc_id) {
-		if (cpu_has(c, X86_FEATURE_AMD_DCM))
-			return true;
-
-		return topology_sane(c, o, "mc");
+		/*
+		 * We used to enforce that 'c' and 'o' be on the
+		 * same node, but AMD's DCM and Intel's Cluster-
+		 * on-Die (CoD) support both have physical
+		 * processors that span NUMA nodes.
+		 */
+		return true;
 	}
 	return false;
 }
_

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-15 22:26 [PATCH] x86: Consider multiple nodes in a single socket to be "sane" Dave Hansen
@ 2014-09-16  3:29 ` Peter Zijlstra
  2014-09-16  6:38   ` Chuck Ebbert
                     ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-09-16  3:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, borislav.petkov, andreas.herrmann3, mingo, hpa, ak

On Mon, Sep 15, 2014 at 03:26:41PM -0700, Dave Hansen wrote:
> 
> I'm getting the spew below when booting with Haswell (Xeon
> E5-2699) CPUs and the "Cluster-on-Die" (CoD) feature enabled in
> the BIOS. 

What is that cluster-on-die thing? I've heard it before but never could
find anything on it.

> This also fixes sysfs because CPUs with the same 'physical_package_id'
> in /sys/devices/system/cpu/cpu*/topology/ are not listed together
> in the same 'core_siblings_list'.  This violates a statement from
> Documentation/ABI/testing/sysfs-devices-system-cpu:
> 
> 	core_siblings: internal kernel map of cpu#'s hardware threads
> 	within the same physical_package_id.
> 
> 	core_siblings_list: human-readable list of the logical CPU
> 	numbers within the same physical_package_id as cpu#.

No that statement is wrong; it assumes physical_package_id is a good
identifier for nodes. Clearly this is no longer true.

The idea is that core_siblings (or rather cpu_core_mask) is a mask of
all cores on a node. 

> The sysfs effects here cause an issue with the hwloc tool where
> it gets confused and thinks there are more sockets than are
> physically present.

Meh, so then we need another mask.

The important bit you didn't show was the scheduler domain setup. I
suspect it all works by accident, not by design.

> diff -puN arch/x86/kernel/smpboot.c~hsw-cod-is-sane arch/x86/kernel/smpboot.c
> --- a/arch/x86/kernel/smpboot.c~hsw-cod-is-sane	2014-09-15 14:56:20.012314468 -0700
> +++ b/arch/x86/kernel/smpboot.c	2014-09-15 14:58:58.837506644 -0700
> @@ -344,10 +344,13 @@ static bool match_llc(struct cpuinfo_x86
>  static bool match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
>  {
>  	if (c->phys_proc_id == o->phys_proc_id) {
> -		if (cpu_has(c, X86_FEATURE_AMD_DCM))
> -			return true;
> -
> -		return topology_sane(c, o, "mc");
> +		/*
> +		 * We used to enforce that 'c' and 'o' be on the
> +		 * same node, but AMD's DCM and Intel's Cluster-
> +		 * on-Die (CoD) support both have physical
> +		 * processors that span NUMA nodes.
> +		 */
> +		return true;
>  	}
>  	return false;
>  }

This is wrong (and I suppose the AMD case was already wrong). That
function is supposed to match a multi-core group which is very much
supposed to be smaller-or-equal to a node, not spanning nodes.

The scheduler assumes: SMT <= LLC <= MC <= NODE and if setting the MC
mask to cover multiple nodes works, its by accident.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16  3:29 ` Peter Zijlstra
@ 2014-09-16  6:38   ` Chuck Ebbert
  2014-09-16  6:44     ` Ingo Molnar
  2014-09-16  8:17   ` Dave Hansen
  2014-09-16 16:59   ` Brice Goglin
  2 siblings, 1 reply; 19+ messages in thread
From: Chuck Ebbert @ 2014-09-16  6:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, linux-kernel, borislav.petkov, andreas.herrmann3,
	mingo, hpa, ak

On Tue, 16 Sep 2014 05:29:20 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 15, 2014 at 03:26:41PM -0700, Dave Hansen wrote:
> > 
> > I'm getting the spew below when booting with Haswell (Xeon
> > E5-2699) CPUs and the "Cluster-on-Die" (CoD) feature enabled in
> > the BIOS. 
> 
> What is that cluster-on-die thing? I've heard it before but never could
> find anything on it.
> 

Each CPU has 2.5MB of L3 connected together in a ring that makes it all act
like a single shared cache. The HW tries to place the data so it's closest to
the CPU that uses it. On the larger processors there are two rings with an
interconnect between them that adds latency if a cache fetch has to cross that.
CoD breaks that connection and effectively gives you two nodes on one die.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16  6:38   ` Chuck Ebbert
@ 2014-09-16  6:44     ` Ingo Molnar
  2014-09-16  7:03       ` Chuck Ebbert
  2014-09-16 15:59       ` Peter Zijlstra
  0 siblings, 2 replies; 19+ messages in thread
From: Ingo Molnar @ 2014-09-16  6:44 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Peter Zijlstra, Dave Hansen, linux-kernel, borislav.petkov,
	andreas.herrmann3, hpa, ak


* Chuck Ebbert <cebbert.lkml@gmail.com> wrote:

> On Tue, 16 Sep 2014 05:29:20 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Sep 15, 2014 at 03:26:41PM -0700, Dave Hansen wrote:
> > > 
> > > I'm getting the spew below when booting with Haswell (Xeon 
> > > E5-2699) CPUs and the "Cluster-on-Die" (CoD) feature 
> > > enabled in the BIOS.
> > 
> > What is that cluster-on-die thing? I've heard it before but 
> > never could find anything on it.
> 
> Each CPU has 2.5MB of L3 connected together in a ring that 
> makes it all act like a single shared cache. The HW tries to 
> place the data so it's closest to the CPU that uses it. On the 
> larger processors there are two rings with an interconnect 
> between them that adds latency if a cache fetch has to cross 
> that. CoD breaks that connection and effectively gives you two 
> nodes on one die.

Note that that's not really a 'NUMA node' in the way lots of 
places in the kernel assume it: permanent placement assymetry 
(and access cost assymetry) of RAM.

It's a new topology construct that needs new handling (and 
probably a new mask): Non Uniform Cache Architecture (NUCA)
or so.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16  6:44     ` Ingo Molnar
@ 2014-09-16  7:03       ` Chuck Ebbert
  2014-09-16  7:05         ` Ingo Molnar
  2014-09-16 16:01         ` Peter Zijlstra
  2014-09-16 15:59       ` Peter Zijlstra
  1 sibling, 2 replies; 19+ messages in thread
From: Chuck Ebbert @ 2014-09-16  7:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Dave Hansen, linux-kernel, borislav.petkov,
	andreas.herrmann3, hpa, ak

On Tue, 16 Sep 2014 08:44:03 +0200
Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Chuck Ebbert <cebbert.lkml@gmail.com> wrote:
> 
> > On Tue, 16 Sep 2014 05:29:20 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Mon, Sep 15, 2014 at 03:26:41PM -0700, Dave Hansen wrote:
> > > > 
> > > > I'm getting the spew below when booting with Haswell (Xeon 
> > > > E5-2699) CPUs and the "Cluster-on-Die" (CoD) feature 
> > > > enabled in the BIOS.
> > > 
> > > What is that cluster-on-die thing? I've heard it before but 
> > > never could find anything on it.
> > 
> > Each CPU has 2.5MB of L3 connected together in a ring that 
> > makes it all act like a single shared cache. The HW tries to 
> > place the data so it's closest to the CPU that uses it. On the 
> > larger processors there are two rings with an interconnect 
> > between them that adds latency if a cache fetch has to cross 
> > that. CoD breaks that connection and effectively gives you two 
> > nodes on one die.
> 
> Note that that's not really a 'NUMA node' in the way lots of 
> places in the kernel assume it: permanent placement assymetry 
> (and access cost assymetry) of RAM.
> 
> It's a new topology construct that needs new handling (and 
> probably a new mask): Non Uniform Cache Architecture (NUCA)
> or so.

Hmm, looking closer at the diagram, each ring has its own memory controller, so
it really is NUMA if you break the interconnect between that caches.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16  7:03       ` Chuck Ebbert
@ 2014-09-16  7:05         ` Ingo Molnar
  2014-09-16 16:01         ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2014-09-16  7:05 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Peter Zijlstra, Dave Hansen, linux-kernel, borislav.petkov,
	andreas.herrmann3, hpa, ak


* Chuck Ebbert <cebbert.lkml@gmail.com> wrote:

> On Tue, 16 Sep 2014 08:44:03 +0200
> Ingo Molnar <mingo@kernel.org> wrote:
> 
> > 
> > * Chuck Ebbert <cebbert.lkml@gmail.com> wrote:
> > 
> > > On Tue, 16 Sep 2014 05:29:20 +0200
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > On Mon, Sep 15, 2014 at 03:26:41PM -0700, Dave Hansen wrote:
> > > > > 
> > > > > I'm getting the spew below when booting with Haswell (Xeon 
> > > > > E5-2699) CPUs and the "Cluster-on-Die" (CoD) feature 
> > > > > enabled in the BIOS.
> > > > 
> > > > What is that cluster-on-die thing? I've heard it before but 
> > > > never could find anything on it.
> > > 
> > > Each CPU has 2.5MB of L3 connected together in a ring that 
> > > makes it all act like a single shared cache. The HW tries 
> > > to place the data so it's closest to the CPU that uses it. 
> > > On the larger processors there are two rings with an 
> > > interconnect between them that adds latency if a cache 
> > > fetch has to cross that. CoD breaks that connection and 
> > > effectively gives you two nodes on one die.
> > 
> > Note that that's not really a 'NUMA node' in the way lots of 
> > places in the kernel assume it: permanent placement assymetry 
> > (and access cost assymetry) of RAM.
> > 
> > It's a new topology construct that needs new handling (and 
> > probably a new mask): Non Uniform Cache Architecture (NUCA) 
> > or so.
> 
> Hmm, looking closer at the diagram, each ring has its own 
> memory controller, so it really is NUMA if you break the 
> interconnect between that caches.

Fair enough, I only went by the description.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16  3:29 ` Peter Zijlstra
  2014-09-16  6:38   ` Chuck Ebbert
@ 2014-09-16  8:17   ` Dave Hansen
  2014-09-16 10:07     ` Heiko Carstens
                       ` (2 more replies)
  2014-09-16 16:59   ` Brice Goglin
  2 siblings, 3 replies; 19+ messages in thread
From: Dave Hansen @ 2014-09-16  8:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, hpa, ak, Alex Chiang, Borislav Petkov,
	Rusty Russell, Mike Travis, Greg Kroah-Hartman, Heiko Carstens

On 09/15/2014 08:29 PM, Peter Zijlstra wrote:
> On Mon, Sep 15, 2014 at 03:26:41PM -0700, Dave Hansen wrote:
>> I'm getting the spew below when booting with Haswell (Xeon
>> E5-2699) CPUs and the "Cluster-on-Die" (CoD) feature enabled in
>> the BIOS. 
> 
> What is that cluster-on-die thing? I've heard it before but never could
> find anything on it.

You split each socket in to halves.  Each half gets slightly lower
latency to the memory on its half's memory controller.  But, at an
increased latency cost to go to every other bit of memory on the system.
 I've got some data squirreled away on exactly what the latencies are,
but it really is a substantial increase for memory, especially for the
memory that is still on package but is now "off-node".

>> This also fixes sysfs because CPUs with the same 'physical_package_id'
>> in /sys/devices/system/cpu/cpu*/topology/ are not listed together
>> in the same 'core_siblings_list'.  This violates a statement from
>> Documentation/ABI/testing/sysfs-devices-system-cpu:
>>
>> 	core_siblings: internal kernel map of cpu#'s hardware threads
>> 	within the same physical_package_id.
>>
>> 	core_siblings_list: human-readable list of the logical CPU
>> 	numbers within the same physical_package_id as cpu#.
> 
> No that statement is wrong; it assumes physical_package_id is a good
> identifier for nodes. Clearly this is no longer true.

I went back and re-read the text around there.  I don't see any mention
of nodes or NUMA.  It's just talking about the topology within the CPU
from what I can tell.

Good thing it's just in "Documentation/ABI/testing/" I guess.

Either way, are you saying that a core's "physical_package_id" should
depend on what BIOS options get set, and that different cores in the
same physical package should have different "physical_package_id"s?

> The idea is that core_siblings (or rather cpu_core_mask) is a mask of
> all cores on a node. 

Why would we need that in the CPU topology information if we can get at
it easily here?

	/sys/devices/system/cpu/cpu0/node0/cpulist

>> The sysfs effects here cause an issue with the hwloc tool where
>> it gets confused and thinks there are more sockets than are
>> physically present.
> 
> Meh, so then we need another mask.

s390 has this concept of a "book" although it's not obvious at all where
this fits in to the hierarchy to me.  The help text and comments are
pretty funny:

	Book scheduler support improves the CPU scheduler's decision
	making when dealing with machines that have several books.

Wow, really?  Book scheduling helps with books?  I would have never
guessed.  Wish I knew what a book was. :)  Google helped a bit:

	http://sysmagazine.com/posts/150430/

but it's still not clear.  I think a book is below a node?

	SMT <= LLC <= MC <= BOOK <= NODE

Or is a book a collection of nodes?

	SMT <= LLC <= MC <= NODE <= BOOK

> The important bit you didn't show was the scheduler domain setup. I
> suspect it all works by accident, not by design.

So you want full output from a boot with CONFIG_SCHED_DEBUG=y?

>> diff -puN arch/x86/kernel/smpboot.c~hsw-cod-is-sane arch/x86/kernel/smpboot.c
>> --- a/arch/x86/kernel/smpboot.c~hsw-cod-is-sane	2014-09-15 14:56:20.012314468 -0700
>> +++ b/arch/x86/kernel/smpboot.c	2014-09-15 14:58:58.837506644 -0700
>> @@ -344,10 +344,13 @@ static bool match_llc(struct cpuinfo_x86
>>  static bool match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
>>  {
>>  	if (c->phys_proc_id == o->phys_proc_id) {
>> -		if (cpu_has(c, X86_FEATURE_AMD_DCM))
>> -			return true;
>> -
>> -		return topology_sane(c, o, "mc");
>> +		/*
>> +		 * We used to enforce that 'c' and 'o' be on the
>> +		 * same node, but AMD's DCM and Intel's Cluster-
>> +		 * on-Die (CoD) support both have physical
>> +		 * processors that span NUMA nodes.
>> +		 */
>> +		return true;
>>  	}
>>  	return false;
>>  }
> 
> This is wrong (and I suppose the AMD case was already wrong). That
> function is supposed to match a multi-core group which is very much
> supposed to be smaller-or-equal to a node, not spanning nodes.
> 
> The scheduler assumes: SMT <= LLC <= MC <= NODE and if setting the MC
> mask to cover multiple nodes works, its by accident.

This makes a lot of sense to enforce inside the scheduler.  But, do we
really need to expose all the naming all the way out to userspace?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16  8:17   ` Dave Hansen
@ 2014-09-16 10:07     ` Heiko Carstens
  2014-09-16 17:58     ` Peter Zijlstra
  2014-09-17 12:55     ` Borislav Petkov
  2 siblings, 0 replies; 19+ messages in thread
From: Heiko Carstens @ 2014-09-16 10:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, linux-kernel, mingo, hpa, ak, Alex Chiang,
	Borislav Petkov, Rusty Russell, Mike Travis, Greg Kroah-Hartman

On Tue, Sep 16, 2014 at 01:17:44AM -0700, Dave Hansen wrote:
> >> The sysfs effects here cause an issue with the hwloc tool where
> >> it gets confused and thinks there are more sockets than are
> >> physically present.
> > 
> > Meh, so then we need another mask.
> 
> s390 has this concept of a "book" although it's not obvious at all where
> this fits in to the hierarchy to me.  The help text and comments are
> pretty funny:
> 
> 	Book scheduler support improves the CPU scheduler's decision
> 	making when dealing with machines that have several books.
> 
> Wow, really?  Book scheduling helps with books?  I would have never
> guessed.

I'm glad the help text did enlighten you.

>  Wish I knew what a book was. :)  Google helped a bit:
> 	http://sysmagazine.com/posts/150430/
> but it's still not clear.  I think a book is below a node?
> 
> 	SMT <= LLC <= MC <= BOOK <= NODE

This one is correct. s390 does not have NUMA support. So we have one
node with multiple books.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16  6:44     ` Ingo Molnar
  2014-09-16  7:03       ` Chuck Ebbert
@ 2014-09-16 15:59       ` Peter Zijlstra
  2014-09-16 16:36         ` Dave Hansen
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-09-16 15:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chuck Ebbert, Dave Hansen, linux-kernel, borislav.petkov,
	andreas.herrmann3, hpa, ak

On Tue, Sep 16, 2014 at 08:44:03AM +0200, Ingo Molnar wrote:
> 
> * Chuck Ebbert <cebbert.lkml@gmail.com> wrote:
> 
> > On Tue, 16 Sep 2014 05:29:20 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Mon, Sep 15, 2014 at 03:26:41PM -0700, Dave Hansen wrote:
> > > > 
> > > > I'm getting the spew below when booting with Haswell (Xeon 
> > > > E5-2699) CPUs and the "Cluster-on-Die" (CoD) feature 
> > > > enabled in the BIOS.
> > > 
> > > What is that cluster-on-die thing? I've heard it before but 
> > > never could find anything on it.
> > 
> > Each CPU has 2.5MB of L3 connected together in a ring that 
> > makes it all act like a single shared cache. The HW tries to 
> > place the data so it's closest to the CPU that uses it. On the 
> > larger processors there are two rings with an interconnect 
> > between them that adds latency if a cache fetch has to cross 
> > that. CoD breaks that connection and effectively gives you two 
> > nodes on one die.
> 
> Note that that's not really a 'NUMA node' in the way lots of 
> places in the kernel assume it: permanent placement assymetry 
> (and access cost assymetry) of RAM.

Agreed, that is not NUMA, both groups will have the exact same local
DRAM latency (unlike the AMD thing which has two memory busses on the
single package, and therefore really has two nodes on a single chip).

This also means the CoD thing sets up the NUMA masks incorrectly.

> It's a new topology construct that needs new handling (and 
> probably a new mask): Non Uniform Cache Architecture (NUCA)
> or so.

Its not new, many chips do this, most notable the Core2Quad which was
basically two Core2Duo dies in a single package. So you have 2 cores
sharing cache, and 2x2 cores sharing the memory bus.

Also, the Silvermont 'modules' which do basically the same thing, have
multiple groups of cores that share cache on the same memory bus.

This is not new. And we can represent this just fine, it just looks like
the Intel CoD stuff sets up the NUMA masks wrong.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16  7:03       ` Chuck Ebbert
  2014-09-16  7:05         ` Ingo Molnar
@ 2014-09-16 16:01         ` Peter Zijlstra
  2014-09-16 16:46           ` Dave Hansen
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-09-16 16:01 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Ingo Molnar, Dave Hansen, linux-kernel, borislav.petkov,
	andreas.herrmann3, hpa, ak

On Tue, Sep 16, 2014 at 02:03:00AM -0500, Chuck Ebbert wrote:
> Hmm, looking closer at the diagram, each ring has its own memory controller, so
> it really is NUMA if you break the interconnect between that caches.

How does it do that? Does it split the DIMM slots in two as well, with
half for the one node and the other half for the other? Or will both
'nodes' share the same local memory?

In the first case its full NUMA, in the second case, not so much.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16 15:59       ` Peter Zijlstra
@ 2014-09-16 16:36         ` Dave Hansen
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Hansen @ 2014-09-16 16:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Chuck Ebbert, linux-kernel, borislav.petkov, andreas.herrmann3, hpa, ak

On 09/16/2014 08:59 AM, Peter Zijlstra wrote:
> On Tue, Sep 16, 2014 at 08:44:03AM +0200, Ingo Molnar wrote:
>> Note that that's not really a 'NUMA node' in the way lots of 
>> places in the kernel assume it: permanent placement assymetry 
>> (and access cost assymetry) of RAM.
> 
> Agreed, that is not NUMA, both groups will have the exact same local
> DRAM latency (unlike the AMD thing which has two memory busses on the
> single package, and therefore really has two nodes on a single chip).

I don't think this is correct.

>From my testing, each ring of CPUs has a "close" and "far" memory
controller in the socket.

> This also means the CoD thing sets up the NUMA masks incorrectly.

I used this publicly-available Intel tool:

https://software.intel.com/en-us/articles/intelr-memory-latency-checker

And ran various combinations pinning the latency checker to various CPUs
and NUMA nodes.

Here's what I think the SLIT table should look like with cluster-on-die
disabled.  There is one node per socket and the latency to the other
node is 1.5x the latency to the local node:

*      0     1
0     10    15
1     15    10

or, measured in ns:
*      0     1
0     76   119
1    114    76

Enabling cluster-on-die, we get 4 nodes.  The local memory in thesame
socket gets faster, and remote memory in the same socket gets both
absolutely and relatively slower:

*      0     1     2     3
0     10    20    26    26
1     20    10    26    26
2     26    26    10    20
3     26    26    20    10

and in ns:
*      0     1     2     3
0   74.8 152.3 190.6 200.4
1  146.2  75.6 190.8 200.6
2  185.1 195.5  74.5 150.1
3  186.6 195.6 147.3  75.6

So I think it really is reasonable to say that there are 2 NUMA nodes in
a socket.

BTW, these numbers are only approximate.  They were not run under
particularly controlled conditions and I don't even remember what kernel
they were under.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16 16:01         ` Peter Zijlstra
@ 2014-09-16 16:46           ` Dave Hansen
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Hansen @ 2014-09-16 16:46 UTC (permalink / raw)
  To: Peter Zijlstra, Chuck Ebbert
  Cc: Ingo Molnar, linux-kernel, borislav.petkov, andreas.herrmann3, hpa, ak

On 09/16/2014 09:01 AM, Peter Zijlstra wrote:
> On Tue, Sep 16, 2014 at 02:03:00AM -0500, Chuck Ebbert wrote:
>> Hmm, looking closer at the diagram, each ring has its own memory controller, so
>> it really is NUMA if you break the interconnect between that caches.
> 
> How does it do that? Does it split the DIMM slots in two as well, with
> half for the one node and the other half for the other? Or will both
> 'nodes' share the same local memory?

I the diagrams in here are accurate in describing the rings:

http://www.enterprisetech.com/2014/09/08/intel-ups-performance-ante-haswell-xeon-chips/

The "nodes" each get their own memory controller and exclusive set of DIMMs.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16  3:29 ` Peter Zijlstra
  2014-09-16  6:38   ` Chuck Ebbert
  2014-09-16  8:17   ` Dave Hansen
@ 2014-09-16 16:59   ` Brice Goglin
  2 siblings, 0 replies; 19+ messages in thread
From: Brice Goglin @ 2014-09-16 16:59 UTC (permalink / raw)
  To: Peter Zijlstra, Dave Hansen
  Cc: linux-kernel, borislav.petkov, andreas.herrmann3, mingo, hpa, ak

Le 16/09/2014 05:29, Peter Zijlstra a écrit :
>
>> This also fixes sysfs because CPUs with the same 'physical_package_id'
>> in /sys/devices/system/cpu/cpu*/topology/ are not listed together
>> in the same 'core_siblings_list'.  This violates a statement from
>> Documentation/ABI/testing/sysfs-devices-system-cpu:
>>
>> 	core_siblings: internal kernel map of cpu#'s hardware threads
>> 	within the same physical_package_id.
>>
>> 	core_siblings_list: human-readable list of the logical CPU
>> 	numbers within the same physical_package_id as cpu#.
> No that statement is wrong; it assumes physical_package_id is a good
> identifier for nodes. Clearly this is no longer true.
>
> The idea is that core_siblings (or rather cpu_core_mask) is a mask of
> all cores on a node.

or rather node_core_mask since it's not clear whether a CPU is a package
or a die or a node or whatever?

Aside from the bad naming, I fail to see why core_siblings couldn't be
package-wide.

Node-wide masks are already in /sys/devices/system/node/node*/cpumap

>> The sysfs effects here cause an issue with the hwloc tool where
>> it gets confused and thinks there are more sockets than are
>> physically present.
> Meh, so then we need another mask.

Note that AMD tried to add a "cpu_node" mask back in 2009 for the same
reason for their 6xxx CPUs but it was rejected :/
https://lkml.org/lkml/2009/6/3/244

>> diff -puN arch/x86/kernel/smpboot.c~hsw-cod-is-sane arch/x86/kernel/smpboot.c
>> --- a/arch/x86/kernel/smpboot.c~hsw-cod-is-sane	2014-09-15 14:56:20.012314468 -0700
>> +++ b/arch/x86/kernel/smpboot.c	2014-09-15 14:58:58.837506644 -0700
>> @@ -344,10 +344,13 @@ static bool match_llc(struct cpuinfo_x86
>>  static bool match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
>>  {
>>  	if (c->phys_proc_id == o->phys_proc_id) {
>> -		if (cpu_has(c, X86_FEATURE_AMD_DCM))
>> -			return true;
>> -
>> -		return topology_sane(c, o, "mc");
>> +		/*
>> +		 * We used to enforce that 'c' and 'o' be on the
>> +		 * same node, but AMD's DCM and Intel's Cluster-
>> +		 * on-Die (CoD) support both have physical
>> +		 * processors that span NUMA nodes.
>> +		 */
>> +		return true;
>>  	}
>>  	return false;
>>  }
> This is wrong (and I suppose the AMD case was already wrong). That
> function is supposed to match a multi-core group which is very much
> supposed to be smaller-or-equal to a node, not spanning nodes.

Well, even if it's wrong, it matches what the sysfs doc says. We have
been using this ABI in userspace tools for years and it always properly
detected AMD CPUs. So if you want to break things, please remove
core_siblings entirely and add new well-defined package-wide and
node-wide masks :/

Brice


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16  8:17   ` Dave Hansen
  2014-09-16 10:07     ` Heiko Carstens
@ 2014-09-16 17:58     ` Peter Zijlstra
  2014-09-16 23:49       ` Dave Hansen
  2014-09-17 12:55     ` Borislav Petkov
  2 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-09-16 17:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, mingo, hpa, ak, Alex Chiang, Borislav Petkov,
	Rusty Russell, Mike Travis, Greg Kroah-Hartman, Heiko Carstens

On Tue, Sep 16, 2014 at 01:17:44AM -0700, Dave Hansen wrote:
> On 09/15/2014 08:29 PM, Peter Zijlstra wrote:

> > What is that cluster-on-die thing? I've heard it before but never could
> > find anything on it.
> 
> You split each socket in to halves.  Each half gets slightly lower
> latency to the memory on its half's memory controller.  But, at an
> increased latency cost to go to every other bit of memory on the system.
>  I've got some data squirreled away on exactly what the latencies are,
> but it really is a substantial increase for memory, especially for the
> memory that is still on package but is now "off-node".

So it really splits the local memory DIMMs in two as well, ok.

> I went back and re-read the text around there.  I don't see any mention
> of nodes or NUMA.  It's just talking about the topology within the CPU
> from what I can tell.
> 
> Good thing it's just in "Documentation/ABI/testing/" I guess.
> 
> Either way, are you saying that a core's "physical_package_id" should
> depend on what BIOS options get set, and that different cores in the
> same physical package should have different "physical_package_id"s?

Yah, I more or less changed my mind halfway through the reply. I suppose
it makes sense to have this mask as defined. It just doesn't match which
how the mask is used in the scheduler (which is the main in-kernel user
or the thing).

> > The idea is that core_siblings (or rather cpu_core_mask) is a mask of
> > all cores on a node. 
> 
> Why would we need that in the CPU topology information if we can get at
> it easily here?
> 
> 	/sys/devices/system/cpu/cpu0/node0/cpulist

Good point, the 'problem' is that we currently have the static order of
the masks, if we were to flip the MC and NUMA masks we need a condition
to do that on and make sure everything is aware of that.

CoD not being detectable sucks arse for sure :/

Also, I think we want to rename the mask to not be MC but PKG or so.

> s390 has this concept of a "book" although it's not obvious at all where
> this fits in to the hierarchy to me.  The help text and comments are
> pretty funny:
> 
> 	Book scheduler support improves the CPU scheduler's decision
> 	making when dealing with machines that have several books.
> 
> Wow, really?  Book scheduling helps with books?  I would have never
> guessed.  Wish I knew what a book was. :)  Google helped a bit:
> 
> 	http://sysmagazine.com/posts/150430/
> 
> but it's still not clear.  I think a book is below a node?
> 
> 	SMT <= LLC <= MC <= BOOK <= NODE
> 
> Or is a book a collection of nodes?
> 
> 	SMT <= LLC <= MC <= NODE <= BOOK

s390 is 'special' you should know that :-) Since s390 is a virtualized
platform they cannot do NUMA since the vCPU can move around freely. So
what they do to 'hide' stuff is big L4 caches, which they call BOOKS.

> > The scheduler assumes: SMT <= LLC <= MC <= NODE and if setting the MC
> > mask to cover multiple nodes works, its by accident.
> 
> This makes a lot of sense to enforce inside the scheduler.  But, do we
> really need to expose all the naming all the way out to userspace?

Dunno.. lets start out by not doing that (the safe option).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16 17:58     ` Peter Zijlstra
@ 2014-09-16 23:49       ` Dave Hansen
  2014-09-17 22:57         ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2014-09-16 23:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, hpa, ak, Alex Chiang, Borislav Petkov,
	Rusty Russell, Mike Travis, Greg Kroah-Hartman, Heiko Carstens

On 09/16/2014 10:58 AM, Peter Zijlstra wrote:
> the 'problem' is that we currently have the static order of
> the masks, if we were to flip the MC and NUMA masks we need a condition
> to do that on and make sure everything is aware of that.
> 
> CoD not being detectable sucks arse for sure :/

It's not like we can't detect it.  We just can't detect it *explicitly*.
 I think when we see nodes inside a package now, we have to trust that
they're OK.

One other data point here.  With an unpatched mainline, here's how the
sched domains look with cluster-on-die enabled:

# grep . /proc/sys/kernel/sched_domain/cpu9/domain?/name
/proc/sys/kernel/sched_domain/cpu9/domain0/name:SMT  // 2 threads
/proc/sys/kernel/sched_domain/cpu9/domain1/name:MC   // 18 threads cores
/proc/sys/kernel/sched_domain/cpu9/domain2/name:NUMA // 36 threads
/proc/sys/kernel/sched_domain/cpu9/domain3/name:NUMA // 72 threads

and with cluster-on-die disabled:

# grep . /proc/sys/kernel/sched_domain/cpu9/domain?/name
/proc/sys/kernel/sched_domain/cpu9/domain0/name:SMT
/proc/sys/kernel/sched_domain/cpu9/domain1/name:MC
/proc/sys/kernel/sched_domain/cpu9/domain2/name:NUMA

So, shockingly, the domains seem to be set up at at least conceptually
OK in both cases.

I think the domains in this case should _probably_ be conceptually:

	SMT -> COD_NUMA -> PKG -> SOCKET_NUMA

We could probably rig up sched_init_numa() to mix topology levels
between the ones that come out of sched_domain_topology and the NUMA
levels, although that doesn't sound very appealing.

Another option would be to:
1. Add a new "PKG" level and actually _build_ it with phys_proc_id
2. Make sure to tie the sysfs 'core_siblings' file to PKG
3. Leave the "MC" level as it is now, but define it as being the lowest-
   common-denominator of core grouping.  In other words, the "MC" group
   will stop at a NUMA node or a socket boundary, whichever it sees
   first.
4. Chop the "COD_NUMA" level off in sched_init_numa()

That seems a bit hackish though.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16  8:17   ` Dave Hansen
  2014-09-16 10:07     ` Heiko Carstens
  2014-09-16 17:58     ` Peter Zijlstra
@ 2014-09-17 12:55     ` Borislav Petkov
  2014-09-18  7:32       ` Borislav Petkov
  2 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2014-09-17 12:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, linux-kernel, mingo, hpa, ak, Alex Chiang,
	Borislav Petkov, Rusty Russell, Mike Travis, Greg Kroah-Hartman,
	Heiko Carstens

On Tue, Sep 16, 2014 at 01:17:44AM -0700, Dave Hansen wrote:
> >> This also fixes sysfs because CPUs with the same 'physical_package_id'
> >> in /sys/devices/system/cpu/cpu*/topology/ are not listed together
> >> in the same 'core_siblings_list'.  This violates a statement from
> >> Documentation/ABI/testing/sysfs-devices-system-cpu:
> >>
> >> 	core_siblings: internal kernel map of cpu#'s hardware threads
> >> 	within the same physical_package_id.
> >>
> >> 	core_siblings_list: human-readable list of the logical CPU
> >> 	numbers within the same physical_package_id as cpu#.
> > 
> > No that statement is wrong; it assumes physical_package_id is a good
> > identifier for nodes. Clearly this is no longer true.
> 
> I went back and re-read the text around there.  I don't see any mention
> of nodes or NUMA.  It's just talking about the topology within the CPU
> from what I can tell.
> 
> Good thing it's just in "Documentation/ABI/testing/" I guess.
> 
> Either way, are you saying that a core's "physical_package_id" should
> depend on what BIOS options get set, and that different cores in the
> same physical package should have different "physical_package_id"s?

This sounds misleading to me. If I would have to explain how I
understand physical_package_id, I'd say it is the physical piece of
silicon containing the core. Which is consistent with what Peter says
that using it to identify NUMA nodes is wrong.

Btw, I'm trying to get on an AMD MCM box to dump those fields but it is
kinda hard currently. Will report back once I have something...

-- 
Regards/Gruss,
    Boris.
--

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-16 23:49       ` Dave Hansen
@ 2014-09-17 22:57         ` Peter Zijlstra
  2014-09-18  0:33           ` Dave Hansen
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-09-17 22:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, mingo, hpa, ak, Alex Chiang, Borislav Petkov,
	Rusty Russell, Mike Travis, Greg Kroah-Hartman, Heiko Carstens

On Tue, Sep 16, 2014 at 04:49:03PM -0700, Dave Hansen wrote:
> On 09/16/2014 10:58 AM, Peter Zijlstra wrote:
> > the 'problem' is that we currently have the static order of
> > the masks, if we were to flip the MC and NUMA masks we need a condition
> > to do that on and make sure everything is aware of that.
> > 
> > CoD not being detectable sucks arse for sure :/
> 
> It's not like we can't detect it.  We just can't detect it *explicitly*.

That's still suckage :-)

>  I think when we see nodes inside a package now, we have to trust that
> they're OK.

Its all we can do.

> One other data point here.  With an unpatched mainline, here's how the
> sched domains look with cluster-on-die enabled:
> 
> # grep . /proc/sys/kernel/sched_domain/cpu9/domain?/name
> /proc/sys/kernel/sched_domain/cpu9/domain0/name:SMT  // 2 threads
> /proc/sys/kernel/sched_domain/cpu9/domain1/name:MC   // 18 threads cores
> /proc/sys/kernel/sched_domain/cpu9/domain2/name:NUMA // 36 threads
> /proc/sys/kernel/sched_domain/cpu9/domain3/name:NUMA // 72 threads
> 
> and with cluster-on-die disabled:
> 
> # grep . /proc/sys/kernel/sched_domain/cpu9/domain?/name
> /proc/sys/kernel/sched_domain/cpu9/domain0/name:SMT
> /proc/sys/kernel/sched_domain/cpu9/domain1/name:MC
> /proc/sys/kernel/sched_domain/cpu9/domain2/name:NUMA
> 
> So, shockingly, the domains seem to be set up at at least conceptually
> OK in both cases.

Yeah, it works out by accident. It had to otherwise these patches
wouldn't work.

> We could probably rig up sched_init_numa() to mix topology levels
> between the ones that come out of sched_domain_topology and the NUMA
> levels, although that doesn't sound very appealing.
> 
> Another option would be to:
> 1. Add a new "PKG" level and actually _build_ it with phys_proc_id
> 2. Make sure to tie the sysfs 'core_siblings' file to PKG
> 3. Leave the "MC" level as it is now, but define it as being the lowest-
>    common-denominator of core grouping.  In other words, the "MC" group
>    will stop at a NUMA node or a socket boundary, whichever it sees
>    first.
> 4. Chop the "COD_NUMA" level off in sched_init_numa()

No, we should provide an arch override for sched_domain_topology which
has the right setup for the detected topology.

See arm,powerpc and s390, which already have his.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-17 22:57         ` Peter Zijlstra
@ 2014-09-18  0:33           ` Dave Hansen
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Hansen @ 2014-09-18  0:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, hpa, ak, Alex Chiang, Borislav Petkov,
	Rusty Russell, Mike Travis, Greg Kroah-Hartman, Heiko Carstens

On 09/17/2014 03:57 PM, Peter Zijlstra wrote:
>> > Another option would be to:
>> > 1. Add a new "PKG" level and actually _build_ it with phys_proc_id
>> > 2. Make sure to tie the sysfs 'core_siblings' file to PKG
>> > 3. Leave the "MC" level as it is now, but define it as being the lowest-
>> >    common-denominator of core grouping.  In other words, the "MC" group
>> >    will stop at a NUMA node or a socket boundary, whichever it sees
>> >    first.
>> > 4. Chop the "COD_NUMA" level off in sched_init_numa()
> No, we should provide an arch override for sched_domain_topology which
> has the right setup for the detected topology.
> 
> See arm,powerpc and s390, which already have his.

I _think_ the problem here is that all of the existing topologies
describe something below the smallest NUMA node.  With the
cluster-on-die stuff, we've got a node boundary smack in the middle of
the existing topology levels.

So, if we're going to do this with a new sched_domain_topology, we've
got to make sure that we eliminate all of the levels below that first
NUMA node.  I think that means just have a topology with the SMT level
alone in it.

Does that work?



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"
  2014-09-17 12:55     ` Borislav Petkov
@ 2014-09-18  7:32       ` Borislav Petkov
  0 siblings, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2014-09-18  7:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, linux-kernel, mingo, hpa, ak, Alex Chiang,
	Borislav Petkov, Rusty Russell, Mike Travis, Greg Kroah-Hartman,
	Heiko Carstens

On Wed, Sep 17, 2014 at 02:55:26PM +0200, Borislav Petkov wrote:
> This sounds misleading to me. If I would have to explain how I
> understand physical_package_id, I'd say it is the physical piece of
> silicon containing the core. Which is consistent with what Peter says
> that using it to identify NUMA nodes is wrong.
> 
> Btw, I'm trying to get on an AMD MCM box to dump those fields but it is
> kinda hard currently. Will report back once I have something...

Ok, so Brice sent me some AMD MCM data from a 4-socket box.
physical_package_id there really denotes the physical package, i.e.
silicon, i.e. physical socket which contains a core.

So on a 4-socket, 16 cores on each machine, you have:

	   physical_package_id
cpu0-15 :  0
cpu16-31:  1
cpu32-47:  2
cpu48-63:  3

which all looks nicely regular and clean.

core_siblings mirrors exactly that too, so you don't see the internal
nodes from that either, i.e:

cpu0/topology/core_siblings_list:0-15
...
cpu16/topology/core_siblings_list:16-31
...
cpu32/topology/core_siblings_list:32-47
...
cpu50/topology/core_siblings_list:48-63

-- 
Regards/Gruss,
    Boris.
--

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-09-18  7:32 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-15 22:26 [PATCH] x86: Consider multiple nodes in a single socket to be "sane" Dave Hansen
2014-09-16  3:29 ` Peter Zijlstra
2014-09-16  6:38   ` Chuck Ebbert
2014-09-16  6:44     ` Ingo Molnar
2014-09-16  7:03       ` Chuck Ebbert
2014-09-16  7:05         ` Ingo Molnar
2014-09-16 16:01         ` Peter Zijlstra
2014-09-16 16:46           ` Dave Hansen
2014-09-16 15:59       ` Peter Zijlstra
2014-09-16 16:36         ` Dave Hansen
2014-09-16  8:17   ` Dave Hansen
2014-09-16 10:07     ` Heiko Carstens
2014-09-16 17:58     ` Peter Zijlstra
2014-09-16 23:49       ` Dave Hansen
2014-09-17 22:57         ` Peter Zijlstra
2014-09-18  0:33           ` Dave Hansen
2014-09-17 12:55     ` Borislav Petkov
2014-09-18  7:32       ` Borislav Petkov
2014-09-16 16:59   ` Brice Goglin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.