All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Fix northbridge quirk to assign correct NUMA node
@ 2014-03-13 11:43 Daniel J Blueman
  2014-03-14  9:06 ` Borislav Petkov
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Daniel J Blueman @ 2014-03-13 11:43 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: x86, Borislav Petkov, linux-kernel, Steffen Persvold, Daniel J Blueman

For systems with multiple servers and routed fabric, all northbridges get
assigned to the first server. Fix this by also using the node reported from
the PCI bus. For single-fabric systems, the northbriges are on PCI bus 0
by definition, which are on NUMA node 0 by definition, so this is invarient
on most systems.

Tested on fam10h and fam15h single and multi-fabric systems and candidate
for stable.

Signed-off-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Steffen Persvold <sp@numascale.com>
---
 arch/x86/kernel/quirks.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
index 04ee1e2..52dbf1e 100644
--- a/arch/x86/kernel/quirks.c
+++ b/arch/x86/kernel/quirks.c
@@ -529,7 +529,7 @@ static void quirk_amd_nb_node(struct pci_dev *dev)
 		return;
 
 	pci_read_config_dword(nb_ht, 0x60, &val);
-	node = val & 7;
+	node = pcibus_to_node(dev->bus) | (val & 7);
 	/*
 	 * Some hardware may return an invalid node ID,
 	 * so check it first:
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] Fix northbridge quirk to assign correct NUMA node
  2014-03-13 11:43 [PATCH] Fix northbridge quirk to assign correct NUMA node Daniel J Blueman
@ 2014-03-14  9:06 ` Borislav Petkov
  2014-03-14  9:57   ` Daniel J Blueman
  2014-03-14 10:09 ` [tip:x86/urgent] x86/amd/numa: " tip-bot for Daniel J Blueman
  2014-03-20 22:07 ` [PATCH] " Bjorn Helgaas
  2 siblings, 1 reply; 12+ messages in thread
From: Borislav Petkov @ 2014-03-14  9:06 UTC (permalink / raw)
  To: Daniel J Blueman
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Steffen Persvold

On Thu, Mar 13, 2014 at 07:43:01PM +0800, Daniel J Blueman wrote:
> For systems with multiple servers and routed fabric, all northbridges get
> assigned to the first server. Fix this by also using the node reported from
> the PCI bus. For single-fabric systems, the northbriges are on PCI bus 0
> by definition, which are on NUMA node 0 by definition, so this is invarient
> on most systems.

Yeah, I think this is of very low risk for !Numascale setups. :-) So

Acked-by: Borislav Petkov <bp@suse.de>

> Tested on fam10h and fam15h single and multi-fabric systems and candidate
> for stable.

I'm not sure about it - this is only reporting the wrong node, right?
Does anything depend on that node setting being correct and breaks due
to this?

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Fix northbridge quirk to assign correct NUMA node
  2014-03-14  9:06 ` Borislav Petkov
@ 2014-03-14  9:57   ` Daniel J Blueman
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel J Blueman @ 2014-03-14  9:57 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Steffen Persvold

Hi Boris,

On 14/03/2014 17:06, Borislav Petkov wrote:
> On Thu, Mar 13, 2014 at 07:43:01PM +0800, Daniel J Blueman wrote:
>> For systems with multiple servers and routed fabric, all northbridges get
>> assigned to the first server. Fix this by also using the node reported from
>> the PCI bus. For single-fabric systems, the northbriges are on PCI bus 0
>> by definition, which are on NUMA node 0 by definition, so this is invarient
>> on most systems.
>
> Yeah, I think this is of very low risk for !Numascale setups. :-) So
>
> Acked-by: Borislav Petkov <bp@suse.de>
>
>> Tested on fam10h and fam15h single and multi-fabric systems and candidate
>> for stable.
>
> I'm not sure about it - this is only reporting the wrong node, right?
> Does anything depend on that node setting being correct and breaks due
> to this?

It's only reporting the wrong node, yes. The irqbalance daemon uses 
/sys/devices/.../numa_node, and we found we have to disable it to 
prevent hangs on certain systems after a while, but I didn't establish a 
link just yet, though found this to be incorrect.

Thanks,
   Daniel
-- 
Daniel J Blueman
Principal Software Engineer, Numascale

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip:x86/urgent] x86/amd/numa: Fix northbridge quirk to assign correct NUMA node
  2014-03-13 11:43 [PATCH] Fix northbridge quirk to assign correct NUMA node Daniel J Blueman
  2014-03-14  9:06 ` Borislav Petkov
@ 2014-03-14 10:09 ` tip-bot for Daniel J Blueman
  2014-03-20 22:07 ` [PATCH] " Bjorn Helgaas
  2 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Daniel J Blueman @ 2014-03-14 10:09 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, stable, sp, daniel, tglx, bp

Commit-ID:  847d7970defb45540735b3fb4e88471c27cacd85
Gitweb:     http://git.kernel.org/tip/847d7970defb45540735b3fb4e88471c27cacd85
Author:     Daniel J Blueman <daniel@numascale.com>
AuthorDate: Thu, 13 Mar 2014 19:43:01 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 14 Mar 2014 11:05:36 +0100

x86/amd/numa: Fix northbridge quirk to assign correct NUMA node

For systems with multiple servers and routed fabric, all
northbridges get assigned to the first server. Fix this by also
using the node reported from the PCI bus. For single-fabric
systems, the northbriges are on PCI bus 0 by definition, which
are on NUMA node 0 by definition, so this is invarient on most
systems.

Tested on fam10h and fam15h single and multi-fabric systems and
candidate for stable.

Signed-off-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Steffen Persvold <sp@numascale.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1394710981-3596-1-git-send-email-daniel@numascale.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/quirks.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
index 7c6acd4..ff898bb 100644
--- a/arch/x86/kernel/quirks.c
+++ b/arch/x86/kernel/quirks.c
@@ -529,7 +529,7 @@ static void quirk_amd_nb_node(struct pci_dev *dev)
 		return;
 
 	pci_read_config_dword(nb_ht, 0x60, &val);
-	node = val & 7;
+	node = pcibus_to_node(dev->bus) | (val & 7);
 	/*
 	 * Some hardware may return an invalid node ID,
 	 * so check it first:

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] Fix northbridge quirk to assign correct NUMA node
  2014-03-13 11:43 [PATCH] Fix northbridge quirk to assign correct NUMA node Daniel J Blueman
  2014-03-14  9:06 ` Borislav Petkov
  2014-03-14 10:09 ` [tip:x86/urgent] x86/amd/numa: " tip-bot for Daniel J Blueman
@ 2014-03-20 22:07 ` Bjorn Helgaas
  2014-03-21  3:38   ` Daniel J Blueman
  2014-03-21  3:51   ` Suravee Suthikulpanit
  2 siblings, 2 replies; 12+ messages in thread
From: Bjorn Helgaas @ 2014-03-20 22:07 UTC (permalink / raw)
  To: Daniel J Blueman
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Steffen Persvold, linux-pci,
	Suravee Suthikulpanit, kim.naru, Aravind Gopalakrishnan,
	Myron Stowe

[+cc linux-pci, Myron, Suravee, Kim, Aravind]

On Thu, Mar 13, 2014 at 5:43 AM, Daniel J Blueman <daniel@numascale.com> wrote:
> For systems with multiple servers and routed fabric, all northbridges get
> assigned to the first server. Fix this by also using the node reported from
> the PCI bus. For single-fabric systems, the northbriges are on PCI bus 0
> by definition, which are on NUMA node 0 by definition, so this is invarient
> on most systems.
>
> Tested on fam10h and fam15h single and multi-fabric systems and candidate
> for stable.

I wish this had been cc'd to linux-pci.  We're talking about a related
change by Suravee there.  In fact, we were hoping this quirk could be
removed altogether.

I don't understand what this quirk is doing.  Normally we discover the
NUMA node for a PCI host bridge via the ACPI _PXM method.  The way
_PXM works is that every PCI device in the hierarchy below the bridge
inherits the same node number as the host bridge.  I first thought
this might be a workaround for a system that lacks _PXM, but I don't
think that can be right, because you're only changing the node for a
few devices, not the whole hierarchy.

So I suspect the problem is more complicated, and maybe _PXM is
insufficient to describe the topology?  Are there subtrees that should
have nodes different from the host bridge?

I know this patch is already in v3.14-rc7, but I'd still like to
understand it so we can do the right thing with Suravee's patch.

Bjorn

> Signed-off-by: Daniel J Blueman <daniel@numascale.com>
> Acked-by: Steffen Persvold <sp@numascale.com>
> ---
>  arch/x86/kernel/quirks.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
> index 04ee1e2..52dbf1e 100644
> --- a/arch/x86/kernel/quirks.c
> +++ b/arch/x86/kernel/quirks.c
> @@ -529,7 +529,7 @@ static void quirk_amd_nb_node(struct pci_dev *dev)
>                 return;
>
>         pci_read_config_dword(nb_ht, 0x60, &val);
> -       node = val & 7;
> +       node = pcibus_to_node(dev->bus) | (val & 7);
>         /*
>          * Some hardware may return an invalid node ID,
>          * so check it first:
> --
> 1.8.3.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Fix northbridge quirk to assign correct NUMA node
  2014-03-20 22:07 ` [PATCH] " Bjorn Helgaas
@ 2014-03-21  3:38   ` Daniel J Blueman
  2014-03-21 16:11     ` Bjorn Helgaas
  2014-03-21 17:16     ` Suravee Suthikulpanit
  2014-03-21  3:51   ` Suravee Suthikulpanit
  1 sibling, 2 replies; 12+ messages in thread
From: Daniel J Blueman @ 2014-03-21  3:38 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Steffen Persvold, linux-pci,
	Suravee Suthikulpanit, kim.naru, Aravind Gopalakrishnan,
	Myron Stowe

On 21/03/2014 06:07, Bjorn Helgaas wrote:
> [+cc linux-pci, Myron, Suravee, Kim, Aravind]
>
> On Thu, Mar 13, 2014 at 5:43 AM, Daniel J Blueman <daniel@numascale.com> wrote:
>> For systems with multiple servers and routed fabric, all northbridges get
>> assigned to the first server. Fix this by also using the node reported from
>> the PCI bus. For single-fabric systems, the northbriges are on PCI bus 0
>> by definition, which are on NUMA node 0 by definition, so this is invarient
>> on most systems.
>>
>> Tested on fam10h and fam15h single and multi-fabric systems and candidate
>> for stable.

> I wish this had been cc'd to linux-pci.  We're talking about a related
> change by Suravee there.  In fact, we were hoping this quirk could be
> removed altogether.

Noted.

> I don't understand what this quirk is doing.  Normally we discover the
> NUMA node for a PCI host bridge via the ACPI _PXM method.  The way
> _PXM works is that every PCI device in the hierarchy below the bridge
> inherits the same node number as the host bridge.  I first thought
> this might be a workaround for a system that lacks _PXM, but I don't
> think that can be right, because you're only changing the node for a
> few devices, not the whole hierarchy.
 >
> So I suspect the problem is more complicated, and maybe _PXM is
> insufficient to describe the topology?  Are there subtrees that should
> have nodes different from the host bridge?

Yes; see below.

> I know this patch is already in v3.14-rc7, but I'd still like to
> understand it so we can do the right thing with Suravee's patch.

The _PXM method associates each northbridge with the first NUMA node, 0 
in single-fabric systems, and eg 4 for the second server in a 
multi-fabric system with 2 dual-module Opterons (with 2 NUMA nodes 
internally) etc, since the northbridges appear in the PCI tree, under 
the host bridge, not above it [1].

With _PXM, the rest of the PCI bus hierarchy has the right NUMA node 
associated, but the northbridge PCI devices should be associated with 
their actual NUMA node, 0, 1, 2, 3 for the first server in this example. 
The quirk fixes this up; irqbalance at least uses this NUMA data exposed 
in /sys.

The alternative to the quirk may be to explicitly express the 
northbridge PCI devices in the AML with their own _PXM methods. If it's 
valid, it may be the honest approach, though the quirk may be needed for 
most BIOSs; I can check the AML on a few servers to confirm if helpful.

Thanks,
   Daniel

[1] http://quora.org/2014/lspci.txt
-- 
Daniel J Blueman
Principal Software Engineer, Numascale

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Fix northbridge quirk to assign correct NUMA node
  2014-03-20 22:07 ` [PATCH] " Bjorn Helgaas
  2014-03-21  3:38   ` Daniel J Blueman
@ 2014-03-21  3:51   ` Suravee Suthikulpanit
  2014-03-21  4:14     ` Daniel J Blueman
  1 sibling, 1 reply; 12+ messages in thread
From: Suravee Suthikulpanit @ 2014-03-21  3:51 UTC (permalink / raw)
  To: Bjorn Helgaas, Daniel J Blueman
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Steffen Persvold, linux-pci,
	kim.naru, Aravind Gopalakrishnan, Myron Stowe, Hurwitz, Sherry

Bjorn,

On a typical AMD system, there are two types of host bridges:
* PCI Root Complex Host bridge (e.g. RD890, SR56xx, etc.)
* CPU Host bridge

Here is an example from a 2 sockets system:

$ lspci
00:00.0 Host bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI bridge (external gfx0 port A) (rev 02)
00:00.2 IOMMU: Advanced Micro Devices [AMD] nee ATI RD990 I/O Memory Management Unit (IOMMU)
00:04.0 PCI bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI bridge (PCI express gpp port D)
00:11.0 SATA controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
00:12.0 USB controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
00:12.1 USB controller: Advanced Micro Devices [AMD] nee ATI SB7x0 USB OHCI1 Controller
00:12.2 USB controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 USB EHCI Controller
00:13.0 USB controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
00:13.1 USB controller: Advanced Micro Devices [AMD] nee ATI SB7x0 USB OHCI1 Controller
00:13.2 USB controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 USB EHCI Controller
00:14.0 SMBus: Advanced Micro Devices [AMD] nee ATI SBx00 SMBus Controller (rev 3d)
00:14.1 IDE interface: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 IDE Controller
00:14.3 ISA bridge: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 LPC host controller
00:14.4 PCI bridge: Advanced Micro Devices [AMD] nee ATI SBx00 PCI to PCI Bridge
00:14.5 USB controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 USB OHCI2 Controller
00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 15h Processor Function 0
00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 15h Processor Function 1
00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 15h Processor Function 2
00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 15h Processor Function 3
00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 15h Processor Function 4
00:18.5 Host bridge: Advanced Micro Devices [AMD] Family 15h Processor Function 5
00:19.0 Host bridge: Advanced Micro Devices [AMD] Family 15h Processor Function 0
00:19.1 Host bridge: Advanced Micro Devices [AMD] Family 15h Processor Function 1
00:19.2 Host bridge: Advanced Micro Devices [AMD] Family 15h Processor Function 2
00:19.3 Host bridge: Advanced Micro Devices [AMD] Family 15h Processor Function 3
00:19.4 Host bridge: Advanced Micro Devices [AMD] Family 15h Processor Function 4
00:19.5 Host bridge: Advanced Micro Devices [AMD] Family 15h Processor Function 5
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
02:06.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI ES1000 (rev 02)

The host bridge 00:00.0 is basically the PCI root complex which connects to the actual PCI bus with
PCI devices hanging off of it.  However, the host bridge 00:[18,19].x are the CPU host bridges,
each of which represents a CPU node within the system. In system with single root complex,
the root complex is normally connected to node 0 (i.e. 00:18.0) via non-coherent HT (I/O) link.

Even though the CPU host bridge 00:[18,19].x is on the same bus as the PCI root complex, it should
not be using the NUMA information from the PCI root complex host bridge.
Therefore, I don't think we should be using the pcibus_to_node(dev->bus) here.
Only the "val" from pci_read_config_dword(nb_ht, 0x60, &val), should be used here.

Please see section 2.2 of the BIOS and Kernel development guide here for more info.
(http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf)

Suravee

On 3/20/2014 5:07 PM, Bjorn Helgaas wrote:
> [+cc linux-pci, Myron, Suravee, Kim, Aravind]
>
> On Thu, Mar 13, 2014 at 5:43 AM, Daniel J Blueman <daniel@numascale.com> wrote:
>> For systems with multiple servers and routed fabric, all northbridges get
>> assigned to the first server. Fix this by also using the node reported from
>> the PCI bus. For single-fabric systems, the northbriges are on PCI bus 0
>> by definition, which are on NUMA node 0 by definition, so this is invarient
>> on most systems.
>>
>> Tested on fam10h and fam15h single and multi-fabric systems and candidate
>> for stable.
>
> I wish this had been cc'd to linux-pci.  We're talking about a related
> change by Suravee there.  In fact, we were hoping this quirk could be
> removed altogether.
>
> I don't understand what this quirk is doing.  Normally we discover the
> NUMA node for a PCI host bridge via the ACPI _PXM method.  The way
> _PXM works is that every PCI device in the hierarchy below the bridge
> inherits the same node number as the host bridge.  I first thought
> this might be a workaround for a system that lacks _PXM, but I don't
> think that can be right, because you're only changing the node for a
> few devices, not the whole hierarchy.
>
> So I suspect the problem is more complicated, and maybe _PXM is
> insufficient to describe the topology?  Are there subtrees that should
> have nodes different from the host bridge?
>
> I know this patch is already in v3.14-rc7, but I'd still like to
> understand it so we can do the right thing with Suravee's patch.
>
> Bjorn
>
>> Signed-off-by: Daniel J Blueman <daniel@numascale.com>
>> Acked-by: Steffen Persvold <sp@numascale.com>
>> ---
>>   arch/x86/kernel/quirks.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
>> index 04ee1e2..52dbf1e 100644
>> --- a/arch/x86/kernel/quirks.c
>> +++ b/arch/x86/kernel/quirks.c
>> @@ -529,7 +529,7 @@ static void quirk_amd_nb_node(struct pci_dev *dev)
>>                  return;
>>
>>          pci_read_config_dword(nb_ht, 0x60, &val);
>> -       node = val & 7;
>> +       node = pcibus_to_node(dev->bus) | (val & 7);
>>          /*
>>           * Some hardware may return an invalid node ID,
>>           * so check it first:
>> --
>> 1.8.3.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Fix northbridge quirk to assign correct NUMA node
  2014-03-21  3:51   ` Suravee Suthikulpanit
@ 2014-03-21  4:14     ` Daniel J Blueman
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel J Blueman @ 2014-03-21  4:14 UTC (permalink / raw)
  To: Suravee Suthikulpanit, Bjorn Helgaas
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Steffen Persvold, linux-pci,
	kim.naru, Aravind Gopalakrishnan, Myron Stowe, Hurwitz, Sherry

On 21/03/2014 11:51, Suravee Suthikulpanit wrote:
> Bjorn,
>
> On a typical AMD system, there are two types of host bridges:
> * PCI Root Complex Host bridge (e.g. RD890, SR56xx, etc.)
> * CPU Host bridge
>
> Here is an example from a 2 sockets system:
>
> $ lspci
[]

> The host bridge 00:00.0 is basically the PCI root complex which connects
> to the actual PCI bus with
> PCI devices hanging off of it.  However, the host bridge 00:[18,19].x
> are the CPU host bridges,
> each of which represents a CPU node within the system. In system with
> single root complex,
> the root complex is normally connected to node 0 (i.e. 00:18.0) via
> non-coherent HT (I/O) link.

> Even though the CPU host bridge 00:[18,19].x is on the same bus as the
> PCI root complex, it should
> not be using the NUMA information from the PCI root complex host bridge.

This is unavoidable unless we special-case it via another mechanism (ie 
not quirks), since the northbridges/CPU host bridges are logically under 
the _PXM method.

> Therefore, I don't think we should be using the pcibus_to_node(dev->bus)
> here.
> Only the "val" from pci_read_config_dword(nb_ht, 0x60, &val), should be
> used here.

Using only effectively the NUMA node ID (HT node ID here) would 
associate all the northbridges with the first fabric, which is false 
information. If there was no quirk, they'd all be associated with the 
first NUMA node in each fabric, as you'd expect.

This was the only safe and defensible one-liner approach I could 
prepare; if you find it introduces a regression or you can find a better 
approach, do tell. If not, we can decouple this fix from an overall new 
approach, since it's unlikely that'll get backported to stable kernels.

Thanks,
   Daniel

> On 3/20/2014 5:07 PM, Bjorn Helgaas wrote:
>> [+cc linux-pci, Myron, Suravee, Kim, Aravind]
>>
>> On Thu, Mar 13, 2014 at 5:43 AM, Daniel J Blueman
>> <daniel@numascale.com> wrote:
>>> For systems with multiple servers and routed fabric, all northbridges
>>> get
>>> assigned to the first server. Fix this by also using the node
>>> reported from
>>> the PCI bus. For single-fabric systems, the northbriges are on PCI bus 0
>>> by definition, which are on NUMA node 0 by definition, so this is
>>> invarient
>>> on most systems.
>>>
>>> Tested on fam10h and fam15h single and multi-fabric systems and
>>> candidate
>>> for stable.
>>
>> I wish this had been cc'd to linux-pci.  We're talking about a related
>> change by Suravee there.  In fact, we were hoping this quirk could be
>> removed altogether.
>>
>> I don't understand what this quirk is doing.  Normally we discover the
>> NUMA node for a PCI host bridge via the ACPI _PXM method.  The way
>> _PXM works is that every PCI device in the hierarchy below the bridge
>> inherits the same node number as the host bridge.  I first thought
>> this might be a workaround for a system that lacks _PXM, but I don't
>> think that can be right, because you're only changing the node for a
>> few devices, not the whole hierarchy.
>>
>> So I suspect the problem is more complicated, and maybe _PXM is
>> insufficient to describe the topology?  Are there subtrees that should
>> have nodes different from the host bridge?
>>
>> I know this patch is already in v3.14-rc7, but I'd still like to
>> understand it so we can do the right thing with Suravee's patch.
>>
>> Bjorn
>>
>>> Signed-off-by: Daniel J Blueman <daniel@numascale.com>
>>> Acked-by: Steffen Persvold <sp@numascale.com>
>>> ---
>>>   arch/x86/kernel/quirks.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
>>> index 04ee1e2..52dbf1e 100644
>>> --- a/arch/x86/kernel/quirks.c
>>> +++ b/arch/x86/kernel/quirks.c
>>> @@ -529,7 +529,7 @@ static void quirk_amd_nb_node(struct pci_dev *dev)
>>>                  return;
>>>
>>>          pci_read_config_dword(nb_ht, 0x60, &val);
>>> -       node = val & 7;
>>> +       node = pcibus_to_node(dev->bus) | (val & 7);
>>>          /*
>>>           * Some hardware may return an invalid node ID,
>>>           * so check it first:
>>> --
>>> 1.8.3.2
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>
>


-- 
Daniel J Blueman
Principal Software Engineer, Numascale

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Fix northbridge quirk to assign correct NUMA node
  2014-03-21  3:38   ` Daniel J Blueman
@ 2014-03-21 16:11     ` Bjorn Helgaas
  2014-03-24  6:03       ` Daniel J Blueman
  2014-03-21 17:16     ` Suravee Suthikulpanit
  1 sibling, 1 reply; 12+ messages in thread
From: Bjorn Helgaas @ 2014-03-21 16:11 UTC (permalink / raw)
  To: Daniel J Blueman
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Steffen Persvold, linux-pci,
	Suravee Suthikulpanit, kim.naru, Aravind Gopalakrishnan,
	Myron Stowe, Rafael J. Wysocki, linux-acpi

[+cc Rafael, linux-acpi for _PXM questions]

On Thu, Mar 20, 2014 at 9:38 PM, Daniel J Blueman <daniel@numascale.com> wrote:
> On 21/03/2014 06:07, Bjorn Helgaas wrote:
>> On Thu, Mar 13, 2014 at 5:43 AM, Daniel J Blueman <daniel@numascale.com>
>> wrote:
>>>
>>> For systems with multiple servers and routed fabric, all northbridges get
>>> assigned to the first server. Fix this by also using the node reported
>>> from
>>> the PCI bus. For single-fabric systems, the northbriges are on PCI bus 0
>>> by definition, which are on NUMA node 0 by definition, so this is
>>> invarient
>>> on most systems.
>>>
>>> Tested on fam10h and fam15h single and multi-fabric systems and candidate
>>> for stable.

>> So I suspect the problem is more complicated, and maybe _PXM is
>> insufficient to describe the topology?  Are there subtrees that should
>> have nodes different from the host bridge?
>
> Yes; see below.
> ...
> The _PXM method associates each northbridge with the first NUMA node, 0 in
> single-fabric systems, and eg 4 for the second server in a multi-fabric
> system with 2 dual-module Opterons (with 2 NUMA nodes internally) etc, since
> the northbridges appear in the PCI tree, under the host bridge, not above it
> [1].
>
> With _PXM, the rest of the PCI bus hierarchy has the right NUMA node
> associated, but the northbridge PCI devices should be associated with their
> actual NUMA node, 0, 1, 2, 3 for the first server in this example. The quirk
> fixes this up; irqbalance at least uses this NUMA data exposed in /sys.

I'm confused about which devices we're talking about.  We currently
look at _PXM for PNP0A08 (and PNP0A03) ACPI devices.  The resulting
node is associated with every PCI device we enumerate below the
PNP0A08 bridge.  This association is made in pci_device_add().

When you say "northbridge PCI devices should be associated with their
actual NUMA node," I assume you mean the 00:18.x and 00:19.x devices
("AMD Family 10h Processor ..."), since those seem to be what the
quirk applies to.  You are *not* talking about 00:00.0 ("ATI RD890
Northbridge"), right?

You mention irqbalance; is the NUMA node information for the 00:18.x
and 00:19.x devices important because you get a lot of interrupts from
those devices?  Or is the issue with actual I/O devices (NICs, SCSI
adapters, etc.)?  If so, I don't see how this quirk would affect
those, because the node information for them comes from the PNP0A08
bridge (in pci_device_add()), not from the 00:00.0, 00:18.x, or
00:19.x devices.

> The alternative to the quirk may be to explicitly express the northbridge
> PCI devices in the AML with their own _PXM methods. If it's valid, it may be
> the honest approach, though the quirk may be needed for most BIOSs; I can
> check the AML on a few servers to confirm if helpful.

ACPI allows _PXM for any device, so this might be a possible approach.
 However, it looks like Linux only pays attention to _PXM for
PNP0A08/03, CPUs, memory and IOAPICs (which seems like a Linux defect
to me).

I'm really worried about the approach here:

        pci_read_config_dword(nb_ht, 0x60, &val);
        node = pcibus_to_node(dev->bus) | (val & 7);

because the pcibus_to_node() information comes indirectly from _PXM,
and the "val" part comes from the hardware, and I don't think these
are the same node number space.  If I understand correctly, the BIOS
can synthesize whatever numbers it wants for _PXM, which returns a
"proximity domain," and then Linux can make up its own mapping of
"proximity domain" to "logical Linux node."  So I don't see why we can
assume that it's valid to OR in the bits from a PCI config register to
this logical Linux node number.

> [1] http://quora.org/2014/lspci.txt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Fix northbridge quirk to assign correct NUMA node
  2014-03-21  3:38   ` Daniel J Blueman
  2014-03-21 16:11     ` Bjorn Helgaas
@ 2014-03-21 17:16     ` Suravee Suthikulpanit
  2014-03-23 14:30       ` Daniel J Blueman
  1 sibling, 1 reply; 12+ messages in thread
From: Suravee Suthikulpanit @ 2014-03-21 17:16 UTC (permalink / raw)
  To: Daniel J Blueman, Bjorn Helgaas
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Steffen Persvold, linux-pci,
	kim.naru, Aravind Gopalakrishnan, Myron Stowe

On 3/20/2014 10:38 PM, Daniel J Blueman wrote:
> On 21/03/2014 06:07, Bjorn Helgaas wrote:
>> [+cc linux-pci, Myron, Suravee, Kim, Aravind]
>>
>> On Thu, Mar 13, 2014 at 5:43 AM, Daniel J Blueman <daniel@numascale.com> wrote:
>>> For systems with multiple servers and routed fabric, all northbridges get
>>> assigned to the first server. Fix this by also using the node reported from
>>> the PCI bus. For single-fabric systems, the northbriges are on PCI bus 0
>>> by definition, which are on NUMA node 0 by definition, so this is invarient
>>> on most systems.
>>>
>>> Tested on fam10h and fam15h single and multi-fabric systems and candidate
>>> for stable.
>
>> I wish this had been cc'd to linux-pci.  We're talking about a related
>> change by Suravee there.  In fact, we were hoping this quirk could be
>> removed altogether.
>
> Noted.
>
>> I don't understand what this quirk is doing.  Normally we discover the
>> NUMA node for a PCI host bridge via the ACPI _PXM method.  The way
>> _PXM works is that every PCI device in the hierarchy below the bridge
>> inherits the same node number as the host bridge.  I first thought
>> this might be a workaround for a system that lacks _PXM, but I don't
>> think that can be right, because you're only changing the node for a
>> few devices, not the whole hierarchy.
>  >
>> So I suspect the problem is more complicated, and maybe _PXM is
>> insufficient to describe the topology?  Are there subtrees that should
>> have nodes different from the host bridge?
>
> Yes; see below.
>
>> I know this patch is already in v3.14-rc7, but I'd still like to
>> understand it so we can do the right thing with Suravee's patch.
>
> The _PXM method associates each northbridge with the first NUMA node, 0 in single-fabric systems, and eg 4 for the second server in a multi-fabric system with 2 dual-module Opterons (with 2 NUMA nodes internally) etc, since the northbridges appear in the
> PCI tree, under the host bridge, not above it [1].
Daniel,

That lspci looks interesting, what is the value returned from pci_bus_to_node() on your system for each fabric?

Suravee

>
> With _PXM, the rest of the PCI bus hierarchy has the right NUMA node associated, but the northbridge PCI devices should be associated with their actual NUMA node, 0, 1, 2, 3 for the first server in this example. The quirk fixes this up; irqbalance at least
> uses this NUMA data exposed in /sys.
>
> The alternative to the quirk may be to explicitly express the northbridge PCI devices in the AML with their own _PXM methods. If it's valid, it may be the honest approach, though the quirk may be needed for most BIOSs; I can check the AML on a few servers
> to confirm if helpful.
>
> Thanks,
>    Daniel
>
> [1] http://quora.org/2014/lspci.txt



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Fix northbridge quirk to assign correct NUMA node
  2014-03-21 17:16     ` Suravee Suthikulpanit
@ 2014-03-23 14:30       ` Daniel J Blueman
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel J Blueman @ 2014-03-23 14:30 UTC (permalink / raw)
  To: Suravee Suthikulpanit
  Cc: Bjorn Helgaas, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Steffen Persvold, linux-pci,
	kim.naru, Aravind Gopalakrishnan, Myron Stowe

On 03/22/2014 01:16 AM, Suravee Suthikulpanit wrote:
> On 3/20/2014 10:38 PM, Daniel J Blueman wrote:
>> On 21/03/2014 06:07, Bjorn Helgaas wrote:
>>> [+cc linux-pci, Myron, Suravee, Kim, Aravind]
>>>
>>> On Thu, Mar 13, 2014 at 5:43 AM, Daniel J Blueman
>>> <daniel@numascale.com> wrote:
>>>> For systems with multiple servers and routed fabric, all
>>>> northbridges get
>>>> assigned to the first server. Fix this by also using the node
>>>> reported from
>>>> the PCI bus. For single-fabric systems, the northbriges are on PCI
>>>> bus 0
>>>> by definition, which are on NUMA node 0 by definition, so this is
>>>> invarient
>>>> on most systems.
>>>>
>>>> Tested on fam10h and fam15h single and multi-fabric systems and
>>>> candidate
>>>> for stable.
>>
>>> I wish this had been cc'd to linux-pci.  We're talking about a related
>>> change by Suravee there.  In fact, we were hoping this quirk could be
>>> removed altogether.
>>
>> Noted.
>>
>>> I don't understand what this quirk is doing.  Normally we discover the
>>> NUMA node for a PCI host bridge via the ACPI _PXM method.  The way
>>> _PXM works is that every PCI device in the hierarchy below the bridge
>>> inherits the same node number as the host bridge.  I first thought
>>> this might be a workaround for a system that lacks _PXM, but I don't
>>> think that can be right, because you're only changing the node for a
>>> few devices, not the whole hierarchy.
>>  >
>>> So I suspect the problem is more complicated, and maybe _PXM is
>>> insufficient to describe the topology?  Are there subtrees that should
>>> have nodes different from the host bridge?
>>
>> Yes; see below.
>>
>>> I know this patch is already in v3.14-rc7, but I'd still like to
>>> understand it so we can do the right thing with Suravee's patch.
>>
>> The _PXM method associates each northbridge with the first NUMA node,
>> 0 in single-fabric systems, and eg 4 for the second server in a
>> multi-fabric system with 2 dual-module Opterons (with 2 NUMA nodes
>> internally) etc, since the northbridges appear in the
>> PCI tree, under the host bridge, not above it [1].
> Daniel,
>
> That lspci looks interesting, what is the value returned from
> pci_bus_to_node() on your system for each fabric?

pci_bus_to_node returns 0 for PCI domain 0000, 2 for PCI domain 0001, 4 
for PCI domain 0002 and so on.

Our processor fabric interconnect has HyperTransport NodeId 2 on each 
server (as they start from bus 0, device 0x18 of course):
0000:00:1a.0 Host bridge: Device 1b47:0601 (rev 02)
0000:00:1a.1 Host bridge: Device 1b47:0602 (rev 02)

Thanks,
   Daniel
-- 
Daniel J Blueman
Principal Software Engineer, Numascale

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Fix northbridge quirk to assign correct NUMA node
  2014-03-21 16:11     ` Bjorn Helgaas
@ 2014-03-24  6:03       ` Daniel J Blueman
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel J Blueman @ 2014-03-24  6:03 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Steffen Persvold, linux-pci,
	Suravee Suthikulpanit, kim.naru, Aravind Gopalakrishnan,
	Myron Stowe, Rafael J. Wysocki, linux-acpi

On 03/22/2014 12:11 AM, Bjorn Helgaas wrote:
> [+cc Rafael, linux-acpi for _PXM questions]
>
> On Thu, Mar 20, 2014 at 9:38 PM, Daniel J Blueman <daniel@numascale.com> wrote:
>> On 21/03/2014 06:07, Bjorn Helgaas wrote:
>>> On Thu, Mar 13, 2014 at 5:43 AM, Daniel J Blueman <daniel@numascale.com>
>>> wrote:
>>>>
>>>> For systems with multiple servers and routed fabric, all northbridges get
>>>> assigned to the first server. Fix this by also using the node reported
>>>> from
>>>> the PCI bus. For single-fabric systems, the northbriges are on PCI bus 0
>>>> by definition, which are on NUMA node 0 by definition, so this is
>>>> invarient
>>>> on most systems.
>>>>
>>>> Tested on fam10h and fam15h single and multi-fabric systems and candidate
>>>> for stable.
>
>>> So I suspect the problem is more complicated, and maybe _PXM is
>>> insufficient to describe the topology?  Are there subtrees that should
>>> have nodes different from the host bridge?
>>
>> Yes; see below.
>> ...
>> The _PXM method associates each northbridge with the first NUMA node, 0 in
>> single-fabric systems, and eg 4 for the second server in a multi-fabric
>> system with 2 dual-module Opterons (with 2 NUMA nodes internally) etc, since
>> the northbridges appear in the PCI tree, under the host bridge, not above it
>> [1].
>>
>> With _PXM, the rest of the PCI bus hierarchy has the right NUMA node
>> associated, but the northbridge PCI devices should be associated with their
>> actual NUMA node, 0, 1, 2, 3 for the first server in this example. The quirk
>> fixes this up; irqbalance at least uses this NUMA data exposed in /sys.
>
> I'm confused about which devices we're talking about.  We currently
> look at _PXM for PNP0A08 (and PNP0A03) ACPI devices.  The resulting
> node is associated with every PCI device we enumerate below the
> PNP0A08 bridge.  This association is made in pci_device_add().
>
> When you say "northbridge PCI devices should be associated with their
> actual NUMA node," I assume you mean the 00:18.x and 00:19.x devices
> ("AMD Family 10h Processor ..."), since those seem to be what the
> quirk applies to.  You are *not* talking about 00:00.0 ("ATI RD890
> Northbridge"), right?

Yes, on bus 0, devices 0x18 to 0x20 decode to the (up to) eight 
Hypertransport devices in the processor fabric, normally all processor 
northbridges.

> You mention irqbalance; is the NUMA node information for the 00:18.x
> and 00:19.x devices important because you get a lot of interrupts from
> those devices?  Or is the issue with actual I/O devices (NICs, SCSI
> adapters, etc.)?  If so, I don't see how this quirk would affect
> those, because the node information for them comes from the PNP0A08
> bridge (in pci_device_add()), not from the 00:00.0, 00:18.x, or
> 00:19.x devices.

I need to investigate the lockups irqbalance was causing on a customer 
system, and am not sure what interrupt source that was rewritten which 
causing hangs; disabling the daemon prevented the hangs.

>> The alternative to the quirk may be to explicitly express the northbridge
>> PCI devices in the AML with their own _PXM methods. If it's valid, it may be
>> the honest approach, though the quirk may be needed for most BIOSs; I can
>> check the AML on a few servers to confirm if helpful.
>
> ACPI allows _PXM for any device, so this might be a possible approach.
>   However, it looks like Linux only pays attention to _PXM for
> PNP0A08/03, CPUs, memory and IOAPICs (which seems like a Linux defect
> to me).

> I'm really worried about the approach here:
>
>          pci_read_config_dword(nb_ht, 0x60, &val);
>          node = pcibus_to_node(dev->bus) | (val & 7);
>
> because the pcibus_to_node() information comes indirectly from _PXM,
> and the "val" part comes from the hardware, and I don't think these
> are the same node number space.  If I understand correctly, the BIOS
> can synthesize whatever numbers it wants for _PXM, which returns a
> "proximity domain," and then Linux can make up its own mapping of
> "proximity domain" to "logical Linux node."  So I don't see why we can
> assume that it's valid to OR in the bits from a PCI config register to
> this logical Linux node number.

pcibus_to_node uses the proximity domain values in the ACPI SRAT table, 
which is thus correctly mapped to the linux NUMA node ID, so my oneliner 
is still progress.

Linux allocates NUMA node ids using the ordering of PXM values seen in 
the SRAT table, ie first_unset_node(nodes_found_map). The APIC ids are 
initialised using the HyperTransport NodeId [1, p263 and p465], but the 
NodeId can be reprogrammed after the APIC ids are set (which also 
changes the PCI configuration device id from 0x18 on bus 0 it responds 
to), and the SRAT table needn't be emitted in order, perhaps except for 
the bootstrap core.

I guess fixing the original quirk depends on how important these cases 
really are.

Thanks,
   Daniel

[1] http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf
-- 
Daniel J Blueman
Principal Software Engineer, Numascale

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-03-24  6:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-13 11:43 [PATCH] Fix northbridge quirk to assign correct NUMA node Daniel J Blueman
2014-03-14  9:06 ` Borislav Petkov
2014-03-14  9:57   ` Daniel J Blueman
2014-03-14 10:09 ` [tip:x86/urgent] x86/amd/numa: " tip-bot for Daniel J Blueman
2014-03-20 22:07 ` [PATCH] " Bjorn Helgaas
2014-03-21  3:38   ` Daniel J Blueman
2014-03-21 16:11     ` Bjorn Helgaas
2014-03-24  6:03       ` Daniel J Blueman
2014-03-21 17:16     ` Suravee Suthikulpanit
2014-03-23 14:30       ` Daniel J Blueman
2014-03-21  3:51   ` Suravee Suthikulpanit
2014-03-21  4:14     ` Daniel J Blueman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.