linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Yunsheng Lin <linyunsheng@huawei.com>
Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	mhocko@kernel.org, peterz@infradead.org, robin.murphy@arm.com,
	geert@linux-m68k.org, gregkh@linuxfoundation.org,
	paul.burton@mips.com
Subject: Re: [PATCH] PCI: Warn about host bridge device when its numa node is NO_NODE
Date: Tue, 22 Oct 2019 16:04:20 -0500	[thread overview]
Message-ID: <20191022210420.GA17717@google.com> (raw)
In-Reply-To: <1571467543-26125-1-git-send-email-linyunsheng@huawei.com>

On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:
> As the disscusion in [1]:

We need to justify this patch right here in the commit log, not with a
pointer to a 50+ message email thread.

> A PCI device really _MUST_ have a node assigned. 

No, it's not really essential.  It's *nice* if we know the node
closest to a PCI device, but the system should function correctly even
if we don't.  The only problem is that it will be slower.

I think the underlying problem you're addressing is that:

  - NUMA_NO_NODE == -1,
  - dev_to_node(dev) may return NUMA_NO_NODE,
  - kmalloc(dev) relies on cpumask_of_node(dev_to_node(dev)), and
  - cpumask_of_node(NUMA_NO_NODE) makes an invalid array reference

For example, on arm64, mips loongson, s390, and x86,
cpumask_of_node(node) returns "node_to_cpumask_map[node]", and -1 is
an invalid array index.

That problem can't be solved by emitting a warning, of course.  I
assume some variation of your "numa: make node_to_cpumask_map()
NUMA_NO_NODE aware" patch [a] will solve that problem.

[a] https://lore.kernel.org/linux-mips/1568535656-158979-1-git-send-email-linyunsheng@huawei.com/T/#u

It is probably a good idea to emit a warning about the performance
issue.

When I run your patch on qemu, I see this:

  ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
  acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
  acpi PNP0A08:00: _OSC: platform does not support [LTR]
  acpi PNP0A08:00: _OSC: OS now controls [PME AER PCIeCapability]
  PCI host bridge to bus 0000:00
  pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
  pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
  pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
  pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window]
  pci_bus 0000:00: root bus resource [mem 0x100000000-0x8ffffffff window]
  pci_bus 0000:00: root bus resource [bus 00-ff]
   pci0000:00: [Firmware Bug]: No node assigned on NUMA capable HW by BIOS. Please contact your vendor for updates.

I didn't debug it to see what's wrong with the " pci0000:00" string.
Ideally it would be connected with "acpi PNP0A08:00" since that's the
place where BIOS would make a fix but I suppose "pci_bus 0000:00"
would be adequate.

> It is possible to
> have a PCI bridge shared between two nodes, such that the PCI
> devices have equidistance. But the moment you scale this out, you
> either get devices that are 'local' to a package while having
> multiple packages, or if you maintain a single bridge in a big
> system, things become so slow it all doesn't matter anyway.
> Assigning a node (one of the shared) is, in the generic ase of
> multiple packages, the better solution over assigning all nodes.
> 
> As pci_device_add() will assign the pci device' node according to
> the bus the device is on, which is decided by pcibus_to_node().
> Currently different arch may implement the pcibus_to_node() based
> on bus->sysdata or bus device' node, which has the same node as
> the bridge device.
>
> And for devices behind another bridge case, the child bus device
> is setup with proper parent bus device and inherit its parent'
> sysdata in pci_alloc_child_bus(), so the pcie device under the
> child bus should have the same node as the parent bridge when
> device_add() is called, which will set the node to its parent's
> node when the child device' node is NUMA_NO_NODE.
> 
> So this patch only warns about the case when a host bridge device
> is registered with a node of NO_NODE in pci_register_host_bridge().
> And it only warns about that when there are more than one numa
> nodes in the system.


> [1] https://lore.kernel.org/lkml/1568724534-146242-1-git-send-email-linyunsheng@huawei.com/
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>  drivers/pci/probe.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 3d5271a..22be96a 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -927,6 +927,9 @@ static int pci_register_host_bridge(struct pci_host_bridge *bridge)
>  	list_add_tail(&bus->node, &pci_root_buses);
>  	up_write(&pci_bus_sem);
>  
> +	if (nr_node_ids > 1 && dev_to_node(bus->bridge) == NUMA_NO_NODE)
> +		dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable HW by BIOS. Please contact your vendor for updates.\n");
> +
>  	return 0;
>  
>  unregister:

  parent reply	other threads:[~2019-10-22 21:04 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-19  6:45 [PATCH] PCI: Warn about host bridge device when its numa node is NO_NODE Yunsheng Lin
2019-10-19  8:34 ` Christoph Hellwig
2019-10-21  4:05   ` Yunsheng Lin
2019-10-22 13:55     ` Robin Murphy
2019-10-23  8:24       ` Yunsheng Lin
2019-10-22 21:04 ` Bjorn Helgaas [this message]
2019-10-23  8:22   ` Yunsheng Lin
2019-10-23 17:10     ` Bjorn Helgaas
2019-10-24  9:20       ` Michal Hocko
2019-10-24 17:40         ` Bjorn Helgaas
2019-10-25  8:16           ` Michal Hocko
2019-10-25  8:51             ` Yunsheng Lin
2019-10-24  9:39       ` Yunsheng Lin
2019-10-24 10:16       ` Robin Murphy
2019-10-25 12:51         ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191022210420.GA17717@google.com \
    --to=helgaas@kernel.org \
    --cc=geert@linux-m68k.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linyunsheng@huawei.com \
    --cc=mhocko@kernel.org \
    --cc=paul.burton@mips.com \
    --cc=peterz@infradead.org \
    --cc=robin.murphy@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).