All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: "Dave Hansen" <dave.hansen@intel.com>,
	linux-pci@vger.kernel.org, x86@kernel.org, linuxarm@huawei.com,
	"Ingo Molnar" <mingo@kernel.org>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"Andy Lutomirski" <luto@kernel.org>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Martin Hundebøll" <martin@geanix.com>,
	"Linux Memory Management List" <linux-mm@kvack.org>,
	"ACPI Devel Mailing List" <linux-acpi@vger.kernel.org>
Subject: Re: [PATCH V2] x86: Fix an issue with invalid ACPI NUMA config
Date: Tue, 29 Jan 2019 13:05:56 -0600	[thread overview]
Message-ID: <20190129190556.GB91506@google.com> (raw)
In-Reply-To: <20190129095105.00000374@huawei.com>

On Tue, Jan 29, 2019 at 09:51:05AM +0000, Jonathan Cameron wrote:
> On Mon, 28 Jan 2019 17:13:22 -0600
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Mon, Jan 28, 2019 at 11:31:08AM +0000, Jonathan Cameron wrote:
> > > On Thu, 20 Dec 2018 13:57:14 -0600
> > > Bjorn Helgaas <helgaas@kernel.org> wrote:  
> > > > On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote:  
> > > > > On 12/20/18 7:12 AM, Bjorn Helgaas wrote:    

> > The current patch proposes setting "numa_off=1" in the x86 version of
> > dummy_numa_init(), on the assumption (from the changelog) that:
> > 
> >   It is invalid under the ACPI spec to specify new NUMA nodes using
> >   _PXM if they have no presence in SRAT.
> > 
> > Do you have a reference for this?  I looked and couldn't find a clear
> > statement in the spec to that effect.  The _PXM description (ACPI
> > v6.2, sec 6.1.14) says that two devices with the same _PXM value are
> > in the same proximity domain, but it doesn't seem to require an SRAT.
> 
> No comment (feel free to guess why). *sigh*

Secret interpretations of the spec are out of bounds.  But I think
it's a waste of time to argue about whether _PXM without SRAT is
valid.  Systems like that exist, and I think it's possible to do
something sensible with them.

> > Maybe it results in an issue when we call kmalloc_node() using this
> > _PXM value that SRAT didn't tell us about?  If so, that's reminiscent
> > of these earlier discussions about kmalloc_node() returning something
> > useless if the requested node is not online:
> > 
> >   https://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com
> >   https://lore.kernel.org/linux-arm-kernel/20180801173132.19739-1-punit.agrawal@arm.com/
> > 
> > As far as I know, that was never really resolved.  The immediate
> > problem of using passing an invalid node number to kmalloc_node() was
> > avoided by using kmalloc() instead.
> 
> Yes, that's definitely still a problem (or was last time I checked)
> 
> > > Dave's response was that we needed to fix the underlying issue of
> > > trying to allocate from non existent NUMA nodes.  

> > Bottom line, I totally agree that it would be better to fix the
> > underlying issue without trying to avoid it by disabling NUMA.
> 
> I don't agree on this point.  I think two layers make sense.
> 
> If there is no NUMA description in DT or ACPI, why not just stop anything
> from using it at all?  The firmware has basically declared there is no
> point, why not save a bit of complexity (and use an existing tested code
> path) but setting numa_off?

Firmware with a _PXM does have a NUMA description.

> However, if there is NUMA description, but with bugs then we should
> protect in depth.  A simple example being that we declare 2 nodes, but
> then use _PXM for a third. I've done that by accident and blows up
> in a nasty fashion (not done it for a while, but probably still true).
> 
> Given DSDT is only parsed long after SRAT we can just check on _PXM
> queries.  Or I suppose we could do a verification parse for all _PXM
> entries and put out some warnings if they don't match SRAT entries?

I'm assuming the crash happens when we call kmalloc_node() with a node
not mentioned in SRAT.  I think that's just sub-optimal implementation
in kmalloc_node().

We *could* fail the allocation and return a NULL pointer, but I think
even that is excessive.  I think we should simply fall back to
kmalloc().  We could print a one-time warning if that's useful.

If kmalloc_node() for an unknown node fell back to kmalloc(), would
anything else be required?

> > > Whilst I agree with that in principle (having managed to provide
> > > tables doing exactly that during development a few times!), I'm not
> > > sure the path to doing so is clear and so this has been stalled for
> > > a few months.  There is to my mind still a strong argument, even
> > > with such protection in place, that we should still be short cutting
> > > it so that you get the same paths if you deliberately disable numa,
> > > and if you have no SRAT and hence can't have NUMA.  
> > 
> > I guess we need to resolve the question of whether NUMA without SRAT
> > is possible.
> 
> It's certainly unclear of whether it has any meaning.  If we allow for
> the fact that the intent of ACPI was never to allow this (and a bit
> of history checking verified this as best as anyone can remember),
> then what do we do with the few platforms that do use _PXM to nodes that
> haven't been defined?

We *could* ignore any _PXM that mentions a proximity domain not
mentioned by an SRAT.  That seems a little heavy-handed because it
means every possible proximity domain must be described up front in
the SRAT, which limits the flexibility of hot-adding entire nodes
(CPU/memory/IO).

But I think it's possible to make sense of a _PXM that adds a
proximity domain not mentioned in an SRAT, e.g., if a new memory
device and a new I/O device supply the same _PXM value, we can assume
they're close together.  If a new I/O device has a previously unknown
_PXM, we may not be able to allocate memory near it, but we should at
least be able to allocate from a default zone.

Bjorn

  reply	other threads:[~2019-01-29 19:06 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-11  9:47 [PATCH V2] x86: Fix an issue with invalid ACPI NUMA config Jonathan Cameron
2018-12-11 18:19 ` Dave Hansen
2018-12-12  9:39   ` Jonathan Cameron
2018-12-20 15:12     ` Bjorn Helgaas
2018-12-20 17:13       ` Dave Hansen
2018-12-20 19:57         ` Bjorn Helgaas
2019-01-28 11:31           ` Jonathan Cameron
2019-01-28 11:31             ` Jonathan Cameron
2019-01-28 11:31             ` Jonathan Cameron
2019-01-28 23:13             ` Bjorn Helgaas
2019-01-29  9:51               ` Jonathan Cameron
2019-01-29 19:05                 ` Bjorn Helgaas [this message]
2019-01-29 19:45                   ` Jonathan Cameron
2019-01-29 21:10                     ` Bjorn Helgaas
2019-02-07 10:12                   ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190129190556.GB91506@google.com \
    --to=helgaas@kernel.org \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=luto@kernel.org \
    --cc=martin@geanix.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.