Re: [PATCH V2] x86: Fix an issue with invalid ACPI NUMA config

From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: "Dave Hansen" <dave.hansen@intel.com>,
	linux-pci@vger.kernel.org, x86@kernel.org, linuxarm@huawei.com,
	"Ingo Molnar" <mingo@kernel.org>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"Andy Lutomirski" <luto@kernel.org>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Martin Hundebøll" <martin@geanix.com>,
	"Linux Memory Management List" <linux-mm@kvack.org>,
	"ACPI Devel Mailing List" <linux-acpi@vger.kernel.org>
Subject: Re: [PATCH V2] x86: Fix an issue with invalid ACPI NUMA config
Date: Thu, 7 Feb 2019 10:12:57 +0000	[thread overview]
Message-ID: <20190207101257.00000a98@huawei.com> (raw)
In-Reply-To: <20190129190556.GB91506@google.com>

On Tue, 29 Jan 2019 13:05:56 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Tue, Jan 29, 2019 at 09:51:05AM +0000, Jonathan Cameron wrote:
> > On Mon, 28 Jan 2019 17:13:22 -0600
> > Bjorn Helgaas <helgaas@kernel.org> wrote:  
> > > On Mon, Jan 28, 2019 at 11:31:08AM +0000, Jonathan Cameron wrote:  
> > > > On Thu, 20 Dec 2018 13:57:14 -0600
> > > > Bjorn Helgaas <helgaas@kernel.org> wrote:    
> > > > > On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote:    
> > > > > > On 12/20/18 7:12 AM, Bjorn Helgaas wrote:      
> 
> > > The current patch proposes setting "numa_off=1" in the x86 version of
> > > dummy_numa_init(), on the assumption (from the changelog) that:
> > > 
> > >   It is invalid under the ACPI spec to specify new NUMA nodes using
> > >   _PXM if they have no presence in SRAT.
> > > 
> > > Do you have a reference for this?  I looked and couldn't find a clear
> > > statement in the spec to that effect.  The _PXM description (ACPI
> > > v6.2, sec 6.1.14) says that two devices with the same _PXM value are
> > > in the same proximity domain, but it doesn't seem to require an SRAT.  
> > 
> > No comment (feel free to guess why). *sigh*  
> 
> Secret interpretations of the spec are out of bounds.  But I think
> it's a waste of time to argue about whether _PXM without SRAT is
> valid.  Systems like that exist, and I think it's possible to do
> something sensible with them.

Now less secret :)

https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf

Specifically 
6.2B Errata 1951 _PXM Clarifications

Adds lots of statements including:

(5.2.16)
Note: SRAT is the place where proximity domains are defined, and _PXM provides
a mechanism to associate a device object (and its children) to an SRAT-defined
proximity domain. 

6.2.14 _PXM (Proximity)
This optional object is used to describe proximity domain associations within a
machine. _PXM evaluates to an integer that identifies a device as belonging
to a Proximity Domain defined in the System Resource Affinity Table (SRAT).

Obviously this doesn't necessarily change the fact there 'might' be
a platform out there with tables written against earlier ACPI specs that
does 'deliberately' provide _PXM entries that don't match entries in SRAT.
What is does mean is that going forwards we "shouldn't" see any new ones.

Note that the usecase that was conjectured below is now accounted for with
the new Generic Initiator Domains (5.2.16.6).  There is some juggling done via
an OSC bit to ensure that firmware can 'adjust' it's _PXM entries to account
for whether or not these Generic Initiator domains are supported by the OS.
I'll clean up my patches for that and post soon (if no one beats me to it!)

One thing I will note though, is I'm not going to propose we drop the
numa_off = true line in the arm code, given there aren't any arm platforms
known to have _PXMs not matching entries in SRAT and now we have a spec
that says it isn't right to do it anyway.

Jonathan

> 
> > > Maybe it results in an issue when we call kmalloc_node() using this
> > > _PXM value that SRAT didn't tell us about?  If so, that's reminiscent
> > > of these earlier discussions about kmalloc_node() returning something
> > > useless if the requested node is not online:
> > > 
> > >   https://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com
> > >   https://lore.kernel.org/linux-arm-kernel/20180801173132.19739-1-punit.agrawal@arm.com/
> > > 
> > > As far as I know, that was never really resolved.  The immediate
> > > problem of using passing an invalid node number to kmalloc_node() was
> > > avoided by using kmalloc() instead.  
> > 
> > Yes, that's definitely still a problem (or was last time I checked)
> >   
> > > > Dave's response was that we needed to fix the underlying issue of
> > > > trying to allocate from non existent NUMA nodes.    
> 
> > > Bottom line, I totally agree that it would be better to fix the
> > > underlying issue without trying to avoid it by disabling NUMA.  
> > 
> > I don't agree on this point.  I think two layers make sense.
> > 
> > If there is no NUMA description in DT or ACPI, why not just stop anything
> > from using it at all?  The firmware has basically declared there is no
> > point, why not save a bit of complexity (and use an existing tested code
> > path) but setting numa_off?  
> 
> Firmware with a _PXM does have a NUMA description.
> 
> > However, if there is NUMA description, but with bugs then we should
> > protect in depth.  A simple example being that we declare 2 nodes, but
> > then use _PXM for a third. I've done that by accident and blows up
> > in a nasty fashion (not done it for a while, but probably still true).
> > 
> > Given DSDT is only parsed long after SRAT we can just check on _PXM
> > queries.  Or I suppose we could do a verification parse for all _PXM
> > entries and put out some warnings if they don't match SRAT entries?  
> 
> I'm assuming the crash happens when we call kmalloc_node() with a node
> not mentioned in SRAT.  I think that's just sub-optimal implementation
> in kmalloc_node().
> 
> We *could* fail the allocation and return a NULL pointer, but I think
> even that is excessive.  I think we should simply fall back to
> kmalloc().  We could print a one-time warning if that's useful.
> 
> If kmalloc_node() for an unknown node fell back to kmalloc(), would
> anything else be required?
> 
> > > > Whilst I agree with that in principle (having managed to provide
> > > > tables doing exactly that during development a few times!), I'm not
> > > > sure the path to doing so is clear and so this has been stalled for
> > > > a few months.  There is to my mind still a strong argument, even
> > > > with such protection in place, that we should still be short cutting
> > > > it so that you get the same paths if you deliberately disable numa,
> > > > and if you have no SRAT and hence can't have NUMA.    
> > > 
> > > I guess we need to resolve the question of whether NUMA without SRAT
> > > is possible.  
> > 
> > It's certainly unclear of whether it has any meaning.  If we allow for
> > the fact that the intent of ACPI was never to allow this (and a bit
> > of history checking verified this as best as anyone can remember),
> > then what do we do with the few platforms that do use _PXM to nodes that
> > haven't been defined?  
> 
> We *could* ignore any _PXM that mentions a proximity domain not
> mentioned by an SRAT.  That seems a little heavy-handed because it
> means every possible proximity domain must be described up front in
> the SRAT, which limits the flexibility of hot-adding entire nodes
> (CPU/memory/IO).
> 
> But I think it's possible to make sense of a _PXM that adds a
> proximity domain not mentioned in an SRAT, e.g., if a new memory
> device and a new I/O device supply the same _PXM value, we can assume
> they're close together.  If a new I/O device has a previously unknown
> _PXM, we may not be able to allocate memory near it, but we should at
> least be able to allocate from a default zone.
> 
> Bjorn