Re: [PATCH] ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT

From: Dan Williams <dan.j.williams@intel.com>
To: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: "Schofield, Alison" <alison.schofield@intel.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Len Brown <lenb@kernel.org>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Ben Widawsky <ben.widawsky@intel.com>,
	linux-cxl@vger.kernel.org,
	Linux ACPI <linux-acpi@vger.kernel.org>
Subject: Re: [PATCH] ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT
Date: Mon, 18 Oct 2021 11:15:36 -0700	[thread overview]
Message-ID: <CAPcyv4g=gVeJtSAMPH5VTZfDk+eoL0zkgnQMny=T+xX8RyQKjQ@mail.gmail.com> (raw)
In-Reply-To: <20211018102538.00007023@Huawei.com>

On Mon, Oct 18, 2021 at 2:25 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Fri, 15 Oct 2021 11:58:36 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
>
> > On Fri, Oct 15, 2021 at 10:00 AM Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:
> > >
> > > On Fri, 8 Oct 2021 18:53:39 -0700
> > > <alison.schofield@intel.com> wrote:
> > >
> > > > From: Alison Schofield <alison.schofield@intel.com>
> > > >
> > > > During NUMA init, CXL memory defined in the SRAT Memory Affinity
> > > > subtable may be assigned to a NUMA node. Since there is no
> > > > requirement that the SRAT be comprehensive for CXL memory another
> > > > mechanism is needed to assign NUMA nodes to CXL memory not identified
> > > > in the SRAT.
> > > >
> > > > Use the CXL Fixed Memory Window Structure's (CFMWS) of the ACPI CXL
> > > > Early Discovery Table (CEDT) to find all CXL memory ranges. Create a
> > > > NUMA node for each range that is not already assigned to a NUMA node.
> > > > Add a memblk attaching its host physical address range to the node.
> > > >
> > > > Note that these ranges may not actually map any memory at boot time.
> > > > They may describe persistent capacity or may be present to enable
> > > > hot-plug.
> > > >
> > > > Consumers can use phys_to_target_node() to discover the NUMA node.
> > > >
> > > > Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> > > Hi Alison,
> > >
> > > I'm not sure that a CFMWS entry should map to a single NUMA node...
> > >
> > > Each entry corresponds to a contiguous HPA range into which CXL devices
> > > below a set of ports (if interleaved) or one port should be mapped.
> > >
> > > That could be multiple devices, each with it's own performance characteristics,
> > > or potentially a mix of persistent and volatile memory on a system with limited
> > > qtg groups.
> > >
> > > Maybe it's the best we can do though given information available
> > > before any devices are present.
> > >
> >
> > Regardless of the performance of the individual devices they can only
> > map to one of the available CFMWS entries. So the maximum number of
> > degrees of freedom is one node per CFMWS. Now if you have only one
> > entry to pick from, but have interleave sets with widely different
> > performance characteristics to online it becomes a policy decision
> > about whether to force map those interleave sets into the same node,
> > and that policy can be maintained outside the kernel.
> >
> > The alternative is to rework NUMA nodes to be something that can be
> > declared dynamically as currently there are assumptions throughout the
> > kernel that num_possible_nodes() is statically determined early in
> > boot. I am not seeing strong evidence that complexity needs to be
> > tackled in the near term, and "NUMA-node per CFMWS" should (famous
> > last words) serve CXL needs for the foreseeable future.
>
> I'm less optimistic we won't end up revisiting this in the medium
> term but can tackle that when we have better visibility of what
> people are actually building.

Agree. When we were game planning this patch internally the 2 options
were, build full support for defining new NUMA nodes after boot, or
just extend the boot-time NUMA node possibilities minimally by the
declared degrees of freedom in the CFMWS. The latter path was taken
because it gets us "80%" of what CXL needs without precluding going
the former path later if that remaining "20% proves critical to add
finer grained dynamic support.