Re: [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region

From: Dan Williams <dan.j.williams@intel.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>,
	linux-cxl@vger.kernel.org, Linux NVDIMM <nvdimm@lists.linux.dev>,
	patches@lists.linux.dev,
	Alison Schofield <alison.schofield@intel.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Christoph Hellwig <hch@infradead.org>,
	John Hubbard <jhubbard@nvidia.com>
Subject: Re: [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region
Date: Tue, 19 Apr 2022 14:59:46 -0700	[thread overview]
Message-ID: <CAPcyv4hyTRm7K8gu4wdL_gaMm2C+Agg1V2-BbnmJ8Kf0OH4sng@mail.gmail.com> (raw)
In-Reply-To: <CAPcyv4hPBw0yJmu7qzSZ_gsbVuj+_R7-_r3+_W9-JsLTD6Uscw@mail.gmail.com>

On Tue, Apr 19, 2022 at 2:50 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Tue, Apr 19, 2022 at 9:43 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Mon, Apr 18, 2022 at 09:42:00AM -0700, Dan Williams wrote:
> > > [ add the usual HMM suspects Christoph, Jason, and John ]
> > >
> > > On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > Define an API which allows CXL drivers to manage CXL address space.
> > > > CXL is unique in that the address space and various properties are only
> > > > known after CXL drivers come up, and therefore cannot be part of core
> > > > memory enumeration.
> > >
> > > I think this buries the lead on the problem introduced by
> > > MEMORY_DEVICE_PRIVATE in the first place. Let's revisit that history
> > > before diving into what CXL needs.
> > >
> > >
> > > Commit 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using
> > > ZONE_DEVICE") introduced the concept of MEMORY_DEVICE_PRIVATE. At its
> > > core MEMORY_DEVICE_PRIVATE uses the ZONE_DEVICE capability to annotate
> > > an "unused" physical address range with 'struct page' for the purpose
> > > of coordinating migration of buffers onto and off of a GPU /
> > > accelerator. The determination of "unused" was based on a heuristic,
> > > not a guarantee, that any address range not expressly conveyed in the
> > > platform firmware map of the system can be repurposed for software
> > > use. The CXL Fixed Memory Windows Structure  (CFMWS) definition
> > > explicitly breaks the assumptions of that heuristic.
> >
> > So CXL defines an address map that is not part of the FW list?
>
> It defines a super-set of 'potential' address space and a subset that
> is active in the FW list. It's similar to memory hotplug where an
> address range may come online after the fact, but unlike ACPI memory
> hotplug, FW is not involved in the hotplug path, and FW cannot predict
> what address ranges will come online. For example ACPI hotplug knows
> in advance to publish the ranges that can experience an online /
> insert event, CXL has many more degrees of freedom.
>
> >
> > > > It would be desirable to simply insert this address space into
> > > > iomem_resource with a new flag to denote this is CXL memory. This would
> > > > permit request_free_mem_region() to be reused for CXL memory provided it
> > > > learned some new tricks. For that, it is tempting to simply use
> > > > insert_resource(). The API was designed specifically for cases where new
> > > > devices may offer new address space. This cannot work in the general
> > > > case. Boot firmware can pass, some, none, or all of the CFMWS range as
> > > > various types of memory to the kernel, and this may be left alone,
> > > > merged, or even expanded.
> >
> > And then we understand that on CXL the FW is might pass stuff that
> > intersects with the actual CXL ranges?
> >
> > > > As a result iomem_resource may intersect CFMWS
> > > > regions in ways insert_resource cannot handle [2]. Similar reasoning
> > > > applies to allocate_resource().
> > > >
> > > > With the insert_resource option out, the only reasonable approach left
> > > > is to let the CXL driver manage the address space independently of
> > > > iomem_resource and attempt to prevent users of device private memory
> >
> > And finally due to all these FW problems we are going to make a 2nd
> > allocator for physical address space and just disable the normal one?
>
> No, or I am misunderstanding this comment. The CXL address space
> allocator is managing space that can be populated and become an
> iomem_resource. So it's not supplanting iomem_resource it is
> coordinating dynamic extensions to the FW map.
>
> > Then since DEVICE_PRIVATE is a notable user of this allocator we now
> > understand it becomes broken?
> >
> > Sounds horrible. IMHO you should fix the normal allocator somehow to
> > understand that the ranges from FW have been reprogrammed by Linux
>
> There is no reprogramming of the ranges from FW. CXL memory that is
> mapped as System RAM at boot will have the CXL decode configuration
> locked in all the participating devices. The remaining CXL decode
> space is then available for dynamic reconfiguration of CXL resources
> from the devices that the FW explicitly ignores, which is all
> hot-added devices and all persistent-memory capacity.
>
> > and
> > not try to build a whole different allocator in CXL code.
>
> I am not seeing much overlap for DEVICE_PRIVATE and CXL to share an
> allocator. CXL explicitly wants ranges that have been set aside for
> CXL and are related to 1 or more CXL host bridges. DEVICE_PRIVATE
> wants to consume an unused physical address range to proxy
> device-local-memory with no requirements on what range is chosen as
> long as it does not collide with anything else.

...or are you suggesting to represent CXL free memory capacity in
iomem_resource and augment the FW list early with CXL ranges. That
seems doable, but it would only represent the free CXL ranges in
iomem_resource as the populated CXL ranges cannot have their resources
reparented after the fact, and there is plenty of code that expects
"System RAM" to be a top-level resource.