Re: Onlining CXL Type2 device coherent memory

From: Dan Williams <dan.j.williams@intel.com>
To: David Hildenbrand <david@redhat.com>
Cc: Vikram Sethi <vsethi@nvidia.com>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	"Natu, Mahesh" <mahesh.natu@intel.com>,
	"Rudoff, Andy" <andy.rudoff@intel.com>,
	Jeff Smith <JSMITH@nvidia.com>,
	Mark Hairgrove <mhairgrove@nvidia.com>,
	"jglisse@redhat.com" <jglisse@redhat.com>,
	Linux MM <linux-mm@kvack.org>,
	Linux ACPI <linux-acpi@vger.kernel.org>,
	Anshuman Khandual <anshuman.khandual@arm.com>
Subject: Re: Onlining CXL Type2 device coherent memory
Date: Sat, 31 Oct 2020 09:51:23 -0700	[thread overview]
Message-ID: <CAPcyv4jX1tedjuU-vCSKgvhQeNFukyq9d0ddmsk7jAjWMX+iBQ@mail.gmail.com> (raw)
In-Reply-To: <451b2571-c3e8-97d8-bfd0-f8054a1b75c5@redhat.com>

On Sat, Oct 31, 2020 at 3:21 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 30.10.20 21:37, Dan Williams wrote:
> > On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com> wrote:
> >>
> >> Hello,
> >>
> >> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2 device
> >> Coherent memory aka Host managed device memory (HDM) will work for type 2 CXL
> >> devices which are available/plugged in at boot. A type 2 CXL device can be simply
> >> thought of as an accelerator with coherent device memory, that also has a
> >> CXL.cache to cache system memory.
> >>
> >> One could envision that BIOS/UEFI could expose the HDM in EFI memory map
> >> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at least
> >> on some architectures (arm64) EFI conventional memory available at kernel boot
> >> memory cannot be offlined, so this may not be suitable on all architectures.
> >
> > That seems an odd restriction. Add David, linux-mm, and linux-acpi as
> > they might be interested / have comments on this restriction as well.
> >
>
> I am missing some important details.
>
> a) What happens after offlining? Will the memory be remove_memory()'ed?
> Will the device get physically unplugged?
>
> b) What's the general purpose of the memory and its intended usage when
> *not* exposed as system RAM? What's the main point of treating it like
> ordinary system RAM as default?
>
> Also, can you be sure that you can offline that memory? If it's
> ZONE_NORMAL (as usually all system RAM in the initial map), there are no
> such guarantees, especially once the system ran for long enough, but
> also in other cases (e.g., shuffling), or if allocation policies change
> in the future.
>
> So I *guess* you would already have to use kernel cmdline hacks like
> "movablecore" to make it work. In that case, you can directly specify
> what you *actually* want (which I am not sure yet I completely
> understood) - e.g., something like "memmap=16G!16G" ... or something
> similar.
>
> I consider offlining+removing *boot* memory to not physically unplug it
> (e.g., a DIMM getting unplugged) abusing the memory hotunplug
> infrastructure. It's a different thing when manually adding memory like
> dax_kmem does via add_memory_driver_managed().
>
>
> Now, back to your original question: arm64 does not support physically
> unplugging DIMMs that were part of the initial map. If you'd reboot
> after unplugging a DIMM, your system would crash. We achieve that by
> disallowing to offline boot memory - we could also try to handle it in
> ACPI code. But again, most uses of offlining+removing boot memory are
> abusing the memory hotunplug infrastructure and should rather be solved
> cleanly via a different mechanism (firmware, kernel cmdline, ...).
>
> Just recently discussed in
>
> https://lkml.kernel.org/r/de8388df2fbc5a6a33aab95831ba7db4@codeaurora.org
>
> >> Further, the device driver associated with the type 2 device/accelerator may
> >> want to save off a chunk of HDM for driver private use.
> >> So it seems the more appropriate model may be something like dev dax model
> >> where the device driver probe/open calls add_memory_driver_managed, and
> >> the driver could choose how much of the HDM it wants to reserve and how
> >> much to make generally available for application mmap/malloc.
> >
> > Sure, it can always be driver managed. The trick will be getting the
> > platform firmware to agree to not map it by default, but I suspect
> > you'll have a hard time convincing platform-firmware to take that
> > stance. The BIOS does not know, and should not care what OS is booting
> > when it produces the memory map. So I think CXL memory unplug after
> > the fact is more realistic than trying to get the BIOS not to map it.
> > So, to me it looks like arm64 needs to reconsider its unplug stance.
>
> My personal opinion is, if memory isn't just "ordinary system RAM", then
> let the system know early that memory is special (as we do with
> soft-reserved).
>
> Ideally, you could configure the firmware (e.g., via BIOS setup) on what
> to do, that's the cleanest solution, but I can understand that's rather
> hard to achieve.

Yes, my hope, which is about the most influence I can have on
platform-firmware implementations, is that it marks CXL attached
memory as soft-reserved by default and allow OS policy decide where it
goes. Barring that, for the configuration that Vikram mentioned, the
only other way to get this differentiated / not-ordinary system-ram
back to being driver managed would be to unplug it. The soft-reserved
path is cleaner.