Linux-CXL Archive on lore.kernel.org
 help / color / Atom feed
* Onlining CXL Type2 device coherent memory
@ 2020-10-28 23:05 Vikram Sethi
  2020-10-29 14:50 ` Ben Widawsky
  2020-10-30 20:37 ` Dan Williams
  0 siblings, 2 replies; 17+ messages in thread
From: Vikram Sethi @ 2020-10-28 23:05 UTC (permalink / raw)
  To: linux-cxl
  Cc: Dan Williams, Natu, Mahesh, Rudoff, Andy, Jeff Smith,
	Mark Hairgrove, jglisse, Vikram Sethi

Hello, 
 
I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2 device 
Coherent memory aka Host managed device memory (HDM) will work for type 2 CXL 
devices which are available/plugged in at boot. A type 2 CXL device can be simply 
thought of as an accelerator with coherent device memory, that also has a 
CXL.cache to cache system memory. 
 
One could envision that BIOS/UEFI could expose the HDM in EFI memory map 
as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at least 
on some architectures (arm64) EFI conventional memory available at kernel boot 
memory cannot be offlined, so this may not be suitable on all architectures. 
 
Further, the device driver associated with the type 2 device/accelerator may 
want to save off a chunk of HDM for driver private use. 
So it seems the more appropriate model may be something like dev dax model 
where the device driver probe/open calls add_memory_driver_managed, and 
the driver could choose how much of the HDM it wants to reserve and how 
much to make generally available for application mmap/malloc. 
 
Another thing to think about is whether the kernel relies on UEFI having fully 
described NUMA proximity domains and end-end NUMA distances for HDM,
or whether the kernel will provide some infrastructure to make use of the 
device-local affinity information provided by the device in the Coherent Device 
Attribute Table (CDAT) via a mailbox, and use that to add a new NUMA node ID
for the HDM, and with the NUMA distances calculated by adding to the NUMA 
distance of the host bridge/Root port with the device local distance. At least 
that's how I think CDAT is supposed to work when kernel doesn't want to rely 
on BIOS tables.
 
A similar question on NUMA node ID and distances for HDM arises for CXL hotplug. 
Will the kernel rely on CDAT, and create its own NUMA node ID and patch up 
distances, or will it rely on BIOS providing PXM domain reserved at boot in 
SRAT to be used later on hotplug?
 
Thanks,
Vikram
 
[1] https://www.computeexpresslink.org/download-the-specification


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Onlining CXL Type2 device coherent memory
  2020-10-28 23:05 Onlining CXL Type2 device coherent memory Vikram Sethi
@ 2020-10-29 14:50 ` Ben Widawsky
  2020-10-30 20:37 ` Dan Williams
  1 sibling, 0 replies; 17+ messages in thread
From: Ben Widawsky @ 2020-10-29 14:50 UTC (permalink / raw)
  To: Vikram Sethi
  Cc: linux-cxl, Dan Williams, Natu, Mahesh, Rudoff, Andy, Jeff Smith,
	Mark Hairgrove, jglisse

On 20-10-28 23:05:48, Vikram Sethi wrote:
> Hello, 
>  
> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2 device 
> Coherent memory aka Host managed device memory (HDM) will work for type 2 CXL 
> devices which are available/plugged in at boot. A type 2 CXL device can be simply 
> thought of as an accelerator with coherent device memory, that also has a 
> CXL.cache to cache system memory. 
>  
> One could envision that BIOS/UEFI could expose the HDM in EFI memory map 
> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at least 
> on some architectures (arm64) EFI conventional memory available at kernel boot 
> memory cannot be offlined, so this may not be suitable on all architectures. 

If the expectation is that BIOS/UEFI is setting up these regions, the HDM
decoder registers themselves can be read to determine the regions. We'll have to
do this anyway for certain cases.

>  
> Further, the device driver associated with the type 2 device/accelerator may 
> want to save off a chunk of HDM for driver private use. 
> So it seems the more appropriate model may be something like dev dax model 
> where the device driver probe/open calls add_memory_driver_managed, and 
> the driver could choose how much of the HDM it wants to reserve and how 
> much to make generally available for application mmap/malloc. 

To me it seems whether the BIOS reports the HDM in the memory map is an
implementation detail that's up to the platform vendor. It would be unwise for
the device driver, and perhaps the bus driver, to skip verification of the
register programming in the HDM decoders even if it is in the memory map.

>  
> Another thing to think about is whether the kernel relies on UEFI having fully 
> described NUMA proximity domains and end-end NUMA distances for HDM,
> or whether the kernel will provide some infrastructure to make use of the 
> device-local affinity information provided by the device in the Coherent Device 
> Attribute Table (CDAT) via a mailbox, and use that to add a new NUMA node ID
> for the HDM, and with the NUMA distances calculated by adding to the NUMA 
> distance of the host bridge/Root port with the device local distance. At least 
> that's how I think CDAT is supposed to work when kernel doesn't want to rely 
> on BIOS tables.

If/when hotplug is a thing, CDAT will be the only viable mechanism to obtain
this information and so the kernel would have to make use of it. I hadn't really
thought about foregoing BIOS provided tables altogether and only using CDAT.
That's interesting...

The one thing I'll lament about while I'm here is the decision to put CDAT
behind DOE...

>  
> A similar question on NUMA node ID and distances for HDM arises for CXL hotplug. 
> Will the kernel rely on CDAT, and create its own NUMA node ID and patch up 
> distances, or will it rely on BIOS providing PXM domain reserved at boot in 
> SRAT to be used later on hotplug?

I don't have enough knowledge here, but it's an interesting question to me as
well.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Onlining CXL Type2 device coherent memory
  2020-10-28 23:05 Onlining CXL Type2 device coherent memory Vikram Sethi
  2020-10-29 14:50 ` Ben Widawsky
@ 2020-10-30 20:37 ` Dan Williams
  2020-10-30 20:59   ` Matthew Wilcox
                     ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Dan Williams @ 2020-10-30 20:37 UTC (permalink / raw)
  To: Vikram Sethi
  Cc: linux-cxl, Natu, Mahesh, Rudoff, Andy, Jeff Smith,
	Mark Hairgrove, jglisse, David Hildenbrand, Linux MM, Linux ACPI

On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com> wrote:
>
> Hello,
>
> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2 device
> Coherent memory aka Host managed device memory (HDM) will work for type 2 CXL
> devices which are available/plugged in at boot. A type 2 CXL device can be simply
> thought of as an accelerator with coherent device memory, that also has a
> CXL.cache to cache system memory.
>
> One could envision that BIOS/UEFI could expose the HDM in EFI memory map
> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at least
> on some architectures (arm64) EFI conventional memory available at kernel boot
> memory cannot be offlined, so this may not be suitable on all architectures.

That seems an odd restriction. Add David, linux-mm, and linux-acpi as
they might be interested / have comments on this restriction as well.

> Further, the device driver associated with the type 2 device/accelerator may
> want to save off a chunk of HDM for driver private use.
> So it seems the more appropriate model may be something like dev dax model
> where the device driver probe/open calls add_memory_driver_managed, and
> the driver could choose how much of the HDM it wants to reserve and how
> much to make generally available for application mmap/malloc.

Sure, it can always be driver managed. The trick will be getting the
platform firmware to agree to not map it by default, but I suspect
you'll have a hard time convincing platform-firmware to take that
stance. The BIOS does not know, and should not care what OS is booting
when it produces the memory map. So I think CXL memory unplug after
the fact is more realistic than trying to get the BIOS not to map it.
So, to me it looks like arm64 needs to reconsider its unplug stance.

> Another thing to think about is whether the kernel relies on UEFI having fully
> described NUMA proximity domains and end-end NUMA distances for HDM,
> or whether the kernel will provide some infrastructure to make use of the
> device-local affinity information provided by the device in the Coherent Device
> Attribute Table (CDAT) via a mailbox, and use that to add a new NUMA node ID
> for the HDM, and with the NUMA distances calculated by adding to the NUMA
> distance of the host bridge/Root port with the device local distance. At least
> that's how I think CDAT is supposed to work when kernel doesn't want to rely
> on BIOS tables.

The kernel can supplement the NUMA configuration from CDAT, but not if
the memory is already enumerated in the EFI Memory Map and ACPI
SRAT/HMAT. At that point CDAT is a nop because the BIOS has precluded
the OS from consuming it.

> A similar question on NUMA node ID and distances for HDM arises for CXL hotplug.
> Will the kernel rely on CDAT, and create its own NUMA node ID and patch up
> distances, or will it rely on BIOS providing PXM domain reserved at boot in
> SRAT to be used later on hotplug?

I don't expect the kernel to merge any CDAT data into the ACPI tables.
Instead the kernel will optionally use CDAT as an alternative method
to generate Linux NUMA topology independent of ACPI SRAT. Think of it
like Linux supporting both ACPI and Open Firmware NUMA descriptions at
the same time. CDAT is its own NUMA description domain unless BIOS has
blurred the lines and pre-incorporated it into SRAT/HMAT. That said I
think the CXL attached memory not described by EFI / ACPI is currently
the NULL set.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Onlining CXL Type2 device coherent memory
  2020-10-30 20:37 ` Dan Williams
@ 2020-10-30 20:59   ` Matthew Wilcox
  2020-10-30 23:38     ` Dan Williams
  2020-10-30 22:39   ` Vikram Sethi
  2020-10-31 10:21   ` David Hildenbrand
  2 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2020-10-30 20:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vikram Sethi, linux-cxl, Natu, Mahesh, Rudoff, Andy, Jeff Smith,
	Mark Hairgrove, jglisse, David Hildenbrand, Linux MM, Linux ACPI

On Fri, Oct 30, 2020 at 01:37:18PM -0700, Dan Williams wrote:
> On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com> wrote:
> > CXL 
> > (HDM) CXL
> > CXL 
> > CXL.cache 
> >
> > BIOS/UEFI HDM EFI 
> > ACPI SRAT/SLIT/HMAT
> > EFI 
> > HDM 
> > HDM 
> 
> BIOS OS 
> CXL 
> BIOS 
> 
> > UEFI 
> > NUMA NUMA HDM,
> > (CDAT) NUMA 
> > HDM, NUMA NUMA
> > CDAT 
> > BIOS 
> 
> NUMA CDAT
> EFI ACPI
> SRAT/HMAT CDAT BIOS 
> OS 
> 
> > NUMA HDM CXL 
> > CDAT, NUMA 
> > BIOS PXM 
> > SRAT 
> 
> CDAT ACPI 
> CDAT 
> NUMA ACPI SRAT
> ACPI NUMA 
> CDAT NUMA BIOS 
> SRAT/HMAT. 
> CXL EFI / ACPI 

i don't know what you're talking about but you must both work for
hardware manufacturers!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Onlining CXL Type2 device coherent memory
  2020-10-30 20:37 ` Dan Williams
  2020-10-30 20:59   ` Matthew Wilcox
@ 2020-10-30 22:39   ` Vikram Sethi
  2020-11-02 17:47     ` Dan Williams
  2020-10-31 10:21   ` David Hildenbrand
  2 siblings, 1 reply; 17+ messages in thread
From: Vikram Sethi @ 2020-10-30 22:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Natu, Mahesh, Rudoff, Andy, Jeff Smith,
	Mark Hairgrove, jglisse, David Hildenbrand, Linux MM, Linux ACPI,
	will, anshuman.khandual, catalin.marinas, Ard Biesheuvel

Hi Dan, 
> From: Dan Williams <dan.j.williams@intel.com>
> On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com> wrote:
> >
> > Hello,
> >
> > I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2 device
> > Coherent memory aka Host managed device memory (HDM) will work for type 2
> CXL
> > devices which are available/plugged in at boot. A type 2 CXL device can be
> simply
> > thought of as an accelerator with coherent device memory, that also has a
> > CXL.cache to cache system memory.
> >
> > One could envision that BIOS/UEFI could expose the HDM in EFI memory map
> > as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at least
> > on some architectures (arm64) EFI conventional memory available at kernel boot
> > memory cannot be offlined, so this may not be suitable on all architectures.
> 
> That seems an odd restriction. Add David, linux-mm, and linux-acpi as
> they might be interested / have comments on this restriction as well.
> 
> > Further, the device driver associated with the type 2 device/accelerator may
> > want to save off a chunk of HDM for driver private use.
> > So it seems the more appropriate model may be something like dev dax model
> > where the device driver probe/open calls add_memory_driver_managed, and
> > the driver could choose how much of the HDM it wants to reserve and how
> > much to make generally available for application mmap/malloc.
> 
> Sure, it can always be driver managed. The trick will be getting the
> platform firmware to agree to not map it by default, but I suspect
> you'll have a hard time convincing platform-firmware to take that
> stance. The BIOS does not know, and should not care what OS is booting
> when it produces the memory map. So I think CXL memory unplug after
> the fact is more realistic than trying to get the BIOS not to map it.
> So, to me it looks like arm64 needs to reconsider its unplug stance.

Agree. Cc Anshuman, Will, Catalin, Ard, in case I missed something in
Anshuman's patches adding arm64 memory remove, or if any plans to remove
the limitation.
 
> > Another thing to think about is whether the kernel relies on UEFI having fully
> > described NUMA proximity domains and end-end NUMA distances for HDM,
> > or whether the kernel will provide some infrastructure to make use of the
> > device-local affinity information provided by the device in the Coherent Device
> > Attribute Table (CDAT) via a mailbox, and use that to add a new NUMA node ID
> > for the HDM, and with the NUMA distances calculated by adding to the NUMA
> > distance of the host bridge/Root port with the device local distance. At least
> > that's how I think CDAT is supposed to work when kernel doesn't want to rely
> > on BIOS tables.
> 
> The kernel can supplement the NUMA configuration from CDAT, but not if
> the memory is already enumerated in the EFI Memory Map and ACPI
> SRAT/HMAT. At that point CDAT is a nop because the BIOS has precluded
> the OS from consuming it.

That makes sense.

> > A similar question on NUMA node ID and distances for HDM arises for CXL
> hotplug.
> > Will the kernel rely on CDAT, and create its own NUMA node ID and patch up
> > distances, or will it rely on BIOS providing PXM domain reserved at boot in
> > SRAT to be used later on hotplug?
> 
> I don't expect the kernel to merge any CDAT data into the ACPI tables.
> Instead the kernel will optionally use CDAT as an alternative method
> to generate Linux NUMA topology independent of ACPI SRAT. Think of it
> like Linux supporting both ACPI and Open Firmware NUMA descriptions at
> the same time. CDAT is its own NUMA description domain unless BIOS has
> blurred the lines and pre-incorporated it into SRAT/HMAT. That said I
> think the CXL attached memory not described by EFI / ACPI is currently
> the NULL set.

What I meant by patch/merge was if on a dual socket system with distance 40
between the sockets (not getting into HMAT vs SLIT description of latency),
if you hotplugged in a CXL type2/3 device whose CDAT says device local 'distance'
is 80, then the kernel is still merging that 80 in with the 40 to the remote socket
to say 120 from remote socket CPU to this socket's CXL device i.e whether the
40 came from SLIT or HMAT, it is still merged into the data kernel had obtained
from ACPI. I think you're saying the same thing in a different way:
that the device local part is not being merged with anything ACPI provided for 
the device, example _SLI at time of hotplug (which I agree with).

Vikram

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Onlining CXL Type2 device coherent memory
  2020-10-30 20:59   ` Matthew Wilcox
@ 2020-10-30 23:38     ` Dan Williams
  0 siblings, 0 replies; 17+ messages in thread
From: Dan Williams @ 2020-10-30 23:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vikram Sethi, linux-cxl, Natu, Mahesh, Rudoff, Andy, Jeff Smith,
	Mark Hairgrove, jglisse, David Hildenbrand, Linux MM, Linux ACPI

On Fri, Oct 30, 2020 at 2:00 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Oct 30, 2020 at 01:37:18PM -0700, Dan Williams wrote:
> > On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com> wrote:
[..]
>
> i don't know what you're talking about but you must both work for
> hardware manufacturers!

If only there was some software project that could comprehend these
details and synthesize an interface for applications to use. Even
better if this project had mailing lists with experts that could parse
the acronym soup into code patches.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Onlining CXL Type2 device coherent memory
  2020-10-30 20:37 ` Dan Williams
  2020-10-30 20:59   ` Matthew Wilcox
  2020-10-30 22:39   ` Vikram Sethi
@ 2020-10-31 10:21   ` David Hildenbrand
  2020-10-31 16:51     ` Dan Williams
  2 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2020-10-31 10:21 UTC (permalink / raw)
  To: Dan Williams, Vikram Sethi
  Cc: linux-cxl, Natu, Mahesh, Rudoff, Andy, Jeff Smith,
	Mark Hairgrove, jglisse, Linux MM, Linux ACPI, Anshuman Khandual

On 30.10.20 21:37, Dan Williams wrote:
> On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com> wrote:
>>
>> Hello,
>>
>> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2 device
>> Coherent memory aka Host managed device memory (HDM) will work for type 2 CXL
>> devices which are available/plugged in at boot. A type 2 CXL device can be simply
>> thought of as an accelerator with coherent device memory, that also has a
>> CXL.cache to cache system memory.
>>
>> One could envision that BIOS/UEFI could expose the HDM in EFI memory map
>> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at least
>> on some architectures (arm64) EFI conventional memory available at kernel boot
>> memory cannot be offlined, so this may not be suitable on all architectures.
> 
> That seems an odd restriction. Add David, linux-mm, and linux-acpi as
> they might be interested / have comments on this restriction as well.
> 

I am missing some important details.

a) What happens after offlining? Will the memory be remove_memory()'ed? 
Will the device get physically unplugged?

b) What's the general purpose of the memory and its intended usage when 
*not* exposed as system RAM? What's the main point of treating it like 
ordinary system RAM as default?

Also, can you be sure that you can offline that memory? If it's 
ZONE_NORMAL (as usually all system RAM in the initial map), there are no 
such guarantees, especially once the system ran for long enough, but 
also in other cases (e.g., shuffling), or if allocation policies change 
in the future.

So I *guess* you would already have to use kernel cmdline hacks like 
"movablecore" to make it work. In that case, you can directly specify 
what you *actually* want (which I am not sure yet I completely 
understood) - e.g., something like "memmap=16G!16G" ... or something 
similar.

I consider offlining+removing *boot* memory to not physically unplug it 
(e.g., a DIMM getting unplugged) abusing the memory hotunplug 
infrastructure. It's a different thing when manually adding memory like 
dax_kmem does via add_memory_driver_managed().


Now, back to your original question: arm64 does not support physically 
unplugging DIMMs that were part of the initial map. If you'd reboot 
after unplugging a DIMM, your system would crash. We achieve that by 
disallowing to offline boot memory - we could also try to handle it in 
ACPI code. But again, most uses of offlining+removing boot memory are 
abusing the memory hotunplug infrastructure and should rather be solved 
cleanly via a different mechanism (firmware, kernel cmdline, ...).

Just recently discussed in

https://lkml.kernel.org/r/de8388df2fbc5a6a33aab95831ba7db4@codeaurora.org

>> Further, the device driver associated with the type 2 device/accelerator may
>> want to save off a chunk of HDM for driver private use.
>> So it seems the more appropriate model may be something like dev dax model
>> where the device driver probe/open calls add_memory_driver_managed, and
>> the driver could choose how much of the HDM it wants to reserve and how
>> much to make generally available for application mmap/malloc.
> 
> Sure, it can always be driver managed. The trick will be getting the
> platform firmware to agree to not map it by default, but I suspect
> you'll have a hard time convincing platform-firmware to take that
> stance. The BIOS does not know, and should not care what OS is booting
> when it produces the memory map. So I think CXL memory unplug after
> the fact is more realistic than trying to get the BIOS not to map it.
> So, to me it looks like arm64 needs to reconsider its unplug stance.

My personal opinion is, if memory isn't just "ordinary system RAM", then 
let the system know early that memory is special (as we do with 
soft-reserved).

Ideally, you could configure the firmware (e.g., via BIOS setup) on what 
to do, that's the cleanest solution, but I can understand that's rather 
hard to achieve.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Onlining CXL Type2 device coherent memory
  2020-10-31 10:21   ` David Hildenbrand
@ 2020-10-31 16:51     ` Dan Williams
  2020-11-02  9:51       ` David Hildenbrand
  2020-11-02 18:34       ` Jonathan Cameron
  0 siblings, 2 replies; 17+ messages in thread
From: Dan Williams @ 2020-10-31 16:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vikram Sethi, linux-cxl, Natu, Mahesh, Rudoff, Andy, Jeff Smith,
	Mark Hairgrove, jglisse, Linux MM, Linux ACPI, Anshuman Khandual

On Sat, Oct 31, 2020 at 3:21 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 30.10.20 21:37, Dan Williams wrote:
> > On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com> wrote:
> >>
> >> Hello,
> >>
> >> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2 device
> >> Coherent memory aka Host managed device memory (HDM) will work for type 2 CXL
> >> devices which are available/plugged in at boot. A type 2 CXL device can be simply
> >> thought of as an accelerator with coherent device memory, that also has a
> >> CXL.cache to cache system memory.
> >>
> >> One could envision that BIOS/UEFI could expose the HDM in EFI memory map
> >> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at least
> >> on some architectures (arm64) EFI conventional memory available at kernel boot
> >> memory cannot be offlined, so this may not be suitable on all architectures.
> >
> > That seems an odd restriction. Add David, linux-mm, and linux-acpi as
> > they might be interested / have comments on this restriction as well.
> >
>
> I am missing some important details.
>
> a) What happens after offlining? Will the memory be remove_memory()'ed?
> Will the device get physically unplugged?
>
> b) What's the general purpose of the memory and its intended usage when
> *not* exposed as system RAM? What's the main point of treating it like
> ordinary system RAM as default?
>
> Also, can you be sure that you can offline that memory? If it's
> ZONE_NORMAL (as usually all system RAM in the initial map), there are no
> such guarantees, especially once the system ran for long enough, but
> also in other cases (e.g., shuffling), or if allocation policies change
> in the future.
>
> So I *guess* you would already have to use kernel cmdline hacks like
> "movablecore" to make it work. In that case, you can directly specify
> what you *actually* want (which I am not sure yet I completely
> understood) - e.g., something like "memmap=16G!16G" ... or something
> similar.
>
> I consider offlining+removing *boot* memory to not physically unplug it
> (e.g., a DIMM getting unplugged) abusing the memory hotunplug
> infrastructure. It's a different thing when manually adding memory like
> dax_kmem does via add_memory_driver_managed().
>
>
> Now, back to your original question: arm64 does not support physically
> unplugging DIMMs that were part of the initial map. If you'd reboot
> after unplugging a DIMM, your system would crash. We achieve that by
> disallowing to offline boot memory - we could also try to handle it in
> ACPI code. But again, most uses of offlining+removing boot memory are
> abusing the memory hotunplug infrastructure and should rather be solved
> cleanly via a different mechanism (firmware, kernel cmdline, ...).
>
> Just recently discussed in
>
> https://lkml.kernel.org/r/de8388df2fbc5a6a33aab95831ba7db4@codeaurora.org
>
> >> Further, the device driver associated with the type 2 device/accelerator may
> >> want to save off a chunk of HDM for driver private use.
> >> So it seems the more appropriate model may be something like dev dax model
> >> where the device driver probe/open calls add_memory_driver_managed, and
> >> the driver could choose how much of the HDM it wants to reserve and how
> >> much to make generally available for application mmap/malloc.
> >
> > Sure, it can always be driver managed. The trick will be getting the
> > platform firmware to agree to not map it by default, but I suspect
> > you'll have a hard time convincing platform-firmware to take that
> > stance. The BIOS does not know, and should not care what OS is booting
> > when it produces the memory map. So I think CXL memory unplug after
> > the fact is more realistic than trying to get the BIOS not to map it.
> > So, to me it looks like arm64 needs to reconsider its unplug stance.
>
> My personal opinion is, if memory isn't just "ordinary system RAM", then
> let the system know early that memory is special (as we do with
> soft-reserved).
>
> Ideally, you could configure the firmware (e.g., via BIOS setup) on what
> to do, that's the cleanest solution, but I can understand that's rather
> hard to achieve.

Yes, my hope, which is about the most influence I can have on
platform-firmware implementations, is that it marks CXL attached
memory as soft-reserved by default and allow OS policy decide where it
goes. Barring that, for the configuration that Vikram mentioned, the
only other way to get this differentiated / not-ordinary system-ram
back to being driver managed would be to unplug it. The soft-reserved
path is cleaner.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Onlining CXL Type2 device coherent memory
  2020-10-31 16:51     ` Dan Williams
@ 2020-11-02  9:51       ` David Hildenbrand
  2020-11-02 16:17         ` Vikram Sethi
  2020-11-02 18:34       ` Jonathan Cameron
  1 sibling, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2020-11-02  9:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vikram Sethi, linux-cxl, Natu, Mahesh, Rudoff, Andy, Jeff Smith,
	Mark Hairgrove, jglisse, Linux MM, Linux ACPI, Anshuman Khandual

On 31.10.20 17:51, Dan Williams wrote:
> On Sat, Oct 31, 2020 at 3:21 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 30.10.20 21:37, Dan Williams wrote:
>>> On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2 device
>>>> Coherent memory aka Host managed device memory (HDM) will work for type 2 CXL
>>>> devices which are available/plugged in at boot. A type 2 CXL device can be simply
>>>> thought of as an accelerator with coherent device memory, that also has a
>>>> CXL.cache to cache system memory.
>>>>
>>>> One could envision that BIOS/UEFI could expose the HDM in EFI memory map
>>>> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at least
>>>> on some architectures (arm64) EFI conventional memory available at kernel boot
>>>> memory cannot be offlined, so this may not be suitable on all architectures.
>>>
>>> That seems an odd restriction. Add David, linux-mm, and linux-acpi as
>>> they might be interested / have comments on this restriction as well.
>>>
>>
>> I am missing some important details.
>>
>> a) What happens after offlining? Will the memory be remove_memory()'ed?
>> Will the device get physically unplugged?
>>
>> b) What's the general purpose of the memory and its intended usage when
>> *not* exposed as system RAM? What's the main point of treating it like
>> ordinary system RAM as default?
>>
>> Also, can you be sure that you can offline that memory? If it's
>> ZONE_NORMAL (as usually all system RAM in the initial map), there are no
>> such guarantees, especially once the system ran for long enough, but
>> also in other cases (e.g., shuffling), or if allocation policies change
>> in the future.
>>
>> So I *guess* you would already have to use kernel cmdline hacks like
>> "movablecore" to make it work. In that case, you can directly specify
>> what you *actually* want (which I am not sure yet I completely
>> understood) - e.g., something like "memmap=16G!16G" ... or something
>> similar.
>>
>> I consider offlining+removing *boot* memory to not physically unplug it
>> (e.g., a DIMM getting unplugged) abusing the memory hotunplug
>> infrastructure. It's a different thing when manually adding memory like
>> dax_kmem does via add_memory_driver_managed().
>>
>>
>> Now, back to your original question: arm64 does not support physically
>> unplugging DIMMs that were part of the initial map. If you'd reboot
>> after unplugging a DIMM, your system would crash. We achieve that by
>> disallowing to offline boot memory - we could also try to handle it in
>> ACPI code. But again, most uses of offlining+removing boot memory are
>> abusing the memory hotunplug infrastructure and should rather be solved
>> cleanly via a different mechanism (firmware, kernel cmdline, ...).
>>
>> Just recently discussed in
>>
>> https://lkml.kernel.org/r/de8388df2fbc5a6a33aab95831ba7db4@codeaurora.org
>>
>>>> Further, the device driver associated with the type 2 device/accelerator may
>>>> want to save off a chunk of HDM for driver private use.
>>>> So it seems the more appropriate model may be something like dev dax model
>>>> where the device driver probe/open calls add_memory_driver_managed, and
>>>> the driver could choose how much of the HDM it wants to reserve and how
>>>> much to make generally available for application mmap/malloc.
>>>
>>> Sure, it can always be driver managed. The trick will be getting the
>>> platform firmware to agree to not map it by default, but I suspect
>>> you'll have a hard time convincing platform-firmware to take that
>>> stance. The BIOS does not know, and should not care what OS is booting
>>> when it produces the memory map. So I think CXL memory unplug after
>>> the fact is more realistic than trying to get the BIOS not to map it.
>>> So, to me it looks like arm64 needs to reconsider its unplug stance.
>>
>> My personal opinion is, if memory isn't just "ordinary system RAM", then
>> let the system know early that memory is special (as we do with
>> soft-reserved).
>>
>> Ideally, you could configure the firmware (e.g., via BIOS setup) on what
>> to do, that's the cleanest solution, but I can understand that's rather
>> hard to achieve.
> 
> Yes, my hope, which is about the most influence I can have on
> platform-firmware implementations, is that it marks CXL attached
> memory as soft-reserved by default and allow OS policy decide where it
> goes. Barring that, for the configuration that Vikram mentioned, the
> only other way to get this differentiated / not-ordinary system-ram
> back to being driver managed would be to unplug it. The soft-reserved
> path is cleaner.

If we already need kernel cmdline parameters (movablecore), we can 
handle this differently via the cmdline. That sets expectations for 
people implementing the firmware - we shouldn't make their life too easy 
with such decisions.

The paragraph started with

"One could envision that BIOS/UEFI could expose the HDM in EFI memory 
map ..." Let's not envision it, but instead suggest people to not do it ;)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Onlining CXL Type2 device coherent memory
  2020-11-02  9:51       ` David Hildenbrand
@ 2020-11-02 16:17         ` Vikram Sethi
  2020-11-02 17:53           ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Vikram Sethi @ 2020-11-02 16:17 UTC (permalink / raw)
  To: David Hildenbrand, Dan Williams
  Cc: linux-cxl, Natu, Mahesh, Rudoff, Andy, Jeff Smith,
	Mark Hairgrove, jglisse, Linux MM, Linux ACPI, Anshuman Khandual,
	alex.williamson, Samer El-Haj-Mahmoud, Shanker Donthineni

Hi David,
> From: David Hildenbrand <david@redhat.com>
> On 31.10.20 17:51, Dan Williams wrote:
> > On Sat, Oct 31, 2020 at 3:21 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 30.10.20 21:37, Dan Williams wrote:
> >>> On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2
> device
> >>>> Coherent memory aka Host managed device memory (HDM) will work for
> type 2 CXL
> >>>> devices which are available/plugged in at boot. A type 2 CXL device can be
> simply
> >>>> thought of as an accelerator with coherent device memory, that also has a
> >>>> CXL.cache to cache system memory.
> >>>>
> >>>> One could envision that BIOS/UEFI could expose the HDM in EFI memory map
> >>>> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at
> least
> >>>> on some architectures (arm64) EFI conventional memory available at kernel
> boot
> >>>> memory cannot be offlined, so this may not be suitable on all architectures.
> >>>
> >>> That seems an odd restriction. Add David, linux-mm, and linux-acpi as
> >>> they might be interested / have comments on this restriction as well.
> >>>
> >>
> >> I am missing some important details.
> >>
> >> a) What happens after offlining? Will the memory be remove_memory()'ed?
> >> Will the device get physically unplugged?
> >>
Not always IMO. If the device was getting reset, the HDM memory is going to be
unavailable while device is reset. Offlining the memory around the reset would
be sufficient, but depending if driver had done the add_memory in probe, 
it perhaps would be onerous to have to remove_memory as well before reset, 
and then add it back after reset. I realize you’re saying such a procedure
would be abusing hotplug framework, and we could perhaps require that memory
be removed prior to reset, but not clear to me that it *must* be removed for 
correctness. 

Another usecase of offlining without removing HDM could be around 
Virtualization/passing entire device with its memory to a VM. If device was
being used in the host kernel, and is then unbound, and bound to vfio-pci
(vfio-cxl?), would we expect vfio-pci to add_memory_driver_managed?
IMO the coherent device memory should be onlined in the host, for example, to 
handle memory_failure flows and passing on to userspace/the VM when poison is 
consumed by the VM on load to "bad" HDM. I realize it *could* be done with 
vfio adding+onlining the memory to host kernel, and perhaps makes sense if 
the device had never been used in the host kernel/bound to its "native" 
device driver to begin with. Alex?

> >> b) What's the general purpose of the memory and its intended usage when
> >> *not* exposed as system RAM? What's the main point of treating it like
> >> ordinary system RAM as default?
> >>
> >> Also, can you be sure that you can offline that memory? If it's
> >> ZONE_NORMAL (as usually all system RAM in the initial map), there are no
> >> such guarantees, especially once the system ran for long enough, but
> >> also in other cases (e.g., shuffling), or if allocation policies change
> >> in the future.
> >>
> >> So I *guess* you would already have to use kernel cmdline hacks like
> >> "movablecore" to make it work. In that case, you can directly specify
> >> what you *actually* want (which I am not sure yet I completely
> >> understood) - e.g., something like "memmap=16G!16G" ... or something
> >> similar.
> >>
> >> I consider offlining+removing *boot* memory to not physically unplug it
> >> (e.g., a DIMM getting unplugged) abusing the memory hotunplug
> >> infrastructure. It's a different thing when manually adding memory like
> >> dax_kmem does via add_memory_driver_managed().
> >>
> >>
> >> Now, back to your original question: arm64 does not support physically
> >> unplugging DIMMs that were part of the initial map. If you'd reboot
> >> after unplugging a DIMM, your system would crash. We achieve that by
> >> disallowing to offline boot memory - we could also try to handle it in
> >> ACPI code. But again, most uses of offlining+removing boot memory are
> >> abusing the memory hotunplug infrastructure and should rather be solved
> >> cleanly via a different mechanism (firmware, kernel cmdline, ...).
> >>
> >> Just recently discussed in
> >>
> >>
> https://lkml.kernel.org/r/de8388df2fbc5a6a33aab95831ba7db4@codeaurora.org
> >>
> >>>> Further, the device driver associated with the type 2 device/accelerator may
> >>>> want to save off a chunk of HDM for driver private use.
> >>>> So it seems the more appropriate model may be something like dev dax
> model
> >>>> where the device driver probe/open calls add_memory_driver_managed,
> and
> >>>> the driver could choose how much of the HDM it wants to reserve and how
> >>>> much to make generally available for application mmap/malloc.
> >>>
> >>> Sure, it can always be driver managed. The trick will be getting the
> >>> platform firmware to agree to not map it by default, but I suspect
> >>> you'll have a hard time convincing platform-firmware to take that
> >>> stance. The BIOS does not know, and should not care what OS is booting
> >>> when it produces the memory map. So I think CXL memory unplug after
> >>> the fact is more realistic than trying to get the BIOS not to map it.
> >>> So, to me it looks like arm64 needs to reconsider its unplug stance.
> >>
> >> My personal opinion is, if memory isn't just "ordinary system RAM", then
> >> let the system know early that memory is special (as we do with
> >> soft-reserved).
> >>
> >> Ideally, you could configure the firmware (e.g., via BIOS setup) on what
> >> to do, that's the cleanest solution, but I can understand that's rather
> >> hard to achieve.
> >
> > Yes, my hope, which is about the most influence I can have on
> > platform-firmware implementations, is that it marks CXL attached
> > memory as soft-reserved by default and allow OS policy decide where it
> > goes. Barring that, for the configuration that Vikram mentioned, the
> > only other way to get this differentiated / not-ordinary system-ram
> > back to being driver managed would be to unplug it. The soft-reserved
> > path is cleaner.
> 
> If we already need kernel cmdline parameters (movablecore), we can
> handle this differently via the cmdline. That sets expectations for
> people implementing the firmware - we shouldn't make their life too easy
> with such decisions.
> 
> The paragraph started with
> 
> "One could envision that BIOS/UEFI could expose the HDM in EFI memory
> map ..." Let's not envision it, but instead suggest people to not do it ;)
> 

Sounds good to me! Mahesh, let’s line this topic up for discussion in a CXL
UEFI/ACPI subteam meeting, and find a way to add ECR implementation note
to the spec that UEFI/BIOS NOT expose HDM in EFI memory map.

Thanks,
Vikram

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Onlining CXL Type2 device coherent memory
  2020-10-30 22:39   ` Vikram Sethi
@ 2020-11-02 17:47     ` Dan Williams
  0 siblings, 0 replies; 17+ messages in thread
From: Dan Williams @ 2020-11-02 17:47 UTC (permalink / raw)
  To: Vikram Sethi
  Cc: linux-cxl, Natu, Mahesh, Rudoff, Andy, Jeff Smith,
	Mark Hairgrove, jglisse, David Hildenbrand, Linux MM, Linux ACPI,
	will, anshuman.khandual, catalin.marinas, Ard Biesheuvel,
	Dave Hansen

On Fri, Oct 30, 2020 at 3:40 PM Vikram Sethi <vsethi@nvidia.com> wrote:
>
> Hi Dan,
> > From: Dan Williams <dan.j.williams@intel.com>
> > On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com> wrote:
> > >
> > > Hello,
> > >
> > > I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2 device
> > > Coherent memory aka Host managed device memory (HDM) will work for type 2
> > CXL
> > > devices which are available/plugged in at boot. A type 2 CXL device can be
> > simply
> > > thought of as an accelerator with coherent device memory, that also has a
> > > CXL.cache to cache system memory.
> > >
> > > One could envision that BIOS/UEFI could expose the HDM in EFI memory map
> > > as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at least
> > > on some architectures (arm64) EFI conventional memory available at kernel boot
> > > memory cannot be offlined, so this may not be suitable on all architectures.
> >
> > That seems an odd restriction. Add David, linux-mm, and linux-acpi as
> > they might be interested / have comments on this restriction as well.
> >
> > > Further, the device driver associated with the type 2 device/accelerator may
> > > want to save off a chunk of HDM for driver private use.
> > > So it seems the more appropriate model may be something like dev dax model
> > > where the device driver probe/open calls add_memory_driver_managed, and
> > > the driver could choose how much of the HDM it wants to reserve and how
> > > much to make generally available for application mmap/malloc.
> >
> > Sure, it can always be driver managed. The trick will be getting the
> > platform firmware to agree to not map it by default, but I suspect
> > you'll have a hard time convincing platform-firmware to take that
> > stance. The BIOS does not know, and should not care what OS is booting
> > when it produces the memory map. So I think CXL memory unplug after
> > the fact is more realistic than trying to get the BIOS not to map it.
> > So, to me it looks like arm64 needs to reconsider its unplug stance.
>
> Agree. Cc Anshuman, Will, Catalin, Ard, in case I missed something in
> Anshuman's patches adding arm64 memory remove, or if any plans to remove
> the limitation.
>
> > > Another thing to think about is whether the kernel relies on UEFI having fully
> > > described NUMA proximity domains and end-end NUMA distances for HDM,
> > > or whether the kernel will provide some infrastructure to make use of the
> > > device-local affinity information provided by the device in the Coherent Device
> > > Attribute Table (CDAT) via a mailbox, and use that to add a new NUMA node ID
> > > for the HDM, and with the NUMA distances calculated by adding to the NUMA
> > > distance of the host bridge/Root port with the device local distance. At least
> > > that's how I think CDAT is supposed to work when kernel doesn't want to rely
> > > on BIOS tables.
> >
> > The kernel can supplement the NUMA configuration from CDAT, but not if
> > the memory is already enumerated in the EFI Memory Map and ACPI
> > SRAT/HMAT. At that point CDAT is a nop because the BIOS has precluded
> > the OS from consuming it.
>
> That makes sense.
>
> > > A similar question on NUMA node ID and distances for HDM arises for CXL
> > hotplug.
> > > Will the kernel rely on CDAT, and create its own NUMA node ID and patch up
> > > distances, or will it rely on BIOS providing PXM domain reserved at boot in
> > > SRAT to be used later on hotplug?
> >
> > I don't expect the kernel to merge any CDAT data into the ACPI tables.
> > Instead the kernel will optionally use CDAT as an alternative method
> > to generate Linux NUMA topology independent of ACPI SRAT. Think of it
> > like Linux supporting both ACPI and Open Firmware NUMA descriptions at
> > the same time. CDAT is its own NUMA description domain unless BIOS has
> > blurred the lines and pre-incorporated it into SRAT/HMAT. That said I
> > think the CXL attached memory not described by EFI / ACPI is currently
> > the NULL set.
>
> What I meant by patch/merge was if on a dual socket system with distance 40
> between the sockets (not getting into HMAT vs SLIT description of latency),
> if you hotplugged in a CXL type2/3 device whose CDAT says device local 'distance'
> is 80, then the kernel is still merging that 80 in with the 40 to the remote socket
> to say 120 from remote socket CPU to this socket's CXL device i.e whether the
> 40 came from SLIT or HMAT, it is still merged into the data kernel had obtained
> from ACPI. I think you're saying the same thing in a different way:
> that the device local part is not being merged with anything ACPI provided for
> the device, example _SLI at time of hotplug (which I agree with).

Thankfully CDAT abandons the broken and gamed system of distance
values (i.e. firmware sometimes reverse engineering OS behavior) in
favor of nominal performance values like HMAT. With that in hand I
think it simplifies the kernel's responsibility to worry less about
"distance" values and instead identify whether the memory range is
"Linux-local" or "Linux-remote" and where to order it in the
allocation fallback lists.

As Dave implemented in his "migrate in lieu of discard" series [1],
find_next_best_node() establishes this ordering for memory tiering, so
the rough plan is to teach each CXL supporting arch how to incorporate
CDAT data into its find_next_best_node() implementation.

[1]: https://lore.kernel.org/linux-mm/20201007161736.ACC6E387@viggo.jf.intel.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Onlining CXL Type2 device coherent memory
  2020-11-02 16:17         ` Vikram Sethi
@ 2020-11-02 17:53           ` David Hildenbrand
  2020-11-02 18:03             ` Dan Williams
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2020-11-02 17:53 UTC (permalink / raw)
  To: Vikram Sethi, Dan Williams
  Cc: linux-cxl, Natu, Mahesh, Rudoff, Andy, Jeff Smith,
	Mark Hairgrove, jglisse, Linux MM, Linux ACPI, Anshuman Khandual,
	alex.williamson, Samer El-Haj-Mahmoud, Shanker Donthineni

On 02.11.20 17:17, Vikram Sethi wrote:
> Hi David,
>> From: David Hildenbrand <david@redhat.com>
>> On 31.10.20 17:51, Dan Williams wrote:
>>> On Sat, Oct 31, 2020 at 3:21 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 30.10.20 21:37, Dan Williams wrote:
>>>>> On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2
>> device
>>>>>> Coherent memory aka Host managed device memory (HDM) will work for
>> type 2 CXL
>>>>>> devices which are available/plugged in at boot. A type 2 CXL device can be
>> simply
>>>>>> thought of as an accelerator with coherent device memory, that also has a
>>>>>> CXL.cache to cache system memory.
>>>>>>
>>>>>> One could envision that BIOS/UEFI could expose the HDM in EFI memory map
>>>>>> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at
>> least
>>>>>> on some architectures (arm64) EFI conventional memory available at kernel
>> boot
>>>>>> memory cannot be offlined, so this may not be suitable on all architectures.
>>>>>
>>>>> That seems an odd restriction. Add David, linux-mm, and linux-acpi as
>>>>> they might be interested / have comments on this restriction as well.
>>>>>
>>>>
>>>> I am missing some important details.
>>>>
>>>> a) What happens after offlining? Will the memory be remove_memory()'ed?
>>>> Will the device get physically unplugged?
>>>>
> Not always IMO. If the device was getting reset, the HDM memory is going to be
> unavailable while device is reset. Offlining the memory around the reset would

Ouch, that speaks IMHO completely against exposing it as System RAM as 
default.

> be sufficient, but depending if driver had done the add_memory in probe,
> it perhaps would be onerous to have to remove_memory as well before reset,
> and then add it back after reset. I realize you’re saying such a procedure
> would be abusing hotplug framework, and we could perhaps require that memory
> be removed prior to reset, but not clear to me that it *must* be removed for
> correctness.
> 
> Another usecase of offlining without removing HDM could be around
> Virtualization/passing entire device with its memory to a VM. If device was
> being used in the host kernel, and is then unbound, and bound to vfio-pci
> (vfio-cxl?), would we expect vfio-pci to add_memory_driver_managed?

At least for passing through memory to VMs (via KVM), you don't actually 
need struct pages / memory exposed to the buddy via 
add_memory_driver_managed(). Actually, doing that sounds like the wrong 
approach.

E.g., you would "allocate" the memory via devdax/dax_hmat and directly 
map the resulting device into guest address space. At least that's what 
some people are doing with

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Onlining CXL Type2 device coherent memory
  2020-11-02 17:53           ` David Hildenbrand
@ 2020-11-02 18:03             ` Dan Williams
  2020-11-02 19:25               ` Vikram Sethi
  0 siblings, 1 reply; 17+ messages in thread
From: Dan Williams @ 2020-11-02 18:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vikram Sethi, linux-cxl, Natu, Mahesh, Rudoff, Andy, Jeff Smith,
	Mark Hairgrove, jglisse, Linux MM, Linux ACPI, Anshuman Khandual,
	alex.williamson, Samer El-Haj-Mahmoud, Shanker Donthineni,
	Joao Martins

On Mon, Nov 2, 2020 at 9:53 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.11.20 17:17, Vikram Sethi wrote:
> > Hi David,
> >> From: David Hildenbrand <david@redhat.com>
> >> On 31.10.20 17:51, Dan Williams wrote:
> >>> On Sat, Oct 31, 2020 at 3:21 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 30.10.20 21:37, Dan Williams wrote:
> >>>>> On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com> wrote:
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2
> >> device
> >>>>>> Coherent memory aka Host managed device memory (HDM) will work for
> >> type 2 CXL
> >>>>>> devices which are available/plugged in at boot. A type 2 CXL device can be
> >> simply
> >>>>>> thought of as an accelerator with coherent device memory, that also has a
> >>>>>> CXL.cache to cache system memory.
> >>>>>>
> >>>>>> One could envision that BIOS/UEFI could expose the HDM in EFI memory map
> >>>>>> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at
> >> least
> >>>>>> on some architectures (arm64) EFI conventional memory available at kernel
> >> boot
> >>>>>> memory cannot be offlined, so this may not be suitable on all architectures.
> >>>>>
> >>>>> That seems an odd restriction. Add David, linux-mm, and linux-acpi as
> >>>>> they might be interested / have comments on this restriction as well.
> >>>>>
> >>>>
> >>>> I am missing some important details.
> >>>>
> >>>> a) What happens after offlining? Will the memory be remove_memory()'ed?
> >>>> Will the device get physically unplugged?
> >>>>
> > Not always IMO. If the device was getting reset, the HDM memory is going to be
> > unavailable while device is reset. Offlining the memory around the reset would
>
> Ouch, that speaks IMHO completely against exposing it as System RAM as
> default.
>
> > be sufficient, but depending if driver had done the add_memory in probe,
> > it perhaps would be onerous to have to remove_memory as well before reset,
> > and then add it back after reset. I realize you’re saying such a procedure
> > would be abusing hotplug framework, and we could perhaps require that memory
> > be removed prior to reset, but not clear to me that it *must* be removed for
> > correctness.
> >
> > Another usecase of offlining without removing HDM could be around
> > Virtualization/passing entire device with its memory to a VM. If device was
> > being used in the host kernel, and is then unbound, and bound to vfio-pci
> > (vfio-cxl?), would we expect vfio-pci to add_memory_driver_managed?
>
> At least for passing through memory to VMs (via KVM), you don't actually
> need struct pages / memory exposed to the buddy via
> add_memory_driver_managed(). Actually, doing that sounds like the wrong
> approach.
>
> E.g., you would "allocate" the memory via devdax/dax_hmat and directly
> map the resulting device into guest address space. At least that's what
> some people are doing with

...and Joao is working to see if the host kernel can skip allocating
'struct page' or do it on demand if the guest ever requests host
kernel services on its memory. Typically it does not so host 'struct
page' space for devdax memory ranges goes wasted.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Onlining CXL Type2 device coherent memory
  2020-10-31 16:51     ` Dan Williams
  2020-11-02  9:51       ` David Hildenbrand
@ 2020-11-02 18:34       ` Jonathan Cameron
  1 sibling, 0 replies; 17+ messages in thread
From: Jonathan Cameron @ 2020-11-02 18:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: David Hildenbrand, Vikram Sethi, linux-cxl, Natu, Mahesh, Rudoff,
	Andy, Jeff Smith, Mark Hairgrove, jglisse, Linux MM, Linux ACPI,
	Anshuman Khandual

On Sat, 31 Oct 2020 09:51:23 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Sat, Oct 31, 2020 at 3:21 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 30.10.20 21:37, Dan Williams wrote:  
> > > On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com> wrote:  
> > >>
> > >> Hello,
> > >>
> > >> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2 device
> > >> Coherent memory aka Host managed device memory (HDM) will work for type 2 CXL
> > >> devices which are available/plugged in at boot. A type 2 CXL device can be simply
> > >> thought of as an accelerator with coherent device memory, that also has a
> > >> CXL.cache to cache system memory.
> > >>
> > >> One could envision that BIOS/UEFI could expose the HDM in EFI memory map
> > >> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at least
> > >> on some architectures (arm64) EFI conventional memory available at kernel boot
> > >> memory cannot be offlined, so this may not be suitable on all architectures.  
> > >
> > > That seems an odd restriction. Add David, linux-mm, and linux-acpi as
> > > they might be interested / have comments on this restriction as well.
> > >  
> >
> > I am missing some important details.
> >
> > a) What happens after offlining? Will the memory be remove_memory()'ed?
> > Will the device get physically unplugged?
> >
> > b) What's the general purpose of the memory and its intended usage when
> > *not* exposed as system RAM? What's the main point of treating it like
> > ordinary system RAM as default?
> >
> > Also, can you be sure that you can offline that memory? If it's
> > ZONE_NORMAL (as usually all system RAM in the initial map), there are no
> > such guarantees, especially once the system ran for long enough, but
> > also in other cases (e.g., shuffling), or if allocation policies change
> > in the future.
> >
> > So I *guess* you would already have to use kernel cmdline hacks like
> > "movablecore" to make it work. In that case, you can directly specify
> > what you *actually* want (which I am not sure yet I completely
> > understood) - e.g., something like "memmap=16G!16G" ... or something
> > similar.
> >
> > I consider offlining+removing *boot* memory to not physically unplug it
> > (e.g., a DIMM getting unplugged) abusing the memory hotunplug
> > infrastructure. It's a different thing when manually adding memory like
> > dax_kmem does via add_memory_driver_managed().
> >
> >
> > Now, back to your original question: arm64 does not support physically
> > unplugging DIMMs that were part of the initial map. If you'd reboot
> > after unplugging a DIMM, your system would crash. We achieve that by
> > disallowing to offline boot memory - we could also try to handle it in
> > ACPI code. But again, most uses of offlining+removing boot memory are
> > abusing the memory hotunplug infrastructure and should rather be solved
> > cleanly via a different mechanism (firmware, kernel cmdline, ...).
> >
> > Just recently discussed in
> >
> > https://lkml.kernel.org/r/de8388df2fbc5a6a33aab95831ba7db4@codeaurora.org
> >  
> > >> Further, the device driver associated with the type 2 device/accelerator may
> > >> want to save off a chunk of HDM for driver private use.
> > >> So it seems the more appropriate model may be something like dev dax model
> > >> where the device driver probe/open calls add_memory_driver_managed, and
> > >> the driver could choose how much of the HDM it wants to reserve and how
> > >> much to make generally available for application mmap/malloc.  
> > >
> > > Sure, it can always be driver managed. The trick will be getting the
> > > platform firmware to agree to not map it by default, but I suspect
> > > you'll have a hard time convincing platform-firmware to take that
> > > stance. The BIOS does not know, and should not care what OS is booting
> > > when it produces the memory map. So I think CXL memory unplug after
> > > the fact is more realistic than trying to get the BIOS not to map it.
> > > So, to me it looks like arm64 needs to reconsider its unplug stance.  
> >
> > My personal opinion is, if memory isn't just "ordinary system RAM", then
> > let the system know early that memory is special (as we do with
> > soft-reserved).
> >
> > Ideally, you could configure the firmware (e.g., via BIOS setup) on what
> > to do, that's the cleanest solution, but I can understand that's rather
> > hard to achieve.  
> 
> Yes, my hope, which is about the most influence I can have on
> platform-firmware implementations, is that it marks CXL attached
> memory as soft-reserved by default and allow OS policy decide where it
> goes. Barring that, for the configuration that Vikram mentioned, the
> only other way to get this differentiated / not-ordinary system-ram
> back to being driver managed would be to unplug it. The soft-reserved
> path is cleaner.
> 

The whole reason that was introduced into UEFI the first place was to handle
this case.  SPM is still in the EFI_MEMORY_MAP but should not be used
for non moveable general purpose allocations.  It was intended for RAM
that you 'could' use for normal purpose if it wasn't being used for what
it was put there for (i.e. your GPU or similar isn't using it).

So agreed, soft-reserved / SPM.  

Jonathan




^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Onlining CXL Type2 device coherent memory
  2020-11-02 18:03             ` Dan Williams
@ 2020-11-02 19:25               ` Vikram Sethi
  2020-11-02 19:45                 ` Dan Williams
  2020-11-03  3:56                 ` Alistair Popple
  0 siblings, 2 replies; 17+ messages in thread
From: Vikram Sethi @ 2020-11-02 19:25 UTC (permalink / raw)
  To: Dan Williams, David Hildenbrand
  Cc: linux-cxl, Natu, Mahesh, Rudoff, Andy, Jeff Smith,
	Mark Hairgrove, jglisse, Linux MM, Linux ACPI, Anshuman Khandual,
	alex.williamson, Samer El-Haj-Mahmoud, Shanker Donthineni,
	Joao Martins



> From: Dan Williams <dan.j.williams@intel.com>
> On Mon, Nov 2, 2020 at 9:53 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 02.11.20 17:17, Vikram Sethi wrote:
> > > Hi David,
> > >> From: David Hildenbrand <david@redhat.com>
> > >> On 31.10.20 17:51, Dan Williams wrote:
> > >>> On Sat, Oct 31, 2020 at 3:21 AM David Hildenbrand <david@redhat.com>
> wrote:
> > >>>>
> > >>>> On 30.10.20 21:37, Dan Williams wrote:
> > >>>>> On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com>
> wrote:
> > >>>>>>
> > >>>>>> Hello,
> > >>>>>>
> > >>>>>> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2
> > >> device
> > >>>>>> Coherent memory aka Host managed device memory (HDM) will work
> for
> > >> type 2 CXL
> > >>>>>> devices which are available/plugged in at boot. A type 2 CXL device can
> be
> > >> simply
> > >>>>>> thought of as an accelerator with coherent device memory, that also has
> a
> > >>>>>> CXL.cache to cache system memory.
> > >>>>>>
> > >>>>>> One could envision that BIOS/UEFI could expose the HDM in EFI memory
> map
> > >>>>>> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However,
> at
> > >> least
> > >>>>>> on some architectures (arm64) EFI conventional memory available at
> kernel
> > >> boot
> > >>>>>> memory cannot be offlined, so this may not be suitable on all
> architectures.
> > >>>>>
> > >>>>> That seems an odd restriction. Add David, linux-mm, and linux-acpi as
> > >>>>> they might be interested / have comments on this restriction as well.
> > >>>>>
> > >>>>
> > >>>> I am missing some important details.
> > >>>>
> > >>>> a) What happens after offlining? Will the memory be
> remove_memory()'ed?
> > >>>> Will the device get physically unplugged?
> > >>>>
> > > Not always IMO. If the device was getting reset, the HDM memory is going to
> be
> > > unavailable while device is reset. Offlining the memory around the reset would
> >
> > Ouch, that speaks IMHO completely against exposing it as System RAM as
> > default.
> >
I should have clarified memory becomes unavailable on a new "CXL Reset" in CXL 2.0.
FLR does not make device memory unavailable, but there could be devices that
Implement CXL reset but not FLR, as FLR is optional.

> > > be sufficient, but depending if driver had done the add_memory in probe,
> > > it perhaps would be onerous to have to remove_memory as well before reset,
> > > and then add it back after reset. I realize you’re saying such a procedure
> > > would be abusing hotplug framework, and we could perhaps require that
> memory
> > > be removed prior to reset, but not clear to me that it *must* be removed for
> > > correctness.
> > >
> > > Another usecase of offlining without removing HDM could be around
> > > Virtualization/passing entire device with its memory to a VM. If device was
> > > being used in the host kernel, and is then unbound, and bound to vfio-pci
> > > (vfio-cxl?), would we expect vfio-pci to add_memory_driver_managed?
> >
> > At least for passing through memory to VMs (via KVM), you don't actually
> > need struct pages / memory exposed to the buddy via
> > add_memory_driver_managed(). Actually, doing that sounds like the wrong
> > approach.
> >
> > E.g., you would "allocate" the memory via devdax/dax_hmat and directly
> > map the resulting device into guest address space. At least that's what
> > some people are doing with

How does memory_failure forwarding to guest work in that case?
IIUC it doesn't without a struct page in the host. 
For normal memory, when VM consumes poison, host kernel signals
Userspace with SIGBUS and si-code that says Action Required, which 
QEMU injects to the guest.
IBM had done something like you suggest with coherent GPU memory and IIUC
memory_failure forwarding to guest VM does not work there.

kernel https://lkml.org/lkml/2018/12/20/103 
QEMU: https://patchwork.kernel.org/patch/10831455/
I would think we *do want* memory errors to be sent to a VM. 
> 
> ...and Joao is working to see if the host kernel can skip allocating
> 'struct page' or do it on demand if the guest ever requests host
> kernel services on its memory. Typically it does not so host 'struct
> page' space for devdax memory ranges goes wasted.
Is memory_failure forwarded to and handled by guest?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Onlining CXL Type2 device coherent memory
  2020-11-02 19:25               ` Vikram Sethi
@ 2020-11-02 19:45                 ` Dan Williams
  2020-11-03  3:56                 ` Alistair Popple
  1 sibling, 0 replies; 17+ messages in thread
From: Dan Williams @ 2020-11-02 19:45 UTC (permalink / raw)
  To: Vikram Sethi
  Cc: David Hildenbrand, linux-cxl, Natu, Mahesh, Rudoff, Andy,
	Jeff Smith, Mark Hairgrove, jglisse, Linux MM, Linux ACPI,
	Anshuman Khandual, alex.williamson, Samer El-Haj-Mahmoud,
	Shanker Donthineni, Joao Martins

On Mon, Nov 2, 2020 at 11:25 AM Vikram Sethi <vsethi@nvidia.com> wrote:
[..]
> > > At least for passing through memory to VMs (via KVM), you don't actually
> > > need struct pages / memory exposed to the buddy via
> > > add_memory_driver_managed(). Actually, doing that sounds like the wrong
> > > approach.
> > >
> > > E.g., you would "allocate" the memory via devdax/dax_hmat and directly
> > > map the resulting device into guest address space. At least that's what
> > > some people are doing with
>
> How does memory_failure forwarding to guest work in that case?
> IIUC it doesn't without a struct page in the host.
> For normal memory, when VM consumes poison, host kernel signals
> Userspace with SIGBUS and si-code that says Action Required, which
> QEMU injects to the guest.
> IBM had done something like you suggest with coherent GPU memory and IIUC
> memory_failure forwarding to guest VM does not work there.
>
> kernel https://lkml.org/lkml/2018/12/20/103
> QEMU: https://patchwork.kernel.org/patch/10831455/
> I would think we *do want* memory errors to be sent to a VM.
> >
> > ...and Joao is working to see if the host kernel can skip allocating
> > 'struct page' or do it on demand if the guest ever requests host
> > kernel services on its memory. Typically it does not so host 'struct
> > page' space for devdax memory ranges goes wasted.
> Is memory_failure forwarded to and handled by guest?

This dovetails with one of the DAX enabling backlog items to remove
dependencies on page->mapping and page->index for the memory-failure
path because that also gets in the way of reflink. For devdax it's
easy to drop the page->mapping dependency. For fsdax we still need
something to redirect the lookup into the proper filesystem code.

Certainly memory-failure support will not regress, it just means we're
stuck with 'struct page' in this path in the meantime.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Onlining CXL Type2 device coherent memory
  2020-11-02 19:25               ` Vikram Sethi
  2020-11-02 19:45                 ` Dan Williams
@ 2020-11-03  3:56                 ` Alistair Popple
  1 sibling, 0 replies; 17+ messages in thread
From: Alistair Popple @ 2020-11-03  3:56 UTC (permalink / raw)
  To: Vikram Sethi
  Cc: Dan Williams, David Hildenbrand, linux-cxl, Natu, Mahesh, Rudoff,
	Andy, Jeff Smith, Mark Hairgrove, jglisse, Linux MM, Linux ACPI,
	Anshuman Khandual, alex.williamson, Samer El-Haj-Mahmoud,
	Shanker Donthineni, Joao Martins

On Tuesday, 3 November 2020 6:25:23 AM AEDT Vikram Sethi wrote:
> > > > be sufficient, but depending if driver had done the add_memory in 
probe,
> > > > it perhaps would be onerous to have to remove_memory as well before 
reset,
> > > > and then add it back after reset. I realize you’re saying such a 
procedure
> > > > would be abusing hotplug framework, and we could perhaps require that
> > memory
> > > > be removed prior to reset, but not clear to me that it *must* be 
removed for
> > > > correctness.

I'm not sure exactly what you meant by "unavailable", but on some platforms 
(eg. PowerPC) it must be removed for correctness if hardware access to the 
memory is going away for any period of time. remove_memory() is what makes it 
safe to physically remove the memory as it triggers things like cache 
flushing. Without this PPC would see memory failure machine checks if it ever 
tried to writeback any dirty cache lines to the now inaccessible memory.

> > > > Another usecase of offlining without removing HDM could be around
> > > > Virtualization/passing entire device with its memory to a VM. If 
device was
> > > > being used in the host kernel, and is then unbound, and bound to vfio-
pci
> > > > (vfio-cxl?), would we expect vfio-pci to add_memory_driver_managed?
> > >
> > > At least for passing through memory to VMs (via KVM), you don't actually
> > > need struct pages / memory exposed to the buddy via
> > > add_memory_driver_managed(). Actually, doing that sounds like the wrong
> > > approach.
> > >
> > > E.g., you would "allocate" the memory via devdax/dax_hmat and directly
> > > map the resulting device into guest address space. At least that's what
> > > some people are doing with
> 
> How does memory_failure forwarding to guest work in that case?
> IIUC it doesn't without a struct page in the host. 
> For normal memory, when VM consumes poison, host kernel signals
> Userspace with SIGBUS and si-code that says Action Required, which 
> QEMU injects to the guest.
> IBM had done something like you suggest with coherent GPU memory and IIUC
> memory_failure forwarding to guest VM does not work there.
> 
> kernel https://lkml.org/lkml/2018/12/20/103 
> QEMU: https://patchwork.kernel.org/patch/10831455/

The above patches simply allow the coherent GPU physical memory ranges to get 
mapped into a guest VM in a similar way to an MMIO range (ie. without a struct 
page in the host). So you are correct in that they do not deal with forwarding 
failures to a guest VM.

Any GPU memory failure on PPC would currently get sent to the host in the same 
way as a normal system memory failure (ie. machine check). So in theory 
notification to a guest would work the same as a normal system memory failure. 
I say in theory because when I last looked at this some time back a guest 
kernel on PPC is not notified of memory errors.

 - Alistair

> I would think we *do want* memory errors to be sent to a VM. 
>
> > 
> > ...and Joao is working to see if the host kernel can skip allocating
> > 'struct page' or do it on demand if the guest ever requests host
> > kernel services on its memory. Typically it does not so host 'struct
> > page' space for devdax memory ranges goes wasted.
> Is memory_failure forwarded to and handled by guest?
> 





^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, back to index

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-28 23:05 Onlining CXL Type2 device coherent memory Vikram Sethi
2020-10-29 14:50 ` Ben Widawsky
2020-10-30 20:37 ` Dan Williams
2020-10-30 20:59   ` Matthew Wilcox
2020-10-30 23:38     ` Dan Williams
2020-10-30 22:39   ` Vikram Sethi
2020-11-02 17:47     ` Dan Williams
2020-10-31 10:21   ` David Hildenbrand
2020-10-31 16:51     ` Dan Williams
2020-11-02  9:51       ` David Hildenbrand
2020-11-02 16:17         ` Vikram Sethi
2020-11-02 17:53           ` David Hildenbrand
2020-11-02 18:03             ` Dan Williams
2020-11-02 19:25               ` Vikram Sethi
2020-11-02 19:45                 ` Dan Williams
2020-11-03  3:56                 ` Alistair Popple
2020-11-02 18:34       ` Jonathan Cameron

Linux-CXL Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-cxl/0 linux-cxl/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-cxl linux-cxl/ https://lore.kernel.org/linux-cxl \
		linux-cxl@vger.kernel.org
	public-inbox-index linux-cxl

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-cxl


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git