All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: CXL 1.1 Support Plan
  2021-08-11  0:18 CXL 1.1 Support Plan johnny
@ 2021-08-10 15:21 ` Dan Williams
  2021-09-13 14:46   ` John Groves
  0 siblings, 1 reply; 12+ messages in thread
From: Dan Williams @ 2021-08-10 15:21 UTC (permalink / raw)
  To: johnny; +Cc: linux-cxl, Jonathan Cameron, Ben Widawsky

On Tue, Aug 10, 2021 at 5:24 AM johnny <johnny.li@montage-tech.com> wrote:
>
> Hello,
>
> Current CXL patches [1] focus on CXL 2.0 support. While in the real world,
> CXL 1.1 capable host and EP are the first CXL components end user can use. May I
> ask community plan on CXL 1.1 support

The expectation is that CXL 1.1 is BIOS / platform-firmware supported.
Just like the OS does not have a driver for DDR and relies on the BIOS
to describe DDR resources via generic ACPI tables, the same
expectation holds for CXL 1.1. CXL 2.0 explicitly adds features that
exceed what platform-firmware can support with static ACPI tables.

Something is broken if the OS requires a driver for CXL 1.1 device
operation, at least for the generic memory expander use case.

Ben did have some improvements to lspci to dump the range registers of
a CXL 1.1 device:

https://github.com/pciutils/pciutils/pull/59

...but it does not seem the pciutils project has accepted that work.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* CXL 1.1 Support Plan
@ 2021-08-11  0:18 johnny
  2021-08-10 15:21 ` Dan Williams
  0 siblings, 1 reply; 12+ messages in thread
From: johnny @ 2021-08-11  0:18 UTC (permalink / raw)
  To: linux-cxl; +Cc: Dan Williams, Jonathan.Cameron, ben.widawsky

Hello,

Current CXL patches [1] focus on CXL 2.0 support. While in the real world,
CXL 1.1 capable host and EP are the first CXL components end user can use. May I
ask community plan on CXL 1.1 support
  - CXL 1.1 host + CXL 1.1 EP
  - CXL 1.1 host + CXL 2.0 EP

[1] https://lwn.net/Articles/846061/


Thanks
Johnny



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL 1.1 Support Plan
  2021-08-10 15:21 ` Dan Williams
@ 2021-09-13 14:46   ` John Groves
  2021-09-13 19:08     ` Dan Williams
  0 siblings, 1 reply; 12+ messages in thread
From: John Groves @ 2021-09-13 14:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: johnny, linux-cxl, Jonathan Cameron, Ben Widawsky, John Groves

On Tue, Aug 10, 2021 at 8:21 AM Dan Williams <Dan.J.Willaims@intel.com> wrote:
> On Tue, Aug 10, 2021 at 5:24 AM johnny <johnny.li@montage-tech.com> wrote:
> >
> Hello,
> >
> > Current CXL patches [1] focus on CXL 2.0 support. While in the real world,
> > CXL 1.1 capable host and EP are the first CXL components end user can use. May I
> > ask community plan on CXL 1.1 support
> 
> The expectation is that CXL 1.1 is BIOS / platform-firmware supported.
> Just like the OS does not have a driver for DDR and relies on the BIOS
> to describe DDR resources via generic ACPI tables, the same
> expectation holds for CXL 1.1. CXL 2.0 explicitly adds features that
> exceed what platform-firmware can support with static ACPI tables.
> 
> Something is broken if the OS requires a driver for CXL 1.1 device
> operation, at least for the generic memory expander use case.
> 
> Ben did have some improvements to lspci to dump the range registers of
> a CXL 1.1 device:
> 
> https://github.com/pciutils/pciutils/pull/59
> 
> ...but it does not seem the pciutils project has accepted that work.

To probe a bit further here... On a system with a cxl 1.1 memory device,
we see that bios maps it, but even though there is an SRAT entry
describing a new NUMA node containing the cxl 1.1 memory, the memory 
shows up as "soft reserved" in the e820 memory map. And we observe
that Linux (5.13 ATM) does not use it.

We can map the memory via /dev/mem (with CONFIG_STRICT_DEVMEM off), so
we are able to hack tests & benchmarks to use the memory - but clearly
this is not the correct or "intended" way to use cxl 1.1 memory.

We could certainly add memmap=offset!size to the kernel command line to
get it treated like simulated nvdimm - which is great if 1) the memory
is non-volatile, and 2) it's reliably always at the same address (which
it appears to be, but that feels like a bit of a squirrely outright
assumption).

How are people thinking we'll map and use cxl 1.1 pmem?  And what about
cxl 1.1 dram? Is the current Linux behavior considered adequate
(ignoring the memory unless you use memmap, or efi_fake_mem)?

...or am I missing something?

Thanks,
John




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL 1.1 Support Plan
  2021-09-13 14:46   ` John Groves
@ 2021-09-13 19:08     ` Dan Williams
       [not found]       ` <dd56f465-560e-9408-d7e6-9f8de59bc792@jagalactic.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Dan Williams @ 2021-09-13 19:08 UTC (permalink / raw)
  To: john; +Cc: johnny, linux-cxl, Jonathan Cameron, Ben Widawsky, John Groves

[-- Attachment #1: Type: text/plain, Size: 3840 bytes --]

On Mon, Sep 13, 2021 at 7:46 AM John Groves <john@jagalactic.com> wrote:
>
> On Tue, Aug 10, 2021 at 8:21 AM Dan Williams <Dan.J.Willaims@intel.com> wrote:
> > On Tue, Aug 10, 2021 at 5:24 AM johnny <johnny.li@montage-tech.com> wrote:
> > >
> > Hello,
> > >
> > > Current CXL patches [1] focus on CXL 2.0 support. While in the real world,
> > > CXL 1.1 capable host and EP are the first CXL components end user can use. May I
> > > ask community plan on CXL 1.1 support
> >
> > The expectation is that CXL 1.1 is BIOS / platform-firmware supported.
> > Just like the OS does not have a driver for DDR and relies on the BIOS
> > to describe DDR resources via generic ACPI tables, the same
> > expectation holds for CXL 1.1. CXL 2.0 explicitly adds features that
> > exceed what platform-firmware can support with static ACPI tables.
> >
> > Something is broken if the OS requires a driver for CXL 1.1 device
> > operation, at least for the generic memory expander use case.
> >
> > Ben did have some improvements to lspci to dump the range registers of
> > a CXL 1.1 device:
> >
> > https://github.com/pciutils/pciutils/pull/59
> >
> > ...but it does not seem the pciutils project has accepted that work.
>
> To probe a bit further here... On a system with a cxl 1.1 memory device,
> we see that bios maps it, but even though there is an SRAT entry
> describing a new NUMA node containing the cxl 1.1 memory, the memory
> shows up as "soft reserved" in the e820 memory map. And we observe
> that Linux (5.13 ATM) does not use it.
>

Looks like I need to improve the documentation here, please review the
attached patch and let me know if it answers your concerns.

In short, Linux *does* use it. Soft-reserve memory is accessed through
a dedicated device-file interface called device-dax by default. You
should be able to run:

cat /proc/iomem | grep dax

...and see your address range. I.e. something like:

# cat /proc/iomem | grep dax
    340000000-43fffffff : dax0.0

...and then /dev/dax0.0 can be directly accessed via mmap() (subject
to device-dax's strict mapping alignment restriction (default 2MB)).


> We can map the memory via /dev/mem (with CONFIG_STRICT_DEVMEM off), so
> we are able to hack tests & benchmarks to use the memory - but clearly
> this is not the correct or "intended" way to use cxl 1.1 memory.

Yeah, /dev/mem is superseded by device-dax for dedicated access and is
not recommended for CXL memory access.

>
> We could certainly add memmap=offset!size to the kernel command line to
> get it treated like simulated nvdimm - which is great if 1) the memory
> is non-volatile, and 2) it's reliably always at the same address (which
> it appears to be, but that feels like a bit of a squirrely outright
> assumption).

Correct, the BIOS is within its rights to move the location of that
memory from one-boot to the next. This is unlikely in typical
situations, but hardware removal or failure can cause the address map
to change from one boot to the next.

However, as mentioned in the attached patch you can simply ask Linux
to ignore the BIOS soft-reservation request (efi=nosoftreserve) if you
want all of that memory to be handed to the OS page allocator as
typical memory. Alternatively, if you only want a subset of the memory
to be given to the OS and the rest to remain dedicated access via
/dev/dax, then you can subdivide the dax range into multiple devices
and use 'daxctl reconfigure-device' to assign that slice of memory to
the OS page allocator.

>
> How are people thinking we'll map and use cxl 1.1 pmem?  And what about
> cxl 1.1 dram? Is the current Linux behavior considered adequate
> (ignoring the memory unless you use memmap, or efi_fake_mem)?
>
> ...or am I missing something?

Hopefully the attached document clears that up a bit, but let me know
what else is missing in the description.

[-- Attachment #2: 0001-daxctl-Add-Soft-Reservation-theory-of-operation.patch --]
[-- Type: text/x-patch, Size: 7977 bytes --]

From 2bcc9bd2e5751060977f2bc1ea8bb4d74ec3f26d Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams@intel.com>
Date: Wed, 26 May 2021 12:21:12 -0700
Subject: [PATCH] daxctl: Add "Soft Reservation" theory of operation

As systems are starting to ship memory with the EFI "Special Purpose"
attribute that Linux optionally turns into "Soft Reserved" ranges one of
the immediate first questions is "where is my special memory, and how do
access it". Add some documentation to explain the default behaviour of
"Soft Reserved".

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 .../daxctl/daxctl-reconfigure-device.txt      | 127 ++++++++++++------
 1 file changed, 88 insertions(+), 39 deletions(-)

diff --git a/Documentation/daxctl/daxctl-reconfigure-device.txt b/Documentation/daxctl/daxctl-reconfigure-device.txt
index f112b3c6120f..132684c193f0 100644
--- a/Documentation/daxctl/daxctl-reconfigure-device.txt
+++ b/Documentation/daxctl/daxctl-reconfigure-device.txt
@@ -12,6 +12,94 @@ SYNOPSIS
 [verse]
 'daxctl reconfigure-device' <dax0.0> [<dax1.0>...<daxY.Z>] [<options>]
 
+DESCRIPTION
+-----------
+
+Reconfigure the operational mode of a dax device. This can be used to convert
+a regular 'devdax' mode device to the 'system-ram' mode which arranges for the
+dax range to be hot-plugged into the system as regular memory.
+
+NOTE: This is a destructive operation. Any data on the dax device *will* be
+lost.
+
+NOTE: Device reconfiguration depends on the dax-bus device model. See
+linkdaxctl:daxctl-migrate-device-model[1] for more information. If dax-class is
+in use (via the dax_pmem_compat driver), the reconfiguration will fail with an
+error such as the following:
+----
+# daxctl reconfigure-device --mode=system-ram --region=0 all
+libdaxctl: daxctl_dev_disable: dax3.0: error: device model is dax-class
+dax3.0: disable failed: Operation not supported
+error reconfiguring devices: Operation not supported
+reconfigured 0 devices
+----
+
+'daxctl-reconfigure-device' nominally expects that it will online new memory
+blocks as 'movable', so that kernel data doesn't make it into this memory.
+However, there are other potential agents that may be configured to
+automatically online new hot-plugged memory as it appears. Most notably,
+these are the '/sys/devices/system/memory/auto_online_blocks' configuration,
+or system udev rules. If such an agent races to online memory sections, daxctl
+checks if the blocks were onlined as 'movable' memory. If this was not the
+case, and the memory blocks are found to be in a different zone, then a
+warning is displayed. If it is desired that a different agent control the
+onlining of memory blocks, and the associated memory zone, then it is
+recommended to use the --no-online option described below. This will abridge
+the device reconfiguration operation to just hotplugging the memory, and
+refrain from then onlining it.
+
+In case daxctl detects that there is a kernel policy to auto-online blocks
+(via /sys/devices/system/memory/auto_online_blocks), then reconfiguring to
+system-ram will result in a failure. This can be overridden with '--force'.
+
+
+THEORY OF OPERATION
+-------------------
+The kernel device-dax subsystem surfaces character devices
+that provide DAX-access (direct mappings sans page-cache buffering) to a
+given memory region. The devices are named /dev/daxX.Y where X is a
+region-id and Y is an instance-id within that region. There are 2
+mechanisms that trigger device-dax instances to appear:
+
+1. Persistent Memory (PMEM) namespace configured in "devdax" mode. See
+"ndctl create-namspace --help" and
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dax/Kconfig[CONFIG_DEV_DAX_PMEM].
+In this case the device-dax instance is statically sized to its host
+memory region which is bounded to the physical address range of the host
+namespace.
+
+2. Soft Reserved memory enumerated by platform firmware. On EFI systems
+this is communicated via the so called EFI_MEMORY_SP "Special Purpose"
+attribute. See
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dax/Kconfig[CONFIG_DEV_DAX_HMEM].
+In this case the device-dax instance(s) associated with the given memory
+region can be resized and divided into multiple devices.
+
+In the Soft Reservation case the expectation for EFI + ACPI based
+platforms is that in addition to the EFI_MEMORY_SP attribute the
+firmware also creates distinct ACPI proximity domains for any address
+range that has different performance characteristics than default
+"System RAM". So, the SRAT will define the proximity domain, the SLIT
+communicates relative distance to other proximity domains, and the HMAT
+is populated with nominal read/write latency and read/write bandwidth
+data. That HMAT data is emitted to the kernel log on bootup, and also
+exported to sysfs. See
+https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html[NUMAPERF],
+for the runtime representation of CPU to Memory node performance
+details.
+
+Outside of the NUMA performance details linked above the other method to
+detect the presence of "Soft Reserved" memory is to dump /proc/iomem and
+look for "Soft Reserved" ranges. If the kernel was not built with
+CONFIG_EFI_SOFTRESERVE, predates the introduction of
+CONFIG_EFI_SOFTRESERVE (v5.5), or was booted with the efi=nosoftreserve
+command line then device-dax will not attach and the expectation is that
+the memory shows up as a memory-only NUMA node. Otherwise the memory
+shows up as a device-dax instance and DAXCTL(1) can be used to
+optionally partition it and assign the memory back to the kernel as
+"System RAM", or the device can be mapped directly as the back end of a
+userspace memory allocator like https://pmem.io/vmem/libvmem/[LIBVMEM].
+
 EXAMPLES
 --------
 
@@ -83,45 +171,6 @@ reconfigured 1 device
 reconfigured 1 device
 ----
 
-DESCRIPTION
------------
-
-Reconfigure the operational mode of a dax device. This can be used to convert
-a regular 'devdax' mode device to the 'system-ram' mode which arranges for the
-dax range to be hot-plugged into the system as regular memory.
-
-NOTE: This is a destructive operation. Any data on the dax device *will* be
-lost.
-
-NOTE: Device reconfiguration depends on the dax-bus device model. See
-linkdaxctl:daxctl-migrate-device-model[1] for more information. If dax-class is
-in use (via the dax_pmem_compat driver), the reconfiguration will fail with an
-error such as the following:
-----
-# daxctl reconfigure-device --mode=system-ram --region=0 all
-libdaxctl: daxctl_dev_disable: dax3.0: error: device model is dax-class
-dax3.0: disable failed: Operation not supported
-error reconfiguring devices: Operation not supported
-reconfigured 0 devices
-----
-
-'daxctl-reconfigure-device' nominally expects that it will online new memory
-blocks as 'movable', so that kernel data doesn't make it into this memory.
-However, there are other potential agents that may be configured to
-automatically online new hot-plugged memory as it appears. Most notably,
-these are the '/sys/devices/system/memory/auto_online_blocks' configuration,
-or system udev rules. If such an agent races to online memory sections, daxctl
-checks if the blocks were onlined as 'movable' memory. If this was not the
-case, and the memory blocks are found to be in a different zone, then a
-warning is displayed. If it is desired that a different agent control the
-onlining of memory blocks, and the associated memory zone, then it is
-recommended to use the --no-online option described below. This will abridge
-the device reconfiguration operation to just hotplugging the memory, and
-refrain from then onlining it.
-
-In case daxctl detects that there is a kernel policy to auto-online blocks
-(via /sys/devices/system/memory/auto_online_blocks), then reconfiguring to
-system-ram will result in a failure. This can be overridden with '--force'.
 
 OPTIONS
 -------
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: CXL 1.1 Support Plan
       [not found]       ` <dd56f465-560e-9408-d7e6-9f8de59bc792@jagalactic.com>
@ 2021-09-14 19:55         ` John Groves
  2021-09-14 20:20           ` Dan Williams
  0 siblings, 1 reply; 12+ messages in thread
From: John Groves @ 2021-09-14 19:55 UTC (permalink / raw)
  To: Dan Williams
  Cc: johnny, linux-cxl, Jonathan Cameron, Ben Widawsky, John Groves

On 9/13/21 2:08 PM, Dan Williams wrote:
> On Mon, Sep 13, 2021 at 7:46 AM John Groves <john@jagalactic.com> wrote:
>> On Tue, Aug 10, 2021 at 8:21 AM Dan Williams <Dan.J.Willaims@intel.com> wrote:
>>> On Tue, Aug 10, 2021 at 5:24 AM johnny <johnny.li@montage-tech.com> wrote:
>>> Hello,
>>>> Current CXL patches [1] focus on CXL 2.0 support. While in the real world,
>>>> CXL 1.1 capable host and EP are the first CXL components end user can use. May I
>>>> ask community plan on CXL 1.1 support
>>> The expectation is that CXL 1.1 is BIOS / platform-firmware supported.
>>> Just like the OS does not have a driver for DDR and relies on the BIOS
>>> to describe DDR resources via generic ACPI tables, the same
>>> expectation holds for CXL 1.1. CXL 2.0 explicitly adds features that
>>> exceed what platform-firmware can support with static ACPI tables.
>>>
>>> Something is broken if the OS requires a driver for CXL 1.1 device
>>> operation, at least for the generic memory expander use case.
>>>
>>> Ben did have some improvements to lspci to dump the range registers of
>>> a CXL 1.1 device:
>>>
>>> https://github.com/pciutils/pciutils/pull/59
>>>
>>> ...but it does not seem the pciutils project has accepted that work.
>> To probe a bit further here... On a system with a cxl 1.1 memory device,
>> we see that bios maps it, but even though there is an SRAT entry
>> describing a new NUMA node containing the cxl 1.1 memory, the memory
>> shows up as "soft reserved" in the e820 memory map. And we observe
>> that Linux (5.13 ATM) does not use it.
>>
> Looks like I need to improve the documentation here, please review the
> attached patch and let me know if it answers your concerns.
>
> In short, Linux *does* use it. Soft-reserve memory is accessed through
> a dedicated device-file interface called device-dax by default. You
> should be able to run:
>
> cat /proc/iomem | grep dax
>
> ...and see your address range. I.e. something like:
>
> # cat /proc/iomem | grep dax
>     340000000-43fffffff : dax0.0
>
> ...and then /dev/dax0.0 can be directly accessed via mmap() (subject
> to device-dax's strict mapping alignment restriction (default 2MB)).

Thanks Dan! This plus the doc patch clears up a lot. I'm able to map
via /dev/dax0.0 now.

I built the cxl-2.0v3 branch of the ndctl repo (had to find source for
kmod, which isn't packaged for rhel8?!).

Thanks,
John

<snip>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL 1.1 Support Plan
  2021-09-14 19:55         ` John Groves
@ 2021-09-14 20:20           ` Dan Williams
       [not found]             ` <2093cae0-ff0b-1d6d-2ff2-ba1bb41510af@jagalactic.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Dan Williams @ 2021-09-14 20:20 UTC (permalink / raw)
  To: John Groves
  Cc: johnny, linux-cxl, Jonathan Cameron, Ben Widawsky, John Groves

On Tue, Sep 14, 2021 at 12:56 PM John Groves <john@jagalactic.com> wrote:
>
> On 9/13/21 2:08 PM, Dan Williams wrote:
> > On Mon, Sep 13, 2021 at 7:46 AM John Groves <john@jagalactic.com> wrote:
> >> On Tue, Aug 10, 2021 at 8:21 AM Dan Williams <Dan.J.Willaims@intel.com> wrote:
> >>> On Tue, Aug 10, 2021 at 5:24 AM johnny <johnny.li@montage-tech.com> wrote:
> >>> Hello,
> >>>> Current CXL patches [1] focus on CXL 2.0 support. While in the real world,
> >>>> CXL 1.1 capable host and EP are the first CXL components end user can use. May I
> >>>> ask community plan on CXL 1.1 support
> >>> The expectation is that CXL 1.1 is BIOS / platform-firmware supported.
> >>> Just like the OS does not have a driver for DDR and relies on the BIOS
> >>> to describe DDR resources via generic ACPI tables, the same
> >>> expectation holds for CXL 1.1. CXL 2.0 explicitly adds features that
> >>> exceed what platform-firmware can support with static ACPI tables.
> >>>
> >>> Something is broken if the OS requires a driver for CXL 1.1 device
> >>> operation, at least for the generic memory expander use case.
> >>>
> >>> Ben did have some improvements to lspci to dump the range registers of
> >>> a CXL 1.1 device:
> >>>
> >>> https://github.com/pciutils/pciutils/pull/59
> >>>
> >>> ...but it does not seem the pciutils project has accepted that work.
> >> To probe a bit further here... On a system with a cxl 1.1 memory device,
> >> we see that bios maps it, but even though there is an SRAT entry
> >> describing a new NUMA node containing the cxl 1.1 memory, the memory
> >> shows up as "soft reserved" in the e820 memory map. And we observe
> >> that Linux (5.13 ATM) does not use it.
> >>
> > Looks like I need to improve the documentation here, please review the
> > attached patch and let me know if it answers your concerns.
> >
> > In short, Linux *does* use it. Soft-reserve memory is accessed through
> > a dedicated device-file interface called device-dax by default. You
> > should be able to run:
> >
> > cat /proc/iomem | grep dax
> >
> > ...and see your address range. I.e. something like:
> >
> > # cat /proc/iomem | grep dax
> >     340000000-43fffffff : dax0.0
> >
> > ...and then /dev/dax0.0 can be directly accessed via mmap() (subject
> > to device-dax's strict mapping alignment restriction (default 2MB)).
>
> Thanks Dan! This plus the doc patch clears up a lot. I'm able to map
> via /dev/dax0.0 now.

Great!

>
> I built the cxl-2.0v3 branch of the ndctl repo (had to find source for
> kmod, which isn't packaged for rhel8?!).
>

That doesn't sound right...

# mock -r centos-stream-8-x86_64 -n --dnf-cmd whatprovides kmod-devel
[..]
CentOS Stream 8 - Extras
  19 kB/s |  15 kB     00:00

kmod-devel-25-17.el8.x86_64 : Header files for kmod development
Repo        : powertools
Matched from:
Provide    : kmod-devel = 25-17.el8

kmod-devel-25-18.el8.x86_64 : Header files for kmod development
Repo        : powertools
Matched from:
Provide    : kmod-devel = 25-18.el8

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL 1.1 Support Plan
       [not found]             ` <2093cae0-ff0b-1d6d-2ff2-ba1bb41510af@jagalactic.com>
@ 2021-09-30 22:17               ` John Groves
  2021-10-07 21:30                 ` Verma, Vishal L
  0 siblings, 1 reply; 12+ messages in thread
From: John Groves @ 2021-09-30 22:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: johnny, linux-cxl, Jonathan Cameron, Ben Widawsky, John Groves

On 9/14/21 3:20 PM, Dan Williams wrote:
> On Tue, Sep 14, 2021 at 12:56 PM John Groves <john@jagalactic.com> wrote:
>> On 9/13/21 2:08 PM, Dan Williams wrote:
>>> On Mon, Sep 13, 2021 at 7:46 AM John Groves <john@jagalactic.com> wrote:
>>>> On Tue, Aug 10, 2021 at 8:21 AM Dan Williams <Dan.J.Willaims@intel.com> wrote:
>>>>> On Tue, Aug 10, 2021 at 5:24 AM johnny <johnny.li@montage-tech.com> wrote:
>>>>> Hello,
>>>>>> Current CXL patches [1] focus on CXL 2.0 support. While in the real world,
>>>>>> CXL 1.1 capable host and EP are the first CXL components end user can use. May I
>>>>>> ask community plan on CXL 1.1 support
>>>>> The expectation is that CXL 1.1 is BIOS / platform-firmware supported.
>>>>> Just like the OS does not have a driver for DDR and relies on the BIOS
>>>>> to describe DDR resources via generic ACPI tables, the same
>>>>> expectation holds for CXL 1.1. CXL 2.0 explicitly adds features that
>>>>> exceed what platform-firmware can support with static ACPI tables.
>>>>>
>>>>> Something is broken if the OS requires a driver for CXL 1.1 device
>>>>> operation, at least for the generic memory expander use case.
>>>>>
>>>>> Ben did have some improvements to lspci to dump the range registers of
>>>>> a CXL 1.1 device:
>>>>>
>>>>> https://github.com/pciutils/pciutils/pull/59
>>>>>
>>>>> ...but it does not seem the pciutils project has accepted that work.
>>>> To probe a bit further here... On a system with a cxl 1.1 memory device,
>>>> we see that bios maps it, but even though there is an SRAT entry
>>>> describing a new NUMA node containing the cxl 1.1 memory, the memory
>>>> shows up as "soft reserved" in the e820 memory map. And we observe
>>>> that Linux (5.13 ATM) does not use it.
>>>>
>>> Looks like I need to improve the documentation here, please review the
>>> attached patch and let me know if it answers your concerns.
>>>
>>> In short, Linux *does* use it. Soft-reserve memory is accessed through
>>> a dedicated device-file interface called device-dax by default. You
>>> should be able to run:
>>>
>>> cat /proc/iomem | grep dax
>>>
>>> ...and see your address range. I.e. something like:
>>>
>>> # cat /proc/iomem | grep dax
>>>     340000000-43fffffff : dax0.0
>>>
>>> ...and then /dev/dax0.0 can be directly accessed via mmap() (subject
>>> to device-dax's strict mapping alignment restriction (default 2MB)).
>> Thanks Dan! This plus the doc patch clears up a lot. I'm able to map
>> via /dev/dax0.0 now.
> Great!

I don't see your doc patch yet in the ndctl repo, but I recommend merging it there.

A related question: "daxctl reconfigure-device --mode=system-ram ..." works for me,
with --force, but going the other way (--mode=devdax) fails.  But a reboot puts it back
into devdax mode regardless of the pre-boot setting (i.e. --mode=system-ram reverts
back to devdax on reboot).

Is reconfigure-device capable of remaining in effect after a reboot (in which case I'm curious
how you persist the info if the device is volatile; I believe it's the LSA on a non-volatile
device).

Back to the patch: you mention --mode=system-ram, but it might be helpful to add
a mention of --mode=devdax, along with a mention of whether (and how) these
changes are persisted.

Again, thanks!


>
>> I built the cxl-2.0v3 branch of the ndctl repo (had to find source for
>> kmod, which isn't packaged for rhel8?!).
>>
> That doesn't sound right...
>
> # mock -r centos-stream-8-x86_64 -n --dnf-cmd whatprovides kmod-devel
> [..]
> CentOS Stream 8 - Extras
>   19 kB/s |  15 kB     00:00
>
> kmod-devel-25-17.el8.x86_64 : Header files for kmod development
> Repo        : powertools
> Matched from:
> Provide    : kmod-devel = 25-17.el8
>
> kmod-devel-25-18.el8.x86_64 : Header files for kmod development
> Repo        : powertools
> Matched from:
> Provide    : kmod-devel = 25-18.el8

I'm a failure at understanding the dis-aggregated repos in modern rhel.
I had found evidence that kmod-devel was packaged for cent8, but didn't
know how to ... bla bla.  It's a personal problem :D

Thanks,
John




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL 1.1 Support Plan
  2021-09-30 22:17               ` John Groves
@ 2021-10-07 21:30                 ` Verma, Vishal L
       [not found]                   ` <c55b69bd-45e3-4dec-91af-02ca4eeb054a@jagalactic.com>
       [not found]                   ` <945a5996-961e-a159-2168-8da355b6d945@jagalactic.com>
  0 siblings, 2 replies; 12+ messages in thread
From: Verma, Vishal L @ 2021-10-07 21:30 UTC (permalink / raw)
  To: john, Williams, Dan J
  Cc: johnny.li, Widawsky, Ben, linux-cxl, Jonathan.Cameron, jgroves

On Thu, 2021-09-30 at 22:17 +0000, John Groves wrote:
> > > 
> I don't see your doc patch yet in the ndctl repo, but I recommend merging it there.
> 
> A related question: "daxctl reconfigure-device --mode=system-ram ..." works for me,
> with --force, but going the other way (--mode=devdax) fails.  But a reboot puts it back
> into devdax mode regardless of the pre-boot setting (i.e. --mode=system-ram reverts
> back to devdax on reboot).

Oh, I'm a bit confused. --force only applies when going from system-ram
to devdax -- it offlines the memory for you. Without force, you're
responsible for a prior 'daxctl offline-memory daxX.Y' step.

Going from devdax to system-ram should not need --force, and I don't
think force actually does anything there.

> 
> Is reconfigure-device capable of remaining in effect after a reboot (in which case I'm curious
> how you persist the info if the device is volatile; I believe it's the LSA on a non-volatile
> device).

The beginnings of this automation are in progress, at least for
converting pmem devices:

  https://lore.kernel.org/nvdimm/20210831090459.2306727-1-vishal.l.verma@intel.com/

This is yet to be extended to volatile dax devices after figuring out
what property we can key off of - one idea is the 'target_node' from
ACPI.

> 
> Back to the patch: you mention --mode=system-ram, but it might be helpful to add
> a mention of --mode=devdax, along with a mention of whether (and how) these
> changes are persisted.

The rest of the man page talks a little bit about the modes:

  https://pmem.io/ndctl/daxctl-reconfigure-device.html

Dan - you haven't posted the patch you attached to the list yet have
you? /me couldn't find it on lore - wasn't sure if I'd just missed it.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL 1.1 Support Plan
       [not found]                   ` <c55b69bd-45e3-4dec-91af-02ca4eeb054a@jagalactic.com>
@ 2021-10-08 18:06                     ` John Groves
       [not found]                       ` <dec7b376-71dd-1499-c46c-3fbc37fc2057@jagalactic.com>
  0 siblings, 1 reply; 12+ messages in thread
From: John Groves @ 2021-10-08 18:06 UTC (permalink / raw)
  To: Verma, Vishal L, Williams, Dan J
  Cc: johnny.li, Widawsky, Ben, linux-cxl, Jonathan.Cameron, jgroves

On 10/7/21 4:30 PM, Verma, Vishal L wrote:
> On Thu, 2021-09-30 at 22:17 +0000, John Groves wrote:
>> I don't see your doc patch yet in the ndctl repo, but I recommend merging it there.
>>
>> A related question: "daxctl reconfigure-device --mode=system-ram ..." works for me,
>> with --force, but going the other way (--mode=devdax) fails.  But a reboot puts it back
>> into devdax mode regardless of the pre-boot setting (i.e. --mode=system-ram reverts
>> back to devdax on reboot).
> Oh, I'm a bit confused. --force only applies when going from system-ram
> to devdax -- it offlines the memory for you. Without force, you're
> responsible for a prior 'daxctl offline-memory daxX.Y' step.
>
> Going from devdax to system-ram should not need --force, and I don't
> think force actually does anything there.

I definitely might be doing something wrong.  Here is a "typescript".

# grep dax /proc/iomem
  880000000-107fffffff : dax0.0

# ls -al /dev/dax0.0 crw------- 1 root root 252, 2 Sep 16 11:40 /dev/dax0.0

# numastat                  node0 numa_hit       5493270 numa_miss            0 numa_foreign         0 interleave_hit   14593 local_node     5493269 other_node           0

 

# daxctl reconfigure-device --mode=system-ram --region=0 dax0.0 dax0.0: error: kernel policy will auto-online memory, aborting error reconfiguring devices: Device or resource busy reconfigured 0 devices # daxctl reconfigure-device --mode=system-ram --region=0 --force dax0.0 dax0.0: WARNING: detected a race while onlining memory Some memory may not be in the expected zone. It is recommended to disable any other onlining mechanisms, and retry. If onlining is to be left to other agents, use the --no-online option to suppress this warning dax0.0: all memory sections (256) already online [   {     "chardev":"dax0.0",     "size":34359738368,     "target_node":1,     "align":2097152,     "mode":"system-ram",     "online_memblocks":256,     "total_memblocks":256,     "movable":false   } ] reconfigured 1 device # numastat                   node0 node1 numa_hit        8080981     0 numa_miss             0     0 numa_foreign          0     0 interleave_hit    14593     0 local_node     
8080981     0 other_node            0     0

 


##

I tried offlining the device and memory before reconfigure-device, but those failed.
Am I doing something wrong or missing a step here?


>
>> Is reconfigure-device capable of remaining in effect after a reboot (in which case I'm curious
>> how you persist the info if the device is volatile; I believe it's the LSA on a non-volatile
>> device).
> The beginnings of this automation are in progress, at least for
> converting pmem devices:
>
>   https://lore.kernel.org/nvdimm/20210831090459.2306727-1-vishal.l.verma@intel.com/
>
> This is yet to be extended to volatile dax devices after figuring out
> what property we can key off of - one idea is the 'target_node' from
> ACPI.

Thank you!


>
>> Back to the patch: you mention --mode=system-ram, but it might be helpful to add
>> a mention of --mode=devdax, along with a mention of whether (and how) these
>> changes are persisted.
> The rest of the man page talks a little bit about the modes:
>
>   https://pmem.io/ndctl/daxctl-reconfigure-device.html
>
> Dan - you haven't posted the patch you attached to the list yet have
> you? /me couldn't find it on lore - wasn't sure if I'd just missed it.
>
I'm pretty sure that's where I found --mode=devdax, though I couldn't figure out how to get
conversion back to devdax to work. I'm sure there is something I'm missing.

Starting in the state where I left off above:


# daxctl reconfigure-device --mode=devdax dax0.0

error reconfiguring devices: Device or resource busy

reconfigured 0 devices


# daxctl reconfigure-device --mode=devdax --force dax0.0

libdaxctl: offline_one_memblock: dax0.0: Failed to offline /sys/devices/system/node/node1/memory272/state: Device or resource busy

dax0.0: failed to offline memory: Device or resource busy

error reconfiguring devices: Device or resource busy

reconfigured 0 devices


# daxctl offline-memory dax0.0

dax0.0: 4 memory sections already offline

libdaxctl: offline_one_memblock: dax0.0: Failed to offline /sys/devices/system/node/node1/memory272/state: Device or resource busy

dax0.0: failed to offline memory: Device or resource busy

error offlining memory: Device or resource busy

offlined memory for 0 devices


# daxctl offline-memory --force dax0.0

  Error: unknown option `force'

 

 usage: daxctl offline-memory <device> [<options>]

 

    -r, --region <region-id>

                          filter by region

    -u, --human           use human friendly number formats

    -v, --verbose         emit more debug messages

 

I have to reboot to get back to dax memory, though it's very possible that my
[questionable] doc reading skills are at fault.

Thanks,
John






^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL 1.1 Support Plan
       [not found]                       ` <dec7b376-71dd-1499-c46c-3fbc37fc2057@jagalactic.com>
@ 2021-10-08 18:10                         ` John Groves
  0 siblings, 0 replies; 12+ messages in thread
From: John Groves @ 2021-10-08 18:10 UTC (permalink / raw)
  To: Verma, Vishal L, Williams, Dan J
  Cc: johnny.li, Widawsky, Ben, linux-cxl, Jonathan.Cameron, jgroves

On 10/8/21 1:06 PM, John Groves wrote:

[snip mangled message]

Will try to un-mangle and re-send that last message.

John


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL 1.1 Support Plan
       [not found]                   ` <945a5996-961e-a159-2168-8da355b6d945@jagalactic.com>
@ 2021-10-08 18:43                     ` John Groves
  2021-10-08 20:20                       ` Verma, Vishal L
  0 siblings, 1 reply; 12+ messages in thread
From: John Groves @ 2021-10-08 18:43 UTC (permalink / raw)
  To: Verma, Vishal L, Williams, Dan J
  Cc: johnny.li, Widawsky, Ben, linux-cxl, Jonathan.Cameron, jgroves

On 10/7/21 4:30 PM, Verma, Vishal L wrote:
> On Thu, 2021-09-30 at 22:17 +0000, John Groves wrote:
>> I don't see your doc patch yet in the ndctl repo, but I recommend merging it there.
>>
>> A related question: "daxctl reconfigure-device --mode=system-ram ..." works for me,
>> with --force, but going the other way (--mode=devdax) fails.  But a reboot puts it back
>> into devdax mode regardless of the pre-boot setting (i.e. --mode=system-ram reverts
>> back to devdax on reboot).
> Oh, I'm a bit confused. --force only applies when going from system-ram
> to devdax -- it offlines the memory for you. Without force, you're
> responsible for a prior 'daxctl offline-memory daxX.Y' step.
>
> Going from devdax to system-ram should not need --force, and I don't
> think force actually does anything there.

Hoping I've successfully de-mangled this message...

I definitely might be doing something wrong.  Here is a "typescript".

# grep dax /proc/iomem

    880000000-107fffffff : *dax*0.0


# ls -al /dev/dax0.0

crw------- 1 root root 252, 2 Oct  8 13:14 /dev/dax0.0


# numastat

                           node0

numa_hit                 2698949

numa_miss                      0

numa_foreign                   0

interleave_hit             14565

local_node               2698949

other_node                     0


# daxctl reconfigure-device --mode=system-ram --region=0 dax0.0

dax0.0: error: kernel policy will auto-online memory, aborting

error reconfiguring devices: Device or resource busy

reconfigured 0 devices


# daxctl reconfigure-device --mode=system-ram --region=0 --force dax0.0

dax0.0:

  WARNING: detected a race while onlining memory

  Some memory may not be in the expected zone. It is

  recommended to disable any other onlining mechanisms,

  and retry. If onlining is to be left to other agents,

  use the --no-online option to suppress this warning

dax0.0: all memory sections (256) already online

[

  {

    "chardev":"dax0.0",

    "size":34359738368,

    "target_node":1,

    "align":2097152,

    "mode":"system-ram",

    "online_memblocks":256,

    "total_memblocks":256,

    "movable":false

  }

]

reconfigured 1 device


# numastat

                           node0           node1

numa_hit                 2878872               0

numa_miss                      0               0

numa_foreign                   0               0

interleave_hit             14565               0

local_node               2878872               0

other_node                     0               0

 

I tried offlining the device and memory before reconfigure-device, but those failed.
Am I doing something wrong or missing a step here?


>
>> Is reconfigure-device capable of remaining in effect after a reboot (in which case I'm curious
>> how you persist the info if the device is volatile; I believe it's the LSA on a non-volatile
>> device).
> The beginnings of this automation are in progress, at least for
> converting pmem devices:
>
>   https://lore.kernel.org/nvdimm/20210831090459.2306727-1-vishal.l.verma@intel.com/
>
> This is yet to be extended to volatile dax devices after figuring out
> what property we can key off of - one idea is the 'target_node' from
> ACPI.
Thank you!
>
>> Back to the patch: you mention --mode=system-ram, but it might be helpful to add
>> a mention of --mode=devdax, along with a mention of whether (and how) these
>> changes are persisted.
> The rest of the man page talks a little bit about the modes:
>
>   https://pmem.io/ndctl/daxctl-reconfigure-device.html
>
> Dan - you haven't posted the patch you attached to the list yet have
> you? /me couldn't find it on lore - wasn't sure if I'd just missed it.
>

I'm pretty sure that's where I found --mode=devdax, though I couldn't figure out how to get
conversion back to devdax to work. I'm sure there is something I'm missing.

Starting in the state where I left off above:


# daxctl reconfigure-device --mode=devdax dax0.0
error reconfiguring devices: Device or resource busy
reconfigured 0 devices


# daxctl reconfigure-device --mode=devdax --force dax0.0
libdaxctl: offline_one_memblock: dax0.0: Failed to offline /sys/devices/system/node/node1/memory272/state: Device or resource busy
dax0.0: failed to offline memory: Device or resource busy

error reconfiguring devices: Device or resource busy
reconfigured 0 devices


# daxctl offline-memory dax0.0
dax0.0: 4 memory sections already offline
libdaxctl: offline_one_memblock: dax0.0: Failed to offline /sys/devices/system/node/node1/memory272/state: Device or resource busy
dax0.0: failed to offline memory: Device or resource busy
error offlining memory: Device or resource busy
offlined memory for 0 devices

# daxctl offline-memory --force dax0.0
  Error: unknown option `force'
 usage: daxctl offline-memory <device> [<options>]

    -r, --region <region-id>
                          filter by region
    -u, --human           use human friendly number formats
    -v, --verbose         emit more debug messages

 

I have to reboot to get back to dax memory, though it's very possible that my
[questionable] doc reading skills are at fault.

Thanks,
John







^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CXL 1.1 Support Plan
  2021-10-08 18:43                     ` John Groves
@ 2021-10-08 20:20                       ` Verma, Vishal L
  0 siblings, 0 replies; 12+ messages in thread
From: Verma, Vishal L @ 2021-10-08 20:20 UTC (permalink / raw)
  To: john, Williams, Dan J
  Cc: johnny.li, Widawsky, Ben, linux-cxl, Jonathan.Cameron, jgroves

On Fri, 2021-10-08 at 18:43 +0000, John Groves wrote:
> On 10/7/21 4:30 PM, Verma, Vishal L wrote:
> > On Thu, 2021-09-30 at 22:17 +0000, John Groves wrote:
> > > I don't see your doc patch yet in the ndctl repo, but I recommend merging it there.
> > > 
> > > A related question: "daxctl reconfigure-device --mode=system-ram ..." works for me,
> > > with --force, but going the other way (--mode=devdax) fails.  But a reboot puts it back
> > > into devdax mode regardless of the pre-boot setting (i.e. --mode=system-ram reverts
> > > back to devdax on reboot).
> > Oh, I'm a bit confused. --force only applies when going from system-ram
> > to devdax -- it offlines the memory for you. Without force, you're
> > responsible for a prior 'daxctl offline-memory daxX.Y' step.
> > 
> > Going from devdax to system-ram should not need --force, and I don't
> > think force actually does anything there.
> 
> Hoping I've successfully de-mangled this message...
> 
> I definitely might be doing something wrong.  Here is a "typescript".
> 
> # grep dax /proc/iomem
>     880000000-107fffffff : *dax*0.0
> 
> # ls -al /dev/dax0.0
> crw------- 1 root root 252, 2 Oct  8 13:14 /dev/dax0.0
> 
> # numastat
>                            node0
> numa_hit                 2698949
> numa_miss                      0
> numa_foreign                   0
> interleave_hit             14565
> local_node               2698949
> other_node                     0
> 
> # daxctl reconfigure-device --mode=system-ram --region=0 dax0.0
> dax0.0: error: kernel policy will auto-online memory, aborting
> error reconfiguring devices: Device or resource busy
> reconfigured 0 devices

Ah yes - so thios points to either

  CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y

or 

  $ cat /sys/devices/system/memory/auto_online_blocks 
  online

daxctl wants to online the new memory in ZONE_MOVABLE by default, but
either of the above would race daxctl to online it in ZONE_NORMAL -
that's the warning you got above. 

> 
> # daxctl reconfigure-device --mode=system-ram --region=0 --force dax0.0
> 
> dax0.0:
>   WARNING: detected a race while onlining memory
>   Some memory may not be in the expected zone. It is
>   recommended to disable any other onlining mechanisms,
>   and retry. If onlining is to be left to other agents,
>   use the --no-online option to suppress this warning

I guess the force does work in this case :)

> dax0.0: all memory sections (256) already online
> [
>   {
>     "chardev":"dax0.0",
>     "size":34359738368,
>     "target_node":1,
>     "align":2097152,
>     "mode":"system-ram",
>     "online_memblocks":256,
>     "total_memblocks":256,
>     "movable":false

This indicates that the new memory went into ZONE_NORMAL.
That can make it hard to convert back to devdax, but..


[snip]

> Starting in the state where I left off above:
> 
> # daxctl reconfigure-device --mode=devdax dax0.0
> error reconfiguring devices: Device or resource busy
> reconfigured 0 devices
> 
> # daxctl reconfigure-device --mode=devdax --force dax0.0
> libdaxctl: offline_one_memblock: dax0.0: Failed to offline /sys/devices/system/node/node1/memory272/state: Device or resource busy
> dax0.0: failed to offline memory: Device or resource busy
> 
> error reconfiguring devices: Device or resource busy
> reconfigured 0 devices
> 
> # daxctl offline-memory dax0.0
> dax0.0: 4 memory sections already offline
> libdaxctl: offline_one_memblock: dax0.0: Failed to offline /sys/devices/system/node/node1/memory272/state: Device or resource busy
> dax0.0: failed to offline memory: Device or resource busy
> error offlining memory: Device or resource busy
> offlined memory for 0 devices

This confused me a bit - I wonder if 'daxctl offline memory' claiming
that it is already offline is a bug, as a subsequent reconfigure then
says 'failed to offline..'

What is this range backed by? Most of the testing I did with this was
using a pmem device that gets it's own 'target_node' (i.e. the new
memory ends up in a new numa node of its own).

If you carve out memory using memmap or efi_fake_mem, It would end up
getting hotplugged into an existing numa node. I wonder if that causes
problems with the hot-unplug. Let take another look at this.

> 
> # daxctl offline-memory --force dax0.0
>   Error: unknown option `force'
>  usage: daxctl offline-memory <device> [<options>]
> 
>     -r, --region <region-id>
>                           filter by region
>     -u, --human           use human friendly number formats
>     -v, --verbose         emit more debug messages
> 
>  
> 
> I have to reboot to get back to dax memory, though it's very possible that my
> [questionable] doc reading skills are at fault.
> 
> Thanks,
> John
> 
> 
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-10-08 20:21 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-11  0:18 CXL 1.1 Support Plan johnny
2021-08-10 15:21 ` Dan Williams
2021-09-13 14:46   ` John Groves
2021-09-13 19:08     ` Dan Williams
     [not found]       ` <dd56f465-560e-9408-d7e6-9f8de59bc792@jagalactic.com>
2021-09-14 19:55         ` John Groves
2021-09-14 20:20           ` Dan Williams
     [not found]             ` <2093cae0-ff0b-1d6d-2ff2-ba1bb41510af@jagalactic.com>
2021-09-30 22:17               ` John Groves
2021-10-07 21:30                 ` Verma, Vishal L
     [not found]                   ` <c55b69bd-45e3-4dec-91af-02ca4eeb054a@jagalactic.com>
2021-10-08 18:06                     ` John Groves
     [not found]                       ` <dec7b376-71dd-1499-c46c-3fbc37fc2057@jagalactic.com>
2021-10-08 18:10                         ` John Groves
     [not found]                   ` <945a5996-961e-a159-2168-8da355b6d945@jagalactic.com>
2021-10-08 18:43                     ` John Groves
2021-10-08 20:20                       ` Verma, Vishal L

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.