linux-acpi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default
@ 2023-02-10  9:05 Dan Williams
  2023-02-10  9:05 ` [PATCH v2 01/20] cxl/memdev: Fix endpoint port removal Dan Williams
                   ` (21 more replies)
  0 siblings, 22 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:05 UTC (permalink / raw)
  To: linux-cxl
  Cc: Ira Weiny, David Hildenbrand, Dave Jiang, Davidlohr Bueso,
	Kees Cook, Jonathan Cameron, Vishal Verma, Dave Hansen,
	Michal Hocko, Jonathan Cameron, Gregory Price, Fan Ni, linux-mm,
	linux-acpi

Changes since v1: [1]
- Add a fix for memdev removal racing port removal (found by unit tests)
- Add a fix to unwind region target list updates on error in
  cxl_region_attach() (Jonathan)
- Move the passthrough decoder fix for submission for v6.2-final (Greg)
- Fix wrong initcall for cxl_core (Gregory and Davidlohr)
- Add an endpoint decoder state (CXL_DECODER_STATE_AUTO) to replace
  the flag CXL_DECODER_F_AUTO (Jonathan)
- Reflow cmp_decode_pos() to reduce levels of indentation (Jonathan)
- Fix a leaked reference count in cxl_add_to_region() (Jonathan)
- Make cxl_add_to_region() return an error (Jonathan)
- Fix several spurious whitespace changes (Jonathan)
- Cleanup some spurious changes from the tools/testing/cxl update
  (Jonathan)
- Test for == CXL_CONFIG_COMMIT rather than >= CXL_CONFIG_COMMIT
  (Jonathan)
- Add comment to clarify device_attach() return code expectation in
  cxl_add_to_region() (Jonathan)
- Add a patch to split cxl_port_probe() into switch and endpoint port
  probe calls (Jonathan)
- Collect reviewed-by and tested-by tags

[1]: http://lore.kernel.org/r/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com

---
Cover letter same as v1

Summary:
--------

CXL RAM support allows for the dynamic provisioning of new CXL RAM
regions, and more routinely, assembling a region from an existing
configuration established by platform-firmware. The latter is motivated
by CXL memory RAS (Reliability, Availability and Serviceability)
support, that requires associating device events with System Physical
Address ranges and vice versa.

The 'Soft Reserved' policy rework arranges for performance
differentiated memory like CXL attached DRAM, or high-bandwidth memory,
to be designated for 'System RAM' by default, rather than the device-dax
dedicated access mode. That current device-dax default is confusing and
surprising for the Pareto of users that do not expect memory to be
quarantined for dedicated access by default. Most users expect all
'System RAM'-capable memory to show up in FREE(1).


Details:
--------

Recall that the Linux 'Soft Reserved' designation for memory is a
reaction to platform-firmware, like EFI EDK2, delineating memory with
the EFI Specific Purpose Memory attribute (EFI_MEMORY_SP). An
alternative way to think of that attribute is that it specifies the
*not* general-purpose memory pool. It is memory that may be too precious
for general usage or not performant enough for some hot data structures.
However, in the absence of explicit policy it should just be 'System
RAM' by default.

Rather than require every distribution to ship a udev policy to assign
dax devices to dax_kmem (the device-memory hotplug driver) just make
that the kernel default. This is similar to the rationale in:

commit 8604d9e534a3 ("memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE")

With this change the relatively niche use case of accessing this memory
via mapping a device-dax instance can be achieved by building with
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, or specifying
memhp_default_state=offline at boot, and then use:

    daxctl reconfigure-device $device -m devdax --force

...to shift the corresponding address range to device-dax access.

The process of assembling a device-dax instance for a given CXL region
device configuration is similar to the process of assembling a
Device-Mapper or MDRAID storage-device array. Specifically, asynchronous
probing by the PCI and driver core enumerates all CXL endpoints and
their decoders. Then, once enough decoders have arrived to a describe a
given region, that region is passed to the device-dax subsystem where it
is subject to the above 'dax_kmem' policy. This assignment and policy
choice is only possible if memory is set aside by the 'Soft Reserved'
designation. Otherwise, CXL that is mapped as 'System RAM' becomes
immutable by CXL driver mechanisms, but is still enumerated for RAS
purposes.

This series is also available via:

https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region

...and has gone through some preview testing in various forms.

---

Dan Williams (20):
      cxl/memdev: Fix endpoint port removal
      cxl/Documentation: Update references to attributes added in v6.0
      cxl/region: Add a mode attribute for regions
      cxl/region: Support empty uuids for non-pmem regions
      cxl/region: Validate region mode vs decoder mode
      cxl/region: Add volatile region creation support
      cxl/region: Refactor attach_target() for autodiscovery
      cxl/region: Cleanup target list on attach error
      cxl/region: Move region-position validation to a helper
      kernel/range: Uplevel the cxl subsystem's range_contains() helper
      cxl/region: Enable CONFIG_CXL_REGION to be toggled
      cxl/port: Split endpoint and switch port probe
      cxl/region: Add region autodiscovery
      tools/testing/cxl: Define a fixed volatile configuration to parse
      dax/hmem: Move HMAT and Soft reservation probe initcall level
      dax/hmem: Drop unnecessary dax_hmem_remove()
      dax/hmem: Convey the dax range via memregion_info()
      dax/hmem: Move hmem device registration to dax_hmem.ko
      dax: Assign RAM regions to memory-hotplug by default
      cxl/dax: Create dax devices for CXL RAM regions


 Documentation/ABI/testing/sysfs-bus-cxl |   64 +-
 MAINTAINERS                             |    1 
 drivers/acpi/numa/hmat.c                |    4 
 drivers/cxl/Kconfig                     |   12 
 drivers/cxl/acpi.c                      |    3 
 drivers/cxl/core/core.h                 |    7 
 drivers/cxl/core/hdm.c                  |   25 +
 drivers/cxl/core/memdev.c               |    1 
 drivers/cxl/core/pci.c                  |    5 
 drivers/cxl/core/port.c                 |   92 ++-
 drivers/cxl/core/region.c               |  851 ++++++++++++++++++++++++++++---
 drivers/cxl/cxl.h                       |   57 ++
 drivers/cxl/cxlmem.h                    |    5 
 drivers/cxl/port.c                      |  113 +++-
 drivers/dax/Kconfig                     |   17 +
 drivers/dax/Makefile                    |    2 
 drivers/dax/bus.c                       |   53 +-
 drivers/dax/bus.h                       |   12 
 drivers/dax/cxl.c                       |   53 ++
 drivers/dax/device.c                    |    3 
 drivers/dax/hmem/Makefile               |    3 
 drivers/dax/hmem/device.c               |  102 ++--
 drivers/dax/hmem/hmem.c                 |  148 +++++
 drivers/dax/kmem.c                      |    1 
 include/linux/dax.h                     |    7 
 include/linux/memregion.h               |    2 
 include/linux/range.h                   |    5 
 lib/stackinit_kunit.c                   |    6 
 tools/testing/cxl/test/cxl.c            |  147 +++++
 29 files changed, 1484 insertions(+), 317 deletions(-)
 create mode 100644 drivers/dax/cxl.c

base-commit: 172738bbccdb4ef76bdd72fc72a315c741c39161

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v2 01/20] cxl/memdev: Fix endpoint port removal
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
@ 2023-02-10  9:05 ` Dan Williams
  2023-02-10 17:28   ` Jonathan Cameron
  2023-02-10 23:17   ` Verma, Vishal L
  2023-02-10  9:05 ` [PATCH v2 02/20] cxl/Documentation: Update references to attributes added in v6.0 Dan Williams
                   ` (20 subsequent siblings)
  21 siblings, 2 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:05 UTC (permalink / raw)
  To: linux-cxl; +Cc: vishal.l.verma, dave.hansen, linux-mm, linux-acpi

Testing of ram region support [1], stimulates a long standing bug in
cxl_detach_ep() where some cxl_ep_remove() cleanup is skipped due to
inability to walk ports after dports have been unregistered. That
results in a failure to re-register a memdev after the port is
re-enabled leading to a crash like the following:

    cxl_port_setup_targets: cxl region4: cxl_host_bridge.0:port4 iw: 1 ig: 256
    general protection fault, ...
    [..]
    RIP: 0010:cxl_region_setup_targets+0x897/0x9e0 [cxl_core]
    dev_name at include/linux/device.h:700
    (inlined by) cxl_port_setup_targets at drivers/cxl/core/region.c:1155
    (inlined by) cxl_region_setup_targets at drivers/cxl/core/region.c:1249
    [..]
    Call Trace:
     <TASK>
     attach_target+0x39a/0x760 [cxl_core]
     ? __mutex_unlock_slowpath+0x3a/0x290
     cxl_add_to_region+0xb8/0x340 [cxl_core]
     ? lockdep_hardirqs_on+0x7d/0x100
     discover_region+0x4b/0x80 [cxl_port]
     ? __pfx_discover_region+0x10/0x10 [cxl_port]
     device_for_each_child+0x58/0x90
     cxl_port_probe+0x10e/0x130 [cxl_port]
     cxl_bus_probe+0x17/0x50 [cxl_core]

Change the port ancestry walk to be by depth rather than by dport. This
ensures that even if a port has unregistered its dports a deferred
memdev cleanup will still be able to cleanup the memdev's interest in
that port.

The parent_port->dev.driver check is only needed for determining if the
bottom up removal beat the top-down removal, but cxl_ep_remove() can
always proceed.

Fixes: 2703c16c75ae ("cxl/core/port: Add switch port enumeration")
Link: http://lore.kernel.org/r/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com [1]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/memdev.c |    1 +
 drivers/cxl/core/port.c   |   58 +++++++++++++++++++++++++--------------------
 drivers/cxl/cxlmem.h      |    2 ++
 3 files changed, 35 insertions(+), 26 deletions(-)

diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index a74a93310d26..3a8bc2b06047 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -246,6 +246,7 @@ static struct cxl_memdev *cxl_memdev_alloc(struct cxl_dev_state *cxlds,
 	if (rc < 0)
 		goto err;
 	cxlmd->id = rc;
+	cxlmd->depth = -1;
 
 	dev = &cxlmd->dev;
 	device_initialize(dev);
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 410c036c09fa..317bcf4dbd9d 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1207,6 +1207,7 @@ int cxl_endpoint_autoremove(struct cxl_memdev *cxlmd, struct cxl_port *endpoint)
 
 	get_device(&endpoint->dev);
 	dev_set_drvdata(dev, endpoint);
+	cxlmd->depth = endpoint->depth;
 	return devm_add_action_or_reset(dev, delete_endpoint, cxlmd);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_endpoint_autoremove, CXL);
@@ -1241,50 +1242,55 @@ static void reap_dports(struct cxl_port *port)
 	}
 }
 
+struct detach_ctx {
+	struct cxl_memdev *cxlmd;
+	int depth;
+};
+
+static int port_has_memdev(struct device *dev, const void *data)
+{
+	const struct detach_ctx *ctx = data;
+	struct cxl_port *port;
+
+	if (!is_cxl_port(dev))
+		return 0;
+
+	port = to_cxl_port(dev);
+	if (port->depth != ctx->depth)
+		return 0;
+
+	return !!cxl_ep_load(port, ctx->cxlmd);
+}
+
 static void cxl_detach_ep(void *data)
 {
 	struct cxl_memdev *cxlmd = data;
-	struct device *iter;
 
-	for (iter = &cxlmd->dev; iter; iter = grandparent(iter)) {
-		struct device *dport_dev = grandparent(iter);
+	for (int i = cxlmd->depth - 1; i >= 1; i--) {
 		struct cxl_port *port, *parent_port;
+		struct detach_ctx ctx = {
+			.cxlmd = cxlmd,
+			.depth = i,
+		};
+		struct device *dev;
 		struct cxl_ep *ep;
 		bool died = false;
 
-		if (!dport_dev)
-			break;
-
-		port = find_cxl_port(dport_dev, NULL);
-		if (!port)
-			continue;
-
-		if (is_cxl_root(port)) {
-			put_device(&port->dev);
+		dev = bus_find_device(&cxl_bus_type, NULL, &ctx,
+				      port_has_memdev);
+		if (!dev)
 			continue;
-		}
+		port = to_cxl_port(dev);
 
 		parent_port = to_cxl_port(port->dev.parent);
 		device_lock(&parent_port->dev);
-		if (!parent_port->dev.driver) {
-			/*
-			 * The bottom-up race to delete the port lost to a
-			 * top-down port disable, give up here, because the
-			 * parent_port ->remove() will have cleaned up all
-			 * descendants.
-			 */
-			device_unlock(&parent_port->dev);
-			put_device(&port->dev);
-			continue;
-		}
-
 		device_lock(&port->dev);
 		ep = cxl_ep_load(port, cxlmd);
 		dev_dbg(&cxlmd->dev, "disconnect %s from %s\n",
 			ep ? dev_name(ep->ep) : "", dev_name(&port->dev));
 		cxl_ep_remove(port, ep);
 		if (ep && !port->dead && xa_empty(&port->endpoints) &&
-		    !is_cxl_root(parent_port)) {
+		    !is_cxl_root(parent_port) && parent_port->dev.driver) {
 			/*
 			 * This was the last ep attached to a dynamically
 			 * enumerated port. Block new cxl_add_ep() and garbage
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index ab138004f644..c9da3c699a21 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -38,6 +38,7 @@
  * @cxl_nvb: coordinate removal of @cxl_nvd if present
  * @cxl_nvd: optional bridge to an nvdimm if the device supports pmem
  * @id: id number of this memdev instance.
+ * @depth: endpoint port depth
  */
 struct cxl_memdev {
 	struct device dev;
@@ -47,6 +48,7 @@ struct cxl_memdev {
 	struct cxl_nvdimm_bridge *cxl_nvb;
 	struct cxl_nvdimm *cxl_nvd;
 	int id;
+	int depth;
 };
 
 static inline struct cxl_memdev *to_cxl_memdev(struct device *dev)


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 02/20] cxl/Documentation: Update references to attributes added in v6.0
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
  2023-02-10  9:05 ` [PATCH v2 01/20] cxl/memdev: Fix endpoint port removal Dan Williams
@ 2023-02-10  9:05 ` Dan Williams
  2023-02-10  9:05 ` [PATCH v2 03/20] cxl/region: Add a mode attribute for regions Dan Williams
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:05 UTC (permalink / raw)
  To: linux-cxl
  Cc: Vishal Verma, Dave Jiang, Gregory Price, Ira Weiny,
	Davidlohr Bueso, Jonathan Cameron, Fan Ni, dave.hansen, linux-mm,
	linux-acpi

Prior to Linus deciding that the kernel that following v5.19 would be
v6.0, the CXL ABI documentation already referenced v5.20. In preparation
for updating these entries update the kernel version to v6.0.

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Gregory Price <gregory.price@memverge.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564535494.847146.12120939572640882946.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/ABI/testing/sysfs-bus-cxl |   30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 329a7e46c805..5be032313e29 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -198,7 +198,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/endpointX/CDAT
 Date:		July, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RO) If this sysfs entry is not present no DOE mailbox was
@@ -209,7 +209,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/decoderX.Y/mode
 Date:		May, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
@@ -229,7 +229,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/decoderX.Y/dpa_resource
 Date:		May, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RO) When a CXL decoder is of devtype "cxl_decoder_endpoint",
@@ -240,7 +240,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/decoderX.Y/dpa_size
 Date:		May, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
@@ -260,7 +260,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/decoderX.Y/interleave_ways
 Date:		May, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RO) The number of targets across which this decoder's host
@@ -275,7 +275,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/decoderX.Y/interleave_granularity
 Date:		May, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RO) The number of consecutive bytes of host physical address
@@ -287,7 +287,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/decoderX.Y/create_pmem_region
 Date:		May, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) Write a string in the form 'regionZ' to start the process
@@ -303,7 +303,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
 Date:		May, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(WO) Write a string in the form 'regionZ' to delete that region,
@@ -312,7 +312,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/regionZ/uuid
 Date:		May, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) Write a unique identifier for the region. This field must
@@ -322,7 +322,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/regionZ/interleave_granularity
 Date:		May, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) Set the number of consecutive bytes each device in the
@@ -333,7 +333,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/regionZ/interleave_ways
 Date:		May, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) Configures the number of devices participating in the
@@ -343,7 +343,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/regionZ/size
 Date:		May, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) System physical address space to be consumed by the region.
@@ -360,7 +360,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/regionZ/resource
 Date:		May, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RO) A region is a contiguous partition of a CXL root decoder
@@ -372,7 +372,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/regionZ/target[0..N]
 Date:		May, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) Write an endpoint decoder object name to 'targetX' where X
@@ -391,7 +391,7 @@ Description:
 
 What:		/sys/bus/cxl/devices/regionZ/commit
 Date:		May, 2022
-KernelVersion:	v5.20
+KernelVersion:	v6.0
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) Write a boolean 'true' string value to this attribute to


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 03/20] cxl/region: Add a mode attribute for regions
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
  2023-02-10  9:05 ` [PATCH v2 01/20] cxl/memdev: Fix endpoint port removal Dan Williams
  2023-02-10  9:05 ` [PATCH v2 02/20] cxl/Documentation: Update references to attributes added in v6.0 Dan Williams
@ 2023-02-10  9:05 ` Dan Williams
  2023-02-10  9:05 ` [PATCH v2 04/20] cxl/region: Support empty uuids for non-pmem regions Dan Williams
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:05 UTC (permalink / raw)
  To: linux-cxl
  Cc: Vishal Verma, Dave Jiang, Gregory Price, Ira Weiny,
	Jonathan Cameron, Fan Ni, dave.hansen, linux-mm, linux-acpi

In preparation for a new region type, "ram" regions, add a mode
attribute to clarify the mode of the decoders that can be added to a
region. Share the internals of mode_show() (for decoders) with the
region case.

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Gregory Price <gregory.price@memverge.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564536041.847146.11330354943211409793.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/ABI/testing/sysfs-bus-cxl |   11 +++++++++++
 drivers/cxl/core/port.c                 |   12 +-----------
 drivers/cxl/core/region.c               |   10 ++++++++++
 drivers/cxl/cxl.h                       |   14 ++++++++++++++
 4 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 5be032313e29..058b0c45001f 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -358,6 +358,17 @@ Description:
 		results in the same address being allocated.
 
 
+What:		/sys/bus/cxl/devices/regionZ/mode
+Date:		January, 2023
+KernelVersion:	v6.3
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) The mode of a region is established at region creation time
+		and dictates the mode of the endpoint decoder that comprise the
+		region. For more details on the possible modes see
+		/sys/bus/cxl/devices/decoderX.Y/mode
+
+
 What:		/sys/bus/cxl/devices/regionZ/resource
 Date:		May, 2022
 KernelVersion:	v6.0
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 317bcf4dbd9d..1e541956f605 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -180,17 +180,7 @@ static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
 {
 	struct cxl_endpoint_decoder *cxled = to_cxl_endpoint_decoder(dev);
 
-	switch (cxled->mode) {
-	case CXL_DECODER_RAM:
-		return sysfs_emit(buf, "ram\n");
-	case CXL_DECODER_PMEM:
-		return sysfs_emit(buf, "pmem\n");
-	case CXL_DECODER_NONE:
-		return sysfs_emit(buf, "none\n");
-	case CXL_DECODER_MIXED:
-	default:
-		return sysfs_emit(buf, "mixed\n");
-	}
+	return sysfs_emit(buf, "%s\n", cxl_decoder_mode_name(cxled->mode));
 }
 
 static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 60828d01972a..17d2d0c12725 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -458,6 +458,15 @@ static ssize_t resource_show(struct device *dev, struct device_attribute *attr,
 }
 static DEVICE_ATTR_RO(resource);
 
+static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%s\n", cxl_decoder_mode_name(cxlr->mode));
+}
+static DEVICE_ATTR_RO(mode);
+
 static int alloc_hpa(struct cxl_region *cxlr, resource_size_t size)
 {
 	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
@@ -585,6 +594,7 @@ static struct attribute *cxl_region_attrs[] = {
 	&dev_attr_interleave_granularity.attr,
 	&dev_attr_resource.attr,
 	&dev_attr_size.attr,
+	&dev_attr_mode.attr,
 	NULL,
 };
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index aa3af3bb73b2..ca76879af1de 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -320,6 +320,20 @@ enum cxl_decoder_mode {
 	CXL_DECODER_DEAD,
 };
 
+static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
+{
+	static const char * const names[] = {
+		[CXL_DECODER_NONE] = "none",
+		[CXL_DECODER_RAM] = "ram",
+		[CXL_DECODER_PMEM] = "pmem",
+		[CXL_DECODER_MIXED] = "mixed",
+	};
+
+	if (mode >= CXL_DECODER_NONE && mode <= CXL_DECODER_MIXED)
+		return names[mode];
+	return "mixed";
+}
+
 /**
  * struct cxl_endpoint_decoder - Endpoint  / SPA to DPA decoder
  * @cxld: base cxl_decoder_object


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 04/20] cxl/region: Support empty uuids for non-pmem regions
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (2 preceding siblings ...)
  2023-02-10  9:05 ` [PATCH v2 03/20] cxl/region: Add a mode attribute for regions Dan Williams
@ 2023-02-10  9:05 ` Dan Williams
  2023-02-10 17:30   ` Jonathan Cameron
  2023-02-10 23:34   ` Ira Weiny
  2023-02-10  9:05 ` [PATCH v2 05/20] cxl/region: Validate region mode vs decoder mode Dan Williams
                   ` (17 subsequent siblings)
  21 siblings, 2 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:05 UTC (permalink / raw)
  To: linux-cxl; +Cc: Vishal Verma, Fan Ni, dave.hansen, linux-mm, linux-acpi

Shipping versions of the cxl-cli utility expect all regions to have a
'uuid' attribute. In preparation for 'ram' regions, update the 'uuid'
attribute to return an empty string which satisfies the current
expectations of 'cxl list -R'. Otherwise, 'cxl list -R' fails in the
presence of regions with the 'uuid' attribute missing. Force the
attribute to be read-only as there is no facility or expectation for a
'ram' region to recall its uuid from one boot to the next.

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564536587.847146.12703125206459604597.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/ABI/testing/sysfs-bus-cxl |    3 ++-
 drivers/cxl/core/region.c               |   11 +++++++++--
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 058b0c45001f..4c4e1cbb1169 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -317,7 +317,8 @@ Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) Write a unique identifier for the region. This field must
 		be set for persistent regions and it must not conflict with the
-		UUID of another region.
+		UUID of another region. For volatile ram regions this
+		attribute is a read-only empty string.
 
 
 What:		/sys/bus/cxl/devices/regionZ/interleave_granularity
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 17d2d0c12725..0fc80478ff6b 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -45,7 +45,10 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 	rc = down_read_interruptible(&cxl_region_rwsem);
 	if (rc)
 		return rc;
-	rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
+	if (cxlr->mode != CXL_DECODER_PMEM)
+		rc = sysfs_emit(buf, "\n");
+	else
+		rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
 	up_read(&cxl_region_rwsem);
 
 	return rc;
@@ -300,8 +303,12 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
 	struct device *dev = kobj_to_dev(kobj);
 	struct cxl_region *cxlr = to_cxl_region(dev);
 
+	/*
+	 * Support tooling that expects to find a 'uuid' attribute for all
+	 * regions regardless of mode.
+	 */
 	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
-		return 0;
+		return 0444;
 	return a->mode;
 }
 


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 05/20] cxl/region: Validate region mode vs decoder mode
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (3 preceding siblings ...)
  2023-02-10  9:05 ` [PATCH v2 04/20] cxl/region: Support empty uuids for non-pmem regions Dan Williams
@ 2023-02-10  9:05 ` Dan Williams
  2023-02-10  9:05 ` [PATCH v2 06/20] cxl/region: Add volatile region creation support Dan Williams
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:05 UTC (permalink / raw)
  To: linux-cxl
  Cc: Vishal Verma, Gregory Price, Dave Jiang, Ira Weiny,
	Jonathan Cameron, Fan Ni, dave.hansen, linux-mm, linux-acpi

In preparation for a new region mode, do not, for example, allow
'ram' decoders to be assigned to 'pmem' regions and vice versa.

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Gregory Price <gregory.price@memverge.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564537131.847146.9020072654741860107.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/region.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 0fc80478ff6b..285835145e9b 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1221,6 +1221,12 @@ static int cxl_region_attach(struct cxl_region *cxlr,
 	struct cxl_dport *dport;
 	int i, rc = -ENXIO;
 
+	if (cxled->mode != cxlr->mode) {
+		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
+			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
+		return -EINVAL;
+	}
+
 	if (cxled->mode == CXL_DECODER_DEAD) {
 		dev_dbg(&cxlr->dev, "%s dead\n", dev_name(&cxled->cxld.dev));
 		return -ENODEV;


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 06/20] cxl/region: Add volatile region creation support
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (4 preceding siblings ...)
  2023-02-10  9:05 ` [PATCH v2 05/20] cxl/region: Validate region mode vs decoder mode Dan Williams
@ 2023-02-10  9:05 ` Dan Williams
  2023-02-10  9:06 ` [PATCH v2 07/20] cxl/region: Refactor attach_target() for autodiscovery Dan Williams
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:05 UTC (permalink / raw)
  To: linux-cxl
  Cc: Vishal Verma, Gregory Price, Dave Jiang, Ira Weiny,
	Jonathan Cameron, Fan Ni, dave.hansen, linux-mm, linux-acpi

Expand the region creation infrastructure to enable 'ram'
(volatile-memory) regions. The internals of create_pmem_region_store()
and create_pmem_region_show() are factored out into helpers
__create_region() and __create_region_show() for the 'ram' case to
reuse.

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Gregory Price <gregory.price@memverge.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564537678.847146.4066579806086171091.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/ABI/testing/sysfs-bus-cxl |   22 +++++-----
 drivers/cxl/core/core.h                 |    1 
 drivers/cxl/core/port.c                 |   14 ++++++
 drivers/cxl/core/region.c               |   71 +++++++++++++++++++++++++------
 4 files changed, 83 insertions(+), 25 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 4c4e1cbb1169..3acf2f17a73f 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -285,20 +285,20 @@ Description:
 		interleave_granularity).
 
 
-What:		/sys/bus/cxl/devices/decoderX.Y/create_pmem_region
-Date:		May, 2022
-KernelVersion:	v6.0
+What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
+Date:		May, 2022, January, 2023
+KernelVersion:	v6.0 (pmem), v6.3 (ram)
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) Write a string in the form 'regionZ' to start the process
-		of defining a new persistent memory region (interleave-set)
-		within the decode range bounded by root decoder 'decoderX.Y'.
-		The value written must match the current value returned from
-		reading this attribute. An atomic compare exchange operation is
-		done on write to assign the requested id to a region and
-		allocate the region-id for the next creation attempt. EBUSY is
-		returned if the region name written does not match the current
-		cached value.
+		of defining a new persistent, or volatile memory region
+		(interleave-set) within the decode range bounded by root decoder
+		'decoderX.Y'. The value written must match the current value
+		returned from reading this attribute. An atomic compare exchange
+		operation is done on write to assign the requested id to a
+		region and allocate the region-id for the next creation attempt.
+		EBUSY is returned if the region name written does not match the
+		current cached value.
 
 
 What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 8c04672dca56..5eb873da5a30 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -11,6 +11,7 @@ extern struct attribute_group cxl_base_attribute_group;
 
 #ifdef CONFIG_CXL_REGION
 extern struct device_attribute dev_attr_create_pmem_region;
+extern struct device_attribute dev_attr_create_ram_region;
 extern struct device_attribute dev_attr_delete_region;
 extern struct device_attribute dev_attr_region;
 extern const struct device_type cxl_pmem_region_type;
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 1e541956f605..9e5df64ea6b5 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -294,6 +294,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
 	&dev_attr_cap_type3.attr,
 	&dev_attr_target_list.attr,
 	SET_CXL_REGION_ATTR(create_pmem_region)
+	SET_CXL_REGION_ATTR(create_ram_region)
 	SET_CXL_REGION_ATTR(delete_region)
 	NULL,
 };
@@ -305,6 +306,13 @@ static bool can_create_pmem(struct cxl_root_decoder *cxlrd)
 	return (cxlrd->cxlsd.cxld.flags & flags) == flags;
 }
 
+static bool can_create_ram(struct cxl_root_decoder *cxlrd)
+{
+	unsigned long flags = CXL_DECODER_F_TYPE3 | CXL_DECODER_F_RAM;
+
+	return (cxlrd->cxlsd.cxld.flags & flags) == flags;
+}
+
 static umode_t cxl_root_decoder_visible(struct kobject *kobj, struct attribute *a, int n)
 {
 	struct device *dev = kobj_to_dev(kobj);
@@ -313,7 +321,11 @@ static umode_t cxl_root_decoder_visible(struct kobject *kobj, struct attribute *
 	if (a == CXL_REGION_ATTR(create_pmem_region) && !can_create_pmem(cxlrd))
 		return 0;
 
-	if (a == CXL_REGION_ATTR(delete_region) && !can_create_pmem(cxlrd))
+	if (a == CXL_REGION_ATTR(create_ram_region) && !can_create_ram(cxlrd))
+		return 0;
+
+	if (a == CXL_REGION_ATTR(delete_region) &&
+	    !(can_create_pmem(cxlrd) || can_create_ram(cxlrd)))
 		return 0;
 
 	return a->mode;
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 285835145e9b..e440db8611a4 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1689,6 +1689,15 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
 	struct device *dev;
 	int rc;
 
+	switch (mode) {
+	case CXL_DECODER_RAM:
+	case CXL_DECODER_PMEM:
+		break;
+	default:
+		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
+		return ERR_PTR(-EINVAL);
+	}
+
 	cxlr = cxl_region_alloc(cxlrd, id);
 	if (IS_ERR(cxlr))
 		return cxlr;
@@ -1717,12 +1726,38 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
 	return ERR_PTR(rc);
 }
 
+static ssize_t __create_region_show(struct cxl_root_decoder *cxlrd, char *buf)
+{
+	return sysfs_emit(buf, "region%u\n", atomic_read(&cxlrd->region_id));
+}
+
 static ssize_t create_pmem_region_show(struct device *dev,
 				       struct device_attribute *attr, char *buf)
 {
-	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
+	return __create_region_show(to_cxl_root_decoder(dev), buf);
+}
 
-	return sysfs_emit(buf, "region%u\n", atomic_read(&cxlrd->region_id));
+static ssize_t create_ram_region_show(struct device *dev,
+				      struct device_attribute *attr, char *buf)
+{
+	return __create_region_show(to_cxl_root_decoder(dev), buf);
+}
+
+static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
+					  enum cxl_decoder_mode mode, int id)
+{
+	int rc;
+
+	rc = memregion_alloc(GFP_KERNEL);
+	if (rc < 0)
+		return ERR_PTR(rc);
+
+	if (atomic_cmpxchg(&cxlrd->region_id, id, rc) != id) {
+		memregion_free(rc);
+		return ERR_PTR(-EBUSY);
+	}
+
+	return devm_cxl_add_region(cxlrd, id, mode, CXL_DECODER_EXPANDER);
 }
 
 static ssize_t create_pmem_region_store(struct device *dev,
@@ -1731,29 +1766,39 @@ static ssize_t create_pmem_region_store(struct device *dev,
 {
 	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
 	struct cxl_region *cxlr;
-	int id, rc;
+	int rc, id;
 
 	rc = sscanf(buf, "region%d\n", &id);
 	if (rc != 1)
 		return -EINVAL;
 
-	rc = memregion_alloc(GFP_KERNEL);
-	if (rc < 0)
-		return rc;
+	cxlr = __create_region(cxlrd, CXL_DECODER_PMEM, id);
+	if (IS_ERR(cxlr))
+		return PTR_ERR(cxlr);
 
-	if (atomic_cmpxchg(&cxlrd->region_id, id, rc) != id) {
-		memregion_free(rc);
-		return -EBUSY;
-	}
+	return len;
+}
+DEVICE_ATTR_RW(create_pmem_region);
+
+static ssize_t create_ram_region_store(struct device *dev,
+				       struct device_attribute *attr,
+				       const char *buf, size_t len)
+{
+	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
+	struct cxl_region *cxlr;
+	int rc, id;
 
-	cxlr = devm_cxl_add_region(cxlrd, id, CXL_DECODER_PMEM,
-				   CXL_DECODER_EXPANDER);
+	rc = sscanf(buf, "region%d\n", &id);
+	if (rc != 1)
+		return -EINVAL;
+
+	cxlr = __create_region(cxlrd, CXL_DECODER_RAM, id);
 	if (IS_ERR(cxlr))
 		return PTR_ERR(cxlr);
 
 	return len;
 }
-DEVICE_ATTR_RW(create_pmem_region);
+DEVICE_ATTR_RW(create_ram_region);
 
 static ssize_t region_show(struct device *dev, struct device_attribute *attr,
 			   char *buf)


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 07/20] cxl/region: Refactor attach_target() for autodiscovery
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (5 preceding siblings ...)
  2023-02-10  9:05 ` [PATCH v2 06/20] cxl/region: Add volatile region creation support Dan Williams
@ 2023-02-10  9:06 ` Dan Williams
  2023-02-10  9:06 ` [PATCH v2 08/20] cxl/region: Cleanup target list on attach error Dan Williams
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:06 UTC (permalink / raw)
  To: linux-cxl
  Cc: Vishal Verma, Dave Jiang, Ira Weiny, Jonathan Cameron, Fan Ni,
	dave.hansen, linux-mm, linux-acpi

Region autodiscovery is the process of kernel creating 'struct
cxl_region' object to represent active CXL memory ranges it finds
already active in hardware when the driver loads. Typically this happens
when platform firmware establishes CXL memory regions and then publishes
them in the memory map. However, this can also happen in the case of
kexec-reboot after the kernel has created regions.

In the autodiscovery case the region creation process starts with a
known endpoint decoder. Refactor attach_target() into a helper that is
suitable to be called from either sysfs, for runtime region creation, or
from cxl_port_probe() after it has enumerated all endpoint decoders.

The cxl_port_probe() context is an async device-core probing context, so
it is not appropriate to allow SIGTERM to interrupt the assembly
process. Refactor attach_target() to take @cxled and @state as arguments
where @state indicates whether waiting from the region rwsem is
interruptible or not.

No behavior change is intended.

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564538227.847146.16305045998592488364.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/region.c |   47 +++++++++++++++++++++++++++------------------
 1 file changed, 28 insertions(+), 19 deletions(-)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index e440db8611a4..040bbd39c81d 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1422,31 +1422,25 @@ void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled)
 	up_write(&cxl_region_rwsem);
 }
 
-static int attach_target(struct cxl_region *cxlr, const char *decoder, int pos)
+static int attach_target(struct cxl_region *cxlr,
+			 struct cxl_endpoint_decoder *cxled, int pos,
+			 unsigned int state)
 {
-	struct device *dev;
-	int rc;
-
-	dev = bus_find_device_by_name(&cxl_bus_type, NULL, decoder);
-	if (!dev)
-		return -ENODEV;
-
-	if (!is_endpoint_decoder(dev)) {
-		put_device(dev);
-		return -EINVAL;
-	}
+	int rc = 0;
 
-	rc = down_write_killable(&cxl_region_rwsem);
+	if (state == TASK_INTERRUPTIBLE)
+		rc = down_write_killable(&cxl_region_rwsem);
+	else
+		down_write(&cxl_region_rwsem);
 	if (rc)
-		goto out;
+		return rc;
+
 	down_read(&cxl_dpa_rwsem);
-	rc = cxl_region_attach(cxlr, to_cxl_endpoint_decoder(dev), pos);
+	rc = cxl_region_attach(cxlr, cxled, pos);
 	if (rc == 0)
 		set_bit(CXL_REGION_F_INCOHERENT, &cxlr->flags);
 	up_read(&cxl_dpa_rwsem);
 	up_write(&cxl_region_rwsem);
-out:
-	put_device(dev);
 	return rc;
 }
 
@@ -1484,8 +1478,23 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
 
 	if (sysfs_streq(buf, "\n"))
 		rc = detach_target(cxlr, pos);
-	else
-		rc = attach_target(cxlr, buf, pos);
+	else {
+		struct device *dev;
+
+		dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
+		if (!dev)
+			return -ENODEV;
+
+		if (!is_endpoint_decoder(dev)) {
+			rc = -EINVAL;
+			goto out;
+		}
+
+		rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
+				   TASK_INTERRUPTIBLE);
+out:
+		put_device(dev);
+	}
 
 	if (rc < 0)
 		return rc;


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 08/20] cxl/region: Cleanup target list on attach error
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (6 preceding siblings ...)
  2023-02-10  9:06 ` [PATCH v2 07/20] cxl/region: Refactor attach_target() for autodiscovery Dan Williams
@ 2023-02-10  9:06 ` Dan Williams
  2023-02-10 17:31   ` Jonathan Cameron
                     ` (2 more replies)
  2023-02-10  9:06 ` [PATCH v2 09/20] cxl/region: Move region-position validation to a helper Dan Williams
                   ` (13 subsequent siblings)
  21 siblings, 3 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:06 UTC (permalink / raw)
  To: linux-cxl
  Cc: Jonathan Cameron, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

Jonathan noticed that the target list setup is not unwound completely
upon error. Undo all the setup in the 'err_decrement:' exit path.

Fixes: 27b3f8d13830 ("cxl/region: Program target lists")
Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Link: http://lore.kernel.org/r/20230208123031.00006990@Huawei.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/region.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 040bbd39c81d..ae7d3adcd41a 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1347,6 +1347,8 @@ static int cxl_region_attach(struct cxl_region *cxlr,
 
 err_decrement:
 	p->nr_targets--;
+	cxled->pos = -1;
+	p->targets[pos] = NULL;
 err:
 	for (iter = ep_port; !is_cxl_root(iter);
 	     iter = to_cxl_port(iter->dev.parent))


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 09/20] cxl/region: Move region-position validation to a helper
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (7 preceding siblings ...)
  2023-02-10  9:06 ` [PATCH v2 08/20] cxl/region: Cleanup target list on attach error Dan Williams
@ 2023-02-10  9:06 ` Dan Williams
  2023-02-10 17:34   ` Jonathan Cameron
  2023-02-10  9:06 ` [PATCH v2 10/20] kernel/range: Uplevel the cxl subsystem's range_contains() helper Dan Williams
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:06 UTC (permalink / raw)
  To: linux-cxl; +Cc: Vishal Verma, Fan Ni, dave.hansen, linux-mm, linux-acpi

In preparation for region autodiscovery, that needs all devices
discovered before their relative position in the region can be
determined, consolidate all position dependent validation in a helper.

Recall that in the on-demand region creation flow the end-user picks the
position of a given endpoint decoder in a region. In the autodiscovery
case the position of an endpoint decoder can only be determined after
all other endpoint decoders that claim to decode the region's address
range have been enumerated and attached. So, in the autodiscovery case
endpoint decoders may be attached before their relative position is
known. Once all decoders arrive, then positions can be determined and
validated with cxl_region_validate_position() the same as user initiated
on-demand creation.

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564538779.847146.8356062886811511706.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/region.c |  119 +++++++++++++++++++++++++++++----------------
 1 file changed, 76 insertions(+), 43 deletions(-)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index ae7d3adcd41a..691605f1e120 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1211,35 +1211,13 @@ static int cxl_region_setup_targets(struct cxl_region *cxlr)
 	return 0;
 }
 
-static int cxl_region_attach(struct cxl_region *cxlr,
-			     struct cxl_endpoint_decoder *cxled, int pos)
+static int cxl_region_validate_position(struct cxl_region *cxlr,
+					struct cxl_endpoint_decoder *cxled,
+					int pos)
 {
-	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
 	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
-	struct cxl_port *ep_port, *root_port, *iter;
 	struct cxl_region_params *p = &cxlr->params;
-	struct cxl_dport *dport;
-	int i, rc = -ENXIO;
-
-	if (cxled->mode != cxlr->mode) {
-		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
-			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
-		return -EINVAL;
-	}
-
-	if (cxled->mode == CXL_DECODER_DEAD) {
-		dev_dbg(&cxlr->dev, "%s dead\n", dev_name(&cxled->cxld.dev));
-		return -ENODEV;
-	}
-
-	/* all full of members, or interleave config not established? */
-	if (p->state > CXL_CONFIG_INTERLEAVE_ACTIVE) {
-		dev_dbg(&cxlr->dev, "region already active\n");
-		return -EBUSY;
-	} else if (p->state < CXL_CONFIG_INTERLEAVE_ACTIVE) {
-		dev_dbg(&cxlr->dev, "interleave config missing\n");
-		return -ENXIO;
-	}
+	int i;
 
 	if (pos < 0 || pos >= p->interleave_ways) {
 		dev_dbg(&cxlr->dev, "position %d out of range %d\n", pos,
@@ -1278,6 +1256,71 @@ static int cxl_region_attach(struct cxl_region *cxlr,
 		}
 	}
 
+	return 0;
+}
+
+static int cxl_region_attach_position(struct cxl_region *cxlr,
+				      struct cxl_root_decoder *cxlrd,
+				      struct cxl_endpoint_decoder *cxled,
+				      const struct cxl_dport *dport, int pos)
+{
+	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+	struct cxl_port *iter;
+	int rc;
+
+	if (cxlrd->calc_hb(cxlrd, pos) != dport) {
+		dev_dbg(&cxlr->dev, "%s:%s invalid target position for %s\n",
+			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
+			dev_name(&cxlrd->cxlsd.cxld.dev));
+		return -ENXIO;
+	}
+
+	for (iter = cxled_to_port(cxled); !is_cxl_root(iter);
+	     iter = to_cxl_port(iter->dev.parent)) {
+		rc = cxl_port_attach_region(iter, cxlr, cxled, pos);
+		if (rc)
+			goto err;
+	}
+
+	return 0;
+
+err:
+	for (iter = cxled_to_port(cxled); !is_cxl_root(iter);
+	     iter = to_cxl_port(iter->dev.parent))
+		cxl_port_detach_region(iter, cxlr, cxled);
+	return rc;
+}
+
+static int cxl_region_attach(struct cxl_region *cxlr,
+			     struct cxl_endpoint_decoder *cxled, int pos)
+{
+	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
+	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+	struct cxl_region_params *p = &cxlr->params;
+	struct cxl_port *ep_port, *root_port;
+	struct cxl_dport *dport;
+	int rc = -ENXIO;
+
+	if (cxled->mode != cxlr->mode) {
+		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
+			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
+		return -EINVAL;
+	}
+
+	if (cxled->mode == CXL_DECODER_DEAD) {
+		dev_dbg(&cxlr->dev, "%s dead\n", dev_name(&cxled->cxld.dev));
+		return -ENODEV;
+	}
+
+	/* all full of members, or interleave config not established? */
+	if (p->state > CXL_CONFIG_INTERLEAVE_ACTIVE) {
+		dev_dbg(&cxlr->dev, "region already active\n");
+		return -EBUSY;
+	} else if (p->state < CXL_CONFIG_INTERLEAVE_ACTIVE) {
+		dev_dbg(&cxlr->dev, "interleave config missing\n");
+		return -ENXIO;
+	}
+
 	ep_port = cxled_to_port(cxled);
 	root_port = cxlrd_to_port(cxlrd);
 	dport = cxl_find_dport_by_dev(root_port, ep_port->host_bridge);
@@ -1288,13 +1331,6 @@ static int cxl_region_attach(struct cxl_region *cxlr,
 		return -ENXIO;
 	}
 
-	if (cxlrd->calc_hb(cxlrd, pos) != dport) {
-		dev_dbg(&cxlr->dev, "%s:%s invalid target position for %s\n",
-			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
-			dev_name(&cxlrd->cxlsd.cxld.dev));
-		return -ENXIO;
-	}
-
 	if (cxled->cxld.target_type != cxlr->type) {
 		dev_dbg(&cxlr->dev, "%s:%s type mismatch: %d vs %d\n",
 			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
@@ -1318,12 +1354,13 @@ static int cxl_region_attach(struct cxl_region *cxlr,
 		return -EINVAL;
 	}
 
-	for (iter = ep_port; !is_cxl_root(iter);
-	     iter = to_cxl_port(iter->dev.parent)) {
-		rc = cxl_port_attach_region(iter, cxlr, cxled, pos);
-		if (rc)
-			goto err;
-	}
+	rc = cxl_region_validate_position(cxlr, cxled, pos);
+	if (rc)
+		return rc;
+
+	rc = cxl_region_attach_position(cxlr, cxlrd, cxled, dport, pos);
+	if (rc)
+		return rc;
 
 	p->targets[pos] = cxled;
 	cxled->pos = pos;
@@ -1349,10 +1386,6 @@ static int cxl_region_attach(struct cxl_region *cxlr,
 	p->nr_targets--;
 	cxled->pos = -1;
 	p->targets[pos] = NULL;
-err:
-	for (iter = ep_port; !is_cxl_root(iter);
-	     iter = to_cxl_port(iter->dev.parent))
-		cxl_port_detach_region(iter, cxlr, cxled);
 	return rc;
 }
 


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 10/20] kernel/range: Uplevel the cxl subsystem's range_contains() helper
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (8 preceding siblings ...)
  2023-02-10  9:06 ` [PATCH v2 09/20] cxl/region: Move region-position validation to a helper Dan Williams
@ 2023-02-10  9:06 ` Dan Williams
  2023-02-10  9:06 ` [PATCH v2 11/20] cxl/region: Enable CONFIG_CXL_REGION to be toggled Dan Williams
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:06 UTC (permalink / raw)
  To: linux-cxl
  Cc: Kees Cook, Vishal Verma, Jonathan Cameron, Dave Jiang,
	Gregory Price, Ira Weiny, Fan Ni, dave.hansen, linux-mm,
	linux-acpi

In support of the CXL subsystem's use of 'struct range' to track decode
address ranges, add a common range_contains() implementation with
identical semantics as resource_contains();

The existing 'range_contains()' in lib/stackinit_kunit.c is namespaced
with a 'stackinit_' prefix.

Cc: Kees Cook <keescook@chromium.org>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Gregory Price <gregory.price@memverge.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564539327.847146.788601375229324484.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/pci.c |    5 -----
 include/linux/range.h  |    5 +++++
 lib/stackinit_kunit.c  |    6 +++---
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 1d1492440287..9ed2120dbf8a 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -214,11 +214,6 @@ static int devm_cxl_enable_mem(struct device *host, struct cxl_dev_state *cxlds)
 	return devm_add_action_or_reset(host, clear_mem_enable, cxlds);
 }
 
-static bool range_contains(struct range *r1, struct range *r2)
-{
-	return r1->start <= r2->start && r1->end >= r2->end;
-}
-
 /* require dvsec ranges to be covered by a locked platform window */
 static int dvsec_range_allowed(struct device *dev, void *arg)
 {
diff --git a/include/linux/range.h b/include/linux/range.h
index 274681cc3154..7efb6a9b069b 100644
--- a/include/linux/range.h
+++ b/include/linux/range.h
@@ -13,6 +13,11 @@ static inline u64 range_len(const struct range *range)
 	return range->end - range->start + 1;
 }
 
+static inline bool range_contains(struct range *r1, struct range *r2)
+{
+	return r1->start <= r2->start && r1->end >= r2->end;
+}
+
 int add_range(struct range *range, int az, int nr_range,
 		u64 start, u64 end);
 
diff --git a/lib/stackinit_kunit.c b/lib/stackinit_kunit.c
index 4591d6cf5e01..05947a2feb93 100644
--- a/lib/stackinit_kunit.c
+++ b/lib/stackinit_kunit.c
@@ -31,8 +31,8 @@ static volatile u8 forced_mask = 0xff;
 static void *fill_start, *target_start;
 static size_t fill_size, target_size;
 
-static bool range_contains(char *haystack_start, size_t haystack_size,
-			   char *needle_start, size_t needle_size)
+static bool stackinit_range_contains(char *haystack_start, size_t haystack_size,
+				     char *needle_start, size_t needle_size)
 {
 	if (needle_start >= haystack_start &&
 	    needle_start + needle_size <= haystack_start + haystack_size)
@@ -175,7 +175,7 @@ static noinline void test_ ## name (struct kunit *test)		\
 								\
 	/* Validate that compiler lined up fill and target. */	\
 	KUNIT_ASSERT_TRUE_MSG(test,				\
-		range_contains(fill_start, fill_size,		\
+		stackinit_range_contains(fill_start, fill_size,	\
 			    target_start, target_size),		\
 		"stack fill missed target!? "			\
 		"(fill %zu wide, target offset by %d)\n",	\


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 11/20] cxl/region: Enable CONFIG_CXL_REGION to be toggled
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (9 preceding siblings ...)
  2023-02-10  9:06 ` [PATCH v2 10/20] kernel/range: Uplevel the cxl subsystem's range_contains() helper Dan Williams
@ 2023-02-10  9:06 ` Dan Williams
  2023-02-10  9:06 ` [PATCH v2 12/20] cxl/port: Split endpoint and switch port probe Dan Williams
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:06 UTC (permalink / raw)
  To: linux-cxl
  Cc: Vishal Verma, Jonathan Cameron, Dave Jiang, Gregory Price,
	Fan Ni, dave.hansen, linux-mm, linux-acpi

Add help text and a label so the CXL_REGION config option can be
toggled. This is mainly to enable compile testing without region
support.

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Gregory Price <gregory.price@memverge.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564539875.847146.16213498614174558767.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/Kconfig |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 0ac53c422c31..163c094e67ae 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -104,12 +104,22 @@ config CXL_SUSPEND
 	depends on SUSPEND && CXL_MEM
 
 config CXL_REGION
-	bool
+	bool "CXL: Region Support"
 	default CXL_BUS
 	# For MAX_PHYSMEM_BITS
 	depends on SPARSEMEM
 	select MEMREGION
 	select GET_FREE_REGION
+	help
+	  Enable the CXL core to enumerate and provision CXL regions. A CXL
+	  region is defined by one or more CXL expanders that decode a given
+	  system-physical address range. For CXL regions established by
+	  platform-firmware this option enables memory error handling to
+	  identify the devices participating in a given interleaved memory
+	  range. Otherwise, platform-firmware managed CXL is enabled by being
+	  placed in the system address map and does not need a driver.
+
+	  If unsure say 'y'
 
 config CXL_REGION_INVALIDATION_TEST
 	bool "CXL: Region Cache Management Bypass (TEST)"


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 12/20] cxl/port: Split endpoint and switch port probe
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (10 preceding siblings ...)
  2023-02-10  9:06 ` [PATCH v2 11/20] cxl/region: Enable CONFIG_CXL_REGION to be toggled Dan Williams
@ 2023-02-10  9:06 ` Dan Williams
  2023-02-10 17:41   ` Jonathan Cameron
  2023-02-10 23:21   ` Verma, Vishal L
  2023-02-10  9:06 ` [PATCH v2 13/20] cxl/region: Add region autodiscovery Dan Williams
                   ` (9 subsequent siblings)
  21 siblings, 2 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:06 UTC (permalink / raw)
  To: linux-cxl
  Cc: Jonathan Cameron, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

Jonathan points out that the shared code between the switch and endpoint
case is small. Before adding another is_cxl_endpoint() conditional,
just split the two cases.

Rather than duplicate the "Couldn't enumerate decoders" error message
take the opportunity to improve the error messages in
devm_cxl_enumerate_decoders().

Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Link: http://lore.kernel.org/r/20230208170724.000067ec@Huawei.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/hdm.c |   11 ++++++--
 drivers/cxl/port.c     |   69 +++++++++++++++++++++++++++---------------------
 2 files changed, 47 insertions(+), 33 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index dcc16d7cb8f3..a0891c3464f1 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -826,7 +826,8 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
 			cxled = cxl_endpoint_decoder_alloc(port);
 			if (IS_ERR(cxled)) {
 				dev_warn(&port->dev,
-					 "Failed to allocate the decoder\n");
+					 "Failed to allocate decoder%d.%d\n",
+					 port->id, i);
 				return PTR_ERR(cxled);
 			}
 			cxld = &cxled->cxld;
@@ -836,7 +837,8 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
 			cxlsd = cxl_switch_decoder_alloc(port, target_count);
 			if (IS_ERR(cxlsd)) {
 				dev_warn(&port->dev,
-					 "Failed to allocate the decoder\n");
+					 "Failed to allocate decoder%d.%d\n",
+					 port->id, i);
 				return PTR_ERR(cxlsd);
 			}
 			cxld = &cxlsd->cxld;
@@ -844,13 +846,16 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
 
 		rc = init_hdm_decoder(port, cxld, target_map, hdm, i, &dpa_base);
 		if (rc) {
+			dev_warn(&port->dev,
+				 "Failed to initialize decoder%d.%d\n",
+				 port->id, i);
 			put_device(&cxld->dev);
 			return rc;
 		}
 		rc = add_hdm_decoder(port, cxld, target_map);
 		if (rc) {
 			dev_warn(&port->dev,
-				 "Failed to add decoder to port\n");
+				 "Failed to add decoder%d.%d\n", port->id, i);
 			return rc;
 		}
 	}
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index 5453771bf330..a8d46a67b45e 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -30,55 +30,64 @@ static void schedule_detach(void *cxlmd)
 	schedule_cxl_memdev_detach(cxlmd);
 }
 
-static int cxl_port_probe(struct device *dev)
+static int cxl_switch_port_probe(struct cxl_port *port)
 {
-	struct cxl_port *port = to_cxl_port(dev);
 	struct cxl_hdm *cxlhdm;
 	int rc;
 
+	rc = devm_cxl_port_enumerate_dports(port);
+	if (rc < 0)
+		return rc;
 
-	if (!is_cxl_endpoint(port)) {
-		rc = devm_cxl_port_enumerate_dports(port);
-		if (rc < 0)
-			return rc;
-		if (rc == 1)
-			return devm_cxl_add_passthrough_decoder(port);
-	}
+	if (rc == 1)
+		return devm_cxl_add_passthrough_decoder(port);
 
 	cxlhdm = devm_cxl_setup_hdm(port);
 	if (IS_ERR(cxlhdm))
 		return PTR_ERR(cxlhdm);
 
-	if (is_cxl_endpoint(port)) {
-		struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport);
-		struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	return devm_cxl_enumerate_decoders(cxlhdm);
+}
 
-		/* Cache the data early to ensure is_visible() works */
-		read_cdat_data(port);
+static int cxl_endpoint_port_probe(struct cxl_port *port)
+{
+	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	struct cxl_hdm *cxlhdm;
+	int rc;
+
+	cxlhdm = devm_cxl_setup_hdm(port);
+	if (IS_ERR(cxlhdm))
+		return PTR_ERR(cxlhdm);
 
-		get_device(&cxlmd->dev);
-		rc = devm_add_action_or_reset(dev, schedule_detach, cxlmd);
-		if (rc)
-			return rc;
+	/* Cache the data early to ensure is_visible() works */
+	read_cdat_data(port);
 
-		rc = cxl_hdm_decode_init(cxlds, cxlhdm);
-		if (rc)
-			return rc;
+	get_device(&cxlmd->dev);
+	rc = devm_add_action_or_reset(&port->dev, schedule_detach, cxlmd);
+	if (rc)
+		return rc;
 
-		rc = cxl_await_media_ready(cxlds);
-		if (rc) {
-			dev_err(dev, "Media not active (%d)\n", rc);
-			return rc;
-		}
-	}
+	rc = cxl_hdm_decode_init(cxlds, cxlhdm);
+	if (rc)
+		return rc;
 
-	rc = devm_cxl_enumerate_decoders(cxlhdm);
+	rc = cxl_await_media_ready(cxlds);
 	if (rc) {
-		dev_err(dev, "Couldn't enumerate decoders (%d)\n", rc);
+		dev_err(&port->dev, "Media not active (%d)\n", rc);
 		return rc;
 	}
 
-	return 0;
+	return devm_cxl_enumerate_decoders(cxlhdm);
+}
+
+static int cxl_port_probe(struct device *dev)
+{
+	struct cxl_port *port = to_cxl_port(dev);
+
+	if (is_cxl_endpoint(port))
+		return cxl_endpoint_port_probe(port);
+	return cxl_switch_port_probe(port);
 }
 
 static ssize_t CDAT_read(struct file *filp, struct kobject *kobj,


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 13/20] cxl/region: Add region autodiscovery
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (11 preceding siblings ...)
  2023-02-10  9:06 ` [PATCH v2 12/20] cxl/port: Split endpoint and switch port probe Dan Williams
@ 2023-02-10  9:06 ` Dan Williams
  2023-02-10 18:09   ` Jonathan Cameron
                     ` (3 more replies)
  2023-02-10  9:06 ` [PATCH v2 14/20] tools/testing/cxl: Define a fixed volatile configuration to parse Dan Williams
                   ` (8 subsequent siblings)
  21 siblings, 4 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:06 UTC (permalink / raw)
  To: linux-cxl; +Cc: Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

Region autodiscovery is an asynchronous state machine advanced by
cxl_port_probe(). After the decoders on an endpoint port are enumerated
they are scanned for actively enabled instances. Each active decoder is
flagged for auto-assembly CXL_DECODER_F_AUTO and attached to a region.
If a region does not already exist for the address range setting of the
decoder one is created. That creation process may race with other
decoders of the same region being discovered since cxl_port_probe() is
asynchronous. A new 'struct cxl_root_decoder' lock, @range_lock, is
introduced to mitigate that race.

Once all decoders have arrived, "p->nr_targets == p->interleave_ways",
they are sorted by their relative decode position. The sort algorithm
involves finding the point in the cxl_port topology where one leg of the
decode leads to deviceA and the other deviceB. At that point in the
topology the target order in the 'struct cxl_switch_decoder' indicates
the relative position of those endpoint decoders in the region.

>From that point the region goes through the same setup and validation
steps as user-created regions, but instead of programming the decoders
it validates that driver would have written the same values to the
decoders as were already present.

Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564540972.847146.17096178433176097831.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/hdm.c    |   11 +
 drivers/cxl/core/port.c   |    2 
 drivers/cxl/core/region.c |  497 ++++++++++++++++++++++++++++++++++++++++++++-
 drivers/cxl/cxl.h         |   29 +++
 drivers/cxl/port.c        |   48 ++++
 5 files changed, 576 insertions(+), 11 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index a0891c3464f1..8c29026a4b9d 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -676,6 +676,14 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld)
 	port->commit_end--;
 	cxld->flags &= ~CXL_DECODER_F_ENABLE;
 
+	/* Userspace is now responsible for reconfiguring this decoder */
+	if (is_endpoint_decoder(&cxld->dev)) {
+		struct cxl_endpoint_decoder *cxled;
+
+		cxled = to_cxl_endpoint_decoder(&cxld->dev);
+		cxled->state = CXL_DECODER_STATE_MANUAL;
+	}
+
 	return 0;
 }
 
@@ -783,6 +791,9 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
 		return rc;
 	}
 	*dpa_base += dpa_size + skip;
+
+	cxled->state = CXL_DECODER_STATE_AUTO;
+
 	return 0;
 }
 
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 9e5df64ea6b5..59620528571a 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -446,6 +446,7 @@ bool is_endpoint_decoder(struct device *dev)
 {
 	return dev->type == &cxl_decoder_endpoint_type;
 }
+EXPORT_SYMBOL_NS_GPL(is_endpoint_decoder, CXL);
 
 bool is_root_decoder(struct device *dev)
 {
@@ -1628,6 +1629,7 @@ struct cxl_root_decoder *cxl_root_decoder_alloc(struct cxl_port *port,
 	}
 
 	cxlrd->calc_hb = calc_hb;
+	mutex_init(&cxlrd->range_lock);
 
 	cxld = &cxlsd->cxld;
 	cxld->dev.type = &cxl_decoder_root_type;
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 691605f1e120..3f6453da2c51 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -6,6 +6,7 @@
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/uuid.h>
+#include <linux/sort.h>
 #include <linux/idr.h>
 #include <cxlmem.h>
 #include <cxl.h>
@@ -524,7 +525,12 @@ static void cxl_region_iomem_release(struct cxl_region *cxlr)
 	if (device_is_registered(&cxlr->dev))
 		lockdep_assert_held_write(&cxl_region_rwsem);
 	if (p->res) {
-		remove_resource(p->res);
+		/*
+		 * Autodiscovered regions may not have been able to insert their
+		 * resource.
+		 */
+		if (p->res->parent)
+			remove_resource(p->res);
 		kfree(p->res);
 		p->res = NULL;
 	}
@@ -1105,12 +1111,35 @@ static int cxl_port_setup_targets(struct cxl_port *port,
 		return rc;
 	}
 
-	cxld->interleave_ways = iw;
-	cxld->interleave_granularity = ig;
-	cxld->hpa_range = (struct range) {
-		.start = p->res->start,
-		.end = p->res->end,
-	};
+	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
+		if (cxld->interleave_ways != iw ||
+		    cxld->interleave_granularity != ig ||
+		    cxld->hpa_range.start != p->res->start ||
+		    cxld->hpa_range.end != p->res->end ||
+		    ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
+			dev_err(&cxlr->dev,
+				"%s:%s %s expected iw: %d ig: %d %pr\n",
+				dev_name(port->uport), dev_name(&port->dev),
+				__func__, iw, ig, p->res);
+			dev_err(&cxlr->dev,
+				"%s:%s %s got iw: %d ig: %d state: %s %#llx:%#llx\n",
+				dev_name(port->uport), dev_name(&port->dev),
+				__func__, cxld->interleave_ways,
+				cxld->interleave_granularity,
+				(cxld->flags & CXL_DECODER_F_ENABLE) ?
+					"enabled" :
+					"disabled",
+				cxld->hpa_range.start, cxld->hpa_range.end);
+			return -ENXIO;
+		}
+	} else {
+		cxld->interleave_ways = iw;
+		cxld->interleave_granularity = ig;
+		cxld->hpa_range = (struct range) {
+			.start = p->res->start,
+			.end = p->res->end,
+		};
+	}
 	dev_dbg(&cxlr->dev, "%s:%s iw: %d ig: %d\n", dev_name(port->uport),
 		dev_name(&port->dev), iw, ig);
 add_target:
@@ -1121,7 +1150,17 @@ static int cxl_port_setup_targets(struct cxl_port *port,
 			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev), pos);
 		return -ENXIO;
 	}
-	cxlsd->target[cxl_rr->nr_targets_set] = ep->dport;
+	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
+		if (cxlsd->target[cxl_rr->nr_targets_set] != ep->dport) {
+			dev_dbg(&cxlr->dev, "%s:%s: %s expected %s at %d\n",
+				dev_name(port->uport), dev_name(&port->dev),
+				dev_name(&cxlsd->cxld.dev),
+				dev_name(ep->dport->dport),
+				cxl_rr->nr_targets_set);
+			return -ENXIO;
+		}
+	} else
+		cxlsd->target[cxl_rr->nr_targets_set] = ep->dport;
 	inc = 1;
 out_target_set:
 	cxl_rr->nr_targets_set += inc;
@@ -1163,6 +1202,13 @@ static void cxl_region_teardown_targets(struct cxl_region *cxlr)
 	struct cxl_ep *ep;
 	int i;
 
+	/*
+	 * In the auto-discovery case skip automatic teardown since the
+	 * address space is already active
+	 */
+	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags))
+		return;
+
 	for (i = 0; i < p->nr_targets; i++) {
 		cxled = p->targets[i];
 		cxlmd = cxled_to_memdev(cxled);
@@ -1195,8 +1241,8 @@ static int cxl_region_setup_targets(struct cxl_region *cxlr)
 			iter = to_cxl_port(iter->dev.parent);
 
 		/*
-		 * Descend the topology tree programming targets while
-		 * looking for conflicts.
+		 * Descend the topology tree programming / validating
+		 * targets while looking for conflicts.
 		 */
 		for (ep = cxl_ep_load(iter, cxlmd); iter;
 		     iter = ep->next, ep = cxl_ep_load(iter, cxlmd)) {
@@ -1291,6 +1337,185 @@ static int cxl_region_attach_position(struct cxl_region *cxlr,
 	return rc;
 }
 
+static int cxl_region_attach_auto(struct cxl_region *cxlr,
+				  struct cxl_endpoint_decoder *cxled, int pos)
+{
+	struct cxl_region_params *p = &cxlr->params;
+
+	if (cxled->state != CXL_DECODER_STATE_AUTO) {
+		dev_err(&cxlr->dev,
+			"%s: unable to add decoder to autodetected region\n",
+			dev_name(&cxled->cxld.dev));
+		return -EINVAL;
+	}
+
+	if (pos >= 0) {
+		dev_dbg(&cxlr->dev, "%s: expected auto position, not %d\n",
+			dev_name(&cxled->cxld.dev), pos);
+		return -EINVAL;
+	}
+
+	if (p->nr_targets >= p->interleave_ways) {
+		dev_err(&cxlr->dev, "%s: no more target slots available\n",
+			dev_name(&cxled->cxld.dev));
+		return -ENXIO;
+	}
+
+	/*
+	 * Temporarily record the endpoint decoder into the target array. Yes,
+	 * this means that userspace can view devices in the wrong position
+	 * before the region activates, and must be careful to understand when
+	 * it might be racing region autodiscovery.
+	 */
+	pos = p->nr_targets;
+	p->targets[pos] = cxled;
+	cxled->pos = pos;
+	p->nr_targets++;
+
+	return 0;
+}
+
+static struct cxl_port *next_port(struct cxl_port *port)
+{
+	if (!port->parent_dport)
+		return NULL;
+	return port->parent_dport->port;
+}
+
+static int decoder_match_range(struct device *dev, void *data)
+{
+	struct cxl_endpoint_decoder *cxled = data;
+	struct cxl_switch_decoder *cxlsd;
+
+	if (!is_switch_decoder(dev))
+		return 0;
+
+	cxlsd = to_cxl_switch_decoder(dev);
+	return range_contains(&cxlsd->cxld.hpa_range, &cxled->cxld.hpa_range);
+}
+
+static void find_positions(const struct cxl_switch_decoder *cxlsd,
+			   const struct cxl_port *iter_a,
+			   const struct cxl_port *iter_b, int *a_pos,
+			   int *b_pos)
+{
+	int i;
+
+	for (i = 0, *a_pos = -1, *b_pos = -1; i < cxlsd->nr_targets; i++) {
+		if (cxlsd->target[i] == iter_a->parent_dport)
+			*a_pos = i;
+		else if (cxlsd->target[i] == iter_b->parent_dport)
+			*b_pos = i;
+		if (*a_pos >= 0 && *b_pos >= 0)
+			break;
+	}
+}
+
+static int cmp_decode_pos(const void *a, const void *b)
+{
+	struct cxl_endpoint_decoder *cxled_a = *(typeof(cxled_a) *)a;
+	struct cxl_endpoint_decoder *cxled_b = *(typeof(cxled_b) *)b;
+	struct cxl_memdev *cxlmd_a = cxled_to_memdev(cxled_a);
+	struct cxl_memdev *cxlmd_b = cxled_to_memdev(cxled_b);
+	struct cxl_port *port_a = cxled_to_port(cxled_a);
+	struct cxl_port *port_b = cxled_to_port(cxled_b);
+	struct cxl_port *iter_a, *iter_b, *port = NULL;
+	struct cxl_switch_decoder *cxlsd;
+	struct device *dev;
+	int a_pos, b_pos;
+	unsigned int seq;
+
+	/* Exit early if any prior sorting failed */
+	if (cxled_a->pos < 0 || cxled_b->pos < 0)
+		return 0;
+
+	/*
+	 * Walk up the hierarchy to find a shared port, find the decoder that
+	 * maps the range, compare the relative position of those dport
+	 * mappings.
+	 */
+	for (iter_a = port_a; iter_a; iter_a = next_port(iter_a)) {
+		struct cxl_port *next_a, *next_b;
+
+		next_a = next_port(iter_a);
+		if (!next_a)
+			break;
+
+		for (iter_b = port_b; iter_b; iter_b = next_port(iter_b)) {
+			next_b = next_port(iter_b);
+			if (next_a != next_b)
+				continue;
+			port = next_a;
+			break;
+		}
+
+		if (port)
+			break;
+	}
+
+	if (!port) {
+		dev_err(cxlmd_a->dev.parent,
+			"failed to find shared port with %s\n",
+			dev_name(cxlmd_b->dev.parent));
+		goto err;
+	}
+
+	dev = device_find_child(&port->dev, cxled_a, decoder_match_range);
+	if (!dev) {
+		struct range *range = &cxled_a->cxld.hpa_range;
+
+		dev_err(port->uport,
+			"failed to find decoder that maps %#llx-%#llx\n",
+			range->start, range->end);
+		goto err;
+	}
+
+	cxlsd = to_cxl_switch_decoder(dev);
+	do {
+		seq = read_seqbegin(&cxlsd->target_lock);
+		find_positions(cxlsd, iter_a, iter_b, &a_pos, &b_pos);
+	} while (read_seqretry(&cxlsd->target_lock, seq));
+
+	put_device(dev);
+
+	if (a_pos < 0 || b_pos < 0) {
+		dev_err(port->uport,
+			"failed to find shared decoder for %s and %s\n",
+			dev_name(cxlmd_a->dev.parent),
+			dev_name(cxlmd_b->dev.parent));
+		goto err;
+	}
+
+	dev_dbg(port->uport, "%s comes %s %s\n", dev_name(cxlmd_a->dev.parent),
+		a_pos - b_pos < 0 ? "before" : "after",
+		dev_name(cxlmd_b->dev.parent));
+
+	return a_pos - b_pos;
+err:
+	cxled_a->pos = -1;
+	return 0;
+}
+
+static int cxl_region_sort_targets(struct cxl_region *cxlr)
+{
+	struct cxl_region_params *p = &cxlr->params;
+	int i, rc = 0;
+
+	sort(p->targets, p->nr_targets, sizeof(p->targets[0]), cmp_decode_pos,
+	     NULL);
+
+	for (i = 0; i < p->nr_targets; i++) {
+		struct cxl_endpoint_decoder *cxled = p->targets[i];
+
+		if (cxled->pos < 0)
+			rc = -ENXIO;
+		cxled->pos = i;
+	}
+
+	dev_dbg(&cxlr->dev, "region sort %s\n", rc ? "failed" : "successful");
+	return rc;
+}
+
 static int cxl_region_attach(struct cxl_region *cxlr,
 			     struct cxl_endpoint_decoder *cxled, int pos)
 {
@@ -1354,6 +1579,50 @@ static int cxl_region_attach(struct cxl_region *cxlr,
 		return -EINVAL;
 	}
 
+	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
+		int i;
+
+		rc = cxl_region_attach_auto(cxlr, cxled, pos);
+		if (rc)
+			return rc;
+
+		/* await more targets to arrive... */
+		if (p->nr_targets < p->interleave_ways)
+			return 0;
+
+		/*
+		 * All targets are here, which implies all PCI enumeration that
+		 * affects this region has been completed. Walk the topology to
+		 * sort the devices into their relative region decode position.
+		 */
+		rc = cxl_region_sort_targets(cxlr);
+		if (rc)
+			return rc;
+
+		for (i = 0; i < p->nr_targets; i++) {
+			cxled = p->targets[i];
+			ep_port = cxled_to_port(cxled);
+			dport = cxl_find_dport_by_dev(root_port,
+						      ep_port->host_bridge);
+			rc = cxl_region_attach_position(cxlr, cxlrd, cxled,
+							dport, i);
+			if (rc)
+				return rc;
+		}
+
+		rc = cxl_region_setup_targets(cxlr);
+		if (rc)
+			return rc;
+
+		/*
+		 * If target setup succeeds in the autodiscovery case
+		 * then the region is already committed.
+		 */
+		p->state = CXL_CONFIG_COMMIT;
+
+		return 0;
+	}
+
 	rc = cxl_region_validate_position(cxlr, cxled, pos);
 	if (rc)
 		return rc;
@@ -2087,6 +2356,193 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
 	return rc;
 }
 
+static int match_decoder_by_range(struct device *dev, void *data)
+{
+	struct range *r1, *r2 = data;
+	struct cxl_root_decoder *cxlrd;
+
+	if (!is_root_decoder(dev))
+		return 0;
+
+	cxlrd = to_cxl_root_decoder(dev);
+	r1 = &cxlrd->cxlsd.cxld.hpa_range;
+	return range_contains(r1, r2);
+}
+
+static int match_region_by_range(struct device *dev, void *data)
+{
+	struct cxl_region_params *p;
+	struct cxl_region *cxlr;
+	struct range *r = data;
+	int rc = 0;
+
+	if (!is_cxl_region(dev))
+		return 0;
+
+	cxlr = to_cxl_region(dev);
+	p = &cxlr->params;
+
+	down_read(&cxl_region_rwsem);
+	if (p->res && p->res->start == r->start && p->res->end == r->end)
+		rc = 1;
+	up_read(&cxl_region_rwsem);
+
+	return rc;
+}
+
+/* Establish an empty region covering the given HPA range */
+static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
+					   struct cxl_endpoint_decoder *cxled)
+{
+	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+	struct cxl_port *port = cxlrd_to_port(cxlrd);
+	struct range *hpa = &cxled->cxld.hpa_range;
+	struct cxl_region_params *p;
+	struct cxl_region *cxlr;
+	struct resource *res;
+	int rc;
+
+	do {
+		cxlr = __create_region(cxlrd, cxled->mode,
+				       atomic_read(&cxlrd->region_id));
+	} while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
+
+	if (IS_ERR(cxlr)) {
+		dev_err(cxlmd->dev.parent,
+			"%s:%s: %s failed assign region: %ld\n",
+			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
+			__func__, PTR_ERR(cxlr));
+		return cxlr;
+	}
+
+	down_write(&cxl_region_rwsem);
+	p = &cxlr->params;
+	if (p->state >= CXL_CONFIG_INTERLEAVE_ACTIVE) {
+		dev_err(cxlmd->dev.parent,
+			"%s:%s: %s autodiscovery interrupted\n",
+			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
+			__func__);
+		rc = -EBUSY;
+		goto err;
+	}
+
+	set_bit(CXL_REGION_F_AUTO, &cxlr->flags);
+
+	res = kmalloc(sizeof(*res), GFP_KERNEL);
+	if (!res) {
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	*res = DEFINE_RES_MEM_NAMED(hpa->start, range_len(hpa),
+				    dev_name(&cxlr->dev));
+	rc = insert_resource(cxlrd->res, res);
+	if (rc) {
+		/*
+		 * Platform-firmware may not have split resources like "System
+		 * RAM" on CXL window boundaries see cxl_region_iomem_release()
+		 */
+		dev_warn(cxlmd->dev.parent,
+			 "%s:%s: %s %s cannot insert resource\n",
+			 dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
+			 __func__, dev_name(&cxlr->dev));
+	}
+
+	p->res = res;
+	p->interleave_ways = cxled->cxld.interleave_ways;
+	p->interleave_granularity = cxled->cxld.interleave_granularity;
+	p->state = CXL_CONFIG_INTERLEAVE_ACTIVE;
+
+	rc = sysfs_update_group(&cxlr->dev.kobj, get_cxl_region_target_group());
+	if (rc)
+		goto err;
+
+	dev_dbg(cxlmd->dev.parent, "%s:%s: %s %s res: %pr iw: %d ig: %d\n",
+		dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev), __func__,
+		dev_name(&cxlr->dev), p->res, p->interleave_ways,
+		p->interleave_granularity);
+
+	/* ...to match put_device() in cxl_add_to_region() */
+	get_device(&cxlr->dev);
+	up_write(&cxl_region_rwsem);
+
+	return cxlr;
+
+err:
+	up_write(&cxl_region_rwsem);
+	devm_release_action(port->uport, unregister_region, cxlr);
+	return ERR_PTR(rc);
+}
+
+int cxl_add_to_region(struct cxl_port *root, struct cxl_endpoint_decoder *cxled)
+{
+	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+	struct range *hpa = &cxled->cxld.hpa_range;
+	struct cxl_decoder *cxld = &cxled->cxld;
+	struct cxl_root_decoder *cxlrd;
+	struct cxl_region_params *p;
+	struct cxl_region *cxlr;
+	bool attach = false;
+	struct device *dev;
+	int rc;
+
+	dev = device_find_child(&root->dev, &cxld->hpa_range,
+				match_decoder_by_range);
+	if (!dev) {
+		dev_err(cxlmd->dev.parent,
+			"%s:%s no CXL window for range %#llx:%#llx\n",
+			dev_name(&cxlmd->dev), dev_name(&cxld->dev),
+			cxld->hpa_range.start, cxld->hpa_range.end);
+		return -ENXIO;
+	}
+
+	cxlrd = to_cxl_root_decoder(dev);
+
+	/*
+	 * Ensure that if multiple threads race to construct_region() for @hpa
+	 * one does the construction and the others add to that.
+	 */
+	mutex_lock(&cxlrd->range_lock);
+	dev = device_find_child(&cxlrd->cxlsd.cxld.dev, hpa,
+				match_region_by_range);
+	if (!dev)
+		cxlr = construct_region(cxlrd, cxled);
+	else
+		cxlr = to_cxl_region(dev);
+	mutex_unlock(&cxlrd->range_lock);
+
+	if (IS_ERR(cxlr)) {
+		rc = PTR_ERR(cxlr);
+		goto out;
+	}
+
+	attach_target(cxlr, cxled, -1, TASK_UNINTERRUPTIBLE);
+
+	down_read(&cxl_region_rwsem);
+	p = &cxlr->params;
+	attach = p->state == CXL_CONFIG_COMMIT;
+	up_read(&cxl_region_rwsem);
+
+	if (attach) {
+		int rc = device_attach(&cxlr->dev);
+
+		/*
+		 * If device_attach() fails the range may still be active via
+		 * the platform-firmware memory map, otherwise the driver for
+		 * regions is local to this file, so driver matching can't fail.
+		 */
+		if (rc < 0)
+			dev_err(&cxlr->dev, "failed to enable, range: %pr\n",
+				p->res);
+	}
+
+	put_device(&cxlr->dev);
+out:
+	put_device(&cxlrd->cxlsd.cxld.dev);
+	return rc;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_add_to_region, CXL);
+
 static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
 {
 	if (!test_bit(CXL_REGION_F_INCOHERENT, &cxlr->flags))
@@ -2111,6 +2567,15 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
 	return 0;
 }
 
+static int is_system_ram(struct resource *res, void *arg)
+{
+	struct cxl_region *cxlr = arg;
+	struct cxl_region_params *p = &cxlr->params;
+
+	dev_dbg(&cxlr->dev, "%pr has System RAM: %pr\n", p->res, res);
+	return 1;
+}
+
 static int cxl_region_probe(struct device *dev)
 {
 	struct cxl_region *cxlr = to_cxl_region(dev);
@@ -2144,6 +2609,18 @@ static int cxl_region_probe(struct device *dev)
 	switch (cxlr->mode) {
 	case CXL_DECODER_PMEM:
 		return devm_cxl_add_pmem_region(cxlr);
+	case CXL_DECODER_RAM:
+		/*
+		 * The region can not be manged by CXL if any portion of
+		 * it is already online as 'System RAM'
+		 */
+		if (walk_iomem_res_desc(IORES_DESC_NONE,
+					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+					p->res->start, p->res->end, cxlr,
+					is_system_ram) > 0)
+			return 0;
+		dev_dbg(dev, "TODO: hookup devdax\n");
+		return 0;
 	default:
 		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
 			cxlr->mode);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index ca76879af1de..c8ee4bb8cce6 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -261,6 +261,8 @@ resource_size_t cxl_rcrb_to_component(struct device *dev,
  * cxl_decoder flags that define the type of memory / devices this
  * decoder supports as well as configuration lock status See "CXL 2.0
  * 8.2.5.12.7 CXL HDM Decoder 0 Control Register" for details.
+ * Additionally indicate whether decoder settings were autodetected,
+ * user customized.
  */
 #define CXL_DECODER_F_RAM   BIT(0)
 #define CXL_DECODER_F_PMEM  BIT(1)
@@ -334,12 +336,22 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
 	return "mixed";
 }
 
+/*
+ * Track whether this decoder is reserved for region autodiscovery, or
+ * free for userspace provisioning.
+ */
+enum cxl_decoder_state {
+	CXL_DECODER_STATE_MANUAL,
+	CXL_DECODER_STATE_AUTO,
+};
+
 /**
  * struct cxl_endpoint_decoder - Endpoint  / SPA to DPA decoder
  * @cxld: base cxl_decoder_object
  * @dpa_res: actively claimed DPA span of this decoder
  * @skip: offset into @dpa_res where @cxld.hpa_range maps
  * @mode: which memory type / access-mode-partition this decoder targets
+ * @state: autodiscovery state
  * @pos: interleave position in @cxld.region
  */
 struct cxl_endpoint_decoder {
@@ -347,6 +359,7 @@ struct cxl_endpoint_decoder {
 	struct resource *dpa_res;
 	resource_size_t skip;
 	enum cxl_decoder_mode mode;
+	enum cxl_decoder_state state;
 	int pos;
 };
 
@@ -380,6 +393,7 @@ typedef struct cxl_dport *(*cxl_calc_hb_fn)(struct cxl_root_decoder *cxlrd,
  * @region_id: region id for next region provisioning event
  * @calc_hb: which host bridge covers the n'th position by granularity
  * @platform_data: platform specific configuration data
+ * @range_lock: sync region autodiscovery by address range
  * @cxlsd: base cxl switch decoder
  */
 struct cxl_root_decoder {
@@ -387,6 +401,7 @@ struct cxl_root_decoder {
 	atomic_t region_id;
 	cxl_calc_hb_fn calc_hb;
 	void *platform_data;
+	struct mutex range_lock;
 	struct cxl_switch_decoder cxlsd;
 };
 
@@ -436,6 +451,13 @@ struct cxl_region_params {
  */
 #define CXL_REGION_F_INCOHERENT 0
 
+/*
+ * Indicate whether this region has been assembled by autodetection or
+ * userspace assembly. Prevent endpoint decoders outside of automatic
+ * detection from being added to the region.
+ */
+#define CXL_REGION_F_AUTO 1
+
 /**
  * struct cxl_region - CXL region
  * @dev: This region's device
@@ -699,6 +721,8 @@ struct cxl_nvdimm_bridge *cxl_find_nvdimm_bridge(struct device *dev);
 #ifdef CONFIG_CXL_REGION
 bool is_cxl_pmem_region(struct device *dev);
 struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev);
+int cxl_add_to_region(struct cxl_port *root,
+		      struct cxl_endpoint_decoder *cxled);
 #else
 static inline bool is_cxl_pmem_region(struct device *dev)
 {
@@ -708,6 +732,11 @@ static inline struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev)
 {
 	return NULL;
 }
+static inline int cxl_add_to_region(struct cxl_port *root,
+				    struct cxl_endpoint_decoder *cxled)
+{
+	return 0;
+}
 #endif
 
 /*
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index a8d46a67b45e..d88518836c2d 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -30,6 +30,34 @@ static void schedule_detach(void *cxlmd)
 	schedule_cxl_memdev_detach(cxlmd);
 }
 
+static int discover_region(struct device *dev, void *root)
+{
+	struct cxl_endpoint_decoder *cxled;
+	int rc;
+
+	if (!is_endpoint_decoder(dev))
+		return 0;
+
+	cxled = to_cxl_endpoint_decoder(dev);
+	if ((cxled->cxld.flags & CXL_DECODER_F_ENABLE) == 0)
+		return 0;
+
+	if (cxled->state != CXL_DECODER_STATE_AUTO)
+		return 0;
+
+	/*
+	 * Region enumeration is opportunistic, if this add-event fails,
+	 * continue to the next endpoint decoder.
+	 */
+	rc = cxl_add_to_region(root, cxled);
+	if (rc)
+		dev_dbg(dev, "failed to add to region: %#llx-%#llx\n",
+			cxled->cxld.hpa_range.start, cxled->cxld.hpa_range.end);
+
+	return 0;
+}
+
+
 static int cxl_switch_port_probe(struct cxl_port *port)
 {
 	struct cxl_hdm *cxlhdm;
@@ -54,6 +82,7 @@ static int cxl_endpoint_port_probe(struct cxl_port *port)
 	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct cxl_hdm *cxlhdm;
+	struct cxl_port *root;
 	int rc;
 
 	cxlhdm = devm_cxl_setup_hdm(port);
@@ -78,7 +107,24 @@ static int cxl_endpoint_port_probe(struct cxl_port *port)
 		return rc;
 	}
 
-	return devm_cxl_enumerate_decoders(cxlhdm);
+	rc = devm_cxl_enumerate_decoders(cxlhdm);
+	if (rc)
+		return rc;
+
+	/*
+	 * This can't fail in practice as CXL root exit unregisters all
+	 * descendant ports and that in turn synchronizes with cxl_port_probe()
+	 */
+	root = find_cxl_root(&cxlmd->dev);
+
+	/*
+	 * Now that all endpoint decoders are successfully enumerated, try to
+	 * assemble regions from committed decoders
+	 */
+	device_for_each_child(&port->dev, root, discover_region);
+	put_device(&root->dev);
+
+	return 0;
 }
 
 static int cxl_port_probe(struct device *dev)


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 14/20] tools/testing/cxl: Define a fixed volatile configuration to parse
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (12 preceding siblings ...)
  2023-02-10  9:06 ` [PATCH v2 13/20] cxl/region: Add region autodiscovery Dan Williams
@ 2023-02-10  9:06 ` Dan Williams
  2023-02-10 18:12   ` Jonathan Cameron
                     ` (2 more replies)
  2023-02-10  9:06 ` [PATCH v2 15/20] dax/hmem: Move HMAT and Soft reservation probe initcall level Dan Williams
                   ` (7 subsequent siblings)
  21 siblings, 3 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:06 UTC (permalink / raw)
  To: linux-cxl; +Cc: Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

Take two endpoints attached to the first switch on the first host-bridge
in the cxl_test topology and define a pre-initialized region. This is a
x2 interleave underneath a x1 CXL Window.

$ modprobe cxl_test
$ # cxl list -Ru
{
  "region":"region3",
  "resource":"0xf010000000",
  "size":"512.00 MiB (536.87 MB)",
  "interleave_ways":2,
  "interleave_granularity":4096,
  "decode_state":"commit"
}

Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564541523.847146.12199636368812381475.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/core.h      |    3 -
 drivers/cxl/core/hdm.c       |    3 +
 drivers/cxl/core/port.c      |    2 +
 drivers/cxl/cxl.h            |    2 +
 drivers/cxl/cxlmem.h         |    3 +
 tools/testing/cxl/test/cxl.c |  147 +++++++++++++++++++++++++++++++++++++++---
 6 files changed, 146 insertions(+), 14 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 5eb873da5a30..479f01da6d35 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -57,9 +57,6 @@ resource_size_t cxl_dpa_size(struct cxl_endpoint_decoder *cxled);
 resource_size_t cxl_dpa_resource_start(struct cxl_endpoint_decoder *cxled);
 extern struct rw_semaphore cxl_dpa_rwsem;
 
-bool is_switch_decoder(struct device *dev);
-struct cxl_switch_decoder *to_cxl_switch_decoder(struct device *dev);
-
 int cxl_memdev_init(void);
 void cxl_memdev_exit(void);
 void cxl_mbox_init(void);
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 8c29026a4b9d..80eccae6ba9e 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -279,7 +279,7 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	return 0;
 }
 
-static int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
+int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 				resource_size_t base, resource_size_t len,
 				resource_size_t skipped)
 {
@@ -295,6 +295,7 @@ static int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 
 	return devm_add_action_or_reset(&port->dev, cxl_dpa_release, cxled);
 }
+EXPORT_SYMBOL_NS_GPL(devm_cxl_dpa_reserve, CXL);
 
 resource_size_t cxl_dpa_size(struct cxl_endpoint_decoder *cxled)
 {
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 59620528571a..b45d2796ef35 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -458,6 +458,7 @@ bool is_switch_decoder(struct device *dev)
 {
 	return is_root_decoder(dev) || dev->type == &cxl_decoder_switch_type;
 }
+EXPORT_SYMBOL_NS_GPL(is_switch_decoder, CXL);
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev)
 {
@@ -485,6 +486,7 @@ struct cxl_switch_decoder *to_cxl_switch_decoder(struct device *dev)
 		return NULL;
 	return container_of(dev, struct cxl_switch_decoder, cxld.dev);
 }
+EXPORT_SYMBOL_NS_GPL(to_cxl_switch_decoder, CXL);
 
 static void cxl_ep_release(struct cxl_ep *ep)
 {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index c8ee4bb8cce6..2ac344235235 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -653,8 +653,10 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
 struct cxl_root_decoder *to_cxl_root_decoder(struct device *dev);
+struct cxl_switch_decoder *to_cxl_switch_decoder(struct device *dev);
 struct cxl_endpoint_decoder *to_cxl_endpoint_decoder(struct device *dev);
 bool is_root_decoder(struct device *dev);
+bool is_switch_decoder(struct device *dev);
 bool is_endpoint_decoder(struct device *dev);
 struct cxl_root_decoder *cxl_root_decoder_alloc(struct cxl_port *port,
 						unsigned int nr_targets,
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index c9da3c699a21..bf7d4c5c8612 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -81,6 +81,9 @@ static inline bool is_cxl_endpoint(struct cxl_port *port)
 }
 
 struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds);
+int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
+			 resource_size_t base, resource_size_t len,
+			 resource_size_t skipped);
 
 static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
 					 struct cxl_memdev *cxlmd)
diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
index 920bd969c554..5342f69d70d2 100644
--- a/tools/testing/cxl/test/cxl.c
+++ b/tools/testing/cxl/test/cxl.c
@@ -703,6 +703,142 @@ static int mock_decoder_reset(struct cxl_decoder *cxld)
 	return 0;
 }
 
+static void default_mock_decoder(struct cxl_decoder *cxld)
+{
+	cxld->hpa_range = (struct range){
+		.start = 0,
+		.end = -1,
+	};
+
+	cxld->interleave_ways = 1;
+	cxld->interleave_granularity = 256;
+	cxld->target_type = CXL_DECODER_EXPANDER;
+	cxld->commit = mock_decoder_commit;
+	cxld->reset = mock_decoder_reset;
+}
+
+static int first_decoder(struct device *dev, void *data)
+{
+	struct cxl_decoder *cxld;
+
+	if (!is_switch_decoder(dev))
+		return 0;
+	cxld = to_cxl_decoder(dev);
+	if (cxld->id == 0)
+		return 1;
+	return 0;
+}
+
+static void mock_init_hdm_decoder(struct cxl_decoder *cxld)
+{
+	struct acpi_cedt_cfmws *window = mock_cfmws[0];
+	struct platform_device *pdev = NULL;
+	struct cxl_endpoint_decoder *cxled;
+	struct cxl_switch_decoder *cxlsd;
+	struct cxl_port *port, *iter;
+	const int size = SZ_512M;
+	struct cxl_memdev *cxlmd;
+	struct cxl_dport *dport;
+	struct device *dev;
+	bool hb0 = false;
+	u64 base;
+	int i;
+
+	if (is_endpoint_decoder(&cxld->dev)) {
+		cxled = to_cxl_endpoint_decoder(&cxld->dev);
+		cxlmd = cxled_to_memdev(cxled);
+		WARN_ON(!dev_is_platform(cxlmd->dev.parent));
+		pdev = to_platform_device(cxlmd->dev.parent);
+
+		/* check is endpoint is attach to host-bridge0 */
+		port = cxled_to_port(cxled);
+		do {
+			if (port->uport == &cxl_host_bridge[0]->dev) {
+				hb0 = true;
+				break;
+			}
+			if (is_cxl_port(port->dev.parent))
+				port = to_cxl_port(port->dev.parent);
+			else
+				port = NULL;
+		} while (port);
+		port = cxled_to_port(cxled);
+	}
+
+	/*
+	 * The first decoder on the first 2 devices on the first switch
+	 * attached to host-bridge0 mock a fake / static RAM region. All
+	 * other decoders are default disabled. Given the round robin
+	 * assignment those devices are named cxl_mem.0, and cxl_mem.4.
+	 *
+	 * See 'cxl list -BMPu -m cxl_mem.0,cxl_mem.4'
+	 */
+	if (!hb0 || pdev->id % 4 || pdev->id > 4 || cxld->id > 0) {
+		default_mock_decoder(cxld);
+		return;
+	}
+
+	base = window->base_hpa;
+	cxld->hpa_range = (struct range) {
+		.start = base,
+		.end = base + size - 1,
+	};
+
+	cxld->interleave_ways = 2;
+	eig_to_granularity(window->granularity, &cxld->interleave_granularity);
+	cxld->target_type = CXL_DECODER_EXPANDER;
+	cxld->flags = CXL_DECODER_F_ENABLE;
+	cxled->state = CXL_DECODER_STATE_AUTO;
+	port->commit_end = cxld->id;
+	devm_cxl_dpa_reserve(cxled, 0, size / cxld->interleave_ways, 0);
+	cxld->commit = mock_decoder_commit;
+	cxld->reset = mock_decoder_reset;
+
+	/*
+	 * Now that endpoint decoder is set up, walk up the hierarchy
+	 * and setup the switch and root port decoders targeting @cxlmd.
+	 */
+	iter = port;
+	for (i = 0; i < 2; i++) {
+		dport = iter->parent_dport;
+		iter = dport->port;
+		dev = device_find_child(&iter->dev, NULL, first_decoder);
+		/*
+		 * Ancestor ports are guaranteed to be enumerated before
+		 * @port, and all ports have at least one decoder.
+		 */
+		if (WARN_ON(!dev))
+			continue;
+		cxlsd = to_cxl_switch_decoder(dev);
+		if (i == 0) {
+			/* put cxl_mem.4 second in the decode order */
+			if (pdev->id == 4)
+				cxlsd->target[1] = dport;
+			else
+				cxlsd->target[0] = dport;
+		} else
+			cxlsd->target[0] = dport;
+		cxld = &cxlsd->cxld;
+		cxld->target_type = CXL_DECODER_EXPANDER;
+		cxld->flags = CXL_DECODER_F_ENABLE;
+		iter->commit_end = 0;
+		/*
+		 * Switch targets 2 endpoints, while host bridge targets
+		 * one root port
+		 */
+		if (i == 0)
+			cxld->interleave_ways = 2;
+		else
+			cxld->interleave_ways = 1;
+		cxld->interleave_granularity = 256;
+		cxld->hpa_range = (struct range) {
+			.start = base,
+			.end = base + size - 1,
+		};
+		put_device(dev);
+	}
+}
+
 static int mock_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
 {
 	struct cxl_port *port = cxlhdm->port;
@@ -748,16 +884,7 @@ static int mock_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
 			cxld = &cxled->cxld;
 		}
 
-		cxld->hpa_range = (struct range) {
-			.start = 0,
-			.end = -1,
-		};
-
-		cxld->interleave_ways = min_not_zero(target_count, 1);
-		cxld->interleave_granularity = SZ_4K;
-		cxld->target_type = CXL_DECODER_EXPANDER;
-		cxld->commit = mock_decoder_commit;
-		cxld->reset = mock_decoder_reset;
+		mock_init_hdm_decoder(cxld);
 
 		if (target_count) {
 			rc = device_for_each_child(port->uport, &ctx,


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 15/20] dax/hmem: Move HMAT and Soft reservation probe initcall level
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (13 preceding siblings ...)
  2023-02-10  9:06 ` [PATCH v2 14/20] tools/testing/cxl: Define a fixed volatile configuration to parse Dan Williams
@ 2023-02-10  9:06 ` Dan Williams
  2023-02-10 21:53   ` Dave Jiang
  2023-02-11  0:40   ` Verma, Vishal L
  2023-02-10  9:06 ` [PATCH v2 16/20] dax/hmem: Drop unnecessary dax_hmem_remove() Dan Williams
                   ` (6 subsequent siblings)
  21 siblings, 2 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:06 UTC (permalink / raw)
  To: linux-cxl; +Cc: Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

In preparation for moving more filtering of "hmem" ranges into the
dax_hmem.ko module, update the initcall levels. HMAT range registration
moves to subsys_initcall() to be done before Soft Reservation probing,
and Soft Reservation probing is moved to device_initcall() to be done
before dax_hmem.ko initialization if it is built-in.

Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564542109.847146.10113972881782419363.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/numa/hmat.c  |    2 +-
 drivers/dax/hmem/Makefile |    3 ++-
 drivers/dax/hmem/device.c |    2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 605a0c7053be..ff24282301ab 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -869,4 +869,4 @@ static __init int hmat_init(void)
 	acpi_put_table(tbl);
 	return 0;
 }
-device_initcall(hmat_init);
+subsys_initcall(hmat_init);
diff --git a/drivers/dax/hmem/Makefile b/drivers/dax/hmem/Makefile
index 57377b4c3d47..d4c4cd6bccd7 100644
--- a/drivers/dax/hmem/Makefile
+++ b/drivers/dax/hmem/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
+# device_hmem.o deliberately precedes dax_hmem.o for initcall ordering
 obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device_hmem.o
+obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
 
 device_hmem-y := device.o
 dax_hmem-y := hmem.o
diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
index 903325aac991..20749c7fab81 100644
--- a/drivers/dax/hmem/device.c
+++ b/drivers/dax/hmem/device.c
@@ -104,4 +104,4 @@ static __init int hmem_init(void)
  * As this is a fallback for address ranges unclaimed by the ACPI HMAT
  * parsing it must be at an initcall level greater than hmat_init().
  */
-late_initcall(hmem_init);
+device_initcall(hmem_init);


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 16/20] dax/hmem: Drop unnecessary dax_hmem_remove()
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (14 preceding siblings ...)
  2023-02-10  9:06 ` [PATCH v2 15/20] dax/hmem: Move HMAT and Soft reservation probe initcall level Dan Williams
@ 2023-02-10  9:06 ` Dan Williams
  2023-02-10 21:59   ` Dave Jiang
  2023-02-11  0:41   ` Verma, Vishal L
  2023-02-10  9:07 ` [PATCH v2 17/20] dax/hmem: Convey the dax range via memregion_info() Dan Williams
                   ` (5 subsequent siblings)
  21 siblings, 2 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:06 UTC (permalink / raw)
  To: linux-cxl
  Cc: Jonathan Cameron, Gregory Price, Fan Ni, vishal.l.verma,
	dave.hansen, linux-mm, linux-acpi

Empty driver remove callbacks can just be elided.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Gregory Price <gregory.price@memverge.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564542679.847146.17174404738816053065.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/hmem/hmem.c |    7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 1bf040dbc834..c7351e0dc8ff 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -44,15 +44,8 @@ static int dax_hmem_probe(struct platform_device *pdev)
 	return 0;
 }
 
-static int dax_hmem_remove(struct platform_device *pdev)
-{
-	/* devm handles teardown */
-	return 0;
-}
-
 static struct platform_driver dax_hmem_driver = {
 	.probe = dax_hmem_probe,
-	.remove = dax_hmem_remove,
 	.driver = {
 		.name = "hmem",
 	},


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 17/20] dax/hmem: Convey the dax range via memregion_info()
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (15 preceding siblings ...)
  2023-02-10  9:06 ` [PATCH v2 16/20] dax/hmem: Drop unnecessary dax_hmem_remove() Dan Williams
@ 2023-02-10  9:07 ` Dan Williams
  2023-02-10 22:03   ` Dave Jiang
  2023-02-11  4:25   ` Verma, Vishal L
  2023-02-10  9:07 ` [PATCH v2 18/20] dax/hmem: Move hmem device registration to dax_hmem.ko Dan Williams
                   ` (4 subsequent siblings)
  21 siblings, 2 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:07 UTC (permalink / raw)
  To: linux-cxl
  Cc: Jonathan Cameron, Fan Ni, vishal.l.verma, dave.hansen, linux-mm,
	linux-acpi

In preparation for hmem platform devices to be unregistered, stop using
platform_device_add_resources() to convey the address range. The
platform_device_add_resources() API causes an existing "Soft Reserved"
iomem resource to be re-parented under an inserted platform device
resource. When that platform device is deleted it removes the platform
device resource and all children.

Instead, it is sufficient to convey just the address range and let
request_mem_region() insert resources to indicate the devices active in
the range. This allows the "Soft Reserved" resource to be re-enumerated
upon the next probe event.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564543303.847146.11045895213318648441.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/hmem/device.c |   37 ++++++++++++++-----------------------
 drivers/dax/hmem/hmem.c   |   14 +++-----------
 include/linux/memregion.h |    2 ++
 3 files changed, 19 insertions(+), 34 deletions(-)

diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
index 20749c7fab81..b1b339bccfe5 100644
--- a/drivers/dax/hmem/device.c
+++ b/drivers/dax/hmem/device.c
@@ -15,15 +15,8 @@ static struct resource hmem_active = {
 	.flags = IORESOURCE_MEM,
 };
 
-void hmem_register_device(int target_nid, struct resource *r)
+void hmem_register_device(int target_nid, struct resource *res)
 {
-	/* define a clean / non-busy resource for the platform device */
-	struct resource res = {
-		.start = r->start,
-		.end = r->end,
-		.flags = IORESOURCE_MEM,
-		.desc = IORES_DESC_SOFT_RESERVED,
-	};
 	struct platform_device *pdev;
 	struct memregion_info info;
 	int rc, id;
@@ -31,55 +24,53 @@ void hmem_register_device(int target_nid, struct resource *r)
 	if (nohmem)
 		return;
 
-	rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
-			IORES_DESC_SOFT_RESERVED);
+	rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
+			       IORES_DESC_SOFT_RESERVED);
 	if (rc != REGION_INTERSECTS)
 		return;
 
 	id = memregion_alloc(GFP_KERNEL);
 	if (id < 0) {
-		pr_err("memregion allocation failure for %pr\n", &res);
+		pr_err("memregion allocation failure for %pr\n", res);
 		return;
 	}
 
 	pdev = platform_device_alloc("hmem", id);
 	if (!pdev) {
-		pr_err("hmem device allocation failure for %pr\n", &res);
+		pr_err("hmem device allocation failure for %pr\n", res);
 		goto out_pdev;
 	}
 
-	if (!__request_region(&hmem_active, res.start, resource_size(&res),
+	if (!__request_region(&hmem_active, res->start, resource_size(res),
 			      dev_name(&pdev->dev), 0)) {
-		dev_dbg(&pdev->dev, "hmem range %pr already active\n", &res);
+		dev_dbg(&pdev->dev, "hmem range %pr already active\n", res);
 		goto out_active;
 	}
 
 	pdev->dev.numa_node = numa_map_to_online_node(target_nid);
 	info = (struct memregion_info) {
 		.target_node = target_nid,
+		.range = {
+			.start = res->start,
+			.end = res->end,
+		},
 	};
 	rc = platform_device_add_data(pdev, &info, sizeof(info));
 	if (rc < 0) {
-		pr_err("hmem memregion_info allocation failure for %pr\n", &res);
-		goto out_resource;
-	}
-
-	rc = platform_device_add_resources(pdev, &res, 1);
-	if (rc < 0) {
-		pr_err("hmem resource allocation failure for %pr\n", &res);
+		pr_err("hmem memregion_info allocation failure for %pr\n", res);
 		goto out_resource;
 	}
 
 	rc = platform_device_add(pdev);
 	if (rc < 0) {
-		dev_err(&pdev->dev, "device add failed for %pr\n", &res);
+		dev_err(&pdev->dev, "device add failed for %pr\n", res);
 		goto out_resource;
 	}
 
 	return;
 
 out_resource:
-	__release_region(&hmem_active, res.start, resource_size(&res));
+	__release_region(&hmem_active, res->start, resource_size(res));
 out_active:
 	platform_device_put(pdev);
 out_pdev:
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index c7351e0dc8ff..5025a8c9850b 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -15,25 +15,17 @@ static int dax_hmem_probe(struct platform_device *pdev)
 	struct memregion_info *mri;
 	struct dev_dax_data data;
 	struct dev_dax *dev_dax;
-	struct resource *res;
-	struct range range;
-
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	if (!res)
-		return -ENOMEM;
 
 	mri = dev->platform_data;
-	range.start = res->start;
-	range.end = res->end;
-	dax_region = alloc_dax_region(dev, pdev->id, &range, mri->target_node,
-			PMD_SIZE, 0);
+	dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
+				      mri->target_node, PMD_SIZE, 0);
 	if (!dax_region)
 		return -ENOMEM;
 
 	data = (struct dev_dax_data) {
 		.dax_region = dax_region,
 		.id = -1,
-		.size = region_idle ? 0 : resource_size(res),
+		.size = region_idle ? 0 : range_len(&mri->range),
 	};
 	dev_dax = devm_create_dev_dax(&data);
 	if (IS_ERR(dev_dax))
diff --git a/include/linux/memregion.h b/include/linux/memregion.h
index bf83363807ac..c01321467789 100644
--- a/include/linux/memregion.h
+++ b/include/linux/memregion.h
@@ -3,10 +3,12 @@
 #define _MEMREGION_H_
 #include <linux/types.h>
 #include <linux/errno.h>
+#include <linux/range.h>
 #include <linux/bug.h>
 
 struct memregion_info {
 	int target_node;
+	struct range range;
 };
 
 #ifdef CONFIG_MEMREGION


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 18/20] dax/hmem: Move hmem device registration to dax_hmem.ko
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (16 preceding siblings ...)
  2023-02-10  9:07 ` [PATCH v2 17/20] dax/hmem: Convey the dax range via memregion_info() Dan Williams
@ 2023-02-10  9:07 ` Dan Williams
  2023-02-10 18:25   ` Jonathan Cameron
                     ` (2 more replies)
  2023-02-10  9:07 ` [PATCH v2 19/20] dax: Assign RAM regions to memory-hotplug by default Dan Williams
                   ` (3 subsequent siblings)
  21 siblings, 3 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:07 UTC (permalink / raw)
  To: linux-cxl; +Cc: Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

In preparation for the CXL region driver to take over the responsibility
of registering device-dax instances for CXL regions, move the
registration of "hmem" devices to dax_hmem.ko.

Previously the builtin component of this enabling
(drivers/dax/hmem/device.o) would register platform devices for each
address range and trigger the dax_hmem.ko module to load and attach
device-dax instances to those devices. Now, the ranges are collected
from the HMAT and EFI memory map walking, but the device creation is
deferred. A new "hmem_platform" device is created which triggers
dax_hmem.ko to load and register the platform devices.

Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564543923.847146.9030380223622044744.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/numa/hmat.c  |    2 -
 drivers/dax/Kconfig       |    2 -
 drivers/dax/hmem/device.c |   91 +++++++++++++++++++--------------------
 drivers/dax/hmem/hmem.c   |  105 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h       |    7 ++-
 5 files changed, 155 insertions(+), 52 deletions(-)

diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index ff24282301ab..bba268ecd802 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -718,7 +718,7 @@ static void hmat_register_target_devices(struct memory_target *target)
 	for (res = target->memregions.child; res; res = res->sibling) {
 		int target_nid = pxm_to_node(target->memory_pxm);
 
-		hmem_register_device(target_nid, res);
+		hmem_register_resource(target_nid, res);
 	}
 }
 
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 5fdf269a822e..d13c889c2a64 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -46,7 +46,7 @@ config DEV_DAX_HMEM
 	  Say M if unsure.
 
 config DEV_DAX_HMEM_DEVICES
-	depends on DEV_DAX_HMEM && DAX=y
+	depends on DEV_DAX_HMEM && DAX
 	def_bool y
 
 config DEV_DAX_KMEM
diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
index b1b339bccfe5..f9e1a76a04a9 100644
--- a/drivers/dax/hmem/device.c
+++ b/drivers/dax/hmem/device.c
@@ -8,6 +8,8 @@
 static bool nohmem;
 module_param_named(disable, nohmem, bool, 0444);
 
+static bool platform_initialized;
+static DEFINE_MUTEX(hmem_resource_lock);
 static struct resource hmem_active = {
 	.name = "HMEM devices",
 	.start = 0,
@@ -15,71 +17,66 @@ static struct resource hmem_active = {
 	.flags = IORESOURCE_MEM,
 };
 
-void hmem_register_device(int target_nid, struct resource *res)
+int walk_hmem_resources(struct device *host, walk_hmem_fn fn)
+{
+	struct resource *res;
+	int rc = 0;
+
+	mutex_lock(&hmem_resource_lock);
+	for (res = hmem_active.child; res; res = res->sibling) {
+		rc = fn(host, (int) res->desc, res);
+		if (rc)
+			break;
+	}
+	mutex_unlock(&hmem_resource_lock);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(walk_hmem_resources);
+
+static void __hmem_register_resource(int target_nid, struct resource *res)
 {
 	struct platform_device *pdev;
-	struct memregion_info info;
-	int rc, id;
+	struct resource *new;
+	int rc;
 
-	if (nohmem)
+	new = __request_region(&hmem_active, res->start, resource_size(res), "",
+			       0);
+	if (!new) {
+		pr_debug("hmem range %pr already active\n", res);
 		return;
+	}
 
-	rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
-			       IORES_DESC_SOFT_RESERVED);
-	if (rc != REGION_INTERSECTS)
-		return;
+	new->desc = target_nid;
 
-	id = memregion_alloc(GFP_KERNEL);
-	if (id < 0) {
-		pr_err("memregion allocation failure for %pr\n", res);
+	if (platform_initialized)
 		return;
-	}
 
-	pdev = platform_device_alloc("hmem", id);
+	pdev = platform_device_alloc("hmem_platform", 0);
 	if (!pdev) {
-		pr_err("hmem device allocation failure for %pr\n", res);
-		goto out_pdev;
-	}
-
-	if (!__request_region(&hmem_active, res->start, resource_size(res),
-			      dev_name(&pdev->dev), 0)) {
-		dev_dbg(&pdev->dev, "hmem range %pr already active\n", res);
-		goto out_active;
-	}
-
-	pdev->dev.numa_node = numa_map_to_online_node(target_nid);
-	info = (struct memregion_info) {
-		.target_node = target_nid,
-		.range = {
-			.start = res->start,
-			.end = res->end,
-		},
-	};
-	rc = platform_device_add_data(pdev, &info, sizeof(info));
-	if (rc < 0) {
-		pr_err("hmem memregion_info allocation failure for %pr\n", res);
-		goto out_resource;
+		pr_err_once("failed to register device-dax hmem_platform device\n");
+		return;
 	}
 
 	rc = platform_device_add(pdev);
-	if (rc < 0) {
-		dev_err(&pdev->dev, "device add failed for %pr\n", res);
-		goto out_resource;
-	}
+	if (rc)
+		platform_device_put(pdev);
+	else
+		platform_initialized = true;
+}
 
-	return;
+void hmem_register_resource(int target_nid, struct resource *res)
+{
+	if (nohmem)
+		return;
 
-out_resource:
-	__release_region(&hmem_active, res->start, resource_size(res));
-out_active:
-	platform_device_put(pdev);
-out_pdev:
-	memregion_free(id);
+	mutex_lock(&hmem_resource_lock);
+	__hmem_register_resource(target_nid, res);
+	mutex_unlock(&hmem_resource_lock);
 }
 
 static __init int hmem_register_one(struct resource *res, void *data)
 {
-	hmem_register_device(phys_to_target_node(res->start), res);
+	hmem_register_resource(phys_to_target_node(res->start), res);
 
 	return 0;
 }
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 5025a8c9850b..e7bdff3132fa 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -3,6 +3,7 @@
 #include <linux/memregion.h>
 #include <linux/module.h>
 #include <linux/pfn_t.h>
+#include <linux/dax.h>
 #include "../bus.h"
 
 static bool region_idle;
@@ -43,8 +44,110 @@ static struct platform_driver dax_hmem_driver = {
 	},
 };
 
-module_platform_driver(dax_hmem_driver);
+static void release_memregion(void *data)
+{
+	memregion_free((long) data);
+}
+
+static void release_hmem(void *pdev)
+{
+	platform_device_unregister(pdev);
+}
+
+static int hmem_register_device(struct device *host, int target_nid,
+				const struct resource *res)
+{
+	struct platform_device *pdev;
+	struct memregion_info info;
+	long id;
+	int rc;
+
+	rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
+			       IORES_DESC_SOFT_RESERVED);
+	if (rc != REGION_INTERSECTS)
+		return 0;
+
+	id = memregion_alloc(GFP_KERNEL);
+	if (id < 0) {
+		dev_err(host, "memregion allocation failure for %pr\n", res);
+		return -ENOMEM;
+	}
+	rc = devm_add_action_or_reset(host, release_memregion, (void *) id);
+	if (rc)
+		return rc;
+
+	pdev = platform_device_alloc("hmem", id);
+	if (!pdev) {
+		dev_err(host, "device allocation failure for %pr\n", res);
+		return -ENOMEM;
+	}
+
+	pdev->dev.numa_node = numa_map_to_online_node(target_nid);
+	info = (struct memregion_info) {
+		.target_node = target_nid,
+		.range = {
+			.start = res->start,
+			.end = res->end,
+		},
+	};
+	rc = platform_device_add_data(pdev, &info, sizeof(info));
+	if (rc < 0) {
+		dev_err(host, "memregion_info allocation failure for %pr\n",
+		       res);
+		goto out_put;
+	}
+
+	rc = platform_device_add(pdev);
+	if (rc < 0) {
+		dev_err(host, "%s add failed for %pr\n", dev_name(&pdev->dev),
+			res);
+		goto out_put;
+	}
+
+	return devm_add_action_or_reset(host, release_hmem, pdev);
+
+out_put:
+	platform_device_put(pdev);
+	return rc;
+}
+
+static int dax_hmem_platform_probe(struct platform_device *pdev)
+{
+	return walk_hmem_resources(&pdev->dev, hmem_register_device);
+}
+
+static struct platform_driver dax_hmem_platform_driver = {
+	.probe = dax_hmem_platform_probe,
+	.driver = {
+		.name = "hmem_platform",
+	},
+};
+
+static __init int dax_hmem_init(void)
+{
+	int rc;
+
+	rc = platform_driver_register(&dax_hmem_platform_driver);
+	if (rc)
+		return rc;
+
+	rc = platform_driver_register(&dax_hmem_driver);
+	if (rc)
+		platform_driver_unregister(&dax_hmem_platform_driver);
+
+	return rc;
+}
+
+static __exit void dax_hmem_exit(void)
+{
+	platform_driver_unregister(&dax_hmem_driver);
+	platform_driver_unregister(&dax_hmem_platform_driver);
+}
+
+module_init(dax_hmem_init);
+module_exit(dax_hmem_exit);
 
 MODULE_ALIAS("platform:hmem*");
+MODULE_ALIAS("platform:hmem_platform*");
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Intel Corporation");
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 2b5ecb591059..bf6258472e49 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -262,11 +262,14 @@ static inline bool dax_mapping(struct address_space *mapping)
 }
 
 #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
-void hmem_register_device(int target_nid, struct resource *r);
+void hmem_register_resource(int target_nid, struct resource *r);
 #else
-static inline void hmem_register_device(int target_nid, struct resource *r)
+static inline void hmem_register_resource(int target_nid, struct resource *r)
 {
 }
 #endif
 
+typedef int (*walk_hmem_fn)(struct device *dev, int target_nid,
+			    const struct resource *res);
+int walk_hmem_resources(struct device *dev, walk_hmem_fn fn);
 #endif


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 19/20] dax: Assign RAM regions to memory-hotplug by default
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (17 preceding siblings ...)
  2023-02-10  9:07 ` [PATCH v2 18/20] dax/hmem: Move hmem device registration to dax_hmem.ko Dan Williams
@ 2023-02-10  9:07 ` Dan Williams
  2023-02-10 22:19   ` Dave Jiang
  2023-02-11  5:57   ` Verma, Vishal L
  2023-02-10  9:07 ` [PATCH v2 20/20] cxl/dax: Create dax devices for CXL RAM regions Dan Williams
                   ` (2 subsequent siblings)
  21 siblings, 2 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:07 UTC (permalink / raw)
  To: linux-cxl
  Cc: Michal Hocko, David Hildenbrand, Dave Hansen, Gregory Price,
	Fan Ni, vishal.l.verma, linux-mm, linux-acpi

The default mode for device-dax instances is backwards for RAM-regions
as evidenced by the fact that it tends to catch end users by surprise.
"Where is my memory?". Recall that platforms are increasingly shipping
with performance-differentiated memory pools beyond typical DRAM and
NUMA effects. This includes HBM (high-bandwidth-memory) and CXL (dynamic
interleave, varied media types, and future fabric attached
possibilities).

For this reason the EFI_MEMORY_SP (EFI Special Purpose Memory => Linux
'Soft Reserved') attribute is expected to be applied to all memory-pools
that are not the general purpose pool. This designation gives an
Operating System a chance to defer usage of a memory pool until later in
the boot process where its performance properties can be interrogated
and administrator policy can be applied.

'Soft Reserved' memory can be anything from too limited and precious to
be part of the general purpose pool (HBM), too slow to host hot kernel
data structures (some PMEM media), or anything in between. However, in
the absence of an explicit policy, the memory should at least be made
usable by default. The current device-dax default hides all
non-general-purpose memory behind a device interface.

The expectation is that the distribution of users that want the memory
online by default vs device-dedicated-access by default follows the
Pareto principle. A small number of enlightened users may want to do
userspace memory management through a device, but general users just
want the kernel to make the memory available with an option to get more
advanced later.

Arrange for all device-dax instances not backed by PMEM to default to
attaching to the dax_kmem driver. From there the baseline memory hotplug
policy (CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE / memhp_default_state=)
gates whether the memory comes online or stays offline. Where, if it
stays offline, it can be reliably converted back to device-mode where it
can be partitioned, or fronted by a userspace allocator.

So, if someone wants device-dax instances for their 'Soft Reserved'
memory:

1/ Build a kernel with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n or boot
   with memhp_default_state=offline, or roll the dice and hope that the
   kernel has not pinned a page in that memory before step 2.

2/ Write a udev rule to convert the target dax device(s) from
   'system-ram' mode to 'devdax' mode:

   daxctl reconfigure-device $dax -m devdax -f

Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Gregory Price <gregory.price@memverge.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564544513.847146.4645646177864365755.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/Kconfig     |    2 +-
 drivers/dax/bus.c       |   53 ++++++++++++++++++++---------------------------
 drivers/dax/bus.h       |   12 +++++++++--
 drivers/dax/device.c    |    3 +--
 drivers/dax/hmem/hmem.c |   12 ++++++++++-
 drivers/dax/kmem.c      |    1 +
 6 files changed, 46 insertions(+), 37 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index d13c889c2a64..1163eb62e5f6 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -50,7 +50,7 @@ config DEV_DAX_HMEM_DEVICES
 	def_bool y
 
 config DEV_DAX_KMEM
-	tristate "KMEM DAX: volatile-use of persistent memory"
+	tristate "KMEM DAX: map dax-devices as System-RAM"
 	default DEV_DAX
 	depends on DEV_DAX
 	depends on MEMORY_HOTPLUG # for add_memory() and friends
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1dad813ee4a6..012d576004e9 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -56,6 +56,25 @@ static int dax_match_id(struct dax_device_driver *dax_drv, struct device *dev)
 	return match;
 }
 
+static int dax_match_type(struct dax_device_driver *dax_drv, struct device *dev)
+{
+	enum dax_driver_type type = DAXDRV_DEVICE_TYPE;
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+
+	if (dev_dax->region->res.flags & IORESOURCE_DAX_KMEM)
+		type = DAXDRV_KMEM_TYPE;
+
+	if (dax_drv->type == type)
+		return 1;
+
+	/* default to device mode if dax_kmem is disabled */
+	if (dax_drv->type == DAXDRV_DEVICE_TYPE &&
+	    !IS_ENABLED(CONFIG_DEV_DAX_KMEM))
+		return 1;
+
+	return 0;
+}
+
 enum id_action {
 	ID_REMOVE,
 	ID_ADD,
@@ -216,14 +235,9 @@ static int dax_bus_match(struct device *dev, struct device_driver *drv)
 {
 	struct dax_device_driver *dax_drv = to_dax_drv(drv);
 
-	/*
-	 * All but the 'device-dax' driver, which has 'match_always'
-	 * set, requires an exact id match.
-	 */
-	if (dax_drv->match_always)
+	if (dax_match_id(dax_drv, dev))
 		return 1;
-
-	return dax_match_id(dax_drv, dev);
+	return dax_match_type(dax_drv, dev);
 }
 
 /*
@@ -1413,13 +1427,10 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 }
 EXPORT_SYMBOL_GPL(devm_create_dev_dax);
 
-static int match_always_count;
-
 int __dax_driver_register(struct dax_device_driver *dax_drv,
 		struct module *module, const char *mod_name)
 {
 	struct device_driver *drv = &dax_drv->drv;
-	int rc = 0;
 
 	/*
 	 * dax_bus_probe() calls dax_drv->probe() unconditionally.
@@ -1434,26 +1445,7 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
 	drv->mod_name = mod_name;
 	drv->bus = &dax_bus_type;
 
-	/* there can only be one default driver */
-	mutex_lock(&dax_bus_lock);
-	match_always_count += dax_drv->match_always;
-	if (match_always_count > 1) {
-		match_always_count--;
-		WARN_ON(1);
-		rc = -EINVAL;
-	}
-	mutex_unlock(&dax_bus_lock);
-	if (rc)
-		return rc;
-
-	rc = driver_register(drv);
-	if (rc && dax_drv->match_always) {
-		mutex_lock(&dax_bus_lock);
-		match_always_count -= dax_drv->match_always;
-		mutex_unlock(&dax_bus_lock);
-	}
-
-	return rc;
+	return driver_register(drv);
 }
 EXPORT_SYMBOL_GPL(__dax_driver_register);
 
@@ -1463,7 +1455,6 @@ void dax_driver_unregister(struct dax_device_driver *dax_drv)
 	struct dax_id *dax_id, *_id;
 
 	mutex_lock(&dax_bus_lock);
-	match_always_count -= dax_drv->match_always;
 	list_for_each_entry_safe(dax_id, _id, &dax_drv->ids, list) {
 		list_del(&dax_id->list);
 		kfree(dax_id);
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index fbb940293d6d..8cd79ab34292 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -11,7 +11,10 @@ struct dax_device;
 struct dax_region;
 void dax_region_put(struct dax_region *dax_region);
 
-#define IORESOURCE_DAX_STATIC (1UL << 0)
+/* dax bus specific ioresource flags */
+#define IORESOURCE_DAX_STATIC BIT(0)
+#define IORESOURCE_DAX_KMEM BIT(1)
+
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct range *range, int target_node, unsigned int align,
 		unsigned long flags);
@@ -25,10 +28,15 @@ struct dev_dax_data {
 
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
 
+enum dax_driver_type {
+	DAXDRV_KMEM_TYPE,
+	DAXDRV_DEVICE_TYPE,
+};
+
 struct dax_device_driver {
 	struct device_driver drv;
 	struct list_head ids;
-	int match_always;
+	enum dax_driver_type type;
 	int (*probe)(struct dev_dax *dev);
 	void (*remove)(struct dev_dax *dev);
 };
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 5494d745ced5..ecdff79e31f2 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -475,8 +475,7 @@ EXPORT_SYMBOL_GPL(dev_dax_probe);
 
 static struct dax_device_driver device_dax_driver = {
 	.probe = dev_dax_probe,
-	/* all probe actions are unwound by devm, so .remove isn't necessary */
-	.match_always = 1,
+	.type = DAXDRV_DEVICE_TYPE,
 };
 
 static int __init dax_init(void)
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index e7bdff3132fa..5ec08f9f8a57 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -11,15 +11,25 @@ module_param_named(region_idle, region_idle, bool, 0644);
 
 static int dax_hmem_probe(struct platform_device *pdev)
 {
+	unsigned long flags = IORESOURCE_DAX_KMEM;
 	struct device *dev = &pdev->dev;
 	struct dax_region *dax_region;
 	struct memregion_info *mri;
 	struct dev_dax_data data;
 	struct dev_dax *dev_dax;
 
+	/*
+	 * @region_idle == true indicates that an administrative agent
+	 * wants to manipulate the range partitioning before the devices
+	 * are created, so do not send them to the dax_kmem driver by
+	 * default.
+	 */
+	if (region_idle)
+		flags = 0;
+
 	mri = dev->platform_data;
 	dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
-				      mri->target_node, PMD_SIZE, 0);
+				      mri->target_node, PMD_SIZE, flags);
 	if (!dax_region)
 		return -ENOMEM;
 
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 4852a2dbdb27..918d01d3fbaa 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -239,6 +239,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 static struct dax_device_driver device_dax_kmem_driver = {
 	.probe = dev_dax_kmem_probe,
 	.remove = dev_dax_kmem_remove,
+	.type = DAXDRV_KMEM_TYPE,
 };
 
 static int __init dax_kmem_init(void)


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 20/20] cxl/dax: Create dax devices for CXL RAM regions
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (18 preceding siblings ...)
  2023-02-10  9:07 ` [PATCH v2 19/20] dax: Assign RAM regions to memory-hotplug by default Dan Williams
@ 2023-02-10  9:07 ` Dan Williams
  2023-02-10 18:38   ` Jonathan Cameron
  2023-02-10 22:42   ` Dave Jiang
  2023-02-10 17:53 ` [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
  2023-02-13 18:22 ` Gregory Price
  21 siblings, 2 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10  9:07 UTC (permalink / raw)
  To: linux-cxl; +Cc: Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

While platform firmware takes some responsibility for mapping the RAM
capacity of CXL devices present at boot, the OS is responsible for
mapping the remainder and hot-added devices. Platform firmware is also
responsible for identifying the platform general purpose memory pool,
typically DDR attached DRAM, and arranging for the remainder to be 'Soft
Reserved'. That reservation allows the CXL subsystem to route the memory
to core-mm via memory-hotplug (dax_kmem), or leave it for dedicated
access (device-dax).

The new 'struct cxl_dax_region' object allows for a CXL memory resource
(region) to be published, but also allow for udev and module policy to
act on that event. It also prevents cxl_core.ko from having a module
loading dependency on any drivers/dax/ modules.

Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/167564545116.847146.4741351262959589920.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 MAINTAINERS               |    1 
 drivers/cxl/acpi.c        |    3 +
 drivers/cxl/core/core.h   |    3 +
 drivers/cxl/core/port.c   |    4 +-
 drivers/cxl/core/region.c |  108 ++++++++++++++++++++++++++++++++++++++++++++-
 drivers/cxl/cxl.h         |   12 +++++
 drivers/dax/Kconfig       |   13 +++++
 drivers/dax/Makefile      |    2 +
 drivers/dax/cxl.c         |   53 ++++++++++++++++++++++
 drivers/dax/hmem/hmem.c   |   14 ++++++
 10 files changed, 209 insertions(+), 4 deletions(-)
 create mode 100644 drivers/dax/cxl.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 7f86d02cb427..73a9f3401e0e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6035,6 +6035,7 @@ M:	Dan Williams <dan.j.williams@intel.com>
 M:	Vishal Verma <vishal.l.verma@intel.com>
 M:	Dave Jiang <dave.jiang@intel.com>
 L:	nvdimm@lists.linux.dev
+L:	linux-cxl@vger.kernel.org
 S:	Supported
 F:	drivers/dax/
 
diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index ad0849af42d7..8ebb9a74790d 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -731,7 +731,8 @@ static void __exit cxl_acpi_exit(void)
 	cxl_bus_drain();
 }
 
-module_init(cxl_acpi_init);
+/* load before dax_hmem sees 'Soft Reserved' CXL ranges */
+subsys_initcall(cxl_acpi_init);
 module_exit(cxl_acpi_exit);
 MODULE_LICENSE("GPL v2");
 MODULE_IMPORT_NS(CXL);
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 479f01da6d35..cde475e13216 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -15,12 +15,14 @@ extern struct device_attribute dev_attr_create_ram_region;
 extern struct device_attribute dev_attr_delete_region;
 extern struct device_attribute dev_attr_region;
 extern const struct device_type cxl_pmem_region_type;
+extern const struct device_type cxl_dax_region_type;
 extern const struct device_type cxl_region_type;
 void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled);
 #define CXL_REGION_ATTR(x) (&dev_attr_##x.attr)
 #define CXL_REGION_TYPE(x) (&cxl_region_type)
 #define SET_CXL_REGION_ATTR(x) (&dev_attr_##x.attr),
 #define CXL_PMEM_REGION_TYPE(x) (&cxl_pmem_region_type)
+#define CXL_DAX_REGION_TYPE(x) (&cxl_dax_region_type)
 int cxl_region_init(void);
 void cxl_region_exit(void);
 #else
@@ -38,6 +40,7 @@ static inline void cxl_region_exit(void)
 #define CXL_REGION_TYPE(x) NULL
 #define SET_CXL_REGION_ATTR(x)
 #define CXL_PMEM_REGION_TYPE(x) NULL
+#define CXL_DAX_REGION_TYPE(x) NULL
 #endif
 
 struct cxl_send_command;
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index b45d2796ef35..0bb7a5ff724b 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -46,6 +46,8 @@ static int cxl_device_id(struct device *dev)
 		return CXL_DEVICE_NVDIMM;
 	if (dev->type == CXL_PMEM_REGION_TYPE())
 		return CXL_DEVICE_PMEM_REGION;
+	if (dev->type == CXL_DAX_REGION_TYPE())
+		return CXL_DEVICE_DAX_REGION;
 	if (is_cxl_port(dev)) {
 		if (is_cxl_root(to_cxl_port(dev)))
 			return CXL_DEVICE_ROOT;
@@ -2015,6 +2017,6 @@ static void cxl_core_exit(void)
 	debugfs_remove_recursive(cxl_debugfs);
 }
 
-module_init(cxl_core_init);
+subsys_initcall(cxl_core_init);
 module_exit(cxl_core_exit);
 MODULE_LICENSE("GPL v2");
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 3f6453da2c51..91d334080cab 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2272,6 +2272,75 @@ static struct cxl_pmem_region *cxl_pmem_region_alloc(struct cxl_region *cxlr)
 	return cxlr_pmem;
 }
 
+static void cxl_dax_region_release(struct device *dev)
+{
+	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
+
+	kfree(cxlr_dax);
+}
+
+static const struct attribute_group *cxl_dax_region_attribute_groups[] = {
+	&cxl_base_attribute_group,
+	NULL,
+};
+
+const struct device_type cxl_dax_region_type = {
+	.name = "cxl_dax_region",
+	.release = cxl_dax_region_release,
+	.groups = cxl_dax_region_attribute_groups,
+};
+
+static bool is_cxl_dax_region(struct device *dev)
+{
+	return dev->type == &cxl_dax_region_type;
+}
+
+struct cxl_dax_region *to_cxl_dax_region(struct device *dev)
+{
+	if (dev_WARN_ONCE(dev, !is_cxl_dax_region(dev),
+			  "not a cxl_dax_region device\n"))
+		return NULL;
+	return container_of(dev, struct cxl_dax_region, dev);
+}
+EXPORT_SYMBOL_NS_GPL(to_cxl_dax_region, CXL);
+
+static struct lock_class_key cxl_dax_region_key;
+
+static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
+{
+	struct cxl_region_params *p = &cxlr->params;
+	struct cxl_dax_region *cxlr_dax;
+	struct device *dev;
+
+	down_read(&cxl_region_rwsem);
+	if (p->state != CXL_CONFIG_COMMIT) {
+		cxlr_dax = ERR_PTR(-ENXIO);
+		goto out;
+	}
+
+	cxlr_dax = kzalloc(sizeof(*cxlr_dax), GFP_KERNEL);
+	if (!cxlr_dax) {
+		cxlr_dax = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	cxlr_dax->hpa_range.start = p->res->start;
+	cxlr_dax->hpa_range.end = p->res->end;
+
+	dev = &cxlr_dax->dev;
+	cxlr_dax->cxlr = cxlr;
+	device_initialize(dev);
+	lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
+	device_set_pm_not_required(dev);
+	dev->parent = &cxlr->dev;
+	dev->bus = &cxl_bus_type;
+	dev->type = &cxl_dax_region_type;
+out:
+	up_read(&cxl_region_rwsem);
+
+	return cxlr_dax;
+}
+
 static void cxlr_pmem_unregister(void *_cxlr_pmem)
 {
 	struct cxl_pmem_region *cxlr_pmem = _cxlr_pmem;
@@ -2356,6 +2425,42 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
 	return rc;
 }
 
+static void cxlr_dax_unregister(void *_cxlr_dax)
+{
+	struct cxl_dax_region *cxlr_dax = _cxlr_dax;
+
+	device_unregister(&cxlr_dax->dev);
+}
+
+static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
+{
+	struct cxl_dax_region *cxlr_dax;
+	struct device *dev;
+	int rc;
+
+	cxlr_dax = cxl_dax_region_alloc(cxlr);
+	if (IS_ERR(cxlr_dax))
+		return PTR_ERR(cxlr_dax);
+
+	dev = &cxlr_dax->dev;
+	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
+	if (rc)
+		goto err;
+
+	rc = device_add(dev);
+	if (rc)
+		goto err;
+
+	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
+		dev_name(dev));
+
+	return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
+					cxlr_dax);
+err:
+	put_device(dev);
+	return rc;
+}
+
 static int match_decoder_by_range(struct device *dev, void *data)
 {
 	struct range *r1, *r2 = data;
@@ -2619,8 +2724,7 @@ static int cxl_region_probe(struct device *dev)
 					p->res->start, p->res->end, cxlr,
 					is_system_ram) > 0)
 			return 0;
-		dev_dbg(dev, "TODO: hookup devdax\n");
-		return 0;
+		return devm_cxl_add_dax_region(cxlr);
 	default:
 		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
 			cxlr->mode);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 2ac344235235..b1395c46baec 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -513,6 +513,12 @@ struct cxl_pmem_region {
 	struct cxl_pmem_region_mapping mapping[];
 };
 
+struct cxl_dax_region {
+	struct device dev;
+	struct cxl_region *cxlr;
+	struct range hpa_range;
+};
+
 /**
  * struct cxl_port - logical collection of upstream port devices and
  *		     downstream port devices to construct a CXL memory
@@ -707,6 +713,7 @@ void cxl_driver_unregister(struct cxl_driver *cxl_drv);
 #define CXL_DEVICE_MEMORY_EXPANDER	5
 #define CXL_DEVICE_REGION		6
 #define CXL_DEVICE_PMEM_REGION		7
+#define CXL_DEVICE_DAX_REGION		8
 
 #define MODULE_ALIAS_CXL(type) MODULE_ALIAS("cxl:t" __stringify(type) "*")
 #define CXL_MODALIAS_FMT "cxl:t%d"
@@ -725,6 +732,7 @@ bool is_cxl_pmem_region(struct device *dev);
 struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev);
 int cxl_add_to_region(struct cxl_port *root,
 		      struct cxl_endpoint_decoder *cxled);
+struct cxl_dax_region *to_cxl_dax_region(struct device *dev);
 #else
 static inline bool is_cxl_pmem_region(struct device *dev)
 {
@@ -739,6 +747,10 @@ static inline int cxl_add_to_region(struct cxl_port *root,
 {
 	return 0;
 }
+static inline struct cxl_dax_region *to_cxl_dax_region(struct device *dev)
+{
+	return NULL;
+}
 #endif
 
 /*
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 1163eb62e5f6..bd06e16c7ac8 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -45,6 +45,19 @@ config DEV_DAX_HMEM
 
 	  Say M if unsure.
 
+config DEV_DAX_CXL
+	tristate "CXL DAX: direct access to CXL RAM regions"
+	depends on CXL_REGION && DEV_DAX
+	default CXL_REGION && DEV_DAX
+	help
+	  CXL RAM regions are either mapped by platform-firmware
+	  and published in the initial system-memory map as "System RAM", mapped
+	  by platform-firmware as "Soft Reserved", or dynamically provisioned
+	  after boot by the CXL driver. In the latter two cases a device-dax
+	  instance is created to access that unmapped-by-default address range.
+	  Per usual it can remain as dedicated access via a device interface, or
+	  converted to "System RAM" via the dax_kmem facility.
+
 config DEV_DAX_HMEM_DEVICES
 	depends on DEV_DAX_HMEM && DAX
 	def_bool y
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 90a56ca3b345..5ed5c39857c8 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -3,10 +3,12 @@ obj-$(CONFIG_DAX) += dax.o
 obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
 obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
+obj-$(CONFIG_DEV_DAX_CXL) += dax_cxl.o
 
 dax-y := super.o
 dax-y += bus.o
 device_dax-y := device.o
 dax_pmem-y := pmem.o
+dax_cxl-y := cxl.o
 
 obj-y += hmem/
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
new file mode 100644
index 000000000000..ccdf8de85bd5
--- /dev/null
+++ b/drivers/dax/cxl.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2023 Intel Corporation. All rights reserved. */
+#include <linux/module.h>
+#include <linux/dax.h>
+
+#include "../cxl/cxl.h"
+#include "bus.h"
+
+static int cxl_dax_region_probe(struct device *dev)
+{
+	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
+	int nid = phys_to_target_node(cxlr_dax->hpa_range.start);
+	struct cxl_region *cxlr = cxlr_dax->cxlr;
+	struct dax_region *dax_region;
+	struct dev_dax_data data;
+	struct dev_dax *dev_dax;
+
+	if (nid == NUMA_NO_NODE)
+		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
+
+	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
+				      PMD_SIZE, IORESOURCE_DAX_KMEM);
+	if (!dax_region)
+		return -ENOMEM;
+
+	data = (struct dev_dax_data) {
+		.dax_region = dax_region,
+		.id = -1,
+		.size = range_len(&cxlr_dax->hpa_range),
+	};
+	dev_dax = devm_create_dev_dax(&data);
+	if (IS_ERR(dev_dax))
+		return PTR_ERR(dev_dax);
+
+	/* child dev_dax instances now own the lifetime of the dax_region */
+	dax_region_put(dax_region);
+	return 0;
+}
+
+static struct cxl_driver cxl_dax_region_driver = {
+	.name = "cxl_dax_region",
+	.probe = cxl_dax_region_probe,
+	.id = CXL_DEVICE_DAX_REGION,
+	.drv = {
+		.suppress_bind_attrs = true,
+	},
+};
+
+module_cxl_driver(cxl_dax_region_driver);
+MODULE_ALIAS_CXL(CXL_DEVICE_DAX_REGION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Intel Corporation");
+MODULE_IMPORT_NS(CXL);
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 5ec08f9f8a57..e5fe8b39fb94 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -72,6 +72,13 @@ static int hmem_register_device(struct device *host, int target_nid,
 	long id;
 	int rc;
 
+	if (IS_ENABLED(CONFIG_CXL_REGION) &&
+	    region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
+			      IORES_DESC_CXL) != REGION_DISJOINT) {
+		dev_dbg(host, "deferring range to CXL: %pr\n", res);
+		return 0;
+	}
+
 	rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
 			       IORES_DESC_SOFT_RESERVED);
 	if (rc != REGION_INTERSECTS)
@@ -157,6 +164,13 @@ static __exit void dax_hmem_exit(void)
 module_init(dax_hmem_init);
 module_exit(dax_hmem_exit);
 
+/* Allow for CXL to define its own dax regions */
+#if IS_ENABLED(CONFIG_CXL_REGION)
+#if IS_MODULE(CONFIG_CXL_ACPI)
+MODULE_SOFTDEP("pre: cxl_acpi");
+#endif
+#endif
+
 MODULE_ALIAS("platform:hmem*");
 MODULE_ALIAS("platform:hmem_platform*");
 MODULE_LICENSE("GPL v2");


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 01/20] cxl/memdev: Fix endpoint port removal
  2023-02-10  9:05 ` [PATCH v2 01/20] cxl/memdev: Fix endpoint port removal Dan Williams
@ 2023-02-10 17:28   ` Jonathan Cameron
  2023-02-10 21:14     ` Dan Williams
  2023-02-10 23:17   ` Verma, Vishal L
  1 sibling, 1 reply; 65+ messages in thread
From: Jonathan Cameron @ 2023-02-10 17:28 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-cxl, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

On Fri, 10 Feb 2023 01:05:27 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Testing of ram region support [1], stimulates a long standing bug in
> cxl_detach_ep() where some cxl_ep_remove() cleanup is skipped due to
> inability to walk ports after dports have been unregistered. That
> results in a failure to re-register a memdev after the port is
> re-enabled leading to a crash like the following:
> 
>     cxl_port_setup_targets: cxl region4: cxl_host_bridge.0:port4 iw: 1 ig: 256
>     general protection fault, ...
>     [..]
>     RIP: 0010:cxl_region_setup_targets+0x897/0x9e0 [cxl_core]
>     dev_name at include/linux/device.h:700
>     (inlined by) cxl_port_setup_targets at drivers/cxl/core/region.c:1155
>     (inlined by) cxl_region_setup_targets at drivers/cxl/core/region.c:1249
>     [..]
>     Call Trace:
>      <TASK>
>      attach_target+0x39a/0x760 [cxl_core]
>      ? __mutex_unlock_slowpath+0x3a/0x290
>      cxl_add_to_region+0xb8/0x340 [cxl_core]
>      ? lockdep_hardirqs_on+0x7d/0x100
>      discover_region+0x4b/0x80 [cxl_port]
>      ? __pfx_discover_region+0x10/0x10 [cxl_port]
>      device_for_each_child+0x58/0x90
>      cxl_port_probe+0x10e/0x130 [cxl_port]
>      cxl_bus_probe+0x17/0x50 [cxl_core]
> 
> Change the port ancestry walk to be by depth rather than by dport. This
> ensures that even if a port has unregistered its dports a deferred
> memdev cleanup will still be able to cleanup the memdev's interest in
> that port.
> 
> The parent_port->dev.driver check is only needed for determining if the
> bottom up removal beat the top-down removal, but cxl_ep_remove() can
> always proceed.

Why can cxl_ep_remove() always proceed?  What stops it racing?
Is it that we are holding a reference to the port at the time of the
call so the release callback can't be called until we drop that?
Anyhow, good to have a little more detail on the 'why' in the patch
description (particularly for those reading this when half asleep like me ;)

> 
> Fixes: 2703c16c75ae ("cxl/core/port: Add switch port enumeration")
> Link: http://lore.kernel.org/r/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com [1]
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/core/memdev.c |    1 +
>  drivers/cxl/core/port.c   |   58 +++++++++++++++++++++++++--------------------
>  drivers/cxl/cxlmem.h      |    2 ++
>  3 files changed, 35 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index a74a93310d26..3a8bc2b06047 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -246,6 +246,7 @@ static struct cxl_memdev *cxl_memdev_alloc(struct cxl_dev_state *cxlds,
>  	if (rc < 0)
>  		goto err;
>  	cxlmd->id = rc;
> +	cxlmd->depth = -1;
>  
>  	dev = &cxlmd->dev;
>  	device_initialize(dev);
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 410c036c09fa..317bcf4dbd9d 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1207,6 +1207,7 @@ int cxl_endpoint_autoremove(struct cxl_memdev *cxlmd, struct cxl_port *endpoint)
>  
>  	get_device(&endpoint->dev);
>  	dev_set_drvdata(dev, endpoint);
> +	cxlmd->depth = endpoint->depth;
>  	return devm_add_action_or_reset(dev, delete_endpoint, cxlmd);
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_endpoint_autoremove, CXL);
> @@ -1241,50 +1242,55 @@ static void reap_dports(struct cxl_port *port)
>  	}
>  }
>  
> +struct detach_ctx {
> +	struct cxl_memdev *cxlmd;
> +	int depth;
> +};

>  static void cxl_detach_ep(void *data)
>  {
>  	struct cxl_memdev *cxlmd = data;
> -	struct device *iter;
>  
> -	for (iter = &cxlmd->dev; iter; iter = grandparent(iter)) {
> -		struct device *dport_dev = grandparent(iter);
> +	for (int i = cxlmd->depth - 1; i >= 1; i--) {
>  		struct cxl_port *port, *parent_port;
> +		struct detach_ctx ctx = {
> +			.cxlmd = cxlmd,
> +			.depth = i,
> +		};
> +		struct device *dev;
>  		struct cxl_ep *ep;
>  		bool died = false;
>  
> -		if (!dport_dev)
> -			break;
> -
> -		port = find_cxl_port(dport_dev, NULL);
> -		if (!port)
> -			continue;
> -
> -		if (is_cxl_root(port)) {
> -			put_device(&port->dev);
> +		dev = bus_find_device(&cxl_bus_type, NULL, &ctx,
> +				      port_has_memdev);
> +		if (!dev)
>  			continue;
> -		}
> +		port = to_cxl_port(dev);
>  
>  		parent_port = to_cxl_port(port->dev.parent);
>  		device_lock(&parent_port->dev);
> -		if (!parent_port->dev.driver) {
> -			/*
> -			 * The bottom-up race to delete the port lost to a
> -			 * top-down port disable, give up here, because the
> -			 * parent_port ->remove() will have cleaned up all
> -			 * descendants.
> -			 */
> -			device_unlock(&parent_port->dev);
> -			put_device(&port->dev);
> -			continue;
> -		}
> -
>  		device_lock(&port->dev);
>  		ep = cxl_ep_load(port, cxlmd);
>  		dev_dbg(&cxlmd->dev, "disconnect %s from %s\n",
>  			ep ? dev_name(ep->ep) : "", dev_name(&port->dev));
>  		cxl_ep_remove(port, ep);
>  		if (ep && !port->dead && xa_empty(&port->endpoints) &&
> -		    !is_cxl_root(parent_port)) {
> +		    !is_cxl_root(parent_port) && parent_port->dev.driver) {
>  			/*
>  			 * This was the last ep attached to a dynamically
>  			 * enumerated port. Block new cxl_add_ep() and garbage



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 04/20] cxl/region: Support empty uuids for non-pmem regions
  2023-02-10  9:05 ` [PATCH v2 04/20] cxl/region: Support empty uuids for non-pmem regions Dan Williams
@ 2023-02-10 17:30   ` Jonathan Cameron
  2023-02-10 23:34   ` Ira Weiny
  1 sibling, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2023-02-10 17:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Vishal Verma, Fan Ni, dave.hansen, linux-mm, linux-acpi

On Fri, 10 Feb 2023 01:05:45 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Shipping versions of the cxl-cli utility expect all regions to have a
> 'uuid' attribute. In preparation for 'ram' regions, update the 'uuid'
> attribute to return an empty string which satisfies the current
> expectations of 'cxl list -R'. Otherwise, 'cxl list -R' fails in the
> presence of regions with the 'uuid' attribute missing. Force the
> attribute to be read-only as there is no facility or expectation for a
> 'ram' region to recall its uuid from one boot to the next.
> 
> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564536587.847146.12703125206459604597.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
After your explanations in v1 thread, I'm fine with this.
Just another bit of slightly inelegant ABI that we are stuck with.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl |    3 ++-
>  drivers/cxl/core/region.c               |   11 +++++++++--
>  2 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 058b0c45001f..4c4e1cbb1169 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -317,7 +317,8 @@ Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Write a unique identifier for the region. This field must
>  		be set for persistent regions and it must not conflict with the
> -		UUID of another region.
> +		UUID of another region. For volatile ram regions this
> +		attribute is a read-only empty string.
>  
>  
>  What:		/sys/bus/cxl/devices/regionZ/interleave_granularity
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 17d2d0c12725..0fc80478ff6b 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -45,7 +45,10 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
>  	rc = down_read_interruptible(&cxl_region_rwsem);
>  	if (rc)
>  		return rc;
> -	rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
> +	if (cxlr->mode != CXL_DECODER_PMEM)
> +		rc = sysfs_emit(buf, "\n");
> +	else
> +		rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
>  	up_read(&cxl_region_rwsem);
>  
>  	return rc;
> @@ -300,8 +303,12 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
>  	struct device *dev = kobj_to_dev(kobj);
>  	struct cxl_region *cxlr = to_cxl_region(dev);
>  
> +	/*
> +	 * Support tooling that expects to find a 'uuid' attribute for all
> +	 * regions regardless of mode.
> +	 */
>  	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
> -		return 0;
> +		return 0444;
>  	return a->mode;
>  }
>  
> 
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 08/20] cxl/region: Cleanup target list on attach error
  2023-02-10  9:06 ` [PATCH v2 08/20] cxl/region: Cleanup target list on attach error Dan Williams
@ 2023-02-10 17:31   ` Jonathan Cameron
  2023-02-10 23:17   ` Verma, Vishal L
  2023-02-10 23:46   ` Ira Weiny
  2 siblings, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2023-02-10 17:31 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-cxl, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

On Fri, 10 Feb 2023 01:06:09 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan noticed that the target list setup is not unwound completely
> upon error. Undo all the setup in the 'err_decrement:' exit path.
> 
> Fixes: 27b3f8d13830 ("cxl/region: Program target lists")
> Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Link: http://lore.kernel.org/r/20230208123031.00006990@Huawei.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> ---
>  drivers/cxl/core/region.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 040bbd39c81d..ae7d3adcd41a 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1347,6 +1347,8 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  
>  err_decrement:
>  	p->nr_targets--;
> +	cxled->pos = -1;
> +	p->targets[pos] = NULL;
>  err:
>  	for (iter = ep_port; !is_cxl_root(iter);
>  	     iter = to_cxl_port(iter->dev.parent))
> 
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 09/20] cxl/region: Move region-position validation to a helper
  2023-02-10  9:06 ` [PATCH v2 09/20] cxl/region: Move region-position validation to a helper Dan Williams
@ 2023-02-10 17:34   ` Jonathan Cameron
  0 siblings, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2023-02-10 17:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Vishal Verma, Fan Ni, dave.hansen, linux-mm, linux-acpi

On Fri, 10 Feb 2023 01:06:15 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> In preparation for region autodiscovery, that needs all devices
> discovered before their relative position in the region can be
> determined, consolidate all position dependent validation in a helper.
> 
> Recall that in the on-demand region creation flow the end-user picks the
> position of a given endpoint decoder in a region. In the autodiscovery
> case the position of an endpoint decoder can only be determined after
> all other endpoint decoders that claim to decode the region's address
> range have been enumerated and attached. So, in the autodiscovery case
> endpoint decoders may be attached before their relative position is
> known. Once all decoders arrive, then positions can be determined and
> validated with cxl_region_validate_position() the same as user initiated
> on-demand creation.
> 
> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564538779.847146.8356062886811511706.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 12/20] cxl/port: Split endpoint and switch port probe
  2023-02-10  9:06 ` [PATCH v2 12/20] cxl/port: Split endpoint and switch port probe Dan Williams
@ 2023-02-10 17:41   ` Jonathan Cameron
  2023-02-10 23:21   ` Verma, Vishal L
  1 sibling, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2023-02-10 17:41 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-cxl, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

On Fri, 10 Feb 2023 01:06:33 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan points out that the shared code between the switch and endpoint
> case is small. Before adding another is_cxl_endpoint() conditional,
> just split the two cases.
> 
> Rather than duplicate the "Couldn't enumerate decoders" error message
> take the opportunity to improve the error messages in
> devm_cxl_enumerate_decoders().
> 
> Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Link: http://lore.kernel.org/r/20230208170724.000067ec@Huawei.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
LGTM.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> ---
>  drivers/cxl/core/hdm.c |   11 ++++++--
>  drivers/cxl/port.c     |   69 +++++++++++++++++++++++++++---------------------
>  2 files changed, 47 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index dcc16d7cb8f3..a0891c3464f1 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -826,7 +826,8 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>  			cxled = cxl_endpoint_decoder_alloc(port);
>  			if (IS_ERR(cxled)) {
>  				dev_warn(&port->dev,
> -					 "Failed to allocate the decoder\n");
> +					 "Failed to allocate decoder%d.%d\n",
> +					 port->id, i);
>  				return PTR_ERR(cxled);
>  			}
>  			cxld = &cxled->cxld;
> @@ -836,7 +837,8 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>  			cxlsd = cxl_switch_decoder_alloc(port, target_count);
>  			if (IS_ERR(cxlsd)) {
>  				dev_warn(&port->dev,
> -					 "Failed to allocate the decoder\n");
> +					 "Failed to allocate decoder%d.%d\n",
> +					 port->id, i);
>  				return PTR_ERR(cxlsd);
>  			}
>  			cxld = &cxlsd->cxld;
> @@ -844,13 +846,16 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>  
>  		rc = init_hdm_decoder(port, cxld, target_map, hdm, i, &dpa_base);
>  		if (rc) {
> +			dev_warn(&port->dev,
> +				 "Failed to initialize decoder%d.%d\n",
> +				 port->id, i);
>  			put_device(&cxld->dev);
>  			return rc;
>  		}
>  		rc = add_hdm_decoder(port, cxld, target_map);
>  		if (rc) {
>  			dev_warn(&port->dev,
> -				 "Failed to add decoder to port\n");
> +				 "Failed to add decoder%d.%d\n", port->id, i);
>  			return rc;
>  		}
>  	}
> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> index 5453771bf330..a8d46a67b45e 100644
> --- a/drivers/cxl/port.c
> +++ b/drivers/cxl/port.c
> @@ -30,55 +30,64 @@ static void schedule_detach(void *cxlmd)
>  	schedule_cxl_memdev_detach(cxlmd);
>  }
>  
> -static int cxl_port_probe(struct device *dev)
> +static int cxl_switch_port_probe(struct cxl_port *port)
>  {
> -	struct cxl_port *port = to_cxl_port(dev);
>  	struct cxl_hdm *cxlhdm;
>  	int rc;
>  
> +	rc = devm_cxl_port_enumerate_dports(port);
> +	if (rc < 0)
> +		return rc;
>  
> -	if (!is_cxl_endpoint(port)) {
> -		rc = devm_cxl_port_enumerate_dports(port);
> -		if (rc < 0)
> -			return rc;
> -		if (rc == 1)
> -			return devm_cxl_add_passthrough_decoder(port);
> -	}
> +	if (rc == 1)
> +		return devm_cxl_add_passthrough_decoder(port);
>  
>  	cxlhdm = devm_cxl_setup_hdm(port);
>  	if (IS_ERR(cxlhdm))
>  		return PTR_ERR(cxlhdm);
>  
> -	if (is_cxl_endpoint(port)) {
> -		struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport);
> -		struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	return devm_cxl_enumerate_decoders(cxlhdm);
> +}
>  
> -		/* Cache the data early to ensure is_visible() works */
> -		read_cdat_data(port);
> +static int cxl_endpoint_port_probe(struct cxl_port *port)
> +{
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	struct cxl_hdm *cxlhdm;
> +	int rc;
> +
> +	cxlhdm = devm_cxl_setup_hdm(port);
> +	if (IS_ERR(cxlhdm))
> +		return PTR_ERR(cxlhdm);
>  
> -		get_device(&cxlmd->dev);
> -		rc = devm_add_action_or_reset(dev, schedule_detach, cxlmd);
> -		if (rc)
> -			return rc;
> +	/* Cache the data early to ensure is_visible() works */
> +	read_cdat_data(port);
>  
> -		rc = cxl_hdm_decode_init(cxlds, cxlhdm);
> -		if (rc)
> -			return rc;
> +	get_device(&cxlmd->dev);
> +	rc = devm_add_action_or_reset(&port->dev, schedule_detach, cxlmd);
> +	if (rc)
> +		return rc;
>  
> -		rc = cxl_await_media_ready(cxlds);
> -		if (rc) {
> -			dev_err(dev, "Media not active (%d)\n", rc);
> -			return rc;
> -		}
> -	}
> +	rc = cxl_hdm_decode_init(cxlds, cxlhdm);
> +	if (rc)
> +		return rc;
>  
> -	rc = devm_cxl_enumerate_decoders(cxlhdm);
> +	rc = cxl_await_media_ready(cxlds);
>  	if (rc) {
> -		dev_err(dev, "Couldn't enumerate decoders (%d)\n", rc);
> +		dev_err(&port->dev, "Media not active (%d)\n", rc);
>  		return rc;
>  	}
>  
> -	return 0;
> +	return devm_cxl_enumerate_decoders(cxlhdm);
> +}
> +
> +static int cxl_port_probe(struct device *dev)
> +{
> +	struct cxl_port *port = to_cxl_port(dev);
> +
> +	if (is_cxl_endpoint(port))
> +		return cxl_endpoint_port_probe(port);
> +	return cxl_switch_port_probe(port);
>  }
>  
>  static ssize_t CDAT_read(struct file *filp, struct kobject *kobj,
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (19 preceding siblings ...)
  2023-02-10  9:07 ` [PATCH v2 20/20] cxl/dax: Create dax devices for CXL RAM regions Dan Williams
@ 2023-02-10 17:53 ` Dan Williams
  2023-02-11 14:04   ` Gregory Price
  2023-02-13 18:22 ` Gregory Price
  21 siblings, 1 reply; 65+ messages in thread
From: Dan Williams @ 2023-02-10 17:53 UTC (permalink / raw)
  To: Dan Williams, linux-cxl
  Cc: Ira Weiny, David Hildenbrand, Dave Jiang, Davidlohr Bueso,
	Kees Cook, Jonathan Cameron, Vishal Verma, Dave Hansen,
	Michal Hocko, Gregory Price, Fan Ni, linux-mm, linux-acpi

Dan Williams wrote:
> Changes since v1: [1]
> - Add a fix for memdev removal racing port removal (found by unit tests)
> - Add a fix to unwind region target list updates on error in
>   cxl_region_attach() (Jonathan)
> - Move the passthrough decoder fix for submission for v6.2-final (Greg)
> - Fix wrong initcall for cxl_core (Gregory and Davidlohr)
> - Add an endpoint decoder state (CXL_DECODER_STATE_AUTO) to replace
>   the flag CXL_DECODER_F_AUTO (Jonathan)
> - Reflow cmp_decode_pos() to reduce levels of indentation (Jonathan)
> - Fix a leaked reference count in cxl_add_to_region() (Jonathan)
> - Make cxl_add_to_region() return an error (Jonathan)
> - Fix several spurious whitespace changes (Jonathan)
> - Cleanup some spurious changes from the tools/testing/cxl update
>   (Jonathan)
> - Test for == CXL_CONFIG_COMMIT rather than >= CXL_CONFIG_COMMIT
>   (Jonathan)
> - Add comment to clarify device_attach() return code expectation in
>   cxl_add_to_region() (Jonathan)
> - Add a patch to split cxl_port_probe() into switch and endpoint port
>   probe calls (Jonathan)
> - Collect reviewed-by and tested-by tags
> 
> [1]: http://lore.kernel.org/r/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com
> 
> ---
> Cover letter same as v1

Thanks for all the review so far! The outstanding backlog is still too
high to definitively say this will make v6.3:

http://lore.kernel.org/r/167601992789.1924368.8083994227892600608.stgit@dwillia2-xfh.jf.intel.com
http://lore.kernel.org/r/167601996980.1924368.390423634911157277.stgit@dwillia2-xfh.jf.intel.com
http://lore.kernel.org/r/167601999378.1924368.15071142145866277623.stgit@dwillia2-xfh.jf.intel.com
http://lore.kernel.org/r/167601999958.1924368.9366954455835735048.stgit@dwillia2-xfh.jf.intel.com
http://lore.kernel.org/r/167602000547.1924368.11613151863880268868.stgit@dwillia2-xfh.jf.intel.com
http://lore.kernel.org/r/167602001107.1924368.11562316181038595611.stgit@dwillia2-xfh.jf.intel.com
http://lore.kernel.org/r/167602002771.1924368.5653558226424530127.stgit@dwillia2-xfh.jf.intel.com
http://lore.kernel.org/r/167602003896.1924368.10335442077318970468.stgit@dwillia2-xfh.jf.intel.com

...what I plan to do is provisionally include it in -next and then make
a judgement call next Friday.

I am encouraged by Fan's test results:

http://lore.kernel.org/r/20230208173720.GA709329@bgt-140510-bm03

...and am reminded that there are some non-trivial TODOs pent up behind
region enumeration:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9ea4dcf49878

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 13/20] cxl/region: Add region autodiscovery
  2023-02-10  9:06 ` [PATCH v2 13/20] cxl/region: Add region autodiscovery Dan Williams
@ 2023-02-10 18:09   ` Jonathan Cameron
  2023-02-10 21:35     ` Dan Williams
  2023-02-10 21:49     ` Dan Williams
  2023-02-11  0:29   ` Verma, Vishal L
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 65+ messages in thread
From: Jonathan Cameron @ 2023-02-10 18:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

On Fri, 10 Feb 2023 01:06:39 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Region autodiscovery is an asynchronous state machine advanced by
> cxl_port_probe(). After the decoders on an endpoint port are enumerated
> they are scanned for actively enabled instances. Each active decoder is
> flagged for auto-assembly CXL_DECODER_F_AUTO and attached to a region.
> If a region does not already exist for the address range setting of the
> decoder one is created. That creation process may race with other
> decoders of the same region being discovered since cxl_port_probe() is
> asynchronous. A new 'struct cxl_root_decoder' lock, @range_lock, is
> introduced to mitigate that race.
> 
> Once all decoders have arrived, "p->nr_targets == p->interleave_ways",
> they are sorted by their relative decode position. The sort algorithm
> involves finding the point in the cxl_port topology where one leg of the
> decode leads to deviceA and the other deviceB. At that point in the
> topology the target order in the 'struct cxl_switch_decoder' indicates
> the relative position of those endpoint decoders in the region.
> 
> >From that point the region goes through the same setup and validation 
Why the >? 
> steps as user-created regions, but instead of programming the decoders
> it validates that driver would have written the same values to the
> decoders as were already present.
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564540972.847146.17096178433176097831.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

A few trivial things inline and this being complex code I'm not
as confident about it as the rest of the series but with that in mind
and the fact I didn't find anything that looked broken...

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

...



> +
> +static int cxl_region_sort_targets(struct cxl_region *cxlr)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +	int i, rc = 0;
> +
> +	sort(p->targets, p->nr_targets, sizeof(p->targets[0]), cmp_decode_pos,
> +	     NULL);
> +
> +	for (i = 0; i < p->nr_targets; i++) {
> +		struct cxl_endpoint_decoder *cxled = p->targets[i];
> +
> +		if (cxled->pos < 0)
> +			rc = -ENXIO;

If it makes sense to carry on after pos < 0 I'd like to see a comment here
on why.  If not, nicer to have a separate dev_dbg() for failed case nad
direct return here.

> +		cxled->pos = i;
> +	}
> +
> +	dev_dbg(&cxlr->dev, "region sort %s\n", rc ? "failed" : "successful");
> +	return rc;
> +}
> +

> +
> +int cxl_add_to_region(struct cxl_port *root, struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct range *hpa = &cxled->cxld.hpa_range;
> +	struct cxl_decoder *cxld = &cxled->cxld;
> +	struct cxl_root_decoder *cxlrd;
> +	struct cxl_region_params *p;
> +	struct cxl_region *cxlr;
> +	bool attach = false;
> +	struct device *dev;
> +	int rc;
> +
> +	dev = device_find_child(&root->dev, &cxld->hpa_range,
> +				match_decoder_by_range);
> +	if (!dev) {
> +		dev_err(cxlmd->dev.parent,
> +			"%s:%s no CXL window for range %#llx:%#llx\n",
> +			dev_name(&cxlmd->dev), dev_name(&cxld->dev),
> +			cxld->hpa_range.start, cxld->hpa_range.end);
> +		return -ENXIO;
> +	}
> +
> +	cxlrd = to_cxl_root_decoder(dev);
> +
> +	/*
> +	 * Ensure that if multiple threads race to construct_region() for @hpa
> +	 * one does the construction and the others add to that.
> +	 */
> +	mutex_lock(&cxlrd->range_lock);
> +	dev = device_find_child(&cxlrd->cxlsd.cxld.dev, hpa,
> +				match_region_by_range);
> +	if (!dev)
> +		cxlr = construct_region(cxlrd, cxled);
> +	else
> +		cxlr = to_cxl_region(dev);
> +	mutex_unlock(&cxlrd->range_lock);
> +
> +	if (IS_ERR(cxlr)) {
> +		rc = PTR_ERR(cxlr);
> +		goto out;
> +	}
> +
> +	attach_target(cxlr, cxled, -1, TASK_UNINTERRUPTIBLE);
> +
> +	down_read(&cxl_region_rwsem);
> +	p = &cxlr->params;
> +	attach = p->state == CXL_CONFIG_COMMIT;
> +	up_read(&cxl_region_rwsem);
> +
> +	if (attach) {
> +		int rc = device_attach(&cxlr->dev);

Shadowing int rc isn't great for readability. Just call it rc2 or something :)
Or given you don't make use of the value...

		/*
		 * If device_attach() fails the range may still be active via
		 * the platform-firmware memory map, otherwise the driver for
		 * regions is local to this file, so driver matching can't fail
+                * and hence device_attach() cannot return 1.

//very much not obvious otherwise to anyone who isn't far too familiar with device_attach()

		 */
		if (device_attach(&cxlr->dev) < 0)
			dev_err()
> +
> +		/*
> +		 * If device_attach() fails the range may still be active via
> +		 * the platform-firmware memory map, otherwise the driver for
> +		 * regions is local to this file, so driver matching can't fail.
> +		 */
> +		if (rc < 0)
> +			dev_err(&cxlr->dev, "failed to enable, range: %pr\n",
> +				p->res);
> +	}
> +
> +	put_device(&cxlr->dev);
> +out:
> +	put_device(&cxlrd->cxlsd.cxld.dev);

Moderately horrible.  Maybe just keep an extra local variable around for the first
use of struct device *dev?  or maybe add a put_cxl_root_decoder() helper?

There are lots of other deep structure access like this I guess, so I don't mind
if you just leave this as yet another one.


> +	return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_add_to_region, CXL);

...

> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> index a8d46a67b45e..d88518836c2d 100644
> --- a/drivers/cxl/port.c
> +++ b/drivers/cxl/port.c
> @@ -30,6 +30,34 @@ static void schedule_detach(void *cxlmd)
>  	schedule_cxl_memdev_detach(cxlmd);
>  }
>  
> +static int discover_region(struct device *dev, void *root)
> +{
> +	struct cxl_endpoint_decoder *cxled;
> +	int rc;
> +
> +	if (!is_endpoint_decoder(dev))
> +		return 0;
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	if ((cxled->cxld.flags & CXL_DECODER_F_ENABLE) == 0)
> +		return 0;
> +
> +	if (cxled->state != CXL_DECODER_STATE_AUTO)
> +		return 0;
> +
> +	/*
> +	 * Region enumeration is opportunistic, if this add-event fails,
> +	 * continue to the next endpoint decoder.
> +	 */
> +	rc = cxl_add_to_region(root, cxled);
> +	if (rc)
> +		dev_dbg(dev, "failed to add to region: %#llx-%#llx\n",
> +			cxled->cxld.hpa_range.start, cxled->cxld.hpa_range.end);
> +
> +	return 0;
> +}
> +
> +

Two blank lines?

>  static int cxl_switch_port_probe(struct cxl_port *port)
>  {

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 14/20] tools/testing/cxl: Define a fixed volatile configuration to parse
  2023-02-10  9:06 ` [PATCH v2 14/20] tools/testing/cxl: Define a fixed volatile configuration to parse Dan Williams
@ 2023-02-10 18:12   ` Jonathan Cameron
  2023-02-10 18:36   ` Dave Jiang
  2023-02-11  0:39   ` Verma, Vishal L
  2 siblings, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2023-02-10 18:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

On Fri, 10 Feb 2023 01:06:45 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Take two endpoints attached to the first switch on the first host-bridge
> in the cxl_test topology and define a pre-initialized region. This is a
> x2 interleave underneath a x1 CXL Window.
> 
> $ modprobe cxl_test
> $ # cxl list -Ru
> {
>   "region":"region3",
>   "resource":"0xf010000000",
>   "size":"512.00 MiB (536.87 MB)",
>   "interleave_ways":2,
>   "interleave_granularity":4096,
>   "decode_state":"commit"
> }
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564541523.847146.12199636368812381475.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The few things I commented on v1 resolved so
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> ---
>  drivers/cxl/core/core.h      |    3 -
>  drivers/cxl/core/hdm.c       |    3 +
>  drivers/cxl/core/port.c      |    2 +
>  drivers/cxl/cxl.h            |    2 +
>  drivers/cxl/cxlmem.h         |    3 +
>  tools/testing/cxl/test/cxl.c |  147 +++++++++++++++++++++++++++++++++++++++---
>  6 files changed, 146 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 5eb873da5a30..479f01da6d35 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -57,9 +57,6 @@ resource_size_t cxl_dpa_size(struct cxl_endpoint_decoder *cxled);
>  resource_size_t cxl_dpa_resource_start(struct cxl_endpoint_decoder *cxled);
>  extern struct rw_semaphore cxl_dpa_rwsem;
>  
> -bool is_switch_decoder(struct device *dev);
> -struct cxl_switch_decoder *to_cxl_switch_decoder(struct device *dev);
> -
>  int cxl_memdev_init(void);
>  void cxl_memdev_exit(void);
>  void cxl_mbox_init(void);
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 8c29026a4b9d..80eccae6ba9e 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -279,7 +279,7 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	return 0;
>  }
>  
> -static int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> +int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  				resource_size_t base, resource_size_t len,
>  				resource_size_t skipped)
>  {
> @@ -295,6 +295,7 @@ static int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  
>  	return devm_add_action_or_reset(&port->dev, cxl_dpa_release, cxled);
>  }
> +EXPORT_SYMBOL_NS_GPL(devm_cxl_dpa_reserve, CXL);
>  
>  resource_size_t cxl_dpa_size(struct cxl_endpoint_decoder *cxled)
>  {
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 59620528571a..b45d2796ef35 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -458,6 +458,7 @@ bool is_switch_decoder(struct device *dev)
>  {
>  	return is_root_decoder(dev) || dev->type == &cxl_decoder_switch_type;
>  }
> +EXPORT_SYMBOL_NS_GPL(is_switch_decoder, CXL);
>  
>  struct cxl_decoder *to_cxl_decoder(struct device *dev)
>  {
> @@ -485,6 +486,7 @@ struct cxl_switch_decoder *to_cxl_switch_decoder(struct device *dev)
>  		return NULL;
>  	return container_of(dev, struct cxl_switch_decoder, cxld.dev);
>  }
> +EXPORT_SYMBOL_NS_GPL(to_cxl_switch_decoder, CXL);
>  
>  static void cxl_ep_release(struct cxl_ep *ep)
>  {
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index c8ee4bb8cce6..2ac344235235 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -653,8 +653,10 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
>  
>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
>  struct cxl_root_decoder *to_cxl_root_decoder(struct device *dev);
> +struct cxl_switch_decoder *to_cxl_switch_decoder(struct device *dev);
>  struct cxl_endpoint_decoder *to_cxl_endpoint_decoder(struct device *dev);
>  bool is_root_decoder(struct device *dev);
> +bool is_switch_decoder(struct device *dev);
>  bool is_endpoint_decoder(struct device *dev);
>  struct cxl_root_decoder *cxl_root_decoder_alloc(struct cxl_port *port,
>  						unsigned int nr_targets,
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index c9da3c699a21..bf7d4c5c8612 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -81,6 +81,9 @@ static inline bool is_cxl_endpoint(struct cxl_port *port)
>  }
>  
>  struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds);
> +int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> +			 resource_size_t base, resource_size_t len,
> +			 resource_size_t skipped);
>  
>  static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
>  					 struct cxl_memdev *cxlmd)
> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
> index 920bd969c554..5342f69d70d2 100644
> --- a/tools/testing/cxl/test/cxl.c
> +++ b/tools/testing/cxl/test/cxl.c
> @@ -703,6 +703,142 @@ static int mock_decoder_reset(struct cxl_decoder *cxld)
>  	return 0;
>  }
>  
> +static void default_mock_decoder(struct cxl_decoder *cxld)
> +{
> +	cxld->hpa_range = (struct range){
> +		.start = 0,
> +		.end = -1,
> +	};
> +
> +	cxld->interleave_ways = 1;
> +	cxld->interleave_granularity = 256;
> +	cxld->target_type = CXL_DECODER_EXPANDER;
> +	cxld->commit = mock_decoder_commit;
> +	cxld->reset = mock_decoder_reset;
> +}
> +
> +static int first_decoder(struct device *dev, void *data)
> +{
> +	struct cxl_decoder *cxld;
> +
> +	if (!is_switch_decoder(dev))
> +		return 0;
> +	cxld = to_cxl_decoder(dev);
> +	if (cxld->id == 0)
> +		return 1;
> +	return 0;
> +}
> +
> +static void mock_init_hdm_decoder(struct cxl_decoder *cxld)
> +{
> +	struct acpi_cedt_cfmws *window = mock_cfmws[0];
> +	struct platform_device *pdev = NULL;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct cxl_switch_decoder *cxlsd;
> +	struct cxl_port *port, *iter;
> +	const int size = SZ_512M;
> +	struct cxl_memdev *cxlmd;
> +	struct cxl_dport *dport;
> +	struct device *dev;
> +	bool hb0 = false;
> +	u64 base;
> +	int i;
> +
> +	if (is_endpoint_decoder(&cxld->dev)) {
> +		cxled = to_cxl_endpoint_decoder(&cxld->dev);
> +		cxlmd = cxled_to_memdev(cxled);
> +		WARN_ON(!dev_is_platform(cxlmd->dev.parent));
> +		pdev = to_platform_device(cxlmd->dev.parent);
> +
> +		/* check is endpoint is attach to host-bridge0 */
> +		port = cxled_to_port(cxled);
> +		do {
> +			if (port->uport == &cxl_host_bridge[0]->dev) {
> +				hb0 = true;
> +				break;
> +			}
> +			if (is_cxl_port(port->dev.parent))
> +				port = to_cxl_port(port->dev.parent);
> +			else
> +				port = NULL;
> +		} while (port);
> +		port = cxled_to_port(cxled);
> +	}
> +
> +	/*
> +	 * The first decoder on the first 2 devices on the first switch
> +	 * attached to host-bridge0 mock a fake / static RAM region. All
> +	 * other decoders are default disabled. Given the round robin
> +	 * assignment those devices are named cxl_mem.0, and cxl_mem.4.
> +	 *
> +	 * See 'cxl list -BMPu -m cxl_mem.0,cxl_mem.4'
> +	 */
> +	if (!hb0 || pdev->id % 4 || pdev->id > 4 || cxld->id > 0) {
> +		default_mock_decoder(cxld);
> +		return;
> +	}
> +
> +	base = window->base_hpa;
> +	cxld->hpa_range = (struct range) {
> +		.start = base,
> +		.end = base + size - 1,
> +	};
> +
> +	cxld->interleave_ways = 2;
> +	eig_to_granularity(window->granularity, &cxld->interleave_granularity);
> +	cxld->target_type = CXL_DECODER_EXPANDER;
> +	cxld->flags = CXL_DECODER_F_ENABLE;
> +	cxled->state = CXL_DECODER_STATE_AUTO;
> +	port->commit_end = cxld->id;
> +	devm_cxl_dpa_reserve(cxled, 0, size / cxld->interleave_ways, 0);
> +	cxld->commit = mock_decoder_commit;
> +	cxld->reset = mock_decoder_reset;
> +
> +	/*
> +	 * Now that endpoint decoder is set up, walk up the hierarchy
> +	 * and setup the switch and root port decoders targeting @cxlmd.
> +	 */
> +	iter = port;
> +	for (i = 0; i < 2; i++) {
> +		dport = iter->parent_dport;
> +		iter = dport->port;
> +		dev = device_find_child(&iter->dev, NULL, first_decoder);
> +		/*
> +		 * Ancestor ports are guaranteed to be enumerated before
> +		 * @port, and all ports have at least one decoder.
> +		 */
> +		if (WARN_ON(!dev))
> +			continue;
> +		cxlsd = to_cxl_switch_decoder(dev);
> +		if (i == 0) {
> +			/* put cxl_mem.4 second in the decode order */
> +			if (pdev->id == 4)
> +				cxlsd->target[1] = dport;
> +			else
> +				cxlsd->target[0] = dport;
> +		} else
> +			cxlsd->target[0] = dport;
> +		cxld = &cxlsd->cxld;
> +		cxld->target_type = CXL_DECODER_EXPANDER;
> +		cxld->flags = CXL_DECODER_F_ENABLE;
> +		iter->commit_end = 0;
> +		/*
> +		 * Switch targets 2 endpoints, while host bridge targets
> +		 * one root port
> +		 */
> +		if (i == 0)
> +			cxld->interleave_ways = 2;
> +		else
> +			cxld->interleave_ways = 1;
> +		cxld->interleave_granularity = 256;
> +		cxld->hpa_range = (struct range) {
> +			.start = base,
> +			.end = base + size - 1,
> +		};
> +		put_device(dev);
> +	}
> +}
> +
>  static int mock_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>  {
>  	struct cxl_port *port = cxlhdm->port;
> @@ -748,16 +884,7 @@ static int mock_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>  			cxld = &cxled->cxld;
>  		}
>  
> -		cxld->hpa_range = (struct range) {
> -			.start = 0,
> -			.end = -1,
> -		};
> -
> -		cxld->interleave_ways = min_not_zero(target_count, 1);
> -		cxld->interleave_granularity = SZ_4K;
> -		cxld->target_type = CXL_DECODER_EXPANDER;
> -		cxld->commit = mock_decoder_commit;
> -		cxld->reset = mock_decoder_reset;
> +		mock_init_hdm_decoder(cxld);
>  
>  		if (target_count) {
>  			rc = device_for_each_child(port->uport, &ctx,
> 
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 18/20] dax/hmem: Move hmem device registration to dax_hmem.ko
  2023-02-10  9:07 ` [PATCH v2 18/20] dax/hmem: Move hmem device registration to dax_hmem.ko Dan Williams
@ 2023-02-10 18:25   ` Jonathan Cameron
  2023-02-10 22:09   ` Dave Jiang
  2023-02-11  4:41   ` Verma, Vishal L
  2 siblings, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2023-02-10 18:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

On Fri, 10 Feb 2023 01:07:07 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> In preparation for the CXL region driver to take over the responsibility
> of registering device-dax instances for CXL regions, move the
> registration of "hmem" devices to dax_hmem.ko.
> 
> Previously the builtin component of this enabling
> (drivers/dax/hmem/device.o) would register platform devices for each
> address range and trigger the dax_hmem.ko module to load and attach
> device-dax instances to those devices. Now, the ranges are collected
> from the HMAT and EFI memory map walking, but the device creation is
> deferred. A new "hmem_platform" device is created which triggers
> dax_hmem.ko to load and register the platform devices.
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564543923.847146.9030380223622044744.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

I'm not particularly familiar with this code, but you changes indeed
reflect what you describe above an appear correct to me.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 14/20] tools/testing/cxl: Define a fixed volatile configuration to parse
  2023-02-10  9:06 ` [PATCH v2 14/20] tools/testing/cxl: Define a fixed volatile configuration to parse Dan Williams
  2023-02-10 18:12   ` Jonathan Cameron
@ 2023-02-10 18:36   ` Dave Jiang
  2023-02-11  0:39   ` Verma, Vishal L
  2 siblings, 0 replies; 65+ messages in thread
From: Dave Jiang @ 2023-02-10 18:36 UTC (permalink / raw)
  To: Dan Williams, linux-cxl
  Cc: Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi



On 2/10/23 2:06 AM, Dan Williams wrote:
> Take two endpoints attached to the first switch on the first host-bridge
> in the cxl_test topology and define a pre-initialized region. This is a
> x2 interleave underneath a x1 CXL Window.
> 
> $ modprobe cxl_test
> $ # cxl list -Ru
> {
>    "region":"region3",
>    "resource":"0xf010000000",
>    "size":"512.00 MiB (536.87 MB)",
>    "interleave_ways":2,
>    "interleave_granularity":4096,
>    "decode_state":"commit"
> }
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564541523.847146.12199636368812381475.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>   drivers/cxl/core/core.h      |    3 -
>   drivers/cxl/core/hdm.c       |    3 +
>   drivers/cxl/core/port.c      |    2 +
>   drivers/cxl/cxl.h            |    2 +
>   drivers/cxl/cxlmem.h         |    3 +
>   tools/testing/cxl/test/cxl.c |  147 +++++++++++++++++++++++++++++++++++++++---
>   6 files changed, 146 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 5eb873da5a30..479f01da6d35 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -57,9 +57,6 @@ resource_size_t cxl_dpa_size(struct cxl_endpoint_decoder *cxled);
>   resource_size_t cxl_dpa_resource_start(struct cxl_endpoint_decoder *cxled);
>   extern struct rw_semaphore cxl_dpa_rwsem;
>   
> -bool is_switch_decoder(struct device *dev);
> -struct cxl_switch_decoder *to_cxl_switch_decoder(struct device *dev);
> -
>   int cxl_memdev_init(void);
>   void cxl_memdev_exit(void);
>   void cxl_mbox_init(void);
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 8c29026a4b9d..80eccae6ba9e 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -279,7 +279,7 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   	return 0;
>   }
>   
> -static int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> +int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   				resource_size_t base, resource_size_t len,
>   				resource_size_t skipped)
>   {
> @@ -295,6 +295,7 @@ static int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   
>   	return devm_add_action_or_reset(&port->dev, cxl_dpa_release, cxled);
>   }
> +EXPORT_SYMBOL_NS_GPL(devm_cxl_dpa_reserve, CXL);
>   
>   resource_size_t cxl_dpa_size(struct cxl_endpoint_decoder *cxled)
>   {
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 59620528571a..b45d2796ef35 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -458,6 +458,7 @@ bool is_switch_decoder(struct device *dev)
>   {
>   	return is_root_decoder(dev) || dev->type == &cxl_decoder_switch_type;
>   }
> +EXPORT_SYMBOL_NS_GPL(is_switch_decoder, CXL);
>   
>   struct cxl_decoder *to_cxl_decoder(struct device *dev)
>   {
> @@ -485,6 +486,7 @@ struct cxl_switch_decoder *to_cxl_switch_decoder(struct device *dev)
>   		return NULL;
>   	return container_of(dev, struct cxl_switch_decoder, cxld.dev);
>   }
> +EXPORT_SYMBOL_NS_GPL(to_cxl_switch_decoder, CXL);
>   
>   static void cxl_ep_release(struct cxl_ep *ep)
>   {
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index c8ee4bb8cce6..2ac344235235 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -653,8 +653,10 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
>   
>   struct cxl_decoder *to_cxl_decoder(struct device *dev);
>   struct cxl_root_decoder *to_cxl_root_decoder(struct device *dev);
> +struct cxl_switch_decoder *to_cxl_switch_decoder(struct device *dev);
>   struct cxl_endpoint_decoder *to_cxl_endpoint_decoder(struct device *dev);
>   bool is_root_decoder(struct device *dev);
> +bool is_switch_decoder(struct device *dev);
>   bool is_endpoint_decoder(struct device *dev);
>   struct cxl_root_decoder *cxl_root_decoder_alloc(struct cxl_port *port,
>   						unsigned int nr_targets,
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index c9da3c699a21..bf7d4c5c8612 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -81,6 +81,9 @@ static inline bool is_cxl_endpoint(struct cxl_port *port)
>   }
>   
>   struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds);
> +int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> +			 resource_size_t base, resource_size_t len,
> +			 resource_size_t skipped);
>   
>   static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
>   					 struct cxl_memdev *cxlmd)
> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
> index 920bd969c554..5342f69d70d2 100644
> --- a/tools/testing/cxl/test/cxl.c
> +++ b/tools/testing/cxl/test/cxl.c
> @@ -703,6 +703,142 @@ static int mock_decoder_reset(struct cxl_decoder *cxld)
>   	return 0;
>   }
>   
> +static void default_mock_decoder(struct cxl_decoder *cxld)
> +{
> +	cxld->hpa_range = (struct range){
> +		.start = 0,
> +		.end = -1,
> +	};
> +
> +	cxld->interleave_ways = 1;
> +	cxld->interleave_granularity = 256;
> +	cxld->target_type = CXL_DECODER_EXPANDER;
> +	cxld->commit = mock_decoder_commit;
> +	cxld->reset = mock_decoder_reset;
> +}
> +
> +static int first_decoder(struct device *dev, void *data)
> +{
> +	struct cxl_decoder *cxld;
> +
> +	if (!is_switch_decoder(dev))
> +		return 0;
> +	cxld = to_cxl_decoder(dev);
> +	if (cxld->id == 0)
> +		return 1;
> +	return 0;
> +}
> +
> +static void mock_init_hdm_decoder(struct cxl_decoder *cxld)
> +{
> +	struct acpi_cedt_cfmws *window = mock_cfmws[0];
> +	struct platform_device *pdev = NULL;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct cxl_switch_decoder *cxlsd;
> +	struct cxl_port *port, *iter;
> +	const int size = SZ_512M;
> +	struct cxl_memdev *cxlmd;
> +	struct cxl_dport *dport;
> +	struct device *dev;
> +	bool hb0 = false;
> +	u64 base;
> +	int i;
> +
> +	if (is_endpoint_decoder(&cxld->dev)) {
> +		cxled = to_cxl_endpoint_decoder(&cxld->dev);
> +		cxlmd = cxled_to_memdev(cxled);
> +		WARN_ON(!dev_is_platform(cxlmd->dev.parent));
> +		pdev = to_platform_device(cxlmd->dev.parent);
> +
> +		/* check is endpoint is attach to host-bridge0 */
> +		port = cxled_to_port(cxled);
> +		do {
> +			if (port->uport == &cxl_host_bridge[0]->dev) {
> +				hb0 = true;
> +				break;
> +			}
> +			if (is_cxl_port(port->dev.parent))
> +				port = to_cxl_port(port->dev.parent);
> +			else
> +				port = NULL;
> +		} while (port);
> +		port = cxled_to_port(cxled);
> +	}
> +
> +	/*
> +	 * The first decoder on the first 2 devices on the first switch
> +	 * attached to host-bridge0 mock a fake / static RAM region. All
> +	 * other decoders are default disabled. Given the round robin
> +	 * assignment those devices are named cxl_mem.0, and cxl_mem.4.
> +	 *
> +	 * See 'cxl list -BMPu -m cxl_mem.0,cxl_mem.4'
> +	 */
> +	if (!hb0 || pdev->id % 4 || pdev->id > 4 || cxld->id > 0) {
> +		default_mock_decoder(cxld);
> +		return;
> +	}
> +
> +	base = window->base_hpa;
> +	cxld->hpa_range = (struct range) {
> +		.start = base,
> +		.end = base + size - 1,
> +	};
> +
> +	cxld->interleave_ways = 2;
> +	eig_to_granularity(window->granularity, &cxld->interleave_granularity);
> +	cxld->target_type = CXL_DECODER_EXPANDER;
> +	cxld->flags = CXL_DECODER_F_ENABLE;
> +	cxled->state = CXL_DECODER_STATE_AUTO;
> +	port->commit_end = cxld->id;
> +	devm_cxl_dpa_reserve(cxled, 0, size / cxld->interleave_ways, 0);
> +	cxld->commit = mock_decoder_commit;
> +	cxld->reset = mock_decoder_reset;
> +
> +	/*
> +	 * Now that endpoint decoder is set up, walk up the hierarchy
> +	 * and setup the switch and root port decoders targeting @cxlmd.
> +	 */
> +	iter = port;
> +	for (i = 0; i < 2; i++) {
> +		dport = iter->parent_dport;
> +		iter = dport->port;
> +		dev = device_find_child(&iter->dev, NULL, first_decoder);
> +		/*
> +		 * Ancestor ports are guaranteed to be enumerated before
> +		 * @port, and all ports have at least one decoder.
> +		 */
> +		if (WARN_ON(!dev))
> +			continue;
> +		cxlsd = to_cxl_switch_decoder(dev);
> +		if (i == 0) {
> +			/* put cxl_mem.4 second in the decode order */
> +			if (pdev->id == 4)
> +				cxlsd->target[1] = dport;
> +			else
> +				cxlsd->target[0] = dport;
> +		} else
> +			cxlsd->target[0] = dport;
> +		cxld = &cxlsd->cxld;
> +		cxld->target_type = CXL_DECODER_EXPANDER;
> +		cxld->flags = CXL_DECODER_F_ENABLE;
> +		iter->commit_end = 0;
> +		/*
> +		 * Switch targets 2 endpoints, while host bridge targets
> +		 * one root port
> +		 */
> +		if (i == 0)
> +			cxld->interleave_ways = 2;
> +		else
> +			cxld->interleave_ways = 1;
> +		cxld->interleave_granularity = 256;
> +		cxld->hpa_range = (struct range) {
> +			.start = base,
> +			.end = base + size - 1,
> +		};
> +		put_device(dev);
> +	}
> +}
> +
>   static int mock_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>   {
>   	struct cxl_port *port = cxlhdm->port;
> @@ -748,16 +884,7 @@ static int mock_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>   			cxld = &cxled->cxld;
>   		}
>   
> -		cxld->hpa_range = (struct range) {
> -			.start = 0,
> -			.end = -1,
> -		};
> -
> -		cxld->interleave_ways = min_not_zero(target_count, 1);
> -		cxld->interleave_granularity = SZ_4K;
> -		cxld->target_type = CXL_DECODER_EXPANDER;
> -		cxld->commit = mock_decoder_commit;
> -		cxld->reset = mock_decoder_reset;
> +		mock_init_hdm_decoder(cxld);
>   
>   		if (target_count) {
>   			rc = device_for_each_child(port->uport, &ctx,
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 20/20] cxl/dax: Create dax devices for CXL RAM regions
  2023-02-10  9:07 ` [PATCH v2 20/20] cxl/dax: Create dax devices for CXL RAM regions Dan Williams
@ 2023-02-10 18:38   ` Jonathan Cameron
  2023-02-10 22:42   ` Dave Jiang
  1 sibling, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2023-02-10 18:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

On Fri, 10 Feb 2023 01:07:19 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> While platform firmware takes some responsibility for mapping the RAM
> capacity of CXL devices present at boot, the OS is responsible for
> mapping the remainder and hot-added devices. Platform firmware is also
> responsible for identifying the platform general purpose memory pool,
> typically DDR attached DRAM, and arranging for the remainder to be 'Soft
> Reserved'. That reservation allows the CXL subsystem to route the memory
> to core-mm via memory-hotplug (dax_kmem), or leave it for dedicated
> access (device-dax).
> 
> The new 'struct cxl_dax_region' object allows for a CXL memory resource
> (region) to be published, but also allow for udev and module policy to
> act on that event. It also prevents cxl_core.ko from having a module
> loading dependency on any drivers/dax/ modules.
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564545116.847146.4741351262959589920.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

I've not yet gotten around to testing this version yet but from a read
through looks fine.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

I've skipped a patch or two where I felt I didn't have the expertise
to cover them adequately (and not enough time for now to get it...)
in particular the policy patch. Hopefully that will get
good review from others.

Jonathan



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 01/20] cxl/memdev: Fix endpoint port removal
  2023-02-10 17:28   ` Jonathan Cameron
@ 2023-02-10 21:14     ` Dan Williams
  0 siblings, 0 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10 21:14 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams
  Cc: linux-cxl, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

Jonathan Cameron wrote:
> On Fri, 10 Feb 2023 01:05:27 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Testing of ram region support [1], stimulates a long standing bug in
> > cxl_detach_ep() where some cxl_ep_remove() cleanup is skipped due to
> > inability to walk ports after dports have been unregistered. That
> > results in a failure to re-register a memdev after the port is
> > re-enabled leading to a crash like the following:
> > 
> >     cxl_port_setup_targets: cxl region4: cxl_host_bridge.0:port4 iw: 1 ig: 256
> >     general protection fault, ...
> >     [..]
> >     RIP: 0010:cxl_region_setup_targets+0x897/0x9e0 [cxl_core]
> >     dev_name at include/linux/device.h:700
> >     (inlined by) cxl_port_setup_targets at drivers/cxl/core/region.c:1155
> >     (inlined by) cxl_region_setup_targets at drivers/cxl/core/region.c:1249
> >     [..]
> >     Call Trace:
> >      <TASK>
> >      attach_target+0x39a/0x760 [cxl_core]
> >      ? __mutex_unlock_slowpath+0x3a/0x290
> >      cxl_add_to_region+0xb8/0x340 [cxl_core]
> >      ? lockdep_hardirqs_on+0x7d/0x100
> >      discover_region+0x4b/0x80 [cxl_port]
> >      ? __pfx_discover_region+0x10/0x10 [cxl_port]
> >      device_for_each_child+0x58/0x90
> >      cxl_port_probe+0x10e/0x130 [cxl_port]
> >      cxl_bus_probe+0x17/0x50 [cxl_core]
> > 
> > Change the port ancestry walk to be by depth rather than by dport. This
> > ensures that even if a port has unregistered its dports a deferred
> > memdev cleanup will still be able to cleanup the memdev's interest in
> > that port.
> > 
> > The parent_port->dev.driver check is only needed for determining if the
> > bottom up removal beat the top-down removal, but cxl_ep_remove() can
> > always proceed.
> 
> Why can cxl_ep_remove() always proceed?  What stops it racing?
> Is it that we are holding a reference to the port at the time of the
> call so the release callback can't be called until we drop that?

Right, as long as a port reference is held then the cxl_ep_remove() at
cxl_port_release() can not race this one from memdev removal.
The result of cxl_ep_load() is guaranteed to stay stable until
the subsequent put_device().

> Anyhow, good to have a little more detail on the 'why' in the patch
> description (particularly for those reading this when half asleep like me ;)

Long day for you, I appreciate it!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 13/20] cxl/region: Add region autodiscovery
  2023-02-10 18:09   ` Jonathan Cameron
@ 2023-02-10 21:35     ` Dan Williams
  2023-02-14 13:23       ` Jonathan Cameron
  2023-02-10 21:49     ` Dan Williams
  1 sibling, 1 reply; 65+ messages in thread
From: Dan Williams @ 2023-02-10 21:35 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams
  Cc: linux-cxl, Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

Jonathan Cameron wrote:
> On Fri, 10 Feb 2023 01:06:39 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Region autodiscovery is an asynchronous state machine advanced by
> > cxl_port_probe(). After the decoders on an endpoint port are enumerated
> > they are scanned for actively enabled instances. Each active decoder is
> > flagged for auto-assembly CXL_DECODER_F_AUTO and attached to a region.
> > If a region does not already exist for the address range setting of the
> > decoder one is created. That creation process may race with other
> > decoders of the same region being discovered since cxl_port_probe() is
> > asynchronous. A new 'struct cxl_root_decoder' lock, @range_lock, is
> > introduced to mitigate that race.
> > 
> > Once all decoders have arrived, "p->nr_targets == p->interleave_ways",
> > they are sorted by their relative decode position. The sort algorithm
> > involves finding the point in the cxl_port topology where one leg of the
> > decode leads to deviceA and the other deviceB. At that point in the
> > topology the target order in the 'struct cxl_switch_decoder' indicates
> > the relative position of those endpoint decoders in the region.
> > 
> > >From that point the region goes through the same setup and validation 
> Why the >? 

I believe this is auto-added by git send-email or public-inbox to make
sure that a sentence that begins with "From" is not misinterpreted as a
"From:" header. You can see this throughout the kernel commit history.
In this case I pulled the patches back down from lore before editing
them to collect review tags.

> > steps as user-created regions, but instead of programming the decoders
> > it validates that driver would have written the same values to the
> > decoders as were already present.
> > 
> > Tested-by: Fan Ni <fan.ni@samsung.com>
> > Link: https://lore.kernel.org/r/167564540972.847146.17096178433176097831.stgit@dwillia2-xfh.jf.intel.com
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> 
> A few trivial things inline and this being complex code I'm not
> as confident about it as the rest of the series but with that in mind
> and the fact I didn't find anything that looked broken...
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> ...
> 
> 
> 
> > +
> > +static int cxl_region_sort_targets(struct cxl_region *cxlr)
> > +{
> > +	struct cxl_region_params *p = &cxlr->params;
> > +	int i, rc = 0;
> > +
> > +	sort(p->targets, p->nr_targets, sizeof(p->targets[0]), cmp_decode_pos,
> > +	     NULL);
> > +
> > +	for (i = 0; i < p->nr_targets; i++) {
> > +		struct cxl_endpoint_decoder *cxled = p->targets[i];
> > +
> > +		if (cxled->pos < 0)
> > +			rc = -ENXIO;
> 
> If it makes sense to carry on after pos < 0 I'd like to see a comment here
> on why.  If not, nicer to have a separate dev_dbg() for failed case nad
> direct return here.

Ok, I'll add:

/*
 * Record that sorting failed, but still continue to restore cxled->pos
 * with its ->targets[] position so that follow-on code paths can reliably
 * do p->targets[cxled->pos] to self-reference their entry.
 */

> 
> > +		cxled->pos = i;
> > +	}
> > +
> > +	dev_dbg(&cxlr->dev, "region sort %s\n", rc ? "failed" : "successful");
> > +	return rc;
> > +}
> > +
> 
> > +
> > +int cxl_add_to_region(struct cxl_port *root, struct cxl_endpoint_decoder *cxled)
> > +{
> > +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > +	struct range *hpa = &cxled->cxld.hpa_range;
> > +	struct cxl_decoder *cxld = &cxled->cxld;
> > +	struct cxl_root_decoder *cxlrd;
> > +	struct cxl_region_params *p;
> > +	struct cxl_region *cxlr;
> > +	bool attach = false;
> > +	struct device *dev;
> > +	int rc;
> > +
> > +	dev = device_find_child(&root->dev, &cxld->hpa_range,
> > +				match_decoder_by_range);
> > +	if (!dev) {
> > +		dev_err(cxlmd->dev.parent,
> > +			"%s:%s no CXL window for range %#llx:%#llx\n",
> > +			dev_name(&cxlmd->dev), dev_name(&cxld->dev),
> > +			cxld->hpa_range.start, cxld->hpa_range.end);
> > +		return -ENXIO;
> > +	}
> > +
> > +	cxlrd = to_cxl_root_decoder(dev);
> > +
> > +	/*
> > +	 * Ensure that if multiple threads race to construct_region() for @hpa
> > +	 * one does the construction and the others add to that.
> > +	 */
> > +	mutex_lock(&cxlrd->range_lock);
> > +	dev = device_find_child(&cxlrd->cxlsd.cxld.dev, hpa,
> > +				match_region_by_range);
> > +	if (!dev)
> > +		cxlr = construct_region(cxlrd, cxled);
> > +	else
> > +		cxlr = to_cxl_region(dev);
> > +	mutex_unlock(&cxlrd->range_lock);
> > +
> > +	if (IS_ERR(cxlr)) {
> > +		rc = PTR_ERR(cxlr);
> > +		goto out;
> > +	}
> > +
> > +	attach_target(cxlr, cxled, -1, TASK_UNINTERRUPTIBLE);
> > +
> > +	down_read(&cxl_region_rwsem);
> > +	p = &cxlr->params;
> > +	attach = p->state == CXL_CONFIG_COMMIT;
> > +	up_read(&cxl_region_rwsem);
> > +
> > +	if (attach) {
> > +		int rc = device_attach(&cxlr->dev);
> 
> Shadowing int rc isn't great for readability. Just call it rc2 or something :)
> Or given you don't make use of the value...

0day did not like this either...

> 
> 		/*
> 		 * If device_attach() fails the range may still be active via
> 		 * the platform-firmware memory map, otherwise the driver for
> 		 * regions is local to this file, so driver matching can't fail
> +                * and hence device_attach() cannot return 1.
> 
> //very much not obvious otherwise to anyone who isn't far too familiar with device_attach()

Hence the comment? Not sure what else can be said here about why
device_attach() < 0 is a sufficient check.

> 
> 		 */
> 		if (device_attach(&cxlr->dev) < 0)
> 			dev_err()
> > +
> > +		/*
> > +		 * If device_attach() fails the range may still be active via
> > +		 * the platform-firmware memory map, otherwise the driver for
> > +		 * regions is local to this file, so driver matching can't fail.
> > +		 */
> > +		if (rc < 0)
> > +			dev_err(&cxlr->dev, "failed to enable, range: %pr\n",
> > +				p->res);
> > +	}
> > +
> > +	put_device(&cxlr->dev);
> > +out:
> > +	put_device(&cxlrd->cxlsd.cxld.dev);
> 
> Moderately horrible.  Maybe just keep an extra local variable around for the first
> use of struct device *dev?  or maybe add a put_cxl_root_decoder() helper?
> 
> There are lots of other deep structure access like this I guess, so I don't mind
> if you just leave this as yet another one.

Yeah, it's difficult to have symmetry here, but I think I'll switch to
using an @cxlrd_dev variable so better match the get with the put.

> 
> 
> > +	return rc;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_add_to_region, CXL);
> 
> ...
> 
> > diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> > index a8d46a67b45e..d88518836c2d 100644
> > --- a/drivers/cxl/port.c
> > +++ b/drivers/cxl/port.c
> > @@ -30,6 +30,34 @@ static void schedule_detach(void *cxlmd)
> >  	schedule_cxl_memdev_detach(cxlmd);
> >  }
> >  
> > +static int discover_region(struct device *dev, void *root)
> > +{
> > +	struct cxl_endpoint_decoder *cxled;
> > +	int rc;
> > +
> > +	if (!is_endpoint_decoder(dev))
> > +		return 0;
> > +
> > +	cxled = to_cxl_endpoint_decoder(dev);
> > +	if ((cxled->cxld.flags & CXL_DECODER_F_ENABLE) == 0)
> > +		return 0;
> > +
> > +	if (cxled->state != CXL_DECODER_STATE_AUTO)
> > +		return 0;
> > +
> > +	/*
> > +	 * Region enumeration is opportunistic, if this add-event fails,
> > +	 * continue to the next endpoint decoder.
> > +	 */
> > +	rc = cxl_add_to_region(root, cxled);
> > +	if (rc)
> > +		dev_dbg(dev, "failed to add to region: %#llx-%#llx\n",
> > +			cxled->cxld.hpa_range.start, cxled->cxld.hpa_range.end);
> > +
> > +	return 0;
> > +}
> > +
> > +
> 
> Two blank lines?

Just stashing this here so I can introduce a spurious whitespace removal
in the next patch. Will clean up.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 13/20] cxl/region: Add region autodiscovery
  2023-02-10 18:09   ` Jonathan Cameron
  2023-02-10 21:35     ` Dan Williams
@ 2023-02-10 21:49     ` Dan Williams
  1 sibling, 0 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-10 21:49 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams
  Cc: linux-cxl, Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

Jonathan Cameron wrote:
> On Fri, 10 Feb 2023 01:06:39 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Region autodiscovery is an asynchronous state machine advanced by
> > cxl_port_probe(). After the decoders on an endpoint port are enumerated
> > they are scanned for actively enabled instances. Each active decoder is
> > flagged for auto-assembly CXL_DECODER_F_AUTO and attached to a region.
> > If a region does not already exist for the address range setting of the
> > decoder one is created. That creation process may race with other
> > decoders of the same region being discovered since cxl_port_probe() is
> > asynchronous. A new 'struct cxl_root_decoder' lock, @range_lock, is
> > introduced to mitigate that race.
> > 
> > Once all decoders have arrived, "p->nr_targets == p->interleave_ways",
> > they are sorted by their relative decode position. The sort algorithm
> > involves finding the point in the cxl_port topology where one leg of the
> > decode leads to deviceA and the other deviceB. At that point in the
> > topology the target order in the 'struct cxl_switch_decoder' indicates
> > the relative position of those endpoint decoders in the region.
> > 
> > >From that point the region goes through the same setup and validation 
> Why the >? 
> > steps as user-created regions, but instead of programming the decoders
> > it validates that driver would have written the same values to the
> > decoders as were already present.
> > 
> > Tested-by: Fan Ni <fan.ni@samsung.com>
> > Link: https://lore.kernel.org/r/167564540972.847146.17096178433176097831.stgit@dwillia2-xfh.jf.intel.com
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> 
> A few trivial things inline and this being complex code I'm not
> as confident about it as the rest of the series but with that in mind
> and the fact I didn't find anything that looked broken...
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Folded the following:

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 3f6453da2c51..1580170d5bdb 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2479,16 +2479,16 @@ int cxl_add_to_region(struct cxl_port *root, struct cxl_endpoint_decoder *cxled)
 	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
 	struct range *hpa = &cxled->cxld.hpa_range;
 	struct cxl_decoder *cxld = &cxled->cxld;
+	struct device *cxlrd_dev, *region_dev;
 	struct cxl_root_decoder *cxlrd;
 	struct cxl_region_params *p;
 	struct cxl_region *cxlr;
 	bool attach = false;
-	struct device *dev;
 	int rc;
 
-	dev = device_find_child(&root->dev, &cxld->hpa_range,
-				match_decoder_by_range);
-	if (!dev) {
+	cxlrd_dev = device_find_child(&root->dev, &cxld->hpa_range,
+				      match_decoder_by_range);
+	if (!cxlrd_dev) {
 		dev_err(cxlmd->dev.parent,
 			"%s:%s no CXL window for range %#llx:%#llx\n",
 			dev_name(&cxlmd->dev), dev_name(&cxld->dev),
@@ -2496,19 +2496,20 @@ int cxl_add_to_region(struct cxl_port *root, struct cxl_endpoint_decoder *cxled)
 		return -ENXIO;
 	}
 
-	cxlrd = to_cxl_root_decoder(dev);
+	cxlrd = to_cxl_root_decoder(cxlrd_dev);
 
 	/*
 	 * Ensure that if multiple threads race to construct_region() for @hpa
 	 * one does the construction and the others add to that.
 	 */
 	mutex_lock(&cxlrd->range_lock);
-	dev = device_find_child(&cxlrd->cxlsd.cxld.dev, hpa,
-				match_region_by_range);
-	if (!dev)
+	region_dev = device_find_child(&cxlrd->cxlsd.cxld.dev, hpa,
+				       match_region_by_range);
+	if (!region_dev) {
 		cxlr = construct_region(cxlrd, cxled);
-	else
-		cxlr = to_cxl_region(dev);
+		region_dev = &cxlr->dev;
+	} else
+		cxlr = to_cxl_region(region_dev);
 	mutex_unlock(&cxlrd->range_lock);
 
 	if (IS_ERR(cxlr)) {
@@ -2524,21 +2525,19 @@ int cxl_add_to_region(struct cxl_port *root, struct cxl_endpoint_decoder *cxled)
 	up_read(&cxl_region_rwsem);
 
 	if (attach) {
-		int rc = device_attach(&cxlr->dev);
-
 		/*
 		 * If device_attach() fails the range may still be active via
 		 * the platform-firmware memory map, otherwise the driver for
 		 * regions is local to this file, so driver matching can't fail.
 		 */
-		if (rc < 0)
+		if (device_attach(&cxlr->dev) < 0)
 			dev_err(&cxlr->dev, "failed to enable, range: %pr\n",
 				p->res);
 	}
 
-	put_device(&cxlr->dev);
+	put_device(region_dev);
 out:
-	put_device(&cxlrd->cxlsd.cxld.dev);
+	put_device(cxlrd_dev);
 	return rc;
 }
 EXPORT_SYMBOL_NS_GPL(cxl_add_to_region, CXL);
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index d88518836c2d..d6c151dabaa7 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -57,7 +57,6 @@ static int discover_region(struct device *dev, void *root)
 	return 0;
 }
 
-
 static int cxl_switch_port_probe(struct cxl_port *port)
 {
 	struct cxl_hdm *cxlhdm;

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 15/20] dax/hmem: Move HMAT and Soft reservation probe initcall level
  2023-02-10  9:06 ` [PATCH v2 15/20] dax/hmem: Move HMAT and Soft reservation probe initcall level Dan Williams
@ 2023-02-10 21:53   ` Dave Jiang
  2023-02-10 21:57     ` Dave Jiang
  2023-02-11  0:40   ` Verma, Vishal L
  1 sibling, 1 reply; 65+ messages in thread
From: Dave Jiang @ 2023-02-10 21:53 UTC (permalink / raw)
  To: Dan Williams, linux-cxl
  Cc: Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi



On 2/10/23 2:06 AM, Dan Williams wrote:
> In preparation for moving more filtering of "hmem" ranges into the
> dax_hmem.ko module, update the initcall levels. HMAT range registration
> moves to subsys_initcall() to be done before Soft Reservation probing,
> and Soft Reservation probing is moved to device_initcall() to be done
> before dax_hmem.ko initialization if it is built-in.
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564542109.847146.10113972881782419363.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>   drivers/acpi/numa/hmat.c  |    2 +-
>   drivers/dax/hmem/Makefile |    3 ++-
>   drivers/dax/hmem/device.c |    2 +-
>   3 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index 605a0c7053be..ff24282301ab 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -869,4 +869,4 @@ static __init int hmat_init(void)
>   	acpi_put_table(tbl);
>   	return 0;
>   }
> -device_initcall(hmat_init);
> +subsys_initcall(hmat_init);
> diff --git a/drivers/dax/hmem/Makefile b/drivers/dax/hmem/Makefile
> index 57377b4c3d47..d4c4cd6bccd7 100644
> --- a/drivers/dax/hmem/Makefile
> +++ b/drivers/dax/hmem/Makefile
> @@ -1,6 +1,7 @@
>   # SPDX-License-Identifier: GPL-2.0
> -obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
> +# device_hmem.o deliberately precedes dax_hmem.o for initcall ordering
>   obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device_hmem.o
> +obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
>   
>   device_hmem-y := device.o
>   dax_hmem-y := hmem.o
> diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
> index 903325aac991..20749c7fab81 100644
> --- a/drivers/dax/hmem/device.c
> +++ b/drivers/dax/hmem/device.c
> @@ -104,4 +104,4 @@ static __init int hmem_init(void)
>    * As this is a fallback for address ranges unclaimed by the ACPI HMAT
>    * parsing it must be at an initcall level greater than hmat_init().
>    */
> -late_initcall(hmem_init);
> +device_initcall(hmem_init);
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 15/20] dax/hmem: Move HMAT and Soft reservation probe initcall level
  2023-02-10 21:53   ` Dave Jiang
@ 2023-02-10 21:57     ` Dave Jiang
  0 siblings, 0 replies; 65+ messages in thread
From: Dave Jiang @ 2023-02-10 21:57 UTC (permalink / raw)
  To: Dan Williams, linux-cxl
  Cc: Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi



On 2/10/23 2:53 PM, Dave Jiang wrote:
> 
> 
> On 2/10/23 2:06 AM, Dan Williams wrote:
>> In preparation for moving more filtering of "hmem" ranges into the
>> dax_hmem.ko module, update the initcall levels. HMAT range registration
>> moves to subsys_initcall() to be done before Soft Reservation probing,
>> and Soft Reservation probing is moved to device_initcall() to be done
>> before dax_hmem.ko initialization if it is built-in.
>>
>> Tested-by: Fan Ni <fan.ni@samsung.com>
>> Link: 
>> https://lore.kernel.org/r/167564542109.847146.10113972881782419363.stgit@dwillia2-xfh.jf.intel.com
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> 
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> 
>> ---
>>   drivers/acpi/numa/hmat.c  |    2 +-
>>   drivers/dax/hmem/Makefile |    3 ++-
>>   drivers/dax/hmem/device.c |    2 +-
>>   3 files changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
>> index 605a0c7053be..ff24282301ab 100644
>> --- a/drivers/acpi/numa/hmat.c
>> +++ b/drivers/acpi/numa/hmat.c
>> @@ -869,4 +869,4 @@ static __init int hmat_init(void)
>>       acpi_put_table(tbl);
>>       return 0;
>>   }
>> -device_initcall(hmat_init);
>> +subsys_initcall(hmat_init);
>> diff --git a/drivers/dax/hmem/Makefile b/drivers/dax/hmem/Makefile
>> index 57377b4c3d47..d4c4cd6bccd7 100644
>> --- a/drivers/dax/hmem/Makefile
>> +++ b/drivers/dax/hmem/Makefile
>> @@ -1,6 +1,7 @@
>>   # SPDX-License-Identifier: GPL-2.0
>> -obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
>> +# device_hmem.o deliberately precedes dax_hmem.o for initcall ordering
>>   obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device_hmem.o
>> +obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
>>   device_hmem-y := device.o
>>   dax_hmem-y := hmem.o
>> diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
>> index 903325aac991..20749c7fab81 100644
>> --- a/drivers/dax/hmem/device.c
>> +++ b/drivers/dax/hmem/device.c
>> @@ -104,4 +104,4 @@ static __init int hmem_init(void)
>>    * As this is a fallback for address ranges unclaimed by the ACPI HMAT
>>    * parsing it must be at an initcall level greater than hmat_init().
>>    */
>> -late_initcall(hmem_init);
>> +device_initcall(hmem_init);
>>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 16/20] dax/hmem: Drop unnecessary dax_hmem_remove()
  2023-02-10  9:06 ` [PATCH v2 16/20] dax/hmem: Drop unnecessary dax_hmem_remove() Dan Williams
@ 2023-02-10 21:59   ` Dave Jiang
  2023-02-11  0:41   ` Verma, Vishal L
  1 sibling, 0 replies; 65+ messages in thread
From: Dave Jiang @ 2023-02-10 21:59 UTC (permalink / raw)
  To: Dan Williams, linux-cxl
  Cc: Jonathan Cameron, Gregory Price, Fan Ni, vishal.l.verma,
	dave.hansen, linux-mm, linux-acpi



On 2/10/23 2:06 AM, Dan Williams wrote:
> Empty driver remove callbacks can just be elided.
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Gregory Price <gregory.price@memverge.com>
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564542679.847146.17174404738816053065.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>   drivers/dax/hmem/hmem.c |    7 -------
>   1 file changed, 7 deletions(-)
> 
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 1bf040dbc834..c7351e0dc8ff 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -44,15 +44,8 @@ static int dax_hmem_probe(struct platform_device *pdev)
>   	return 0;
>   }
>   
> -static int dax_hmem_remove(struct platform_device *pdev)
> -{
> -	/* devm handles teardown */
> -	return 0;
> -}
> -
>   static struct platform_driver dax_hmem_driver = {
>   	.probe = dax_hmem_probe,
> -	.remove = dax_hmem_remove,
>   	.driver = {
>   		.name = "hmem",
>   	},
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 17/20] dax/hmem: Convey the dax range via memregion_info()
  2023-02-10  9:07 ` [PATCH v2 17/20] dax/hmem: Convey the dax range via memregion_info() Dan Williams
@ 2023-02-10 22:03   ` Dave Jiang
  2023-02-11  4:25   ` Verma, Vishal L
  1 sibling, 0 replies; 65+ messages in thread
From: Dave Jiang @ 2023-02-10 22:03 UTC (permalink / raw)
  To: Dan Williams, linux-cxl
  Cc: Jonathan Cameron, Fan Ni, vishal.l.verma, dave.hansen, linux-mm,
	linux-acpi



On 2/10/23 2:07 AM, Dan Williams wrote:
> In preparation for hmem platform devices to be unregistered, stop using
> platform_device_add_resources() to convey the address range. The
> platform_device_add_resources() API causes an existing "Soft Reserved"
> iomem resource to be re-parented under an inserted platform device
> resource. When that platform device is deleted it removes the platform
> device resource and all children.
> 
> Instead, it is sufficient to convey just the address range and let
> request_mem_region() insert resources to indicate the devices active in
> the range. This allows the "Soft Reserved" resource to be re-enumerated
> upon the next probe event.
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564543303.847146.11045895213318648441.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>   drivers/dax/hmem/device.c |   37 ++++++++++++++-----------------------
>   drivers/dax/hmem/hmem.c   |   14 +++-----------
>   include/linux/memregion.h |    2 ++
>   3 files changed, 19 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
> index 20749c7fab81..b1b339bccfe5 100644
> --- a/drivers/dax/hmem/device.c
> +++ b/drivers/dax/hmem/device.c
> @@ -15,15 +15,8 @@ static struct resource hmem_active = {
>   	.flags = IORESOURCE_MEM,
>   };
>   
> -void hmem_register_device(int target_nid, struct resource *r)
> +void hmem_register_device(int target_nid, struct resource *res)
>   {
> -	/* define a clean / non-busy resource for the platform device */
> -	struct resource res = {
> -		.start = r->start,
> -		.end = r->end,
> -		.flags = IORESOURCE_MEM,
> -		.desc = IORES_DESC_SOFT_RESERVED,
> -	};
>   	struct platform_device *pdev;
>   	struct memregion_info info;
>   	int rc, id;
> @@ -31,55 +24,53 @@ void hmem_register_device(int target_nid, struct resource *r)
>   	if (nohmem)
>   		return;
>   
> -	rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
> -			IORES_DESC_SOFT_RESERVED);
> +	rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> +			       IORES_DESC_SOFT_RESERVED);
>   	if (rc != REGION_INTERSECTS)
>   		return;
>   
>   	id = memregion_alloc(GFP_KERNEL);
>   	if (id < 0) {
> -		pr_err("memregion allocation failure for %pr\n", &res);
> +		pr_err("memregion allocation failure for %pr\n", res);
>   		return;
>   	}
>   
>   	pdev = platform_device_alloc("hmem", id);
>   	if (!pdev) {
> -		pr_err("hmem device allocation failure for %pr\n", &res);
> +		pr_err("hmem device allocation failure for %pr\n", res);
>   		goto out_pdev;
>   	}
>   
> -	if (!__request_region(&hmem_active, res.start, resource_size(&res),
> +	if (!__request_region(&hmem_active, res->start, resource_size(res),
>   			      dev_name(&pdev->dev), 0)) {
> -		dev_dbg(&pdev->dev, "hmem range %pr already active\n", &res);
> +		dev_dbg(&pdev->dev, "hmem range %pr already active\n", res);
>   		goto out_active;
>   	}
>   
>   	pdev->dev.numa_node = numa_map_to_online_node(target_nid);
>   	info = (struct memregion_info) {
>   		.target_node = target_nid,
> +		.range = {
> +			.start = res->start,
> +			.end = res->end,
> +		},
>   	};
>   	rc = platform_device_add_data(pdev, &info, sizeof(info));
>   	if (rc < 0) {
> -		pr_err("hmem memregion_info allocation failure for %pr\n", &res);
> -		goto out_resource;
> -	}
> -
> -	rc = platform_device_add_resources(pdev, &res, 1);
> -	if (rc < 0) {
> -		pr_err("hmem resource allocation failure for %pr\n", &res);
> +		pr_err("hmem memregion_info allocation failure for %pr\n", res);
>   		goto out_resource;
>   	}
>   
>   	rc = platform_device_add(pdev);
>   	if (rc < 0) {
> -		dev_err(&pdev->dev, "device add failed for %pr\n", &res);
> +		dev_err(&pdev->dev, "device add failed for %pr\n", res);
>   		goto out_resource;
>   	}
>   
>   	return;
>   
>   out_resource:
> -	__release_region(&hmem_active, res.start, resource_size(&res));
> +	__release_region(&hmem_active, res->start, resource_size(res));
>   out_active:
>   	platform_device_put(pdev);
>   out_pdev:
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index c7351e0dc8ff..5025a8c9850b 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -15,25 +15,17 @@ static int dax_hmem_probe(struct platform_device *pdev)
>   	struct memregion_info *mri;
>   	struct dev_dax_data data;
>   	struct dev_dax *dev_dax;
> -	struct resource *res;
> -	struct range range;
> -
> -	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> -	if (!res)
> -		return -ENOMEM;
>   
>   	mri = dev->platform_data;
> -	range.start = res->start;
> -	range.end = res->end;
> -	dax_region = alloc_dax_region(dev, pdev->id, &range, mri->target_node,
> -			PMD_SIZE, 0);
> +	dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
> +				      mri->target_node, PMD_SIZE, 0);
>   	if (!dax_region)
>   		return -ENOMEM;
>   
>   	data = (struct dev_dax_data) {
>   		.dax_region = dax_region,
>   		.id = -1,
> -		.size = region_idle ? 0 : resource_size(res),
> +		.size = region_idle ? 0 : range_len(&mri->range),
>   	};
>   	dev_dax = devm_create_dev_dax(&data);
>   	if (IS_ERR(dev_dax))
> diff --git a/include/linux/memregion.h b/include/linux/memregion.h
> index bf83363807ac..c01321467789 100644
> --- a/include/linux/memregion.h
> +++ b/include/linux/memregion.h
> @@ -3,10 +3,12 @@
>   #define _MEMREGION_H_
>   #include <linux/types.h>
>   #include <linux/errno.h>
> +#include <linux/range.h>
>   #include <linux/bug.h>
>   
>   struct memregion_info {
>   	int target_node;
> +	struct range range;
>   };
>   
>   #ifdef CONFIG_MEMREGION
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 18/20] dax/hmem: Move hmem device registration to dax_hmem.ko
  2023-02-10  9:07 ` [PATCH v2 18/20] dax/hmem: Move hmem device registration to dax_hmem.ko Dan Williams
  2023-02-10 18:25   ` Jonathan Cameron
@ 2023-02-10 22:09   ` Dave Jiang
  2023-02-11  4:41   ` Verma, Vishal L
  2 siblings, 0 replies; 65+ messages in thread
From: Dave Jiang @ 2023-02-10 22:09 UTC (permalink / raw)
  To: Dan Williams, linux-cxl
  Cc: Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi



On 2/10/23 2:07 AM, Dan Williams wrote:
> In preparation for the CXL region driver to take over the responsibility
> of registering device-dax instances for CXL regions, move the
> registration of "hmem" devices to dax_hmem.ko.
> 
> Previously the builtin component of this enabling
> (drivers/dax/hmem/device.o) would register platform devices for each
> address range and trigger the dax_hmem.ko module to load and attach
> device-dax instances to those devices. Now, the ranges are collected
> from the HMAT and EFI memory map walking, but the device creation is
> deferred. A new "hmem_platform" device is created which triggers
> dax_hmem.ko to load and register the platform devices.
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564543923.847146.9030380223622044744.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>   drivers/acpi/numa/hmat.c  |    2 -
>   drivers/dax/Kconfig       |    2 -
>   drivers/dax/hmem/device.c |   91 +++++++++++++++++++--------------------
>   drivers/dax/hmem/hmem.c   |  105 +++++++++++++++++++++++++++++++++++++++++++++
>   include/linux/dax.h       |    7 ++-
>   5 files changed, 155 insertions(+), 52 deletions(-)
> 
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index ff24282301ab..bba268ecd802 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -718,7 +718,7 @@ static void hmat_register_target_devices(struct memory_target *target)
>   	for (res = target->memregions.child; res; res = res->sibling) {
>   		int target_nid = pxm_to_node(target->memory_pxm);
>   
> -		hmem_register_device(target_nid, res);
> +		hmem_register_resource(target_nid, res);
>   	}
>   }
>   
> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> index 5fdf269a822e..d13c889c2a64 100644
> --- a/drivers/dax/Kconfig
> +++ b/drivers/dax/Kconfig
> @@ -46,7 +46,7 @@ config DEV_DAX_HMEM
>   	  Say M if unsure.
>   
>   config DEV_DAX_HMEM_DEVICES
> -	depends on DEV_DAX_HMEM && DAX=y
> +	depends on DEV_DAX_HMEM && DAX
>   	def_bool y
>   
>   config DEV_DAX_KMEM
> diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
> index b1b339bccfe5..f9e1a76a04a9 100644
> --- a/drivers/dax/hmem/device.c
> +++ b/drivers/dax/hmem/device.c
> @@ -8,6 +8,8 @@
>   static bool nohmem;
>   module_param_named(disable, nohmem, bool, 0444);
>   
> +static bool platform_initialized;
> +static DEFINE_MUTEX(hmem_resource_lock);
>   static struct resource hmem_active = {
>   	.name = "HMEM devices",
>   	.start = 0,
> @@ -15,71 +17,66 @@ static struct resource hmem_active = {
>   	.flags = IORESOURCE_MEM,
>   };
>   
> -void hmem_register_device(int target_nid, struct resource *res)
> +int walk_hmem_resources(struct device *host, walk_hmem_fn fn)
> +{
> +	struct resource *res;
> +	int rc = 0;
> +
> +	mutex_lock(&hmem_resource_lock);
> +	for (res = hmem_active.child; res; res = res->sibling) {
> +		rc = fn(host, (int) res->desc, res);
> +		if (rc)
> +			break;
> +	}
> +	mutex_unlock(&hmem_resource_lock);
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(walk_hmem_resources);
> +
> +static void __hmem_register_resource(int target_nid, struct resource *res)
>   {
>   	struct platform_device *pdev;
> -	struct memregion_info info;
> -	int rc, id;
> +	struct resource *new;
> +	int rc;
>   
> -	if (nohmem)
> +	new = __request_region(&hmem_active, res->start, resource_size(res), "",
> +			       0);
> +	if (!new) {
> +		pr_debug("hmem range %pr already active\n", res);
>   		return;
> +	}
>   
> -	rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> -			       IORES_DESC_SOFT_RESERVED);
> -	if (rc != REGION_INTERSECTS)
> -		return;
> +	new->desc = target_nid;
>   
> -	id = memregion_alloc(GFP_KERNEL);
> -	if (id < 0) {
> -		pr_err("memregion allocation failure for %pr\n", res);
> +	if (platform_initialized)
>   		return;
> -	}
>   
> -	pdev = platform_device_alloc("hmem", id);
> +	pdev = platform_device_alloc("hmem_platform", 0);
>   	if (!pdev) {
> -		pr_err("hmem device allocation failure for %pr\n", res);
> -		goto out_pdev;
> -	}
> -
> -	if (!__request_region(&hmem_active, res->start, resource_size(res),
> -			      dev_name(&pdev->dev), 0)) {
> -		dev_dbg(&pdev->dev, "hmem range %pr already active\n", res);
> -		goto out_active;
> -	}
> -
> -	pdev->dev.numa_node = numa_map_to_online_node(target_nid);
> -	info = (struct memregion_info) {
> -		.target_node = target_nid,
> -		.range = {
> -			.start = res->start,
> -			.end = res->end,
> -		},
> -	};
> -	rc = platform_device_add_data(pdev, &info, sizeof(info));
> -	if (rc < 0) {
> -		pr_err("hmem memregion_info allocation failure for %pr\n", res);
> -		goto out_resource;
> +		pr_err_once("failed to register device-dax hmem_platform device\n");
> +		return;
>   	}
>   
>   	rc = platform_device_add(pdev);
> -	if (rc < 0) {
> -		dev_err(&pdev->dev, "device add failed for %pr\n", res);
> -		goto out_resource;
> -	}
> +	if (rc)
> +		platform_device_put(pdev);
> +	else
> +		platform_initialized = true;
> +}
>   
> -	return;
> +void hmem_register_resource(int target_nid, struct resource *res)
> +{
> +	if (nohmem)
> +		return;
>   
> -out_resource:
> -	__release_region(&hmem_active, res->start, resource_size(res));
> -out_active:
> -	platform_device_put(pdev);
> -out_pdev:
> -	memregion_free(id);
> +	mutex_lock(&hmem_resource_lock);
> +	__hmem_register_resource(target_nid, res);
> +	mutex_unlock(&hmem_resource_lock);
>   }
>   
>   static __init int hmem_register_one(struct resource *res, void *data)
>   {
> -	hmem_register_device(phys_to_target_node(res->start), res);
> +	hmem_register_resource(phys_to_target_node(res->start), res);
>   
>   	return 0;
>   }
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 5025a8c9850b..e7bdff3132fa 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -3,6 +3,7 @@
>   #include <linux/memregion.h>
>   #include <linux/module.h>
>   #include <linux/pfn_t.h>
> +#include <linux/dax.h>
>   #include "../bus.h"
>   
>   static bool region_idle;
> @@ -43,8 +44,110 @@ static struct platform_driver dax_hmem_driver = {
>   	},
>   };
>   
> -module_platform_driver(dax_hmem_driver);
> +static void release_memregion(void *data)
> +{
> +	memregion_free((long) data);
> +}
> +
> +static void release_hmem(void *pdev)
> +{
> +	platform_device_unregister(pdev);
> +}
> +
> +static int hmem_register_device(struct device *host, int target_nid,
> +				const struct resource *res)
> +{
> +	struct platform_device *pdev;
> +	struct memregion_info info;
> +	long id;
> +	int rc;
> +
> +	rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> +			       IORES_DESC_SOFT_RESERVED);
> +	if (rc != REGION_INTERSECTS)
> +		return 0;
> +
> +	id = memregion_alloc(GFP_KERNEL);
> +	if (id < 0) {
> +		dev_err(host, "memregion allocation failure for %pr\n", res);
> +		return -ENOMEM;
> +	}
> +	rc = devm_add_action_or_reset(host, release_memregion, (void *) id);
> +	if (rc)
> +		return rc;
> +
> +	pdev = platform_device_alloc("hmem", id);
> +	if (!pdev) {
> +		dev_err(host, "device allocation failure for %pr\n", res);
> +		return -ENOMEM;
> +	}
> +
> +	pdev->dev.numa_node = numa_map_to_online_node(target_nid);
> +	info = (struct memregion_info) {
> +		.target_node = target_nid,
> +		.range = {
> +			.start = res->start,
> +			.end = res->end,
> +		},
> +	};
> +	rc = platform_device_add_data(pdev, &info, sizeof(info));
> +	if (rc < 0) {
> +		dev_err(host, "memregion_info allocation failure for %pr\n",
> +		       res);
> +		goto out_put;
> +	}
> +
> +	rc = platform_device_add(pdev);
> +	if (rc < 0) {
> +		dev_err(host, "%s add failed for %pr\n", dev_name(&pdev->dev),
> +			res);
> +		goto out_put;
> +	}
> +
> +	return devm_add_action_or_reset(host, release_hmem, pdev);
> +
> +out_put:
> +	platform_device_put(pdev);
> +	return rc;
> +}
> +
> +static int dax_hmem_platform_probe(struct platform_device *pdev)
> +{
> +	return walk_hmem_resources(&pdev->dev, hmem_register_device);
> +}
> +
> +static struct platform_driver dax_hmem_platform_driver = {
> +	.probe = dax_hmem_platform_probe,
> +	.driver = {
> +		.name = "hmem_platform",
> +	},
> +};
> +
> +static __init int dax_hmem_init(void)
> +{
> +	int rc;
> +
> +	rc = platform_driver_register(&dax_hmem_platform_driver);
> +	if (rc)
> +		return rc;
> +
> +	rc = platform_driver_register(&dax_hmem_driver);
> +	if (rc)
> +		platform_driver_unregister(&dax_hmem_platform_driver);
> +
> +	return rc;
> +}
> +
> +static __exit void dax_hmem_exit(void)
> +{
> +	platform_driver_unregister(&dax_hmem_driver);
> +	platform_driver_unregister(&dax_hmem_platform_driver);
> +}
> +
> +module_init(dax_hmem_init);
> +module_exit(dax_hmem_exit);
>   
>   MODULE_ALIAS("platform:hmem*");
> +MODULE_ALIAS("platform:hmem_platform*");
>   MODULE_LICENSE("GPL v2");
>   MODULE_AUTHOR("Intel Corporation");
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 2b5ecb591059..bf6258472e49 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -262,11 +262,14 @@ static inline bool dax_mapping(struct address_space *mapping)
>   }
>   
>   #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
> -void hmem_register_device(int target_nid, struct resource *r);
> +void hmem_register_resource(int target_nid, struct resource *r);
>   #else
> -static inline void hmem_register_device(int target_nid, struct resource *r)
> +static inline void hmem_register_resource(int target_nid, struct resource *r)
>   {
>   }
>   #endif
>   
> +typedef int (*walk_hmem_fn)(struct device *dev, int target_nid,
> +			    const struct resource *res);
> +int walk_hmem_resources(struct device *dev, walk_hmem_fn fn);
>   #endif
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 19/20] dax: Assign RAM regions to memory-hotplug by default
  2023-02-10  9:07 ` [PATCH v2 19/20] dax: Assign RAM regions to memory-hotplug by default Dan Williams
@ 2023-02-10 22:19   ` Dave Jiang
  2023-02-11  5:57   ` Verma, Vishal L
  1 sibling, 0 replies; 65+ messages in thread
From: Dave Jiang @ 2023-02-10 22:19 UTC (permalink / raw)
  To: Dan Williams, linux-cxl
  Cc: Michal Hocko, David Hildenbrand, Dave Hansen, Gregory Price,
	Fan Ni, vishal.l.verma, linux-mm, linux-acpi



On 2/10/23 2:07 AM, Dan Williams wrote:
> The default mode for device-dax instances is backwards for RAM-regions
> as evidenced by the fact that it tends to catch end users by surprise.
> "Where is my memory?". Recall that platforms are increasingly shipping
> with performance-differentiated memory pools beyond typical DRAM and
> NUMA effects. This includes HBM (high-bandwidth-memory) and CXL (dynamic
> interleave, varied media types, and future fabric attached
> possibilities).
> 
> For this reason the EFI_MEMORY_SP (EFI Special Purpose Memory => Linux
> 'Soft Reserved') attribute is expected to be applied to all memory-pools
> that are not the general purpose pool. This designation gives an
> Operating System a chance to defer usage of a memory pool until later in
> the boot process where its performance properties can be interrogated
> and administrator policy can be applied.
> 
> 'Soft Reserved' memory can be anything from too limited and precious to
> be part of the general purpose pool (HBM), too slow to host hot kernel
> data structures (some PMEM media), or anything in between. However, in
> the absence of an explicit policy, the memory should at least be made
> usable by default. The current device-dax default hides all
> non-general-purpose memory behind a device interface.
> 
> The expectation is that the distribution of users that want the memory
> online by default vs device-dedicated-access by default follows the
> Pareto principle. A small number of enlightened users may want to do
> userspace memory management through a device, but general users just
> want the kernel to make the memory available with an option to get more
> advanced later.
> 
> Arrange for all device-dax instances not backed by PMEM to default to
> attaching to the dax_kmem driver. From there the baseline memory hotplug
> policy (CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE / memhp_default_state=)
> gates whether the memory comes online or stays offline. Where, if it
> stays offline, it can be reliably converted back to device-mode where it
> can be partitioned, or fronted by a userspace allocator.
> 
> So, if someone wants device-dax instances for their 'Soft Reserved'
> memory:
> 
> 1/ Build a kernel with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n or boot
>     with memhp_default_state=offline, or roll the dice and hope that the
>     kernel has not pinned a page in that memory before step 2.
> 
> 2/ Write a udev rule to convert the target dax device(s) from
>     'system-ram' mode to 'devdax' mode:
> 
>     daxctl reconfigure-device $dax -m devdax -f
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Gregory Price <gregory.price@memverge.com>
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564544513.847146.4645646177864365755.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>   drivers/dax/Kconfig     |    2 +-
>   drivers/dax/bus.c       |   53 ++++++++++++++++++++---------------------------
>   drivers/dax/bus.h       |   12 +++++++++--
>   drivers/dax/device.c    |    3 +--
>   drivers/dax/hmem/hmem.c |   12 ++++++++++-
>   drivers/dax/kmem.c      |    1 +
>   6 files changed, 46 insertions(+), 37 deletions(-)
> 
> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> index d13c889c2a64..1163eb62e5f6 100644
> --- a/drivers/dax/Kconfig
> +++ b/drivers/dax/Kconfig
> @@ -50,7 +50,7 @@ config DEV_DAX_HMEM_DEVICES
>   	def_bool y
>   
>   config DEV_DAX_KMEM
> -	tristate "KMEM DAX: volatile-use of persistent memory"
> +	tristate "KMEM DAX: map dax-devices as System-RAM"
>   	default DEV_DAX
>   	depends on DEV_DAX
>   	depends on MEMORY_HOTPLUG # for add_memory() and friends
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 1dad813ee4a6..012d576004e9 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -56,6 +56,25 @@ static int dax_match_id(struct dax_device_driver *dax_drv, struct device *dev)
>   	return match;
>   }
>   
> +static int dax_match_type(struct dax_device_driver *dax_drv, struct device *dev)
> +{
> +	enum dax_driver_type type = DAXDRV_DEVICE_TYPE;
> +	struct dev_dax *dev_dax = to_dev_dax(dev);
> +
> +	if (dev_dax->region->res.flags & IORESOURCE_DAX_KMEM)
> +		type = DAXDRV_KMEM_TYPE;
> +
> +	if (dax_drv->type == type)
> +		return 1;
> +
> +	/* default to device mode if dax_kmem is disabled */
> +	if (dax_drv->type == DAXDRV_DEVICE_TYPE &&
> +	    !IS_ENABLED(CONFIG_DEV_DAX_KMEM))
> +		return 1;
> +
> +	return 0;
> +}
> +
>   enum id_action {
>   	ID_REMOVE,
>   	ID_ADD,
> @@ -216,14 +235,9 @@ static int dax_bus_match(struct device *dev, struct device_driver *drv)
>   {
>   	struct dax_device_driver *dax_drv = to_dax_drv(drv);
>   
> -	/*
> -	 * All but the 'device-dax' driver, which has 'match_always'
> -	 * set, requires an exact id match.
> -	 */
> -	if (dax_drv->match_always)
> +	if (dax_match_id(dax_drv, dev))
>   		return 1;
> -
> -	return dax_match_id(dax_drv, dev);
> +	return dax_match_type(dax_drv, dev);
>   }
>   
>   /*
> @@ -1413,13 +1427,10 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
>   }
>   EXPORT_SYMBOL_GPL(devm_create_dev_dax);
>   
> -static int match_always_count;
> -
>   int __dax_driver_register(struct dax_device_driver *dax_drv,
>   		struct module *module, const char *mod_name)
>   {
>   	struct device_driver *drv = &dax_drv->drv;
> -	int rc = 0;
>   
>   	/*
>   	 * dax_bus_probe() calls dax_drv->probe() unconditionally.
> @@ -1434,26 +1445,7 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
>   	drv->mod_name = mod_name;
>   	drv->bus = &dax_bus_type;
>   
> -	/* there can only be one default driver */
> -	mutex_lock(&dax_bus_lock);
> -	match_always_count += dax_drv->match_always;
> -	if (match_always_count > 1) {
> -		match_always_count--;
> -		WARN_ON(1);
> -		rc = -EINVAL;
> -	}
> -	mutex_unlock(&dax_bus_lock);
> -	if (rc)
> -		return rc;
> -
> -	rc = driver_register(drv);
> -	if (rc && dax_drv->match_always) {
> -		mutex_lock(&dax_bus_lock);
> -		match_always_count -= dax_drv->match_always;
> -		mutex_unlock(&dax_bus_lock);
> -	}
> -
> -	return rc;
> +	return driver_register(drv);
>   }
>   EXPORT_SYMBOL_GPL(__dax_driver_register);
>   
> @@ -1463,7 +1455,6 @@ void dax_driver_unregister(struct dax_device_driver *dax_drv)
>   	struct dax_id *dax_id, *_id;
>   
>   	mutex_lock(&dax_bus_lock);
> -	match_always_count -= dax_drv->match_always;
>   	list_for_each_entry_safe(dax_id, _id, &dax_drv->ids, list) {
>   		list_del(&dax_id->list);
>   		kfree(dax_id);
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index fbb940293d6d..8cd79ab34292 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -11,7 +11,10 @@ struct dax_device;
>   struct dax_region;
>   void dax_region_put(struct dax_region *dax_region);
>   
> -#define IORESOURCE_DAX_STATIC (1UL << 0)
> +/* dax bus specific ioresource flags */
> +#define IORESOURCE_DAX_STATIC BIT(0)
> +#define IORESOURCE_DAX_KMEM BIT(1)
> +
>   struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>   		struct range *range, int target_node, unsigned int align,
>   		unsigned long flags);
> @@ -25,10 +28,15 @@ struct dev_dax_data {
>   
>   struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
>   
> +enum dax_driver_type {
> +	DAXDRV_KMEM_TYPE,
> +	DAXDRV_DEVICE_TYPE,
> +};
> +
>   struct dax_device_driver {
>   	struct device_driver drv;
>   	struct list_head ids;
> -	int match_always;
> +	enum dax_driver_type type;
>   	int (*probe)(struct dev_dax *dev);
>   	void (*remove)(struct dev_dax *dev);
>   };
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 5494d745ced5..ecdff79e31f2 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -475,8 +475,7 @@ EXPORT_SYMBOL_GPL(dev_dax_probe);
>   
>   static struct dax_device_driver device_dax_driver = {
>   	.probe = dev_dax_probe,
> -	/* all probe actions are unwound by devm, so .remove isn't necessary */
> -	.match_always = 1,
> +	.type = DAXDRV_DEVICE_TYPE,
>   };
>   
>   static int __init dax_init(void)
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index e7bdff3132fa..5ec08f9f8a57 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -11,15 +11,25 @@ module_param_named(region_idle, region_idle, bool, 0644);
>   
>   static int dax_hmem_probe(struct platform_device *pdev)
>   {
> +	unsigned long flags = IORESOURCE_DAX_KMEM;
>   	struct device *dev = &pdev->dev;
>   	struct dax_region *dax_region;
>   	struct memregion_info *mri;
>   	struct dev_dax_data data;
>   	struct dev_dax *dev_dax;
>   
> +	/*
> +	 * @region_idle == true indicates that an administrative agent
> +	 * wants to manipulate the range partitioning before the devices
> +	 * are created, so do not send them to the dax_kmem driver by
> +	 * default.
> +	 */
> +	if (region_idle)
> +		flags = 0;
> +
>   	mri = dev->platform_data;
>   	dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
> -				      mri->target_node, PMD_SIZE, 0);
> +				      mri->target_node, PMD_SIZE, flags);
>   	if (!dax_region)
>   		return -ENOMEM;
>   
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 4852a2dbdb27..918d01d3fbaa 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -239,6 +239,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
>   static struct dax_device_driver device_dax_kmem_driver = {
>   	.probe = dev_dax_kmem_probe,
>   	.remove = dev_dax_kmem_remove,
> +	.type = DAXDRV_KMEM_TYPE,
>   };
>   
>   static int __init dax_kmem_init(void)
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 20/20] cxl/dax: Create dax devices for CXL RAM regions
  2023-02-10  9:07 ` [PATCH v2 20/20] cxl/dax: Create dax devices for CXL RAM regions Dan Williams
  2023-02-10 18:38   ` Jonathan Cameron
@ 2023-02-10 22:42   ` Dave Jiang
  1 sibling, 0 replies; 65+ messages in thread
From: Dave Jiang @ 2023-02-10 22:42 UTC (permalink / raw)
  To: Dan Williams, linux-cxl
  Cc: Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi



On 2/10/23 2:07 AM, Dan Williams wrote:
> While platform firmware takes some responsibility for mapping the RAM
> capacity of CXL devices present at boot, the OS is responsible for
> mapping the remainder and hot-added devices. Platform firmware is also
> responsible for identifying the platform general purpose memory pool,
> typically DDR attached DRAM, and arranging for the remainder to be 'Soft
> Reserved'. That reservation allows the CXL subsystem to route the memory
> to core-mm via memory-hotplug (dax_kmem), or leave it for dedicated
> access (device-dax).
> 
> The new 'struct cxl_dax_region' object allows for a CXL memory resource
> (region) to be published, but also allow for udev and module policy to
> act on that event. It also prevents cxl_core.ko from having a module
> loading dependency on any drivers/dax/ modules.
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564545116.847146.4741351262959589920.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>   MAINTAINERS               |    1
>   drivers/cxl/acpi.c        |    3 +
>   drivers/cxl/core/core.h   |    3 +
>   drivers/cxl/core/port.c   |    4 +-
>   drivers/cxl/core/region.c |  108 ++++++++++++++++++++++++++++++++++++++++++++-
>   drivers/cxl/cxl.h         |   12 +++++
>   drivers/dax/Kconfig       |   13 +++++
>   drivers/dax/Makefile      |    2 +
>   drivers/dax/cxl.c         |   53 ++++++++++++++++++++++
>   drivers/dax/hmem/hmem.c   |   14 ++++++
>   10 files changed, 209 insertions(+), 4 deletions(-)
>   create mode 100644 drivers/dax/cxl.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7f86d02cb427..73a9f3401e0e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6035,6 +6035,7 @@ M:	Dan Williams <dan.j.williams@intel.com>
>   M:	Vishal Verma <vishal.l.verma@intel.com>
>   M:	Dave Jiang <dave.jiang@intel.com>
>   L:	nvdimm@lists.linux.dev
> +L:	linux-cxl@vger.kernel.org
>   S:	Supported
>   F:	drivers/dax/
>   
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index ad0849af42d7..8ebb9a74790d 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -731,7 +731,8 @@ static void __exit cxl_acpi_exit(void)
>   	cxl_bus_drain();
>   }
>   
> -module_init(cxl_acpi_init);
> +/* load before dax_hmem sees 'Soft Reserved' CXL ranges */
> +subsys_initcall(cxl_acpi_init);
>   module_exit(cxl_acpi_exit);
>   MODULE_LICENSE("GPL v2");
>   MODULE_IMPORT_NS(CXL);
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 479f01da6d35..cde475e13216 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -15,12 +15,14 @@ extern struct device_attribute dev_attr_create_ram_region;
>   extern struct device_attribute dev_attr_delete_region;
>   extern struct device_attribute dev_attr_region;
>   extern const struct device_type cxl_pmem_region_type;
> +extern const struct device_type cxl_dax_region_type;
>   extern const struct device_type cxl_region_type;
>   void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled);
>   #define CXL_REGION_ATTR(x) (&dev_attr_##x.attr)
>   #define CXL_REGION_TYPE(x) (&cxl_region_type)
>   #define SET_CXL_REGION_ATTR(x) (&dev_attr_##x.attr),
>   #define CXL_PMEM_REGION_TYPE(x) (&cxl_pmem_region_type)
> +#define CXL_DAX_REGION_TYPE(x) (&cxl_dax_region_type)
>   int cxl_region_init(void);
>   void cxl_region_exit(void);
>   #else
> @@ -38,6 +40,7 @@ static inline void cxl_region_exit(void)
>   #define CXL_REGION_TYPE(x) NULL
>   #define SET_CXL_REGION_ATTR(x)
>   #define CXL_PMEM_REGION_TYPE(x) NULL
> +#define CXL_DAX_REGION_TYPE(x) NULL
>   #endif
>   
>   struct cxl_send_command;
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index b45d2796ef35..0bb7a5ff724b 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -46,6 +46,8 @@ static int cxl_device_id(struct device *dev)
>   		return CXL_DEVICE_NVDIMM;
>   	if (dev->type == CXL_PMEM_REGION_TYPE())
>   		return CXL_DEVICE_PMEM_REGION;
> +	if (dev->type == CXL_DAX_REGION_TYPE())
> +		return CXL_DEVICE_DAX_REGION;
>   	if (is_cxl_port(dev)) {
>   		if (is_cxl_root(to_cxl_port(dev)))
>   			return CXL_DEVICE_ROOT;
> @@ -2015,6 +2017,6 @@ static void cxl_core_exit(void)
>   	debugfs_remove_recursive(cxl_debugfs);
>   }
>   
> -module_init(cxl_core_init);
> +subsys_initcall(cxl_core_init);
>   module_exit(cxl_core_exit);
>   MODULE_LICENSE("GPL v2");
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 3f6453da2c51..91d334080cab 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2272,6 +2272,75 @@ static struct cxl_pmem_region *cxl_pmem_region_alloc(struct cxl_region *cxlr)
>   	return cxlr_pmem;
>   }
>   
> +static void cxl_dax_region_release(struct device *dev)
> +{
> +	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
> +
> +	kfree(cxlr_dax);
> +}
> +
> +static const struct attribute_group *cxl_dax_region_attribute_groups[] = {
> +	&cxl_base_attribute_group,
> +	NULL,
> +};
> +
> +const struct device_type cxl_dax_region_type = {
> +	.name = "cxl_dax_region",
> +	.release = cxl_dax_region_release,
> +	.groups = cxl_dax_region_attribute_groups,
> +};
> +
> +static bool is_cxl_dax_region(struct device *dev)
> +{
> +	return dev->type == &cxl_dax_region_type;
> +}
> +
> +struct cxl_dax_region *to_cxl_dax_region(struct device *dev)
> +{
> +	if (dev_WARN_ONCE(dev, !is_cxl_dax_region(dev),
> +			  "not a cxl_dax_region device\n"))
> +		return NULL;
> +	return container_of(dev, struct cxl_dax_region, dev);
> +}
> +EXPORT_SYMBOL_NS_GPL(to_cxl_dax_region, CXL);
> +
> +static struct lock_class_key cxl_dax_region_key;
> +
> +static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct device *dev;
> +
> +	down_read(&cxl_region_rwsem);
> +	if (p->state != CXL_CONFIG_COMMIT) {
> +		cxlr_dax = ERR_PTR(-ENXIO);
> +		goto out;
> +	}
> +
> +	cxlr_dax = kzalloc(sizeof(*cxlr_dax), GFP_KERNEL);
> +	if (!cxlr_dax) {
> +		cxlr_dax = ERR_PTR(-ENOMEM);
> +		goto out;
> +	}
> +
> +	cxlr_dax->hpa_range.start = p->res->start;
> +	cxlr_dax->hpa_range.end = p->res->end;
> +
> +	dev = &cxlr_dax->dev;
> +	cxlr_dax->cxlr = cxlr;
> +	device_initialize(dev);
> +	lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
> +	device_set_pm_not_required(dev);
> +	dev->parent = &cxlr->dev;
> +	dev->bus = &cxl_bus_type;
> +	dev->type = &cxl_dax_region_type;
> +out:
> +	up_read(&cxl_region_rwsem);
> +
> +	return cxlr_dax;
> +}
> +
>   static void cxlr_pmem_unregister(void *_cxlr_pmem)
>   {
>   	struct cxl_pmem_region *cxlr_pmem = _cxlr_pmem;
> @@ -2356,6 +2425,42 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
>   	return rc;
>   }
>   
> +static void cxlr_dax_unregister(void *_cxlr_dax)
> +{
> +	struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> +
> +	device_unregister(&cxlr_dax->dev);
> +}
> +
> +static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> +{
> +	struct cxl_dax_region *cxlr_dax;
> +	struct device *dev;
> +	int rc;
> +
> +	cxlr_dax = cxl_dax_region_alloc(cxlr);
> +	if (IS_ERR(cxlr_dax))
> +		return PTR_ERR(cxlr_dax);
> +
> +	dev = &cxlr_dax->dev;
> +	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
> +	if (rc)
> +		goto err;
> +
> +	rc = device_add(dev);
> +	if (rc)
> +		goto err;
> +
> +	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> +		dev_name(dev));
> +
> +	return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
> +					cxlr_dax);
> +err:
> +	put_device(dev);
> +	return rc;
> +}
> +
>   static int match_decoder_by_range(struct device *dev, void *data)
>   {
>   	struct range *r1, *r2 = data;
> @@ -2619,8 +2724,7 @@ static int cxl_region_probe(struct device *dev)
>   					p->res->start, p->res->end, cxlr,
>   					is_system_ram) > 0)
>   			return 0;
> -		dev_dbg(dev, "TODO: hookup devdax\n");
> -		return 0;
> +		return devm_cxl_add_dax_region(cxlr);
>   	default:
>   		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
>   			cxlr->mode);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 2ac344235235..b1395c46baec 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -513,6 +513,12 @@ struct cxl_pmem_region {
>   	struct cxl_pmem_region_mapping mapping[];
>   };
>   
> +struct cxl_dax_region {
> +	struct device dev;
> +	struct cxl_region *cxlr;
> +	struct range hpa_range;
> +};
> +
>   /**
>    * struct cxl_port - logical collection of upstream port devices and
>    *		     downstream port devices to construct a CXL memory
> @@ -707,6 +713,7 @@ void cxl_driver_unregister(struct cxl_driver *cxl_drv);
>   #define CXL_DEVICE_MEMORY_EXPANDER	5
>   #define CXL_DEVICE_REGION		6
>   #define CXL_DEVICE_PMEM_REGION		7
> +#define CXL_DEVICE_DAX_REGION		8
>   
>   #define MODULE_ALIAS_CXL(type) MODULE_ALIAS("cxl:t" __stringify(type) "*")
>   #define CXL_MODALIAS_FMT "cxl:t%d"
> @@ -725,6 +732,7 @@ bool is_cxl_pmem_region(struct device *dev);
>   struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev);
>   int cxl_add_to_region(struct cxl_port *root,
>   		      struct cxl_endpoint_decoder *cxled);
> +struct cxl_dax_region *to_cxl_dax_region(struct device *dev);
>   #else
>   static inline bool is_cxl_pmem_region(struct device *dev)
>   {
> @@ -739,6 +747,10 @@ static inline int cxl_add_to_region(struct cxl_port *root,
>   {
>   	return 0;
>   }
> +static inline struct cxl_dax_region *to_cxl_dax_region(struct device *dev)
> +{
> +	return NULL;
> +}
>   #endif
>   
>   /*
> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> index 1163eb62e5f6..bd06e16c7ac8 100644
> --- a/drivers/dax/Kconfig
> +++ b/drivers/dax/Kconfig
> @@ -45,6 +45,19 @@ config DEV_DAX_HMEM
>   
>   	  Say M if unsure.
>   
> +config DEV_DAX_CXL
> +	tristate "CXL DAX: direct access to CXL RAM regions"
> +	depends on CXL_REGION && DEV_DAX
> +	default CXL_REGION && DEV_DAX
> +	help
> +	  CXL RAM regions are either mapped by platform-firmware
> +	  and published in the initial system-memory map as "System RAM", mapped
> +	  by platform-firmware as "Soft Reserved", or dynamically provisioned
> +	  after boot by the CXL driver. In the latter two cases a device-dax
> +	  instance is created to access that unmapped-by-default address range.
> +	  Per usual it can remain as dedicated access via a device interface, or
> +	  converted to "System RAM" via the dax_kmem facility.
> +
>   config DEV_DAX_HMEM_DEVICES
>   	depends on DEV_DAX_HMEM && DAX
>   	def_bool y
> diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
> index 90a56ca3b345..5ed5c39857c8 100644
> --- a/drivers/dax/Makefile
> +++ b/drivers/dax/Makefile
> @@ -3,10 +3,12 @@ obj-$(CONFIG_DAX) += dax.o
>   obj-$(CONFIG_DEV_DAX) += device_dax.o
>   obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
>   obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
> +obj-$(CONFIG_DEV_DAX_CXL) += dax_cxl.o
>   
>   dax-y := super.o
>   dax-y += bus.o
>   device_dax-y := device.o
>   dax_pmem-y := pmem.o
> +dax_cxl-y := cxl.o
>   
>   obj-y += hmem/
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> new file mode 100644
> index 000000000000..ccdf8de85bd5
> --- /dev/null
> +++ b/drivers/dax/cxl.c
> @@ -0,0 +1,53 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2023 Intel Corporation. All rights reserved. */
> +#include <linux/module.h>
> +#include <linux/dax.h>
> +
> +#include "../cxl/cxl.h"
> +#include "bus.h"
> +
> +static int cxl_dax_region_probe(struct device *dev)
> +{
> +	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
> +	int nid = phys_to_target_node(cxlr_dax->hpa_range.start);
> +	struct cxl_region *cxlr = cxlr_dax->cxlr;
> +	struct dax_region *dax_region;
> +	struct dev_dax_data data;
> +	struct dev_dax *dev_dax;
> +
> +	if (nid == NUMA_NO_NODE)
> +		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
> +
> +	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> +				      PMD_SIZE, IORESOURCE_DAX_KMEM);
> +	if (!dax_region)
> +		return -ENOMEM;
> +
> +	data = (struct dev_dax_data) {
> +		.dax_region = dax_region,
> +		.id = -1,
> +		.size = range_len(&cxlr_dax->hpa_range),
> +	};
> +	dev_dax = devm_create_dev_dax(&data);
> +	if (IS_ERR(dev_dax))
> +		return PTR_ERR(dev_dax);
> +
> +	/* child dev_dax instances now own the lifetime of the dax_region */
> +	dax_region_put(dax_region);
> +	return 0;
> +}
> +
> +static struct cxl_driver cxl_dax_region_driver = {
> +	.name = "cxl_dax_region",
> +	.probe = cxl_dax_region_probe,
> +	.id = CXL_DEVICE_DAX_REGION,
> +	.drv = {
> +		.suppress_bind_attrs = true,
> +	},
> +};
> +
> +module_cxl_driver(cxl_dax_region_driver);
> +MODULE_ALIAS_CXL(CXL_DEVICE_DAX_REGION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Intel Corporation");
> +MODULE_IMPORT_NS(CXL);
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 5ec08f9f8a57..e5fe8b39fb94 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -72,6 +72,13 @@ static int hmem_register_device(struct device *host, int target_nid,
>   	long id;
>   	int rc;
>   
> +	if (IS_ENABLED(CONFIG_CXL_REGION) &&
> +	    region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> +			      IORES_DESC_CXL) != REGION_DISJOINT) {
> +		dev_dbg(host, "deferring range to CXL: %pr\n", res);
> +		return 0;
> +	}
> +
>   	rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>   			       IORES_DESC_SOFT_RESERVED);
>   	if (rc != REGION_INTERSECTS)
> @@ -157,6 +164,13 @@ static __exit void dax_hmem_exit(void)
>   module_init(dax_hmem_init);
>   module_exit(dax_hmem_exit);
>   
> +/* Allow for CXL to define its own dax regions */
> +#if IS_ENABLED(CONFIG_CXL_REGION)
> +#if IS_MODULE(CONFIG_CXL_ACPI)
> +MODULE_SOFTDEP("pre: cxl_acpi");
> +#endif
> +#endif
> +
>   MODULE_ALIAS("platform:hmem*");
>   MODULE_ALIAS("platform:hmem_platform*");
>   MODULE_LICENSE("GPL v2");
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 01/20] cxl/memdev: Fix endpoint port removal
  2023-02-10  9:05 ` [PATCH v2 01/20] cxl/memdev: Fix endpoint port removal Dan Williams
  2023-02-10 17:28   ` Jonathan Cameron
@ 2023-02-10 23:17   ` Verma, Vishal L
  1 sibling, 0 replies; 65+ messages in thread
From: Verma, Vishal L @ 2023-02-10 23:17 UTC (permalink / raw)
  To: Williams, Dan J, linux-cxl; +Cc: linux-mm, dave.hansen, linux-acpi

On Fri, 2023-02-10 at 01:05 -0800, Dan Williams wrote:
> Testing of ram region support [1], stimulates a long standing bug in
> cxl_detach_ep() where some cxl_ep_remove() cleanup is skipped due to
> inability to walk ports after dports have been unregistered. That
> results in a failure to re-register a memdev after the port is
> re-enabled leading to a crash like the following:
> 
>     cxl_port_setup_targets: cxl region4: cxl_host_bridge.0:port4 iw: 1 ig: 256
>     general protection fault, ...
>     [..]
>     RIP: 0010:cxl_region_setup_targets+0x897/0x9e0 [cxl_core]
>     dev_name at include/linux/device.h:700
>     (inlined by) cxl_port_setup_targets at drivers/cxl/core/region.c:1155
>     (inlined by) cxl_region_setup_targets at drivers/cxl/core/region.c:1249
>     [..]
>     Call Trace:
>      <TASK>
>      attach_target+0x39a/0x760 [cxl_core]
>      ? __mutex_unlock_slowpath+0x3a/0x290
>      cxl_add_to_region+0xb8/0x340 [cxl_core]
>      ? lockdep_hardirqs_on+0x7d/0x100
>      discover_region+0x4b/0x80 [cxl_port]
>      ? __pfx_discover_region+0x10/0x10 [cxl_port]
>      device_for_each_child+0x58/0x90
>      cxl_port_probe+0x10e/0x130 [cxl_port]
>      cxl_bus_probe+0x17/0x50 [cxl_core]
> 
> Change the port ancestry walk to be by depth rather than by dport. This
> ensures that even if a port has unregistered its dports a deferred
> memdev cleanup will still be able to cleanup the memdev's interest in
> that port.
> 
> The parent_port->dev.driver check is only needed for determining if the
> bottom up removal beat the top-down removal, but cxl_ep_remove() can
> always proceed.
> 
> Fixes: 2703c16c75ae ("cxl/core/port: Add switch port enumeration")
> Link: http://lore.kernel.org/r/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com [1]
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/core/memdev.c |    1 +
>  drivers/cxl/core/port.c   |   58 +++++++++++++++++++++++++--------------------
>  drivers/cxl/cxlmem.h      |    2 ++
>  3 files changed, 35 insertions(+), 26 deletions(-)

Looks good,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index a74a93310d26..3a8bc2b06047 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -246,6 +246,7 @@ static struct cxl_memdev *cxl_memdev_alloc(struct cxl_dev_state *cxlds,
>         if (rc < 0)
>                 goto err;
>         cxlmd->id = rc;
> +       cxlmd->depth = -1;
>  
>         dev = &cxlmd->dev;
>         device_initialize(dev);
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 410c036c09fa..317bcf4dbd9d 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1207,6 +1207,7 @@ int cxl_endpoint_autoremove(struct cxl_memdev *cxlmd, struct cxl_port *endpoint)
>  
>         get_device(&endpoint->dev);
>         dev_set_drvdata(dev, endpoint);
> +       cxlmd->depth = endpoint->depth;
>         return devm_add_action_or_reset(dev, delete_endpoint, cxlmd);
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_endpoint_autoremove, CXL);
> @@ -1241,50 +1242,55 @@ static void reap_dports(struct cxl_port *port)
>         }
>  }
>  
> +struct detach_ctx {
> +       struct cxl_memdev *cxlmd;
> +       int depth;
> +};
> +
> +static int port_has_memdev(struct device *dev, const void *data)
> +{
> +       const struct detach_ctx *ctx = data;
> +       struct cxl_port *port;
> +
> +       if (!is_cxl_port(dev))
> +               return 0;
> +
> +       port = to_cxl_port(dev);
> +       if (port->depth != ctx->depth)
> +               return 0;
> +
> +       return !!cxl_ep_load(port, ctx->cxlmd);
> +}
> +
>  static void cxl_detach_ep(void *data)
>  {
>         struct cxl_memdev *cxlmd = data;
> -       struct device *iter;
>  
> -       for (iter = &cxlmd->dev; iter; iter = grandparent(iter)) {
> -               struct device *dport_dev = grandparent(iter);
> +       for (int i = cxlmd->depth - 1; i >= 1; i--) {
>                 struct cxl_port *port, *parent_port;
> +               struct detach_ctx ctx = {
> +                       .cxlmd = cxlmd,
> +                       .depth = i,
> +               };
> +               struct device *dev;
>                 struct cxl_ep *ep;
>                 bool died = false;
>  
> -               if (!dport_dev)
> -                       break;
> -
> -               port = find_cxl_port(dport_dev, NULL);
> -               if (!port)
> -                       continue;
> -
> -               if (is_cxl_root(port)) {
> -                       put_device(&port->dev);
> +               dev = bus_find_device(&cxl_bus_type, NULL, &ctx,
> +                                     port_has_memdev);
> +               if (!dev)
>                         continue;
> -               }
> +               port = to_cxl_port(dev);
>  
>                 parent_port = to_cxl_port(port->dev.parent);
>                 device_lock(&parent_port->dev);
> -               if (!parent_port->dev.driver) {
> -                       /*
> -                        * The bottom-up race to delete the port lost to a
> -                        * top-down port disable, give up here, because the
> -                        * parent_port ->remove() will have cleaned up all
> -                        * descendants.
> -                        */
> -                       device_unlock(&parent_port->dev);
> -                       put_device(&port->dev);
> -                       continue;
> -               }
> -
>                 device_lock(&port->dev);
>                 ep = cxl_ep_load(port, cxlmd);
>                 dev_dbg(&cxlmd->dev, "disconnect %s from %s\n",
>                         ep ? dev_name(ep->ep) : "", dev_name(&port->dev));
>                 cxl_ep_remove(port, ep);
>                 if (ep && !port->dead && xa_empty(&port->endpoints) &&
> -                   !is_cxl_root(parent_port)) {
> +                   !is_cxl_root(parent_port) && parent_port->dev.driver) {
>                         /*
>                          * This was the last ep attached to a dynamically
>                          * enumerated port. Block new cxl_add_ep() and garbage
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index ab138004f644..c9da3c699a21 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -38,6 +38,7 @@
>   * @cxl_nvb: coordinate removal of @cxl_nvd if present
>   * @cxl_nvd: optional bridge to an nvdimm if the device supports pmem
>   * @id: id number of this memdev instance.
> + * @depth: endpoint port depth
>   */
>  struct cxl_memdev {
>         struct device dev;
> @@ -47,6 +48,7 @@ struct cxl_memdev {
>         struct cxl_nvdimm_bridge *cxl_nvb;
>         struct cxl_nvdimm *cxl_nvd;
>         int id;
> +       int depth;
>  };
>  
>  static inline struct cxl_memdev *to_cxl_memdev(struct device *dev)
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 08/20] cxl/region: Cleanup target list on attach error
  2023-02-10  9:06 ` [PATCH v2 08/20] cxl/region: Cleanup target list on attach error Dan Williams
  2023-02-10 17:31   ` Jonathan Cameron
@ 2023-02-10 23:17   ` Verma, Vishal L
  2023-02-10 23:46   ` Ira Weiny
  2 siblings, 0 replies; 65+ messages in thread
From: Verma, Vishal L @ 2023-02-10 23:17 UTC (permalink / raw)
  To: Williams, Dan J, linux-cxl
  Cc: linux-mm, Jonathan.Cameron, dave.hansen, linux-acpi

On Fri, 2023-02-10 at 01:06 -0800, Dan Williams wrote:
> Jonathan noticed that the target list setup is not unwound completely
> upon error. Undo all the setup in the 'err_decrement:' exit path.
> 
> Fixes: 27b3f8d13830 ("cxl/region: Program target lists")
> Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Link: http://lore.kernel.org/r/20230208123031.00006990@Huawei.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/core/region.c |    2 ++
>  1 file changed, 2 insertions(+)

Looks good,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 040bbd39c81d..ae7d3adcd41a 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1347,6 +1347,8 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  
>  err_decrement:
>         p->nr_targets--;
> +       cxled->pos = -1;
> +       p->targets[pos] = NULL;
>  err:
>         for (iter = ep_port; !is_cxl_root(iter);
>              iter = to_cxl_port(iter->dev.parent))
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 12/20] cxl/port: Split endpoint and switch port probe
  2023-02-10  9:06 ` [PATCH v2 12/20] cxl/port: Split endpoint and switch port probe Dan Williams
  2023-02-10 17:41   ` Jonathan Cameron
@ 2023-02-10 23:21   ` Verma, Vishal L
  1 sibling, 0 replies; 65+ messages in thread
From: Verma, Vishal L @ 2023-02-10 23:21 UTC (permalink / raw)
  To: Williams, Dan J, linux-cxl
  Cc: linux-mm, Jonathan.Cameron, dave.hansen, linux-acpi

On Fri, 2023-02-10 at 01:06 -0800, Dan Williams wrote:
> Jonathan points out that the shared code between the switch and endpoint
> case is small. Before adding another is_cxl_endpoint() conditional,
> just split the two cases.
> 
> Rather than duplicate the "Couldn't enumerate decoders" error message
> take the opportunity to improve the error messages in
> devm_cxl_enumerate_decoders().
> 
> Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Link: http://lore.kernel.org/r/20230208170724.000067ec@Huawei.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/core/hdm.c |   11 ++++++--
>  drivers/cxl/port.c     |   69 +++++++++++++++++++++++++++---------------------
>  2 files changed, 47 insertions(+), 33 deletions(-)

Looks good,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index dcc16d7cb8f3..a0891c3464f1 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -826,7 +826,8 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>                         cxled = cxl_endpoint_decoder_alloc(port);
>                         if (IS_ERR(cxled)) {
>                                 dev_warn(&port->dev,
> -                                        "Failed to allocate the decoder\n");
> +                                        "Failed to allocate decoder%d.%d\n",
> +                                        port->id, i);
>                                 return PTR_ERR(cxled);
>                         }
>                         cxld = &cxled->cxld;
> @@ -836,7 +837,8 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>                         cxlsd = cxl_switch_decoder_alloc(port, target_count);
>                         if (IS_ERR(cxlsd)) {
>                                 dev_warn(&port->dev,
> -                                        "Failed to allocate the decoder\n");
> +                                        "Failed to allocate decoder%d.%d\n",
> +                                        port->id, i);
>                                 return PTR_ERR(cxlsd);
>                         }
>                         cxld = &cxlsd->cxld;
> @@ -844,13 +846,16 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>  
>                 rc = init_hdm_decoder(port, cxld, target_map, hdm, i, &dpa_base);
>                 if (rc) {
> +                       dev_warn(&port->dev,
> +                                "Failed to initialize decoder%d.%d\n",
> +                                port->id, i);
>                         put_device(&cxld->dev);
>                         return rc;
>                 }
>                 rc = add_hdm_decoder(port, cxld, target_map);
>                 if (rc) {
>                         dev_warn(&port->dev,
> -                                "Failed to add decoder to port\n");
> +                                "Failed to add decoder%d.%d\n", port->id, i);
>                         return rc;
>                 }
>         }
> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> index 5453771bf330..a8d46a67b45e 100644
> --- a/drivers/cxl/port.c
> +++ b/drivers/cxl/port.c
> @@ -30,55 +30,64 @@ static void schedule_detach(void *cxlmd)
>         schedule_cxl_memdev_detach(cxlmd);
>  }
>  
> -static int cxl_port_probe(struct device *dev)
> +static int cxl_switch_port_probe(struct cxl_port *port)
>  {
> -       struct cxl_port *port = to_cxl_port(dev);
>         struct cxl_hdm *cxlhdm;
>         int rc;
>  
> +       rc = devm_cxl_port_enumerate_dports(port);
> +       if (rc < 0)
> +               return rc;
>  
> -       if (!is_cxl_endpoint(port)) {
> -               rc = devm_cxl_port_enumerate_dports(port);
> -               if (rc < 0)
> -                       return rc;
> -               if (rc == 1)
> -                       return devm_cxl_add_passthrough_decoder(port);
> -       }
> +       if (rc == 1)
> +               return devm_cxl_add_passthrough_decoder(port);
>  
>         cxlhdm = devm_cxl_setup_hdm(port);
>         if (IS_ERR(cxlhdm))
>                 return PTR_ERR(cxlhdm);
>  
> -       if (is_cxl_endpoint(port)) {
> -               struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport);
> -               struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +       return devm_cxl_enumerate_decoders(cxlhdm);
> +}
>  
> -               /* Cache the data early to ensure is_visible() works */
> -               read_cdat_data(port);
> +static int cxl_endpoint_port_probe(struct cxl_port *port)
> +{
> +       struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport);
> +       struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +       struct cxl_hdm *cxlhdm;
> +       int rc;
> +
> +       cxlhdm = devm_cxl_setup_hdm(port);
> +       if (IS_ERR(cxlhdm))
> +               return PTR_ERR(cxlhdm);
>  
> -               get_device(&cxlmd->dev);
> -               rc = devm_add_action_or_reset(dev, schedule_detach, cxlmd);
> -               if (rc)
> -                       return rc;
> +       /* Cache the data early to ensure is_visible() works */
> +       read_cdat_data(port);
>  
> -               rc = cxl_hdm_decode_init(cxlds, cxlhdm);
> -               if (rc)
> -                       return rc;
> +       get_device(&cxlmd->dev);
> +       rc = devm_add_action_or_reset(&port->dev, schedule_detach, cxlmd);
> +       if (rc)
> +               return rc;
>  
> -               rc = cxl_await_media_ready(cxlds);
> -               if (rc) {
> -                       dev_err(dev, "Media not active (%d)\n", rc);
> -                       return rc;
> -               }
> -       }
> +       rc = cxl_hdm_decode_init(cxlds, cxlhdm);
> +       if (rc)
> +               return rc;
>  
> -       rc = devm_cxl_enumerate_decoders(cxlhdm);
> +       rc = cxl_await_media_ready(cxlds);
>         if (rc) {
> -               dev_err(dev, "Couldn't enumerate decoders (%d)\n", rc);
> +               dev_err(&port->dev, "Media not active (%d)\n", rc);
>                 return rc;
>         }
>  
> -       return 0;
> +       return devm_cxl_enumerate_decoders(cxlhdm);
> +}
> +
> +static int cxl_port_probe(struct device *dev)
> +{
> +       struct cxl_port *port = to_cxl_port(dev);
> +
> +       if (is_cxl_endpoint(port))
> +               return cxl_endpoint_port_probe(port);
> +       return cxl_switch_port_probe(port);
>  }
>  
>  static ssize_t CDAT_read(struct file *filp, struct kobject *kobj,
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 04/20] cxl/region: Support empty uuids for non-pmem regions
  2023-02-10  9:05 ` [PATCH v2 04/20] cxl/region: Support empty uuids for non-pmem regions Dan Williams
  2023-02-10 17:30   ` Jonathan Cameron
@ 2023-02-10 23:34   ` Ira Weiny
  1 sibling, 0 replies; 65+ messages in thread
From: Ira Weiny @ 2023-02-10 23:34 UTC (permalink / raw)
  To: Dan Williams, linux-cxl
  Cc: Vishal Verma, Fan Ni, dave.hansen, linux-mm, linux-acpi

Dan Williams wrote:
> Shipping versions of the cxl-cli utility expect all regions to have a
> 'uuid' attribute. In preparation for 'ram' regions, update the 'uuid'
> attribute to return an empty string which satisfies the current
> expectations of 'cxl list -R'. Otherwise, 'cxl list -R' fails in the
> presence of regions with the 'uuid' attribute missing. Force the
> attribute to be read-only as there is no facility or expectation for a
> 'ram' region to recall its uuid from one boot to the next.
> 
> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564536587.847146.12703125206459604597.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl |    3 ++-
>  drivers/cxl/core/region.c               |   11 +++++++++--
>  2 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 058b0c45001f..4c4e1cbb1169 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -317,7 +317,8 @@ Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Write a unique identifier for the region. This field must
>  		be set for persistent regions and it must not conflict with the
> -		UUID of another region.
> +		UUID of another region. For volatile ram regions this
> +		attribute is a read-only empty string.
>  
>  
>  What:		/sys/bus/cxl/devices/regionZ/interleave_granularity
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 17d2d0c12725..0fc80478ff6b 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -45,7 +45,10 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
>  	rc = down_read_interruptible(&cxl_region_rwsem);
>  	if (rc)
>  		return rc;
> -	rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
> +	if (cxlr->mode != CXL_DECODER_PMEM)
> +		rc = sysfs_emit(buf, "\n");
> +	else
> +		rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
>  	up_read(&cxl_region_rwsem);
>  
>  	return rc;
> @@ -300,8 +303,12 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
>  	struct device *dev = kobj_to_dev(kobj);
>  	struct cxl_region *cxlr = to_cxl_region(dev);
>  
> +	/*
> +	 * Support tooling that expects to find a 'uuid' attribute for all
> +	 * regions regardless of mode.
> +	 */
>  	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
> -		return 0;
> +		return 0444;
>  	return a->mode;
>  }
>  
> 
> 



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 08/20] cxl/region: Cleanup target list on attach error
  2023-02-10  9:06 ` [PATCH v2 08/20] cxl/region: Cleanup target list on attach error Dan Williams
  2023-02-10 17:31   ` Jonathan Cameron
  2023-02-10 23:17   ` Verma, Vishal L
@ 2023-02-10 23:46   ` Ira Weiny
  2 siblings, 0 replies; 65+ messages in thread
From: Ira Weiny @ 2023-02-10 23:46 UTC (permalink / raw)
  To: Dan Williams, linux-cxl
  Cc: Jonathan Cameron, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

Dan Williams wrote:
> Jonathan noticed that the target list setup is not unwound completely
> upon error. Undo all the setup in the 'err_decrement:' exit path.
> 
> Fixes: 27b3f8d13830 ("cxl/region: Program target lists")
> Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

> Link: http://lore.kernel.org/r/20230208123031.00006990@Huawei.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/core/region.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 040bbd39c81d..ae7d3adcd41a 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1347,6 +1347,8 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  
>  err_decrement:
>  	p->nr_targets--;
> +	cxled->pos = -1;
> +	p->targets[pos] = NULL;
>  err:
>  	for (iter = ep_port; !is_cxl_root(iter);
>  	     iter = to_cxl_port(iter->dev.parent))
> 
> 



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 13/20] cxl/region: Add region autodiscovery
  2023-02-10  9:06 ` [PATCH v2 13/20] cxl/region: Add region autodiscovery Dan Williams
  2023-02-10 18:09   ` Jonathan Cameron
@ 2023-02-11  0:29   ` Verma, Vishal L
  2023-02-11  1:03     ` Dan Williams
       [not found]   ` <CGME20230213192752uscas1p1c49508da4b100c9ba6a1a3aa92ca03e5@uscas1p1.samsung.com>
       [not found]   ` <CGME20230228185348uscas1p1a5314a077383ee81ac228c1b9f1da2f8@uscas1p1.samsung.com>
  3 siblings, 1 reply; 65+ messages in thread
From: Verma, Vishal L @ 2023-02-11  0:29 UTC (permalink / raw)
  To: Williams, Dan J, linux-cxl; +Cc: linux-mm, fan.ni, dave.hansen, linux-acpi

On Fri, 2023-02-10 at 01:06 -0800, Dan Williams wrote:
> Region autodiscovery is an asynchronous state machine advanced by
> cxl_port_probe(). After the decoders on an endpoint port are enumerated
> they are scanned for actively enabled instances. Each active decoder is
> flagged for auto-assembly CXL_DECODER_F_AUTO and attached to a region.
> If a region does not already exist for the address range setting of the
> decoder one is created. That creation process may race with other
> decoders of the same region being discovered since cxl_port_probe() is
> asynchronous. A new 'struct cxl_root_decoder' lock, @range_lock, is
> introduced to mitigate that race.
> 
> Once all decoders have arrived, "p->nr_targets == p->interleave_ways",
> they are sorted by their relative decode position. The sort algorithm
> involves finding the point in the cxl_port topology where one leg of the
> decode leads to deviceA and the other deviceB. At that point in the
> topology the target order in the 'struct cxl_switch_decoder' indicates
> the relative position of those endpoint decoders in the region.
> 
> > From that point the region goes through the same setup and validation
> steps as user-created regions, but instead of programming the decoders
> it validates that driver would have written the same values to the
> decoders as were already present.
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564540972.847146.17096178433176097831.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/core/hdm.c    |   11 +
>  drivers/cxl/core/port.c   |    2 
>  drivers/cxl/core/region.c |  497 ++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/cxl/cxl.h         |   29 +++
>  drivers/cxl/port.c        |   48 ++++
>  5 files changed, 576 insertions(+), 11 deletions(-)
> 
> 
One question below, but otherwise looks good,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

<..>

>  
> +static int cxl_region_attach_auto(struct cxl_region *cxlr,
> +                                 struct cxl_endpoint_decoder *cxled, int pos)
> +{
> +       struct cxl_region_params *p = &cxlr->params;
> +
> +       if (cxled->state != CXL_DECODER_STATE_AUTO) {
> +               dev_err(&cxlr->dev,
> +                       "%s: unable to add decoder to autodetected region\n",
> +                       dev_name(&cxled->cxld.dev));
> +               return -EINVAL;
> +       }
> +
> +       if (pos >= 0) {
> +               dev_dbg(&cxlr->dev, "%s: expected auto position, not %d\n",
> +                       dev_name(&cxled->cxld.dev), pos);
> +               return -EINVAL;
> +       }
> +
> +       if (p->nr_targets >= p->interleave_ways) {
> +               dev_err(&cxlr->dev, "%s: no more target slots available\n",
> +                       dev_name(&cxled->cxld.dev));
> +               return -ENXIO;
> +       }
> +
> +       /*
> +        * Temporarily record the endpoint decoder into the target array. Yes,
> +        * this means that userspace can view devices in the wrong position
> +        * before the region activates, and must be careful to understand when
> +        * it might be racing region autodiscovery.
> +        */

Would it be worthwhile adding an attribute around this - either to
distinguish an auto-assembled region from a user-created one, or
perhaps better - something to mark the assembly complete? cxl-list
doesn't have to display this attribute as is, but maybe it can make a
decision to mark it as idle while assembly is pending, or maybe even
refuse to add_cxl_region() for it entirely?

This can be a follow-on too.

> +       pos = p->nr_targets;
> +       p->targets[pos] = cxled;
> +       cxled->pos = pos;
> +       p->nr_targets++;
> +
> +       return 0;
> +}
> +
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 14/20] tools/testing/cxl: Define a fixed volatile configuration to parse
  2023-02-10  9:06 ` [PATCH v2 14/20] tools/testing/cxl: Define a fixed volatile configuration to parse Dan Williams
  2023-02-10 18:12   ` Jonathan Cameron
  2023-02-10 18:36   ` Dave Jiang
@ 2023-02-11  0:39   ` Verma, Vishal L
  2 siblings, 0 replies; 65+ messages in thread
From: Verma, Vishal L @ 2023-02-11  0:39 UTC (permalink / raw)
  To: Williams, Dan J, linux-cxl; +Cc: linux-mm, fan.ni, dave.hansen, linux-acpi

On Fri, 2023-02-10 at 01:06 -0800, Dan Williams wrote:
> Take two endpoints attached to the first switch on the first host-bridge
> in the cxl_test topology and define a pre-initialized region. This is a
> x2 interleave underneath a x1 CXL Window.
> 
> $ modprobe cxl_test
> $ # cxl list -Ru
> {
>   "region":"region3",
>   "resource":"0xf010000000",
>   "size":"512.00 MiB (536.87 MB)",
>   "interleave_ways":2,
>   "interleave_granularity":4096,
>   "decode_state":"commit"
> }
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564541523.847146.12199636368812381475.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/core/core.h      |    3 -
>  drivers/cxl/core/hdm.c       |    3 +
>  drivers/cxl/core/port.c      |    2 +
>  drivers/cxl/cxl.h            |    2 +
>  drivers/cxl/cxlmem.h         |    3 +
>  tools/testing/cxl/test/cxl.c |  147 +++++++++++++++++++++++++++++++++++++++---
>  6 files changed, 146 insertions(+), 14 deletions(-)

Looks good,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 5eb873da5a30..479f01da6d35 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -57,9 +57,6 @@ resource_size_t cxl_dpa_size(struct cxl_endpoint_decoder *cxled);
>  resource_size_t cxl_dpa_resource_start(struct cxl_endpoint_decoder *cxled);
>  extern struct rw_semaphore cxl_dpa_rwsem;
>  
> -bool is_switch_decoder(struct device *dev);
> -struct cxl_switch_decoder *to_cxl_switch_decoder(struct device *dev);
> -
>  int cxl_memdev_init(void);
>  void cxl_memdev_exit(void);
>  void cxl_mbox_init(void);
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 8c29026a4b9d..80eccae6ba9e 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -279,7 +279,7 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>         return 0;
>  }
>  
> -static int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> +int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>                                 resource_size_t base, resource_size_t len,
>                                 resource_size_t skipped)
>  {
> @@ -295,6 +295,7 @@ static int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  
>         return devm_add_action_or_reset(&port->dev, cxl_dpa_release, cxled);
>  }
> +EXPORT_SYMBOL_NS_GPL(devm_cxl_dpa_reserve, CXL);
>  
>  resource_size_t cxl_dpa_size(struct cxl_endpoint_decoder *cxled)
>  {
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 59620528571a..b45d2796ef35 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -458,6 +458,7 @@ bool is_switch_decoder(struct device *dev)
>  {
>         return is_root_decoder(dev) || dev->type == &cxl_decoder_switch_type;
>  }
> +EXPORT_SYMBOL_NS_GPL(is_switch_decoder, CXL);
>  
>  struct cxl_decoder *to_cxl_decoder(struct device *dev)
>  {
> @@ -485,6 +486,7 @@ struct cxl_switch_decoder *to_cxl_switch_decoder(struct device *dev)
>                 return NULL;
>         return container_of(dev, struct cxl_switch_decoder, cxld.dev);
>  }
> +EXPORT_SYMBOL_NS_GPL(to_cxl_switch_decoder, CXL);
>  
>  static void cxl_ep_release(struct cxl_ep *ep)
>  {
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index c8ee4bb8cce6..2ac344235235 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -653,8 +653,10 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
>  
>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
>  struct cxl_root_decoder *to_cxl_root_decoder(struct device *dev);
> +struct cxl_switch_decoder *to_cxl_switch_decoder(struct device *dev);
>  struct cxl_endpoint_decoder *to_cxl_endpoint_decoder(struct device *dev);
>  bool is_root_decoder(struct device *dev);
> +bool is_switch_decoder(struct device *dev);
>  bool is_endpoint_decoder(struct device *dev);
>  struct cxl_root_decoder *cxl_root_decoder_alloc(struct cxl_port *port,
>                                                 unsigned int nr_targets,
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index c9da3c699a21..bf7d4c5c8612 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -81,6 +81,9 @@ static inline bool is_cxl_endpoint(struct cxl_port *port)
>  }
>  
>  struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds);
> +int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> +                        resource_size_t base, resource_size_t len,
> +                        resource_size_t skipped);
>  
>  static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
>                                          struct cxl_memdev *cxlmd)
> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
> index 920bd969c554..5342f69d70d2 100644
> --- a/tools/testing/cxl/test/cxl.c
> +++ b/tools/testing/cxl/test/cxl.c
> @@ -703,6 +703,142 @@ static int mock_decoder_reset(struct cxl_decoder *cxld)
>         return 0;
>  }
>  
> +static void default_mock_decoder(struct cxl_decoder *cxld)
> +{
> +       cxld->hpa_range = (struct range){
> +               .start = 0,
> +               .end = -1,
> +       };
> +
> +       cxld->interleave_ways = 1;
> +       cxld->interleave_granularity = 256;
> +       cxld->target_type = CXL_DECODER_EXPANDER;
> +       cxld->commit = mock_decoder_commit;
> +       cxld->reset = mock_decoder_reset;
> +}
> +
> +static int first_decoder(struct device *dev, void *data)
> +{
> +       struct cxl_decoder *cxld;
> +
> +       if (!is_switch_decoder(dev))
> +               return 0;
> +       cxld = to_cxl_decoder(dev);
> +       if (cxld->id == 0)
> +               return 1;
> +       return 0;
> +}
> +
> +static void mock_init_hdm_decoder(struct cxl_decoder *cxld)
> +{
> +       struct acpi_cedt_cfmws *window = mock_cfmws[0];
> +       struct platform_device *pdev = NULL;
> +       struct cxl_endpoint_decoder *cxled;
> +       struct cxl_switch_decoder *cxlsd;
> +       struct cxl_port *port, *iter;
> +       const int size = SZ_512M;
> +       struct cxl_memdev *cxlmd;
> +       struct cxl_dport *dport;
> +       struct device *dev;
> +       bool hb0 = false;
> +       u64 base;
> +       int i;
> +
> +       if (is_endpoint_decoder(&cxld->dev)) {
> +               cxled = to_cxl_endpoint_decoder(&cxld->dev);
> +               cxlmd = cxled_to_memdev(cxled);
> +               WARN_ON(!dev_is_platform(cxlmd->dev.parent));
> +               pdev = to_platform_device(cxlmd->dev.parent);
> +
> +               /* check is endpoint is attach to host-bridge0 */
> +               port = cxled_to_port(cxled);
> +               do {
> +                       if (port->uport == &cxl_host_bridge[0]->dev) {
> +                               hb0 = true;
> +                               break;
> +                       }
> +                       if (is_cxl_port(port->dev.parent))
> +                               port = to_cxl_port(port->dev.parent);
> +                       else
> +                               port = NULL;
> +               } while (port);
> +               port = cxled_to_port(cxled);
> +       }
> +
> +       /*
> +        * The first decoder on the first 2 devices on the first switch
> +        * attached to host-bridge0 mock a fake / static RAM region. All
> +        * other decoders are default disabled. Given the round robin
> +        * assignment those devices are named cxl_mem.0, and cxl_mem.4.
> +        *
> +        * See 'cxl list -BMPu -m cxl_mem.0,cxl_mem.4'
> +        */
> +       if (!hb0 || pdev->id % 4 || pdev->id > 4 || cxld->id > 0) {
> +               default_mock_decoder(cxld);
> +               return;
> +       }
> +
> +       base = window->base_hpa;
> +       cxld->hpa_range = (struct range) {
> +               .start = base,
> +               .end = base + size - 1,
> +       };
> +
> +       cxld->interleave_ways = 2;
> +       eig_to_granularity(window->granularity, &cxld->interleave_granularity);
> +       cxld->target_type = CXL_DECODER_EXPANDER;
> +       cxld->flags = CXL_DECODER_F_ENABLE;
> +       cxled->state = CXL_DECODER_STATE_AUTO;
> +       port->commit_end = cxld->id;
> +       devm_cxl_dpa_reserve(cxled, 0, size / cxld->interleave_ways, 0);
> +       cxld->commit = mock_decoder_commit;
> +       cxld->reset = mock_decoder_reset;
> +
> +       /*
> +        * Now that endpoint decoder is set up, walk up the hierarchy
> +        * and setup the switch and root port decoders targeting @cxlmd.
> +        */
> +       iter = port;
> +       for (i = 0; i < 2; i++) {
> +               dport = iter->parent_dport;
> +               iter = dport->port;
> +               dev = device_find_child(&iter->dev, NULL, first_decoder);
> +               /*
> +                * Ancestor ports are guaranteed to be enumerated before
> +                * @port, and all ports have at least one decoder.
> +                */
> +               if (WARN_ON(!dev))
> +                       continue;
> +               cxlsd = to_cxl_switch_decoder(dev);
> +               if (i == 0) {
> +                       /* put cxl_mem.4 second in the decode order */
> +                       if (pdev->id == 4)
> +                               cxlsd->target[1] = dport;
> +                       else
> +                               cxlsd->target[0] = dport;
> +               } else
> +                       cxlsd->target[0] = dport;
> +               cxld = &cxlsd->cxld;
> +               cxld->target_type = CXL_DECODER_EXPANDER;
> +               cxld->flags = CXL_DECODER_F_ENABLE;
> +               iter->commit_end = 0;
> +               /*
> +                * Switch targets 2 endpoints, while host bridge targets
> +                * one root port
> +                */
> +               if (i == 0)
> +                       cxld->interleave_ways = 2;
> +               else
> +                       cxld->interleave_ways = 1;
> +               cxld->interleave_granularity = 256;
> +               cxld->hpa_range = (struct range) {
> +                       .start = base,
> +                       .end = base + size - 1,
> +               };
> +               put_device(dev);
> +       }
> +}
> +
>  static int mock_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>  {
>         struct cxl_port *port = cxlhdm->port;
> @@ -748,16 +884,7 @@ static int mock_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>                         cxld = &cxled->cxld;
>                 }
>  
> -               cxld->hpa_range = (struct range) {
> -                       .start = 0,
> -                       .end = -1,
> -               };
> -
> -               cxld->interleave_ways = min_not_zero(target_count, 1);
> -               cxld->interleave_granularity = SZ_4K;
> -               cxld->target_type = CXL_DECODER_EXPANDER;
> -               cxld->commit = mock_decoder_commit;
> -               cxld->reset = mock_decoder_reset;
> +               mock_init_hdm_decoder(cxld);
>  
>                 if (target_count) {
>                         rc = device_for_each_child(port->uport, &ctx,
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 15/20] dax/hmem: Move HMAT and Soft reservation probe initcall level
  2023-02-10  9:06 ` [PATCH v2 15/20] dax/hmem: Move HMAT and Soft reservation probe initcall level Dan Williams
  2023-02-10 21:53   ` Dave Jiang
@ 2023-02-11  0:40   ` Verma, Vishal L
  1 sibling, 0 replies; 65+ messages in thread
From: Verma, Vishal L @ 2023-02-11  0:40 UTC (permalink / raw)
  To: Williams, Dan J, linux-cxl; +Cc: linux-mm, fan.ni, dave.hansen, linux-acpi

On Fri, 2023-02-10 at 01:06 -0800, Dan Williams wrote:
> In preparation for moving more filtering of "hmem" ranges into the
> dax_hmem.ko module, update the initcall levels. HMAT range registration
> moves to subsys_initcall() to be done before Soft Reservation probing,
> and Soft Reservation probing is moved to device_initcall() to be done
> before dax_hmem.ko initialization if it is built-in.
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564542109.847146.10113972881782419363.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/acpi/numa/hmat.c  |    2 +-
>  drivers/dax/hmem/Makefile |    3 ++-
>  drivers/dax/hmem/device.c |    2 +-
>  3 files changed, 4 insertions(+), 3 deletions(-)

Looks good,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index 605a0c7053be..ff24282301ab 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -869,4 +869,4 @@ static __init int hmat_init(void)
>         acpi_put_table(tbl);
>         return 0;
>  }
> -device_initcall(hmat_init);
> +subsys_initcall(hmat_init);
> diff --git a/drivers/dax/hmem/Makefile b/drivers/dax/hmem/Makefile
> index 57377b4c3d47..d4c4cd6bccd7 100644
> --- a/drivers/dax/hmem/Makefile
> +++ b/drivers/dax/hmem/Makefile
> @@ -1,6 +1,7 @@
>  # SPDX-License-Identifier: GPL-2.0
> -obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
> +# device_hmem.o deliberately precedes dax_hmem.o for initcall ordering
>  obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device_hmem.o
> +obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
>  
>  device_hmem-y := device.o
>  dax_hmem-y := hmem.o
> diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
> index 903325aac991..20749c7fab81 100644
> --- a/drivers/dax/hmem/device.c
> +++ b/drivers/dax/hmem/device.c
> @@ -104,4 +104,4 @@ static __init int hmem_init(void)
>   * As this is a fallback for address ranges unclaimed by the ACPI HMAT
>   * parsing it must be at an initcall level greater than hmat_init().
>   */
> -late_initcall(hmem_init);
> +device_initcall(hmem_init);
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 16/20] dax/hmem: Drop unnecessary dax_hmem_remove()
  2023-02-10  9:06 ` [PATCH v2 16/20] dax/hmem: Drop unnecessary dax_hmem_remove() Dan Williams
  2023-02-10 21:59   ` Dave Jiang
@ 2023-02-11  0:41   ` Verma, Vishal L
  1 sibling, 0 replies; 65+ messages in thread
From: Verma, Vishal L @ 2023-02-11  0:41 UTC (permalink / raw)
  To: Williams, Dan J, linux-cxl
  Cc: gregory.price, Jonathan.Cameron, fan.ni, linux-mm, dave.hansen,
	linux-acpi

On Fri, 2023-02-10 at 01:06 -0800, Dan Williams wrote:
> Empty driver remove callbacks can just be elided.
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Gregory Price <gregory.price@memverge.com>
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564542679.847146.17174404738816053065.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/dax/hmem/hmem.c |    7 -------
>  1 file changed, 7 deletions(-)

Looks good,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 1bf040dbc834..c7351e0dc8ff 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -44,15 +44,8 @@ static int dax_hmem_probe(struct platform_device *pdev)
>         return 0;
>  }
>  
> -static int dax_hmem_remove(struct platform_device *pdev)
> -{
> -       /* devm handles teardown */
> -       return 0;
> -}
> -
>  static struct platform_driver dax_hmem_driver = {
>         .probe = dax_hmem_probe,
> -       .remove = dax_hmem_remove,
>         .driver = {
>                 .name = "hmem",
>         },
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 13/20] cxl/region: Add region autodiscovery
  2023-02-11  0:29   ` Verma, Vishal L
@ 2023-02-11  1:03     ` Dan Williams
  0 siblings, 0 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-11  1:03 UTC (permalink / raw)
  To: Verma, Vishal L, Williams, Dan J, linux-cxl
  Cc: linux-mm, fan.ni, dave.hansen, linux-acpi

Verma, Vishal L wrote:
> On Fri, 2023-02-10 at 01:06 -0800, Dan Williams wrote:
> > Region autodiscovery is an asynchronous state machine advanced by
> > cxl_port_probe(). After the decoders on an endpoint port are enumerated
> > they are scanned for actively enabled instances. Each active decoder is
> > flagged for auto-assembly CXL_DECODER_F_AUTO and attached to a region.
> > If a region does not already exist for the address range setting of the
> > decoder one is created. That creation process may race with other
> > decoders of the same region being discovered since cxl_port_probe() is
> > asynchronous. A new 'struct cxl_root_decoder' lock, @range_lock, is
> > introduced to mitigate that race.
> > 
> > Once all decoders have arrived, "p->nr_targets == p->interleave_ways",
> > they are sorted by their relative decode position. The sort algorithm
> > involves finding the point in the cxl_port topology where one leg of the
> > decode leads to deviceA and the other deviceB. At that point in the
> > topology the target order in the 'struct cxl_switch_decoder' indicates
> > the relative position of those endpoint decoders in the region.
> > 
> > > From that point the region goes through the same setup and validation
> > steps as user-created regions, but instead of programming the decoders
> > it validates that driver would have written the same values to the
> > decoders as were already present.
> > 
> > Tested-by: Fan Ni <fan.ni@samsung.com>
> > Link: https://lore.kernel.org/r/167564540972.847146.17096178433176097831.stgit@dwillia2-xfh.jf.intel.com
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  drivers/cxl/core/hdm.c    |   11 +
> >  drivers/cxl/core/port.c   |    2 
> >  drivers/cxl/core/region.c |  497 ++++++++++++++++++++++++++++++++++++++++++++-
> >  drivers/cxl/cxl.h         |   29 +++
> >  drivers/cxl/port.c        |   48 ++++
> >  5 files changed, 576 insertions(+), 11 deletions(-)
> > 
> > 
> One question below, but otherwise looks good,
> 
> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
> 
> <..>
> 
> >  
> > +static int cxl_region_attach_auto(struct cxl_region *cxlr,
> > +                                 struct cxl_endpoint_decoder *cxled, int pos)
> > +{
> > +       struct cxl_region_params *p = &cxlr->params;
> > +
> > +       if (cxled->state != CXL_DECODER_STATE_AUTO) {
> > +               dev_err(&cxlr->dev,
> > +                       "%s: unable to add decoder to autodetected region\n",
> > +                       dev_name(&cxled->cxld.dev));
> > +               return -EINVAL;
> > +       }
> > +
> > +       if (pos >= 0) {
> > +               dev_dbg(&cxlr->dev, "%s: expected auto position, not %d\n",
> > +                       dev_name(&cxled->cxld.dev), pos);
> > +               return -EINVAL;
> > +       }
> > +
> > +       if (p->nr_targets >= p->interleave_ways) {
> > +               dev_err(&cxlr->dev, "%s: no more target slots available\n",
> > +                       dev_name(&cxled->cxld.dev));
> > +               return -ENXIO;
> > +       }
> > +
> > +       /*
> > +        * Temporarily record the endpoint decoder into the target array. Yes,
> > +        * this means that userspace can view devices in the wrong position
> > +        * before the region activates, and must be careful to understand when
> > +        * it might be racing region autodiscovery.
> > +        */
> 
> Would it be worthwhile adding an attribute around this - either to
> distinguish an auto-assembled region from a user-created one, or
> perhaps better - something to mark the assembly complete? cxl-list
> doesn't have to display this attribute as is, but maybe it can make a
> decision to mark it as idle while assembly is pending, or maybe even
> refuse to add_cxl_region() for it entirely?
> 
> This can be a follow-on too.

"Assembly complete" is determined by "commit" going active. What about
all of the "targetX" attributes printing the decoder-name out with a prefix
like:

    "auto:decoderX.Y"

...that way userspace can both see what candidate decoders the kernel is
considering, and the fact that assembly is still in progress (or
stalled).

The concern thought is breaking legacy that only ever expects to read
decoder names there... guess it depends on how bad the failure mode is.

Instead, maybe a new attribute like "origin" that returns "manual" vs
"auto"?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 17/20] dax/hmem: Convey the dax range via memregion_info()
  2023-02-10  9:07 ` [PATCH v2 17/20] dax/hmem: Convey the dax range via memregion_info() Dan Williams
  2023-02-10 22:03   ` Dave Jiang
@ 2023-02-11  4:25   ` Verma, Vishal L
  1 sibling, 0 replies; 65+ messages in thread
From: Verma, Vishal L @ 2023-02-11  4:25 UTC (permalink / raw)
  To: Williams, Dan J, linux-cxl
  Cc: linux-mm, Jonathan.Cameron, fan.ni, dave.hansen, linux-acpi

On Fri, 2023-02-10 at 01:07 -0800, Dan Williams wrote:
> In preparation for hmem platform devices to be unregistered, stop using
> platform_device_add_resources() to convey the address range. The
> platform_device_add_resources() API causes an existing "Soft Reserved"
> iomem resource to be re-parented under an inserted platform device
> resource. When that platform device is deleted it removes the platform
> device resource and all children.
> 
> Instead, it is sufficient to convey just the address range and let
> request_mem_region() insert resources to indicate the devices active in
> the range. This allows the "Soft Reserved" resource to be re-enumerated
> upon the next probe event.
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564543303.847146.11045895213318648441.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/dax/hmem/device.c |   37 ++++++++++++++-----------------------
>  drivers/dax/hmem/hmem.c   |   14 +++-----------
>  include/linux/memregion.h |    2 ++
>  3 files changed, 19 insertions(+), 34 deletions(-)

Looks good,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
> index 20749c7fab81..b1b339bccfe5 100644
> --- a/drivers/dax/hmem/device.c
> +++ b/drivers/dax/hmem/device.c
> @@ -15,15 +15,8 @@ static struct resource hmem_active = {
>         .flags = IORESOURCE_MEM,
>  };
>  
> -void hmem_register_device(int target_nid, struct resource *r)
> +void hmem_register_device(int target_nid, struct resource *res)
>  {
> -       /* define a clean / non-busy resource for the platform device */
> -       struct resource res = {
> -               .start = r->start,
> -               .end = r->end,
> -               .flags = IORESOURCE_MEM,
> -               .desc = IORES_DESC_SOFT_RESERVED,
> -       };
>         struct platform_device *pdev;
>         struct memregion_info info;
>         int rc, id;
> @@ -31,55 +24,53 @@ void hmem_register_device(int target_nid, struct resource *r)
>         if (nohmem)
>                 return;
>  
> -       rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
> -                       IORES_DESC_SOFT_RESERVED);
> +       rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> +                              IORES_DESC_SOFT_RESERVED);
>         if (rc != REGION_INTERSECTS)
>                 return;
>  
>         id = memregion_alloc(GFP_KERNEL);
>         if (id < 0) {
> -               pr_err("memregion allocation failure for %pr\n", &res);
> +               pr_err("memregion allocation failure for %pr\n", res);
>                 return;
>         }
>  
>         pdev = platform_device_alloc("hmem", id);
>         if (!pdev) {
> -               pr_err("hmem device allocation failure for %pr\n", &res);
> +               pr_err("hmem device allocation failure for %pr\n", res);
>                 goto out_pdev;
>         }
>  
> -       if (!__request_region(&hmem_active, res.start, resource_size(&res),
> +       if (!__request_region(&hmem_active, res->start, resource_size(res),
>                               dev_name(&pdev->dev), 0)) {
> -               dev_dbg(&pdev->dev, "hmem range %pr already active\n", &res);
> +               dev_dbg(&pdev->dev, "hmem range %pr already active\n", res);
>                 goto out_active;
>         }
>  
>         pdev->dev.numa_node = numa_map_to_online_node(target_nid);
>         info = (struct memregion_info) {
>                 .target_node = target_nid,
> +               .range = {
> +                       .start = res->start,
> +                       .end = res->end,
> +               },
>         };
>         rc = platform_device_add_data(pdev, &info, sizeof(info));
>         if (rc < 0) {
> -               pr_err("hmem memregion_info allocation failure for %pr\n", &res);
> -               goto out_resource;
> -       }
> -
> -       rc = platform_device_add_resources(pdev, &res, 1);
> -       if (rc < 0) {
> -               pr_err("hmem resource allocation failure for %pr\n", &res);
> +               pr_err("hmem memregion_info allocation failure for %pr\n", res);
>                 goto out_resource;
>         }
>  
>         rc = platform_device_add(pdev);
>         if (rc < 0) {
> -               dev_err(&pdev->dev, "device add failed for %pr\n", &res);
> +               dev_err(&pdev->dev, "device add failed for %pr\n", res);
>                 goto out_resource;
>         }
>  
>         return;
>  
>  out_resource:
> -       __release_region(&hmem_active, res.start, resource_size(&res));
> +       __release_region(&hmem_active, res->start, resource_size(res));
>  out_active:
>         platform_device_put(pdev);
>  out_pdev:
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index c7351e0dc8ff..5025a8c9850b 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -15,25 +15,17 @@ static int dax_hmem_probe(struct platform_device *pdev)
>         struct memregion_info *mri;
>         struct dev_dax_data data;
>         struct dev_dax *dev_dax;
> -       struct resource *res;
> -       struct range range;
> -
> -       res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> -       if (!res)
> -               return -ENOMEM;
>  
>         mri = dev->platform_data;
> -       range.start = res->start;
> -       range.end = res->end;
> -       dax_region = alloc_dax_region(dev, pdev->id, &range, mri->target_node,
> -                       PMD_SIZE, 0);
> +       dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
> +                                     mri->target_node, PMD_SIZE, 0);
>         if (!dax_region)
>                 return -ENOMEM;
>  
>         data = (struct dev_dax_data) {
>                 .dax_region = dax_region,
>                 .id = -1,
> -               .size = region_idle ? 0 : resource_size(res),
> +               .size = region_idle ? 0 : range_len(&mri->range),
>         };
>         dev_dax = devm_create_dev_dax(&data);
>         if (IS_ERR(dev_dax))
> diff --git a/include/linux/memregion.h b/include/linux/memregion.h
> index bf83363807ac..c01321467789 100644
> --- a/include/linux/memregion.h
> +++ b/include/linux/memregion.h
> @@ -3,10 +3,12 @@
>  #define _MEMREGION_H_
>  #include <linux/types.h>
>  #include <linux/errno.h>
> +#include <linux/range.h>
>  #include <linux/bug.h>
>  
>  struct memregion_info {
>         int target_node;
> +       struct range range;
>  };
>  
>  #ifdef CONFIG_MEMREGION
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 18/20] dax/hmem: Move hmem device registration to dax_hmem.ko
  2023-02-10  9:07 ` [PATCH v2 18/20] dax/hmem: Move hmem device registration to dax_hmem.ko Dan Williams
  2023-02-10 18:25   ` Jonathan Cameron
  2023-02-10 22:09   ` Dave Jiang
@ 2023-02-11  4:41   ` Verma, Vishal L
  2 siblings, 0 replies; 65+ messages in thread
From: Verma, Vishal L @ 2023-02-11  4:41 UTC (permalink / raw)
  To: Williams, Dan J, linux-cxl; +Cc: linux-mm, fan.ni, dave.hansen, linux-acpi

On Fri, 2023-02-10 at 01:07 -0800, Dan Williams wrote:
> In preparation for the CXL region driver to take over the responsibility
> of registering device-dax instances for CXL regions, move the
> registration of "hmem" devices to dax_hmem.ko.
> 
> Previously the builtin component of this enabling
> (drivers/dax/hmem/device.o) would register platform devices for each
> address range and trigger the dax_hmem.ko module to load and attach
> device-dax instances to those devices. Now, the ranges are collected
> from the HMAT and EFI memory map walking, but the device creation is
> deferred. A new "hmem_platform" device is created which triggers
> dax_hmem.ko to load and register the platform devices.
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564543923.847146.9030380223622044744.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/acpi/numa/hmat.c  |    2 -
>  drivers/dax/Kconfig       |    2 -
>  drivers/dax/hmem/device.c |   91 +++++++++++++++++++--------------------
>  drivers/dax/hmem/hmem.c   |  105 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dax.h       |    7 ++-
>  5 files changed, 155 insertions(+), 52 deletions(-)

Looks good,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index ff24282301ab..bba268ecd802 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -718,7 +718,7 @@ static void hmat_register_target_devices(struct memory_target *target)
>         for (res = target->memregions.child; res; res = res->sibling) {
>                 int target_nid = pxm_to_node(target->memory_pxm);
>  
> -               hmem_register_device(target_nid, res);
> +               hmem_register_resource(target_nid, res);
>         }
>  }
>  
> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> index 5fdf269a822e..d13c889c2a64 100644
> --- a/drivers/dax/Kconfig
> +++ b/drivers/dax/Kconfig
> @@ -46,7 +46,7 @@ config DEV_DAX_HMEM
>           Say M if unsure.
>  
>  config DEV_DAX_HMEM_DEVICES
> -       depends on DEV_DAX_HMEM && DAX=y
> +       depends on DEV_DAX_HMEM && DAX
>         def_bool y
>  
>  config DEV_DAX_KMEM
> diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
> index b1b339bccfe5..f9e1a76a04a9 100644
> --- a/drivers/dax/hmem/device.c
> +++ b/drivers/dax/hmem/device.c
> @@ -8,6 +8,8 @@
>  static bool nohmem;
>  module_param_named(disable, nohmem, bool, 0444);
>  
> +static bool platform_initialized;
> +static DEFINE_MUTEX(hmem_resource_lock);
>  static struct resource hmem_active = {
>         .name = "HMEM devices",
>         .start = 0,
> @@ -15,71 +17,66 @@ static struct resource hmem_active = {
>         .flags = IORESOURCE_MEM,
>  };
>  
> -void hmem_register_device(int target_nid, struct resource *res)
> +int walk_hmem_resources(struct device *host, walk_hmem_fn fn)
> +{
> +       struct resource *res;
> +       int rc = 0;
> +
> +       mutex_lock(&hmem_resource_lock);
> +       for (res = hmem_active.child; res; res = res->sibling) {
> +               rc = fn(host, (int) res->desc, res);
> +               if (rc)
> +                       break;
> +       }
> +       mutex_unlock(&hmem_resource_lock);
> +       return rc;
> +}
> +EXPORT_SYMBOL_GPL(walk_hmem_resources);
> +
> +static void __hmem_register_resource(int target_nid, struct resource *res)
>  {
>         struct platform_device *pdev;
> -       struct memregion_info info;
> -       int rc, id;
> +       struct resource *new;
> +       int rc;
>  
> -       if (nohmem)
> +       new = __request_region(&hmem_active, res->start, resource_size(res), "",
> +                              0);
> +       if (!new) {
> +               pr_debug("hmem range %pr already active\n", res);
>                 return;
> +       }
>  
> -       rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> -                              IORES_DESC_SOFT_RESERVED);
> -       if (rc != REGION_INTERSECTS)
> -               return;
> +       new->desc = target_nid;
>  
> -       id = memregion_alloc(GFP_KERNEL);
> -       if (id < 0) {
> -               pr_err("memregion allocation failure for %pr\n", res);
> +       if (platform_initialized)
>                 return;
> -       }
>  
> -       pdev = platform_device_alloc("hmem", id);
> +       pdev = platform_device_alloc("hmem_platform", 0);
>         if (!pdev) {
> -               pr_err("hmem device allocation failure for %pr\n", res);
> -               goto out_pdev;
> -       }
> -
> -       if (!__request_region(&hmem_active, res->start, resource_size(res),
> -                             dev_name(&pdev->dev), 0)) {
> -               dev_dbg(&pdev->dev, "hmem range %pr already active\n", res);
> -               goto out_active;
> -       }
> -
> -       pdev->dev.numa_node = numa_map_to_online_node(target_nid);
> -       info = (struct memregion_info) {
> -               .target_node = target_nid,
> -               .range = {
> -                       .start = res->start,
> -                       .end = res->end,
> -               },
> -       };
> -       rc = platform_device_add_data(pdev, &info, sizeof(info));
> -       if (rc < 0) {
> -               pr_err("hmem memregion_info allocation failure for %pr\n", res);
> -               goto out_resource;
> +               pr_err_once("failed to register device-dax hmem_platform device\n");
> +               return;
>         }
>  
>         rc = platform_device_add(pdev);
> -       if (rc < 0) {
> -               dev_err(&pdev->dev, "device add failed for %pr\n", res);
> -               goto out_resource;
> -       }
> +       if (rc)
> +               platform_device_put(pdev);
> +       else
> +               platform_initialized = true;
> +}
>  
> -       return;
> +void hmem_register_resource(int target_nid, struct resource *res)
> +{
> +       if (nohmem)
> +               return;
>  
> -out_resource:
> -       __release_region(&hmem_active, res->start, resource_size(res));
> -out_active:
> -       platform_device_put(pdev);
> -out_pdev:
> -       memregion_free(id);
> +       mutex_lock(&hmem_resource_lock);
> +       __hmem_register_resource(target_nid, res);
> +       mutex_unlock(&hmem_resource_lock);
>  }
>  
>  static __init int hmem_register_one(struct resource *res, void *data)
>  {
> -       hmem_register_device(phys_to_target_node(res->start), res);
> +       hmem_register_resource(phys_to_target_node(res->start), res);
>  
>         return 0;
>  }
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 5025a8c9850b..e7bdff3132fa 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -3,6 +3,7 @@
>  #include <linux/memregion.h>
>  #include <linux/module.h>
>  #include <linux/pfn_t.h>
> +#include <linux/dax.h>
>  #include "../bus.h"
>  
>  static bool region_idle;
> @@ -43,8 +44,110 @@ static struct platform_driver dax_hmem_driver = {
>         },
>  };
>  
> -module_platform_driver(dax_hmem_driver);
> +static void release_memregion(void *data)
> +{
> +       memregion_free((long) data);
> +}
> +
> +static void release_hmem(void *pdev)
> +{
> +       platform_device_unregister(pdev);
> +}
> +
> +static int hmem_register_device(struct device *host, int target_nid,
> +                               const struct resource *res)
> +{
> +       struct platform_device *pdev;
> +       struct memregion_info info;
> +       long id;
> +       int rc;
> +
> +       rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> +                              IORES_DESC_SOFT_RESERVED);
> +       if (rc != REGION_INTERSECTS)
> +               return 0;
> +
> +       id = memregion_alloc(GFP_KERNEL);
> +       if (id < 0) {
> +               dev_err(host, "memregion allocation failure for %pr\n", res);
> +               return -ENOMEM;
> +       }
> +       rc = devm_add_action_or_reset(host, release_memregion, (void *) id);
> +       if (rc)
> +               return rc;
> +
> +       pdev = platform_device_alloc("hmem", id);
> +       if (!pdev) {
> +               dev_err(host, "device allocation failure for %pr\n", res);
> +               return -ENOMEM;
> +       }
> +
> +       pdev->dev.numa_node = numa_map_to_online_node(target_nid);
> +       info = (struct memregion_info) {
> +               .target_node = target_nid,
> +               .range = {
> +                       .start = res->start,
> +                       .end = res->end,
> +               },
> +       };
> +       rc = platform_device_add_data(pdev, &info, sizeof(info));
> +       if (rc < 0) {
> +               dev_err(host, "memregion_info allocation failure for %pr\n",
> +                      res);
> +               goto out_put;
> +       }
> +
> +       rc = platform_device_add(pdev);
> +       if (rc < 0) {
> +               dev_err(host, "%s add failed for %pr\n", dev_name(&pdev->dev),
> +                       res);
> +               goto out_put;
> +       }
> +
> +       return devm_add_action_or_reset(host, release_hmem, pdev);
> +
> +out_put:
> +       platform_device_put(pdev);
> +       return rc;
> +}
> +
> +static int dax_hmem_platform_probe(struct platform_device *pdev)
> +{
> +       return walk_hmem_resources(&pdev->dev, hmem_register_device);
> +}
> +
> +static struct platform_driver dax_hmem_platform_driver = {
> +       .probe = dax_hmem_platform_probe,
> +       .driver = {
> +               .name = "hmem_platform",
> +       },
> +};
> +
> +static __init int dax_hmem_init(void)
> +{
> +       int rc;
> +
> +       rc = platform_driver_register(&dax_hmem_platform_driver);
> +       if (rc)
> +               return rc;
> +
> +       rc = platform_driver_register(&dax_hmem_driver);
> +       if (rc)
> +               platform_driver_unregister(&dax_hmem_platform_driver);
> +
> +       return rc;
> +}
> +
> +static __exit void dax_hmem_exit(void)
> +{
> +       platform_driver_unregister(&dax_hmem_driver);
> +       platform_driver_unregister(&dax_hmem_platform_driver);
> +}
> +
> +module_init(dax_hmem_init);
> +module_exit(dax_hmem_exit);
>  
>  MODULE_ALIAS("platform:hmem*");
> +MODULE_ALIAS("platform:hmem_platform*");
>  MODULE_LICENSE("GPL v2");
>  MODULE_AUTHOR("Intel Corporation");
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 2b5ecb591059..bf6258472e49 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -262,11 +262,14 @@ static inline bool dax_mapping(struct address_space *mapping)
>  }
>  
>  #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
> -void hmem_register_device(int target_nid, struct resource *r);
> +void hmem_register_resource(int target_nid, struct resource *r);
>  #else
> -static inline void hmem_register_device(int target_nid, struct resource *r)
> +static inline void hmem_register_resource(int target_nid, struct resource *r)
>  {
>  }
>  #endif
>  
> +typedef int (*walk_hmem_fn)(struct device *dev, int target_nid,
> +                           const struct resource *res);
> +int walk_hmem_resources(struct device *dev, walk_hmem_fn fn);
>  #endif
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 19/20] dax: Assign RAM regions to memory-hotplug by default
  2023-02-10  9:07 ` [PATCH v2 19/20] dax: Assign RAM regions to memory-hotplug by default Dan Williams
  2023-02-10 22:19   ` Dave Jiang
@ 2023-02-11  5:57   ` Verma, Vishal L
  1 sibling, 0 replies; 65+ messages in thread
From: Verma, Vishal L @ 2023-02-11  5:57 UTC (permalink / raw)
  To: Williams, Dan J, linux-cxl
  Cc: gregory.price, Hocko, Michal, fan.ni, linux-mm, david,
	dave.hansen, linux-acpi

On Fri, 2023-02-10 at 01:07 -0800, Dan Williams wrote:
> The default mode for device-dax instances is backwards for RAM-regions
> as evidenced by the fact that it tends to catch end users by surprise.
> "Where is my memory?". Recall that platforms are increasingly shipping
> with performance-differentiated memory pools beyond typical DRAM and
> NUMA effects. This includes HBM (high-bandwidth-memory) and CXL (dynamic
> interleave, varied media types, and future fabric attached
> possibilities).
> 
> For this reason the EFI_MEMORY_SP (EFI Special Purpose Memory => Linux
> 'Soft Reserved') attribute is expected to be applied to all memory-pools
> that are not the general purpose pool. This designation gives an
> Operating System a chance to defer usage of a memory pool until later in
> the boot process where its performance properties can be interrogated
> and administrator policy can be applied.
> 
> 'Soft Reserved' memory can be anything from too limited and precious to
> be part of the general purpose pool (HBM), too slow to host hot kernel
> data structures (some PMEM media), or anything in between. However, in
> the absence of an explicit policy, the memory should at least be made
> usable by default. The current device-dax default hides all
> non-general-purpose memory behind a device interface.
> 
> The expectation is that the distribution of users that want the memory
> online by default vs device-dedicated-access by default follows the
> Pareto principle. A small number of enlightened users may want to do
> userspace memory management through a device, but general users just
> want the kernel to make the memory available with an option to get more
> advanced later.
> 
> Arrange for all device-dax instances not backed by PMEM to default to
> attaching to the dax_kmem driver. From there the baseline memory hotplug
> policy (CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE / memhp_default_state=)
> gates whether the memory comes online or stays offline. Where, if it
> stays offline, it can be reliably converted back to device-mode where it
> can be partitioned, or fronted by a userspace allocator.
> 
> So, if someone wants device-dax instances for their 'Soft Reserved'
> memory:
> 
> 1/ Build a kernel with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n or boot
>    with memhp_default_state=offline, or roll the dice and hope that the
>    kernel has not pinned a page in that memory before step 2.
> 
> 2/ Write a udev rule to convert the target dax device(s) from
>    'system-ram' mode to 'devdax' mode:
> 
>    daxctl reconfigure-device $dax -m devdax -f
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Gregory Price <gregory.price@memverge.com>
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564544513.847146.4645646177864365755.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/dax/Kconfig     |    2 +-
>  drivers/dax/bus.c       |   53 ++++++++++++++++++++---------------------------
>  drivers/dax/bus.h       |   12 +++++++++--
>  drivers/dax/device.c    |    3 +--
>  drivers/dax/hmem/hmem.c |   12 ++++++++++-
>  drivers/dax/kmem.c      |    1 +
>  6 files changed, 46 insertions(+), 37 deletions(-)

Looks good,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> index d13c889c2a64..1163eb62e5f6 100644
> --- a/drivers/dax/Kconfig
> +++ b/drivers/dax/Kconfig
> @@ -50,7 +50,7 @@ config DEV_DAX_HMEM_DEVICES
>         def_bool y
>  
>  config DEV_DAX_KMEM
> -       tristate "KMEM DAX: volatile-use of persistent memory"
> +       tristate "KMEM DAX: map dax-devices as System-RAM"
>         default DEV_DAX
>         depends on DEV_DAX
>         depends on MEMORY_HOTPLUG # for add_memory() and friends
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 1dad813ee4a6..012d576004e9 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -56,6 +56,25 @@ static int dax_match_id(struct dax_device_driver *dax_drv, struct device *dev)
>         return match;
>  }
>  
> +static int dax_match_type(struct dax_device_driver *dax_drv, struct device *dev)
> +{
> +       enum dax_driver_type type = DAXDRV_DEVICE_TYPE;
> +       struct dev_dax *dev_dax = to_dev_dax(dev);
> +
> +       if (dev_dax->region->res.flags & IORESOURCE_DAX_KMEM)
> +               type = DAXDRV_KMEM_TYPE;
> +
> +       if (dax_drv->type == type)
> +               return 1;
> +
> +       /* default to device mode if dax_kmem is disabled */
> +       if (dax_drv->type == DAXDRV_DEVICE_TYPE &&
> +           !IS_ENABLED(CONFIG_DEV_DAX_KMEM))
> +               return 1;
> +
> +       return 0;
> +}
> +
>  enum id_action {
>         ID_REMOVE,
>         ID_ADD,
> @@ -216,14 +235,9 @@ static int dax_bus_match(struct device *dev, struct device_driver *drv)
>  {
>         struct dax_device_driver *dax_drv = to_dax_drv(drv);
>  
> -       /*
> -        * All but the 'device-dax' driver, which has 'match_always'
> -        * set, requires an exact id match.
> -        */
> -       if (dax_drv->match_always)
> +       if (dax_match_id(dax_drv, dev))
>                 return 1;
> -
> -       return dax_match_id(dax_drv, dev);
> +       return dax_match_type(dax_drv, dev);
>  }
>  
>  /*
> @@ -1413,13 +1427,10 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
>  }
>  EXPORT_SYMBOL_GPL(devm_create_dev_dax);
>  
> -static int match_always_count;
> -
>  int __dax_driver_register(struct dax_device_driver *dax_drv,
>                 struct module *module, const char *mod_name)
>  {
>         struct device_driver *drv = &dax_drv->drv;
> -       int rc = 0;
>  
>         /*
>          * dax_bus_probe() calls dax_drv->probe() unconditionally.
> @@ -1434,26 +1445,7 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
>         drv->mod_name = mod_name;
>         drv->bus = &dax_bus_type;
>  
> -       /* there can only be one default driver */
> -       mutex_lock(&dax_bus_lock);
> -       match_always_count += dax_drv->match_always;
> -       if (match_always_count > 1) {
> -               match_always_count--;
> -               WARN_ON(1);
> -               rc = -EINVAL;
> -       }
> -       mutex_unlock(&dax_bus_lock);
> -       if (rc)
> -               return rc;
> -
> -       rc = driver_register(drv);
> -       if (rc && dax_drv->match_always) {
> -               mutex_lock(&dax_bus_lock);
> -               match_always_count -= dax_drv->match_always;
> -               mutex_unlock(&dax_bus_lock);
> -       }
> -
> -       return rc;
> +       return driver_register(drv);
>  }
>  EXPORT_SYMBOL_GPL(__dax_driver_register);
>  
> @@ -1463,7 +1455,6 @@ void dax_driver_unregister(struct dax_device_driver *dax_drv)
>         struct dax_id *dax_id, *_id;
>  
>         mutex_lock(&dax_bus_lock);
> -       match_always_count -= dax_drv->match_always;
>         list_for_each_entry_safe(dax_id, _id, &dax_drv->ids, list) {
>                 list_del(&dax_id->list);
>                 kfree(dax_id);
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index fbb940293d6d..8cd79ab34292 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -11,7 +11,10 @@ struct dax_device;
>  struct dax_region;
>  void dax_region_put(struct dax_region *dax_region);
>  
> -#define IORESOURCE_DAX_STATIC (1UL << 0)
> +/* dax bus specific ioresource flags */
> +#define IORESOURCE_DAX_STATIC BIT(0)
> +#define IORESOURCE_DAX_KMEM BIT(1)
> +
>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>                 struct range *range, int target_node, unsigned int align,
>                 unsigned long flags);
> @@ -25,10 +28,15 @@ struct dev_dax_data {
>  
>  struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
>  
> +enum dax_driver_type {
> +       DAXDRV_KMEM_TYPE,
> +       DAXDRV_DEVICE_TYPE,
> +};
> +
>  struct dax_device_driver {
>         struct device_driver drv;
>         struct list_head ids;
> -       int match_always;
> +       enum dax_driver_type type;
>         int (*probe)(struct dev_dax *dev);
>         void (*remove)(struct dev_dax *dev);
>  };
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 5494d745ced5..ecdff79e31f2 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -475,8 +475,7 @@ EXPORT_SYMBOL_GPL(dev_dax_probe);
>  
>  static struct dax_device_driver device_dax_driver = {
>         .probe = dev_dax_probe,
> -       /* all probe actions are unwound by devm, so .remove isn't necessary */
> -       .match_always = 1,
> +       .type = DAXDRV_DEVICE_TYPE,
>  };
>  
>  static int __init dax_init(void)
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index e7bdff3132fa..5ec08f9f8a57 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -11,15 +11,25 @@ module_param_named(region_idle, region_idle, bool, 0644);
>  
>  static int dax_hmem_probe(struct platform_device *pdev)
>  {
> +       unsigned long flags = IORESOURCE_DAX_KMEM;
>         struct device *dev = &pdev->dev;
>         struct dax_region *dax_region;
>         struct memregion_info *mri;
>         struct dev_dax_data data;
>         struct dev_dax *dev_dax;
>  
> +       /*
> +        * @region_idle == true indicates that an administrative agent
> +        * wants to manipulate the range partitioning before the devices
> +        * are created, so do not send them to the dax_kmem driver by
> +        * default.
> +        */
> +       if (region_idle)
> +               flags = 0;
> +
>         mri = dev->platform_data;
>         dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
> -                                     mri->target_node, PMD_SIZE, 0);
> +                                     mri->target_node, PMD_SIZE, flags);
>         if (!dax_region)
>                 return -ENOMEM;
>  
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 4852a2dbdb27..918d01d3fbaa 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -239,6 +239,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
>  static struct dax_device_driver device_dax_kmem_driver = {
>         .probe = dev_dax_kmem_probe,
>         .remove = dev_dax_kmem_remove,
> +       .type = DAXDRV_KMEM_TYPE,
>  };
>  
>  static int __init dax_kmem_init(void)
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default
  2023-02-10 17:53 ` [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
@ 2023-02-11 14:04   ` Gregory Price
  0 siblings, 0 replies; 65+ messages in thread
From: Gregory Price @ 2023-02-11 14:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Ira Weiny, David Hildenbrand, Dave Jiang,
	Davidlohr Bueso, Kees Cook, Jonathan Cameron, Vishal Verma,
	Dave Hansen, Michal Hocko, Fan Ni, linux-mm, linux-acpi

On Fri, Feb 10, 2023 at 09:53:35AM -0800, Dan Williams wrote:
> Dan Williams wrote:
> > Changes since v1: [1]
> > - Add a fix for memdev removal racing port removal (found by unit tests)
> > - Add a fix to unwind region target list updates on error in
> >   cxl_region_attach() (Jonathan)
> > - Move the passthrough decoder fix for submission for v6.2-final (Greg)
> > - Fix wrong initcall for cxl_core (Gregory and Davidlohr)
> > - Add an endpoint decoder state (CXL_DECODER_STATE_AUTO) to replace
> >   the flag CXL_DECODER_F_AUTO (Jonathan)
> > - Reflow cmp_decode_pos() to reduce levels of indentation (Jonathan)
> > - Fix a leaked reference count in cxl_add_to_region() (Jonathan)
> > - Make cxl_add_to_region() return an error (Jonathan)
> > - Fix several spurious whitespace changes (Jonathan)
> > - Cleanup some spurious changes from the tools/testing/cxl update
> >   (Jonathan)
> > - Test for == CXL_CONFIG_COMMIT rather than >= CXL_CONFIG_COMMIT
> >   (Jonathan)
> > - Add comment to clarify device_attach() return code expectation in
> >   cxl_add_to_region() (Jonathan)
> > - Add a patch to split cxl_port_probe() into switch and endpoint port
> >   probe calls (Jonathan)
> > - Collect reviewed-by and tested-by tags
> > 
> > [1]: http://lore.kernel.org/r/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com
> > 
> > ---
> > Cover letter same as v1
> 
> Thanks for all the review so far! The outstanding backlog is still too
> high to definitively say this will make v6.3:
> 
> http://lore.kernel.org/r/167601992789.1924368.8083994227892600608.stgit@dwillia2-xfh.jf.intel.com
> http://lore.kernel.org/r/167601996980.1924368.390423634911157277.stgit@dwillia2-xfh.jf.intel.com
> http://lore.kernel.org/r/167601999378.1924368.15071142145866277623.stgit@dwillia2-xfh.jf.intel.com
> http://lore.kernel.org/r/167601999958.1924368.9366954455835735048.stgit@dwillia2-xfh.jf.intel.com
> http://lore.kernel.org/r/167602000547.1924368.11613151863880268868.stgit@dwillia2-xfh.jf.intel.com
> http://lore.kernel.org/r/167602001107.1924368.11562316181038595611.stgit@dwillia2-xfh.jf.intel.com
> http://lore.kernel.org/r/167602002771.1924368.5653558226424530127.stgit@dwillia2-xfh.jf.intel.com
> http://lore.kernel.org/r/167602003896.1924368.10335442077318970468.stgit@dwillia2-xfh.jf.intel.com
> 
> ...what I plan to do is provisionally include it in -next and then make
> a judgement call next Friday.
> 
> I am encouraged by Fan's test results:
> 
> http://lore.kernel.org/r/20230208173720.GA709329@bgt-140510-bm03
> 
> ...and am reminded that there are some non-trivial TODOs pent up behind
> region enumeration:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9ea4dcf49878

I have also been testing with generally positive results, I've just been
traveling so internet has been spotty.

I'll post a full breakdown of what i've been doing on monday, including
poking the new kernel boot parameter and such.

~Gregory

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default
  2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
                   ` (20 preceding siblings ...)
  2023-02-10 17:53 ` [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
@ 2023-02-13 18:22 ` Gregory Price
  2023-02-13 18:31   ` Gregory Price
  2023-02-14 13:35   ` Jonathan Cameron
  21 siblings, 2 replies; 65+ messages in thread
From: Gregory Price @ 2023-02-13 18:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Ira Weiny, David Hildenbrand, Dave Jiang,
	Davidlohr Bueso, Kees Cook, Jonathan Cameron, Vishal Verma,
	Dave Hansen, Michal Hocko, Fan Ni, linux-mm, linux-acpi

On Fri, Feb 10, 2023 at 01:05:21AM -0800, Dan Williams wrote:
> Changes since v1: [1]
> [... snip ...]

For a single attached device - I have been finding general success.

For multiple attached devices, I'm seeing some strange behaviors.

With multiple root ports, I got some stack traces before deciding
I needed multiple CMFW to do this "correctly", and just attached
multiple pxb-cxl to the root bus.

Obviously this configuration is "not great", and some form of
"impossible in the real world", but it's worth examining i think.

/opt/qemu-cxl/bin/qemu-system-x86_64 \
-drive file=/data/qemu/images/pool/pool1.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 4G,slots=4,maxmem=16G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-netdev bridge,id=hn0,br=virbr0 \
-device virtio-net-pci,netdev=hn0,id=nic1,mac=52:54:00:12:34:56 \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=191 \
-device pxb-cxl,id=cxl.2,bus=pcie.0,bus_nr=230 \
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
-device cxl-rp,id=rp1,bus=cxl.1,chassis=0,port=1,slot=1 \
-device cxl-rp,id=rp2,bus=cxl.2,chassis=0,port=2,slot=2 \
-object memory-backend-ram,id=mem0,size=4G \
-object memory-backend-ram,id=mem1,size=4G \
-object memory-backend-ram,id=mem2,size=4G \
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
-device cxl-type3,bus=rp1,volatile-memdev=mem1,id=cxl-mem1 \
-device cxl-type3,bus=rp2,volatile-memdev=mem2,id=cxl-mem2 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.size=4G,cxl-fmw.2.targets.0=cxl.2,cxl-fmw.2.size=4G

The goal here should be to have 3 different memory expanders have their
regions created and mapped to 3 different numa nodes.

One piece i'm not confident about is my CFMW's
(listed more readably:)

-M cxl-fmw.0.targets.0=cxl.0,
   cxl-fmw.0.size=4G,
   cxl-fmw.1.targets.0=cxl.1,
   cxl-fmw.1.size=4G,
   cxl-fmw.2.targets.0=cxl.2,
   cxl-fmw.2.size=4G

should targets in this case be targets.0/1/2, or all of them targets.0?

Either way, i would expect 3 root decoders, and 3 memory devices

[root@fedora ~]# ls /sys/bus/cxl/devices/
decoder0.0  decoder1.0  decoder4.0  endpoint4  mem0  nvdimm-bridge0  port3
decoder0.1  decoder2.0  decoder5.0  endpoint5  mem1  port1           root0
decoder0.2  decoder3.0  decoder6.0  endpoint6  mem2  port2

I see the devices I expect, but i would expect the following:
(cxl list output at the bottom)

decoder0.0 -> mem0
decoder0.1 -> mem1
decoder0.2 -> mem2

root0 -> [decoder0.0, 0.1, 0.2]
root0 -> [port1, 2, 3]
port1 -> mem0
port2 -> mem1
port3 -> mem2

Really i see these decoders and device mappings setup:
port1 -> mem2
port2 -> mem1
port3 -> mem0

Therefore I should expect
decoder0.0 -> mem2
decoder0.1 -> mem1
decoder0.2 -> mem0

This bears out: attempting to use any other combination produces ndctl errors.

So the numbers are backwards, maybe that's relevant, maybe it's not.
The devices are otherwise completely the same, so for the most part
everything might "just work".  Lets keep testing.


[root@fedora ~]# cat create_region.sh
./ndctl/build/cxl/cxl \
  create-region \
  -m \
  -t ram \
  -d decoder0.$1 \
  -w 1 \
  -g 4096 \
  mem$2

[root@fedora ~]# ./create_region.sh 2 0
[   34.424931] cxl_region region2: Bypassing cpu_cache_invalidate_memregion() for testing!
{
  "region":"region2",
  "resource":"0x790000000",
  "size":"4.00 GiB (4.29 GB)",
  "type":"ram",
  "interleave_ways":1,
  "interleave_granularity":4096,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem0",
      "decoder":"decoder4.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region

[   34.486668] Fallback order for Node 3: 3 0
[   34.487568] Built 1 zonelists, mobility grouping on.  Total pages: 979669
[   34.488206] Policy zone: Normal
[   34.501938] Fallback order for Node 0: 0 3
[   34.502405] Fallback order for Node 1: 1 3 0
[   34.502832] Fallback order for Node 2: 2 3 0
[   34.503251] Fallback order for Node 3: 3 0
[   34.503649] Built 2 zonelists, mobility grouping on.  Total pages: 1012437
[   34.504296] Policy zone: Normal



Cool, looks good.  Lets try mem1



[root@fedora ~]# ./create_region.sh 1 1

[   98.787029] Fallback order for Node 2: 2 3 0
[   98.787630] Built 2 zonelists, mobility grouping on.  Total pages: 2019798
[   98.788483] Policy zone: Normal
[  128.301580] Fallback order for Node 0: 0 2 3
[  128.302084] Fallback order for Node 1: 1 3 2 0
[  128.302547] Fallback order for Node 2: 2 3 0
[  128.303009] Fallback order for Node 3: 3 2 0
[  128.303436] Built 3 zonelists, mobility grouping on.  Total pages: 2052566
[  128.304071] Policy zone: Normal
[ .... wait 20-30 more seconds .... ]
{
  "region":"region1",
  "resource":"0x690000000",
  "size":"4.00 GiB (4.29 GB)",
  "type":"ram",
  "interleave_ways":1,
  "interleave_granularity":4096,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem1",
      "decoder":"decoder5.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region


This takes a LONG time to complete. Maybe that's expected, I don't know.


Lets online mem2.


[root@fedora ~]# ./create_region.sh 0 2
extra data[7]: 0x0000000000000000
emulation failure
RAX=0000000000000000 RBX=ffff8a6f90006800 RCX=0000000000100001 RDX=0000000080100010
RSI=ffffca291a400000 RDI=0000000040000000 RBP=ffff9684c0017a60 RSP=ffff9684c0017a30
R8 =ffff8a6f90006800 R9 =0000000000100001 R10=0000000000000000 R11=0000000000000001
R12=ffffca291a400000 R13=0000000000100001 R14=0000000000000000 R15=0000000080100010
RIP=ffffffffb71c5831 RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00c00000
FS =0000 00007fd03025db40 ffffffff 00c00000
GS =0000 ffff8a6a7bd00000 ffffffff 00c00000
LDT=0000 0000000000000000 ffffffff 00c00000
TR =0040 fffffe46e6e25000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     fffffe46e6e23000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=00005604371ab0c8 CR3=0000000102ece000 CR4=000006e0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=83 ec 08 81 e7 00 00 00 40 74 2c 48 89 d0 48 89 ca 4c 89 c9 <f0> 48 0f c7 4e 20 0f 84 85 00 00 00 f3 90 48 83 c4 08 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d


Well that seems bad lol.  I'm not sure what to make of this since my
scrollback cuts off and the machine completely locks up.  I have never
seen "emulation failure" before.


Reboot and attempt to online that region by itself:

[root@fedora ~]# ./create_region.sh 0 2
[   21.292598] cxl_region region0: Bypassing cpu_cache_invalidate_memregion() for testing!
[   21.341753] Fallback order for Node 1: 1 0
[   21.342462] Built 1 zonelists, mobility grouping on.  Total pages: 979670
[   21.343085] Policy zone: Normal
[   21.355166] Fallback order for Node 0: 0 1
[   21.355613] Fallback order for Node 1: 1 0
[   21.356009] Fallback order for Node 2: 2 1 0
[   21.356441] Fallback order for Node 3: 3 1 0
[   21.356874] Built 2 zonelists, mobility grouping on.  Total pages: 1012438
[   21.357501] Policy zone: Normal
{
  "region":"region0",
  "resource":"0x590000000",
  "size":"4.00 GiB (4.29 GB)",
  "type":"ram",
  "interleave_ways":1,
  "interleave_granularity":4096,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem2",
      "decoder":"decoder6.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region


That works fine, and works just like onlining the first region (2,0).

This suggests the issue is actually creating multiple regions in this
topology.



Bonus round: booting with memhp_default_state=offline

All regions successfully get created without error.



I have a few guesses, but haven't dived in yet:

1) There's a QEMU error in the way this configuration routes to various
   components of the CXL structure, and/or multiple pxb-cxl's do bad
   things and I should feel bad for doing this configuration.

2) There's something going on when creating the topology, leading to the
   inverted [decoder0.2, mem0], [decoder0.1, mem1], [decoder0.0, mem2]
   mappings, leading to inconsistent device control.  Or I'm making a
   bad assumption and this is expected behavior.

3) The memory block creation / online code is getting hung up somewhere.
   Why does the second region take forever to online?

4) Something else completely.


My gut at the moment tells me my configuration is bad, but i have no
idea why.  Anyone with an idea on what I should look for, let me know.


cxl list output for completeness:

[root@fedora ~]# ./ndctl/build/cxl/cxl list -vvvv
[
  {
    "bus":"root0",
    "provider":"ACPI.CXL",
    "nr_dports":3,
    "dports":[
      {
        "dport":"pci0000:e6",
        "alias":"ACPI0016:00",
        "id":230
      },
      {
        "dport":"pci0000:bf",
        "alias":"ACPI0016:01",
        "id":191
      },
      {
        "dport":"pci0000:34",
        "alias":"ACPI0016:02",
        "id":52
      }
    ],
    "ports:root0":[
      {
        "port":"port1",
        "host":"pci0000:e6",
        "depth":1,
        "nr_dports":1,
        "dports":[
          {
            "dport":"0000:e6:00.0",
            "id":2
          }
        ],
        "endpoints:port1":[
          {
            "endpoint":"endpoint5",
            "host":"mem1",
            "depth":2,
            "memdev":{
              "memdev":"mem1",
              "ram_size":4294967296,
              "serial":0,
              "host":"0000:e7:00.0",
              "partition_info":{
                "total_size":4294967296,
                "volatile_only_size":4294967296,
                "persistent_only_size":0,
                "partition_alignment_size":0
              }
            },
            "decoders:endpoint5":[
              {
                "decoder":"decoder5.0",
                "interleave_ways":1,
                "state":"disabled"
              }
            ]
          }
        ],
        "decoders:port1":[
          {
            "decoder":"decoder1.0",
            "interleave_ways":1,
            "state":"disabled",
            "nr_targets":1,
            "targets":[
              {
                "target":"0000:e6:00.0",
                "position":0,
                "id":2
              }
            ]
          }
        ]
      },
      {
        "port":"port3",
        "host":"pci0000:34",
        "depth":1,
        "nr_dports":1,
        "dports":[
          {
            "dport":"0000:34:00.0",
            "id":0
          }
        ],
        "endpoints:port3":[
          {
            "endpoint":"endpoint4",
            "host":"mem0",
            "depth":2,
            "memdev":{
              "memdev":"mem0",
              "ram_size":4294967296,
              "serial":0,
              "host":"0000:35:00.0",
              "partition_info":{
                "total_size":4294967296,
                "volatile_only_size":4294967296,
                "persistent_only_size":0,
                "partition_alignment_size":0
              }
            },
            "decoders:endpoint4":[
              {
                "decoder":"decoder4.0",
                "interleave_ways":1,
                "state":"disabled"
              }
            ]
          }
        ],
        "decoders:port3":[
          {
            "decoder":"decoder3.0",
            "interleave_ways":1,
            "state":"disabled",
            "nr_targets":1,
            "targets":[
              {
                "target":"0000:34:00.0",
                "position":0,
                "id":0
              }
            ]
          }
        ]
      },
      {
        "port":"port2",
        "host":"pci0000:bf",
        "depth":1,
        "nr_dports":1,
        "dports":[
          {
            "dport":"0000:bf:00.0",
            "id":1
          }
        ],
        "endpoints:port2":[
          {
            "endpoint":"endpoint6",
            "host":"mem2",
            "depth":2,
            "memdev":{
              "memdev":"mem2",
              "ram_size":4294967296,
              "serial":0,
              "host":"0000:c0:00.0",
              "partition_info":{
                "total_size":4294967296,
                "volatile_only_size":4294967296,
                "persistent_only_size":0,
                "partition_alignment_size":0
              }
            },
            "decoders:endpoint6":[
              {
                "decoder":"decoder6.0",
                "interleave_ways":1,
                "state":"disabled"
              }
            ]
          }
        ],
        "decoders:port2":[
          {
            "decoder":"decoder2.0",
            "interleave_ways":1,
            "state":"disabled",
            "nr_targets":1,
            "targets":[
              {
                "target":"0000:bf:00.0",
                "position":0,
                "id":1
              }
            ]
          }
        ]
      }
    ],
    "decoders:root0":[
      {
        "decoder":"decoder0.0",
        "resource":23890755584,
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:34",
            "alias":"ACPI0016:02",
            "position":0,
            "id":52
          }
        ]
      },
      {
        "decoder":"decoder0.1",
        "resource":28185722880,
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:bf",
            "alias":"ACPI0016:01",
            "position":0,
            "id":191
          }
        ]
      },
      {
        "decoder":"decoder0.2",
        "resource":32480690176,
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:e6",
            "alias":"ACPI0016:00",
            "position":0,
            "id":230
          }
        ]
      }
    ]
  }
]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default
  2023-02-13 18:22 ` Gregory Price
@ 2023-02-13 18:31   ` Gregory Price
       [not found]     ` <CGME20230222214151uscas1p26d53b2e198f63a1f382fe575c6c25070@uscas1p2.samsung.com>
  2023-02-14 13:35   ` Jonathan Cameron
  1 sibling, 1 reply; 65+ messages in thread
From: Gregory Price @ 2023-02-13 18:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Ira Weiny, David Hildenbrand, Dave Jiang,
	Davidlohr Bueso, Kees Cook, Jonathan Cameron, Vishal Verma,
	Dave Hansen, Michal Hocko, Fan Ni, linux-mm, linux-acpi

On Mon, Feb 13, 2023 at 01:22:17PM -0500, Gregory Price wrote:
> On Fri, Feb 10, 2023 at 01:05:21AM -0800, Dan Williams wrote:
> > Changes since v1: [1]
> > [... snip ...]
> [... snip ...]
> Really i see these decoders and device mappings setup:
> port1 -> mem2
> port2 -> mem1
> port3 -> mem0

small correction:
port1 -> mem1
port3 -> mem0
port2 -> mem2

> 
> Therefore I should expect
> decoder0.0 -> mem2
> decoder0.1 -> mem1
> decoder0.2 -> mem0
> 

this end up mapping this way, which is still further jumbled.

Something feels like there's an off-by-one

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 13/20] cxl/region: Add region autodiscovery
       [not found]   ` <CGME20230213192752uscas1p1c49508da4b100c9ba6a1a3aa92ca03e5@uscas1p1.samsung.com>
@ 2023-02-13 19:27     ` Fan Ni
  0 siblings, 0 replies; 65+ messages in thread
From: Fan Ni @ 2023-02-13 19:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, vishal.l.verma, dave.hansen, linux-mm, linux-acpi, arnd

On Fri, Feb 10, 2023 at 01:06:39AM -0800, Dan Williams wrote:
> Region autodiscovery is an asynchronous state machine advanced by
> cxl_port_probe(). After the decoders on an endpoint port are enumerated
> they are scanned for actively enabled instances. Each active decoder is
> flagged for auto-assembly CXL_DECODER_F_AUTO and attached to a region.
> If a region does not already exist for the address range setting of the
> decoder one is created. That creation process may race with other
> decoders of the same region being discovered since cxl_port_probe() is
> asynchronous. A new 'struct cxl_root_decoder' lock, @range_lock, is
> introduced to mitigate that race.
> 
> Once all decoders have arrived, "p->nr_targets == p->interleave_ways",
> they are sorted by their relative decode position. The sort algorithm
> involves finding the point in the cxl_port topology where one leg of the
> decode leads to deviceA and the other deviceB. At that point in the
> topology the target order in the 'struct cxl_switch_decoder' indicates
> the relative position of those endpoint decoders in the region.
> 
> >From that point the region goes through the same setup and validation
> steps as user-created regions, but instead of programming the decoders
> it validates that driver would have written the same values to the
> decoders as were already present.
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564540972.847146.17096178433176097831.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/core/hdm.c    |   11 +
>  drivers/cxl/core/port.c   |    2 
>  drivers/cxl/core/region.c |  497 ++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/cxl/cxl.h         |   29 +++
>  drivers/cxl/port.c        |   48 ++++
>  5 files changed, 576 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index a0891c3464f1..8c29026a4b9d 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -676,6 +676,14 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld)
>  	port->commit_end--;
>  	cxld->flags &= ~CXL_DECODER_F_ENABLE;
>  
> +	/* Userspace is now responsible for reconfiguring this decoder */
> +	if (is_endpoint_decoder(&cxld->dev)) {
> +		struct cxl_endpoint_decoder *cxled;
> +
> +		cxled = to_cxl_endpoint_decoder(&cxld->dev);
> +		cxled->state = CXL_DECODER_STATE_MANUAL;
> +	}
> +
>  	return 0;
>  }
>  
> @@ -783,6 +791,9 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
>  		return rc;
>  	}
>  	*dpa_base += dpa_size + skip;
> +
> +	cxled->state = CXL_DECODER_STATE_AUTO;
> +
>  	return 0;
>  }
>  
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 9e5df64ea6b5..59620528571a 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -446,6 +446,7 @@ bool is_endpoint_decoder(struct device *dev)
>  {
>  	return dev->type == &cxl_decoder_endpoint_type;
>  }
> +EXPORT_SYMBOL_NS_GPL(is_endpoint_decoder, CXL);
>  
>  bool is_root_decoder(struct device *dev)
>  {
> @@ -1628,6 +1629,7 @@ struct cxl_root_decoder *cxl_root_decoder_alloc(struct cxl_port *port,
>  	}
>  
>  	cxlrd->calc_hb = calc_hb;
> +	mutex_init(&cxlrd->range_lock);
>  
>  	cxld = &cxlsd->cxld;
>  	cxld->dev.type = &cxl_decoder_root_type;
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 691605f1e120..3f6453da2c51 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -6,6 +6,7 @@
>  #include <linux/module.h>
>  #include <linux/slab.h>
>  #include <linux/uuid.h>
> +#include <linux/sort.h>
>  #include <linux/idr.h>
>  #include <cxlmem.h>
>  #include <cxl.h>
> @@ -524,7 +525,12 @@ static void cxl_region_iomem_release(struct cxl_region *cxlr)
>  	if (device_is_registered(&cxlr->dev))
>  		lockdep_assert_held_write(&cxl_region_rwsem);
>  	if (p->res) {
> -		remove_resource(p->res);
> +		/*
> +		 * Autodiscovered regions may not have been able to insert their
> +		 * resource.
> +		 */
> +		if (p->res->parent)
> +			remove_resource(p->res);
>  		kfree(p->res);
>  		p->res = NULL;
>  	}
> @@ -1105,12 +1111,35 @@ static int cxl_port_setup_targets(struct cxl_port *port,
>  		return rc;
>  	}
>  
> -	cxld->interleave_ways = iw;
> -	cxld->interleave_granularity = ig;
> -	cxld->hpa_range = (struct range) {
> -		.start = p->res->start,
> -		.end = p->res->end,
> -	};
> +	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
> +		if (cxld->interleave_ways != iw ||
> +		    cxld->interleave_granularity != ig ||
> +		    cxld->hpa_range.start != p->res->start ||
> +		    cxld->hpa_range.end != p->res->end ||
> +		    ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
> +			dev_err(&cxlr->dev,
> +				"%s:%s %s expected iw: %d ig: %d %pr\n",
> +				dev_name(port->uport), dev_name(&port->dev),
> +				__func__, iw, ig, p->res);
> +			dev_err(&cxlr->dev,
> +				"%s:%s %s got iw: %d ig: %d state: %s %#llx:%#llx\n",
> +				dev_name(port->uport), dev_name(&port->dev),
> +				__func__, cxld->interleave_ways,
> +				cxld->interleave_granularity,
> +				(cxld->flags & CXL_DECODER_F_ENABLE) ?
> +					"enabled" :
> +					"disabled",
> +				cxld->hpa_range.start, cxld->hpa_range.end);
> +			return -ENXIO;
> +		}
> +	} else {
> +		cxld->interleave_ways = iw;
> +		cxld->interleave_granularity = ig;
> +		cxld->hpa_range = (struct range) {
> +			.start = p->res->start,
> +			.end = p->res->end,
> +		};
> +	}
>  	dev_dbg(&cxlr->dev, "%s:%s iw: %d ig: %d\n", dev_name(port->uport),
>  		dev_name(&port->dev), iw, ig);
>  add_target:
> @@ -1121,7 +1150,17 @@ static int cxl_port_setup_targets(struct cxl_port *port,
>  			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev), pos);
>  		return -ENXIO;
>  	}
> -	cxlsd->target[cxl_rr->nr_targets_set] = ep->dport;
> +	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
> +		if (cxlsd->target[cxl_rr->nr_targets_set] != ep->dport) {
> +			dev_dbg(&cxlr->dev, "%s:%s: %s expected %s at %d\n",
> +				dev_name(port->uport), dev_name(&port->dev),
> +				dev_name(&cxlsd->cxld.dev),
> +				dev_name(ep->dport->dport),
> +				cxl_rr->nr_targets_set);
> +			return -ENXIO;
> +		}
> +	} else
> +		cxlsd->target[cxl_rr->nr_targets_set] = ep->dport;
>  	inc = 1;
>  out_target_set:
>  	cxl_rr->nr_targets_set += inc;
> @@ -1163,6 +1202,13 @@ static void cxl_region_teardown_targets(struct cxl_region *cxlr)
>  	struct cxl_ep *ep;
>  	int i;
>  
> +	/*
> +	 * In the auto-discovery case skip automatic teardown since the
> +	 * address space is already active
> +	 */
> +	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags))
> +		return;
> +
>  	for (i = 0; i < p->nr_targets; i++) {
>  		cxled = p->targets[i];
>  		cxlmd = cxled_to_memdev(cxled);
> @@ -1195,8 +1241,8 @@ static int cxl_region_setup_targets(struct cxl_region *cxlr)
>  			iter = to_cxl_port(iter->dev.parent);
>  
>  		/*
> -		 * Descend the topology tree programming targets while
> -		 * looking for conflicts.
> +		 * Descend the topology tree programming / validating
> +		 * targets while looking for conflicts.
>  		 */
>  		for (ep = cxl_ep_load(iter, cxlmd); iter;
>  		     iter = ep->next, ep = cxl_ep_load(iter, cxlmd)) {
> @@ -1291,6 +1337,185 @@ static int cxl_region_attach_position(struct cxl_region *cxlr,
>  	return rc;
>  }
>  
> +static int cxl_region_attach_auto(struct cxl_region *cxlr,
> +				  struct cxl_endpoint_decoder *cxled, int pos)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +
> +	if (cxled->state != CXL_DECODER_STATE_AUTO) {
> +		dev_err(&cxlr->dev,
> +			"%s: unable to add decoder to autodetected region\n",
> +			dev_name(&cxled->cxld.dev));
> +		return -EINVAL;
> +	}
> +
> +	if (pos >= 0) {
> +		dev_dbg(&cxlr->dev, "%s: expected auto position, not %d\n",
> +			dev_name(&cxled->cxld.dev), pos);
> +		return -EINVAL;
> +	}
> +
> +	if (p->nr_targets >= p->interleave_ways) {
> +		dev_err(&cxlr->dev, "%s: no more target slots available\n",
> +			dev_name(&cxled->cxld.dev));
> +		return -ENXIO;
> +	}
> +
> +	/*
> +	 * Temporarily record the endpoint decoder into the target array. Yes,
> +	 * this means that userspace can view devices in the wrong position
> +	 * before the region activates, and must be careful to understand when
> +	 * it might be racing region autodiscovery.
> +	 */
> +	pos = p->nr_targets;
> +	p->targets[pos] = cxled;
> +	cxled->pos = pos;
> +	p->nr_targets++;
> +
> +	return 0;
> +}
> +
> +static struct cxl_port *next_port(struct cxl_port *port)
> +{
> +	if (!port->parent_dport)
> +		return NULL;
> +	return port->parent_dport->port;
> +}
> +
> +static int decoder_match_range(struct device *dev, void *data)
> +{
> +	struct cxl_endpoint_decoder *cxled = data;
> +	struct cxl_switch_decoder *cxlsd;
> +
> +	if (!is_switch_decoder(dev))
> +		return 0;
> +
> +	cxlsd = to_cxl_switch_decoder(dev);
> +	return range_contains(&cxlsd->cxld.hpa_range, &cxled->cxld.hpa_range);
> +}
> +
> +static void find_positions(const struct cxl_switch_decoder *cxlsd,
> +			   const struct cxl_port *iter_a,
> +			   const struct cxl_port *iter_b, int *a_pos,
> +			   int *b_pos)
> +{
> +	int i;
> +
> +	for (i = 0, *a_pos = -1, *b_pos = -1; i < cxlsd->nr_targets; i++) {
> +		if (cxlsd->target[i] == iter_a->parent_dport)
> +			*a_pos = i;
> +		else if (cxlsd->target[i] == iter_b->parent_dport)
> +			*b_pos = i;
> +		if (*a_pos >= 0 && *b_pos >= 0)
> +			break;
> +	}
> +}
> +
> +static int cmp_decode_pos(const void *a, const void *b)
> +{
> +	struct cxl_endpoint_decoder *cxled_a = *(typeof(cxled_a) *)a;
> +	struct cxl_endpoint_decoder *cxled_b = *(typeof(cxled_b) *)b;
> +	struct cxl_memdev *cxlmd_a = cxled_to_memdev(cxled_a);
> +	struct cxl_memdev *cxlmd_b = cxled_to_memdev(cxled_b);
> +	struct cxl_port *port_a = cxled_to_port(cxled_a);
> +	struct cxl_port *port_b = cxled_to_port(cxled_b);
> +	struct cxl_port *iter_a, *iter_b, *port = NULL;
> +	struct cxl_switch_decoder *cxlsd;
> +	struct device *dev;
> +	int a_pos, b_pos;
> +	unsigned int seq;
> +
> +	/* Exit early if any prior sorting failed */
> +	if (cxled_a->pos < 0 || cxled_b->pos < 0)
> +		return 0;
> +
> +	/*
> +	 * Walk up the hierarchy to find a shared port, find the decoder that
> +	 * maps the range, compare the relative position of those dport
> +	 * mappings.
> +	 */
> +	for (iter_a = port_a; iter_a; iter_a = next_port(iter_a)) {
> +		struct cxl_port *next_a, *next_b;
> +
> +		next_a = next_port(iter_a);
> +		if (!next_a)
> +			break;
> +
> +		for (iter_b = port_b; iter_b; iter_b = next_port(iter_b)) {
> +			next_b = next_port(iter_b);
> +			if (next_a != next_b)
> +				continue;
> +			port = next_a;
> +			break;
> +		}
> +
> +		if (port)
> +			break;
> +	}
> +
> +	if (!port) {
> +		dev_err(cxlmd_a->dev.parent,
> +			"failed to find shared port with %s\n",
> +			dev_name(cxlmd_b->dev.parent));
> +		goto err;
> +	}
> +
> +	dev = device_find_child(&port->dev, cxled_a, decoder_match_range);
> +	if (!dev) {
> +		struct range *range = &cxled_a->cxld.hpa_range;
> +
> +		dev_err(port->uport,
> +			"failed to find decoder that maps %#llx-%#llx\n",
> +			range->start, range->end);
> +		goto err;
> +	}
> +
> +	cxlsd = to_cxl_switch_decoder(dev);
> +	do {
> +		seq = read_seqbegin(&cxlsd->target_lock);
> +		find_positions(cxlsd, iter_a, iter_b, &a_pos, &b_pos);
> +	} while (read_seqretry(&cxlsd->target_lock, seq));
> +
> +	put_device(dev);
> +
> +	if (a_pos < 0 || b_pos < 0) {
> +		dev_err(port->uport,
> +			"failed to find shared decoder for %s and %s\n",
> +			dev_name(cxlmd_a->dev.parent),
> +			dev_name(cxlmd_b->dev.parent));
> +		goto err;
> +	}
> +
> +	dev_dbg(port->uport, "%s comes %s %s\n", dev_name(cxlmd_a->dev.parent),
> +		a_pos - b_pos < 0 ? "before" : "after",
> +		dev_name(cxlmd_b->dev.parent));
> +
> +	return a_pos - b_pos;
> +err:
> +	cxled_a->pos = -1;
> +	return 0;
> +}
> +
> +static int cxl_region_sort_targets(struct cxl_region *cxlr)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +	int i, rc = 0;
> +
> +	sort(p->targets, p->nr_targets, sizeof(p->targets[0]), cmp_decode_pos,
> +	     NULL);
> +
> +	for (i = 0; i < p->nr_targets; i++) {
> +		struct cxl_endpoint_decoder *cxled = p->targets[i];
> +
> +		if (cxled->pos < 0)
> +			rc = -ENXIO;
> +		cxled->pos = i;
> +	}
> +
> +	dev_dbg(&cxlr->dev, "region sort %s\n", rc ? "failed" : "successful");
> +	return rc;
> +}
> +
>  static int cxl_region_attach(struct cxl_region *cxlr,
>  			     struct cxl_endpoint_decoder *cxled, int pos)
>  {
> @@ -1354,6 +1579,50 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  		return -EINVAL;
>  	}
>  
> +	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
> +		int i;
> +
> +		rc = cxl_region_attach_auto(cxlr, cxled, pos);
> +		if (rc)
> +			return rc;
> +
> +		/* await more targets to arrive... */
> +		if (p->nr_targets < p->interleave_ways)
> +			return 0;
> +
> +		/*
> +		 * All targets are here, which implies all PCI enumeration that
> +		 * affects this region has been completed. Walk the topology to
> +		 * sort the devices into their relative region decode position.
> +		 */
> +		rc = cxl_region_sort_targets(cxlr);
> +		if (rc)
> +			return rc;
> +
> +		for (i = 0; i < p->nr_targets; i++) {
> +			cxled = p->targets[i];
> +			ep_port = cxled_to_port(cxled);
> +			dport = cxl_find_dport_by_dev(root_port,
> +						      ep_port->host_bridge);
> +			rc = cxl_region_attach_position(cxlr, cxlrd, cxled,
> +							dport, i);
> +			if (rc)
> +				return rc;
> +		}
> +
> +		rc = cxl_region_setup_targets(cxlr);
> +		if (rc)
> +			return rc;
> +
> +		/*
> +		 * If target setup succeeds in the autodiscovery case
> +		 * then the region is already committed.
> +		 */
> +		p->state = CXL_CONFIG_COMMIT;
> +
> +		return 0;
> +	}
> +
>  	rc = cxl_region_validate_position(cxlr, cxled, pos);
>  	if (rc)
>  		return rc;
> @@ -2087,6 +2356,193 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
>  	return rc;
>  }
>  
> +static int match_decoder_by_range(struct device *dev, void *data)
> +{
> +	struct range *r1, *r2 = data;
> +	struct cxl_root_decoder *cxlrd;
> +
> +	if (!is_root_decoder(dev))
> +		return 0;
> +
> +	cxlrd = to_cxl_root_decoder(dev);
> +	r1 = &cxlrd->cxlsd.cxld.hpa_range;
> +	return range_contains(r1, r2);
> +}
> +
> +static int match_region_by_range(struct device *dev, void *data)
> +{
> +	struct cxl_region_params *p;
> +	struct cxl_region *cxlr;
> +	struct range *r = data;
> +	int rc = 0;
> +
> +	if (!is_cxl_region(dev))
> +		return 0;
> +
> +	cxlr = to_cxl_region(dev);
> +	p = &cxlr->params;
> +
> +	down_read(&cxl_region_rwsem);
> +	if (p->res && p->res->start == r->start && p->res->end == r->end)
> +		rc = 1;
> +	up_read(&cxl_region_rwsem);
> +
> +	return rc;
> +}
> +
> +/* Establish an empty region covering the given HPA range */
> +static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
> +					   struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_port *port = cxlrd_to_port(cxlrd);
> +	struct range *hpa = &cxled->cxld.hpa_range;
> +	struct cxl_region_params *p;
> +	struct cxl_region *cxlr;
> +	struct resource *res;
> +	int rc;
> +
> +	do {
> +		cxlr = __create_region(cxlrd, cxled->mode,
> +				       atomic_read(&cxlrd->region_id));
> +	} while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
> +
> +	if (IS_ERR(cxlr)) {
> +		dev_err(cxlmd->dev.parent,
> +			"%s:%s: %s failed assign region: %ld\n",
> +			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> +			__func__, PTR_ERR(cxlr));
> +		return cxlr;
> +	}
> +
> +	down_write(&cxl_region_rwsem);
> +	p = &cxlr->params;
> +	if (p->state >= CXL_CONFIG_INTERLEAVE_ACTIVE) {
> +		dev_err(cxlmd->dev.parent,
> +			"%s:%s: %s autodiscovery interrupted\n",
> +			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> +			__func__);
> +		rc = -EBUSY;
> +		goto err;
> +	}
> +
> +	set_bit(CXL_REGION_F_AUTO, &cxlr->flags);
> +
> +	res = kmalloc(sizeof(*res), GFP_KERNEL);
> +	if (!res) {
> +		rc = -ENOMEM;
> +		goto err;
> +	}
> +
> +	*res = DEFINE_RES_MEM_NAMED(hpa->start, range_len(hpa),
> +				    dev_name(&cxlr->dev));
> +	rc = insert_resource(cxlrd->res, res);
> +	if (rc) {
> +		/*
> +		 * Platform-firmware may not have split resources like "System
> +		 * RAM" on CXL window boundaries see cxl_region_iomem_release()
> +		 */
> +		dev_warn(cxlmd->dev.parent,
> +			 "%s:%s: %s %s cannot insert resource\n",
> +			 dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> +			 __func__, dev_name(&cxlr->dev));
> +	}
> +
> +	p->res = res;
> +	p->interleave_ways = cxled->cxld.interleave_ways;
> +	p->interleave_granularity = cxled->cxld.interleave_granularity;
> +	p->state = CXL_CONFIG_INTERLEAVE_ACTIVE;
> +
> +	rc = sysfs_update_group(&cxlr->dev.kobj, get_cxl_region_target_group());
> +	if (rc)
> +		goto err;
> +
> +	dev_dbg(cxlmd->dev.parent, "%s:%s: %s %s res: %pr iw: %d ig: %d\n",
> +		dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev), __func__,
> +		dev_name(&cxlr->dev), p->res, p->interleave_ways,
> +		p->interleave_granularity);
> +
> +	/* ...to match put_device() in cxl_add_to_region() */
> +	get_device(&cxlr->dev);
> +	up_write(&cxl_region_rwsem);
> +
> +	return cxlr;
> +
> +err:
> +	up_write(&cxl_region_rwsem);
> +	devm_release_action(port->uport, unregister_region, cxlr);
> +	return ERR_PTR(rc);
> +}
> +
> +int cxl_add_to_region(struct cxl_port *root, struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct range *hpa = &cxled->cxld.hpa_range;
> +	struct cxl_decoder *cxld = &cxled->cxld;
> +	struct cxl_root_decoder *cxlrd;
> +	struct cxl_region_params *p;
> +	struct cxl_region *cxlr;
> +	bool attach = false;
> +	struct device *dev;
> +	int rc;
> +
> +	dev = device_find_child(&root->dev, &cxld->hpa_range,
> +				match_decoder_by_range);
> +	if (!dev) {
> +		dev_err(cxlmd->dev.parent,
> +			"%s:%s no CXL window for range %#llx:%#llx\n",
> +			dev_name(&cxlmd->dev), dev_name(&cxld->dev),
> +			cxld->hpa_range.start, cxld->hpa_range.end);
> +		return -ENXIO;
> +	}
> +
> +	cxlrd = to_cxl_root_decoder(dev);
> +
> +	/*
> +	 * Ensure that if multiple threads race to construct_region() for @hpa
> +	 * one does the construction and the others add to that.
> +	 */
> +	mutex_lock(&cxlrd->range_lock);
> +	dev = device_find_child(&cxlrd->cxlsd.cxld.dev, hpa,
> +				match_region_by_range);
> +	if (!dev)
> +		cxlr = construct_region(cxlrd, cxled);
> +	else
> +		cxlr = to_cxl_region(dev);
> +	mutex_unlock(&cxlrd->range_lock);
> +
> +	if (IS_ERR(cxlr)) {
> +		rc = PTR_ERR(cxlr);
> +		goto out;
> +	}
> +
> +	attach_target(cxlr, cxled, -1, TASK_UNINTERRUPTIBLE);
> +
> +	down_read(&cxl_region_rwsem);
> +	p = &cxlr->params;
> +	attach = p->state == CXL_CONFIG_COMMIT;
> +	up_read(&cxl_region_rwsem);
> +
> +	if (attach) {
> +		int rc = device_attach(&cxlr->dev);
> +
> +		/*
> +		 * If device_attach() fails the range may still be active via
> +		 * the platform-firmware memory map, otherwise the driver for
> +		 * regions is local to this file, so driver matching can't fail.
> +		 */
> +		if (rc < 0)
> +			dev_err(&cxlr->dev, "failed to enable, range: %pr\n",
> +				p->res);
> +	}
> +
> +	put_device(&cxlr->dev);
> +out:
> +	put_device(&cxlrd->cxlsd.cxld.dev);
> +	return rc;

rc can be returned without initialized as mentioned by Arnd Bergmann in
https://lore.kernel.org/linux-cxl/20230213101220.3821689-1-arnd@kernel.org/T/#u

> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_add_to_region, CXL);
> +
>  static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
>  {
>  	if (!test_bit(CXL_REGION_F_INCOHERENT, &cxlr->flags))
> @@ -2111,6 +2567,15 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
>  	return 0;
>  }
>  
> +static int is_system_ram(struct resource *res, void *arg)
> +{
> +	struct cxl_region *cxlr = arg;
> +	struct cxl_region_params *p = &cxlr->params;
> +
> +	dev_dbg(&cxlr->dev, "%pr has System RAM: %pr\n", p->res, res);
> +	return 1;
> +}
> +
>  static int cxl_region_probe(struct device *dev)
>  {
>  	struct cxl_region *cxlr = to_cxl_region(dev);
> @@ -2144,6 +2609,18 @@ static int cxl_region_probe(struct device *dev)
>  	switch (cxlr->mode) {
>  	case CXL_DECODER_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
> +	case CXL_DECODER_RAM:
> +		/*
> +		 * The region can not be manged by CXL if any portion of
> +		 * it is already online as 'System RAM'
> +		 */
> +		if (walk_iomem_res_desc(IORES_DESC_NONE,
> +					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> +					p->res->start, p->res->end, cxlr,
> +					is_system_ram) > 0)
> +			return 0;
> +		dev_dbg(dev, "TODO: hookup devdax\n");
> +		return 0;
>  	default:
>  		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
>  			cxlr->mode);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index ca76879af1de..c8ee4bb8cce6 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -261,6 +261,8 @@ resource_size_t cxl_rcrb_to_component(struct device *dev,
>   * cxl_decoder flags that define the type of memory / devices this
>   * decoder supports as well as configuration lock status See "CXL 2.0
>   * 8.2.5.12.7 CXL HDM Decoder 0 Control Register" for details.
> + * Additionally indicate whether decoder settings were autodetected,
> + * user customized.
>   */
>  #define CXL_DECODER_F_RAM   BIT(0)
>  #define CXL_DECODER_F_PMEM  BIT(1)
> @@ -334,12 +336,22 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  	return "mixed";
>  }
>  
> +/*
> + * Track whether this decoder is reserved for region autodiscovery, or
> + * free for userspace provisioning.
> + */
> +enum cxl_decoder_state {
> +	CXL_DECODER_STATE_MANUAL,
> +	CXL_DECODER_STATE_AUTO,
> +};
> +
>  /**
>   * struct cxl_endpoint_decoder - Endpoint  / SPA to DPA decoder
>   * @cxld: base cxl_decoder_object
>   * @dpa_res: actively claimed DPA span of this decoder
>   * @skip: offset into @dpa_res where @cxld.hpa_range maps
>   * @mode: which memory type / access-mode-partition this decoder targets
> + * @state: autodiscovery state
>   * @pos: interleave position in @cxld.region
>   */
>  struct cxl_endpoint_decoder {
> @@ -347,6 +359,7 @@ struct cxl_endpoint_decoder {
>  	struct resource *dpa_res;
>  	resource_size_t skip;
>  	enum cxl_decoder_mode mode;
> +	enum cxl_decoder_state state;
>  	int pos;
>  };
>  
> @@ -380,6 +393,7 @@ typedef struct cxl_dport *(*cxl_calc_hb_fn)(struct cxl_root_decoder *cxlrd,
>   * @region_id: region id for next region provisioning event
>   * @calc_hb: which host bridge covers the n'th position by granularity
>   * @platform_data: platform specific configuration data
> + * @range_lock: sync region autodiscovery by address range
>   * @cxlsd: base cxl switch decoder
>   */
>  struct cxl_root_decoder {
> @@ -387,6 +401,7 @@ struct cxl_root_decoder {
>  	atomic_t region_id;
>  	cxl_calc_hb_fn calc_hb;
>  	void *platform_data;
> +	struct mutex range_lock;
>  	struct cxl_switch_decoder cxlsd;
>  };
>  
> @@ -436,6 +451,13 @@ struct cxl_region_params {
>   */
>  #define CXL_REGION_F_INCOHERENT 0
>  
> +/*
> + * Indicate whether this region has been assembled by autodetection or
> + * userspace assembly. Prevent endpoint decoders outside of automatic
> + * detection from being added to the region.
> + */
> +#define CXL_REGION_F_AUTO 1
> +
>  /**
>   * struct cxl_region - CXL region
>   * @dev: This region's device
> @@ -699,6 +721,8 @@ struct cxl_nvdimm_bridge *cxl_find_nvdimm_bridge(struct device *dev);
>  #ifdef CONFIG_CXL_REGION
>  bool is_cxl_pmem_region(struct device *dev);
>  struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev);
> +int cxl_add_to_region(struct cxl_port *root,
> +		      struct cxl_endpoint_decoder *cxled);
>  #else
>  static inline bool is_cxl_pmem_region(struct device *dev)
>  {
> @@ -708,6 +732,11 @@ static inline struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev)
>  {
>  	return NULL;
>  }
> +static inline int cxl_add_to_region(struct cxl_port *root,
> +				    struct cxl_endpoint_decoder *cxled)
> +{
> +	return 0;
> +}
>  #endif
>  
>  /*
> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> index a8d46a67b45e..d88518836c2d 100644
> --- a/drivers/cxl/port.c
> +++ b/drivers/cxl/port.c
> @@ -30,6 +30,34 @@ static void schedule_detach(void *cxlmd)
>  	schedule_cxl_memdev_detach(cxlmd);
>  }
>  
> +static int discover_region(struct device *dev, void *root)
> +{
> +	struct cxl_endpoint_decoder *cxled;
> +	int rc;
> +
> +	if (!is_endpoint_decoder(dev))
> +		return 0;
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	if ((cxled->cxld.flags & CXL_DECODER_F_ENABLE) == 0)
> +		return 0;
> +
> +	if (cxled->state != CXL_DECODER_STATE_AUTO)
> +		return 0;
> +
> +	/*
> +	 * Region enumeration is opportunistic, if this add-event fails,
> +	 * continue to the next endpoint decoder.
> +	 */
> +	rc = cxl_add_to_region(root, cxled);
> +	if (rc)
> +		dev_dbg(dev, "failed to add to region: %#llx-%#llx\n",
> +			cxled->cxld.hpa_range.start, cxled->cxld.hpa_range.end);
> +
> +	return 0;
> +}
> +
> +
>  static int cxl_switch_port_probe(struct cxl_port *port)
>  {
>  	struct cxl_hdm *cxlhdm;
> @@ -54,6 +82,7 @@ static int cxl_endpoint_port_probe(struct cxl_port *port)
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct cxl_hdm *cxlhdm;
> +	struct cxl_port *root;
>  	int rc;
>  
>  	cxlhdm = devm_cxl_setup_hdm(port);
> @@ -78,7 +107,24 @@ static int cxl_endpoint_port_probe(struct cxl_port *port)
>  		return rc;
>  	}
>  
> -	return devm_cxl_enumerate_decoders(cxlhdm);
> +	rc = devm_cxl_enumerate_decoders(cxlhdm);
> +	if (rc)
> +		return rc;
> +
> +	/*
> +	 * This can't fail in practice as CXL root exit unregisters all
> +	 * descendant ports and that in turn synchronizes with cxl_port_probe()
> +	 */
> +	root = find_cxl_root(&cxlmd->dev);
> +
> +	/*
> +	 * Now that all endpoint decoders are successfully enumerated, try to
> +	 * assemble regions from committed decoders
> +	 */
> +	device_for_each_child(&port->dev, root, discover_region);
> +	put_device(&root->dev);
> +
> +	return 0;
>  }
>  
>  static int cxl_port_probe(struct device *dev)
> 
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 13/20] cxl/region: Add region autodiscovery
  2023-02-10 21:35     ` Dan Williams
@ 2023-02-14 13:23       ` Jonathan Cameron
  2023-02-14 16:43         ` Dan Williams
  0 siblings, 1 reply; 65+ messages in thread
From: Jonathan Cameron @ 2023-02-14 13:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi


> > 
> > 		/*
> > 		 * If device_attach() fails the range may still be active via
> > 		 * the platform-firmware memory map, otherwise the driver for
> > 		 * regions is local to this file, so driver matching can't fail
> > +                * and hence device_attach() cannot return 1.
> > 
> > //very much not obvious otherwise to anyone who isn't far too familiar with device_attach()  
> 
> Hence the comment? Not sure what else can be said here about why
> device_attach() < 0 is a sufficient check.

I'd just add the bit I added above that calls out the condition you are
describing is indicated by a return of 1.




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default
  2023-02-13 18:22 ` Gregory Price
  2023-02-13 18:31   ` Gregory Price
@ 2023-02-14 13:35   ` Jonathan Cameron
  1 sibling, 0 replies; 65+ messages in thread
From: Jonathan Cameron @ 2023-02-14 13:35 UTC (permalink / raw)
  To: Gregory Price
  Cc: Dan Williams, linux-cxl, Ira Weiny, David Hildenbrand,
	Dave Jiang, Davidlohr Bueso, Kees Cook, Vishal Verma,
	Dave Hansen, Michal Hocko, Fan Ni, linux-mm, linux-acpi

On Mon, 13 Feb 2023 13:22:17 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Fri, Feb 10, 2023 at 01:05:21AM -0800, Dan Williams wrote:
> > Changes since v1: [1]
> > [... snip ...]  
> 
> For a single attached device - I have been finding general success.
> 
> For multiple attached devices, I'm seeing some strange behaviors.
> 
> With multiple root ports, I got some stack traces before deciding
> I needed multiple CMFW to do this "correctly", and just attached
> multiple pxb-cxl to the root bus.

Hmm. I should get on with doing multiple HDM decoders in the host
bridge at least. (also useful to support in switch and EP obviously)

> 
> Obviously this configuration is "not great", and some form of
> "impossible in the real world", but it's worth examining i think.

Seems reasonable simplification of what you might see on a 3 socket
system.  Host bridge and CFMWS per socket.

> 
> /opt/qemu-cxl/bin/qemu-system-x86_64 \
> -drive file=/data/qemu/images/pool/pool1.qcow2,format=qcow2,index=0,media=disk,id=hd \
> -m 4G,slots=4,maxmem=16G \
> -smp 4 \
> -machine type=q35,accel=kvm,cxl=on \
> -enable-kvm \
> -nographic \
> -netdev bridge,id=hn0,br=virbr0 \
> -device virtio-net-pci,netdev=hn0,id=nic1,mac=52:54:00:12:34:56 \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=191 \
> -device pxb-cxl,id=cxl.2,bus=pcie.0,bus_nr=230 \
> -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
> -device cxl-rp,id=rp1,bus=cxl.1,chassis=0,port=1,slot=1 \
> -device cxl-rp,id=rp2,bus=cxl.2,chassis=0,port=2,slot=2 \
> -object memory-backend-ram,id=mem0,size=4G \
> -object memory-backend-ram,id=mem1,size=4G \
> -object memory-backend-ram,id=mem2,size=4G \
> -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
> -device cxl-type3,bus=rp1,volatile-memdev=mem1,id=cxl-mem1 \
> -device cxl-type3,bus=rp2,volatile-memdev=mem2,id=cxl-mem2 \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.size=4G,cxl-fmw.2.targets.0=cxl.2,cxl-fmw.2.size=4G
> 
> The goal here should be to have 3 different memory expanders have their
> regions created and mapped to 3 different numa nodes.
> 
> One piece i'm not confident about is my CFMW's
> (listed more readably:)
> 
> -M cxl-fmw.0.targets.0=cxl.0,
>    cxl-fmw.0.size=4G,
>    cxl-fmw.1.targets.0=cxl.1,
>    cxl-fmw.1.size=4G,
>    cxl-fmw.2.targets.0=cxl.2,
>    cxl-fmw.2.size=4G
> 
> should targets in this case be targets.0/1/2, or all of them targets.0?

targets.0 for all of them.  Fairly sure it will moan at you if you don't
do that as some of the targets array for each cfmws will be un specified.

> 
> Either way, i would expect 3 root decoders, and 3 memory devices
> 
> [root@fedora ~]# ls /sys/bus/cxl/devices/
> decoder0.0  decoder1.0  decoder4.0  endpoint4  mem0  nvdimm-bridge0  port3
> decoder0.1  decoder2.0  decoder5.0  endpoint5  mem1  port1           root0
> decoder0.2  decoder3.0  decoder6.0  endpoint6  mem2  port2
> 
> I see the devices I expect, but i would expect the following:
> (cxl list output at the bottom)
> 
> decoder0.0 -> mem0
> decoder0.1 -> mem1
> decoder0.2 -> mem2

I don't think there is any enforcement of ordering across various elements.
It just depends on exact ordering of probe calls that are racing.


> 
> root0 -> [decoder0.0, 0.1, 0.2]
> root0 -> [port1, 2, 3]
> port1 -> mem0
> port2 -> mem1
> port3 -> mem2
> 
> Really i see these decoders and device mappings setup:
> port1 -> mem2
> port2 -> mem1
> port3 -> mem0
> 
> Therefore I should expect
> decoder0.0 -> mem2
> decoder0.1 -> mem1
> decoder0.2 -> mem0
> 
> This bears out: attempting to use any other combination produces ndctl errors.
> 
> So the numbers are backwards, maybe that's relevant, maybe it's not.
> The devices are otherwise completely the same, so for the most part
> everything might "just work".  Lets keep testing.
> 
> 
> [root@fedora ~]# cat create_region.sh
> ./ndctl/build/cxl/cxl \
>   create-region \
>   -m \
>   -t ram \
>   -d decoder0.$1 \
>   -w 1 \
>   -g 4096 \
>   mem$2
> 
> [root@fedora ~]# ./create_region.sh 2 0
> [   34.424931] cxl_region region2: Bypassing cpu_cache_invalidate_memregion() for testing!
> {
>   "region":"region2",
>   "resource":"0x790000000",
>   "size":"4.00 GiB (4.29 GB)",
>   "type":"ram",
>   "interleave_ways":1,
>   "interleave_granularity":4096,
>   "decode_state":"commit",
>   "mappings":[
>     {
>       "position":0,
>       "memdev":"mem0",
>       "decoder":"decoder4.0"
>     }
>   ]
> }
> cxl region: cmd_create_region: created 1 region
> 
> [   34.486668] Fallback order for Node 3: 3 0
> [   34.487568] Built 1 zonelists, mobility grouping on.  Total pages: 979669
> [   34.488206] Policy zone: Normal
> [   34.501938] Fallback order for Node 0: 0 3
> [   34.502405] Fallback order for Node 1: 1 3 0
> [   34.502832] Fallback order for Node 2: 2 3 0
> [   34.503251] Fallback order for Node 3: 3 0
> [   34.503649] Built 2 zonelists, mobility grouping on.  Total pages: 1012437
> [   34.504296] Policy zone: Normal
> 
> 
> 
> Cool, looks good.  Lets try mem1
> 
> 
> 
> [root@fedora ~]# ./create_region.sh 1 1
> 
> [   98.787029] Fallback order for Node 2: 2 3 0
> [   98.787630] Built 2 zonelists, mobility grouping on.  Total pages: 2019798
> [   98.788483] Policy zone: Normal
> [  128.301580] Fallback order for Node 0: 0 2 3
> [  128.302084] Fallback order for Node 1: 1 3 2 0
> [  128.302547] Fallback order for Node 2: 2 3 0
> [  128.303009] Fallback order for Node 3: 3 2 0
> [  128.303436] Built 3 zonelists, mobility grouping on.  Total pages: 2052566
> [  128.304071] Policy zone: Normal
> [ .... wait 20-30 more seconds .... ]
> {
>   "region":"region1",
>   "resource":"0x690000000",
>   "size":"4.00 GiB (4.29 GB)",
>   "type":"ram",
>   "interleave_ways":1,
>   "interleave_granularity":4096,
>   "decode_state":"commit",
>   "mappings":[
>     {
>       "position":0,
>       "memdev":"mem1",
>       "decoder":"decoder5.0"
>     }
>   ]
> }
> cxl region: cmd_create_region: created 1 region
> 
> 
> This takes a LONG time to complete. Maybe that's expected, I don't know.
> 
> 
> Lets online mem2.
> 
> 
> [root@fedora ~]# ./create_region.sh 0 2
> extra data[7]: 0x0000000000000000
> emulation failure
> RAX=0000000000000000 RBX=ffff8a6f90006800 RCX=0000000000100001 RDX=0000000080100010
> RSI=ffffca291a400000 RDI=0000000040000000 RBP=ffff9684c0017a60 RSP=ffff9684c0017a30
> R8 =ffff8a6f90006800 R9 =0000000000100001 R10=0000000000000000 R11=0000000000000001
> R12=ffffca291a400000 R13=0000000000100001 R14=0000000000000000 R15=0000000080100010
> RIP=ffffffffb71c5831 RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0000 0000000000000000 ffffffff 00c00000
> CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
> SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> DS =0000 0000000000000000 ffffffff 00c00000
> FS =0000 00007fd03025db40 ffffffff 00c00000
> GS =0000 ffff8a6a7bd00000 ffffffff 00c00000
> LDT=0000 0000000000000000 ffffffff 00c00000
> TR =0040 fffffe46e6e25000 00004087 00008b00 DPL=0 TSS64-busy
> GDT=     fffffe46e6e23000 0000007f
> IDT=     fffffe0000000000 00000fff
> CR0=80050033 CR2=00005604371ab0c8 CR3=0000000102ece000 CR4=000006e0
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> DR6=00000000fffe0ff0 DR7=0000000000000400
> EFER=0000000000000d01
> Code=83 ec 08 81 e7 00 00 00 40 74 2c 48 89 d0 48 89 ca 4c 89 c9 <f0> 48 0f c7 4e 20 0f 84 85 00 00 00 f3 90 48 83 c4 08 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d
> 
> 
> Well that seems bad lol.  I'm not sure what to make of this since my
> scrollback cuts off and the machine completely locks up.  I have never
> seen "emulation failure" before.
> 
> 
> Reboot and attempt to online that region by itself:
> 
> [root@fedora ~]# ./create_region.sh 0 2
> [   21.292598] cxl_region region0: Bypassing cpu_cache_invalidate_memregion() for testing!
> [   21.341753] Fallback order for Node 1: 1 0
> [   21.342462] Built 1 zonelists, mobility grouping on.  Total pages: 979670
> [   21.343085] Policy zone: Normal
> [   21.355166] Fallback order for Node 0: 0 1
> [   21.355613] Fallback order for Node 1: 1 0
> [   21.356009] Fallback order for Node 2: 2 1 0
> [   21.356441] Fallback order for Node 3: 3 1 0
> [   21.356874] Built 2 zonelists, mobility grouping on.  Total pages: 1012438
> [   21.357501] Policy zone: Normal
> {
>   "region":"region0",
>   "resource":"0x590000000",
>   "size":"4.00 GiB (4.29 GB)",
>   "type":"ram",
>   "interleave_ways":1,
>   "interleave_granularity":4096,
>   "decode_state":"commit",
>   "mappings":[
>     {
>       "position":0,
>       "memdev":"mem2",
>       "decoder":"decoder6.0"
>     }
>   ]
> }
> cxl region: cmd_create_region: created 1 region
> 
> 
> That works fine, and works just like onlining the first region (2,0).
> 
> This suggests the issue is actually creating multiple regions in this
> topology.
> 
> 
> 
> Bonus round: booting with memhp_default_state=offline
> 
> All regions successfully get created without error.
> 
> 
> 
> I have a few guesses, but haven't dived in yet:
> 
> 1) There's a QEMU error in the way this configuration routes to various
>    components of the CXL structure, and/or multiple pxb-cxl's do bad
>    things and I should feel bad for doing this configuration.

Nothing looks like it should be broken in your command line etc.
There may well be a bug in qemu though.
> 
> 2) There's something going on when creating the topology, leading to the
>    inverted [decoder0.2, mem0], [decoder0.1, mem1], [decoder0.0, mem2]
>    mappings, leading to inconsistent device control.  Or I'm making a
>    bad assumption and this is expected behavior.

Expected I think.

> 
> 3) The memory block creation / online code is getting hung up somewhere.
>    Why does the second region take forever to online?

Maybe try some smaller devices, just in case it's running out of memory somewhere.

> 
> 4) Something else completely.
> 
> 
> My gut at the moment tells me my configuration is bad, but i have no
> idea why.  Anyone with an idea on what I should look for, let me know.
> 
> 
> cxl list output for completeness:
> 
> [root@fedora ~]# ./ndctl/build/cxl/cxl list -vvvv
> [
>   {
>     "bus":"root0",
>     "provider":"ACPI.CXL",
>     "nr_dports":3,
>     "dports":[
>       {
>         "dport":"pci0000:e6",
>         "alias":"ACPI0016:00",
>         "id":230
>       },
>       {
>         "dport":"pci0000:bf",
>         "alias":"ACPI0016:01",
>         "id":191
>       },
>       {
>         "dport":"pci0000:34",
>         "alias":"ACPI0016:02",
>         "id":52
>       }
>     ],
>     "ports:root0":[
>       {
>         "port":"port1",
>         "host":"pci0000:e6",
>         "depth":1,
>         "nr_dports":1,
>         "dports":[
>           {
>             "dport":"0000:e6:00.0",
>             "id":2
>           }
>         ],
>         "endpoints:port1":[
>           {
>             "endpoint":"endpoint5",
>             "host":"mem1",
>             "depth":2,
>             "memdev":{
>               "memdev":"mem1",
>               "ram_size":4294967296,
>               "serial":0,
>               "host":"0000:e7:00.0",
>               "partition_info":{
>                 "total_size":4294967296,
>                 "volatile_only_size":4294967296,
>                 "persistent_only_size":0,
>                 "partition_alignment_size":0
>               }
>             },
>             "decoders:endpoint5":[
>               {
>                 "decoder":"decoder5.0",
>                 "interleave_ways":1,
>                 "state":"disabled"
>               }
>             ]
>           }
>         ],
>         "decoders:port1":[
>           {
>             "decoder":"decoder1.0",
>             "interleave_ways":1,
>             "state":"disabled",
>             "nr_targets":1,
>             "targets":[
>               {
>                 "target":"0000:e6:00.0",
>                 "position":0,
>                 "id":2
>               }
>             ]
>           }
>         ]
>       },
>       {
>         "port":"port3",
>         "host":"pci0000:34",
>         "depth":1,
>         "nr_dports":1,
>         "dports":[
>           {
>             "dport":"0000:34:00.0",
>             "id":0
>           }
>         ],
>         "endpoints:port3":[
>           {
>             "endpoint":"endpoint4",
>             "host":"mem0",
>             "depth":2,
>             "memdev":{
>               "memdev":"mem0",
>               "ram_size":4294967296,
>               "serial":0,
>               "host":"0000:35:00.0",
>               "partition_info":{
>                 "total_size":4294967296,
>                 "volatile_only_size":4294967296,
>                 "persistent_only_size":0,
>                 "partition_alignment_size":0
>               }
>             },
>             "decoders:endpoint4":[
>               {
>                 "decoder":"decoder4.0",
>                 "interleave_ways":1,
>                 "state":"disabled"
>               }
>             ]
>           }
>         ],
>         "decoders:port3":[
>           {
>             "decoder":"decoder3.0",
>             "interleave_ways":1,
>             "state":"disabled",
>             "nr_targets":1,
>             "targets":[
>               {
>                 "target":"0000:34:00.0",
>                 "position":0,
>                 "id":0
>               }
>             ]
>           }
>         ]
>       },
>       {
>         "port":"port2",
>         "host":"pci0000:bf",
>         "depth":1,
>         "nr_dports":1,
>         "dports":[
>           {
>             "dport":"0000:bf:00.0",
>             "id":1
>           }
>         ],
>         "endpoints:port2":[
>           {
>             "endpoint":"endpoint6",
>             "host":"mem2",
>             "depth":2,
>             "memdev":{
>               "memdev":"mem2",
>               "ram_size":4294967296,
>               "serial":0,
>               "host":"0000:c0:00.0",
>               "partition_info":{
>                 "total_size":4294967296,
>                 "volatile_only_size":4294967296,
>                 "persistent_only_size":0,
>                 "partition_alignment_size":0
>               }
>             },
>             "decoders:endpoint6":[
>               {
>                 "decoder":"decoder6.0",
>                 "interleave_ways":1,
>                 "state":"disabled"
>               }
>             ]
>           }
>         ],
>         "decoders:port2":[
>           {
>             "decoder":"decoder2.0",
>             "interleave_ways":1,
>             "state":"disabled",
>             "nr_targets":1,
>             "targets":[
>               {
>                 "target":"0000:bf:00.0",
>                 "position":0,
>                 "id":1
>               }
>             ]
>           }
>         ]
>       }
>     ],
>     "decoders:root0":[
>       {
>         "decoder":"decoder0.0",
>         "resource":23890755584,
>         "size":4294967296,
>         "interleave_ways":1,
>         "max_available_extent":4294967296,
>         "pmem_capable":true,
>         "volatile_capable":true,
>         "accelmem_capable":true,
>         "nr_targets":1,
>         "targets":[
>           {
>             "target":"pci0000:34",
>             "alias":"ACPI0016:02",
>             "position":0,
>             "id":52
>           }
>         ]
>       },
>       {
>         "decoder":"decoder0.1",
>         "resource":28185722880,
>         "size":4294967296,
>         "interleave_ways":1,
>         "max_available_extent":4294967296,
>         "pmem_capable":true,
>         "volatile_capable":true,
>         "accelmem_capable":true,
>         "nr_targets":1,
>         "targets":[
>           {
>             "target":"pci0000:bf",
>             "alias":"ACPI0016:01",
>             "position":0,
>             "id":191
>           }
>         ]
>       },
>       {
>         "decoder":"decoder0.2",
>         "resource":32480690176,
>         "size":4294967296,
>         "interleave_ways":1,
>         "max_available_extent":4294967296,
>         "pmem_capable":true,
>         "volatile_capable":true,
>         "accelmem_capable":true,
>         "nr_targets":1,
>         "targets":[
>           {
>             "target":"pci0000:e6",
>             "alias":"ACPI0016:00",
>             "position":0,
>             "id":230
>           }
>         ]
>       }
>     ]
>   }
> ]


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 13/20] cxl/region: Add region autodiscovery
  2023-02-14 13:23       ` Jonathan Cameron
@ 2023-02-14 16:43         ` Dan Williams
  0 siblings, 0 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-14 16:43 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams
  Cc: linux-cxl, Fan Ni, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

Jonathan Cameron wrote:
> 
> > > 
> > > 		/*
> > > 		 * If device_attach() fails the range may still be active via
> > > 		 * the platform-firmware memory map, otherwise the driver for
> > > 		 * regions is local to this file, so driver matching can't fail
> > > +                * and hence device_attach() cannot return 1.
> > > 
> > > //very much not obvious otherwise to anyone who isn't far too familiar with device_attach()  
> > 
> > Hence the comment? Not sure what else can be said here about why
> > device_attach() < 0 is a sufficient check.
> 
> I'd just add the bit I added above that calls out the condition you are
> describing is indicated by a return of 1.

Got it.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default
       [not found]     ` <CGME20230222214151uscas1p26d53b2e198f63a1f382fe575c6c25070@uscas1p2.samsung.com>
@ 2023-02-22 21:41       ` Fan Ni
  2023-02-22 22:18         ` Dan Williams
  0 siblings, 1 reply; 65+ messages in thread
From: Fan Ni @ 2023-02-22 21:41 UTC (permalink / raw)
  To: Gregory Price
  Cc: Dan Williams, linux-cxl, Ira Weiny, David Hildenbrand,
	Dave Jiang, Davidlohr Bueso, Kees Cook, Jonathan Cameron,
	Vishal Verma, Dave Hansen, Michal Hocko, linux-mm, linux-acpi,
	Adam Manzanares

On Mon, Feb 13, 2023 at 01:31:17PM -0500, Gregory Price wrote:

> On Mon, Feb 13, 2023 at 01:22:17PM -0500, Gregory Price wrote:
> > On Fri, Feb 10, 2023 at 01:05:21AM -0800, Dan Williams wrote:
> > > Changes since v1: [1]
> > > [... snip ...]
> > [... snip ...]
> > Really i see these decoders and device mappings setup:
> > port1 -> mem2
> > port2 -> mem1
> > port3 -> mem0
> 
> small correction:
> port1 -> mem1
> port3 -> mem0
> port2 -> mem2
> 
> > 
> > Therefore I should expect
> > decoder0.0 -> mem2
> > decoder0.1 -> mem1
> > decoder0.2 -> mem0
> > 
> 
> this end up mapping this way, which is still further jumbled.
> 
> Something feels like there's an off-by-one
> 

Currently, the naming of memdevs can be out-of-order due to the
following two reasons,
1. At kernel side, cxl port driver does async device probe, which can
change the memdev naming even within a single OS boot and among multiple
time of device enumeration. The pattern can be observed with following
steps in the guest,
	loop(){
	a) modprobe cxl_xxx
	b)cxl list  --> you will see the memdev name changes (like mem0->mem1).
	c) rmmod cxl_xxx
	}
This behaviour can be avoided by using sync device probe by making the
following change
--------------------------------------------
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 258004f34281..f3f90fad62b5 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -663,7 +663,7 @@ static struct pci_driver cxl_pci_driver = {
 	.probe			= cxl_pci_probe,
 	.err_handler		= &cxl_error_handlers,
 	.driver	= {
-		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
+		.probe_type = PROBE_FORCE_SYNCHRONOUS,
 	},
 };
-------------------------------------------

The above patch, you will see consistent memdev naming within one
OS boot, however, the order can be still different from what we expect with
the qemu config options we use. We need to make some change at QEMU side
also as shown below.

2. Currently in Qemu, multiple components at the same topology level are
stored in a data structure called QLIST as defined in
include/qemu/queue.h. When enqueuing a component, current qemu code uses
QLIST_INSERT_HEAD to insert the item at the head, but when iterating, it
uses QLIST_FOREACH/QLIST_FOREACH_SAFE which is also from the head of the
list. That is to say, if we enqueue items P1,P2,P3 in order, when iterating,
we get P3,P2,P1. I have a simple test with the below code change(always
insert to the list tail), the order issue is fixed.

----------------------------------------------------------------------------
diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index e029e7bf66..15491960e1 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -130,7 +130,7 @@ struct {                                                                \
         (listelm)->field.le_prev = &(elm)->field.le_next;               \
 } while (/*CONSTCOND*/0)
 
-#define QLIST_INSERT_HEAD(head, elm, field) do {                        \
+#define QLIST_INSERT_HEAD_OLD(head, elm, field) do {                    \
         if (((elm)->field.le_next = (head)->lh_first) != NULL)          \
                 (head)->lh_first->field.le_prev = &(elm)->field.le_next;\
         (head)->lh_first = (elm);                                       \
@@ -146,6 +146,20 @@ struct {                                                                \
         (elm)->field.le_prev = NULL;                                    \
 } while (/*CONSTCOND*/0)
 
+#define QLIST_INSERT_TAIL(head, elm, field) do {                        \
+        typeof(elm) last_p = (head)->lh_first;                          \
+        while (last_p && last_p->field.le_next)                         \
+            last_p = last_p->field.le_next;                             \
+        if (last_p)                                                     \
+            QLIST_INSERT_AFTER(last_p, elm, field);                     \
+        else                                                            \
+            QLIST_INSERT_HEAD_OLD(head, elm, field);                    \
+} while (/*CONSTCOND*/0)
+
+#define QLIST_INSERT_HEAD(head, elm, field) do {                        \
+        QLIST_INSERT_TAIL(head, elm, field);                            \
+} while (/*CONSTCOND*/0)
+
 /*
  * Like QLIST_REMOVE() but safe to call when elm is not in a list
  */
-----------------------------------------------------------------------------

The memdev naming order can also cause confusion when creating regions
for multiple memdevs under different HBs as in the kernel code, we enforce
HB check to ensure the target position matches the CFMW configuration.
To avoid the confusion, we can use "cxl list -TD" to find out the target
position for a memdev, but it is kind of annoying to do it before
creating region.

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default
  2023-02-22 21:41       ` Fan Ni
@ 2023-02-22 22:18         ` Dan Williams
  0 siblings, 0 replies; 65+ messages in thread
From: Dan Williams @ 2023-02-22 22:18 UTC (permalink / raw)
  To: Fan Ni, Gregory Price
  Cc: Dan Williams, linux-cxl, Ira Weiny, David Hildenbrand,
	Dave Jiang, Davidlohr Bueso, Kees Cook, Jonathan Cameron,
	Vishal Verma, Dave Hansen, Michal Hocko, linux-mm, linux-acpi,
	Adam Manzanares

Fan Ni wrote:
> On Mon, Feb 13, 2023 at 01:31:17PM -0500, Gregory Price wrote:
> 
> > On Mon, Feb 13, 2023 at 01:22:17PM -0500, Gregory Price wrote:
> > > On Fri, Feb 10, 2023 at 01:05:21AM -0800, Dan Williams wrote:
> > > > Changes since v1: [1]
> > > > [... snip ...]
> > > [... snip ...]
> > > Really i see these decoders and device mappings setup:
> > > port1 -> mem2
> > > port2 -> mem1
> > > port3 -> mem0
> > 
> > small correction:
> > port1 -> mem1
> > port3 -> mem0
> > port2 -> mem2
> > 
> > > 
> > > Therefore I should expect
> > > decoder0.0 -> mem2
> > > decoder0.1 -> mem1
> > > decoder0.2 -> mem0
> > > 
> > 
> > this end up mapping this way, which is still further jumbled.
> > 
> > Something feels like there's an off-by-one
> > 
> 
> Currently, the naming of memdevs can be out-of-order due to the
> following two reasons,
> 1. At kernel side, cxl port driver does async device probe, which can
> change the memdev naming even within a single OS boot and among multiple
> time of device enumeration. The pattern can be observed with following
> steps in the guest,
> 	loop(){
> 	a) modprobe cxl_xxx
> 	b)cxl list  --> you will see the memdev name changes (like mem0->mem1).
> 	c) rmmod cxl_xxx
> 	}
> This behaviour can be avoided by using sync device probe by making the
> following change
> --------------------------------------------
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 258004f34281..f3f90fad62b5 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -663,7 +663,7 @@ static struct pci_driver cxl_pci_driver = {
>  	.probe			= cxl_pci_probe,
>  	.err_handler		= &cxl_error_handlers,
>  	.driver	= {
> -		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
> +		.probe_type = PROBE_FORCE_SYNCHRONOUS,
>  	},
>  };
> -------------------------------------------
> 
> The above patch, you will see consistent memdev naming within one
> OS boot, however, the order can be still different from what we expect with
> the qemu config options we use. We need to make some change at QEMU side
> also as shown below.

This is by design. Kernel device name order is not guaranteed even with
synchronous probing and the async probing acts to make sure these names
are always random for memdevs. For a memdev the recommendation is to
identify them by 'host'/'path' or by 'serial':

# cxl list -u -m 0000:35:00.0
{
  "memdev":"mem0",
  "pmem_size":"512.00 MiB (536.87 MB)",
  "serial":"0",
  "host":"0000:35:00.0"
}


# cxl list -u -s 0
{
  "memdev":"mem0",
  "pmem_size":"512.00 MiB (536.87 MB)",
  "serial":"0",
  "host":"0000:35:00.0"
}

Although, in real life a CXL device will have a non-zero unique serial
number.

> 2. Currently in Qemu, multiple components at the same topology level are
> stored in a data structure called QLIST as defined in
> include/qemu/queue.h. When enqueuing a component, current qemu code uses
> QLIST_INSERT_HEAD to insert the item at the head, but when iterating, it
> uses QLIST_FOREACH/QLIST_FOREACH_SAFE which is also from the head of the
> list. That is to say, if we enqueue items P1,P2,P3 in order, when iterating,
> we get P3,P2,P1. I have a simple test with the below code change(always
> insert to the list tail), the order issue is fixed.

Again, kernel does not and should not be expected to guarantee kernel
device name ordering. Perhaps this merits /dev/cxl/by-path and
/dev/cxl/by-id similar to /dev/disk/by-path and /dev/disk/by-id for
semi-persistent / persistent naming.

That's a conversation to have with the systemd-udev folks.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 13/20] cxl/region: Add region autodiscovery
       [not found]   ` <CGME20230228185348uscas1p1a5314a077383ee81ac228c1b9f1da2f8@uscas1p1.samsung.com>
@ 2023-02-28 18:53     ` Fan Ni
  0 siblings, 0 replies; 65+ messages in thread
From: Fan Ni @ 2023-02-28 18:53 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-cxl, vishal.l.verma, dave.hansen, linux-mm, linux-acpi

On Fri, Feb 10, 2023 at 01:06:39AM -0800, Dan Williams wrote:
> Region autodiscovery is an asynchronous state machine advanced by
> cxl_port_probe(). After the decoders on an endpoint port are enumerated
> they are scanned for actively enabled instances. Each active decoder is
> flagged for auto-assembly CXL_DECODER_F_AUTO and attached to a region.
> If a region does not already exist for the address range setting of the
> decoder one is created. That creation process may race with other
> decoders of the same region being discovered since cxl_port_probe() is
> asynchronous. A new 'struct cxl_root_decoder' lock, @range_lock, is
> introduced to mitigate that race.
> 
> Once all decoders have arrived, "p->nr_targets == p->interleave_ways",
> they are sorted by their relative decode position. The sort algorithm
> involves finding the point in the cxl_port topology where one leg of the
> decode leads to deviceA and the other deviceB. At that point in the
> topology the target order in the 'struct cxl_switch_decoder' indicates
> the relative position of those endpoint decoders in the region.
> 
> >From that point the region goes through the same setup and validation
> steps as user-created regions, but instead of programming the decoders
> it validates that driver would have written the same values to the
> decoders as were already present.
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564540972.847146.17096178433176097831.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/core/hdm.c    |   11 +
>  drivers/cxl/core/port.c   |    2 
>  drivers/cxl/core/region.c |  497 ++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/cxl/cxl.h         |   29 +++
>  drivers/cxl/port.c        |   48 ++++
>  5 files changed, 576 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index a0891c3464f1..8c29026a4b9d 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -676,6 +676,14 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld)
>  	port->commit_end--;
>  	cxld->flags &= ~CXL_DECODER_F_ENABLE;
>  
> +	/* Userspace is now responsible for reconfiguring this decoder */
> +	if (is_endpoint_decoder(&cxld->dev)) {
> +		struct cxl_endpoint_decoder *cxled;
> +
> +		cxled = to_cxl_endpoint_decoder(&cxld->dev);
> +		cxled->state = CXL_DECODER_STATE_MANUAL;
> +	}
> +
>  	return 0;
>  }
>  
> @@ -783,6 +791,9 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
>  		return rc;
>  	}
>  	*dpa_base += dpa_size + skip;
> +
> +	cxled->state = CXL_DECODER_STATE_AUTO;
> +
>  	return 0;
>  }
>  
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 9e5df64ea6b5..59620528571a 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -446,6 +446,7 @@ bool is_endpoint_decoder(struct device *dev)
>  {
>  	return dev->type == &cxl_decoder_endpoint_type;
>  }
> +EXPORT_SYMBOL_NS_GPL(is_endpoint_decoder, CXL);
>  
>  bool is_root_decoder(struct device *dev)
>  {
> @@ -1628,6 +1629,7 @@ struct cxl_root_decoder *cxl_root_decoder_alloc(struct cxl_port *port,
>  	}
>  
>  	cxlrd->calc_hb = calc_hb;
> +	mutex_init(&cxlrd->range_lock);
>  
>  	cxld = &cxlsd->cxld;
>  	cxld->dev.type = &cxl_decoder_root_type;
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 691605f1e120..3f6453da2c51 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -6,6 +6,7 @@
>  #include <linux/module.h>
>  #include <linux/slab.h>
>  #include <linux/uuid.h>
> +#include <linux/sort.h>
>  #include <linux/idr.h>
>  #include <cxlmem.h>
>  #include <cxl.h>
> @@ -524,7 +525,12 @@ static void cxl_region_iomem_release(struct cxl_region *cxlr)
>  	if (device_is_registered(&cxlr->dev))
>  		lockdep_assert_held_write(&cxl_region_rwsem);
>  	if (p->res) {
> -		remove_resource(p->res);
> +		/*
> +		 * Autodiscovered regions may not have been able to insert their
> +		 * resource.
> +		 */
> +		if (p->res->parent)
> +			remove_resource(p->res);
>  		kfree(p->res);
>  		p->res = NULL;
>  	}
> @@ -1105,12 +1111,35 @@ static int cxl_port_setup_targets(struct cxl_port *port,
>  		return rc;
>  	}
>  
> -	cxld->interleave_ways = iw;
> -	cxld->interleave_granularity = ig;
> -	cxld->hpa_range = (struct range) {
> -		.start = p->res->start,
> -		.end = p->res->end,
> -	};
> +	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
> +		if (cxld->interleave_ways != iw ||
> +		    cxld->interleave_granularity != ig ||
> +		    cxld->hpa_range.start != p->res->start ||
> +		    cxld->hpa_range.end != p->res->end ||
> +		    ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
> +			dev_err(&cxlr->dev,
> +				"%s:%s %s expected iw: %d ig: %d %pr\n",
> +				dev_name(port->uport), dev_name(&port->dev),
> +				__func__, iw, ig, p->res);
> +			dev_err(&cxlr->dev,
> +				"%s:%s %s got iw: %d ig: %d state: %s %#llx:%#llx\n",
> +				dev_name(port->uport), dev_name(&port->dev),
> +				__func__, cxld->interleave_ways,
> +				cxld->interleave_granularity,
> +				(cxld->flags & CXL_DECODER_F_ENABLE) ?
> +					"enabled" :
> +					"disabled",
> +				cxld->hpa_range.start, cxld->hpa_range.end);
> +			return -ENXIO;
> +		}
> +	} else {
> +		cxld->interleave_ways = iw;
> +		cxld->interleave_granularity = ig;
> +		cxld->hpa_range = (struct range) {
> +			.start = p->res->start,
> +			.end = p->res->end,
> +		};
> +	}
>  	dev_dbg(&cxlr->dev, "%s:%s iw: %d ig: %d\n", dev_name(port->uport),
>  		dev_name(&port->dev), iw, ig);
>  add_target:
> @@ -1121,7 +1150,17 @@ static int cxl_port_setup_targets(struct cxl_port *port,
>  			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev), pos);
>  		return -ENXIO;
>  	}
> -	cxlsd->target[cxl_rr->nr_targets_set] = ep->dport;
> +	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
> +		if (cxlsd->target[cxl_rr->nr_targets_set] != ep->dport) {
> +			dev_dbg(&cxlr->dev, "%s:%s: %s expected %s at %d\n",
> +				dev_name(port->uport), dev_name(&port->dev),
> +				dev_name(&cxlsd->cxld.dev),
> +				dev_name(ep->dport->dport),
> +				cxl_rr->nr_targets_set);
> +			return -ENXIO;
> +		}
> +	} else
> +		cxlsd->target[cxl_rr->nr_targets_set] = ep->dport;
>  	inc = 1;
>  out_target_set:
>  	cxl_rr->nr_targets_set += inc;
> @@ -1163,6 +1202,13 @@ static void cxl_region_teardown_targets(struct cxl_region *cxlr)
>  	struct cxl_ep *ep;
>  	int i;
>  
> +	/*
> +	 * In the auto-discovery case skip automatic teardown since the
> +	 * address space is already active
> +	 */
> +	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags))
> +		return;
> +
>  	for (i = 0; i < p->nr_targets; i++) {
>  		cxled = p->targets[i];
>  		cxlmd = cxled_to_memdev(cxled);
> @@ -1195,8 +1241,8 @@ static int cxl_region_setup_targets(struct cxl_region *cxlr)
>  			iter = to_cxl_port(iter->dev.parent);
>  
>  		/*
> -		 * Descend the topology tree programming targets while
> -		 * looking for conflicts.
> +		 * Descend the topology tree programming / validating
> +		 * targets while looking for conflicts.
>  		 */
>  		for (ep = cxl_ep_load(iter, cxlmd); iter;
>  		     iter = ep->next, ep = cxl_ep_load(iter, cxlmd)) {
> @@ -1291,6 +1337,185 @@ static int cxl_region_attach_position(struct cxl_region *cxlr,
>  	return rc;
>  }
>  
> +static int cxl_region_attach_auto(struct cxl_region *cxlr,
> +				  struct cxl_endpoint_decoder *cxled, int pos)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +
> +	if (cxled->state != CXL_DECODER_STATE_AUTO) {
> +		dev_err(&cxlr->dev,
> +			"%s: unable to add decoder to autodetected region\n",
> +			dev_name(&cxled->cxld.dev));
> +		return -EINVAL;
> +	}
> +
> +	if (pos >= 0) {
> +		dev_dbg(&cxlr->dev, "%s: expected auto position, not %d\n",
> +			dev_name(&cxled->cxld.dev), pos);
> +		return -EINVAL;
> +	}
> +
> +	if (p->nr_targets >= p->interleave_ways) {
> +		dev_err(&cxlr->dev, "%s: no more target slots available\n",
> +			dev_name(&cxled->cxld.dev));
> +		return -ENXIO;
> +	}
> +
> +	/*
> +	 * Temporarily record the endpoint decoder into the target array. Yes,
> +	 * this means that userspace can view devices in the wrong position
> +	 * before the region activates, and must be careful to understand when
> +	 * it might be racing region autodiscovery.
> +	 */
> +	pos = p->nr_targets;
> +	p->targets[pos] = cxled;
> +	cxled->pos = pos;
> +	p->nr_targets++;
> +
> +	return 0;
> +}
> +
> +static struct cxl_port *next_port(struct cxl_port *port)
> +{
> +	if (!port->parent_dport)
> +		return NULL;
> +	return port->parent_dport->port;
> +}
> +
> +static int decoder_match_range(struct device *dev, void *data)
> +{
> +	struct cxl_endpoint_decoder *cxled = data;
> +	struct cxl_switch_decoder *cxlsd;
> +
> +	if (!is_switch_decoder(dev))
> +		return 0;
> +
> +	cxlsd = to_cxl_switch_decoder(dev);
> +	return range_contains(&cxlsd->cxld.hpa_range, &cxled->cxld.hpa_range);
> +}
> +
> +static void find_positions(const struct cxl_switch_decoder *cxlsd,
> +			   const struct cxl_port *iter_a,
> +			   const struct cxl_port *iter_b, int *a_pos,
> +			   int *b_pos)
> +{
> +	int i;
> +
> +	for (i = 0, *a_pos = -1, *b_pos = -1; i < cxlsd->nr_targets; i++) {
> +		if (cxlsd->target[i] == iter_a->parent_dport)
> +			*a_pos = i;
> +		else if (cxlsd->target[i] == iter_b->parent_dport)
> +			*b_pos = i;
> +		if (*a_pos >= 0 && *b_pos >= 0)
> +			break;
> +	}
> +}
> +
> +static int cmp_decode_pos(const void *a, const void *b)
> +{
> +	struct cxl_endpoint_decoder *cxled_a = *(typeof(cxled_a) *)a;
> +	struct cxl_endpoint_decoder *cxled_b = *(typeof(cxled_b) *)b;
> +	struct cxl_memdev *cxlmd_a = cxled_to_memdev(cxled_a);
> +	struct cxl_memdev *cxlmd_b = cxled_to_memdev(cxled_b);
> +	struct cxl_port *port_a = cxled_to_port(cxled_a);
> +	struct cxl_port *port_b = cxled_to_port(cxled_b);
> +	struct cxl_port *iter_a, *iter_b, *port = NULL;
> +	struct cxl_switch_decoder *cxlsd;
> +	struct device *dev;
> +	int a_pos, b_pos;
> +	unsigned int seq;
> +
> +	/* Exit early if any prior sorting failed */
> +	if (cxled_a->pos < 0 || cxled_b->pos < 0)
> +		return 0;
> +
> +	/*
> +	 * Walk up the hierarchy to find a shared port, find the decoder that
> +	 * maps the range, compare the relative position of those dport
> +	 * mappings.
> +	 */
> +	for (iter_a = port_a; iter_a; iter_a = next_port(iter_a)) {
> +		struct cxl_port *next_a, *next_b;
> +
> +		next_a = next_port(iter_a);
> +		if (!next_a)
> +			break;
> +
> +		for (iter_b = port_b; iter_b; iter_b = next_port(iter_b)) {
> +			next_b = next_port(iter_b);
> +			if (next_a != next_b)
> +				continue;
> +			port = next_a;
> +			break;
> +		}
> +
> +		if (port)
> +			break;
> +	}
> +
> +	if (!port) {
> +		dev_err(cxlmd_a->dev.parent,
> +			"failed to find shared port with %s\n",
> +			dev_name(cxlmd_b->dev.parent));
> +		goto err;
> +	}
> +
> +	dev = device_find_child(&port->dev, cxled_a, decoder_match_range);
> +	if (!dev) {
> +		struct range *range = &cxled_a->cxld.hpa_range;
> +
> +		dev_err(port->uport,
> +			"failed to find decoder that maps %#llx-%#llx\n",
> +			range->start, range->end);
> +		goto err;
> +	}
> +
> +	cxlsd = to_cxl_switch_decoder(dev);
> +	do {
> +		seq = read_seqbegin(&cxlsd->target_lock);
> +		find_positions(cxlsd, iter_a, iter_b, &a_pos, &b_pos);
> +	} while (read_seqretry(&cxlsd->target_lock, seq));
> +
> +	put_device(dev);
> +
> +	if (a_pos < 0 || b_pos < 0) {
> +		dev_err(port->uport,
> +			"failed to find shared decoder for %s and %s\n",
> +			dev_name(cxlmd_a->dev.parent),
> +			dev_name(cxlmd_b->dev.parent));
> +		goto err;
> +	}
> +
> +	dev_dbg(port->uport, "%s comes %s %s\n", dev_name(cxlmd_a->dev.parent),
> +		a_pos - b_pos < 0 ? "before" : "after",
> +		dev_name(cxlmd_b->dev.parent));
> +
> +	return a_pos - b_pos;
> +err:
> +	cxled_a->pos = -1;
> +	return 0;
> +}
> +
> +static int cxl_region_sort_targets(struct cxl_region *cxlr)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +	int i, rc = 0;
> +
> +	sort(p->targets, p->nr_targets, sizeof(p->targets[0]), cmp_decode_pos,
> +	     NULL);
> +
> +	for (i = 0; i < p->nr_targets; i++) {
> +		struct cxl_endpoint_decoder *cxled = p->targets[i];
> +
> +		if (cxled->pos < 0)
> +			rc = -ENXIO;
> +		cxled->pos = i;
> +	}
> +
> +	dev_dbg(&cxlr->dev, "region sort %s\n", rc ? "failed" : "successful");
> +	return rc;
> +}
> +
>  static int cxl_region_attach(struct cxl_region *cxlr,
>  			     struct cxl_endpoint_decoder *cxled, int pos)
>  {
> @@ -1354,6 +1579,50 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  		return -EINVAL;
>  	}
>  
> +	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
> +		int i;
> +
> +		rc = cxl_region_attach_auto(cxlr, cxled, pos);
> +		if (rc)
> +			return rc;
> +
> +		/* await more targets to arrive... */
> +		if (p->nr_targets < p->interleave_ways)
> +			return 0;
> +
> +		/*
> +		 * All targets are here, which implies all PCI enumeration that
> +		 * affects this region has been completed. Walk the topology to
> +		 * sort the devices into their relative region decode position.
> +		 */
> +		rc = cxl_region_sort_targets(cxlr);
> +		if (rc)
> +			return rc;
> +
> +		for (i = 0; i < p->nr_targets; i++) {
> +			cxled = p->targets[i];
> +			ep_port = cxled_to_port(cxled);
> +			dport = cxl_find_dport_by_dev(root_port,
> +						      ep_port->host_bridge);
> +			rc = cxl_region_attach_position(cxlr, cxlrd, cxled,
> +							dport, i);
> +			if (rc)
> +				return rc;
> +		}
> +
> +		rc = cxl_region_setup_targets(cxlr);
> +		if (rc)
> +			return rc;
> +
> +		/*
> +		 * If target setup succeeds in the autodiscovery case
> +		 * then the region is already committed.
> +		 */
> +		p->state = CXL_CONFIG_COMMIT;
> +
> +		return 0;
> +	}
> +
>  	rc = cxl_region_validate_position(cxlr, cxled, pos);
>  	if (rc)
>  		return rc;
> @@ -2087,6 +2356,193 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
>  	return rc;
>  }
>  
> +static int match_decoder_by_range(struct device *dev, void *data)
> +{
> +	struct range *r1, *r2 = data;
> +	struct cxl_root_decoder *cxlrd;
> +
> +	if (!is_root_decoder(dev))
> +		return 0;
> +
> +	cxlrd = to_cxl_root_decoder(dev);
> +	r1 = &cxlrd->cxlsd.cxld.hpa_range;
> +	return range_contains(r1, r2);
> +}
> +
> +static int match_region_by_range(struct device *dev, void *data)
> +{
> +	struct cxl_region_params *p;
> +	struct cxl_region *cxlr;
> +	struct range *r = data;
> +	int rc = 0;
> +
> +	if (!is_cxl_region(dev))
> +		return 0;
> +
> +	cxlr = to_cxl_region(dev);
> +	p = &cxlr->params;
> +
> +	down_read(&cxl_region_rwsem);
> +	if (p->res && p->res->start == r->start && p->res->end == r->end)
> +		rc = 1;
> +	up_read(&cxl_region_rwsem);
> +
> +	return rc;
> +}
> +
> +/* Establish an empty region covering the given HPA range */
> +static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
> +					   struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_port *port = cxlrd_to_port(cxlrd);
> +	struct range *hpa = &cxled->cxld.hpa_range;
> +	struct cxl_region_params *p;
> +	struct cxl_region *cxlr;
> +	struct resource *res;
> +	int rc;
> +
> +	do {
> +		cxlr = __create_region(cxlrd, cxled->mode,
> +				       atomic_read(&cxlrd->region_id));
> +	} while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
> +
> +	if (IS_ERR(cxlr)) {
> +		dev_err(cxlmd->dev.parent,
> +			"%s:%s: %s failed assign region: %ld\n",
> +			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> +			__func__, PTR_ERR(cxlr));
> +		return cxlr;
> +	}
> +
> +	down_write(&cxl_region_rwsem);
> +	p = &cxlr->params;
> +	if (p->state >= CXL_CONFIG_INTERLEAVE_ACTIVE) {
> +		dev_err(cxlmd->dev.parent,
> +			"%s:%s: %s autodiscovery interrupted\n",
> +			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> +			__func__);
> +		rc = -EBUSY;
> +		goto err;
> +	}
> +
> +	set_bit(CXL_REGION_F_AUTO, &cxlr->flags);
> +
> +	res = kmalloc(sizeof(*res), GFP_KERNEL);
> +	if (!res) {
> +		rc = -ENOMEM;
> +		goto err;
> +	}
> +
> +	*res = DEFINE_RES_MEM_NAMED(hpa->start, range_len(hpa),
> +				    dev_name(&cxlr->dev));
> +	rc = insert_resource(cxlrd->res, res);
> +	if (rc) {
> +		/*
> +		 * Platform-firmware may not have split resources like "System
> +		 * RAM" on CXL window boundaries see cxl_region_iomem_release()
> +		 */
> +		dev_warn(cxlmd->dev.parent,
> +			 "%s:%s: %s %s cannot insert resource\n",
> +			 dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> +			 __func__, dev_name(&cxlr->dev));
> +	}
> +
> +	p->res = res;
> +	p->interleave_ways = cxled->cxld.interleave_ways;
> +	p->interleave_granularity = cxled->cxld.interleave_granularity;
> +	p->state = CXL_CONFIG_INTERLEAVE_ACTIVE;
> +
> +	rc = sysfs_update_group(&cxlr->dev.kobj, get_cxl_region_target_group());
> +	if (rc)
> +		goto err;
> +
> +	dev_dbg(cxlmd->dev.parent, "%s:%s: %s %s res: %pr iw: %d ig: %d\n",
> +		dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev), __func__,
> +		dev_name(&cxlr->dev), p->res, p->interleave_ways,
> +		p->interleave_granularity);
> +
> +	/* ...to match put_device() in cxl_add_to_region() */
> +	get_device(&cxlr->dev);
> +	up_write(&cxl_region_rwsem);
> +
> +	return cxlr;
> +
> +err:
> +	up_write(&cxl_region_rwsem);
> +	devm_release_action(port->uport, unregister_region, cxlr);
> +	return ERR_PTR(rc);
> +}
> +
> +int cxl_add_to_region(struct cxl_port *root, struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct range *hpa = &cxled->cxld.hpa_range;
> +	struct cxl_decoder *cxld = &cxled->cxld;
> +	struct cxl_root_decoder *cxlrd;
> +	struct cxl_region_params *p;
> +	struct cxl_region *cxlr;
> +	bool attach = false;
> +	struct device *dev;
> +	int rc;
> +
> +	dev = device_find_child(&root->dev, &cxld->hpa_range,
> +				match_decoder_by_range);
> +	if (!dev) {
> +		dev_err(cxlmd->dev.parent,
> +			"%s:%s no CXL window for range %#llx:%#llx\n",
> +			dev_name(&cxlmd->dev), dev_name(&cxld->dev),
> +			cxld->hpa_range.start, cxld->hpa_range.end);
> +		return -ENXIO;
> +	}
> +
> +	cxlrd = to_cxl_root_decoder(dev);
> +
> +	/*
> +	 * Ensure that if multiple threads race to construct_region() for @hpa
> +	 * one does the construction and the others add to that.
> +	 */
> +	mutex_lock(&cxlrd->range_lock);
> +	dev = device_find_child(&cxlrd->cxlsd.cxld.dev, hpa,
> +				match_region_by_range);
> +	if (!dev)
> +		cxlr = construct_region(cxlrd, cxled);
> +	else
> +		cxlr = to_cxl_region(dev);
> +	mutex_unlock(&cxlrd->range_lock);
> +
> +	if (IS_ERR(cxlr)) {
> +		rc = PTR_ERR(cxlr);
> +		goto out;
> +	}
> +
> +	attach_target(cxlr, cxled, -1, TASK_UNINTERRUPTIBLE);
> +
> +	down_read(&cxl_region_rwsem);
> +	p = &cxlr->params;
> +	attach = p->state == CXL_CONFIG_COMMIT;
> +	up_read(&cxl_region_rwsem);
> +
> +	if (attach) {
> +		int rc = device_attach(&cxlr->dev);
> +
> +		/*
> +		 * If device_attach() fails the range may still be active via
> +		 * the platform-firmware memory map, otherwise the driver for
> +		 * regions is local to this file, so driver matching can't fail.
> +		 */
> +		if (rc < 0)
> +			dev_err(&cxlr->dev, "failed to enable, range: %pr\n",
> +				p->res);
> +	}
> +
> +	put_device(&cxlr->dev);
> +out:
> +	put_device(&cxlrd->cxlsd.cxld.dev);
> +	return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_add_to_region, CXL);
> +
>  static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
>  {
>  	if (!test_bit(CXL_REGION_F_INCOHERENT, &cxlr->flags))
> @@ -2111,6 +2567,15 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
>  	return 0;
>  }
>  
> +static int is_system_ram(struct resource *res, void *arg)
> +{
> +	struct cxl_region *cxlr = arg;
> +	struct cxl_region_params *p = &cxlr->params;
> +
> +	dev_dbg(&cxlr->dev, "%pr has System RAM: %pr\n", p->res, res);
> +	return 1;
> +}
> +
>  static int cxl_region_probe(struct device *dev)
>  {
>  	struct cxl_region *cxlr = to_cxl_region(dev);
> @@ -2144,6 +2609,18 @@ static int cxl_region_probe(struct device *dev)
>  	switch (cxlr->mode) {
>  	case CXL_DECODER_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
> +	case CXL_DECODER_RAM:
> +		/*
> +		 * The region can not be manged by CXL if any portion of
> +		 * it is already online as 'System RAM'
> +		 */
> +		if (walk_iomem_res_desc(IORES_DESC_NONE,
> +					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> +					p->res->start, p->res->end, cxlr,
> +					is_system_ram) > 0)
> +			return 0;
> +		dev_dbg(dev, "TODO: hookup devdax\n");
> +		return 0;
>  	default:
>  		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
>  			cxlr->mode);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index ca76879af1de..c8ee4bb8cce6 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -261,6 +261,8 @@ resource_size_t cxl_rcrb_to_component(struct device *dev,
>   * cxl_decoder flags that define the type of memory / devices this
>   * decoder supports as well as configuration lock status See "CXL 2.0
>   * 8.2.5.12.7 CXL HDM Decoder 0 Control Register" for details.
> + * Additionally indicate whether decoder settings were autodetected,
> + * user customized.
>   */
>  #define CXL_DECODER_F_RAM   BIT(0)
>  #define CXL_DECODER_F_PMEM  BIT(1)
> @@ -334,12 +336,22 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  	return "mixed";
>  }
>  
> +/*
> + * Track whether this decoder is reserved for region autodiscovery, or
> + * free for userspace provisioning.
> + */
> +enum cxl_decoder_state {
> +	CXL_DECODER_STATE_MANUAL,
> +	CXL_DECODER_STATE_AUTO,
> +};
> +
>  /**
>   * struct cxl_endpoint_decoder - Endpoint  / SPA to DPA decoder
>   * @cxld: base cxl_decoder_object
>   * @dpa_res: actively claimed DPA span of this decoder
>   * @skip: offset into @dpa_res where @cxld.hpa_range maps
>   * @mode: which memory type / access-mode-partition this decoder targets
> + * @state: autodiscovery state
>   * @pos: interleave position in @cxld.region
>   */
>  struct cxl_endpoint_decoder {
> @@ -347,6 +359,7 @@ struct cxl_endpoint_decoder {
>  	struct resource *dpa_res;
>  	resource_size_t skip;
>  	enum cxl_decoder_mode mode;
> +	enum cxl_decoder_state state;
>  	int pos;
>  };
>  
> @@ -380,6 +393,7 @@ typedef struct cxl_dport *(*cxl_calc_hb_fn)(struct cxl_root_decoder *cxlrd,
>   * @region_id: region id for next region provisioning event
>   * @calc_hb: which host bridge covers the n'th position by granularity
>   * @platform_data: platform specific configuration data
> + * @range_lock: sync region autodiscovery by address range
>   * @cxlsd: base cxl switch decoder
>   */
>  struct cxl_root_decoder {
> @@ -387,6 +401,7 @@ struct cxl_root_decoder {
>  	atomic_t region_id;
>  	cxl_calc_hb_fn calc_hb;
>  	void *platform_data;
> +	struct mutex range_lock;
>  	struct cxl_switch_decoder cxlsd;
>  };
>  
> @@ -436,6 +451,13 @@ struct cxl_region_params {
>   */
>  #define CXL_REGION_F_INCOHERENT 0
>  
> +/*
> + * Indicate whether this region has been assembled by autodetection or
> + * userspace assembly. Prevent endpoint decoders outside of automatic
> + * detection from being added to the region.
> + */
> +#define CXL_REGION_F_AUTO 1
> +
>  /**
>   * struct cxl_region - CXL region
>   * @dev: This region's device
> @@ -699,6 +721,8 @@ struct cxl_nvdimm_bridge *cxl_find_nvdimm_bridge(struct device *dev);
>  #ifdef CONFIG_CXL_REGION
>  bool is_cxl_pmem_region(struct device *dev);
>  struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev);
> +int cxl_add_to_region(struct cxl_port *root,
> +		      struct cxl_endpoint_decoder *cxled);
>  #else
>  static inline bool is_cxl_pmem_region(struct device *dev)
>  {
> @@ -708,6 +732,11 @@ static inline struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev)
>  {
>  	return NULL;
>  }
> +static inline int cxl_add_to_region(struct cxl_port *root,
> +				    struct cxl_endpoint_decoder *cxled)
> +{
> +	return 0;
> +}
>  #endif
>  
>  /*
> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> index a8d46a67b45e..d88518836c2d 100644
> --- a/drivers/cxl/port.c
> +++ b/drivers/cxl/port.c
> @@ -30,6 +30,34 @@ static void schedule_detach(void *cxlmd)
>  	schedule_cxl_memdev_detach(cxlmd);
>  }
>  
> +static int discover_region(struct device *dev, void *root)
> +{
> +	struct cxl_endpoint_decoder *cxled;
> +	int rc;
> +
> +	if (!is_endpoint_decoder(dev))
> +		return 0;
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	if ((cxled->cxld.flags & CXL_DECODER_F_ENABLE) == 0)
> +		return 0;
> +
> +	if (cxled->state != CXL_DECODER_STATE_AUTO)
> +		return 0;
> +
> +	/*
> +	 * Region enumeration is opportunistic, if this add-event fails,
> +	 * continue to the next endpoint decoder.
> +	 */
> +	rc = cxl_add_to_region(root, cxled);
> +	if (rc)
> +		dev_dbg(dev, "failed to add to region: %#llx-%#llx\n",
> +			cxled->cxld.hpa_range.start, cxled->cxld.hpa_range.end);
> +
> +	return 0;
should we return rc here?
> +}
> +
> +
>  static int cxl_switch_port_probe(struct cxl_port *port)
>  {
>  	struct cxl_hdm *cxlhdm;
> @@ -54,6 +82,7 @@ static int cxl_endpoint_port_probe(struct cxl_port *port)
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct cxl_hdm *cxlhdm;
> +	struct cxl_port *root;
>  	int rc;
>  
>  	cxlhdm = devm_cxl_setup_hdm(port);
> @@ -78,7 +107,24 @@ static int cxl_endpoint_port_probe(struct cxl_port *port)
>  		return rc;
>  	}
>  
> -	return devm_cxl_enumerate_decoders(cxlhdm);
> +	rc = devm_cxl_enumerate_decoders(cxlhdm);
> +	if (rc)
> +		return rc;
> +
> +	/*
> +	 * This can't fail in practice as CXL root exit unregisters all
> +	 * descendant ports and that in turn synchronizes with cxl_port_probe()
> +	 */
> +	root = find_cxl_root(&cxlmd->dev);
> +
> +	/*
> +	 * Now that all endpoint decoders are successfully enumerated, try to
> +	 * assemble regions from committed decoders
> +	 */
> +	device_for_each_child(&port->dev, root, discover_region);
> +	put_device(&root->dev);
> +
> +	return 0;
>  }
>  
>  static int cxl_port_probe(struct device *dev)
> 
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2023-02-28 18:53 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
2023-02-10  9:05 ` [PATCH v2 01/20] cxl/memdev: Fix endpoint port removal Dan Williams
2023-02-10 17:28   ` Jonathan Cameron
2023-02-10 21:14     ` Dan Williams
2023-02-10 23:17   ` Verma, Vishal L
2023-02-10  9:05 ` [PATCH v2 02/20] cxl/Documentation: Update references to attributes added in v6.0 Dan Williams
2023-02-10  9:05 ` [PATCH v2 03/20] cxl/region: Add a mode attribute for regions Dan Williams
2023-02-10  9:05 ` [PATCH v2 04/20] cxl/region: Support empty uuids for non-pmem regions Dan Williams
2023-02-10 17:30   ` Jonathan Cameron
2023-02-10 23:34   ` Ira Weiny
2023-02-10  9:05 ` [PATCH v2 05/20] cxl/region: Validate region mode vs decoder mode Dan Williams
2023-02-10  9:05 ` [PATCH v2 06/20] cxl/region: Add volatile region creation support Dan Williams
2023-02-10  9:06 ` [PATCH v2 07/20] cxl/region: Refactor attach_target() for autodiscovery Dan Williams
2023-02-10  9:06 ` [PATCH v2 08/20] cxl/region: Cleanup target list on attach error Dan Williams
2023-02-10 17:31   ` Jonathan Cameron
2023-02-10 23:17   ` Verma, Vishal L
2023-02-10 23:46   ` Ira Weiny
2023-02-10  9:06 ` [PATCH v2 09/20] cxl/region: Move region-position validation to a helper Dan Williams
2023-02-10 17:34   ` Jonathan Cameron
2023-02-10  9:06 ` [PATCH v2 10/20] kernel/range: Uplevel the cxl subsystem's range_contains() helper Dan Williams
2023-02-10  9:06 ` [PATCH v2 11/20] cxl/region: Enable CONFIG_CXL_REGION to be toggled Dan Williams
2023-02-10  9:06 ` [PATCH v2 12/20] cxl/port: Split endpoint and switch port probe Dan Williams
2023-02-10 17:41   ` Jonathan Cameron
2023-02-10 23:21   ` Verma, Vishal L
2023-02-10  9:06 ` [PATCH v2 13/20] cxl/region: Add region autodiscovery Dan Williams
2023-02-10 18:09   ` Jonathan Cameron
2023-02-10 21:35     ` Dan Williams
2023-02-14 13:23       ` Jonathan Cameron
2023-02-14 16:43         ` Dan Williams
2023-02-10 21:49     ` Dan Williams
2023-02-11  0:29   ` Verma, Vishal L
2023-02-11  1:03     ` Dan Williams
     [not found]   ` <CGME20230213192752uscas1p1c49508da4b100c9ba6a1a3aa92ca03e5@uscas1p1.samsung.com>
2023-02-13 19:27     ` Fan Ni
     [not found]   ` <CGME20230228185348uscas1p1a5314a077383ee81ac228c1b9f1da2f8@uscas1p1.samsung.com>
2023-02-28 18:53     ` Fan Ni
2023-02-10  9:06 ` [PATCH v2 14/20] tools/testing/cxl: Define a fixed volatile configuration to parse Dan Williams
2023-02-10 18:12   ` Jonathan Cameron
2023-02-10 18:36   ` Dave Jiang
2023-02-11  0:39   ` Verma, Vishal L
2023-02-10  9:06 ` [PATCH v2 15/20] dax/hmem: Move HMAT and Soft reservation probe initcall level Dan Williams
2023-02-10 21:53   ` Dave Jiang
2023-02-10 21:57     ` Dave Jiang
2023-02-11  0:40   ` Verma, Vishal L
2023-02-10  9:06 ` [PATCH v2 16/20] dax/hmem: Drop unnecessary dax_hmem_remove() Dan Williams
2023-02-10 21:59   ` Dave Jiang
2023-02-11  0:41   ` Verma, Vishal L
2023-02-10  9:07 ` [PATCH v2 17/20] dax/hmem: Convey the dax range via memregion_info() Dan Williams
2023-02-10 22:03   ` Dave Jiang
2023-02-11  4:25   ` Verma, Vishal L
2023-02-10  9:07 ` [PATCH v2 18/20] dax/hmem: Move hmem device registration to dax_hmem.ko Dan Williams
2023-02-10 18:25   ` Jonathan Cameron
2023-02-10 22:09   ` Dave Jiang
2023-02-11  4:41   ` Verma, Vishal L
2023-02-10  9:07 ` [PATCH v2 19/20] dax: Assign RAM regions to memory-hotplug by default Dan Williams
2023-02-10 22:19   ` Dave Jiang
2023-02-11  5:57   ` Verma, Vishal L
2023-02-10  9:07 ` [PATCH v2 20/20] cxl/dax: Create dax devices for CXL RAM regions Dan Williams
2023-02-10 18:38   ` Jonathan Cameron
2023-02-10 22:42   ` Dave Jiang
2023-02-10 17:53 ` [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
2023-02-11 14:04   ` Gregory Price
2023-02-13 18:22 ` Gregory Price
2023-02-13 18:31   ` Gregory Price
     [not found]     ` <CGME20230222214151uscas1p26d53b2e198f63a1f382fe575c6c25070@uscas1p2.samsung.com>
2023-02-22 21:41       ` Fan Ni
2023-02-22 22:18         ` Dan Williams
2023-02-14 13:35   ` Jonathan Cameron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).