All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/15] Region driver
@ 2022-04-13 18:37 Ben Widawsky
  2022-04-13 18:37 ` [RFC PATCH 01/15] cxl/core: Use is_endpoint_decoder Ben Widawsky
                   ` (15 more replies)
  0 siblings, 16 replies; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

Spring cleaning is here and we're starting fresh so I won't be referencing
previous postings and I've removed revision history from commit messages.

This patch series introduces the CXL region driver as well as associated APIs in
CXL core to create and configure regions. Regions are defined by the CXL 2.0
specification [1], a summary follows.

A region surfaces a swath of RAM (persistent or volatile) that appears as normal
memory to the operating system. The memory, unless programmed by BIOS, or a
previous Operating System, is inaccessible until the CXL driver creates a region
for it.A region may be strided (interleave granularity) across multiple devices
(interleave ways). The interleaving may traverse multiple levels of the CXL
hierarchy.

+-------------------------+      +-------------------------+
|                         |      |                         |
|   CXL 2.0 Host Bridge   |      |   CXL 2.0 Host Bridge   |
|                         |      |                         |
|  +------+     +------+  |      |  +------+     +------+  |
|  |  RP  |     |  RP  |  |      |  |  RP  |     |  RP  |  |
+--+------+-----+------+--+      +--+------+-----+------+--+
      |            |                   |               \--
      |            |                   |        +-------+-\--+------+
   +------+    +-------+            +-------+   |       |USP |      |
   |Type 3|    |Type 3 |            |Type 3 |   |       +----+      |
   |Device|    |Device |            |Device |   |     CXL Switch    |
   +------+    +-------+            +-------+   | +----+     +----+ |
                                                | |DSP |     |DSP | |
                                                +-+-|--+-----+-|--+-+
                                                    |          |
                                                +------+    +-------+
                                                |Type 3|    |Type 3 |
                                                |Device|    |Device |
                                                +------+    +-------+

Region verification and programming state are owned by the cxl_region driver
(implemented in the cxl_region module). Much of the region driver is an
implementation of algorithms described in the CXL Type 3 Memory Device Software
Guide [2].

The region driver is responsible for configuring regions found on persistent
capacities in the Label Storage Area (LSA), it will also enumerate regions
configured by BIOS, usually volatile capacities, and will allow for dynamic
region creation (which can then be stored in the LSA). Only dynamically created
regions are implemented thus far.

Dan has previously stated that he doesn't want to merge ABI until the whole
series is posted and reviewed, to make sure we have no gaps. As such, the goal
of posting this series is *not* to discuss the ABI specifically, feedback is of
course welcome. In other wordsIt has been discussed previously. The goal is to find
architectural flaws in the implementation of the ABI that may pose problematic
for cases we haven't yet conceived.

Since region creation is done via sysfs, it is left to userspace to prevent
racing for resource usage. Here is an overview for creating a x1 256M
dynamically created region programming to be used by userspace clients. In this
example, the following topology is used (cropped for brevity):
/sys/bus/cxl/devices/
├── decoder0.0 -> ../../../devices/platform/ACPI0017:00/root0/decoder0.0
├── decoder0.1 -> ../../../devices/platform/ACPI0017:00/root0/decoder0.1
├── decoder1.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/decoder1.0
├── decoder2.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/decoder2.0
├── decoder3.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint3/decoder3.0
├── decoder4.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint4/decoder4.0
├── decoder5.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint5/decoder5.0
├── decoder6.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint6/decoder6.0
├── endpoint3 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint3
├── endpoint4 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint4
├── endpoint5 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint5
├── endpoint6 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint6
...

1. Select a Root Decoder whose interleave spans the desired interleave config
   - devices, IG, IW, Large enough address space.
   - ie. pick decoder0.0
2. Program the decoders for the endpoints comprising the interleave set.
   - ie. echo $((256 << 20)) > /sys/bus/cxl/devices/decoder3.0
3. Create a region
   - ie. echo $(cat create_pmem_region) >| create_pmem_region
4. Configure a region
   - ie. echo 256 >| interleave_granularity
	 echo 1 >| interleave_ways
	 echo $((256 << 20)) >| size
	 echo decoder3.0 >| target0
5. Bind the region driver to the region
   - ie. echo region0 > /sys/bus/cxl/drivers/cxl_region/bind


[1]: https://www.computeexpresslink.org/download-the-specification
[2]: https://cdrdv2.intel.com/v1/dl/getContent/643805?wapkw=CXL%20memory%20device%20sw%20guide

Ben Widawsky (15):
  cxl/core: Use is_endpoint_decoder
  cxl/core/hdm: Bail on endpoint init fail
  Revert "cxl/core: Convert decoder range to resource"
  cxl/core: Create distinct decoder structs
  cxl/acpi: Reserve CXL resources from request_free_mem_region
  cxl/acpi: Manage root decoder's address space
  cxl/port: Surface ram and pmem resources
  cxl/core/hdm: Allocate resources from the media
  cxl/core/port: Add attrs for size and volatility
  cxl/core: Extract IW/IG decoding
  cxl/acpi: Use common IW/IG decoding
  cxl/region: Add region creation ABI
  cxl/core/port: Add attrs for root ways & granularity
  cxl/region: Introduce configuration
  cxl/region: Introduce a cxl_region driver

 Documentation/ABI/testing/sysfs-bus-cxl       |  96 ++-
 .../driver-api/cxl/memory-devices.rst         |  14 +
 drivers/cxl/Kconfig                           |  10 +
 drivers/cxl/Makefile                          |   2 +
 drivers/cxl/acpi.c                            |  83 ++-
 drivers/cxl/core/Makefile                     |   1 +
 drivers/cxl/core/core.h                       |   4 +
 drivers/cxl/core/hdm.c                        |  44 +-
 drivers/cxl/core/port.c                       | 363 ++++++++--
 drivers/cxl/core/region.c                     | 669 ++++++++++++++++++
 drivers/cxl/cxl.h                             | 168 ++++-
 drivers/cxl/mem.c                             |   7 +-
 drivers/cxl/region.c                          | 333 +++++++++
 drivers/cxl/region.h                          | 105 +++
 include/linux/ioport.h                        |   1 +
 kernel/resource.c                             |  11 +-
 tools/testing/cxl/Kbuild                      |   1 +
 tools/testing/cxl/test/cxl.c                  |   2 +-
 18 files changed, 1810 insertions(+), 104 deletions(-)
 create mode 100644 drivers/cxl/core/region.c
 create mode 100644 drivers/cxl/region.c
 create mode 100644 drivers/cxl/region.h


base-commit: 7dc1d11d7abae52aada5340fb98885f0ddbb7c37
-- 
2.35.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH 01/15] cxl/core: Use is_endpoint_decoder
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-04-13 21:22   ` Dan Williams
       [not found]   ` <CGME20220415205052uscas1p209e03abf95b9c80b2ba1f287c82dfd80@uscas1p2.samsung.com>
  2022-04-13 18:37 ` [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail Ben Widawsky
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

Save some characters and directly check decoder type rather than port
type. There's no need to check if the port is an endpoint port since we
already know the decoder, after alloc, has a specified type.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/core/hdm.c  | 2 +-
 drivers/cxl/core/port.c | 2 +-
 drivers/cxl/cxl.h       | 1 +
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 0e89a7a932d4..bfc8ee876278 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -197,7 +197,7 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
 	else
 		cxld->target_type = CXL_DECODER_ACCELERATOR;
 
-	if (is_cxl_endpoint(to_cxl_port(cxld->dev.parent)))
+	if (is_endpoint_decoder(&cxld->dev))
 		return 0;
 
 	target_list.value =
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 2ab1ba4499b3..74c8e47bf915 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -272,7 +272,7 @@ static const struct device_type cxl_decoder_root_type = {
 	.groups = cxl_decoder_root_attribute_groups,
 };
 
-static bool is_endpoint_decoder(struct device *dev)
+bool is_endpoint_decoder(struct device *dev)
 {
 	return dev->type == &cxl_decoder_endpoint_type;
 }
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 990b6670222e..5102491e8d13 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -340,6 +340,7 @@ struct cxl_dport *cxl_find_dport_by_dev(struct cxl_port *port,
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
 bool is_root_decoder(struct device *dev);
+bool is_endpoint_decoder(struct device *dev);
 bool is_cxl_decoder(struct device *dev);
 struct cxl_decoder *cxl_root_decoder_alloc(struct cxl_port *port,
 					   unsigned int nr_targets);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
  2022-04-13 18:37 ` [RFC PATCH 01/15] cxl/core: Use is_endpoint_decoder Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-04-13 21:31   ` Dan Williams
  2022-04-13 18:37 ` [RFC PATCH 03/15] Revert "cxl/core: Convert decoder range to resource" Ben Widawsky
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

Endpoint decoder enumeration is the only way in which we can determine
Device Physical Address (DPA) -> Host Physical Address (HPA) mappings.
Information is obtained only when the register state can be read
sequentially. If when enumerating the decoders a failure occurs, all
other decoders must also fail since the decoders can no longer be
accurately managed (unless it's the last decoder in which case it can
still work).

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/core/hdm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index bfc8ee876278..c3c021b54079 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -255,6 +255,8 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
 				      cxlhdm->regs.hdm_decoder, i);
 		if (rc) {
 			put_device(&cxld->dev);
+			if (is_endpoint_decoder(&cxld->dev))
+				return rc;
 			failed++;
 			continue;
 		}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 03/15] Revert "cxl/core: Convert decoder range to resource"
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
  2022-04-13 18:37 ` [RFC PATCH 01/15] cxl/core: Use is_endpoint_decoder Ben Widawsky
  2022-04-13 18:37 ` [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-04-13 21:43   ` Dan Williams
  2022-04-13 18:37 ` [RFC PATCH 04/15] cxl/core: Create distinct decoder structs Ben Widawsky
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

This reverts commit 608135db1b790170d22848815c4671407af74e37. All
decoders do have a host physical address space and the revert allows us
to keep that uniformity. Decoder disambiguation will allow for decoder
type-specific members which is needed, but will be handled separately.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>

---
The explanation for why it is impossible to make CFMWS ranges be
iomem_resources is explain in a later patch.
---
 drivers/cxl/acpi.c      | 17 ++++++++++-------
 drivers/cxl/core/hdm.c  |  2 +-
 drivers/cxl/core/port.c | 28 ++++++----------------------
 drivers/cxl/cxl.h       |  8 ++------
 4 files changed, 19 insertions(+), 36 deletions(-)

diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index d15a6aec0331..9b69955b90cb 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -108,8 +108,10 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
 
 	cxld->flags = cfmws_to_decoder_flags(cfmws->restrictions);
 	cxld->target_type = CXL_DECODER_EXPANDER;
-	cxld->platform_res = (struct resource)DEFINE_RES_MEM(cfmws->base_hpa,
-							     cfmws->window_size);
+	cxld->range = (struct range){
+		.start = cfmws->base_hpa,
+		.end = cfmws->base_hpa + cfmws->window_size - 1,
+	};
 	cxld->interleave_ways = CFMWS_INTERLEAVE_WAYS(cfmws);
 	cxld->interleave_granularity = CFMWS_INTERLEAVE_GRANULARITY(cfmws);
 
@@ -119,13 +121,14 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
 	else
 		rc = cxl_decoder_autoremove(dev, cxld);
 	if (rc) {
-		dev_err(dev, "Failed to add decoder for %pr\n",
-			&cxld->platform_res);
+		dev_err(dev, "Failed to add decoder for %#llx-%#llx\n",
+			cfmws->base_hpa,
+			cfmws->base_hpa + cfmws->window_size - 1);
 		return 0;
 	}
-	dev_dbg(dev, "add: %s node: %d range %pr\n", dev_name(&cxld->dev),
-		phys_to_target_node(cxld->platform_res.start),
-		&cxld->platform_res);
+	dev_dbg(dev, "add: %s node: %d range %#llx-%#llx\n",
+		dev_name(&cxld->dev), phys_to_target_node(cxld->range.start),
+		cfmws->base_hpa, cfmws->base_hpa + cfmws->window_size - 1);
 
 	return 0;
 }
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index c3c021b54079..3055e246aab9 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -172,7 +172,7 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
 		return -ENXIO;
 	}
 
-	cxld->decoder_range = (struct range) {
+	cxld->range = (struct range) {
 		.start = base,
 		.end = base + size - 1,
 	};
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 74c8e47bf915..86f451ecb7ed 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -73,14 +73,8 @@ static ssize_t start_show(struct device *dev, struct device_attribute *attr,
 			  char *buf)
 {
 	struct cxl_decoder *cxld = to_cxl_decoder(dev);
-	u64 start;
 
-	if (is_root_decoder(dev))
-		start = cxld->platform_res.start;
-	else
-		start = cxld->decoder_range.start;
-
-	return sysfs_emit(buf, "%#llx\n", start);
+	return sysfs_emit(buf, "%#llx\n", cxld->range.start);
 }
 static DEVICE_ATTR_ADMIN_RO(start);
 
@@ -88,14 +82,8 @@ static ssize_t size_show(struct device *dev, struct device_attribute *attr,
 			char *buf)
 {
 	struct cxl_decoder *cxld = to_cxl_decoder(dev);
-	u64 size;
 
-	if (is_root_decoder(dev))
-		size = resource_size(&cxld->platform_res);
-	else
-		size = range_len(&cxld->decoder_range);
-
-	return sysfs_emit(buf, "%#llx\n", size);
+	return sysfs_emit(buf, "%#llx\n", range_len(&cxld->range));
 }
 static DEVICE_ATTR_RO(size);
 
@@ -1228,7 +1216,10 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
 	cxld->interleave_ways = 1;
 	cxld->interleave_granularity = PAGE_SIZE;
 	cxld->target_type = CXL_DECODER_EXPANDER;
-	cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
+	cxld->range = (struct range) {
+		.start = 0,
+		.end = -1,
+	};
 
 	return cxld;
 err:
@@ -1342,13 +1333,6 @@ int cxl_decoder_add_locked(struct cxl_decoder *cxld, int *target_map)
 	if (rc)
 		return rc;
 
-	/*
-	 * Platform decoder resources should show up with a reasonable name. All
-	 * other resources are just sub ranges within the main decoder resource.
-	 */
-	if (is_root_decoder(dev))
-		cxld->platform_res.name = dev_name(dev);
-
 	return device_add(dev);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_decoder_add_locked, CXL);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 5102491e8d13..6517d5cdf5ee 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -197,8 +197,7 @@ enum cxl_decoder_type {
  * struct cxl_decoder - CXL address range decode configuration
  * @dev: this decoder's device
  * @id: kernel device name id
- * @platform_res: address space resources considered by root decoder
- * @decoder_range: address space resources considered by midlevel decoder
+ * @range: address range considered by this decoder
  * @interleave_ways: number of cxl_dports in this decode
  * @interleave_granularity: data stride per dport
  * @target_type: accelerator vs expander (type2 vs type3) selector
@@ -210,10 +209,7 @@ enum cxl_decoder_type {
 struct cxl_decoder {
 	struct device dev;
 	int id;
-	union {
-		struct resource platform_res;
-		struct range decoder_range;
-	};
+	struct range range;
 	int interleave_ways;
 	int interleave_granularity;
 	enum cxl_decoder_type target_type;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 04/15] cxl/core: Create distinct decoder structs
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
                   ` (2 preceding siblings ...)
  2022-04-13 18:37 ` [RFC PATCH 03/15] Revert "cxl/core: Convert decoder range to resource" Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-04-15  1:45   ` Dan Williams
  2022-04-13 18:37 ` [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region Ben Widawsky
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

CXL HDM decoders have distinct properties at each level in the
hierarchy. Root decoders manage host physical address space. Switch
decoders manage demultiplexing of data to downstream targets. Endpoint
decoders must be aware of physical media size constraints. To properly
support these unique needs, create these unique structures.

CXL HDM decoders do have similar architectural properties at all levels:
interleave properties, flags, types and consumption of host physical
address space. Those are retained and when possible, still utilized.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/core/hdm.c       |   3 +-
 drivers/cxl/core/port.c      | 102 ++++++++++++++++++++++++-----------
 drivers/cxl/cxl.h            |  69 +++++++++++++++++++++---
 tools/testing/cxl/test/cxl.c |   2 +-
 4 files changed, 137 insertions(+), 39 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 3055e246aab9..37c09c77e9a7 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -6,6 +6,7 @@
 
 #include "cxlmem.h"
 #include "core.h"
+#include "cxl.h"
 
 /**
  * DOC: cxl core hdm
@@ -242,7 +243,7 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
 		struct cxl_decoder *cxld;
 
 		if (is_cxl_endpoint(port))
-			cxld = cxl_endpoint_decoder_alloc(port);
+			cxld = &cxl_endpoint_decoder_alloc(port)->base;
 		else
 			cxld = cxl_switch_decoder_alloc(port, target_count);
 		if (IS_ERR(cxld)) {
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 86f451ecb7ed..8dd29c97e318 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -121,18 +121,19 @@ static DEVICE_ATTR_RO(target_type);
 
 static ssize_t emit_target_list(struct cxl_decoder *cxld, char *buf)
 {
+	struct cxl_decoder_targets *t = cxl_get_decoder_targets(cxld);
 	ssize_t offset = 0;
 	int i, rc = 0;
 
 	for (i = 0; i < cxld->interleave_ways; i++) {
-		struct cxl_dport *dport = cxld->target[i];
+		struct cxl_dport *dport = t->target[i];
 		struct cxl_dport *next = NULL;
 
 		if (!dport)
 			break;
 
 		if (i + 1 < cxld->interleave_ways)
-			next = cxld->target[i + 1];
+			next = t->target[i + 1];
 		rc = sysfs_emit_at(buf, offset, "%d%s", dport->port_id,
 				   next ? "," : "");
 		if (rc < 0)
@@ -147,14 +148,15 @@ static ssize_t target_list_show(struct device *dev,
 				struct device_attribute *attr, char *buf)
 {
 	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	struct cxl_decoder_targets *t = cxl_get_decoder_targets(cxld);
 	ssize_t offset;
 	unsigned int seq;
 	int rc;
 
 	do {
-		seq = read_seqbegin(&cxld->target_lock);
+		seq = read_seqbegin(&t->target_lock);
 		rc = emit_target_list(cxld, buf);
-	} while (read_seqretry(&cxld->target_lock, seq));
+	} while (read_seqretry(&t->target_lock, seq));
 
 	if (rc < 0)
 		return rc;
@@ -199,23 +201,6 @@ static const struct attribute_group *cxl_decoder_root_attribute_groups[] = {
 	NULL,
 };
 
-static struct attribute *cxl_decoder_switch_attrs[] = {
-	&dev_attr_target_type.attr,
-	&dev_attr_target_list.attr,
-	NULL,
-};
-
-static struct attribute_group cxl_decoder_switch_attribute_group = {
-	.attrs = cxl_decoder_switch_attrs,
-};
-
-static const struct attribute_group *cxl_decoder_switch_attribute_groups[] = {
-	&cxl_decoder_switch_attribute_group,
-	&cxl_decoder_base_attribute_group,
-	&cxl_base_attribute_group,
-	NULL,
-};
-
 static struct attribute *cxl_decoder_endpoint_attrs[] = {
 	&dev_attr_target_type.attr,
 	NULL,
@@ -232,6 +217,12 @@ static const struct attribute_group *cxl_decoder_endpoint_attribute_groups[] = {
 	NULL,
 };
 
+static const struct attribute_group *cxl_decoder_switch_attribute_groups[] = {
+	&cxl_decoder_base_attribute_group,
+	&cxl_base_attribute_group,
+	NULL,
+};
+
 static void cxl_decoder_release(struct device *dev)
 {
 	struct cxl_decoder *cxld = to_cxl_decoder(dev);
@@ -264,6 +255,7 @@ bool is_endpoint_decoder(struct device *dev)
 {
 	return dev->type == &cxl_decoder_endpoint_type;
 }
+EXPORT_SYMBOL_NS_GPL(is_endpoint_decoder, CXL);
 
 bool is_root_decoder(struct device *dev)
 {
@@ -1136,6 +1128,7 @@ EXPORT_SYMBOL_NS_GPL(cxl_find_dport_by_dev, CXL);
 static int decoder_populate_targets(struct cxl_decoder *cxld,
 				    struct cxl_port *port, int *target_map)
 {
+	struct cxl_decoder_targets *t = cxl_get_decoder_targets(cxld);
 	int i, rc = 0;
 
 	if (!target_map)
@@ -1146,21 +1139,72 @@ static int decoder_populate_targets(struct cxl_decoder *cxld,
 	if (list_empty(&port->dports))
 		return -EINVAL;
 
-	write_seqlock(&cxld->target_lock);
-	for (i = 0; i < cxld->nr_targets; i++) {
+	write_seqlock(&t->target_lock);
+	for (i = 0; i < t->nr_targets; i++) {
 		struct cxl_dport *dport = find_dport(port, target_map[i]);
 
 		if (!dport) {
 			rc = -ENXIO;
 			break;
 		}
-		cxld->target[i] = dport;
+		t->target[i] = dport;
 	}
-	write_sequnlock(&cxld->target_lock);
+	write_sequnlock(&t->target_lock);
 
 	return rc;
 }
 
+static struct cxl_decoder *__cxl_decoder_alloc(struct cxl_port *port,
+					       unsigned int nr_targets)
+{
+	struct cxl_decoder *cxld;
+
+	if (is_cxl_endpoint(port)) {
+		struct cxl_endpoint_decoder *cxled;
+
+		cxled = kzalloc(sizeof(*cxled), GFP_KERNEL);
+		if (!cxled)
+			return NULL;
+		cxld = &cxled->base;
+	} else if (is_cxl_root(port)) {
+		struct cxl_root_decoder *cxlrd;
+
+		cxlrd = kzalloc(sizeof(*cxlrd), GFP_KERNEL);
+		if (!cxlrd)
+			return NULL;
+
+		cxlrd->targets =
+			kzalloc(struct_size(cxlrd->targets, target, nr_targets),
+				GFP_KERNEL);
+		if (!cxlrd->targets) {
+			kfree(cxlrd);
+			return NULL;
+		}
+		cxlrd->targets->nr_targets = nr_targets;
+		seqlock_init(&cxlrd->targets->target_lock);
+		cxld = &cxlrd->base;
+	} else {
+		struct cxl_switch_decoder *cxlsd;
+
+		cxlsd = kzalloc(sizeof(*cxlsd), GFP_KERNEL);
+		if (!cxlsd)
+			return NULL;
+
+		cxlsd->targets =
+			kzalloc(struct_size(cxlsd->targets, target, nr_targets),
+				GFP_KERNEL);
+		if (!cxlsd->targets) {
+			kfree(cxlsd);
+			return NULL;
+		}
+		cxlsd->targets->nr_targets = nr_targets;
+		seqlock_init(&cxlsd->targets->target_lock);
+		cxld = &cxlsd->base;
+	}
+
+	return cxld;
+}
+
 /**
  * cxl_decoder_alloc - Allocate a new CXL decoder
  * @port: owning port of this decoder
@@ -1186,7 +1230,7 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
 	if (nr_targets > CXL_DECODER_MAX_INTERLEAVE)
 		return ERR_PTR(-EINVAL);
 
-	cxld = kzalloc(struct_size(cxld, target, nr_targets), GFP_KERNEL);
+	cxld = __cxl_decoder_alloc(port, nr_targets);
 	if (!cxld)
 		return ERR_PTR(-ENOMEM);
 
@@ -1198,8 +1242,6 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
 	get_device(&port->dev);
 	cxld->id = rc;
 
-	cxld->nr_targets = nr_targets;
-	seqlock_init(&cxld->target_lock);
 	dev = &cxld->dev;
 	device_initialize(dev);
 	device_set_pm_not_required(dev);
@@ -1274,12 +1316,12 @@ EXPORT_SYMBOL_NS_GPL(cxl_switch_decoder_alloc, CXL);
  *
  * Return: A new cxl decoder to be registered by cxl_decoder_add()
  */
-struct cxl_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
+struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
 {
 	if (!is_cxl_endpoint(port))
 		return ERR_PTR(-EINVAL);
 
-	return cxl_decoder_alloc(port, 0);
+	return to_cxl_endpoint_decoder(cxl_decoder_alloc(port, 0));
 }
 EXPORT_SYMBOL_NS_GPL(cxl_endpoint_decoder_alloc, CXL);
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 6517d5cdf5ee..85fd5e84f978 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -193,6 +193,18 @@ enum cxl_decoder_type {
  */
 #define CXL_DECODER_MAX_INTERLEAVE 16
 
+/**
+ * struct cxl_decoder_targets - Target information for root and switch decoders.
+ * @target_lock: coordinate coherent reads of the target list
+ * @nr_targets: number of elements in @target
+ * @target: active ordered target list in current decoder configuration
+ */
+struct cxl_decoder_targets {
+	seqlock_t target_lock;
+	int nr_targets;
+	struct cxl_dport *target[];
+};
+
 /**
  * struct cxl_decoder - CXL address range decode configuration
  * @dev: this decoder's device
@@ -202,9 +214,6 @@ enum cxl_decoder_type {
  * @interleave_granularity: data stride per dport
  * @target_type: accelerator vs expander (type2 vs type3) selector
  * @flags: memory type capabilities and locking
- * @target_lock: coordinate coherent reads of the target list
- * @nr_targets: number of elements in @target
- * @target: active ordered target list in current decoder configuration
  */
 struct cxl_decoder {
 	struct device dev;
@@ -214,11 +223,46 @@ struct cxl_decoder {
 	int interleave_granularity;
 	enum cxl_decoder_type target_type;
 	unsigned long flags;
-	seqlock_t target_lock;
-	int nr_targets;
-	struct cxl_dport *target[];
 };
 
+/**
+ * struct cxl_endpoint_decoder - An decoder residing in a CXL endpoint.
+ * @base: Base class decoder
+ */
+struct cxl_endpoint_decoder {
+	struct cxl_decoder base;
+};
+
+/**
+ * struct cxl_switch_decoder - A decoder in a switch or hostbridge.
+ * @base: Base class decoder
+ * @targets: Downstream targets for this switch.
+ */
+struct cxl_switch_decoder {
+	struct cxl_decoder base;
+	struct cxl_decoder_targets *targets;
+};
+
+/**
+ * struct cxl_root_decoder - A toplevel/platform decoder
+ * @base: Base class decoder
+ * @targets: Downstream targets (ie. hostbridges).
+ */
+struct cxl_root_decoder {
+	struct cxl_decoder base;
+	struct cxl_decoder_targets *targets;
+};
+
+#define _to_cxl_decoder(x)                                                     \
+	static inline struct cxl_##x##_decoder *to_cxl_##x##_decoder(          \
+		struct cxl_decoder *cxld)                                      \
+	{                                                                      \
+		return container_of(cxld, struct cxl_##x##_decoder, base);     \
+	}
+
+_to_cxl_decoder(root)
+_to_cxl_decoder(switch)
+_to_cxl_decoder(endpoint)
 
 /**
  * enum cxl_nvdimm_brige_state - state machine for managing bus rescans
@@ -343,11 +387,22 @@ struct cxl_decoder *cxl_root_decoder_alloc(struct cxl_port *port,
 struct cxl_decoder *cxl_switch_decoder_alloc(struct cxl_port *port,
 					     unsigned int nr_targets);
 int cxl_decoder_add(struct cxl_decoder *cxld, int *target_map);
-struct cxl_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port);
+struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port);
 int cxl_decoder_add_locked(struct cxl_decoder *cxld, int *target_map);
 int cxl_decoder_autoremove(struct device *host, struct cxl_decoder *cxld);
 int cxl_endpoint_autoremove(struct cxl_memdev *cxlmd, struct cxl_port *endpoint);
 
+static inline struct cxl_decoder_targets *
+cxl_get_decoder_targets(struct cxl_decoder *cxld)
+{
+	if (is_root_decoder(&cxld->dev))
+		return to_cxl_root_decoder(cxld)->targets;
+	else if (is_endpoint_decoder(&cxld->dev))
+		return NULL;
+	else
+		return to_cxl_switch_decoder(cxld)->targets;
+}
+
 struct cxl_hdm;
 struct cxl_hdm *devm_cxl_setup_hdm(struct cxl_port *port);
 int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm);
diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
index 431f2bddf6c8..0534d96486eb 100644
--- a/tools/testing/cxl/test/cxl.c
+++ b/tools/testing/cxl/test/cxl.c
@@ -454,7 +454,7 @@ static int mock_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
 		if (target_count)
 			cxld = cxl_switch_decoder_alloc(port, target_count);
 		else
-			cxld = cxl_endpoint_decoder_alloc(port);
+			cxld = &cxl_endpoint_decoder_alloc(port)->base;
 		if (IS_ERR(cxld)) {
 			dev_warn(&port->dev,
 				 "Failed to allocate the decoder\n");
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
                   ` (3 preceding siblings ...)
  2022-04-13 18:37 ` [RFC PATCH 04/15] cxl/core: Create distinct decoder structs Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-04-18 16:42   ` Dan Williams
  2022-04-13 18:37 ` [RFC PATCH 06/15] cxl/acpi: Manage root decoder's address space Ben Widawsky
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Dan Williams, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma

Define an API which allows CXL drivers to manage CXL address space.
CXL is unique in that the address space and various properties are only
known after CXL drivers come up, and therefore cannot be part of core
memory enumeration.

Compute Express Link 2.0 [ECN] defines a concept called CXL Fixed Memory
Window Structures (CFMWS). Each CFMWS conveys a region of host physical
address (HPA) space which has certain properties that are familiar to
CXL, mainly interleave properties, and restrictions, such as
persistence. The HPA ranges therefore should be owned, or at least
guided by the relevant CXL driver, cxl_acpi [1].

It would be desirable to simply insert this address space into
iomem_resource with a new flag to denote this is CXL memory. This would
permit request_free_mem_region() to be reused for CXL memory provided it
learned some new tricks. For that, it is tempting to simply use
insert_resource(). The API was designed specifically for cases where new
devices may offer new address space. This cannot work in the general
case. Boot firmware can pass, some, none, or all of the CFMWS range as
various types of memory to the kernel, and this may be left alone,
merged, or even expanded. As a result iomem_resource may intersect CFMWS
regions in ways insert_resource cannot handle [2]. Similar reasoning
applies to allocate_resource().

With the insert_resource option out, the only reasonable approach left
is to let the CXL driver manage the address space independently of
iomem_resource and attempt to prevent users of device private memory
APIs from using CXL memory. In the case where cxl_acpi comes up first,
the new API allows cxl to block use of any CFMWS defined address space
by assuming everything above the highest CFMWS entry is fair game. It is
expected that this effectively will prevent usage of device private
memory, but if such behavior is undesired, cxl_acpi can be blocked from
loading, or unloaded. When device private memory is used before CXL
comes up, or, there are intersections as described above, the CXL driver
will have to make sure to not reuse sysram that is BUSY.

[1]: The specification defines enumeration via ACPI, however, one could
envision devicetree, or some other hardcoded mechanisms for doing the
same thing.

[2]: A common way to hit this case is when BIOS creates a volatile
region with extra space for hotplug. In this case, you're likely to have

|<--------------HPA space---------------------->|
|<---iomem_resource -->|
| DDR  | CXL Volatile  |
|      | CFMWS for volatile w/ hotplug |

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/acpi.c     | 26 ++++++++++++++++++++++++++
 include/linux/ioport.h |  1 +
 kernel/resource.c      | 11 ++++++++++-
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index 9b69955b90cb..0870904fe4b5 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -76,6 +76,7 @@ static int cxl_acpi_cfmws_verify(struct device *dev,
 struct cxl_cfmws_context {
 	struct device *dev;
 	struct cxl_port *root_port;
+	struct acpi_cedt_cfmws *high_cfmws;
 };
 
 static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
@@ -126,6 +127,14 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
 			cfmws->base_hpa + cfmws->window_size - 1);
 		return 0;
 	}
+
+	if (ctx->high_cfmws) {
+		if (cfmws->base_hpa > ctx->high_cfmws->base_hpa)
+			ctx->high_cfmws = cfmws;
+	} else {
+		ctx->high_cfmws = cfmws;
+	}
+
 	dev_dbg(dev, "add: %s node: %d range %#llx-%#llx\n",
 		dev_name(&cxld->dev), phys_to_target_node(cxld->range.start),
 		cfmws->base_hpa, cfmws->base_hpa + cfmws->window_size - 1);
@@ -299,6 +308,7 @@ static int cxl_acpi_probe(struct platform_device *pdev)
 	ctx = (struct cxl_cfmws_context) {
 		.dev = host,
 		.root_port = root_port,
+		.high_cfmws = NULL,
 	};
 	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, cxl_parse_cfmws, &ctx);
 
@@ -317,10 +327,25 @@ static int cxl_acpi_probe(struct platform_device *pdev)
 	if (rc < 0)
 		return rc;
 
+	if (ctx.high_cfmws) {
+		resource_size_t end =
+			ctx.high_cfmws->base_hpa + ctx.high_cfmws->window_size;
+		dev_dbg(host,
+			"Disabling free device private regions below %#llx\n",
+			end);
+		set_request_free_min_base(end);
+	}
+
 	/* In case PCI is scanned before ACPI re-trigger memdev attach */
 	return cxl_bus_rescan();
 }
 
+static int cxl_acpi_remove(struct platform_device *pdev)
+{
+	set_request_free_min_base(0);
+	return 0;
+}
+
 static const struct acpi_device_id cxl_acpi_ids[] = {
 	{ "ACPI0017" },
 	{ },
@@ -329,6 +354,7 @@ MODULE_DEVICE_TABLE(acpi, cxl_acpi_ids);
 
 static struct platform_driver cxl_acpi_driver = {
 	.probe = cxl_acpi_probe,
+	.remove = cxl_acpi_remove,
 	.driver = {
 		.name = KBUILD_MODNAME,
 		.acpi_match_table = cxl_acpi_ids,
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index ec5f71f7135b..dc41e4be5635 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -325,6 +325,7 @@ extern int
 walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 end,
 		    void *arg, int (*func)(struct resource *, void *));
 
+void set_request_free_min_base(resource_size_t val);
 struct resource *devm_request_free_mem_region(struct device *dev,
 		struct resource *base, unsigned long size);
 struct resource *request_free_mem_region(struct resource *base,
diff --git a/kernel/resource.c b/kernel/resource.c
index 34eaee179689..a4750689e529 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1774,6 +1774,14 @@ void resource_list_free(struct list_head *head)
 EXPORT_SYMBOL(resource_list_free);
 
 #ifdef CONFIG_DEVICE_PRIVATE
+static resource_size_t request_free_min_base;
+
+void set_request_free_min_base(resource_size_t val)
+{
+	request_free_min_base = val;
+}
+EXPORT_SYMBOL_GPL(set_request_free_min_base);
+
 static struct resource *__request_free_mem_region(struct device *dev,
 		struct resource *base, unsigned long size, const char *name)
 {
@@ -1799,7 +1807,8 @@ static struct resource *__request_free_mem_region(struct device *dev,
 	}
 
 	write_lock(&resource_lock);
-	for (; addr > size && addr >= base->start; addr -= size) {
+	for (; addr > size && addr >= max(base->start, request_free_min_base);
+	     addr -= size) {
 		if (__region_intersects(addr, size, 0, IORES_DESC_NONE) !=
 				REGION_DISJOINT)
 			continue;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 06/15] cxl/acpi: Manage root decoder's address space
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
                   ` (4 preceding siblings ...)
  2022-04-13 18:37 ` [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-04-18 22:15   ` Dan Williams
  2022-04-13 18:37 ` [RFC PATCH 07/15] cxl/port: Surface ram and pmem resources Ben Widawsky
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

Use a gen_pool to manage the physical address space that is routed by
the platform decoder (root decoder). As described in 'cxl/acpi: Resereve
CXL resources from request_free_mem_region' the address space does not
coexist well if part of all of it is conveyed in the memory map to the
kernel.

Since the existing resource APIs of interest all rely on the root
decoder's address space being in iomem_resource, the choices are to roll
a new allocator because on struct resource, or use gen_pool. gen_pool is
a good choice because it already has all the capabilities needed to
satisfy CXL programming.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/acpi.c | 36 ++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxl.h  |  2 ++
 2 files changed, 38 insertions(+)

diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index 0870904fe4b5..a6b0c3181d0e 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /* Copyright(c) 2021 Intel Corporation. All rights reserved. */
 #include <linux/platform_device.h>
+#include <linux/genalloc.h>
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/kernel.h>
@@ -79,6 +80,25 @@ struct cxl_cfmws_context {
 	struct acpi_cedt_cfmws *high_cfmws;
 };
 
+static int cfmws_cookie;
+
+static int fill_busy_mem(struct resource *res, void *_window)
+{
+	struct gen_pool *window = _window;
+	struct genpool_data_fixed gpdf;
+	unsigned long addr;
+	void *type;
+
+	gpdf.offset = res->start;
+	addr = gen_pool_alloc_algo_owner(window, resource_size(res),
+					 gen_pool_fixed_alloc, &gpdf, &type);
+	if (addr != res->start || (res->start == 0 && type != &cfmws_cookie))
+		return -ENXIO;
+
+	pr_devel("%pR removed from CFMWS\n", res);
+	return 0;
+}
+
 static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
 			   const unsigned long end)
 {
@@ -88,6 +108,8 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
 	struct device *dev = ctx->dev;
 	struct acpi_cedt_cfmws *cfmws;
 	struct cxl_decoder *cxld;
+	struct gen_pool *window;
+	char name[64];
 	int rc, i;
 
 	cfmws = (struct acpi_cedt_cfmws *) header;
@@ -116,6 +138,20 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
 	cxld->interleave_ways = CFMWS_INTERLEAVE_WAYS(cfmws);
 	cxld->interleave_granularity = CFMWS_INTERLEAVE_GRANULARITY(cfmws);
 
+	sprintf(name, "cfmws@%#llx", cfmws->base_hpa);
+	window = devm_gen_pool_create(dev, ilog2(SZ_256M), NUMA_NO_NODE, name);
+	if (IS_ERR(window))
+		return 0;
+
+	gen_pool_add_owner(window, cfmws->base_hpa, -1, cfmws->window_size,
+			   NUMA_NO_NODE, &cfmws_cookie);
+
+	/* Area claimed by other resources, remove those from the gen_pool. */
+	walk_iomem_res_desc(IORES_DESC_NONE, 0, cfmws->base_hpa,
+			    cfmws->base_hpa + cfmws->window_size - 1, window,
+			    fill_busy_mem);
+	to_cxl_root_decoder(cxld)->window = window;
+
 	rc = cxl_decoder_add(cxld, target_map);
 	if (rc)
 		put_device(&cxld->dev);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 85fd5e84f978..0e1c65761ead 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -246,10 +246,12 @@ struct cxl_switch_decoder {
 /**
  * struct cxl_root_decoder - A toplevel/platform decoder
  * @base: Base class decoder
+ * @window: host address space allocator
  * @targets: Downstream targets (ie. hostbridges).
  */
 struct cxl_root_decoder {
 	struct cxl_decoder base;
+	struct gen_pool *window;
 	struct cxl_decoder_targets *targets;
 };
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 07/15] cxl/port: Surface ram and pmem resources
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
                   ` (5 preceding siblings ...)
  2022-04-13 18:37 ` [RFC PATCH 06/15] cxl/acpi: Manage root decoder's address space Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-04-13 18:37 ` [RFC PATCH 08/15] cxl/core/hdm: Allocate resources from the media Ben Widawsky
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

CXL Type 2 and 3 endpoints may contain Host-managed Device Memory (HDM).
This memory can be either volatile, persistent, or some combination of
both. Similar to the root decoder the port's resources can be considered
the host memory of which decoders allocate out of. Unlike the root
decoder resource, device resources are in the device physical address
space domain.

The CXL specification mandates a specific partitioning of volatile vs.
persistent capacities. While an endpoint may contain one, or both
capacities the volatile capacity while always be first. To accommodate
this, two parameters are added to port creation, the offset of the
split, and the total capacity.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/core/port.c | 19 +++++++++++++++++++
 drivers/cxl/cxl.h       | 11 +++++++++++
 drivers/cxl/mem.c       |  7 +++++--
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 8dd29c97e318..0d946711685b 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -2,6 +2,7 @@
 /* Copyright(c) 2020 Intel Corporation. All rights reserved. */
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/workqueue.h>
+#include <linux/genalloc.h>
 #include <linux/device.h>
 #include <linux/module.h>
 #include <linux/pci.h>
@@ -469,6 +470,24 @@ struct cxl_port *devm_cxl_add_port(struct device *host, struct device *uport,
 }
 EXPORT_SYMBOL_NS_GPL(devm_cxl_add_port, CXL);
 
+struct cxl_port *devm_cxl_add_endpoint_port(struct device *host,
+					    struct device *uport,
+					    resource_size_t component_reg_phys,
+					    u64 capacity, u64 pmem_offset,
+					    struct cxl_port *parent_port)
+{
+	struct cxl_port *ep =
+		devm_cxl_add_port(host, uport, component_reg_phys, parent_port);
+	if (IS_ERR(ep) || !capacity)
+		return ep;
+
+	ep->capacity = capacity;
+	ep->pmem_offset = pmem_offset;
+
+	return ep;
+}
+EXPORT_SYMBOL_NS_GPL(devm_cxl_add_endpoint_port, CXL);
+
 struct pci_bus *cxl_port_to_pci_bus(struct cxl_port *port)
 {
 	/* There is no pci_bus associated with a CXL platform-root port */
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 0e1c65761ead..52295548a071 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -309,6 +309,9 @@ struct cxl_nvdimm {
  * @component_reg_phys: component register capability base address (optional)
  * @dead: last ep has been removed, force port re-creation
  * @depth: How deep this port is relative to the root. depth 0 is the root.
+ * @capacity: How much total storage the media can hold (endpoint only)
+ * @pmem_offset: Partition dividing volatile, [0, pmem_offset -1 ], and persistent
+ *		 [pmem_offset, capacity - 1] addresses.
  */
 struct cxl_port {
 	struct device dev;
@@ -320,6 +323,9 @@ struct cxl_port {
 	resource_size_t component_reg_phys;
 	bool dead;
 	unsigned int depth;
+
+	u64 capacity;
+	u64 pmem_offset;
 };
 
 /**
@@ -368,6 +374,11 @@ struct pci_bus *cxl_port_to_pci_bus(struct cxl_port *port);
 struct cxl_port *devm_cxl_add_port(struct device *host, struct device *uport,
 				   resource_size_t component_reg_phys,
 				   struct cxl_port *parent_port);
+struct cxl_port *devm_cxl_add_endpoint_port(struct device *host,
+					    struct device *uport,
+					    resource_size_t component_reg_phys,
+					    u64 capacity, u64 pmem_offset,
+					    struct cxl_port *parent_port);
 struct cxl_port *find_cxl_root(struct device *dev);
 int devm_cxl_enumerate_ports(struct cxl_memdev *cxlmd);
 int cxl_bus_rescan(void);
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 49a4b1c47299..b27ce13c1872 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -50,9 +50,12 @@ static int create_endpoint(struct cxl_memdev *cxlmd,
 {
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct cxl_port *endpoint;
+	u64 partition = range_len(&cxlds->ram_range);
+	u64 size = range_len(&cxlds->ram_range) + range_len(&cxlds->pmem_range);
 
-	endpoint = devm_cxl_add_port(&parent_port->dev, &cxlmd->dev,
-				     cxlds->component_reg_phys, parent_port);
+	endpoint = devm_cxl_add_endpoint_port(&parent_port->dev, &cxlmd->dev,
+					      cxlds->component_reg_phys, size,
+					      partition, parent_port);
 	if (IS_ERR(endpoint))
 		return PTR_ERR(endpoint);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 08/15] cxl/core/hdm: Allocate resources from the media
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
                   ` (6 preceding siblings ...)
  2022-04-13 18:37 ` [RFC PATCH 07/15] cxl/port: Surface ram and pmem resources Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-04-13 18:37 ` [RFC PATCH 09/15] cxl/core/port: Add attrs for size and volatility Ben Widawsky
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

Similar to how decoders consume address space for the root decoder, they
also consume space on the device's physical media. For future
allocations, it's required to mark those as used/busy.

The CXL specification requires that HDM decoder are programmed in
ascending physical address order. The device's address space can
therefore be managed by a simple allocator. Fragmentation may occur if
devices are taken in and out of active decoding. Fixing this is left to
userspace to handle.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/core/core.h |  3 +++
 drivers/cxl/core/hdm.c  | 26 +++++++++++++++++++++++++-
 drivers/cxl/core/port.c |  9 ++++++++-
 drivers/cxl/cxl.h       | 10 ++++++++++
 4 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 1a50c0fc399c..a507a2502127 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -9,6 +9,9 @@ extern const struct device_type cxl_nvdimm_type;
 
 extern struct attribute_group cxl_base_attribute_group;
 
+extern struct device_attribute dev_attr_create_pmem_region;
+extern struct device_attribute dev_attr_delete_region;
+
 struct cxl_send_command;
 struct cxl_mem_query_commands;
 int cxl_query_cmd(struct cxl_memdev *cxlmd,
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 37c09c77e9a7..5326a2cd6968 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /* Copyright(c) 2022 Intel Corporation. All rights reserved. */
 #include <linux/io-64-nonatomic-hi-lo.h>
+#include <linux/genalloc.h>
 #include <linux/device.h>
 #include <linux/delay.h>
 
@@ -198,8 +199,11 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
 	else
 		cxld->target_type = CXL_DECODER_ACCELERATOR;
 
-	if (is_endpoint_decoder(&cxld->dev))
+	if (is_endpoint_decoder(&cxld->dev)) {
+		to_cxl_endpoint_decoder(cxld)->skip =
+			ioread64_hi_lo(hdm + CXL_HDM_DECODER0_TL_LOW(which));
 		return 0;
+	}
 
 	target_list.value =
 		ioread64_hi_lo(hdm + CXL_HDM_DECODER0_TL_LOW(which));
@@ -218,6 +222,7 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
 	void __iomem *hdm = cxlhdm->regs.hdm_decoder;
 	struct cxl_port *port = cxlhdm->port;
 	int i, committed, failed;
+	u64 base = 0;
 	u32 ctrl;
 
 	/*
@@ -240,6 +245,7 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
 	for (i = 0, failed = 0; i < cxlhdm->decoder_count; i++) {
 		int target_map[CXL_DECODER_MAX_INTERLEAVE] = { 0 };
 		int rc, target_count = cxlhdm->target_count;
+		struct cxl_endpoint_decoder *cxled;
 		struct cxl_decoder *cxld;
 
 		if (is_cxl_endpoint(port))
@@ -267,6 +273,24 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
 				 "Failed to add decoder to port\n");
 			return rc;
 		}
+
+		if (!is_cxl_endpoint(port))
+			continue;
+
+		cxled = to_cxl_endpoint_decoder(cxld);
+		cxled->drange = (struct range) {
+			.start = base,
+			.end = base + range_len(&cxld->range) - 1,
+		};
+
+		if (!range_len(&cxld->range))
+			continue;
+
+		dev_dbg(&cxld->dev,
+			"Enumerated decoder with DPA range %#llx-%#llx\n", base,
+			base + range_len(&cxled->drange));
+		base += cxled->skip + range_len(&cxld->range);
+		port->last_cxled = cxled;
 	}
 
 	if (failed == cxlhdm->decoder_count) {
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 0d946711685b..9ef8d69dbfa5 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -84,7 +84,14 @@ static ssize_t size_show(struct device *dev, struct device_attribute *attr,
 {
 	struct cxl_decoder *cxld = to_cxl_decoder(dev);
 
-	return sysfs_emit(buf, "%#llx\n", range_len(&cxld->range));
+	if (is_endpoint_decoder(dev)) {
+		struct cxl_endpoint_decoder *cxled;
+
+		cxled = to_cxl_endpoint_decoder(cxld);
+		return sysfs_emit(buf, "%#llx\n", range_len(&cxled->drange));
+	} else {
+		return sysfs_emit(buf, "%#llx\n", range_len(&cxld->range));
+	}
 }
 static DEVICE_ATTR_RO(size);
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 52295548a071..33f8a55f2f84 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -228,9 +228,13 @@ struct cxl_decoder {
 /**
  * struct cxl_endpoint_decoder - An decoder residing in a CXL endpoint.
  * @base: Base class decoder
+ * @drange: Device physical address space this decoder is using
+ * @skip: The skip count as specified in the CXL specification.
  */
 struct cxl_endpoint_decoder {
 	struct cxl_decoder base;
+	struct range drange;
+	u64 skip;
 };
 
 /**
@@ -248,11 +252,15 @@ struct cxl_switch_decoder {
  * @base: Base class decoder
  * @window: host address space allocator
  * @targets: Downstream targets (ie. hostbridges).
+ * @next_region_id: The pre-cached next region id.
+ * @id_lock: Protects next_region_id
  */
 struct cxl_root_decoder {
 	struct cxl_decoder base;
 	struct gen_pool *window;
 	struct cxl_decoder_targets *targets;
+	int next_region_id;
+	struct mutex id_lock; /* synchronizes access to next_region_id */
 };
 
 #define _to_cxl_decoder(x)                                                     \
@@ -312,6 +320,7 @@ struct cxl_nvdimm {
  * @capacity: How much total storage the media can hold (endpoint only)
  * @pmem_offset: Partition dividing volatile, [0, pmem_offset -1 ], and persistent
  *		 [pmem_offset, capacity - 1] addresses.
+ * @last_cxled: Last active decoder doing decode (endpoint only)
  */
 struct cxl_port {
 	struct device dev;
@@ -326,6 +335,7 @@ struct cxl_port {
 
 	u64 capacity;
 	u64 pmem_offset;
+	struct cxl_endpoint_decoder *last_cxled;
 };
 
 /**
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 09/15] cxl/core/port: Add attrs for size and volatility
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
                   ` (7 preceding siblings ...)
  2022-04-13 18:37 ` [RFC PATCH 08/15] cxl/core/hdm: Allocate resources from the media Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-04-13 18:37 ` [RFC PATCH 10/15] cxl/core: Extract IW/IG decoding Ben Widawsky
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

Endpoint decoders have the decoder-unique properties of having their
range being constrained by the media they're a part of, and, having a
concrete need to disambiguate between volatile and persistent capacity
(due to partitioning). As part of region programming, these decoders
will be required to be pre-configured, ie, have the size and volatility
set.

Endpoint decoders must consider two different address spaces for address
allocation. Sysram will need to be mapped for use of this memory if not
set up in the EFI memory map. Additionally, the CXL device itself has
it's own address space domain which requires allocation and management.
Device address space is managed with a simple allocator and host
physical address space is managed by the region driver/core.

/sys/bus/cxl/devices/decoder3.0
├── devtype
├── interleave_granularity
├── interleave_ways
├── locked
├── modalias
├── size
├── start
├── subsystem -> ../../../../../../../bus/cxl
├── target_type
├── uevent
└── volatile

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 Documentation/ABI/testing/sysfs-bus-cxl |  13 ++-
 drivers/cxl/Kconfig                     |   3 +-
 drivers/cxl/core/port.c                 | 145 +++++++++++++++++++++++-
 drivers/cxl/cxl.h                       |   6 +
 4 files changed, 163 insertions(+), 4 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 7c2b846521f3..01fee09b8473 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -117,7 +117,9 @@ Description:
 		range is fixed. For decoders of devtype "cxl_decoder_switch" the
 		address is bounded by the decode range of the cxl_port ancestor
 		of the decoder's cxl_port, and dynamically updates based on the
-		active memory regions in that address space.
+		active memory regions in that address space. For decoders of
+		devtype "cxl_decoder_endpoint", size is a mutable value which
+		carves our space from the physical media.
 
 What:		/sys/bus/cxl/devices/decoderX.Y/locked
 Date:		June, 2021
@@ -163,3 +165,12 @@ Description:
 		memory (type-3). The 'target_type' attribute indicates the
 		current setting which may dynamically change based on what
 		memory regions are activated in this decode hierarchy.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/volatile
+Date:		March, 2022
+KernelVersion:	v5.19
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		Provide a knob to set/get whether the desired media is volatile
+		or persistent. This applies only to decoders of devtype
+		"cxl_decoder_endpoint",
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index b88ab956bb7c..8796fd4b22bc 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -95,7 +95,8 @@ config CXL_MEM
 	  If unsure say 'm'.
 
 config CXL_PORT
-	default CXL_BUS
 	tristate
+	default CXL_BUS
+	select DEVICE_PRIVATE
 
 endif
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 9ef8d69dbfa5..bdafdec80d98 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -4,6 +4,7 @@
 #include <linux/workqueue.h>
 #include <linux/genalloc.h>
 #include <linux/device.h>
+#include <linux/ioport.h>
 #include <linux/module.h>
 #include <linux/pci.h>
 #include <linux/slab.h>
@@ -80,7 +81,7 @@ static ssize_t start_show(struct device *dev, struct device_attribute *attr,
 static DEVICE_ATTR_ADMIN_RO(start);
 
 static ssize_t size_show(struct device *dev, struct device_attribute *attr,
-			char *buf)
+			 char *buf)
 {
 	struct cxl_decoder *cxld = to_cxl_decoder(dev);
 
@@ -93,7 +94,144 @@ static ssize_t size_show(struct device *dev, struct device_attribute *attr,
 		return sysfs_emit(buf, "%#llx\n", range_len(&cxld->range));
 	}
 }
-static DEVICE_ATTR_RO(size);
+
+static struct cxl_endpoint_decoder *
+get_prev_decoder(struct cxl_endpoint_decoder *cxled)
+{
+	struct cxl_port *port = to_cxl_port(cxled->base.dev.parent);
+	struct device *cxldd;
+	char *name;
+
+	if (cxled->base.id == 0)
+		return NULL;
+
+	name = kasprintf(GFP_KERNEL, "decoder%u.%u", port->id, cxled->base.id);
+	if (!name)
+		return ERR_PTR(-ENOMEM);
+
+	cxldd = device_find_child_by_name(&port->dev, name);
+	kfree(name);
+	if (cxldd) {
+		struct cxl_decoder *cxld = to_cxl_decoder(cxldd);
+
+		if (dev_WARN_ONCE(&port->dev,
+				  (cxld->flags & CXL_DECODER_F_ENABLE) == 0,
+				  "%s should be enabled\n",
+				  dev_name(&cxld->dev)))
+			return NULL;
+		return to_cxl_endpoint_decoder(cxld);
+	}
+
+	return NULL;
+}
+
+static ssize_t size_store(struct device *dev, struct device_attribute *attr,
+			  const char *buf, size_t len)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	struct cxl_endpoint_decoder *cxled = to_cxl_endpoint_decoder(cxld);
+	struct cxl_port *port = to_cxl_port(cxled->base.dev.parent);
+	struct cxl_endpoint_decoder *prev = get_prev_decoder(cxled);
+	u64 size, dpa_base = 0;
+	int rc;
+
+	rc = kstrtou64(buf, 0, &size);
+	if (rc)
+		return rc;
+
+	if (size % SZ_256M)
+		return -EINVAL;
+
+	rc = mutex_lock_interruptible(&cxled->res_lock);
+	if (rc)
+		return rc;
+
+	/* No change */
+	if (range_len(&cxled->drange) == size)
+		goto out;
+
+	rc = mutex_lock_interruptible(&port->media_lock);
+	if (rc)
+		goto out;
+
+	/* Extent was previously set */
+	if (port->last_cxled == cxled) {
+		if (size == range_len(&cxled->drange)) {
+			mutex_unlock(&port->media_lock);
+			goto out;
+		}
+
+		if (!size) {
+			dev_dbg(dev,
+				"freeing previous reservation %#llx-%#llx\n",
+				cxled->drange.start, cxled->drange.end);
+			port->last_cxled = prev;
+			mutex_unlock(&port->media_lock);
+			goto out;
+		}
+	}
+
+	if (prev)
+		dpa_base = port->last_cxled->drange.end + 1;
+
+	if ((dpa_base + size) > port->capacity)
+		rc = -ENOSPC;
+	else
+		port->last_cxled = cxled;
+
+	mutex_unlock(&port->media_lock);
+	if (rc)
+		goto out;
+
+	cxled->drange = (struct range) {
+		.start = dpa_base,
+		.end = dpa_base + size - 1
+	};
+
+	dev_dbg(dev, "Allocated %#llx-%#llx from media\n", cxled->drange.start,
+		cxled->drange.end);
+
+out:
+	mutex_unlock(&cxled->res_lock);
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(size);
+
+static ssize_t volatile_show(struct device *dev, struct device_attribute *attr,
+			     char *buf)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	struct cxl_endpoint_decoder *cxled = to_cxl_endpoint_decoder(cxld);
+
+	return sysfs_emit(buf, "%u\n", cxled->volatil);
+}
+
+static ssize_t volatile_store(struct device *dev, struct device_attribute *attr,
+			      const char *buf, size_t len)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	struct cxl_endpoint_decoder *cxled = to_cxl_endpoint_decoder(cxld);
+	bool p;
+	int rc;
+
+	rc = kstrtobool(buf, &p);
+	if (rc)
+		return rc;
+
+	rc = mutex_lock_interruptible(&cxled->res_lock);
+	if (rc)
+		return rc;
+
+	if (range_len(&cxled->drange) > 0)
+		rc = -EBUSY;
+	mutex_unlock(&cxled->res_lock);
+	if (rc)
+		return rc;
+
+	cxled->volatil = p;
+	return len;
+}
+static DEVICE_ATTR_RW(volatile);
 
 #define CXL_DECODER_FLAG_ATTR(name, flag)                            \
 static ssize_t name##_show(struct device *dev,                       \
@@ -211,6 +349,7 @@ static const struct attribute_group *cxl_decoder_root_attribute_groups[] = {
 
 static struct attribute *cxl_decoder_endpoint_attrs[] = {
 	&dev_attr_target_type.attr,
+	&dev_attr_volatile.attr,
 	NULL,
 };
 
@@ -413,6 +552,7 @@ static struct cxl_port *cxl_port_alloc(struct device *uport,
 	ida_init(&port->decoder_ida);
 	INIT_LIST_HEAD(&port->dports);
 	INIT_LIST_HEAD(&port->endpoints);
+	mutex_init(&port->media_lock);
 
 	device_initialize(dev);
 	device_set_pm_not_required(dev);
@@ -1191,6 +1331,7 @@ static struct cxl_decoder *__cxl_decoder_alloc(struct cxl_port *port,
 		cxled = kzalloc(sizeof(*cxled), GFP_KERNEL);
 		if (!cxled)
 			return NULL;
+		mutex_init(&cxled->res_lock);
 		cxld = &cxled->base;
 	} else if (is_cxl_root(port)) {
 		struct cxl_root_decoder *cxlrd;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 33f8a55f2f84..07df13f05d3d 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -230,11 +230,15 @@ struct cxl_decoder {
  * @base: Base class decoder
  * @drange: Device physical address space this decoder is using
  * @skip: The skip count as specified in the CXL specification.
+ * @res_lock: Synchronize device's resource usage
+ * @volatil: Configuration param. Decoder target is non-persistent mem
  */
 struct cxl_endpoint_decoder {
 	struct cxl_decoder base;
 	struct range drange;
 	u64 skip;
+	struct mutex res_lock; /* sync access to decoder's resource */
+	bool volatil;
 };
 
 /**
@@ -321,6 +325,7 @@ struct cxl_nvdimm {
  * @pmem_offset: Partition dividing volatile, [0, pmem_offset -1 ], and persistent
  *		 [pmem_offset, capacity - 1] addresses.
  * @last_cxled: Last active decoder doing decode (endpoint only)
+ * @media_lock: Synchronizes use of allocation of media (endpoint only)
  */
 struct cxl_port {
 	struct device dev;
@@ -336,6 +341,7 @@ struct cxl_port {
 	u64 capacity;
 	u64 pmem_offset;
 	struct cxl_endpoint_decoder *last_cxled;
+	struct mutex media_lock; /* sync access to media allocator */
 };
 
 /**
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 10/15] cxl/core: Extract IW/IG decoding
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
                   ` (8 preceding siblings ...)
  2022-04-13 18:37 ` [RFC PATCH 09/15] cxl/core/port: Add attrs for size and volatility Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-04-13 18:37 ` [RFC PATCH 11/15] cxl/acpi: Use common " Ben Widawsky
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

Interleave granularity and ways have specification defined encodings.
Extracting this functionality into the common header file allows other
consumers to make use of it.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/core/hdm.c | 11 ++---------
 drivers/cxl/cxl.h      | 17 +++++++++++++++++
 2 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 5326a2cd6968..b4b65aa55bd2 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -134,21 +134,14 @@ static int to_interleave_granularity(u32 ctrl)
 {
 	int val = FIELD_GET(CXL_HDM_DECODER0_CTRL_IG_MASK, ctrl);
 
-	return 256 << val;
+	return cxl_to_interleave_granularity(val);
 }
 
 static int to_interleave_ways(u32 ctrl)
 {
 	int val = FIELD_GET(CXL_HDM_DECODER0_CTRL_IW_MASK, ctrl);
 
-	switch (val) {
-	case 0 ... 4:
-		return 1 << val;
-	case 8 ... 10:
-		return 3 << (val - 8);
-	default:
-		return 0;
-	}
+	return cxl_to_interleave_ways(val);
 }
 
 static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 07df13f05d3d..0586c3d4592c 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -64,6 +64,23 @@ static inline int cxl_hdm_decoder_count(u32 cap_hdr)
 	return val ? val * 2 : 1;
 }
 
+static inline int cxl_to_interleave_granularity(u16 ig)
+{
+	return 256 << ig;
+}
+
+static inline int cxl_to_interleave_ways(u8 eniw)
+{
+	switch (eniw) {
+	case 0 ... 4:
+		return 1 << eniw;
+	case 8 ... 10:
+		return 3 << (eniw - 8);
+	default:
+		return 0;
+	}
+}
+
 /* CXL 2.0 8.2.8.1 Device Capabilities Array Register */
 #define CXLDEV_CAP_ARRAY_OFFSET 0x0
 #define   CXLDEV_CAP_ARRAY_CAP_ID 0
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 11/15] cxl/acpi: Use common IW/IG decoding
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
                   ` (9 preceding siblings ...)
  2022-04-13 18:37 ` [RFC PATCH 10/15] cxl/core: Extract IW/IG decoding Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-04-13 18:37 ` [RFC PATCH 12/15] cxl/region: Add region creation ABI Ben Widawsky
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

Now that functionality to decode interleave ways and granularity is in
a common place, use that functionality in the cxl_acpi driver.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/acpi.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index a6b0c3181d0e..50e54e5d58c0 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -11,8 +11,8 @@
 #include "cxl.h"
 
 /* Encode defined in CXL 2.0 8.2.5.12.7 HDM Decoder Control Register */
-#define CFMWS_INTERLEAVE_WAYS(x)	(1 << (x)->interleave_ways)
-#define CFMWS_INTERLEAVE_GRANULARITY(x)	((x)->granularity + 8)
+#define CFMWS_INTERLEAVE_WAYS(x)	(cxl_to_interleave_ways((x)->interleave_ways))
+#define CFMWS_INTERLEAVE_GRANULARITY(x)	(cxl_to_interleave_granularity((x)->granularity))
 
 static unsigned long cfmws_to_decoder_flags(int restrictions)
 {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 12/15] cxl/region: Add region creation ABI
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
                   ` (10 preceding siblings ...)
  2022-04-13 18:37 ` [RFC PATCH 11/15] cxl/acpi: Use common " Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-05-04 22:56   ` Verma, Vishal L
  2022-04-13 18:37 ` [RFC PATCH 13/15] cxl/core/port: Add attrs for root ways & granularity Ben Widawsky
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

Regions are created as a child of the decoder that encompasses an
address space with constraints. Regions have a number of attributes that
must be configured before the region can be activated.

Multiple processes which are trying not to race with each other
shouldn't need special userspace synchronization to do so.

// Allocate a new region name
region=$(cat /sys/bus/cxl/devices/decoder0.0/create_pmem_region)

// Create a new region by name
while
region=$(cat /sys/bus/cxl/devices/decoder0.0/create_pmem_region)
! echo $region > /sys/bus/cxl/devices/decoder0.0/create_pmem_region
do true; done

// Region now exists in sysfs
stat -t /sys/bus/cxl/devices/decoder0.0/$region

// Delete the region, and name
echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 Documentation/ABI/testing/sysfs-bus-cxl       |  23 ++
 .../driver-api/cxl/memory-devices.rst         |  11 +
 drivers/cxl/Kconfig                           |   5 +
 drivers/cxl/core/Makefile                     |   1 +
 drivers/cxl/core/port.c                       |  39 ++-
 drivers/cxl/core/region.c                     | 234 ++++++++++++++++++
 drivers/cxl/cxl.h                             |   7 +
 drivers/cxl/region.h                          |  29 +++
 tools/testing/cxl/Kbuild                      |   1 +
 9 files changed, 347 insertions(+), 3 deletions(-)
 create mode 100644 drivers/cxl/core/region.c
 create mode 100644 drivers/cxl/region.h

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 01fee09b8473..5229f4bd109a 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -174,3 +174,26 @@ Description:
 		Provide a knob to set/get whether the desired media is volatile
 		or persistent. This applies only to decoders of devtype
 		"cxl_decoder_endpoint",
+
+What:		/sys/bus/cxl/devices/decoderX.Y/create_pmem_region
+Date:		January, 2022
+KernelVersion:	v5.19
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		Write an integer value to instantiate a new region to be named
+		regionZ within the decode range bounded by decoderX.Y. Where X,
+		Y, and Z are unsigned integers, and where decoderX.Y exists in
+		the CXL sysfs topology. The value written must match the current
+		value returned from reading this attribute. This behavior lets
+		the kernel arbitrate racing attempts to create a region. The
+		thread that fails to write loops and tries the next value.
+		Regions must subsequently configured and bound to a region
+		driver before they can be used.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
+Date:		January, 2022
+KernelVersion:	v5.19
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		Deletes the named region. The attribute expects a region number
+		as an integer.
diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
index db476bb170b6..66ddc58a21b1 100644
--- a/Documentation/driver-api/cxl/memory-devices.rst
+++ b/Documentation/driver-api/cxl/memory-devices.rst
@@ -362,6 +362,17 @@ CXL Core
 .. kernel-doc:: drivers/cxl/core/mbox.c
    :doc: cxl mbox
 
+CXL Regions
+-----------
+.. kernel-doc:: drivers/cxl/region.h
+   :identifiers:
+
+.. kernel-doc:: drivers/cxl/core/region.c
+   :doc: cxl core region
+
+.. kernel-doc:: drivers/cxl/core/region.c
+   :identifiers:
+
 External Interfaces
 ===================
 
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 8796fd4b22bc..7ce86eee8bda 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -99,4 +99,9 @@ config CXL_PORT
 	default CXL_BUS
 	select DEVICE_PRIVATE
 
+config CXL_REGION
+	tristate
+	default CXL_BUS
+	select MEMREGION
+
 endif
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 6d37cd78b151..39ce8f2f2373 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_CXL_BUS) += cxl_core.o
 ccflags-y += -I$(srctree)/drivers/cxl
 cxl_core-y := port.o
 cxl_core-y += pmem.o
+cxl_core-y += region.o
 cxl_core-y += regs.o
 cxl_core-y += memdev.o
 cxl_core-y += mbox.o
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index bdafdec80d98..5ef8a6e1ea23 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /* Copyright(c) 2020 Intel Corporation. All rights reserved. */
 #include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/memregion.h>
 #include <linux/workqueue.h>
 #include <linux/genalloc.h>
 #include <linux/device.h>
@@ -11,6 +12,7 @@
 #include <linux/idr.h>
 #include <cxlmem.h>
 #include <cxlpci.h>
+#include <region.h>
 #include <cxl.h>
 #include "core.h"
 
@@ -328,6 +330,8 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
 };
 
 static struct attribute *cxl_decoder_root_attrs[] = {
+	&dev_attr_create_pmem_region.attr,
+	&dev_attr_delete_region.attr,
 	&dev_attr_cap_pmem.attr,
 	&dev_attr_cap_ram.attr,
 	&dev_attr_cap_type2.attr,
@@ -375,6 +379,8 @@ static void cxl_decoder_release(struct device *dev)
 	struct cxl_decoder *cxld = to_cxl_decoder(dev);
 	struct cxl_port *port = to_cxl_port(dev->parent);
 
+	if (is_root_decoder(dev))
+		memregion_free(to_cxl_root_decoder(cxld)->next_region_id);
 	ida_free(&port->decoder_ida, cxld->id);
 	kfree(cxld);
 	put_device(&port->dev);
@@ -1414,12 +1420,22 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
 	device_set_pm_not_required(dev);
 	dev->parent = &port->dev;
 	dev->bus = &cxl_bus_type;
-	if (is_cxl_root(port))
+	if (is_cxl_root(port)) {
+		struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxld);
+
 		cxld->dev.type = &cxl_decoder_root_type;
-	else if (is_cxl_endpoint(port))
+		mutex_init(&cxlrd->id_lock);
+		rc = memregion_alloc(GFP_KERNEL);
+		if (rc < 0)
+			goto err;
+
+		cxlrd->next_region_id = rc;
+		cxld->dev.type = &cxl_decoder_root_type;
+	} else if (is_cxl_endpoint(port)) {
 		cxld->dev.type = &cxl_decoder_endpoint_type;
-	else
+	} else {
 		cxld->dev.type = &cxl_decoder_switch_type;
+	}
 
 	/* Pre initialize an "empty" decoder */
 	cxld->interleave_ways = 1;
@@ -1582,6 +1598,17 @@ EXPORT_SYMBOL_NS_GPL(cxl_decoder_add, CXL);
 
 static void cxld_unregister(void *dev)
 {
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	struct cxl_endpoint_decoder *cxled;
+
+	if (!is_endpoint_decoder(&cxld->dev))
+		goto out;
+
+	mutex_lock(&cxled->cxlr->remove_lock);
+	device_release_driver(&cxled->cxlr->dev);
+	mutex_unlock(&cxled->cxlr->remove_lock);
+
+out:
 	device_unregister(dev);
 }
 
@@ -1681,6 +1708,12 @@ bool schedule_cxl_memdev_detach(struct cxl_memdev *cxlmd)
 }
 EXPORT_SYMBOL_NS_GPL(schedule_cxl_memdev_detach, CXL);
 
+bool schedule_cxl_region_unregister(struct cxl_region *cxlr)
+{
+	return queue_work(cxl_bus_wq, &cxlr->detach_work);
+}
+EXPORT_SYMBOL_NS_GPL(schedule_cxl_region_unregister, CXL);
+
 /* for user tooling to ensure port disable work has completed */
 static ssize_t flush_store(struct bus_type *bus, const char *buf, size_t count)
 {
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
new file mode 100644
index 000000000000..16829bf2f73a
--- /dev/null
+++ b/drivers/cxl/core/region.c
@@ -0,0 +1,234 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2022 Intel Corporation. All rights reserved. */
+#include <linux/memregion.h>
+#include <linux/genalloc.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/idr.h>
+#include <region.h>
+#include <cxl.h>
+#include "core.h"
+
+/**
+ * DOC: cxl core region
+ *
+ * CXL Regions represent mapped memory capacity in system physical address
+ * space. Whereas the CXL Root Decoders identify the bounds of potential CXL
+ * Memory ranges, Regions represent the active mapped capacity by the HDM
+ * Decoder Capability structures throughout the Host Bridges, Switches, and
+ * Endpoints in the topology.
+ */
+
+static struct cxl_region *to_cxl_region(struct device *dev);
+
+static void cxl_region_release(struct device *dev)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	memregion_free(cxlr->id);
+	kfree(cxlr);
+}
+
+static const struct device_type cxl_region_type = {
+	.name = "cxl_region",
+	.release = cxl_region_release,
+};
+
+bool is_cxl_region(struct device *dev)
+{
+	return dev->type == &cxl_region_type;
+}
+EXPORT_SYMBOL_NS_GPL(is_cxl_region, CXL);
+
+static struct cxl_region *to_cxl_region(struct device *dev)
+{
+	if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type,
+			  "not a cxl_region device\n"))
+		return NULL;
+
+	return container_of(dev, struct cxl_region, dev);
+}
+
+static void unregister_region(struct work_struct *work)
+{
+	struct cxl_region *cxlr;
+
+	cxlr = container_of(work, typeof(*cxlr), detach_work);
+	device_unregister(&cxlr->dev);
+}
+
+static void schedule_unregister(void *cxlr)
+{
+	schedule_cxl_region_unregister(cxlr);
+}
+
+static struct cxl_region *cxl_region_alloc(struct cxl_decoder *cxld)
+{
+	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxld);
+	struct cxl_region *cxlr;
+	struct device *dev;
+	int rc;
+
+	lockdep_assert_held(&cxlrd->id_lock);
+
+	rc = memregion_alloc(GFP_KERNEL);
+	if (rc < 0) {
+		dev_dbg(dev, "Failed to get next cached id (%d)\n", rc);
+		return ERR_PTR(rc);
+	}
+
+	cxlr = kzalloc(sizeof(*cxlr), GFP_KERNEL);
+	if (!cxlr) {
+		memregion_free(rc);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	cxlr->id = cxlrd->next_region_id;
+	cxlrd->next_region_id = rc;
+
+	dev = &cxlr->dev;
+	device_initialize(dev);
+	dev->parent = &cxld->dev;
+	device_set_pm_not_required(dev);
+	dev->bus = &cxl_bus_type;
+	dev->type = &cxl_region_type;
+	INIT_WORK(&cxlr->detach_work, unregister_region);
+	mutex_init(&cxlr->remove_lock);
+
+	return cxlr;
+}
+
+/**
+ * devm_cxl_add_region - Adds a region to a decoder
+ * @cxld: Parent decoder.
+ *
+ * This is the second step of region initialization. Regions exist within an
+ * address space which is mapped by a @cxld. That @cxld must be a root decoder,
+ * and it enforces constraints upon the region as it is configured.
+ *
+ * Return: 0 if the region was added to the @cxld, else returns negative error
+ * code. The region will be named "regionX.Y.Z" where X is the port, Y is the
+ * decoder id, and Z is the region number.
+ */
+static struct cxl_region *devm_cxl_add_region(struct cxl_decoder *cxld)
+{
+	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
+	struct cxl_region *cxlr;
+	struct device *dev;
+	int rc;
+
+	cxlr = cxl_region_alloc(cxld);
+	if (IS_ERR(cxlr))
+		return cxlr;
+
+	dev = &cxlr->dev;
+
+	rc = dev_set_name(dev, "region%d", cxlr->id);
+	if (rc)
+		goto err_out;
+
+	rc = device_add(dev);
+	if (rc)
+		goto err_put;
+
+	rc = devm_add_action_or_reset(port->uport, schedule_unregister, cxlr);
+	if (rc)
+		goto err_put;
+
+	return cxlr;
+
+err_put:
+	put_device(&cxld->dev);
+
+err_out:
+	put_device(dev);
+	return ERR_PTR(rc);
+}
+
+static ssize_t create_pmem_region_show(struct device *dev,
+				       struct device_attribute *attr, char *buf)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxld);
+	size_t rc;
+
+	/*
+	 * There's no point in returning known bad answers when the lock is held
+	 * on the store side, even though the answer given here may be
+	 * immediately invalidated as soon as the lock is dropped it's still
+	 * useful to throttle readers in the presence of writers.
+	 */
+	rc = mutex_lock_interruptible(&cxlrd->id_lock);
+	if (rc)
+		return rc;
+	rc = sysfs_emit(buf, "%d\n", cxlrd->next_region_id);
+	mutex_unlock(&cxlrd->id_lock);
+
+	return rc;
+}
+
+static ssize_t create_pmem_region_store(struct device *dev,
+					struct device_attribute *attr,
+					const char *buf, size_t len)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxld);
+	struct cxl_region *cxlr;
+	size_t id, rc;
+
+	rc = kstrtoul(buf, 10, &id);
+	if (rc)
+		return rc;
+
+	rc = mutex_lock_interruptible(&cxlrd->id_lock);
+	if (rc)
+		return rc;
+
+	if (cxlrd->next_region_id != id) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	cxlr = devm_cxl_add_region(cxld);
+	rc = 0;
+	dev_dbg(dev, "Created %s\n", dev_name(&cxlr->dev));
+
+out:
+	mutex_unlock(&cxlrd->id_lock);
+	if (rc)
+		return rc;
+	return len;
+}
+DEVICE_ATTR_RW(create_pmem_region);
+
+static struct cxl_region *cxl_find_region_by_name(struct cxl_decoder *cxld,
+						  const char *name)
+{
+	struct device *region_dev;
+
+	region_dev = device_find_child_by_name(&cxld->dev, name);
+	if (!region_dev)
+		return ERR_PTR(-ENOENT);
+
+	return to_cxl_region(region_dev);
+}
+
+static ssize_t delete_region_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t len)
+{
+	struct cxl_port *port = to_cxl_port(dev->parent);
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	struct cxl_region *cxlr;
+
+	cxlr = cxl_find_region_by_name(cxld, buf);
+	if (IS_ERR(cxlr))
+		return PTR_ERR(cxlr);
+
+	/* Reference held for wq */
+	devm_release_action(port->uport, schedule_unregister, cxlr);
+
+	return len;
+}
+DEVICE_ATTR_WO(delete_region);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 0586c3d4592c..3abc8b0cf8f4 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -249,6 +249,7 @@ struct cxl_decoder {
  * @skip: The skip count as specified in the CXL specification.
  * @res_lock: Synchronize device's resource usage
  * @volatil: Configuration param. Decoder target is non-persistent mem
+ * @cxlr: Region this decoder belongs to.
  */
 struct cxl_endpoint_decoder {
 	struct cxl_decoder base;
@@ -256,6 +257,7 @@ struct cxl_endpoint_decoder {
 	u64 skip;
 	struct mutex res_lock; /* sync access to decoder's resource */
 	bool volatil;
+	struct cxl_region *cxlr;
 };
 
 /**
@@ -454,6 +456,8 @@ struct cxl_hdm *devm_cxl_setup_hdm(struct cxl_port *port);
 int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm);
 int devm_cxl_add_passthrough_decoder(struct cxl_port *port);
 
+bool is_cxl_region(struct device *dev);
+
 extern struct bus_type cxl_bus_type;
 
 struct cxl_driver {
@@ -508,6 +512,7 @@ enum cxl_lock_class {
 	CXL_ANON_LOCK,
 	CXL_NVDIMM_LOCK,
 	CXL_NVDIMM_BRIDGE_LOCK,
+	CXL_REGION_LOCK,
 	CXL_PORT_LOCK,
 	/*
 	 * Be careful to add new lock classes here, CXL_PORT_LOCK is
@@ -536,6 +541,8 @@ static inline void cxl_nested_lock(struct device *dev)
 		mutex_lock_nested(&dev->lockdep_mutex, CXL_NVDIMM_BRIDGE_LOCK);
 	else if (is_cxl_nvdimm(dev))
 		mutex_lock_nested(&dev->lockdep_mutex, CXL_NVDIMM_LOCK);
+	else if (is_cxl_region(dev))
+		mutex_lock_nested(&dev->lockdep_mutex, CXL_REGION_LOCK);
 	else
 		mutex_lock_nested(&dev->lockdep_mutex, CXL_ANON_LOCK);
 }
diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
new file mode 100644
index 000000000000..66d9ba195c34
--- /dev/null
+++ b/drivers/cxl/region.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2021 Intel Corporation. */
+#ifndef __CXL_REGION_H__
+#define __CXL_REGION_H__
+
+#include <linux/uuid.h>
+
+#include "cxl.h"
+
+/**
+ * struct cxl_region - CXL region
+ * @dev: This region's device.
+ * @id: This region's id. Id is globally unique across all regions.
+ * @flags: Flags representing the current state of the region.
+ * @detach_work: Async unregister to allow attrs to take device_lock.
+ * @remove_lock: Coordinates region removal against decoder removal
+ */
+struct cxl_region {
+	struct device dev;
+	int id;
+	unsigned long flags;
+#define REGION_DEAD 0
+	struct work_struct detach_work;
+	struct mutex remove_lock; /* serialize region removal */
+};
+
+bool schedule_cxl_region_unregister(struct cxl_region *cxlr);
+
+#endif
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index 82e49ab0937d..3fe6d34e6d59 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -46,6 +46,7 @@ cxl_core-y += $(CXL_CORE_SRC)/memdev.o
 cxl_core-y += $(CXL_CORE_SRC)/mbox.o
 cxl_core-y += $(CXL_CORE_SRC)/pci.o
 cxl_core-y += $(CXL_CORE_SRC)/hdm.o
+cxl_core-y += $(CXL_CORE_SRC)/region.o
 cxl_core-y += config_check.o
 
 obj-m += test/
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 13/15] cxl/core/port: Add attrs for root ways & granularity
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
                   ` (11 preceding siblings ...)
  2022-04-13 18:37 ` [RFC PATCH 12/15] cxl/region: Add region creation ABI Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-04-13 18:37 ` [RFC PATCH 14/15] cxl/region: Introduce configuration Ben Widawsky
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

Region programming requires knowledge of root decoder attributes. For
example, if the root decoder supports only 256b granularity then a
region with > 256b granularity cannot work. Add sysfs attributes in
order to provide this information to userspace. The CXL driver controls
programming of switch and endpoint decoders, but the attributes are also
exported for informational purposes.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/core/port.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 5ef8a6e1ea23..19cf1fd16118 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -318,10 +318,31 @@ static ssize_t target_list_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(target_list);
 
+static ssize_t interleave_granularity_show(struct device *dev,
+					   struct device_attribute *attr,
+					   char *buf)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+
+	return sysfs_emit(buf, "%d\n", cxld->interleave_granularity);
+}
+static DEVICE_ATTR_RO(interleave_granularity);
+
+static ssize_t interleave_ways_show(struct device *dev,
+				    struct device_attribute *attr, char *buf)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+
+	return sysfs_emit(buf, "%d\n", cxld->interleave_ways);
+}
+static DEVICE_ATTR_RO(interleave_ways);
+
 static struct attribute *cxl_decoder_base_attrs[] = {
 	&dev_attr_start.attr,
 	&dev_attr_size.attr,
 	&dev_attr_locked.attr,
+	&dev_attr_interleave_granularity.attr,
+	&dev_attr_interleave_ways.attr,
 	NULL,
 };
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 14/15] cxl/region: Introduce configuration
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
                   ` (12 preceding siblings ...)
  2022-04-13 18:37 ` [RFC PATCH 13/15] cxl/core/port: Add attrs for root ways & granularity Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-04-13 18:37 ` [RFC PATCH 15/15] cxl/region: Introduce a cxl_region driver Ben Widawsky
  2022-05-20 16:23 ` [RFC PATCH 00/15] Region driver Jonathan Cameron
  15 siblings, 0 replies; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

The region creation APIs create a vacant region. Configuring the region
works in the same way as similar subsystems such as devdax. Sysfs attrs
will be provided to allow userspace to configure the region. Finally
once all configuration is complete, userspace may activate the region by
binding the driver.

Introduced here are the most basic attributes needed to configure a
region. Details of these attribute are described in the ABI
Documentation.

A example is provided below:

/sys/bus/cxl/devices/region0
├── devtype
├── interleave_granularity
├── interleave_ways
├── modalias
├── offset
├── size
├── subsystem -> ../../../../../../bus/cxl
├── target0
├── uevent
└── uuid

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 Documentation/ABI/testing/sysfs-bus-cxl |  64 +++-
 drivers/cxl/core/region.c               | 455 +++++++++++++++++++++++-
 drivers/cxl/cxl.h                       |  15 +
 drivers/cxl/region.h                    |  76 ++++
 4 files changed, 598 insertions(+), 12 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 5229f4bd109a..9ace58635942 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -195,5 +195,65 @@ Date:		January, 2022
 KernelVersion:	v5.19
 Contact:	linux-cxl@vger.kernel.org
 Description:
-		Deletes the named region. The attribute expects a region number
-		as an integer.
+		Deletes the named region. The attribute expects a region name in
+		the form regionZ where Z is an integer value.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/regionZ/resource
+Date:		January, 2022
+KernelVersion:	v5.19
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		A region is a contiguous partition of a CXL root decoder address
+		space. Region capacity is allocated by writing to the size
+		attribute, the resulting physical address space determined by
+		the driver is reflected here.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/regionZ/size
+Date:		January, 2022
+KernelVersion:	v5.19
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		System physical address space to be consumed by the region. When
+		written to, this attribute will allocate space out of the CXL
+		root decoder's address space. When read the size of the address
+		space is reported and should match the span of the region's
+		resource attribute.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/regionZ/interleave_ways
+Date:		January, 2022
+KernelVersion:	v5.19
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		Configures the number of devices participating in the region is
+		set by writing this value. Each device will provide
+		1/interleave_ways of storage for the region.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/regionZ/interleave_granularity
+Date:		January, 2022
+KernelVersion:	v5.19
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		Set the number of consecutive bytes each device in the
+		interleave set will claim. The possible interleave granularity
+		values are determined by the CXL spec and the participating
+		devices.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/regionZ/uuid
+Date:		January, 2022
+KernelVersion:	v5.19
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		Write a unique identifier for the region. This field must be set
+		for persistent regions and it must not conflict with the UUID of
+		another region. If this field is set for volatile regions, the
+		value is ignored.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/target[0..interleave_ways]
+Date:		January, 2022
+KernelVersion:	v5.19
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		Write a [endpoint] decoder object that is unused and will
+		participate in decoding memory transactions for the interleave
+		set, ie. decoderX.Y. All required attributes of the decoder must
+		be populated.
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 16829bf2f73a..4766d897f4bf 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -4,9 +4,12 @@
 #include <linux/genalloc.h>
 #include <linux/device.h>
 #include <linux/module.h>
+#include <linux/sizes.h>
 #include <linux/slab.h>
+#include <linux/uuid.h>
 #include <linux/idr.h>
 #include <region.h>
+#include <cxlmem.h>
 #include <cxl.h>
 #include "core.h"
 
@@ -18,21 +21,453 @@
  * Memory ranges, Regions represent the active mapped capacity by the HDM
  * Decoder Capability structures throughout the Host Bridges, Switches, and
  * Endpoints in the topology.
+ *
+ * Region configuration has ordering constraints:
+ * - Targets: Must be set after size
+ * - Size: Must be set after interleave ways
+ * - Interleave ways: Must be set after Interleave Granularity
+ *
+ * UUID may be set at any time before binding the driver to the region.
  */
 
-static struct cxl_region *to_cxl_region(struct device *dev);
+static const struct attribute_group region_interleave_group;
+
+static void remove_target(struct cxl_region *cxlr, int target)
+{
+	struct cxl_endpoint_decoder *cxled;
+
+	mutex_lock(&cxlr->remove_lock);
+	cxled = cxlr->targets[target];
+	if (cxled) {
+		cxled->cxlr = NULL;
+		put_device(&cxled->base.dev);
+	}
+	cxlr->targets[target] = NULL;
+	mutex_unlock(&cxlr->remove_lock);
+}
 
 static void cxl_region_release(struct device *dev)
 {
 	struct cxl_region *cxlr = to_cxl_region(dev);
+	int i;
 
 	memregion_free(cxlr->id);
+	for (i = 0; i < cxlr->interleave_ways; i++)
+		remove_target(cxlr, i);
 	kfree(cxlr);
 }
 
+static ssize_t interleave_ways_show(struct device *dev,
+				    struct device_attribute *attr, char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%d\n", cxlr->interleave_ways);
+}
+
+static ssize_t interleave_ways_store(struct device *dev,
+				     struct device_attribute *attr,
+				     const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	struct cxl_decoder *rootd;
+	int rc, val;
+
+	rc = kstrtoint(buf, 0, &val);
+	if (rc)
+		return rc;
+
+	cxl_device_lock(dev);
+
+	if (dev->driver) {
+		cxl_device_unlock(dev);
+		return -EBUSY;
+	}
+
+	if (cxlr->interleave_ways) {
+		cxl_device_unlock(dev);
+		return -EEXIST;
+	}
+
+	if (!cxlr->interleave_granularity) {
+		dev_dbg(&cxlr->dev, "IG must be set before IW\n");
+		cxl_device_unlock(dev);
+		return -EILSEQ;
+	}
+
+	rootd = to_cxl_decoder(cxlr->dev.parent);
+	if (!cxl_region_ways_valid(rootd, val, cxlr->interleave_granularity)) {
+		cxl_device_unlock(dev);
+		return -EINVAL;
+	}
+
+	cxlr->interleave_ways = val;
+	cxl_device_unlock(dev);
+
+	rc = sysfs_update_group(&cxlr->dev.kobj, &region_interleave_group);
+	if (rc < 0) {
+		cxlr->interleave_ways = 0;
+		return rc;
+	}
+
+	return len;
+}
+static DEVICE_ATTR_RW(interleave_ways);
+
+static ssize_t interleave_granularity_show(struct device *dev,
+					   struct device_attribute *attr,
+					   char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%d\n", cxlr->interleave_granularity);
+}
+
+static ssize_t interleave_granularity_store(struct device *dev,
+					    struct device_attribute *attr,
+					    const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	struct cxl_decoder *rootd;
+	int val, ret;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	cxl_device_lock(dev);
+
+	if (dev->driver) {
+		cxl_device_unlock(dev);
+		return -EBUSY;
+	}
+
+	if (cxlr->interleave_granularity) {
+		cxl_device_unlock(dev);
+		return -EEXIST;
+	}
+
+	rootd = to_cxl_decoder(cxlr->dev.parent);
+	if (!cxl_region_granularity_valid(rootd, val)) {
+		cxl_device_unlock(dev);
+		return -EINVAL;
+	}
+
+	cxlr->interleave_granularity = val;
+	cxl_device_unlock(dev);
+
+	return len;
+}
+static DEVICE_ATTR_RW(interleave_granularity);
+
+static ssize_t resource_show(struct device *dev, struct device_attribute *attr,
+			     char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%#llx\n", cxlr->range.start);
+}
+static DEVICE_ATTR_RO(resource);
+
+static ssize_t size_store(struct device *dev, struct device_attribute *attr,
+			  const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	struct cxl_root_decoder *cxlrd;
+	struct cxl_decoder *rootd;
+	unsigned long addr;
+	u64 val;
+	int rc;
+
+	rc = kstrtou64(buf, 0, &val);
+	if (rc)
+		return rc;
+
+	if (!cxl_region_size_valid(val, cxlr->interleave_ways)) {
+		dev_dbg(&cxlr->dev, "Size must be a multiple of %dM\n",
+			cxlr->interleave_ways * 256);
+		return -EINVAL;
+	}
+
+	cxl_device_lock(dev);
+
+	if (dev->driver) {
+		cxl_device_unlock(dev);
+		return -EBUSY;
+	}
+
+	if (!cxlr->interleave_ways) {
+		dev_dbg(&cxlr->dev, "IW must be set before size\n");
+		cxl_device_unlock(dev);
+		return -EILSEQ;
+	}
+
+	rootd = to_cxl_decoder(cxlr->dev.parent);
+	cxlrd = to_cxl_root_decoder(rootd);
+
+	addr = gen_pool_alloc(cxlrd->window, val);
+	if (addr == 0 && rootd->range.start != 0) {
+		rc = -ENOSPC;
+		goto out;
+	}
+
+	cxlr->range = (struct range) {
+		.start = addr,
+		.end = addr + val - 1,
+	};
+
+out:
+	cxl_device_unlock(dev);
+	return rc ? rc : len;
+}
+
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%#llx\n", range_len(&cxlr->range));
+}
+static DEVICE_ATTR_RW(size);
+
+static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%pUb\n", &cxlr->uuid);
+}
+
+static int is_dupe(struct device *match, void *_cxlr)
+{
+	struct cxl_region *c, *cxlr = _cxlr;
+
+	if (!is_cxl_region(match))
+		return 0;
+
+	if (&cxlr->dev == match)
+		return 0;
+
+	c = to_cxl_region(match);
+	if (uuid_equal(&c->uuid, &cxlr->uuid))
+		return -EEXIST;
+
+	return 0;
+}
+
+static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
+			  const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	ssize_t rc;
+	uuid_t temp;
+
+	if (len != UUID_STRING_LEN + 1)
+		return -EINVAL;
+
+	rc = uuid_parse(buf, &temp);
+	if (rc)
+		return rc;
+
+	cxl_device_lock(dev);
+
+	if (dev->driver) {
+		cxl_device_unlock(dev);
+		return -EBUSY;
+	}
+
+	if (!uuid_is_null(&cxlr->uuid)) {
+		cxl_device_unlock(dev);
+		return -EEXIST;
+	}
+
+	rc = bus_for_each_dev(&cxl_bus_type, NULL, cxlr, is_dupe);
+	if (rc < 0) {
+		cxl_device_unlock(dev);
+		return false;
+	}
+
+	cxlr->uuid = temp;
+	cxl_device_unlock(dev);
+	return len;
+}
+static DEVICE_ATTR_RW(uuid);
+
+static struct attribute *region_attrs[] = {
+	&dev_attr_resource.attr,
+	&dev_attr_interleave_ways.attr,
+	&dev_attr_interleave_granularity.attr,
+	&dev_attr_size.attr,
+	&dev_attr_uuid.attr,
+	NULL,
+};
+
+static const struct attribute_group region_group = {
+	.attrs = region_attrs,
+};
+
+static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
+{
+	if (!cxlr->targets[n])
+		return sysfs_emit(buf, "\n");
+
+	return sysfs_emit(buf, "%s\n", dev_name(&cxlr->targets[n]->base.dev));
+}
+
+static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int n,
+			    size_t len)
+{
+	struct cxl_endpoint_decoder *cxled;
+	struct cxl_decoder *cxld;
+	struct device *cxld_dev;
+	struct cxl_port *port;
+
+	cxl_device_lock(&cxlr->dev);
+
+	if (cxlr->dev.driver) {
+		cxl_device_unlock(&cxlr->dev);
+		return -EBUSY;
+	}
+
+	/* The target attrs don't exist until ways are set. No need to check */
+
+	if (cxlr->targets[n]) {
+		cxl_device_unlock(&cxlr->dev);
+		return -EEXIST;
+	}
+
+	cxld_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
+	if (!cxld_dev) {
+		cxl_device_unlock(&cxlr->dev);
+		return -ENOENT;
+	}
+
+	if (!is_cxl_decoder(cxld_dev)) {
+		put_device(cxld_dev);
+		cxl_device_unlock(&cxlr->dev);
+		dev_info(cxld_dev, "Not a decoder\n");
+		return -EINVAL;
+	}
+
+	if (!is_cxl_endpoint(to_cxl_port(cxld_dev->parent))) {
+		put_device(cxld_dev);
+		cxl_device_unlock(&cxlr->dev);
+		dev_info(cxld_dev, "Not an endpoint decoder\n");
+		return -EINVAL;
+	}
+
+	cxld = to_cxl_decoder(cxld_dev);
+	if (cxld->flags & CXL_DECODER_F_ENABLE) {
+		put_device(cxld_dev);
+		cxl_device_unlock(&cxlr->dev);
+		return -EBUSY;
+	}
+
+	/* Decoder reference is held until region probe can complete. */
+	cxled = to_cxl_endpoint_decoder(cxld);
+
+	if (range_len(&cxled->drange) !=
+	    range_len(&cxlr->range) / cxlr->interleave_ways) {
+		put_device(cxld_dev);
+		cxl_device_unlock(&cxlr->dev);
+		dev_info(cxld_dev, "Decoder is the wrong size\n");
+		return -EINVAL;
+	}
+
+	port = to_cxl_port(cxld->dev.parent);
+	if (port->last_cxled &&
+	    cxlr->range.start <= port->last_cxled->drange.start) {
+		put_device(cxld_dev);
+		cxl_device_unlock(&cxlr->dev);
+		dev_info(cxld_dev, "Decoder in set has higher HPA than region. Try different device\n");
+		return -EINVAL;
+	}
+
+	cxlr->targets[n] = cxled;
+	cxled->cxlr = cxlr;
+
+	cxl_device_unlock(&cxlr->dev);
+
+	return len;
+}
+
+#define TARGET_ATTR_RW(n)                                                      \
+	static ssize_t target##n##_show(                                       \
+		struct device *dev, struct device_attribute *attr, char *buf)  \
+	{                                                                      \
+		return show_targetN(to_cxl_region(dev), buf, (n));             \
+	}                                                                      \
+	static ssize_t target##n##_store(struct device *dev,                   \
+					 struct device_attribute *attr,        \
+					 const char *buf, size_t len)          \
+	{                                                                      \
+		return store_targetN(to_cxl_region(dev), buf, (n), len);       \
+	}                                                                      \
+	static DEVICE_ATTR_RW(target##n)
+
+TARGET_ATTR_RW(0);
+TARGET_ATTR_RW(1);
+TARGET_ATTR_RW(2);
+TARGET_ATTR_RW(3);
+TARGET_ATTR_RW(4);
+TARGET_ATTR_RW(5);
+TARGET_ATTR_RW(6);
+TARGET_ATTR_RW(7);
+TARGET_ATTR_RW(8);
+TARGET_ATTR_RW(9);
+TARGET_ATTR_RW(10);
+TARGET_ATTR_RW(11);
+TARGET_ATTR_RW(12);
+TARGET_ATTR_RW(13);
+TARGET_ATTR_RW(14);
+TARGET_ATTR_RW(15);
+
+static struct attribute *interleave_attrs[] = {
+	&dev_attr_target0.attr,
+	&dev_attr_target1.attr,
+	&dev_attr_target2.attr,
+	&dev_attr_target3.attr,
+	&dev_attr_target4.attr,
+	&dev_attr_target5.attr,
+	&dev_attr_target6.attr,
+	&dev_attr_target7.attr,
+	&dev_attr_target8.attr,
+	&dev_attr_target9.attr,
+	&dev_attr_target10.attr,
+	&dev_attr_target11.attr,
+	&dev_attr_target12.attr,
+	&dev_attr_target13.attr,
+	&dev_attr_target14.attr,
+	&dev_attr_target15.attr,
+	NULL,
+};
+
+static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	if (n < cxlr->interleave_ways)
+		return a->mode;
+	return 0;
+}
+
+static const struct attribute_group region_interleave_group = {
+	.attrs = interleave_attrs,
+	.is_visible = visible_targets,
+};
+
+static const struct attribute_group *region_groups[] = {
+	&region_group,
+	&region_interleave_group,
+	&cxl_base_attribute_group,
+	NULL,
+};
+
 static const struct device_type cxl_region_type = {
 	.name = "cxl_region",
 	.release = cxl_region_release,
+	.groups = region_groups
 };
 
 bool is_cxl_region(struct device *dev)
@@ -41,7 +476,7 @@ bool is_cxl_region(struct device *dev)
 }
 EXPORT_SYMBOL_NS_GPL(is_cxl_region, CXL);
 
-static struct cxl_region *to_cxl_region(struct device *dev)
+struct cxl_region *to_cxl_region(struct device *dev)
 {
 	if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type,
 			  "not a cxl_region device\n"))
@@ -49,6 +484,7 @@ static struct cxl_region *to_cxl_region(struct device *dev)
 
 	return container_of(dev, struct cxl_region, dev);
 }
+EXPORT_SYMBOL_NS_GPL(to_cxl_region, CXL);
 
 static void unregister_region(struct work_struct *work)
 {
@@ -96,20 +532,20 @@ static struct cxl_region *cxl_region_alloc(struct cxl_decoder *cxld)
 	INIT_WORK(&cxlr->detach_work, unregister_region);
 	mutex_init(&cxlr->remove_lock);
 
+	cxlr->range = (struct range) {
+		.start = 0,
+		.end = -1,
+	};
+
 	return cxlr;
 }
 
 /**
  * devm_cxl_add_region - Adds a region to a decoder
- * @cxld: Parent decoder.
- *
- * This is the second step of region initialization. Regions exist within an
- * address space which is mapped by a @cxld. That @cxld must be a root decoder,
- * and it enforces constraints upon the region as it is configured.
+ * @cxld: Root decoder.
  *
  * Return: 0 if the region was added to the @cxld, else returns negative error
- * code. The region will be named "regionX.Y.Z" where X is the port, Y is the
- * decoder id, and Z is the region number.
+ * code. The region will be named "regionX" where Z is the region number.
  */
 static struct cxl_region *devm_cxl_add_region(struct cxl_decoder *cxld)
 {
@@ -191,7 +627,6 @@ static ssize_t create_pmem_region_store(struct device *dev,
 	}
 
 	cxlr = devm_cxl_add_region(cxld);
-	rc = 0;
 	dev_dbg(dev, "Created %s\n", dev_name(&cxlr->dev));
 
 out:
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 3abc8b0cf8f4..db69dfa16f71 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -81,6 +81,19 @@ static inline int cxl_to_interleave_ways(u8 eniw)
 	}
 }
 
+static inline int cxl_from_ways(u8 ways)
+{
+	if (is_power_of_2(ways))
+		return ilog2(ways);
+
+	return ways / 3 + 8;
+}
+
+static inline int cxl_from_granularity(u16 g)
+{
+	return ilog2(g) - 8;
+}
+
 /* CXL 2.0 8.2.8.1 Device Capabilities Array Register */
 #define CXLDEV_CAP_ARRAY_OFFSET 0x0
 #define   CXLDEV_CAP_ARRAY_CAP_ID 0
@@ -277,6 +290,7 @@ struct cxl_switch_decoder {
  * @targets: Downstream targets (ie. hostbridges).
  * @next_region_id: The pre-cached next region id.
  * @id_lock: Protects next_region_id
+ * @regions: List of active regions in this decoder's address space
  */
 struct cxl_root_decoder {
 	struct cxl_decoder base;
@@ -284,6 +298,7 @@ struct cxl_root_decoder {
 	struct cxl_decoder_targets *targets;
 	int next_region_id;
 	struct mutex id_lock; /* synchronizes access to next_region_id */
+	struct list_head regions;
 };
 
 #define _to_cxl_decoder(x)                                                     \
diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
index 66d9ba195c34..e6457ea3d388 100644
--- a/drivers/cxl/region.h
+++ b/drivers/cxl/region.h
@@ -14,6 +14,12 @@
  * @flags: Flags representing the current state of the region.
  * @detach_work: Async unregister to allow attrs to take device_lock.
  * @remove_lock: Coordinates region removal against decoder removal
+ * @list: Node in decoder's region list.
+ * @range: Resource this region carves out of the platform decode range.
+ * @uuid: The UUID for this region.
+ * @interleave_ways: Number of interleave ways this region is configured for.
+ * @interleave_granularity: Interleave granularity of region
+ * @targets: The memory devices comprising the region.
  */
 struct cxl_region {
 	struct device dev;
@@ -22,8 +28,78 @@ struct cxl_region {
 #define REGION_DEAD 0
 	struct work_struct detach_work;
 	struct mutex remove_lock; /* serialize region removal */
+
+	struct list_head list;
+	struct range range;
+
+	uuid_t uuid;
+	int interleave_ways;
+	int interleave_granularity;
+	struct cxl_endpoint_decoder *targets[CXL_DECODER_MAX_INTERLEAVE];
 };
 
+bool is_cxl_region(struct device *dev);
+struct cxl_region *to_cxl_region(struct device *dev);
 bool schedule_cxl_region_unregister(struct cxl_region *cxlr);
 
+/**
+ * cxl_region_ways_valid - Determine if ways is valid for the given
+ *				  decoder.
+ * @rootd: The decoder for which validity will be checked
+ * @ways: Determination if ways is valid given @rootd and @granularity
+ * @granularity: The granularity the region will be interleaved
+ */
+static inline bool cxl_region_ways_valid(const struct cxl_decoder *rootd,
+					 u8 ways, u16 granularity)
+{
+	int root_ig, region_ig, root_eniw;
+
+	switch (ways) {
+	case 0 ... 4:
+	case 6:
+	case 8:
+	case 12:
+	case 16:
+		break;
+	default:
+		return false;
+	}
+
+	if (rootd->interleave_ways == 1)
+		return true;
+
+	root_ig = cxl_from_granularity(rootd->interleave_granularity);
+	region_ig = cxl_from_granularity(granularity);
+	root_eniw = cxl_from_ways(rootd->interleave_ways);
+
+	return ((1 << (root_ig - region_ig)) * (1 << root_eniw)) <= ways;
+}
+
+static inline bool cxl_region_granularity_valid(const struct cxl_decoder *rootd,
+						int ig)
+{
+	int rootd_hbig;
+
+	if (!is_power_of_2(ig))
+		return false;
+
+	/* 16K is the max */
+	if (ig >> 15)
+		return false;
+
+	rootd_hbig = cxl_from_granularity(rootd->interleave_granularity);
+	if (rootd_hbig < cxl_from_granularity(ig))
+		return false;
+
+	return true;
+}
+
+static inline bool cxl_region_size_valid(u64 size, int ways)
+{
+	int rem;
+
+	div_u64_rem(size, SZ_256M * ways, &rem);
+	return rem == 0;
+}
+
 #endif
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 15/15] cxl/region: Introduce a cxl_region driver
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
                   ` (13 preceding siblings ...)
  2022-04-13 18:37 ` [RFC PATCH 14/15] cxl/region: Introduce configuration Ben Widawsky
@ 2022-04-13 18:37 ` Ben Widawsky
  2022-05-20 16:23 ` [RFC PATCH 00/15] Region driver Jonathan Cameron
  15 siblings, 0 replies; 53+ messages in thread
From: Ben Widawsky @ 2022-04-13 18:37 UTC (permalink / raw)
  To: linux-cxl, nvdimm
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma

The cxl_region driver is responsible for managing the HDM decoder
programming in the CXL topology. Once a region is created it must be
configured and bound to the driver in order to activate it.

The following is a sample of how such controls might work:

region=$(cat /sys/bus/cxl/devices/decoder0.0/create_pmem_region)
echo $region > /sys/bus/cxl/devices/decoder0.0/create_pmem_region
echo 256 > /sys/bus/cxl/devices/decoder0.0/region0/interleave_granularity
echo 2 > /sys/bus/cxl/devices/decoder0.0/region0/interleave_ways
echo $((256<<20)) > /sys/bus/cxl/devices/decoder0.0/region0/size
echo decoder3.0 > /sys/bus/cxl/devices/decoder0.0/region0/target0
echo decoder4.0 > /sys/bus/cxl/devices/decoder0.0/region0/target1
echo region0 > /sys/bus/cxl/drivers/cxl_region/bind

Note that the above is not complete as the endpoint decoders also need
configuration.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 .../driver-api/cxl/memory-devices.rst         |   3 +
 drivers/cxl/Kconfig                           |   4 +
 drivers/cxl/Makefile                          |   2 +
 drivers/cxl/core/core.h                       |   1 +
 drivers/cxl/core/port.c                       |   2 +
 drivers/cxl/core/region.c                     |   2 +-
 drivers/cxl/cxl.h                             |   6 +
 drivers/cxl/region.c                          | 333 ++++++++++++++++++
 8 files changed, 352 insertions(+), 1 deletion(-)
 create mode 100644 drivers/cxl/region.c

diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
index 66ddc58a21b1..8cb4dece5b17 100644
--- a/Documentation/driver-api/cxl/memory-devices.rst
+++ b/Documentation/driver-api/cxl/memory-devices.rst
@@ -364,6 +364,9 @@ CXL Core
 
 CXL Regions
 -----------
+.. kernel-doc:: drivers/cxl/region.c
+   :doc: cxl region
+
 .. kernel-doc:: drivers/cxl/region.h
    :identifiers:
 
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 7ce86eee8bda..d5c41c96971f 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -104,4 +104,8 @@ config CXL_REGION
 	default CXL_BUS
 	select MEMREGION
 
+config CXL_REGION
+	default CXL_PORT
+	tristate
+
 endif
diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
index ce267ef11d93..02a4776e7ab9 100644
--- a/drivers/cxl/Makefile
+++ b/drivers/cxl/Makefile
@@ -5,9 +5,11 @@ obj-$(CONFIG_CXL_MEM) += cxl_mem.o
 obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
 obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
 obj-$(CONFIG_CXL_PORT) += cxl_port.o
+obj-$(CONFIG_CXL_REGION) += cxl_region.o
 
 cxl_mem-y := mem.o
 cxl_pci-y := pci.o
 cxl_acpi-y := acpi.o
 cxl_pmem-y := pmem.o
 cxl_port-y := port.o
+cxl_region-y := region.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index a507a2502127..8871a3385604 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -6,6 +6,7 @@
 
 extern const struct device_type cxl_nvdimm_bridge_type;
 extern const struct device_type cxl_nvdimm_type;
+extern const struct device_type cxl_region_type;
 
 extern struct attribute_group cxl_base_attribute_group;
 
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 19cf1fd16118..f22579cd031d 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -53,6 +53,8 @@ static int cxl_device_id(struct device *dev)
 	}
 	if (is_cxl_memdev(dev))
 		return CXL_DEVICE_MEMORY_EXPANDER;
+	if (dev->type == &cxl_region_type)
+		return CXL_DEVICE_REGION;
 	return 0;
 }
 
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 4766d897f4bf..1c28d9623cb8 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -464,7 +464,7 @@ static const struct attribute_group *region_groups[] = {
 	NULL,
 };
 
-static const struct device_type cxl_region_type = {
+const struct device_type cxl_region_type = {
 	.name = "cxl_region",
 	.release = cxl_region_release,
 	.groups = region_groups
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index db69dfa16f71..184af920113d 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -212,6 +212,10 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
 #define CXL_DECODER_F_ENABLE    BIT(5)
 #define CXL_DECODER_F_MASK  GENMASK(5, 0)
 
+#define cxl_is_pmem_t3(flags)                                                  \
+	(((flags) & (CXL_DECODER_F_TYPE3 | CXL_DECODER_F_PMEM)) ==             \
+	 (CXL_DECODER_F_TYPE3 | CXL_DECODER_F_PMEM))
+
 enum cxl_decoder_type {
        CXL_DECODER_ACCELERATOR = 2,
        CXL_DECODER_EXPANDER = 3,
@@ -440,6 +444,7 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port,
 				     resource_size_t component_reg_phys);
 struct cxl_dport *cxl_find_dport_by_dev(struct cxl_port *port,
 					const struct device *dev);
+struct cxl_port *ep_find_cxl_port(struct cxl_memdev *cxlmd, unsigned int depth);
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
 bool is_root_decoder(struct device *dev);
@@ -501,6 +506,7 @@ void cxl_driver_unregister(struct cxl_driver *cxl_drv);
 #define CXL_DEVICE_PORT			3
 #define CXL_DEVICE_ROOT			4
 #define CXL_DEVICE_MEMORY_EXPANDER	5
+#define CXL_DEVICE_REGION		6
 
 #define MODULE_ALIAS_CXL(type) MODULE_ALIAS("cxl:t" __stringify(type) "*")
 #define CXL_MODALIAS_FMT "cxl:t%d"
diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
new file mode 100644
index 000000000000..f5de640623c0
--- /dev/null
+++ b/drivers/cxl/region.c
@@ -0,0 +1,333 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2021 Intel Corporation. All rights reserved. */
+#include <linux/platform_device.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include "cxlmem.h"
+#include "region.h"
+#include "cxl.h"
+
+/**
+ * DOC: cxl region
+ *
+ * This module implements a region driver that is capable of programming CXL
+ * hardware to setup regions.
+ *
+ * A CXL region encompasses a chunk of host physical address space that may be
+ * consumed by a single device (x1 interleave aka linear) or across multiple
+ * devices (xN interleaved). The region driver has the following
+ * responsibilities:
+ *
+ * * Walk topology to obtain decoder resources for region configuration.
+ * * Program decoder resources based on region configuration.
+ * * Bridge CXL regions to LIBNVDIMM
+ * * Initiates reading and configuring LSA regions
+ * * Enumerates regions created by BIOS (typically volatile)
+ */
+
+#define for_each_cxled(cxled, idx, cxlr) \
+	for (idx = 0; idx < cxlr->interleave_ways && (cxled = cxlr->targets[idx]); idx++)
+
+static struct cxl_decoder *rootd_from_region(const struct cxl_region *cxlr)
+{
+	struct device *d = cxlr->dev.parent;
+
+	if (WARN_ONCE(!is_root_decoder(d),
+		      "Corrupt topology for root region\n"))
+		return NULL;
+
+	return to_cxl_decoder(d);
+}
+
+static struct cxl_port *get_hostbridge(const struct cxl_memdev *ep)
+{
+	struct cxl_port *port = dev_get_drvdata(&ep->dev);
+
+	while (!is_cxl_root(port)) {
+		port = to_cxl_port(port->dev.parent);
+		if (port->depth == 1)
+			return port;
+	}
+
+	BUG();
+	return NULL;
+}
+
+static struct cxl_port *get_root_decoder(const struct cxl_memdev *endpoint)
+{
+	struct cxl_port *hostbridge = get_hostbridge(endpoint);
+
+	if (hostbridge)
+		return to_cxl_port(hostbridge->dev.parent);
+
+	return NULL;
+}
+
+/**
+ * validate_region() - Check is region is reasonably configured
+ * @cxlr: The region to check
+ *
+ * Determination as to whether or not a region can possibly be configured is
+ * described in CXL Memory Device SW Guide. In order to implement the algorithms
+ * described there, certain more basic configuration parameters must first need
+ * to be validated. That is accomplished by this function.
+ *
+ * Returns 0 if the region is reasonably configured, else returns a negative
+ * error code.
+ */
+static int validate_region(const struct cxl_region *cxlr)
+{
+	const struct cxl_decoder *rootd = rootd_from_region(cxlr);
+	const int gran = cxlr->interleave_granularity;
+	const int ways = cxlr->interleave_ways;
+	struct cxl_endpoint_decoder *cxled;
+	int i;
+
+	/*
+	 * Interleave attributes should be caught by later math, but it's
+	 * easiest to find those issues here, now.
+	 */
+	if (!cxl_region_granularity_valid(rootd, gran)) {
+		dev_dbg(&cxlr->dev, "Invalid interleave granularity\n");
+		return -ENXIO;
+	}
+
+	if (!cxl_region_ways_valid(rootd, ways, gran)) {
+		dev_dbg(&cxlr->dev, "Invalid number of ways\n");
+		return -ENXIO;
+	}
+
+	if (!cxl_region_size_valid(range_len(&cxlr->range), ways)) {
+		dev_dbg(&cxlr->dev, "Invalid size. Must be multiple of %uM\n",
+			256 * ways);
+		return -ENXIO;
+	}
+
+	for_each_cxled(cxled, i, cxlr) {
+		struct cxl_memdev *cxlmd;
+		struct cxl_port *port;
+
+		port = to_cxl_port(cxled->base.dev.parent);
+		cxlmd = to_cxl_memdev(port->uport);
+		if (!cxlmd->dev.driver) {
+			dev_dbg(&cxlr->dev, "%s isn't CXL.mem capable\n",
+				dev_name(&cxled->base.dev));
+			return -ENODEV;
+		}
+
+		if ((range_len(&cxlr->range) / ways) !=
+		    range_len(&cxled->drange)) {
+			dev_dbg(&cxlr->dev, "%s is the wrong size\n",
+				dev_name(&cxled->base.dev));
+			return -ENXIO;
+		}
+	}
+
+	if (i != cxlr->interleave_ways) {
+		dev_dbg(&cxlr->dev, "Missing memory device target%u", i);
+		return -ENXIO;
+	}
+
+	return 0;
+}
+
+/**
+ * find_cdat_dsmas() - Find a valid DSMAS for the region
+ * @cxlr: The region
+ */
+static bool find_cdat_dsmas(const struct cxl_region *cxlr)
+{
+	return true;
+}
+
+/**
+ * qtg_match() - Does this root decoder have desirable QTG for the endpoint
+ * @rootd: The root decoder for the region
+ *
+ * Prior to calling this function, the caller should verify that all endpoints
+ * in the region have the same QTG ID.
+ *
+ * Returns true if the QTG ID of the root decoder matches the endpoint
+ */
+static bool qtg_match(const struct cxl_decoder *rootd)
+{
+	/* TODO: */
+	return true;
+}
+
+/**
+ * region_xhb_config_valid() - determine cross host bridge validity
+ * @cxlr: The region being programmed
+ * @rootd: The root decoder to check against
+ *
+ * The algorithm is outlined in 2.13.14 "Verify XHB configuration sequence" of
+ * the CXL Memory Device SW Guide (Rev1p0).
+ *
+ * Returns true if the configuration is valid.
+ */
+static bool region_xhb_config_valid(const struct cxl_region *cxlr,
+				    const struct cxl_decoder *rootd)
+{
+	/* TODO: */
+	return true;
+}
+
+/**
+ * region_hb_rp_config_valid() - determine root port ordering is correct
+ * @cxlr: Region to validate
+ * @rootd: root decoder for this @cxlr
+ *
+ * The algorithm is outlined in 2.13.15 "Verify HB root port configuration
+ * sequence" of the CXL Memory Device SW Guide (Rev1p0).
+ *
+ * Returns true if the configuration is valid.
+ */
+static bool region_hb_rp_config_valid(const struct cxl_region *cxlr,
+				      const struct cxl_decoder *rootd)
+{
+	/* TODO: */
+	return true;
+}
+
+/**
+ * rootd_contains() - determine if this region can exist in the root decoder
+ * @rootd: root decoder that potentially decodes to this region
+ * @cxlr: region to be routed by the @rootd
+ */
+static bool rootd_contains(const struct cxl_region *cxlr,
+			   const struct cxl_decoder *rootd)
+{
+	/* TODO: */
+	return true;
+}
+
+static bool rootd_valid(const struct cxl_region *cxlr,
+			const struct cxl_decoder *rootd)
+{
+	if (!qtg_match(rootd))
+		return false;
+
+	if (!cxl_is_pmem_t3(rootd->flags))
+		return false;
+
+	if (!region_xhb_config_valid(cxlr, rootd))
+		return false;
+
+	if (!region_hb_rp_config_valid(cxlr, rootd))
+		return false;
+
+	if (!rootd_contains(cxlr, rootd))
+		return false;
+
+	return true;
+}
+
+struct rootd_context {
+	const struct cxl_region *cxlr;
+	struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
+	int count;
+};
+
+static int rootd_match(struct device *dev, void *data)
+{
+	struct rootd_context *ctx = (struct rootd_context *)data;
+	const struct cxl_region *cxlr = ctx->cxlr;
+
+	if (!is_root_decoder(dev))
+		return 0;
+
+	return !!rootd_valid(cxlr, to_cxl_decoder(dev));
+}
+
+/*
+ * This is a roughly equivalent implementation to "Figure 45 - High-level
+ * sequence: Finding CFMWS for region" from the CXL Memory Device SW Guide
+ * Rev1p0.
+ */
+static struct cxl_decoder *find_rootd(const struct cxl_region *cxlr,
+				      const struct cxl_port *root)
+{
+	struct rootd_context ctx;
+	struct device *ret;
+
+	ctx.cxlr = cxlr;
+
+	ret = device_find_child((struct device *)&root->dev, &ctx, rootd_match);
+	if (ret)
+		return to_cxl_decoder(ret);
+
+	return NULL;
+}
+
+static int bind_region(const struct cxl_region *cxlr)
+{
+	struct cxl_endpoint_decoder *cxled;
+	int i;
+	/* TODO: */
+
+	/*
+	 * Natural decoder teardown can occur at this point, put the
+	 * reference which was taken when the target was set.
+	 */
+	for_each_cxled(cxled, i, cxlr)
+		put_device(&cxled->base.dev);
+
+	WARN_ON(i != cxlr->interleave_ways);
+	return 0;
+}
+
+static int cxl_region_probe(struct device *dev)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	struct cxl_port *root_port, *ep_port;
+	struct cxl_decoder *rootd, *ours;
+	struct cxl_memdev *cxlmd;
+	int ret;
+
+	if (uuid_is_null(&cxlr->uuid))
+		uuid_gen(&cxlr->uuid);
+
+	/* TODO: What about volatile, and LSA generated regions? */
+
+	ret = validate_region(cxlr);
+	if (ret)
+		return ret;
+
+	if (!find_cdat_dsmas(cxlr))
+		return -ENXIO;
+
+	rootd = rootd_from_region(cxlr);
+	if (!rootd) {
+		dev_err(dev, "Couldn't find root decoder\n");
+		return -ENXIO;
+	}
+
+	if (!rootd_valid(cxlr, rootd)) {
+		dev_err(dev, "Picked invalid rootd\n");
+		return -ENXIO;
+	}
+
+	ep_port = to_cxl_port(cxlr->targets[0]->base.dev.parent);
+	cxlmd = to_cxl_memdev(ep_port->uport);
+	root_port = get_root_decoder(cxlmd);
+	ours = find_rootd(cxlr, root_port);
+	if (ours != rootd)
+		dev_dbg(dev, "Picked different rootd %s %s\n",
+			dev_name(&rootd->dev), dev_name(&ours->dev));
+	if (ours)
+		put_device(&ours->dev);
+
+	return bind_region(cxlr);
+}
+
+static struct cxl_driver cxl_region_driver = {
+	.name = "cxl_region",
+	.probe = cxl_region_probe,
+	.id = CXL_DEVICE_REGION,
+};
+module_cxl_driver(cxl_region_driver);
+
+MODULE_LICENSE("GPL");
+MODULE_IMPORT_NS(CXL);
+MODULE_ALIAS_CXL(CXL_DEVICE_REGION);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 01/15] cxl/core: Use is_endpoint_decoder
  2022-04-13 18:37 ` [RFC PATCH 01/15] cxl/core: Use is_endpoint_decoder Ben Widawsky
@ 2022-04-13 21:22   ` Dan Williams
       [not found]   ` <CGME20220415205052uscas1p209e03abf95b9c80b2ba1f287c82dfd80@uscas1p2.samsung.com>
  1 sibling, 0 replies; 53+ messages in thread
From: Dan Williams @ 2022-04-13 21:22 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, Linux NVDIMM, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma

On Wed, Apr 13, 2022 at 11:37 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Save some characters and directly check decoder type rather than port
> type. There's no need to check if the port is an endpoint port since we
> already know the decoder, after alloc, has a specified type.

...a smidge more clarity:

s/we already know the decoder, after alloc,/,by this point,
cxl_endpoint_decoder_alloc()/

Otherwise, looks good to me.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail
  2022-04-13 18:37 ` [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail Ben Widawsky
@ 2022-04-13 21:31   ` Dan Williams
       [not found]     ` <CGME20220418163713uscas1p17b3b1b45c7d27e54e3ecb62eb8af2469@uscas1p1.samsung.com>
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2022-04-13 21:31 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, Linux NVDIMM, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma

On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Endpoint decoder enumeration is the only way in which we can determine
> Device Physical Address (DPA) -> Host Physical Address (HPA) mappings.
> Information is obtained only when the register state can be read
> sequentially. If when enumerating the decoders a failure occurs, all
> other decoders must also fail since the decoders can no longer be
> accurately managed (unless it's the last decoder in which case it can
> still work).

I think this should be expanded to fail if any decoder fails to
allocate anywhere in the topology otherwise it leaves a mess for
future address translation code to work through cases where decoder
information is missing.

The current approach is based around the current expectation that
nothing is enumerating pre-existing regions, and nothing is performing
address translation.

>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
>  drivers/cxl/core/hdm.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index bfc8ee876278..c3c021b54079 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -255,6 +255,8 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>                                       cxlhdm->regs.hdm_decoder, i);
>                 if (rc) {
>                         put_device(&cxld->dev);
> +                       if (is_endpoint_decoder(&cxld->dev))
> +                               return rc;
>                         failed++;
>                         continue;
>                 }
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 03/15] Revert "cxl/core: Convert decoder range to resource"
  2022-04-13 18:37 ` [RFC PATCH 03/15] Revert "cxl/core: Convert decoder range to resource" Ben Widawsky
@ 2022-04-13 21:43   ` Dan Williams
  2022-05-12 16:09     ` Ben Widawsky
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2022-04-13 21:43 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, Linux NVDIMM, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma

On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> This reverts commit 608135db1b790170d22848815c4671407af74e37. All

Did checkpatch not complain about this being in "commit
<12-character-commit-id> <commit summary format>"? However, I'd rather
just drop the revert language and say:

Change root decoders to reuse the existing ->range field to track the
decoder's programmed HPA range. The infrastructure to track the
allocations out of the root decoder range is still a work-in-progress,
but in the meantime it simplifies the code to always represent the
current decoder range setting in the ->range field regardless of
decoder type.

> decoders do have a host physical address space and the revert allows us
> to keep that uniformity. Decoder disambiguation will allow for decoder
> type-specific members which is needed, but will be handled separately.
>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
>
> ---
> The explanation for why it is impossible to make CFMWS ranges be
> iomem_resources is explain in a later patch.

This change stands alone / is independent of any iomem_resource concerns, right?

> ---
>  drivers/cxl/acpi.c      | 17 ++++++++++-------
>  drivers/cxl/core/hdm.c  |  2 +-
>  drivers/cxl/core/port.c | 28 ++++++----------------------
>  drivers/cxl/cxl.h       |  8 ++------
>  4 files changed, 19 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index d15a6aec0331..9b69955b90cb 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -108,8 +108,10 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
>
>         cxld->flags = cfmws_to_decoder_flags(cfmws->restrictions);
>         cxld->target_type = CXL_DECODER_EXPANDER;
> -       cxld->platform_res = (struct resource)DEFINE_RES_MEM(cfmws->base_hpa,
> -                                                            cfmws->window_size);
> +       cxld->range = (struct range){
> +               .start = cfmws->base_hpa,
> +               .end = cfmws->base_hpa + cfmws->window_size - 1,
> +       };
>         cxld->interleave_ways = CFMWS_INTERLEAVE_WAYS(cfmws);
>         cxld->interleave_granularity = CFMWS_INTERLEAVE_GRANULARITY(cfmws);
>
> @@ -119,13 +121,14 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
>         else
>                 rc = cxl_decoder_autoremove(dev, cxld);
>         if (rc) {
> -               dev_err(dev, "Failed to add decoder for %pr\n",
> -                       &cxld->platform_res);
> +               dev_err(dev, "Failed to add decoder for %#llx-%#llx\n",
> +                       cfmws->base_hpa,
> +                       cfmws->base_hpa + cfmws->window_size - 1);
>                 return 0;
>         }
> -       dev_dbg(dev, "add: %s node: %d range %pr\n", dev_name(&cxld->dev),
> -               phys_to_target_node(cxld->platform_res.start),
> -               &cxld->platform_res);
> +       dev_dbg(dev, "add: %s node: %d range %#llx-%#llx\n",
> +               dev_name(&cxld->dev), phys_to_target_node(cxld->range.start),
> +               cfmws->base_hpa, cfmws->base_hpa + cfmws->window_size - 1);
>
>         return 0;
>  }
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index c3c021b54079..3055e246aab9 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -172,7 +172,7 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
>                 return -ENXIO;
>         }
>
> -       cxld->decoder_range = (struct range) {
> +       cxld->range = (struct range) {
>                 .start = base,
>                 .end = base + size - 1,
>         };
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 74c8e47bf915..86f451ecb7ed 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -73,14 +73,8 @@ static ssize_t start_show(struct device *dev, struct device_attribute *attr,
>                           char *buf)
>  {
>         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> -       u64 start;
>
> -       if (is_root_decoder(dev))
> -               start = cxld->platform_res.start;
> -       else
> -               start = cxld->decoder_range.start;
> -
> -       return sysfs_emit(buf, "%#llx\n", start);
> +       return sysfs_emit(buf, "%#llx\n", cxld->range.start);
>  }
>  static DEVICE_ATTR_ADMIN_RO(start);
>
> @@ -88,14 +82,8 @@ static ssize_t size_show(struct device *dev, struct device_attribute *attr,
>                         char *buf)
>  {
>         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> -       u64 size;
>
> -       if (is_root_decoder(dev))
> -               size = resource_size(&cxld->platform_res);
> -       else
> -               size = range_len(&cxld->decoder_range);
> -
> -       return sysfs_emit(buf, "%#llx\n", size);
> +       return sysfs_emit(buf, "%#llx\n", range_len(&cxld->range));
>  }
>  static DEVICE_ATTR_RO(size);
>
> @@ -1228,7 +1216,10 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
>         cxld->interleave_ways = 1;
>         cxld->interleave_granularity = PAGE_SIZE;
>         cxld->target_type = CXL_DECODER_EXPANDER;
> -       cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
> +       cxld->range = (struct range) {
> +               .start = 0,
> +               .end = -1,
> +       };
>
>         return cxld;
>  err:
> @@ -1342,13 +1333,6 @@ int cxl_decoder_add_locked(struct cxl_decoder *cxld, int *target_map)
>         if (rc)
>                 return rc;
>
> -       /*
> -        * Platform decoder resources should show up with a reasonable name. All
> -        * other resources are just sub ranges within the main decoder resource.
> -        */
> -       if (is_root_decoder(dev))
> -               cxld->platform_res.name = dev_name(dev);
> -
>         return device_add(dev);
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_decoder_add_locked, CXL);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 5102491e8d13..6517d5cdf5ee 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -197,8 +197,7 @@ enum cxl_decoder_type {
>   * struct cxl_decoder - CXL address range decode configuration
>   * @dev: this decoder's device
>   * @id: kernel device name id
> - * @platform_res: address space resources considered by root decoder
> - * @decoder_range: address space resources considered by midlevel decoder
> + * @range: address range considered by this decoder
>   * @interleave_ways: number of cxl_dports in this decode
>   * @interleave_granularity: data stride per dport
>   * @target_type: accelerator vs expander (type2 vs type3) selector
> @@ -210,10 +209,7 @@ enum cxl_decoder_type {
>  struct cxl_decoder {
>         struct device dev;
>         int id;
> -       union {
> -               struct resource platform_res;
> -               struct range decoder_range;
> -       };
> +       struct range range;
>         int interleave_ways;
>         int interleave_granularity;
>         enum cxl_decoder_type target_type;
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 04/15] cxl/core: Create distinct decoder structs
  2022-04-13 18:37 ` [RFC PATCH 04/15] cxl/core: Create distinct decoder structs Ben Widawsky
@ 2022-04-15  1:45   ` Dan Williams
  2022-04-18 20:43     ` Dan Williams
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2022-04-15  1:45 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, Linux NVDIMM, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma

On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> CXL HDM decoders have distinct properties at each level in the
> hierarchy. Root decoders manage host physical address space. Switch
> decoders manage demultiplexing of data to downstream targets. Endpoint
> decoders must be aware of physical media size constraints. To properly
> support these unique needs, create these unique structures.
>
> CXL HDM decoders do have similar architectural properties at all levels:
> interleave properties, flags, types and consumption of host physical
> address space. Those are retained and when possible, still utilized.
>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
>  drivers/cxl/core/hdm.c       |   3 +-
>  drivers/cxl/core/port.c      | 102 ++++++++++++++++++++++++-----------
>  drivers/cxl/cxl.h            |  69 +++++++++++++++++++++---
>  tools/testing/cxl/test/cxl.c |   2 +-
>  4 files changed, 137 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 3055e246aab9..37c09c77e9a7 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -6,6 +6,7 @@
>
>  #include "cxlmem.h"
>  #include "core.h"
> +#include "cxl.h"
>
>  /**
>   * DOC: cxl core hdm
> @@ -242,7 +243,7 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>                 struct cxl_decoder *cxld;
>
>                 if (is_cxl_endpoint(port))
> -                       cxld = cxl_endpoint_decoder_alloc(port);
> +                       cxld = &cxl_endpoint_decoder_alloc(port)->base;

Please split to:

cxled = cxl_endpoint_decoder_alloc(port);
cxld = &cxled->base;

>                 else
>                         cxld = cxl_switch_decoder_alloc(port, target_count);
>                 if (IS_ERR(cxld)) {
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 86f451ecb7ed..8dd29c97e318 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -121,18 +121,19 @@ static DEVICE_ATTR_RO(target_type);
>
>  static ssize_t emit_target_list(struct cxl_decoder *cxld, char *buf)
>  {
> +       struct cxl_decoder_targets *t = cxl_get_decoder_targets(cxld);
>         ssize_t offset = 0;
>         int i, rc = 0;
>
>         for (i = 0; i < cxld->interleave_ways; i++) {
> -               struct cxl_dport *dport = cxld->target[i];
> +               struct cxl_dport *dport = t->target[i];
>                 struct cxl_dport *next = NULL;
>
>                 if (!dport)
>                         break;
>
>                 if (i + 1 < cxld->interleave_ways)
> -                       next = cxld->target[i + 1];
> +                       next = t->target[i + 1];
>                 rc = sysfs_emit_at(buf, offset, "%d%s", dport->port_id,
>                                    next ? "," : "");
>                 if (rc < 0)
> @@ -147,14 +148,15 @@ static ssize_t target_list_show(struct device *dev,
>                                 struct device_attribute *attr, char *buf)
>  {
>         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> +       struct cxl_decoder_targets *t = cxl_get_decoder_targets(cxld);
>         ssize_t offset;
>         unsigned int seq;
>         int rc;
>
>         do {
> -               seq = read_seqbegin(&cxld->target_lock);
> +               seq = read_seqbegin(&t->target_lock);
>                 rc = emit_target_list(cxld, buf);
> -       } while (read_seqretry(&cxld->target_lock, seq));
> +       } while (read_seqretry(&t->target_lock, seq));
>
>         if (rc < 0)
>                 return rc;
> @@ -199,23 +201,6 @@ static const struct attribute_group *cxl_decoder_root_attribute_groups[] = {
>         NULL,
>  };
>
> -static struct attribute *cxl_decoder_switch_attrs[] = {
> -       &dev_attr_target_type.attr,
> -       &dev_attr_target_list.attr,
> -       NULL,
> -};
> -
> -static struct attribute_group cxl_decoder_switch_attribute_group = {
> -       .attrs = cxl_decoder_switch_attrs,
> -};
> -
> -static const struct attribute_group *cxl_decoder_switch_attribute_groups[] = {
> -       &cxl_decoder_switch_attribute_group,
> -       &cxl_decoder_base_attribute_group,
> -       &cxl_base_attribute_group,
> -       NULL,
> -};
> -
>  static struct attribute *cxl_decoder_endpoint_attrs[] = {
>         &dev_attr_target_type.attr,
>         NULL,
> @@ -232,6 +217,12 @@ static const struct attribute_group *cxl_decoder_endpoint_attribute_groups[] = {
>         NULL,
>  };
>
> +static const struct attribute_group *cxl_decoder_switch_attribute_groups[] = {
> +       &cxl_decoder_base_attribute_group,
> +       &cxl_base_attribute_group,
> +       NULL,
> +};
> +
>  static void cxl_decoder_release(struct device *dev)
>  {
>         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> @@ -264,6 +255,7 @@ bool is_endpoint_decoder(struct device *dev)
>  {
>         return dev->type == &cxl_decoder_endpoint_type;
>  }
> +EXPORT_SYMBOL_NS_GPL(is_endpoint_decoder, CXL);
>
>  bool is_root_decoder(struct device *dev)
>  {
> @@ -1136,6 +1128,7 @@ EXPORT_SYMBOL_NS_GPL(cxl_find_dport_by_dev, CXL);
>  static int decoder_populate_targets(struct cxl_decoder *cxld,
>                                     struct cxl_port *port, int *target_map)
>  {
> +       struct cxl_decoder_targets *t = cxl_get_decoder_targets(cxld);
>         int i, rc = 0;
>
>         if (!target_map)
> @@ -1146,21 +1139,72 @@ static int decoder_populate_targets(struct cxl_decoder *cxld,
>         if (list_empty(&port->dports))
>                 return -EINVAL;
>
> -       write_seqlock(&cxld->target_lock);
> -       for (i = 0; i < cxld->nr_targets; i++) {
> +       write_seqlock(&t->target_lock);
> +       for (i = 0; i < t->nr_targets; i++) {
>                 struct cxl_dport *dport = find_dport(port, target_map[i]);
>
>                 if (!dport) {
>                         rc = -ENXIO;
>                         break;
>                 }
> -               cxld->target[i] = dport;
> +               t->target[i] = dport;
>         }
> -       write_sequnlock(&cxld->target_lock);
> +       write_sequnlock(&t->target_lock);
>
>         return rc;
>  }
>
> +static struct cxl_decoder *__cxl_decoder_alloc(struct cxl_port *port,
> +                                              unsigned int nr_targets)
> +{
> +       struct cxl_decoder *cxld;
> +
> +       if (is_cxl_endpoint(port)) {
> +               struct cxl_endpoint_decoder *cxled;
> +
> +               cxled = kzalloc(sizeof(*cxled), GFP_KERNEL);
> +               if (!cxled)
> +                       return NULL;
> +               cxld = &cxled->base;
> +       } else if (is_cxl_root(port)) {
> +               struct cxl_root_decoder *cxlrd;
> +
> +               cxlrd = kzalloc(sizeof(*cxlrd), GFP_KERNEL);
> +               if (!cxlrd)
> +                       return NULL;
> +
> +               cxlrd->targets =
> +                       kzalloc(struct_size(cxlrd->targets, target, nr_targets),
> +                               GFP_KERNEL);
> +               if (!cxlrd->targets) {
> +                       kfree(cxlrd);
> +                       return NULL;
> +               }
> +               cxlrd->targets->nr_targets = nr_targets;
> +               seqlock_init(&cxlrd->targets->target_lock);
> +               cxld = &cxlrd->base;
> +       } else {
> +               struct cxl_switch_decoder *cxlsd;
> +
> +               cxlsd = kzalloc(sizeof(*cxlsd), GFP_KERNEL);
> +               if (!cxlsd)
> +                       return NULL;
> +
> +               cxlsd->targets =
> +                       kzalloc(struct_size(cxlsd->targets, target, nr_targets),
> +                               GFP_KERNEL);
> +               if (!cxlsd->targets) {
> +                       kfree(cxlsd);
> +                       return NULL;
> +               }
> +               cxlsd->targets->nr_targets = nr_targets;
> +               seqlock_init(&cxlsd->targets->target_lock);
> +               cxld = &cxlsd->base;
> +       }
> +
> +       return cxld;
> +}
> +
>  /**
>   * cxl_decoder_alloc - Allocate a new CXL decoder
>   * @port: owning port of this decoder
> @@ -1186,7 +1230,7 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
>         if (nr_targets > CXL_DECODER_MAX_INTERLEAVE)
>                 return ERR_PTR(-EINVAL);
>
> -       cxld = kzalloc(struct_size(cxld, target, nr_targets), GFP_KERNEL);
> +       cxld = __cxl_decoder_alloc(port, nr_targets);
>         if (!cxld)
>                 return ERR_PTR(-ENOMEM);
>
> @@ -1198,8 +1242,6 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
>         get_device(&port->dev);
>         cxld->id = rc;
>
> -       cxld->nr_targets = nr_targets;
> -       seqlock_init(&cxld->target_lock);
>         dev = &cxld->dev;
>         device_initialize(dev);
>         device_set_pm_not_required(dev);
> @@ -1274,12 +1316,12 @@ EXPORT_SYMBOL_NS_GPL(cxl_switch_decoder_alloc, CXL);
>   *
>   * Return: A new cxl decoder to be registered by cxl_decoder_add()
>   */
> -struct cxl_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
> +struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
>  {
>         if (!is_cxl_endpoint(port))
>                 return ERR_PTR(-EINVAL);
>
> -       return cxl_decoder_alloc(port, 0);
> +       return to_cxl_endpoint_decoder(cxl_decoder_alloc(port, 0));
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_endpoint_decoder_alloc, CXL);
>
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 6517d5cdf5ee..85fd5e84f978 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -193,6 +193,18 @@ enum cxl_decoder_type {
>   */
>  #define CXL_DECODER_MAX_INTERLEAVE 16
>
> +/**
> + * struct cxl_decoder_targets - Target information for root and switch decoders.
> + * @target_lock: coordinate coherent reads of the target list
> + * @nr_targets: number of elements in @target
> + * @target: active ordered target list in current decoder configuration
> + */
> +struct cxl_decoder_targets {
> +       seqlock_t target_lock;
> +       int nr_targets;
> +       struct cxl_dport *target[];
> +};
> +
>  /**
>   * struct cxl_decoder - CXL address range decode configuration
>   * @dev: this decoder's device
> @@ -202,9 +214,6 @@ enum cxl_decoder_type {
>   * @interleave_granularity: data stride per dport
>   * @target_type: accelerator vs expander (type2 vs type3) selector
>   * @flags: memory type capabilities and locking
> - * @target_lock: coordinate coherent reads of the target list
> - * @nr_targets: number of elements in @target
> - * @target: active ordered target list in current decoder configuration
>   */
>  struct cxl_decoder {
>         struct device dev;
> @@ -214,11 +223,46 @@ struct cxl_decoder {
>         int interleave_granularity;
>         enum cxl_decoder_type target_type;
>         unsigned long flags;
> -       seqlock_t target_lock;
> -       int nr_targets;
> -       struct cxl_dport *target[];
>  };
>
> +/**
> + * struct cxl_endpoint_decoder - An decoder residing in a CXL endpoint.
> + * @base: Base class decoder
> + */
> +struct cxl_endpoint_decoder {
> +       struct cxl_decoder base;
> +};
> +
> +/**
> + * struct cxl_switch_decoder - A decoder in a switch or hostbridge.
> + * @base: Base class decoder
> + * @targets: Downstream targets for this switch.
> + */
> +struct cxl_switch_decoder {
> +       struct cxl_decoder base;
> +       struct cxl_decoder_targets *targets;

Please no double-allocation when not necessary. This can be

struct cxl_switch_decoder {
       struct cxl_decoder base;
       struct cxl_decoder_targets targets;
};

...and then allocated with a single:

cxlsd = kzalloc(struct_size(cxlsd, targets.target, nr_targets), GFP_KERNEL);

...or something like that (not compile tested).

> +};

> +};
> +
> +/**
> + * struct cxl_root_decoder - A toplevel/platform decoder
> + * @base: Base class decoder
> + * @targets: Downstream targets (ie. hostbridges).
> + */
> +struct cxl_root_decoder {
> +       struct cxl_decoder base;
> +       struct cxl_decoder_targets *targets;
> +};

Ditto single allocation feedback...

...although now that struct cxl_root_decoder is identical to
cxl_switch_decoder is there any benefit to making them distinct types
beyond being pedantic? Making them the same type means
cxl_get_decoder_targets() can go because all callers of that are
already in known !cxl_endpoint_decoder paths. But of course ignore
this if cxl_root_decoder is going to game some differentiating
attributes down the road, I have not looked ahead in this series.

> +
> +#define _to_cxl_decoder(x)                                                     \
> +       static inline struct cxl_##x##_decoder *to_cxl_##x##_decoder(          \
> +               struct cxl_decoder *cxld)                                      \
> +       {                                                                      \
> +               return container_of(cxld, struct cxl_##x##_decoder, base);     \
> +       }
> +
> +_to_cxl_decoder(root)
> +_to_cxl_decoder(switch)
> +_to_cxl_decoder(endpoint)
>
>  /**
>   * enum cxl_nvdimm_brige_state - state machine for managing bus rescans
> @@ -343,11 +387,22 @@ struct cxl_decoder *cxl_root_decoder_alloc(struct cxl_port *port,
>  struct cxl_decoder *cxl_switch_decoder_alloc(struct cxl_port *port,
>                                              unsigned int nr_targets);
>  int cxl_decoder_add(struct cxl_decoder *cxld, int *target_map);
> -struct cxl_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port);
> +struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port);
>  int cxl_decoder_add_locked(struct cxl_decoder *cxld, int *target_map);
>  int cxl_decoder_autoremove(struct device *host, struct cxl_decoder *cxld);
>  int cxl_endpoint_autoremove(struct cxl_memdev *cxlmd, struct cxl_port *endpoint);
>
> +static inline struct cxl_decoder_targets *
> +cxl_get_decoder_targets(struct cxl_decoder *cxld)
> +{
> +       if (is_root_decoder(&cxld->dev))
> +               return to_cxl_root_decoder(cxld)->targets;
> +       else if (is_endpoint_decoder(&cxld->dev))
> +               return NULL;
> +       else
> +               return to_cxl_switch_decoder(cxld)->targets;
> +}
> +
>  struct cxl_hdm;
>  struct cxl_hdm *devm_cxl_setup_hdm(struct cxl_port *port);
>  int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm);
> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
> index 431f2bddf6c8..0534d96486eb 100644
> --- a/tools/testing/cxl/test/cxl.c
> +++ b/tools/testing/cxl/test/cxl.c
> @@ -454,7 +454,7 @@ static int mock_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>                 if (target_count)
>                         cxld = cxl_switch_decoder_alloc(port, target_count);
>                 else
> -                       cxld = cxl_endpoint_decoder_alloc(port);
> +                       cxld = &cxl_endpoint_decoder_alloc(port)->base;

Same preference to split this into two steps.

>                 if (IS_ERR(cxld)) {
>                         dev_warn(&port->dev,
>                                  "Failed to allocate the decoder\n");
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 01/15] cxl/core: Use is_endpoint_decoder
       [not found]   ` <CGME20220415205052uscas1p209e03abf95b9c80b2ba1f287c82dfd80@uscas1p2.samsung.com>
@ 2022-04-15 20:50     ` Adam Manzanares
  0 siblings, 0 replies; 53+ messages in thread
From: Adam Manzanares @ 2022-04-15 20:50 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, nvdimm, patches, Alison Schofield, Dan Williams,
	Ira Weiny, Jonathan Cameron, Vishal Verma

On Wed, Apr 13, 2022 at 11:37:06AM -0700, Ben Widawsky wrote:
> Save some characters and directly check decoder type rather than port
> type. There's no need to check if the port is an endpoint port since we
> already know the decoder, after alloc, has a specified type.
> 
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
>  drivers/cxl/core/hdm.c  | 2 +-
>  drivers/cxl/core/port.c | 2 +-
>  drivers/cxl/cxl.h       | 1 +
>  3 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 0e89a7a932d4..bfc8ee876278 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -197,7 +197,7 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
>  	else
>  		cxld->target_type = CXL_DECODER_ACCELERATOR;
>  
> -	if (is_cxl_endpoint(to_cxl_port(cxld->dev.parent)))
> +	if (is_endpoint_decoder(&cxld->dev))
>  		return 0;
>  
>  	target_list.value =
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 2ab1ba4499b3..74c8e47bf915 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -272,7 +272,7 @@ static const struct device_type cxl_decoder_root_type = {
>  	.groups = cxl_decoder_root_attribute_groups,
>  };
>  
> -static bool is_endpoint_decoder(struct device *dev)
> +bool is_endpoint_decoder(struct device *dev)
>  {
>  	return dev->type == &cxl_decoder_endpoint_type;
>  }
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 990b6670222e..5102491e8d13 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -340,6 +340,7 @@ struct cxl_dport *cxl_find_dport_by_dev(struct cxl_port *port,
>  
>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
>  bool is_root_decoder(struct device *dev);
> +bool is_endpoint_decoder(struct device *dev);
>  bool is_cxl_decoder(struct device *dev);
>  struct cxl_decoder *cxl_root_decoder_alloc(struct cxl_port *port,
>  					   unsigned int nr_targets);
> -- 
> 2.35.1
> 
>


Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail
       [not found]     ` <CGME20220418163713uscas1p17b3b1b45c7d27e54e3ecb62eb8af2469@uscas1p1.samsung.com>
@ 2022-04-18 16:37       ` Adam Manzanares
  2022-05-12 15:50         ` Ben Widawsky
  0 siblings, 1 reply; 53+ messages in thread
From: Adam Manzanares @ 2022-04-18 16:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ben Widawsky, linux-cxl, Linux NVDIMM, patches, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, mcgrof

On Wed, Apr 13, 2022 at 02:31:42PM -0700, Dan Williams wrote:
> On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > Endpoint decoder enumeration is the only way in which we can determine
> > Device Physical Address (DPA) -> Host Physical Address (HPA) mappings.
> > Information is obtained only when the register state can be read
> > sequentially. If when enumerating the decoders a failure occurs, all
> > other decoders must also fail since the decoders can no longer be
> > accurately managed (unless it's the last decoder in which case it can
> > still work).
> 
> I think this should be expanded to fail if any decoder fails to
> allocate anywhere in the topology otherwise it leaves a mess for
> future address translation code to work through cases where decoder
> information is missing.
> 
> The current approach is based around the current expectation that
> nothing is enumerating pre-existing regions, and nothing is performing
> address translation.

Does the qemu support currently allow testing of this patch? If so, it would 
be good to reference qemu configurations. Any other alternatives would be 
welcome as well. 

+Luis on cc.

> 
> >
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > ---
> >  drivers/cxl/core/hdm.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index bfc8ee876278..c3c021b54079 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -255,6 +255,8 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
> >                                       cxlhdm->regs.hdm_decoder, i);
> >                 if (rc) {
> >                         put_device(&cxld->dev);
> > +                       if (is_endpoint_decoder(&cxld->dev))
> > +                               return rc;
> >                         failed++;
> >                         continue;
> >                 }
> > --
> > 2.35.1
> >
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region
  2022-04-13 18:37 ` [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region Ben Widawsky
@ 2022-04-18 16:42   ` Dan Williams
  2022-04-19 16:43     ` Jason Gunthorpe
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2022-04-18 16:42 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, Linux NVDIMM, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Christoph Hellwig,
	Jason Gunthorpe, John Hubbard

[ add the usual HMM suspects Christoph, Jason, and John ]

On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Define an API which allows CXL drivers to manage CXL address space.
> CXL is unique in that the address space and various properties are only
> known after CXL drivers come up, and therefore cannot be part of core
> memory enumeration.

I think this buries the lead on the problem introduced by
MEMORY_DEVICE_PRIVATE in the first place. Let's revisit that history
before diving into what CXL needs.

---

Commit 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using
ZONE_DEVICE") introduced the concept of MEMORY_DEVICE_PRIVATE. At its
core MEMORY_DEVICE_PRIVATE uses the ZONE_DEVICE capability to annotate
an "unused" physical address range with 'struct page' for the purpose
of coordinating migration of buffers onto and off of a GPU /
accelerator. The determination of "unused" was based on a heuristic,
not a guarantee, that any address range not expressly conveyed in the
platform firmware map of the system can be repurposed for software
use. The CXL Fixed Memory Windows Structure  (CFMWS) definition
explicitly breaks the assumptions of that heuristic.

---

...and then jump into what CFMWS is and the proposal to coordinate
with request_free_mem_region().


>
> Compute Express Link 2.0 [ECN] defines a concept called CXL Fixed Memory
> Window Structures (CFMWS). Each CFMWS conveys a region of host physical
> address (HPA) space which has certain properties that are familiar to
> CXL, mainly interleave properties, and restrictions, such as
> persistence. The HPA ranges therefore should be owned, or at least
> guided by the relevant CXL driver, cxl_acpi [1].
>
> It would be desirable to simply insert this address space into
> iomem_resource with a new flag to denote this is CXL memory. This would
> permit request_free_mem_region() to be reused for CXL memory provided it
> learned some new tricks. For that, it is tempting to simply use
> insert_resource(). The API was designed specifically for cases where new
> devices may offer new address space. This cannot work in the general
> case. Boot firmware can pass, some, none, or all of the CFMWS range as
> various types of memory to the kernel, and this may be left alone,
> merged, or even expanded.

s/expanded/expanded as the memory map is parsed and reconciled/

> As a result iomem_resource may intersect CFMWS
> regions in ways insert_resource cannot handle [2]. Similar reasoning
> applies to allocate_resource().
>
> With the insert_resource option out, the only reasonable approach left
> is to let the CXL driver manage the address space independently of
> iomem_resource and attempt to prevent users of device private memory

s/device private memory/MEMORY_DEVICE_PRIVATE/

> APIs from using CXL memory. In the case where cxl_acpi comes up first,
> the new API allows cxl to block use of any CFMWS defined address space
> by assuming everything above the highest CFMWS entry is fair game. It is
> expected that this effectively will prevent usage of device private
> memory,

No, only if CFMWS consumes the full 64-bit address space which is
unlikely. It's also unlikely going forward to need
MEMORY_DEVICE_PRIVATE when hardware supports CXL for fully coherent
migration of buffers onto and off of an accelearator.

> but if such behavior is undesired, cxl_acpi can be blocked from
> loading, or unloaded.

I would just say if MEMORY_DEVICE_PRIVATE needs exceed the memory
space left over by CXL then the loading of the dynamic CXL address
space allocation infrastructure can be deferred until after
MEMORY_DEVICE_PRIVATE consumers have

> When device private memory is used before CXL
> comes up, or, there are intersections as described above, the CXL driver
> will have to make sure to not reuse sysram that is BUSY.
>
> [1]: The specification defines enumeration via ACPI, however, one could
> envision devicetree, or some other hardcoded mechanisms for doing the
> same thing.
>
> [2]: A common way to hit this case is when BIOS creates a volatile
> region with extra space for hotplug. In this case, you're likely to have
>
> |<--------------HPA space---------------------->|
> |<---iomem_resource -->|
> | DDR  | CXL Volatile  |
> |      | CFMWS for volatile w/ hotplug |
>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
>  drivers/cxl/acpi.c     | 26 ++++++++++++++++++++++++++
>  include/linux/ioport.h |  1 +
>  kernel/resource.c      | 11 ++++++++++-
>  3 files changed, 37 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index 9b69955b90cb..0870904fe4b5 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -76,6 +76,7 @@ static int cxl_acpi_cfmws_verify(struct device *dev,
>  struct cxl_cfmws_context {
>         struct device *dev;
>         struct cxl_port *root_port;
> +       struct acpi_cedt_cfmws *high_cfmws;

Seems more straightforward to track the max 'end' address seen so far
rather than the "highest" cfmws entry.

>  };
>
>  static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
> @@ -126,6 +127,14 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
>                         cfmws->base_hpa + cfmws->window_size - 1);
>                 return 0;
>         }
> +
> +       if (ctx->high_cfmws) {
> +               if (cfmws->base_hpa > ctx->high_cfmws->base_hpa)
> +                       ctx->high_cfmws = cfmws;

I'd expect:

end = cfmws->base_hpa + window_size;
if (ctx->cfmws_max < end)
   ctx->cfmws_max = end;

> +       } else {
> +               ctx->high_cfmws = cfmws;
> +       }
> +
>         dev_dbg(dev, "add: %s node: %d range %#llx-%#llx\n",
>                 dev_name(&cxld->dev), phys_to_target_node(cxld->range.start),
>                 cfmws->base_hpa, cfmws->base_hpa + cfmws->window_size - 1);
> @@ -299,6 +308,7 @@ static int cxl_acpi_probe(struct platform_device *pdev)
>         ctx = (struct cxl_cfmws_context) {
>                 .dev = host,
>                 .root_port = root_port,
> +               .high_cfmws = NULL,
>         };
>         acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, cxl_parse_cfmws, &ctx);
>
> @@ -317,10 +327,25 @@ static int cxl_acpi_probe(struct platform_device *pdev)
>         if (rc < 0)
>                 return rc;
>
> +       if (ctx.high_cfmws) {

Even if there are zero CFMWS entries there will always be a max end
address to call set_request_free_min_base().

> +               resource_size_t end =
> +                       ctx.high_cfmws->base_hpa + ctx.high_cfmws->window_size;
> +               dev_dbg(host,
> +                       "Disabling free device private regions below %#llx\n",
> +                       end);
> +               set_request_free_min_base(end);
> +       }
> +
>         /* In case PCI is scanned before ACPI re-trigger memdev attach */
>         return cxl_bus_rescan();
>  }
>
> +static int cxl_acpi_remove(struct platform_device *pdev)

No need for a .remove() method, just use devm_add_action_or_reset() to
unreserve CXL address space as cxl_acpi unloads.

> +{
> +       set_request_free_min_base(0);
> +       return 0;
> +}
> +
>  static const struct acpi_device_id cxl_acpi_ids[] = {
>         { "ACPI0017" },
>         { },
> @@ -329,6 +354,7 @@ MODULE_DEVICE_TABLE(acpi, cxl_acpi_ids);
>
>  static struct platform_driver cxl_acpi_driver = {
>         .probe = cxl_acpi_probe,
> +       .remove = cxl_acpi_remove,
>         .driver = {
>                 .name = KBUILD_MODNAME,
>                 .acpi_match_table = cxl_acpi_ids,
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index ec5f71f7135b..dc41e4be5635 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -325,6 +325,7 @@ extern int
>  walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 end,
>                     void *arg, int (*func)(struct resource *, void *));
>
> +void set_request_free_min_base(resource_size_t val);

Shouldn't there also be a static inline empty routine in the
CONFIG_DEVICE_PRIVATE=n case?

>  struct resource *devm_request_free_mem_region(struct device *dev,
>                 struct resource *base, unsigned long size);
>  struct resource *request_free_mem_region(struct resource *base,
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 34eaee179689..a4750689e529 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -1774,6 +1774,14 @@ void resource_list_free(struct list_head *head)
>  EXPORT_SYMBOL(resource_list_free);
>
>  #ifdef CONFIG_DEVICE_PRIVATE
> +static resource_size_t request_free_min_base;
> +
> +void set_request_free_min_base(resource_size_t val)
> +{
> +       request_free_min_base = val;
> +}
> +EXPORT_SYMBOL_GPL(set_request_free_min_base);
> +
>  static struct resource *__request_free_mem_region(struct device *dev,
>                 struct resource *base, unsigned long size, const char *name)
>  {
> @@ -1799,7 +1807,8 @@ static struct resource *__request_free_mem_region(struct device *dev,
>         }
>
>         write_lock(&resource_lock);
> -       for (; addr > size && addr >= base->start; addr -= size) {
> +       for (; addr > size && addr >= max(base->start, request_free_min_base);
> +            addr -= size) {
>                 if (__region_intersects(addr, size, 0, IORES_DESC_NONE) !=
>                                 REGION_DISJOINT)
>                         continue;
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 04/15] cxl/core: Create distinct decoder structs
  2022-04-15  1:45   ` Dan Williams
@ 2022-04-18 20:43     ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2022-04-18 20:43 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, Linux NVDIMM, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma

On Thu, Apr 14, 2022 at 6:45 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > CXL HDM decoders have distinct properties at each level in the
> > hierarchy. Root decoders manage host physical address space. Switch
> > decoders manage demultiplexing of data to downstream targets. Endpoint
> > decoders must be aware of physical media size constraints. To properly
> > support these unique needs, create these unique structures.
> >
> > CXL HDM decoders do have similar architectural properties at all levels:
> > interleave properties, flags, types and consumption of host physical
> > address space. Those are retained and when possible, still utilized.
> >
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > ---
> >  drivers/cxl/core/hdm.c       |   3 +-
> >  drivers/cxl/core/port.c      | 102 ++++++++++++++++++++++++-----------
> >  drivers/cxl/cxl.h            |  69 +++++++++++++++++++++---
> >  tools/testing/cxl/test/cxl.c |   2 +-
> >  4 files changed, 137 insertions(+), 39 deletions(-)
> >
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 3055e246aab9..37c09c77e9a7 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -6,6 +6,7 @@
> >
> >  #include "cxlmem.h"
> >  #include "core.h"
> > +#include "cxl.h"
> >
> >  /**
> >   * DOC: cxl core hdm
> > @@ -242,7 +243,7 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
> >                 struct cxl_decoder *cxld;
> >
> >                 if (is_cxl_endpoint(port))
> > -                       cxld = cxl_endpoint_decoder_alloc(port);
> > +                       cxld = &cxl_endpoint_decoder_alloc(port)->base;
>
> Please split to:
>
> cxled = cxl_endpoint_decoder_alloc(port);
> cxld = &cxled->base;
>
> >                 else
> >                         cxld = cxl_switch_decoder_alloc(port, target_count);
> >                 if (IS_ERR(cxld)) {
> > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > index 86f451ecb7ed..8dd29c97e318 100644
> > --- a/drivers/cxl/core/port.c
> > +++ b/drivers/cxl/core/port.c
> > @@ -121,18 +121,19 @@ static DEVICE_ATTR_RO(target_type);
> >
> >  static ssize_t emit_target_list(struct cxl_decoder *cxld, char *buf)
> >  {
> > +       struct cxl_decoder_targets *t = cxl_get_decoder_targets(cxld);
> >         ssize_t offset = 0;
> >         int i, rc = 0;
> >
> >         for (i = 0; i < cxld->interleave_ways; i++) {
> > -               struct cxl_dport *dport = cxld->target[i];
> > +               struct cxl_dport *dport = t->target[i];
> >                 struct cxl_dport *next = NULL;
> >
> >                 if (!dport)
> >                         break;
> >
> >                 if (i + 1 < cxld->interleave_ways)
> > -                       next = cxld->target[i + 1];
> > +                       next = t->target[i + 1];
> >                 rc = sysfs_emit_at(buf, offset, "%d%s", dport->port_id,
> >                                    next ? "," : "");
> >                 if (rc < 0)
> > @@ -147,14 +148,15 @@ static ssize_t target_list_show(struct device *dev,
> >                                 struct device_attribute *attr, char *buf)
> >  {
> >         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > +       struct cxl_decoder_targets *t = cxl_get_decoder_targets(cxld);
> >         ssize_t offset;
> >         unsigned int seq;
> >         int rc;
> >
> >         do {
> > -               seq = read_seqbegin(&cxld->target_lock);
> > +               seq = read_seqbegin(&t->target_lock);
> >                 rc = emit_target_list(cxld, buf);
> > -       } while (read_seqretry(&cxld->target_lock, seq));
> > +       } while (read_seqretry(&t->target_lock, seq));
> >
> >         if (rc < 0)
> >                 return rc;
> > @@ -199,23 +201,6 @@ static const struct attribute_group *cxl_decoder_root_attribute_groups[] = {
> >         NULL,
> >  };
> >
> > -static struct attribute *cxl_decoder_switch_attrs[] = {
> > -       &dev_attr_target_type.attr,
> > -       &dev_attr_target_list.attr,
> > -       NULL,
> > -};
> > -
> > -static struct attribute_group cxl_decoder_switch_attribute_group = {
> > -       .attrs = cxl_decoder_switch_attrs,
> > -};
> > -
> > -static const struct attribute_group *cxl_decoder_switch_attribute_groups[] = {
> > -       &cxl_decoder_switch_attribute_group,
> > -       &cxl_decoder_base_attribute_group,
> > -       &cxl_base_attribute_group,
> > -       NULL,
> > -};
> > -
> >  static struct attribute *cxl_decoder_endpoint_attrs[] = {
> >         &dev_attr_target_type.attr,
> >         NULL,
> > @@ -232,6 +217,12 @@ static const struct attribute_group *cxl_decoder_endpoint_attribute_groups[] = {
> >         NULL,
> >  };
> >
> > +static const struct attribute_group *cxl_decoder_switch_attribute_groups[] = {
> > +       &cxl_decoder_base_attribute_group,
> > +       &cxl_base_attribute_group,
> > +       NULL,
> > +};
> > +
> >  static void cxl_decoder_release(struct device *dev)
> >  {
> >         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > @@ -264,6 +255,7 @@ bool is_endpoint_decoder(struct device *dev)
> >  {
> >         return dev->type == &cxl_decoder_endpoint_type;
> >  }
> > +EXPORT_SYMBOL_NS_GPL(is_endpoint_decoder, CXL);
> >
> >  bool is_root_decoder(struct device *dev)
> >  {
> > @@ -1136,6 +1128,7 @@ EXPORT_SYMBOL_NS_GPL(cxl_find_dport_by_dev, CXL);
> >  static int decoder_populate_targets(struct cxl_decoder *cxld,
> >                                     struct cxl_port *port, int *target_map)
> >  {
> > +       struct cxl_decoder_targets *t = cxl_get_decoder_targets(cxld);
> >         int i, rc = 0;
> >
> >         if (!target_map)
> > @@ -1146,21 +1139,72 @@ static int decoder_populate_targets(struct cxl_decoder *cxld,
> >         if (list_empty(&port->dports))
> >                 return -EINVAL;
> >
> > -       write_seqlock(&cxld->target_lock);
> > -       for (i = 0; i < cxld->nr_targets; i++) {
> > +       write_seqlock(&t->target_lock);
> > +       for (i = 0; i < t->nr_targets; i++) {
> >                 struct cxl_dport *dport = find_dport(port, target_map[i]);
> >
> >                 if (!dport) {
> >                         rc = -ENXIO;
> >                         break;
> >                 }
> > -               cxld->target[i] = dport;
> > +               t->target[i] = dport;
> >         }
> > -       write_sequnlock(&cxld->target_lock);
> > +       write_sequnlock(&t->target_lock);
> >
> >         return rc;
> >  }
> >
> > +static struct cxl_decoder *__cxl_decoder_alloc(struct cxl_port *port,
> > +                                              unsigned int nr_targets)
> > +{
> > +       struct cxl_decoder *cxld;
> > +
> > +       if (is_cxl_endpoint(port)) {
> > +               struct cxl_endpoint_decoder *cxled;
> > +
> > +               cxled = kzalloc(sizeof(*cxled), GFP_KERNEL);
> > +               if (!cxled)
> > +                       return NULL;
> > +               cxld = &cxled->base;
> > +       } else if (is_cxl_root(port)) {
> > +               struct cxl_root_decoder *cxlrd;
> > +
> > +               cxlrd = kzalloc(sizeof(*cxlrd), GFP_KERNEL);
> > +               if (!cxlrd)
> > +                       return NULL;
> > +
> > +               cxlrd->targets =
> > +                       kzalloc(struct_size(cxlrd->targets, target, nr_targets),
> > +                               GFP_KERNEL);
> > +               if (!cxlrd->targets) {
> > +                       kfree(cxlrd);
> > +                       return NULL;
> > +               }
> > +               cxlrd->targets->nr_targets = nr_targets;
> > +               seqlock_init(&cxlrd->targets->target_lock);
> > +               cxld = &cxlrd->base;
> > +       } else {
> > +               struct cxl_switch_decoder *cxlsd;
> > +
> > +               cxlsd = kzalloc(sizeof(*cxlsd), GFP_KERNEL);
> > +               if (!cxlsd)
> > +                       return NULL;
> > +
> > +               cxlsd->targets =
> > +                       kzalloc(struct_size(cxlsd->targets, target, nr_targets),
> > +                               GFP_KERNEL);
> > +               if (!cxlsd->targets) {
> > +                       kfree(cxlsd);
> > +                       return NULL;
> > +               }
> > +               cxlsd->targets->nr_targets = nr_targets;
> > +               seqlock_init(&cxlsd->targets->target_lock);
> > +               cxld = &cxlsd->base;
> > +       }
> > +
> > +       return cxld;
> > +}
> > +
> >  /**
> >   * cxl_decoder_alloc - Allocate a new CXL decoder
> >   * @port: owning port of this decoder
> > @@ -1186,7 +1230,7 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
> >         if (nr_targets > CXL_DECODER_MAX_INTERLEAVE)
> >                 return ERR_PTR(-EINVAL);
> >
> > -       cxld = kzalloc(struct_size(cxld, target, nr_targets), GFP_KERNEL);
> > +       cxld = __cxl_decoder_alloc(port, nr_targets);
> >         if (!cxld)
> >                 return ERR_PTR(-ENOMEM);
> >
> > @@ -1198,8 +1242,6 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
> >         get_device(&port->dev);
> >         cxld->id = rc;
> >
> > -       cxld->nr_targets = nr_targets;
> > -       seqlock_init(&cxld->target_lock);
> >         dev = &cxld->dev;
> >         device_initialize(dev);
> >         device_set_pm_not_required(dev);
> > @@ -1274,12 +1316,12 @@ EXPORT_SYMBOL_NS_GPL(cxl_switch_decoder_alloc, CXL);
> >   *
> >   * Return: A new cxl decoder to be registered by cxl_decoder_add()
> >   */
> > -struct cxl_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
> > +struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
> >  {
> >         if (!is_cxl_endpoint(port))
> >                 return ERR_PTR(-EINVAL);
> >
> > -       return cxl_decoder_alloc(port, 0);
> > +       return to_cxl_endpoint_decoder(cxl_decoder_alloc(port, 0));
> >  }
> >  EXPORT_SYMBOL_NS_GPL(cxl_endpoint_decoder_alloc, CXL);
> >
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index 6517d5cdf5ee..85fd5e84f978 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -193,6 +193,18 @@ enum cxl_decoder_type {
> >   */
> >  #define CXL_DECODER_MAX_INTERLEAVE 16
> >
> > +/**
> > + * struct cxl_decoder_targets - Target information for root and switch decoders.
> > + * @target_lock: coordinate coherent reads of the target list
> > + * @nr_targets: number of elements in @target
> > + * @target: active ordered target list in current decoder configuration
> > + */
> > +struct cxl_decoder_targets {
> > +       seqlock_t target_lock;
> > +       int nr_targets;
> > +       struct cxl_dport *target[];
> > +};
> > +
> >  /**
> >   * struct cxl_decoder - CXL address range decode configuration
> >   * @dev: this decoder's device
> > @@ -202,9 +214,6 @@ enum cxl_decoder_type {
> >   * @interleave_granularity: data stride per dport
> >   * @target_type: accelerator vs expander (type2 vs type3) selector
> >   * @flags: memory type capabilities and locking
> > - * @target_lock: coordinate coherent reads of the target list
> > - * @nr_targets: number of elements in @target
> > - * @target: active ordered target list in current decoder configuration
> >   */
> >  struct cxl_decoder {
> >         struct device dev;
> > @@ -214,11 +223,46 @@ struct cxl_decoder {
> >         int interleave_granularity;
> >         enum cxl_decoder_type target_type;
> >         unsigned long flags;
> > -       seqlock_t target_lock;
> > -       int nr_targets;
> > -       struct cxl_dport *target[];
> >  };
> >
> > +/**
> > + * struct cxl_endpoint_decoder - An decoder residing in a CXL endpoint.
> > + * @base: Base class decoder
> > + */
> > +struct cxl_endpoint_decoder {
> > +       struct cxl_decoder base;
> > +};
> > +
> > +/**
> > + * struct cxl_switch_decoder - A decoder in a switch or hostbridge.
> > + * @base: Base class decoder
> > + * @targets: Downstream targets for this switch.
> > + */
> > +struct cxl_switch_decoder {
> > +       struct cxl_decoder base;
> > +       struct cxl_decoder_targets *targets;
>
> Please no double-allocation when not necessary. This can be
>
> struct cxl_switch_decoder {
>        struct cxl_decoder base;
>        struct cxl_decoder_targets targets;
> };
>
> ...and then allocated with a single:
>
> cxlsd = kzalloc(struct_size(cxlsd, targets.target, nr_targets), GFP_KERNEL);
>
> ...or something like that (not compile tested).
>
> > +};
>
> > +};
> > +
> > +/**
> > + * struct cxl_root_decoder - A toplevel/platform decoder
> > + * @base: Base class decoder
> > + * @targets: Downstream targets (ie. hostbridges).
> > + */
> > +struct cxl_root_decoder {
> > +       struct cxl_decoder base;
> > +       struct cxl_decoder_targets *targets;
> > +};
>
> Ditto single allocation feedback...
>
> ...although now that struct cxl_root_decoder is identical to
> cxl_switch_decoder is there any benefit to making them distinct types
> beyond being pedantic? Making them the same type means
> cxl_get_decoder_targets() can go because all callers of that are
> already in known !cxl_endpoint_decoder paths. But of course ignore
> this if cxl_root_decoder is going to game some differentiating
> attributes down the road, I have not looked ahead in this series.

Actually regardless of whether root decoders gain additional
attributes they should simply be a strict super-set of root decoders:

struct cxl_endpoint_decoder {
       /* future endpoint decoder specific attributes go here */
      struct cxl_decoder base;
};

struct cxl_switch_decoder {
       struct cxl_decoder base;
       seqlock_t target_lock;
       int nr_targets;
       struct cxl_dport *target[];
};

struct cxl_root_decoder {
       /* future root decoder specific attributes go here */
       struct cxl_switch_decoder cxlsd;
};

> > +#define _to_cxl_decoder(x)                                                     \
> > +       static inline struct cxl_##x##_decoder *to_cxl_##x##_decoder(          \
> > +               struct cxl_decoder *cxld)                                      \
> > +       {                                                                      \
> > +               return container_of(cxld, struct cxl_##x##_decoder, base);     \
> > +       }
> > +
> > +_to_cxl_decoder(root)
> > +_to_cxl_decoder(switch)
> > +_to_cxl_decoder(endpoint)

Per above root decoders could no longer use this macro which was
already borderline not worth doing in my view.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 06/15] cxl/acpi: Manage root decoder's address space
  2022-04-13 18:37 ` [RFC PATCH 06/15] cxl/acpi: Manage root decoder's address space Ben Widawsky
@ 2022-04-18 22:15   ` Dan Williams
  2022-05-12 19:18     ` Ben Widawsky
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2022-04-18 22:15 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, Linux NVDIMM, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma

On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Use a gen_pool to manage the physical address space that is routed by
> the platform decoder (root decoder). As described in 'cxl/acpi: Resereve
> CXL resources from request_free_mem_region' the address space does not
> coexist well if part of all of it is conveyed in the memory map to the
> kernel.
>
> Since the existing resource APIs of interest all rely on the root
> decoder's address space being in iomem_resource,

I do not understand what this is trying to convey. Nothing requires
that a given 'struct resource' be managed under iomem_resource.

> the choices are to roll
> a new allocator because on struct resource, or use gen_pool. gen_pool is
> a good choice because it already has all the capabilities needed to
> satisfy CXL programming.

Not sure what comparison to 'struct resource' is being made here, what
is the tradeoff as you see it? In other words, why mention 'struct
resource' as a consideration?

>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
>  drivers/cxl/acpi.c | 36 ++++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxl.h  |  2 ++
>  2 files changed, 38 insertions(+)
>
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index 0870904fe4b5..a6b0c3181d0e 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /* Copyright(c) 2021 Intel Corporation. All rights reserved. */
>  #include <linux/platform_device.h>
> +#include <linux/genalloc.h>
>  #include <linux/module.h>
>  #include <linux/device.h>
>  #include <linux/kernel.h>
> @@ -79,6 +80,25 @@ struct cxl_cfmws_context {
>         struct acpi_cedt_cfmws *high_cfmws;
>  };
>
> +static int cfmws_cookie;
> +
> +static int fill_busy_mem(struct resource *res, void *_window)
> +{
> +       struct gen_pool *window = _window;
> +       struct genpool_data_fixed gpdf;
> +       unsigned long addr;
> +       void *type;
> +
> +       gpdf.offset = res->start;
> +       addr = gen_pool_alloc_algo_owner(window, resource_size(res),
> +                                        gen_pool_fixed_alloc, &gpdf, &type);

The "_owner" variant of gen_pool was only added for p2pdma as a way to
coordinate reference counts across p2pdma space allocation and a
'strcuct dev_pagemap' instance. The use here seems completely
vestigial and can just move to gen_pool_alloc_algo.

> +       if (addr != res->start || (res->start == 0 && type != &cfmws_cookie))
> +               return -ENXIO;

How can the second condition ever be true?

> +
> +       pr_devel("%pR removed from CFMWS\n", res);
> +       return 0;
> +}
> +
>  static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
>                            const unsigned long end)
>  {
> @@ -88,6 +108,8 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
>         struct device *dev = ctx->dev;
>         struct acpi_cedt_cfmws *cfmws;
>         struct cxl_decoder *cxld;
> +       struct gen_pool *window;
> +       char name[64];
>         int rc, i;
>
>         cfmws = (struct acpi_cedt_cfmws *) header;
> @@ -116,6 +138,20 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
>         cxld->interleave_ways = CFMWS_INTERLEAVE_WAYS(cfmws);
>         cxld->interleave_granularity = CFMWS_INTERLEAVE_GRANULARITY(cfmws);
>
> +       sprintf(name, "cfmws@%#llx", cfmws->base_hpa);
> +       window = devm_gen_pool_create(dev, ilog2(SZ_256M), NUMA_NO_NODE, name);
> +       if (IS_ERR(window))
> +               return 0;
> +
> +       gen_pool_add_owner(window, cfmws->base_hpa, -1, cfmws->window_size,
> +                          NUMA_NO_NODE, &cfmws_cookie);

Similar comment about the "_owner" variant serving no visible purpose.

These seems to pre-suppose that only the allocator will ever want to
interrogate the state of free space, it might be worth registering
objects for each intersection that are not cxl_regions so that
userspace explicitly sees what the cxl_acpi driver sees in terms of
available resources.

> +
> +       /* Area claimed by other resources, remove those from the gen_pool. */
> +       walk_iomem_res_desc(IORES_DESC_NONE, 0, cfmws->base_hpa,
> +                           cfmws->base_hpa + cfmws->window_size - 1, window,
> +                           fill_busy_mem);
> +       to_cxl_root_decoder(cxld)->window = window;
> +
>         rc = cxl_decoder_add(cxld, target_map);
>         if (rc)
>                 put_device(&cxld->dev);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 85fd5e84f978..0e1c65761ead 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -246,10 +246,12 @@ struct cxl_switch_decoder {
>  /**
>   * struct cxl_root_decoder - A toplevel/platform decoder
>   * @base: Base class decoder
> + * @window: host address space allocator
>   * @targets: Downstream targets (ie. hostbridges).
>   */
>  struct cxl_root_decoder {
>         struct cxl_decoder base;
> +       struct gen_pool *window;
>         struct cxl_decoder_targets *targets;
>  };
>
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region
  2022-04-18 16:42   ` Dan Williams
@ 2022-04-19 16:43     ` Jason Gunthorpe
  2022-04-19 21:50       ` Dan Williams
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2022-04-19 16:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ben Widawsky, linux-cxl, Linux NVDIMM, patches, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Christoph Hellwig,
	John Hubbard

On Mon, Apr 18, 2022 at 09:42:00AM -0700, Dan Williams wrote:
> [ add the usual HMM suspects Christoph, Jason, and John ]
> 
> On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > Define an API which allows CXL drivers to manage CXL address space.
> > CXL is unique in that the address space and various properties are only
> > known after CXL drivers come up, and therefore cannot be part of core
> > memory enumeration.
> 
> I think this buries the lead on the problem introduced by
> MEMORY_DEVICE_PRIVATE in the first place. Let's revisit that history
> before diving into what CXL needs.
> 
> 
> Commit 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using
> ZONE_DEVICE") introduced the concept of MEMORY_DEVICE_PRIVATE. At its
> core MEMORY_DEVICE_PRIVATE uses the ZONE_DEVICE capability to annotate
> an "unused" physical address range with 'struct page' for the purpose
> of coordinating migration of buffers onto and off of a GPU /
> accelerator. The determination of "unused" was based on a heuristic,
> not a guarantee, that any address range not expressly conveyed in the
> platform firmware map of the system can be repurposed for software
> use. The CXL Fixed Memory Windows Structure  (CFMWS) definition
> explicitly breaks the assumptions of that heuristic.

So CXL defines an address map that is not part of the FW list?

> > It would be desirable to simply insert this address space into
> > iomem_resource with a new flag to denote this is CXL memory. This would
> > permit request_free_mem_region() to be reused for CXL memory provided it
> > learned some new tricks. For that, it is tempting to simply use
> > insert_resource(). The API was designed specifically for cases where new
> > devices may offer new address space. This cannot work in the general
> > case. Boot firmware can pass, some, none, or all of the CFMWS range as
> > various types of memory to the kernel, and this may be left alone,
> > merged, or even expanded.

And then we understand that on CXL the FW is might pass stuff that
intersects with the actual CXL ranges?

> > As a result iomem_resource may intersect CFMWS
> > regions in ways insert_resource cannot handle [2]. Similar reasoning
> > applies to allocate_resource().
> >
> > With the insert_resource option out, the only reasonable approach left
> > is to let the CXL driver manage the address space independently of
> > iomem_resource and attempt to prevent users of device private memory

And finally due to all these FW problems we are going to make a 2nd
allocator for physical address space and just disable the normal one?

Then since DEVICE_PRIVATE is a notable user of this allocator we now
understand it becomes broken?

Sounds horrible. IMHO you should fix the normal allocator somehow to
understand that the ranges from FW have been reprogrammed by Linux and
not try to build a whole different allocator in CXL code.

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region
  2022-04-19 16:43     ` Jason Gunthorpe
@ 2022-04-19 21:50       ` Dan Williams
  2022-04-19 21:59         ` Dan Williams
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2022-04-19 21:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ben Widawsky, linux-cxl, Linux NVDIMM, patches, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Christoph Hellwig,
	John Hubbard

On Tue, Apr 19, 2022 at 9:43 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Apr 18, 2022 at 09:42:00AM -0700, Dan Williams wrote:
> > [ add the usual HMM suspects Christoph, Jason, and John ]
> >
> > On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > Define an API which allows CXL drivers to manage CXL address space.
> > > CXL is unique in that the address space and various properties are only
> > > known after CXL drivers come up, and therefore cannot be part of core
> > > memory enumeration.
> >
> > I think this buries the lead on the problem introduced by
> > MEMORY_DEVICE_PRIVATE in the first place. Let's revisit that history
> > before diving into what CXL needs.
> >
> >
> > Commit 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using
> > ZONE_DEVICE") introduced the concept of MEMORY_DEVICE_PRIVATE. At its
> > core MEMORY_DEVICE_PRIVATE uses the ZONE_DEVICE capability to annotate
> > an "unused" physical address range with 'struct page' for the purpose
> > of coordinating migration of buffers onto and off of a GPU /
> > accelerator. The determination of "unused" was based on a heuristic,
> > not a guarantee, that any address range not expressly conveyed in the
> > platform firmware map of the system can be repurposed for software
> > use. The CXL Fixed Memory Windows Structure  (CFMWS) definition
> > explicitly breaks the assumptions of that heuristic.
>
> So CXL defines an address map that is not part of the FW list?

It defines a super-set of 'potential' address space and a subset that
is active in the FW list. It's similar to memory hotplug where an
address range may come online after the fact, but unlike ACPI memory
hotplug, FW is not involved in the hotplug path, and FW cannot predict
what address ranges will come online. For example ACPI hotplug knows
in advance to publish the ranges that can experience an online /
insert event, CXL has many more degrees of freedom.

>
> > > It would be desirable to simply insert this address space into
> > > iomem_resource with a new flag to denote this is CXL memory. This would
> > > permit request_free_mem_region() to be reused for CXL memory provided it
> > > learned some new tricks. For that, it is tempting to simply use
> > > insert_resource(). The API was designed specifically for cases where new
> > > devices may offer new address space. This cannot work in the general
> > > case. Boot firmware can pass, some, none, or all of the CFMWS range as
> > > various types of memory to the kernel, and this may be left alone,
> > > merged, or even expanded.
>
> And then we understand that on CXL the FW is might pass stuff that
> intersects with the actual CXL ranges?
>
> > > As a result iomem_resource may intersect CFMWS
> > > regions in ways insert_resource cannot handle [2]. Similar reasoning
> > > applies to allocate_resource().
> > >
> > > With the insert_resource option out, the only reasonable approach left
> > > is to let the CXL driver manage the address space independently of
> > > iomem_resource and attempt to prevent users of device private memory
>
> And finally due to all these FW problems we are going to make a 2nd
> allocator for physical address space and just disable the normal one?

No, or I am misunderstanding this comment. The CXL address space
allocator is managing space that can be populated and become an
iomem_resource. So it's not supplanting iomem_resource it is
coordinating dynamic extensions to the FW map.

> Then since DEVICE_PRIVATE is a notable user of this allocator we now
> understand it becomes broken?
>
> Sounds horrible. IMHO you should fix the normal allocator somehow to
> understand that the ranges from FW have been reprogrammed by Linux

There is no reprogramming of the ranges from FW. CXL memory that is
mapped as System RAM at boot will have the CXL decode configuration
locked in all the participating devices. The remaining CXL decode
space is then available for dynamic reconfiguration of CXL resources
from the devices that the FW explicitly ignores, which is all
hot-added devices and all persistent-memory capacity.

> and
> not try to build a whole different allocator in CXL code.

I am not seeing much overlap for DEVICE_PRIVATE and CXL to share an
allocator. CXL explicitly wants ranges that have been set aside for
CXL and are related to 1 or more CXL host bridges. DEVICE_PRIVATE
wants to consume an unused physical address range to proxy
device-local-memory with no requirements on what range is chosen as
long as it does not collide with anything else.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region
  2022-04-19 21:50       ` Dan Williams
@ 2022-04-19 21:59         ` Dan Williams
  2022-04-19 23:04           ` Jason Gunthorpe
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2022-04-19 21:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ben Widawsky, linux-cxl, Linux NVDIMM, patches, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Christoph Hellwig,
	John Hubbard

On Tue, Apr 19, 2022 at 2:50 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Tue, Apr 19, 2022 at 9:43 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Mon, Apr 18, 2022 at 09:42:00AM -0700, Dan Williams wrote:
> > > [ add the usual HMM suspects Christoph, Jason, and John ]
> > >
> > > On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > Define an API which allows CXL drivers to manage CXL address space.
> > > > CXL is unique in that the address space and various properties are only
> > > > known after CXL drivers come up, and therefore cannot be part of core
> > > > memory enumeration.
> > >
> > > I think this buries the lead on the problem introduced by
> > > MEMORY_DEVICE_PRIVATE in the first place. Let's revisit that history
> > > before diving into what CXL needs.
> > >
> > >
> > > Commit 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using
> > > ZONE_DEVICE") introduced the concept of MEMORY_DEVICE_PRIVATE. At its
> > > core MEMORY_DEVICE_PRIVATE uses the ZONE_DEVICE capability to annotate
> > > an "unused" physical address range with 'struct page' for the purpose
> > > of coordinating migration of buffers onto and off of a GPU /
> > > accelerator. The determination of "unused" was based on a heuristic,
> > > not a guarantee, that any address range not expressly conveyed in the
> > > platform firmware map of the system can be repurposed for software
> > > use. The CXL Fixed Memory Windows Structure  (CFMWS) definition
> > > explicitly breaks the assumptions of that heuristic.
> >
> > So CXL defines an address map that is not part of the FW list?
>
> It defines a super-set of 'potential' address space and a subset that
> is active in the FW list. It's similar to memory hotplug where an
> address range may come online after the fact, but unlike ACPI memory
> hotplug, FW is not involved in the hotplug path, and FW cannot predict
> what address ranges will come online. For example ACPI hotplug knows
> in advance to publish the ranges that can experience an online /
> insert event, CXL has many more degrees of freedom.
>
> >
> > > > It would be desirable to simply insert this address space into
> > > > iomem_resource with a new flag to denote this is CXL memory. This would
> > > > permit request_free_mem_region() to be reused for CXL memory provided it
> > > > learned some new tricks. For that, it is tempting to simply use
> > > > insert_resource(). The API was designed specifically for cases where new
> > > > devices may offer new address space. This cannot work in the general
> > > > case. Boot firmware can pass, some, none, or all of the CFMWS range as
> > > > various types of memory to the kernel, and this may be left alone,
> > > > merged, or even expanded.
> >
> > And then we understand that on CXL the FW is might pass stuff that
> > intersects with the actual CXL ranges?
> >
> > > > As a result iomem_resource may intersect CFMWS
> > > > regions in ways insert_resource cannot handle [2]. Similar reasoning
> > > > applies to allocate_resource().
> > > >
> > > > With the insert_resource option out, the only reasonable approach left
> > > > is to let the CXL driver manage the address space independently of
> > > > iomem_resource and attempt to prevent users of device private memory
> >
> > And finally due to all these FW problems we are going to make a 2nd
> > allocator for physical address space and just disable the normal one?
>
> No, or I am misunderstanding this comment. The CXL address space
> allocator is managing space that can be populated and become an
> iomem_resource. So it's not supplanting iomem_resource it is
> coordinating dynamic extensions to the FW map.
>
> > Then since DEVICE_PRIVATE is a notable user of this allocator we now
> > understand it becomes broken?
> >
> > Sounds horrible. IMHO you should fix the normal allocator somehow to
> > understand that the ranges from FW have been reprogrammed by Linux
>
> There is no reprogramming of the ranges from FW. CXL memory that is
> mapped as System RAM at boot will have the CXL decode configuration
> locked in all the participating devices. The remaining CXL decode
> space is then available for dynamic reconfiguration of CXL resources
> from the devices that the FW explicitly ignores, which is all
> hot-added devices and all persistent-memory capacity.
>
> > and
> > not try to build a whole different allocator in CXL code.
>
> I am not seeing much overlap for DEVICE_PRIVATE and CXL to share an
> allocator. CXL explicitly wants ranges that have been set aside for
> CXL and are related to 1 or more CXL host bridges. DEVICE_PRIVATE
> wants to consume an unused physical address range to proxy
> device-local-memory with no requirements on what range is chosen as
> long as it does not collide with anything else.

...or are you suggesting to represent CXL free memory capacity in
iomem_resource and augment the FW list early with CXL ranges. That
seems doable, but it would only represent the free CXL ranges in
iomem_resource as the populated CXL ranges cannot have their resources
reparented after the fact, and there is plenty of code that expects
"System RAM" to be a top-level resource.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region
  2022-04-19 21:59         ` Dan Williams
@ 2022-04-19 23:04           ` Jason Gunthorpe
  2022-04-20  0:47             ` Dan Williams
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2022-04-19 23:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ben Widawsky, linux-cxl, Linux NVDIMM, patches, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Christoph Hellwig,
	John Hubbard

On Tue, Apr 19, 2022 at 02:59:46PM -0700, Dan Williams wrote:

> ...or are you suggesting to represent CXL free memory capacity in
> iomem_resource and augment the FW list early with CXL ranges. That
> seems doable, but it would only represent the free CXL ranges in
> iomem_resource as the populated CXL ranges cannot have their resources
> reparented after the fact, and there is plenty of code that expects
> "System RAM" to be a top-level resource.

Yes, something more like this. iomem_resource should represent stuff
actually in use and CXL shouldn't leave behind an 'IOW' for address
space it isn't actually able to currently use.

Your whole description sounds like the same problems PCI hotplug has
adjusting the bridge windows.

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region
  2022-04-19 23:04           ` Jason Gunthorpe
@ 2022-04-20  0:47             ` Dan Williams
  2022-04-20 14:34               ` Jason Gunthorpe
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2022-04-20  0:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ben Widawsky, linux-cxl, Linux NVDIMM, patches, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Christoph Hellwig,
	John Hubbard

On Tue, Apr 19, 2022 at 4:04 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Apr 19, 2022 at 02:59:46PM -0700, Dan Williams wrote:
>
> > ...or are you suggesting to represent CXL free memory capacity in
> > iomem_resource and augment the FW list early with CXL ranges. That
> > seems doable, but it would only represent the free CXL ranges in
> > iomem_resource as the populated CXL ranges cannot have their resources
> > reparented after the fact, and there is plenty of code that expects
> > "System RAM" to be a top-level resource.
>
> Yes, something more like this. iomem_resource should represent stuff
> actually in use and CXL shouldn't leave behind an 'IOW' for address
> space it isn't actually able to currently use.

So that's the problem, these gigantic windows need to support someone
showing up unannounced with a stack of multi-terabyte devices to add
to the system. The address space is idle before that event, but it
needs to be reserved for CXL because the top-level system decode makes
mandates like "CXL cards of type X performance Y inserted underneath
CXL host-bridge Z can only use CXL address ranges 1, 4 and 5".

> Your whole description sounds like the same problems PCI hotplug has
> adjusting the bridge windows.

...but even there the base bounds (AFAICS) are coming from FW (_CRS
entries for ACPI described PCIe host bridges). So if CXL follows that
model then the entire unmapped portion of the CXL ranges should be
marked as an idle resource in iomem_resource.

The improvement that offers over this current proposal is that it
allows for global visibility of CXL hotplug resources, but it does set
up a discontinuity between FW mapped and OS mapped CXL. FW mapped will
have top-level "System RAM" resources indistinguishable from typical
DRAM while OS mapped CXL will look like this:

100000000-1ffffffff : CXL Range 0
  108000000-1ffffffff : region5
    108000000-1ffffffff : System RAM (CXL)

...even though to FW "range 0" spans across a BIOS mapped portion and
"free for OS to use" portion.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region
  2022-04-20  0:47             ` Dan Williams
@ 2022-04-20 14:34               ` Jason Gunthorpe
  2022-04-20 15:32                 ` Dan Williams
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2022-04-20 14:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ben Widawsky, linux-cxl, Linux NVDIMM, patches, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Christoph Hellwig,
	John Hubbard

On Tue, Apr 19, 2022 at 05:47:56PM -0700, Dan Williams wrote:
> On Tue, Apr 19, 2022 at 4:04 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Apr 19, 2022 at 02:59:46PM -0700, Dan Williams wrote:
> >
> > > ...or are you suggesting to represent CXL free memory capacity in
> > > iomem_resource and augment the FW list early with CXL ranges. That
> > > seems doable, but it would only represent the free CXL ranges in
> > > iomem_resource as the populated CXL ranges cannot have their resources
> > > reparented after the fact, and there is plenty of code that expects
> > > "System RAM" to be a top-level resource.
> >
> > Yes, something more like this. iomem_resource should represent stuff
> > actually in use and CXL shouldn't leave behind an 'IOW' for address
> > space it isn't actually able to currently use.
> 
> So that's the problem, these gigantic windows need to support someone
> showing up unannounced with a stack of multi-terabyte devices to add
> to the system.

In my experience PCIe hotplug is already extremely rare, you may need
to do this reservation on systems with hotplug slots, but not
generally. In PCIe world the BIOS often figures this out and bridge
windows are not significantly over allocated on non-hotplug HW.

(though even PCIe has the resizable bar extension and other things
that are quite like hotplug and do trigger huge resource requirements)

> > Your whole description sounds like the same problems PCI hotplug has
> > adjusting the bridge windows.
> 
> ...but even there the base bounds (AFAICS) are coming from FW (_CRS
> entries for ACPI described PCIe host bridges). So if CXL follows that
> model then the entire unmapped portion of the CXL ranges should be
> marked as an idle resource in iomem_resource.

And possibly yes, because part of the point of this stuff is to
declare where HW is actually using the address space. So if FW has
left a host bridge decoder setup to actually consume this space then
it really has to be set aside to prevent hotplug of other bus types
from trying to claim the same address space for their own usages.

If no actual decoder is setup then it maybe it shouldn't be left as an
IOW in the resource tree. In this case it might be better to teach the
io resource allocator to leave gaps for future hotplug.

> The improvement that offers over this current proposal is that it
> allows for global visibility of CXL hotplug resources, but it does set
> up a discontinuity between FW mapped and OS mapped CXL. FW mapped will
> have top-level "System RAM" resources indistinguishable from typical
> DRAM while OS mapped CXL will look like this:

Maybe this can be reotractively fixed up in the resource tree?

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region
  2022-04-20 14:34               ` Jason Gunthorpe
@ 2022-04-20 15:32                 ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2022-04-20 15:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ben Widawsky, linux-cxl, Linux NVDIMM, patches, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Christoph Hellwig,
	John Hubbard

On Wed, Apr 20, 2022 at 7:35 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Apr 19, 2022 at 05:47:56PM -0700, Dan Williams wrote:
> > On Tue, Apr 19, 2022 at 4:04 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Tue, Apr 19, 2022 at 02:59:46PM -0700, Dan Williams wrote:
> > >
> > > > ...or are you suggesting to represent CXL free memory capacity in
> > > > iomem_resource and augment the FW list early with CXL ranges. That
> > > > seems doable, but it would only represent the free CXL ranges in
> > > > iomem_resource as the populated CXL ranges cannot have their resources
> > > > reparented after the fact, and there is plenty of code that expects
> > > > "System RAM" to be a top-level resource.
> > >
> > > Yes, something more like this. iomem_resource should represent stuff
> > > actually in use and CXL shouldn't leave behind an 'IOW' for address
> > > space it isn't actually able to currently use.
> >
> > So that's the problem, these gigantic windows need to support someone
> > showing up unannounced with a stack of multi-terabyte devices to add
> > to the system.
>
> In my experience PCIe hotplug is already extremely rare, you may need
> to do this reservation on systems with hotplug slots, but not
> generally. In PCIe world the BIOS often figures this out and bridge
> windows are not significantly over allocated on non-hotplug HW.
>
> (though even PCIe has the resizable bar extension and other things
> that are quite like hotplug and do trigger huge resource requirements)
>
> > > Your whole description sounds like the same problems PCI hotplug has
> > > adjusting the bridge windows.
> >
> > ...but even there the base bounds (AFAICS) are coming from FW (_CRS
> > entries for ACPI described PCIe host bridges). So if CXL follows that
> > model then the entire unmapped portion of the CXL ranges should be
> > marked as an idle resource in iomem_resource.
>
> And possibly yes, because part of the point of this stuff is to
> declare where HW is actually using the address space. So if FW has
> left a host bridge decoder setup to actually consume this space then
> it really has to be set aside to prevent hotplug of other bus types
> from trying to claim the same address space for their own usages.
>
> If no actual decoder is setup then it maybe it shouldn't be left as an
> IOW in the resource tree. In this case it might be better to teach the
> io resource allocator to leave gaps for future hotplug.

Yeah, it is the former. These CXL ranges are all actively decoded by
the CPU complex memory controller as "this range goes to DDR and this
other range is interleaved across this set of CXL host bridges". Even
if there is nothing behind those host bridges there is hardware
actively routing requests that fall into those ranges to those
downstream devices.

> > The improvement that offers over this current proposal is that it
> > allows for global visibility of CXL hotplug resources, but it does set
> > up a discontinuity between FW mapped and OS mapped CXL. FW mapped will
> > have top-level "System RAM" resources indistinguishable from typical
> > DRAM while OS mapped CXL will look like this:
>
> Maybe this can be reotractively fixed up in the resource tree?

I had been discouraged to go that route considering some code only
scans top-level iomem_resource entries, but it is probably better to
try to fix that legacy code to operate correctly when System RAM is
parented by a CXL Range.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 12/15] cxl/region: Add region creation ABI
  2022-04-13 18:37 ` [RFC PATCH 12/15] cxl/region: Add region creation ABI Ben Widawsky
@ 2022-05-04 22:56   ` Verma, Vishal L
  2022-05-05  5:17     ` Dan Williams
  0 siblings, 1 reply; 53+ messages in thread
From: Verma, Vishal L @ 2022-05-04 22:56 UTC (permalink / raw)
  To: Widawsky, Ben, linux-cxl, nvdimm
  Cc: patches, Schofield, Alison, Jonathan.Cameron, Williams, Dan J,
	Weiny, Ira

On Wed, 2022-04-13 at 11:37 -0700, Ben Widawsky wrote:
> Regions are created as a child of the decoder that encompasses an
> address space with constraints. Regions have a number of attributes that
> must be configured before the region can be activated.
> 
> Multiple processes which are trying not to race with each other
> shouldn't need special userspace synchronization to do so.
> 
> // Allocate a new region name
> region=$(cat /sys/bus/cxl/devices/decoder0.0/create_pmem_region)
> 
> // Create a new region by name
> while
> region=$(cat /sys/bus/cxl/devices/decoder0.0/create_pmem_region)
> ! echo $region > /sys/bus/cxl/devices/decoder0.0/create_pmem_region
> do true; done
> 
> // Region now exists in sysfs
> stat -t /sys/bus/cxl/devices/decoder0.0/$region
> 
> // Delete the region, and name
> echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region

I noticed a slight ABI inconsistency here while working on the libcxl
side of this - see below.

> +
> +static ssize_t create_pmem_region_show(struct device *dev,
> +                                      struct device_attribute *attr, char *buf)
> +{
> +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> +       struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxld);
> +       size_t rc;
> +
> +       /*
> +        * There's no point in returning known bad answers when the lock is held
> +        * on the store side, even though the answer given here may be
> +        * immediately invalidated as soon as the lock is dropped it's still
> +        * useful to throttle readers in the presence of writers.
> +        */
> +       rc = mutex_lock_interruptible(&cxlrd->id_lock);
> +       if (rc)
> +               return rc;
> +       rc = sysfs_emit(buf, "%d\n", cxlrd->next_region_id);

This emits a numeric region ID, e.g. "0", where as

> +       mutex_unlock(&cxlrd->id_lock);
> +
> +       return rc;
> +}
> +

<snip>

> +static ssize_t delete_region_store(struct device *dev,
> +                                  struct device_attribute *attr,
> +                                  const char *buf, size_t len)
> +{
> +       struct cxl_port *port = to_cxl_port(dev->parent);
> +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> +       struct cxl_region *cxlr;
> +
> +       cxlr = cxl_find_region_by_name(cxld, buf);

This expects a full region name string e.g. "region0"

Was this intentional? I don't think it's a huge deal, the library can
certainly deal with it if needed - but would it be better to have the
ABI symmetrical between create and delete?


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 12/15] cxl/region: Add region creation ABI
  2022-05-04 22:56   ` Verma, Vishal L
@ 2022-05-05  5:17     ` Dan Williams
  2022-05-12 15:54       ` Ben Widawsky
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2022-05-05  5:17 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: Widawsky, Ben, linux-cxl, nvdimm, patches, Schofield, Alison,
	Jonathan.Cameron, Weiny, Ira

On Wed, May 4, 2022 at 3:57 PM Verma, Vishal L <vishal.l.verma@intel.com> wrote:
>
> On Wed, 2022-04-13 at 11:37 -0700, Ben Widawsky wrote:
> > Regions are created as a child of the decoder that encompasses an
> > address space with constraints. Regions have a number of attributes that
> > must be configured before the region can be activated.
> >
> > Multiple processes which are trying not to race with each other
> > shouldn't need special userspace synchronization to do so.
> >
> > // Allocate a new region name
> > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_pmem_region)
> >
> > // Create a new region by name
> > while
> > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_pmem_region)
> > ! echo $region > /sys/bus/cxl/devices/decoder0.0/create_pmem_region
> > do true; done
> >
> > // Region now exists in sysfs
> > stat -t /sys/bus/cxl/devices/decoder0.0/$region
> >
> > // Delete the region, and name
> > echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region
>
> I noticed a slight ABI inconsistency here while working on the libcxl
> side of this - see below.
>
> > +
> > +static ssize_t create_pmem_region_show(struct device *dev,
> > +                                      struct device_attribute *attr, char *buf)
> > +{
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > +       struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxld);
> > +       size_t rc;
> > +
> > +       /*
> > +        * There's no point in returning known bad answers when the lock is held
> > +        * on the store side, even though the answer given here may be
> > +        * immediately invalidated as soon as the lock is dropped it's still
> > +        * useful to throttle readers in the presence of writers.
> > +        */
> > +       rc = mutex_lock_interruptible(&cxlrd->id_lock);
> > +       if (rc)
> > +               return rc;
> > +       rc = sysfs_emit(buf, "%d\n", cxlrd->next_region_id);
>
> This emits a numeric region ID, e.g. "0", where as
>
> > +       mutex_unlock(&cxlrd->id_lock);
> > +
> > +       return rc;
> > +}
> > +
>
> <snip>
>
> > +static ssize_t delete_region_store(struct device *dev,
> > +                                  struct device_attribute *attr,
> > +                                  const char *buf, size_t len)
> > +{
> > +       struct cxl_port *port = to_cxl_port(dev->parent);
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > +       struct cxl_region *cxlr;
> > +
> > +       cxlr = cxl_find_region_by_name(cxld, buf);
>
> This expects a full region name string e.g. "region0"
>
> Was this intentional? I don't think it's a huge deal, the library can
> certainly deal with it if needed - but would it be better to have the
> ABI symmetrical between create and delete?

Yes, makes sense to me.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail
  2022-04-18 16:37       ` Adam Manzanares
@ 2022-05-12 15:50         ` Ben Widawsky
  2022-05-12 17:27           ` Luis Chamberlain
  0 siblings, 1 reply; 53+ messages in thread
From: Ben Widawsky @ 2022-05-12 15:50 UTC (permalink / raw)
  To: Adam Manzanares
  Cc: Dan Williams, linux-cxl, Linux NVDIMM, patches, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, mcgrof

On 22-04-18 16:37:12, Adam Manzanares wrote:
> On Wed, Apr 13, 2022 at 02:31:42PM -0700, Dan Williams wrote:
> > On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > Endpoint decoder enumeration is the only way in which we can determine
> > > Device Physical Address (DPA) -> Host Physical Address (HPA) mappings.
> > > Information is obtained only when the register state can be read
> > > sequentially. If when enumerating the decoders a failure occurs, all
> > > other decoders must also fail since the decoders can no longer be
> > > accurately managed (unless it's the last decoder in which case it can
> > > still work).
> > 
> > I think this should be expanded to fail if any decoder fails to
> > allocate anywhere in the topology otherwise it leaves a mess for
> > future address translation code to work through cases where decoder
> > information is missing.
> > 
> > The current approach is based around the current expectation that
> > nothing is enumerating pre-existing regions, and nothing is performing
> > address translation.
> 
> Does the qemu support currently allow testing of this patch? If so, it would 
> be good to reference qemu configurations. Any other alternatives would be 
> welcome as well. 
> 
> +Luis on cc.
> 

No. This type of error injection would be cool to have, but I'm not sure of a
good way to support that in a scalable way. Maybe Jonathan has some ideas?

> > 
> > >
> > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > ---
> > >  drivers/cxl/core/hdm.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > > index bfc8ee876278..c3c021b54079 100644
> > > --- a/drivers/cxl/core/hdm.c
> > > +++ b/drivers/cxl/core/hdm.c
> > > @@ -255,6 +255,8 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
> > >                                       cxlhdm->regs.hdm_decoder, i);
> > >                 if (rc) {
> > >                         put_device(&cxld->dev);
> > > +                       if (is_endpoint_decoder(&cxld->dev))
> > > +                               return rc;
> > >                         failed++;
> > >                         continue;
> > >                 }
> > > --
> > > 2.35.1
> > >
> > 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 12/15] cxl/region: Add region creation ABI
  2022-05-05  5:17     ` Dan Williams
@ 2022-05-12 15:54       ` Ben Widawsky
  0 siblings, 0 replies; 53+ messages in thread
From: Ben Widawsky @ 2022-05-12 15:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Verma, Vishal L, linux-cxl, nvdimm, patches, Schofield, Alison,
	Jonathan.Cameron, Weiny, Ira

On 22-05-04 22:17:49, Dan Williams wrote:
> On Wed, May 4, 2022 at 3:57 PM Verma, Vishal L <vishal.l.verma@intel.com> wrote:
> >
> > On Wed, 2022-04-13 at 11:37 -0700, Ben Widawsky wrote:
> > > Regions are created as a child of the decoder that encompasses an
> > > address space with constraints. Regions have a number of attributes that
> > > must be configured before the region can be activated.
> > >
> > > Multiple processes which are trying not to race with each other
> > > shouldn't need special userspace synchronization to do so.
> > >
> > > // Allocate a new region name
> > > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_pmem_region)
> > >
> > > // Create a new region by name
> > > while
> > > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_pmem_region)
> > > ! echo $region > /sys/bus/cxl/devices/decoder0.0/create_pmem_region
> > > do true; done
> > >
> > > // Region now exists in sysfs
> > > stat -t /sys/bus/cxl/devices/decoder0.0/$region
> > >
> > > // Delete the region, and name
> > > echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region
> >
> > I noticed a slight ABI inconsistency here while working on the libcxl
> > side of this - see below.
> >
> > > +
> > > +static ssize_t create_pmem_region_show(struct device *dev,
> > > +                                      struct device_attribute *attr, char *buf)
> > > +{
> > > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > > +       struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxld);
> > > +       size_t rc;
> > > +
> > > +       /*
> > > +        * There's no point in returning known bad answers when the lock is held
> > > +        * on the store side, even though the answer given here may be
> > > +        * immediately invalidated as soon as the lock is dropped it's still
> > > +        * useful to throttle readers in the presence of writers.
> > > +        */
> > > +       rc = mutex_lock_interruptible(&cxlrd->id_lock);
> > > +       if (rc)
> > > +               return rc;
> > > +       rc = sysfs_emit(buf, "%d\n", cxlrd->next_region_id);
> >
> > This emits a numeric region ID, e.g. "0", where as
> >
> > > +       mutex_unlock(&cxlrd->id_lock);
> > > +
> > > +       return rc;
> > > +}
> > > +
> >
> > <snip>
> >
> > > +static ssize_t delete_region_store(struct device *dev,
> > > +                                  struct device_attribute *attr,
> > > +                                  const char *buf, size_t len)
> > > +{
> > > +       struct cxl_port *port = to_cxl_port(dev->parent);
> > > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > > +       struct cxl_region *cxlr;
> > > +
> > > +       cxlr = cxl_find_region_by_name(cxld, buf);
> >
> > This expects a full region name string e.g. "region0"
> >
> > Was this intentional? I don't think it's a huge deal, the library can
> > certainly deal with it if needed - but would it be better to have the
> > ABI symmetrical between create and delete?
> 
> Yes, makes sense to me.

It was not intentional. It's "region%u" for both create and delete now.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 03/15] Revert "cxl/core: Convert decoder range to resource"
  2022-04-13 21:43   ` Dan Williams
@ 2022-05-12 16:09     ` Ben Widawsky
  0 siblings, 0 replies; 53+ messages in thread
From: Ben Widawsky @ 2022-05-12 16:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Linux NVDIMM, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma

On 22-04-13 14:43:48, Dan Williams wrote:
> On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > This reverts commit 608135db1b790170d22848815c4671407af74e37. All
> 
> Did checkpatch not complain about this being in "commit
> <12-character-commit-id> <commit summary format>"? However, I'd rather
> just drop the revert language and say:
> 
> Change root decoders to reuse the existing ->range field to track the
> decoder's programmed HPA range. The infrastructure to track the
> allocations out of the root decoder range is still a work-in-progress,
> but in the meantime it simplifies the code to always represent the
> current decoder range setting in the ->range field regardless of
> decoder type.
> 
> > decoders do have a host physical address space and the revert allows us
> > to keep that uniformity. Decoder disambiguation will allow for decoder
> > type-specific members which is needed, but will be handled separately.
> >
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> >
> > ---
> > The explanation for why it is impossible to make CFMWS ranges be
> > iomem_resources is explain in a later patch.
> 
> This change stands alone / is independent of any iomem_resource concerns, right?
> 

I think I need to revisit this per the discussion at LSFMM. Ideally a CFMWS
resource would just be insert_resource()'d, but that won't work. I'm going to
attempt what we discussed and this patch will likely go away.

> > ---
> >  drivers/cxl/acpi.c      | 17 ++++++++++-------
> >  drivers/cxl/core/hdm.c  |  2 +-
> >  drivers/cxl/core/port.c | 28 ++++++----------------------
> >  drivers/cxl/cxl.h       |  8 ++------
> >  4 files changed, 19 insertions(+), 36 deletions(-)
> >
> > diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> > index d15a6aec0331..9b69955b90cb 100644
> > --- a/drivers/cxl/acpi.c
> > +++ b/drivers/cxl/acpi.c
> > @@ -108,8 +108,10 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
> >
> >         cxld->flags = cfmws_to_decoder_flags(cfmws->restrictions);
> >         cxld->target_type = CXL_DECODER_EXPANDER;
> > -       cxld->platform_res = (struct resource)DEFINE_RES_MEM(cfmws->base_hpa,
> > -                                                            cfmws->window_size);
> > +       cxld->range = (struct range){
> > +               .start = cfmws->base_hpa,
> > +               .end = cfmws->base_hpa + cfmws->window_size - 1,
> > +       };
> >         cxld->interleave_ways = CFMWS_INTERLEAVE_WAYS(cfmws);
> >         cxld->interleave_granularity = CFMWS_INTERLEAVE_GRANULARITY(cfmws);
> >
> > @@ -119,13 +121,14 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
> >         else
> >                 rc = cxl_decoder_autoremove(dev, cxld);
> >         if (rc) {
> > -               dev_err(dev, "Failed to add decoder for %pr\n",
> > -                       &cxld->platform_res);
> > +               dev_err(dev, "Failed to add decoder for %#llx-%#llx\n",
> > +                       cfmws->base_hpa,
> > +                       cfmws->base_hpa + cfmws->window_size - 1);
> >                 return 0;
> >         }
> > -       dev_dbg(dev, "add: %s node: %d range %pr\n", dev_name(&cxld->dev),
> > -               phys_to_target_node(cxld->platform_res.start),
> > -               &cxld->platform_res);
> > +       dev_dbg(dev, "add: %s node: %d range %#llx-%#llx\n",
> > +               dev_name(&cxld->dev), phys_to_target_node(cxld->range.start),
> > +               cfmws->base_hpa, cfmws->base_hpa + cfmws->window_size - 1);
> >
> >         return 0;
> >  }
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index c3c021b54079..3055e246aab9 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -172,7 +172,7 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
> >                 return -ENXIO;
> >         }
> >
> > -       cxld->decoder_range = (struct range) {
> > +       cxld->range = (struct range) {
> >                 .start = base,
> >                 .end = base + size - 1,
> >         };
> > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > index 74c8e47bf915..86f451ecb7ed 100644
> > --- a/drivers/cxl/core/port.c
> > +++ b/drivers/cxl/core/port.c
> > @@ -73,14 +73,8 @@ static ssize_t start_show(struct device *dev, struct device_attribute *attr,
> >                           char *buf)
> >  {
> >         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > -       u64 start;
> >
> > -       if (is_root_decoder(dev))
> > -               start = cxld->platform_res.start;
> > -       else
> > -               start = cxld->decoder_range.start;
> > -
> > -       return sysfs_emit(buf, "%#llx\n", start);
> > +       return sysfs_emit(buf, "%#llx\n", cxld->range.start);
> >  }
> >  static DEVICE_ATTR_ADMIN_RO(start);
> >
> > @@ -88,14 +82,8 @@ static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> >                         char *buf)
> >  {
> >         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > -       u64 size;
> >
> > -       if (is_root_decoder(dev))
> > -               size = resource_size(&cxld->platform_res);
> > -       else
> > -               size = range_len(&cxld->decoder_range);
> > -
> > -       return sysfs_emit(buf, "%#llx\n", size);
> > +       return sysfs_emit(buf, "%#llx\n", range_len(&cxld->range));
> >  }
> >  static DEVICE_ATTR_RO(size);
> >
> > @@ -1228,7 +1216,10 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
> >         cxld->interleave_ways = 1;
> >         cxld->interleave_granularity = PAGE_SIZE;
> >         cxld->target_type = CXL_DECODER_EXPANDER;
> > -       cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
> > +       cxld->range = (struct range) {
> > +               .start = 0,
> > +               .end = -1,
> > +       };
> >
> >         return cxld;
> >  err:
> > @@ -1342,13 +1333,6 @@ int cxl_decoder_add_locked(struct cxl_decoder *cxld, int *target_map)
> >         if (rc)
> >                 return rc;
> >
> > -       /*
> > -        * Platform decoder resources should show up with a reasonable name. All
> > -        * other resources are just sub ranges within the main decoder resource.
> > -        */
> > -       if (is_root_decoder(dev))
> > -               cxld->platform_res.name = dev_name(dev);
> > -
> >         return device_add(dev);
> >  }
> >  EXPORT_SYMBOL_NS_GPL(cxl_decoder_add_locked, CXL);
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index 5102491e8d13..6517d5cdf5ee 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -197,8 +197,7 @@ enum cxl_decoder_type {
> >   * struct cxl_decoder - CXL address range decode configuration
> >   * @dev: this decoder's device
> >   * @id: kernel device name id
> > - * @platform_res: address space resources considered by root decoder
> > - * @decoder_range: address space resources considered by midlevel decoder
> > + * @range: address range considered by this decoder
> >   * @interleave_ways: number of cxl_dports in this decode
> >   * @interleave_granularity: data stride per dport
> >   * @target_type: accelerator vs expander (type2 vs type3) selector
> > @@ -210,10 +209,7 @@ enum cxl_decoder_type {
> >  struct cxl_decoder {
> >         struct device dev;
> >         int id;
> > -       union {
> > -               struct resource platform_res;
> > -               struct range decoder_range;
> > -       };
> > +       struct range range;
> >         int interleave_ways;
> >         int interleave_granularity;
> >         enum cxl_decoder_type target_type;
> > --
> > 2.35.1
> >

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail
  2022-05-12 15:50         ` Ben Widawsky
@ 2022-05-12 17:27           ` Luis Chamberlain
  2022-05-13 12:09             ` Jonathan Cameron
  0 siblings, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2022-05-12 17:27 UTC (permalink / raw)
  To: Ben Widawsky, Klaus Jensen, Josef Bacik
  Cc: Adam Manzanares, Dan Williams, linux-cxl, Linux NVDIMM, patches,
	Alison Schofield, Ira Weiny, Jonathan Cameron, Vishal Verma

On Thu, May 12, 2022 at 08:50:14AM -0700, Ben Widawsky wrote:
> On 22-04-18 16:37:12, Adam Manzanares wrote:
> > On Wed, Apr 13, 2022 at 02:31:42PM -0700, Dan Williams wrote:
> > > On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > Endpoint decoder enumeration is the only way in which we can determine
> > > > Device Physical Address (DPA) -> Host Physical Address (HPA) mappings.
> > > > Information is obtained only when the register state can be read
> > > > sequentially. If when enumerating the decoders a failure occurs, all
> > > > other decoders must also fail since the decoders can no longer be
> > > > accurately managed (unless it's the last decoder in which case it can
> > > > still work).
> > > 
> > > I think this should be expanded to fail if any decoder fails to
> > > allocate anywhere in the topology otherwise it leaves a mess for
> > > future address translation code to work through cases where decoder
> > > information is missing.
> > > 
> > > The current approach is based around the current expectation that
> > > nothing is enumerating pre-existing regions, and nothing is performing
> > > address translation.
> > 
> > Does the qemu support currently allow testing of this patch? If so, it would 
> > be good to reference qemu configurations. Any other alternatives would be 
> > welcome as well. 
> > 
> > +Luis on cc.
> > 
> 
> No. This type of error injection would be cool to have, but I'm not sure of a
> good way to support that in a scalable way. Maybe Jonathan has some ideas?

In case it helps on the Linux front the least intrusive way is to use
ALLOW_ERROR_INJECTION(). It's what I hope we'll slowly strive for on
the block layer and filesystems slowly. That incurs one macro call per error
routine you want to allow error injection on.

Then you use debugfs to dynamically enable / disable the error
injection / rate etc.

So I think this begs the question, what error injection mechanisms
exist for qemu and would new functionality be welcomed?

Linux builds off a brilliantly simple simple interface borrowed from
failmalloc [0]. The initial implementation on Linux then was also really
simple [1] [2] [3] however it required adding stubs on each call with a
respective build option to enable failure injection. Configuration was done
through debugfs.

Later Josef enabled us to use BPF to allow overriding kprobed functions
to return arbitrary values[4], and further generalized away from kprobes
by Masami [5].

If no failure injection is present in qemu something as simple as the initial
approach could be considered [1] [2] [3], but a dynamic interface
would certainly be wonderful long term.

[0] http://www.nongnu.org/failmalloc/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de1ba09b214056365d9082982905b255caafb7a2
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6ff1cb355e628f8fc55fa2d01e269e5e1bbc2fe9
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8a8b6502fb669c3a0638a08955442814cedc86b1
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92ace9991da08827e809c2d120108a96a281e7fc
[5] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=540adea3809f61115d2a1ea4ed6e627613452ba1

  Luis

> > > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > > ---
> > > >  drivers/cxl/core/hdm.c | 2 ++
> > > >  1 file changed, 2 insertions(+)
> > > >
> > > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > > > index bfc8ee876278..c3c021b54079 100644
> > > > --- a/drivers/cxl/core/hdm.c
> > > > +++ b/drivers/cxl/core/hdm.c
> > > > @@ -255,6 +255,8 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
> > > >                                       cxlhdm->regs.hdm_decoder, i);
> > > >                 if (rc) {
> > > >                         put_device(&cxld->dev);
> > > > +                       if (is_endpoint_decoder(&cxld->dev))
> > > > +                               return rc;
> > > >                         failed++;
> > > >                         continue;
> > > >                 }
> > > > --
> > > > 2.35.1
> > > >
> > > 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 06/15] cxl/acpi: Manage root decoder's address space
  2022-04-18 22:15   ` Dan Williams
@ 2022-05-12 19:18     ` Ben Widawsky
  0 siblings, 0 replies; 53+ messages in thread
From: Ben Widawsky @ 2022-05-12 19:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Linux NVDIMM, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma

On 22-04-18 15:15:47, Dan Williams wrote:
> On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > Use a gen_pool to manage the physical address space that is routed by
> > the platform decoder (root decoder). As described in 'cxl/acpi: Resereve
> > CXL resources from request_free_mem_region' the address space does not
> > coexist well if part of all of it is conveyed in the memory map to the
> > kernel.
> >
> > Since the existing resource APIs of interest all rely on the root
> > decoder's address space being in iomem_resource,
> 
> I do not understand what this is trying to convey. Nothing requires
> that a given 'struct resource' be managed under iomem_resource.
> 
> > the choices are to roll
> > a new allocator because on struct resource, or use gen_pool. gen_pool is
> > a good choice because it already has all the capabilities needed to
> > satisfy CXL programming.
> 
> Not sure what comparison to 'struct resource' is being made here, what
> is the tradeoff as you see it? In other words, why mention 'struct
> resource' as a consideration?
> 
> >
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > ---
> >  drivers/cxl/acpi.c | 36 ++++++++++++++++++++++++++++++++++++
> >  drivers/cxl/cxl.h  |  2 ++
> >  2 files changed, 38 insertions(+)
> >
> > diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> > index 0870904fe4b5..a6b0c3181d0e 100644
> > --- a/drivers/cxl/acpi.c
> > +++ b/drivers/cxl/acpi.c
> > @@ -1,6 +1,7 @@
> >  // SPDX-License-Identifier: GPL-2.0-only
> >  /* Copyright(c) 2021 Intel Corporation. All rights reserved. */
> >  #include <linux/platform_device.h>
> > +#include <linux/genalloc.h>
> >  #include <linux/module.h>
> >  #include <linux/device.h>
> >  #include <linux/kernel.h>
> > @@ -79,6 +80,25 @@ struct cxl_cfmws_context {
> >         struct acpi_cedt_cfmws *high_cfmws;
> >  };
> >
> > +static int cfmws_cookie;
> > +
> > +static int fill_busy_mem(struct resource *res, void *_window)
> > +{
> > +       struct gen_pool *window = _window;
> > +       struct genpool_data_fixed gpdf;
> > +       unsigned long addr;
> > +       void *type;
> > +
> > +       gpdf.offset = res->start;
> > +       addr = gen_pool_alloc_algo_owner(window, resource_size(res),
> > +                                        gen_pool_fixed_alloc, &gpdf, &type);
> 
> The "_owner" variant of gen_pool was only added for p2pdma as a way to
> coordinate reference counts across p2pdma space allocation and a
> 'strcuct dev_pagemap' instance. The use here seems completely
> vestigial and can just move to gen_pool_alloc_algo.
> 

The problem that it's trying to solve is for the case when gpdf.offset is 0. I
think that's a highly unlikely case with the current plan, however, if
reparenting comes into play, you could very likely have offset 0, and then you
have no way to distinguish error from success without the cookie.

Thoughts?

> > +       if (addr != res->start || (res->start == 0 && type != &cfmws_cookie))
> > +               return -ENXIO;
> 
> How can the second condition ever be true?
> 

0 offset but failure.

> > +
> > +       pr_devel("%pR removed from CFMWS\n", res);
> > +       return 0;
> > +}
> > +
> >  static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
> >                            const unsigned long end)
> >  {
> > @@ -88,6 +108,8 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
> >         struct device *dev = ctx->dev;
> >         struct acpi_cedt_cfmws *cfmws;
> >         struct cxl_decoder *cxld;
> > +       struct gen_pool *window;
> > +       char name[64];
> >         int rc, i;
> >
> >         cfmws = (struct acpi_cedt_cfmws *) header;
> > @@ -116,6 +138,20 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
> >         cxld->interleave_ways = CFMWS_INTERLEAVE_WAYS(cfmws);
> >         cxld->interleave_granularity = CFMWS_INTERLEAVE_GRANULARITY(cfmws);
> >
> > +       sprintf(name, "cfmws@%#llx", cfmws->base_hpa);
> > +       window = devm_gen_pool_create(dev, ilog2(SZ_256M), NUMA_NO_NODE, name);
> > +       if (IS_ERR(window))
> > +               return 0;
> > +
> > +       gen_pool_add_owner(window, cfmws->base_hpa, -1, cfmws->window_size,
> > +                          NUMA_NO_NODE, &cfmws_cookie);
> 
> Similar comment about the "_owner" variant serving no visible purpose.
> 
> These seems to pre-suppose that only the allocator will ever want to
> interrogate the state of free space, it might be worth registering
> objects for each intersection that are not cxl_regions so that
> userspace explicitly sees what the cxl_acpi driver sees in terms of
> available resources.
> 
> > +
> > +       /* Area claimed by other resources, remove those from the gen_pool. */
> > +       walk_iomem_res_desc(IORES_DESC_NONE, 0, cfmws->base_hpa,
> > +                           cfmws->base_hpa + cfmws->window_size - 1, window,
> > +                           fill_busy_mem);
> > +       to_cxl_root_decoder(cxld)->window = window;
> > +
> >         rc = cxl_decoder_add(cxld, target_map);
> >         if (rc)
> >                 put_device(&cxld->dev);
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index 85fd5e84f978..0e1c65761ead 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -246,10 +246,12 @@ struct cxl_switch_decoder {
> >  /**
> >   * struct cxl_root_decoder - A toplevel/platform decoder
> >   * @base: Base class decoder
> > + * @window: host address space allocator
> >   * @targets: Downstream targets (ie. hostbridges).
> >   */
> >  struct cxl_root_decoder {
> >         struct cxl_decoder base;
> > +       struct gen_pool *window;
> >         struct cxl_decoder_targets *targets;
> >  };
> >
> > --
> > 2.35.1
> >

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail
  2022-05-12 17:27           ` Luis Chamberlain
@ 2022-05-13 12:09             ` Jonathan Cameron
  2022-05-13 15:03               ` Dan Williams
  2022-05-13 15:12               ` Luis Chamberlain
  0 siblings, 2 replies; 53+ messages in thread
From: Jonathan Cameron @ 2022-05-13 12:09 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Ben Widawsky, Klaus Jensen, Josef Bacik, Adam Manzanares,
	Dan Williams, linux-cxl, Linux NVDIMM, patches, Alison Schofield,
	Ira Weiny, Vishal Verma

On Thu, 12 May 2022 10:27:38 -0700
Luis Chamberlain <mcgrof@kernel.org> wrote:

> On Thu, May 12, 2022 at 08:50:14AM -0700, Ben Widawsky wrote:
> > On 22-04-18 16:37:12, Adam Manzanares wrote:  
> > > On Wed, Apr 13, 2022 at 02:31:42PM -0700, Dan Williams wrote:  
> > > > On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:  
> > > > >
> > > > > Endpoint decoder enumeration is the only way in which we can determine
> > > > > Device Physical Address (DPA) -> Host Physical Address (HPA) mappings.
> > > > > Information is obtained only when the register state can be read
> > > > > sequentially. If when enumerating the decoders a failure occurs, all
> > > > > other decoders must also fail since the decoders can no longer be
> > > > > accurately managed (unless it's the last decoder in which case it can
> > > > > still work).  
> > > > 
> > > > I think this should be expanded to fail if any decoder fails to
> > > > allocate anywhere in the topology otherwise it leaves a mess for
> > > > future address translation code to work through cases where decoder
> > > > information is missing.
> > > > 
> > > > The current approach is based around the current expectation that
> > > > nothing is enumerating pre-existing regions, and nothing is performing
> > > > address translation.  
> > > 
> > > Does the qemu support currently allow testing of this patch? If so, it would 
> > > be good to reference qemu configurations. Any other alternatives would be 
> > > welcome as well. 
> > > 
> > > +Luis on cc.
> > >   
> > 
> > No. This type of error injection would be cool to have, but I'm not sure of a
> > good way to support that in a scalable way. Maybe Jonathan has some ideas?  
> 
> In case it helps on the Linux front the least intrusive way is to use
> ALLOW_ERROR_INJECTION(). It's what I hope we'll slowly strive for on
> the block layer and filesystems slowly. That incurs one macro call per error
> routine you want to allow error injection on.
> 
> Then you use debugfs to dynamically enable / disable the error
> injection / rate etc.
> 
> So I think this begs the question, what error injection mechanisms
> exist for qemu and would new functionality be welcomed?

So what paths can actually cause this to fail? Looking at the upstream
code in init_hdm_decoder() looks like there are only a few things that
are checked. 

base or size being all fs or interleave ways not being a value the
kernel understands.

For all fs, I'm not sure how we'd get that value?

For interleave ways:
Our current verification of writes to these registers in QEMU is very
limited I think you can currently push in an invalid value. We are only
masking writes, not checking for mid range values that don't exist.
However, that's something I'll be looking to restrict soon as we add
more input verification so I wouldn't rely on it.

I'm not aware of anything general affecting QEMU devices emulation.
I've hacked cases in as temporary tests but not sure
we'd want to carry something specific for this one.

> 
> Linux builds off a brilliantly simple simple interface borrowed from
> failmalloc [0]. The initial implementation on Linux then was also really
> simple [1] [2] [3] however it required adding stubs on each call with a
> respective build option to enable failure injection. Configuration was done
> through debugfs.
> 
> Later Josef enabled us to use BPF to allow overriding kprobed functions
> to return arbitrary values[4], and further generalized away from kprobes
> by Masami [5].
> 
> If no failure injection is present in qemu something as simple as the initial
> approach could be considered [1] [2] [3], but a dynamic interface
> would certainly be wonderful long term.
> 
> [0] http://www.nongnu.org/failmalloc/
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de1ba09b214056365d9082982905b255caafb7a2
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6ff1cb355e628f8fc55fa2d01e269e5e1bbc2fe9
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8a8b6502fb669c3a0638a08955442814cedc86b1
> [4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92ace9991da08827e809c2d120108a96a281e7fc
> [5] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=540adea3809f61115d2a1ea4ed6e627613452ba1
> 
>   Luis
> 
> > > > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > > > ---
> > > > >  drivers/cxl/core/hdm.c | 2 ++
> > > > >  1 file changed, 2 insertions(+)
> > > > >
> > > > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > > > > index bfc8ee876278..c3c021b54079 100644
> > > > > --- a/drivers/cxl/core/hdm.c
> > > > > +++ b/drivers/cxl/core/hdm.c
> > > > > @@ -255,6 +255,8 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
> > > > >                                       cxlhdm->regs.hdm_decoder, i);
> > > > >                 if (rc) {
> > > > >                         put_device(&cxld->dev);
> > > > > +                       if (is_endpoint_decoder(&cxld->dev))
> > > > > +                               return rc;
> > > > >                         failed++;
> > > > >                         continue;
> > > > >                 }
> > > > > --
> > > > > 2.35.1
> > > > >  
> > > >   


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail
  2022-05-13 12:09             ` Jonathan Cameron
@ 2022-05-13 15:03               ` Dan Williams
  2022-05-13 15:12               ` Luis Chamberlain
  1 sibling, 0 replies; 53+ messages in thread
From: Dan Williams @ 2022-05-13 15:03 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Luis Chamberlain, Ben Widawsky, Klaus Jensen, Josef Bacik,
	Adam Manzanares, linux-cxl, Linux NVDIMM, patches,
	Alison Schofield, Ira Weiny, Vishal Verma

On Fri, May 13, 2022 at 5:09 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Thu, 12 May 2022 10:27:38 -0700
> Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> > On Thu, May 12, 2022 at 08:50:14AM -0700, Ben Widawsky wrote:
> > > On 22-04-18 16:37:12, Adam Manzanares wrote:
> > > > On Wed, Apr 13, 2022 at 02:31:42PM -0700, Dan Williams wrote:
> > > > > On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > > >
> > > > > > Endpoint decoder enumeration is the only way in which we can determine
> > > > > > Device Physical Address (DPA) -> Host Physical Address (HPA) mappings.
> > > > > > Information is obtained only when the register state can be read
> > > > > > sequentially. If when enumerating the decoders a failure occurs, all
> > > > > > other decoders must also fail since the decoders can no longer be
> > > > > > accurately managed (unless it's the last decoder in which case it can
> > > > > > still work).
> > > > >
> > > > > I think this should be expanded to fail if any decoder fails to
> > > > > allocate anywhere in the topology otherwise it leaves a mess for
> > > > > future address translation code to work through cases where decoder
> > > > > information is missing.
> > > > >
> > > > > The current approach is based around the current expectation that
> > > > > nothing is enumerating pre-existing regions, and nothing is performing
> > > > > address translation.
> > > >
> > > > Does the qemu support currently allow testing of this patch? If so, it would
> > > > be good to reference qemu configurations. Any other alternatives would be
> > > > welcome as well.
> > > >
> > > > +Luis on cc.
> > > >
> > >
> > > No. This type of error injection would be cool to have, but I'm not sure of a
> > > good way to support that in a scalable way. Maybe Jonathan has some ideas?
> >
> > In case it helps on the Linux front the least intrusive way is to use
> > ALLOW_ERROR_INJECTION(). It's what I hope we'll slowly strive for on
> > the block layer and filesystems slowly. That incurs one macro call per error
> > routine you want to allow error injection on.
> >
> > Then you use debugfs to dynamically enable / disable the error
> > injection / rate etc.
> >
> > So I think this begs the question, what error injection mechanisms
> > exist for qemu and would new functionality be welcomed?
>
> So what paths can actually cause this to fail? Looking at the upstream
> code in init_hdm_decoder() looks like there are only a few things that
> are checked.
>
> base or size being all fs or interleave ways not being a value the
> kernel understands.
>
> For all fs, I'm not sure how we'd get that value?
>
> For interleave ways:
> Our current verification of writes to these registers in QEMU is very
> limited I think you can currently push in an invalid value. We are only
> masking writes, not checking for mid range values that don't exist.
> However, that's something I'll be looking to restrict soon as we add
> more input verification so I wouldn't rely on it.
>
> I'm not aware of anything general affecting QEMU devices emulation.
> I've hacked cases in as temporary tests but not sure
> we'd want to carry something specific for this one.

This is another motivation for cxl_test. QEMU is meant to faithfully
emulate the hardware, not unit test drivers.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail
  2022-05-13 12:09             ` Jonathan Cameron
  2022-05-13 15:03               ` Dan Williams
@ 2022-05-13 15:12               ` Luis Chamberlain
  2022-05-13 19:14                 ` Dan Williams
  1 sibling, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2022-05-13 15:12 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Ben Widawsky, Klaus Jensen, Josef Bacik, Adam Manzanares,
	Dan Williams, linux-cxl, Linux NVDIMM, patches, Alison Schofield,
	Ira Weiny, Vishal Verma

On Fri, May 13, 2022 at 01:09:09PM +0100, Jonathan Cameron wrote:
> On Thu, 12 May 2022 10:27:38 -0700
> Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> > On Thu, May 12, 2022 at 08:50:14AM -0700, Ben Widawsky wrote:
> > > On 22-04-18 16:37:12, Adam Manzanares wrote:  
> > > > On Wed, Apr 13, 2022 at 02:31:42PM -0700, Dan Williams wrote:  
> > > > > On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:  
> > > > > >
> > > > > > Endpoint decoder enumeration is the only way in which we can determine
> > > > > > Device Physical Address (DPA) -> Host Physical Address (HPA) mappings.
> > > > > > Information is obtained only when the register state can be read
> > > > > > sequentially. If when enumerating the decoders a failure occurs, all
> > > > > > other decoders must also fail since the decoders can no longer be
> > > > > > accurately managed (unless it's the last decoder in which case it can
> > > > > > still work).  
> > > > > 
> > > > > I think this should be expanded to fail if any decoder fails to
> > > > > allocate anywhere in the topology otherwise it leaves a mess for
> > > > > future address translation code to work through cases where decoder
> > > > > information is missing.
> > > > > 
> > > > > The current approach is based around the current expectation that
> > > > > nothing is enumerating pre-existing regions, and nothing is performing
> > > > > address translation.  
> > > > 
> > > > Does the qemu support currently allow testing of this patch? If so, it would 
> > > > be good to reference qemu configurations. Any other alternatives would be 
> > > > welcome as well. 
> > > > 
> > > > +Luis on cc.
> > > >   
> > > 
> > > No. This type of error injection would be cool to have, but I'm not sure of a
> > > good way to support that in a scalable way. Maybe Jonathan has some ideas?  
> > 
> > In case it helps on the Linux front the least intrusive way is to use
> > ALLOW_ERROR_INJECTION(). It's what I hope we'll slowly strive for on
> > the block layer and filesystems slowly. That incurs one macro call per error
> > routine you want to allow error injection on.
> > 
> > Then you use debugfs to dynamically enable / disable the error
> > injection / rate etc.
> > 
> > So I think this begs the question, what error injection mechanisms
> > exist for qemu and would new functionality be welcomed?
> 
> So what paths can actually cause this to fail?

If you are asking about adopting something like the failmalloc
should_fail() strategy in qemu, you'd essentially open code a call to
a should_fail() and in it pass the arguments you want from your
own call down. If you want to ignore size you can just pass 0
for instance.

> Looking at the upstream
> code in init_hdm_decoder() looks like there are only a few things that
> are checked. 

If you mean in Linux, you would open code a should_fail()
specific to the area as in this commit old commit example, and
adding a respective kconfig entry for it:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8a8b6502fb669c3a0638a08955442814cedc86b1

Eech of these knobs then get its own probability, times, and space
debugfs entries which let the routine should_fail() fail when the
parameters set meet the criteria set by debugfs.

There are ways to make this much more scalable though, but I had not
seen many efforts to do so. I did start such an approach using debugfs
specific to *one* kconfig entry, for instance see this block layer proposed
change, which would in turn enable tons of different ways to enable failing
if CONFIG_FAIL_ADD_DISK would be used:

https://lore.kernel.org/linux-block/20210512064629.13899-9-mcgrof@kernel.org/

However, at the recent discussion at LSFMM for this we decided instead
to just sprinkle ALLOW_ERROR_INJECTION() after each routine. Otherwise
you are open coding tons of new "should_fail()" calls in your runtime
path and that can make it hard to review patches and is just a lot of
noise in code.

But with CONFIG_FAIL_FUNCTION this means you don't have to open code
should_fail() calls, but instead for each routine you want to add a failure
injection support you'd just use ALLOW_ERROR_INJECTION() per call.

Read Documentation/fault-injection/fault-injection.rst on
fail_function/injectable and fail_function/<function-name>/retval,
so we could do for instance, to avoid a namespace clash I just
added the cxl_ prefix:

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 0e89a7a932d4..2aff3bace698 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -149,8 +149,8 @@ static int to_interleave_ways(u32 ctrl)
 	}
 }
 
-static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
-			    int *target_map, void __iomem *hdm, int which)
+static int cxl_init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
+				int *target_map, void __iomem *hdm, int which)
 {
 	u64 size, base;
 	u32 ctrl;
@@ -207,6 +207,7 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
 
 	return 0;
 }
+ALLOW_ERROR_INJECTION(cxl_init_hdm_decoder, ERRNO);
 
 /**
  * devm_cxl_enumerate_decoders - add decoder objects per HDM register set
@@ -251,8 +252,8 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
 			return PTR_ERR(cxld);
 		}
 
-		rc = init_hdm_decoder(port, cxld, target_map,
-				      cxlhdm->regs.hdm_decoder, i);
+		rc = cxl_init_hdm_decoder(port, cxld, target_map,
+					  cxlhdm->regs.hdm_decoder, i);
 		if (rc) {
 			put_device(&cxld->dev);
 			failed++;

This lets's us then modprobe the cxl modules and then:

root@kdevops-dev ~ # grep ^cxl /sys/kernel/debug/fail_function/injectable 
cxl_init_hdm_decoder [cxl_core] ERRNO

echo cxl_init_hdm_decoder > /sys/kernel/debug/fail_function/cxl_init_hdm_decoder/
printf %#x -6 > /sys/kernel/debug/fail_function/cxl_init_hdm_decoder/retval

Now this routine will return -ENXIO (-6) when called but I think
you still need to set the probability for this to not be 0:

cat /sys/kernel/debug/fail_function/probability 
0

In the world where ALLOW_ERROR_INJECTION() is not used we'd get one of
each of the probability, space and times for each new failure injection
function.

> base or size being all fs or interleave ways not being a value the
> kernel understands.

I think that if you want to make use of variable failures depending on
input data you might as well open code your own should_fail() calls
as I did in the above referenced block layer patches for add_disk with
your own kconfig entry.

ALLOW_ERROR_INJECTION() seems to be useful if you are OK to do simple
failures based on a return code and probably a probablity / times
values (times means that if you say 2 it will fail ever 2 calls).

There are odd uses of ALLOW_ERROR_INJECTION() but I can't say I'm a fan
of: drivers/platform/surface/aggregator/ssh_packet_layer.c

> For all fs, I'm not sure how we'd get that value?

You'd echo the return value you want to fake a failure for into the
debugfs retval. If you open code your own should_fail() you can use
size, space, probability in whatever you see fit.

> For interleave ways:
> Our current verification of writes to these registers in QEMU is very
> limited I think you can currently push in an invalid value. We are only
> masking writes, not checking for mid range values that don't exist.
> However, that's something I'll be looking to restrict soon as we add
> more input verification so I wouldn't rely on it.
> 
> I'm not aware of anything general affecting QEMU devices emulation.
> I've hacked cases in as temporary tests but not sure
> we'd want to carry something specific for this one.

I'd imagine as in any subsystem finding the few key areas you *do* want
test coverage for failure injection will be a nice fun next step. It is
understandable if init_hdm_decoder() is not it.

  Luis

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail
  2022-05-13 15:12               ` Luis Chamberlain
@ 2022-05-13 19:14                 ` Dan Williams
  2022-05-13 19:31                   ` Luis Chamberlain
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2022-05-13 19:14 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Jonathan Cameron, Ben Widawsky, Klaus Jensen, Josef Bacik,
	Adam Manzanares, linux-cxl, Linux NVDIMM, patches,
	Alison Schofield, Ira Weiny, Vishal Verma

On Fri, May 13, 2022 at 8:12 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Fri, May 13, 2022 at 01:09:09PM +0100, Jonathan Cameron wrote:
> > On Thu, 12 May 2022 10:27:38 -0700
> > Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > > On Thu, May 12, 2022 at 08:50:14AM -0700, Ben Widawsky wrote:
> > > > On 22-04-18 16:37:12, Adam Manzanares wrote:
> > > > > On Wed, Apr 13, 2022 at 02:31:42PM -0700, Dan Williams wrote:
> > > > > > On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > > > >
> > > > > > > Endpoint decoder enumeration is the only way in which we can determine
> > > > > > > Device Physical Address (DPA) -> Host Physical Address (HPA) mappings.
> > > > > > > Information is obtained only when the register state can be read
> > > > > > > sequentially. If when enumerating the decoders a failure occurs, all
> > > > > > > other decoders must also fail since the decoders can no longer be
> > > > > > > accurately managed (unless it's the last decoder in which case it can
> > > > > > > still work).
> > > > > >
> > > > > > I think this should be expanded to fail if any decoder fails to
> > > > > > allocate anywhere in the topology otherwise it leaves a mess for
> > > > > > future address translation code to work through cases where decoder
> > > > > > information is missing.
> > > > > >
> > > > > > The current approach is based around the current expectation that
> > > > > > nothing is enumerating pre-existing regions, and nothing is performing
> > > > > > address translation.
> > > > >
> > > > > Does the qemu support currently allow testing of this patch? If so, it would
> > > > > be good to reference qemu configurations. Any other alternatives would be
> > > > > welcome as well.
> > > > >
> > > > > +Luis on cc.
> > > > >
> > > >
> > > > No. This type of error injection would be cool to have, but I'm not sure of a
> > > > good way to support that in a scalable way. Maybe Jonathan has some ideas?
> > >
> > > In case it helps on the Linux front the least intrusive way is to use
> > > ALLOW_ERROR_INJECTION(). It's what I hope we'll slowly strive for on
> > > the block layer and filesystems slowly. That incurs one macro call per error
> > > routine you want to allow error injection on.
> > >
> > > Then you use debugfs to dynamically enable / disable the error
> > > injection / rate etc.
> > >
> > > So I think this begs the question, what error injection mechanisms
> > > exist for qemu and would new functionality be welcomed?
> >
> > So what paths can actually cause this to fail?
>
> If you are asking about adopting something like the failmalloc
> should_fail() strategy in qemu, you'd essentially open code a call to
> a should_fail() and in it pass the arguments you want from your
> own call down. If you want to ignore size you can just pass 0
> for instance.
>
> > Looking at the upstream
> > code in init_hdm_decoder() looks like there are only a few things that
> > are checked.
>
> If you mean in Linux, you would open code a should_fail()
> specific to the area as in this commit old commit example, and
> adding a respective kconfig entry for it:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8a8b6502fb669c3a0638a08955442814cedc86b1
>
> Eech of these knobs then get its own probability, times, and space
> debugfs entries which let the routine should_fail() fail when the
> parameters set meet the criteria set by debugfs.
>
> There are ways to make this much more scalable though, but I had not
> seen many efforts to do so. I did start such an approach using debugfs
> specific to *one* kconfig entry, for instance see this block layer proposed
> change, which would in turn enable tons of different ways to enable failing
> if CONFIG_FAIL_ADD_DISK would be used:
>
> https://lore.kernel.org/linux-block/20210512064629.13899-9-mcgrof@kernel.org/
>
> However, at the recent discussion at LSFMM for this we decided instead
> to just sprinkle ALLOW_ERROR_INJECTION() after each routine. Otherwise
> you are open coding tons of new "should_fail()" calls in your runtime
> path and that can make it hard to review patches and is just a lot of
> noise in code.
>
> But with CONFIG_FAIL_FUNCTION this means you don't have to open code
> should_fail() calls, but instead for each routine you want to add a failure
> injection support you'd just use ALLOW_ERROR_INJECTION() per call.

So cxl_test takes the opposite approach and tries not to pollute the
production code with test instrumentation. All of the infrastructure
to replace calls and inject mocked values is self contained in
tools/testing/cxl/ where it builds replacement modules with test
instrumentation. Otherwise its a maintenance burden, in my view, to
read the error injection macros in the nominal code paths.

> Read Documentation/fault-injection/fault-injection.rst on
> fail_function/injectable and fail_function/<function-name>/retval,
> so we could do for instance, to avoid a namespace clash I just
> added the cxl_ prefix:

Certainly those would be good to use behind the mocked interfaces.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail
  2022-05-13 19:14                 ` Dan Williams
@ 2022-05-13 19:31                   ` Luis Chamberlain
  2022-05-19  5:09                     ` Dan Williams
  0 siblings, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2022-05-13 19:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jonathan Cameron, Ben Widawsky, Klaus Jensen, Josef Bacik,
	Adam Manzanares, linux-cxl, Linux NVDIMM, patches,
	Alison Schofield, Ira Weiny, Vishal Verma

On Fri, May 13, 2022 at 12:14:51PM -0700, Dan Williams wrote:
> On Fri, May 13, 2022 at 8:12 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > But with CONFIG_FAIL_FUNCTION this means you don't have to open code
> > should_fail() calls, but instead for each routine you want to add a failure
> > injection support you'd just use ALLOW_ERROR_INJECTION() per call.
> 
> So cxl_test takes the opposite approach and tries not to pollute the
> production code with test instrumentation. All of the infrastructure
> to replace calls and inject mocked values is self contained in
> tools/testing/cxl/ where it builds replacement modules with test
> instrumentation. Otherwise its a maintenance burden, in my view, to
> read the error injection macros in the nominal code paths.

Is relying on just ALLOW_ERROR_INJECTION() per routine you'd want
to enable error injection for really too much to swallow?

  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail
  2022-05-13 19:31                   ` Luis Chamberlain
@ 2022-05-19  5:09                     ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2022-05-19  5:09 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Jonathan Cameron, Ben Widawsky, Klaus Jensen, Josef Bacik,
	Adam Manzanares, linux-cxl, Linux NVDIMM, patches,
	Alison Schofield, Ira Weiny, Vishal Verma

On Fri, May 13, 2022 at 12:32 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Fri, May 13, 2022 at 12:14:51PM -0700, Dan Williams wrote:
> > On Fri, May 13, 2022 at 8:12 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > > But with CONFIG_FAIL_FUNCTION this means you don't have to open code
> > > should_fail() calls, but instead for each routine you want to add a failure
> > > injection support you'd just use ALLOW_ERROR_INJECTION() per call.
> >
> > So cxl_test takes the opposite approach and tries not to pollute the
> > production code with test instrumentation. All of the infrastructure
> > to replace calls and inject mocked values is self contained in
> > tools/testing/cxl/ where it builds replacement modules with test
> > instrumentation. Otherwise its a maintenance burden, in my view, to
> > read the error injection macros in the nominal code paths.
>
> Is relying on just ALLOW_ERROR_INJECTION() per routine you'd want
> to enable error injection for really too much to swallow?

Inline? To me, yes. However, it seems the perfect thing to hide
out-of-line in a mocked call injected from tools/testing/.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 00/15] Region driver
  2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
                   ` (14 preceding siblings ...)
  2022-04-13 18:37 ` [RFC PATCH 15/15] cxl/region: Introduce a cxl_region driver Ben Widawsky
@ 2022-05-20 16:23 ` Jonathan Cameron
  2022-05-20 16:41   ` Dan Williams
  15 siblings, 1 reply; 53+ messages in thread
From: Jonathan Cameron @ 2022-05-20 16:23 UTC (permalink / raw)
  To: Ben Widawsky, linux-cxl
  Cc: nvdimm, patches, Alison Schofield, Dan Williams, Ira Weiny, Vishal Verma

On Wed, 13 Apr 2022 11:37:05 -0700
Ben Widawsky <ben.widawsky@intel.com> wrote:

> Spring cleaning is here and we're starting fresh so I won't be referencing
> previous postings and I've removed revision history from commit messages.
> 
> This patch series introduces the CXL region driver as well as associated APIs in
> CXL core to create and configure regions. Regions are defined by the CXL 2.0
> specification [1], a summary follows.
> 
> A region surfaces a swath of RAM (persistent or volatile) that appears as normal
> memory to the operating system. The memory, unless programmed by BIOS, or a
> previous Operating System, is inaccessible until the CXL driver creates a region
> for it.A region may be strided (interleave granularity) across multiple devices
> (interleave ways). The interleaving may traverse multiple levels of the CXL
> hierarchy.
> 
> +-------------------------+      +-------------------------+
> |                         |      |                         |
> |   CXL 2.0 Host Bridge   |      |   CXL 2.0 Host Bridge   |
> |                         |      |                         |
> |  +------+     +------+  |      |  +------+     +------+  |
> |  |  RP  |     |  RP  |  |      |  |  RP  |     |  RP  |  |
> +--+------+-----+------+--+      +--+------+-----+------+--+
>       |            |                   |               \--
>       |            |                   |        +-------+-\--+------+
>    +------+    +-------+            +-------+   |       |USP |      |
>    |Type 3|    |Type 3 |            |Type 3 |   |       +----+      |
>    |Device|    |Device |            |Device |   |     CXL Switch    |
>    +------+    +-------+            +-------+   | +----+     +----+ |
>                                                 | |DSP |     |DSP | |
>                                                 +-+-|--+-----+-|--+-+
>                                                     |          |
>                                                 +------+    +-------+
>                                                 |Type 3|    |Type 3 |
>                                                 |Device|    |Device |
>                                                 +------+    +-------+
> 
> Region verification and programming state are owned by the cxl_region driver
> (implemented in the cxl_region module). Much of the region driver is an
> implementation of algorithms described in the CXL Type 3 Memory Device Software
> Guide [2].
> 
> The region driver is responsible for configuring regions found on persistent
> capacities in the Label Storage Area (LSA), it will also enumerate regions
> configured by BIOS, usually volatile capacities, and will allow for dynamic
> region creation (which can then be stored in the LSA). Only dynamically created
> regions are implemented thus far.
> 
> Dan has previously stated that he doesn't want to merge ABI until the whole
> series is posted and reviewed, to make sure we have no gaps. As such, the goal
> of posting this series is *not* to discuss the ABI specifically, feedback is of
> course welcome. In other wordsIt has been discussed previously. The goal is to find
> architectural flaws in the implementation of the ABI that may pose problematic
> for cases we haven't yet conceived.
> 
> Since region creation is done via sysfs, it is left to userspace to prevent
> racing for resource usage. Here is an overview for creating a x1 256M
> dynamically created region programming to be used by userspace clients. In this
> example, the following topology is used (cropped for brevity):
> /sys/bus/cxl/devices/
> ├── decoder0.0 -> ../../../devices/platform/ACPI0017:00/root0/decoder0.0
> ├── decoder0.1 -> ../../../devices/platform/ACPI0017:00/root0/decoder0.1
> ├── decoder1.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/decoder1.0
> ├── decoder2.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/decoder2.0
> ├── decoder3.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint3/decoder3.0
> ├── decoder4.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint4/decoder4.0
> ├── decoder5.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint5/decoder5.0
> ├── decoder6.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint6/decoder6.0
> ├── endpoint3 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint3
> ├── endpoint4 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint4
> ├── endpoint5 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint5
> ├── endpoint6 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint6
> ...
> 
> 1. Select a Root Decoder whose interleave spans the desired interleave config
>    - devices, IG, IW, Large enough address space.
>    - ie. pick decoder0.0
> 2. Program the decoders for the endpoints comprising the interleave set.
>    - ie. echo $((256 << 20)) > /sys/bus/cxl/devices/decoder3.0
> 3. Create a region
>    - ie. echo $(cat create_pmem_region) >| create_pmem_region
> 4. Configure a region
>    - ie. echo 256 >| interleave_granularity
> 	 echo 1 >| interleave_ways
> 	 echo $((256 << 20)) >| size
> 	 echo decoder3.0 >| target0
> 5. Bind the region driver to the region
>    - ie. echo region0 > /sys/bus/cxl/drivers/cxl_region/bind
> 
Hi Ben,

I finally got around to actually trying this out on top of Dan's recent fix set
(I rebased it from the cxl/preview branch on kernel.org).

I'm not having much luck actually bring up a region.

The patch set refers to configuring the end point decoders, but all their
sysfs attributes are read only.  Am I missing a dependency somewhere or
is the intent that this series is part of the solution only?

I'm confused!

Jonathan

> 
> [1]: https://www.computeexpresslink.org/download-the-specification
> [2]: https://cdrdv2.intel.com/v1/dl/getContent/643805?wapkw=CXL%20memory%20device%20sw%20guide
> 
> Ben Widawsky (15):
>   cxl/core: Use is_endpoint_decoder
>   cxl/core/hdm: Bail on endpoint init fail
>   Revert "cxl/core: Convert decoder range to resource"
>   cxl/core: Create distinct decoder structs
>   cxl/acpi: Reserve CXL resources from request_free_mem_region
>   cxl/acpi: Manage root decoder's address space
>   cxl/port: Surface ram and pmem resources
>   cxl/core/hdm: Allocate resources from the media
>   cxl/core/port: Add attrs for size and volatility
>   cxl/core: Extract IW/IG decoding
>   cxl/acpi: Use common IW/IG decoding
>   cxl/region: Add region creation ABI
>   cxl/core/port: Add attrs for root ways & granularity
>   cxl/region: Introduce configuration
>   cxl/region: Introduce a cxl_region driver
> 
>  Documentation/ABI/testing/sysfs-bus-cxl       |  96 ++-
>  .../driver-api/cxl/memory-devices.rst         |  14 +
>  drivers/cxl/Kconfig                           |  10 +
>  drivers/cxl/Makefile                          |   2 +
>  drivers/cxl/acpi.c                            |  83 ++-
>  drivers/cxl/core/Makefile                     |   1 +
>  drivers/cxl/core/core.h                       |   4 +
>  drivers/cxl/core/hdm.c                        |  44 +-
>  drivers/cxl/core/port.c                       | 363 ++++++++--
>  drivers/cxl/core/region.c                     | 669 ++++++++++++++++++
>  drivers/cxl/cxl.h                             | 168 ++++-
>  drivers/cxl/mem.c                             |   7 +-
>  drivers/cxl/region.c                          | 333 +++++++++
>  drivers/cxl/region.h                          | 105 +++
>  include/linux/ioport.h                        |   1 +
>  kernel/resource.c                             |  11 +-
>  tools/testing/cxl/Kbuild                      |   1 +
>  tools/testing/cxl/test/cxl.c                  |   2 +-
>  18 files changed, 1810 insertions(+), 104 deletions(-)
>  create mode 100644 drivers/cxl/core/region.c
>  create mode 100644 drivers/cxl/region.c
>  create mode 100644 drivers/cxl/region.h
> 
> 
> base-commit: 7dc1d11d7abae52aada5340fb98885f0ddbb7c37


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 00/15] Region driver
  2022-05-20 16:23 ` [RFC PATCH 00/15] Region driver Jonathan Cameron
@ 2022-05-20 16:41   ` Dan Williams
  2022-05-31 12:21     ` Jonathan Cameron
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2022-05-20 16:41 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Ben Widawsky, linux-cxl, nvdimm, patches, Alison Schofield,
	Ira Weiny, Vishal Verma

On Fri, May 20, 2022 at 9:23 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Wed, 13 Apr 2022 11:37:05 -0700
> Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> > Spring cleaning is here and we're starting fresh so I won't be referencing
> > previous postings and I've removed revision history from commit messages.
> >
> > This patch series introduces the CXL region driver as well as associated APIs in
> > CXL core to create and configure regions. Regions are defined by the CXL 2.0
> > specification [1], a summary follows.
> >
> > A region surfaces a swath of RAM (persistent or volatile) that appears as normal
> > memory to the operating system. The memory, unless programmed by BIOS, or a
> > previous Operating System, is inaccessible until the CXL driver creates a region
> > for it.A region may be strided (interleave granularity) across multiple devices
> > (interleave ways). The interleaving may traverse multiple levels of the CXL
> > hierarchy.
> >
> > +-------------------------+      +-------------------------+
> > |                         |      |                         |
> > |   CXL 2.0 Host Bridge   |      |   CXL 2.0 Host Bridge   |
> > |                         |      |                         |
> > |  +------+     +------+  |      |  +------+     +------+  |
> > |  |  RP  |     |  RP  |  |      |  |  RP  |     |  RP  |  |
> > +--+------+-----+------+--+      +--+------+-----+------+--+
> >       |            |                   |               \--
> >       |            |                   |        +-------+-\--+------+
> >    +------+    +-------+            +-------+   |       |USP |      |
> >    |Type 3|    |Type 3 |            |Type 3 |   |       +----+      |
> >    |Device|    |Device |            |Device |   |     CXL Switch    |
> >    +------+    +-------+            +-------+   | +----+     +----+ |
> >                                                 | |DSP |     |DSP | |
> >                                                 +-+-|--+-----+-|--+-+
> >                                                     |          |
> >                                                 +------+    +-------+
> >                                                 |Type 3|    |Type 3 |
> >                                                 |Device|    |Device |
> >                                                 +------+    +-------+
> >
> > Region verification and programming state are owned by the cxl_region driver
> > (implemented in the cxl_region module). Much of the region driver is an
> > implementation of algorithms described in the CXL Type 3 Memory Device Software
> > Guide [2].
> >
> > The region driver is responsible for configuring regions found on persistent
> > capacities in the Label Storage Area (LSA), it will also enumerate regions
> > configured by BIOS, usually volatile capacities, and will allow for dynamic
> > region creation (which can then be stored in the LSA). Only dynamically created
> > regions are implemented thus far.
> >
> > Dan has previously stated that he doesn't want to merge ABI until the whole
> > series is posted and reviewed, to make sure we have no gaps. As such, the goal
> > of posting this series is *not* to discuss the ABI specifically, feedback is of
> > course welcome. In other wordsIt has been discussed previously. The goal is to find
> > architectural flaws in the implementation of the ABI that may pose problematic
> > for cases we haven't yet conceived.
> >
> > Since region creation is done via sysfs, it is left to userspace to prevent
> > racing for resource usage. Here is an overview for creating a x1 256M
> > dynamically created region programming to be used by userspace clients. In this
> > example, the following topology is used (cropped for brevity):
> > /sys/bus/cxl/devices/
> > ├── decoder0.0 -> ../../../devices/platform/ACPI0017:00/root0/decoder0.0
> > ├── decoder0.1 -> ../../../devices/platform/ACPI0017:00/root0/decoder0.1
> > ├── decoder1.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/decoder1.0
> > ├── decoder2.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/decoder2.0
> > ├── decoder3.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint3/decoder3.0
> > ├── decoder4.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint4/decoder4.0
> > ├── decoder5.0 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint5/decoder5.0
> > ├── decoder6.0 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint6/decoder6.0
> > ├── endpoint3 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint3
> > ├── endpoint4 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint4
> > ├── endpoint5 -> ../../../devices/platform/ACPI0017:00/root0/port1/endpoint5
> > ├── endpoint6 -> ../../../devices/platform/ACPI0017:00/root0/port2/endpoint6
> > ...
> >
> > 1. Select a Root Decoder whose interleave spans the desired interleave config
> >    - devices, IG, IW, Large enough address space.
> >    - ie. pick decoder0.0
> > 2. Program the decoders for the endpoints comprising the interleave set.
> >    - ie. echo $((256 << 20)) > /sys/bus/cxl/devices/decoder3.0
> > 3. Create a region
> >    - ie. echo $(cat create_pmem_region) >| create_pmem_region
> > 4. Configure a region
> >    - ie. echo 256 >| interleave_granularity
> >        echo 1 >| interleave_ways
> >        echo $((256 << 20)) >| size
> >        echo decoder3.0 >| target0
> > 5. Bind the region driver to the region
> >    - ie. echo region0 > /sys/bus/cxl/drivers/cxl_region/bind
> >
> Hi Ben,
>
> I finally got around to actually trying this out on top of Dan's recent fix set
> (I rebased it from the cxl/preview branch on kernel.org).
>
> I'm not having much luck actually bring up a region.
>
> The patch set refers to configuring the end point decoders, but all their
> sysfs attributes are read only.  Am I missing a dependency somewhere or
> is the intent that this series is part of the solution only?
>
> I'm confused!

There's a new series that's being reviewed internally before going to the list:

https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-redux3

Given the proximity to the merge window opening and the need to get
the "mem_enabled" series staged, I asked Ben to hold it back from the
list for now.

There are some changes I am folding into it, but I hope to send it out
in the next few days after "mem_enabled" is finalized.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 00/15] Region driver
  2022-05-20 16:41   ` Dan Williams
@ 2022-05-31 12:21     ` Jonathan Cameron
  2022-06-23  5:40       ` Dan Williams
  0 siblings, 1 reply; 53+ messages in thread
From: Jonathan Cameron @ 2022-05-31 12:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ben Widawsky, linux-cxl, nvdimm, patches, Alison Schofield,
	Ira Weiny, Vishal Verma

....

> > Hi Ben,
> >
> > I finally got around to actually trying this out on top of Dan's recent fix set
> > (I rebased it from the cxl/preview branch on kernel.org).
> >
> > I'm not having much luck actually bring up a region.
> >
> > The patch set refers to configuring the end point decoders, but all their
> > sysfs attributes are read only.  Am I missing a dependency somewhere or
> > is the intent that this series is part of the solution only?
> >
> > I'm confused!  
> 
> There's a new series that's being reviewed internally before going to the list:
> 
> https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-redux3
> 
> Given the proximity to the merge window opening and the need to get
> the "mem_enabled" series staged, I asked Ben to hold it back from the
> list for now.
> 
> There are some changes I am folding into it, but I hope to send it out
> in the next few days after "mem_enabled" is finalized.

Hi Dan,

I switched from an earlier version of the region code over to a rebase of the tree.
Two issues below you may already have fixed.

The second is a carry over from an earlier set so I haven't tested
without it but looks like it's still valid.

Anyhow, thought it might save some cycles to preempt you sending
out the series if these issues are still present.

Minimal testing so far on these with 2 hb, 2 rp, 4 directly connected
devices, but once you post I'll test more extensively.  I've not
really thought about the below much, so might not be best way to fix.

Found a bug in QEMU code as well (missing write masks for the
target list registers) - will post fix for that shortly.

Thanks,

Jonathan


From fa31f37214fcb121428be1ceb87ae335209fa4cc Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Date: Tue, 31 May 2022 13:13:51 +0100
Subject: [PATCH] Fixes for region code

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/cxl/region.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
index e81c5b1339ec..fbbf084004d9 100644
--- a/drivers/cxl/region.c
+++ b/drivers/cxl/region.c
@@ -229,10 +229,10 @@ static struct cxl_decoder *stage_decoder(struct cxl_region *cxlr,

        return cxld;
 }
-
+// calculating whilst working down the tree - so divide granularity of previous level by local ways.
 static int calculate_ig(struct cxl_decoder *pcxld)
 {
-       return cxl_to_interleave_granularity(cxl_from_granularity(pcxld->cip.g) + cxl_from_ways(pcxld->cip.w));
+       return cxl_to_interleave_granularity(cxl_from_granularity(pcxld->cip.g) - cxl_from_ways(pcxld->cip.w));
 }

 static void unstage_decoder(struct cxl_decoder *cxld)
@@ -302,7 +302,8 @@ static struct cxl_decoder *stage_hb(struct cxl_region *cxlr,
                          t->nr_targets))
                return NULL;

-       t->target[port_grouping] = root_port;
+       //      t->target[port_grouping] = root_port;
+       t->target[hbd->cip.w] = root_port;
        hbd->cip.w++;

        /* If no switch, root port is connected to memdev */
--
2.32.0

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 00/15] Region driver
  2022-05-31 12:21     ` Jonathan Cameron
@ 2022-06-23  5:40       ` Dan Williams
  2022-06-23 15:08         ` Jonathan Cameron
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2022-06-23  5:40 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams
  Cc: Ben Widawsky, linux-cxl, nvdimm, patches, Alison Schofield,
	Ira Weiny, Vishal Verma

Jonathan Cameron wrote:
> ....
> 
> > > Hi Ben,
> > >
> > > I finally got around to actually trying this out on top of Dan's recent fix set
> > > (I rebased it from the cxl/preview branch on kernel.org).
> > >
> > > I'm not having much luck actually bring up a region.
> > >
> > > The patch set refers to configuring the end point decoders, but all their
> > > sysfs attributes are read only.  Am I missing a dependency somewhere or
> > > is the intent that this series is part of the solution only?
> > >
> > > I'm confused!  
> > 
> > There's a new series that's being reviewed internally before going to the list:
> > 
> > https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-redux3
> > 
> > Given the proximity to the merge window opening and the need to get
> > the "mem_enabled" series staged, I asked Ben to hold it back from the
> > list for now.
> > 
> > There are some changes I am folding into it, but I hope to send it out
> > in the next few days after "mem_enabled" is finalized.
> 
> Hi Dan,
> 
> I switched from an earlier version of the region code over to a rebase of the tree.
> Two issues below you may already have fixed.
> 
> The second is a carry over from an earlier set so I haven't tested
> without it but looks like it's still valid.
> 
> Anyhow, thought it might save some cycles to preempt you sending
> out the series if these issues are still present.
> 
> Minimal testing so far on these with 2 hb, 2 rp, 4 directly connected
> devices, but once you post I'll test more extensively.  I've not
> really thought about the below much, so might not be best way to fix.
> 
> Found a bug in QEMU code as well (missing write masks for the
> target list registers) - will post fix for that shortly.

Hi Jonathan,

Tomorrow I'll post the tranche to the list, but wanted to let you and
others watching that that the 'preview' branch [1] now has the proposed
initial region support. Once the bots give the thumbs up I'll send it
along.

To date I've only tested it with cxl_test and an internal test vehicle.
The cxl_test script I used to setup and teardown a x8 interleave across
x2 host bridges and x4 switches is:

---

#!/bin/bash
modprobe cxl_test
udevadm settle
decoder=$(cxl list -b cxl_test -D -d root | jq -r ".[] |
          select(.pmem_capable == true) | 
          select(.nr_targets == 2) |
          .decoder")

readarray -t mem < <(cxl list -M -d $decoder | jq -r ".[].memdev")
readarray -t endpoint < <(cxl reserve-dpa -t pmem ${mem[*]} -s $((256<<20)) |
                          jq -r ".[] | .decoder.decoder")
region=$(cat /sys/bus/cxl/devices/$decoder/create_pmem_region)
echo $region > /sys/bus/cxl/devices/$decoder/create_pmem_region
uuidgen > /sys/bus/cxl/devices/$region/uuid
nr_targets=${#endpoint[@]}
echo $nr_targets > /sys/bus/cxl/devices/$region/interleave_ways
g=$(cat /sys/bus/cxl/devices/$decoder/interleave_granularity)
echo $g > /sys/bus/cxl/devices/$region/interleave_granularity
echo $((nr_targets * (256<<20))) > /sys/bus/cxl/devices/$region/size
port_dev0=$(cxl list -T -d $decoder | jq -r ".[] |
            .targets | .[] | select(.position == 0) | .target")
port_dev1=$(cxl list -T -d $decoder | jq -r ".[] |
            .targets | .[] | select(.position == 1) | .target")
readarray -t mem_sort0 < <(cxl list -M -p $port_dev0 | jq -r ".[] | .memdev")
readarray -t mem_sort1 < <(cxl list -M -p $port_dev1 | jq -r ".[] | .memdev")
mem_sort=()
mem_sort[0]=${mem_sort0[0]}
mem_sort[1]=${mem_sort1[0]}
mem_sort[2]=${mem_sort0[2]}
mem_sort[3]=${mem_sort1[2]}
mem_sort[4]=${mem_sort0[1]}
mem_sort[5]=${mem_sort1[1]}
mem_sort[6]=${mem_sort0[3]}
mem_sort[7]=${mem_sort1[3]}

#mem_sort[2]=${mem_sort0[0]}
#mem_sort[1]=${mem_sort1[0]}
#mem_sort[0]=${mem_sort0[2]}
#mem_sort[3]=${mem_sort1[2]}
#mem_sort[4]=${mem_sort0[1]}
#mem_sort[5]=${mem_sort1[1]}
#mem_sort[6]=${mem_sort0[3]}
#mem_sort[7]=${mem_sort1[3]}

endpoint=()
for i in ${mem_sort[@]}
do
        readarray -O ${#endpoint[@]} -t endpoint < <(cxl list -Di -d endpoint -m $i | jq -r ".[] |
                                                     select(.mode == \"pmem\") | .decoder")
done
pos=0
for i in ${endpoint[@]}
do
        echo $i > /sys/bus/cxl/devices/$region/target$pos
        pos=$((pos+1))
done
echo "$region added ${#endpoint[@]} targets: ${endpoint[@]}"

echo 1 > /sys/bus/cxl/devices/$region/commit
echo 0 > /sys/bus/cxl/devices/$region/commit

pos=0
for i in ${endpoint[@]}
do
        echo "" > /sys/bus/cxl/devices/$region/target$pos
        pos=$((pos+1))
done
readarray -t endpoint < <(cxl free-dpa -t pmem ${mem[*]} |
                          jq -r ".[] | .decoder.decoder")
echo "$region released ${#endpoint[@]} targets: ${endpoint[@]}"

---

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=preview

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 00/15] Region driver
  2022-06-23  5:40       ` Dan Williams
@ 2022-06-23 15:08         ` Jonathan Cameron
  2022-06-23 17:33           ` Dan Williams
  0 siblings, 1 reply; 53+ messages in thread
From: Jonathan Cameron @ 2022-06-23 15:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ben Widawsky, linux-cxl, nvdimm, patches, Alison Schofield,
	Ira Weiny, Vishal Verma

On Wed, 22 Jun 2022 22:40:48 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> > ....
> >   
> > > > Hi Ben,
> > > >
> > > > I finally got around to actually trying this out on top of Dan's recent fix set
> > > > (I rebased it from the cxl/preview branch on kernel.org).
> > > >
> > > > I'm not having much luck actually bring up a region.
> > > >
> > > > The patch set refers to configuring the end point decoders, but all their
> > > > sysfs attributes are read only.  Am I missing a dependency somewhere or
> > > > is the intent that this series is part of the solution only?
> > > >
> > > > I'm confused!    
> > > 
> > > There's a new series that's being reviewed internally before going to the list:
> > > 
> > > https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-redux3
> > > 
> > > Given the proximity to the merge window opening and the need to get
> > > the "mem_enabled" series staged, I asked Ben to hold it back from the
> > > list for now.
> > > 
> > > There are some changes I am folding into it, but I hope to send it out
> > > in the next few days after "mem_enabled" is finalized.  
> > 
> > Hi Dan,
> > 
> > I switched from an earlier version of the region code over to a rebase of the tree.
> > Two issues below you may already have fixed.
> > 
> > The second is a carry over from an earlier set so I haven't tested
> > without it but looks like it's still valid.
> > 
> > Anyhow, thought it might save some cycles to preempt you sending
> > out the series if these issues are still present.
> > 
> > Minimal testing so far on these with 2 hb, 2 rp, 4 directly connected
> > devices, but once you post I'll test more extensively.  I've not
> > really thought about the below much, so might not be best way to fix.
> > 
> > Found a bug in QEMU code as well (missing write masks for the
> > target list registers) - will post fix for that shortly.  
> 
> Hi Jonathan,
> 
> Tomorrow I'll post the tranche to the list, but wanted to let you and
> others watching that that the 'preview' branch [1] now has the proposed
> initial region support. Once the bots give the thumbs up I'll send it
> along.
> 
> To date I've only tested it with cxl_test and an internal test vehicle.
> The cxl_test script I used to setup and teardown a x8 interleave across
> x2 host bridges and x4 switches is:

Thanks.  Trivial feedback from a very quick play (busy day).

Bit odd that regionX/size is once write - get an error even if
writing same value to it twice.

Also not debugged yet but on just got a null pointer dereference on

echo decoder3.0 > target0

Beyond a stacktrace pointing at store_targetN and dereference is of
0x00008 no idea yet.

I was testing with a slightly modified version of a nasty script
I was using to test with Ben's code previously.  Might well be
doing something wrong but obviously need to fix that crash anyway!
Will move to your nicer script below at somepoint as I've been lazy
enough I'm still hand editing a few lines depending on number on
a particular run.

Should have some time tomorrow to debug, but definitely 'here be
dragons' at the moment.

Jonathan

> 
> ---
> 
> #!/bin/bash
> modprobe cxl_test
> udevadm settle
> decoder=$(cxl list -b cxl_test -D -d root | jq -r ".[] |
>           select(.pmem_capable == true) | 
>           select(.nr_targets == 2) |
>           .decoder")
> 
> readarray -t mem < <(cxl list -M -d $decoder | jq -r ".[].memdev")
> readarray -t endpoint < <(cxl reserve-dpa -t pmem ${mem[*]} -s $((256<<20)) |
>                           jq -r ".[] | .decoder.decoder")
> region=$(cat /sys/bus/cxl/devices/$decoder/create_pmem_region)
> echo $region > /sys/bus/cxl/devices/$decoder/create_pmem_region
> uuidgen > /sys/bus/cxl/devices/$region/uuid
> nr_targets=${#endpoint[@]}
> echo $nr_targets > /sys/bus/cxl/devices/$region/interleave_ways
> g=$(cat /sys/bus/cxl/devices/$decoder/interleave_granularity)
> echo $g > /sys/bus/cxl/devices/$region/interleave_granularity
> echo $((nr_targets * (256<<20))) > /sys/bus/cxl/devices/$region/size
> port_dev0=$(cxl list -T -d $decoder | jq -r ".[] |
>             .targets | .[] | select(.position == 0) | .target")
> port_dev1=$(cxl list -T -d $decoder | jq -r ".[] |
>             .targets | .[] | select(.position == 1) | .target")
> readarray -t mem_sort0 < <(cxl list -M -p $port_dev0 | jq -r ".[] | .memdev")
> readarray -t mem_sort1 < <(cxl list -M -p $port_dev1 | jq -r ".[] | .memdev")
> mem_sort=()
> mem_sort[0]=${mem_sort0[0]}
> mem_sort[1]=${mem_sort1[0]}
> mem_sort[2]=${mem_sort0[2]}
> mem_sort[3]=${mem_sort1[2]}
> mem_sort[4]=${mem_sort0[1]}
> mem_sort[5]=${mem_sort1[1]}
> mem_sort[6]=${mem_sort0[3]}
> mem_sort[7]=${mem_sort1[3]}
> 
> #mem_sort[2]=${mem_sort0[0]}
> #mem_sort[1]=${mem_sort1[0]}
> #mem_sort[0]=${mem_sort0[2]}
> #mem_sort[3]=${mem_sort1[2]}
> #mem_sort[4]=${mem_sort0[1]}
> #mem_sort[5]=${mem_sort1[1]}
> #mem_sort[6]=${mem_sort0[3]}
> #mem_sort[7]=${mem_sort1[3]}
> 
> endpoint=()
> for i in ${mem_sort[@]}
> do
>         readarray -O ${#endpoint[@]} -t endpoint < <(cxl list -Di -d endpoint -m $i | jq -r ".[] |
>                                                      select(.mode == \"pmem\") | .decoder")
> done
> pos=0
> for i in ${endpoint[@]}
> do
>         echo $i > /sys/bus/cxl/devices/$region/target$pos
>         pos=$((pos+1))
> done
> echo "$region added ${#endpoint[@]} targets: ${endpoint[@]}"
> 
> echo 1 > /sys/bus/cxl/devices/$region/commit
> echo 0 > /sys/bus/cxl/devices/$region/commit
> 
> pos=0
> for i in ${endpoint[@]}
> do
>         echo "" > /sys/bus/cxl/devices/$region/target$pos
>         pos=$((pos+1))
> done
> readarray -t endpoint < <(cxl free-dpa -t pmem ${mem[*]} |
>                           jq -r ".[] | .decoder.decoder")
> echo "$region released ${#endpoint[@]} targets: ${endpoint[@]}"
> 
> ---
> 
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=preview


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 00/15] Region driver
  2022-06-23 15:08         ` Jonathan Cameron
@ 2022-06-23 17:33           ` Dan Williams
  2022-06-23 23:44             ` Dan Williams
  2022-06-24  9:08             ` Jonathan Cameron
  0 siblings, 2 replies; 53+ messages in thread
From: Dan Williams @ 2022-06-23 17:33 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams
  Cc: Ben Widawsky, linux-cxl, nvdimm, patches, Alison Schofield,
	Ira Weiny, Vishal Verma

Jonathan Cameron wrote:
> On Wed, 22 Jun 2022 22:40:48 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Jonathan Cameron wrote:
> > > ....
> > >   
> > > > > Hi Ben,
> > > > >
> > > > > I finally got around to actually trying this out on top of Dan's recent fix set
> > > > > (I rebased it from the cxl/preview branch on kernel.org).
> > > > >
> > > > > I'm not having much luck actually bring up a region.
> > > > >
> > > > > The patch set refers to configuring the end point decoders, but all their
> > > > > sysfs attributes are read only.  Am I missing a dependency somewhere or
> > > > > is the intent that this series is part of the solution only?
> > > > >
> > > > > I'm confused!    
> > > > 
> > > > There's a new series that's being reviewed internally before going to the list:
> > > > 
> > > > https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-redux3
> > > > 
> > > > Given the proximity to the merge window opening and the need to get
> > > > the "mem_enabled" series staged, I asked Ben to hold it back from the
> > > > list for now.
> > > > 
> > > > There are some changes I am folding into it, but I hope to send it out
> > > > in the next few days after "mem_enabled" is finalized.  
> > > 
> > > Hi Dan,
> > > 
> > > I switched from an earlier version of the region code over to a rebase of the tree.
> > > Two issues below you may already have fixed.
> > > 
> > > The second is a carry over from an earlier set so I haven't tested
> > > without it but looks like it's still valid.
> > > 
> > > Anyhow, thought it might save some cycles to preempt you sending
> > > out the series if these issues are still present.
> > > 
> > > Minimal testing so far on these with 2 hb, 2 rp, 4 directly connected
> > > devices, but once you post I'll test more extensively.  I've not
> > > really thought about the below much, so might not be best way to fix.
> > > 
> > > Found a bug in QEMU code as well (missing write masks for the
> > > target list registers) - will post fix for that shortly.  
> > 
> > Hi Jonathan,
> > 
> > Tomorrow I'll post the tranche to the list, but wanted to let you and
> > others watching that that the 'preview' branch [1] now has the proposed
> > initial region support. Once the bots give the thumbs up I'll send it
> > along.
> > 
> > To date I've only tested it with cxl_test and an internal test vehicle.
> > The cxl_test script I used to setup and teardown a x8 interleave across
> > x2 host bridges and x4 switches is:
> 
> Thanks.  Trivial feedback from a very quick play (busy day).
> 
> Bit odd that regionX/size is once write - get an error even if
> writing same value to it twice.

Ah true, that should just silently succeed.

> Also not debugged yet but on just got a null pointer dereference on
> 
> echo decoder3.0 > target0
> 
> Beyond a stacktrace pointing at store_targetN and dereference is of
> 0x00008 no idea yet.

The compiler unfortunately does a good job inlining the entirety of all the
leaf functions beneath store_targetN() so I have found myself needing to
sprinkle "noinline" to get better back traces.

> 
> I was testing with a slightly modified version of a nasty script
> I was using to test with Ben's code previously.  Might well be
> doing something wrong but obviously need to fix that crash anyway!

Most definitely.

> Will move to your nicer script below at somepoint as I've been lazy
> enough I'm still hand editing a few lines depending on number on
> a particular run.
> 
> Should have some time tomorrow to debug, but definitely 'here be
> dragons' at the moment.

Yes. Even before this posting I had shaken out a few crash scenarios just from
moving from my old QEMU baseline to "jic123/cxl-rework-draft-2" which did
things like collide PCI MMIO with cxl_test fake CXL ranges. By the way, is
there a "latest" tag I should be following to stay in sync with what you are
running for QEMU+CXL?  If only to reproduce the same crash scenarios.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 00/15] Region driver
  2022-06-23 17:33           ` Dan Williams
@ 2022-06-23 23:44             ` Dan Williams
  2022-06-24  9:08             ` Jonathan Cameron
  1 sibling, 0 replies; 53+ messages in thread
From: Dan Williams @ 2022-06-23 23:44 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron
  Cc: Ben Widawsky, linux-cxl, nvdimm, patches, Alison Schofield,
	Ira Weiny, Vishal Verma

Dan Williams wrote:
> Jonathan Cameron wrote:
> > On Wed, 22 Jun 2022 22:40:48 -0700
> > Dan Williams <dan.j.williams@intel.com> wrote:
> > 
> > > Jonathan Cameron wrote:
> > > > ....
> > > >   
> > > > > > Hi Ben,
> > > > > >
> > > > > > I finally got around to actually trying this out on top of Dan's recent fix set
> > > > > > (I rebased it from the cxl/preview branch on kernel.org).
> > > > > >
> > > > > > I'm not having much luck actually bring up a region.
> > > > > >
> > > > > > The patch set refers to configuring the end point decoders, but all their
> > > > > > sysfs attributes are read only.  Am I missing a dependency somewhere or
> > > > > > is the intent that this series is part of the solution only?
> > > > > >
> > > > > > I'm confused!    
> > > > > 
> > > > > There's a new series that's being reviewed internally before going to the list:
> > > > > 
> > > > > https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-redux3
> > > > > 
> > > > > Given the proximity to the merge window opening and the need to get
> > > > > the "mem_enabled" series staged, I asked Ben to hold it back from the
> > > > > list for now.
> > > > > 
> > > > > There are some changes I am folding into it, but I hope to send it out
> > > > > in the next few days after "mem_enabled" is finalized.  
> > > > 
> > > > Hi Dan,
> > > > 
> > > > I switched from an earlier version of the region code over to a rebase of the tree.
> > > > Two issues below you may already have fixed.
> > > > 
> > > > The second is a carry over from an earlier set so I haven't tested
> > > > without it but looks like it's still valid.
> > > > 
> > > > Anyhow, thought it might save some cycles to preempt you sending
> > > > out the series if these issues are still present.
> > > > 
> > > > Minimal testing so far on these with 2 hb, 2 rp, 4 directly connected
> > > > devices, but once you post I'll test more extensively.  I've not
> > > > really thought about the below much, so might not be best way to fix.
> > > > 
> > > > Found a bug in QEMU code as well (missing write masks for the
> > > > target list registers) - will post fix for that shortly.  
> > > 
> > > Hi Jonathan,
> > > 
> > > Tomorrow I'll post the tranche to the list, but wanted to let you and
> > > others watching that that the 'preview' branch [1] now has the proposed
> > > initial region support. Once the bots give the thumbs up I'll send it
> > > along.
> > > 
> > > To date I've only tested it with cxl_test and an internal test vehicle.
> > > The cxl_test script I used to setup and teardown a x8 interleave across
> > > x2 host bridges and x4 switches is:
> > 
> > Thanks.  Trivial feedback from a very quick play (busy day).
> > 
> > Bit odd that regionX/size is once write - get an error even if
> > writing same value to it twice.
> 
> Ah true, that should just silently succeed.

I fixed this one.

> 
> > Also not debugged yet but on just got a null pointer dereference on
> > 
> > echo decoder3.0 > target0
> > 
> > Beyond a stacktrace pointing at store_targetN and dereference is of
> > 0x00008 no idea yet.
> 
> The compiler unfortunately does a good job inlining the entirety of all the
> leaf functions beneath store_targetN() so I have found myself needing to
> sprinkle "noinline" to get better back traces.
> 
> > 
> > I was testing with a slightly modified version of a nasty script
> > I was using to test with Ben's code previously.  Might well be
> > doing something wrong but obviously need to fix that crash anyway!
> 
> Most definitely.

I tried to reproduce this one, but unfortunately, "worked for me". So
send along more reproduction details when you get a chance, but I'll
proceed with posting the series for now. I tried the following on my
QEMU config to reproduce:

# cxl reserve-dpa -t pmem mem0
{
  "memdev":"mem0",
  "pmem_size":"512.00 MiB (536.87 MB)",
  "ram_size":0,
  "serial":"0",
  "host":"0000:35:00.0",
  "decoder":{
    "decoder":"decoder2.0",
    "state":"disabled",
    "dpa_size":"512.00 MiB (536.87 MB)",
    "mode":"pmem"
  }
}
# echo region1 > /sys/bus/cxl/devices/decoder0.0/create_pmem_region
# echo 1 > /sys/bus/cxl/devices/region1/interleave_ways 
# echo 256 > /sys/bus/cxl/devices/region1/interleave_granularity 
# echo $((512 << 20)) > /sys/bus/cxl/devices/region1/size
# echo decoder2.0 > /sys/bus/cxl/devices/region1/target0 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 00/15] Region driver
  2022-06-23 17:33           ` Dan Williams
  2022-06-23 23:44             ` Dan Williams
@ 2022-06-24  9:08             ` Jonathan Cameron
  1 sibling, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2022-06-24  9:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ben Widawsky, linux-cxl, nvdimm, patches, Alison Schofield,
	Ira Weiny, Vishal Verma

On Thu, 23 Jun 2022 10:33:38 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> > On Wed, 22 Jun 2022 22:40:48 -0700
> > Dan Williams <dan.j.williams@intel.com> wrote:
> >   
> > > Jonathan Cameron wrote:  
> > > > ....
> > > >     
> > > > > > Hi Ben,
> > > > > >
> > > > > > I finally got around to actually trying this out on top of Dan's recent fix set
> > > > > > (I rebased it from the cxl/preview branch on kernel.org).
> > > > > >
> > > > > > I'm not having much luck actually bring up a region.
> > > > > >
> > > > > > The patch set refers to configuring the end point decoders, but all their
> > > > > > sysfs attributes are read only.  Am I missing a dependency somewhere or
> > > > > > is the intent that this series is part of the solution only?
> > > > > >
> > > > > > I'm confused!      
> > > > > 
> > > > > There's a new series that's being reviewed internally before going to the list:
> > > > > 
> > > > > https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-redux3
> > > > > 
> > > > > Given the proximity to the merge window opening and the need to get
> > > > > the "mem_enabled" series staged, I asked Ben to hold it back from the
> > > > > list for now.
> > > > > 
> > > > > There are some changes I am folding into it, but I hope to send it out
> > > > > in the next few days after "mem_enabled" is finalized.    
> > > > 
> > > > Hi Dan,
> > > > 
> > > > I switched from an earlier version of the region code over to a rebase of the tree.
> > > > Two issues below you may already have fixed.
> > > > 
> > > > The second is a carry over from an earlier set so I haven't tested
> > > > without it but looks like it's still valid.
> > > > 
> > > > Anyhow, thought it might save some cycles to preempt you sending
> > > > out the series if these issues are still present.
> > > > 
> > > > Minimal testing so far on these with 2 hb, 2 rp, 4 directly connected
> > > > devices, but once you post I'll test more extensively.  I've not
> > > > really thought about the below much, so might not be best way to fix.
> > > > 
> > > > Found a bug in QEMU code as well (missing write masks for the
> > > > target list registers) - will post fix for that shortly.    
> > > 
> > > Hi Jonathan,
> > > 
> > > Tomorrow I'll post the tranche to the list, but wanted to let you and
> > > others watching that that the 'preview' branch [1] now has the proposed
> > > initial region support. Once the bots give the thumbs up I'll send it
> > > along.
> > > 
> > > To date I've only tested it with cxl_test and an internal test vehicle.
> > > The cxl_test script I used to setup and teardown a x8 interleave across
> > > x2 host bridges and x4 switches is:  
> > 
> > Thanks.  Trivial feedback from a very quick play (busy day).
> > 
> > Bit odd that regionX/size is once write - get an error even if
> > writing same value to it twice.  
> 
> Ah true, that should just silently succeed.
> 
> > Also not debugged yet but on just got a null pointer dereference on
> > 
> > echo decoder3.0 > target0
> > 
> > Beyond a stacktrace pointing at store_targetN and dereference is of
> > 0x00008 no idea yet.  
> 
> The compiler unfortunately does a good job inlining the entirety of all the
> leaf functions beneath store_targetN() so I have found myself needing to
> sprinkle "noinline" to get better back traces.
> 
> > 
> > I was testing with a slightly modified version of a nasty script
> > I was using to test with Ben's code previously.  Might well be
> > doing something wrong but obviously need to fix that crash anyway!  
> 
> Most definitely.
> 
> > Will move to your nicer script below at somepoint as I've been lazy
> > enough I'm still hand editing a few lines depending on number on
> > a particular run.
> > 
> > Should have some time tomorrow to debug, but definitely 'here be
> > dragons' at the moment.  
> 
> Yes. Even before this posting I had shaken out a few crash scenarios just from
> moving from my old QEMU baseline to "jic123/cxl-rework-draft-2" which did
> things like collide PCI MMIO with cxl_test fake CXL ranges. By the way, is
> there a "latest" tag I should be following to stay in sync with what you are
> running for QEMU+CXL? 

For this particular feature should just be mainline QEMU now.
Switch support was picked up a few days ago and I haven't pushed out a rebase
on top of that yet. Famous last words, but I don't 'think' that anything
that isn't yet in upstream QEMU should effect the region code.

I am testing on ARM (which requires the arch and board support which is
awaiting review) but doubt that causes this problem...

> If only to reproduce the same crash scenarios.


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2022-06-24  9:08 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-13 18:37 [RFC PATCH 00/15] Region driver Ben Widawsky
2022-04-13 18:37 ` [RFC PATCH 01/15] cxl/core: Use is_endpoint_decoder Ben Widawsky
2022-04-13 21:22   ` Dan Williams
     [not found]   ` <CGME20220415205052uscas1p209e03abf95b9c80b2ba1f287c82dfd80@uscas1p2.samsung.com>
2022-04-15 20:50     ` Adam Manzanares
2022-04-13 18:37 ` [RFC PATCH 02/15] cxl/core/hdm: Bail on endpoint init fail Ben Widawsky
2022-04-13 21:31   ` Dan Williams
     [not found]     ` <CGME20220418163713uscas1p17b3b1b45c7d27e54e3ecb62eb8af2469@uscas1p1.samsung.com>
2022-04-18 16:37       ` Adam Manzanares
2022-05-12 15:50         ` Ben Widawsky
2022-05-12 17:27           ` Luis Chamberlain
2022-05-13 12:09             ` Jonathan Cameron
2022-05-13 15:03               ` Dan Williams
2022-05-13 15:12               ` Luis Chamberlain
2022-05-13 19:14                 ` Dan Williams
2022-05-13 19:31                   ` Luis Chamberlain
2022-05-19  5:09                     ` Dan Williams
2022-04-13 18:37 ` [RFC PATCH 03/15] Revert "cxl/core: Convert decoder range to resource" Ben Widawsky
2022-04-13 21:43   ` Dan Williams
2022-05-12 16:09     ` Ben Widawsky
2022-04-13 18:37 ` [RFC PATCH 04/15] cxl/core: Create distinct decoder structs Ben Widawsky
2022-04-15  1:45   ` Dan Williams
2022-04-18 20:43     ` Dan Williams
2022-04-13 18:37 ` [RFC PATCH 05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region Ben Widawsky
2022-04-18 16:42   ` Dan Williams
2022-04-19 16:43     ` Jason Gunthorpe
2022-04-19 21:50       ` Dan Williams
2022-04-19 21:59         ` Dan Williams
2022-04-19 23:04           ` Jason Gunthorpe
2022-04-20  0:47             ` Dan Williams
2022-04-20 14:34               ` Jason Gunthorpe
2022-04-20 15:32                 ` Dan Williams
2022-04-13 18:37 ` [RFC PATCH 06/15] cxl/acpi: Manage root decoder's address space Ben Widawsky
2022-04-18 22:15   ` Dan Williams
2022-05-12 19:18     ` Ben Widawsky
2022-04-13 18:37 ` [RFC PATCH 07/15] cxl/port: Surface ram and pmem resources Ben Widawsky
2022-04-13 18:37 ` [RFC PATCH 08/15] cxl/core/hdm: Allocate resources from the media Ben Widawsky
2022-04-13 18:37 ` [RFC PATCH 09/15] cxl/core/port: Add attrs for size and volatility Ben Widawsky
2022-04-13 18:37 ` [RFC PATCH 10/15] cxl/core: Extract IW/IG decoding Ben Widawsky
2022-04-13 18:37 ` [RFC PATCH 11/15] cxl/acpi: Use common " Ben Widawsky
2022-04-13 18:37 ` [RFC PATCH 12/15] cxl/region: Add region creation ABI Ben Widawsky
2022-05-04 22:56   ` Verma, Vishal L
2022-05-05  5:17     ` Dan Williams
2022-05-12 15:54       ` Ben Widawsky
2022-04-13 18:37 ` [RFC PATCH 13/15] cxl/core/port: Add attrs for root ways & granularity Ben Widawsky
2022-04-13 18:37 ` [RFC PATCH 14/15] cxl/region: Introduce configuration Ben Widawsky
2022-04-13 18:37 ` [RFC PATCH 15/15] cxl/region: Introduce a cxl_region driver Ben Widawsky
2022-05-20 16:23 ` [RFC PATCH 00/15] Region driver Jonathan Cameron
2022-05-20 16:41   ` Dan Williams
2022-05-31 12:21     ` Jonathan Cameron
2022-06-23  5:40       ` Dan Williams
2022-06-23 15:08         ` Jonathan Cameron
2022-06-23 17:33           ` Dan Williams
2022-06-23 23:44             ` Dan Williams
2022-06-24  9:08             ` Jonathan Cameron

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.