All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD)
@ 2023-06-14 19:16 ira.weiny
  2023-06-14 19:16 ` [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device ira.weiny
                   ` (7 more replies)
  0 siblings, 8 replies; 55+ messages in thread
From: ira.weiny @ 2023-06-14 19:16 UTC (permalink / raw)
  To: Navneet Singh, Fan Ni, Jonathan Cameron, Ira Weiny, Dan Williams,
	linux-cxl

I'm submitting these on behalf of Navneet.  There was a round of
internal discussion which left a few questions but we want to get the
public discussion going.  A first public preview was posted by Dan.[1]

The series has been rebased on the type-2 work posted from Dan.[2]  As
discussed in the community call, not all of that series is required for
these patches.  This will get rebased on the subset of those patches he
is targeting for 6.5.  The series was tested using Fan Ni's Qemu DCD
series.[3]

[cover letter]

A Dynamic Capacity Device (DCD) (CXL 3.0 spec 9.13.3) is a CXL memory
device that implements dynamic capacity.  Dynamic capacity feature
allows memory capacity to change dynamically, without the need for
resetting the device.

Provide initial patches to enable DCD on non interleaving regions.
Details:

- Get the dynamic capacity region information from cxl device and add
  the advertised DC memory to driver managed resources
- Get the device dynamic capacity extent list from the device and
  maintain it in the host and add the preallocated memory to the host
- Dynamic capacity region support
- DCD region provisioning via Dax
- Dynamic capacity event records
        a. Add capacity Events
	b. Release capacity events
	c. Add the memory to the host dc region
	d. Release the memory from the host dc region
- Trace Dynamic Capacity events
- Send add capacity response to device
- Send release dynamic capacity to device

Cc: Navneet Singh <navneet.singh@intel.com>
Cc: Fan Ni <fan.ni@samsung.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-cxl@vger.kernel.org

[1] https://lore.kernel.org/all/64326437c1496_934b2949f@dwillia2-mobl3.amr.corp.intel.com.notmuch/
[2] https://lore.kernel.org/all/168592149709.1948938.8663425987110396027.stgit@dwillia2-xfh.jf.intel.com/
[3] https://lore.kernel.org/all/6483946e8152f_f1132294a2@iweiny-mobl.notmuch/

---
Navneet Singh (5):
      cxl/mem : Read Dynamic capacity configuration from the device
      cxl/region: Add dynamic capacity cxl region support.
      cxl/mem : Expose dynamic capacity configuration to userspace
      cxl/mem: Add support to handle DCD add and release capacity events.
      cxl/mem: Trace Dynamic capacity Event Record

 drivers/cxl/Kconfig       |  11 +
 drivers/cxl/core/core.h   |   7 +
 drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++--
 drivers/cxl/core/mbox.c   | 540 +++++++++++++++++++++++++++++++++++++++++++++-
 drivers/cxl/core/memdev.c |  72 +++++++
 drivers/cxl/core/port.c   |  18 ++
 drivers/cxl/core/region.c | 337 ++++++++++++++++++++++++++++-
 drivers/cxl/core/trace.h  |  68 +++++-
 drivers/cxl/cxl.h         |  32 ++-
 drivers/cxl/cxlmem.h      | 146 ++++++++++++-
 drivers/cxl/pci.c         |  14 +-
 drivers/dax/bus.c         |  11 +-
 drivers/dax/bus.h         |   5 +-
 drivers/dax/cxl.c         |   4 +
 14 files changed, 1453 insertions(+), 46 deletions(-)
---
base-commit: 034a16d0165be3e092d60685be7b1b05e6f3059b
change-id: 20230604-dcd-type2-upstream-0cd15f6216fd

Best regards,
-- 
Ira Weiny <ira.weiny@intel.com>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device
  2023-06-14 19:16 [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD) ira.weiny
@ 2023-06-14 19:16 ` ira.weiny
  2023-06-14 22:53   ` Dave Jiang
                     ` (4 more replies)
  2023-06-14 19:16 ` [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support ira.weiny
                   ` (6 subsequent siblings)
  7 siblings, 5 replies; 55+ messages in thread
From: ira.weiny @ 2023-06-14 19:16 UTC (permalink / raw)
  To: Navneet Singh, Fan Ni, Jonathan Cameron, Ira Weiny, Dan Williams,
	linux-cxl

From: Navneet Singh <navneet.singh@intel.com>

Read the Dynamic capacity configuration and store dynamic capacity region
information in the device state which driver will use to map into the HDM
ranges.

Implement Get Dynamic Capacity Configuration (opcode 4800h) mailbox
command as specified in CXL 3.0 spec section 8.2.9.8.9.1.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>

---
[iweiny: ensure all mds->dc_region's are named]
---
 drivers/cxl/core/mbox.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/cxl/cxlmem.h    |  70 +++++++++++++++++-
 drivers/cxl/pci.c       |   4 +
 3 files changed, 256 insertions(+), 8 deletions(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 3ca0bf12c55f..c5b696737c87 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -111,6 +111,37 @@ static u8 security_command_sets[] = {
 	0x46, /* Security Passthrough */
 };
 
+static bool cxl_is_dcd_command(u16 opcode)
+{
+#define CXL_MBOX_OP_DCD_CMDS 0x48
+
+	if ((opcode >> 8) == CXL_MBOX_OP_DCD_CMDS)
+		return true;
+
+	return false;
+}
+
+static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
+					u16 opcode)
+{
+	switch (opcode) {
+	case CXL_MBOX_OP_GET_DC_CONFIG:
+		set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
+		break;
+	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+		set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
+		break;
+	case CXL_MBOX_OP_ADD_DC_RESPONSE:
+		set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
+		break;
+	case CXL_MBOX_OP_RELEASE_DC:
+		set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
+		break;
+	default:
+		break;
+	}
+}
+
 static bool cxl_is_security_command(u16 opcode)
 {
 	int i;
@@ -666,6 +697,7 @@ static int cxl_xfer_log(struct cxl_memdev_state *mds, uuid_t *uuid,
 static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
 {
 	struct cxl_cel_entry *cel_entry;
+	struct cxl_mem_command *cmd;
 	const int cel_entries = size / sizeof(*cel_entry);
 	struct device *dev = mds->cxlds.dev;
 	int i;
@@ -674,11 +706,12 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
 
 	for (i = 0; i < cel_entries; i++) {
 		u16 opcode = le16_to_cpu(cel_entry[i].opcode);
-		struct cxl_mem_command *cmd = cxl_mem_find_command(opcode);
+		cmd = cxl_mem_find_command(opcode);
 
-		if (!cmd && !cxl_is_poison_command(opcode)) {
-			dev_dbg(dev,
-				"Opcode 0x%04x unsupported by driver\n", opcode);
+		if (!cmd && !cxl_is_poison_command(opcode) &&
+		    !cxl_is_dcd_command(opcode)) {
+			dev_dbg(dev, "Opcode 0x%04x unsupported by driver\n",
+				opcode);
 			continue;
 		}
 
@@ -688,6 +721,9 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
 		if (cxl_is_poison_command(opcode))
 			cxl_set_poison_cmd_enabled(&mds->poison, opcode);
 
+		if (cxl_is_dcd_command(opcode))
+			cxl_set_dcd_cmd_enabled(mds, opcode);
+
 		dev_dbg(dev, "Opcode 0x%04x enabled\n", opcode);
 	}
 }
@@ -1059,7 +1095,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
 	if (rc < 0)
 		return rc;
 
-	mds->total_bytes =
+	mds->total_static_capacity =
 		le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
 	mds->volatile_only_bytes =
 		le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
@@ -1077,10 +1113,137 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
 		mds->poison.max_errors = min_t(u32, val, CXL_POISON_LIST_MAX);
 	}
 
+	mds->dc_event_log_size = le16_to_cpu(id.dc_event_log_size);
+
 	return 0;
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dev_state_identify, CXL);
 
+/**
+ * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
+ * information from the device.
+ * @mds: The memory device state
+ * Return: 0 if identify was executed successfully.
+ *
+ * This will dispatch the get_dynamic_capacity command to the device
+ * and on success populate structures to be exported to sysfs.
+ */
+int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
+{
+	struct cxl_dev_state *cxlds = &mds->cxlds;
+	struct device *dev = cxlds->dev;
+	struct cxl_mbox_dynamic_capacity *dc;
+	struct cxl_mbox_get_dc_config get_dc;
+	struct cxl_mbox_cmd mbox_cmd;
+	u64 next_dc_region_start;
+	int rc, i;
+
+	for (i = 0; i < CXL_MAX_DC_REGION; i++)
+		sprintf(mds->dc_region[i].name, "dc%d", i);
+
+	/* Check GET_DC_CONFIG is supported by device */
+	if (!test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds)) {
+		dev_dbg(dev, "unsupported cmd: get_dynamic_capacity_config\n");
+		return 0;
+	}
+
+	dc = kvmalloc(mds->payload_size, GFP_KERNEL);
+	if (!dc)
+		return -ENOMEM;
+
+	get_dc = (struct cxl_mbox_get_dc_config) {
+		.region_count = CXL_MAX_DC_REGION,
+		.start_region_index = 0,
+	};
+
+	mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
+		.payload_in = &get_dc,
+		.size_in = sizeof(get_dc),
+		.size_out = mds->payload_size,
+		.payload_out = dc,
+		.min_out = 1,
+	};
+	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
+	if (rc < 0)
+		goto dc_error;
+
+	mds->nr_dc_region = dc->avail_region_count;
+
+	if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
+		dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
+			mds->nr_dc_region);
+		rc = -EINVAL;
+		goto dc_error;
+	}
+
+	for (i = 0; i < mds->nr_dc_region; i++) {
+		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
+
+		dcr->base = le64_to_cpu(dc->region[i].region_base);
+		dcr->decode_len =
+			le64_to_cpu(dc->region[i].region_decode_length);
+		dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
+		dcr->len = le64_to_cpu(dc->region[i].region_length);
+		dcr->blk_size = le64_to_cpu(dc->region[i].region_block_size);
+
+		/* Check regions are in increasing DPA order */
+		if ((i + 1) < mds->nr_dc_region) {
+			next_dc_region_start =
+				le64_to_cpu(dc->region[i + 1].region_base);
+			if ((dcr->base > next_dc_region_start) ||
+			    ((dcr->base + dcr->decode_len) > next_dc_region_start)) {
+				dev_err(dev,
+					"DPA ordering violation for DC region %d and %d\n",
+					i, i + 1);
+				rc = -EINVAL;
+				goto dc_error;
+			}
+		}
+
+		/* Check the region is 256 MB aligned */
+		if (!IS_ALIGNED(dcr->base, SZ_256M)) {
+			dev_err(dev, "DC region %d not aligned to 256MB\n", i);
+			rc = -EINVAL;
+			goto dc_error;
+		}
+
+		/* Check Region base and length are aligned to block size */
+		if (!IS_ALIGNED(dcr->base, dcr->blk_size) ||
+		    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
+			dev_err(dev, "DC region %d not aligned to %#llx\n", i,
+				dcr->blk_size);
+			rc = -EINVAL;
+			goto dc_error;
+		}
+
+		dcr->dsmad_handle =
+			le32_to_cpu(dc->region[i].region_dsmad_handle);
+		dcr->flags = dc->region[i].flags;
+		sprintf(dcr->name, "dc%d", i);
+
+		dev_dbg(dev,
+			"DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
+			dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
+	}
+
+	/*
+	 * Calculate entire DPA range of all configured regions which will be mapped by
+	 * one or more HDM decoders
+	 */
+	mds->total_dynamic_capacity =
+		mds->dc_region[mds->nr_dc_region - 1].base +
+		mds->dc_region[mds->nr_dc_region - 1].decode_len -
+		mds->dc_region[0].base;
+	dev_dbg(dev, "Total dynamic capacity: %#llx\n",
+		mds->total_dynamic_capacity);
+
+dc_error:
+	kvfree(dc);
+	return rc;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
+
 static int add_dpa_res(struct device *dev, struct resource *parent,
 		       struct resource *res, resource_size_t start,
 		       resource_size_t size, const char *type)
@@ -1112,6 +1275,11 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
 	struct cxl_dev_state *cxlds = &mds->cxlds;
 	struct device *dev = cxlds->dev;
 	int rc;
+	size_t untenanted_mem =
+		mds->dc_region[0].base - mds->total_static_capacity;
+
+	mds->total_capacity = mds->total_static_capacity +
+			untenanted_mem + mds->total_dynamic_capacity;
 
 	if (!cxlds->media_ready) {
 		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
@@ -1121,13 +1289,23 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
 	}
 
 	cxlds->dpa_res =
-		(struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
+		(struct resource)DEFINE_RES_MEM(0, mds->total_capacity);
+
+	for (int i = 0; i < CXL_MAX_DC_REGION; i++) {
+		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
+
+		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
+				 dcr->base, dcr->decode_len, dcr->name);
+		if (rc)
+			return rc;
+	}
 
 	if (mds->partition_align_bytes == 0) {
 		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
 				 mds->volatile_only_bytes, "ram");
 		if (rc)
 			return rc;
+
 		return add_dpa_res(dev, &cxlds->dpa_res, &cxlds->pmem_res,
 				   mds->volatile_only_bytes,
 				   mds->persistent_only_bytes, "pmem");
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 89e560ea14c0..9c0b2fa72bdd 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -239,6 +239,15 @@ struct cxl_event_state {
 	struct mutex log_lock;
 };
 
+/* Device enabled DCD commands */
+enum dcd_cmd_enabled_bits {
+	CXL_DCD_ENABLED_GET_CONFIG,
+	CXL_DCD_ENABLED_GET_EXTENT_LIST,
+	CXL_DCD_ENABLED_ADD_RESPONSE,
+	CXL_DCD_ENABLED_RELEASE,
+	CXL_DCD_ENABLED_MAX
+};
+
 /* Device enabled poison commands */
 enum poison_cmd_enabled_bits {
 	CXL_POISON_ENABLED_LIST,
@@ -284,6 +293,9 @@ enum cxl_devtype {
 	CXL_DEVTYPE_CLASSMEM,
 };
 
+#define CXL_MAX_DC_REGION 8
+#define CXL_DC_REGION_SRTLEN 8
+
 /**
  * struct cxl_dev_state - The driver device state
  *
@@ -300,6 +312,8 @@ enum cxl_devtype {
  * @dpa_res: Overall DPA resource tree for the device
  * @pmem_res: Active Persistent memory capacity configuration
  * @ram_res: Active Volatile memory capacity configuration
+ * @dc_res: Active Dynamic Capacity memory configuration for each possible
+ *          region
  * @component_reg_phys: register base of component registers
  * @info: Cached DVSEC information about the device.
  * @serial: PCIe Device Serial Number
@@ -315,6 +329,7 @@ struct cxl_dev_state {
 	struct resource dpa_res;
 	struct resource pmem_res;
 	struct resource ram_res;
+	struct resource dc_res[CXL_MAX_DC_REGION];
 	resource_size_t component_reg_phys;
 	u64 serial;
 	enum cxl_devtype type;
@@ -334,9 +349,12 @@ struct cxl_dev_state {
  *                (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
  * @mbox_mutex: Mutex to synchronize mailbox access.
  * @firmware_version: Firmware version for the memory device.
+ * @dcd_cmds: List of DCD commands implemented by memory device
  * @enabled_cmds: Hardware commands found enabled in CEL.
  * @exclusive_cmds: Commands that are kernel-internal only
- * @total_bytes: sum of all possible capacities
+ * @total_capacity: Sum of static and dynamic capacities
+ * @total_static_capacity: Sum of RAM and PMEM capacities
+ * @total_dynamic_capacity: Complete DPA range occupied by DC regions
  * @volatile_only_bytes: hard volatile capacity
  * @persistent_only_bytes: hard persistent capacity
  * @partition_align_bytes: alignment size for partition-able capacity
@@ -344,6 +362,10 @@ struct cxl_dev_state {
  * @active_persistent_bytes: sum of hard + soft persistent
  * @next_volatile_bytes: volatile capacity change pending device reset
  * @next_persistent_bytes: persistent capacity change pending device reset
+ * @nr_dc_region: number of DC regions implemented in the memory device
+ * @dc_region: array containing info about the DC regions
+ * @dc_event_log_size: The number of events the device can store in the
+ * Dynamic Capacity Event Log before it overflows
  * @event: event log driver state
  * @poison: poison driver state info
  * @mbox_send: @dev specific transport for transmitting mailbox commands
@@ -357,9 +379,13 @@ struct cxl_memdev_state {
 	size_t lsa_size;
 	struct mutex mbox_mutex; /* Protects device mailbox and firmware */
 	char firmware_version[0x10];
+	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
 	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
 	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
-	u64 total_bytes;
+
+	u64 total_capacity;
+	u64 total_static_capacity;
+	u64 total_dynamic_capacity;
 	u64 volatile_only_bytes;
 	u64 persistent_only_bytes;
 	u64 partition_align_bytes;
@@ -367,6 +393,20 @@ struct cxl_memdev_state {
 	u64 active_persistent_bytes;
 	u64 next_volatile_bytes;
 	u64 next_persistent_bytes;
+
+	u8 nr_dc_region;
+
+	struct cxl_dc_region_info {
+		u8 name[CXL_DC_REGION_SRTLEN];
+		u64 base;
+		u64 decode_len;
+		u64 len;
+		u64 blk_size;
+		u32 dsmad_handle;
+		u8 flags;
+	} dc_region[CXL_MAX_DC_REGION];
+
+	size_t dc_event_log_size;
 	struct cxl_event_state event;
 	struct cxl_poison_state poison;
 	int (*mbox_send)(struct cxl_memdev_state *mds,
@@ -415,6 +455,10 @@ enum cxl_opcode {
 	CXL_MBOX_OP_UNLOCK		= 0x4503,
 	CXL_MBOX_OP_FREEZE_SECURITY	= 0x4504,
 	CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE	= 0x4505,
+	CXL_MBOX_OP_GET_DC_CONFIG	= 0x4800,
+	CXL_MBOX_OP_GET_DC_EXTENT_LIST	= 0x4801,
+	CXL_MBOX_OP_ADD_DC_RESPONSE	= 0x4802,
+	CXL_MBOX_OP_RELEASE_DC		= 0x4803,
 	CXL_MBOX_OP_MAX			= 0x10000
 };
 
@@ -462,6 +506,7 @@ struct cxl_mbox_identify {
 	__le16 inject_poison_limit;
 	u8 poison_caps;
 	u8 qos_telemetry_caps;
+	__le16 dc_event_log_size;
 } __packed;
 
 /*
@@ -617,7 +662,27 @@ struct cxl_mbox_set_partition_info {
 	u8 flags;
 } __packed;
 
+struct cxl_mbox_get_dc_config {
+	u8 region_count;
+	u8 start_region_index;
+} __packed;
+
+/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
+struct cxl_mbox_dynamic_capacity {
+	u8 avail_region_count;
+	u8 rsvd[7];
+	struct cxl_dc_region_config {
+		__le64 region_base;
+		__le64 region_decode_length;
+		__le64 region_length;
+		__le64 region_block_size;
+		__le32 region_dsmad_handle;
+		u8 flags;
+		u8 rsvd[3];
+	} __packed region[];
+} __packed;
 #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
+#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
 
 /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
 struct cxl_mbox_set_timestamp_in {
@@ -742,6 +807,7 @@ enum {
 int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
 			  struct cxl_mbox_cmd *cmd);
 int cxl_dev_state_identify(struct cxl_memdev_state *mds);
+int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
 int cxl_await_media_ready(struct cxl_dev_state *cxlds);
 int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
 int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 4e2845b7331a..ac1a41bc083d 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -742,6 +742,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
+	rc = cxl_dev_dynamic_capacity_identify(mds);
+	if (rc)
+		return rc;
+
 	rc = cxl_mem_create_range_info(mds);
 	if (rc)
 		return rc;

-- 
2.40.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-14 19:16 [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD) ira.weiny
  2023-06-14 19:16 ` [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device ira.weiny
@ 2023-06-14 19:16 ` ira.weiny
  2023-06-14 23:37   ` Dave Jiang
                     ` (6 more replies)
  2023-06-14 19:16 ` [PATCH 3/5] cxl/mem : Expose dynamic capacity configuration to userspace ira.weiny
                   ` (5 subsequent siblings)
  7 siblings, 7 replies; 55+ messages in thread
From: ira.weiny @ 2023-06-14 19:16 UTC (permalink / raw)
  To: Navneet Singh, Fan Ni, Jonathan Cameron, Ira Weiny, Dan Williams,
	linux-cxl

From: Navneet Singh <navneet.singh@intel.com>

CXL devices optionally support dynamic capacity. CXL Regions must be
created to access this capacity.

Add sysfs entries to create dynamic capacity cxl regions. Provide a new
Dynamic Capacity decoder mode which targets dynamic capacity on devices
which are added to that region.

Below are the steps to create and delete dynamic capacity region0
(example).

    region=$(cat /sys/bus/cxl/devices/decoder0.0/create_dc_region)
    echo $region> /sys/bus/cxl/devices/decoder0.0/create_dc_region
    echo 256 > /sys/bus/cxl/devices/$region/interleave_granularity
    echo 1 > /sys/bus/cxl/devices/$region/interleave_ways

    echo "dc0" >/sys/bus/cxl/devices/decoder1.0/mode
    echo 0x400000000 >/sys/bus/cxl/devices/decoder1.0/dpa_size

    echo 0x400000000 > /sys/bus/cxl/devices/$region/size
    echo  "decoder1.0" > /sys/bus/cxl/devices/$region/target0
    echo 1 > /sys/bus/cxl/devices/$region/commit
    echo $region > /sys/bus/cxl/drivers/cxl_region/bind

    echo $region> /sys/bus/cxl/devices/decoder0.0/delete_region

Signed-off-by: Navneet Singh <navneet.singh@intel.com>

---
[iweiny: fixups]
[iweiny: remove unused CXL_DC_REGION_MODE macro]
[iweiny: Make dc_mode_to_region_index static]
[iweiny: simplify <sysfs>/create_dc_region]
[iweiny: introduce decoder_mode_is_dc]
[djbw: fixups, no sign-off: preview only]
---
 drivers/cxl/Kconfig       |  11 +++
 drivers/cxl/core/core.h   |   7 ++
 drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++++++++++++++++++++++++++----
 drivers/cxl/core/port.c   |  18 ++++
 drivers/cxl/core/region.c | 135 ++++++++++++++++++++++++--
 drivers/cxl/cxl.h         |  28 ++++++
 drivers/dax/cxl.c         |   4 +
 7 files changed, 409 insertions(+), 28 deletions(-)

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index ff4e78117b31..df034889d053 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -121,6 +121,17 @@ config CXL_REGION
 
 	  If unsure say 'y'
 
+config CXL_DCD
+	bool "CXL: DCD Support"
+	default CXL_BUS
+	depends on CXL_REGION
+	help
+	  Enable the CXL core to provision CXL DCD regions.
+	  CXL devices optionally support dynamic capacity and DCD region
+	  maps the dynamic capacity regions DPA's into Host HPA ranges.
+
+	  If unsure say 'y'
+
 config CXL_REGION_INVALIDATION_TEST
 	bool "CXL: Region Cache Management Bypass (TEST)"
 	depends on CXL_REGION
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 27f0968449de..725700ab5973 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -9,6 +9,13 @@ extern const struct device_type cxl_nvdimm_type;
 
 extern struct attribute_group cxl_base_attribute_group;
 
+#ifdef CONFIG_CXL_DCD
+extern struct device_attribute dev_attr_create_dc_region;
+#define SET_CXL_DC_REGION_ATTR(x) (&dev_attr_##x.attr),
+#else
+#define SET_CXL_DC_REGION_ATTR(x)
+#endif
+
 #ifdef CONFIG_CXL_REGION
 extern struct device_attribute dev_attr_create_pmem_region;
 extern struct device_attribute dev_attr_create_ram_region;
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 514d30131d92..29649b47d177 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -233,14 +233,23 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct resource *res = cxled->dpa_res;
 	resource_size_t skip_start;
+	resource_size_t skipped = cxled->skip;
 
 	lockdep_assert_held_write(&cxl_dpa_rwsem);
 
 	/* save @skip_start, before @res is released */
-	skip_start = res->start - cxled->skip;
+	skip_start = res->start - skipped;
 	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
-	if (cxled->skip)
-		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
+	if (cxled->skip != 0) {
+		while (skipped != 0) {
+			res = xa_load(&cxled->skip_res, skip_start);
+			__release_region(&cxlds->dpa_res, skip_start,
+							resource_size(res));
+			xa_erase(&cxled->skip_res, skip_start);
+			skip_start += resource_size(res);
+			skipped -= resource_size(res);
+			}
+	}
 	cxled->skip = 0;
 	cxled->dpa_res = NULL;
 	put_device(&cxled->cxld.dev);
@@ -267,6 +276,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
 	__cxl_dpa_release(cxled);
 }
 
+static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
+{
+	int index = 0;
+
+	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
+		if (mode == i)
+			return index;
+		index++;
+	}
+
+	return -EINVAL;
+}
+
 static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 			     resource_size_t base, resource_size_t len,
 			     resource_size_t skipped)
@@ -275,7 +297,11 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	struct cxl_port *port = cxled_to_port(cxled);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct device *dev = &port->dev;
+	struct device *ed_dev = &cxled->cxld.dev;
+	struct resource *dpa_res = &cxlds->dpa_res;
+	resource_size_t skip_len = 0;
 	struct resource *res;
+	int rc, index;
 
 	lockdep_assert_held_write(&cxl_dpa_rwsem);
 
@@ -304,28 +330,119 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	}
 
 	if (skipped) {
-		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
-				       dev_name(&cxled->cxld.dev), 0);
-		if (!res) {
-			dev_dbg(dev,
-				"decoder%d.%d: failed to reserve skipped space\n",
-				port->id, cxled->cxld.id);
-			return -EBUSY;
+		resource_size_t skip_base = base - skipped;
+
+		if (decoder_mode_is_dc(cxled->mode)) {
+			if (resource_size(&cxlds->ram_res) &&
+					skip_base <= cxlds->ram_res.end) {
+				skip_len = cxlds->ram_res.end - skip_base + 1;
+				res = __request_region(dpa_res, skip_base,
+						skip_len, dev_name(ed_dev), 0);
+				if (!res)
+					goto error;
+
+				rc = xa_insert(&cxled->skip_res, skip_base, res,
+								GFP_KERNEL);
+				skip_base += skip_len;
+			}
+
+			if (resource_size(&cxlds->ram_res) &&
+					skip_base <= cxlds->pmem_res.end) {
+				skip_len = cxlds->pmem_res.end - skip_base + 1;
+				res = __request_region(dpa_res, skip_base,
+						skip_len, dev_name(ed_dev), 0);
+				if (!res)
+					goto error;
+
+				rc = xa_insert(&cxled->skip_res, skip_base, res,
+								GFP_KERNEL);
+				skip_base += skip_len;
+			}
+
+			index = dc_mode_to_region_index(cxled->mode);
+			for (int i = 0; i <= index; i++) {
+				struct resource *dcr = &cxlds->dc_res[i];
+
+				if (skip_base < dcr->start) {
+					skip_len = dcr->start - skip_base;
+					res = __request_region(dpa_res,
+							skip_base, skip_len,
+							dev_name(ed_dev), 0);
+					if (!res)
+						goto error;
+
+					rc = xa_insert(&cxled->skip_res, skip_base,
+							res, GFP_KERNEL);
+					skip_base += skip_len;
+				}
+
+				if (skip_base == base) {
+					dev_dbg(dev, "skip done!\n");
+					break;
+				}
+
+				if (resource_size(dcr) &&
+						skip_base <= dcr->end) {
+					if (skip_base > base)
+						dev_err(dev, "Skip error\n");
+
+					skip_len = dcr->end - skip_base + 1;
+					res = __request_region(dpa_res, skip_base,
+							skip_len,
+							dev_name(ed_dev), 0);
+					if (!res)
+						goto error;
+
+					rc = xa_insert(&cxled->skip_res, skip_base,
+							res, GFP_KERNEL);
+					skip_base += skip_len;
+				}
+			}
+		} else	{
+			res = __request_region(dpa_res, base - skipped, skipped,
+							dev_name(ed_dev), 0);
+			if (!res)
+				goto error;
+
+			rc = xa_insert(&cxled->skip_res, skip_base, res,
+								GFP_KERNEL);
 		}
 	}
-	res = __request_region(&cxlds->dpa_res, base, len,
-			       dev_name(&cxled->cxld.dev), 0);
+
+	res = __request_region(dpa_res, base, len, dev_name(ed_dev), 0);
 	if (!res) {
 		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
-			port->id, cxled->cxld.id);
-		if (skipped)
-			__release_region(&cxlds->dpa_res, base - skipped,
-					 skipped);
+				port->id, cxled->cxld.id);
+		if (skipped) {
+			resource_size_t skip_base = base - skipped;
+
+			while (skipped != 0) {
+				if (skip_base > base)
+					dev_err(dev, "Skip error\n");
+
+				res = xa_load(&cxled->skip_res, skip_base);
+				__release_region(dpa_res, skip_base,
+							resource_size(res));
+				xa_erase(&cxled->skip_res, skip_base);
+				skip_base += resource_size(res);
+				skipped -= resource_size(res);
+			}
+		}
 		return -EBUSY;
 	}
 	cxled->dpa_res = res;
 	cxled->skip = skipped;
 
+	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
+		int index = dc_mode_to_region_index(mode);
+
+		if (resource_contains(&cxlds->dc_res[index], res)) {
+			cxled->mode = mode;
+			dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id,
+				cxled->cxld.id, cxled->dpa_res, cxled->mode);
+			goto success;
+		}
+	}
 	if (resource_contains(&cxlds->pmem_res, res))
 		cxled->mode = CXL_DECODER_PMEM;
 	else if (resource_contains(&cxlds->ram_res, res))
@@ -336,9 +453,16 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 		cxled->mode = CXL_DECODER_MIXED;
 	}
 
+success:
 	port->hdm_end++;
 	get_device(&cxled->cxld.dev);
 	return 0;
+
+error:
+	dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space\n",
+			port->id, cxled->cxld.id);
+	return -EBUSY;
+
 }
 
 int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
@@ -429,6 +553,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
 	switch (mode) {
 	case CXL_DECODER_RAM:
 	case CXL_DECODER_PMEM:
+	case CXL_DECODER_DC0:
+	case CXL_DECODER_DC1:
+	case CXL_DECODER_DC2:
+	case CXL_DECODER_DC3:
+	case CXL_DECODER_DC4:
+	case CXL_DECODER_DC5:
+	case CXL_DECODER_DC6:
+	case CXL_DECODER_DC7:
 		break;
 	default:
 		dev_dbg(dev, "unsupported mode: %d\n", mode);
@@ -456,6 +588,16 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
 		goto out;
 	}
 
+	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
+		int index = dc_mode_to_region_index(i);
+
+		if (mode == i && !resource_size(&cxlds->dc_res[index])) {
+			dev_dbg(dev, "no available dynamic capacity\n");
+			rc = -ENXIO;
+			goto out;
+		}
+	}
+
 	cxled->mode = mode;
 	rc = 0;
 out:
@@ -469,10 +611,12 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
 					 resource_size_t *skip_out)
 {
 	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
-	resource_size_t free_ram_start, free_pmem_start;
+	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	struct device *dev = &cxled->cxld.dev;
 	resource_size_t start, avail, skip;
 	struct resource *p, *last;
+	int index;
 
 	lockdep_assert_held(&cxl_dpa_rwsem);
 
@@ -490,6 +634,20 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
 	else
 		free_pmem_start = cxlds->pmem_res.start;
 
+	/*
+	 * One HDM Decoder per DC region to map memory with different
+	 * DSMAS entry.
+	 */
+	index = dc_mode_to_region_index(cxled->mode);
+	if (index >= 0) {
+		if (cxlds->dc_res[index].child) {
+			dev_err(dev, "Cannot allocated DPA from DC Region: %d\n",
+					index);
+			return -EINVAL;
+		}
+		free_dc_start = cxlds->dc_res[index].start;
+	}
+
 	if (cxled->mode == CXL_DECODER_RAM) {
 		start = free_ram_start;
 		avail = cxlds->ram_res.end - start + 1;
@@ -511,6 +669,29 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
 		else
 			skip_end = start - 1;
 		skip = skip_end - skip_start + 1;
+	} else if (decoder_mode_is_dc(cxled->mode)) {
+		resource_size_t skip_start, skip_end;
+
+		start = free_dc_start;
+		avail = cxlds->dc_res[index].end - start + 1;
+		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
+			skip_start = free_ram_start;
+		else
+			skip_start = free_pmem_start;
+		/*
+		 * If some dc region is already mapped, then that allocation
+		 * already handled the RAM and PMEM skip.Check for DC region
+		 * skip.
+		 */
+		for (int i = index - 1; i >= 0 ; i--) {
+			if (cxlds->dc_res[i].child) {
+				skip_start = cxlds->dc_res[i].child->end + 1;
+				break;
+			}
+		}
+
+		skip_end = start - 1;
+		skip = skip_end - skip_start + 1;
 	} else {
 		dev_dbg(cxled_dev(cxled), "mode not set\n");
 		avail = 0;
@@ -548,10 +729,25 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
 
 	avail = cxl_dpa_freespace(cxled, &start, &skip);
 
+	dev_dbg(dev, "DPA Allocation start: %llx len: %llx Skip: %llx\n",
+						start, size, skip);
 	if (size > avail) {
+		static const char * const names[] = {
+			[CXL_DECODER_NONE] = "none",
+			[CXL_DECODER_RAM] = "ram",
+			[CXL_DECODER_PMEM] = "pmem",
+			[CXL_DECODER_MIXED] = "mixed",
+			[CXL_DECODER_DC0] = "dc0",
+			[CXL_DECODER_DC1] = "dc1",
+			[CXL_DECODER_DC2] = "dc2",
+			[CXL_DECODER_DC3] = "dc3",
+			[CXL_DECODER_DC4] = "dc4",
+			[CXL_DECODER_DC5] = "dc5",
+			[CXL_DECODER_DC6] = "dc6",
+			[CXL_DECODER_DC7] = "dc7",
+		};
 		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
-			cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
-			&avail);
+			names[cxled->mode], &avail);
 		rc = -ENOSPC;
 		goto out;
 	}
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 5e21b53362e6..a1a98aba24ed 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -195,6 +195,22 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
 		mode = CXL_DECODER_PMEM;
 	else if (sysfs_streq(buf, "ram"))
 		mode = CXL_DECODER_RAM;
+	else if (sysfs_streq(buf, "dc0"))
+		mode = CXL_DECODER_DC0;
+	else if (sysfs_streq(buf, "dc1"))
+		mode = CXL_DECODER_DC1;
+	else if (sysfs_streq(buf, "dc2"))
+		mode = CXL_DECODER_DC2;
+	else if (sysfs_streq(buf, "dc3"))
+		mode = CXL_DECODER_DC3;
+	else if (sysfs_streq(buf, "dc4"))
+		mode = CXL_DECODER_DC4;
+	else if (sysfs_streq(buf, "dc5"))
+		mode = CXL_DECODER_DC5;
+	else if (sysfs_streq(buf, "dc6"))
+		mode = CXL_DECODER_DC6;
+	else if (sysfs_streq(buf, "dc7"))
+		mode = CXL_DECODER_DC7;
 	else
 		return -EINVAL;
 
@@ -296,6 +312,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
 	&dev_attr_target_list.attr,
 	SET_CXL_REGION_ATTR(create_pmem_region)
 	SET_CXL_REGION_ATTR(create_ram_region)
+	SET_CXL_DC_REGION_ATTR(create_dc_region)
 	SET_CXL_REGION_ATTR(delete_region)
 	NULL,
 };
@@ -1691,6 +1708,7 @@ struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
 		return ERR_PTR(-ENOMEM);
 
 	cxled->pos = -1;
+	xa_init(&cxled->skip_res);
 	cxld = &cxled->cxld;
 	rc = cxl_decoder_init(port, cxld);
 	if (rc)	 {
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 543c4499379e..144232c8305e 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1733,7 +1733,7 @@ static int cxl_region_attach(struct cxl_region *cxlr,
 	lockdep_assert_held_write(&cxl_region_rwsem);
 	lockdep_assert_held_read(&cxl_dpa_rwsem);
 
-	if (cxled->mode != cxlr->mode) {
+	if (decoder_mode_is_dc(cxlr->mode) && !decoder_mode_is_dc(cxled->mode)) {
 		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
 			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
 		return -EINVAL;
@@ -2211,6 +2211,14 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
 	switch (mode) {
 	case CXL_DECODER_RAM:
 	case CXL_DECODER_PMEM:
+	case CXL_DECODER_DC0:
+	case CXL_DECODER_DC1:
+	case CXL_DECODER_DC2:
+	case CXL_DECODER_DC3:
+	case CXL_DECODER_DC4:
+	case CXL_DECODER_DC5:
+	case CXL_DECODER_DC6:
+	case CXL_DECODER_DC7:
 		break;
 	default:
 		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
@@ -2321,6 +2329,43 @@ static ssize_t create_ram_region_store(struct device *dev,
 }
 DEVICE_ATTR_RW(create_ram_region);
 
+static ssize_t store_dcN_region(struct cxl_root_decoder *cxlrd,
+				const char *buf, enum cxl_decoder_mode mode,
+				size_t len)
+{
+	struct cxl_region *cxlr;
+	int rc, id;
+
+	rc = sscanf(buf, "region%d\n", &id);
+	if (rc != 1)
+		return -EINVAL;
+
+	cxlr = __create_region(cxlrd, id, mode, CXL_DECODER_HOSTMEM);
+	if (IS_ERR(cxlr))
+		return PTR_ERR(cxlr);
+
+	return len;
+}
+
+static ssize_t create_dc_region_show(struct device *dev,
+				     struct device_attribute *attr, char *buf)
+{
+	return __create_region_show(to_cxl_root_decoder(dev), buf);
+}
+
+static ssize_t create_dc_region_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t len)
+{
+	/*
+	 * All DC regions use decoder mode DC0 as the region does not need the
+	 * index information
+	 */
+	return store_dcN_region(to_cxl_root_decoder(dev), buf,
+				CXL_DECODER_DC0, len);
+}
+DEVICE_ATTR_RW(create_dc_region);
+
 static ssize_t region_show(struct device *dev, struct device_attribute *attr,
 			   char *buf)
 {
@@ -2799,6 +2844,61 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
 	return rc;
 }
 
+static void cxl_dc_region_release(void *data)
+{
+	struct cxl_region *cxlr = data;
+	struct cxl_dc_region *cxlr_dc = cxlr->cxlr_dc;
+
+	xa_destroy(&cxlr_dc->dax_dev_list);
+	kfree(cxlr_dc);
+}
+
+static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
+{
+	struct cxl_dc_region *cxlr_dc;
+	struct cxl_dax_region *cxlr_dax;
+	struct device *dev;
+	int rc = 0;
+
+	cxlr_dax = cxl_dax_region_alloc(cxlr);
+	if (IS_ERR(cxlr_dax))
+		return PTR_ERR(cxlr_dax);
+
+	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
+	if (!cxlr_dc) {
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	dev = &cxlr_dax->dev;
+	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
+	if (rc)
+		goto err;
+
+	rc = device_add(dev);
+	if (rc)
+		goto err;
+
+	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
+		dev_name(dev));
+
+	rc = devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
+					cxlr_dax);
+	if (rc)
+		goto err;
+
+	cxlr_dc->cxlr_dax = cxlr_dax;
+	xa_init(&cxlr_dc->dax_dev_list);
+	cxlr->cxlr_dc = cxlr_dc;
+	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
+	if (!rc)
+		return 0;
+err:
+	put_device(dev);
+	kfree(cxlr_dc);
+	return rc;
+}
+
 static int match_decoder_by_range(struct device *dev, void *data)
 {
 	struct range *r1, *r2 = data;
@@ -3140,6 +3240,19 @@ static int is_system_ram(struct resource *res, void *arg)
 	return 1;
 }
 
+/*
+ * The region can not be manged by CXL if any portion of
+ * it is already online as 'System RAM'
+ */
+static bool region_is_system_ram(struct cxl_region *cxlr,
+				 struct cxl_region_params *p)
+{
+	return (walk_iomem_res_desc(IORES_DESC_NONE,
+				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+				    p->res->start, p->res->end, cxlr,
+				    is_system_ram) > 0);
+}
+
 static int cxl_region_probe(struct device *dev)
 {
 	struct cxl_region *cxlr = to_cxl_region(dev);
@@ -3174,14 +3287,7 @@ static int cxl_region_probe(struct device *dev)
 	case CXL_DECODER_PMEM:
 		return devm_cxl_add_pmem_region(cxlr);
 	case CXL_DECODER_RAM:
-		/*
-		 * The region can not be manged by CXL if any portion of
-		 * it is already online as 'System RAM'
-		 */
-		if (walk_iomem_res_desc(IORES_DESC_NONE,
-					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
-					p->res->start, p->res->end, cxlr,
-					is_system_ram) > 0)
+		if (region_is_system_ram(cxlr, p))
 			return 0;
 
 		/*
@@ -3193,6 +3299,17 @@ static int cxl_region_probe(struct device *dev)
 
 		/* HDM-H routes to device-dax */
 		return devm_cxl_add_dax_region(cxlr);
+	case CXL_DECODER_DC0:
+	case CXL_DECODER_DC1:
+	case CXL_DECODER_DC2:
+	case CXL_DECODER_DC3:
+	case CXL_DECODER_DC4:
+	case CXL_DECODER_DC5:
+	case CXL_DECODER_DC6:
+	case CXL_DECODER_DC7:
+		if (region_is_system_ram(cxlr, p))
+			return 0;
+		return devm_cxl_add_dc_region(cxlr);
 	default:
 		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
 			cxlr->mode);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 8400af85d99f..7ac1237938b7 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -335,6 +335,14 @@ enum cxl_decoder_mode {
 	CXL_DECODER_NONE,
 	CXL_DECODER_RAM,
 	CXL_DECODER_PMEM,
+	CXL_DECODER_DC0,
+	CXL_DECODER_DC1,
+	CXL_DECODER_DC2,
+	CXL_DECODER_DC3,
+	CXL_DECODER_DC4,
+	CXL_DECODER_DC5,
+	CXL_DECODER_DC6,
+	CXL_DECODER_DC7,
 	CXL_DECODER_MIXED,
 	CXL_DECODER_DEAD,
 };
@@ -345,6 +353,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
 		[CXL_DECODER_NONE] = "none",
 		[CXL_DECODER_RAM] = "ram",
 		[CXL_DECODER_PMEM] = "pmem",
+		[CXL_DECODER_DC0] = "dc0",
+		[CXL_DECODER_DC1] = "dc1",
+		[CXL_DECODER_DC2] = "dc2",
+		[CXL_DECODER_DC3] = "dc3",
+		[CXL_DECODER_DC4] = "dc4",
+		[CXL_DECODER_DC5] = "dc5",
+		[CXL_DECODER_DC6] = "dc6",
+		[CXL_DECODER_DC7] = "dc7",
 		[CXL_DECODER_MIXED] = "mixed",
 	};
 
@@ -353,6 +369,11 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
 	return "mixed";
 }
 
+static inline bool decoder_mode_is_dc(enum cxl_decoder_mode mode)
+{
+	return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
+}
+
 /*
  * Track whether this decoder is reserved for region autodiscovery, or
  * free for userspace provisioning.
@@ -375,6 +396,7 @@ struct cxl_endpoint_decoder {
 	struct cxl_decoder cxld;
 	struct resource *dpa_res;
 	resource_size_t skip;
+	struct xarray skip_res;
 	enum cxl_decoder_mode mode;
 	enum cxl_decoder_state state;
 	int pos;
@@ -475,6 +497,11 @@ struct cxl_region_params {
  */
 #define CXL_REGION_F_AUTO 1
 
+struct cxl_dc_region {
+	struct xarray dax_dev_list;
+	struct cxl_dax_region *cxlr_dax;
+};
+
 /**
  * struct cxl_region - CXL region
  * @dev: This region's device
@@ -493,6 +520,7 @@ struct cxl_region {
 	enum cxl_decoder_type type;
 	struct cxl_nvdimm_bridge *cxl_nvb;
 	struct cxl_pmem_region *cxlr_pmem;
+	struct cxl_dc_region *cxlr_dc;
 	unsigned long flags;
 	struct cxl_region_params params;
 };
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index ccdf8de85bd5..eb5eb81bfbd7 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -23,11 +23,15 @@ static int cxl_dax_region_probe(struct device *dev)
 	if (!dax_region)
 		return -ENOMEM;
 
+	if (decoder_mode_is_dc(cxlr->mode))
+		return 0;
+
 	data = (struct dev_dax_data) {
 		.dax_region = dax_region,
 		.id = -1,
 		.size = range_len(&cxlr_dax->hpa_range),
 	};
+
 	dev_dax = devm_create_dev_dax(&data);
 	if (IS_ERR(dev_dax))
 		return PTR_ERR(dev_dax);

-- 
2.40.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 3/5] cxl/mem : Expose dynamic capacity configuration to userspace
  2023-06-14 19:16 [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD) ira.weiny
  2023-06-14 19:16 ` [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device ira.weiny
  2023-06-14 19:16 ` [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support ira.weiny
@ 2023-06-14 19:16 ` ira.weiny
  2023-06-15  0:40   ` Alison Schofield
  2023-06-15 15:41   ` Dave Jiang
  2023-06-14 19:16 ` [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events ira.weiny
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 55+ messages in thread
From: ira.weiny @ 2023-06-14 19:16 UTC (permalink / raw)
  To: Navneet Singh, Fan Ni, Jonathan Cameron, Ira Weiny, Dan Williams,
	linux-cxl

From: Navneet Singh <navneet.singh@intel.com>

Exposing driver cached dynamic capacity configuration through sysfs
attributes.User will create one or more dynamic capacity
cxl regions based on this information and map the dynamic capacity of
the device into HDM ranges using one or more HDM decoders.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>

---
[iweiny: fixups]
[djbw: fixups, no sign-off: preview only]
---
 drivers/cxl/core/memdev.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 5d1ba7a72567..beeb5fa3a0aa 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -99,6 +99,20 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
 static struct device_attribute dev_attr_pmem_size =
 	__ATTR(size, 0444, pmem_size_show, NULL);
 
+static ssize_t dc_regions_count_show(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+	int len = 0;
+
+	len = sysfs_emit(buf, "0x%x\n", mds->nr_dc_region);
+	return len;
+}
+
+struct device_attribute dev_attr_dc_regions_count =
+	__ATTR(dc_regions_count, 0444, dc_regions_count_show, NULL);
+
 static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
 			   char *buf)
 {
@@ -362,6 +376,57 @@ static struct attribute *cxl_memdev_ram_attributes[] = {
 	NULL,
 };
 
+static ssize_t show_size_regionN(struct cxl_memdev *cxlmd, char *buf, int pos)
+{
+	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+	return sysfs_emit(buf, "0x%llx\n", mds->dc_region[pos].decode_len);
+}
+
+#define SIZE_ATTR_RO(n)                                              \
+static ssize_t dc##n##_size_show(                                       \
+	struct device *dev, struct device_attribute *attr, char *buf)  \
+{                                                                      \
+	return show_size_regionN(to_cxl_memdev(dev), buf, (n));             \
+}                                                                      \
+static DEVICE_ATTR_RO(dc##n##_size)
+SIZE_ATTR_RO(0);
+SIZE_ATTR_RO(1);
+SIZE_ATTR_RO(2);
+SIZE_ATTR_RO(3);
+SIZE_ATTR_RO(4);
+SIZE_ATTR_RO(5);
+SIZE_ATTR_RO(6);
+SIZE_ATTR_RO(7);
+
+static struct attribute *cxl_memdev_dc_attributes[] = {
+	&dev_attr_dc0_size.attr,
+	&dev_attr_dc1_size.attr,
+	&dev_attr_dc2_size.attr,
+	&dev_attr_dc3_size.attr,
+	&dev_attr_dc4_size.attr,
+	&dev_attr_dc5_size.attr,
+	&dev_attr_dc6_size.attr,
+	&dev_attr_dc7_size.attr,
+	&dev_attr_dc_regions_count.attr,
+	NULL,
+};
+
+static umode_t cxl_dc_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+	if (a == &dev_attr_dc_regions_count.attr)
+		return a->mode;
+
+	if (n < mds->nr_dc_region)
+		return a->mode;
+
+	return 0;
+}
+
 static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
 				  int n)
 {
@@ -385,10 +450,17 @@ static struct attribute_group cxl_memdev_pmem_attribute_group = {
 	.attrs = cxl_memdev_pmem_attributes,
 };
 
+static struct attribute_group cxl_memdev_dc_attribute_group = {
+	.name = "dc",
+	.attrs = cxl_memdev_dc_attributes,
+	.is_visible = cxl_dc_visible,
+};
+
 static const struct attribute_group *cxl_memdev_attribute_groups[] = {
 	&cxl_memdev_attribute_group,
 	&cxl_memdev_ram_attribute_group,
 	&cxl_memdev_pmem_attribute_group,
+	&cxl_memdev_dc_attribute_group,
 	NULL,
 };
 

-- 
2.40.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events.
  2023-06-14 19:16 [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD) ira.weiny
                   ` (2 preceding siblings ...)
  2023-06-14 19:16 ` [PATCH 3/5] cxl/mem : Expose dynamic capacity configuration to userspace ira.weiny
@ 2023-06-14 19:16 ` ira.weiny
  2023-06-15  2:19   ` Alison Schofield
                     ` (4 more replies)
  2023-06-14 19:16 ` [PATCH 5/5] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
                   ` (3 subsequent siblings)
  7 siblings, 5 replies; 55+ messages in thread
From: ira.weiny @ 2023-06-14 19:16 UTC (permalink / raw)
  To: Navneet Singh, Fan Ni, Jonathan Cameron, Ira Weiny, Dan Williams,
	linux-cxl

From: Navneet Singh <navneet.singh@intel.com>

A dynamic capacity device utilizes events to signal the host about the
changes to the allocation of DC blocks. The device communicates the
state of these blocks of dynamic capacity through an extent list that
describes the starting DPA and length of all blocks the host can access.

Based on the dynamic capacity add or release event type,
dynamic memory represented by the extents are either added
or removed as devdax device.

Process the dynamic capacity add and release events.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>

---
[iweiny: Remove invalid comment]
---
 drivers/cxl/core/mbox.c   | 345 +++++++++++++++++++++++++++++++++++++++++++++-
 drivers/cxl/core/region.c | 214 +++++++++++++++++++++++++++-
 drivers/cxl/core/trace.h  |   3 +-
 drivers/cxl/cxl.h         |   4 +-
 drivers/cxl/cxlmem.h      |  76 ++++++++++
 drivers/cxl/pci.c         |  10 +-
 drivers/dax/bus.c         |  11 +-
 drivers/dax/bus.h         |   5 +-
 8 files changed, 652 insertions(+), 16 deletions(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index c5b696737c87..db9295216de5 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -767,6 +767,14 @@ static const uuid_t log_uuid[] = {
 	[VENDOR_DEBUG_UUID] = DEFINE_CXL_VENDOR_DEBUG_UUID,
 };
 
+/* See CXL 3.0 8.2.9.2.1.5 */
+enum dc_event {
+	ADD_CAPACITY,
+	RELEASE_CAPACITY,
+	FORCED_CAPACITY_RELEASE,
+	REGION_CONFIGURATION_UPDATED,
+};
+
 /**
  * cxl_enumerate_cmds() - Enumerate commands for a device.
  * @mds: The driver data for the operation
@@ -852,6 +860,14 @@ static const uuid_t mem_mod_event_uuid =
 	UUID_INIT(0xfe927475, 0xdd59, 0x4339,
 		  0xa5, 0x86, 0x79, 0xba, 0xb1, 0x13, 0xb7, 0x74);
 
+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.0 section 8.2.9.2.1.3; Table 8-45
+ */
+static const uuid_t dc_event_uuid =
+	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c,
+		0x2f, 0x95, 0x26, 0x8e, 0x10, 0x1a, 0x2a);
+
 static void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 				   enum cxl_event_log_type type,
 				   struct cxl_event_record_raw *record)
@@ -945,6 +961,188 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
 	return rc;
 }
 
+static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
+				struct cxl_mbox_dc_response *res,
+				int extent_cnt, int opcode)
+{
+	struct cxl_mbox_cmd mbox_cmd;
+	int rc, size;
+
+	size = struct_size(res, extent_list, extent_cnt);
+	res->extent_list_size = cpu_to_le32(extent_cnt);
+
+	mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = opcode,
+		.size_in = size,
+		.payload_in = res,
+	};
+
+	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
+
+	return rc;
+
+}
+
+static int cxl_prepare_ext_list(struct cxl_mbox_dc_response **res,
+					int *n, struct range *extent)
+{
+	struct cxl_mbox_dc_response *dc_res;
+	unsigned int size;
+
+	if (!extent)
+		size = struct_size(dc_res, extent_list, 0);
+	else
+		size = struct_size(dc_res, extent_list, *n + 1);
+
+	dc_res = krealloc(*res, size, GFP_KERNEL);
+	if (!dc_res)
+		return -ENOMEM;
+
+	if (extent) {
+		dc_res->extent_list[*n].dpa_start = cpu_to_le64(extent->start);
+		memset(dc_res->extent_list[*n].reserved, 0, 8);
+		dc_res->extent_list[*n].length =
+				cpu_to_le64(range_len(extent));
+		(*n)++;
+	}
+
+	*res = dc_res;
+	return 0;
+}
+/**
+ * cxl_handle_dcd_event_records() - Read DCD event records.
+ * @mds: The memory device state
+ *
+ * Returns 0 if enumerate completed successfully.
+ *
+ * CXL devices can generate DCD events to add or remove extents in the list.
+ */
+static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
+					struct cxl_event_record_raw *rec)
+{
+	struct cxl_mbox_dc_response *dc_res = NULL;
+	struct device *dev = mds->cxlds.dev;
+	uuid_t *id = &rec->hdr.id;
+	struct dcd_event_dyn_cap *record =
+			(struct dcd_event_dyn_cap *)rec;
+	int extent_cnt = 0, rc = 0;
+	struct cxl_dc_extent_data *extent;
+	struct range alloc_range, rel_range;
+	resource_size_t dpa, size;
+
+	if (!uuid_equal(id, &dc_event_uuid))
+		return -EINVAL;
+
+	switch (record->data.event_type) {
+	case ADD_CAPACITY:
+		extent = devm_kzalloc(dev, sizeof(*extent), GFP_ATOMIC);
+		if (!extent)
+			return -ENOMEM;
+
+		extent->dpa_start = le64_to_cpu(record->data.extent.start_dpa);
+		extent->length = le64_to_cpu(record->data.extent.length);
+		memcpy(extent->tag, record->data.extent.tag,
+				sizeof(record->data.extent.tag));
+		extent->shared_extent_seq =
+			le16_to_cpu(record->data.extent.shared_extn_seq);
+		dev_dbg(dev, "Add DC extent DPA:0x%llx LEN:%llx\n",
+					extent->dpa_start, extent->length);
+		alloc_range = (struct range) {
+			.start = extent->dpa_start,
+			.end = extent->dpa_start + extent->length - 1,
+		};
+
+		rc = cxl_add_dc_extent(mds, &alloc_range);
+		if (rc < 0) {
+			dev_dbg(dev, "unconsumed DC extent DPA:0x%llx LEN:%llx\n",
+					extent->dpa_start, extent->length);
+			rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, NULL);
+			if (rc < 0) {
+				dev_err(dev, "Couldn't create extent list %d\n",
+									rc);
+				devm_kfree(dev, extent);
+				return rc;
+			}
+
+			rc = cxl_send_dc_cap_response(mds, dc_res,
+					extent_cnt, CXL_MBOX_OP_ADD_DC_RESPONSE);
+			if (rc < 0) {
+				devm_kfree(dev, extent);
+				goto out;
+			}
+
+			kfree(dc_res);
+			devm_kfree(dev, extent);
+
+			return 0;
+		}
+
+		rc = xa_insert(&mds->dc_extent_list, extent->dpa_start, extent,
+				GFP_KERNEL);
+		if (rc < 0)
+			goto out;
+
+		mds->num_dc_extents++;
+		rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, &alloc_range);
+		if (rc < 0) {
+			dev_err(dev, "Couldn't create extent list %d\n", rc);
+			return rc;
+		}
+
+		rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
+					      CXL_MBOX_OP_ADD_DC_RESPONSE);
+		if (rc < 0)
+			goto out;
+
+		break;
+
+	case RELEASE_CAPACITY:
+		dpa = le64_to_cpu(record->data.extent.start_dpa);
+		size = le64_to_cpu(record->data.extent.length);
+		dev_dbg(dev, "Release DC extents DPA:0x%llx LEN:%llx\n",
+				dpa, size);
+		extent = xa_load(&mds->dc_extent_list, dpa);
+		if (!extent) {
+			dev_err(dev, "No extent found with DPA:0x%llx\n", dpa);
+			return -EINVAL;
+		}
+
+		rel_range = (struct range) {
+			.start = dpa,
+			.end = dpa + size - 1,
+		};
+
+		rc = cxl_release_dc_extent(mds, &rel_range);
+		if (rc < 0) {
+			dev_dbg(dev, "withhold DC extent DPA:0x%llx LEN:%llx\n",
+									dpa, size);
+			return 0;
+		}
+
+		xa_erase(&mds->dc_extent_list, dpa);
+		devm_kfree(dev, extent);
+		mds->num_dc_extents--;
+		rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, &rel_range);
+		if (rc < 0) {
+			dev_err(dev, "Couldn't create extent list %d\n", rc);
+			return rc;
+		}
+
+		rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
+					      CXL_MBOX_OP_RELEASE_DC);
+		if (rc < 0)
+			goto out;
+
+		break;
+
+	default:
+		return -EINVAL;
+	}
+out:
+	kfree(dc_res);
+	return rc;
+}
+
 static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
 				    enum cxl_event_log_type type)
 {
@@ -982,9 +1180,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
 		if (!nr_rec)
 			break;
 
-		for (i = 0; i < nr_rec; i++)
+		for (i = 0; i < nr_rec; i++) {
 			cxl_event_trace_record(cxlmd, type,
 					       &payload->records[i]);
+			if (type == CXL_EVENT_TYPE_DCD) {
+				rc = cxl_handle_dcd_event_records(mds,
+						&payload->records[i]);
+				if (rc)
+					dev_err_ratelimited(dev,
+						"dcd event failed: %d\n", rc);
+			}
+		}
 
 		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
 			trace_cxl_overflow(cxlmd, type, payload);
@@ -1024,6 +1230,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
 		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_WARN);
 	if (status & CXLDEV_EVENT_STATUS_INFO)
 		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_INFO);
+	if (status & CXLDEV_EVENT_STATUS_DCD)
+		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_mem_get_event_records, CXL);
 
@@ -1244,6 +1452,140 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
 
+int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
+			      unsigned int *extent_gen_num)
+{
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_mbox_dc_extents *dc_extents;
+	struct cxl_mbox_get_dc_extent get_dc_extent;
+	unsigned int total_extent_cnt;
+	struct cxl_mbox_cmd mbox_cmd;
+	int rc;
+
+	/* Check GET_DC_EXTENT_LIST is supported by device */
+	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
+		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
+		return 0;
+	}
+
+	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
+	if (!dc_extents)
+		return -ENOMEM;
+
+	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
+		.extent_cnt = 0,
+		.start_extent_index = 0,
+	};
+
+	mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
+		.payload_in = &get_dc_extent,
+		.size_in = sizeof(get_dc_extent),
+		.size_out = mds->payload_size,
+		.payload_out = dc_extents,
+		.min_out = 1,
+	};
+	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
+	if (rc < 0)
+		goto out;
+
+	total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
+	*extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
+	dev_dbg(dev, "Total extent count :%d Extent list Generation Num: %d\n",
+			total_extent_cnt, *extent_gen_num);
+out:
+
+	kvfree(dc_extents);
+	if (rc < 0)
+		return rc;
+
+	return total_extent_cnt;
+
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extent_cnt, CXL);
+
+int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds,
+			   unsigned int index, unsigned int cnt)
+{
+	/* See CXL 3.0 Table 125 dynamic capacity config  Output Payload */
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_mbox_dc_extents *dc_extents;
+	struct cxl_mbox_get_dc_extent get_dc_extent;
+	unsigned int extent_gen_num, available_extents, total_extent_cnt;
+	int rc;
+	struct cxl_dc_extent_data *extent;
+	struct cxl_mbox_cmd mbox_cmd;
+	struct range alloc_range;
+
+	/* Check GET_DC_EXTENT_LIST is supported by device */
+	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
+		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
+		return 0;
+	}
+
+	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
+	if (!dc_extents)
+		return -ENOMEM;
+	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
+		.extent_cnt = cnt,
+		.start_extent_index = index,
+	};
+
+	mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
+		.payload_in = &get_dc_extent,
+		.size_in = sizeof(get_dc_extent),
+		.size_out = mds->payload_size,
+		.payload_out = dc_extents,
+		.min_out = 1,
+	};
+	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
+	if (rc < 0)
+		goto out;
+
+	available_extents = le32_to_cpu(dc_extents->ret_extent_cnt);
+	total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
+	extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
+	dev_dbg(dev, "No Total extent count :%d Extent list Generation Num:%d\n",
+			total_extent_cnt, extent_gen_num);
+
+
+	for (int i = 0; i < available_extents ; i++) {
+		extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
+		if (!extent) {
+			rc = -ENOMEM;
+			goto out;
+		}
+		extent->dpa_start = le64_to_cpu(dc_extents->extent[i].start_dpa);
+		extent->length = le64_to_cpu(dc_extents->extent[i].length);
+		memcpy(extent->tag, dc_extents->extent[i].tag,
+					sizeof(dc_extents->extent[i].tag));
+		extent->shared_extent_seq =
+				le16_to_cpu(dc_extents->extent[i].shared_extn_seq);
+		dev_dbg(dev, "dynamic capacity extent[%d] DPA:0x%llx LEN:%llx\n",
+				i, extent->dpa_start, extent->length);
+
+		alloc_range = (struct range){
+			.start = extent->dpa_start,
+			.end = extent->dpa_start + extent->length - 1,
+		};
+
+		rc = cxl_add_dc_extent(mds, &alloc_range);
+		if (rc < 0)
+			goto out;
+		rc = xa_insert(&mds->dc_extent_list, extent->dpa_start, extent,
+				GFP_KERNEL);
+	}
+
+out:
+	kvfree(dc_extents);
+	if (rc < 0)
+		return rc;
+
+	return available_extents;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extents, CXL);
+
 static int add_dpa_res(struct device *dev, struct resource *parent,
 		       struct resource *res, resource_size_t start,
 		       resource_size_t size, const char *type)
@@ -1452,6 +1794,7 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
 	mutex_init(&mds->event.log_lock);
 	mds->cxlds.dev = dev;
 	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
+	xa_init(&mds->dc_extent_list);
 
 	return mds;
 }
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 144232c8305e..ba45c1c3b0a9 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /* Copyright(c) 2022 Intel Corporation. All rights reserved. */
 #include <linux/memregion.h>
+#include <linux/interrupt.h>
 #include <linux/genalloc.h>
 #include <linux/device.h>
 #include <linux/module.h>
@@ -11,6 +12,8 @@
 #include <cxlmem.h>
 #include <cxl.h>
 #include "core.h"
+#include "../../dax/bus.h"
+#include "../../dax/dax-private.h"
 
 /**
  * DOC: cxl core region
@@ -166,6 +169,38 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
 	return 0;
 }
 
+static int cxl_region_manage_dc(struct cxl_region *cxlr)
+{
+	struct cxl_region_params *p = &cxlr->params;
+	unsigned int extent_gen_num;
+	int i, rc;
+
+	/* Designed for Non Interleaving flow with the assumption one
+	 * cxl_region will map the complete device DC region's DPA range
+	 */
+	for (i = 0; i < p->nr_targets; i++) {
+		struct cxl_endpoint_decoder *cxled = p->targets[i];
+		struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+		struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+		rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
+		if (rc < 0)
+			goto err;
+		else if (rc > 1) {
+			rc = cxl_dev_get_dc_extents(mds, rc, 0);
+			if (rc < 0)
+				goto err;
+			mds->num_dc_extents = rc;
+			mds->dc_extents_index = rc - 1;
+		}
+		mds->dc_list_gen_num = extent_gen_num;
+		dev_dbg(mds->cxlds.dev, "No of preallocated extents :%d\n", rc);
+	}
+	return 0;
+err:
+	return rc;
+}
+
 static int commit_decoder(struct cxl_decoder *cxld)
 {
 	struct cxl_switch_decoder *cxlsd = NULL;
@@ -2865,11 +2900,14 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
 		return PTR_ERR(cxlr_dax);
 
 	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
-	if (!cxlr_dc) {
-		rc = -ENOMEM;
-		goto err;
-	}
+	if (!cxlr_dc)
+		return -ENOMEM;
 
+	rc = request_module("dax_cxl");
+	if (rc) {
+		dev_err(dev, "failed to load dax-ctl module\n");
+		goto load_err;
+	}
 	dev = &cxlr_dax->dev;
 	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
 	if (rc)
@@ -2891,10 +2929,24 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
 	xa_init(&cxlr_dc->dax_dev_list);
 	cxlr->cxlr_dc = cxlr_dc;
 	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
-	if (!rc)
-		return 0;
+	if (rc)
+		goto err;
+
+	if (!dev->driver) {
+		dev_err(dev, "%s Driver not attached\n", dev_name(dev));
+		rc = -ENXIO;
+		goto err;
+	}
+
+	rc = cxl_region_manage_dc(cxlr);
+	if (rc)
+		goto err;
+
+	return 0;
+
 err:
 	put_device(dev);
+load_err:
 	kfree(cxlr_dc);
 	return rc;
 }
@@ -3076,6 +3128,156 @@ struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_create_region, CXL);
 
+static int match_ep_decoder_by_range(struct device *dev, void *data)
+{
+	struct cxl_endpoint_decoder *cxled;
+	struct range *dpa_range = data;
+
+	if (!is_endpoint_decoder(dev))
+		return 0;
+
+	cxled = to_cxl_endpoint_decoder(dev);
+	if (!cxled->cxld.region)
+		return 0;
+
+	if (cxled->dpa_res->start <= dpa_range->start &&
+				cxled->dpa_res->end >= dpa_range->end)
+		return 1;
+
+	return 0;
+}
+
+int cxl_release_dc_extent(struct cxl_memdev_state *mds,
+			  struct range *rel_range)
+{
+	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+	struct cxl_endpoint_decoder *cxled;
+	struct cxl_dc_region *cxlr_dc;
+	struct dax_region *dax_region;
+	resource_size_t dpa_offset;
+	struct cxl_region *cxlr;
+	struct range hpa_range;
+	struct dev_dax *dev_dax;
+	resource_size_t hpa;
+	struct device *dev;
+	int ranges, rc = 0;
+
+	/*
+	 * Find the cxl endpoind decoder with which has the extent dpa range and
+	 * get the cxl_region, dax_region refrences.
+	 */
+	dev = device_find_child(&cxlmd->endpoint->dev, rel_range,
+				match_ep_decoder_by_range);
+	if (!dev) {
+		dev_err(mds->cxlds.dev, "%pr not mapped\n", rel_range);
+		return PTR_ERR(dev);
+	}
+
+	cxled = to_cxl_endpoint_decoder(dev);
+	hpa_range = cxled->cxld.hpa_range;
+	cxlr = cxled->cxld.region;
+	cxlr_dc = cxlr->cxlr_dc;
+
+	/* DPA to HPA translation */
+	if (cxled->cxld.interleave_ways == 1) {
+		dpa_offset = rel_range->start - cxled->dpa_res->start;
+		hpa = hpa_range.start + dpa_offset;
+	} else {
+		dev_err(mds->cxlds.dev, "Interleaving DC not supported\n");
+		return -EINVAL;
+	}
+
+	dev_dax = xa_load(&cxlr_dc->dax_dev_list, hpa);
+	if (!dev_dax)
+		return -EINVAL;
+
+	dax_region = dev_dax->region;
+	ranges = dev_dax->nr_range;
+
+	while (ranges) {
+		int i = ranges - 1;
+		struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
+
+		devm_release_action(dax_region->dev, unregister_dax_mapping,
+								&mapping->dev);
+		ranges--;
+	}
+
+	dev_dbg(mds->cxlds.dev, "removing devdax device:%s\n",
+						dev_name(&dev_dax->dev));
+	devm_release_action(dax_region->dev, unregister_dev_dax,
+							&dev_dax->dev);
+	xa_erase(&cxlr_dc->dax_dev_list, hpa);
+
+	return rc;
+}
+
+int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range)
+{
+	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+	struct cxl_endpoint_decoder *cxled;
+	struct cxl_dax_region *cxlr_dax;
+	struct cxl_dc_region *cxlr_dc;
+	struct dax_region *dax_region;
+	resource_size_t dpa_offset;
+	struct dev_dax_data data;
+	struct dev_dax *dev_dax;
+	struct cxl_region *cxlr;
+	struct range hpa_range;
+	resource_size_t hpa;
+	struct device *dev;
+	int rc;
+
+	/*
+	 * Find the cxl endpoind decoder with which has the extent dpa range and
+	 * get the cxl_region, dax_region refrences.
+	 */
+	dev = device_find_child(&cxlmd->endpoint->dev, alloc_range,
+				match_ep_decoder_by_range);
+	if (!dev) {
+		dev_err(mds->cxlds.dev, "%pr not mapped\n",	alloc_range);
+		return PTR_ERR(dev);
+	}
+
+	cxled = to_cxl_endpoint_decoder(dev);
+	hpa_range = cxled->cxld.hpa_range;
+	cxlr = cxled->cxld.region;
+	cxlr_dc = cxlr->cxlr_dc;
+	cxlr_dax = cxlr_dc->cxlr_dax;
+	dax_region = dev_get_drvdata(&cxlr_dax->dev);
+
+	/* DPA to HPA translation */
+	if (cxled->cxld.interleave_ways == 1) {
+		dpa_offset = alloc_range->start - cxled->dpa_res->start;
+		hpa = hpa_range.start + dpa_offset;
+	} else {
+		dev_err(mds->cxlds.dev, "Interleaving DC not supported\n");
+		return -EINVAL;
+	}
+
+	data = (struct dev_dax_data) {
+		.dax_region = dax_region,
+		.id = -1,
+		.size = 0,
+	};
+
+	dev_dax = devm_create_dev_dax(&data);
+	if (IS_ERR(dev_dax))
+		return PTR_ERR(dev_dax);
+
+	if (IS_ALIGNED(range_len(alloc_range), max_t(unsigned long,
+				dev_dax->align, memremap_compat_align()))) {
+		rc = alloc_dev_dax_range(dev_dax, hpa,
+					range_len(alloc_range));
+		if (rc)
+			return rc;
+	}
+
+	rc = xa_insert(&cxlr_dc->dax_dev_list, hpa, dev_dax, GFP_KERNEL);
+
+	return rc;
+}
+
 /* Establish an empty region covering the given HPA range */
 static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
 					   struct cxl_endpoint_decoder *cxled)
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index a0b5819bc70b..e11651255780 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -122,7 +122,8 @@ TRACE_EVENT(cxl_aer_correctable_error,
 		{ CXL_EVENT_TYPE_INFO, "Informational" },	\
 		{ CXL_EVENT_TYPE_WARN, "Warning" },		\
 		{ CXL_EVENT_TYPE_FAIL, "Failure" },		\
-		{ CXL_EVENT_TYPE_FATAL, "Fatal" })
+		{ CXL_EVENT_TYPE_FATAL, "Fatal" },		\
+		{ CXL_EVENT_TYPE_DCD, "DCD" })
 
 TRACE_EVENT(cxl_overflow,
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 7ac1237938b7..60c436b7ebb1 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -163,11 +163,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
 #define CXLDEV_EVENT_STATUS_WARN		BIT(1)
 #define CXLDEV_EVENT_STATUS_FAIL		BIT(2)
 #define CXLDEV_EVENT_STATUS_FATAL		BIT(3)
+#define CXLDEV_EVENT_STATUS_DCD			BIT(4)
 
 #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO |	\
 				 CXLDEV_EVENT_STATUS_WARN |	\
 				 CXLDEV_EVENT_STATUS_FAIL |	\
-				 CXLDEV_EVENT_STATUS_FATAL)
+				 CXLDEV_EVENT_STATUS_FATAL|	\
+				 CXLDEV_EVENT_STATUS_DCD)
 
 /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
 #define CXLDEV_EVENT_INT_MODE_MASK	GENMASK(1, 0)
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 9c0b2fa72bdd..0440b5c04ef6 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -5,6 +5,7 @@
 #include <uapi/linux/cxl_mem.h>
 #include <linux/cdev.h>
 #include <linux/uuid.h>
+#include <linux/xarray.h>
 #include "cxl.h"
 
 /* CXL 2.0 8.2.8.5.1.1 Memory Device Status Register */
@@ -226,6 +227,7 @@ struct cxl_event_interrupt_policy {
 	u8 warn_settings;
 	u8 failure_settings;
 	u8 fatal_settings;
+	u8 dyncap_settings;
 } __packed;
 
 /**
@@ -296,6 +298,13 @@ enum cxl_devtype {
 #define CXL_MAX_DC_REGION 8
 #define CXL_DC_REGION_SRTLEN 8
 
+struct cxl_dc_extent_data {
+	u64 dpa_start;
+	u64 length;
+	u8 tag[16];
+	u16 shared_extent_seq;
+};
+
 /**
  * struct cxl_dev_state - The driver device state
  *
@@ -406,6 +415,11 @@ struct cxl_memdev_state {
 		u8 flags;
 	} dc_region[CXL_MAX_DC_REGION];
 
+	u32 dc_list_gen_num;
+	u32 dc_extents_index;
+	struct xarray dc_extent_list;
+	u32 num_dc_extents;
+
 	size_t dc_event_log_size;
 	struct cxl_event_state event;
 	struct cxl_poison_state poison;
@@ -470,6 +484,17 @@ enum cxl_opcode {
 	UUID_INIT(0xe1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19,     \
 		  0x40, 0x3d, 0x86)
 
+
+struct cxl_mbox_dc_response {
+	__le32 extent_list_size;
+	u8 reserved[4];
+	struct updated_extent_list {
+		__le64 dpa_start;
+		__le64 length;
+		u8 reserved[8];
+	} __packed extent_list[];
+} __packed;
+
 struct cxl_mbox_get_supported_logs {
 	__le16 entries;
 	u8 rsvd[6];
@@ -555,6 +580,7 @@ enum cxl_event_log_type {
 	CXL_EVENT_TYPE_WARN,
 	CXL_EVENT_TYPE_FAIL,
 	CXL_EVENT_TYPE_FATAL,
+	CXL_EVENT_TYPE_DCD,
 	CXL_EVENT_TYPE_MAX
 };
 
@@ -639,6 +665,35 @@ struct cxl_event_mem_module {
 	u8 reserved[0x3d];
 } __packed;
 
+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.0 section 8.2.9.2.1.5; Table 8-47
+ */
+
+#define CXL_EVENT_DC_TAG_SIZE	0x10
+struct cxl_dc_extent {
+	__le64 start_dpa;
+	__le64 length;
+	u8 tag[CXL_EVENT_DC_TAG_SIZE];
+	__le16 shared_extn_seq;
+	u8 reserved[6];
+} __packed;
+
+struct dcd_record_data {
+	u8 event_type;
+	u8 reserved;
+	__le16 host_id;
+	u8 region_index;
+	u8 reserved1[3];
+	struct cxl_dc_extent extent;
+	u8 reserved2[32];
+} __packed;
+
+struct dcd_event_dyn_cap {
+	struct cxl_event_record_hdr hdr;
+	struct dcd_record_data data;
+} __packed;
+
 struct cxl_mbox_get_partition_info {
 	__le64 active_volatile_cap;
 	__le64 active_persistent_cap;
@@ -684,6 +739,19 @@ struct cxl_mbox_dynamic_capacity {
 #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
 #define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
 
+struct cxl_mbox_get_dc_extent {
+	__le32 extent_cnt;
+	__le32 start_extent_index;
+} __packed;
+
+struct cxl_mbox_dc_extents {
+	__le32 ret_extent_cnt;
+	__le32 total_extent_cnt;
+	__le32 extent_list_num;
+	u8 rsvd[4];
+	struct cxl_dc_extent extent[];
+}  __packed;
+
 /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
 struct cxl_mbox_set_timestamp_in {
 	__le64 timestamp;
@@ -826,6 +894,14 @@ int cxl_trigger_poison_list(struct cxl_memdev *cxlmd);
 int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
 int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
 
+/* FIXME why not have these be static in mbox.c? */
+int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range);
+int cxl_release_dc_extent(struct cxl_memdev_state *mds, struct range *rel_range);
+int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
+			      unsigned int *extent_gen_num);
+int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds, unsigned int cnt,
+			   unsigned int index);
+
 #ifdef CONFIG_CXL_SUSPEND
 void cxl_mem_active_inc(void);
 void cxl_mem_active_dec(void);
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index ac1a41bc083d..558ffbcb9b34 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -522,8 +522,8 @@ static int cxl_event_req_irq(struct cxl_dev_state *cxlds, u8 setting)
 		return irq;
 
 	return devm_request_threaded_irq(dev, irq, NULL, cxl_event_thread,
-					 IRQF_SHARED | IRQF_ONESHOT, NULL,
-					 dev_id);
+					IRQF_SHARED | IRQF_ONESHOT, NULL,
+					dev_id);
 }
 
 static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
@@ -555,6 +555,7 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
 		.warn_settings = CXL_INT_MSI_MSIX,
 		.failure_settings = CXL_INT_MSI_MSIX,
 		.fatal_settings = CXL_INT_MSI_MSIX,
+		.dyncap_settings = CXL_INT_MSI_MSIX,
 	};
 
 	mbox_cmd = (struct cxl_mbox_cmd) {
@@ -608,6 +609,11 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
 		return rc;
 	}
 
+	rc = cxl_event_req_irq(cxlds, policy.dyncap_settings);
+	if (rc) {
+		dev_err(cxlds->dev, "Failed to get interrupt for event dc log\n");
+		return rc;
+	}
 	return 0;
 }
 
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 227800053309..b2b27033f589 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -434,7 +434,7 @@ static void free_dev_dax_ranges(struct dev_dax *dev_dax)
 		trim_dev_dax_range(dev_dax);
 }
 
-static void unregister_dev_dax(void *dev)
+void unregister_dev_dax(void *dev)
 {
 	struct dev_dax *dev_dax = to_dev_dax(dev);
 
@@ -445,6 +445,7 @@ static void unregister_dev_dax(void *dev)
 	free_dev_dax_ranges(dev_dax);
 	put_device(dev);
 }
+EXPORT_SYMBOL_GPL(unregister_dev_dax);
 
 /* a return value >= 0 indicates this invocation invalidated the id */
 static int __free_dev_dax_id(struct dev_dax *dev_dax)
@@ -641,7 +642,7 @@ static void dax_mapping_release(struct device *dev)
 	kfree(mapping);
 }
 
-static void unregister_dax_mapping(void *data)
+void unregister_dax_mapping(void *data)
 {
 	struct device *dev = data;
 	struct dax_mapping *mapping = to_dax_mapping(dev);
@@ -658,7 +659,7 @@ static void unregister_dax_mapping(void *data)
 	device_del(dev);
 	put_device(dev);
 }
-
+EXPORT_SYMBOL_GPL(unregister_dax_mapping);
 static struct dev_dax_range *get_dax_range(struct device *dev)
 {
 	struct dax_mapping *mapping = to_dax_mapping(dev);
@@ -793,7 +794,7 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
 	return 0;
 }
 
-static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
+int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
 		resource_size_t size)
 {
 	struct dax_region *dax_region = dev_dax->region;
@@ -853,6 +854,8 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
 
 	return rc;
 }
+EXPORT_SYMBOL_GPL(alloc_dev_dax_range);
+
 
 static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, resource_size_t size)
 {
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 8cd79ab34292..aa8418c7aead 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -47,8 +47,11 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
 	__dax_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
 void dax_driver_unregister(struct dax_device_driver *dax_drv);
 void kill_dev_dax(struct dev_dax *dev_dax);
+void unregister_dev_dax(void *dev);
+void unregister_dax_mapping(void *data);
 bool static_dev_dax(struct dev_dax *dev_dax);
-
+int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
+					resource_size_t size);
 /*
  * While run_dax() is potentially a generic operation that could be
  * defined in include/linux/dax.h we don't want to grow any users

-- 
2.40.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 5/5] cxl/mem: Trace Dynamic capacity Event Record
  2023-06-14 19:16 [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD) ira.weiny
                   ` (3 preceding siblings ...)
  2023-06-14 19:16 ` [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events ira.weiny
@ 2023-06-14 19:16 ` ira.weiny
  2023-06-15 17:08   ` Dave Jiang
  2023-06-15  0:56 ` [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD) Alison Schofield
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 55+ messages in thread
From: ira.weiny @ 2023-06-14 19:16 UTC (permalink / raw)
  To: Navneet Singh, Fan Ni, Jonathan Cameron, Ira Weiny, Dan Williams,
	linux-cxl

From: Navneet Singh <navneet.singh@intel.com>

CXL rev 3.0 section 8.2.9.2.1.5 defines the Dynamic Capacity Event Record
Determine if the event read is a Dynamic capacity event record and
if so trace the record for the debug purpose.

Add DC trace points to the trace log.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>

---
[iweiny: fixups]
[djbw: no sign-off: preview only]
---
 drivers/cxl/core/mbox.c  |  5 ++++
 drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index db9295216de5..802dacd09772 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -888,6 +888,11 @@ static void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 				(struct cxl_event_mem_module *)record;
 
 		trace_cxl_memory_module(cxlmd, type, rec);
+	} else if (uuid_equal(id, &dc_event_uuid)) {
+		struct dcd_event_dyn_cap *rec =
+				(struct dcd_event_dyn_cap *)record;
+
+		trace_cxl_dynamic_capacity(cxlmd, type, rec);
 	} else {
 		/* For unknown record types print just the header */
 		trace_cxl_generic_event(cxlmd, type, record);
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index e11651255780..468c2c8b4347 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -704,6 +704,71 @@ TRACE_EVENT(cxl_poison,
 	)
 );
 
+/*
+ * DYNAMIC CAPACITY Event Record - DER
+ *
+ * CXL rev 3.0 section 8.2.9.2.1.5 Table 8-47
+ */
+
+#define CXL_DC_ADD_CAPACITY			0x00
+#define CXL_DC_REL_CAPACITY			0x01
+#define CXL_DC_FORCED_REL_CAPACITY		0x02
+#define CXL_DC_REG_CONF_UPDATED			0x03
+#define show_dc_evt_type(type)	__print_symbolic(type,		\
+	{ CXL_DC_ADD_CAPACITY,	"Add capacity"},		\
+	{ CXL_DC_REL_CAPACITY,	"Release capacity"},		\
+	{ CXL_DC_FORCED_REL_CAPACITY,	"Forced capacity release"},	\
+	{ CXL_DC_REG_CONF_UPDATED,	"Region Configuration Updated"	} \
+)
+
+TRACE_EVENT(cxl_dynamic_capacity,
+
+	TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
+		 struct dcd_event_dyn_cap  *rec),
+
+	TP_ARGS(cxlmd, log, rec),
+
+	TP_STRUCT__entry(
+		CXL_EVT_TP_entry
+
+		/* Dynamic capacity Event */
+		__field(u8, event_type)
+		__field(u16, hostid)
+		__field(u8, region_id)
+		__field(u64, dpa_start)
+		__field(u64, length)
+		__array(u8, tag, CXL_EVENT_DC_TAG_SIZE)
+		__field(u16, sh_extent_seq)
+	),
+
+	TP_fast_assign(
+		CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
+
+		/* Dynamic_capacity Event */
+		__entry->event_type = rec->data.event_type;
+
+		/* DCD event record data */
+		__entry->hostid = le16_to_cpu(rec->data.host_id);
+		__entry->region_id = rec->data.region_index;
+		__entry->dpa_start = le64_to_cpu(rec->data.extent.start_dpa);
+		__entry->length = le64_to_cpu(rec->data.extent.length);
+		memcpy(__entry->tag, &rec->data.extent.tag, CXL_EVENT_DC_TAG_SIZE);
+		__entry->sh_extent_seq = le16_to_cpu(rec->data.extent.shared_extn_seq);
+	),
+
+	CXL_EVT_TP_printk("event_type='%s' host_id='%d' region_id='%d' " \
+		"starting_dpa=%llx length=%llx tag=%s " \
+		"shared_extent_sequence=%d",
+		show_dc_evt_type(__entry->event_type),
+		__entry->hostid,
+		__entry->region_id,
+		__entry->dpa_start,
+		__entry->length,
+		__print_hex(__entry->tag, CXL_EVENT_DC_TAG_SIZE),
+		__entry->sh_extent_seq
+	)
+);
+
 #endif /* _CXL_EVENTS_H */
 
 #define TRACE_INCLUDE_FILE trace

-- 
2.40.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device
  2023-06-14 19:16 ` [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device ira.weiny
@ 2023-06-14 22:53   ` Dave Jiang
  2023-06-15 15:04     ` Ira Weiny
  2023-06-14 23:49   ` Alison Schofield
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Dave Jiang @ 2023-06-14 22:53 UTC (permalink / raw)
  To: ira.weiny, Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams,
	linux-cxl



On 6/14/23 12:16, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Read the Dynamic capacity configuration and store dynamic capacity region
> information in the device state which driver will use to map into the HDM
> ranges.
> 
> Implement Get Dynamic Capacity Configuration (opcode 4800h) mailbox
> command as specified in CXL 3.0 spec section 8.2.9.8.9.1.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> ---
> [iweiny: ensure all mds->dc_region's are named]
> ---
>   drivers/cxl/core/mbox.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++--
>   drivers/cxl/cxlmem.h    |  70 +++++++++++++++++-
>   drivers/cxl/pci.c       |   4 +
>   3 files changed, 256 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 3ca0bf12c55f..c5b696737c87 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -111,6 +111,37 @@ static u8 security_command_sets[] = {
>   	0x46, /* Security Passthrough */
>   };
>   
> +static bool cxl_is_dcd_command(u16 opcode)
> +{
> +#define CXL_MBOX_OP_DCD_CMDS 0x48

Move this to cxlmem.h?

> +
> +	if ((opcode >> 8) == CXL_MBOX_OP_DCD_CMDS)
> +		return true;
> +
> +	return false;

I think you simplify by:

return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;


> +}
> +
> +static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
> +					u16 opcode)
> +{
> +	switch (opcode) {
> +	case CXL_MBOX_OP_GET_DC_CONFIG:
> +		set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> +		set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_ADD_DC_RESPONSE:
> +		set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_RELEASE_DC:
> +		set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
>   static bool cxl_is_security_command(u16 opcode)
>   {
>   	int i;
> @@ -666,6 +697,7 @@ static int cxl_xfer_log(struct cxl_memdev_state *mds, uuid_t *uuid,
>   static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>   {
>   	struct cxl_cel_entry *cel_entry;
> +	struct cxl_mem_command *cmd;
>   	const int cel_entries = size / sizeof(*cel_entry);
>   	struct device *dev = mds->cxlds.dev;
>   	int i;
> @@ -674,11 +706,12 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>   
>   	for (i = 0; i < cel_entries; i++) {
>   		u16 opcode = le16_to_cpu(cel_entry[i].opcode);
> -		struct cxl_mem_command *cmd = cxl_mem_find_command(opcode);
> +		cmd = cxl_mem_find_command(opcode);
>   
> -		if (!cmd && !cxl_is_poison_command(opcode)) {
> -			dev_dbg(dev,
> -				"Opcode 0x%04x unsupported by driver\n", opcode);
> +		if (!cmd && !cxl_is_poison_command(opcode) &&
> +		    !cxl_is_dcd_command(opcode)) {
> +			dev_dbg(dev, "Opcode 0x%04x unsupported by driver\n",
> +				opcode);
>   			continue;
>   		}
>   
> @@ -688,6 +721,9 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>   		if (cxl_is_poison_command(opcode))
>   			cxl_set_poison_cmd_enabled(&mds->poison, opcode);
>   
> +		if (cxl_is_dcd_command(opcode))
> +			cxl_set_dcd_cmd_enabled(mds, opcode);
> +
>   		dev_dbg(dev, "Opcode 0x%04x enabled\n", opcode);
>   	}
>   }
> @@ -1059,7 +1095,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>   	if (rc < 0)
>   		return rc;
>   
> -	mds->total_bytes =
> +	mds->total_static_capacity =
>   		le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
>   	mds->volatile_only_bytes =
>   		le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
> @@ -1077,10 +1113,137 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>   		mds->poison.max_errors = min_t(u32, val, CXL_POISON_LIST_MAX);
>   	}
>   
> +	mds->dc_event_log_size = le16_to_cpu(id.dc_event_log_size);
> +
>   	return 0;
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_dev_state_identify, CXL);
>   
> +/**
> + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> + * information from the device.
> + * @mds: The memory device state
> + * Return: 0 if identify was executed successfully.
> + *
> + * This will dispatch the get_dynamic_capacity command to the device
> + * and on success populate structures to be exported to sysfs.
> + */
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> +{
> +	struct cxl_dev_state *cxlds = &mds->cxlds;
> +	struct device *dev = cxlds->dev;
> +	struct cxl_mbox_dynamic_capacity *dc;
> +	struct cxl_mbox_get_dc_config get_dc;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	u64 next_dc_region_start;
> +	int rc, i;
> +
> +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> +		sprintf(mds->dc_region[i].name, "dc%d", i);
> +
> +	/* Check GET_DC_CONFIG is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd: get_dynamic_capacity_config\n");
> +		return 0;
> +	}
> +
> +	dc = kvmalloc(mds->payload_size, GFP_KERNEL);
> +	if (!dc)
> +		return -ENOMEM;
> +
> +	get_dc = (struct cxl_mbox_get_dc_config) {
> +		.region_count = CXL_MAX_DC_REGION,
> +		.start_region_index = 0,
> +	};
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> +		.payload_in = &get_dc,
> +		.size_in = sizeof(get_dc),
> +		.size_out = mds->payload_size,
> +		.payload_out = dc,
> +		.min_out = 1,
> +	};
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		goto dc_error;
> +
> +	mds->nr_dc_region = dc->avail_region_count;
> +
> +	if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> +		dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> +			mds->nr_dc_region);
> +		rc = -EINVAL;
> +		goto dc_error;
> +	}
> +
> +	for (i = 0; i < mds->nr_dc_region; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +		dcr->base = le64_to_cpu(dc->region[i].region_base);
> +		dcr->decode_len =
> +			le64_to_cpu(dc->region[i].region_decode_length);
> +		dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> +		dcr->len = le64_to_cpu(dc->region[i].region_length);
> +		dcr->blk_size = le64_to_cpu(dc->region[i].region_block_size);
> +
> +		/* Check regions are in increasing DPA order */
> +		if ((i + 1) < mds->nr_dc_region) {
> +			next_dc_region_start =
> +				le64_to_cpu(dc->region[i + 1].region_base);
> +			if ((dcr->base > next_dc_region_start) ||
> +			    ((dcr->base + dcr->decode_len) > next_dc_region_start)) {
> +				dev_err(dev,
> +					"DPA ordering violation for DC region %d and %d\n",
> +					i, i + 1);
> +				rc = -EINVAL;
> +				goto dc_error;
> +			}
> +		}
> +
> +		/* Check the region is 256 MB aligned */
> +		if (!IS_ALIGNED(dcr->base, SZ_256M)) {
> +			dev_err(dev, "DC region %d not aligned to 256MB\n", i);
> +			rc = -EINVAL;
> +			goto dc_error;
> +		}
> +
> +		/* Check Region base and length are aligned to block size */
> +		if (!IS_ALIGNED(dcr->base, dcr->blk_size) ||
> +		    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> +			dev_err(dev, "DC region %d not aligned to %#llx\n", i,
> +				dcr->blk_size);
> +			rc = -EINVAL;
> +			goto dc_error;
> +		}
> +
> +		dcr->dsmad_handle =
> +			le32_to_cpu(dc->region[i].region_dsmad_handle);
> +		dcr->flags = dc->region[i].flags;
> +		sprintf(dcr->name, "dc%d", i);
> +
> +		dev_dbg(dev,
> +			"DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> +			dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> +	}
> +
> +	/*
> +	 * Calculate entire DPA range of all configured regions which will be mapped by
> +	 * one or more HDM decoders
> +	 */
> +	mds->total_dynamic_capacity =
> +		mds->dc_region[mds->nr_dc_region - 1].base +
> +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> +		mds->dc_region[0].base;
> +	dev_dbg(dev, "Total dynamic capacity: %#llx\n",
> +		mds->total_dynamic_capacity);
> +
> +dc_error:
> +	kvfree(dc);
> +	return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> +
>   static int add_dpa_res(struct device *dev, struct resource *parent,
>   		       struct resource *res, resource_size_t start,
>   		       resource_size_t size, const char *type)
> @@ -1112,6 +1275,11 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>   	struct cxl_dev_state *cxlds = &mds->cxlds;
>   	struct device *dev = cxlds->dev;
>   	int rc;
> +	size_t untenanted_mem =
> +		mds->dc_region[0].base - mds->total_static_capacity;
> +
> +	mds->total_capacity = mds->total_static_capacity +
> +			untenanted_mem + mds->total_dynamic_capacity;
>   
>   	if (!cxlds->media_ready) {
>   		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> @@ -1121,13 +1289,23 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>   	}
>   
>   	cxlds->dpa_res =
> -		(struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> +		(struct resource)DEFINE_RES_MEM(0, mds->total_capacity);
> +
> +	for (int i = 0; i < CXL_MAX_DC_REGION; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> +				 dcr->base, dcr->decode_len, dcr->name);
> +		if (rc)
> +			return rc;
> +	}
>   
>   	if (mds->partition_align_bytes == 0) {
>   		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
>   				 mds->volatile_only_bytes, "ram");
>   		if (rc)
>   			return rc;
> +

Stray blank line?

DJ

>   		return add_dpa_res(dev, &cxlds->dpa_res, &cxlds->pmem_res,
>   				   mds->volatile_only_bytes,
>   				   mds->persistent_only_bytes, "pmem");
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 89e560ea14c0..9c0b2fa72bdd 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -239,6 +239,15 @@ struct cxl_event_state {
>   	struct mutex log_lock;
>   };
>   
> +/* Device enabled DCD commands */
> +enum dcd_cmd_enabled_bits {
> +	CXL_DCD_ENABLED_GET_CONFIG,
> +	CXL_DCD_ENABLED_GET_EXTENT_LIST,
> +	CXL_DCD_ENABLED_ADD_RESPONSE,
> +	CXL_DCD_ENABLED_RELEASE,
> +	CXL_DCD_ENABLED_MAX
> +};
> +
>   /* Device enabled poison commands */
>   enum poison_cmd_enabled_bits {
>   	CXL_POISON_ENABLED_LIST,
> @@ -284,6 +293,9 @@ enum cxl_devtype {
>   	CXL_DEVTYPE_CLASSMEM,
>   };
>   
> +#define CXL_MAX_DC_REGION 8
> +#define CXL_DC_REGION_SRTLEN 8
> +
>   /**
>    * struct cxl_dev_state - The driver device state
>    *
> @@ -300,6 +312,8 @@ enum cxl_devtype {
>    * @dpa_res: Overall DPA resource tree for the device
>    * @pmem_res: Active Persistent memory capacity configuration
>    * @ram_res: Active Volatile memory capacity configuration
> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> + *          region
>    * @component_reg_phys: register base of component registers
>    * @info: Cached DVSEC information about the device.
>    * @serial: PCIe Device Serial Number
> @@ -315,6 +329,7 @@ struct cxl_dev_state {
>   	struct resource dpa_res;
>   	struct resource pmem_res;
>   	struct resource ram_res;
> +	struct resource dc_res[CXL_MAX_DC_REGION];
>   	resource_size_t component_reg_phys;
>   	u64 serial;
>   	enum cxl_devtype type;
> @@ -334,9 +349,12 @@ struct cxl_dev_state {
>    *                (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
>    * @mbox_mutex: Mutex to synchronize mailbox access.
>    * @firmware_version: Firmware version for the memory device.
> + * @dcd_cmds: List of DCD commands implemented by memory device
>    * @enabled_cmds: Hardware commands found enabled in CEL.
>    * @exclusive_cmds: Commands that are kernel-internal only
> - * @total_bytes: sum of all possible capacities
> + * @total_capacity: Sum of static and dynamic capacities
> + * @total_static_capacity: Sum of RAM and PMEM capacities
> + * @total_dynamic_capacity: Complete DPA range occupied by DC regions
>    * @volatile_only_bytes: hard volatile capacity
>    * @persistent_only_bytes: hard persistent capacity
>    * @partition_align_bytes: alignment size for partition-able capacity
> @@ -344,6 +362,10 @@ struct cxl_dev_state {
>    * @active_persistent_bytes: sum of hard + soft persistent
>    * @next_volatile_bytes: volatile capacity change pending device reset
>    * @next_persistent_bytes: persistent capacity change pending device reset
> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
> + * @dc_event_log_size: The number of events the device can store in the
> + * Dynamic Capacity Event Log before it overflows
>    * @event: event log driver state
>    * @poison: poison driver state info
>    * @mbox_send: @dev specific transport for transmitting mailbox commands
> @@ -357,9 +379,13 @@ struct cxl_memdev_state {
>   	size_t lsa_size;
>   	struct mutex mbox_mutex; /* Protects device mailbox and firmware */
>   	char firmware_version[0x10];
> +	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>   	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>   	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> -	u64 total_bytes;
> +
> +	u64 total_capacity;
> +	u64 total_static_capacity;
> +	u64 total_dynamic_capacity;
>   	u64 volatile_only_bytes;
>   	u64 persistent_only_bytes;
>   	u64 partition_align_bytes;
> @@ -367,6 +393,20 @@ struct cxl_memdev_state {
>   	u64 active_persistent_bytes;
>   	u64 next_volatile_bytes;
>   	u64 next_persistent_bytes;
> +
> +	u8 nr_dc_region;
> +
> +	struct cxl_dc_region_info {
> +		u8 name[CXL_DC_REGION_SRTLEN];
> +		u64 base;
> +		u64 decode_len;
> +		u64 len;
> +		u64 blk_size;
> +		u32 dsmad_handle;
> +		u8 flags;
> +	} dc_region[CXL_MAX_DC_REGION];
> +
> +	size_t dc_event_log_size;
>   	struct cxl_event_state event;
>   	struct cxl_poison_state poison;
>   	int (*mbox_send)(struct cxl_memdev_state *mds,
> @@ -415,6 +455,10 @@ enum cxl_opcode {
>   	CXL_MBOX_OP_UNLOCK		= 0x4503,
>   	CXL_MBOX_OP_FREEZE_SECURITY	= 0x4504,
>   	CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE	= 0x4505,
> +	CXL_MBOX_OP_GET_DC_CONFIG	= 0x4800,
> +	CXL_MBOX_OP_GET_DC_EXTENT_LIST	= 0x4801,
> +	CXL_MBOX_OP_ADD_DC_RESPONSE	= 0x4802,
> +	CXL_MBOX_OP_RELEASE_DC		= 0x4803,
>   	CXL_MBOX_OP_MAX			= 0x10000
>   };
>   
> @@ -462,6 +506,7 @@ struct cxl_mbox_identify {
>   	__le16 inject_poison_limit;
>   	u8 poison_caps;
>   	u8 qos_telemetry_caps;
> +	__le16 dc_event_log_size;
>   } __packed;
>   
>   /*
> @@ -617,7 +662,27 @@ struct cxl_mbox_set_partition_info {
>   	u8 flags;
>   } __packed;
>   
> +struct cxl_mbox_get_dc_config {
> +	u8 region_count;
> +	u8 start_region_index;
> +} __packed;
> +
> +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> +struct cxl_mbox_dynamic_capacity {
> +	u8 avail_region_count;
> +	u8 rsvd[7];
> +	struct cxl_dc_region_config {
> +		__le64 region_base;
> +		__le64 region_decode_length;
> +		__le64 region_length;
> +		__le64 region_block_size;
> +		__le32 region_dsmad_handle;
> +		u8 flags;
> +		u8 rsvd[3];
> +	} __packed region[];
> +} __packed;
>   #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
>   
>   /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>   struct cxl_mbox_set_timestamp_in {
> @@ -742,6 +807,7 @@ enum {
>   int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
>   			  struct cxl_mbox_cmd *cmd);
>   int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
>   int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>   int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>   int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 4e2845b7331a..ac1a41bc083d 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -742,6 +742,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>   	if (rc)
>   		return rc;
>   
> +	rc = cxl_dev_dynamic_capacity_identify(mds);
> +	if (rc)
> +		return rc;
> +
>   	rc = cxl_mem_create_range_info(mds);
>   	if (rc)
>   		return rc;
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-14 19:16 ` [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support ira.weiny
@ 2023-06-14 23:37   ` Dave Jiang
  2023-06-15 18:12     ` Ira Weiny
  2023-06-15  0:21   ` Alison Schofield
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 55+ messages in thread
From: Dave Jiang @ 2023-06-14 23:37 UTC (permalink / raw)
  To: ira.weiny, Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams,
	linux-cxl



On 6/14/23 12:16, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> CXL devices optionally support dynamic capacity. CXL Regions must be
> created to access this capacity.
> 
> Add sysfs entries to create dynamic capacity cxl regions. Provide a new
> Dynamic Capacity decoder mode which targets dynamic capacity on devices
> which are added to that region.
> 
> Below are the steps to create and delete dynamic capacity region0
> (example).
> 
>      region=$(cat /sys/bus/cxl/devices/decoder0.0/create_dc_region)
>      echo $region> /sys/bus/cxl/devices/decoder0.0/create_dc_region
>      echo 256 > /sys/bus/cxl/devices/$region/interleave_granularity
>      echo 1 > /sys/bus/cxl/devices/$region/interleave_ways
> 
>      echo "dc0" >/sys/bus/cxl/devices/decoder1.0/mode
>      echo 0x400000000 >/sys/bus/cxl/devices/decoder1.0/dpa_size
> 
>      echo 0x400000000 > /sys/bus/cxl/devices/$region/size
>      echo  "decoder1.0" > /sys/bus/cxl/devices/$region/target0
>      echo 1 > /sys/bus/cxl/devices/$region/commit
>      echo $region > /sys/bus/cxl/drivers/cxl_region/bind
> 
>      echo $region> /sys/bus/cxl/devices/decoder0.0/delete_region
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> ---
> [iweiny: fixups]
> [iweiny: remove unused CXL_DC_REGION_MODE macro]
> [iweiny: Make dc_mode_to_region_index static]
> [iweiny: simplify <sysfs>/create_dc_region]
> [iweiny: introduce decoder_mode_is_dc]
> [djbw: fixups, no sign-off: preview only]
> ---
>   drivers/cxl/Kconfig       |  11 +++
>   drivers/cxl/core/core.h   |   7 ++
>   drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++++++++++++++++++++++++++----
>   drivers/cxl/core/port.c   |  18 ++++
>   drivers/cxl/core/region.c | 135 ++++++++++++++++++++++++--
>   drivers/cxl/cxl.h         |  28 ++++++
>   drivers/dax/cxl.c         |   4 +
>   7 files changed, 409 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index ff4e78117b31..df034889d053 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -121,6 +121,17 @@ config CXL_REGION
>   
>   	  If unsure say 'y'
>   
> +config CXL_DCD
> +	bool "CXL: DCD Support"
> +	default CXL_BUS
> +	depends on CXL_REGION
> +	help
> +	  Enable the CXL core to provision CXL DCD regions.
> +	  CXL devices optionally support dynamic capacity and DCD region
> +	  maps the dynamic capacity regions DPA's into Host HPA ranges.
> +
> +	  If unsure say 'y'
> +
>   config CXL_REGION_INVALIDATION_TEST
>   	bool "CXL: Region Cache Management Bypass (TEST)"
>   	depends on CXL_REGION
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 27f0968449de..725700ab5973 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -9,6 +9,13 @@ extern const struct device_type cxl_nvdimm_type;
>   
>   extern struct attribute_group cxl_base_attribute_group;
>   
> +#ifdef CONFIG_CXL_DCD
> +extern struct device_attribute dev_attr_create_dc_region;
> +#define SET_CXL_DC_REGION_ATTR(x) (&dev_attr_##x.attr),
> +#else
> +#define SET_CXL_DC_REGION_ATTR(x)
> +#endif
> +
>   #ifdef CONFIG_CXL_REGION
>   extern struct device_attribute dev_attr_create_pmem_region;
>   extern struct device_attribute dev_attr_create_ram_region;
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 514d30131d92..29649b47d177 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -233,14 +233,23 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>   	struct resource *res = cxled->dpa_res;
>   	resource_size_t skip_start;
> +	resource_size_t skipped = cxled->skip;
>   
>   	lockdep_assert_held_write(&cxl_dpa_rwsem);
>   
>   	/* save @skip_start, before @res is released */
> -	skip_start = res->start - cxled->skip;
> +	skip_start = res->start - skipped;
>   	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> -	if (cxled->skip)
> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> +	if (cxled->skip != 0) {
> +		while (skipped != 0) {
> +			res = xa_load(&cxled->skip_res, skip_start);
> +			__release_region(&cxlds->dpa_res, skip_start,
> +							resource_size(res));
> +			xa_erase(&cxled->skip_res, skip_start);
> +			skip_start += resource_size(res);
> +			skipped -= resource_size(res);
> +			}
> +	}
>   	cxled->skip = 0;
>   	cxled->dpa_res = NULL;
>   	put_device(&cxled->cxld.dev);
> @@ -267,6 +276,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>   	__cxl_dpa_release(cxled);
>   }
>   
> +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> +{
> +	int index = 0;
> +
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> +		if (mode == i)
> +			return index;
> +		index++;
> +	}
> +
> +	return -EINVAL;
> +}
> +
>   static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   			     resource_size_t base, resource_size_t len,
>   			     resource_size_t skipped)
> @@ -275,7 +297,11 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   	struct cxl_port *port = cxled_to_port(cxled);
>   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>   	struct device *dev = &port->dev;
> +	struct device *ed_dev = &cxled->cxld.dev;
> +	struct resource *dpa_res = &cxlds->dpa_res;
> +	resource_size_t skip_len = 0;
>   	struct resource *res;
> +	int rc, index;
>   
>   	lockdep_assert_held_write(&cxl_dpa_rwsem);
>   
> @@ -304,28 +330,119 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   	}
>   
>   	if (skipped) {
> -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> -				       dev_name(&cxled->cxld.dev), 0);
> -		if (!res) {
> -			dev_dbg(dev,
> -				"decoder%d.%d: failed to reserve skipped space\n",
> -				port->id, cxled->cxld.id);
> -			return -EBUSY;
> +		resource_size_t skip_base = base - skipped;
> +
> +		if (decoder_mode_is_dc(cxled->mode)) {

Maybe move this entire block to a helper function to reduce the size of 
the current function and reduce indent levels and improve readability?

> +			if (resource_size(&cxlds->ram_res) &&
> +					skip_base <= cxlds->ram_res.end) {
> +				skip_len = cxlds->ram_res.end - skip_base + 1;
> +				res = __request_region(dpa_res, skip_base,
> +						skip_len, dev_name(ed_dev), 0);
> +				if (!res)
> +					goto error;
> +
> +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
> +				skip_base += skip_len;
> +			}
> +
> +			if (resource_size(&cxlds->ram_res) &&
> +					skip_base <= cxlds->pmem_res.end) {
> +				skip_len = cxlds->pmem_res.end - skip_base + 1;
> +				res = __request_region(dpa_res, skip_base,
> +						skip_len, dev_name(ed_dev), 0);
> +				if (!res)
> +					goto error;
> +
> +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
> +				skip_base += skip_len;
> +			}
> +
> +			index = dc_mode_to_region_index(cxled->mode);
> +			for (int i = 0; i <= index; i++) {
> +				struct resource *dcr = &cxlds->dc_res[i];
> +
> +				if (skip_base < dcr->start) {
> +					skip_len = dcr->start - skip_base;
> +					res = __request_region(dpa_res,
> +							skip_base, skip_len,
> +							dev_name(ed_dev), 0);
> +					if (!res)
> +						goto error;
> +
> +					rc = xa_insert(&cxled->skip_res, skip_base,
> +							res, GFP_KERNEL);
> +					skip_base += skip_len;
> +				}
> +
> +				if (skip_base == base) {
> +					dev_dbg(dev, "skip done!\n");
> +					break;
> +				}
> +
> +				if (resource_size(dcr) &&
> +						skip_base <= dcr->end) {
> +					if (skip_base > base)
> +						dev_err(dev, "Skip error\n");
> +
> +					skip_len = dcr->end - skip_base + 1;
> +					res = __request_region(dpa_res, skip_base,
> +							skip_len,
> +							dev_name(ed_dev), 0);
> +					if (!res)
> +						goto error;
> +
> +					rc = xa_insert(&cxled->skip_res, skip_base,
> +							res, GFP_KERNEL);
> +					skip_base += skip_len;
> +				}
> +			}
> +		} else	{
> +			res = __request_region(dpa_res, base - skipped, skipped,
> +							dev_name(ed_dev), 0);
> +			if (!res)
> +				goto error;
> +
> +			rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
>   		}
>   	}
> -	res = __request_region(&cxlds->dpa_res, base, len,
> -			       dev_name(&cxled->cxld.dev), 0);
> +
> +	res = __request_region(dpa_res, base, len, dev_name(ed_dev), 0);
>   	if (!res) {
>   		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
> -			port->id, cxled->cxld.id);
> -		if (skipped)
> -			__release_region(&cxlds->dpa_res, base - skipped,
> -					 skipped);
> +				port->id, cxled->cxld.id);
> +		if (skipped) {
> +			resource_size_t skip_base = base - skipped;
> +
> +			while (skipped != 0) {
> +				if (skip_base > base)
> +					dev_err(dev, "Skip error\n");
> +
> +				res = xa_load(&cxled->skip_res, skip_base);
> +				__release_region(dpa_res, skip_base,
> +							resource_size(res));
> +				xa_erase(&cxled->skip_res, skip_base);
> +				skip_base += resource_size(res);
> +				skipped -= resource_size(res);
> +			}
> +		}
>   		return -EBUSY;
>   	}
>   	cxled->dpa_res = res;
>   	cxled->skip = skipped;
>   
> +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> +		int index = dc_mode_to_region_index(mode);
> +
> +		if (resource_contains(&cxlds->dc_res[index], res)) {
> +			cxled->mode = mode;
> +			dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id,
> +				cxled->cxld.id, cxled->dpa_res, cxled->mode);
> +			goto success > +		}
> +	}

This block should only happen if decoder_mode_is_dc() right? If that's 
the case, you might be able to refactor it so the 'goto success' isn't 
necessary.

>   	if (resource_contains(&cxlds->pmem_res, res))
>   		cxled->mode = CXL_DECODER_PMEM;
>   	else if (resource_contains(&cxlds->ram_res, res))
> @@ -336,9 +453,16 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   		cxled->mode = CXL_DECODER_MIXED;
>   	}
>   
> +success:
>   	port->hdm_end++;
>   	get_device(&cxled->cxld.dev);
>   	return 0;
> +
> +error:
> +	dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space\n",
> +			port->id, cxled->cxld.id);
> +	return -EBUSY;
> +
>   }
>   
>   int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> @@ -429,6 +553,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>   	switch (mode) {
>   	case CXL_DECODER_RAM:
>   	case CXL_DECODER_PMEM:
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
>   		break;
>   	default:
>   		dev_dbg(dev, "unsupported mode: %d\n", mode);
> @@ -456,6 +588,16 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>   		goto out;
>   	}
>   
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> +		int index = dc_mode_to_region_index(i);
> +
> +		if (mode == i && !resource_size(&cxlds->dc_res[index])) {
> +			dev_dbg(dev, "no available dynamic capacity\n");
> +			rc = -ENXIO;
> +			goto out;
> +		}
> +	}
> +
>   	cxled->mode = mode;
>   	rc = 0;
>   out:
> @@ -469,10 +611,12 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>   					 resource_size_t *skip_out)
>   {
>   	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -	resource_size_t free_ram_start, free_pmem_start;
> +	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
>   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	struct device *dev = &cxled->cxld.dev;
>   	resource_size_t start, avail, skip;
>   	struct resource *p, *last;
> +	int index;
>   
>   	lockdep_assert_held(&cxl_dpa_rwsem);
>   
> @@ -490,6 +634,20 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>   	else
>   		free_pmem_start = cxlds->pmem_res.start;
>   
> +	/*
> +	 * One HDM Decoder per DC region to map memory with different
> +	 * DSMAS entry.
> +	 */
> +	index = dc_mode_to_region_index(cxled->mode);
> +	if (index >= 0) {
> +		if (cxlds->dc_res[index].child) {
> +			dev_err(dev, "Cannot allocated DPA from DC Region: %d\n",
> +					index);
> +			return -EINVAL;
> +		}
> +		free_dc_start = cxlds->dc_res[index].start;
> +	}
> +
>   	if (cxled->mode == CXL_DECODER_RAM) {
>   		start = free_ram_start;
>   		avail = cxlds->ram_res.end - start + 1;
> @@ -511,6 +669,29 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>   		else
>   			skip_end = start - 1;
>   		skip = skip_end - skip_start + 1;
> +	} else if (decoder_mode_is_dc(cxled->mode)) {
> +		resource_size_t skip_start, skip_end;
> +
> +		start = free_dc_start;
> +		avail = cxlds->dc_res[index].end - start + 1;
> +		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
> +			skip_start = free_ram_start;
> +		else
> +			skip_start = free_pmem_start;
> +		/*
> +		 * If some dc region is already mapped, then that allocation
> +		 * already handled the RAM and PMEM skip.Check for DC region
> +		 * skip.
> +		 */
> +		for (int i = index - 1; i >= 0 ; i--) {
> +			if (cxlds->dc_res[i].child) {
> +				skip_start = cxlds->dc_res[i].child->end + 1;
> +				break;
> +			}
> +		}
> +
> +		skip_end = start - 1;
> +		skip = skip_end - skip_start + 1;
>   	} else {
>   		dev_dbg(cxled_dev(cxled), "mode not set\n");
>   		avail = 0;
> @@ -548,10 +729,25 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>   
>   	avail = cxl_dpa_freespace(cxled, &start, &skip);
>   
> +	dev_dbg(dev, "DPA Allocation start: %llx len: %llx Skip: %llx\n",
> +						start, size, skip);
>   	if (size > avail) {
> +		static const char * const names[] = {
> +			[CXL_DECODER_NONE] = "none",
> +			[CXL_DECODER_RAM] = "ram",
> +			[CXL_DECODER_PMEM] = "pmem",
> +			[CXL_DECODER_MIXED] = "mixed",
> +			[CXL_DECODER_DC0] = "dc0",
> +			[CXL_DECODER_DC1] = "dc1",
> +			[CXL_DECODER_DC2] = "dc2",
> +			[CXL_DECODER_DC3] = "dc3",
> +			[CXL_DECODER_DC4] = "dc4",
> +			[CXL_DECODER_DC5] = "dc5",
> +			[CXL_DECODER_DC6] = "dc6",
> +			[CXL_DECODER_DC7] = "dc7",
> +		};
>   		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
> -			cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
> -			&avail);
> +			names[cxled->mode], &avail);
>   		rc = -ENOSPC;
>   		goto out;
>   	}
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 5e21b53362e6..a1a98aba24ed 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -195,6 +195,22 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
>   		mode = CXL_DECODER_PMEM;
>   	else if (sysfs_streq(buf, "ram"))
>   		mode = CXL_DECODER_RAM;
> +	else if (sysfs_streq(buf, "dc0"))
> +		mode = CXL_DECODER_DC0;
> +	else if (sysfs_streq(buf, "dc1"))
> +		mode = CXL_DECODER_DC1;
> +	else if (sysfs_streq(buf, "dc2"))
> +		mode = CXL_DECODER_DC2;
> +	else if (sysfs_streq(buf, "dc3"))
> +		mode = CXL_DECODER_DC3;
> +	else if (sysfs_streq(buf, "dc4"))
> +		mode = CXL_DECODER_DC4;
> +	else if (sysfs_streq(buf, "dc5"))
> +		mode = CXL_DECODER_DC5;
> +	else if (sysfs_streq(buf, "dc6"))
> +		mode = CXL_DECODER_DC6;
> +	else if (sysfs_streq(buf, "dc7"))
> +		mode = CXL_DECODER_DC7;
>   	else
>   		return -EINVAL;
>   
> @@ -296,6 +312,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>   	&dev_attr_target_list.attr,
>   	SET_CXL_REGION_ATTR(create_pmem_region)
>   	SET_CXL_REGION_ATTR(create_ram_region)
> +	SET_CXL_DC_REGION_ATTR(create_dc_region)
>   	SET_CXL_REGION_ATTR(delete_region)
>   	NULL,
>   };
> @@ -1691,6 +1708,7 @@ struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
>   		return ERR_PTR(-ENOMEM);
>   
>   	cxled->pos = -1;
> +	xa_init(&cxled->skip_res);
>   	cxld = &cxled->cxld;
>   	rc = cxl_decoder_init(port, cxld);
>   	if (rc)	 {
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 543c4499379e..144232c8305e 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1733,7 +1733,7 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>   	lockdep_assert_held_write(&cxl_region_rwsem);
>   	lockdep_assert_held_read(&cxl_dpa_rwsem);
>   
> -	if (cxled->mode != cxlr->mode) {
> +	if (decoder_mode_is_dc(cxlr->mode) && !decoder_mode_is_dc(cxled->mode)) {
>   		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
>   			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
>   		return -EINVAL;
> @@ -2211,6 +2211,14 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>   	switch (mode) {
>   	case CXL_DECODER_RAM:
>   	case CXL_DECODER_PMEM:
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
>   		break;
>   	default:
>   		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> @@ -2321,6 +2329,43 @@ static ssize_t create_ram_region_store(struct device *dev,
>   }
>   DEVICE_ATTR_RW(create_ram_region);
>   
> +static ssize_t store_dcN_region(struct cxl_root_decoder *cxlrd,
> +				const char *buf, enum cxl_decoder_mode mode,
> +				size_t len)
> +{
> +	struct cxl_region *cxlr;
> +	int rc, id;
> +
> +	rc = sscanf(buf, "region%d\n", &id);
> +	if (rc != 1)
> +		return -EINVAL;
> +
> +	cxlr = __create_region(cxlrd, id, mode, CXL_DECODER_HOSTMEM);
> +	if (IS_ERR(cxlr))
> +		return PTR_ERR(cxlr);
> +
> +	return len;
> +}
> +
> +static ssize_t create_dc_region_show(struct device *dev,
> +				     struct device_attribute *attr, char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dc_region_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t len)
> +{
> +	/*
> +	 * All DC regions use decoder mode DC0 as the region does not need the
> +	 * index information
> +	 */
> +	return store_dcN_region(to_cxl_root_decoder(dev), buf,
> +				CXL_DECODER_DC0, len);
> +}
> +DEVICE_ATTR_RW(create_dc_region);
> +
>   static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>   			   char *buf)
>   {
> @@ -2799,6 +2844,61 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>   	return rc;
>   }
>   
> +static void cxl_dc_region_release(void *data)
> +{
> +	struct cxl_region *cxlr = data;
> +	struct cxl_dc_region *cxlr_dc = cxlr->cxlr_dc;
> +
> +	xa_destroy(&cxlr_dc->dax_dev_list);
> +	kfree(cxlr_dc);
> +}
> +
> +static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
> +{
> +	struct cxl_dc_region *cxlr_dc;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct device *dev;
> +	int rc = 0;
> +
> +	cxlr_dax = cxl_dax_region_alloc(cxlr);
> +	if (IS_ERR(cxlr_dax))
> +		return PTR_ERR(cxlr_dax);
> +
> +	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
> +	if (!cxlr_dc) {
> +		rc = -ENOMEM;
> +		goto err;
> +	}
> +
> +	dev = &cxlr_dax->dev;
> +	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
> +	if (rc)
> +		goto err;
> +
> +	rc = device_add(dev);
> +	if (rc)
> +		goto err;
> +
> +	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> +		dev_name(dev));
> +
> +	rc = devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
> +					cxlr_dax);
> +	if (rc)
> +		goto err;
> +
> +	cxlr_dc->cxlr_dax = cxlr_dax;
> +	xa_init(&cxlr_dc->dax_dev_list);
> +	cxlr->cxlr_dc = cxlr_dc;
> +	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
> +	if (!rc)
> +		return 0;
> +err:
> +	put_device(dev);
> +	kfree(cxlr_dc);
> +	return rc;
> +}
> +
>   static int match_decoder_by_range(struct device *dev, void *data)
>   {
>   	struct range *r1, *r2 = data;
> @@ -3140,6 +3240,19 @@ static int is_system_ram(struct resource *res, void *arg)
>   	return 1;
>   }
>   
> +/*
> + * The region can not be manged by CXL if any portion of
> + * it is already online as 'System RAM'
> + */
> +static bool region_is_system_ram(struct cxl_region *cxlr,
> +				 struct cxl_region_params *p)
> +{
> +	return (walk_iomem_res_desc(IORES_DESC_NONE,
> +				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> +				    p->res->start, p->res->end, cxlr,
> +				    is_system_ram) > 0);
> +}
> +
>   static int cxl_region_probe(struct device *dev)
>   {
>   	struct cxl_region *cxlr = to_cxl_region(dev);
> @@ -3174,14 +3287,7 @@ static int cxl_region_probe(struct device *dev)
>   	case CXL_DECODER_PMEM:
>   		return devm_cxl_add_pmem_region(cxlr);
>   	case CXL_DECODER_RAM:
> -		/*
> -		 * The region can not be manged by CXL if any portion of
> -		 * it is already online as 'System RAM'
> -		 */
> -		if (walk_iomem_res_desc(IORES_DESC_NONE,
> -					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> -					p->res->start, p->res->end, cxlr,
> -					is_system_ram) > 0)
> +		if (region_is_system_ram(cxlr, p))

Maybe split this change out as a prep patch before the current patch.

>   			return 0;
>   
>   		/*
> @@ -3193,6 +3299,17 @@ static int cxl_region_probe(struct device *dev)
>   
>   		/* HDM-H routes to device-dax */
>   		return devm_cxl_add_dax_region(cxlr);
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
> +		if (region_is_system_ram(cxlr, p))
> +			return 0;
> +		return devm_cxl_add_dc_region(cxlr);
>   	default:
>   		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
>   			cxlr->mode);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 8400af85d99f..7ac1237938b7 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -335,6 +335,14 @@ enum cxl_decoder_mode {
>   	CXL_DECODER_NONE,
>   	CXL_DECODER_RAM,
>   	CXL_DECODER_PMEM,
> +	CXL_DECODER_DC0,
> +	CXL_DECODER_DC1,
> +	CXL_DECODER_DC2,
> +	CXL_DECODER_DC3,
> +	CXL_DECODER_DC4,
> +	CXL_DECODER_DC5,
> +	CXL_DECODER_DC6,
> +	CXL_DECODER_DC7,
>   	CXL_DECODER_MIXED,
>   	CXL_DECODER_DEAD,
>   };
> @@ -345,6 +353,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>   		[CXL_DECODER_NONE] = "none",
>   		[CXL_DECODER_RAM] = "ram",
>   		[CXL_DECODER_PMEM] = "pmem",
> +		[CXL_DECODER_DC0] = "dc0",
> +		[CXL_DECODER_DC1] = "dc1",
> +		[CXL_DECODER_DC2] = "dc2",
> +		[CXL_DECODER_DC3] = "dc3",
> +		[CXL_DECODER_DC4] = "dc4",
> +		[CXL_DECODER_DC5] = "dc5",
> +		[CXL_DECODER_DC6] = "dc6",
> +		[CXL_DECODER_DC7] = "dc7",
>   		[CXL_DECODER_MIXED] = "mixed",
>   	};
>   
> @@ -353,6 +369,11 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>   	return "mixed";
>   }
>   
> +static inline bool decoder_mode_is_dc(enum cxl_decoder_mode mode)
> +{
> +	return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
> +}
> +
>   /*
>    * Track whether this decoder is reserved for region autodiscovery, or
>    * free for userspace provisioning.
> @@ -375,6 +396,7 @@ struct cxl_endpoint_decoder {
>   	struct cxl_decoder cxld;
>   	struct resource *dpa_res;
>   	resource_size_t skip;
> +	struct xarray skip_res;
>   	enum cxl_decoder_mode mode;
>   	enum cxl_decoder_state state;
>   	int pos;
> @@ -475,6 +497,11 @@ struct cxl_region_params {
>    */
>   #define CXL_REGION_F_AUTO 1
>   
> +struct cxl_dc_region {
> +	struct xarray dax_dev_list;
> +	struct cxl_dax_region *cxlr_dax;
> +};
> +
>   /**
>    * struct cxl_region - CXL region
>    * @dev: This region's device
> @@ -493,6 +520,7 @@ struct cxl_region {
>   	enum cxl_decoder_type type;
>   	struct cxl_nvdimm_bridge *cxl_nvb;
>   	struct cxl_pmem_region *cxlr_pmem;
> +	struct cxl_dc_region *cxlr_dc;
>   	unsigned long flags;
>   	struct cxl_region_params params;
>   };
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index ccdf8de85bd5..eb5eb81bfbd7 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -23,11 +23,15 @@ static int cxl_dax_region_probe(struct device *dev)
>   	if (!dax_region)
>   		return -ENOMEM;
>   
> +	if (decoder_mode_is_dc(cxlr->mode))
> +		return 0;
> +
>   	data = (struct dev_dax_data) {
>   		.dax_region = dax_region,
>   		.id = -1,
>   		.size = range_len(&cxlr_dax->hpa_range),
>   	};
> +

Stray blank line?

>   	dev_dax = devm_create_dev_dax(&data);
>   	if (IS_ERR(dev_dax))
>   		return PTR_ERR(dev_dax);
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device
  2023-06-14 19:16 ` [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device ira.weiny
  2023-06-14 22:53   ` Dave Jiang
@ 2023-06-14 23:49   ` Alison Schofield
  2023-06-15 22:46     ` Ira Weiny
  2023-06-15 18:30   ` Fan Ni
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Alison Schofield @ 2023-06-14 23:49 UTC (permalink / raw)
  To: ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

On Wed, Jun 14, 2023 at 12:16:28PM -0700, Ira Weiny wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Read the Dynamic capacity configuration and store dynamic capacity region
> information in the device state which driver will use to map into the HDM
> ranges.
> 
> Implement Get Dynamic Capacity Configuration (opcode 4800h) mailbox
> command as specified in CXL 3.0 spec section 8.2.9.8.9.1.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> ---
> [iweiny: ensure all mds->dc_region's are named]
> ---
>  drivers/cxl/core/mbox.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/cxl/cxlmem.h    |  70 +++++++++++++++++-
>  drivers/cxl/pci.c       |   4 +
>  3 files changed, 256 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 3ca0bf12c55f..c5b696737c87 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -111,6 +111,37 @@ static u8 security_command_sets[] = {
>  	0x46, /* Security Passthrough */
>  };
>  
> +static bool cxl_is_dcd_command(u16 opcode)
> +{
> +#define CXL_MBOX_OP_DCD_CMDS 0x48
> +
> +	if ((opcode >> 8) == CXL_MBOX_OP_DCD_CMDS)
> +		return true;
> +
> +	return false;
> +}
> +
> +static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
> +					u16 opcode)
> +{
> +	switch (opcode) {
> +	case CXL_MBOX_OP_GET_DC_CONFIG:
> +		set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> +		set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_ADD_DC_RESPONSE:
> +		set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_RELEASE_DC:
> +		set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
>  static bool cxl_is_security_command(u16 opcode)
>  {
>  	int i;
> @@ -666,6 +697,7 @@ static int cxl_xfer_log(struct cxl_memdev_state *mds, uuid_t *uuid,
>  static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  {
>  	struct cxl_cel_entry *cel_entry;
> +	struct cxl_mem_command *cmd;
>  	const int cel_entries = size / sizeof(*cel_entry);
>  	struct device *dev = mds->cxlds.dev;
>  	int i;
> @@ -674,11 +706,12 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  
>  	for (i = 0; i < cel_entries; i++) {
>  		u16 opcode = le16_to_cpu(cel_entry[i].opcode);
> -		struct cxl_mem_command *cmd = cxl_mem_find_command(opcode);
> +		cmd = cxl_mem_find_command(opcode);

Is the move of the 'cmd' define related to this patch?
Checkpatch warns on it: WARNING: Missing a blank line after declarations

>  
> -		if (!cmd && !cxl_is_poison_command(opcode)) {
> -			dev_dbg(dev,
> -				"Opcode 0x%04x unsupported by driver\n", opcode);
> +		if (!cmd && !cxl_is_poison_command(opcode) &&
> +		    !cxl_is_dcd_command(opcode)) {
> +			dev_dbg(dev, "Opcode 0x%04x unsupported by driver\n",
> +				opcode);
>  			continue;
>  		}
>  
> @@ -688,6 +721,9 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  		if (cxl_is_poison_command(opcode))
>  			cxl_set_poison_cmd_enabled(&mds->poison, opcode);
>  
> +		if (cxl_is_dcd_command(opcode))
> +			cxl_set_dcd_cmd_enabled(mds, opcode);
> +
>  		dev_dbg(dev, "Opcode 0x%04x enabled\n", opcode);
>  	}
>  }
> @@ -1059,7 +1095,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>  	if (rc < 0)
>  		return rc;
>  
> -	mds->total_bytes =
> +	mds->total_static_capacity =
>  		le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
>  	mds->volatile_only_bytes =
>  		le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
> @@ -1077,10 +1113,137 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>  		mds->poison.max_errors = min_t(u32, val, CXL_POISON_LIST_MAX);
>  	}
>  
> +	mds->dc_event_log_size = le16_to_cpu(id.dc_event_log_size);
> +
>  	return 0;
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dev_state_identify, CXL);
>  
> +/**
> + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> + * information from the device.
> + * @mds: The memory device state
> + * Return: 0 if identify was executed successfully.
> + *
> + * This will dispatch the get_dynamic_capacity command to the device
> + * and on success populate structures to be exported to sysfs.
> + */
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> +{
> +	struct cxl_dev_state *cxlds = &mds->cxlds;
> +	struct device *dev = cxlds->dev;
> +	struct cxl_mbox_dynamic_capacity *dc;
> +	struct cxl_mbox_get_dc_config get_dc;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	u64 next_dc_region_start;
> +	int rc, i;
> +
> +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> +		sprintf(mds->dc_region[i].name, "dc%d", i);
> +
> +	/* Check GET_DC_CONFIG is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd: get_dynamic_capacity_config\n");
> +		return 0;
> +	}
> +
> +	dc = kvmalloc(mds->payload_size, GFP_KERNEL);
> +	if (!dc)
> +		return -ENOMEM;
> +
> +	get_dc = (struct cxl_mbox_get_dc_config) {
> +		.region_count = CXL_MAX_DC_REGION,
> +		.start_region_index = 0,
> +	};
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> +		.payload_in = &get_dc,
> +		.size_in = sizeof(get_dc),
> +		.size_out = mds->payload_size,
> +		.payload_out = dc,
> +		.min_out = 1,
> +	};
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		goto dc_error;
> +
> +	mds->nr_dc_region = dc->avail_region_count;
> +
> +	if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> +		dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> +			mds->nr_dc_region);
> +		rc = -EINVAL;
> +		goto dc_error;
> +	}
> +
> +	for (i = 0; i < mds->nr_dc_region; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +		dcr->base = le64_to_cpu(dc->region[i].region_base);
> +		dcr->decode_len =
> +			le64_to_cpu(dc->region[i].region_decode_length);
> +		dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> +		dcr->len = le64_to_cpu(dc->region[i].region_length);
> +		dcr->blk_size = le64_to_cpu(dc->region[i].region_block_size);
> +
> +		/* Check regions are in increasing DPA order */
> +		if ((i + 1) < mds->nr_dc_region) {
> +			next_dc_region_start =
> +				le64_to_cpu(dc->region[i + 1].region_base);
> +			if ((dcr->base > next_dc_region_start) ||
> +			    ((dcr->base + dcr->decode_len) > next_dc_region_start)) {
> +				dev_err(dev,
> +					"DPA ordering violation for DC region %d and %d\n",
> +					i, i + 1);
> +				rc = -EINVAL;
> +				goto dc_error;
> +			}
> +		}
> +
> +		/* Check the region is 256 MB aligned */
> +		if (!IS_ALIGNED(dcr->base, SZ_256M)) {
> +			dev_err(dev, "DC region %d not aligned to 256MB\n", i);
> +			rc = -EINVAL;
> +			goto dc_error;
> +		}
> +
> +		/* Check Region base and length are aligned to block size */
> +		if (!IS_ALIGNED(dcr->base, dcr->blk_size) ||
> +		    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> +			dev_err(dev, "DC region %d not aligned to %#llx\n", i,
> +				dcr->blk_size);
> +			rc = -EINVAL;
> +			goto dc_error;
> +		}
> +
> +		dcr->dsmad_handle =
> +			le32_to_cpu(dc->region[i].region_dsmad_handle);
> +		dcr->flags = dc->region[i].flags;
> +		sprintf(dcr->name, "dc%d", i);
> +
> +		dev_dbg(dev,
> +			"DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> +			dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> +	}
> +
> +	/*
> +	 * Calculate entire DPA range of all configured regions which will be mapped by
> +	 * one or more HDM decoders
> +	 */

Comment is needlessly going >80 chars.


> +	mds->total_dynamic_capacity =
> +		mds->dc_region[mds->nr_dc_region - 1].base +
> +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> +		mds->dc_region[0].base;
> +	dev_dbg(dev, "Total dynamic capacity: %#llx\n",
> +		mds->total_dynamic_capacity);
> +
> +dc_error:
> +	kvfree(dc);
> +	return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> +
>  static int add_dpa_res(struct device *dev, struct resource *parent,
>  		       struct resource *res, resource_size_t start,
>  		       resource_size_t size, const char *type)
> @@ -1112,6 +1275,11 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  	struct cxl_dev_state *cxlds = &mds->cxlds;
>  	struct device *dev = cxlds->dev;
>  	int rc;
> +	size_t untenanted_mem =
> +		mds->dc_region[0].base - mds->total_static_capacity;

Perhaps:
	size_t untenanted_mem;  (and put that in reverse x-tree order)

	untenanted_mem = mds->dc_region[0].base - mds->total_static_capacity;

> +
> +	mds->total_capacity = mds->total_static_capacity +
> +			untenanted_mem + mds->total_dynamic_capacity;
>  

Also, looking at this first patch with the long names, wondering if
there is an opportunity to (re-)define these fields in fewers chars.
Do we have to describe with 'total'? Is there a partial?

I guess I'll get to the defines further down...


>  	if (!cxlds->media_ready) {
>  		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> @@ -1121,13 +1289,23 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  	}
>  
>  	cxlds->dpa_res =
> -		(struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> +		(struct resource)DEFINE_RES_MEM(0, mds->total_capacity);
> +
> +	for (int i = 0; i < CXL_MAX_DC_REGION; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> +				 dcr->base, dcr->decode_len, dcr->name);
> +		if (rc)
> +			return rc;
> +	}
>  
>  	if (mds->partition_align_bytes == 0) {
>  		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
>  				 mds->volatile_only_bytes, "ram");
>  		if (rc)
>  			return rc;
> +
>  		return add_dpa_res(dev, &cxlds->dpa_res, &cxlds->pmem_res,
>  				   mds->volatile_only_bytes,
>  				   mds->persistent_only_bytes, "pmem");
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 89e560ea14c0..9c0b2fa72bdd 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -239,6 +239,15 @@ struct cxl_event_state {
>  	struct mutex log_lock;
>  };
>  
> +/* Device enabled DCD commands */
> +enum dcd_cmd_enabled_bits {
> +	CXL_DCD_ENABLED_GET_CONFIG,
> +	CXL_DCD_ENABLED_GET_EXTENT_LIST,
> +	CXL_DCD_ENABLED_ADD_RESPONSE,
> +	CXL_DCD_ENABLED_RELEASE,
> +	CXL_DCD_ENABLED_MAX
> +};
> +
>  /* Device enabled poison commands */
>  enum poison_cmd_enabled_bits {
>  	CXL_POISON_ENABLED_LIST,
> @@ -284,6 +293,9 @@ enum cxl_devtype {
>  	CXL_DEVTYPE_CLASSMEM,
>  };
>  
> +#define CXL_MAX_DC_REGION 8
> +#define CXL_DC_REGION_SRTLEN 8
> +
>  /**
>   * struct cxl_dev_state - The driver device state
>   *
> @@ -300,6 +312,8 @@ enum cxl_devtype {
>   * @dpa_res: Overall DPA resource tree for the device
>   * @pmem_res: Active Persistent memory capacity configuration
>   * @ram_res: Active Volatile memory capacity configuration
> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> + *          region
>   * @component_reg_phys: register base of component registers
>   * @info: Cached DVSEC information about the device.
>   * @serial: PCIe Device Serial Number
> @@ -315,6 +329,7 @@ struct cxl_dev_state {
>  	struct resource dpa_res;
>  	struct resource pmem_res;
>  	struct resource ram_res;
> +	struct resource dc_res[CXL_MAX_DC_REGION];
>  	resource_size_t component_reg_phys;
>  	u64 serial;
>  	enum cxl_devtype type;
> @@ -334,9 +349,12 @@ struct cxl_dev_state {
>   *                (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
>   * @mbox_mutex: Mutex to synchronize mailbox access.
>   * @firmware_version: Firmware version for the memory device.
> + * @dcd_cmds: List of DCD commands implemented by memory device
>   * @enabled_cmds: Hardware commands found enabled in CEL.
>   * @exclusive_cmds: Commands that are kernel-internal only
> - * @total_bytes: sum of all possible capacities
> + * @total_capacity: Sum of static and dynamic capacities
> + * @total_static_capacity: Sum of RAM and PMEM capacities
> + * @total_dynamic_capacity: Complete DPA range occupied by DC regions
>   * @volatile_only_bytes: hard volatile capacity
>   * @persistent_only_bytes: hard persistent capacity
>   * @partition_align_bytes: alignment size for partition-able capacity
> @@ -344,6 +362,10 @@ struct cxl_dev_state {
>   * @active_persistent_bytes: sum of hard + soft persistent
>   * @next_volatile_bytes: volatile capacity change pending device reset
>   * @next_persistent_bytes: persistent capacity change pending device reset
> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
> + * @dc_event_log_size: The number of events the device can store in the
> + * Dynamic Capacity Event Log before it overflows
>   * @event: event log driver state
>   * @poison: poison driver state info
>   * @mbox_send: @dev specific transport for transmitting mailbox commands
> @@ -357,9 +379,13 @@ struct cxl_memdev_state {
>  	size_t lsa_size;
>  	struct mutex mbox_mutex; /* Protects device mailbox and firmware */
>  	char firmware_version[0x10];
> +	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> -	u64 total_bytes;
> +
> +	u64 total_capacity;
> +	u64 total_static_capacity;
> +	u64 total_dynamic_capacity;

maybe cap, static_cap, dynamic_cap

(because I think I had a hand in defining the long names that
follow and deeply regret it ;))

>  	u64 volatile_only_bytes;
>  	u64 persistent_only_bytes;
>  	u64 partition_align_bytes;
> @@ -367,6 +393,20 @@ struct cxl_memdev_state {
>  	u64 active_persistent_bytes;
>  	u64 next_volatile_bytes;
>  	u64 next_persistent_bytes;
> +
> +	u8 nr_dc_region;
> +
> +	struct cxl_dc_region_info {
> +		u8 name[CXL_DC_REGION_SRTLEN];
> +		u64 base;
> +		u64 decode_len;
> +		u64 len;
> +		u64 blk_size;
> +		u32 dsmad_handle;
> +		u8 flags;
> +	} dc_region[CXL_MAX_DC_REGION];
> +
> +	size_t dc_event_log_size;
>  	struct cxl_event_state event;
>  	struct cxl_poison_state poison;
>  	int (*mbox_send)(struct cxl_memdev_state *mds,
> @@ -415,6 +455,10 @@ enum cxl_opcode {
>  	CXL_MBOX_OP_UNLOCK		= 0x4503,
>  	CXL_MBOX_OP_FREEZE_SECURITY	= 0x4504,
>  	CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE	= 0x4505,
> +	CXL_MBOX_OP_GET_DC_CONFIG	= 0x4800,
> +	CXL_MBOX_OP_GET_DC_EXTENT_LIST	= 0x4801,
> +	CXL_MBOX_OP_ADD_DC_RESPONSE	= 0x4802,
> +	CXL_MBOX_OP_RELEASE_DC		= 0x4803,
>  	CXL_MBOX_OP_MAX			= 0x10000
>  };
>  
> @@ -462,6 +506,7 @@ struct cxl_mbox_identify {
>  	__le16 inject_poison_limit;
>  	u8 poison_caps;
>  	u8 qos_telemetry_caps;
> +	__le16 dc_event_log_size;
>  } __packed;
>  
>  /*
> @@ -617,7 +662,27 @@ struct cxl_mbox_set_partition_info {
>  	u8 flags;
>  } __packed;
>  
> +struct cxl_mbox_get_dc_config {
> +	u8 region_count;
> +	u8 start_region_index;
> +} __packed;
> +
> +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> +struct cxl_mbox_dynamic_capacity {
> +	u8 avail_region_count;
> +	u8 rsvd[7];
> +	struct cxl_dc_region_config {
> +		__le64 region_base;
> +		__le64 region_decode_length;
> +		__le64 region_length;
> +		__le64 region_block_size;
> +		__le32 region_dsmad_handle;
> +		u8 flags;
> +		u8 rsvd[3];
> +	} __packed region[];
> +} __packed;
>  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)

This ^ goes with the cxl_mbox_set_partition_info above.
Please don't split.

> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
>  
>  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>  struct cxl_mbox_set_timestamp_in {
> @@ -742,6 +807,7 @@ enum {
>  int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
>  			  struct cxl_mbox_cmd *cmd);
>  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
>  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>  int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 4e2845b7331a..ac1a41bc083d 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -742,6 +742,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> +	rc = cxl_dev_dynamic_capacity_identify(mds);
> +	if (rc)
> +		return rc;
> +
>  	rc = cxl_mem_create_range_info(mds);
>  	if (rc)
>  		return rc;
> 
> -- 
> 2.40.0
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-14 19:16 ` [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support ira.weiny
  2023-06-14 23:37   ` Dave Jiang
@ 2023-06-15  0:21   ` Alison Schofield
  2023-06-16  2:06     ` Ira Weiny
  2023-06-16 16:51   ` Alison Schofield
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 55+ messages in thread
From: Alison Schofield @ 2023-06-15  0:21 UTC (permalink / raw)
  To: ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

On Wed, Jun 14, 2023 at 12:16:29PM -0700, Ira Weiny wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> CXL devices optionally support dynamic capacity. CXL Regions must be
> created to access this capacity.
> 
> Add sysfs entries to create dynamic capacity cxl regions. Provide a new
> Dynamic Capacity decoder mode which targets dynamic capacity on devices
> which are added to that region.

This is a lot in one patch, especially where it weaves in and out of
existing code. I'm wondering if this can be introduced in smaller
pieces (patches). An introductory patch explaining the DC DPA 
allocations might be a useful chunk to pull forward. 

Alison

> 
> Below are the steps to create and delete dynamic capacity region0
> (example).
> 
>     region=$(cat /sys/bus/cxl/devices/decoder0.0/create_dc_region)
>     echo $region> /sys/bus/cxl/devices/decoder0.0/create_dc_region
>     echo 256 > /sys/bus/cxl/devices/$region/interleave_granularity
>     echo 1 > /sys/bus/cxl/devices/$region/interleave_ways
> 
>     echo "dc0" >/sys/bus/cxl/devices/decoder1.0/mode
>     echo 0x400000000 >/sys/bus/cxl/devices/decoder1.0/dpa_size
> 
>     echo 0x400000000 > /sys/bus/cxl/devices/$region/size
>     echo  "decoder1.0" > /sys/bus/cxl/devices/$region/target0
>     echo 1 > /sys/bus/cxl/devices/$region/commit
>     echo $region > /sys/bus/cxl/drivers/cxl_region/bind
> 
>     echo $region> /sys/bus/cxl/devices/decoder0.0/delete_region
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> ---
> [iweiny: fixups]
> [iweiny: remove unused CXL_DC_REGION_MODE macro]
> [iweiny: Make dc_mode_to_region_index static]
> [iweiny: simplify <sysfs>/create_dc_region]
> [iweiny: introduce decoder_mode_is_dc]
> [djbw: fixups, no sign-off: preview only]
> ---
>  drivers/cxl/Kconfig       |  11 +++
>  drivers/cxl/core/core.h   |   7 ++
>  drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++++++++++++++++++++++++++----
>  drivers/cxl/core/port.c   |  18 ++++
>  drivers/cxl/core/region.c | 135 ++++++++++++++++++++++++--
>  drivers/cxl/cxl.h         |  28 ++++++
>  drivers/dax/cxl.c         |   4 +
>  7 files changed, 409 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index ff4e78117b31..df034889d053 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -121,6 +121,17 @@ config CXL_REGION
>  
>  	  If unsure say 'y'
>  
> +config CXL_DCD
> +	bool "CXL: DCD Support"
> +	default CXL_BUS
> +	depends on CXL_REGION
> +	help
> +	  Enable the CXL core to provision CXL DCD regions.
> +	  CXL devices optionally support dynamic capacity and DCD region
> +	  maps the dynamic capacity regions DPA's into Host HPA ranges.
> +
> +	  If unsure say 'y'
> +
>  config CXL_REGION_INVALIDATION_TEST
>  	bool "CXL: Region Cache Management Bypass (TEST)"
>  	depends on CXL_REGION
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 27f0968449de..725700ab5973 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -9,6 +9,13 @@ extern const struct device_type cxl_nvdimm_type;
>  
>  extern struct attribute_group cxl_base_attribute_group;
>  
> +#ifdef CONFIG_CXL_DCD
> +extern struct device_attribute dev_attr_create_dc_region;
> +#define SET_CXL_DC_REGION_ATTR(x) (&dev_attr_##x.attr),
> +#else
> +#define SET_CXL_DC_REGION_ATTR(x)
> +#endif
> +
>  #ifdef CONFIG_CXL_REGION
>  extern struct device_attribute dev_attr_create_pmem_region;
>  extern struct device_attribute dev_attr_create_ram_region;
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 514d30131d92..29649b47d177 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -233,14 +233,23 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct resource *res = cxled->dpa_res;
>  	resource_size_t skip_start;
> +	resource_size_t skipped = cxled->skip;
>  
>  	lockdep_assert_held_write(&cxl_dpa_rwsem);
>  
>  	/* save @skip_start, before @res is released */
> -	skip_start = res->start - cxled->skip;
> +	skip_start = res->start - skipped;
>  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> -	if (cxled->skip)
> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> +	if (cxled->skip != 0) {
> +		while (skipped != 0) {
> +			res = xa_load(&cxled->skip_res, skip_start);
> +			__release_region(&cxlds->dpa_res, skip_start,
> +							resource_size(res));
> +			xa_erase(&cxled->skip_res, skip_start);
> +			skip_start += resource_size(res);
> +			skipped -= resource_size(res);
> +			}
> +	}
>  	cxled->skip = 0;
>  	cxled->dpa_res = NULL;
>  	put_device(&cxled->cxld.dev);
> @@ -267,6 +276,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	__cxl_dpa_release(cxled);
>  }
>  
> +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> +{
> +	int index = 0;
> +
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> +		if (mode == i)
> +			return index;
> +		index++;
> +	}
> +
> +	return -EINVAL;
> +}
> +
>  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  			     resource_size_t base, resource_size_t len,
>  			     resource_size_t skipped)
> @@ -275,7 +297,11 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct device *dev = &port->dev;
> +	struct device *ed_dev = &cxled->cxld.dev;
> +	struct resource *dpa_res = &cxlds->dpa_res;
> +	resource_size_t skip_len = 0;
>  	struct resource *res;
> +	int rc, index;
>  
>  	lockdep_assert_held_write(&cxl_dpa_rwsem);
>  
> @@ -304,28 +330,119 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	}
>  
>  	if (skipped) {
> -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> -				       dev_name(&cxled->cxld.dev), 0);
> -		if (!res) {
> -			dev_dbg(dev,
> -				"decoder%d.%d: failed to reserve skipped space\n",
> -				port->id, cxled->cxld.id);
> -			return -EBUSY;
> +		resource_size_t skip_base = base - skipped;
> +
> +		if (decoder_mode_is_dc(cxled->mode)) {
> +			if (resource_size(&cxlds->ram_res) &&
> +					skip_base <= cxlds->ram_res.end) {
> +				skip_len = cxlds->ram_res.end - skip_base + 1;
> +				res = __request_region(dpa_res, skip_base,
> +						skip_len, dev_name(ed_dev), 0);
> +				if (!res)
> +					goto error;
> +
> +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
> +				skip_base += skip_len;
> +			}
> +
> +			if (resource_size(&cxlds->ram_res) &&
> +					skip_base <= cxlds->pmem_res.end) {
> +				skip_len = cxlds->pmem_res.end - skip_base + 1;
> +				res = __request_region(dpa_res, skip_base,
> +						skip_len, dev_name(ed_dev), 0);
> +				if (!res)
> +					goto error;
> +
> +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
> +				skip_base += skip_len;
> +			}
> +
> +			index = dc_mode_to_region_index(cxled->mode);
> +			for (int i = 0; i <= index; i++) {
> +				struct resource *dcr = &cxlds->dc_res[i];
> +
> +				if (skip_base < dcr->start) {
> +					skip_len = dcr->start - skip_base;
> +					res = __request_region(dpa_res,
> +							skip_base, skip_len,
> +							dev_name(ed_dev), 0);
> +					if (!res)
> +						goto error;
> +
> +					rc = xa_insert(&cxled->skip_res, skip_base,
> +							res, GFP_KERNEL);
> +					skip_base += skip_len;
> +				}
> +
> +				if (skip_base == base) {
> +					dev_dbg(dev, "skip done!\n");
> +					break;
> +				}
> +
> +				if (resource_size(dcr) &&
> +						skip_base <= dcr->end) {
> +					if (skip_base > base)
> +						dev_err(dev, "Skip error\n");
> +
> +					skip_len = dcr->end - skip_base + 1;
> +					res = __request_region(dpa_res, skip_base,
> +							skip_len,
> +							dev_name(ed_dev), 0);
> +					if (!res)
> +						goto error;
> +
> +					rc = xa_insert(&cxled->skip_res, skip_base,
> +							res, GFP_KERNEL);
> +					skip_base += skip_len;
> +				}
> +			}
> +		} else	{
> +			res = __request_region(dpa_res, base - skipped, skipped,
> +							dev_name(ed_dev), 0);
> +			if (!res)
> +				goto error;
> +
> +			rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
>  		}
>  	}
> -	res = __request_region(&cxlds->dpa_res, base, len,
> -			       dev_name(&cxled->cxld.dev), 0);
> +
> +	res = __request_region(dpa_res, base, len, dev_name(ed_dev), 0);
>  	if (!res) {
>  		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
> -			port->id, cxled->cxld.id);
> -		if (skipped)
> -			__release_region(&cxlds->dpa_res, base - skipped,
> -					 skipped);
> +				port->id, cxled->cxld.id);
> +		if (skipped) {
> +			resource_size_t skip_base = base - skipped;
> +
> +			while (skipped != 0) {
> +				if (skip_base > base)
> +					dev_err(dev, "Skip error\n");
> +
> +				res = xa_load(&cxled->skip_res, skip_base);
> +				__release_region(dpa_res, skip_base,
> +							resource_size(res));
> +				xa_erase(&cxled->skip_res, skip_base);
> +				skip_base += resource_size(res);
> +				skipped -= resource_size(res);
> +			}
> +		}
>  		return -EBUSY;
>  	}
>  	cxled->dpa_res = res;
>  	cxled->skip = skipped;
>  
> +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> +		int index = dc_mode_to_region_index(mode);
> +
> +		if (resource_contains(&cxlds->dc_res[index], res)) {
> +			cxled->mode = mode;
> +			dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id,
> +				cxled->cxld.id, cxled->dpa_res, cxled->mode);
> +			goto success;
> +		}
> +	}
>  	if (resource_contains(&cxlds->pmem_res, res))
>  		cxled->mode = CXL_DECODER_PMEM;
>  	else if (resource_contains(&cxlds->ram_res, res))
> @@ -336,9 +453,16 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  		cxled->mode = CXL_DECODER_MIXED;
>  	}
>  
> +success:
>  	port->hdm_end++;
>  	get_device(&cxled->cxld.dev);
>  	return 0;
> +
> +error:
> +	dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space\n",
> +			port->id, cxled->cxld.id);
> +	return -EBUSY;
> +
>  }
>  
>  int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> @@ -429,6 +553,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  	switch (mode) {
>  	case CXL_DECODER_RAM:
>  	case CXL_DECODER_PMEM:
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
>  		break;
>  	default:
>  		dev_dbg(dev, "unsupported mode: %d\n", mode);
> @@ -456,6 +588,16 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  		goto out;
>  	}
>  
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> +		int index = dc_mode_to_region_index(i);
> +
> +		if (mode == i && !resource_size(&cxlds->dc_res[index])) {
> +			dev_dbg(dev, "no available dynamic capacity\n");
> +			rc = -ENXIO;
> +			goto out;
> +		}
> +	}
> +
>  	cxled->mode = mode;
>  	rc = 0;
>  out:
> @@ -469,10 +611,12 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  					 resource_size_t *skip_out)
>  {
>  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -	resource_size_t free_ram_start, free_pmem_start;
> +	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	struct device *dev = &cxled->cxld.dev;
>  	resource_size_t start, avail, skip;
>  	struct resource *p, *last;
> +	int index;
>  
>  	lockdep_assert_held(&cxl_dpa_rwsem);
>  
> @@ -490,6 +634,20 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  	else
>  		free_pmem_start = cxlds->pmem_res.start;
>  
> +	/*
> +	 * One HDM Decoder per DC region to map memory with different
> +	 * DSMAS entry.
> +	 */
> +	index = dc_mode_to_region_index(cxled->mode);
> +	if (index >= 0) {
> +		if (cxlds->dc_res[index].child) {
> +			dev_err(dev, "Cannot allocated DPA from DC Region: %d\n",
> +					index);
> +			return -EINVAL;
> +		}
> +		free_dc_start = cxlds->dc_res[index].start;
> +	}
> +
>  	if (cxled->mode == CXL_DECODER_RAM) {
>  		start = free_ram_start;
>  		avail = cxlds->ram_res.end - start + 1;
> @@ -511,6 +669,29 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  		else
>  			skip_end = start - 1;
>  		skip = skip_end - skip_start + 1;
> +	} else if (decoder_mode_is_dc(cxled->mode)) {
> +		resource_size_t skip_start, skip_end;
> +
> +		start = free_dc_start;
> +		avail = cxlds->dc_res[index].end - start + 1;
> +		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
> +			skip_start = free_ram_start;
> +		else
> +			skip_start = free_pmem_start;
> +		/*
> +		 * If some dc region is already mapped, then that allocation
> +		 * already handled the RAM and PMEM skip.Check for DC region
> +		 * skip.
> +		 */
> +		for (int i = index - 1; i >= 0 ; i--) {
> +			if (cxlds->dc_res[i].child) {
> +				skip_start = cxlds->dc_res[i].child->end + 1;
> +				break;
> +			}
> +		}
> +
> +		skip_end = start - 1;
> +		skip = skip_end - skip_start + 1;
>  	} else {
>  		dev_dbg(cxled_dev(cxled), "mode not set\n");
>  		avail = 0;
> @@ -548,10 +729,25 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  
>  	avail = cxl_dpa_freespace(cxled, &start, &skip);
>  
> +	dev_dbg(dev, "DPA Allocation start: %llx len: %llx Skip: %llx\n",
> +						start, size, skip);
>  	if (size > avail) {
> +		static const char * const names[] = {
> +			[CXL_DECODER_NONE] = "none",
> +			[CXL_DECODER_RAM] = "ram",
> +			[CXL_DECODER_PMEM] = "pmem",
> +			[CXL_DECODER_MIXED] = "mixed",
> +			[CXL_DECODER_DC0] = "dc0",
> +			[CXL_DECODER_DC1] = "dc1",
> +			[CXL_DECODER_DC2] = "dc2",
> +			[CXL_DECODER_DC3] = "dc3",
> +			[CXL_DECODER_DC4] = "dc4",
> +			[CXL_DECODER_DC5] = "dc5",
> +			[CXL_DECODER_DC6] = "dc6",
> +			[CXL_DECODER_DC7] = "dc7",
> +		};
>  		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
> -			cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
> -			&avail);
> +			names[cxled->mode], &avail);
>  		rc = -ENOSPC;
>  		goto out;
>  	}
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 5e21b53362e6..a1a98aba24ed 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -195,6 +195,22 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
>  		mode = CXL_DECODER_PMEM;
>  	else if (sysfs_streq(buf, "ram"))
>  		mode = CXL_DECODER_RAM;
> +	else if (sysfs_streq(buf, "dc0"))
> +		mode = CXL_DECODER_DC0;
> +	else if (sysfs_streq(buf, "dc1"))
> +		mode = CXL_DECODER_DC1;
> +	else if (sysfs_streq(buf, "dc2"))
> +		mode = CXL_DECODER_DC2;
> +	else if (sysfs_streq(buf, "dc3"))
> +		mode = CXL_DECODER_DC3;
> +	else if (sysfs_streq(buf, "dc4"))
> +		mode = CXL_DECODER_DC4;
> +	else if (sysfs_streq(buf, "dc5"))
> +		mode = CXL_DECODER_DC5;
> +	else if (sysfs_streq(buf, "dc6"))
> +		mode = CXL_DECODER_DC6;
> +	else if (sysfs_streq(buf, "dc7"))
> +		mode = CXL_DECODER_DC7;
>  	else
>  		return -EINVAL;
>  
> @@ -296,6 +312,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>  	&dev_attr_target_list.attr,
>  	SET_CXL_REGION_ATTR(create_pmem_region)
>  	SET_CXL_REGION_ATTR(create_ram_region)
> +	SET_CXL_DC_REGION_ATTR(create_dc_region)
>  	SET_CXL_REGION_ATTR(delete_region)
>  	NULL,
>  };
> @@ -1691,6 +1708,7 @@ struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
>  		return ERR_PTR(-ENOMEM);
>  
>  	cxled->pos = -1;
> +	xa_init(&cxled->skip_res);
>  	cxld = &cxled->cxld;
>  	rc = cxl_decoder_init(port, cxld);
>  	if (rc)	 {
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 543c4499379e..144232c8305e 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1733,7 +1733,7 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  	lockdep_assert_held_write(&cxl_region_rwsem);
>  	lockdep_assert_held_read(&cxl_dpa_rwsem);
>  
> -	if (cxled->mode != cxlr->mode) {
> +	if (decoder_mode_is_dc(cxlr->mode) && !decoder_mode_is_dc(cxled->mode)) {
>  		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
>  			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
>  		return -EINVAL;
> @@ -2211,6 +2211,14 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>  	switch (mode) {
>  	case CXL_DECODER_RAM:
>  	case CXL_DECODER_PMEM:
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
>  		break;
>  	default:
>  		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> @@ -2321,6 +2329,43 @@ static ssize_t create_ram_region_store(struct device *dev,
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>  
> +static ssize_t store_dcN_region(struct cxl_root_decoder *cxlrd,
> +				const char *buf, enum cxl_decoder_mode mode,
> +				size_t len)
> +{
> +	struct cxl_region *cxlr;
> +	int rc, id;
> +
> +	rc = sscanf(buf, "region%d\n", &id);
> +	if (rc != 1)
> +		return -EINVAL;
> +
> +	cxlr = __create_region(cxlrd, id, mode, CXL_DECODER_HOSTMEM);
> +	if (IS_ERR(cxlr))
> +		return PTR_ERR(cxlr);
> +
> +	return len;
> +}
> +
> +static ssize_t create_dc_region_show(struct device *dev,
> +				     struct device_attribute *attr, char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dc_region_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t len)
> +{
> +	/*
> +	 * All DC regions use decoder mode DC0 as the region does not need the
> +	 * index information
> +	 */
> +	return store_dcN_region(to_cxl_root_decoder(dev), buf,
> +				CXL_DECODER_DC0, len);
> +}
> +DEVICE_ATTR_RW(create_dc_region);
> +
>  static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -2799,6 +2844,61 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  	return rc;
>  }
>  
> +static void cxl_dc_region_release(void *data)
> +{
> +	struct cxl_region *cxlr = data;
> +	struct cxl_dc_region *cxlr_dc = cxlr->cxlr_dc;
> +
> +	xa_destroy(&cxlr_dc->dax_dev_list);
> +	kfree(cxlr_dc);
> +}
> +
> +static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
> +{
> +	struct cxl_dc_region *cxlr_dc;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct device *dev;
> +	int rc = 0;
> +
> +	cxlr_dax = cxl_dax_region_alloc(cxlr);
> +	if (IS_ERR(cxlr_dax))
> +		return PTR_ERR(cxlr_dax);
> +
> +	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
> +	if (!cxlr_dc) {
> +		rc = -ENOMEM;
> +		goto err;
> +	}
> +
> +	dev = &cxlr_dax->dev;
> +	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
> +	if (rc)
> +		goto err;
> +
> +	rc = device_add(dev);
> +	if (rc)
> +		goto err;
> +
> +	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> +		dev_name(dev));
> +
> +	rc = devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
> +					cxlr_dax);
> +	if (rc)
> +		goto err;
> +
> +	cxlr_dc->cxlr_dax = cxlr_dax;
> +	xa_init(&cxlr_dc->dax_dev_list);
> +	cxlr->cxlr_dc = cxlr_dc;
> +	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
> +	if (!rc)
> +		return 0;
> +err:
> +	put_device(dev);
> +	kfree(cxlr_dc);
> +	return rc;
> +}
> +
>  static int match_decoder_by_range(struct device *dev, void *data)
>  {
>  	struct range *r1, *r2 = data;
> @@ -3140,6 +3240,19 @@ static int is_system_ram(struct resource *res, void *arg)
>  	return 1;
>  }
>  
> +/*
> + * The region can not be manged by CXL if any portion of
> + * it is already online as 'System RAM'
> + */
> +static bool region_is_system_ram(struct cxl_region *cxlr,
> +				 struct cxl_region_params *p)
> +{
> +	return (walk_iomem_res_desc(IORES_DESC_NONE,
> +				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> +				    p->res->start, p->res->end, cxlr,
> +				    is_system_ram) > 0);
> +}
> +
>  static int cxl_region_probe(struct device *dev)
>  {
>  	struct cxl_region *cxlr = to_cxl_region(dev);
> @@ -3174,14 +3287,7 @@ static int cxl_region_probe(struct device *dev)
>  	case CXL_DECODER_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
>  	case CXL_DECODER_RAM:
> -		/*
> -		 * The region can not be manged by CXL if any portion of
> -		 * it is already online as 'System RAM'
> -		 */
> -		if (walk_iomem_res_desc(IORES_DESC_NONE,
> -					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> -					p->res->start, p->res->end, cxlr,
> -					is_system_ram) > 0)
> +		if (region_is_system_ram(cxlr, p))
>  			return 0;
>  
>  		/*
> @@ -3193,6 +3299,17 @@ static int cxl_region_probe(struct device *dev)
>  
>  		/* HDM-H routes to device-dax */
>  		return devm_cxl_add_dax_region(cxlr);
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
> +		if (region_is_system_ram(cxlr, p))
> +			return 0;
> +		return devm_cxl_add_dc_region(cxlr);
>  	default:
>  		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
>  			cxlr->mode);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 8400af85d99f..7ac1237938b7 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -335,6 +335,14 @@ enum cxl_decoder_mode {
>  	CXL_DECODER_NONE,
>  	CXL_DECODER_RAM,
>  	CXL_DECODER_PMEM,
> +	CXL_DECODER_DC0,
> +	CXL_DECODER_DC1,
> +	CXL_DECODER_DC2,
> +	CXL_DECODER_DC3,
> +	CXL_DECODER_DC4,
> +	CXL_DECODER_DC5,
> +	CXL_DECODER_DC6,
> +	CXL_DECODER_DC7,
>  	CXL_DECODER_MIXED,
>  	CXL_DECODER_DEAD,
>  };
> @@ -345,6 +353,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  		[CXL_DECODER_NONE] = "none",
>  		[CXL_DECODER_RAM] = "ram",
>  		[CXL_DECODER_PMEM] = "pmem",
> +		[CXL_DECODER_DC0] = "dc0",
> +		[CXL_DECODER_DC1] = "dc1",
> +		[CXL_DECODER_DC2] = "dc2",
> +		[CXL_DECODER_DC3] = "dc3",
> +		[CXL_DECODER_DC4] = "dc4",
> +		[CXL_DECODER_DC5] = "dc5",
> +		[CXL_DECODER_DC6] = "dc6",
> +		[CXL_DECODER_DC7] = "dc7",
>  		[CXL_DECODER_MIXED] = "mixed",
>  	};
>  
> @@ -353,6 +369,11 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  	return "mixed";
>  }
>  
> +static inline bool decoder_mode_is_dc(enum cxl_decoder_mode mode)
> +{
> +	return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
> +}
> +
>  /*
>   * Track whether this decoder is reserved for region autodiscovery, or
>   * free for userspace provisioning.
> @@ -375,6 +396,7 @@ struct cxl_endpoint_decoder {
>  	struct cxl_decoder cxld;
>  	struct resource *dpa_res;
>  	resource_size_t skip;
> +	struct xarray skip_res;
>  	enum cxl_decoder_mode mode;
>  	enum cxl_decoder_state state;
>  	int pos;
> @@ -475,6 +497,11 @@ struct cxl_region_params {
>   */
>  #define CXL_REGION_F_AUTO 1
>  
> +struct cxl_dc_region {
> +	struct xarray dax_dev_list;
> +	struct cxl_dax_region *cxlr_dax;
> +};
> +
>  /**
>   * struct cxl_region - CXL region
>   * @dev: This region's device
> @@ -493,6 +520,7 @@ struct cxl_region {
>  	enum cxl_decoder_type type;
>  	struct cxl_nvdimm_bridge *cxl_nvb;
>  	struct cxl_pmem_region *cxlr_pmem;
> +	struct cxl_dc_region *cxlr_dc;
>  	unsigned long flags;
>  	struct cxl_region_params params;
>  };
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index ccdf8de85bd5..eb5eb81bfbd7 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -23,11 +23,15 @@ static int cxl_dax_region_probe(struct device *dev)
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> +	if (decoder_mode_is_dc(cxlr->mode))
> +		return 0;
> +
>  	data = (struct dev_dax_data) {
>  		.dax_region = dax_region,
>  		.id = -1,
>  		.size = range_len(&cxlr_dax->hpa_range),
>  	};
> +
>  	dev_dax = devm_create_dev_dax(&data);
>  	if (IS_ERR(dev_dax))
>  		return PTR_ERR(dev_dax);
> 
> -- 
> 2.40.0
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/5] cxl/mem : Expose dynamic capacity configuration to userspace
  2023-06-14 19:16 ` [PATCH 3/5] cxl/mem : Expose dynamic capacity configuration to userspace ira.weiny
@ 2023-06-15  0:40   ` Alison Schofield
  2023-06-16  2:47     ` Ira Weiny
  2023-06-15 15:41   ` Dave Jiang
  1 sibling, 1 reply; 55+ messages in thread
From: Alison Schofield @ 2023-06-15  0:40 UTC (permalink / raw)
  To: ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

On Wed, Jun 14, 2023 at 12:16:30PM -0700, Ira Weiny wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Exposing driver cached dynamic capacity configuration through sysfs
> attributes.User will create one or more dynamic capacity
> cxl regions based on this information and map the dynamic capacity of
> the device into HDM ranges using one or more HDM decoders.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> ---
> [iweiny: fixups]
> [djbw: fixups, no sign-off: preview only]
> ---
>  drivers/cxl/core/memdev.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 72 insertions(+)

Add the documentation of these new attributes in this patch.
Documentation/ABI/testing/sysfs-bus-cxl

A bit of my ignorance here, but when I keep seeing the word
'regions' below, it makes me wonder whether these attributes
are in the right place?

> 
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 5d1ba7a72567..beeb5fa3a0aa 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -99,6 +99,20 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
>  static struct device_attribute dev_attr_pmem_size =
>  	__ATTR(size, 0444, pmem_size_show, NULL);
>  
> +static ssize_t dc_regions_count_show(struct device *dev, struct device_attribute *attr,
> +		char *buf)
> +{
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +	int len = 0;
> +
> +	len = sysfs_emit(buf, "0x%x\n", mds->nr_dc_region);

Prefer using this notation: %#llx
grep for the sysfs_emit's to see customary usage.

> +	return len;
> +}
> +
> +struct device_attribute dev_attr_dc_regions_count =
> +	__ATTR(dc_regions_count, 0444, dc_regions_count_show, NULL);
> +
>  static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -362,6 +376,57 @@ static struct attribute *cxl_memdev_ram_attributes[] = {
>  	NULL,
>  };
>  
> +static ssize_t show_size_regionN(struct cxl_memdev *cxlmd, char *buf, int pos)
> +{
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +	return sysfs_emit(buf, "0x%llx\n", mds->dc_region[pos].decode_len);
> +}
> +
> +#define SIZE_ATTR_RO(n)                                              \
> +static ssize_t dc##n##_size_show(                                       \
> +	struct device *dev, struct device_attribute *attr, char *buf)  \
> +{                                                                      \
> +	return show_size_regionN(to_cxl_memdev(dev), buf, (n));             \
> +}                                                                      \
> +static DEVICE_ATTR_RO(dc##n##_size)
> +SIZE_ATTR_RO(0);
> +SIZE_ATTR_RO(1);
> +SIZE_ATTR_RO(2);
> +SIZE_ATTR_RO(3);
> +SIZE_ATTR_RO(4);
> +SIZE_ATTR_RO(5);
> +SIZE_ATTR_RO(6);
> +SIZE_ATTR_RO(7);
> +
> +static struct attribute *cxl_memdev_dc_attributes[] = {
> +	&dev_attr_dc0_size.attr,
> +	&dev_attr_dc1_size.attr,
> +	&dev_attr_dc2_size.attr,
> +	&dev_attr_dc3_size.attr,
> +	&dev_attr_dc4_size.attr,
> +	&dev_attr_dc5_size.attr,
> +	&dev_attr_dc6_size.attr,
> +	&dev_attr_dc7_size.attr,
> +	&dev_attr_dc_regions_count.attr,
> +	NULL,
> +};
> +
> +static umode_t cxl_dc_visible(struct kobject *kobj, struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +	if (a == &dev_attr_dc_regions_count.attr)
> +		return a->mode;
> +
> +	if (n < mds->nr_dc_region)
> +		return a->mode;
> +
> +	return 0;
> +}
> +
>  static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
>  				  int n)
>  {
> @@ -385,10 +450,17 @@ static struct attribute_group cxl_memdev_pmem_attribute_group = {
>  	.attrs = cxl_memdev_pmem_attributes,
>  };
>  
> +static struct attribute_group cxl_memdev_dc_attribute_group = {
> +	.name = "dc",
> +	.attrs = cxl_memdev_dc_attributes,
> +	.is_visible = cxl_dc_visible,
> +};
> +
>  static const struct attribute_group *cxl_memdev_attribute_groups[] = {
>  	&cxl_memdev_attribute_group,
>  	&cxl_memdev_ram_attribute_group,
>  	&cxl_memdev_pmem_attribute_group,
> +	&cxl_memdev_dc_attribute_group,
>  	NULL,
>  };
>  
> 
> -- 
> 2.40.0
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD)
  2023-06-14 19:16 [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD) ira.weiny
                   ` (4 preceding siblings ...)
  2023-06-14 19:16 ` [PATCH 5/5] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
@ 2023-06-15  0:56 ` Alison Schofield
  2023-06-16  2:57   ` Ira Weiny
  2023-06-15 14:51 ` Ira Weiny
  2023-06-29 15:30 ` Ira Weiny
  7 siblings, 1 reply; 55+ messages in thread
From: Alison Schofield @ 2023-06-15  0:56 UTC (permalink / raw)
  To: ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

On Wed, Jun 14, 2023 at 12:16:27PM -0700, Ira Weiny wrote:

Is there a repo you can share?
If not, how about a recipe for applying these to cxl/next?
(Not trying to run, just want to load and view)

Thanks!

> I'm submitting these on behalf of Navneet.  There was a round of
> internal discussion which left a few questions but we want to get the
> public discussion going.  A first public preview was posted by Dan.[1]
> 
> The series has been rebased on the type-2 work posted from Dan.[2]  As
> discussed in the community call, not all of that series is required for
> these patches.  This will get rebased on the subset of those patches he
> is targeting for 6.5.  The series was tested using Fan Ni's Qemu DCD
> series.[3]
> 
> [cover letter]
> 
> A Dynamic Capacity Device (DCD) (CXL 3.0 spec 9.13.3) is a CXL memory
> device that implements dynamic capacity.  Dynamic capacity feature
> allows memory capacity to change dynamically, without the need for
> resetting the device.
> 
> Provide initial patches to enable DCD on non interleaving regions.
> Details:
> 
> - Get the dynamic capacity region information from cxl device and add
>   the advertised DC memory to driver managed resources
> - Get the device dynamic capacity extent list from the device and
>   maintain it in the host and add the preallocated memory to the host
> - Dynamic capacity region support
> - DCD region provisioning via Dax
> - Dynamic capacity event records
>         a. Add capacity Events
> 	b. Release capacity events
> 	c. Add the memory to the host dc region
> 	d. Release the memory from the host dc region
> - Trace Dynamic Capacity events
> - Send add capacity response to device
> - Send release dynamic capacity to device
> 
> Cc: Navneet Singh <navneet.singh@intel.com>
> Cc: Fan Ni <fan.ni@samsung.com>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-cxl@vger.kernel.org
> 
> [1] https://lore.kernel.org/all/64326437c1496_934b2949f@dwillia2-mobl3.amr.corp.intel.com.notmuch/
> [2] https://lore.kernel.org/all/168592149709.1948938.8663425987110396027.stgit@dwillia2-xfh.jf.intel.com/
> [3] https://lore.kernel.org/all/6483946e8152f_f1132294a2@iweiny-mobl.notmuch/
> 
> ---
> Navneet Singh (5):
>       cxl/mem : Read Dynamic capacity configuration from the device
>       cxl/region: Add dynamic capacity cxl region support.
>       cxl/mem : Expose dynamic capacity configuration to userspace
>       cxl/mem: Add support to handle DCD add and release capacity events.
>       cxl/mem: Trace Dynamic capacity Event Record
> 
>  drivers/cxl/Kconfig       |  11 +
>  drivers/cxl/core/core.h   |   7 +
>  drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++--
>  drivers/cxl/core/mbox.c   | 540 +++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/cxl/core/memdev.c |  72 +++++++
>  drivers/cxl/core/port.c   |  18 ++
>  drivers/cxl/core/region.c | 337 ++++++++++++++++++++++++++++-
>  drivers/cxl/core/trace.h  |  68 +++++-
>  drivers/cxl/cxl.h         |  32 ++-
>  drivers/cxl/cxlmem.h      | 146 ++++++++++++-
>  drivers/cxl/pci.c         |  14 +-
>  drivers/dax/bus.c         |  11 +-
>  drivers/dax/bus.h         |   5 +-
>  drivers/dax/cxl.c         |   4 +
>  14 files changed, 1453 insertions(+), 46 deletions(-)
> ---
> base-commit: 034a16d0165be3e092d60685be7b1b05e6f3059b
> change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
> 
> Best regards,
> -- 
> Ira Weiny <ira.weiny@intel.com>
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events.
  2023-06-14 19:16 ` [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events ira.weiny
@ 2023-06-15  2:19   ` Alison Schofield
  2023-06-16  4:11     ` Ira Weiny
  2023-06-15 16:58   ` Dave Jiang
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Alison Schofield @ 2023-06-15  2:19 UTC (permalink / raw)
  To: ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

On Wed, Jun 14, 2023 at 12:16:31PM -0700, Ira Weiny wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> A dynamic capacity device utilizes events to signal the host about the
> changes to the allocation of DC blocks. The device communicates the
> state of these blocks of dynamic capacity through an extent list that
> describes the starting DPA and length of all blocks the host can access.
> 
> Based on the dynamic capacity add or release event type,
> dynamic memory represented by the extents are either added
> or removed as devdax device.

Nice commit msg, please align second paragraph w first.

> 
> Process the dynamic capacity add and release events.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> ---
> [iweiny: Remove invalid comment]
> ---
>  drivers/cxl/core/mbox.c   | 345 +++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/cxl/core/region.c | 214 +++++++++++++++++++++++++++-
>  drivers/cxl/core/trace.h  |   3 +-
>  drivers/cxl/cxl.h         |   4 +-
>  drivers/cxl/cxlmem.h      |  76 ++++++++++
>  drivers/cxl/pci.c         |  10 +-
>  drivers/dax/bus.c         |  11 +-
>  drivers/dax/bus.h         |   5 +-
>  8 files changed, 652 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index c5b696737c87..db9295216de5 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -767,6 +767,14 @@ static const uuid_t log_uuid[] = {
>  	[VENDOR_DEBUG_UUID] = DEFINE_CXL_VENDOR_DEBUG_UUID,
>  };
>  
> +/* See CXL 3.0 8.2.9.2.1.5 */
> +enum dc_event {
> +	ADD_CAPACITY,
> +	RELEASE_CAPACITY,
> +	FORCED_CAPACITY_RELEASE,
> +	REGION_CONFIGURATION_UPDATED,
> +};
> +
>  /**
>   * cxl_enumerate_cmds() - Enumerate commands for a device.
>   * @mds: The driver data for the operation
> @@ -852,6 +860,14 @@ static const uuid_t mem_mod_event_uuid =
>  	UUID_INIT(0xfe927475, 0xdd59, 0x4339,
>  		  0xa5, 0x86, 0x79, 0xba, 0xb1, 0x13, 0xb7, 0x74);
>  
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.0 section 8.2.9.2.1.3; Table 8-45
> + */
> +static const uuid_t dc_event_uuid =
> +	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c,
> +		0x2f, 0x95, 0x26, 0x8e, 0x10, 0x1a, 0x2a);
> +
>  static void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  				   enum cxl_event_log_type type,
>  				   struct cxl_event_record_raw *record)
> @@ -945,6 +961,188 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
>  	return rc;
>  }
>  
> +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> +				struct cxl_mbox_dc_response *res,
> +				int extent_cnt, int opcode)
> +{
> +	struct cxl_mbox_cmd mbox_cmd;
> +	int rc, size;
> +
> +	size = struct_size(res, extent_list, extent_cnt);
> +	res->extent_list_size = cpu_to_le32(extent_cnt);
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = opcode,
> +		.size_in = size,
> +		.payload_in = res,
> +	};
> +
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +
> +	return rc;
> +
> +}
> +
> +static int cxl_prepare_ext_list(struct cxl_mbox_dc_response **res,
> +					int *n, struct range *extent)
> +{
> +	struct cxl_mbox_dc_response *dc_res;
> +	unsigned int size;
> +
> +	if (!extent)
> +		size = struct_size(dc_res, extent_list, 0);
> +	else
> +		size = struct_size(dc_res, extent_list, *n + 1);
> +
> +	dc_res = krealloc(*res, size, GFP_KERNEL);
> +	if (!dc_res)
> +		return -ENOMEM;
> +
> +	if (extent) {
> +		dc_res->extent_list[*n].dpa_start = cpu_to_le64(extent->start);
> +		memset(dc_res->extent_list[*n].reserved, 0, 8);
> +		dc_res->extent_list[*n].length = 
> +				cpu_to_le64(range_len(extent));

Unnecessary return. I think that fits in 80 columns.

> +		(*n)++;
> +	}
> +
> +	*res = dc_res;
> +	return 0;
> +}
> +/**
> + * cxl_handle_dcd_event_records() - Read DCD event records.
> + * @mds: The memory device state
> + *
> + * Returns 0 if enumerate completed successfully.
> + *
> + * CXL devices can generate DCD events to add or remove extents in the list.
> + */

That's a kernel doc comment, so maybe can be clearer.
It's called 'handle', so 'Read DCD event records' seems like a mismatch.
Probably needs more explaining.


> +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
> +					struct cxl_event_record_raw *rec)
> +{
> +	struct cxl_mbox_dc_response *dc_res = NULL;
> +	struct device *dev = mds->cxlds.dev;
> +	uuid_t *id = &rec->hdr.id;
> +	struct dcd_event_dyn_cap *record =
> +			(struct dcd_event_dyn_cap *)rec;
> +	int extent_cnt = 0, rc = 0;
> +	struct cxl_dc_extent_data *extent;
> +	struct range alloc_range, rel_range;
> +	resource_size_t dpa, size;
> +

Please reverse x-tree. And if things like that *record can't fit within
80 columns and in reverse x-tree order, then assign it afterwards.


> +	if (!uuid_equal(id, &dc_event_uuid))
> +		return -EINVAL;
> +
> +	switch (record->data.event_type) {

Maybe a local for record->data.extent that is used repeatedly below,
or,
Perhaps pull the length and dpa local defines you made down in the
RELEASE_CAPACITY up here and share them with ADD_CAPACITY. That'll
reduce the le65_to_cpu noise. Add similar for shared_extn_seq.


> +	case ADD_CAPACITY:
> +		extent = devm_kzalloc(dev, sizeof(*extent), GFP_ATOMIC);
> +		if (!extent)
> +			return -ENOMEM;
> +
> +		extent->dpa_start = le64_to_cpu(record->data.extent.start_dpa);
> +		extent->length = le64_to_cpu(record->data.extent.length);
> +		memcpy(extent->tag, record->data.extent.tag,
> +				sizeof(record->data.extent.tag));
> +		extent->shared_extent_seq =
> +			le16_to_cpu(record->data.extent.shared_extn_seq);
> +		dev_dbg(dev, "Add DC extent DPA:0x%llx LEN:%llx\n",
> +					extent->dpa_start, extent->length);
> +		alloc_range = (struct range) {
> +			.start = extent->dpa_start,
> +			.end = extent->dpa_start + extent->length - 1,
> +		};
> +
> +		rc = cxl_add_dc_extent(mds, &alloc_range);
> +		if (rc < 0) {

How about 
		if (rc >=)
			goto insert;

Then you can remove this level of indent.

> +			dev_dbg(dev, "unconsumed DC extent DPA:0x%llx LEN:%llx\n",
> +					extent->dpa_start, extent->length);
> +			rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, NULL);
> +			if (rc < 0) {
> +				dev_err(dev, "Couldn't create extent list %d\n",
> +									rc);
> +				devm_kfree(dev, extent);
> +				return rc;
> +			}
> +
> +			rc = cxl_send_dc_cap_response(mds, dc_res,
> +					extent_cnt, CXL_MBOX_OP_ADD_DC_RESPONSE);
> +			if (rc < 0) {
> +				devm_kfree(dev, extent);
> +				goto out;
> +			}
> +
> +			kfree(dc_res);
> +			devm_kfree(dev, extent);
> +
> +			return 0;
> +		}

insert:

> +
> +		rc = xa_insert(&mds->dc_extent_list, extent->dpa_start, extent,
> +				GFP_KERNEL);
> +		if (rc < 0)
> +			goto out;
> +
> +		mds->num_dc_extents++;
> +		rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, &alloc_range);
> +		if (rc < 0) {
> +			dev_err(dev, "Couldn't create extent list %d\n", rc);
> +			return rc;
> +		}
> +
> +		rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
> +					      CXL_MBOX_OP_ADD_DC_RESPONSE);
> +		if (rc < 0)
> +			goto out;
> +
> +		break;
> +
> +	case RELEASE_CAPACITY:
> +		dpa = le64_to_cpu(record->data.extent.start_dpa);
> +		size = le64_to_cpu(record->data.extent.length);

^^ do these sooner and share

> +		dev_dbg(dev, "Release DC extents DPA:0x%llx LEN:%llx\n",
> +				dpa, size);
> +		extent = xa_load(&mds->dc_extent_list, dpa);
> +		if (!extent) {
> +			dev_err(dev, "No extent found with DPA:0x%llx\n", dpa);
> +			return -EINVAL;
> +		}
> +
> +		rel_range = (struct range) {
> +			.start = dpa,
> +			.end = dpa + size - 1,
> +		};
> +
> +		rc = cxl_release_dc_extent(mds, &rel_range);
> +		if (rc < 0) {
> +			dev_dbg(dev, "withhold DC extent DPA:0x%llx LEN:%llx\n",
> +									dpa, size);
> +			return 0;
> +		}
> +
> +		xa_erase(&mds->dc_extent_list, dpa);
> +		devm_kfree(dev, extent);
> +		mds->num_dc_extents--;
> +		rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, &rel_range);
> +		if (rc < 0) {
> +			dev_err(dev, "Couldn't create extent list %d\n", rc);
> +			return rc;
> +		}
> +
> +		rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
> +					      CXL_MBOX_OP_RELEASE_DC);
> +		if (rc < 0)
> +			goto out;
> +
> +		break;
> +
> +	default:
> +		return -EINVAL;
> +	}
> +out:

The out seems needless. Replace all 'goto out''s  with 'break'

I'm also a bit concerned about all the direct returns above.
Can this be the single exit point?  kfree of a NULL ptr is OK.
Maybe a bit more logic here to do that devm_free is all that
is needed.


> +	kfree(dc_res);
> +	return rc;
> +}
> +
>  static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>  				    enum cxl_event_log_type type)
>  {
> @@ -982,9 +1180,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>  		if (!nr_rec)
>  			break;
>  
> -		for (i = 0; i < nr_rec; i++)
> +		for (i = 0; i < nr_rec; i++) {
>  			cxl_event_trace_record(cxlmd, type,
>  					       &payload->records[i]);
> +			if (type == CXL_EVENT_TYPE_DCD) {
> +				rc = cxl_handle_dcd_event_records(mds,
> +						&payload->records[i]);
> +				if (rc)
> +					dev_err_ratelimited(dev,
> +						"dcd event failed: %d\n", rc);
> +			}


Reduce indent option:

			if (type != CXL_EVENT_TYPE_DCD)
				continue;

			rc = cxl_handle_dcd_event_records(mds,
							  &payload->records[i]);			if (rc)
				dev_err_ratelimited(dev,
						    "dcd event failed: %d\n", rc);

I don't know where 'cxl_handle_dcd_event_records() was introduce,
but I'm wondering now if it can have a short name.

> +		}
>  
>  		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
>  			trace_cxl_overflow(cxlmd, type, payload);
> @@ -1024,6 +1230,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
>  		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_WARN);
>  	if (status & CXLDEV_EVENT_STATUS_INFO)
>  		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_INFO);
> +	if (status & CXLDEV_EVENT_STATUS_DCD)
> +		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_mem_get_event_records, CXL);
>  
> @@ -1244,6 +1452,140 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
>  
> +int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> +			      unsigned int *extent_gen_num)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_mbox_dc_extents *dc_extents;
> +	struct cxl_mbox_get_dc_extent get_dc_extent;
> +	unsigned int total_extent_cnt;

Seems 'count' would probably suffice here.

> +	struct cxl_mbox_cmd mbox_cmd;
> +	int rc;

Above - reverse x-tree please.

> +
> +	/* Check GET_DC_EXTENT_LIST is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> +		return 0;
> +	}
> +
> +	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
> +	if (!dc_extents)
> +		return -ENOMEM;
> +
> +	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> +		.extent_cnt = 0,
> +		.start_extent_index = 0,
> +	};
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +		.payload_in = &get_dc_extent,
> +		.size_in = sizeof(get_dc_extent),
> +		.size_out = mds->payload_size,
> +		.payload_out = dc_extents,
> +		.min_out = 1,
> +	};
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		goto out;
> +
> +	total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> +	*extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +	dev_dbg(dev, "Total extent count :%d Extent list Generation Num: %d\n",
> +			total_extent_cnt, *extent_gen_num);
> +out:
> +
> +	kvfree(dc_extents);
> +	if (rc < 0)
> +		return rc;
> +
> +	return total_extent_cnt;
> +
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extent_cnt, CXL);
> +
> +int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds,
> +			   unsigned int index, unsigned int cnt)
> +{
> +	/* See CXL 3.0 Table 125 dynamic capacity config  Output Payload */
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_mbox_dc_extents *dc_extents;
> +	struct cxl_mbox_get_dc_extent get_dc_extent;
> +	unsigned int extent_gen_num, available_extents, total_extent_cnt;
> +	int rc;
> +	struct cxl_dc_extent_data *extent;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	struct range alloc_range;
> +

Reverse x-tree please.

> +	/* Check GET_DC_EXTENT_LIST is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> +		return 0;
> +	}

Can we even get this far if this cmd is not supported by the device?
Is there a sooner place to test those bits?  Is this sysfs request?
(sorry not completely following here).

> +
> +	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
> +	if (!dc_extents)
> +		return -ENOMEM;
> +	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> +		.extent_cnt = cnt,
> +		.start_extent_index = index,
> +	};
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +		.payload_in = &get_dc_extent,
> +		.size_in = sizeof(get_dc_extent),
> +		.size_out = mds->payload_size,
> +		.payload_out = dc_extents,
> +		.min_out = 1,
> +	};
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		goto out;
> +
> +	available_extents = le32_to_cpu(dc_extents->ret_extent_cnt);
> +	total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> +	extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +	dev_dbg(dev, "No Total extent count :%d Extent list Generation Num:%d\n",
> +			total_extent_cnt, extent_gen_num);
> +
> +
> +	for (int i = 0; i < available_extents ; i++) {
> +		extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
> +		if (!extent) {
> +			rc = -ENOMEM;
> +			goto out;
> +		}
> +		extent->dpa_start = le64_to_cpu(dc_extents->extent[i].start_dpa);
> +		extent->length = le64_to_cpu(dc_extents->extent[i].length);
> +		memcpy(extent->tag, dc_extents->extent[i].tag,
> +					sizeof(dc_extents->extent[i].tag));
> +		extent->shared_extent_seq =
> +				le16_to_cpu(dc_extents->extent[i].shared_extn_seq);
> +		dev_dbg(dev, "dynamic capacity extent[%d] DPA:0x%llx LEN:%llx\n",
> +				i, extent->dpa_start, extent->length);
> +
> +		alloc_range = (struct range){
> +			.start = extent->dpa_start,
> +			.end = extent->dpa_start + extent->length - 1,
> +		};
> +
> +		rc = cxl_add_dc_extent(mds, &alloc_range);
> +		if (rc < 0)
> +			goto out;
> +		rc = xa_insert(&mds->dc_extent_list, extent->dpa_start, extent,
> +				GFP_KERNEL);
> +	}
> +
> +out:
> +	kvfree(dc_extents);
> +	if (rc < 0)
> +		return rc;
> +
> +	return available_extents;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extents, CXL);
> +
>  static int add_dpa_res(struct device *dev, struct resource *parent,
>  		       struct resource *res, resource_size_t start,
>  		       resource_size_t size, const char *type)
> @@ -1452,6 +1794,7 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
>  	mutex_init(&mds->event.log_lock);
>  	mds->cxlds.dev = dev;
>  	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
> +	xa_init(&mds->dc_extent_list);
>  
>  	return mds;
>  }
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 144232c8305e..ba45c1c3b0a9 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /* Copyright(c) 2022 Intel Corporation. All rights reserved. */
>  #include <linux/memregion.h>
> +#include <linux/interrupt.h>
>  #include <linux/genalloc.h>
>  #include <linux/device.h>
>  #include <linux/module.h>
> @@ -11,6 +12,8 @@
>  #include <cxlmem.h>
>  #include <cxl.h>
>  #include "core.h"
> +#include "../../dax/bus.h"
> +#include "../../dax/dax-private.h"
>  
>  /**
>   * DOC: cxl core region
> @@ -166,6 +169,38 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
>  	return 0;
>  }
>  
> +static int cxl_region_manage_dc(struct cxl_region *cxlr)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +	unsigned int extent_gen_num;
> +	int i, rc;
> +
> +	/* Designed for Non Interleaving flow with the assumption one
> +	 * cxl_region will map the complete device DC region's DPA range
> +	 */
> +	for (i = 0; i < p->nr_targets; i++) {
> +		struct cxl_endpoint_decoder *cxled = p->targets[i];
> +		struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +		struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +		rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> +		if (rc < 0)
> +			goto err;
> +		else if (rc > 1) {
> +			rc = cxl_dev_get_dc_extents(mds, rc, 0);
> +			if (rc < 0)
> +				goto err;
> +			mds->num_dc_extents = rc;
> +			mds->dc_extents_index = rc - 1;
> +		}

Brackets required around both arms of that if/else if statement. 
(checkpatch should be telling you that)

How about flipping that and doing the (rc > 1) work first.
then the else if, goto err.

> +		mds->dc_list_gen_num = extent_gen_num;
> +		dev_dbg(mds->cxlds.dev, "No of preallocated extents :%d\n", rc);
> +	}
> +	return 0;
> +err:
> +	return rc;
> +}
> +
>  static int commit_decoder(struct cxl_decoder *cxld)
>  {
>  	struct cxl_switch_decoder *cxlsd = NULL;
> @@ -2865,11 +2900,14 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
>  		return PTR_ERR(cxlr_dax);
>  
>  	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
> -	if (!cxlr_dc) {
> -		rc = -ENOMEM;
> -		goto err;
> -	}
> +	if (!cxlr_dc)
> +		return -ENOMEM;
>  
> +	rc = request_module("dax_cxl");
> +	if (rc) {
> +		dev_err(dev, "failed to load dax-ctl module\n");
> +		goto load_err;
> +	}
>  	dev = &cxlr_dax->dev;
>  	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
>  	if (rc)
> @@ -2891,10 +2929,24 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
>  	xa_init(&cxlr_dc->dax_dev_list);
>  	cxlr->cxlr_dc = cxlr_dc;
>  	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
> -	if (!rc)
> -		return 0;
> +	if (rc)
> +		goto err;
> +
> +	if (!dev->driver) {
> +		dev_err(dev, "%s Driver not attached\n", dev_name(dev));
> +		rc = -ENXIO;
> +		goto err;
> +	}
> +
> +	rc = cxl_region_manage_dc(cxlr);
> +	if (rc)
> +		goto err;
> +
> +	return 0;
> +
>  err:
>  	put_device(dev);
> +load_err:
>  	kfree(cxlr_dc);
>  	return rc;
>  }
> @@ -3076,6 +3128,156 @@ struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_create_region, CXL);
>  
> +static int match_ep_decoder_by_range(struct device *dev, void *data)
> +{
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range *dpa_range = data;
> +
> +	if (!is_endpoint_decoder(dev))
> +		return 0;
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	if (!cxled->cxld.region)
> +		return 0;
> +
> +	if (cxled->dpa_res->start <= dpa_range->start &&
> +				cxled->dpa_res->end >= dpa_range->end)
> +		return 1;
> +
> +	return 0;
> +}
> +
> +int cxl_release_dc_extent(struct cxl_memdev_state *mds,
> +			  struct range *rel_range)
> +{
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct cxl_dc_region *cxlr_dc;
> +	struct dax_region *dax_region;
> +	resource_size_t dpa_offset;
> +	struct cxl_region *cxlr;
> +	struct range hpa_range;
> +	struct dev_dax *dev_dax;
> +	resource_size_t hpa;
> +	struct device *dev;
> +	int ranges, rc = 0;
> +
> +	/*
> +	 * Find the cxl endpoind decoder with which has the extent dpa range and
> +	 * get the cxl_region, dax_region refrences.
> +	 */
> +	dev = device_find_child(&cxlmd->endpoint->dev, rel_range,
> +				match_ep_decoder_by_range);
> +	if (!dev) {
> +		dev_err(mds->cxlds.dev, "%pr not mapped\n", rel_range);
> +		return PTR_ERR(dev);
> +	}
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	hpa_range = cxled->cxld.hpa_range;
> +	cxlr = cxled->cxld.region;
> +	cxlr_dc = cxlr->cxlr_dc;
> +
> +	/* DPA to HPA translation */
> +	if (cxled->cxld.interleave_ways == 1) {
> +		dpa_offset = rel_range->start - cxled->dpa_res->start;
> +		hpa = hpa_range.start + dpa_offset;
> +	} else {
> +		dev_err(mds->cxlds.dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}
> +
> +	dev_dax = xa_load(&cxlr_dc->dax_dev_list, hpa);
> +	if (!dev_dax)
> +		return -EINVAL;
> +
> +	dax_region = dev_dax->region;
> +	ranges = dev_dax->nr_range;
> +
> +	while (ranges) {
> +		int i = ranges - 1;
> +		struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
> +
> +		devm_release_action(dax_region->dev, unregister_dax_mapping,
> +								&mapping->dev);
> +		ranges--;
> +	}
> +
> +	dev_dbg(mds->cxlds.dev, "removing devdax device:%s\n",
> +						dev_name(&dev_dax->dev));
> +	devm_release_action(dax_region->dev, unregister_dev_dax,
> +							&dev_dax->dev);
> +	xa_erase(&cxlr_dc->dax_dev_list, hpa);
> +
> +	return rc;
> +}
> +
> +int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range)
> +{
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct cxl_dc_region *cxlr_dc;
> +	struct dax_region *dax_region;
> +	resource_size_t dpa_offset;
> +	struct dev_dax_data data;
> +	struct dev_dax *dev_dax;
> +	struct cxl_region *cxlr;
> +	struct range hpa_range;
> +	resource_size_t hpa;
> +	struct device *dev;
> +	int rc;
> +
> +	/*
> +	 * Find the cxl endpoind decoder with which has the extent dpa range and
> +	 * get the cxl_region, dax_region refrences.
> +	 */
> +	dev = device_find_child(&cxlmd->endpoint->dev, alloc_range,
> +				match_ep_decoder_by_range);
> +	if (!dev) {
> +		dev_err(mds->cxlds.dev, "%pr not mapped\n",	alloc_range);
> +		return PTR_ERR(dev);
> +	}
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	hpa_range = cxled->cxld.hpa_range;
> +	cxlr = cxled->cxld.region;
> +	cxlr_dc = cxlr->cxlr_dc;
> +	cxlr_dax = cxlr_dc->cxlr_dax;
> +	dax_region = dev_get_drvdata(&cxlr_dax->dev);
> +
> +	/* DPA to HPA translation */
> +	if (cxled->cxld.interleave_ways == 1) {
> +		dpa_offset = alloc_range->start - cxled->dpa_res->start;
> +		hpa = hpa_range.start + dpa_offset;
> +	} else {
> +		dev_err(mds->cxlds.dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}

Hey, I'm running out of steam here, but lastly between these last
2 funcs, seems some duplicate code. Is there maybe an opportunity
for a common func that can 'add' or 'release' a dc extent?



The end.
> +
> +	data = (struct dev_dax_data) {
> +		.dax_region = dax_region,
> +		.id = -1,
> +		.size = 0,
> +	};
> +
> +	dev_dax = devm_create_dev_dax(&data);
> +	if (IS_ERR(dev_dax))
> +		return PTR_ERR(dev_dax);
> +
> +	if (IS_ALIGNED(range_len(alloc_range), max_t(unsigned long,
> +				dev_dax->align, memremap_compat_align()))) {
> +		rc = alloc_dev_dax_range(dev_dax, hpa,
> +					range_len(alloc_range));
> +		if (rc)
> +			return rc;
> +	}
> +
> +	rc = xa_insert(&cxlr_dc->dax_dev_list, hpa, dev_dax, GFP_KERNEL);
> +
> +	return rc;
> +}
> +
>  /* Establish an empty region covering the given HPA range */
>  static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
>  					   struct cxl_endpoint_decoder *cxled)
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index a0b5819bc70b..e11651255780 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -122,7 +122,8 @@ TRACE_EVENT(cxl_aer_correctable_error,
>  		{ CXL_EVENT_TYPE_INFO, "Informational" },	\
>  		{ CXL_EVENT_TYPE_WARN, "Warning" },		\
>  		{ CXL_EVENT_TYPE_FAIL, "Failure" },		\
> -		{ CXL_EVENT_TYPE_FATAL, "Fatal" })
> +		{ CXL_EVENT_TYPE_FATAL, "Fatal" },		\
> +		{ CXL_EVENT_TYPE_DCD, "DCD" })
>  
>  TRACE_EVENT(cxl_overflow,
>  
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 7ac1237938b7..60c436b7ebb1 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -163,11 +163,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
>  #define CXLDEV_EVENT_STATUS_WARN		BIT(1)
>  #define CXLDEV_EVENT_STATUS_FAIL		BIT(2)
>  #define CXLDEV_EVENT_STATUS_FATAL		BIT(3)
> +#define CXLDEV_EVENT_STATUS_DCD			BIT(4)
>  
>  #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO |	\
>  				 CXLDEV_EVENT_STATUS_WARN |	\
>  				 CXLDEV_EVENT_STATUS_FAIL |	\
> -				 CXLDEV_EVENT_STATUS_FATAL)
> +				 CXLDEV_EVENT_STATUS_FATAL|	\
> +				 CXLDEV_EVENT_STATUS_DCD)
>  
>  /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
>  #define CXLDEV_EVENT_INT_MODE_MASK	GENMASK(1, 0)
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 9c0b2fa72bdd..0440b5c04ef6 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -5,6 +5,7 @@
>  #include <uapi/linux/cxl_mem.h>
>  #include <linux/cdev.h>
>  #include <linux/uuid.h>
> +#include <linux/xarray.h>
>  #include "cxl.h"
>  
>  /* CXL 2.0 8.2.8.5.1.1 Memory Device Status Register */
> @@ -226,6 +227,7 @@ struct cxl_event_interrupt_policy {
>  	u8 warn_settings;
>  	u8 failure_settings;
>  	u8 fatal_settings;
> +	u8 dyncap_settings;
>  } __packed;
>  
>  /**
> @@ -296,6 +298,13 @@ enum cxl_devtype {
>  #define CXL_MAX_DC_REGION 8
>  #define CXL_DC_REGION_SRTLEN 8
>  
> +struct cxl_dc_extent_data {
> +	u64 dpa_start;
> +	u64 length;
> +	u8 tag[16];
> +	u16 shared_extent_seq;
> +};
> +
>  /**
>   * struct cxl_dev_state - The driver device state
>   *
> @@ -406,6 +415,11 @@ struct cxl_memdev_state {
>  		u8 flags;
>  	} dc_region[CXL_MAX_DC_REGION];
>  
> +	u32 dc_list_gen_num;
> +	u32 dc_extents_index;
> +	struct xarray dc_extent_list;
> +	u32 num_dc_extents;
> +
>  	size_t dc_event_log_size;
>  	struct cxl_event_state event;
>  	struct cxl_poison_state poison;
> @@ -470,6 +484,17 @@ enum cxl_opcode {
>  	UUID_INIT(0xe1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19,     \
>  		  0x40, 0x3d, 0x86)
>  
> +
> +struct cxl_mbox_dc_response {
> +	__le32 extent_list_size;
> +	u8 reserved[4];
> +	struct updated_extent_list {
> +		__le64 dpa_start;
> +		__le64 length;
> +		u8 reserved[8];
> +	} __packed extent_list[];
> +} __packed;
> +
>  struct cxl_mbox_get_supported_logs {
>  	__le16 entries;
>  	u8 rsvd[6];
> @@ -555,6 +580,7 @@ enum cxl_event_log_type {
>  	CXL_EVENT_TYPE_WARN,
>  	CXL_EVENT_TYPE_FAIL,
>  	CXL_EVENT_TYPE_FATAL,
> +	CXL_EVENT_TYPE_DCD,
>  	CXL_EVENT_TYPE_MAX
>  };
>  
> @@ -639,6 +665,35 @@ struct cxl_event_mem_module {
>  	u8 reserved[0x3d];
>  } __packed;
>  
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.0 section 8.2.9.2.1.5; Table 8-47
> + */
> +
> +#define CXL_EVENT_DC_TAG_SIZE	0x10
> +struct cxl_dc_extent {
> +	__le64 start_dpa;
> +	__le64 length;
> +	u8 tag[CXL_EVENT_DC_TAG_SIZE];
> +	__le16 shared_extn_seq;
> +	u8 reserved[6];
> +} __packed;
> +
> +struct dcd_record_data {
> +	u8 event_type;
> +	u8 reserved;
> +	__le16 host_id;
> +	u8 region_index;
> +	u8 reserved1[3];
> +	struct cxl_dc_extent extent;
> +	u8 reserved2[32];
> +} __packed;
> +
> +struct dcd_event_dyn_cap {
> +	struct cxl_event_record_hdr hdr;
> +	struct dcd_record_data data;
> +} __packed;
> +
>  struct cxl_mbox_get_partition_info {
>  	__le64 active_volatile_cap;
>  	__le64 active_persistent_cap;
> @@ -684,6 +739,19 @@ struct cxl_mbox_dynamic_capacity {
>  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
>  #define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
>  
> +struct cxl_mbox_get_dc_extent {
> +	__le32 extent_cnt;
> +	__le32 start_extent_index;
> +} __packed;
> +
> +struct cxl_mbox_dc_extents {
> +	__le32 ret_extent_cnt;
> +	__le32 total_extent_cnt;
> +	__le32 extent_list_num;
> +	u8 rsvd[4];
> +	struct cxl_dc_extent extent[];
> +}  __packed;
> +
>  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>  struct cxl_mbox_set_timestamp_in {
>  	__le64 timestamp;
> @@ -826,6 +894,14 @@ int cxl_trigger_poison_list(struct cxl_memdev *cxlmd);
>  int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
>  int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
>  
> +/* FIXME why not have these be static in mbox.c? */
> +int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range);
> +int cxl_release_dc_extent(struct cxl_memdev_state *mds, struct range *rel_range);
> +int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> +			      unsigned int *extent_gen_num);
> +int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds, unsigned int cnt,
> +			   unsigned int index);
> +
>  #ifdef CONFIG_CXL_SUSPEND
>  void cxl_mem_active_inc(void);
>  void cxl_mem_active_dec(void);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index ac1a41bc083d..558ffbcb9b34 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -522,8 +522,8 @@ static int cxl_event_req_irq(struct cxl_dev_state *cxlds, u8 setting)
>  		return irq;
>  
>  	return devm_request_threaded_irq(dev, irq, NULL, cxl_event_thread,
> -					 IRQF_SHARED | IRQF_ONESHOT, NULL,
> -					 dev_id);
> +					IRQF_SHARED | IRQF_ONESHOT, NULL,
> +					dev_id);
>  }
>  
>  static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
> @@ -555,6 +555,7 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
>  		.warn_settings = CXL_INT_MSI_MSIX,
>  		.failure_settings = CXL_INT_MSI_MSIX,
>  		.fatal_settings = CXL_INT_MSI_MSIX,
> +		.dyncap_settings = CXL_INT_MSI_MSIX,
>  	};
>  
>  	mbox_cmd = (struct cxl_mbox_cmd) {
> @@ -608,6 +609,11 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
>  		return rc;
>  	}
>  
> +	rc = cxl_event_req_irq(cxlds, policy.dyncap_settings);
> +	if (rc) {
> +		dev_err(cxlds->dev, "Failed to get interrupt for event dc log\n");
> +		return rc;
> +	}
>  	return 0;
>  }
>  
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 227800053309..b2b27033f589 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -434,7 +434,7 @@ static void free_dev_dax_ranges(struct dev_dax *dev_dax)
>  		trim_dev_dax_range(dev_dax);
>  }
>  
> -static void unregister_dev_dax(void *dev)
> +void unregister_dev_dax(void *dev)
>  {
>  	struct dev_dax *dev_dax = to_dev_dax(dev);
>  
> @@ -445,6 +445,7 @@ static void unregister_dev_dax(void *dev)
>  	free_dev_dax_ranges(dev_dax);
>  	put_device(dev);
>  }
> +EXPORT_SYMBOL_GPL(unregister_dev_dax);
>  
>  /* a return value >= 0 indicates this invocation invalidated the id */
>  static int __free_dev_dax_id(struct dev_dax *dev_dax)
> @@ -641,7 +642,7 @@ static void dax_mapping_release(struct device *dev)
>  	kfree(mapping);
>  }
>  
> -static void unregister_dax_mapping(void *data)
> +void unregister_dax_mapping(void *data)
>  {
>  	struct device *dev = data;
>  	struct dax_mapping *mapping = to_dax_mapping(dev);
> @@ -658,7 +659,7 @@ static void unregister_dax_mapping(void *data)
>  	device_del(dev);
>  	put_device(dev);
>  }
> -
> +EXPORT_SYMBOL_GPL(unregister_dax_mapping);
>  static struct dev_dax_range *get_dax_range(struct device *dev)
>  {
>  	struct dax_mapping *mapping = to_dax_mapping(dev);
> @@ -793,7 +794,7 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
>  	return 0;
>  }
>  
> -static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
> +int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
>  		resource_size_t size)
>  {
>  	struct dax_region *dax_region = dev_dax->region;
> @@ -853,6 +854,8 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
>  
>  	return rc;
>  }
> +EXPORT_SYMBOL_GPL(alloc_dev_dax_range);
> +
>  
>  static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, resource_size_t size)
>  {
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 8cd79ab34292..aa8418c7aead 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -47,8 +47,11 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
>  	__dax_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
>  void dax_driver_unregister(struct dax_device_driver *dax_drv);
>  void kill_dev_dax(struct dev_dax *dev_dax);
> +void unregister_dev_dax(void *dev);
> +void unregister_dax_mapping(void *data);
>  bool static_dev_dax(struct dev_dax *dev_dax);
> -
> +int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
> +					resource_size_t size);
>  /*
>   * While run_dax() is potentially a generic operation that could be
>   * defined in include/linux/dax.h we don't want to grow any users
> 
> -- 
> 2.40.0
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD)
  2023-06-14 19:16 [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD) ira.weiny
                   ` (5 preceding siblings ...)
  2023-06-15  0:56 ` [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD) Alison Schofield
@ 2023-06-15 14:51 ` Ira Weiny
  2023-06-22 15:07   ` Jonathan Cameron
  2023-06-29 15:30 ` Ira Weiny
  7 siblings, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2023-06-15 14:51 UTC (permalink / raw)
  To: ira.weiny, Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams,
	linux-cxl

ira.weiny@ wrote:
> I'm submitting these on behalf of Navneet.  There was a round of
> internal discussion which left a few questions but we want to get the
> public discussion going.  A first public preview was posted by Dan.[1]

Apologies for not being clear and marking these appropriately.  I
intended these to be RFC to get the discussion moving forward.  I somewhat
rushed the submission.  Depending on where the comments in this submission
go I'll try and make a better determination if the next submission is RFC
or can be a proper V1.  (Although b4 will mark them v2...  I'll have to
deal with that.)

Ira

> 
> The series has been rebased on the type-2 work posted from Dan.[2]  As
> discussed in the community call, not all of that series is required for
> these patches.  This will get rebased on the subset of those patches he
> is targeting for 6.5.  The series was tested using Fan Ni's Qemu DCD
> series.[3]
> 
> [cover letter]
> 
> A Dynamic Capacity Device (DCD) (CXL 3.0 spec 9.13.3) is a CXL memory
> device that implements dynamic capacity.  Dynamic capacity feature
> allows memory capacity to change dynamically, without the need for
> resetting the device.
> 
> Provide initial patches to enable DCD on non interleaving regions.
> Details:
> 
> - Get the dynamic capacity region information from cxl device and add
>   the advertised DC memory to driver managed resources
> - Get the device dynamic capacity extent list from the device and
>   maintain it in the host and add the preallocated memory to the host
> - Dynamic capacity region support
> - DCD region provisioning via Dax
> - Dynamic capacity event records
>         a. Add capacity Events
> 	b. Release capacity events
> 	c. Add the memory to the host dc region
> 	d. Release the memory from the host dc region
> - Trace Dynamic Capacity events
> - Send add capacity response to device
> - Send release dynamic capacity to device
> 
> Cc: Navneet Singh <navneet.singh@intel.com>
> Cc: Fan Ni <fan.ni@samsung.com>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-cxl@vger.kernel.org
> 
> [1] https://lore.kernel.org/all/64326437c1496_934b2949f@dwillia2-mobl3.amr.corp.intel.com.notmuch/
> [2] https://lore.kernel.org/all/168592149709.1948938.8663425987110396027.stgit@dwillia2-xfh.jf.intel.com/
> [3] https://lore.kernel.org/all/6483946e8152f_f1132294a2@iweiny-mobl.notmuch/
> 
> ---
> Navneet Singh (5):
>       cxl/mem : Read Dynamic capacity configuration from the device
>       cxl/region: Add dynamic capacity cxl region support.
>       cxl/mem : Expose dynamic capacity configuration to userspace
>       cxl/mem: Add support to handle DCD add and release capacity events.
>       cxl/mem: Trace Dynamic capacity Event Record
> 
>  drivers/cxl/Kconfig       |  11 +
>  drivers/cxl/core/core.h   |   7 +
>  drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++--
>  drivers/cxl/core/mbox.c   | 540 +++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/cxl/core/memdev.c |  72 +++++++
>  drivers/cxl/core/port.c   |  18 ++
>  drivers/cxl/core/region.c | 337 ++++++++++++++++++++++++++++-
>  drivers/cxl/core/trace.h  |  68 +++++-
>  drivers/cxl/cxl.h         |  32 ++-
>  drivers/cxl/cxlmem.h      | 146 ++++++++++++-
>  drivers/cxl/pci.c         |  14 +-
>  drivers/dax/bus.c         |  11 +-
>  drivers/dax/bus.h         |   5 +-
>  drivers/dax/cxl.c         |   4 +
>  14 files changed, 1453 insertions(+), 46 deletions(-)
> ---
> base-commit: 034a16d0165be3e092d60685be7b1b05e6f3059b
> change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
> 
> Best regards,
> -- 
> Ira Weiny <ira.weiny@intel.com>
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device
  2023-06-14 22:53   ` Dave Jiang
@ 2023-06-15 15:04     ` Ira Weiny
  0 siblings, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2023-06-15 15:04 UTC (permalink / raw)
  To: Dave Jiang, ira.weiny, Navneet Singh, Fan Ni, Jonathan Cameron,
	Dan Williams, linux-cxl

Dave Jiang wrote:
> 
> 
> On 6/14/23 12:16, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > Read the Dynamic capacity configuration and store dynamic capacity region
> > information in the device state which driver will use to map into the HDM
> > ranges.
> > 
> > Implement Get Dynamic Capacity Configuration (opcode 4800h) mailbox
> > command as specified in CXL 3.0 spec section 8.2.9.8.9.1.
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > 
> > ---
> > [iweiny: ensure all mds->dc_region's are named]
> > ---
> >   drivers/cxl/core/mbox.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++--
> >   drivers/cxl/cxlmem.h    |  70 +++++++++++++++++-
> >   drivers/cxl/pci.c       |   4 +
> >   3 files changed, 256 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 3ca0bf12c55f..c5b696737c87 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -111,6 +111,37 @@ static u8 security_command_sets[] = {
> >   	0x46, /* Security Passthrough */
> >   };
> >   
> > +static bool cxl_is_dcd_command(u16 opcode)
> > +{
> > +#define CXL_MBOX_OP_DCD_CMDS 0x48
> 
> Move this to cxlmem.h?

This is only ever used in mbox.c.  So I left the scope restricted for now.
If modules need it in the future we can lift it.  Also
cxl_is_poison_command() is also this way.

> 
> > +
> > +	if ((opcode >> 8) == CXL_MBOX_OP_DCD_CMDS)
> > +		return true;
> > +
> > +	return false;
> 
> I think you simplify by:
> 
> return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;

Good catch.  I'll clean it up.

[snip]

> > @@ -1121,13 +1289,23 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> >   	}
> >   
> >   	cxlds->dpa_res =
> > -		(struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> > +		(struct resource)DEFINE_RES_MEM(0, mds->total_capacity);
> > +
> > +	for (int i = 0; i < CXL_MAX_DC_REGION; i++) {
> > +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> > +
> > +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> > +				 dcr->base, dcr->decode_len, dcr->name);
> > +		if (rc)
> > +			return rc;
> > +	}
> >   
> >   	if (mds->partition_align_bytes == 0) {
> >   		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
> >   				 mds->volatile_only_bytes, "ram");
> >   		if (rc)
> >   			return rc;
> > +
> 
> Stray blank line?

Ah yep!  Fixed!

Thanks for looking!
Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/5] cxl/mem : Expose dynamic capacity configuration to userspace
  2023-06-14 19:16 ` [PATCH 3/5] cxl/mem : Expose dynamic capacity configuration to userspace ira.weiny
  2023-06-15  0:40   ` Alison Schofield
@ 2023-06-15 15:41   ` Dave Jiang
  1 sibling, 0 replies; 55+ messages in thread
From: Dave Jiang @ 2023-06-15 15:41 UTC (permalink / raw)
  To: ira.weiny, Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams,
	linux-cxl



On 6/14/23 12:16, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Exposing driver cached dynamic capacity configuration through sysfs
> attributes.User will create one or more dynamic capacity

Space after '.'

> cxl regions based on this information and map the dynamic capacity of
> the device into HDM ranges using one or more HDM decoders.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> ---
> [iweiny: fixups]
> [djbw: fixups, no sign-off: preview only]
> ---
>   drivers/cxl/core/memdev.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 72 insertions(+)

Missing sysfs documentation?

> 
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 5d1ba7a72567..beeb5fa3a0aa 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -99,6 +99,20 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
>   static struct device_attribute dev_attr_pmem_size =
>   	__ATTR(size, 0444, pmem_size_show, NULL);
>   
> +static ssize_t dc_regions_count_show(struct device *dev, struct device_attribute *attr,
> +		char *buf)
> +{
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +	int len = 0;
> +
> +	len = sysfs_emit(buf, "0x%x\n", mds->nr_dc_region);
> +	return len;
> +}

Just directly return sysfs_emit(). Also, emit region count as decimal 
instead of hex?

> +
> +struct device_attribute dev_attr_dc_regions_count =
> +	__ATTR(dc_regions_count, 0444, dc_regions_count_show, NULL);
> +
>   static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
>   			   char *buf)
>   {
> @@ -362,6 +376,57 @@ static struct attribute *cxl_memdev_ram_attributes[] = {
>   	NULL,
>   };
>   
> +static ssize_t show_size_regionN(struct cxl_memdev *cxlmd, char *buf, int pos)
> +{
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +	return sysfs_emit(buf, "0x%llx\n", mds->dc_region[pos].decode_len);
> +}

Should size be in decimal format?

DJ

> +
> +#define SIZE_ATTR_RO(n)                                              \
> +static ssize_t dc##n##_size_show(                                       \
> +	struct device *dev, struct device_attribute *attr, char *buf)  \
> +{                                                                      \
> +	return show_size_regionN(to_cxl_memdev(dev), buf, (n));             \
> +}                                                                      \
> +static DEVICE_ATTR_RO(dc##n##_size)
> +SIZE_ATTR_RO(0);
> +SIZE_ATTR_RO(1);
> +SIZE_ATTR_RO(2);
> +SIZE_ATTR_RO(3);
> +SIZE_ATTR_RO(4);
> +SIZE_ATTR_RO(5);
> +SIZE_ATTR_RO(6);
> +SIZE_ATTR_RO(7);
> +
> +static struct attribute *cxl_memdev_dc_attributes[] = {
> +	&dev_attr_dc0_size.attr,
> +	&dev_attr_dc1_size.attr,
> +	&dev_attr_dc2_size.attr,
> +	&dev_attr_dc3_size.attr,
> +	&dev_attr_dc4_size.attr,
> +	&dev_attr_dc5_size.attr,
> +	&dev_attr_dc6_size.attr,
> +	&dev_attr_dc7_size.attr,
> +	&dev_attr_dc_regions_count.attr,
> +	NULL,
> +};
> +
> +static umode_t cxl_dc_visible(struct kobject *kobj, struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +	if (a == &dev_attr_dc_regions_count.attr)
> +		return a->mode;
> +
> +	if (n < mds->nr_dc_region)
> +		return a->mode;
> +
> +	return 0;
> +}
> +
>   static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
>   				  int n)
>   {
> @@ -385,10 +450,17 @@ static struct attribute_group cxl_memdev_pmem_attribute_group = {
>   	.attrs = cxl_memdev_pmem_attributes,
>   };
>   
> +static struct attribute_group cxl_memdev_dc_attribute_group = {
> +	.name = "dc",
> +	.attrs = cxl_memdev_dc_attributes,
> +	.is_visible = cxl_dc_visible,
> +};
> +
>   static const struct attribute_group *cxl_memdev_attribute_groups[] = {
>   	&cxl_memdev_attribute_group,
>   	&cxl_memdev_ram_attribute_group,
>   	&cxl_memdev_pmem_attribute_group,
> +	&cxl_memdev_dc_attribute_group,
>   	NULL,
>   };
>   
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events.
  2023-06-14 19:16 ` [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events ira.weiny
  2023-06-15  2:19   ` Alison Schofield
@ 2023-06-15 16:58   ` Dave Jiang
  2023-06-22 17:01   ` Jonathan Cameron
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 55+ messages in thread
From: Dave Jiang @ 2023-06-15 16:58 UTC (permalink / raw)
  To: ira.weiny, Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams,
	linux-cxl



On 6/14/23 12:16, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> A dynamic capacity device utilizes events to signal the host about the
> changes to the allocation of DC blocks. The device communicates the
> state of these blocks of dynamic capacity through an extent list that
> describes the starting DPA and length of all blocks the host can access.
> 
> Based on the dynamic capacity add or release event type,
> dynamic memory represented by the extents are either added
> or removed as devdax device.
> 
> Process the dynamic capacity add and release events.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> ---
> [iweiny: Remove invalid comment]
> ---
>   drivers/cxl/core/mbox.c   | 345 +++++++++++++++++++++++++++++++++++++++++++++-
>   drivers/cxl/core/region.c | 214 +++++++++++++++++++++++++++-
>   drivers/cxl/core/trace.h  |   3 +-
>   drivers/cxl/cxl.h         |   4 +-
>   drivers/cxl/cxlmem.h      |  76 ++++++++++
>   drivers/cxl/pci.c         |  10 +-
>   drivers/dax/bus.c         |  11 +-
>   drivers/dax/bus.h         |   5 +-
>   8 files changed, 652 insertions(+), 16 deletions(-)

Rather large patch. Can this be broken into the add and release events?


> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index c5b696737c87..db9295216de5 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -767,6 +767,14 @@ static const uuid_t log_uuid[] = {
>   	[VENDOR_DEBUG_UUID] = DEFINE_CXL_VENDOR_DEBUG_UUID,
>   };
>   
> +/* See CXL 3.0 8.2.9.2.1.5 */
> +enum dc_event {
> +	ADD_CAPACITY,
> +	RELEASE_CAPACITY,
> +	FORCED_CAPACITY_RELEASE,
> +	REGION_CONFIGURATION_UPDATED,
> +};
> +
>   /**
>    * cxl_enumerate_cmds() - Enumerate commands for a device.
>    * @mds: The driver data for the operation
> @@ -852,6 +860,14 @@ static const uuid_t mem_mod_event_uuid =
>   	UUID_INIT(0xfe927475, 0xdd59, 0x4339,
>   		  0xa5, 0x86, 0x79, 0xba, 0xb1, 0x13, 0xb7, 0x74);
>   
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.0 section 8.2.9.2.1.3; Table 8-45
> + */
> +static const uuid_t dc_event_uuid =
> +	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c,
> +		0x2f, 0x95, 0x26, 0x8e, 0x10, 0x1a, 0x2a);
> +
>   static void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>   				   enum cxl_event_log_type type,
>   				   struct cxl_event_record_raw *record)
> @@ -945,6 +961,188 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
>   	return rc;
>   }
>   
> +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> +				struct cxl_mbox_dc_response *res,
> +				int extent_cnt, int opcode)
> +{
> +	struct cxl_mbox_cmd mbox_cmd;
> +	int rc, size;
> +
> +	size = struct_size(res, extent_list, extent_cnt);
> +	res->extent_list_size = cpu_to_le32(extent_cnt);
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = opcode,
> +		.size_in = size,
> +		.payload_in = res,
> +	};
> +
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +
> +	return rc;
> +
> +}

Return cxl_internal_send_cmd() directly

> +
> +static int cxl_prepare_ext_list(struct cxl_mbox_dc_response **res,
'res' always triggers "resource" in my head. Maybe either spell it out 
or 'resp'?

> +					int *n, struct range *extent)

Maybe a parameter more descriptive than 'n'?


> +{
> +	struct cxl_mbox_dc_response *dc_res;

Same comment as 'res'

> +	unsigned int size;
> +
> +	if (!extent)
> +		size = struct_size(dc_res, extent_list, 0);
> +	else
> +		size = struct_size(dc_res, extent_list, *n + 1);
> +
> +	dc_res = krealloc(*res, size, GFP_KERNEL);
> +	if (!dc_res)
> +		return -ENOMEM;
> +
> +	if (extent) {
> +		dc_res->extent_list[*n].dpa_start = cpu_to_le64(extent->start);
> +		memset(dc_res->extent_list[*n].reserved, 0, 8);

SZ_8 perhaps?

> +		dc_res->extent_list[*n].length =
> +				cpu_to_le64(range_len(extent));
> +		(*n)++;
> +	}
> +
> +	*res = dc_res;
> +	return 0;
> +}
> +/**
> + * cxl_handle_dcd_event_records() - Read DCD event records.
> + * @mds: The memory device state
> + *
> + * Returns 0 if enumerate completed successfully.

Please also add "errno on failure."

> + *
> + * CXL devices can generate DCD events to add or remove extents in the list.
> + */
> +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
> +					struct cxl_event_record_raw *rec)
> +{
> +	struct cxl_mbox_dc_response *dc_res = NULL;
> +	struct device *dev = mds->cxlds.dev;
> +	uuid_t *id = &rec->hdr.id;
> +	struct dcd_event_dyn_cap *record =
> +			(struct dcd_event_dyn_cap *)rec;
> +	int extent_cnt = 0, rc = 0;
> +	struct cxl_dc_extent_data *extent;
> +	struct range alloc_range, rel_range;
> +	resource_size_t dpa, size;
> +
> +	if (!uuid_equal(id, &dc_event_uuid))
> +		return -EINVAL;
> +
> +	switch (record->data.event_type) {
> +	case ADD_CAPACITY:
> +		extent = devm_kzalloc(dev, sizeof(*extent), GFP_ATOMIC);
> +		if (!extent)
> +			return -ENOMEM;
> +
> +		extent->dpa_start = le64_to_cpu(record->data.extent.start_dpa);
> +		extent->length = le64_to_cpu(record->data.extent.length);
> +		memcpy(extent->tag, record->data.extent.tag,
> +				sizeof(record->data.extent.tag));

A general comment, the alignment of second line in this patch is 
inconsistent

> +		extent->shared_extent_seq =
> +			le16_to_cpu(record->data.extent.shared_extn_seq);
> +		dev_dbg(dev, "Add DC extent DPA:0x%llx LEN:%llx\n",
> +					extent->dpa_start, extent->length);
> +		alloc_range = (struct range) {
> +			.start = extent->dpa_start,
> +			.end = extent->dpa_start + extent->length - 1,
> +		};
> +
> +		rc = cxl_add_dc_extent(mds, &alloc_range);
> +		if (rc < 0) {
> +			dev_dbg(dev, "unconsumed DC extent DPA:0x%llx LEN:%llx\n",
> +					extent->dpa_start, extent->length);
> +			rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, NULL);
> +			if (rc < 0) {
> +				dev_err(dev, "Couldn't create extent list %d\n",
> +									rc);
> +				devm_kfree(dev, extent);
> +				return rc;
> +			}
> +
> +			rc = cxl_send_dc_cap_response(mds, dc_res,
> +					extent_cnt, CXL_MBOX_OP_ADD_DC_RESPONSE);
> +			if (rc < 0) {
> +				devm_kfree(dev, extent);
> +				goto out;

Please be consistent in direct return vs single path error out. Mixing 
makes it more difficult to read.

> +			}
> +
> +			kfree(dc_res);
> +			devm_kfree(dev, extent);
> +
> +			return 0;
> +		}
> +
> +		rc = xa_insert(&mds->dc_extent_list, extent->dpa_start, extent,
> +				GFP_KERNEL);
> +		if (rc < 0)
> +			goto out;
> +
> +		mds->num_dc_extents++;
> +		rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, &alloc_range);
> +		if (rc < 0) {
> +			dev_err(dev, "Couldn't create extent list %d\n", rc);
> +			return rc;
> +		}
> +
> +		rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
> +					      CXL_MBOX_OP_ADD_DC_RESPONSE);
> +		if (rc < 0)
> +			goto out;
> +
> +		break;
> +
> +	case RELEASE_CAPACITY:
> +		dpa = le64_to_cpu(record->data.extent.start_dpa);
> +		size = le64_to_cpu(record->data.extent.length);
> +		dev_dbg(dev, "Release DC extents DPA:0x%llx LEN:%llx\n",
> +				dpa, size);
> +		extent = xa_load(&mds->dc_extent_list, dpa);
> +		if (!extent) {
> +			dev_err(dev, "No extent found with DPA:0x%llx\n", dpa);
> +			return -EINVAL;
> +		}
> +
> +		rel_range = (struct range) {
> +			.start = dpa,
> +			.end = dpa + size - 1,
> +		};
> +
> +		rc = cxl_release_dc_extent(mds, &rel_range);
> +		if (rc < 0) {
> +			dev_dbg(dev, "withhold DC extent DPA:0x%llx LEN:%llx\n",
> +									dpa, size);
> +			return 0;
> +		}
> +
> +		xa_erase(&mds->dc_extent_list, dpa);
> +		devm_kfree(dev, extent);
> +		mds->num_dc_extents--;
> +		rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, &rel_range);
> +		if (rc < 0) {
> +			dev_err(dev, "Couldn't create extent list %d\n", rc);
> +			return rc;
> +		}
> +
> +		rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
> +					      CXL_MBOX_OP_RELEASE_DC);
> +		if (rc < 0)
> +			goto out;
> +
> +		break;
> +
> +	default:
> +		return -EINVAL;
> +	}
> +out:
> +	kfree(dc_res);
> +	return rc;
> +}
> +
>   static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>   				    enum cxl_event_log_type type)
>   {
> @@ -982,9 +1180,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>   		if (!nr_rec)
>   			break;
>   
> -		for (i = 0; i < nr_rec; i++)
> +		for (i = 0; i < nr_rec; i++) {
>   			cxl_event_trace_record(cxlmd, type,
>   					       &payload->records[i]);
> +			if (type == CXL_EVENT_TYPE_DCD) {
> +				rc = cxl_handle_dcd_event_records(mds,
> +						&payload->records[i]);
> +				if (rc)
> +					dev_err_ratelimited(dev,
> +						"dcd event failed: %d\n", rc);
> +			}
> +		}
>   
>   		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
>   			trace_cxl_overflow(cxlmd, type, payload);
> @@ -1024,6 +1230,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
>   		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_WARN);
>   	if (status & CXLDEV_EVENT_STATUS_INFO)
>   		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_INFO);
> +	if (status & CXLDEV_EVENT_STATUS_DCD)
> +		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_mem_get_event_records, CXL);
>   
> @@ -1244,6 +1452,140 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
>   
> +int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> +			      unsigned int *extent_gen_num)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_mbox_dc_extents *dc_extents;
> +	struct cxl_mbox_get_dc_extent get_dc_extent;
> +	unsigned int total_extent_cnt;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	int rc;
> +
> +	/* Check GET_DC_EXTENT_LIST is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> +		return 0;
> +	}
> +
> +	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
> +	if (!dc_extents)
> +		return -ENOMEM;
> +
> +	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> +		.extent_cnt = 0,
> +		.start_extent_index = 0,
> +	};
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +		.payload_in = &get_dc_extent,
> +		.size_in = sizeof(get_dc_extent),
> +		.size_out = mds->payload_size,
> +		.payload_out = dc_extents,
> +		.min_out = 1,
> +	};
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		goto out;
> +
> +	total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> +	*extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +	dev_dbg(dev, "Total extent count :%d Extent list Generation Num: %d\n",
> +			total_extent_cnt, *extent_gen_num);
> +out:
> +
> +	kvfree(dc_extents);
> +	if (rc < 0)
> +		return rc;
> +
> +	return total_extent_cnt;
> +
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extent_cnt, CXL);
> +
> +int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds,
> +			   unsigned int index, unsigned int cnt)
> +{
> +	/* See CXL 3.0 Table 125 dynamic capacity config  Output Payload */
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_mbox_dc_extents *dc_extents;
> +	struct cxl_mbox_get_dc_extent get_dc_extent;
> +	unsigned int extent_gen_num, available_extents, total_extent_cnt;
> +	int rc;
> +	struct cxl_dc_extent_data *extent;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	struct range alloc_range;
> +
> +	/* Check GET_DC_EXTENT_LIST is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> +		return 0;
> +	}
> +
> +	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
> +	if (!dc_extents)
> +		return -ENOMEM;
> +	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> +		.extent_cnt = cnt,
> +		.start_extent_index = index,
> +	};
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +		.payload_in = &get_dc_extent,
> +		.size_in = sizeof(get_dc_extent),
> +		.size_out = mds->payload_size,
> +		.payload_out = dc_extents,
> +		.min_out = 1,
> +	};
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		goto out;
> +
> +	available_extents = le32_to_cpu(dc_extents->ret_extent_cnt);
> +	total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> +	extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +	dev_dbg(dev, "No Total extent count :%d Extent list Generation Num:%d\n",
> +			total_extent_cnt, extent_gen_num);
> +
> +
> +	for (int i = 0; i < available_extents ; i++) {
> +		extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
> +		if (!extent) {
> +			rc = -ENOMEM;
> +			goto out;
> +		}
> +		extent->dpa_start = le64_to_cpu(dc_extents->extent[i].start_dpa);
> +		extent->length = le64_to_cpu(dc_extents->extent[i].length);
> +		memcpy(extent->tag, dc_extents->extent[i].tag,
> +					sizeof(dc_extents->extent[i].tag));
> +		extent->shared_extent_seq =
> +				le16_to_cpu(dc_extents->extent[i].shared_extn_seq);
> +		dev_dbg(dev, "dynamic capacity extent[%d] DPA:0x%llx LEN:%llx\n",
> +				i, extent->dpa_start, extent->length);
> +
> +		alloc_range = (struct range){
> +			.start = extent->dpa_start,
> +			.end = extent->dpa_start + extent->length - 1,
> +		};
> +
> +		rc = cxl_add_dc_extent(mds, &alloc_range);
> +		if (rc < 0)
> +			goto out;
> +		rc = xa_insert(&mds->dc_extent_list, extent->dpa_start, extent,
> +				GFP_KERNEL);
> +	}
> +
> +out:
> +	kvfree(dc_extents);
> +	if (rc < 0)
> +		return rc;
> +
> +	return available_extents;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extents, CXL);
> +
>   static int add_dpa_res(struct device *dev, struct resource *parent,
>   		       struct resource *res, resource_size_t start,
>   		       resource_size_t size, const char *type)
> @@ -1452,6 +1794,7 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
>   	mutex_init(&mds->event.log_lock);
>   	mds->cxlds.dev = dev;
>   	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
> +	xa_init(&mds->dc_extent_list);
>   
>   	return mds;
>   }
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 144232c8305e..ba45c1c3b0a9 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1,6 +1,7 @@
>   // SPDX-License-Identifier: GPL-2.0-only
>   /* Copyright(c) 2022 Intel Corporation. All rights reserved. */
>   #include <linux/memregion.h>
> +#include <linux/interrupt.h>
>   #include <linux/genalloc.h>
>   #include <linux/device.h>
>   #include <linux/module.h>
> @@ -11,6 +12,8 @@
>   #include <cxlmem.h>
>   #include <cxl.h>
>   #include "core.h"
> +#include "../../dax/bus.h"
> +#include "../../dax/dax-private.h"
>   
>   /**
>    * DOC: cxl core region
> @@ -166,6 +169,38 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
>   	return 0;
>   }
>   
> +static int cxl_region_manage_dc(struct cxl_region *cxlr)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +	unsigned int extent_gen_num;
> +	int i, rc;
> +
> +	/* Designed for Non Interleaving flow with the assumption one

comment need to go to next line
/*
  * comment
  */

> +	 * cxl_region will map the complete device DC region's DPA range
> +	 */
> +	for (i = 0; i < p->nr_targets; i++) {
> +		struct cxl_endpoint_decoder *cxled = p->targets[i];
> +		struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +		struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +		rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> +		if (rc < 0)
Need {} since else if does
> +			goto err;

I would just do return rc here. No need for the goto. and then else if() 
below would become if()

> +		else if (rc > 1) {
> +			rc = cxl_dev_get_dc_extents(mds, rc, 0);
> +			if (rc < 0)
> +				goto err;
> +			mds->num_dc_extents = rc;
> +			mds->dc_extents_index = rc - 1;
> +		}
> +		mds->dc_list_gen_num = extent_gen_num;
> +		dev_dbg(mds->cxlds.dev, "No of preallocated extents :%d\n", rc);
> +	}
> +	return 0;
> +err:
> +	return rc;
> +}
> +
>   static int commit_decoder(struct cxl_decoder *cxld)
>   {
>   	struct cxl_switch_decoder *cxlsd = NULL;
> @@ -2865,11 +2900,14 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
>   		return PTR_ERR(cxlr_dax);
>   
>   	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
> -	if (!cxlr_dc) {
> -		rc = -ENOMEM;
> -		goto err;
> -	}
> +	if (!cxlr_dc)
> +		return -ENOMEM;
>   
> +	rc = request_module("dax_cxl");
> +	if (rc) {
> +		dev_err(dev, "failed to load dax-ctl module\n");
> +		goto load_err;
> +	}
>   	dev = &cxlr_dax->dev;
>   	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
>   	if (rc)
> @@ -2891,10 +2929,24 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
>   	xa_init(&cxlr_dc->dax_dev_list);
>   	cxlr->cxlr_dc = cxlr_dc;
>   	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
> -	if (!rc)
> -		return 0;
> +	if (rc)
> +		goto err;
> +
> +	if (!dev->driver) {
> +		dev_err(dev, "%s Driver not attached\n", dev_name(dev));
> +		rc = -ENXIO;
> +		goto err;
> +	}
> +
> +	rc = cxl_region_manage_dc(cxlr);
> +	if (rc)
> +		goto err;
> +
> +	return 0;
> +
>   err:
>   	put_device(dev);
> +load_err:
>   	kfree(cxlr_dc);
>   	return rc;
>   }
> @@ -3076,6 +3128,156 @@ struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_create_region, CXL);
>   
> +static int match_ep_decoder_by_range(struct device *dev, void *data)
> +{
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range *dpa_range = data;
> +
> +	if (!is_endpoint_decoder(dev))
> +		return 0;
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	if (!cxled->cxld.region)
> +		return 0;
> +
> +	if (cxled->dpa_res->start <= dpa_range->start &&
> +				cxled->dpa_res->end >= dpa_range->end)
> +		return 1;
> +
> +	return 0;

Return the compare directly

> +}
> +
> +int cxl_release_dc_extent(struct cxl_memdev_state *mds,
> +			  struct range *rel_range)
> +{
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct cxl_dc_region *cxlr_dc;
> +	struct dax_region *dax_region;
> +	resource_size_t dpa_offset;
> +	struct cxl_region *cxlr;
> +	struct range hpa_range;
> +	struct dev_dax *dev_dax;
> +	resource_size_t hpa;
> +	struct device *dev;
> +	int ranges, rc = 0;
> +
> +	/*
> +	 * Find the cxl endpoind decoder with which has the extent dpa range and
> +	 * get the cxl_region, dax_region refrences.
> +	 */
> +	dev = device_find_child(&cxlmd->endpoint->dev, rel_range,
> +				match_ep_decoder_by_range);
> +	if (!dev) {
> +		dev_err(mds->cxlds.dev, "%pr not mapped\n", rel_range);
> +		return PTR_ERR(dev);

dev would be NULL and PTR_ERR() won't work. Maybe 'return -ENODEV'? 
device_find_child() doesn't return ERRPTR it seems.

> +	}
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	hpa_range = cxled->cxld.hpa_range;
> +	cxlr = cxled->cxld.region;
> +	cxlr_dc = cxlr->cxlr_dc;
> +
> +	/* DPA to HPA translation */
> +	if (cxled->cxld.interleave_ways == 1) {
> +		dpa_offset = rel_range->start - cxled->dpa_res->start;
> +		hpa = hpa_range.start + dpa_offset;
> +	} else {
> +		dev_err(mds->cxlds.dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}

Check != 1 first and return. Then you don't need else.

> +
> +	dev_dax = xa_load(&cxlr_dc->dax_dev_list, hpa);
> +	if (!dev_dax)
> +		return -EINVAL;
> +
> +	dax_region = dev_dax->region;
> +	ranges = dev_dax->nr_range;
> +
> +	while (ranges) {
> +		int i = ranges - 1;
> +		struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
> +
> +		devm_release_action(dax_region->dev, unregister_dax_mapping,
> +								&mapping->dev);
> +		ranges--;
> +	}
> +
> +	dev_dbg(mds->cxlds.dev, "removing devdax device:%s\n",
> +						dev_name(&dev_dax->dev));
> +	devm_release_action(dax_region->dev, unregister_dev_dax,
> +							&dev_dax->dev);
> +	xa_erase(&cxlr_dc->dax_dev_list, hpa);
> +
> +	return rc;
> +}
> +
> +int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range)
> +{
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct cxl_dc_region *cxlr_dc;
> +	struct dax_region *dax_region;
> +	resource_size_t dpa_offset;
> +	struct dev_dax_data data;
> +	struct dev_dax *dev_dax;
> +	struct cxl_region *cxlr;
> +	struct range hpa_range;
> +	resource_size_t hpa;
> +	struct device *dev;
> +	int rc;
> +
> +	/*
> +	 * Find the cxl endpoind decoder with which has the extent dpa range and
> +	 * get the cxl_region, dax_region refrences.
> +	 */
> +	dev = device_find_child(&cxlmd->endpoint->dev, alloc_range,
> +				match_ep_decoder_by_range);
> +	if (!dev) {
> +		dev_err(mds->cxlds.dev, "%pr not mapped\n",	alloc_range);

A lot of extra spaces after ','

> +		return PTR_ERR(dev);

Same comment as earlier.

> +	}
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	hpa_range = cxled->cxld.hpa_range;
> +	cxlr = cxled->cxld.region;
> +	cxlr_dc = cxlr->cxlr_dc;
> +	cxlr_dax = cxlr_dc->cxlr_dax;
> +	dax_region = dev_get_drvdata(&cxlr_dax->dev);
> +
> +	/* DPA to HPA translation */
> +	if (cxled->cxld.interleave_ways == 1) {
> +		dpa_offset = alloc_range->start - cxled->dpa_res->start;
> +		hpa = hpa_range.start + dpa_offset;
> +	} else {
> +		dev_err(mds->cxlds.dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}

Check == 1 first and return. No need for else.


> +
> +	data = (struct dev_dax_data) {
> +		.dax_region = dax_region,
> +		.id = -1,

magic id number

> +		.size = 0,
> +	};
> +
> +	dev_dax = devm_create_dev_dax(&data);
> +	if (IS_ERR(dev_dax))
> +		return PTR_ERR(dev_dax);
> +
> +	if (IS_ALIGNED(range_len(alloc_range), max_t(unsigned long,
> +				dev_dax->align, memremap_compat_align()))) {
> +		rc = alloc_dev_dax_range(dev_dax, hpa,
> +					range_len(alloc_range));
> +		if (rc)
> +			return rc;
> +	}
> +
> +	rc = xa_insert(&cxlr_dc->dax_dev_list, hpa, dev_dax, GFP_KERNEL);
> +
> +	return rc;
> +}
> +
>   /* Establish an empty region covering the given HPA range */
>   static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
>   					   struct cxl_endpoint_decoder *cxled)
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index a0b5819bc70b..e11651255780 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -122,7 +122,8 @@ TRACE_EVENT(cxl_aer_correctable_error,
>   		{ CXL_EVENT_TYPE_INFO, "Informational" },	\
>   		{ CXL_EVENT_TYPE_WARN, "Warning" },		\
>   		{ CXL_EVENT_TYPE_FAIL, "Failure" },		\
> -		{ CXL_EVENT_TYPE_FATAL, "Fatal" })
> +		{ CXL_EVENT_TYPE_FATAL, "Fatal" },		\
> +		{ CXL_EVENT_TYPE_DCD, "DCD" })
>   
>   TRACE_EVENT(cxl_overflow,
>   
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 7ac1237938b7..60c436b7ebb1 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -163,11 +163,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
>   #define CXLDEV_EVENT_STATUS_WARN		BIT(1)
>   #define CXLDEV_EVENT_STATUS_FAIL		BIT(2)
>   #define CXLDEV_EVENT_STATUS_FATAL		BIT(3)
> +#define CXLDEV_EVENT_STATUS_DCD			BIT(4)
>   
>   #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO |	\
>   				 CXLDEV_EVENT_STATUS_WARN |	\
>   				 CXLDEV_EVENT_STATUS_FAIL |	\
> -				 CXLDEV_EVENT_STATUS_FATAL)
> +				 CXLDEV_EVENT_STATUS_FATAL|	\
> +				 CXLDEV_EVENT_STATUS_DCD)
>   
>   /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
>   #define CXLDEV_EVENT_INT_MODE_MASK	GENMASK(1, 0)
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 9c0b2fa72bdd..0440b5c04ef6 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -5,6 +5,7 @@
>   #include <uapi/linux/cxl_mem.h>
>   #include <linux/cdev.h>
>   #include <linux/uuid.h>
> +#include <linux/xarray.h>
>   #include "cxl.h"
>   
>   /* CXL 2.0 8.2.8.5.1.1 Memory Device Status Register */
> @@ -226,6 +227,7 @@ struct cxl_event_interrupt_policy {
>   	u8 warn_settings;
>   	u8 failure_settings;
>   	u8 fatal_settings;
> +	u8 dyncap_settings;
>   } __packed;
>   
>   /**
> @@ -296,6 +298,13 @@ enum cxl_devtype {
>   #define CXL_MAX_DC_REGION 8
>   #define CXL_DC_REGION_SRTLEN 8
>   
> +struct cxl_dc_extent_data {
> +	u64 dpa_start;
> +	u64 length;
> +	u8 tag[16];
> +	u16 shared_extent_seq;
> +};
> +
>   /**
>    * struct cxl_dev_state - The driver device state
>    *
> @@ -406,6 +415,11 @@ struct cxl_memdev_state {
>   		u8 flags;
>   	} dc_region[CXL_MAX_DC_REGION];
>   
> +	u32 dc_list_gen_num;
> +	u32 dc_extents_index;
> +	struct xarray dc_extent_list;
> +	u32 num_dc_extents;
> +
>   	size_t dc_event_log_size;
>   	struct cxl_event_state event;
>   	struct cxl_poison_state poison;
> @@ -470,6 +484,17 @@ enum cxl_opcode {
>   	UUID_INIT(0xe1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19,     \
>   		  0x40, 0x3d, 0x86)
>   
> +
> +struct cxl_mbox_dc_response {
> +	__le32 extent_list_size;
> +	u8 reserved[4];
> +	struct updated_extent_list {
> +		__le64 dpa_start;
> +		__le64 length;
> +		u8 reserved[8];
> +	} __packed extent_list[];
> +} __packed;
> +
>   struct cxl_mbox_get_supported_logs {
>   	__le16 entries;
>   	u8 rsvd[6];
> @@ -555,6 +580,7 @@ enum cxl_event_log_type {
>   	CXL_EVENT_TYPE_WARN,
>   	CXL_EVENT_TYPE_FAIL,
>   	CXL_EVENT_TYPE_FATAL,
> +	CXL_EVENT_TYPE_DCD,
>   	CXL_EVENT_TYPE_MAX
>   };
>   
> @@ -639,6 +665,35 @@ struct cxl_event_mem_module {
>   	u8 reserved[0x3d];
>   } __packed;
>   
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.0 section 8.2.9.2.1.5; Table 8-47
> + */
> +
> +#define CXL_EVENT_DC_TAG_SIZE	0x10
> +struct cxl_dc_extent {
> +	__le64 start_dpa;
> +	__le64 length;
> +	u8 tag[CXL_EVENT_DC_TAG_SIZE];
> +	__le16 shared_extn_seq;
> +	u8 reserved[6];
> +} __packed;
> +
> +struct dcd_record_data {
> +	u8 event_type;
> +	u8 reserved;
> +	__le16 host_id;
> +	u8 region_index;
> +	u8 reserved1[3];
> +	struct cxl_dc_extent extent;
> +	u8 reserved2[32];
> +} __packed;
> +
> +struct dcd_event_dyn_cap {
> +	struct cxl_event_record_hdr hdr;
> +	struct dcd_record_data data;
> +} __packed;
> +
>   struct cxl_mbox_get_partition_info {
>   	__le64 active_volatile_cap;
>   	__le64 active_persistent_cap;
> @@ -684,6 +739,19 @@ struct cxl_mbox_dynamic_capacity {
>   #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
>   #define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
>   
> +struct cxl_mbox_get_dc_extent {
> +	__le32 extent_cnt;
> +	__le32 start_extent_index;
> +} __packed;
> +
> +struct cxl_mbox_dc_extents {
> +	__le32 ret_extent_cnt;
> +	__le32 total_extent_cnt;
> +	__le32 extent_list_num;
> +	u8 rsvd[4];
> +	struct cxl_dc_extent extent[];
> +}  __packed;
> +
>   /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>   struct cxl_mbox_set_timestamp_in {
>   	__le64 timestamp;
> @@ -826,6 +894,14 @@ int cxl_trigger_poison_list(struct cxl_memdev *cxlmd);
>   int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
>   int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
>   
> +/* FIXME why not have these be static in mbox.c? */
> +int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range);
> +int cxl_release_dc_extent(struct cxl_memdev_state *mds, struct range *rel_range);
> +int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> +			      unsigned int *extent_gen_num);
> +int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds, unsigned int cnt,
> +			   unsigned int index);
> +
>   #ifdef CONFIG_CXL_SUSPEND
>   void cxl_mem_active_inc(void);
>   void cxl_mem_active_dec(void);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index ac1a41bc083d..558ffbcb9b34 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -522,8 +522,8 @@ static int cxl_event_req_irq(struct cxl_dev_state *cxlds, u8 setting)
>   		return irq;
>   
>   	return devm_request_threaded_irq(dev, irq, NULL, cxl_event_thread,
> -					 IRQF_SHARED | IRQF_ONESHOT, NULL,
> -					 dev_id);
> +					IRQF_SHARED | IRQF_ONESHOT, NULL,
> +					dev_id);

Stray edit?

>   }
>   
>   static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
> @@ -555,6 +555,7 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
>   		.warn_settings = CXL_INT_MSI_MSIX,
>   		.failure_settings = CXL_INT_MSI_MSIX,
>   		.fatal_settings = CXL_INT_MSI_MSIX,
> +		.dyncap_settings = CXL_INT_MSI_MSIX,
>   	};
>   
>   	mbox_cmd = (struct cxl_mbox_cmd) {
> @@ -608,6 +609,11 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
>   		return rc;
>   	}
>   
> +	rc = cxl_event_req_irq(cxlds, policy.dyncap_settings);
> +	if (rc) {
> +		dev_err(cxlds->dev, "Failed to get interrupt for event dc log\n");
> +		return rc;
> +	}
>   	return 0;
>   }
>   
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 227800053309..b2b27033f589 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -434,7 +434,7 @@ static void free_dev_dax_ranges(struct dev_dax *dev_dax)
>   		trim_dev_dax_range(dev_dax);
>   }
>   
> -static void unregister_dev_dax(void *dev)
> +void unregister_dev_dax(void *dev)
>   {
>   	struct dev_dax *dev_dax = to_dev_dax(dev);
>   
> @@ -445,6 +445,7 @@ static void unregister_dev_dax(void *dev)
>   	free_dev_dax_ranges(dev_dax);
>   	put_device(dev);
>   }
> +EXPORT_SYMBOL_GPL(unregister_dev_dax);
>   
>   /* a return value >= 0 indicates this invocation invalidated the id */
>   static int __free_dev_dax_id(struct dev_dax *dev_dax)
> @@ -641,7 +642,7 @@ static void dax_mapping_release(struct device *dev)
>   	kfree(mapping);
>   }
>   
> -static void unregister_dax_mapping(void *data)
> +void unregister_dax_mapping(void *data)
>   {
>   	struct device *dev = data;
>   	struct dax_mapping *mapping = to_dax_mapping(dev);
> @@ -658,7 +659,7 @@ static void unregister_dax_mapping(void *data)
>   	device_del(dev);
>   	put_device(dev);
>   }
> -
> +EXPORT_SYMBOL_GPL(unregister_dax_mapping);
>   static struct dev_dax_range *get_dax_range(struct device *dev)
>   {
>   	struct dax_mapping *mapping = to_dax_mapping(dev);
> @@ -793,7 +794,7 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
>   	return 0;
>   }
>   
> -static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
> +int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
>   		resource_size_t size)
>   {
>   	struct dax_region *dax_region = dev_dax->region;
> @@ -853,6 +854,8 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
>   
>   	return rc;
>   }
> +EXPORT_SYMBOL_GPL(alloc_dev_dax_range);
> +
>   
>   static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, resource_size_t size)
>   {

Split the dax bus changes out as a prep patch?

> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 8cd79ab34292..aa8418c7aead 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -47,8 +47,11 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
>   	__dax_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
>   void dax_driver_unregister(struct dax_device_driver *dax_drv);
>   void kill_dev_dax(struct dev_dax *dev_dax);
> +void unregister_dev_dax(void *dev);
> +void unregister_dax_mapping(void *data);
>   bool static_dev_dax(struct dev_dax *dev_dax);
> -
> +int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
> +					resource_size_t size);
>   /*
>    * While run_dax() is potentially a generic operation that could be
>    * defined in include/linux/dax.h we don't want to grow any users
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 5/5] cxl/mem: Trace Dynamic capacity Event Record
  2023-06-14 19:16 ` [PATCH 5/5] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
@ 2023-06-15 17:08   ` Dave Jiang
  0 siblings, 0 replies; 55+ messages in thread
From: Dave Jiang @ 2023-06-15 17:08 UTC (permalink / raw)
  To: ira.weiny, Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams,
	linux-cxl



On 6/14/23 12:16, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> CXL rev 3.0 section 8.2.9.2.1.5 defines the Dynamic Capacity Event Record
> Determine if the event read is a Dynamic capacity event record and
> if so trace the record for the debug purpose.
> 
> Add DC trace points to the trace log.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>

LGTM

> 
> ---
> [iweiny: fixups]
> [djbw: no sign-off: preview only]
> ---
>   drivers/cxl/core/mbox.c  |  5 ++++
>   drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 70 insertions(+)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index db9295216de5..802dacd09772 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -888,6 +888,11 @@ static void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>   				(struct cxl_event_mem_module *)record;
>   
>   		trace_cxl_memory_module(cxlmd, type, rec);
> +	} else if (uuid_equal(id, &dc_event_uuid)) {
> +		struct dcd_event_dyn_cap *rec =
> +				(struct dcd_event_dyn_cap *)record;
> +
> +		trace_cxl_dynamic_capacity(cxlmd, type, rec);
>   	} else {
>   		/* For unknown record types print just the header */
>   		trace_cxl_generic_event(cxlmd, type, record);
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index e11651255780..468c2c8b4347 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -704,6 +704,71 @@ TRACE_EVENT(cxl_poison,
>   	)
>   );
>   
> +/*
> + * DYNAMIC CAPACITY Event Record - DER
> + *
> + * CXL rev 3.0 section 8.2.9.2.1.5 Table 8-47
> + */
> +
> +#define CXL_DC_ADD_CAPACITY			0x00
> +#define CXL_DC_REL_CAPACITY			0x01
> +#define CXL_DC_FORCED_REL_CAPACITY		0x02
> +#define CXL_DC_REG_CONF_UPDATED			0x03
> +#define show_dc_evt_type(type)	__print_symbolic(type,		\
> +	{ CXL_DC_ADD_CAPACITY,	"Add capacity"},		\
> +	{ CXL_DC_REL_CAPACITY,	"Release capacity"},		\
> +	{ CXL_DC_FORCED_REL_CAPACITY,	"Forced capacity release"},	\
> +	{ CXL_DC_REG_CONF_UPDATED,	"Region Configuration Updated"	} \
> +)
> +
> +TRACE_EVENT(cxl_dynamic_capacity,
> +
> +	TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
> +		 struct dcd_event_dyn_cap  *rec),
> +
> +	TP_ARGS(cxlmd, log, rec),
> +
> +	TP_STRUCT__entry(
> +		CXL_EVT_TP_entry
> +
> +		/* Dynamic capacity Event */
> +		__field(u8, event_type)
> +		__field(u16, hostid)
> +		__field(u8, region_id)
> +		__field(u64, dpa_start)
> +		__field(u64, length)
> +		__array(u8, tag, CXL_EVENT_DC_TAG_SIZE)
> +		__field(u16, sh_extent_seq)
> +	),
> +
> +	TP_fast_assign(
> +		CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
> +
> +		/* Dynamic_capacity Event */
> +		__entry->event_type = rec->data.event_type;
> +
> +		/* DCD event record data */
> +		__entry->hostid = le16_to_cpu(rec->data.host_id);
> +		__entry->region_id = rec->data.region_index;
> +		__entry->dpa_start = le64_to_cpu(rec->data.extent.start_dpa);
> +		__entry->length = le64_to_cpu(rec->data.extent.length);
> +		memcpy(__entry->tag, &rec->data.extent.tag, CXL_EVENT_DC_TAG_SIZE);
> +		__entry->sh_extent_seq = le16_to_cpu(rec->data.extent.shared_extn_seq);
> +	),
> +
> +	CXL_EVT_TP_printk("event_type='%s' host_id='%d' region_id='%d' " \
> +		"starting_dpa=%llx length=%llx tag=%s " \
> +		"shared_extent_sequence=%d",
> +		show_dc_evt_type(__entry->event_type),
> +		__entry->hostid,
> +		__entry->region_id,
> +		__entry->dpa_start,
> +		__entry->length,
> +		__print_hex(__entry->tag, CXL_EVENT_DC_TAG_SIZE),
> +		__entry->sh_extent_seq
> +	)
> +);
> +
>   #endif /* _CXL_EVENTS_H */
>   
>   #define TRACE_INCLUDE_FILE trace
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-14 23:37   ` Dave Jiang
@ 2023-06-15 18:12     ` Ira Weiny
  2023-06-15 18:28       ` Dave Jiang
  2023-06-15 18:56       ` Navneet Singh
  0 siblings, 2 replies; 55+ messages in thread
From: Ira Weiny @ 2023-06-15 18:12 UTC (permalink / raw)
  To: Dave Jiang, ira.weiny, Navneet Singh, Fan Ni, Jonathan Cameron,
	Dan Williams, linux-cxl

Dave Jiang wrote:
> 
> 
> On 6/14/23 12:16, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > CXL devices optionally support dynamic capacity. CXL Regions must be
> > created to access this capacity.
> > 
> > Add sysfs entries to create dynamic capacity cxl regions. Provide a new
> > Dynamic Capacity decoder mode which targets dynamic capacity on devices
> > which are added to that region.
> > 
> > Below are the steps to create and delete dynamic capacity region0
> > (example).
> > 
> >      region=$(cat /sys/bus/cxl/devices/decoder0.0/create_dc_region)
> >      echo $region> /sys/bus/cxl/devices/decoder0.0/create_dc_region
> >      echo 256 > /sys/bus/cxl/devices/$region/interleave_granularity
> >      echo 1 > /sys/bus/cxl/devices/$region/interleave_ways
> > 
> >      echo "dc0" >/sys/bus/cxl/devices/decoder1.0/mode
> >      echo 0x400000000 >/sys/bus/cxl/devices/decoder1.0/dpa_size
> > 
> >      echo 0x400000000 > /sys/bus/cxl/devices/$region/size
> >      echo  "decoder1.0" > /sys/bus/cxl/devices/$region/target0
> >      echo 1 > /sys/bus/cxl/devices/$region/commit
> >      echo $region > /sys/bus/cxl/drivers/cxl_region/bind
> > 
> >      echo $region> /sys/bus/cxl/devices/decoder0.0/delete_region
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > 
> > ---
> > [iweiny: fixups]
> > [iweiny: remove unused CXL_DC_REGION_MODE macro]
> > [iweiny: Make dc_mode_to_region_index static]
> > [iweiny: simplify <sysfs>/create_dc_region]
> > [iweiny: introduce decoder_mode_is_dc]
> > [djbw: fixups, no sign-off: preview only]
> > ---
> >   drivers/cxl/Kconfig       |  11 +++
> >   drivers/cxl/core/core.h   |   7 ++
> >   drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++++++++++++++++++++++++++----
> >   drivers/cxl/core/port.c   |  18 ++++
> >   drivers/cxl/core/region.c | 135 ++++++++++++++++++++++++--
> >   drivers/cxl/cxl.h         |  28 ++++++
> >   drivers/dax/cxl.c         |   4 +
> >   7 files changed, 409 insertions(+), 28 deletions(-)
> > 
> > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > index ff4e78117b31..df034889d053 100644
> > --- a/drivers/cxl/Kconfig
> > +++ b/drivers/cxl/Kconfig
> > @@ -121,6 +121,17 @@ config CXL_REGION
> >   
> >   	  If unsure say 'y'
> >   
> > +config CXL_DCD
> > +	bool "CXL: DCD Support"
> > +	default CXL_BUS
> > +	depends on CXL_REGION
> > +	help
> > +	  Enable the CXL core to provision CXL DCD regions.
> > +	  CXL devices optionally support dynamic capacity and DCD region
> > +	  maps the dynamic capacity regions DPA's into Host HPA ranges.
> > +
> > +	  If unsure say 'y'
> > +
> >   config CXL_REGION_INVALIDATION_TEST
> >   	bool "CXL: Region Cache Management Bypass (TEST)"
> >   	depends on CXL_REGION
> > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> > index 27f0968449de..725700ab5973 100644
> > --- a/drivers/cxl/core/core.h
> > +++ b/drivers/cxl/core/core.h
> > @@ -9,6 +9,13 @@ extern const struct device_type cxl_nvdimm_type;
> >   
> >   extern struct attribute_group cxl_base_attribute_group;
> >   
> > +#ifdef CONFIG_CXL_DCD
> > +extern struct device_attribute dev_attr_create_dc_region;
> > +#define SET_CXL_DC_REGION_ATTR(x) (&dev_attr_##x.attr),
> > +#else
> > +#define SET_CXL_DC_REGION_ATTR(x)
> > +#endif
> > +
> >   #ifdef CONFIG_CXL_REGION
> >   extern struct device_attribute dev_attr_create_pmem_region;
> >   extern struct device_attribute dev_attr_create_ram_region;
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 514d30131d92..29649b47d177 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -233,14 +233,23 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> >   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> >   	struct resource *res = cxled->dpa_res;
> >   	resource_size_t skip_start;
> > +	resource_size_t skipped = cxled->skip;
> >   
> >   	lockdep_assert_held_write(&cxl_dpa_rwsem);
> >   
> >   	/* save @skip_start, before @res is released */
> > -	skip_start = res->start - cxled->skip;
> > +	skip_start = res->start - skipped;
> >   	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> > -	if (cxled->skip)
> > -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> > +	if (cxled->skip != 0) {
> > +		while (skipped != 0) {
> > +			res = xa_load(&cxled->skip_res, skip_start);
> > +			__release_region(&cxlds->dpa_res, skip_start,
> > +							resource_size(res));
> > +			xa_erase(&cxled->skip_res, skip_start);
> > +			skip_start += resource_size(res);
> > +			skipped -= resource_size(res);
> > +			}
> > +	}
> >   	cxled->skip = 0;
> >   	cxled->dpa_res = NULL;
> >   	put_device(&cxled->cxld.dev);
> > @@ -267,6 +276,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> >   	__cxl_dpa_release(cxled);
> >   }
> >   
> > +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> > +{
> > +	int index = 0;
> > +
> > +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> > +		if (mode == i)
> > +			return index;
> > +		index++;
> > +	}
> > +
> > +	return -EINVAL;
> > +}
> > +
> >   static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >   			     resource_size_t base, resource_size_t len,
> >   			     resource_size_t skipped)
> > @@ -275,7 +297,11 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >   	struct cxl_port *port = cxled_to_port(cxled);
> >   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> >   	struct device *dev = &port->dev;
> > +	struct device *ed_dev = &cxled->cxld.dev;
> > +	struct resource *dpa_res = &cxlds->dpa_res;
> > +	resource_size_t skip_len = 0;
> >   	struct resource *res;
> > +	int rc, index;
> >   
> >   	lockdep_assert_held_write(&cxl_dpa_rwsem);
> >   
> > @@ -304,28 +330,119 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >   	}
> >   
> >   	if (skipped) {
> > -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> > -				       dev_name(&cxled->cxld.dev), 0);
> > -		if (!res) {
> > -			dev_dbg(dev,
> > -				"decoder%d.%d: failed to reserve skipped space\n",
> > -				port->id, cxled->cxld.id);
> > -			return -EBUSY;
> > +		resource_size_t skip_base = base - skipped;
> > +
> > +		if (decoder_mode_is_dc(cxled->mode)) {
> 
> Maybe move this entire block to a helper function to reduce the size of 
> the current function and reduce indent levels and improve readability?

:-/

I'll work on breaking it out more.  The logic here is getting kind of
crazy.

> 
> > +			if (resource_size(&cxlds->ram_res) &&
> > +					skip_base <= cxlds->ram_res.end) {
> > +				skip_len = cxlds->ram_res.end - skip_base + 1;
> > +				res = __request_region(dpa_res, skip_base,
> > +						skip_len, dev_name(ed_dev), 0);
> > +				if (!res)
> > +					goto error;
> > +
> > +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> > +								GFP_KERNEL);
> > +				skip_base += skip_len;
> > +			}
> > +
> > +			if (resource_size(&cxlds->ram_res) &&
                                                  ^^^^^^^
						  pmem_res?

> > +					skip_base <= cxlds->pmem_res.end) {

The 2 if statements here are almost exactly the same.  To the point I
wonder if there is a bug.

Navneet,

Why does the code check ram_res the second time but go on to use pmem_res
in the block?

> > +				skip_len = cxlds->pmem_res.end - skip_base + 1;
> > +				res = __request_region(dpa_res, skip_base,
> > +						skip_len, dev_name(ed_dev), 0);
> > +				if (!res)
> > +					goto error;
> > +
> > +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> > +								GFP_KERNEL);
> > +				skip_base += skip_len;
> > +			}
> > +
> > +			index = dc_mode_to_region_index(cxled->mode);
> > +			for (int i = 0; i <= index; i++) {
> > +				struct resource *dcr = &cxlds->dc_res[i];
> > +
> > +				if (skip_base < dcr->start) {
> > +					skip_len = dcr->start - skip_base;
> > +					res = __request_region(dpa_res,
> > +							skip_base, skip_len,
> > +							dev_name(ed_dev), 0);
> > +					if (!res)
> > +						goto error;
> > +
> > +					rc = xa_insert(&cxled->skip_res, skip_base,
> > +							res, GFP_KERNEL);
> > +					skip_base += skip_len;
> > +				}
> > +
> > +				if (skip_base == base) {
> > +					dev_dbg(dev, "skip done!\n");
> > +					break;
> > +				}
> > +
> > +				if (resource_size(dcr) &&
> > +						skip_base <= dcr->end) {
> > +					if (skip_base > base)
> > +						dev_err(dev, "Skip error\n");
> > +
> > +					skip_len = dcr->end - skip_base + 1;
> > +					res = __request_region(dpa_res, skip_base,
> > +							skip_len,
> > +							dev_name(ed_dev), 0);
> > +					if (!res)
> > +						goto error;
> > +
> > +					rc = xa_insert(&cxled->skip_res, skip_base,
> > +							res, GFP_KERNEL);
> > +					skip_base += skip_len;
> > +				}
> > +			}
> > +		} else	{
> > +			res = __request_region(dpa_res, base - skipped, skipped,
> > +							dev_name(ed_dev), 0);
> > +			if (!res)
> > +				goto error;
> > +
> > +			rc = xa_insert(&cxled->skip_res, skip_base, res,
> > +								GFP_KERNEL);
> >   		}
> >   	}
> > -	res = __request_region(&cxlds->dpa_res, base, len,
> > -			       dev_name(&cxled->cxld.dev), 0);
> > +
> > +	res = __request_region(dpa_res, base, len, dev_name(ed_dev), 0);
> >   	if (!res) {
> >   		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
> > -			port->id, cxled->cxld.id);
> > -		if (skipped)
> > -			__release_region(&cxlds->dpa_res, base - skipped,
> > -					 skipped);
> > +				port->id, cxled->cxld.id);
> > +		if (skipped) {
> > +			resource_size_t skip_base = base - skipped;
> > +
> > +			while (skipped != 0) {
> > +				if (skip_base > base)
> > +					dev_err(dev, "Skip error\n");
> > +
> > +				res = xa_load(&cxled->skip_res, skip_base);
> > +				__release_region(dpa_res, skip_base,
> > +							resource_size(res));
> > +				xa_erase(&cxled->skip_res, skip_base);
> > +				skip_base += resource_size(res);
> > +				skipped -= resource_size(res);
> > +			}
> > +		}
> >   		return -EBUSY;
> >   	}
> >   	cxled->dpa_res = res;
> >   	cxled->skip = skipped;
> >   
> > +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> > +		int index = dc_mode_to_region_index(mode);
> > +
> > +		if (resource_contains(&cxlds->dc_res[index], res)) {
> > +			cxled->mode = mode;
> > +			dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id,
> > +				cxled->cxld.id, cxled->dpa_res, cxled->mode);
> > +			goto success > +		}
> > +	}
> 
> This block should only happen if decoder_mode_is_dc() right? If that's 
> the case, you might be able to refactor it so the 'goto success' isn't 
> necessary.

I'll check.  I looked through this code a couple of times in my review
before posting because I'm not 100% sure I want to see 8 different modes
DC decoders and regions.

I think the 'mode' should be 'DC' with an index in the endpoint decoder to
map DC region that decoder is mapping.  But that change was much bigger to
Navneets code and I wanted to see how others felt about having DC0 - DC7
modes.  My compromise was creating decoder_mode_is_dc().

> 
> >   	if (resource_contains(&cxlds->pmem_res, res))
> >   		cxled->mode = CXL_DECODER_PMEM;
> >   	else if (resource_contains(&cxlds->ram_res, res))
> > @@ -336,9 +453,16 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >   		cxled->mode = CXL_DECODER_MIXED;
> >   	}
> >   
> > +success:
> >   	port->hdm_end++;
> >   	get_device(&cxled->cxld.dev);
> >   	return 0;
> > +
> > +error:
> > +	dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space\n",
> > +			port->id, cxled->cxld.id);
> > +	return -EBUSY;
> > +
> >   }
> >   
> >   int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > @@ -429,6 +553,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> >   	switch (mode) {
> >   	case CXL_DECODER_RAM:
> >   	case CXL_DECODER_PMEM:
> > +	case CXL_DECODER_DC0:
> > +	case CXL_DECODER_DC1:
> > +	case CXL_DECODER_DC2:
> > +	case CXL_DECODER_DC3:
> > +	case CXL_DECODER_DC4:
> > +	case CXL_DECODER_DC5:
> > +	case CXL_DECODER_DC6:
> > +	case CXL_DECODER_DC7:

For example this seems very hacky...

[snip]

> >   
> > +/*
> > + * The region can not be manged by CXL if any portion of
> > + * it is already online as 'System RAM'
> > + */
> > +static bool region_is_system_ram(struct cxl_region *cxlr,
> > +				 struct cxl_region_params *p)
> > +{
> > +	return (walk_iomem_res_desc(IORES_DESC_NONE,
> > +				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> > +				    p->res->start, p->res->end, cxlr,
> > +				    is_system_ram) > 0);
> > +}
> > +
> >   static int cxl_region_probe(struct device *dev)
> >   {
> >   	struct cxl_region *cxlr = to_cxl_region(dev);
> > @@ -3174,14 +3287,7 @@ static int cxl_region_probe(struct device *dev)
> >   	case CXL_DECODER_PMEM:
> >   		return devm_cxl_add_pmem_region(cxlr);
> >   	case CXL_DECODER_RAM:
> > -		/*
> > -		 * The region can not be manged by CXL if any portion of
> > -		 * it is already online as 'System RAM'
> > -		 */
> > -		if (walk_iomem_res_desc(IORES_DESC_NONE,
> > -					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> > -					p->res->start, p->res->end, cxlr,
> > -					is_system_ram) > 0)
> > +		if (region_is_system_ram(cxlr, p))
> 
> Maybe split this change out as a prep patch before the current patch.

That seems reasonable.  But the patch is not so large and the
justification for creating a helper is that we need this same check for DC
regions.  So it seemed ok to leave it like this.  Let me see about
splitting it out.

[snip]

> > diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> > index ccdf8de85bd5..eb5eb81bfbd7 100644
> > --- a/drivers/dax/cxl.c
> > +++ b/drivers/dax/cxl.c
> > @@ -23,11 +23,15 @@ static int cxl_dax_region_probe(struct device *dev)
> >   	if (!dax_region)
> >   		return -ENOMEM;
> >   
> > +	if (decoder_mode_is_dc(cxlr->mode))
> > +		return 0;
> > +
> >   	data = (struct dev_dax_data) {
> >   		.dax_region = dax_region,
> >   		.id = -1,
> >   		.size = range_len(&cxlr_dax->hpa_range),
> >   	};
> > +
> 
> Stray blank line?

Opps!  Fixed!

Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-15 18:12     ` Ira Weiny
@ 2023-06-15 18:28       ` Dave Jiang
  2023-06-16  3:52         ` Navneet Singh
  2023-06-15 18:56       ` Navneet Singh
  1 sibling, 1 reply; 55+ messages in thread
From: Dave Jiang @ 2023-06-15 18:28 UTC (permalink / raw)
  To: Ira Weiny, Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams,
	linux-cxl



On 6/15/23 11:12, Ira Weiny wrote:
> Dave Jiang wrote:
>>
>>
>> On 6/14/23 12:16, ira.weiny@intel.com wrote:
>>> From: Navneet Singh <navneet.singh@intel.com>
>>>
>>> CXL devices optionally support dynamic capacity. CXL Regions must be
>>> created to access this capacity.
>>>
>>> Add sysfs entries to create dynamic capacity cxl regions. Provide a new
>>> Dynamic Capacity decoder mode which targets dynamic capacity on devices
>>> which are added to that region.
>>>
>>> Below are the steps to create and delete dynamic capacity region0
>>> (example).
>>>
>>>       region=$(cat /sys/bus/cxl/devices/decoder0.0/create_dc_region)
>>>       echo $region> /sys/bus/cxl/devices/decoder0.0/create_dc_region
>>>       echo 256 > /sys/bus/cxl/devices/$region/interleave_granularity
>>>       echo 1 > /sys/bus/cxl/devices/$region/interleave_ways
>>>
>>>       echo "dc0" >/sys/bus/cxl/devices/decoder1.0/mode
>>>       echo 0x400000000 >/sys/bus/cxl/devices/decoder1.0/dpa_size
>>>
>>>       echo 0x400000000 > /sys/bus/cxl/devices/$region/size
>>>       echo  "decoder1.0" > /sys/bus/cxl/devices/$region/target0
>>>       echo 1 > /sys/bus/cxl/devices/$region/commit
>>>       echo $region > /sys/bus/cxl/drivers/cxl_region/bind
>>>
>>>       echo $region> /sys/bus/cxl/devices/decoder0.0/delete_region
>>>
>>> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
>>>
>>> ---
>>> [iweiny: fixups]
>>> [iweiny: remove unused CXL_DC_REGION_MODE macro]
>>> [iweiny: Make dc_mode_to_region_index static]
>>> [iweiny: simplify <sysfs>/create_dc_region]
>>> [iweiny: introduce decoder_mode_is_dc]
>>> [djbw: fixups, no sign-off: preview only]
>>> ---
>>>    drivers/cxl/Kconfig       |  11 +++
>>>    drivers/cxl/core/core.h   |   7 ++
>>>    drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++++++++++++++++++++++++++----
>>>    drivers/cxl/core/port.c   |  18 ++++
>>>    drivers/cxl/core/region.c | 135 ++++++++++++++++++++++++--
>>>    drivers/cxl/cxl.h         |  28 ++++++
>>>    drivers/dax/cxl.c         |   4 +
>>>    7 files changed, 409 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
>>> index ff4e78117b31..df034889d053 100644
>>> --- a/drivers/cxl/Kconfig
>>> +++ b/drivers/cxl/Kconfig
>>> @@ -121,6 +121,17 @@ config CXL_REGION
>>>    
>>>    	  If unsure say 'y'
>>>    
>>> +config CXL_DCD
>>> +	bool "CXL: DCD Support"
>>> +	default CXL_BUS
>>> +	depends on CXL_REGION
>>> +	help
>>> +	  Enable the CXL core to provision CXL DCD regions.
>>> +	  CXL devices optionally support dynamic capacity and DCD region
>>> +	  maps the dynamic capacity regions DPA's into Host HPA ranges.
>>> +
>>> +	  If unsure say 'y'
>>> +
>>>    config CXL_REGION_INVALIDATION_TEST
>>>    	bool "CXL: Region Cache Management Bypass (TEST)"
>>>    	depends on CXL_REGION
>>> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
>>> index 27f0968449de..725700ab5973 100644
>>> --- a/drivers/cxl/core/core.h
>>> +++ b/drivers/cxl/core/core.h
>>> @@ -9,6 +9,13 @@ extern const struct device_type cxl_nvdimm_type;
>>>    
>>>    extern struct attribute_group cxl_base_attribute_group;
>>>    
>>> +#ifdef CONFIG_CXL_DCD
>>> +extern struct device_attribute dev_attr_create_dc_region;
>>> +#define SET_CXL_DC_REGION_ATTR(x) (&dev_attr_##x.attr),
>>> +#else
>>> +#define SET_CXL_DC_REGION_ATTR(x)
>>> +#endif
>>> +
>>>    #ifdef CONFIG_CXL_REGION
>>>    extern struct device_attribute dev_attr_create_pmem_region;
>>>    extern struct device_attribute dev_attr_create_ram_region;
>>> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
>>> index 514d30131d92..29649b47d177 100644
>>> --- a/drivers/cxl/core/hdm.c
>>> +++ b/drivers/cxl/core/hdm.c
>>> @@ -233,14 +233,23 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>>>    	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>>>    	struct resource *res = cxled->dpa_res;
>>>    	resource_size_t skip_start;
>>> +	resource_size_t skipped = cxled->skip;
>>>    
>>>    	lockdep_assert_held_write(&cxl_dpa_rwsem);
>>>    
>>>    	/* save @skip_start, before @res is released */
>>> -	skip_start = res->start - cxled->skip;
>>> +	skip_start = res->start - skipped;
>>>    	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
>>> -	if (cxled->skip)
>>> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
>>> +	if (cxled->skip != 0) {
>>> +		while (skipped != 0) {
>>> +			res = xa_load(&cxled->skip_res, skip_start);
>>> +			__release_region(&cxlds->dpa_res, skip_start,
>>> +							resource_size(res));
>>> +			xa_erase(&cxled->skip_res, skip_start);
>>> +			skip_start += resource_size(res);
>>> +			skipped -= resource_size(res);
>>> +			}
>>> +	}
>>>    	cxled->skip = 0;
>>>    	cxled->dpa_res = NULL;
>>>    	put_device(&cxled->cxld.dev);
>>> @@ -267,6 +276,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>>>    	__cxl_dpa_release(cxled);
>>>    }
>>>    
>>> +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
>>> +{
>>> +	int index = 0;
>>> +
>>> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
>>> +		if (mode == i)
>>> +			return index;
>>> +		index++;
>>> +	}
>>> +
>>> +	return -EINVAL;
>>> +}
>>> +
>>>    static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>>>    			     resource_size_t base, resource_size_t len,
>>>    			     resource_size_t skipped)
>>> @@ -275,7 +297,11 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>>>    	struct cxl_port *port = cxled_to_port(cxled);
>>>    	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>>>    	struct device *dev = &port->dev;
>>> +	struct device *ed_dev = &cxled->cxld.dev;
>>> +	struct resource *dpa_res = &cxlds->dpa_res;
>>> +	resource_size_t skip_len = 0;
>>>    	struct resource *res;
>>> +	int rc, index;
>>>    
>>>    	lockdep_assert_held_write(&cxl_dpa_rwsem);
>>>    
>>> @@ -304,28 +330,119 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>>>    	}
>>>    
>>>    	if (skipped) {
>>> -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
>>> -				       dev_name(&cxled->cxld.dev), 0);
>>> -		if (!res) {
>>> -			dev_dbg(dev,
>>> -				"decoder%d.%d: failed to reserve skipped space\n",
>>> -				port->id, cxled->cxld.id);
>>> -			return -EBUSY;
>>> +		resource_size_t skip_base = base - skipped;
>>> +
>>> +		if (decoder_mode_is_dc(cxled->mode)) {
>>
>> Maybe move this entire block to a helper function to reduce the size of
>> the current function and reduce indent levels and improve readability?
> 
> :-/
> 
> I'll work on breaking it out more.  The logic here is getting kind of
> crazy.
> 
>>
>>> +			if (resource_size(&cxlds->ram_res) &&
>>> +					skip_base <= cxlds->ram_res.end) {
>>> +				skip_len = cxlds->ram_res.end - skip_base + 1;
>>> +				res = __request_region(dpa_res, skip_base,
>>> +						skip_len, dev_name(ed_dev), 0);
>>> +				if (!res)
>>> +					goto error;
>>> +
>>> +				rc = xa_insert(&cxled->skip_res, skip_base, res,
>>> +								GFP_KERNEL);
>>> +				skip_base += skip_len;
>>> +			}
>>> +
>>> +			if (resource_size(&cxlds->ram_res) &&
>                                                    ^^^^^^^
> 						  pmem_res?
> 
>>> +					skip_base <= cxlds->pmem_res.end) {
> 
> The 2 if statements here are almost exactly the same.  To the point I
> wonder if there is a bug.
> 
> Navneet,
> 
> Why does the code check ram_res the second time but go on to use pmem_res
> in the block?
> 
>>> +				skip_len = cxlds->pmem_res.end - skip_base + 1;
>>> +				res = __request_region(dpa_res, skip_base,
>>> +						skip_len, dev_name(ed_dev), 0);
>>> +				if (!res)
>>> +					goto error;
>>> +
>>> +				rc = xa_insert(&cxled->skip_res, skip_base, res,
>>> +								GFP_KERNEL);
>>> +				skip_base += skip_len;
>>> +			}
>>> +
>>> +			index = dc_mode_to_region_index(cxled->mode);
>>> +			for (int i = 0; i <= index; i++) {
>>> +				struct resource *dcr = &cxlds->dc_res[i];
>>> +
>>> +				if (skip_base < dcr->start) {
>>> +					skip_len = dcr->start - skip_base;
>>> +					res = __request_region(dpa_res,
>>> +							skip_base, skip_len,
>>> +							dev_name(ed_dev), 0);
>>> +					if (!res)
>>> +						goto error;
>>> +
>>> +					rc = xa_insert(&cxled->skip_res, skip_base,
>>> +							res, GFP_KERNEL);
>>> +					skip_base += skip_len;
>>> +				}
>>> +
>>> +				if (skip_base == base) {
>>> +					dev_dbg(dev, "skip done!\n");
>>> +					break;
>>> +				}
>>> +
>>> +				if (resource_size(dcr) &&
>>> +						skip_base <= dcr->end) {
>>> +					if (skip_base > base)
>>> +						dev_err(dev, "Skip error\n");
>>> +
>>> +					skip_len = dcr->end - skip_base + 1;
>>> +					res = __request_region(dpa_res, skip_base,
>>> +							skip_len,
>>> +							dev_name(ed_dev), 0);
>>> +					if (!res)
>>> +						goto error;
>>> +
>>> +					rc = xa_insert(&cxled->skip_res, skip_base,
>>> +							res, GFP_KERNEL);
>>> +					skip_base += skip_len;
>>> +				}
>>> +			}
>>> +		} else	{
>>> +			res = __request_region(dpa_res, base - skipped, skipped,
>>> +							dev_name(ed_dev), 0);
>>> +			if (!res)
>>> +				goto error;
>>> +
>>> +			rc = xa_insert(&cxled->skip_res, skip_base, res,
>>> +								GFP_KERNEL);
>>>    		}
>>>    	}
>>> -	res = __request_region(&cxlds->dpa_res, base, len,
>>> -			       dev_name(&cxled->cxld.dev), 0);
>>> +
>>> +	res = __request_region(dpa_res, base, len, dev_name(ed_dev), 0);
>>>    	if (!res) {
>>>    		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
>>> -			port->id, cxled->cxld.id);
>>> -		if (skipped)
>>> -			__release_region(&cxlds->dpa_res, base - skipped,
>>> -					 skipped);
>>> +				port->id, cxled->cxld.id);
>>> +		if (skipped) {
>>> +			resource_size_t skip_base = base - skipped;
>>> +
>>> +			while (skipped != 0) {
>>> +				if (skip_base > base)
>>> +					dev_err(dev, "Skip error\n");
>>> +
>>> +				res = xa_load(&cxled->skip_res, skip_base);
>>> +				__release_region(dpa_res, skip_base,
>>> +							resource_size(res));
>>> +				xa_erase(&cxled->skip_res, skip_base);
>>> +				skip_base += resource_size(res);
>>> +				skipped -= resource_size(res);
>>> +			}
>>> +		}
>>>    		return -EBUSY;
>>>    	}
>>>    	cxled->dpa_res = res;
>>>    	cxled->skip = skipped;
>>>    
>>> +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
>>> +		int index = dc_mode_to_region_index(mode);
>>> +
>>> +		if (resource_contains(&cxlds->dc_res[index], res)) {
>>> +			cxled->mode = mode;
>>> +			dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id,
>>> +				cxled->cxld.id, cxled->dpa_res, cxled->mode);
>>> +			goto success > +		}
>>> +	}
>>
>> This block should only happen if decoder_mode_is_dc() right? If that's
>> the case, you might be able to refactor it so the 'goto success' isn't
>> necessary.
> 
> I'll check.  I looked through this code a couple of times in my review
> before posting because I'm not 100% sure I want to see 8 different modes
> DC decoders and regions.
> 
> I think the 'mode' should be 'DC' with an index in the endpoint decoder to
> map DC region that decoder is mapping.  But that change was much bigger to
> Navneets code and I wanted to see how others felt about having DC0 - DC7
> modes.  My compromise was creating decoder_mode_is_dc().
> 
>>
>>>    	if (resource_contains(&cxlds->pmem_res, res))
>>>    		cxled->mode = CXL_DECODER_PMEM;
>>>    	else if (resource_contains(&cxlds->ram_res, res))
>>> @@ -336,9 +453,16 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>>>    		cxled->mode = CXL_DECODER_MIXED;
>>>    	}
>>>    
>>> +success:
>>>    	port->hdm_end++;
>>>    	get_device(&cxled->cxld.dev);
>>>    	return 0;
>>> +
>>> +error:
>>> +	dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space\n",
>>> +			port->id, cxled->cxld.id);
>>> +	return -EBUSY;
>>> +
>>>    }
>>>    
>>>    int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>>> @@ -429,6 +553,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>>>    	switch (mode) {
>>>    	case CXL_DECODER_RAM:
>>>    	case CXL_DECODER_PMEM:
>>> +	case CXL_DECODER_DC0:
>>> +	case CXL_DECODER_DC1:
>>> +	case CXL_DECODER_DC2:
>>> +	case CXL_DECODER_DC3:
>>> +	case CXL_DECODER_DC4:
>>> +	case CXL_DECODER_DC5:
>>> +	case CXL_DECODER_DC6:
>>> +	case CXL_DECODER_DC7:
> 
> For example this seems very hacky...

Not sure if it helps, but you can always do:
case CXL_DECODER_DC0 ... CXL_DECODER_DC7:

DJ

> 
> [snip]
> 
>>>    
>>> +/*
>>> + * The region can not be manged by CXL if any portion of
>>> + * it is already online as 'System RAM'
>>> + */
>>> +static bool region_is_system_ram(struct cxl_region *cxlr,
>>> +				 struct cxl_region_params *p)
>>> +{
>>> +	return (walk_iomem_res_desc(IORES_DESC_NONE,
>>> +				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
>>> +				    p->res->start, p->res->end, cxlr,
>>> +				    is_system_ram) > 0);
>>> +}
>>> +
>>>    static int cxl_region_probe(struct device *dev)
>>>    {
>>>    	struct cxl_region *cxlr = to_cxl_region(dev);
>>> @@ -3174,14 +3287,7 @@ static int cxl_region_probe(struct device *dev)
>>>    	case CXL_DECODER_PMEM:
>>>    		return devm_cxl_add_pmem_region(cxlr);
>>>    	case CXL_DECODER_RAM:
>>> -		/*
>>> -		 * The region can not be manged by CXL if any portion of
>>> -		 * it is already online as 'System RAM'
>>> -		 */
>>> -		if (walk_iomem_res_desc(IORES_DESC_NONE,
>>> -					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
>>> -					p->res->start, p->res->end, cxlr,
>>> -					is_system_ram) > 0)
>>> +		if (region_is_system_ram(cxlr, p))
>>
>> Maybe split this change out as a prep patch before the current patch.
> 
> That seems reasonable.  But the patch is not so large and the
> justification for creating a helper is that we need this same check for DC
> regions.  So it seemed ok to leave it like this.  Let me see about
> splitting it out.
> 
> [snip]
> 
>>> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
>>> index ccdf8de85bd5..eb5eb81bfbd7 100644
>>> --- a/drivers/dax/cxl.c
>>> +++ b/drivers/dax/cxl.c
>>> @@ -23,11 +23,15 @@ static int cxl_dax_region_probe(struct device *dev)
>>>    	if (!dax_region)
>>>    		return -ENOMEM;
>>>    
>>> +	if (decoder_mode_is_dc(cxlr->mode))
>>> +		return 0;
>>> +
>>>    	data = (struct dev_dax_data) {
>>>    		.dax_region = dax_region,
>>>    		.id = -1,
>>>    		.size = range_len(&cxlr_dax->hpa_range),
>>>    	};
>>> +
>>
>> Stray blank line?
> 
> Opps!  Fixed!
> 
> Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device
  2023-06-14 19:16 ` [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device ira.weiny
  2023-06-14 22:53   ` Dave Jiang
  2023-06-14 23:49   ` Alison Schofield
@ 2023-06-15 18:30   ` Fan Ni
  2023-06-15 19:17     ` Navneet Singh
  2023-06-15 21:41   ` Fan Ni
  2023-06-22 15:58   ` Jonathan Cameron
  4 siblings, 1 reply; 55+ messages in thread
From: Fan Ni @ 2023-06-15 18:30 UTC (permalink / raw)
  To: ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl,
	a.manzanares, dave, nmtadam.samsung

The 06/14/2023 12:16, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Read the Dynamic capacity configuration and store dynamic capacity region
> information in the device state which driver will use to map into the HDM
> ranges.
> 
> Implement Get Dynamic Capacity Configuration (opcode 4800h) mailbox
> command as specified in CXL 3.0 spec section 8.2.9.8.9.1.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 

See the comments below how total_dynamic_capacity is collected.


> ---
> [iweiny: ensure all mds->dc_region's are named]
> ---
>  drivers/cxl/core/mbox.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/cxl/cxlmem.h    |  70 +++++++++++++++++-
>  drivers/cxl/pci.c       |   4 +
>  3 files changed, 256 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 3ca0bf12c55f..c5b696737c87 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -111,6 +111,37 @@ static u8 security_command_sets[] = {
>  	0x46, /* Security Passthrough */
>  };
>  
> +static bool cxl_is_dcd_command(u16 opcode)
> +{
> +#define CXL_MBOX_OP_DCD_CMDS 0x48
> +
> +	if ((opcode >> 8) == CXL_MBOX_OP_DCD_CMDS)
> +		return true;
> +
> +	return false;
> +}
> +
> +static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
> +					u16 opcode)
> +{
> +	switch (opcode) {
> +	case CXL_MBOX_OP_GET_DC_CONFIG:
> +		set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> +		set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_ADD_DC_RESPONSE:
> +		set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_RELEASE_DC:
> +		set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
>  static bool cxl_is_security_command(u16 opcode)
>  {
>  	int i;
> @@ -666,6 +697,7 @@ static int cxl_xfer_log(struct cxl_memdev_state *mds, uuid_t *uuid,
>  static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  {
>  	struct cxl_cel_entry *cel_entry;
> +	struct cxl_mem_command *cmd;
>  	const int cel_entries = size / sizeof(*cel_entry);
>  	struct device *dev = mds->cxlds.dev;
>  	int i;
> @@ -674,11 +706,12 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  
>  	for (i = 0; i < cel_entries; i++) {
>  		u16 opcode = le16_to_cpu(cel_entry[i].opcode);
> -		struct cxl_mem_command *cmd = cxl_mem_find_command(opcode);
> +		cmd = cxl_mem_find_command(opcode);
>  
> -		if (!cmd && !cxl_is_poison_command(opcode)) {
> -			dev_dbg(dev,
> -				"Opcode 0x%04x unsupported by driver\n", opcode);
> +		if (!cmd && !cxl_is_poison_command(opcode) &&
> +		    !cxl_is_dcd_command(opcode)) {
> +			dev_dbg(dev, "Opcode 0x%04x unsupported by driver\n",
> +				opcode);
>  			continue;
>  		}
>  
> @@ -688,6 +721,9 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  		if (cxl_is_poison_command(opcode))
>  			cxl_set_poison_cmd_enabled(&mds->poison, opcode);
>  
> +		if (cxl_is_dcd_command(opcode))
> +			cxl_set_dcd_cmd_enabled(mds, opcode);
> +
>  		dev_dbg(dev, "Opcode 0x%04x enabled\n", opcode);
>  	}
>  }
> @@ -1059,7 +1095,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>  	if (rc < 0)
>  		return rc;
>  
> -	mds->total_bytes =
> +	mds->total_static_capacity =
>  		le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
>  	mds->volatile_only_bytes =
>  		le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
> @@ -1077,10 +1113,137 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>  		mds->poison.max_errors = min_t(u32, val, CXL_POISON_LIST_MAX);
>  	}
>  
> +	mds->dc_event_log_size = le16_to_cpu(id.dc_event_log_size);
> +
>  	return 0;
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dev_state_identify, CXL);
>  
> +/**
> + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> + * information from the device.
> + * @mds: The memory device state
> + * Return: 0 if identify was executed successfully.
> + *
> + * This will dispatch the get_dynamic_capacity command to the device
> + * and on success populate structures to be exported to sysfs.
> + */
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> +{
> +	struct cxl_dev_state *cxlds = &mds->cxlds;
> +	struct device *dev = cxlds->dev;
> +	struct cxl_mbox_dynamic_capacity *dc;
> +	struct cxl_mbox_get_dc_config get_dc;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	u64 next_dc_region_start;
> +	int rc, i;
> +
> +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> +		sprintf(mds->dc_region[i].name, "dc%d", i);
> +
> +	/* Check GET_DC_CONFIG is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd: get_dynamic_capacity_config\n");
> +		return 0;
> +	}
> +
> +	dc = kvmalloc(mds->payload_size, GFP_KERNEL);
> +	if (!dc)
> +		return -ENOMEM;
> +
> +	get_dc = (struct cxl_mbox_get_dc_config) {
> +		.region_count = CXL_MAX_DC_REGION,
> +		.start_region_index = 0,
> +	};
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> +		.payload_in = &get_dc,
> +		.size_in = sizeof(get_dc),
> +		.size_out = mds->payload_size,
> +		.payload_out = dc,
> +		.min_out = 1,
> +	};
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		goto dc_error;
> +
> +	mds->nr_dc_region = dc->avail_region_count;
> +
> +	if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> +		dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> +			mds->nr_dc_region);
> +		rc = -EINVAL;
> +		goto dc_error;
> +	}
> +
> +	for (i = 0; i < mds->nr_dc_region; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +		dcr->base = le64_to_cpu(dc->region[i].region_base);
> +		dcr->decode_len =
> +			le64_to_cpu(dc->region[i].region_decode_length);
> +		dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> +		dcr->len = le64_to_cpu(dc->region[i].region_length);
> +		dcr->blk_size = le64_to_cpu(dc->region[i].region_block_size);
> +
> +		/* Check regions are in increasing DPA order */
> +		if ((i + 1) < mds->nr_dc_region) {
> +			next_dc_region_start =
> +				le64_to_cpu(dc->region[i + 1].region_base);
> +			if ((dcr->base > next_dc_region_start) ||
> +			    ((dcr->base + dcr->decode_len) > next_dc_region_start)) {
> +				dev_err(dev,
> +					"DPA ordering violation for DC region %d and %d\n",
> +					i, i + 1);
> +				rc = -EINVAL;
> +				goto dc_error;
> +			}
> +		}
> +
> +		/* Check the region is 256 MB aligned */
> +		if (!IS_ALIGNED(dcr->base, SZ_256M)) {
> +			dev_err(dev, "DC region %d not aligned to 256MB\n", i);
> +			rc = -EINVAL;
> +			goto dc_error;
> +		}
> +
> +		/* Check Region base and length are aligned to block size */
> +		if (!IS_ALIGNED(dcr->base, dcr->blk_size) ||
> +		    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> +			dev_err(dev, "DC region %d not aligned to %#llx\n", i,
> +				dcr->blk_size);
> +			rc = -EINVAL;
> +			goto dc_error;
> +		}
> +
> +		dcr->dsmad_handle =
> +			le32_to_cpu(dc->region[i].region_dsmad_handle);
> +		dcr->flags = dc->region[i].flags;
> +		sprintf(dcr->name, "dc%d", i);
> +
> +		dev_dbg(dev,
> +			"DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> +			dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> +	}
> +
> +	/*
> +	 * Calculate entire DPA range of all configured regions which will be mapped by
> +	 * one or more HDM decoders
> +	 */
> +	mds->total_dynamic_capacity =
> +		mds->dc_region[mds->nr_dc_region - 1].base +
> +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> +		mds->dc_region[0].base;

The above code assumes there is no DPA address gap between two adjacent dc
regions. Not sure whether it will always be true or not, I cannot find any
specific statement in the spec. 

Fan

> +	dev_dbg(dev, "Total dynamic capacity: %#llx\n",
> +		mds->total_dynamic_capacity);
> +
> +dc_error:
> +	kvfree(dc);
> +	return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> +
>  static int add_dpa_res(struct device *dev, struct resource *parent,
>  		       struct resource *res, resource_size_t start,
>  		       resource_size_t size, const char *type)
> @@ -1112,6 +1275,11 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  	struct cxl_dev_state *cxlds = &mds->cxlds;
>  	struct device *dev = cxlds->dev;
>  	int rc;
> +	size_t untenanted_mem =
> +		mds->dc_region[0].base - mds->total_static_capacity;
> +
> +	mds->total_capacity = mds->total_static_capacity +
> +			untenanted_mem + mds->total_dynamic_capacity;
>  
>  	if (!cxlds->media_ready) {
>  		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> @@ -1121,13 +1289,23 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  	}
>  
>  	cxlds->dpa_res =
> -		(struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> +		(struct resource)DEFINE_RES_MEM(0, mds->total_capacity);
> +
> +	for (int i = 0; i < CXL_MAX_DC_REGION; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> +				 dcr->base, dcr->decode_len, dcr->name);
> +		if (rc)
> +			return rc;
> +	}
>  
>  	if (mds->partition_align_bytes == 0) {
>  		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
>  				 mds->volatile_only_bytes, "ram");
>  		if (rc)
>  			return rc;
> +
>  		return add_dpa_res(dev, &cxlds->dpa_res, &cxlds->pmem_res,
>  				   mds->volatile_only_bytes,
>  				   mds->persistent_only_bytes, "pmem");
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 89e560ea14c0..9c0b2fa72bdd 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -239,6 +239,15 @@ struct cxl_event_state {
>  	struct mutex log_lock;
>  };
>  
> +/* Device enabled DCD commands */
> +enum dcd_cmd_enabled_bits {
> +	CXL_DCD_ENABLED_GET_CONFIG,
> +	CXL_DCD_ENABLED_GET_EXTENT_LIST,
> +	CXL_DCD_ENABLED_ADD_RESPONSE,
> +	CXL_DCD_ENABLED_RELEASE,
> +	CXL_DCD_ENABLED_MAX
> +};
> +
>  /* Device enabled poison commands */
>  enum poison_cmd_enabled_bits {
>  	CXL_POISON_ENABLED_LIST,
> @@ -284,6 +293,9 @@ enum cxl_devtype {
>  	CXL_DEVTYPE_CLASSMEM,
>  };
>  
> +#define CXL_MAX_DC_REGION 8
> +#define CXL_DC_REGION_SRTLEN 8
> +
>  /**
>   * struct cxl_dev_state - The driver device state
>   *
> @@ -300,6 +312,8 @@ enum cxl_devtype {
>   * @dpa_res: Overall DPA resource tree for the device
>   * @pmem_res: Active Persistent memory capacity configuration
>   * @ram_res: Active Volatile memory capacity configuration
> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> + *          region
>   * @component_reg_phys: register base of component registers
>   * @info: Cached DVSEC information about the device.
>   * @serial: PCIe Device Serial Number
> @@ -315,6 +329,7 @@ struct cxl_dev_state {
>  	struct resource dpa_res;
>  	struct resource pmem_res;
>  	struct resource ram_res;
> +	struct resource dc_res[CXL_MAX_DC_REGION];
>  	resource_size_t component_reg_phys;
>  	u64 serial;
>  	enum cxl_devtype type;
> @@ -334,9 +349,12 @@ struct cxl_dev_state {
>   *                (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
>   * @mbox_mutex: Mutex to synchronize mailbox access.
>   * @firmware_version: Firmware version for the memory device.
> + * @dcd_cmds: List of DCD commands implemented by memory device
>   * @enabled_cmds: Hardware commands found enabled in CEL.
>   * @exclusive_cmds: Commands that are kernel-internal only
> - * @total_bytes: sum of all possible capacities
> + * @total_capacity: Sum of static and dynamic capacities
> + * @total_static_capacity: Sum of RAM and PMEM capacities
> + * @total_dynamic_capacity: Complete DPA range occupied by DC regions
>   * @volatile_only_bytes: hard volatile capacity
>   * @persistent_only_bytes: hard persistent capacity
>   * @partition_align_bytes: alignment size for partition-able capacity
> @@ -344,6 +362,10 @@ struct cxl_dev_state {
>   * @active_persistent_bytes: sum of hard + soft persistent
>   * @next_volatile_bytes: volatile capacity change pending device reset
>   * @next_persistent_bytes: persistent capacity change pending device reset
> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
> + * @dc_event_log_size: The number of events the device can store in the
> + * Dynamic Capacity Event Log before it overflows
>   * @event: event log driver state
>   * @poison: poison driver state info
>   * @mbox_send: @dev specific transport for transmitting mailbox commands
> @@ -357,9 +379,13 @@ struct cxl_memdev_state {
>  	size_t lsa_size;
>  	struct mutex mbox_mutex; /* Protects device mailbox and firmware */
>  	char firmware_version[0x10];
> +	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> -	u64 total_bytes;
> +
> +	u64 total_capacity;
> +	u64 total_static_capacity;
> +	u64 total_dynamic_capacity;
>  	u64 volatile_only_bytes;
>  	u64 persistent_only_bytes;
>  	u64 partition_align_bytes;
> @@ -367,6 +393,20 @@ struct cxl_memdev_state {
>  	u64 active_persistent_bytes;
>  	u64 next_volatile_bytes;
>  	u64 next_persistent_bytes;
> +
> +	u8 nr_dc_region;
> +
> +	struct cxl_dc_region_info {
> +		u8 name[CXL_DC_REGION_SRTLEN];
> +		u64 base;
> +		u64 decode_len;
> +		u64 len;
> +		u64 blk_size;
> +		u32 dsmad_handle;
> +		u8 flags;
> +	} dc_region[CXL_MAX_DC_REGION];
> +
> +	size_t dc_event_log_size;
>  	struct cxl_event_state event;
>  	struct cxl_poison_state poison;
>  	int (*mbox_send)(struct cxl_memdev_state *mds,
> @@ -415,6 +455,10 @@ enum cxl_opcode {
>  	CXL_MBOX_OP_UNLOCK		= 0x4503,
>  	CXL_MBOX_OP_FREEZE_SECURITY	= 0x4504,
>  	CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE	= 0x4505,
> +	CXL_MBOX_OP_GET_DC_CONFIG	= 0x4800,
> +	CXL_MBOX_OP_GET_DC_EXTENT_LIST	= 0x4801,
> +	CXL_MBOX_OP_ADD_DC_RESPONSE	= 0x4802,
> +	CXL_MBOX_OP_RELEASE_DC		= 0x4803,
>  	CXL_MBOX_OP_MAX			= 0x10000
>  };
>  
> @@ -462,6 +506,7 @@ struct cxl_mbox_identify {
>  	__le16 inject_poison_limit;
>  	u8 poison_caps;
>  	u8 qos_telemetry_caps;
> +	__le16 dc_event_log_size;
>  } __packed;
>  
>  /*
> @@ -617,7 +662,27 @@ struct cxl_mbox_set_partition_info {
>  	u8 flags;
>  } __packed;
>  
> +struct cxl_mbox_get_dc_config {
> +	u8 region_count;
> +	u8 start_region_index;
> +} __packed;
> +
> +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> +struct cxl_mbox_dynamic_capacity {
> +	u8 avail_region_count;
> +	u8 rsvd[7];
> +	struct cxl_dc_region_config {
> +		__le64 region_base;
> +		__le64 region_decode_length;
> +		__le64 region_length;
> +		__le64 region_block_size;
> +		__le32 region_dsmad_handle;
> +		u8 flags;
> +		u8 rsvd[3];
> +	} __packed region[];
> +} __packed;
>  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
>  
>  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>  struct cxl_mbox_set_timestamp_in {
> @@ -742,6 +807,7 @@ enum {
>  int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
>  			  struct cxl_mbox_cmd *cmd);
>  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
>  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>  int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 4e2845b7331a..ac1a41bc083d 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -742,6 +742,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> +	rc = cxl_dev_dynamic_capacity_identify(mds);
> +	if (rc)
> +		return rc;
> +
>  	rc = cxl_mem_create_range_info(mds);
>  	if (rc)
>  		return rc;
> 
> -- 
> 2.40.0
> 

-- 
Fan Ni <nifan@outlook.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-15 18:12     ` Ira Weiny
  2023-06-15 18:28       ` Dave Jiang
@ 2023-06-15 18:56       ` Navneet Singh
  1 sibling, 0 replies; 55+ messages in thread
From: Navneet Singh @ 2023-06-15 18:56 UTC (permalink / raw)
  To: Ira Weiny; +Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

On Thu, Jun 15, 2023 at 11:12:29AM -0700, Ira Weiny wrote:
> Dave Jiang wrote:
> > 
> > 
> > On 6/14/23 12:16, ira.weiny@intel.com wrote:
> > > From: Navneet Singh <navneet.singh@intel.com>
> > > 
> > > CXL devices optionally support dynamic capacity. CXL Regions must be
> > > created to access this capacity.
> > > 
> > > Add sysfs entries to create dynamic capacity cxl regions. Provide a new
> > > Dynamic Capacity decoder mode which targets dynamic capacity on devices
> > > which are added to that region.
> > > 
> > > Below are the steps to create and delete dynamic capacity region0
> > > (example).
> > > 
> > >      region=$(cat /sys/bus/cxl/devices/decoder0.0/create_dc_region)
> > >      echo $region> /sys/bus/cxl/devices/decoder0.0/create_dc_region
> > >      echo 256 > /sys/bus/cxl/devices/$region/interleave_granularity
> > >      echo 1 > /sys/bus/cxl/devices/$region/interleave_ways
> > > 
> > >      echo "dc0" >/sys/bus/cxl/devices/decoder1.0/mode
> > >      echo 0x400000000 >/sys/bus/cxl/devices/decoder1.0/dpa_size
> > > 
> > >      echo 0x400000000 > /sys/bus/cxl/devices/$region/size
> > >      echo  "decoder1.0" > /sys/bus/cxl/devices/$region/target0
> > >      echo 1 > /sys/bus/cxl/devices/$region/commit
> > >      echo $region > /sys/bus/cxl/drivers/cxl_region/bind
> > > 
> > >      echo $region> /sys/bus/cxl/devices/decoder0.0/delete_region
> > > 
> > > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > > 
> > > ---
> > > [iweiny: fixups]
> > > [iweiny: remove unused CXL_DC_REGION_MODE macro]
> > > [iweiny: Make dc_mode_to_region_index static]
> > > [iweiny: simplify <sysfs>/create_dc_region]
> > > [iweiny: introduce decoder_mode_is_dc]
> > > [djbw: fixups, no sign-off: preview only]
> > > ---
> > >   drivers/cxl/Kconfig       |  11 +++
> > >   drivers/cxl/core/core.h   |   7 ++
> > >   drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++++++++++++++++++++++++++----
> > >   drivers/cxl/core/port.c   |  18 ++++
> > >   drivers/cxl/core/region.c | 135 ++++++++++++++++++++++++--
> > >   drivers/cxl/cxl.h         |  28 ++++++
> > >   drivers/dax/cxl.c         |   4 +
> > >   7 files changed, 409 insertions(+), 28 deletions(-)
> > > 
> > > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > > index ff4e78117b31..df034889d053 100644
> > > --- a/drivers/cxl/Kconfig
> > > +++ b/drivers/cxl/Kconfig
> > > @@ -121,6 +121,17 @@ config CXL_REGION
> > >   
> > >   	  If unsure say 'y'
> > >   
> > > +config CXL_DCD
> > > +	bool "CXL: DCD Support"
> > > +	default CXL_BUS
> > > +	depends on CXL_REGION
> > > +	help
> > > +	  Enable the CXL core to provision CXL DCD regions.
> > > +	  CXL devices optionally support dynamic capacity and DCD region
> > > +	  maps the dynamic capacity regions DPA's into Host HPA ranges.
> > > +
> > > +	  If unsure say 'y'
> > > +
> > >   config CXL_REGION_INVALIDATION_TEST
> > >   	bool "CXL: Region Cache Management Bypass (TEST)"
> > >   	depends on CXL_REGION
> > > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> > > index 27f0968449de..725700ab5973 100644
> > > --- a/drivers/cxl/core/core.h
> > > +++ b/drivers/cxl/core/core.h
> > > @@ -9,6 +9,13 @@ extern const struct device_type cxl_nvdimm_type;
> > >   
> > >   extern struct attribute_group cxl_base_attribute_group;
> > >   
> > > +#ifdef CONFIG_CXL_DCD
> > > +extern struct device_attribute dev_attr_create_dc_region;
> > > +#define SET_CXL_DC_REGION_ATTR(x) (&dev_attr_##x.attr),
> > > +#else
> > > +#define SET_CXL_DC_REGION_ATTR(x)
> > > +#endif
> > > +
> > >   #ifdef CONFIG_CXL_REGION
> > >   extern struct device_attribute dev_attr_create_pmem_region;
> > >   extern struct device_attribute dev_attr_create_ram_region;
> > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > > index 514d30131d92..29649b47d177 100644
> > > --- a/drivers/cxl/core/hdm.c
> > > +++ b/drivers/cxl/core/hdm.c
> > > @@ -233,14 +233,23 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> > >   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > >   	struct resource *res = cxled->dpa_res;
> > >   	resource_size_t skip_start;
> > > +	resource_size_t skipped = cxled->skip;
> > >   
> > >   	lockdep_assert_held_write(&cxl_dpa_rwsem);
> > >   
> > >   	/* save @skip_start, before @res is released */
> > > -	skip_start = res->start - cxled->skip;
> > > +	skip_start = res->start - skipped;
> > >   	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> > > -	if (cxled->skip)
> > > -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> > > +	if (cxled->skip != 0) {
> > > +		while (skipped != 0) {
> > > +			res = xa_load(&cxled->skip_res, skip_start);
> > > +			__release_region(&cxlds->dpa_res, skip_start,
> > > +							resource_size(res));
> > > +			xa_erase(&cxled->skip_res, skip_start);
> > > +			skip_start += resource_size(res);
> > > +			skipped -= resource_size(res);
> > > +			}
> > > +	}
> > >   	cxled->skip = 0;
> > >   	cxled->dpa_res = NULL;
> > >   	put_device(&cxled->cxld.dev);
> > > @@ -267,6 +276,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> > >   	__cxl_dpa_release(cxled);
> > >   }
> > >   
> > > +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> > > +{
> > > +	int index = 0;
> > > +
> > > +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> > > +		if (mode == i)
> > > +			return index;
> > > +		index++;
> > > +	}
> > > +
> > > +	return -EINVAL;
> > > +}
> > > +
> > >   static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > >   			     resource_size_t base, resource_size_t len,
> > >   			     resource_size_t skipped)
> > > @@ -275,7 +297,11 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > >   	struct cxl_port *port = cxled_to_port(cxled);
> > >   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > >   	struct device *dev = &port->dev;
> > > +	struct device *ed_dev = &cxled->cxld.dev;
> > > +	struct resource *dpa_res = &cxlds->dpa_res;
> > > +	resource_size_t skip_len = 0;
> > >   	struct resource *res;
> > > +	int rc, index;
> > >   
> > >   	lockdep_assert_held_write(&cxl_dpa_rwsem);
> > >   
> > > @@ -304,28 +330,119 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > >   	}
> > >   
> > >   	if (skipped) {
> > > -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> > > -				       dev_name(&cxled->cxld.dev), 0);
> > > -		if (!res) {
> > > -			dev_dbg(dev,
> > > -				"decoder%d.%d: failed to reserve skipped space\n",
> > > -				port->id, cxled->cxld.id);
> > > -			return -EBUSY;
> > > +		resource_size_t skip_base = base - skipped;
> > > +
> > > +		if (decoder_mode_is_dc(cxled->mode)) {
> > 
> > Maybe move this entire block to a helper function to reduce the size of 
> > the current function and reduce indent levels and improve readability?
> 
> :-/
> 
> I'll work on breaking it out more.  The logic here is getting kind of
> crazy.
> 
> > 
> > > +			if (resource_size(&cxlds->ram_res) &&
> > > +					skip_base <= cxlds->ram_res.end) {
> > > +				skip_len = cxlds->ram_res.end - skip_base + 1;
> > > +				res = __request_region(dpa_res, skip_base,
> > > +						skip_len, dev_name(ed_dev), 0);
> > > +				if (!res)
> > > +					goto error;
> > > +
> > > +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> > > +								GFP_KERNEL);
> > > +				skip_base += skip_len;
> > > +			}
> > > +
> > > +			if (resource_size(&cxlds->ram_res) &&
>                                                   ^^^^^^^
> 						  pmem_res?
> 
> > > +					skip_base <= cxlds->pmem_res.end) {
> 
> The 2 if statements here are almost exactly the same.  To the point I
> wonder if there is a bug.
> 
> Navneet,
> 
> Why does the code check ram_res the second time but go on to use pmem_res
> in the block?

Navneet- Thanks for pointing out it should be pmem_res instead of
ram_res. I will fix it.
> 
> > > +				skip_len = cxlds->pmem_res.end - skip_base + 1;
> > > +				res = __request_region(dpa_res, skip_base,
> > > +						skip_len, dev_name(ed_dev), 0);
> > > +				if (!res)
> > > +					goto error;
> > > +
> > > +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> > > +								GFP_KERNEL);
> > > +				skip_base += skip_len;
> > > +			}
> > > +
> > > +			index = dc_mode_to_region_index(cxled->mode);
> > > +			for (int i = 0; i <= index; i++) {
> > > +				struct resource *dcr = &cxlds->dc_res[i];
> > > +
> > > +				if (skip_base < dcr->start) {
> > > +					skip_len = dcr->start - skip_base;
> > > +					res = __request_region(dpa_res,
> > > +							skip_base, skip_len,
> > > +							dev_name(ed_dev), 0);
> > > +					if (!res)
> > > +						goto error;
> > > +
> > > +					rc = xa_insert(&cxled->skip_res, skip_base,
> > > +							res, GFP_KERNEL);
> > > +					skip_base += skip_len;
> > > +				}
> > > +
> > > +				if (skip_base == base) {
> > > +					dev_dbg(dev, "skip done!\n");
> > > +					break;
> > > +				}
> > > +
> > > +				if (resource_size(dcr) &&
> > > +						skip_base <= dcr->end) {
> > > +					if (skip_base > base)
> > > +						dev_err(dev, "Skip error\n");
> > > +
> > > +					skip_len = dcr->end - skip_base + 1;
> > > +					res = __request_region(dpa_res, skip_base,
> > > +							skip_len,
> > > +							dev_name(ed_dev), 0);
> > > +					if (!res)
> > > +						goto error;
> > > +
> > > +					rc = xa_insert(&cxled->skip_res, skip_base,
> > > +							res, GFP_KERNEL);
> > > +					skip_base += skip_len;
> > > +				}
> > > +			}
> > > +		} else	{
> > > +			res = __request_region(dpa_res, base - skipped, skipped,
> > > +							dev_name(ed_dev), 0);
> > > +			if (!res)
> > > +				goto error;
> > > +
> > > +			rc = xa_insert(&cxled->skip_res, skip_base, res,
> > > +								GFP_KERNEL);
> > >   		}
> > >   	}
> > > -	res = __request_region(&cxlds->dpa_res, base, len,
> > > -			       dev_name(&cxled->cxld.dev), 0);
> > > +
> > > +	res = __request_region(dpa_res, base, len, dev_name(ed_dev), 0);
> > >   	if (!res) {
> > >   		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
> > > -			port->id, cxled->cxld.id);
> > > -		if (skipped)
> > > -			__release_region(&cxlds->dpa_res, base - skipped,
> > > -					 skipped);
> > > +				port->id, cxled->cxld.id);
> > > +		if (skipped) {
> > > +			resource_size_t skip_base = base - skipped;
> > > +
> > > +			while (skipped != 0) {
> > > +				if (skip_base > base)
> > > +					dev_err(dev, "Skip error\n");
> > > +
> > > +				res = xa_load(&cxled->skip_res, skip_base);
> > > +				__release_region(dpa_res, skip_base,
> > > +							resource_size(res));
> > > +				xa_erase(&cxled->skip_res, skip_base);
> > > +				skip_base += resource_size(res);
> > > +				skipped -= resource_size(res);
> > > +			}
> > > +		}
> > >   		return -EBUSY;
> > >   	}
> > >   	cxled->dpa_res = res;
> > >   	cxled->skip = skipped;
> > >   
> > > +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> > > +		int index = dc_mode_to_region_index(mode);
> > > +
> > > +		if (resource_contains(&cxlds->dc_res[index], res)) {
> > > +			cxled->mode = mode;
> > > +			dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id,
> > > +				cxled->cxld.id, cxled->dpa_res, cxled->mode);
> > > +			goto success > +		}
> > > +	}
> > 
> > This block should only happen if decoder_mode_is_dc() right? If that's 
> > the case, you might be able to refactor it so the 'goto success' isn't 
> > necessary.
> 
> I'll check.  I looked through this code a couple of times in my review
> before posting because I'm not 100% sure I want to see 8 different modes
> DC decoders and regions.
> 
> I think the 'mode' should be 'DC' with an index in the endpoint decoder to
> map DC region that decoder is mapping.  But that change was much bigger to
> Navneets code and I wanted to see how others felt about having DC0 - DC7
> modes.  My compromise was creating decoder_mode_is_dc().
> 
> > 
> > >   	if (resource_contains(&cxlds->pmem_res, res))
> > >   		cxled->mode = CXL_DECODER_PMEM;
> > >   	else if (resource_contains(&cxlds->ram_res, res))
> > > @@ -336,9 +453,16 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > >   		cxled->mode = CXL_DECODER_MIXED;
> > >   	}
> > >   
> > > +success:
> > >   	port->hdm_end++;
> > >   	get_device(&cxled->cxld.dev);
> > >   	return 0;
> > > +
> > > +error:
> > > +	dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space\n",
> > > +			port->id, cxled->cxld.id);
> > > +	return -EBUSY;
> > > +
> > >   }
> > >   
> > >   int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > > @@ -429,6 +553,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> > >   	switch (mode) {
> > >   	case CXL_DECODER_RAM:
> > >   	case CXL_DECODER_PMEM:
> > > +	case CXL_DECODER_DC0:
> > > +	case CXL_DECODER_DC1:
> > > +	case CXL_DECODER_DC2:
> > > +	case CXL_DECODER_DC3:
> > > +	case CXL_DECODER_DC4:
> > > +	case CXL_DECODER_DC5:
> > > +	case CXL_DECODER_DC6:
> > > +	case CXL_DECODER_DC7:
> 
> For example this seems very hacky...
> 
> [snip]
> 
> > >   
> > > +/*
> > > + * The region can not be manged by CXL if any portion of
> > > + * it is already online as 'System RAM'
> > > + */
> > > +static bool region_is_system_ram(struct cxl_region *cxlr,
> > > +				 struct cxl_region_params *p)
> > > +{
> > > +	return (walk_iomem_res_desc(IORES_DESC_NONE,
> > > +				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> > > +				    p->res->start, p->res->end, cxlr,
> > > +				    is_system_ram) > 0);
> > > +}
> > > +
> > >   static int cxl_region_probe(struct device *dev)
> > >   {
> > >   	struct cxl_region *cxlr = to_cxl_region(dev);
> > > @@ -3174,14 +3287,7 @@ static int cxl_region_probe(struct device *dev)
> > >   	case CXL_DECODER_PMEM:
> > >   		return devm_cxl_add_pmem_region(cxlr);
> > >   	case CXL_DECODER_RAM:
> > > -		/*
> > > -		 * The region can not be manged by CXL if any portion of
> > > -		 * it is already online as 'System RAM'
> > > -		 */
> > > -		if (walk_iomem_res_desc(IORES_DESC_NONE,
> > > -					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> > > -					p->res->start, p->res->end, cxlr,
> > > -					is_system_ram) > 0)
> > > +		if (region_is_system_ram(cxlr, p))
> > 
> > Maybe split this change out as a prep patch before the current patch.
> 
> That seems reasonable.  But the patch is not so large and the
> justification for creating a helper is that we need this same check for DC
> regions.  So it seemed ok to leave it like this.  Let me see about
> splitting it out.
> 
> [snip]
> 
> > > diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> > > index ccdf8de85bd5..eb5eb81bfbd7 100644
> > > --- a/drivers/dax/cxl.c
> > > +++ b/drivers/dax/cxl.c
> > > @@ -23,11 +23,15 @@ static int cxl_dax_region_probe(struct device *dev)
> > >   	if (!dax_region)
> > >   		return -ENOMEM;
> > >   
> > > +	if (decoder_mode_is_dc(cxlr->mode))
> > > +		return 0;
> > > +
> > >   	data = (struct dev_dax_data) {
> > >   		.dax_region = dax_region,
> > >   		.id = -1,
> > >   		.size = range_len(&cxlr_dax->hpa_range),
> > >   	};
> > > +
> > 
> > Stray blank line?
> 
> Opps!  Fixed!
> 
> Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device
  2023-06-15 18:30   ` Fan Ni
@ 2023-06-15 19:17     ` Navneet Singh
  0 siblings, 0 replies; 55+ messages in thread
From: Navneet Singh @ 2023-06-15 19:17 UTC (permalink / raw)
  To: Fan Ni
  Cc: ira.weiny, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl,
	a.manzanares, dave, nmtadam.samsung

On Thu, Jun 15, 2023 at 11:30:08AM -0700, Fan Ni wrote:
> The 06/14/2023 12:16, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > Read the Dynamic capacity configuration and store dynamic capacity region
> > information in the device state which driver will use to map into the HDM
> > ranges.
> > 
> > Implement Get Dynamic Capacity Configuration (opcode 4800h) mailbox
> > command as specified in CXL 3.0 spec section 8.2.9.8.9.1.
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > 
> 
> See the comments below how total_dynamic_capacity is collected.
> 
> 
> > ---
> > [iweiny: ensure all mds->dc_region's are named]
> > ---
> >  drivers/cxl/core/mbox.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++--
> >  drivers/cxl/cxlmem.h    |  70 +++++++++++++++++-
> >  drivers/cxl/pci.c       |   4 +
> >  3 files changed, 256 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 3ca0bf12c55f..c5b696737c87 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -111,6 +111,37 @@ static u8 security_command_sets[] = {
> >  	0x46, /* Security Passthrough */
> >  };
> >  
> > +static bool cxl_is_dcd_command(u16 opcode)
> > +{
> > +#define CXL_MBOX_OP_DCD_CMDS 0x48
> > +
> > +	if ((opcode >> 8) == CXL_MBOX_OP_DCD_CMDS)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> > +static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
> > +					u16 opcode)
> > +{
> > +	switch (opcode) {
> > +	case CXL_MBOX_OP_GET_DC_CONFIG:
> > +		set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> > +		break;
> > +	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> > +		set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
> > +		break;
> > +	case CXL_MBOX_OP_ADD_DC_RESPONSE:
> > +		set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
> > +		break;
> > +	case CXL_MBOX_OP_RELEASE_DC:
> > +		set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
> > +		break;
> > +	default:
> > +		break;
> > +	}
> > +}
> > +
> >  static bool cxl_is_security_command(u16 opcode)
> >  {
> >  	int i;
> > @@ -666,6 +697,7 @@ static int cxl_xfer_log(struct cxl_memdev_state *mds, uuid_t *uuid,
> >  static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> >  {
> >  	struct cxl_cel_entry *cel_entry;
> > +	struct cxl_mem_command *cmd;
> >  	const int cel_entries = size / sizeof(*cel_entry);
> >  	struct device *dev = mds->cxlds.dev;
> >  	int i;
> > @@ -674,11 +706,12 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> >  
> >  	for (i = 0; i < cel_entries; i++) {
> >  		u16 opcode = le16_to_cpu(cel_entry[i].opcode);
> > -		struct cxl_mem_command *cmd = cxl_mem_find_command(opcode);
> > +		cmd = cxl_mem_find_command(opcode);
> >  
> > -		if (!cmd && !cxl_is_poison_command(opcode)) {
> > -			dev_dbg(dev,
> > -				"Opcode 0x%04x unsupported by driver\n", opcode);
> > +		if (!cmd && !cxl_is_poison_command(opcode) &&
> > +		    !cxl_is_dcd_command(opcode)) {
> > +			dev_dbg(dev, "Opcode 0x%04x unsupported by driver\n",
> > +				opcode);
> >  			continue;
> >  		}
> >  
> > @@ -688,6 +721,9 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> >  		if (cxl_is_poison_command(opcode))
> >  			cxl_set_poison_cmd_enabled(&mds->poison, opcode);
> >  
> > +		if (cxl_is_dcd_command(opcode))
> > +			cxl_set_dcd_cmd_enabled(mds, opcode);
> > +
> >  		dev_dbg(dev, "Opcode 0x%04x enabled\n", opcode);
> >  	}
> >  }
> > @@ -1059,7 +1095,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
> >  	if (rc < 0)
> >  		return rc;
> >  
> > -	mds->total_bytes =
> > +	mds->total_static_capacity =
> >  		le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
> >  	mds->volatile_only_bytes =
> >  		le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
> > @@ -1077,10 +1113,137 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
> >  		mds->poison.max_errors = min_t(u32, val, CXL_POISON_LIST_MAX);
> >  	}
> >  
> > +	mds->dc_event_log_size = le16_to_cpu(id.dc_event_log_size);
> > +
> >  	return 0;
> >  }
> >  EXPORT_SYMBOL_NS_GPL(cxl_dev_state_identify, CXL);
> >  
> > +/**
> > + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> > + * information from the device.
> > + * @mds: The memory device state
> > + * Return: 0 if identify was executed successfully.
> > + *
> > + * This will dispatch the get_dynamic_capacity command to the device
> > + * and on success populate structures to be exported to sysfs.
> > + */
> > +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> > +{
> > +	struct cxl_dev_state *cxlds = &mds->cxlds;
> > +	struct device *dev = cxlds->dev;
> > +	struct cxl_mbox_dynamic_capacity *dc;
> > +	struct cxl_mbox_get_dc_config get_dc;
> > +	struct cxl_mbox_cmd mbox_cmd;
> > +	u64 next_dc_region_start;
> > +	int rc, i;
> > +
> > +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> > +		sprintf(mds->dc_region[i].name, "dc%d", i);
> > +
> > +	/* Check GET_DC_CONFIG is supported by device */
> > +	if (!test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds)) {
> > +		dev_dbg(dev, "unsupported cmd: get_dynamic_capacity_config\n");
> > +		return 0;
> > +	}
> > +
> > +	dc = kvmalloc(mds->payload_size, GFP_KERNEL);
> > +	if (!dc)
> > +		return -ENOMEM;
> > +
> > +	get_dc = (struct cxl_mbox_get_dc_config) {
> > +		.region_count = CXL_MAX_DC_REGION,
> > +		.start_region_index = 0,
> > +	};
> > +
> > +	mbox_cmd = (struct cxl_mbox_cmd) {
> > +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> > +		.payload_in = &get_dc,
> > +		.size_in = sizeof(get_dc),
> > +		.size_out = mds->payload_size,
> > +		.payload_out = dc,
> > +		.min_out = 1,
> > +	};
> > +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > +	if (rc < 0)
> > +		goto dc_error;
> > +
> > +	mds->nr_dc_region = dc->avail_region_count;
> > +
> > +	if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> > +		dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> > +			mds->nr_dc_region);
> > +		rc = -EINVAL;
> > +		goto dc_error;
> > +	}
> > +
> > +	for (i = 0; i < mds->nr_dc_region; i++) {
> > +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> > +
> > +		dcr->base = le64_to_cpu(dc->region[i].region_base);
> > +		dcr->decode_len =
> > +			le64_to_cpu(dc->region[i].region_decode_length);
> > +		dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> > +		dcr->len = le64_to_cpu(dc->region[i].region_length);
> > +		dcr->blk_size = le64_to_cpu(dc->region[i].region_block_size);
> > +
> > +		/* Check regions are in increasing DPA order */
> > +		if ((i + 1) < mds->nr_dc_region) {
> > +			next_dc_region_start =
> > +				le64_to_cpu(dc->region[i + 1].region_base);
> > +			if ((dcr->base > next_dc_region_start) ||
> > +			    ((dcr->base + dcr->decode_len) > next_dc_region_start)) {
> > +				dev_err(dev,
> > +					"DPA ordering violation for DC region %d and %d\n",
> > +					i, i + 1);
> > +				rc = -EINVAL;
> > +				goto dc_error;
> > +			}
> > +		}
> > +
> > +		/* Check the region is 256 MB aligned */
> > +		if (!IS_ALIGNED(dcr->base, SZ_256M)) {
> > +			dev_err(dev, "DC region %d not aligned to 256MB\n", i);
> > +			rc = -EINVAL;
> > +			goto dc_error;
> > +		}
> > +
> > +		/* Check Region base and length are aligned to block size */
> > +		if (!IS_ALIGNED(dcr->base, dcr->blk_size) ||
> > +		    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> > +			dev_err(dev, "DC region %d not aligned to %#llx\n", i,
> > +				dcr->blk_size);
> > +			rc = -EINVAL;
> > +			goto dc_error;
> > +		}
> > +
> > +		dcr->dsmad_handle =
> > +			le32_to_cpu(dc->region[i].region_dsmad_handle);
> > +		dcr->flags = dc->region[i].flags;
> > +		sprintf(dcr->name, "dc%d", i);
> > +
> > +		dev_dbg(dev,
> > +			"DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> > +			dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> > +	}
> > +
> > +	/*
> > +	 * Calculate entire DPA range of all configured regions which will be mapped by
> > +	 * one or more HDM decoders
> > +	 */
> > +	mds->total_dynamic_capacity =
> > +		mds->dc_region[mds->nr_dc_region - 1].base +
> > +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> > +		mds->dc_region[0].base;
> 
> The above code assumes there is no DPA address gap between two adjacent dc
> regions. Not sure whether it will always be true or not, I cannot find any
> specific statement in the spec. 
> 
> Fan
> 
Navneet - Total dynamic capacity name gives a perception that its a sum of all DC regions but its
a DPA range which contains all the dc regions. Each DC region is added
as child resource to it. While allocating the DPA gaps between the
regions are taken care to skip. Maybe a better name would help here to
avoid confusion.

> > +	dev_dbg(dev, "Total dynamic capacity: %#llx\n",
> > +		mds->total_dynamic_capacity);
> > +
> > +dc_error:
> > +	kvfree(dc);
> > +	return rc;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> > +
> >  static int add_dpa_res(struct device *dev, struct resource *parent,
> >  		       struct resource *res, resource_size_t start,
> >  		       resource_size_t size, const char *type)
> > @@ -1112,6 +1275,11 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> >  	struct cxl_dev_state *cxlds = &mds->cxlds;
> >  	struct device *dev = cxlds->dev;
> >  	int rc;
> > +	size_t untenanted_mem =
> > +		mds->dc_region[0].base - mds->total_static_capacity;
> > +
> > +	mds->total_capacity = mds->total_static_capacity +
> > +			untenanted_mem + mds->total_dynamic_capacity;
> >  
> >  	if (!cxlds->media_ready) {
> >  		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> > @@ -1121,13 +1289,23 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> >  	}
> >  
> >  	cxlds->dpa_res =
> > -		(struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> > +		(struct resource)DEFINE_RES_MEM(0, mds->total_capacity);
> > +
> > +	for (int i = 0; i < CXL_MAX_DC_REGION; i++) {
> > +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> > +
> > +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> > +				 dcr->base, dcr->decode_len, dcr->name);
> > +		if (rc)
> > +			return rc;
> > +	}
> >  
> >  	if (mds->partition_align_bytes == 0) {
> >  		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
> >  				 mds->volatile_only_bytes, "ram");
> >  		if (rc)
> >  			return rc;
> > +
> >  		return add_dpa_res(dev, &cxlds->dpa_res, &cxlds->pmem_res,
> >  				   mds->volatile_only_bytes,
> >  				   mds->persistent_only_bytes, "pmem");
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 89e560ea14c0..9c0b2fa72bdd 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -239,6 +239,15 @@ struct cxl_event_state {
> >  	struct mutex log_lock;
> >  };
> >  
> > +/* Device enabled DCD commands */
> > +enum dcd_cmd_enabled_bits {
> > +	CXL_DCD_ENABLED_GET_CONFIG,
> > +	CXL_DCD_ENABLED_GET_EXTENT_LIST,
> > +	CXL_DCD_ENABLED_ADD_RESPONSE,
> > +	CXL_DCD_ENABLED_RELEASE,
> > +	CXL_DCD_ENABLED_MAX
> > +};
> > +
> >  /* Device enabled poison commands */
> >  enum poison_cmd_enabled_bits {
> >  	CXL_POISON_ENABLED_LIST,
> > @@ -284,6 +293,9 @@ enum cxl_devtype {
> >  	CXL_DEVTYPE_CLASSMEM,
> >  };
> >  
> > +#define CXL_MAX_DC_REGION 8
> > +#define CXL_DC_REGION_SRTLEN 8
> > +
> >  /**
> >   * struct cxl_dev_state - The driver device state
> >   *
> > @@ -300,6 +312,8 @@ enum cxl_devtype {
> >   * @dpa_res: Overall DPA resource tree for the device
> >   * @pmem_res: Active Persistent memory capacity configuration
> >   * @ram_res: Active Volatile memory capacity configuration
> > + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> > + *          region
> >   * @component_reg_phys: register base of component registers
> >   * @info: Cached DVSEC information about the device.
> >   * @serial: PCIe Device Serial Number
> > @@ -315,6 +329,7 @@ struct cxl_dev_state {
> >  	struct resource dpa_res;
> >  	struct resource pmem_res;
> >  	struct resource ram_res;
> > +	struct resource dc_res[CXL_MAX_DC_REGION];
> >  	resource_size_t component_reg_phys;
> >  	u64 serial;
> >  	enum cxl_devtype type;
> > @@ -334,9 +349,12 @@ struct cxl_dev_state {
> >   *                (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
> >   * @mbox_mutex: Mutex to synchronize mailbox access.
> >   * @firmware_version: Firmware version for the memory device.
> > + * @dcd_cmds: List of DCD commands implemented by memory device
> >   * @enabled_cmds: Hardware commands found enabled in CEL.
> >   * @exclusive_cmds: Commands that are kernel-internal only
> > - * @total_bytes: sum of all possible capacities
> > + * @total_capacity: Sum of static and dynamic capacities
> > + * @total_static_capacity: Sum of RAM and PMEM capacities
> > + * @total_dynamic_capacity: Complete DPA range occupied by DC regions
> >   * @volatile_only_bytes: hard volatile capacity
> >   * @persistent_only_bytes: hard persistent capacity
> >   * @partition_align_bytes: alignment size for partition-able capacity
> > @@ -344,6 +362,10 @@ struct cxl_dev_state {
> >   * @active_persistent_bytes: sum of hard + soft persistent
> >   * @next_volatile_bytes: volatile capacity change pending device reset
> >   * @next_persistent_bytes: persistent capacity change pending device reset
> > + * @nr_dc_region: number of DC regions implemented in the memory device
> > + * @dc_region: array containing info about the DC regions
> > + * @dc_event_log_size: The number of events the device can store in the
> > + * Dynamic Capacity Event Log before it overflows
> >   * @event: event log driver state
> >   * @poison: poison driver state info
> >   * @mbox_send: @dev specific transport for transmitting mailbox commands
> > @@ -357,9 +379,13 @@ struct cxl_memdev_state {
> >  	size_t lsa_size;
> >  	struct mutex mbox_mutex; /* Protects device mailbox and firmware */
> >  	char firmware_version[0x10];
> > +	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
> >  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> >  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> > -	u64 total_bytes;
> > +
> > +	u64 total_capacity;
> > +	u64 total_static_capacity;
> > +	u64 total_dynamic_capacity;
> >  	u64 volatile_only_bytes;
> >  	u64 persistent_only_bytes;
> >  	u64 partition_align_bytes;
> > @@ -367,6 +393,20 @@ struct cxl_memdev_state {
> >  	u64 active_persistent_bytes;
> >  	u64 next_volatile_bytes;
> >  	u64 next_persistent_bytes;
> > +
> > +	u8 nr_dc_region;
> > +
> > +	struct cxl_dc_region_info {
> > +		u8 name[CXL_DC_REGION_SRTLEN];
> > +		u64 base;
> > +		u64 decode_len;
> > +		u64 len;
> > +		u64 blk_size;
> > +		u32 dsmad_handle;
> > +		u8 flags;
> > +	} dc_region[CXL_MAX_DC_REGION];
> > +
> > +	size_t dc_event_log_size;
> >  	struct cxl_event_state event;
> >  	struct cxl_poison_state poison;
> >  	int (*mbox_send)(struct cxl_memdev_state *mds,
> > @@ -415,6 +455,10 @@ enum cxl_opcode {
> >  	CXL_MBOX_OP_UNLOCK		= 0x4503,
> >  	CXL_MBOX_OP_FREEZE_SECURITY	= 0x4504,
> >  	CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE	= 0x4505,
> > +	CXL_MBOX_OP_GET_DC_CONFIG	= 0x4800,
> > +	CXL_MBOX_OP_GET_DC_EXTENT_LIST	= 0x4801,
> > +	CXL_MBOX_OP_ADD_DC_RESPONSE	= 0x4802,
> > +	CXL_MBOX_OP_RELEASE_DC		= 0x4803,
> >  	CXL_MBOX_OP_MAX			= 0x10000
> >  };
> >  
> > @@ -462,6 +506,7 @@ struct cxl_mbox_identify {
> >  	__le16 inject_poison_limit;
> >  	u8 poison_caps;
> >  	u8 qos_telemetry_caps;
> > +	__le16 dc_event_log_size;
> >  } __packed;
> >  
> >  /*
> > @@ -617,7 +662,27 @@ struct cxl_mbox_set_partition_info {
> >  	u8 flags;
> >  } __packed;
> >  
> > +struct cxl_mbox_get_dc_config {
> > +	u8 region_count;
> > +	u8 start_region_index;
> > +} __packed;
> > +
> > +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> > +struct cxl_mbox_dynamic_capacity {
> > +	u8 avail_region_count;
> > +	u8 rsvd[7];
> > +	struct cxl_dc_region_config {
> > +		__le64 region_base;
> > +		__le64 region_decode_length;
> > +		__le64 region_length;
> > +		__le64 region_block_size;
> > +		__le32 region_dsmad_handle;
> > +		u8 flags;
> > +		u8 rsvd[3];
> > +	} __packed region[];
> > +} __packed;
> >  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
> > +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> >  
> >  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
> >  struct cxl_mbox_set_timestamp_in {
> > @@ -742,6 +807,7 @@ enum {
> >  int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
> >  			  struct cxl_mbox_cmd *cmd);
> >  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> > +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
> >  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> >  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> >  int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index 4e2845b7331a..ac1a41bc083d 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -742,6 +742,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  	if (rc)
> >  		return rc;
> >  
> > +	rc = cxl_dev_dynamic_capacity_identify(mds);
> > +	if (rc)
> > +		return rc;
> > +
> >  	rc = cxl_mem_create_range_info(mds);
> >  	if (rc)
> >  		return rc;
> > 
> > -- 
> > 2.40.0
> > 
> 
> -- 
> Fan Ni <nifan@outlook.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device
  2023-06-14 19:16 ` [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device ira.weiny
                     ` (2 preceding siblings ...)
  2023-06-15 18:30   ` Fan Ni
@ 2023-06-15 21:41   ` Fan Ni
  2023-06-22 15:58   ` Jonathan Cameron
  4 siblings, 0 replies; 55+ messages in thread
From: Fan Ni @ 2023-06-15 21:41 UTC (permalink / raw)
  To: ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

The 06/14/2023 12:16, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Read the Dynamic capacity configuration and store dynamic capacity region
> information in the device state which driver will use to map into the HDM
> ranges.
> 
> Implement Get Dynamic Capacity Configuration (opcode 4800h) mailbox
> command as specified in CXL 3.0 spec section 8.2.9.8.9.1.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> ---
> [iweiny: ensure all mds->dc_region's are named]
> ---
>  drivers/cxl/core/mbox.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/cxl/cxlmem.h    |  70 +++++++++++++++++-
>  drivers/cxl/pci.c       |   4 +
>  3 files changed, 256 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 3ca0bf12c55f..c5b696737c87 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -111,6 +111,37 @@ static u8 security_command_sets[] = {
>  	0x46, /* Security Passthrough */
>  };
>  
> +static bool cxl_is_dcd_command(u16 opcode)
> +{
> +#define CXL_MBOX_OP_DCD_CMDS 0x48
> +
> +	if ((opcode >> 8) == CXL_MBOX_OP_DCD_CMDS)
> +		return true;
> +
> +	return false;
> +}
> +
> +static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
> +					u16 opcode)
> +{
> +	switch (opcode) {
> +	case CXL_MBOX_OP_GET_DC_CONFIG:
> +		set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> +		set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_ADD_DC_RESPONSE:
> +		set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_RELEASE_DC:
> +		set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
>  static bool cxl_is_security_command(u16 opcode)
>  {
>  	int i;
> @@ -666,6 +697,7 @@ static int cxl_xfer_log(struct cxl_memdev_state *mds, uuid_t *uuid,
>  static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  {
>  	struct cxl_cel_entry *cel_entry;
> +	struct cxl_mem_command *cmd;
>  	const int cel_entries = size / sizeof(*cel_entry);
>  	struct device *dev = mds->cxlds.dev;
>  	int i;
> @@ -674,11 +706,12 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  
>  	for (i = 0; i < cel_entries; i++) {
>  		u16 opcode = le16_to_cpu(cel_entry[i].opcode);
> -		struct cxl_mem_command *cmd = cxl_mem_find_command(opcode);
> +		cmd = cxl_mem_find_command(opcode);
>  
> -		if (!cmd && !cxl_is_poison_command(opcode)) {
> -			dev_dbg(dev,
> -				"Opcode 0x%04x unsupported by driver\n", opcode);
> +		if (!cmd && !cxl_is_poison_command(opcode) &&
> +		    !cxl_is_dcd_command(opcode)) {
> +			dev_dbg(dev, "Opcode 0x%04x unsupported by driver\n",
> +				opcode);
>  			continue;
>  		}
>  
> @@ -688,6 +721,9 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  		if (cxl_is_poison_command(opcode))
>  			cxl_set_poison_cmd_enabled(&mds->poison, opcode);
>  
> +		if (cxl_is_dcd_command(opcode))
> +			cxl_set_dcd_cmd_enabled(mds, opcode);
> +
>  		dev_dbg(dev, "Opcode 0x%04x enabled\n", opcode);
>  	}
>  }
> @@ -1059,7 +1095,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>  	if (rc < 0)
>  		return rc;
>  
> -	mds->total_bytes =
> +	mds->total_static_capacity =
>  		le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
>  	mds->volatile_only_bytes =
>  		le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
> @@ -1077,10 +1113,137 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>  		mds->poison.max_errors = min_t(u32, val, CXL_POISON_LIST_MAX);
>  	}
>  
> +	mds->dc_event_log_size = le16_to_cpu(id.dc_event_log_size);
> +
>  	return 0;
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dev_state_identify, CXL);
>  
> +/**
> + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> + * information from the device.
> + * @mds: The memory device state
> + * Return: 0 if identify was executed successfully.
> + *
> + * This will dispatch the get_dynamic_capacity command to the device
> + * and on success populate structures to be exported to sysfs.
> + */
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> +{
> +	struct cxl_dev_state *cxlds = &mds->cxlds;
> +	struct device *dev = cxlds->dev;
> +	struct cxl_mbox_dynamic_capacity *dc;
> +	struct cxl_mbox_get_dc_config get_dc;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	u64 next_dc_region_start;
> +	int rc, i;
> +
> +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> +		sprintf(mds->dc_region[i].name, "dc%d", i);
> +
> +	/* Check GET_DC_CONFIG is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd: get_dynamic_capacity_config\n");
> +		return 0;
> +	}
> +
> +	dc = kvmalloc(mds->payload_size, GFP_KERNEL);
> +	if (!dc)
> +		return -ENOMEM;
> +
> +	get_dc = (struct cxl_mbox_get_dc_config) {
> +		.region_count = CXL_MAX_DC_REGION,
> +		.start_region_index = 0,
> +	};
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> +		.payload_in = &get_dc,
> +		.size_in = sizeof(get_dc),
> +		.size_out = mds->payload_size,
> +		.payload_out = dc,
> +		.min_out = 1,
> +	};
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		goto dc_error;
> +
> +	mds->nr_dc_region = dc->avail_region_count;
> +
> +	if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> +		dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> +			mds->nr_dc_region);
> +		rc = -EINVAL;
> +		goto dc_error;
> +	}
> +
> +	for (i = 0; i < mds->nr_dc_region; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +		dcr->base = le64_to_cpu(dc->region[i].region_base);
> +		dcr->decode_len =
> +			le64_to_cpu(dc->region[i].region_decode_length);
> +		dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> +		dcr->len = le64_to_cpu(dc->region[i].region_length);
> +		dcr->blk_size = le64_to_cpu(dc->region[i].region_block_size);
> +
> +		/* Check regions are in increasing DPA order */
> +		if ((i + 1) < mds->nr_dc_region) {
> +			next_dc_region_start =
> +				le64_to_cpu(dc->region[i + 1].region_base);
> +			if ((dcr->base > next_dc_region_start) ||
> +			    ((dcr->base + dcr->decode_len) > next_dc_region_start)) {
> +				dev_err(dev,
> +					"DPA ordering violation for DC region %d and %d\n",
> +					i, i + 1);
> +				rc = -EINVAL;
> +				goto dc_error;
> +			}
> +		}
> +
> +		/* Check the region is 256 MB aligned */
> +		if (!IS_ALIGNED(dcr->base, SZ_256M)) {
> +			dev_err(dev, "DC region %d not aligned to 256MB\n", i);
> +			rc = -EINVAL;
> +			goto dc_error;
> +		}
> +
> +		/* Check Region base and length are aligned to block size */
> +		if (!IS_ALIGNED(dcr->base, dcr->blk_size) ||
> +		    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> +			dev_err(dev, "DC region %d not aligned to %#llx\n", i,
> +				dcr->blk_size);
> +			rc = -EINVAL;
> +			goto dc_error;
> +		}
> +
> +		dcr->dsmad_handle =
> +			le32_to_cpu(dc->region[i].region_dsmad_handle);
> +		dcr->flags = dc->region[i].flags;
> +		sprintf(dcr->name, "dc%d", i);
> +
> +		dev_dbg(dev,
> +			"DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> +			dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> +	}
> +
> +	/*
> +	 * Calculate entire DPA range of all configured regions which will be mapped by
> +	 * one or more HDM decoders
> +	 */
> +	mds->total_dynamic_capacity =
> +		mds->dc_region[mds->nr_dc_region - 1].base +
> +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> +		mds->dc_region[0].base;
> +	dev_dbg(dev, "Total dynamic capacity: %#llx\n",
> +		mds->total_dynamic_capacity);
> +
> +dc_error:
> +	kvfree(dc);
> +	return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> +
>  static int add_dpa_res(struct device *dev, struct resource *parent,
>  		       struct resource *res, resource_size_t start,
>  		       resource_size_t size, const char *type)
> @@ -1112,6 +1275,11 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  	struct cxl_dev_state *cxlds = &mds->cxlds;
>  	struct device *dev = cxlds->dev;
>  	int rc;
> +	size_t untenanted_mem =
> +		mds->dc_region[0].base - mds->total_static_capacity;
> +
> +	mds->total_capacity = mds->total_static_capacity +
> +			untenanted_mem + mds->total_dynamic_capacity;
>  
>  	if (!cxlds->media_ready) {
>  		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> @@ -1121,13 +1289,23 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  	}
>  
>  	cxlds->dpa_res =
> -		(struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> +		(struct resource)DEFINE_RES_MEM(0, mds->total_capacity);
> +
> +	for (int i = 0; i < CXL_MAX_DC_REGION; i++) {

should it be i < mds->nr_dc_region???

> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> +				 dcr->base, dcr->decode_len, dcr->name);
> +		if (rc)
> +			return rc;
> +	}
>  
>  	if (mds->partition_align_bytes == 0) {
>  		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
>  				 mds->volatile_only_bytes, "ram");
>  		if (rc)
>  			return rc;
> +
>  		return add_dpa_res(dev, &cxlds->dpa_res, &cxlds->pmem_res,
>  				   mds->volatile_only_bytes,
>  				   mds->persistent_only_bytes, "pmem");
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 89e560ea14c0..9c0b2fa72bdd 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -239,6 +239,15 @@ struct cxl_event_state {
>  	struct mutex log_lock;
>  };
>  
> +/* Device enabled DCD commands */
> +enum dcd_cmd_enabled_bits {
> +	CXL_DCD_ENABLED_GET_CONFIG,
> +	CXL_DCD_ENABLED_GET_EXTENT_LIST,
> +	CXL_DCD_ENABLED_ADD_RESPONSE,
> +	CXL_DCD_ENABLED_RELEASE,
> +	CXL_DCD_ENABLED_MAX
> +};
> +
>  /* Device enabled poison commands */
>  enum poison_cmd_enabled_bits {
>  	CXL_POISON_ENABLED_LIST,
> @@ -284,6 +293,9 @@ enum cxl_devtype {
>  	CXL_DEVTYPE_CLASSMEM,
>  };
>  
> +#define CXL_MAX_DC_REGION 8
> +#define CXL_DC_REGION_SRTLEN 8
> +
>  /**
>   * struct cxl_dev_state - The driver device state
>   *
> @@ -300,6 +312,8 @@ enum cxl_devtype {
>   * @dpa_res: Overall DPA resource tree for the device
>   * @pmem_res: Active Persistent memory capacity configuration
>   * @ram_res: Active Volatile memory capacity configuration
> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> + *          region
>   * @component_reg_phys: register base of component registers
>   * @info: Cached DVSEC information about the device.
>   * @serial: PCIe Device Serial Number
> @@ -315,6 +329,7 @@ struct cxl_dev_state {
>  	struct resource dpa_res;
>  	struct resource pmem_res;
>  	struct resource ram_res;
> +	struct resource dc_res[CXL_MAX_DC_REGION];
>  	resource_size_t component_reg_phys;
>  	u64 serial;
>  	enum cxl_devtype type;
> @@ -334,9 +349,12 @@ struct cxl_dev_state {
>   *                (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
>   * @mbox_mutex: Mutex to synchronize mailbox access.
>   * @firmware_version: Firmware version for the memory device.
> + * @dcd_cmds: List of DCD commands implemented by memory device
>   * @enabled_cmds: Hardware commands found enabled in CEL.
>   * @exclusive_cmds: Commands that are kernel-internal only
> - * @total_bytes: sum of all possible capacities
> + * @total_capacity: Sum of static and dynamic capacities
> + * @total_static_capacity: Sum of RAM and PMEM capacities
> + * @total_dynamic_capacity: Complete DPA range occupied by DC regions
>   * @volatile_only_bytes: hard volatile capacity
>   * @persistent_only_bytes: hard persistent capacity
>   * @partition_align_bytes: alignment size for partition-able capacity
> @@ -344,6 +362,10 @@ struct cxl_dev_state {
>   * @active_persistent_bytes: sum of hard + soft persistent
>   * @next_volatile_bytes: volatile capacity change pending device reset
>   * @next_persistent_bytes: persistent capacity change pending device reset
> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
> + * @dc_event_log_size: The number of events the device can store in the
> + * Dynamic Capacity Event Log before it overflows
>   * @event: event log driver state
>   * @poison: poison driver state info
>   * @mbox_send: @dev specific transport for transmitting mailbox commands
> @@ -357,9 +379,13 @@ struct cxl_memdev_state {
>  	size_t lsa_size;
>  	struct mutex mbox_mutex; /* Protects device mailbox and firmware */
>  	char firmware_version[0x10];
> +	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> -	u64 total_bytes;
> +

Remove blank line ?

> +	u64 total_capacity;
> +	u64 total_static_capacity;
> +	u64 total_dynamic_capacity;
>  	u64 volatile_only_bytes;
>  	u64 persistent_only_bytes;
>  	u64 partition_align_bytes;
> @@ -367,6 +393,20 @@ struct cxl_memdev_state {
>  	u64 active_persistent_bytes;
>  	u64 next_volatile_bytes;
>  	u64 next_persistent_bytes;
> +
> +	u8 nr_dc_region;
> +

Remove blank line?

Fan
> +	struct cxl_dc_region_info {
> +		u8 name[CXL_DC_REGION_SRTLEN];
> +		u64 base;
> +		u64 decode_len;
> +		u64 len;
> +		u64 blk_size;
> +		u32 dsmad_handle;
> +		u8 flags;
> +	} dc_region[CXL_MAX_DC_REGION];
> +
> +	size_t dc_event_log_size;
>  	struct cxl_event_state event;
>  	struct cxl_poison_state poison;
>  	int (*mbox_send)(struct cxl_memdev_state *mds,
> @@ -415,6 +455,10 @@ enum cxl_opcode {
>  	CXL_MBOX_OP_UNLOCK		= 0x4503,
>  	CXL_MBOX_OP_FREEZE_SECURITY	= 0x4504,
>  	CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE	= 0x4505,
> +	CXL_MBOX_OP_GET_DC_CONFIG	= 0x4800,
> +	CXL_MBOX_OP_GET_DC_EXTENT_LIST	= 0x4801,
> +	CXL_MBOX_OP_ADD_DC_RESPONSE	= 0x4802,
> +	CXL_MBOX_OP_RELEASE_DC		= 0x4803,
>  	CXL_MBOX_OP_MAX			= 0x10000
>  };
>  
> @@ -462,6 +506,7 @@ struct cxl_mbox_identify {
>  	__le16 inject_poison_limit;
>  	u8 poison_caps;
>  	u8 qos_telemetry_caps;
> +	__le16 dc_event_log_size;
>  } __packed;
>  
>  /*
> @@ -617,7 +662,27 @@ struct cxl_mbox_set_partition_info {
>  	u8 flags;
>  } __packed;
>  
> +struct cxl_mbox_get_dc_config {
> +	u8 region_count;
> +	u8 start_region_index;
> +} __packed;
> +
> +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> +struct cxl_mbox_dynamic_capacity {
> +	u8 avail_region_count;
> +	u8 rsvd[7];
> +	struct cxl_dc_region_config {
> +		__le64 region_base;
> +		__le64 region_decode_length;
> +		__le64 region_length;
> +		__le64 region_block_size;
> +		__le32 region_dsmad_handle;
> +		u8 flags;
> +		u8 rsvd[3];
> +	} __packed region[];
> +} __packed;
>  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
>  
>  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>  struct cxl_mbox_set_timestamp_in {
> @@ -742,6 +807,7 @@ enum {
>  int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
>  			  struct cxl_mbox_cmd *cmd);
>  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
>  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>  int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 4e2845b7331a..ac1a41bc083d 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -742,6 +742,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> +	rc = cxl_dev_dynamic_capacity_identify(mds);
> +	if (rc)
> +		return rc;
> +
>  	rc = cxl_mem_create_range_info(mds);
>  	if (rc)
>  		return rc;
> 
> -- 
> 2.40.0
> 

-- 
Fan Ni <nifan@outlook.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device
  2023-06-14 23:49   ` Alison Schofield
@ 2023-06-15 22:46     ` Ira Weiny
  0 siblings, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2023-06-15 22:46 UTC (permalink / raw)
  To: Alison Schofield, ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

Alison Schofield wrote:
> On Wed, Jun 14, 2023 at 12:16:28PM -0700, Ira Weiny wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > Read the Dynamic capacity configuration and store dynamic capacity region
> > information in the device state which driver will use to map into the HDM
> > ranges.
> > 
> > Implement Get Dynamic Capacity Configuration (opcode 4800h) mailbox
> > command as specified in CXL 3.0 spec section 8.2.9.8.9.1.
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > 
> > ---
> > [iweiny: ensure all mds->dc_region's are named]
> > ---

[snip]

> > @@ -666,6 +697,7 @@ static int cxl_xfer_log(struct cxl_memdev_state *mds, uuid_t *uuid,
> >  static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> >  {
> >  	struct cxl_cel_entry *cel_entry;
> > +	struct cxl_mem_command *cmd;
> >  	const int cel_entries = size / sizeof(*cel_entry);
> >  	struct device *dev = mds->cxlds.dev;
> >  	int i;
> > @@ -674,11 +706,12 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> >  
> >  	for (i = 0; i < cel_entries; i++) {
> >  		u16 opcode = le16_to_cpu(cel_entry[i].opcode);
> > -		struct cxl_mem_command *cmd = cxl_mem_find_command(opcode);
> > +		cmd = cxl_mem_find_command(opcode);
> 
> Is the move of the 'cmd' define related to this patch?
> Checkpatch warns on it: WARNING: Missing a blank line after declarations

That seems unneeded.  Perhaps left over from a previous version.  I've
moved it back.

> 
> >  
> > -		if (!cmd && !cxl_is_poison_command(opcode)) {
> > -			dev_dbg(dev,
> > -				"Opcode 0x%04x unsupported by driver\n", opcode);
> > +		if (!cmd && !cxl_is_poison_command(opcode) &&
> > +		    !cxl_is_dcd_command(opcode)) {
> > +			dev_dbg(dev, "Opcode 0x%04x unsupported by driver\n",
> > +				opcode);
> >  			continue;
> >  		}
> >  

[snip]

> > +
> > +	/*
> > +	 * Calculate entire DPA range of all configured regions which will be mapped by
> > +	 * one or more HDM decoders
> > +	 */
> 
> Comment is needlessly going >80 chars.

My checkpatch script is set to 100 lines due to the recent change.  But
this line length is unneeded here.  Thanks for noticing.

Fixed.

> 
> 
> > +	mds->total_dynamic_capacity =
> > +		mds->dc_region[mds->nr_dc_region - 1].base +
> > +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> > +		mds->dc_region[0].base;
> > +	dev_dbg(dev, "Total dynamic capacity: %#llx\n",
> > +		mds->total_dynamic_capacity);
> > +
> > +dc_error:
> > +	kvfree(dc);
> > +	return rc;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> > +
> >  static int add_dpa_res(struct device *dev, struct resource *parent,
> >  		       struct resource *res, resource_size_t start,
> >  		       resource_size_t size, const char *type)
> > @@ -1112,6 +1275,11 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> >  	struct cxl_dev_state *cxlds = &mds->cxlds;
> >  	struct device *dev = cxlds->dev;
> >  	int rc;
> > +	size_t untenanted_mem =
> > +		mds->dc_region[0].base - mds->total_static_capacity;
> 
> Perhaps:
> 	size_t untenanted_mem;  (and put that in reverse x-tree order)
> 
> 	untenanted_mem = mds->dc_region[0].base - mds->total_static_capacity;

That looks good, fixed.

> 
> > +
> > +	mds->total_capacity = mds->total_static_capacity +
> > +			untenanted_mem + mds->total_dynamic_capacity;
> >  
> 
> Also, looking at this first patch with the long names, wondering if
> there is an opportunity to (re-)define these fields in fewers chars.
> Do we have to describe with 'total'? Is there a partial?
> 
> I guess I'll get to the defines further down...
> 
> 

[snip]

> > @@ -334,9 +349,12 @@ struct cxl_dev_state {
> >   *                (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
> >   * @mbox_mutex: Mutex to synchronize mailbox access.
> >   * @firmware_version: Firmware version for the memory device.
> > + * @dcd_cmds: List of DCD commands implemented by memory device
> >   * @enabled_cmds: Hardware commands found enabled in CEL.
> >   * @exclusive_cmds: Commands that are kernel-internal only
> > - * @total_bytes: sum of all possible capacities
> > + * @total_capacity: Sum of static and dynamic capacities
> > + * @total_static_capacity: Sum of RAM and PMEM capacities
> > + * @total_dynamic_capacity: Complete DPA range occupied by DC regions
> >   * @volatile_only_bytes: hard volatile capacity
> >   * @persistent_only_bytes: hard persistent capacity
> >   * @partition_align_bytes: alignment size for partition-able capacity
> > @@ -344,6 +362,10 @@ struct cxl_dev_state {
> >   * @active_persistent_bytes: sum of hard + soft persistent
> >   * @next_volatile_bytes: volatile capacity change pending device reset
> >   * @next_persistent_bytes: persistent capacity change pending device reset
> > + * @nr_dc_region: number of DC regions implemented in the memory device
> > + * @dc_region: array containing info about the DC regions
> > + * @dc_event_log_size: The number of events the device can store in the
> > + * Dynamic Capacity Event Log before it overflows
> >   * @event: event log driver state
> >   * @poison: poison driver state info
> >   * @mbox_send: @dev specific transport for transmitting mailbox commands
> > @@ -357,9 +379,13 @@ struct cxl_memdev_state {
> >  	size_t lsa_size;
> >  	struct mutex mbox_mutex; /* Protects device mailbox and firmware */
> >  	char firmware_version[0x10];
> > +	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
> >  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> >  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> > -	u64 total_bytes;
> > +
> > +	u64 total_capacity;
> > +	u64 total_static_capacity;
> > +	u64 total_dynamic_capacity;
> 
> maybe cap, static_cap, dynamic_cap

Since these are new it is probably better to use shorter names.  Also we
have good kdocs for each above.

> 
> (because I think I had a hand in defining the long names that
> follow and deeply regret it ;))

Well no one made a comment to correct them.  So remember:  "There is no
crying in CXL!"

[snip]

> > +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> > +struct cxl_mbox_dynamic_capacity {
> > +	u8 avail_region_count;
> > +	u8 rsvd[7];
> > +	struct cxl_dc_region_config {
> > +		__le64 region_base;
> > +		__le64 region_decode_length;
> > +		__le64 region_length;
> > +		__le64 region_block_size;
> > +		__le32 region_dsmad_handle;
> > +		u8 flags;
> > +		u8 rsvd[3];
> > +	} __packed region[];
> > +} __packed;
> >  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
> 
> This ^ goes with the cxl_mbox_set_partition_info above.
> Please don't split.

Oh yea that is bad.  Fixed.

Thanks for looking,
Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-15  0:21   ` Alison Schofield
@ 2023-06-16  2:06     ` Ira Weiny
  2023-06-16 15:56       ` Alison Schofield
  0 siblings, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2023-06-16  2:06 UTC (permalink / raw)
  To: Alison Schofield, ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

Alison Schofield wrote:
> On Wed, Jun 14, 2023 at 12:16:29PM -0700, Ira Weiny wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > CXL devices optionally support dynamic capacity. CXL Regions must be
> > created to access this capacity.
> > 
> > Add sysfs entries to create dynamic capacity cxl regions. Provide a new
> > Dynamic Capacity decoder mode which targets dynamic capacity on devices
> > which are added to that region.
> 
> This is a lot in one patch, especially where it weaves in and out of
> existing code. I'm wondering if this can be introduced in smaller
> pieces (patches). An introductory patch explaining the DC DPA 
> allocations might be a useful chunk to pull forward. 

The patch is < 800 lines long.  And would be closer to 700 lines if there
were not 8 different 'modes' for the various DC regions.

It is also very self contained in that it implements the region creation
for DC DPAs fully.  And I know that Dan prefers patches larger if they are
all part of the same functionality.

Dan?

Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/5] cxl/mem : Expose dynamic capacity configuration to userspace
  2023-06-15  0:40   ` Alison Schofield
@ 2023-06-16  2:47     ` Ira Weiny
  2023-06-16 15:58       ` Dave Jiang
  0 siblings, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2023-06-16  2:47 UTC (permalink / raw)
  To: Alison Schofield, ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

Alison Schofield wrote:
> On Wed, Jun 14, 2023 at 12:16:30PM -0700, Ira Weiny wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > Exposing driver cached dynamic capacity configuration through sysfs
> > attributes.User will create one or more dynamic capacity
> > cxl regions based on this information and map the dynamic capacity of
> > the device into HDM ranges using one or more HDM decoders.
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > 
> > ---
> > [iweiny: fixups]
> > [djbw: fixups, no sign-off: preview only]
> > ---
> >  drivers/cxl/core/memdev.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 72 insertions(+)
> 
> Add the documentation of these new attributes in this patch.
> Documentation/ABI/testing/sysfs-bus-cxl

Good point.  And the region creation patch needs some updating for the
sysfs documentation as well...

Thanks!  I'll work on those.

Writing the documentation it seems like 'dc_region_count' should just be
'region_count'.  Because the 'dc' is redundant with the directory.
However, dcY_size has a redundant 'dc' but Y_size (ie 0_size) seems
odd.[*]

Thoughts on the 'dc' prefix for these?

[*] example listing with 2 DC regions supported.

$ ll mem1/dc/
total 0
-r--r--r-- 1 root root 4096 Jun 15 19:26 dc0_size
-r--r--r-- 1 root root 4096 Jun 15 19:26 dc1_size
-r--r--r-- 1 root root 4096 Jun 15 19:26 dc_regions_count

> 
> A bit of my ignorance here, but when I keep seeing the word
> 'regions' below, it makes me wonder whether these attributes
> are in the right place?

There is a difference between 'DC region' and CXL 'Linux' region.  It has
taken me some time to get used to the terminology.  So I think this is
correct.

> 
> > 
> > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> > index 5d1ba7a72567..beeb5fa3a0aa 100644
> > --- a/drivers/cxl/core/memdev.c
> > +++ b/drivers/cxl/core/memdev.c
> > @@ -99,6 +99,20 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
> >  static struct device_attribute dev_attr_pmem_size =
> >  	__ATTR(size, 0444, pmem_size_show, NULL);
> >  
> > +static ssize_t dc_regions_count_show(struct device *dev, struct device_attribute *attr,
> > +		char *buf)
> > +{
> > +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> > +	int len = 0;
> > +
> > +	len = sysfs_emit(buf, "0x%x\n", mds->nr_dc_region);
> 
> Prefer using this notation: %#llx
> grep for the sysfs_emit's to see customary usage.

oh.  I did see this oddity when I was testing and forgot to change this.

However, I think %#llx needs to be used in show_size_regionN() and this
needs to be %d.  This is just a count of the number of DC regions
supported by the device.  I don't think that needs to be in hex.  Changed
to %d.

> 
> > +	return len;
> > +}
> > +
> > +struct device_attribute dev_attr_dc_regions_count =
> > +	__ATTR(dc_regions_count, 0444, dc_regions_count_show, NULL);
> > +
> >  static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
> >  			   char *buf)
> >  {
> > @@ -362,6 +376,57 @@ static struct attribute *cxl_memdev_ram_attributes[] = {
> >  	NULL,
> >  };
> >  
> > +static ssize_t show_size_regionN(struct cxl_memdev *cxlmd, char *buf, int pos)
> > +{
> > +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> > +
> > +	return sysfs_emit(buf, "0x%llx\n", mds->dc_region[pos].decode_len);

... changed this one to %#llx.

Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD)
  2023-06-15  0:56 ` [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD) Alison Schofield
@ 2023-06-16  2:57   ` Ira Weiny
  0 siblings, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2023-06-16  2:57 UTC (permalink / raw)
  To: Alison Schofield, ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

Alison Schofield wrote:
> On Wed, Jun 14, 2023 at 12:16:27PM -0700, Ira Weiny wrote:
> 
> Is there a repo you can share?

:-/  I did not push this version anywhere.  I can recreate and push if you
like.  V2 will be based on Dan's new 12 patch clean up series.

https://lore.kernel.org/all/168679257511.3436160.9707734364766526576.stgit@dwillia2-xfh.jf.intel.com/

> If not, how about a recipe for applying these to cxl/next?

Sure.  Starting from v6.5-rc5

b4 shazam 168592149709.1948938.8663425987110396027.stgit@dwillia2-xfh.jf.intel.com
b4 shazam 20230604-dcd-type2-upstream-v1-0-71b6341bae54@intel.com

Will get you this branch.

> (Not trying to run, just want to load and view)
> 
> Thanks!

Sure.

Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-15 18:28       ` Dave Jiang
@ 2023-06-16  3:52         ` Navneet Singh
  0 siblings, 0 replies; 55+ messages in thread
From: Navneet Singh @ 2023-06-16  3:52 UTC (permalink / raw)
  To: Dave Jiang; +Cc: Ira Weiny, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

On Thu, Jun 15, 2023 at 11:28:26AM -0700, Dave Jiang wrote:
> 
> 
> On 6/15/23 11:12, Ira Weiny wrote:
> > Dave Jiang wrote:
> > > 
> > > 
> > > On 6/14/23 12:16, ira.weiny@intel.com wrote:
> > > > From: Navneet Singh <navneet.singh@intel.com>
> > > > 
> > > > CXL devices optionally support dynamic capacity. CXL Regions must be
> > > > created to access this capacity.
> > > > 
> > > > Add sysfs entries to create dynamic capacity cxl regions. Provide a new
> > > > Dynamic Capacity decoder mode which targets dynamic capacity on devices
> > > > which are added to that region.
> > > > 
> > > > Below are the steps to create and delete dynamic capacity region0
> > > > (example).
> > > > 
> > > >       region=$(cat /sys/bus/cxl/devices/decoder0.0/create_dc_region)
> > > >       echo $region> /sys/bus/cxl/devices/decoder0.0/create_dc_region
> > > >       echo 256 > /sys/bus/cxl/devices/$region/interleave_granularity
> > > >       echo 1 > /sys/bus/cxl/devices/$region/interleave_ways
> > > > 
> > > >       echo "dc0" >/sys/bus/cxl/devices/decoder1.0/mode
> > > >       echo 0x400000000 >/sys/bus/cxl/devices/decoder1.0/dpa_size
> > > > 
> > > >       echo 0x400000000 > /sys/bus/cxl/devices/$region/size
> > > >       echo  "decoder1.0" > /sys/bus/cxl/devices/$region/target0
> > > >       echo 1 > /sys/bus/cxl/devices/$region/commit
> > > >       echo $region > /sys/bus/cxl/drivers/cxl_region/bind
> > > > 
> > > >       echo $region> /sys/bus/cxl/devices/decoder0.0/delete_region
> > > > 
> > > > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > > > 
> > > > ---
> > > > [iweiny: fixups]
> > > > [iweiny: remove unused CXL_DC_REGION_MODE macro]
> > > > [iweiny: Make dc_mode_to_region_index static]
> > > > [iweiny: simplify <sysfs>/create_dc_region]
> > > > [iweiny: introduce decoder_mode_is_dc]
> > > > [djbw: fixups, no sign-off: preview only]
> > > > ---
> > > >    drivers/cxl/Kconfig       |  11 +++
> > > >    drivers/cxl/core/core.h   |   7 ++
> > > >    drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++++++++++++++++++++++++++----
> > > >    drivers/cxl/core/port.c   |  18 ++++
> > > >    drivers/cxl/core/region.c | 135 ++++++++++++++++++++++++--
> > > >    drivers/cxl/cxl.h         |  28 ++++++
> > > >    drivers/dax/cxl.c         |   4 +
> > > >    7 files changed, 409 insertions(+), 28 deletions(-)
> > > > 
> > > > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > > > index ff4e78117b31..df034889d053 100644
> > > > --- a/drivers/cxl/Kconfig
> > > > +++ b/drivers/cxl/Kconfig
> > > > @@ -121,6 +121,17 @@ config CXL_REGION
> > > >    	  If unsure say 'y'
> > > > +config CXL_DCD
> > > > +	bool "CXL: DCD Support"
> > > > +	default CXL_BUS
> > > > +	depends on CXL_REGION
> > > > +	help
> > > > +	  Enable the CXL core to provision CXL DCD regions.
> > > > +	  CXL devices optionally support dynamic capacity and DCD region
> > > > +	  maps the dynamic capacity regions DPA's into Host HPA ranges.
> > > > +
> > > > +	  If unsure say 'y'
> > > > +
> > > >    config CXL_REGION_INVALIDATION_TEST
> > > >    	bool "CXL: Region Cache Management Bypass (TEST)"
> > > >    	depends on CXL_REGION
> > > > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> > > > index 27f0968449de..725700ab5973 100644
> > > > --- a/drivers/cxl/core/core.h
> > > > +++ b/drivers/cxl/core/core.h
> > > > @@ -9,6 +9,13 @@ extern const struct device_type cxl_nvdimm_type;
> > > >    extern struct attribute_group cxl_base_attribute_group;
> > > > +#ifdef CONFIG_CXL_DCD
> > > > +extern struct device_attribute dev_attr_create_dc_region;
> > > > +#define SET_CXL_DC_REGION_ATTR(x) (&dev_attr_##x.attr),
> > > > +#else
> > > > +#define SET_CXL_DC_REGION_ATTR(x)
> > > > +#endif
> > > > +
> > > >    #ifdef CONFIG_CXL_REGION
> > > >    extern struct device_attribute dev_attr_create_pmem_region;
> > > >    extern struct device_attribute dev_attr_create_ram_region;
> > > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > > > index 514d30131d92..29649b47d177 100644
> > > > --- a/drivers/cxl/core/hdm.c
> > > > +++ b/drivers/cxl/core/hdm.c
> > > > @@ -233,14 +233,23 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> > > >    	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > > >    	struct resource *res = cxled->dpa_res;
> > > >    	resource_size_t skip_start;
> > > > +	resource_size_t skipped = cxled->skip;
> > > >    	lockdep_assert_held_write(&cxl_dpa_rwsem);
> > > >    	/* save @skip_start, before @res is released */
> > > > -	skip_start = res->start - cxled->skip;
> > > > +	skip_start = res->start - skipped;
> > > >    	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> > > > -	if (cxled->skip)
> > > > -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> > > > +	if (cxled->skip != 0) {
> > > > +		while (skipped != 0) {
> > > > +			res = xa_load(&cxled->skip_res, skip_start);
> > > > +			__release_region(&cxlds->dpa_res, skip_start,
> > > > +							resource_size(res));
> > > > +			xa_erase(&cxled->skip_res, skip_start);
> > > > +			skip_start += resource_size(res);
> > > > +			skipped -= resource_size(res);
> > > > +			}
> > > > +	}
> > > >    	cxled->skip = 0;
> > > >    	cxled->dpa_res = NULL;
> > > >    	put_device(&cxled->cxld.dev);
> > > > @@ -267,6 +276,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> > > >    	__cxl_dpa_release(cxled);
> > > >    }
> > > > +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> > > > +{
> > > > +	int index = 0;
> > > > +
> > > > +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> > > > +		if (mode == i)
> > > > +			return index;
> > > > +		index++;
> > > > +	}
> > > > +
> > > > +	return -EINVAL;
> > > > +}
> > > > +
> > > >    static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > > >    			     resource_size_t base, resource_size_t len,
> > > >    			     resource_size_t skipped)
> > > > @@ -275,7 +297,11 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > > >    	struct cxl_port *port = cxled_to_port(cxled);
> > > >    	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > > >    	struct device *dev = &port->dev;
> > > > +	struct device *ed_dev = &cxled->cxld.dev;
> > > > +	struct resource *dpa_res = &cxlds->dpa_res;
> > > > +	resource_size_t skip_len = 0;
> > > >    	struct resource *res;
> > > > +	int rc, index;
> > > >    	lockdep_assert_held_write(&cxl_dpa_rwsem);
> > > > @@ -304,28 +330,119 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > > >    	}
> > > >    	if (skipped) {
> > > > -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> > > > -				       dev_name(&cxled->cxld.dev), 0);
> > > > -		if (!res) {
> > > > -			dev_dbg(dev,
> > > > -				"decoder%d.%d: failed to reserve skipped space\n",
> > > > -				port->id, cxled->cxld.id);
> > > > -			return -EBUSY;
> > > > +		resource_size_t skip_base = base - skipped;
> > > > +
> > > > +		if (decoder_mode_is_dc(cxled->mode)) {
> > > 
> > > Maybe move this entire block to a helper function to reduce the size of
> > > the current function and reduce indent levels and improve readability?
> > 
> > :-/
> > 
> > I'll work on breaking it out more.  The logic here is getting kind of
> > crazy.
Navneet - yeah, its like splitting the skip in ram, pmem, dc
regions and gaps between the regions.Helper can be done.
> > 
> > > 
> > > > +			if (resource_size(&cxlds->ram_res) &&
> > > > +					skip_base <= cxlds->ram_res.end) {
> > > > +				skip_len = cxlds->ram_res.end - skip_base + 1;
> > > > +				res = __request_region(dpa_res, skip_base,
> > > > +						skip_len, dev_name(ed_dev), 0);
> > > > +				if (!res)
> > > > +					goto error;
> > > > +
> > > > +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> > > > +								GFP_KERNEL);
> > > > +				skip_base += skip_len;
> > > > +			}
> > > > +
> > > > +			if (resource_size(&cxlds->ram_res) &&
> >                                                    ^^^^^^^
> > 						  pmem_res?
> > 
> > > > +					skip_base <= cxlds->pmem_res.end) {
> > 
> > The 2 if statements here are almost exactly the same.  To the point I
> > wonder if there is a bug.
> > 
> > Navneet,
> > 
> > Why does the code check ram_res the second time but go on to use pmem_res
> > in the block?
> > 
> > > > +				skip_len = cxlds->pmem_res.end - skip_base + 1;
> > > > +				res = __request_region(dpa_res, skip_base,
> > > > +						skip_len, dev_name(ed_dev), 0);
> > > > +				if (!res)
> > > > +					goto error;
> > > > +
> > > > +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> > > > +								GFP_KERNEL);
> > > > +				skip_base += skip_len;
> > > > +			}
> > > > +
> > > > +			index = dc_mode_to_region_index(cxled->mode);
> > > > +			for (int i = 0; i <= index; i++) {
> > > > +				struct resource *dcr = &cxlds->dc_res[i];
> > > > +
> > > > +				if (skip_base < dcr->start) {
> > > > +					skip_len = dcr->start - skip_base;
> > > > +					res = __request_region(dpa_res,
> > > > +							skip_base, skip_len,
> > > > +							dev_name(ed_dev), 0);
> > > > +					if (!res)
> > > > +						goto error;
> > > > +
> > > > +					rc = xa_insert(&cxled->skip_res, skip_base,
> > > > +							res, GFP_KERNEL);
> > > > +					skip_base += skip_len;
> > > > +				}
> > > > +
> > > > +				if (skip_base == base) {
> > > > +					dev_dbg(dev, "skip done!\n");
> > > > +					break;
> > > > +				}
> > > > +
> > > > +				if (resource_size(dcr) &&
> > > > +						skip_base <= dcr->end) {
> > > > +					if (skip_base > base)
> > > > +						dev_err(dev, "Skip error\n");
> > > > +
> > > > +					skip_len = dcr->end - skip_base + 1;
> > > > +					res = __request_region(dpa_res, skip_base,
> > > > +							skip_len,
> > > > +							dev_name(ed_dev), 0);
> > > > +					if (!res)
> > > > +						goto error;
> > > > +
> > > > +					rc = xa_insert(&cxled->skip_res, skip_base,
> > > > +							res, GFP_KERNEL);
> > > > +					skip_base += skip_len;
> > > > +				}
> > > > +			}
> > > > +		} else	{
> > > > +			res = __request_region(dpa_res, base - skipped, skipped,
> > > > +							dev_name(ed_dev), 0);
> > > > +			if (!res)
> > > > +				goto error;
> > > > +
> > > > +			rc = xa_insert(&cxled->skip_res, skip_base, res,
> > > > +								GFP_KERNEL);
> > > >    		}
> > > >    	}
> > > > -	res = __request_region(&cxlds->dpa_res, base, len,
> > > > -			       dev_name(&cxled->cxld.dev), 0);
> > > > +
> > > > +	res = __request_region(dpa_res, base, len, dev_name(ed_dev), 0);
> > > >    	if (!res) {
> > > >    		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
> > > > -			port->id, cxled->cxld.id);
> > > > -		if (skipped)
> > > > -			__release_region(&cxlds->dpa_res, base - skipped,
> > > > -					 skipped);
> > > > +				port->id, cxled->cxld.id);
> > > > +		if (skipped) {
> > > > +			resource_size_t skip_base = base - skipped;
> > > > +
> > > > +			while (skipped != 0) {
> > > > +				if (skip_base > base)
> > > > +					dev_err(dev, "Skip error\n");
> > > > +
> > > > +				res = xa_load(&cxled->skip_res, skip_base);
> > > > +				__release_region(dpa_res, skip_base,
> > > > +							resource_size(res));
> > > > +				xa_erase(&cxled->skip_res, skip_base);
> > > > +				skip_base += resource_size(res);
> > > > +				skipped -= resource_size(res);
> > > > +			}
> > > > +		}
> > > >    		return -EBUSY;
> > > >    	}
> > > >    	cxled->dpa_res = res;
> > > >    	cxled->skip = skipped;
> > > > +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> > > > +		int index = dc_mode_to_region_index(mode);
> > > > +
> > > > +		if (resource_contains(&cxlds->dc_res[index], res)) {
> > > > +			cxled->mode = mode;
> > > > +			dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id,
> > > > +				cxled->cxld.id, cxled->dpa_res, cxled->mode);
> > > > +			goto success > +		}
> > > > +	}
> > > 
> > > This block should only happen if decoder_mode_is_dc() right? If that's
> > > the case, you might be able to refactor it so the 'goto success' isn't
> > > necessary.
> > 
> > I'll check.  I looked through this code a couple of times in my review
> > before posting because I'm not 100% sure I want to see 8 different modes
> > DC decoders and regions.
> > 
> > I think the 'mode' should be 'DC' with an index in the endpoint decoder to
> > map DC region that decoder is mapping.  But that change was much bigger to
> > Navneets code and I wanted to see how others felt about having DC0 - DC7
> > modes.  My compromise was creating decoder_mode_is_dc().
> > 
Navneet - Discussed with Dan before splitting the modes in dc0-dc7.
Intent is to keep the Linux definition simple and enforce one decoder per DC region.
Each DC region will have its own DSMAS entry.The primary reason to 
have multiple DC regions is to have different performance properties.


> > > 
> > > >    	if (resource_contains(&cxlds->pmem_res, res))
> > > >    		cxled->mode = CXL_DECODER_PMEM;
> > > >    	else if (resource_contains(&cxlds->ram_res, res))
> > > > @@ -336,9 +453,16 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > > >    		cxled->mode = CXL_DECODER_MIXED;
> > > >    	}
> > > > +success:
> > > >    	port->hdm_end++;
> > > >    	get_device(&cxled->cxld.dev);
> > > >    	return 0;
> > > > +
> > > > +error:
> > > > +	dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space\n",
> > > > +			port->id, cxled->cxld.id);
> > > > +	return -EBUSY;
> > > > +
> > > >    }
> > > >    int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > > > @@ -429,6 +553,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> > > >    	switch (mode) {
> > > >    	case CXL_DECODER_RAM:
> > > >    	case CXL_DECODER_PMEM:
> > > > +	case CXL_DECODER_DC0:
> > > > +	case CXL_DECODER_DC1:
> > > > +	case CXL_DECODER_DC2:
> > > > +	case CXL_DECODER_DC3:
> > > > +	case CXL_DECODER_DC4:
> > > > +	case CXL_DECODER_DC5:
> > > > +	case CXL_DECODER_DC6:
> > > > +	case CXL_DECODER_DC7:
> > 
> > For example this seems very hacky...
> 
> Not sure if it helps, but you can always do:
> case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
> 
> DJ
> 
Navneet - Definetly, thanks.
> > 
> > [snip]
> > 
> > > > +/*
> > > > + * The region can not be manged by CXL if any portion of
> > > > + * it is already online as 'System RAM'
> > > > + */
> > > > +static bool region_is_system_ram(struct cxl_region *cxlr,
> > > > +				 struct cxl_region_params *p)
> > > > +{
> > > > +	return (walk_iomem_res_desc(IORES_DESC_NONE,
> > > > +				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> > > > +				    p->res->start, p->res->end, cxlr,
> > > > +				    is_system_ram) > 0);
> > > > +}
> > > > +
> > > >    static int cxl_region_probe(struct device *dev)
> > > >    {
> > > >    	struct cxl_region *cxlr = to_cxl_region(dev);
> > > > @@ -3174,14 +3287,7 @@ static int cxl_region_probe(struct device *dev)
> > > >    	case CXL_DECODER_PMEM:
> > > >    		return devm_cxl_add_pmem_region(cxlr);
> > > >    	case CXL_DECODER_RAM:
> > > > -		/*
> > > > -		 * The region can not be manged by CXL if any portion of
> > > > -		 * it is already online as 'System RAM'
> > > > -		 */
> > > > -		if (walk_iomem_res_desc(IORES_DESC_NONE,
> > > > -					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> > > > -					p->res->start, p->res->end, cxlr,
> > > > -					is_system_ram) > 0)
> > > > +		if (region_is_system_ram(cxlr, p))
> > > 
> > > Maybe split this change out as a prep patch before the current patch.
> > 
> > That seems reasonable.  But the patch is not so large and the
> > justification for creating a helper is that we need this same check for DC
> > regions.  So it seemed ok to leave it like this.  Let me see about
> > splitting it out.
> > 
> > [snip]
> > 
> > > > diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> > > > index ccdf8de85bd5..eb5eb81bfbd7 100644
> > > > --- a/drivers/dax/cxl.c
> > > > +++ b/drivers/dax/cxl.c
> > > > @@ -23,11 +23,15 @@ static int cxl_dax_region_probe(struct device *dev)
> > > >    	if (!dax_region)
> > > >    		return -ENOMEM;
> > > > +	if (decoder_mode_is_dc(cxlr->mode))
> > > > +		return 0;
> > > > +
> > > >    	data = (struct dev_dax_data) {
> > > >    		.dax_region = dax_region,
> > > >    		.id = -1,
> > > >    		.size = range_len(&cxlr_dax->hpa_range),
> > > >    	};
> > > > +
> > > 
> > > Stray blank line?
> > 
> > Opps!  Fixed!
> > 
> > Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events.
  2023-06-15  2:19   ` Alison Schofield
@ 2023-06-16  4:11     ` Ira Weiny
  2023-06-27 18:20       ` Fan Ni
  0 siblings, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2023-06-16  4:11 UTC (permalink / raw)
  To: Alison Schofield, ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

Alison Schofield wrote:
> On Wed, Jun 14, 2023 at 12:16:31PM -0700, Ira Weiny wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > A dynamic capacity device utilizes events to signal the host about the
> > changes to the allocation of DC blocks. The device communicates the
> > state of these blocks of dynamic capacity through an extent list that
> > describes the starting DPA and length of all blocks the host can access.
> > 
> > Based on the dynamic capacity add or release event type,
> > dynamic memory represented by the extents are either added
> > or removed as devdax device.
> 
> Nice commit msg, please align second paragraph w first.

ok... fixed.  :-)

> 
> > 
> > Process the dynamic capacity add and release events.
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > 
> > ---
> > [iweiny: Remove invalid comment]
> > ---
> >  drivers/cxl/core/mbox.c   | 345 +++++++++++++++++++++++++++++++++++++++++++++-
> >  drivers/cxl/core/region.c | 214 +++++++++++++++++++++++++++-
> >  drivers/cxl/core/trace.h  |   3 +-
> >  drivers/cxl/cxl.h         |   4 +-
> >  drivers/cxl/cxlmem.h      |  76 ++++++++++
> >  drivers/cxl/pci.c         |  10 +-
> >  drivers/dax/bus.c         |  11 +-
> >  drivers/dax/bus.h         |   5 +-
> >  8 files changed, 652 insertions(+), 16 deletions(-)
> > 
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index c5b696737c87..db9295216de5 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -767,6 +767,14 @@ static const uuid_t log_uuid[] = {
> >  	[VENDOR_DEBUG_UUID] = DEFINE_CXL_VENDOR_DEBUG_UUID,
> >  };
> >  
> > +/* See CXL 3.0 8.2.9.2.1.5 */
> > +enum dc_event {
> > +	ADD_CAPACITY,
> > +	RELEASE_CAPACITY,
> > +	FORCED_CAPACITY_RELEASE,
> > +	REGION_CONFIGURATION_UPDATED,
> > +};
> > +
> >  /**
> >   * cxl_enumerate_cmds() - Enumerate commands for a device.
> >   * @mds: The driver data for the operation
> > @@ -852,6 +860,14 @@ static const uuid_t mem_mod_event_uuid =
> >  	UUID_INIT(0xfe927475, 0xdd59, 0x4339,
> >  		  0xa5, 0x86, 0x79, 0xba, 0xb1, 0x13, 0xb7, 0x74);
> >  
> > +/*
> > + * Dynamic Capacity Event Record
> > + * CXL rev 3.0 section 8.2.9.2.1.3; Table 8-45
> > + */
> > +static const uuid_t dc_event_uuid =
> > +	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c,
> > +		0x2f, 0x95, 0x26, 0x8e, 0x10, 0x1a, 0x2a);
> > +
> >  static void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> >  				   enum cxl_event_log_type type,
> >  				   struct cxl_event_record_raw *record)
> > @@ -945,6 +961,188 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> >  	return rc;
> >  }
> >  
> > +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> > +				struct cxl_mbox_dc_response *res,
> > +				int extent_cnt, int opcode)
> > +{
> > +	struct cxl_mbox_cmd mbox_cmd;
> > +	int rc, size;
> > +
> > +	size = struct_size(res, extent_list, extent_cnt);
> > +	res->extent_list_size = cpu_to_le32(extent_cnt);
> > +
> > +	mbox_cmd = (struct cxl_mbox_cmd) {
> > +		.opcode = opcode,
> > +		.size_in = size,
> > +		.payload_in = res,
> > +	};
> > +
> > +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > +
> > +	return rc;
> > +
> > +}
> > +
> > +static int cxl_prepare_ext_list(struct cxl_mbox_dc_response **res,
> > +					int *n, struct range *extent)
> > +{
> > +	struct cxl_mbox_dc_response *dc_res;
> > +	unsigned int size;
> > +
> > +	if (!extent)
> > +		size = struct_size(dc_res, extent_list, 0);
> > +	else
> > +		size = struct_size(dc_res, extent_list, *n + 1);
> > +
> > +	dc_res = krealloc(*res, size, GFP_KERNEL);
> > +	if (!dc_res)
> > +		return -ENOMEM;
> > +
> > +	if (extent) {
> > +		dc_res->extent_list[*n].dpa_start = cpu_to_le64(extent->start);
> > +		memset(dc_res->extent_list[*n].reserved, 0, 8);
> > +		dc_res->extent_list[*n].length = 
> > +				cpu_to_le64(range_len(extent));
> 
> Unnecessary return. I think that fits in 80 columns.

exactly 80...  fixed.

> 
> > +		(*n)++;
> > +	}
> > +
> > +	*res = dc_res;
> > +	return 0;
> > +}
> > +/**
> > + * cxl_handle_dcd_event_records() - Read DCD event records.
> > + * @mds: The memory device state
> > + *
> > + * Returns 0 if enumerate completed successfully.
> > + *
> > + * CXL devices can generate DCD events to add or remove extents in the list.
> > + */
> 
> That's a kernel doc comment, so maybe can be clearer.

Or remove the kernel doc comment.

> It's called 'handle', so 'Read DCD event records' seems like a mismatch.

Yea.

> Probably needs more explaining.

Rather I would say less.  How about simply:

/* Returns 0 if the event was handled successfully. */

Or even nothing at all.  It is a static function and used 1 place.  Not
sure we even need that line.

> 
> 
> > +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
> > +					struct cxl_event_record_raw *rec)
> > +{
> > +	struct cxl_mbox_dc_response *dc_res = NULL;
> > +	struct device *dev = mds->cxlds.dev;
> > +	uuid_t *id = &rec->hdr.id;
> > +	struct dcd_event_dyn_cap *record =
> > +			(struct dcd_event_dyn_cap *)rec;
> > +	int extent_cnt = 0, rc = 0;
> > +	struct cxl_dc_extent_data *extent;
> > +	struct range alloc_range, rel_range;
> > +	resource_size_t dpa, size;
> > +
> 
> Please reverse x-tree. And if things like that *record can't fit within
> 80 columns and in reverse x-tree order, then assign it afterwards.

Done.

> 
> 
> > +	if (!uuid_equal(id, &dc_event_uuid))
> > +		return -EINVAL;
> > +
> > +	switch (record->data.event_type) {
> 
> Maybe a local for record->data.extent that is used repeatedly below,
> or,
> Perhaps pull the length and dpa local defines you made down in the
> RELEASE_CAPACITY up here and share them with ADD_CAPACITY. That'll
> reduce the le65_to_cpu noise. Add similar for shared_extn_seq.

I'm thinking ADD_CAPACITY and RELEASE_CAPACITY need to be 2 separate
functions which make this function a simple uuid check and event_type
switch.

Having local variables for those become much cleaner then.

I think the handling of dc_res would be cleaner then too.

> 
> 
> > +	case ADD_CAPACITY:
> > +		extent = devm_kzalloc(dev, sizeof(*extent), GFP_ATOMIC);
> > +		if (!extent)
> > +			return -ENOMEM;
> > +
> > +		extent->dpa_start = le64_to_cpu(record->data.extent.start_dpa);
> > +		extent->length = le64_to_cpu(record->data.extent.length);
> > +		memcpy(extent->tag, record->data.extent.tag,
> > +				sizeof(record->data.extent.tag));
> > +		extent->shared_extent_seq =
> > +			le16_to_cpu(record->data.extent.shared_extn_seq);
> > +		dev_dbg(dev, "Add DC extent DPA:0x%llx LEN:%llx\n",
> > +					extent->dpa_start, extent->length);
> > +		alloc_range = (struct range) {
> > +			.start = extent->dpa_start,
> > +			.end = extent->dpa_start + extent->length - 1,
> > +		};
> > +
> > +		rc = cxl_add_dc_extent(mds, &alloc_range);
> > +		if (rc < 0) {
> 
> How about 
> 		if (rc >=)
> 			goto insert;
> 
> Then you can remove this level of indent.

I think if this is a separate function it will be better...

Also this entire indent block could be another sub function because AFAICS
(see below) it always returns out from this block (only via the 'out'
label in 1 case which seems redundant).

> 
> > +			dev_dbg(dev, "unconsumed DC extent DPA:0x%llx LEN:%llx\n",
> > +					extent->dpa_start, extent->length);
> > +			rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, NULL);
> > +			if (rc < 0) {
> > +				dev_err(dev, "Couldn't create extent list %d\n",
> > +									rc);
> > +				devm_kfree(dev, extent);
> > +				return rc;
> > +			}
> > +
> > +			rc = cxl_send_dc_cap_response(mds, dc_res,
> > +					extent_cnt, CXL_MBOX_OP_ADD_DC_RESPONSE);
> > +			if (rc < 0) {
> > +				devm_kfree(dev, extent);
> > +				goto out;

This if is not doing anything useful.  Because this statement ...

> > +			}
> > +
> > +			kfree(dc_res);
> > +			devm_kfree(dev, extent);

...  and the 'else' here end up being the same logic.  The 'out' label
flows through kfree(dc_res).  Is the intent that
cxl_send_dc_cap_response() has no failure consequences?

> > +
> > +			return 0;
> > +		}
> 
> insert:
> 
> > +
> > +		rc = xa_insert(&mds->dc_extent_list, extent->dpa_start, extent,
> > +				GFP_KERNEL);
> > +		if (rc < 0)
> > +			goto out;
> > +
> > +		mds->num_dc_extents++;
> > +		rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, &alloc_range);
> > +		if (rc < 0) {
> > +			dev_err(dev, "Couldn't create extent list %d\n", rc);
> > +			return rc;
> > +		}
> > +
> > +		rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
> > +					      CXL_MBOX_OP_ADD_DC_RESPONSE);
> > +		if (rc < 0)
> > +			goto out;
> > +
> > +		break;
> > +
> > +	case RELEASE_CAPACITY:
> > +		dpa = le64_to_cpu(record->data.extent.start_dpa);
> > +		size = le64_to_cpu(record->data.extent.length);
> 
> ^^ do these sooner and share

I think add/release should be their own functions.

> 
> > +		dev_dbg(dev, "Release DC extents DPA:0x%llx LEN:%llx\n",
> > +				dpa, size);
> > +		extent = xa_load(&mds->dc_extent_list, dpa);
> > +		if (!extent) {
> > +			dev_err(dev, "No extent found with DPA:0x%llx\n", dpa);
> > +			return -EINVAL;
> > +		}
> > +
> > +		rel_range = (struct range) {
> > +			.start = dpa,
> > +			.end = dpa + size - 1,
> > +		};
> > +
> > +		rc = cxl_release_dc_extent(mds, &rel_range);
> > +		if (rc < 0) {
> > +			dev_dbg(dev, "withhold DC extent DPA:0x%llx LEN:%llx\n",
> > +									dpa, size);
> > +			return 0;
> > +		}
> > +
> > +		xa_erase(&mds->dc_extent_list, dpa);
> > +		devm_kfree(dev, extent);
> > +		mds->num_dc_extents--;
> > +		rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, &rel_range);
> > +		if (rc < 0) {
> > +			dev_err(dev, "Couldn't create extent list %d\n", rc);
> > +			return rc;
> > +		}
> > +
> > +		rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
> > +					      CXL_MBOX_OP_RELEASE_DC);
> > +		if (rc < 0)
> > +			goto out;
> > +
> > +		break;
> > +
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +out:
> 
> The out seems needless. Replace all 'goto out''s  with 'break'
> 
> I'm also a bit concerned about all the direct returns above.
> Can this be the single exit point?

I think so...

> kfree of a NULL ptr is OK.
> Maybe a bit more logic here to do that devm_free is all that
> is needed.

... but even more clean up so that the logic is:

handle_event()
{

	... do checks ...

	switch (type):
	case ADD...:
		rc = handle_add();
		break;
	case RELEASE...:
		rc = handle_release();
		break;
	default:
		rc = -EINVAL;
		break;
	}

	return rc;
}

> 
> 
> > +	kfree(dc_res);
> > +	return rc;
> > +}
> > +
> >  static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> >  				    enum cxl_event_log_type type)
> >  {
> > @@ -982,9 +1180,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> >  		if (!nr_rec)
> >  			break;
> >  
> > -		for (i = 0; i < nr_rec; i++)
> > +		for (i = 0; i < nr_rec; i++) {
> >  			cxl_event_trace_record(cxlmd, type,
> >  					       &payload->records[i]);
> > +			if (type == CXL_EVENT_TYPE_DCD) {
> > +				rc = cxl_handle_dcd_event_records(mds,
> > +						&payload->records[i]);
> > +				if (rc)
> > +					dev_err_ratelimited(dev,
> > +						"dcd event failed: %d\n", rc);
> > +			}
> 
> 
> Reduce indent option:
> 
> 			if (type != CXL_EVENT_TYPE_DCD)
> 				continue;
> 
> 			rc = cxl_handle_dcd_event_records(mds,
> 							  &payload->records[i]);			if (rc)
> 				dev_err_ratelimited(dev,
> 						    "dcd event failed: %d\n", rc);

Ah...  Ok.

Honestly I just made this change and I'm not keen on it.  I think it makes
the detail that the event was DCD obscured.

I'm also questioning the need to the error reporting here.  There seems to
be error messages in the critical parts of cxl_handle_dcd_event_records()
which would give a clue as to why the DCD failed.  (Other than some common
memory allocation issues.)  But also those errors are not rate limited.
So if we are concerned with a FM or other external entity causing events
which flood the logs it seems they all need to be debug or ratelimited.

> 
> I don't know where 'cxl_handle_dcd_event_records() was introduce,
> but I'm wondering now if it can have a short name.

Its the function above which needs all the rework.

> 
> > +		}
> >  
> >  		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
> >  			trace_cxl_overflow(cxlmd, type, payload);
> > @@ -1024,6 +1230,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
> >  		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_WARN);
> >  	if (status & CXLDEV_EVENT_STATUS_INFO)
> >  		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_INFO);
> > +	if (status & CXLDEV_EVENT_STATUS_DCD)
> > +		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
> >  }
> >  EXPORT_SYMBOL_NS_GPL(cxl_mem_get_event_records, CXL);
> >  
> > @@ -1244,6 +1452,140 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> >  }
> >  EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> >  
> > +int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> > +			      unsigned int *extent_gen_num)
> > +{
> > +	struct device *dev = mds->cxlds.dev;
> > +	struct cxl_mbox_dc_extents *dc_extents;
> > +	struct cxl_mbox_get_dc_extent get_dc_extent;
> > +	unsigned int total_extent_cnt;
> 
> Seems 'count' would probably suffice here.

Done.

> 
> > +	struct cxl_mbox_cmd mbox_cmd;
> > +	int rc;
> 
> Above - reverse x-tree please.

Done.

> 
> > +
> > +	/* Check GET_DC_EXTENT_LIST is supported by device */
> > +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> > +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> > +		return 0;
> > +	}
> > +
> > +	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
> > +	if (!dc_extents)
> > +		return -ENOMEM;
> > +
> > +	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> > +		.extent_cnt = 0,
> > +		.start_extent_index = 0,
> > +	};
> > +
> > +	mbox_cmd = (struct cxl_mbox_cmd) {
> > +		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> > +		.payload_in = &get_dc_extent,
> > +		.size_in = sizeof(get_dc_extent),
> > +		.size_out = mds->payload_size,
> > +		.payload_out = dc_extents,
> > +		.min_out = 1,
> > +	};
> > +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > +	if (rc < 0)
> > +		goto out;
> > +
> > +	total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> > +	*extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
> > +	dev_dbg(dev, "Total extent count :%d Extent list Generation Num: %d\n",
> > +			total_extent_cnt, *extent_gen_num);
> > +out:
> > +
> > +	kvfree(dc_extents);
> > +	if (rc < 0)
> > +		return rc;
> > +
> > +	return total_extent_cnt;
> > +
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extent_cnt, CXL);
> > +
> > +int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds,
> > +			   unsigned int index, unsigned int cnt)
> > +{
> > +	/* See CXL 3.0 Table 125 dynamic capacity config  Output Payload */
> > +	struct device *dev = mds->cxlds.dev;
> > +	struct cxl_mbox_dc_extents *dc_extents;
> > +	struct cxl_mbox_get_dc_extent get_dc_extent;
> > +	unsigned int extent_gen_num, available_extents, total_extent_cnt;
> > +	int rc;
> > +	struct cxl_dc_extent_data *extent;
> > +	struct cxl_mbox_cmd mbox_cmd;
> > +	struct range alloc_range;
> > +
> 
> Reverse x-tree please.

Done.

> 
> > +	/* Check GET_DC_EXTENT_LIST is supported by device */
> > +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> > +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> > +		return 0;
> > +	}
> 
> Can we even get this far if this cmd is not supported by the device?
> Is there a sooner place to test those bits?  Is this sysfs request?
> (sorry not completely following here).
> 

I'll have to check.  Perhaps Navneet knows.

> > +
> > +	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
> > +	if (!dc_extents)
> > +		return -ENOMEM;
> > +	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> > +		.extent_cnt = cnt,
> > +		.start_extent_index = index,
> > +	};
> > +
> > +	mbox_cmd = (struct cxl_mbox_cmd) {
> > +		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> > +		.payload_in = &get_dc_extent,
> > +		.size_in = sizeof(get_dc_extent),
> > +		.size_out = mds->payload_size,
> > +		.payload_out = dc_extents,
> > +		.min_out = 1,
> > +	};
> > +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > +	if (rc < 0)
> > +		goto out;
> > +
> > +	available_extents = le32_to_cpu(dc_extents->ret_extent_cnt);
> > +	total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> > +	extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
> > +	dev_dbg(dev, "No Total extent count :%d Extent list Generation Num:%d\n",
> > +			total_extent_cnt, extent_gen_num);
> > +
> > +
> > +	for (int i = 0; i < available_extents ; i++) {
> > +		extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
> > +		if (!extent) {
> > +			rc = -ENOMEM;
> > +			goto out;
> > +		}
> > +		extent->dpa_start = le64_to_cpu(dc_extents->extent[i].start_dpa);
> > +		extent->length = le64_to_cpu(dc_extents->extent[i].length);
> > +		memcpy(extent->tag, dc_extents->extent[i].tag,
> > +					sizeof(dc_extents->extent[i].tag));
> > +		extent->shared_extent_seq =
> > +				le16_to_cpu(dc_extents->extent[i].shared_extn_seq);
> > +		dev_dbg(dev, "dynamic capacity extent[%d] DPA:0x%llx LEN:%llx\n",
> > +				i, extent->dpa_start, extent->length);
> > +
> > +		alloc_range = (struct range){
> > +			.start = extent->dpa_start,
> > +			.end = extent->dpa_start + extent->length - 1,
> > +		};
> > +
> > +		rc = cxl_add_dc_extent(mds, &alloc_range);
> > +		if (rc < 0)
> > +			goto out;
> > +		rc = xa_insert(&mds->dc_extent_list, extent->dpa_start, extent,
> > +				GFP_KERNEL);
> > +	}
> > +
> > +out:
> > +	kvfree(dc_extents);
> > +	if (rc < 0)
> > +		return rc;
> > +
> > +	return available_extents;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extents, CXL);
> > +
> >  static int add_dpa_res(struct device *dev, struct resource *parent,
> >  		       struct resource *res, resource_size_t start,
> >  		       resource_size_t size, const char *type)
> > @@ -1452,6 +1794,7 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
> >  	mutex_init(&mds->event.log_lock);
> >  	mds->cxlds.dev = dev;
> >  	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
> > +	xa_init(&mds->dc_extent_list);
> >  
> >  	return mds;
> >  }
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 144232c8305e..ba45c1c3b0a9 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -1,6 +1,7 @@
> >  // SPDX-License-Identifier: GPL-2.0-only
> >  /* Copyright(c) 2022 Intel Corporation. All rights reserved. */
> >  #include <linux/memregion.h>
> > +#include <linux/interrupt.h>
> >  #include <linux/genalloc.h>
> >  #include <linux/device.h>
> >  #include <linux/module.h>
> > @@ -11,6 +12,8 @@
> >  #include <cxlmem.h>
> >  #include <cxl.h>
> >  #include "core.h"
> > +#include "../../dax/bus.h"
> > +#include "../../dax/dax-private.h"
> >  
> >  /**
> >   * DOC: cxl core region
> > @@ -166,6 +169,38 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
> >  	return 0;
> >  }
> >  
> > +static int cxl_region_manage_dc(struct cxl_region *cxlr)
> > +{
> > +	struct cxl_region_params *p = &cxlr->params;
> > +	unsigned int extent_gen_num;
> > +	int i, rc;
> > +
> > +	/* Designed for Non Interleaving flow with the assumption one
> > +	 * cxl_region will map the complete device DC region's DPA range
> > +	 */
> > +	for (i = 0; i < p->nr_targets; i++) {
> > +		struct cxl_endpoint_decoder *cxled = p->targets[i];
> > +		struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > +		struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> > +
> > +		rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> > +		if (rc < 0)
> > +			goto err;
> > +		else if (rc > 1) {
> > +			rc = cxl_dev_get_dc_extents(mds, rc, 0);
> > +			if (rc < 0)
> > +				goto err;
> > +			mds->num_dc_extents = rc;
> > +			mds->dc_extents_index = rc - 1;
> > +		}
> 
> Brackets required around both arms of that if/else if statement. 
> (checkpatch should be telling you that)
> 
> How about flipping that and doing the (rc > 1) work first.
> then the else if, goto err.

Actually the goto err handles it all.  Just get rid of the 'else'

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 57f8ec9ef07a..47f94dec47f4 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -186,7 +186,8 @@ static int cxl_region_manage_dc(struct cxl_region *cxlr)
                rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
                if (rc < 0)
                        goto err;
-               else if (rc > 1) {
+
+               if (rc > 1) {
                        rc = cxl_dev_get_dc_extents(mds, rc, 0);
                        if (rc < 0)
                                goto err;

> 
> > +		mds->dc_list_gen_num = extent_gen_num;
> > +		dev_dbg(mds->cxlds.dev, "No of preallocated extents :%d\n", rc);
> > +	}
> > +	return 0;
> > +err:
> > +	return rc;
> > +}
> > +
> >  static int commit_decoder(struct cxl_decoder *cxld)
> >  {
> >  	struct cxl_switch_decoder *cxlsd = NULL;
> > @@ -2865,11 +2900,14 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
> >  		return PTR_ERR(cxlr_dax);
> >  
> >  	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
> > -	if (!cxlr_dc) {
> > -		rc = -ENOMEM;
> > -		goto err;
> > -	}
> > +	if (!cxlr_dc)
> > +		return -ENOMEM;
> >  
> > +	rc = request_module("dax_cxl");
> > +	if (rc) {
> > +		dev_err(dev, "failed to load dax-ctl module\n");
> > +		goto load_err;
> > +	}
> >  	dev = &cxlr_dax->dev;
> >  	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
> >  	if (rc)
> > @@ -2891,10 +2929,24 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
> >  	xa_init(&cxlr_dc->dax_dev_list);
> >  	cxlr->cxlr_dc = cxlr_dc;
> >  	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
> > -	if (!rc)
> > -		return 0;
> > +	if (rc)
> > +		goto err;
> > +
> > +	if (!dev->driver) {
> > +		dev_err(dev, "%s Driver not attached\n", dev_name(dev));
> > +		rc = -ENXIO;
> > +		goto err;
> > +	}
> > +
> > +	rc = cxl_region_manage_dc(cxlr);
> > +	if (rc)
> > +		goto err;
> > +
> > +	return 0;
> > +
> >  err:
> >  	put_device(dev);
> > +load_err:
> >  	kfree(cxlr_dc);
> >  	return rc;
> >  }
> > @@ -3076,6 +3128,156 @@ struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
> >  }
> >  EXPORT_SYMBOL_NS_GPL(cxl_create_region, CXL);
> >  
> > +static int match_ep_decoder_by_range(struct device *dev, void *data)
> > +{
> > +	struct cxl_endpoint_decoder *cxled;
> > +	struct range *dpa_range = data;
> > +
> > +	if (!is_endpoint_decoder(dev))
> > +		return 0;
> > +
> > +	cxled = to_cxl_endpoint_decoder(dev);
> > +	if (!cxled->cxld.region)
> > +		return 0;
> > +
> > +	if (cxled->dpa_res->start <= dpa_range->start &&
> > +				cxled->dpa_res->end >= dpa_range->end)
> > +		return 1;
> > +
> > +	return 0;
> > +}
> > +
> > +int cxl_release_dc_extent(struct cxl_memdev_state *mds,
> > +			  struct range *rel_range)
> > +{
> > +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> > +	struct cxl_endpoint_decoder *cxled;
> > +	struct cxl_dc_region *cxlr_dc;
> > +	struct dax_region *dax_region;
> > +	resource_size_t dpa_offset;
> > +	struct cxl_region *cxlr;
> > +	struct range hpa_range;
> > +	struct dev_dax *dev_dax;
> > +	resource_size_t hpa;
> > +	struct device *dev;
> > +	int ranges, rc = 0;
> > +
> > +	/*
> > +	 * Find the cxl endpoind decoder with which has the extent dpa range and
> > +	 * get the cxl_region, dax_region refrences.
> > +	 */
> > +	dev = device_find_child(&cxlmd->endpoint->dev, rel_range,
> > +				match_ep_decoder_by_range);
> > +	if (!dev) {
> > +		dev_err(mds->cxlds.dev, "%pr not mapped\n", rel_range);
> > +		return PTR_ERR(dev);
> > +	}
> > +
> > +	cxled = to_cxl_endpoint_decoder(dev);
> > +	hpa_range = cxled->cxld.hpa_range;
> > +	cxlr = cxled->cxld.region;
> > +	cxlr_dc = cxlr->cxlr_dc;
> > +
> > +	/* DPA to HPA translation */
> > +	if (cxled->cxld.interleave_ways == 1) {
> > +		dpa_offset = rel_range->start - cxled->dpa_res->start;
> > +		hpa = hpa_range.start + dpa_offset;
> > +	} else {
> > +		dev_err(mds->cxlds.dev, "Interleaving DC not supported\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	dev_dax = xa_load(&cxlr_dc->dax_dev_list, hpa);
> > +	if (!dev_dax)
> > +		return -EINVAL;
> > +
> > +	dax_region = dev_dax->region;
> > +	ranges = dev_dax->nr_range;
> > +
> > +	while (ranges) {
> > +		int i = ranges - 1;
> > +		struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
> > +
> > +		devm_release_action(dax_region->dev, unregister_dax_mapping,
> > +								&mapping->dev);
> > +		ranges--;
> > +	}
> > +
> > +	dev_dbg(mds->cxlds.dev, "removing devdax device:%s\n",
> > +						dev_name(&dev_dax->dev));
> > +	devm_release_action(dax_region->dev, unregister_dev_dax,
> > +							&dev_dax->dev);
> > +	xa_erase(&cxlr_dc->dax_dev_list, hpa);
> > +
> > +	return rc;
> > +}
> > +
> > +int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range)
> > +{
> > +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> > +	struct cxl_endpoint_decoder *cxled;
> > +	struct cxl_dax_region *cxlr_dax;
> > +	struct cxl_dc_region *cxlr_dc;
> > +	struct dax_region *dax_region;
> > +	resource_size_t dpa_offset;
> > +	struct dev_dax_data data;
> > +	struct dev_dax *dev_dax;
> > +	struct cxl_region *cxlr;
> > +	struct range hpa_range;
> > +	resource_size_t hpa;
> > +	struct device *dev;
> > +	int rc;
> > +
> > +	/*
> > +	 * Find the cxl endpoind decoder with which has the extent dpa range and
> > +	 * get the cxl_region, dax_region refrences.
> > +	 */
> > +	dev = device_find_child(&cxlmd->endpoint->dev, alloc_range,
> > +				match_ep_decoder_by_range);
> > +	if (!dev) {
> > +		dev_err(mds->cxlds.dev, "%pr not mapped\n",	alloc_range);
> > +		return PTR_ERR(dev);
> > +	}
> > +
> > +	cxled = to_cxl_endpoint_decoder(dev);
> > +	hpa_range = cxled->cxld.hpa_range;
> > +	cxlr = cxled->cxld.region;
> > +	cxlr_dc = cxlr->cxlr_dc;
> > +	cxlr_dax = cxlr_dc->cxlr_dax;
> > +	dax_region = dev_get_drvdata(&cxlr_dax->dev);
> > +
> > +	/* DPA to HPA translation */
> > +	if (cxled->cxld.interleave_ways == 1) {
> > +		dpa_offset = alloc_range->start - cxled->dpa_res->start;
> > +		hpa = hpa_range.start + dpa_offset;
> > +	} else {
> > +		dev_err(mds->cxlds.dev, "Interleaving DC not supported\n");
> > +		return -EINVAL;
> > +	}
> 
> Hey, I'm running out of steam here,

:-D

> but lastly between these last
> 2 funcs, seems some duplicate code. Is there maybe an opportunity
> for a common func that can 'add' or 'release' a dc extent?

Maybe.  I'm too tired to see how this intertwines with
cxl_handle_dcd_event_records() and cxl_dev_get_dc_extents().  But the returning
of the range is odd.  Might be ok I think.  But perhaps
cxl_handle_dcd_event_records() and cxl_dev_get_dc_extents() can issue the
device_find_child() or something?

> 
> 
> 
> The end.

Thanks for looking!
Ira

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-16  2:06     ` Ira Weiny
@ 2023-06-16 15:56       ` Alison Schofield
  0 siblings, 0 replies; 55+ messages in thread
From: Alison Schofield @ 2023-06-16 15:56 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

On Thu, Jun 15, 2023 at 07:06:15PM -0700, Ira Weiny wrote:
> Alison Schofield wrote:
> > On Wed, Jun 14, 2023 at 12:16:29PM -0700, Ira Weiny wrote:
> > > From: Navneet Singh <navneet.singh@intel.com>
> > > 
> > > CXL devices optionally support dynamic capacity. CXL Regions must be
> > > created to access this capacity.
> > > 
> > > Add sysfs entries to create dynamic capacity cxl regions. Provide a new
> > > Dynamic Capacity decoder mode which targets dynamic capacity on devices
> > > which are added to that region.
> > 
> > This is a lot in one patch, especially where it weaves in and out of
> > existing code. I'm wondering if this can be introduced in smaller
> > pieces (patches). An introductory patch explaining the DC DPA 
> > allocations might be a useful chunk to pull forward. 
> 
> The patch is < 800 lines long.  And would be closer to 700 lines if there
> were not 8 different 'modes' for the various DC regions.
> 
> It is also very self contained in that it implements the region creation
> for DC DPAs fully.  And I know that Dan prefers patches larger if they are
> all part of the same functionality.
> 
> Dan?

Ira,
I found the patch difficult to review, and hope that it can be 
presented in a way that is easier to review. I don't know if
that results in separate patches.

It's hard for me to imagine that it wasn't conceived of in 
smaller chunks, that could be presented, but I don't know.

I'll go back and review the patch now, and point out where,
I found it difficult to follow.

Alison




> 
> Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/5] cxl/mem : Expose dynamic capacity configuration to userspace
  2023-06-16  2:47     ` Ira Weiny
@ 2023-06-16 15:58       ` Dave Jiang
  2023-06-20 16:23         ` Ira Weiny
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Jiang @ 2023-06-16 15:58 UTC (permalink / raw)
  To: Ira Weiny, Alison Schofield
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl



On 6/15/23 19:47, Ira Weiny wrote:
> Alison Schofield wrote:
>> On Wed, Jun 14, 2023 at 12:16:30PM -0700, Ira Weiny wrote:
>>> From: Navneet Singh <navneet.singh@intel.com>
>>>
>>> Exposing driver cached dynamic capacity configuration through sysfs
>>> attributes.User will create one or more dynamic capacity
>>> cxl regions based on this information and map the dynamic capacity of
>>> the device into HDM ranges using one or more HDM decoders.
>>>
>>> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
>>>
>>> ---
>>> [iweiny: fixups]
>>> [djbw: fixups, no sign-off: preview only]
>>> ---
>>>   drivers/cxl/core/memdev.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 72 insertions(+)
>>
>> Add the documentation of these new attributes in this patch.
>> Documentation/ABI/testing/sysfs-bus-cxl
> 
> Good point.  And the region creation patch needs some updating for the
> sysfs documentation as well...
> 
> Thanks!  I'll work on those.
> 
> Writing the documentation it seems like 'dc_region_count' should just be
> 'region_count'.  Because the 'dc' is redundant with the directory.
> However, dcY_size has a redundant 'dc' but Y_size (ie 0_size) seems
> odd.[*]
> 
> Thoughts on the 'dc' prefix for these?
> 
> [*] example listing with 2 DC regions supported.
> 
> $ ll mem1/dc/
> total 0
> -r--r--r-- 1 root root 4096 Jun 15 19:26 dc0_size
> -r--r--r-- 1 root root 4096 Jun 15 19:26 dc1_size
> -r--r--r-- 1 root root 4096 Jun 15 19:26 dc_regions_count
> 
>>
>> A bit of my ignorance here, but when I keep seeing the word
>> 'regions' below, it makes me wonder whether these attributes
>> are in the right place?
> 
> There is a difference between 'DC region' and CXL 'Linux' region.  It has
> taken me some time to get used to the terminology.  So I think this is
> correct.

I think you answered your own question above here. If dc_region is 
different than CXL regions, then you'll have to keep the dc_ prefix to 
distinguish between the two. Unless you call it something different.

DJ

> 
>>
>>>
>>> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
>>> index 5d1ba7a72567..beeb5fa3a0aa 100644
>>> --- a/drivers/cxl/core/memdev.c
>>> +++ b/drivers/cxl/core/memdev.c
>>> @@ -99,6 +99,20 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
>>>   static struct device_attribute dev_attr_pmem_size =
>>>   	__ATTR(size, 0444, pmem_size_show, NULL);
>>>   
>>> +static ssize_t dc_regions_count_show(struct device *dev, struct device_attribute *attr,
>>> +		char *buf)
>>> +{
>>> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>>> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
>>> +	int len = 0;
>>> +
>>> +	len = sysfs_emit(buf, "0x%x\n", mds->nr_dc_region);
>>
>> Prefer using this notation: %#llx
>> grep for the sysfs_emit's to see customary usage.
> 
> oh.  I did see this oddity when I was testing and forgot to change this.
> 
> However, I think %#llx needs to be used in show_size_regionN() and this
> needs to be %d.  This is just a count of the number of DC regions
> supported by the device.  I don't think that needs to be in hex.  Changed
> to %d.
> 
>>
>>> +	return len;
>>> +}
>>> +
>>> +struct device_attribute dev_attr_dc_regions_count =
>>> +	__ATTR(dc_regions_count, 0444, dc_regions_count_show, NULL);
>>> +
>>>   static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
>>>   			   char *buf)
>>>   {
>>> @@ -362,6 +376,57 @@ static struct attribute *cxl_memdev_ram_attributes[] = {
>>>   	NULL,
>>>   };
>>>   
>>> +static ssize_t show_size_regionN(struct cxl_memdev *cxlmd, char *buf, int pos)
>>> +{
>>> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
>>> +
>>> +	return sysfs_emit(buf, "0x%llx\n", mds->dc_region[pos].decode_len);
> 
> ... changed this one to %#llx.
> 
> Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-14 19:16 ` [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support ira.weiny
  2023-06-14 23:37   ` Dave Jiang
  2023-06-15  0:21   ` Alison Schofield
@ 2023-06-16 16:51   ` Alison Schofield
  2023-06-21  2:44     ` Ira Weiny
  2023-06-20 17:55   ` Fan Ni
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 55+ messages in thread
From: Alison Schofield @ 2023-06-16 16:51 UTC (permalink / raw)
  To: ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

On Wed, Jun 14, 2023 at 12:16:29PM -0700, Ira Weiny wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> CXL devices optionally support dynamic capacity. CXL Regions must be
> created to access this capacity.
> 
> Add sysfs entries to create dynamic capacity cxl regions. Provide a new
> Dynamic Capacity decoder mode which targets dynamic capacity on devices
> which are added to that region.
> 
> Below are the steps to create and delete dynamic capacity region0
> (example).
> 
>     region=$(cat /sys/bus/cxl/devices/decoder0.0/create_dc_region)
>     echo $region> /sys/bus/cxl/devices/decoder0.0/create_dc_region
>     echo 256 > /sys/bus/cxl/devices/$region/interleave_granularity
>     echo 1 > /sys/bus/cxl/devices/$region/interleave_ways
> 
>     echo "dc0" >/sys/bus/cxl/devices/decoder1.0/mode
>     echo 0x400000000 >/sys/bus/cxl/devices/decoder1.0/dpa_size
> 
>     echo 0x400000000 > /sys/bus/cxl/devices/$region/size
>     echo  "decoder1.0" > /sys/bus/cxl/devices/$region/target0
>     echo 1 > /sys/bus/cxl/devices/$region/commit
>     echo $region > /sys/bus/cxl/drivers/cxl_region/bind
> 
>     echo $region> /sys/bus/cxl/devices/decoder0.0/delete_region
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>

Hi,
I took another pass at this and offered more feedback.
I do think that if the big part - the cxl_dpa_reserve()
was more 'chunkified' it would be easier to review for
actual functionality.

I'd also like to see the commit log be a bit more specific
in enumerated the things this patch intends to do.

Many of my comments are about style. Some checkpatch --strict
would call out and some are addressed in the kernel coding
style - Documentation/process/coding-style.rst

But really, my goal is that when this code merges, that as
I scroll through a file, say region.c, I see a consistent
coding style. I shouldn't be able to notice that oh, Dan
wrote that, and Ira that, and Navneet wrote that piece.

I think it's important because differences in style distract
from focusing on the functionality of the code.

(off my soap box now ;)

Alison


> 
> ---
> [iweiny: fixups]
> [iweiny: remove unused CXL_DC_REGION_MODE macro]
> [iweiny: Make dc_mode_to_region_index static]
> [iweiny: simplify <sysfs>/create_dc_region]
> [iweiny: introduce decoder_mode_is_dc]
> [djbw: fixups, no sign-off: preview only]
> ---
>  drivers/cxl/Kconfig       |  11 +++
>  drivers/cxl/core/core.h   |   7 ++
>  drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++++++++++++++++++++++++++----
>  drivers/cxl/core/port.c   |  18 ++++
>  drivers/cxl/core/region.c | 135 ++++++++++++++++++++++++--
>  drivers/cxl/cxl.h         |  28 ++++++
>  drivers/dax/cxl.c         |   4 +
>  7 files changed, 409 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index ff4e78117b31..df034889d053 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -121,6 +121,17 @@ config CXL_REGION
>  
>  	  If unsure say 'y'
>  
> +config CXL_DCD
> +	bool "CXL: DCD Support"

"CXL DCD: Dynamic Capacity Device Support"
is more in line with others in this file, and expands the acronym onetime.

> +	default CXL_BUS
> +	depends on CXL_REGION
> +	help
> +	  Enable the CXL core to provision CXL DCD regions.
> +	  CXL devices optionally support dynamic capacity and DCD region
> +	  maps the dynamic capacity regions DPA's into Host HPA ranges.
> +
> +	  If unsure say 'y'
> +
>  config CXL_REGION_INVALIDATION_TEST
>  	bool "CXL: Region Cache Management Bypass (TEST)"
>  	depends on CXL_REGION
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 27f0968449de..725700ab5973 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -9,6 +9,13 @@ extern const struct device_type cxl_nvdimm_type;
>  
>  extern struct attribute_group cxl_base_attribute_group;
>  
> +#ifdef CONFIG_CXL_DCD
> +extern struct device_attribute dev_attr_create_dc_region;
> +#define SET_CXL_DC_REGION_ATTR(x) (&dev_attr_##x.attr),
> +#else
> +#define SET_CXL_DC_REGION_ATTR(x)
> +#endif
> +
>  #ifdef CONFIG_CXL_REGION
>  extern struct device_attribute dev_attr_create_pmem_region;
>  extern struct device_attribute dev_attr_create_ram_region;
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 514d30131d92..29649b47d177 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -233,14 +233,23 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct resource *res = cxled->dpa_res;
>  	resource_size_t skip_start;
> +	resource_size_t skipped = cxled->skip;

Reverse x-tree.

>  
>  	lockdep_assert_held_write(&cxl_dpa_rwsem);
>  
>  	/* save @skip_start, before @res is released */
> -	skip_start = res->start - cxled->skip;
> +	skip_start = res->start - skipped;

Why did the assignment of skip_start need to change here?

>  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> -	if (cxled->skip)
> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> +	if (cxled->skip != 0) {
> +		while (skipped != 0) {
> +			res = xa_load(&cxled->skip_res, skip_start);
> +			__release_region(&cxlds->dpa_res, skip_start,
> +							resource_size(res));

The above appears poorlty aligned.

> +			xa_erase(&cxled->skip_res, skip_start);
> +			skip_start += resource_size(res);
> +			skipped -= resource_size(res);
> +			}

This bracket appears poorly aligned.

> +	}
>  	cxled->skip = 0;
>  	cxled->dpa_res = NULL;
>  	put_device(&cxled->cxld.dev);
> @@ -267,6 +276,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	__cxl_dpa_release(cxled);
>  }
>  
> +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> +{
> +	int index = 0;
> +
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> +		if (mode == i)
> +			return index;
> +		index++;
> +	}
> +
> +	return -EINVAL;
> +}
> +
>  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  			     resource_size_t base, resource_size_t len,
>  			     resource_size_t skipped)
> @@ -275,7 +297,11 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct device *dev = &port->dev;
> +	struct device *ed_dev = &cxled->cxld.dev;
> +	struct resource *dpa_res = &cxlds->dpa_res;
> +	resource_size_t skip_len = 0;
>  	struct resource *res;
> +	int rc, index;
>  

Above poorly aligned.

>  	lockdep_assert_held_write(&cxl_dpa_rwsem);
>  
> @@ -304,28 +330,119 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	}
>  
>  	if (skipped) {

This has excessive indentation, so started out with a monster
if skipped is begging for a refactoring.

I find it odd that the DCD case got inserted before the 'default'
or non-DCD case here.


> -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> -				       dev_name(&cxled->cxld.dev), 0);
> -		if (!res) {
> -			dev_dbg(dev,
> -				"decoder%d.%d: failed to reserve skipped space\n",
> -				port->id, cxled->cxld.id);
> -			return -EBUSY;
> +		resource_size_t skip_base = base - skipped;
> +
> +		if (decoder_mode_is_dc(cxled->mode)) {

This may be cleaner to introduce as a separate function for
handling _mode_id_dc.

> +			if (resource_size(&cxlds->ram_res) &&
> +					skip_base <= cxlds->ram_res.end) {
> +				skip_len = cxlds->ram_res.end - skip_base + 1;
> +				res = __request_region(dpa_res, skip_base,
> +						skip_len, dev_name(ed_dev), 0);
> +				if (!res)
> +					goto error;
> +
> +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
> +				skip_base += skip_len;
> +			}
> +
> +			if (resource_size(&cxlds->ram_res) &&
> +					skip_base <= cxlds->pmem_res.end) {
> +				skip_len = cxlds->pmem_res.end - skip_base + 1;
> +				res = __request_region(dpa_res, skip_base,
> +						skip_len, dev_name(ed_dev), 0);
> +				if (!res)
> +					goto error;
> +
> +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
> +				skip_base += skip_len;
> +			}

The above 2 if (resource_size() cases have redundant code. 
Pull it out, refactor.

> +
> +			index = dc_mode_to_region_index(cxled->mode);
> +			for (int i = 0; i <= index; i++) {
> +				struct resource *dcr = &cxlds->dc_res[i];
> +
> +				if (skip_base < dcr->start) {
> +					skip_len = dcr->start - skip_base;
> +					res = __request_region(dpa_res,
> +							skip_base, skip_len,
> +							dev_name(ed_dev), 0);
> +					if (!res)
> +						goto error;
> +
> +					rc = xa_insert(&cxled->skip_res, skip_base,
> +							res, GFP_KERNEL);
> +					skip_base += skip_len;
> +				}
> +
> +				if (skip_base == base) {
> +					dev_dbg(dev, "skip done!\n");
> +					break;
> +				}
> +
> +				if (resource_size(dcr) &&
> +						skip_base <= dcr->end) {
> +					if (skip_base > base)
> +						dev_err(dev, "Skip error\n");
> +
> +					skip_len = dcr->end - skip_base + 1;
> +					res = __request_region(dpa_res, skip_base,
> +							skip_len,
> +							dev_name(ed_dev), 0);
> +					if (!res)
> +						goto error;
> +
> +					rc = xa_insert(&cxled->skip_res, skip_base,
> +							res, GFP_KERNEL);
> +					skip_base += skip_len;
> +				}
> +			}


And, below,we are back to the original code.
This would be more readable, reviewable if the DCD support was
added in separate function that are then called from here.

> +		} else	{
> +			res = __request_region(dpa_res, base - skipped, skipped,
> +							dev_name(ed_dev), 0);
> +			if (!res)
> +				goto error;
> +
> +			rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
>  		}
>  	}
> -	res = __request_region(&cxlds->dpa_res, base, len,
> -			       dev_name(&cxled->cxld.dev), 0);
> +
> +	res = __request_region(dpa_res, base, len, dev_name(ed_dev), 0);
>  	if (!res) {
>  		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
> -			port->id, cxled->cxld.id);

General comment - look over the dev_dbg() messages and consider placing
them after the code. I recall, others that were needlessly between lines
of code.


> -		if (skipped)
> -			__release_region(&cxlds->dpa_res, base - skipped,
> -					 skipped);
> +				port->id, cxled->cxld.id);
> +		if (skipped) {
> +			resource_size_t skip_base = base - skipped;
> +
> +			while (skipped != 0) {
> +				if (skip_base > base)
> +					dev_err(dev, "Skip error\n");
> +
> +				res = xa_load(&cxled->skip_res, skip_base);
> +				__release_region(dpa_res, skip_base,
> +							resource_size(res));
> +				xa_erase(&cxled->skip_res, skip_base);
> +				skip_base += resource_size(res);
> +				skipped -= resource_size(res);
> +			}
> +		}

		Can that debug message go here ?

>  		return -EBUSY;
>  	}
>  	cxled->dpa_res = res;
>  	cxled->skip = skipped;
>  
> +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> +		int index = dc_mode_to_region_index(mode);
> +
> +		if (resource_contains(&cxlds->dc_res[index], res)) {
> +			cxled->mode = mode;
> +			dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id,
> +				cxled->cxld.id, cxled->dpa_res, cxled->mode);

Can this move to ....


> +			goto success;
> +		}
> +	}
>  	if (resource_contains(&cxlds->pmem_res, res))
>  		cxled->mode = CXL_DECODER_PMEM;
>  	else if (resource_contains(&cxlds->ram_res, res))
> @@ -336,9 +453,16 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  		cxled->mode = CXL_DECODER_MIXED;
>  	}
>  
> +success:
>  	port->hdm_end++;
>  	get_device(&cxled->cxld.dev);

here...dev_dbg() success message. That pairs it nicely with the
error message below.

>  	return 0;
> +
> +error:
> +	dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space\n",
> +			port->id, cxled->cxld.id);
> +	return -EBUSY;
> +
>  }
>  
>  int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> @@ -429,6 +553,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  	switch (mode) {
>  	case CXL_DECODER_RAM:
>  	case CXL_DECODER_PMEM:
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
>  		break;
>  	default:
>  		dev_dbg(dev, "unsupported mode: %d\n", mode);
> @@ -456,6 +588,16 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  		goto out;
>  	}
>  
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> +		int index = dc_mode_to_region_index(i);
> +
> +		if (mode == i && !resource_size(&cxlds->dc_res[index])) {
> +			dev_dbg(dev, "no available dynamic capacity\n");

I see this one is following the pattern in the function :)


> +			rc = -ENXIO;
> +			goto out;
> +		}
> +	}
> +
>  	cxled->mode = mode;
>  	rc = 0;
>  out:
> @@ -469,10 +611,12 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,

Hmmm...I don't have cxl_dpa_freespace() in my cxl/next? Where's that?


>  					 resource_size_t *skip_out)
>  {
>  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -	resource_size_t free_ram_start, free_pmem_start;
> +	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	struct device *dev = &cxled->cxld.dev;
>  	resource_size_t start, avail, skip;
>  	struct resource *p, *last;
> +	int index;

Why break the alignment above?

>  
>  	lockdep_assert_held(&cxl_dpa_rwsem);
>  
> @@ -490,6 +634,20 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  	else
>  		free_pmem_start = cxlds->pmem_res.start;
>  
> +	/*
> +	 * One HDM Decoder per DC region to map memory with different
> +	 * DSMAS entry.
> +	 */

It seems this comment is missing a verb. Why not align?

> +	index = dc_mode_to_region_index(cxled->mode);
> +	if (index >= 0) {
> +		if (cxlds->dc_res[index].child) {
> +			dev_err(dev, "Cannot allocated DPA from DC Region: %d\n"

s/allocated/allocate

,
> +					index);
> +			return -EINVAL;
> +		}
> +		free_dc_start = cxlds->dc_res[index].start;
> +	}
> +
>  	if (cxled->mode == CXL_DECODER_RAM) {
>  		start = free_ram_start;
>  		avail = cxlds->ram_res.end - start + 1;
> @@ -511,6 +669,29 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  		else
>  			skip_end = start - 1;
>  		skip = skip_end - skip_start + 1;
> +	} else if (decoder_mode_is_dc(cxled->mode)) {
> +		resource_size_t skip_start, skip_end;
> +
> +		start = free_dc_start;
> +		avail = cxlds->dc_res[index].end - start + 1;
> +		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
> +			skip_start = free_ram_start;
> +		else
> +			skip_start = free_pmem_start;
> +		/*
> +		 * If some dc region is already mapped, then that allocation

maybe s/some/any ?

> +		 * already handled the RAM and PMEM skip.Check for DC region
> +		 * skip.
> +		 */
> +		for (int i = index - 1; i >= 0 ; i--) {
> +			if (cxlds->dc_res[i].child) {
> +				skip_start = cxlds->dc_res[i].child->end + 1;
> +				break;
> +			}
> +		}
> +
> +		skip_end = start - 1;
> +		skip = skip_end - skip_start + 1;
>  	} else {
>  		dev_dbg(cxled_dev(cxled), "mode not set\n");
>  		avail = 0;
> @@ -548,10 +729,25 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  
>  	avail = cxl_dpa_freespace(cxled, &start, &skip);
>  
> +	dev_dbg(dev, "DPA Allocation start: %llx len: %llx Skip: %llx\n",
> +						start, size, skip);
>  	if (size > avail) {
> +		static const char * const names[] = {
> +			[CXL_DECODER_NONE] = "none",
> +			[CXL_DECODER_RAM] = "ram",
> +			[CXL_DECODER_PMEM] = "pmem",
> +			[CXL_DECODER_MIXED] = "mixed",
> +			[CXL_DECODER_DC0] = "dc0",
> +			[CXL_DECODER_DC1] = "dc1",
> +			[CXL_DECODER_DC2] = "dc2",
> +			[CXL_DECODER_DC3] = "dc3",
> +			[CXL_DECODER_DC4] = "dc4",
> +			[CXL_DECODER_DC5] = "dc5",
> +			[CXL_DECODER_DC6] = "dc6",
> +			[CXL_DECODER_DC7] = "dc7",
> +		};
>  		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
> -			cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
> -			&avail);
> +			names[cxled->mode], &avail);
>  		rc = -ENOSPC;
>  		goto out;
>  	}
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 5e21b53362e6..a1a98aba24ed 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -195,6 +195,22 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
>  		mode = CXL_DECODER_PMEM;
>  	else if (sysfs_streq(buf, "ram"))
>  		mode = CXL_DECODER_RAM;
> +	else if (sysfs_streq(buf, "dc0"))
> +		mode = CXL_DECODER_DC0;
> +	else if (sysfs_streq(buf, "dc1"))
> +		mode = CXL_DECODER_DC1;
> +	else if (sysfs_streq(buf, "dc2"))
> +		mode = CXL_DECODER_DC2;
> +	else if (sysfs_streq(buf, "dc3"))
> +		mode = CXL_DECODER_DC3;
> +	else if (sysfs_streq(buf, "dc4"))
> +		mode = CXL_DECODER_DC4;
> +	else if (sysfs_streq(buf, "dc5"))
> +		mode = CXL_DECODER_DC5;
> +	else if (sysfs_streq(buf, "dc6"))
> +		mode = CXL_DECODER_DC6;
> +	else if (sysfs_streq(buf, "dc7"))
> +		mode = CXL_DECODER_DC7;
>  	else
>  		return -EINVAL;
>  
> @@ -296,6 +312,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>  	&dev_attr_target_list.attr,
>  	SET_CXL_REGION_ATTR(create_pmem_region)
>  	SET_CXL_REGION_ATTR(create_ram_region)
> +	SET_CXL_DC_REGION_ATTR(create_dc_region)
>  	SET_CXL_REGION_ATTR(delete_region)
>  	NULL,
>  };
> @@ -1691,6 +1708,7 @@ struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
>  		return ERR_PTR(-ENOMEM);
>  
>  	cxled->pos = -1;
> +	xa_init(&cxled->skip_res);
>  	cxld = &cxled->cxld;
>  	rc = cxl_decoder_init(port, cxld);
>  	if (rc)	 {
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 543c4499379e..144232c8305e 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1733,7 +1733,7 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  	lockdep_assert_held_write(&cxl_region_rwsem);
>  	lockdep_assert_held_read(&cxl_dpa_rwsem);
>  
> -	if (cxled->mode != cxlr->mode) {
> +	if (decoder_mode_is_dc(cxlr->mode) && !decoder_mode_is_dc(cxled->mode)) {
>  		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
>  			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
>  		return -EINVAL;
> @@ -2211,6 +2211,14 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>  	switch (mode) {
>  	case CXL_DECODER_RAM:
>  	case CXL_DECODER_PMEM:
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
>  		break;
>  	default:
>  		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> @@ -2321,6 +2329,43 @@ static ssize_t create_ram_region_store(struct device *dev,
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>  
> +static ssize_t store_dcN_region(struct cxl_root_decoder *cxlrd,
> +				const char *buf, enum cxl_decoder_mode mode,
> +				size_t len)
> +{
> +	struct cxl_region *cxlr;
> +	int rc, id;
> +
> +	rc = sscanf(buf, "region%d\n", &id);
> +	if (rc != 1)
> +		return -EINVAL;
> +
> +	cxlr = __create_region(cxlrd, id, mode, CXL_DECODER_HOSTMEM);
> +	if (IS_ERR(cxlr))
> +		return PTR_ERR(cxlr);
> +
> +	return len;
> +}
> +
> +static ssize_t create_dc_region_show(struct device *dev,
> +				     struct device_attribute *attr, char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dc_region_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t len)
> +{
> +	/*
> +	 * All DC regions use decoder mode DC0 as the region does not need the
> +	 * index information
> +	 */
> +	return store_dcN_region(to_cxl_root_decoder(dev), buf,
> +				CXL_DECODER_DC0, len);
> +}
> +DEVICE_ATTR_RW(create_dc_region);
> +
>  static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -2799,6 +2844,61 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  	return rc;
>  }
>  
> +static void cxl_dc_region_release(void *data)
> +{
> +	struct cxl_region *cxlr = data;
> +	struct cxl_dc_region *cxlr_dc = cxlr->cxlr_dc;
> +
> +	xa_destroy(&cxlr_dc->dax_dev_list);
> +	kfree(cxlr_dc);
> +}
> +
> +static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
> +{
> +	struct cxl_dc_region *cxlr_dc;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct device *dev;
> +	int rc = 0;
> +
> +	cxlr_dax = cxl_dax_region_alloc(cxlr);
> +	if (IS_ERR(cxlr_dax))
> +		return PTR_ERR(cxlr_dax);
> +
> +	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
> +	if (!cxlr_dc) {
> +		rc = -ENOMEM;
> +		goto err;
> +	}
> +
> +	dev = &cxlr_dax->dev;
> +	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
> +	if (rc)
> +		goto err;
> +
> +	rc = device_add(dev);
> +	if (rc)
> +		goto err;
> +
> +	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> +		dev_name(dev));
> +
> +	rc = devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
> +					cxlr_dax);
> +	if (rc)
> +		goto err;
> +
> +	cxlr_dc->cxlr_dax = cxlr_dax;
> +	xa_init(&cxlr_dc->dax_dev_list);
> +	cxlr->cxlr_dc = cxlr_dc;
> +	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
> +	if (!rc)
> +		return 0;
> +err:
> +	put_device(dev);
> +	kfree(cxlr_dc);
> +	return rc;
> +}
> +
>  static int match_decoder_by_range(struct device *dev, void *data)
>  {
>  	struct range *r1, *r2 = data;
> @@ -3140,6 +3240,19 @@ static int is_system_ram(struct resource *res, void *arg)
>  	return 1;
>  }
>  
> +/*
> + * The region can not be manged by CXL if any portion of
> + * it is already online as 'System RAM'
> + */
> +static bool region_is_system_ram(struct cxl_region *cxlr,
> +				 struct cxl_region_params *p)
> +{
> +	return (walk_iomem_res_desc(IORES_DESC_NONE,
> +				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> +				    p->res->start, p->res->end, cxlr,
> +				    is_system_ram) > 0);
> +}
> +
>  static int cxl_region_probe(struct device *dev)
>  {
>  	struct cxl_region *cxlr = to_cxl_region(dev);
> @@ -3174,14 +3287,7 @@ static int cxl_region_probe(struct device *dev)
>  	case CXL_DECODER_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
>  	case CXL_DECODER_RAM:
> -		/*
> -		 * The region can not be manged by CXL if any portion of
> -		 * it is already online as 'System RAM'
> -		 */
> -		if (walk_iomem_res_desc(IORES_DESC_NONE,
> -					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> -					p->res->start, p->res->end, cxlr,
> -					is_system_ram) > 0)
> +		if (region_is_system_ram(cxlr, p))
>  			return 0;
>  
>  		/*
> @@ -3193,6 +3299,17 @@ static int cxl_region_probe(struct device *dev)
>  
>  		/* HDM-H routes to device-dax */
>  		return devm_cxl_add_dax_region(cxlr);
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
> +		if (region_is_system_ram(cxlr, p))
> +			return 0;
> +		return devm_cxl_add_dc_region(cxlr);
>  	default:
>  		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
>  			cxlr->mode);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 8400af85d99f..7ac1237938b7 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -335,6 +335,14 @@ enum cxl_decoder_mode {
>  	CXL_DECODER_NONE,
>  	CXL_DECODER_RAM,
>  	CXL_DECODER_PMEM,
> +	CXL_DECODER_DC0,
> +	CXL_DECODER_DC1,
> +	CXL_DECODER_DC2,
> +	CXL_DECODER_DC3,
> +	CXL_DECODER_DC4,
> +	CXL_DECODER_DC5,
> +	CXL_DECODER_DC6,
> +	CXL_DECODER_DC7,
>  	CXL_DECODER_MIXED,
>  	CXL_DECODER_DEAD,
>  };
> @@ -345,6 +353,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  		[CXL_DECODER_NONE] = "none",
>  		[CXL_DECODER_RAM] = "ram",
>  		[CXL_DECODER_PMEM] = "pmem",
> +		[CXL_DECODER_DC0] = "dc0",
> +		[CXL_DECODER_DC1] = "dc1",
> +		[CXL_DECODER_DC2] = "dc2",
> +		[CXL_DECODER_DC3] = "dc3",
> +		[CXL_DECODER_DC4] = "dc4",
> +		[CXL_DECODER_DC5] = "dc5",
> +		[CXL_DECODER_DC6] = "dc6",
> +		[CXL_DECODER_DC7] = "dc7",
>  		[CXL_DECODER_MIXED] = "mixed",
>  	};
>  
> @@ -353,6 +369,11 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  	return "mixed";
>  }
>  
> +static inline bool decoder_mode_is_dc(enum cxl_decoder_mode mode)
> +{
> +	return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
> +}
> +
>  /*
>   * Track whether this decoder is reserved for region autodiscovery, or
>   * free for userspace provisioning.
> @@ -375,6 +396,7 @@ struct cxl_endpoint_decoder {
>  	struct cxl_decoder cxld;
>  	struct resource *dpa_res;
>  	resource_size_t skip;
> +	struct xarray skip_res;
>  	enum cxl_decoder_mode mode;
>  	enum cxl_decoder_state state;
>  	int pos;
> @@ -475,6 +497,11 @@ struct cxl_region_params {
>   */
>  #define CXL_REGION_F_AUTO 1
>  
> +struct cxl_dc_region {
> +	struct xarray dax_dev_list;
> +	struct cxl_dax_region *cxlr_dax;
> +};
> +
>  /**
>   * struct cxl_region - CXL region
>   * @dev: This region's device
> @@ -493,6 +520,7 @@ struct cxl_region {
>  	enum cxl_decoder_type type;
>  	struct cxl_nvdimm_bridge *cxl_nvb;
>  	struct cxl_pmem_region *cxlr_pmem;
> +	struct cxl_dc_region *cxlr_dc;
>  	unsigned long flags;
>  	struct cxl_region_params params;
>  };
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index ccdf8de85bd5..eb5eb81bfbd7 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -23,11 +23,15 @@ static int cxl_dax_region_probe(struct device *dev)
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> +	if (decoder_mode_is_dc(cxlr->mode))
> +		return 0;
> +
>  	data = (struct dev_dax_data) {
>  		.dax_region = dax_region,
>  		.id = -1,
>  		.size = range_len(&cxlr_dax->hpa_range),
>  	};
> +
>  	dev_dax = devm_create_dev_dax(&data);
>  	if (IS_ERR(dev_dax))
>  		return PTR_ERR(dev_dax);
> 
> -- 
> 2.40.0
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/5] cxl/mem : Expose dynamic capacity configuration to userspace
  2023-06-16 15:58       ` Dave Jiang
@ 2023-06-20 16:23         ` Ira Weiny
  2023-06-20 16:48           ` Dave Jiang
  0 siblings, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2023-06-20 16:23 UTC (permalink / raw)
  To: Dave Jiang, Ira Weiny, Alison Schofield
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

Dave Jiang wrote:
> 
> 
> On 6/15/23 19:47, Ira Weiny wrote:
> > Alison Schofield wrote:
> >> On Wed, Jun 14, 2023 at 12:16:30PM -0700, Ira Weiny wrote:
> >>> From: Navneet Singh <navneet.singh@intel.com>
> >>>

[snip]

> > 
> > Writing the documentation it seems like 'dc_region_count' should just be
> > 'region_count'.  Because the 'dc' is redundant with the directory.
> > However, dcY_size has a redundant 'dc' but Y_size (ie 0_size) seems
> > odd.[*]
> > 
> > Thoughts on the 'dc' prefix for these?
> > 
> > [*] example listing with 2 DC regions supported.
> > 
> > $ ll mem1/dc/
> > total 0
> > -r--r--r-- 1 root root 4096 Jun 15 19:26 dc0_size
> > -r--r--r-- 1 root root 4096 Jun 15 19:26 dc1_size
> > -r--r--r-- 1 root root 4096 Jun 15 19:26 dc_regions_count
> > 
> >>
> >> A bit of my ignorance here, but when I keep seeing the word
> >> 'regions' below, it makes me wonder whether these attributes
> >> are in the right place?
> > 
> > There is a difference between 'DC region' and CXL 'Linux' region.  It has
> > taken me some time to get used to the terminology.  So I think this is
> > correct.
> 
> I think you answered your own question above here. If dc_region is 
> different than CXL regions, then you'll have to keep the dc_ prefix to 
> distinguish between the two. Unless you call it something different.

But it sits in the 'dc' directory.

memX/dc/dc_regions_count
memX/dc/dc0_size
...

So it feels like the dc is redundant.  But it is probably ok as it is.

Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/5] cxl/mem : Expose dynamic capacity configuration to userspace
  2023-06-20 16:23         ` Ira Weiny
@ 2023-06-20 16:48           ` Dave Jiang
  0 siblings, 0 replies; 55+ messages in thread
From: Dave Jiang @ 2023-06-20 16:48 UTC (permalink / raw)
  To: Ira Weiny, Alison Schofield
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl



On 6/20/23 09:23, Ira Weiny wrote:
> Dave Jiang wrote:
>>
>>
>> On 6/15/23 19:47, Ira Weiny wrote:
>>> Alison Schofield wrote:
>>>> On Wed, Jun 14, 2023 at 12:16:30PM -0700, Ira Weiny wrote:
>>>>> From: Navneet Singh <navneet.singh@intel.com>
>>>>>
> 
> [snip]
> 
>>>
>>> Writing the documentation it seems like 'dc_region_count' should just be
>>> 'region_count'.  Because the 'dc' is redundant with the directory.
>>> However, dcY_size has a redundant 'dc' but Y_size (ie 0_size) seems
>>> odd.[*]
>>>
>>> Thoughts on the 'dc' prefix for these?
>>>
>>> [*] example listing with 2 DC regions supported.
>>>
>>> $ ll mem1/dc/
>>> total 0
>>> -r--r--r-- 1 root root 4096 Jun 15 19:26 dc0_size
>>> -r--r--r-- 1 root root 4096 Jun 15 19:26 dc1_size
>>> -r--r--r-- 1 root root 4096 Jun 15 19:26 dc_regions_count
>>>
>>>>
>>>> A bit of my ignorance here, but when I keep seeing the word
>>>> 'regions' below, it makes me wonder whether these attributes
>>>> are in the right place?
>>>
>>> There is a difference between 'DC region' and CXL 'Linux' region.  It has
>>> taken me some time to get used to the terminology.  So I think this is
>>> correct.
>>
>> I think you answered your own question above here. If dc_region is
>> different than CXL regions, then you'll have to keep the dc_ prefix to
>> distinguish between the two. Unless you call it something different.
> 
> But it sits in the 'dc' directory.
> 
> memX/dc/dc_regions_count
> memX/dc/dc0_size
> ...
> 
> So it feels like the dc is redundant.  But it is probably ok as it is.

Ah I see what you mean. Yeah maybe dropping the dc would make it look nicer.

> 
> Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-14 19:16 ` [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support ira.weiny
                     ` (2 preceding siblings ...)
  2023-06-16 16:51   ` Alison Schofield
@ 2023-06-20 17:55   ` Fan Ni
  2023-06-20 20:33     ` Ira Weiny
  2023-06-21  3:13     ` Navneet Singh
  2023-06-21 17:20   ` Fan Ni
                     ` (2 subsequent siblings)
  6 siblings, 2 replies; 55+ messages in thread
From: Fan Ni @ 2023-06-20 17:55 UTC (permalink / raw)
  To: ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl,
	a.manzanares, dave, nmtadam.samsung, nifan

The 06/14/2023 12:16, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> CXL devices optionally support dynamic capacity. CXL Regions must be
> created to access this capacity.
> 
> Add sysfs entries to create dynamic capacity cxl regions. Provide a new
> Dynamic Capacity decoder mode which targets dynamic capacity on devices
> which are added to that region.
> 
> Below are the steps to create and delete dynamic capacity region0
> (example).
> 
>     region=$(cat /sys/bus/cxl/devices/decoder0.0/create_dc_region)
>     echo $region> /sys/bus/cxl/devices/decoder0.0/create_dc_region
>     echo 256 > /sys/bus/cxl/devices/$region/interleave_granularity
>     echo 1 > /sys/bus/cxl/devices/$region/interleave_ways
> 
>     echo "dc0" >/sys/bus/cxl/devices/decoder1.0/mode
>     echo 0x400000000 >/sys/bus/cxl/devices/decoder1.0/dpa_size
> 
>     echo 0x400000000 > /sys/bus/cxl/devices/$region/size
>     echo  "decoder1.0" > /sys/bus/cxl/devices/$region/target0
>     echo 1 > /sys/bus/cxl/devices/$region/commit
>     echo $region > /sys/bus/cxl/drivers/cxl_region/bind
> 
>     echo $region> /sys/bus/cxl/devices/decoder0.0/delete_region
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> ---
> [iweiny: fixups]
> [iweiny: remove unused CXL_DC_REGION_MODE macro]
> [iweiny: Make dc_mode_to_region_index static]
> [iweiny: simplify <sysfs>/create_dc_region]
> [iweiny: introduce decoder_mode_is_dc]
> [djbw: fixups, no sign-off: preview only]
> ---
>  drivers/cxl/Kconfig       |  11 +++
>  drivers/cxl/core/core.h   |   7 ++
>  drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++++++++++++++++++++++++++----
>  drivers/cxl/core/port.c   |  18 ++++
>  drivers/cxl/core/region.c | 135 ++++++++++++++++++++++++--
>  drivers/cxl/cxl.h         |  28 ++++++
>  drivers/dax/cxl.c         |   4 +
>  7 files changed, 409 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index ff4e78117b31..df034889d053 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -121,6 +121,17 @@ config CXL_REGION
>  
>  	  If unsure say 'y'
>  
> +config CXL_DCD
> +	bool "CXL: DCD Support"
> +	default CXL_BUS
> +	depends on CXL_REGION
> +	help
> +	  Enable the CXL core to provision CXL DCD regions.
> +	  CXL devices optionally support dynamic capacity and DCD region
> +	  maps the dynamic capacity regions DPA's into Host HPA ranges.
> +
> +	  If unsure say 'y'
> +
>  config CXL_REGION_INVALIDATION_TEST
>  	bool "CXL: Region Cache Management Bypass (TEST)"
>  	depends on CXL_REGION
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 27f0968449de..725700ab5973 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -9,6 +9,13 @@ extern const struct device_type cxl_nvdimm_type;
>  
>  extern struct attribute_group cxl_base_attribute_group;
>  
> +#ifdef CONFIG_CXL_DCD
> +extern struct device_attribute dev_attr_create_dc_region;
> +#define SET_CXL_DC_REGION_ATTR(x) (&dev_attr_##x.attr),
> +#else
> +#define SET_CXL_DC_REGION_ATTR(x)
> +#endif
> +
>  #ifdef CONFIG_CXL_REGION
>  extern struct device_attribute dev_attr_create_pmem_region;
>  extern struct device_attribute dev_attr_create_ram_region;
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 514d30131d92..29649b47d177 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -233,14 +233,23 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct resource *res = cxled->dpa_res;
>  	resource_size_t skip_start;
> +	resource_size_t skipped = cxled->skip;
>  
>  	lockdep_assert_held_write(&cxl_dpa_rwsem);
>  
>  	/* save @skip_start, before @res is released */
> -	skip_start = res->start - cxled->skip;
> +	skip_start = res->start - skipped;
>  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> -	if (cxled->skip)
> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> +	if (cxled->skip != 0) {
> +		while (skipped != 0) {
> +			res = xa_load(&cxled->skip_res, skip_start);
> +			__release_region(&cxlds->dpa_res, skip_start,
> +							resource_size(res));
> +			xa_erase(&cxled->skip_res, skip_start);
> +			skip_start += resource_size(res);
> +			skipped -= resource_size(res);
> +			}
> +	}
>  	cxled->skip = 0;
>  	cxled->dpa_res = NULL;
>  	put_device(&cxled->cxld.dev);
> @@ -267,6 +276,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	__cxl_dpa_release(cxled);
>  }
>  
> +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> +{
> +	int index = 0;
> +
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> +		if (mode == i)
> +			return index;
> +		index++;
> +	}
> +
> +	return -EINVAL;
> +}
> +
>  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  			     resource_size_t base, resource_size_t len,
>  			     resource_size_t skipped)
> @@ -275,7 +297,11 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct device *dev = &port->dev;
> +	struct device *ed_dev = &cxled->cxld.dev;
> +	struct resource *dpa_res = &cxlds->dpa_res;
> +	resource_size_t skip_len = 0;
>  	struct resource *res;
> +	int rc, index;
>  
>  	lockdep_assert_held_write(&cxl_dpa_rwsem);
>  
> @@ -304,28 +330,119 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	}
>  
>  	if (skipped) {
> -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> -				       dev_name(&cxled->cxld.dev), 0);
> -		if (!res) {
> -			dev_dbg(dev,
> -				"decoder%d.%d: failed to reserve skipped space\n",
> -				port->id, cxled->cxld.id);
> -			return -EBUSY;
> +		resource_size_t skip_base = base - skipped;
> +
> +		if (decoder_mode_is_dc(cxled->mode)) {
> +			if (resource_size(&cxlds->ram_res) &&
> +					skip_base <= cxlds->ram_res.end) {
> +				skip_len = cxlds->ram_res.end - skip_base + 1;
> +				res = __request_region(dpa_res, skip_base,
> +						skip_len, dev_name(ed_dev), 0);
> +				if (!res)
> +					goto error;
> +
> +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
> +				skip_base += skip_len;
> +			}
> +
> +			if (resource_size(&cxlds->ram_res) &&
Should it be cxlds->pmem_res here?

Fan
> +					skip_base <= cxlds->pmem_res.end) {
> +				skip_len = cxlds->pmem_res.end - skip_base + 1;
> +				res = __request_region(dpa_res, skip_base,
> +						skip_len, dev_name(ed_dev), 0);
> +				if (!res)
> +					goto error;
> +
> +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
> +				skip_base += skip_len;
> +			}
> +
> +			index = dc_mode_to_region_index(cxled->mode);
> +			for (int i = 0; i <= index; i++) {
> +				struct resource *dcr = &cxlds->dc_res[i];
> +
> +				if (skip_base < dcr->start) {
> +					skip_len = dcr->start - skip_base;
> +					res = __request_region(dpa_res,
> +							skip_base, skip_len,
> +							dev_name(ed_dev), 0);
> +					if (!res)
> +						goto error;
> +
> +					rc = xa_insert(&cxled->skip_res, skip_base,
> +							res, GFP_KERNEL);
> +					skip_base += skip_len;
> +				}
> +
> +				if (skip_base == base) {
> +					dev_dbg(dev, "skip done!\n");
> +					break;
> +				}
> +
> +				if (resource_size(dcr) &&
> +						skip_base <= dcr->end) {
> +					if (skip_base > base)
> +						dev_err(dev, "Skip error\n");
> +
> +					skip_len = dcr->end - skip_base + 1;
> +					res = __request_region(dpa_res, skip_base,
> +							skip_len,
> +							dev_name(ed_dev), 0);
> +					if (!res)
> +						goto error;
> +
> +					rc = xa_insert(&cxled->skip_res, skip_base,
> +							res, GFP_KERNEL);
> +					skip_base += skip_len;
> +				}
> +			}
> +		} else	{
> +			res = __request_region(dpa_res, base - skipped, skipped,
> +							dev_name(ed_dev), 0);
> +			if (!res)
> +				goto error;
> +
> +			rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
>  		}
>  	}
> -	res = __request_region(&cxlds->dpa_res, base, len,
> -			       dev_name(&cxled->cxld.dev), 0);
> +
> +	res = __request_region(dpa_res, base, len, dev_name(ed_dev), 0);
>  	if (!res) {
>  		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
> -			port->id, cxled->cxld.id);
> -		if (skipped)
> -			__release_region(&cxlds->dpa_res, base - skipped,
> -					 skipped);
> +				port->id, cxled->cxld.id);
> +		if (skipped) {
> +			resource_size_t skip_base = base - skipped;
> +
> +			while (skipped != 0) {
> +				if (skip_base > base)
> +					dev_err(dev, "Skip error\n");
> +
> +				res = xa_load(&cxled->skip_res, skip_base);
> +				__release_region(dpa_res, skip_base,
> +							resource_size(res));
> +				xa_erase(&cxled->skip_res, skip_base);
> +				skip_base += resource_size(res);
> +				skipped -= resource_size(res);
> +			}
> +		}
>  		return -EBUSY;
>  	}
>  	cxled->dpa_res = res;
>  	cxled->skip = skipped;
>  
> +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> +		int index = dc_mode_to_region_index(mode);
> +
> +		if (resource_contains(&cxlds->dc_res[index], res)) {
> +			cxled->mode = mode;
> +			dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id,
> +				cxled->cxld.id, cxled->dpa_res, cxled->mode);
> +			goto success;
> +		}
> +	}
>  	if (resource_contains(&cxlds->pmem_res, res))
>  		cxled->mode = CXL_DECODER_PMEM;
>  	else if (resource_contains(&cxlds->ram_res, res))
> @@ -336,9 +453,16 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  		cxled->mode = CXL_DECODER_MIXED;
>  	}
>  
> +success:
>  	port->hdm_end++;
>  	get_device(&cxled->cxld.dev);
>  	return 0;
> +
> +error:
> +	dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space\n",
> +			port->id, cxled->cxld.id);
> +	return -EBUSY;
> +
>  }
>  
>  int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> @@ -429,6 +553,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  	switch (mode) {
>  	case CXL_DECODER_RAM:
>  	case CXL_DECODER_PMEM:
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
>  		break;
>  	default:
>  		dev_dbg(dev, "unsupported mode: %d\n", mode);
> @@ -456,6 +588,16 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  		goto out;
>  	}
>  
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> +		int index = dc_mode_to_region_index(i);
> +
> +		if (mode == i && !resource_size(&cxlds->dc_res[index])) {
> +			dev_dbg(dev, "no available dynamic capacity\n");
> +			rc = -ENXIO;
> +			goto out;
> +		}
> +	}
> +
>  	cxled->mode = mode;
>  	rc = 0;
>  out:
> @@ -469,10 +611,12 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  					 resource_size_t *skip_out)
>  {
>  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -	resource_size_t free_ram_start, free_pmem_start;
> +	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	struct device *dev = &cxled->cxld.dev;
>  	resource_size_t start, avail, skip;
>  	struct resource *p, *last;
> +	int index;
>  
>  	lockdep_assert_held(&cxl_dpa_rwsem);
>  
> @@ -490,6 +634,20 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  	else
>  		free_pmem_start = cxlds->pmem_res.start;
>  
> +	/*
> +	 * One HDM Decoder per DC region to map memory with different
> +	 * DSMAS entry.
> +	 */
> +	index = dc_mode_to_region_index(cxled->mode);
> +	if (index >= 0) {
> +		if (cxlds->dc_res[index].child) {
> +			dev_err(dev, "Cannot allocated DPA from DC Region: %d\n",
> +					index);
> +			return -EINVAL;
> +		}
> +		free_dc_start = cxlds->dc_res[index].start;
> +	}
> +
>  	if (cxled->mode == CXL_DECODER_RAM) {
>  		start = free_ram_start;
>  		avail = cxlds->ram_res.end - start + 1;
> @@ -511,6 +669,29 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  		else
>  			skip_end = start - 1;
>  		skip = skip_end - skip_start + 1;
> +	} else if (decoder_mode_is_dc(cxled->mode)) {
> +		resource_size_t skip_start, skip_end;
> +
> +		start = free_dc_start;
> +		avail = cxlds->dc_res[index].end - start + 1;
> +		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
> +			skip_start = free_ram_start;
> +		else
> +			skip_start = free_pmem_start;
> +		/*
> +		 * If some dc region is already mapped, then that allocation
> +		 * already handled the RAM and PMEM skip.Check for DC region
> +		 * skip.
> +		 */
> +		for (int i = index - 1; i >= 0 ; i--) {
> +			if (cxlds->dc_res[i].child) {
> +				skip_start = cxlds->dc_res[i].child->end + 1;
> +				break;
> +			}
> +		}
> +
> +		skip_end = start - 1;
> +		skip = skip_end - skip_start + 1;
>  	} else {
>  		dev_dbg(cxled_dev(cxled), "mode not set\n");
>  		avail = 0;
> @@ -548,10 +729,25 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  
>  	avail = cxl_dpa_freespace(cxled, &start, &skip);
>  
> +	dev_dbg(dev, "DPA Allocation start: %llx len: %llx Skip: %llx\n",
> +						start, size, skip);
>  	if (size > avail) {
> +		static const char * const names[] = {
> +			[CXL_DECODER_NONE] = "none",
> +			[CXL_DECODER_RAM] = "ram",
> +			[CXL_DECODER_PMEM] = "pmem",
> +			[CXL_DECODER_MIXED] = "mixed",
> +			[CXL_DECODER_DC0] = "dc0",
> +			[CXL_DECODER_DC1] = "dc1",
> +			[CXL_DECODER_DC2] = "dc2",
> +			[CXL_DECODER_DC3] = "dc3",
> +			[CXL_DECODER_DC4] = "dc4",
> +			[CXL_DECODER_DC5] = "dc5",
> +			[CXL_DECODER_DC6] = "dc6",
> +			[CXL_DECODER_DC7] = "dc7",
> +		};
>  		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
> -			cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
> -			&avail);
> +			names[cxled->mode], &avail);
>  		rc = -ENOSPC;
>  		goto out;
>  	}
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 5e21b53362e6..a1a98aba24ed 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -195,6 +195,22 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
>  		mode = CXL_DECODER_PMEM;
>  	else if (sysfs_streq(buf, "ram"))
>  		mode = CXL_DECODER_RAM;
> +	else if (sysfs_streq(buf, "dc0"))
> +		mode = CXL_DECODER_DC0;
> +	else if (sysfs_streq(buf, "dc1"))
> +		mode = CXL_DECODER_DC1;
> +	else if (sysfs_streq(buf, "dc2"))
> +		mode = CXL_DECODER_DC2;
> +	else if (sysfs_streq(buf, "dc3"))
> +		mode = CXL_DECODER_DC3;
> +	else if (sysfs_streq(buf, "dc4"))
> +		mode = CXL_DECODER_DC4;
> +	else if (sysfs_streq(buf, "dc5"))
> +		mode = CXL_DECODER_DC5;
> +	else if (sysfs_streq(buf, "dc6"))
> +		mode = CXL_DECODER_DC6;
> +	else if (sysfs_streq(buf, "dc7"))
> +		mode = CXL_DECODER_DC7;
>  	else
>  		return -EINVAL;
>  
> @@ -296,6 +312,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>  	&dev_attr_target_list.attr,
>  	SET_CXL_REGION_ATTR(create_pmem_region)
>  	SET_CXL_REGION_ATTR(create_ram_region)
> +	SET_CXL_DC_REGION_ATTR(create_dc_region)
>  	SET_CXL_REGION_ATTR(delete_region)
>  	NULL,
>  };
> @@ -1691,6 +1708,7 @@ struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
>  		return ERR_PTR(-ENOMEM);
>  
>  	cxled->pos = -1;
> +	xa_init(&cxled->skip_res);
>  	cxld = &cxled->cxld;
>  	rc = cxl_decoder_init(port, cxld);
>  	if (rc)	 {
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 543c4499379e..144232c8305e 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1733,7 +1733,7 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  	lockdep_assert_held_write(&cxl_region_rwsem);
>  	lockdep_assert_held_read(&cxl_dpa_rwsem);
>  
> -	if (cxled->mode != cxlr->mode) {
> +	if (decoder_mode_is_dc(cxlr->mode) && !decoder_mode_is_dc(cxled->mode)) {
>  		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
>  			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
>  		return -EINVAL;
> @@ -2211,6 +2211,14 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>  	switch (mode) {
>  	case CXL_DECODER_RAM:
>  	case CXL_DECODER_PMEM:
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
>  		break;
>  	default:
>  		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> @@ -2321,6 +2329,43 @@ static ssize_t create_ram_region_store(struct device *dev,
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>  
> +static ssize_t store_dcN_region(struct cxl_root_decoder *cxlrd,
> +				const char *buf, enum cxl_decoder_mode mode,
> +				size_t len)
> +{
> +	struct cxl_region *cxlr;
> +	int rc, id;
> +
> +	rc = sscanf(buf, "region%d\n", &id);
> +	if (rc != 1)
> +		return -EINVAL;
> +
> +	cxlr = __create_region(cxlrd, id, mode, CXL_DECODER_HOSTMEM);
> +	if (IS_ERR(cxlr))
> +		return PTR_ERR(cxlr);
> +
> +	return len;
> +}
> +
> +static ssize_t create_dc_region_show(struct device *dev,
> +				     struct device_attribute *attr, char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dc_region_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t len)
> +{
> +	/*
> +	 * All DC regions use decoder mode DC0 as the region does not need the
> +	 * index information
> +	 */
> +	return store_dcN_region(to_cxl_root_decoder(dev), buf,
> +				CXL_DECODER_DC0, len);
> +}
> +DEVICE_ATTR_RW(create_dc_region);
> +
>  static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -2799,6 +2844,61 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  	return rc;
>  }
>  
> +static void cxl_dc_region_release(void *data)
> +{
> +	struct cxl_region *cxlr = data;
> +	struct cxl_dc_region *cxlr_dc = cxlr->cxlr_dc;
> +
> +	xa_destroy(&cxlr_dc->dax_dev_list);
> +	kfree(cxlr_dc);
> +}
> +
> +static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
> +{
> +	struct cxl_dc_region *cxlr_dc;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct device *dev;
> +	int rc = 0;
> +
> +	cxlr_dax = cxl_dax_region_alloc(cxlr);
> +	if (IS_ERR(cxlr_dax))
> +		return PTR_ERR(cxlr_dax);
> +
> +	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
> +	if (!cxlr_dc) {
> +		rc = -ENOMEM;
> +		goto err;
> +	}
> +
> +	dev = &cxlr_dax->dev;
> +	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
> +	if (rc)
> +		goto err;
> +
> +	rc = device_add(dev);
> +	if (rc)
> +		goto err;
> +
> +	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> +		dev_name(dev));
> +
> +	rc = devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
> +					cxlr_dax);
> +	if (rc)
> +		goto err;
> +
> +	cxlr_dc->cxlr_dax = cxlr_dax;
> +	xa_init(&cxlr_dc->dax_dev_list);
> +	cxlr->cxlr_dc = cxlr_dc;
> +	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
> +	if (!rc)
> +		return 0;
> +err:
> +	put_device(dev);
> +	kfree(cxlr_dc);
> +	return rc;
> +}
> +
>  static int match_decoder_by_range(struct device *dev, void *data)
>  {
>  	struct range *r1, *r2 = data;
> @@ -3140,6 +3240,19 @@ static int is_system_ram(struct resource *res, void *arg)
>  	return 1;
>  }
>  
> +/*
> + * The region can not be manged by CXL if any portion of
> + * it is already online as 'System RAM'
> + */
> +static bool region_is_system_ram(struct cxl_region *cxlr,
> +				 struct cxl_region_params *p)
> +{
> +	return (walk_iomem_res_desc(IORES_DESC_NONE,
> +				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> +				    p->res->start, p->res->end, cxlr,
> +				    is_system_ram) > 0);
> +}
> +
>  static int cxl_region_probe(struct device *dev)
>  {
>  	struct cxl_region *cxlr = to_cxl_region(dev);
> @@ -3174,14 +3287,7 @@ static int cxl_region_probe(struct device *dev)
>  	case CXL_DECODER_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
>  	case CXL_DECODER_RAM:
> -		/*
> -		 * The region can not be manged by CXL if any portion of
> -		 * it is already online as 'System RAM'
> -		 */
> -		if (walk_iomem_res_desc(IORES_DESC_NONE,
> -					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> -					p->res->start, p->res->end, cxlr,
> -					is_system_ram) > 0)
> +		if (region_is_system_ram(cxlr, p))
>  			return 0;
>  
>  		/*
> @@ -3193,6 +3299,17 @@ static int cxl_region_probe(struct device *dev)
>  
>  		/* HDM-H routes to device-dax */
>  		return devm_cxl_add_dax_region(cxlr);
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
> +		if (region_is_system_ram(cxlr, p))
> +			return 0;
> +		return devm_cxl_add_dc_region(cxlr);
>  	default:
>  		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
>  			cxlr->mode);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 8400af85d99f..7ac1237938b7 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -335,6 +335,14 @@ enum cxl_decoder_mode {
>  	CXL_DECODER_NONE,
>  	CXL_DECODER_RAM,
>  	CXL_DECODER_PMEM,
> +	CXL_DECODER_DC0,
> +	CXL_DECODER_DC1,
> +	CXL_DECODER_DC2,
> +	CXL_DECODER_DC3,
> +	CXL_DECODER_DC4,
> +	CXL_DECODER_DC5,
> +	CXL_DECODER_DC6,
> +	CXL_DECODER_DC7,
>  	CXL_DECODER_MIXED,
>  	CXL_DECODER_DEAD,
>  };
> @@ -345,6 +353,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  		[CXL_DECODER_NONE] = "none",
>  		[CXL_DECODER_RAM] = "ram",
>  		[CXL_DECODER_PMEM] = "pmem",
> +		[CXL_DECODER_DC0] = "dc0",
> +		[CXL_DECODER_DC1] = "dc1",
> +		[CXL_DECODER_DC2] = "dc2",
> +		[CXL_DECODER_DC3] = "dc3",
> +		[CXL_DECODER_DC4] = "dc4",
> +		[CXL_DECODER_DC5] = "dc5",
> +		[CXL_DECODER_DC6] = "dc6",
> +		[CXL_DECODER_DC7] = "dc7",
>  		[CXL_DECODER_MIXED] = "mixed",
>  	};
>  
> @@ -353,6 +369,11 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  	return "mixed";
>  }
>  
> +static inline bool decoder_mode_is_dc(enum cxl_decoder_mode mode)
> +{
> +	return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
> +}
> +
>  /*
>   * Track whether this decoder is reserved for region autodiscovery, or
>   * free for userspace provisioning.
> @@ -375,6 +396,7 @@ struct cxl_endpoint_decoder {
>  	struct cxl_decoder cxld;
>  	struct resource *dpa_res;
>  	resource_size_t skip;
> +	struct xarray skip_res;
>  	enum cxl_decoder_mode mode;
>  	enum cxl_decoder_state state;
>  	int pos;
> @@ -475,6 +497,11 @@ struct cxl_region_params {
>   */
>  #define CXL_REGION_F_AUTO 1
>  
> +struct cxl_dc_region {
> +	struct xarray dax_dev_list;
> +	struct cxl_dax_region *cxlr_dax;
> +};
> +
>  /**
>   * struct cxl_region - CXL region
>   * @dev: This region's device
> @@ -493,6 +520,7 @@ struct cxl_region {
>  	enum cxl_decoder_type type;
>  	struct cxl_nvdimm_bridge *cxl_nvb;
>  	struct cxl_pmem_region *cxlr_pmem;
> +	struct cxl_dc_region *cxlr_dc;
>  	unsigned long flags;
>  	struct cxl_region_params params;
>  };
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index ccdf8de85bd5..eb5eb81bfbd7 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -23,11 +23,15 @@ static int cxl_dax_region_probe(struct device *dev)
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> +	if (decoder_mode_is_dc(cxlr->mode))
> +		return 0;
> +
>  	data = (struct dev_dax_data) {
>  		.dax_region = dax_region,
>  		.id = -1,
>  		.size = range_len(&cxlr_dax->hpa_range),
>  	};
> +
>  	dev_dax = devm_create_dev_dax(&data);
>  	if (IS_ERR(dev_dax))
>  		return PTR_ERR(dev_dax);
> 
> -- 
> 2.40.0
> 

-- 
Fan Ni <nifan@outlook.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-20 17:55   ` Fan Ni
@ 2023-06-20 20:33     ` Ira Weiny
  2023-06-21  3:13     ` Navneet Singh
  1 sibling, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2023-06-20 20:33 UTC (permalink / raw)
  To: Fan Ni, ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl,
	a.manzanares, dave, nmtadam.samsung, nifan

Fan Ni wrote:
> The 06/14/2023 12:16, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > CXL devices optionally support dynamic capacity. CXL Regions must be
> > created to access this capacity.
> > 
> > Add sysfs entries to create dynamic capacity cxl regions. Provide a new
> > Dynamic Capacity decoder mode which targets dynamic capacity on devices
> > which are added to that region.
> > 
> > Below are the steps to create and delete dynamic capacity region0
> > (example).
> > 
> >     region=$(cat /sys/bus/cxl/devices/decoder0.0/create_dc_region)
> >     echo $region> /sys/bus/cxl/devices/decoder0.0/create_dc_region
> >     echo 256 > /sys/bus/cxl/devices/$region/interleave_granularity
> >     echo 1 > /sys/bus/cxl/devices/$region/interleave_ways
> > 
> >     echo "dc0" >/sys/bus/cxl/devices/decoder1.0/mode
> >     echo 0x400000000 >/sys/bus/cxl/devices/decoder1.0/dpa_size
> > 
> >     echo 0x400000000 > /sys/bus/cxl/devices/$region/size
> >     echo  "decoder1.0" > /sys/bus/cxl/devices/$region/target0
> >     echo 1 > /sys/bus/cxl/devices/$region/commit
> >     echo $region > /sys/bus/cxl/drivers/cxl_region/bind
> > 
> >     echo $region> /sys/bus/cxl/devices/decoder0.0/delete_region
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > 

[snip]

> > @@ -304,28 +330,119 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >  	}
> >  
> >  	if (skipped) {
> > -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> > -				       dev_name(&cxled->cxld.dev), 0);
> > -		if (!res) {
> > -			dev_dbg(dev,
> > -				"decoder%d.%d: failed to reserve skipped space\n",
> > -				port->id, cxled->cxld.id);
> > -			return -EBUSY;
> > +		resource_size_t skip_base = base - skipped;
> > +
> > +		if (decoder_mode_is_dc(cxled->mode)) {
> > +			if (resource_size(&cxlds->ram_res) &&
> > +					skip_base <= cxlds->ram_res.end) {
> > +				skip_len = cxlds->ram_res.end - skip_base + 1;
> > +				res = __request_region(dpa_res, skip_base,
> > +						skip_len, dev_name(ed_dev), 0);
> > +				if (!res)
> > +					goto error;
> > +
> > +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> > +								GFP_KERNEL);
> > +				skip_base += skip_len;
> > +			}
> > +
> > +			if (resource_size(&cxlds->ram_res) &&
> Should it be cxlds->pmem_res here?

Yep.  I think I mentioned that in the thread somewhere...

yea here it is: https://lore.kernel.org/all/648b548db05f5_1c7ab42944a@iweiny-mobl.notmuch/

And Navneet agreed:  https://lore.kernel.org/all/ZIte4QozSm+n2zI3@fedora/

Thanks for looking,
Ira

> 
> Fan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-16 16:51   ` Alison Schofield
@ 2023-06-21  2:44     ` Ira Weiny
  0 siblings, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2023-06-21  2:44 UTC (permalink / raw)
  To: Alison Schofield, ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

Alison Schofield wrote:
> On Wed, Jun 14, 2023 at 12:16:29PM -0700, Ira Weiny wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > CXL devices optionally support dynamic capacity. CXL Regions must be
> > created to access this capacity.
> > 
> > Add sysfs entries to create dynamic capacity cxl regions. Provide a new
> > Dynamic Capacity decoder mode which targets dynamic capacity on devices
> > which are added to that region.
> > 
> > Below are the steps to create and delete dynamic capacity region0
> > (example).
> > 
> >     region=$(cat /sys/bus/cxl/devices/decoder0.0/create_dc_region)
> >     echo $region> /sys/bus/cxl/devices/decoder0.0/create_dc_region
> >     echo 256 > /sys/bus/cxl/devices/$region/interleave_granularity
> >     echo 1 > /sys/bus/cxl/devices/$region/interleave_ways
> > 
> >     echo "dc0" >/sys/bus/cxl/devices/decoder1.0/mode
> >     echo 0x400000000 >/sys/bus/cxl/devices/decoder1.0/dpa_size
> > 
> >     echo 0x400000000 > /sys/bus/cxl/devices/$region/size
> >     echo  "decoder1.0" > /sys/bus/cxl/devices/$region/target0
> >     echo 1 > /sys/bus/cxl/devices/$region/commit
> >     echo $region > /sys/bus/cxl/drivers/cxl_region/bind
> > 
> >     echo $region> /sys/bus/cxl/devices/decoder0.0/delete_region
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> Hi,
> I took another pass at this and offered more feedback.
> I do think that if the big part - the cxl_dpa_reserve()
> was more 'chunkified' it would be easier to review for
> actual functionality.
> 
> I'd also like to see the commit log be a bit more specific
> in enumerated the things this patch intends to do.
> 
> Many of my comments are about style. Some checkpatch --strict
> would call out and some are addressed in the kernel coding
> style - Documentation/process/coding-style.rst

As I said before I did not run with --strict

I've done a quick run through with --strict and will ensure it is done
again after I refactor the code.

> 
> But really, my goal is that when this code merges, that as
> I scroll through a file, say region.c, I see a consistent
> coding style. I shouldn't be able to notice that oh, Dan
> wrote that, and Ira that, and Navneet wrote that piece.

I agree.

> 
> I think it's important because differences in style distract
> from focusing on the functionality of the code.
> 
> (off my soap box now ;)
> 
> Alison
> 
> 
> > 
> > ---
> > [iweiny: fixups]
> > [iweiny: remove unused CXL_DC_REGION_MODE macro]
> > [iweiny: Make dc_mode_to_region_index static]
> > [iweiny: simplify <sysfs>/create_dc_region]
> > [iweiny: introduce decoder_mode_is_dc]
> > [djbw: fixups, no sign-off: preview only]
> > ---
> >  drivers/cxl/Kconfig       |  11 +++
> >  drivers/cxl/core/core.h   |   7 ++
> >  drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++++++++++++++++++++++++++----
> >  drivers/cxl/core/port.c   |  18 ++++
> >  drivers/cxl/core/region.c | 135 ++++++++++++++++++++++++--
> >  drivers/cxl/cxl.h         |  28 ++++++
> >  drivers/dax/cxl.c         |   4 +
> >  7 files changed, 409 insertions(+), 28 deletions(-)
> > 
> > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > index ff4e78117b31..df034889d053 100644
> > --- a/drivers/cxl/Kconfig
> > +++ b/drivers/cxl/Kconfig
> > @@ -121,6 +121,17 @@ config CXL_REGION
> >  
> >  	  If unsure say 'y'
> >  
> > +config CXL_DCD
> > +	bool "CXL: DCD Support"
> 
> "CXL DCD: Dynamic Capacity Device Support"
> is more in line with others in this file, and expands the acronym onetime.

done.

> 
> > +	default CXL_BUS
> > +	depends on CXL_REGION
> > +	help
> > +	  Enable the CXL core to provision CXL DCD regions.
> > +	  CXL devices optionally support dynamic capacity and DCD region
> > +	  maps the dynamic capacity regions DPA's into Host HPA ranges.
> > +
> > +	  If unsure say 'y'
> > +
> >  config CXL_REGION_INVALIDATION_TEST
> >  	bool "CXL: Region Cache Management Bypass (TEST)"
> >  	depends on CXL_REGION
> > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> > index 27f0968449de..725700ab5973 100644
> > --- a/drivers/cxl/core/core.h
> > +++ b/drivers/cxl/core/core.h
> > @@ -9,6 +9,13 @@ extern const struct device_type cxl_nvdimm_type;
> >  
> >  extern struct attribute_group cxl_base_attribute_group;
> >  
> > +#ifdef CONFIG_CXL_DCD
> > +extern struct device_attribute dev_attr_create_dc_region;
> > +#define SET_CXL_DC_REGION_ATTR(x) (&dev_attr_##x.attr),
> > +#else
> > +#define SET_CXL_DC_REGION_ATTR(x)
> > +#endif
> > +
> >  #ifdef CONFIG_CXL_REGION
> >  extern struct device_attribute dev_attr_create_pmem_region;
> >  extern struct device_attribute dev_attr_create_ram_region;
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 514d30131d92..29649b47d177 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -233,14 +233,23 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> >  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> >  	struct resource *res = cxled->dpa_res;
> >  	resource_size_t skip_start;
> > +	resource_size_t skipped = cxled->skip;
> 
> Reverse x-tree.

Done.

> 
> >  
> >  	lockdep_assert_held_write(&cxl_dpa_rwsem);
> >  
> >  	/* save @skip_start, before @res is released */
> > -	skip_start = res->start - cxled->skip;
> > +	skip_start = res->start - skipped;
> 
> Why did the assignment of skip_start need to change here?

I believe this was done for consistency because skipped now represents
cxled->skip, however...

> 
> >  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> > -	if (cxled->skip)
> > -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> > +	if (cxled->skip != 0) {
> > +		while (skipped != 0) {

... what is more concerning is we now effectively have.

	if (skipped != 0) {
		while (skipped != 0) {
			...

:-(

> > +			res = xa_load(&cxled->skip_res, skip_start);
> > +			__release_region(&cxlds->dpa_res, skip_start,
> > +							resource_size(res));
> 
> The above appears poorlty aligned.

fixed.

> 
> > +			xa_erase(&cxled->skip_res, skip_start);
> > +			skip_start += resource_size(res);
> > +			skipped -= resource_size(res);
> > +			}
> 
> This bracket appears poorly aligned.

This is very poorly aligned.  I'll run --strict before sending V2.

> 
> > +	}
> >  	cxled->skip = 0;
> >  	cxled->dpa_res = NULL;
> >  	put_device(&cxled->cxld.dev);
> > @@ -267,6 +276,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> >  	__cxl_dpa_release(cxled);
> >  }
> >  
> > +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> > +{
> > +	int index = 0;
> > +
> > +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> > +		if (mode == i)
> > +			return index;
> > +		index++;
> > +	}
> > +
> > +	return -EINVAL;
> > +}
> > +
> >  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >  			     resource_size_t base, resource_size_t len,
> >  			     resource_size_t skipped)
> > @@ -275,7 +297,11 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >  	struct cxl_port *port = cxled_to_port(cxled);
> >  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> >  	struct device *dev = &port->dev;
> > +	struct device *ed_dev = &cxled->cxld.dev;
> > +	struct resource *dpa_res = &cxlds->dpa_res;
> > +	resource_size_t skip_len = 0;
> >  	struct resource *res;
> > +	int rc, index;
> >  
> 
> Above poorly aligned.

Do you mean reverse x-tree?

> 
> >  	lockdep_assert_held_write(&cxl_dpa_rwsem);
> >  
> > @@ -304,28 +330,119 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >  	}
> >  
> >  	if (skipped) {
> 
> This has excessive indentation, so started out with a monster
> if skipped is begging for a refactoring.

Yea I agree.  Dave pointed this out as well.  What I have to be sure about
is the logic here.

> 
> I find it odd that the DCD case got inserted before the 'default'
> or non-DCD case here.

Yea I'm working with Navneet on this.

> 
> 
> > -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> > -				       dev_name(&cxled->cxld.dev), 0);
> > -		if (!res) {
> > -			dev_dbg(dev,
> > -				"decoder%d.%d: failed to reserve skipped space\n",
> > -				port->id, cxled->cxld.id);
> > -			return -EBUSY;
> > +		resource_size_t skip_base = base - skipped;
> > +
> > +		if (decoder_mode_is_dc(cxled->mode)) {
> 
> This may be cleaner to introduce as a separate function for
> handling _mode_id_dc.

Yes I think all the DC in this function should be handled in it's own
function to clarify how that is handled.

I think it will make the diff/review easier as well because it will be
more clear how things are with and without DCD.

> 
> > +			if (resource_size(&cxlds->ram_res) &&
> > +					skip_base <= cxlds->ram_res.end) {
> > +				skip_len = cxlds->ram_res.end - skip_base + 1;
> > +				res = __request_region(dpa_res, skip_base,
> > +						skip_len, dev_name(ed_dev), 0);
> > +				if (!res)
> > +					goto error;
> > +
> > +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> > +								GFP_KERNEL);
> > +				skip_base += skip_len;
> > +			}
> > +
> > +			if (resource_size(&cxlds->ram_res) &&
> > +					skip_base <= cxlds->pmem_res.end) {
> > +				skip_len = cxlds->pmem_res.end - skip_base + 1;
> > +				res = __request_region(dpa_res, skip_base,
> > +						skip_len, dev_name(ed_dev), 0);
> > +				if (!res)
> > +					goto error;
> > +
> > +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> > +								GFP_KERNEL);
> > +				skip_base += skip_len;
> > +			}
> 
> The above 2 if (resource_size() cases have redundant code. 
> Pull it out, refactor.

Redundant except that the second ram_res needs to be pmem_res.  After
that change I'll have to evaluate how much is duplicated.  As I said above
I'm working with Navneet to see how this logic can be broken down.  It is
a big function now.

> 
> > +
> > +			index = dc_mode_to_region_index(cxled->mode);
> > +			for (int i = 0; i <= index; i++) {
> > +				struct resource *dcr = &cxlds->dc_res[i];
> > +
> > +				if (skip_base < dcr->start) {
> > +					skip_len = dcr->start - skip_base;
> > +					res = __request_region(dpa_res,
> > +							skip_base, skip_len,
> > +							dev_name(ed_dev), 0);
> > +					if (!res)
> > +						goto error;
> > +
> > +					rc = xa_insert(&cxled->skip_res, skip_base,
> > +							res, GFP_KERNEL);
> > +					skip_base += skip_len;
> > +				}
> > +
> > +				if (skip_base == base) {
> > +					dev_dbg(dev, "skip done!\n");
> > +					break;
> > +				}
> > +
> > +				if (resource_size(dcr) &&
> > +						skip_base <= dcr->end) {
> > +					if (skip_base > base)
> > +						dev_err(dev, "Skip error\n");
> > +
> > +					skip_len = dcr->end - skip_base + 1;
> > +					res = __request_region(dpa_res, skip_base,
> > +							skip_len,
> > +							dev_name(ed_dev), 0);
> > +					if (!res)
> > +						goto error;
> > +
> > +					rc = xa_insert(&cxled->skip_res, skip_base,
> > +							res, GFP_KERNEL);
> > +					skip_base += skip_len;
> > +				}
> > +			}
> 
> 
> And, below,we are back to the original code.
> This would be more readable, reviewable if the DCD support was
> added in separate function that are then called from here.

Yep!

> 
> > +		} else	{
> > +			res = __request_region(dpa_res, base - skipped, skipped,
> > +							dev_name(ed_dev), 0);
> > +			if (!res)
> > +				goto error;
> > +
> > +			rc = xa_insert(&cxled->skip_res, skip_base, res,
> > +								GFP_KERNEL);
> >  		}
> >  	}
> > -	res = __request_region(&cxlds->dpa_res, base, len,
> > -			       dev_name(&cxled->cxld.dev), 0);
> > +
> > +	res = __request_region(dpa_res, base, len, dev_name(ed_dev), 0);
> >  	if (!res) {
> >  		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
> > -			port->id, cxled->cxld.id);
> 
> General comment - look over the dev_dbg() messages and consider placing
> them after the code. I recall, others that were needlessly between lines
> of code.

At the end of the block?

> 
> 
> > -		if (skipped)
> > -			__release_region(&cxlds->dpa_res, base - skipped,
> > -					 skipped);
> > +				port->id, cxled->cxld.id);
> > +		if (skipped) {
> > +			resource_size_t skip_base = base - skipped;
> > +
> > +			while (skipped != 0) {
> > +				if (skip_base > base)
> > +					dev_err(dev, "Skip error\n");
> > +
> > +				res = xa_load(&cxled->skip_res, skip_base);
> > +				__release_region(dpa_res, skip_base,
> > +							resource_size(res));
> > +				xa_erase(&cxled->skip_res, skip_base);
> > +				skip_base += resource_size(res);
> > +				skipped -= resource_size(res);
> > +			}
> > +		}
> 
> 		Can that debug message go here ?

Not sure.  But we have another issue of:

if (skipped) {
	while (skipped) {
	...

which is redundant I think.  I'll have to see about skip_base.

> 
> >  		return -EBUSY;
> >  	}
> >  	cxled->dpa_res = res;
> >  	cxled->skip = skipped;
> >  
> > +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> > +		int index = dc_mode_to_region_index(mode);
> > +
> > +		if (resource_contains(&cxlds->dc_res[index], res)) {
> > +			cxled->mode = mode;
> > +			dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id,
> > +				cxled->cxld.id, cxled->dpa_res, cxled->mode);
> 
> Can this move to ....
> 
> 
> > +			goto success;
> > +		}
> > +	}
> >  	if (resource_contains(&cxlds->pmem_res, res))
> >  		cxled->mode = CXL_DECODER_PMEM;
> >  	else if (resource_contains(&cxlds->ram_res, res))
> > @@ -336,9 +453,16 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >  		cxled->mode = CXL_DECODER_MIXED;
> >  	}
> >  
> > +success:
> >  	port->hdm_end++;
> >  	get_device(&cxled->cxld.dev);
> 
> here...dev_dbg() success message. That pairs it nicely with the
> error message below.

I think it can.  I think we have a case here where there was an attempt
not to change the initial behavior of the code even so much as adding a
debug message.  But I think the over all flow would be better with the
debug here.

> 
> >  	return 0;
> > +
> > +error:
> > +	dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space\n",
> > +			port->id, cxled->cxld.id);
> > +	return -EBUSY;
> > +
> >  }
> >  
> >  int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > @@ -429,6 +553,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> >  	switch (mode) {
> >  	case CXL_DECODER_RAM:
> >  	case CXL_DECODER_PMEM:
> > +	case CXL_DECODER_DC0:
> > +	case CXL_DECODER_DC1:
> > +	case CXL_DECODER_DC2:
> > +	case CXL_DECODER_DC3:
> > +	case CXL_DECODER_DC4:
> > +	case CXL_DECODER_DC5:
> > +	case CXL_DECODER_DC6:
> > +	case CXL_DECODER_DC7:
> >  		break;
> >  	default:
> >  		dev_dbg(dev, "unsupported mode: %d\n", mode);
> > @@ -456,6 +588,16 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> >  		goto out;
> >  	}
> >  
> > +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> > +		int index = dc_mode_to_region_index(i);
> > +
> > +		if (mode == i && !resource_size(&cxlds->dc_res[index])) {
> > +			dev_dbg(dev, "no available dynamic capacity\n");
> 
> I see this one is following the pattern in the function :)
> 
> 
> > +			rc = -ENXIO;
> > +			goto out;
> > +		}
> > +	}
> > +
> >  	cxled->mode = mode;
> >  	rc = 0;
> >  out:
> > @@ -469,10 +611,12 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
> 
> Hmmm...I don't have cxl_dpa_freespace() in my cxl/next? Where's that?

That was in the patches from Dan which this series depends on.

https://lore.kernel.org/all/168592158743.1948938.7622563891193802610.stgit@dwillia2-xfh.jf.intel.com/

> 
> 
> >  					 resource_size_t *skip_out)
> >  {
> >  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > -	resource_size_t free_ram_start, free_pmem_start;
> > +	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
> >  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > +	struct device *dev = &cxled->cxld.dev;
> >  	resource_size_t start, avail, skip;
> >  	struct resource *p, *last;
> > +	int index;
> 
> Why break the alignment above?

What do you mean?

> 
> >  
> >  	lockdep_assert_held(&cxl_dpa_rwsem);
> >  
> > @@ -490,6 +634,20 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
> >  	else
> >  		free_pmem_start = cxlds->pmem_res.start;
> >  
> > +	/*
> > +	 * One HDM Decoder per DC region to map memory with different
> > +	 * DSMAS entry.
> > +	 */
> 
> It seems this comment is missing a verb. Why not align?

align?

with DSMAS on the end like this?

	/*
	 * One HDM Decoder per DC region to map memory with different DSMAS
	 * entry.
	 */

> > +	index = dc_mode_to_region_index(cxled->mode);
> > +	if (index >= 0) {
> > +		if (cxlds->dc_res[index].child) {
> > +			dev_err(dev, "Cannot allocated DPA from DC Region: %d\n"
> 
> s/allocated/allocate

Fixed.

> 
> ,
> > +					index);
> > +			return -EINVAL;
> > +		}
> > +		free_dc_start = cxlds->dc_res[index].start;
> > +	}
> > +
> >  	if (cxled->mode == CXL_DECODER_RAM) {
> >  		start = free_ram_start;
> >  		avail = cxlds->ram_res.end - start + 1;
> > @@ -511,6 +669,29 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
> >  		else
> >  			skip_end = start - 1;
> >  		skip = skip_end - skip_start + 1;
> > +	} else if (decoder_mode_is_dc(cxled->mode)) {
> > +		resource_size_t skip_start, skip_end;
> > +
> > +		start = free_dc_start;
> > +		avail = cxlds->dc_res[index].end - start + 1;
> > +		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
> > +			skip_start = free_ram_start;
> > +		else
> > +			skip_start = free_pmem_start;
> > +		/*
> > +		 * If some dc region is already mapped, then that allocation
> 
> maybe s/some/any ?

Fixed.

Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-20 17:55   ` Fan Ni
  2023-06-20 20:33     ` Ira Weiny
@ 2023-06-21  3:13     ` Navneet Singh
  1 sibling, 0 replies; 55+ messages in thread
From: Navneet Singh @ 2023-06-21  3:13 UTC (permalink / raw)
  To: Fan Ni
  Cc: ira.weiny, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl,
	a.manzanares, dave, nmtadam.samsung

On Tue, Jun 20, 2023 at 10:55:15AM -0700, Fan Ni wrote:
> The 06/14/2023 12:16, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > CXL devices optionally support dynamic capacity. CXL Regions must be
> > created to access this capacity.
> > 
> > Add sysfs entries to create dynamic capacity cxl regions. Provide a new
> > Dynamic Capacity decoder mode which targets dynamic capacity on devices
> > which are added to that region.
> > 
> > Below are the steps to create and delete dynamic capacity region0
> > (example).
> > 
> >     region=$(cat /sys/bus/cxl/devices/decoder0.0/create_dc_region)
> >     echo $region> /sys/bus/cxl/devices/decoder0.0/create_dc_region
> >     echo 256 > /sys/bus/cxl/devices/$region/interleave_granularity
> >     echo 1 > /sys/bus/cxl/devices/$region/interleave_ways
> > 
> >     echo "dc0" >/sys/bus/cxl/devices/decoder1.0/mode
> >     echo 0x400000000 >/sys/bus/cxl/devices/decoder1.0/dpa_size
> > 
> >     echo 0x400000000 > /sys/bus/cxl/devices/$region/size
> >     echo  "decoder1.0" > /sys/bus/cxl/devices/$region/target0
> >     echo 1 > /sys/bus/cxl/devices/$region/commit
> >     echo $region > /sys/bus/cxl/drivers/cxl_region/bind
> > 
> >     echo $region> /sys/bus/cxl/devices/decoder0.0/delete_region
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > 
> > ---
> > [iweiny: fixups]
> > [iweiny: remove unused CXL_DC_REGION_MODE macro]
> > [iweiny: Make dc_mode_to_region_index static]
> > [iweiny: simplify <sysfs>/create_dc_region]
> > [iweiny: introduce decoder_mode_is_dc]
> > [djbw: fixups, no sign-off: preview only]
> > ---
> >  drivers/cxl/Kconfig       |  11 +++
> >  drivers/cxl/core/core.h   |   7 ++
> >  drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++++++++++++++++++++++++++----
> >  drivers/cxl/core/port.c   |  18 ++++
> >  drivers/cxl/core/region.c | 135 ++++++++++++++++++++++++--
> >  drivers/cxl/cxl.h         |  28 ++++++
> >  drivers/dax/cxl.c         |   4 +
> >  7 files changed, 409 insertions(+), 28 deletions(-)
> > 
> > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > index ff4e78117b31..df034889d053 100644
> > --- a/drivers/cxl/Kconfig
> > +++ b/drivers/cxl/Kconfig
> > @@ -121,6 +121,17 @@ config CXL_REGION
> >  
> >  	  If unsure say 'y'
> >  
> > +config CXL_DCD
> > +	bool "CXL: DCD Support"
> > +	default CXL_BUS
> > +	depends on CXL_REGION
> > +	help
> > +	  Enable the CXL core to provision CXL DCD regions.
> > +	  CXL devices optionally support dynamic capacity and DCD region
> > +	  maps the dynamic capacity regions DPA's into Host HPA ranges.
> > +
> > +	  If unsure say 'y'
> > +
> >  config CXL_REGION_INVALIDATION_TEST
> >  	bool "CXL: Region Cache Management Bypass (TEST)"
> >  	depends on CXL_REGION
> > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> > index 27f0968449de..725700ab5973 100644
> > --- a/drivers/cxl/core/core.h
> > +++ b/drivers/cxl/core/core.h
> > @@ -9,6 +9,13 @@ extern const struct device_type cxl_nvdimm_type;
> >  
> >  extern struct attribute_group cxl_base_attribute_group;
> >  
> > +#ifdef CONFIG_CXL_DCD
> > +extern struct device_attribute dev_attr_create_dc_region;
> > +#define SET_CXL_DC_REGION_ATTR(x) (&dev_attr_##x.attr),
> > +#else
> > +#define SET_CXL_DC_REGION_ATTR(x)
> > +#endif
> > +
> >  #ifdef CONFIG_CXL_REGION
> >  extern struct device_attribute dev_attr_create_pmem_region;
> >  extern struct device_attribute dev_attr_create_ram_region;
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 514d30131d92..29649b47d177 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -233,14 +233,23 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> >  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> >  	struct resource *res = cxled->dpa_res;
> >  	resource_size_t skip_start;
> > +	resource_size_t skipped = cxled->skip;
> >  
> >  	lockdep_assert_held_write(&cxl_dpa_rwsem);
> >  
> >  	/* save @skip_start, before @res is released */
> > -	skip_start = res->start - cxled->skip;
> > +	skip_start = res->start - skipped;
> >  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> > -	if (cxled->skip)
> > -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> > +	if (cxled->skip != 0) {
> > +		while (skipped != 0) {
> > +			res = xa_load(&cxled->skip_res, skip_start);
> > +			__release_region(&cxlds->dpa_res, skip_start,
> > +							resource_size(res));
> > +			xa_erase(&cxled->skip_res, skip_start);
> > +			skip_start += resource_size(res);
> > +			skipped -= resource_size(res);
> > +			}
> > +	}
> >  	cxled->skip = 0;
> >  	cxled->dpa_res = NULL;
> >  	put_device(&cxled->cxld.dev);
> > @@ -267,6 +276,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> >  	__cxl_dpa_release(cxled);
> >  }
> >  
> > +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> > +{
> > +	int index = 0;
> > +
> > +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> > +		if (mode == i)
> > +			return index;
> > +		index++;
> > +	}
> > +
> > +	return -EINVAL;
> > +}
> > +
> >  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >  			     resource_size_t base, resource_size_t len,
> >  			     resource_size_t skipped)
> > @@ -275,7 +297,11 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >  	struct cxl_port *port = cxled_to_port(cxled);
> >  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> >  	struct device *dev = &port->dev;
> > +	struct device *ed_dev = &cxled->cxld.dev;
> > +	struct resource *dpa_res = &cxlds->dpa_res;
> > +	resource_size_t skip_len = 0;
> >  	struct resource *res;
> > +	int rc, index;
> >  
> >  	lockdep_assert_held_write(&cxl_dpa_rwsem);
> >  
> > @@ -304,28 +330,119 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >  	}
> >  
> >  	if (skipped) {
> > -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> > -				       dev_name(&cxled->cxld.dev), 0);
> > -		if (!res) {
> > -			dev_dbg(dev,
> > -				"decoder%d.%d: failed to reserve skipped space\n",
> > -				port->id, cxled->cxld.id);
> > -			return -EBUSY;
> > +		resource_size_t skip_base = base - skipped;
> > +
> > +		if (decoder_mode_is_dc(cxled->mode)) {
> > +			if (resource_size(&cxlds->ram_res) &&
> > +					skip_base <= cxlds->ram_res.end) {
> > +				skip_len = cxlds->ram_res.end - skip_base + 1;
> > +				res = __request_region(dpa_res, skip_base,
> > +						skip_len, dev_name(ed_dev), 0);
> > +				if (!res)
> > +					goto error;
> > +
> > +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> > +								GFP_KERNEL);
> > +				skip_base += skip_len;
> > +			}
> > +
> > +			if (resource_size(&cxlds->ram_res) &&
> Should it be cxlds->pmem_res here?
> 
> Fan
Navneet - Yes , This is already in the change list. 
> > +					skip_base <= cxlds->pmem_res.end) {
> > +				skip_len = cxlds->pmem_res.end - skip_base + 1;
> > +				res = __request_region(dpa_res, skip_base,
> > +						skip_len, dev_name(ed_dev), 0);
> > +				if (!res)
> > +					goto error;
> > +
> > +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> > +								GFP_KERNEL);
> > +				skip_base += skip_len;
> > +			}
> > +
> > +			index = dc_mode_to_region_index(cxled->mode);
> > +			for (int i = 0; i <= index; i++) {
> > +				struct resource *dcr = &cxlds->dc_res[i];
> > +
> > +				if (skip_base < dcr->start) {
> > +					skip_len = dcr->start - skip_base;
> > +					res = __request_region(dpa_res,
> > +							skip_base, skip_len,
> > +							dev_name(ed_dev), 0);
> > +					if (!res)
> > +						goto error;
> > +
> > +					rc = xa_insert(&cxled->skip_res, skip_base,
> > +							res, GFP_KERNEL);
> > +					skip_base += skip_len;
> > +				}
> > +
> > +				if (skip_base == base) {
> > +					dev_dbg(dev, "skip done!\n");
> > +					break;
> > +				}
> > +
> > +				if (resource_size(dcr) &&
> > +						skip_base <= dcr->end) {
> > +					if (skip_base > base)
> > +						dev_err(dev, "Skip error\n");
> > +
> > +					skip_len = dcr->end - skip_base + 1;
> > +					res = __request_region(dpa_res, skip_base,
> > +							skip_len,
> > +							dev_name(ed_dev), 0);
> > +					if (!res)
> > +						goto error;
> > +
> > +					rc = xa_insert(&cxled->skip_res, skip_base,
> > +							res, GFP_KERNEL);
> > +					skip_base += skip_len;
> > +				}
> > +			}
> > +		} else	{
> > +			res = __request_region(dpa_res, base - skipped, skipped,
> > +							dev_name(ed_dev), 0);
> > +			if (!res)
> > +				goto error;
> > +
> > +			rc = xa_insert(&cxled->skip_res, skip_base, res,
> > +								GFP_KERNEL);
> >  		}
> >  	}
> > -	res = __request_region(&cxlds->dpa_res, base, len,
> > -			       dev_name(&cxled->cxld.dev), 0);
> > +
> > +	res = __request_region(dpa_res, base, len, dev_name(ed_dev), 0);
> >  	if (!res) {
> >  		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
> > -			port->id, cxled->cxld.id);
> > -		if (skipped)
> > -			__release_region(&cxlds->dpa_res, base - skipped,
> > -					 skipped);
> > +				port->id, cxled->cxld.id);
> > +		if (skipped) {
> > +			resource_size_t skip_base = base - skipped;
> > +
> > +			while (skipped != 0) {
> > +				if (skip_base > base)
> > +					dev_err(dev, "Skip error\n");
> > +
> > +				res = xa_load(&cxled->skip_res, skip_base);
> > +				__release_region(dpa_res, skip_base,
> > +							resource_size(res));
> > +				xa_erase(&cxled->skip_res, skip_base);
> > +				skip_base += resource_size(res);
> > +				skipped -= resource_size(res);
> > +			}
> > +		}
> >  		return -EBUSY;
> >  	}
> >  	cxled->dpa_res = res;
> >  	cxled->skip = skipped;
> >  
> > +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> > +		int index = dc_mode_to_region_index(mode);
> > +
> > +		if (resource_contains(&cxlds->dc_res[index], res)) {
> > +			cxled->mode = mode;
> > +			dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id,
> > +				cxled->cxld.id, cxled->dpa_res, cxled->mode);
> > +			goto success;
> > +		}
> > +	}
> >  	if (resource_contains(&cxlds->pmem_res, res))
> >  		cxled->mode = CXL_DECODER_PMEM;
> >  	else if (resource_contains(&cxlds->ram_res, res))
> > @@ -336,9 +453,16 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >  		cxled->mode = CXL_DECODER_MIXED;
> >  	}
> >  
> > +success:
> >  	port->hdm_end++;
> >  	get_device(&cxled->cxld.dev);
> >  	return 0;
> > +
> > +error:
> > +	dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space\n",
> > +			port->id, cxled->cxld.id);
> > +	return -EBUSY;
> > +
> >  }
> >  
> >  int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > @@ -429,6 +553,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> >  	switch (mode) {
> >  	case CXL_DECODER_RAM:
> >  	case CXL_DECODER_PMEM:
> > +	case CXL_DECODER_DC0:
> > +	case CXL_DECODER_DC1:
> > +	case CXL_DECODER_DC2:
> > +	case CXL_DECODER_DC3:
> > +	case CXL_DECODER_DC4:
> > +	case CXL_DECODER_DC5:
> > +	case CXL_DECODER_DC6:
> > +	case CXL_DECODER_DC7:
> >  		break;
> >  	default:
> >  		dev_dbg(dev, "unsupported mode: %d\n", mode);
> > @@ -456,6 +588,16 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> >  		goto out;
> >  	}
> >  
> > +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> > +		int index = dc_mode_to_region_index(i);
> > +
> > +		if (mode == i && !resource_size(&cxlds->dc_res[index])) {
> > +			dev_dbg(dev, "no available dynamic capacity\n");
> > +			rc = -ENXIO;
> > +			goto out;
> > +		}
> > +	}
> > +
> >  	cxled->mode = mode;
> >  	rc = 0;
> >  out:
> > @@ -469,10 +611,12 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
> >  					 resource_size_t *skip_out)
> >  {
> >  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > -	resource_size_t free_ram_start, free_pmem_start;
> > +	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
> >  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > +	struct device *dev = &cxled->cxld.dev;
> >  	resource_size_t start, avail, skip;
> >  	struct resource *p, *last;
> > +	int index;
> >  
> >  	lockdep_assert_held(&cxl_dpa_rwsem);
> >  
> > @@ -490,6 +634,20 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
> >  	else
> >  		free_pmem_start = cxlds->pmem_res.start;
> >  
> > +	/*
> > +	 * One HDM Decoder per DC region to map memory with different
> > +	 * DSMAS entry.
> > +	 */
> > +	index = dc_mode_to_region_index(cxled->mode);
> > +	if (index >= 0) {
> > +		if (cxlds->dc_res[index].child) {
> > +			dev_err(dev, "Cannot allocated DPA from DC Region: %d\n",
> > +					index);
> > +			return -EINVAL;
> > +		}
> > +		free_dc_start = cxlds->dc_res[index].start;
> > +	}
> > +
> >  	if (cxled->mode == CXL_DECODER_RAM) {
> >  		start = free_ram_start;
> >  		avail = cxlds->ram_res.end - start + 1;
> > @@ -511,6 +669,29 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
> >  		else
> >  			skip_end = start - 1;
> >  		skip = skip_end - skip_start + 1;
> > +	} else if (decoder_mode_is_dc(cxled->mode)) {
> > +		resource_size_t skip_start, skip_end;
> > +
> > +		start = free_dc_start;
> > +		avail = cxlds->dc_res[index].end - start + 1;
> > +		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
> > +			skip_start = free_ram_start;
> > +		else
> > +			skip_start = free_pmem_start;
> > +		/*
> > +		 * If some dc region is already mapped, then that allocation
> > +		 * already handled the RAM and PMEM skip.Check for DC region
> > +		 * skip.
> > +		 */
> > +		for (int i = index - 1; i >= 0 ; i--) {
> > +			if (cxlds->dc_res[i].child) {
> > +				skip_start = cxlds->dc_res[i].child->end + 1;
> > +				break;
> > +			}
> > +		}
> > +
> > +		skip_end = start - 1;
> > +		skip = skip_end - skip_start + 1;
> >  	} else {
> >  		dev_dbg(cxled_dev(cxled), "mode not set\n");
> >  		avail = 0;
> > @@ -548,10 +729,25 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> >  
> >  	avail = cxl_dpa_freespace(cxled, &start, &skip);
> >  
> > +	dev_dbg(dev, "DPA Allocation start: %llx len: %llx Skip: %llx\n",
> > +						start, size, skip);
> >  	if (size > avail) {
> > +		static const char * const names[] = {
> > +			[CXL_DECODER_NONE] = "none",
> > +			[CXL_DECODER_RAM] = "ram",
> > +			[CXL_DECODER_PMEM] = "pmem",
> > +			[CXL_DECODER_MIXED] = "mixed",
> > +			[CXL_DECODER_DC0] = "dc0",
> > +			[CXL_DECODER_DC1] = "dc1",
> > +			[CXL_DECODER_DC2] = "dc2",
> > +			[CXL_DECODER_DC3] = "dc3",
> > +			[CXL_DECODER_DC4] = "dc4",
> > +			[CXL_DECODER_DC5] = "dc5",
> > +			[CXL_DECODER_DC6] = "dc6",
> > +			[CXL_DECODER_DC7] = "dc7",
> > +		};
> >  		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
> > -			cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
> > -			&avail);
> > +			names[cxled->mode], &avail);
> >  		rc = -ENOSPC;
> >  		goto out;
> >  	}
> > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > index 5e21b53362e6..a1a98aba24ed 100644
> > --- a/drivers/cxl/core/port.c
> > +++ b/drivers/cxl/core/port.c
> > @@ -195,6 +195,22 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
> >  		mode = CXL_DECODER_PMEM;
> >  	else if (sysfs_streq(buf, "ram"))
> >  		mode = CXL_DECODER_RAM;
> > +	else if (sysfs_streq(buf, "dc0"))
> > +		mode = CXL_DECODER_DC0;
> > +	else if (sysfs_streq(buf, "dc1"))
> > +		mode = CXL_DECODER_DC1;
> > +	else if (sysfs_streq(buf, "dc2"))
> > +		mode = CXL_DECODER_DC2;
> > +	else if (sysfs_streq(buf, "dc3"))
> > +		mode = CXL_DECODER_DC3;
> > +	else if (sysfs_streq(buf, "dc4"))
> > +		mode = CXL_DECODER_DC4;
> > +	else if (sysfs_streq(buf, "dc5"))
> > +		mode = CXL_DECODER_DC5;
> > +	else if (sysfs_streq(buf, "dc6"))
> > +		mode = CXL_DECODER_DC6;
> > +	else if (sysfs_streq(buf, "dc7"))
> > +		mode = CXL_DECODER_DC7;
> >  	else
> >  		return -EINVAL;
> >  
> > @@ -296,6 +312,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
> >  	&dev_attr_target_list.attr,
> >  	SET_CXL_REGION_ATTR(create_pmem_region)
> >  	SET_CXL_REGION_ATTR(create_ram_region)
> > +	SET_CXL_DC_REGION_ATTR(create_dc_region)
> >  	SET_CXL_REGION_ATTR(delete_region)
> >  	NULL,
> >  };
> > @@ -1691,6 +1708,7 @@ struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
> >  		return ERR_PTR(-ENOMEM);
> >  
> >  	cxled->pos = -1;
> > +	xa_init(&cxled->skip_res);
> >  	cxld = &cxled->cxld;
> >  	rc = cxl_decoder_init(port, cxld);
> >  	if (rc)	 {
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 543c4499379e..144232c8305e 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -1733,7 +1733,7 @@ static int cxl_region_attach(struct cxl_region *cxlr,
> >  	lockdep_assert_held_write(&cxl_region_rwsem);
> >  	lockdep_assert_held_read(&cxl_dpa_rwsem);
> >  
> > -	if (cxled->mode != cxlr->mode) {
> > +	if (decoder_mode_is_dc(cxlr->mode) && !decoder_mode_is_dc(cxled->mode)) {
> >  		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
> >  			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
> >  		return -EINVAL;
> > @@ -2211,6 +2211,14 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
> >  	switch (mode) {
> >  	case CXL_DECODER_RAM:
> >  	case CXL_DECODER_PMEM:
> > +	case CXL_DECODER_DC0:
> > +	case CXL_DECODER_DC1:
> > +	case CXL_DECODER_DC2:
> > +	case CXL_DECODER_DC3:
> > +	case CXL_DECODER_DC4:
> > +	case CXL_DECODER_DC5:
> > +	case CXL_DECODER_DC6:
> > +	case CXL_DECODER_DC7:
> >  		break;
> >  	default:
> >  		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> > @@ -2321,6 +2329,43 @@ static ssize_t create_ram_region_store(struct device *dev,
> >  }
> >  DEVICE_ATTR_RW(create_ram_region);
> >  
> > +static ssize_t store_dcN_region(struct cxl_root_decoder *cxlrd,
> > +				const char *buf, enum cxl_decoder_mode mode,
> > +				size_t len)
> > +{
> > +	struct cxl_region *cxlr;
> > +	int rc, id;
> > +
> > +	rc = sscanf(buf, "region%d\n", &id);
> > +	if (rc != 1)
> > +		return -EINVAL;
> > +
> > +	cxlr = __create_region(cxlrd, id, mode, CXL_DECODER_HOSTMEM);
> > +	if (IS_ERR(cxlr))
> > +		return PTR_ERR(cxlr);
> > +
> > +	return len;
> > +}
> > +
> > +static ssize_t create_dc_region_show(struct device *dev,
> > +				     struct device_attribute *attr, char *buf)
> > +{
> > +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> > +}
> > +
> > +static ssize_t create_dc_region_store(struct device *dev,
> > +				      struct device_attribute *attr,
> > +				      const char *buf, size_t len)
> > +{
> > +	/*
> > +	 * All DC regions use decoder mode DC0 as the region does not need the
> > +	 * index information
> > +	 */
> > +	return store_dcN_region(to_cxl_root_decoder(dev), buf,
> > +				CXL_DECODER_DC0, len);
> > +}
> > +DEVICE_ATTR_RW(create_dc_region);
> > +
> >  static ssize_t region_show(struct device *dev, struct device_attribute *attr,
> >  			   char *buf)
> >  {
> > @@ -2799,6 +2844,61 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> >  	return rc;
> >  }
> >  
> > +static void cxl_dc_region_release(void *data)
> > +{
> > +	struct cxl_region *cxlr = data;
> > +	struct cxl_dc_region *cxlr_dc = cxlr->cxlr_dc;
> > +
> > +	xa_destroy(&cxlr_dc->dax_dev_list);
> > +	kfree(cxlr_dc);
> > +}
> > +
> > +static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
> > +{
> > +	struct cxl_dc_region *cxlr_dc;
> > +	struct cxl_dax_region *cxlr_dax;
> > +	struct device *dev;
> > +	int rc = 0;
> > +
> > +	cxlr_dax = cxl_dax_region_alloc(cxlr);
> > +	if (IS_ERR(cxlr_dax))
> > +		return PTR_ERR(cxlr_dax);
> > +
> > +	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
> > +	if (!cxlr_dc) {
> > +		rc = -ENOMEM;
> > +		goto err;
> > +	}
> > +
> > +	dev = &cxlr_dax->dev;
> > +	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
> > +	if (rc)
> > +		goto err;
> > +
> > +	rc = device_add(dev);
> > +	if (rc)
> > +		goto err;
> > +
> > +	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> > +		dev_name(dev));
> > +
> > +	rc = devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
> > +					cxlr_dax);
> > +	if (rc)
> > +		goto err;
> > +
> > +	cxlr_dc->cxlr_dax = cxlr_dax;
> > +	xa_init(&cxlr_dc->dax_dev_list);
> > +	cxlr->cxlr_dc = cxlr_dc;
> > +	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
> > +	if (!rc)
> > +		return 0;
> > +err:
> > +	put_device(dev);
> > +	kfree(cxlr_dc);
> > +	return rc;
> > +}
> > +
> >  static int match_decoder_by_range(struct device *dev, void *data)
> >  {
> >  	struct range *r1, *r2 = data;
> > @@ -3140,6 +3240,19 @@ static int is_system_ram(struct resource *res, void *arg)
> >  	return 1;
> >  }
> >  
> > +/*
> > + * The region can not be manged by CXL if any portion of
> > + * it is already online as 'System RAM'
> > + */
> > +static bool region_is_system_ram(struct cxl_region *cxlr,
> > +				 struct cxl_region_params *p)
> > +{
> > +	return (walk_iomem_res_desc(IORES_DESC_NONE,
> > +				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> > +				    p->res->start, p->res->end, cxlr,
> > +				    is_system_ram) > 0);
> > +}
> > +
> >  static int cxl_region_probe(struct device *dev)
> >  {
> >  	struct cxl_region *cxlr = to_cxl_region(dev);
> > @@ -3174,14 +3287,7 @@ static int cxl_region_probe(struct device *dev)
> >  	case CXL_DECODER_PMEM:
> >  		return devm_cxl_add_pmem_region(cxlr);
> >  	case CXL_DECODER_RAM:
> > -		/*
> > -		 * The region can not be manged by CXL if any portion of
> > -		 * it is already online as 'System RAM'
> > -		 */
> > -		if (walk_iomem_res_desc(IORES_DESC_NONE,
> > -					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> > -					p->res->start, p->res->end, cxlr,
> > -					is_system_ram) > 0)
> > +		if (region_is_system_ram(cxlr, p))
> >  			return 0;
> >  
> >  		/*
> > @@ -3193,6 +3299,17 @@ static int cxl_region_probe(struct device *dev)
> >  
> >  		/* HDM-H routes to device-dax */
> >  		return devm_cxl_add_dax_region(cxlr);
> > +	case CXL_DECODER_DC0:
> > +	case CXL_DECODER_DC1:
> > +	case CXL_DECODER_DC2:
> > +	case CXL_DECODER_DC3:
> > +	case CXL_DECODER_DC4:
> > +	case CXL_DECODER_DC5:
> > +	case CXL_DECODER_DC6:
> > +	case CXL_DECODER_DC7:
> > +		if (region_is_system_ram(cxlr, p))
> > +			return 0;
> > +		return devm_cxl_add_dc_region(cxlr);
> >  	default:
> >  		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
> >  			cxlr->mode);
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index 8400af85d99f..7ac1237938b7 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -335,6 +335,14 @@ enum cxl_decoder_mode {
> >  	CXL_DECODER_NONE,
> >  	CXL_DECODER_RAM,
> >  	CXL_DECODER_PMEM,
> > +	CXL_DECODER_DC0,
> > +	CXL_DECODER_DC1,
> > +	CXL_DECODER_DC2,
> > +	CXL_DECODER_DC3,
> > +	CXL_DECODER_DC4,
> > +	CXL_DECODER_DC5,
> > +	CXL_DECODER_DC6,
> > +	CXL_DECODER_DC7,
> >  	CXL_DECODER_MIXED,
> >  	CXL_DECODER_DEAD,
> >  };
> > @@ -345,6 +353,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
> >  		[CXL_DECODER_NONE] = "none",
> >  		[CXL_DECODER_RAM] = "ram",
> >  		[CXL_DECODER_PMEM] = "pmem",
> > +		[CXL_DECODER_DC0] = "dc0",
> > +		[CXL_DECODER_DC1] = "dc1",
> > +		[CXL_DECODER_DC2] = "dc2",
> > +		[CXL_DECODER_DC3] = "dc3",
> > +		[CXL_DECODER_DC4] = "dc4",
> > +		[CXL_DECODER_DC5] = "dc5",
> > +		[CXL_DECODER_DC6] = "dc6",
> > +		[CXL_DECODER_DC7] = "dc7",
> >  		[CXL_DECODER_MIXED] = "mixed",
> >  	};
> >  
> > @@ -353,6 +369,11 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
> >  	return "mixed";
> >  }
> >  
> > +static inline bool decoder_mode_is_dc(enum cxl_decoder_mode mode)
> > +{
> > +	return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
> > +}
> > +
> >  /*
> >   * Track whether this decoder is reserved for region autodiscovery, or
> >   * free for userspace provisioning.
> > @@ -375,6 +396,7 @@ struct cxl_endpoint_decoder {
> >  	struct cxl_decoder cxld;
> >  	struct resource *dpa_res;
> >  	resource_size_t skip;
> > +	struct xarray skip_res;
> >  	enum cxl_decoder_mode mode;
> >  	enum cxl_decoder_state state;
> >  	int pos;
> > @@ -475,6 +497,11 @@ struct cxl_region_params {
> >   */
> >  #define CXL_REGION_F_AUTO 1
> >  
> > +struct cxl_dc_region {
> > +	struct xarray dax_dev_list;
> > +	struct cxl_dax_region *cxlr_dax;
> > +};
> > +
> >  /**
> >   * struct cxl_region - CXL region
> >   * @dev: This region's device
> > @@ -493,6 +520,7 @@ struct cxl_region {
> >  	enum cxl_decoder_type type;
> >  	struct cxl_nvdimm_bridge *cxl_nvb;
> >  	struct cxl_pmem_region *cxlr_pmem;
> > +	struct cxl_dc_region *cxlr_dc;
> >  	unsigned long flags;
> >  	struct cxl_region_params params;
> >  };
> > diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> > index ccdf8de85bd5..eb5eb81bfbd7 100644
> > --- a/drivers/dax/cxl.c
> > +++ b/drivers/dax/cxl.c
> > @@ -23,11 +23,15 @@ static int cxl_dax_region_probe(struct device *dev)
> >  	if (!dax_region)
> >  		return -ENOMEM;
> >  
> > +	if (decoder_mode_is_dc(cxlr->mode))
> > +		return 0;
> > +
> >  	data = (struct dev_dax_data) {
> >  		.dax_region = dax_region,
> >  		.id = -1,
> >  		.size = range_len(&cxlr_dax->hpa_range),
> >  	};
> > +
> >  	dev_dax = devm_create_dev_dax(&data);
> >  	if (IS_ERR(dev_dax))
> >  		return PTR_ERR(dev_dax);
> > 
> > -- 
> > 2.40.0
> > 
> 
> -- 
> Fan Ni <nifan@outlook.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-14 19:16 ` [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support ira.weiny
                     ` (3 preceding siblings ...)
  2023-06-20 17:55   ` Fan Ni
@ 2023-06-21 17:20   ` Fan Ni
  2023-06-23 18:02     ` Ira Weiny
  2023-06-22 16:34   ` Jonathan Cameron
  2023-07-05 14:49   ` Davidlohr Bueso
  6 siblings, 1 reply; 55+ messages in thread
From: Fan Ni @ 2023-06-21 17:20 UTC (permalink / raw)
  To: ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl,
	a.manzanares, dave, nmtadam.samsung, nifan

The 06/14/2023 12:16, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> CXL devices optionally support dynamic capacity. CXL Regions must be
> created to access this capacity.
> 
> Add sysfs entries to create dynamic capacity cxl regions. Provide a new
> Dynamic Capacity decoder mode which targets dynamic capacity on devices
> which are added to that region.
> 
> Below are the steps to create and delete dynamic capacity region0
> (example).
> 
>     region=$(cat /sys/bus/cxl/devices/decoder0.0/create_dc_region)
>     echo $region> /sys/bus/cxl/devices/decoder0.0/create_dc_region
>     echo 256 > /sys/bus/cxl/devices/$region/interleave_granularity
>     echo 1 > /sys/bus/cxl/devices/$region/interleave_ways
> 
>     echo "dc0" >/sys/bus/cxl/devices/decoder1.0/mode
>     echo 0x400000000 >/sys/bus/cxl/devices/decoder1.0/dpa_size
> 
>     echo 0x400000000 > /sys/bus/cxl/devices/$region/size
>     echo  "decoder1.0" > /sys/bus/cxl/devices/$region/target0
>     echo 1 > /sys/bus/cxl/devices/$region/commit
>     echo $region > /sys/bus/cxl/drivers/cxl_region/bind
> 
>     echo $region> /sys/bus/cxl/devices/decoder0.0/delete_region
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> ---
> [iweiny: fixups]
> [iweiny: remove unused CXL_DC_REGION_MODE macro]
> [iweiny: Make dc_mode_to_region_index static]
> [iweiny: simplify <sysfs>/create_dc_region]
> [iweiny: introduce decoder_mode_is_dc]
> [djbw: fixups, no sign-off: preview only]
> ---
>  drivers/cxl/Kconfig       |  11 +++
>  drivers/cxl/core/core.h   |   7 ++
>  drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++++++++++++++++++++++++++----
>  drivers/cxl/core/port.c   |  18 ++++
>  drivers/cxl/core/region.c | 135 ++++++++++++++++++++++++--
>  drivers/cxl/cxl.h         |  28 ++++++
>  drivers/dax/cxl.c         |   4 +
>  7 files changed, 409 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index ff4e78117b31..df034889d053 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -121,6 +121,17 @@ config CXL_REGION
>  
>  	  If unsure say 'y'
>  
> +config CXL_DCD
> +	bool "CXL: DCD Support"
> +	default CXL_BUS
> +	depends on CXL_REGION
> +	help
> +	  Enable the CXL core to provision CXL DCD regions.
> +	  CXL devices optionally support dynamic capacity and DCD region
> +	  maps the dynamic capacity regions DPA's into Host HPA ranges.
> +
> +	  If unsure say 'y'
> +
>  config CXL_REGION_INVALIDATION_TEST
>  	bool "CXL: Region Cache Management Bypass (TEST)"
>  	depends on CXL_REGION
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 27f0968449de..725700ab5973 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -9,6 +9,13 @@ extern const struct device_type cxl_nvdimm_type;
>  
>  extern struct attribute_group cxl_base_attribute_group;
>  
> +#ifdef CONFIG_CXL_DCD
> +extern struct device_attribute dev_attr_create_dc_region;
> +#define SET_CXL_DC_REGION_ATTR(x) (&dev_attr_##x.attr),
> +#else
> +#define SET_CXL_DC_REGION_ATTR(x)
> +#endif
> +
>  #ifdef CONFIG_CXL_REGION
>  extern struct device_attribute dev_attr_create_pmem_region;
>  extern struct device_attribute dev_attr_create_ram_region;
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 514d30131d92..29649b47d177 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -233,14 +233,23 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct resource *res = cxled->dpa_res;
>  	resource_size_t skip_start;
> +	resource_size_t skipped = cxled->skip;
>  
>  	lockdep_assert_held_write(&cxl_dpa_rwsem);
>  
>  	/* save @skip_start, before @res is released */
> -	skip_start = res->start - cxled->skip;
> +	skip_start = res->start - skipped;
>  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> -	if (cxled->skip)
> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> +	if (cxled->skip != 0) {
> +		while (skipped != 0) {
> +			res = xa_load(&cxled->skip_res, skip_start);
> +			__release_region(&cxlds->dpa_res, skip_start,
> +							resource_size(res));
> +			xa_erase(&cxled->skip_res, skip_start);
> +			skip_start += resource_size(res);
> +			skipped -= resource_size(res);
> +			}
> +	}
>  	cxled->skip = 0;
>  	cxled->dpa_res = NULL;
>  	put_device(&cxled->cxld.dev);
> @@ -267,6 +276,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	__cxl_dpa_release(cxled);
>  }
>  
> +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> +{
> +	int index = 0;
> +
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> +		if (mode == i)
> +			return index;
> +		index++;
> +	}
> +
> +	return -EINVAL;
> +}
> +
>  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  			     resource_size_t base, resource_size_t len,
>  			     resource_size_t skipped)
> @@ -275,7 +297,11 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct device *dev = &port->dev;
> +	struct device *ed_dev = &cxled->cxld.dev;
> +	struct resource *dpa_res = &cxlds->dpa_res;
> +	resource_size_t skip_len = 0;
>  	struct resource *res;
> +	int rc, index;
>  
>  	lockdep_assert_held_write(&cxl_dpa_rwsem);
>  
> @@ -304,28 +330,119 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	}
>  
>  	if (skipped) {
> -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> -				       dev_name(&cxled->cxld.dev), 0);
> -		if (!res) {
> -			dev_dbg(dev,
> -				"decoder%d.%d: failed to reserve skipped space\n",
> -				port->id, cxled->cxld.id);
> -			return -EBUSY;
> +		resource_size_t skip_base = base - skipped;
> +
> +		if (decoder_mode_is_dc(cxled->mode)) {
> +			if (resource_size(&cxlds->ram_res) &&
> +					skip_base <= cxlds->ram_res.end) {
> +				skip_len = cxlds->ram_res.end - skip_base + 1;
> +				res = __request_region(dpa_res, skip_base,
> +						skip_len, dev_name(ed_dev), 0);
> +				if (!res)
> +					goto error;
> +
> +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
> +				skip_base += skip_len;
> +			}
> +
> +			if (resource_size(&cxlds->ram_res) &&
> +					skip_base <= cxlds->pmem_res.end) {
> +				skip_len = cxlds->pmem_res.end - skip_base + 1;
> +				res = __request_region(dpa_res, skip_base,
> +						skip_len, dev_name(ed_dev), 0);
> +				if (!res)
> +					goto error;
> +
> +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
> +				skip_base += skip_len;
> +			}
> +
> +			index = dc_mode_to_region_index(cxled->mode);
> +			for (int i = 0; i <= index; i++) {
> +				struct resource *dcr = &cxlds->dc_res[i];
> +
> +				if (skip_base < dcr->start) {
> +					skip_len = dcr->start - skip_base;
> +					res = __request_region(dpa_res,
> +							skip_base, skip_len,
> +							dev_name(ed_dev), 0);
> +					if (!res)
> +						goto error;
> +
> +					rc = xa_insert(&cxled->skip_res, skip_base,
> +							res, GFP_KERNEL);
> +					skip_base += skip_len;
> +				}
> +
> +				if (skip_base == base) {
> +					dev_dbg(dev, "skip done!\n");
> +					break;
> +				}
> +
> +				if (resource_size(dcr) &&
> +						skip_base <= dcr->end) {
> +					if (skip_base > base)
> +						dev_err(dev, "Skip error\n");
> +
> +					skip_len = dcr->end - skip_base + 1;
> +					res = __request_region(dpa_res, skip_base,
> +							skip_len,
> +							dev_name(ed_dev), 0);
> +					if (!res)
> +						goto error;
> +
> +					rc = xa_insert(&cxled->skip_res, skip_base,
> +							res, GFP_KERNEL);
> +					skip_base += skip_len;
> +				}
> +			}
> +		} else	{
> +			res = __request_region(dpa_res, base - skipped, skipped,
> +							dev_name(ed_dev), 0);
> +			if (!res)
> +				goto error;
> +
> +			rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
>  		}
>  	}
> -	res = __request_region(&cxlds->dpa_res, base, len,
> -			       dev_name(&cxled->cxld.dev), 0);
> +
> +	res = __request_region(dpa_res, base, len, dev_name(ed_dev), 0);
>  	if (!res) {
>  		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
> -			port->id, cxled->cxld.id);
> -		if (skipped)
> -			__release_region(&cxlds->dpa_res, base - skipped,
> -					 skipped);
> +				port->id, cxled->cxld.id);
> +		if (skipped) {
> +			resource_size_t skip_base = base - skipped;
> +
> +			while (skipped != 0) {
> +				if (skip_base > base)
> +					dev_err(dev, "Skip error\n");
> +
> +				res = xa_load(&cxled->skip_res, skip_base);
> +				__release_region(dpa_res, skip_base,
> +							resource_size(res));
> +				xa_erase(&cxled->skip_res, skip_base);
> +				skip_base += resource_size(res);
> +				skipped -= resource_size(res);
> +			}
> +		}
>  		return -EBUSY;
>  	}
>  	cxled->dpa_res = res;
>  	cxled->skip = skipped;
>  
> +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> +		int index = dc_mode_to_region_index(mode);
> +
> +		if (resource_contains(&cxlds->dc_res[index], res)) {
> +			cxled->mode = mode;
> +			dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id,
> +				cxled->cxld.id, cxled->dpa_res, cxled->mode);
> +			goto success;
> +		}
> +	}
>  	if (resource_contains(&cxlds->pmem_res, res))
>  		cxled->mode = CXL_DECODER_PMEM;
>  	else if (resource_contains(&cxlds->ram_res, res))
> @@ -336,9 +453,16 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  		cxled->mode = CXL_DECODER_MIXED;
>  	}
>  
> +success:
>  	port->hdm_end++;
>  	get_device(&cxled->cxld.dev);
>  	return 0;
> +
> +error:
> +	dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space\n",
> +			port->id, cxled->cxld.id);
> +	return -EBUSY;
> +
>  }
>  
>  int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> @@ -429,6 +553,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  	switch (mode) {
>  	case CXL_DECODER_RAM:
>  	case CXL_DECODER_PMEM:
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
>  		break;
>  	default:
>  		dev_dbg(dev, "unsupported mode: %d\n", mode);
> @@ -456,6 +588,16 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  		goto out;
>  	}
>  
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> +		int index = dc_mode_to_region_index(i);
> +
> +		if (mode == i && !resource_size(&cxlds->dc_res[index])) {
> +			dev_dbg(dev, "no available dynamic capacity\n");
> +			rc = -ENXIO;
> +			goto out;
> +		}
> +	}
> +
>  	cxled->mode = mode;
>  	rc = 0;
>  out:
> @@ -469,10 +611,12 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  					 resource_size_t *skip_out)
>  {
>  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -	resource_size_t free_ram_start, free_pmem_start;
> +	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	struct device *dev = &cxled->cxld.dev;
>  	resource_size_t start, avail, skip;
>  	struct resource *p, *last;
> +	int index;
>  
>  	lockdep_assert_held(&cxl_dpa_rwsem);
>  
> @@ -490,6 +634,20 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  	else
>  		free_pmem_start = cxlds->pmem_res.start;
>  
> +	/*
> +	 * One HDM Decoder per DC region to map memory with different
> +	 * DSMAS entry.
> +	 */
> +	index = dc_mode_to_region_index(cxled->mode);
> +	if (index >= 0) {
> +		if (cxlds->dc_res[index].child) {
> +			dev_err(dev, "Cannot allocated DPA from DC Region: %d\n",
> +					index);
> +			return -EINVAL;
> +		}
> +		free_dc_start = cxlds->dc_res[index].start;
> +	}
> +
>  	if (cxled->mode == CXL_DECODER_RAM) {
>  		start = free_ram_start;
>  		avail = cxlds->ram_res.end - start + 1;
> @@ -511,6 +669,29 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  		else
>  			skip_end = start - 1;
>  		skip = skip_end - skip_start + 1;
> +	} else if (decoder_mode_is_dc(cxled->mode)) {
> +		resource_size_t skip_start, skip_end;
> +
> +		start = free_dc_start;
> +		avail = cxlds->dc_res[index].end - start + 1;
> +		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
> +			skip_start = free_ram_start;
> +		else
> +			skip_start = free_pmem_start;
> +		/*
> +		 * If some dc region is already mapped, then that allocation
> +		 * already handled the RAM and PMEM skip.Check for DC region
> +		 * skip.
> +		 */
> +		for (int i = index - 1; i >= 0 ; i--) {
> +			if (cxlds->dc_res[i].child) {
> +				skip_start = cxlds->dc_res[i].child->end + 1;
> +				break;
> +			}
> +		}
> +
> +		skip_end = start - 1;
> +		skip = skip_end - skip_start + 1;
>  	} else {
>  		dev_dbg(cxled_dev(cxled), "mode not set\n");
>  		avail = 0;
> @@ -548,10 +729,25 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  
>  	avail = cxl_dpa_freespace(cxled, &start, &skip);
>  
> +	dev_dbg(dev, "DPA Allocation start: %llx len: %llx Skip: %llx\n",
> +						start, size, skip);
>  	if (size > avail) {
> +		static const char * const names[] = {
> +			[CXL_DECODER_NONE] = "none",
> +			[CXL_DECODER_RAM] = "ram",
> +			[CXL_DECODER_PMEM] = "pmem",
> +			[CXL_DECODER_MIXED] = "mixed",
> +			[CXL_DECODER_DC0] = "dc0",
> +			[CXL_DECODER_DC1] = "dc1",
> +			[CXL_DECODER_DC2] = "dc2",
> +			[CXL_DECODER_DC3] = "dc3",
> +			[CXL_DECODER_DC4] = "dc4",
> +			[CXL_DECODER_DC5] = "dc5",
> +			[CXL_DECODER_DC6] = "dc6",
> +			[CXL_DECODER_DC7] = "dc7",
> +		};
>  		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
> -			cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
> -			&avail);
> +			names[cxled->mode], &avail);
>  		rc = -ENOSPC;
>  		goto out;
>  	}
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 5e21b53362e6..a1a98aba24ed 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -195,6 +195,22 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
>  		mode = CXL_DECODER_PMEM;
>  	else if (sysfs_streq(buf, "ram"))
>  		mode = CXL_DECODER_RAM;
> +	else if (sysfs_streq(buf, "dc0"))
> +		mode = CXL_DECODER_DC0;
> +	else if (sysfs_streq(buf, "dc1"))
> +		mode = CXL_DECODER_DC1;
> +	else if (sysfs_streq(buf, "dc2"))
> +		mode = CXL_DECODER_DC2;
> +	else if (sysfs_streq(buf, "dc3"))
> +		mode = CXL_DECODER_DC3;
> +	else if (sysfs_streq(buf, "dc4"))
> +		mode = CXL_DECODER_DC4;
> +	else if (sysfs_streq(buf, "dc5"))
> +		mode = CXL_DECODER_DC5;
> +	else if (sysfs_streq(buf, "dc6"))
> +		mode = CXL_DECODER_DC6;
> +	else if (sysfs_streq(buf, "dc7"))
> +		mode = CXL_DECODER_DC7;
>  	else
>  		return -EINVAL;
>  
> @@ -296,6 +312,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>  	&dev_attr_target_list.attr,
>  	SET_CXL_REGION_ATTR(create_pmem_region)
>  	SET_CXL_REGION_ATTR(create_ram_region)
> +	SET_CXL_DC_REGION_ATTR(create_dc_region)
>  	SET_CXL_REGION_ATTR(delete_region)
>  	NULL,
>  };
> @@ -1691,6 +1708,7 @@ struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
>  		return ERR_PTR(-ENOMEM);
>  
>  	cxled->pos = -1;
> +	xa_init(&cxled->skip_res);
>  	cxld = &cxled->cxld;
>  	rc = cxl_decoder_init(port, cxld);
>  	if (rc)	 {
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 543c4499379e..144232c8305e 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1733,7 +1733,7 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  	lockdep_assert_held_write(&cxl_region_rwsem);
>  	lockdep_assert_held_read(&cxl_dpa_rwsem);
>  
> -	if (cxled->mode != cxlr->mode) {
> +	if (decoder_mode_is_dc(cxlr->mode) && !decoder_mode_is_dc(cxled->mode)) {
For mode other than dc, no check will be performed, is that what we
want?


>  		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
>  			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
>  		return -EINVAL;
> @@ -2211,6 +2211,14 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>  	switch (mode) {
>  	case CXL_DECODER_RAM:
>  	case CXL_DECODER_PMEM:
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
>  		break;
>  	default:
>  		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> @@ -2321,6 +2329,43 @@ static ssize_t create_ram_region_store(struct device *dev,
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>  
> +static ssize_t store_dcN_region(struct cxl_root_decoder *cxlrd,
> +				const char *buf, enum cxl_decoder_mode mode,
> +				size_t len)
> +{
> +	struct cxl_region *cxlr;
> +	int rc, id;
> +
> +	rc = sscanf(buf, "region%d\n", &id);
> +	if (rc != 1)
> +		return -EINVAL;
> +
> +	cxlr = __create_region(cxlrd, id, mode, CXL_DECODER_HOSTMEM);
> +	if (IS_ERR(cxlr))
> +		return PTR_ERR(cxlr);
> +
> +	return len;
> +}
> +
> +static ssize_t create_dc_region_show(struct device *dev,
> +				     struct device_attribute *attr, char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dc_region_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t len)
> +{
> +	/*
> +	 * All DC regions use decoder mode DC0 as the region does not need the
> +	 * index information
> +	 */
> +	return store_dcN_region(to_cxl_root_decoder(dev), buf,
> +				CXL_DECODER_DC0, len);
If all DC regions use DC0, what will CXL_DECODER_DC1~7 be used for?

Fan
> +}
> +DEVICE_ATTR_RW(create_dc_region);
> +
>  static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -2799,6 +2844,61 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  	return rc;
>  }
>  
> +static void cxl_dc_region_release(void *data)
> +{
> +	struct cxl_region *cxlr = data;
> +	struct cxl_dc_region *cxlr_dc = cxlr->cxlr_dc;
> +
> +	xa_destroy(&cxlr_dc->dax_dev_list);
> +	kfree(cxlr_dc);
> +}
> +
> +static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
> +{
> +	struct cxl_dc_region *cxlr_dc;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct device *dev;
> +	int rc = 0;
> +
> +	cxlr_dax = cxl_dax_region_alloc(cxlr);
> +	if (IS_ERR(cxlr_dax))
> +		return PTR_ERR(cxlr_dax);
> +
> +	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
> +	if (!cxlr_dc) {
> +		rc = -ENOMEM;
> +		goto err;
> +	}
> +
> +	dev = &cxlr_dax->dev;
> +	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
> +	if (rc)
> +		goto err;
> +
> +	rc = device_add(dev);
> +	if (rc)
> +		goto err;
> +
> +	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> +		dev_name(dev));
> +
> +	rc = devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
> +					cxlr_dax);
> +	if (rc)
> +		goto err;
> +
> +	cxlr_dc->cxlr_dax = cxlr_dax;
> +	xa_init(&cxlr_dc->dax_dev_list);
> +	cxlr->cxlr_dc = cxlr_dc;
> +	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
> +	if (!rc)
> +		return 0;
> +err:
> +	put_device(dev);
> +	kfree(cxlr_dc);
> +	return rc;
> +}
> +
>  static int match_decoder_by_range(struct device *dev, void *data)
>  {
>  	struct range *r1, *r2 = data;
> @@ -3140,6 +3240,19 @@ static int is_system_ram(struct resource *res, void *arg)
>  	return 1;
>  }
>  
> +/*
> + * The region can not be manged by CXL if any portion of
> + * it is already online as 'System RAM'
> + */
> +static bool region_is_system_ram(struct cxl_region *cxlr,
> +				 struct cxl_region_params *p)
> +{
> +	return (walk_iomem_res_desc(IORES_DESC_NONE,
> +				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> +				    p->res->start, p->res->end, cxlr,
> +				    is_system_ram) > 0);
> +}
> +
>  static int cxl_region_probe(struct device *dev)
>  {
>  	struct cxl_region *cxlr = to_cxl_region(dev);
> @@ -3174,14 +3287,7 @@ static int cxl_region_probe(struct device *dev)
>  	case CXL_DECODER_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
>  	case CXL_DECODER_RAM:
> -		/*
> -		 * The region can not be manged by CXL if any portion of
> -		 * it is already online as 'System RAM'
> -		 */
> -		if (walk_iomem_res_desc(IORES_DESC_NONE,
> -					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> -					p->res->start, p->res->end, cxlr,
> -					is_system_ram) > 0)
> +		if (region_is_system_ram(cxlr, p))
>  			return 0;
>  
>  		/*
> @@ -3193,6 +3299,17 @@ static int cxl_region_probe(struct device *dev)
>  
>  		/* HDM-H routes to device-dax */
>  		return devm_cxl_add_dax_region(cxlr);
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
> +		if (region_is_system_ram(cxlr, p))
> +			return 0;
> +		return devm_cxl_add_dc_region(cxlr);
>  	default:
>  		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
>  			cxlr->mode);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 8400af85d99f..7ac1237938b7 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -335,6 +335,14 @@ enum cxl_decoder_mode {
>  	CXL_DECODER_NONE,
>  	CXL_DECODER_RAM,
>  	CXL_DECODER_PMEM,
> +	CXL_DECODER_DC0,
> +	CXL_DECODER_DC1,
> +	CXL_DECODER_DC2,
> +	CXL_DECODER_DC3,
> +	CXL_DECODER_DC4,
> +	CXL_DECODER_DC5,
> +	CXL_DECODER_DC6,
> +	CXL_DECODER_DC7,
>  	CXL_DECODER_MIXED,
>  	CXL_DECODER_DEAD,
>  };
> @@ -345,6 +353,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  		[CXL_DECODER_NONE] = "none",
>  		[CXL_DECODER_RAM] = "ram",
>  		[CXL_DECODER_PMEM] = "pmem",
> +		[CXL_DECODER_DC0] = "dc0",
> +		[CXL_DECODER_DC1] = "dc1",
> +		[CXL_DECODER_DC2] = "dc2",
> +		[CXL_DECODER_DC3] = "dc3",
> +		[CXL_DECODER_DC4] = "dc4",
> +		[CXL_DECODER_DC5] = "dc5",
> +		[CXL_DECODER_DC6] = "dc6",
> +		[CXL_DECODER_DC7] = "dc7",
>  		[CXL_DECODER_MIXED] = "mixed",
>  	};
>  
> @@ -353,6 +369,11 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  	return "mixed";
>  }
>  
> +static inline bool decoder_mode_is_dc(enum cxl_decoder_mode mode)
> +{
> +	return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
> +}
> +
>  /*
>   * Track whether this decoder is reserved for region autodiscovery, or
>   * free for userspace provisioning.
> @@ -375,6 +396,7 @@ struct cxl_endpoint_decoder {
>  	struct cxl_decoder cxld;
>  	struct resource *dpa_res;
>  	resource_size_t skip;
> +	struct xarray skip_res;
>  	enum cxl_decoder_mode mode;
>  	enum cxl_decoder_state state;
>  	int pos;
> @@ -475,6 +497,11 @@ struct cxl_region_params {
>   */
>  #define CXL_REGION_F_AUTO 1
>  
> +struct cxl_dc_region {
> +	struct xarray dax_dev_list;
> +	struct cxl_dax_region *cxlr_dax;
> +};
> +
>  /**
>   * struct cxl_region - CXL region
>   * @dev: This region's device
> @@ -493,6 +520,7 @@ struct cxl_region {
>  	enum cxl_decoder_type type;
>  	struct cxl_nvdimm_bridge *cxl_nvb;
>  	struct cxl_pmem_region *cxlr_pmem;
> +	struct cxl_dc_region *cxlr_dc;
>  	unsigned long flags;
>  	struct cxl_region_params params;
>  };
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index ccdf8de85bd5..eb5eb81bfbd7 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -23,11 +23,15 @@ static int cxl_dax_region_probe(struct device *dev)
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> +	if (decoder_mode_is_dc(cxlr->mode))
> +		return 0;
> +
>  	data = (struct dev_dax_data) {
>  		.dax_region = dax_region,
>  		.id = -1,
>  		.size = range_len(&cxlr_dax->hpa_range),
>  	};
> +
>  	dev_dax = devm_create_dev_dax(&data);
>  	if (IS_ERR(dev_dax))
>  		return PTR_ERR(dev_dax);
> 
> -- 
> 2.40.0
> 

-- 
Fan Ni <nifan@outlook.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD)
  2023-06-15 14:51 ` Ira Weiny
@ 2023-06-22 15:07   ` Jonathan Cameron
  2023-06-22 16:37     ` Jonathan Cameron
  2023-06-27 14:59     ` Ira Weiny
  0 siblings, 2 replies; 55+ messages in thread
From: Jonathan Cameron @ 2023-06-22 15:07 UTC (permalink / raw)
  To: Ira Weiny; +Cc: Navneet Singh, Fan Ni, Dan Williams, linux-cxl

On Thu, 15 Jun 2023 07:51:16 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> ira.weiny@ wrote:
> > I'm submitting these on behalf of Navneet.  There was a round of
> > internal discussion which left a few questions but we want to get the
> > public discussion going.  A first public preview was posted by Dan.[1]  
> 
> Apologies for not being clear and marking these appropriately.  I
> intended these to be RFC to get the discussion moving forward.  I somewhat
> rushed the submission.  Depending on where the comments in this submission
> go I'll try and make a better determination if the next submission is RFC
> or can be a proper V1.  (Although b4 will mark them v2...  I'll have to
> deal with that.)

Make sure your SoB is added after Navneet to reflect that you are
handling the posting to the mailing list even if you feel changes are insufficient
to merit a Co-developed-by tag. (no idea who is doing what :)

Jonathan


> 
> Ira
> 
> > 
> > The series has been rebased on the type-2 work posted from Dan.[2]  As
> > discussed in the community call, not all of that series is required for
> > these patches.  This will get rebased on the subset of those patches he
> > is targeting for 6.5.  The series was tested using Fan Ni's Qemu DCD
> > series.[3]
> > 
> > [cover letter]
> > 
> > A Dynamic Capacity Device (DCD) (CXL 3.0 spec 9.13.3) is a CXL memory
> > device that implements dynamic capacity.  Dynamic capacity feature
> > allows memory capacity to change dynamically, without the need for
> > resetting the device.
> > 
> > Provide initial patches to enable DCD on non interleaving regions.
> > Details:
> > 
> > - Get the dynamic capacity region information from cxl device and add
> >   the advertised DC memory to driver managed resources
> > - Get the device dynamic capacity extent list from the device and
> >   maintain it in the host and add the preallocated memory to the host
> > - Dynamic capacity region support
> > - DCD region provisioning via Dax
> > - Dynamic capacity event records
> >         a. Add capacity Events
> > 	b. Release capacity events
> > 	c. Add the memory to the host dc region
> > 	d. Release the memory from the host dc region
> > - Trace Dynamic Capacity events
> > - Send add capacity response to device
> > - Send release dynamic capacity to device
> > 
> > Cc: Navneet Singh <navneet.singh@intel.com>
> > Cc: Fan Ni <fan.ni@samsung.com>
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: linux-cxl@vger.kernel.org
> > 
> > [1] https://lore.kernel.org/all/64326437c1496_934b2949f@dwillia2-mobl3.amr.corp.intel.com.notmuch/
> > [2] https://lore.kernel.org/all/168592149709.1948938.8663425987110396027.stgit@dwillia2-xfh.jf.intel.com/
> > [3] https://lore.kernel.org/all/6483946e8152f_f1132294a2@iweiny-mobl.notmuch/
> > 
> > ---
> > Navneet Singh (5):
> >       cxl/mem : Read Dynamic capacity configuration from the device
> >       cxl/region: Add dynamic capacity cxl region support.
> >       cxl/mem : Expose dynamic capacity configuration to userspace
> >       cxl/mem: Add support to handle DCD add and release capacity events.
> >       cxl/mem: Trace Dynamic capacity Event Record
> > 
> >  drivers/cxl/Kconfig       |  11 +
> >  drivers/cxl/core/core.h   |   7 +
> >  drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++--
> >  drivers/cxl/core/mbox.c   | 540 +++++++++++++++++++++++++++++++++++++++++++++-
> >  drivers/cxl/core/memdev.c |  72 +++++++
> >  drivers/cxl/core/port.c   |  18 ++
> >  drivers/cxl/core/region.c | 337 ++++++++++++++++++++++++++++-
> >  drivers/cxl/core/trace.h  |  68 +++++-
> >  drivers/cxl/cxl.h         |  32 ++-
> >  drivers/cxl/cxlmem.h      | 146 ++++++++++++-
> >  drivers/cxl/pci.c         |  14 +-
> >  drivers/dax/bus.c         |  11 +-
> >  drivers/dax/bus.h         |   5 +-
> >  drivers/dax/cxl.c         |   4 +
> >  14 files changed, 1453 insertions(+), 46 deletions(-)
> > ---
> > base-commit: 034a16d0165be3e092d60685be7b1b05e6f3059b
> > change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
> > 
> > Best regards,
> > -- 
> > Ira Weiny <ira.weiny@intel.com>
> >   
> 
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device
  2023-06-14 19:16 ` [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device ira.weiny
                     ` (3 preceding siblings ...)
  2023-06-15 21:41   ` Fan Ni
@ 2023-06-22 15:58   ` Jonathan Cameron
  2023-06-24 13:08     ` Ira Weiny
  4 siblings, 1 reply; 55+ messages in thread
From: Jonathan Cameron @ 2023-06-22 15:58 UTC (permalink / raw)
  To: ira.weiny; +Cc: Navneet Singh, Fan Ni, Dan Williams, linux-cxl

On Wed, 14 Jun 2023 12:16:28 -0700
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> Read the Dynamic capacity configuration and store dynamic capacity region
> information in the device state which driver will use to map into the HDM
> ranges.
> 
> Implement Get Dynamic Capacity Configuration (opcode 4800h) mailbox
> command as specified in CXL 3.0 spec section 8.2.9.8.9.1.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>

Hi Ira / Navneet,

I'll probably overlap with comments of others (good to see so much review!)
so feel free to ignore duplication.

Comments inline,

Jonathan



> +/**
> + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> + * information from the device.
> + * @mds: The memory device state
> + * Return: 0 if identify was executed successfully.
> + *
> + * This will dispatch the get_dynamic_capacity command to the device
> + * and on success populate structures to be exported to sysfs.
> + */
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> +{
> +	struct cxl_dev_state *cxlds = &mds->cxlds;
> +	struct device *dev = cxlds->dev;
> +	struct cxl_mbox_dynamic_capacity *dc;

Calling it dc is confusing.  I'd make it clear this is the mailbox
response. config_resp or dc_config_res.

> +	struct cxl_mbox_get_dc_config get_dc;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	u64 next_dc_region_start;
> +	int rc, i;
> +
> +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> +		sprintf(mds->dc_region[i].name, "dc%d", i);
> +
> +	/* Check GET_DC_CONFIG is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd: get_dynamic_capacity_config\n");
> +		return 0;
> +	}
> +
> +	dc = kvmalloc(mds->payload_size, GFP_KERNEL);
> +	if (!dc)
> +		return -ENOMEM;

Response to CXL_MBOX_OP_GET_DC_CONFIG has a known maximum
size. Can we provide that instead of potentially much larger?

8 + 0x28 * 8 I think so 328 bytes. Use struct_size()


But fun corner.... Mailbox is allowed to be smaller than that (256 bytes min
I think) so need to handle multiple reads with different start regions.
Which reminds me that we need to add support for running out of space
in the mailbox to qemu... So far we've just made sure everything fitted :)


> +
> +	get_dc = (struct cxl_mbox_get_dc_config) {
> +		.region_count = CXL_MAX_DC_REGION,
> +		.start_region_index = 0,
> +	};
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> +		.payload_in = &get_dc,
> +		.size_in = sizeof(get_dc),
> +		.size_out = mds->payload_size,
> +		.payload_out = dc,
> +		.min_out = 1,
> +	};
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		goto dc_error;
The error label is a bit too generic.  Why dc_error?
"error" conveys just as small amount of info.  I'd got for goto free_resp;

> +
> +	mds->nr_dc_region = dc->avail_region_count;
> +
> +	if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> +		dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> +			mds->nr_dc_region);
> +		rc = -EINVAL;
> +		goto dc_error;
> +	}
> +
> +	for (i = 0; i < mds->nr_dc_region; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +		dcr->base = le64_to_cpu(dc->region[i].region_base);
> +		dcr->decode_len =
> +			le64_to_cpu(dc->region[i].region_decode_length);
> +		dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> +		dcr->len = le64_to_cpu(dc->region[i].region_length);
> +		dcr->blk_size = le64_to_cpu(dc->region[i].region_block_size);
> +
> +		/* Check regions are in increasing DPA order */
> +		if ((i + 1) < mds->nr_dc_region) {

Feels a bit odd to look at entries we haven't seen yet.  Maybe flip this around
to check the ones we have looked at?  So don't start until 2nd region and then check
it's start against mds->dc_region[0] etc?
Or factor out this loop contents in general and just pass in the single
value needed for checking this. Biggest advantage being direct returns in
that function as allocation and free will be in caller.


> +			next_dc_region_start =
> +				le64_to_cpu(dc->region[i + 1].region_base);
> +			if ((dcr->base > next_dc_region_start) ||
> +			    ((dcr->base + dcr->decode_len) > next_dc_region_start)) {

Unless you have a negative decode length the second condition includes the first.
So just check that.

> +				dev_err(dev,
> +					"DPA ordering violation for DC region %d and %d\n",
> +					i, i + 1);
> +				rc = -EINVAL;
> +				goto dc_error;
> +			}
> +		}
> +
> +		/* Check the region is 256 MB aligned */
> +		if (!IS_ALIGNED(dcr->base, SZ_256M)) {

That's an oddity. I wonder why those lower bits where defined as reserved...
Anyhow code is right if paranoid ;)

> +			dev_err(dev, "DC region %d not aligned to 256MB\n", i);
> +			rc = -EINVAL;
> +			goto dc_error;
> +		}
> +
> +		/* Check Region base and length are aligned to block size */
> +		if (!IS_ALIGNED(dcr->base, dcr->blk_size) ||
> +		    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> +			dev_err(dev, "DC region %d not aligned to %#llx\n", i,
> +				dcr->blk_size);
> +			rc = -EINVAL;
> +			goto dc_error;
> +		}
> +
> +		dcr->dsmad_handle =
> +			le32_to_cpu(dc->region[i].region_dsmad_handle);
> +		dcr->flags = dc->region[i].flags;

I'd just grab these at same time as all the other fields above.
A pattern where you fill values in only after checking would be fine, or one
where you fill them in all in one place. The mixture of the two is less clear
than either consistent approach.

> +		sprintf(dcr->name, "dc%d", i);
> +
> +		dev_dbg(dev,
> +			"DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> +			dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> +	}
> +
> +	/*
> +	 * Calculate entire DPA range of all configured regions which will be mapped by
> +	 * one or more HDM decoders
> +	 */
> +	mds->total_dynamic_capacity =
> +		mds->dc_region[mds->nr_dc_region - 1].base +
> +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> +		mds->dc_region[0].base;
> +	dev_dbg(dev, "Total dynamic capacity: %#llx\n",
> +		mds->total_dynamic_capacity);
> +
> +dc_error:
> +	kvfree(dc);
> +	return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> +



> @@ -1121,13 +1289,23 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  	}
>  
>  	cxlds->dpa_res =
> -		(struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> +		(struct resource)DEFINE_RES_MEM(0, mds->total_capacity);
> +
> +	for (int i = 0; i < CXL_MAX_DC_REGION; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> +				 dcr->base, dcr->decode_len, dcr->name);
> +		if (rc)
> +			return rc;
> +	}
>  
>  	if (mds->partition_align_bytes == 0) {
>  		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
>  				 mds->volatile_only_bytes, "ram");
>  		if (rc)
>  			return rc;
> +

Scrub for this stuff before posting v2. Just noise that slows down review
a little.  If it is worth doing, do it in a separate patch.

>  		return add_dpa_res(dev, &cxlds->dpa_res, &cxlds->pmem_res,
>  				   mds->volatile_only_bytes,
>  				   mds->persistent_only_bytes, "pmem");
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 89e560ea14c0..9c0b2fa72bdd 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
>
>  
> +#define CXL_MAX_DC_REGION 8
> +#define CXL_DC_REGION_SRTLEN 8

SRT? 

> +
>  /**
>   * struct cxl_dev_state - The driver device state
>   *
> @@ -300,6 +312,8 @@ enum cxl_devtype {
>   * @dpa_res: Overall DPA resource tree for the device
>   * @pmem_res: Active Persistent memory capacity configuration
>   * @ram_res: Active Volatile memory capacity configuration
> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> + *          region
>   * @component_reg_phys: register base of component registers
>   * @info: Cached DVSEC information about the device.
>   * @serial: PCIe Device Serial Number
> @@ -315,6 +329,7 @@ struct cxl_dev_state {
>  	struct resource dpa_res;
>  	struct resource pmem_res;
>  	struct resource ram_res;
> +	struct resource dc_res[CXL_MAX_DC_REGION];
>  	resource_size_t component_reg_phys;
>  	u64 serial;
>  	enum cxl_devtype type;

...

> @@ -357,9 +379,13 @@ struct cxl_memdev_state {
>  	size_t lsa_size;
>  	struct mutex mbox_mutex; /* Protects device mailbox and firmware */
>  	char firmware_version[0x10];
> +	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> -	u64 total_bytes;
> +
> +	u64 total_capacity;
> +	u64 total_static_capacity;
> +	u64 total_dynamic_capacity;
>  	u64 volatile_only_bytes;
>  	u64 persistent_only_bytes;
>  	u64 partition_align_bytes;
> @@ -367,6 +393,20 @@ struct cxl_memdev_state {
>  	u64 active_persistent_bytes;
>  	u64 next_volatile_bytes;
>  	u64 next_persistent_bytes;
> +
> +	u8 nr_dc_region;
> +
> +	struct cxl_dc_region_info {
> +		u8 name[CXL_DC_REGION_SRTLEN];

char? SRT?  Also isn't it a bit big? Looks like max 4 chars to me.
Put it next to flags and we can save some space.

> +		u64 base;
> +		u64 decode_len;
> +		u64 len;
> +		u64 blk_size;
> +		u32 dsmad_handle;
> +		u8 flags;
> +	} dc_region[CXL_MAX_DC_REGION];
> +
> +	size_t dc_event_log_size;

>  /*
> @@ -617,7 +662,27 @@ struct cxl_mbox_set_partition_info {
>  	u8 flags;
>  } __packed;
>  
> +struct cxl_mbox_get_dc_config {
> +	u8 region_count;
> +	u8 start_region_index;
> +} __packed;
> +
> +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> +struct cxl_mbox_dynamic_capacity {
> +	u8 avail_region_count;
> +	u8 rsvd[7];
> +	struct cxl_dc_region_config {
> +		__le64 region_base;
> +		__le64 region_decode_length;
> +		__le64 region_length;
> +		__le64 region_block_size;
> +		__le32 region_dsmad_handle;
> +		u8 flags;
> +		u8 rsvd[3];
> +	} __packed region[];
> +} __packed;
>  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)

This looks to have merged oddly with existing changes. I'd
move the define into the structure definition so ti's clear which
flag it reflects and avoids this sort of interleaving in future.

> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
>  



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-14 19:16 ` [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support ira.weiny
                     ` (4 preceding siblings ...)
  2023-06-21 17:20   ` Fan Ni
@ 2023-06-22 16:34   ` Jonathan Cameron
  2023-07-05 14:49   ` Davidlohr Bueso
  6 siblings, 0 replies; 55+ messages in thread
From: Jonathan Cameron @ 2023-06-22 16:34 UTC (permalink / raw)
  To: ira.weiny; +Cc: Navneet Singh, Fan Ni, Dan Williams, linux-cxl

On Wed, 14 Jun 2023 12:16:29 -0700
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> CXL devices optionally support dynamic capacity. CXL Regions must be
> created to access this capacity.
> 
> Add sysfs entries to create dynamic capacity cxl regions. Provide a new
> Dynamic Capacity decoder mode which targets dynamic capacity on devices
> which are added to that region.
> 
> Below are the steps to create and delete dynamic capacity region0
> (example).
> 
>     region=$(cat /sys/bus/cxl/devices/decoder0.0/create_dc_region)
>     echo $region> /sys/bus/cxl/devices/decoder0.0/create_dc_region
>     echo 256 > /sys/bus/cxl/devices/$region/interleave_granularity
>     echo 1 > /sys/bus/cxl/devices/$region/interleave_ways
> 
>     echo "dc0" >/sys/bus/cxl/devices/decoder1.0/mode
>     echo 0x400000000 >/sys/bus/cxl/devices/decoder1.0/dpa_size
> 
>     echo 0x400000000 > /sys/bus/cxl/devices/$region/size
>     echo  "decoder1.0" > /sys/bus/cxl/devices/$region/target0
>     echo 1 > /sys/bus/cxl/devices/$region/commit
>     echo $region > /sys/bus/cxl/drivers/cxl_region/bind
> 
>     echo $region> /sys/bus/cxl/devices/decoder0.0/delete_region
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 

I'd like some additional info here on why the skip stuff needs to be
so complicated.  I think it's juts a way of tracking the skip value
needed in the HDM decoders and that's just one value. So why can't
we just have one resource reservation for the skip?  Is it related to them needing
to be nested in some way?

Jonathan



>  #ifdef CONFIG_CXL_REGION
>  extern struct device_attribute dev_attr_create_pmem_region;
>  extern struct device_attribute dev_attr_create_ram_region;
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 514d30131d92..29649b47d177 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -233,14 +233,23 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct resource *res = cxled->dpa_res;
>  	resource_size_t skip_start;
> +	resource_size_t skipped = cxled->skip;
>  
>  	lockdep_assert_held_write(&cxl_dpa_rwsem);
>  
>  	/* save @skip_start, before @res is released */
> -	skip_start = res->start - cxled->skip;
> +	skip_start = res->start - skipped;
>  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> -	if (cxled->skip)
> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> +	if (cxled->skip != 0) {
> +		while (skipped != 0) {
> +			res = xa_load(&cxled->skip_res, skip_start);
> +			__release_region(&cxlds->dpa_res, skip_start,
> +							resource_size(res));
> +			xa_erase(&cxled->skip_res, skip_start);
> +			skip_start += resource_size(res);
> +			skipped -= resource_size(res);
> +			}
} indented too far..

> +	}
>  	cxled->skip = 0;
>  	cxled->dpa_res = NULL;
>  	put_device(&cxled->cxld.dev);
> @@ -267,6 +276,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	__cxl_dpa_release(cxled);
>  }
>  
> +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> +{
> +	int index = 0;
> +
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> +		if (mode == i)
> +			return index;
> +		index++;

Might as well increment index in the loop as well.
i++, index++;  Though given you are looping over a bunch of enum
entries and relying on them being in a row...

	if (mode < CXL_DECODER_DC0 || i > CXL_DECODER_DC7)
		return -EINVAL;
	return mode - CXL_DECODER0;


> +	}
> +
> +	return -EINVAL;
> +}
> +
>  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  			     resource_size_t base, resource_size_t len,
>  			     resource_size_t skipped)
> @@ -275,7 +297,11 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct device *dev = &port->dev;
> +	struct device *ed_dev = &cxled->cxld.dev;
> +	struct resource *dpa_res = &cxlds->dpa_res;
> +	resource_size_t skip_len = 0;
>  	struct resource *res;
> +	int rc, index;
>  
>  	lockdep_assert_held_write(&cxl_dpa_rwsem);
>  
> @@ -304,28 +330,119 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	}
>  
>  	if (skipped) {
> -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> -				       dev_name(&cxled->cxld.dev), 0);
> -		if (!res) {
> -			dev_dbg(dev,
> -				"decoder%d.%d: failed to reserve skipped space\n",
> -				port->id, cxled->cxld.id);
> -			return -EBUSY;
> +		resource_size_t skip_base = base - skipped;
> +
> +		if (decoder_mode_is_dc(cxled->mode)) {
> +			if (resource_size(&cxlds->ram_res) &&
> +					skip_base <= cxlds->ram_res.end) {

Fix alignment to after (

> +				skip_len = cxlds->ram_res.end - skip_base + 1;
> +				res = __request_region(dpa_res, skip_base,
> +						skip_len, dev_name(ed_dev), 0);


Does it make sense to have all these potential skip regions in a row?
Why not just add one potentially including ram, pmem, and some of the
dc regions and remove one below?

I may be missing some subtlety here though.

> +				if (!res)
> +					goto error;
> +
> +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
> +				skip_base += skip_len;
> +			}
> +
> +			if (resource_size(&cxlds->ram_res) &&
> +					skip_base <= cxlds->pmem_res.end) {
> +				skip_len = cxlds->pmem_res.end - skip_base + 1;
> +				res = __request_region(dpa_res, skip_base,
> +						skip_len, dev_name(ed_dev), 0);
> +				if (!res)
> +					goto error;
> +
> +				rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
> +				skip_base += skip_len;
> +			}
> +
> +			index = dc_mode_to_region_index(cxled->mode);
> +			for (int i = 0; i <= index; i++) {
> +				struct resource *dcr = &cxlds->dc_res[i];
> +
> +				if (skip_base < dcr->start) {
> +					skip_len = dcr->start - skip_base;
> +					res = __request_region(dpa_res,
> +							skip_base, skip_len,
> +							dev_name(ed_dev), 0);
> +					if (!res)
> +						goto error;
> +
> +					rc = xa_insert(&cxled->skip_res, skip_base,
> +							res, GFP_KERNEL);
> +					skip_base += skip_len;
> +				}
> +
> +				if (skip_base == base) {
> +					dev_dbg(dev, "skip done!\n");
> +					break;
> +				}
> +
> +				if (resource_size(dcr) &&
> +						skip_base <= dcr->end) {
> +					if (skip_base > base)
> +						dev_err(dev, "Skip error\n");
> +
> +					skip_len = dcr->end - skip_base + 1;
> +					res = __request_region(dpa_res, skip_base,
> +							skip_len,
> +							dev_name(ed_dev), 0);
> +					if (!res)
> +						goto error;
> +
> +					rc = xa_insert(&cxled->skip_res, skip_base,
> +							res, GFP_KERNEL);
> +					skip_base += skip_len;
> +				}
> +			}
> +		} else	{
> +			res = __request_region(dpa_res, base - skipped, skipped,
> +							dev_name(ed_dev), 0);
> +			if (!res)
> +				goto error;
> +
> +			rc = xa_insert(&cxled->skip_res, skip_base, res,
> +								GFP_KERNEL);
Can we have a precursor patch introducing the xarray for skip res?
Might make that bit easy to understand even if it start with few entries.

Also, is rc checked?


>  		}
>  	}
> -	res = __request_region(&cxlds->dpa_res, base, len,
> -			       dev_name(&cxled->cxld.dev), 0);
> +
> +	res = __request_region(dpa_res, base, len, dev_name(ed_dev), 0);
>  	if (!res) {
>  		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
> -			port->id, cxled->cxld.id);
> -		if (skipped)
> -			__release_region(&cxlds->dpa_res, base - skipped,
> -					 skipped);
> +				port->id, cxled->cxld.id);

Odd indent of line above that is making this noisier than it needs to be.

> +		if (skipped) {

I'd invert at cost of two places you exit.
		if (!skipped)
			return -EBUSY;

		skip_base = base - skipped;
		...

> +			resource_size_t skip_base = base - skipped;
> +
> +			while (skipped != 0) {
> +				if (skip_base > base)
> +					dev_err(dev, "Skip error\n");
> +
> +				res = xa_load(&cxled->skip_res, skip_base);
> +				__release_region(dpa_res, skip_base,
> +							resource_size(res));
> +				xa_erase(&cxled->skip_res, skip_base);
> +				skip_base += resource_size(res);
> +				skipped -= resource_size(res);
> +			}
> +		}
>  		return -EBUSY;
>  	}
>  	cxled->dpa_res = res;
>  	cxled->skip = skipped;
>  
> +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> +		int index = dc_mode_to_region_index(mode);
> +
> +		if (resource_contains(&cxlds->dc_res[index], res)) {
> +			cxled->mode = mode;
> +			dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id,
> +				cxled->cxld.id, cxled->dpa_res, cxled->mode);
> +			goto success;
> +		}
> +	}
>  	if (resource_contains(&cxlds->pmem_res, res))
>  		cxled->mode = CXL_DECODER_PMEM;
>  	else if (resource_contains(&cxlds->ram_res, res))
> @@ -336,9 +453,16 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  		cxled->mode = CXL_DECODER_MIXED;
>  	}
>  
> +success:
>  	port->hdm_end++;
>  	get_device(&cxled->cxld.dev);
>  	return 0;
> +
> +error:

Unless other stuff is coming here, drag the debug print up to callers, make
it more specific and return directly.  Makes for an easier flow to read.


> +	dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space\n",
> +			port->id, cxled->cxld.id);
> +	return -EBUSY;
> +
>  }


> @@ -469,10 +611,12 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  					 resource_size_t *skip_out)
>  {
>  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -	resource_size_t free_ram_start, free_pmem_start;
> +	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	struct device *dev = &cxled->cxld.dev;

Pull this out as a precursor.  Also note Dan used cxled_dev() in the patch adding cxl_dpa_freespace.
Probably best bet is just push this change into Dan's patch on basis it'll make the history neater.


>  	resource_size_t start, avail, skip;
>  	struct resource *p, *last;
> +	int index;
>  
>  	lockdep_assert_held(&cxl_dpa_rwsem);
>  
> @@ -490,6 +634,20 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  	else
>  		free_pmem_start = cxlds->pmem_res.start;
>  
> +	/*
> +	 * One HDM Decoder per DC region to map memory with different
> +	 * DSMAS entry.
> +	 */

Push all the dc stuff into one place?  Perhaps that becomes impossible
in later patches...

> +	index = dc_mode_to_region_index(cxled->mode);
> +	if (index >= 0) {
> +		if (cxlds->dc_res[index].child) {
> +			dev_err(dev, "Cannot allocated DPA from DC Region: %d\n",
> +					index);
> +			return -EINVAL;
> +		}
> +		free_dc_start = cxlds->dc_res[index].start;
> +	}
> +
>  	if (cxled->mode == CXL_DECODER_RAM) {
>  		start = free_ram_start;
>  		avail = cxlds->ram_res.end - start + 1;
> @@ -511,6 +669,29 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  		else
>  			skip_end = start - 1;
>  		skip = skip_end - skip_start + 1;
> +	} else if (decoder_mode_is_dc(cxled->mode)) {
> +		resource_size_t skip_start, skip_end;
> +
> +		start = free_dc_start;
> +		avail = cxlds->dc_res[index].end - start + 1;
> +		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
> +			skip_start = free_ram_start;
> +		else
> +			skip_start = free_pmem_start;
> +		/*
> +		 * If some dc region is already mapped, then that allocation
> +		 * already handled the RAM and PMEM skip.Check for DC region
> +		 * skip.
> +		 */
> +		for (int i = index - 1; i >= 0 ; i--) {
> +			if (cxlds->dc_res[i].child) {
> +				skip_start = cxlds->dc_res[i].child->end + 1;
> +				break;
> +			}
> +		}
> +
> +		skip_end = start - 1;
> +		skip = skip_end - skip_start + 1;
>  	} else {
>  		dev_dbg(cxled_dev(cxled), "mode not set\n");
>  		avail = 0;
> @@ -548,10 +729,25 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  
>  	avail = cxl_dpa_freespace(cxled, &start, &skip);
>  
> +	dev_dbg(dev, "DPA Allocation start: %llx len: %llx Skip: %llx\n",
> +						start, size, skip);
>  	if (size > avail) {
> +		static const char * const names[] = {
> +			[CXL_DECODER_NONE] = "none",
> +			[CXL_DECODER_RAM] = "ram",
> +			[CXL_DECODER_PMEM] = "pmem",
> +			[CXL_DECODER_MIXED] = "mixed",
> +			[CXL_DECODER_DC0] = "dc0",
> +			[CXL_DECODER_DC1] = "dc1",
> +			[CXL_DECODER_DC2] = "dc2",
> +			[CXL_DECODER_DC3] = "dc3",
> +			[CXL_DECODER_DC4] = "dc4",
> +			[CXL_DECODER_DC5] = "dc5",
> +			[CXL_DECODER_DC6] = "dc6",
> +			[CXL_DECODER_DC7] = "dc7",

Hmm. 8 is on the boundary for being better to just do this programaticaly.
I guess it's fine though and is nice and easy to follow.

> +		};
>  		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
> -			cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
> -			&avail);
> +			names[cxled->mode], &avail);
>  		rc = -ENOSPC;
>  		goto out;
>  	}


> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 543c4499379e..144232c8305e 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1733,7 +1733,7 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  	lockdep_assert_held_write(&cxl_region_rwsem);
>  	lockdep_assert_held_read(&cxl_dpa_rwsem);
>  
> -	if (cxled->mode != cxlr->mode) {
> +	if (decoder_mode_is_dc(cxlr->mode) && !decoder_mode_is_dc(cxled->mode)) {
>  		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
>  			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
>  		return -EINVAL;
> @@ -2211,6 +2211,14 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>  	switch (mode) {
>  	case CXL_DECODER_RAM:
>  	case CXL_DECODER_PMEM:
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
>  		break;
>  	default:
>  		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> @@ -2321,6 +2329,43 @@ static ssize_t create_ram_region_store(struct device *dev,
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>  
> +static ssize_t store_dcN_region(struct cxl_root_decoder *cxlrd,
> +				const char *buf, enum cxl_decoder_mode mode,
> +				size_t len)
> +{
> +	struct cxl_region *cxlr;
> +	int rc, id;
> +
> +	rc = sscanf(buf, "region%d\n", &id);
> +	if (rc != 1)
> +		return -EINVAL;
> +
> +	cxlr = __create_region(cxlrd, id, mode, CXL_DECODER_HOSTMEM);
> +	if (IS_ERR(cxlr))
> +		return PTR_ERR(cxlr);
> +
> +	return len;
> +}
> +
> +static ssize_t create_dc_region_show(struct device *dev,
> +				     struct device_attribute *attr, char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dc_region_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t len)
> +{
> +	/*
> +	 * All DC regions use decoder mode DC0 as the region does not need the
> +	 * index information
> +	 */
> +	return store_dcN_region(to_cxl_root_decoder(dev), buf,
> +				CXL_DECODER_DC0, len);
> +}
> +DEVICE_ATTR_RW(create_dc_region);
> +
>  static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -2799,6 +2844,61 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  	return rc;
>  }
>  
> +static void cxl_dc_region_release(void *data)
> +{
> +	struct cxl_region *cxlr = data;
> +	struct cxl_dc_region *cxlr_dc = cxlr->cxlr_dc;
> +
> +	xa_destroy(&cxlr_dc->dax_dev_list);
> +	kfree(cxlr_dc);
> +}
> +
> +static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
> +{
> +	struct cxl_dc_region *cxlr_dc;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct device *dev;
> +	int rc = 0;
> +
> +	cxlr_dax = cxl_dax_region_alloc(cxlr);
> +	if (IS_ERR(cxlr_dax))
> +		return PTR_ERR(cxlr_dax);
> +
> +	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
> +	if (!cxlr_dc) {
> +		rc = -ENOMEM;
> +		goto err;
> +	}
> +
> +	dev = &cxlr_dax->dev;
> +	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
> +	if (rc)
> +		goto err;
> +
> +	rc = device_add(dev);
> +	if (rc)
> +		goto err;
> +
> +	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> +		dev_name(dev));
> +
> +	rc = devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
> +					cxlr_dax);
> +	if (rc)
> +		goto err;
> +
> +	cxlr_dc->cxlr_dax = cxlr_dax;
> +	xa_init(&cxlr_dc->dax_dev_list);
> +	cxlr->cxlr_dc = cxlr_dc;
> +	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
> +	if (!rc)
> +		return 0;
> +err:
> +	put_device(dev);
> +	kfree(cxlr_dc);
> +	return rc;
> +}
> +
>  static int match_decoder_by_range(struct device *dev, void *data)
>  {
>  	struct range *r1, *r2 = data;
> @@ -3140,6 +3240,19 @@ static int is_system_ram(struct resource *res, void *arg)
>  	return 1;
>  }
>  
> +/*
> + * The region can not be manged by CXL if any portion of
> + * it is already online as 'System RAM'
> + */
> +static bool region_is_system_ram(struct cxl_region *cxlr,
> +				 struct cxl_region_params *p)
> +{
> +	return (walk_iomem_res_desc(IORES_DESC_NONE,
> +				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> +				    p->res->start, p->res->end, cxlr,
> +				    is_system_ram) > 0);
> +}
> +
>  static int cxl_region_probe(struct device *dev)
>  {
>  	struct cxl_region *cxlr = to_cxl_region(dev);
> @@ -3174,14 +3287,7 @@ static int cxl_region_probe(struct device *dev)
>  	case CXL_DECODER_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
>  	case CXL_DECODER_RAM:
> -		/*
> -		 * The region can not be manged by CXL if any portion of
> -		 * it is already online as 'System RAM'
> -		 */
> -		if (walk_iomem_res_desc(IORES_DESC_NONE,
> -					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> -					p->res->start, p->res->end, cxlr,
> -					is_system_ram) > 0)
> +		if (region_is_system_ram(cxlr, p))
>  			return 0;
>  
>  		/*
> @@ -3193,6 +3299,17 @@ static int cxl_region_probe(struct device *dev)
>  
>  		/* HDM-H routes to device-dax */
>  		return devm_cxl_add_dax_region(cxlr);
> +	case CXL_DECODER_DC0:
> +	case CXL_DECODER_DC1:
> +	case CXL_DECODER_DC2:
> +	case CXL_DECODER_DC3:
> +	case CXL_DECODER_DC4:
> +	case CXL_DECODER_DC5:
> +	case CXL_DECODER_DC6:
> +	case CXL_DECODER_DC7:
> +		if (region_is_system_ram(cxlr, p))
> +			return 0;
> +		return devm_cxl_add_dc_region(cxlr);
>  	default:
>  		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
>  			cxlr->mode);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 8400af85d99f..7ac1237938b7 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -335,6 +335,14 @@ enum cxl_decoder_mode {
>  	CXL_DECODER_NONE,
>  	CXL_DECODER_RAM,
>  	CXL_DECODER_PMEM,
> +	CXL_DECODER_DC0,
> +	CXL_DECODER_DC1,
> +	CXL_DECODER_DC2,
> +	CXL_DECODER_DC3,
> +	CXL_DECODER_DC4,
> +	CXL_DECODER_DC5,
> +	CXL_DECODER_DC6,
> +	CXL_DECODER_DC7,
>  	CXL_DECODER_MIXED,
>  	CXL_DECODER_DEAD,
>  };
> @@ -345,6 +353,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  		[CXL_DECODER_NONE] = "none",
>  		[CXL_DECODER_RAM] = "ram",
>  		[CXL_DECODER_PMEM] = "pmem",
> +		[CXL_DECODER_DC0] = "dc0",
> +		[CXL_DECODER_DC1] = "dc1",
> +		[CXL_DECODER_DC2] = "dc2",
> +		[CXL_DECODER_DC3] = "dc3",
> +		[CXL_DECODER_DC4] = "dc4",
> +		[CXL_DECODER_DC5] = "dc5",
> +		[CXL_DECODER_DC6] = "dc6",
> +		[CXL_DECODER_DC7] = "dc7",
>  		[CXL_DECODER_MIXED] = "mixed",
>  	};
>  
> @@ -353,6 +369,11 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  	return "mixed";
>  }
>  
> +static inline bool decoder_mode_is_dc(enum cxl_decoder_mode mode)
> +{
> +	return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
> +}
> +
>  /*
>   * Track whether this decoder is reserved for region autodiscovery, or
>   * free for userspace provisioning.
> @@ -375,6 +396,7 @@ struct cxl_endpoint_decoder {
>  	struct cxl_decoder cxld;
>  	struct resource *dpa_res;
>  	resource_size_t skip;
> +	struct xarray skip_res;
>  	enum cxl_decoder_mode mode;
>  	enum cxl_decoder_state state;
>  	int pos;
> @@ -475,6 +497,11 @@ struct cxl_region_params {
>   */

> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index ccdf8de85bd5..eb5eb81bfbd7 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -23,11 +23,15 @@ static int cxl_dax_region_probe(struct device *dev)
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> +	if (decoder_mode_is_dc(cxlr->mode))
Comment for this would be good to let people know why (even if
it goes away in the future).

> +		return 0;
> +
>  	data = (struct dev_dax_data) {
>  		.dax_region = dax_region,
>  		.id = -1,
>  		.size = range_len(&cxlr_dax->hpa_range),
>  	};
> +

*grumble*

>  	dev_dax = devm_create_dev_dax(&data);
>  	if (IS_ERR(dev_dax))
>  		return PTR_ERR(dev_dax);
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD)
  2023-06-22 15:07   ` Jonathan Cameron
@ 2023-06-22 16:37     ` Jonathan Cameron
  2023-06-27 14:59     ` Ira Weiny
  1 sibling, 0 replies; 55+ messages in thread
From: Jonathan Cameron @ 2023-06-22 16:37 UTC (permalink / raw)
  To: Ira Weiny; +Cc: Navneet Singh, Fan Ni, Dan Williams, linux-cxl

On Thu, 22 Jun 2023 16:07:36 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Thu, 15 Jun 2023 07:51:16 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > ira.weiny@ wrote:  
> > > I'm submitting these on behalf of Navneet.  There was a round of
> > > internal discussion which left a few questions but we want to get the
> > > public discussion going.  A first public preview was posted by Dan.[1]    
> > 
> > Apologies for not being clear and marking these appropriately.  I
> > intended these to be RFC to get the discussion moving forward.  I somewhat
> > rushed the submission.  Depending on where the comments in this submission
> > go I'll try and make a better determination if the next submission is RFC
> > or can be a proper V1.  (Although b4 will mark them v2...  I'll have to
> > deal with that.)  
> 
> Make sure your SoB is added after Navneet to reflect that you are
> handling the posting to the mailing list even if you feel changes are insufficient
> to merit a Co-developed-by tag. (no idea who is doing what :)
> 
Ah. I see there are comments on no sign off. Hmm. If it's legally fine
and you post it as an RFC only I don't see a strong argument for not
keeping that chain intact.

Jonathan

> Jonathan
> 
> 
> > 
> > Ira
> >   
> > > 
> > > The series has been rebased on the type-2 work posted from Dan.[2]  As
> > > discussed in the community call, not all of that series is required for
> > > these patches.  This will get rebased on the subset of those patches he
> > > is targeting for 6.5.  The series was tested using Fan Ni's Qemu DCD
> > > series.[3]
> > > 
> > > [cover letter]
> > > 
> > > A Dynamic Capacity Device (DCD) (CXL 3.0 spec 9.13.3) is a CXL memory
> > > device that implements dynamic capacity.  Dynamic capacity feature
> > > allows memory capacity to change dynamically, without the need for
> > > resetting the device.
> > > 
> > > Provide initial patches to enable DCD on non interleaving regions.
> > > Details:
> > > 
> > > - Get the dynamic capacity region information from cxl device and add
> > >   the advertised DC memory to driver managed resources
> > > - Get the device dynamic capacity extent list from the device and
> > >   maintain it in the host and add the preallocated memory to the host
> > > - Dynamic capacity region support
> > > - DCD region provisioning via Dax
> > > - Dynamic capacity event records
> > >         a. Add capacity Events
> > > 	b. Release capacity events
> > > 	c. Add the memory to the host dc region
> > > 	d. Release the memory from the host dc region
> > > - Trace Dynamic Capacity events
> > > - Send add capacity response to device
> > > - Send release dynamic capacity to device
> > > 
> > > Cc: Navneet Singh <navneet.singh@intel.com>
> > > Cc: Fan Ni <fan.ni@samsung.com>
> > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > Cc: Ira Weiny <ira.weiny@intel.com>
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > Cc: linux-cxl@vger.kernel.org
> > > 
> > > [1] https://lore.kernel.org/all/64326437c1496_934b2949f@dwillia2-mobl3.amr.corp.intel.com.notmuch/
> > > [2] https://lore.kernel.org/all/168592149709.1948938.8663425987110396027.stgit@dwillia2-xfh.jf.intel.com/
> > > [3] https://lore.kernel.org/all/6483946e8152f_f1132294a2@iweiny-mobl.notmuch/
> > > 
> > > ---
> > > Navneet Singh (5):
> > >       cxl/mem : Read Dynamic capacity configuration from the device
> > >       cxl/region: Add dynamic capacity cxl region support.
> > >       cxl/mem : Expose dynamic capacity configuration to userspace
> > >       cxl/mem: Add support to handle DCD add and release capacity events.
> > >       cxl/mem: Trace Dynamic capacity Event Record
> > > 
> > >  drivers/cxl/Kconfig       |  11 +
> > >  drivers/cxl/core/core.h   |   7 +
> > >  drivers/cxl/core/hdm.c    | 234 ++++++++++++++++++--
> > >  drivers/cxl/core/mbox.c   | 540 +++++++++++++++++++++++++++++++++++++++++++++-
> > >  drivers/cxl/core/memdev.c |  72 +++++++
> > >  drivers/cxl/core/port.c   |  18 ++
> > >  drivers/cxl/core/region.c | 337 ++++++++++++++++++++++++++++-
> > >  drivers/cxl/core/trace.h  |  68 +++++-
> > >  drivers/cxl/cxl.h         |  32 ++-
> > >  drivers/cxl/cxlmem.h      | 146 ++++++++++++-
> > >  drivers/cxl/pci.c         |  14 +-
> > >  drivers/dax/bus.c         |  11 +-
> > >  drivers/dax/bus.h         |   5 +-
> > >  drivers/dax/cxl.c         |   4 +
> > >  14 files changed, 1453 insertions(+), 46 deletions(-)
> > > ---
> > > base-commit: 034a16d0165be3e092d60685be7b1b05e6f3059b
> > > change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
> > > 
> > > Best regards,
> > > -- 
> > > Ira Weiny <ira.weiny@intel.com>
> > >     
> > 
> >   
> 
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events.
  2023-06-14 19:16 ` [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events ira.weiny
  2023-06-15  2:19   ` Alison Schofield
  2023-06-15 16:58   ` Dave Jiang
@ 2023-06-22 17:01   ` Jonathan Cameron
  2023-06-29 15:19     ` Ira Weiny
  2023-06-27 18:17   ` Fan Ni
  2023-07-13 12:55   ` Jørgen Hansen
  4 siblings, 1 reply; 55+ messages in thread
From: Jonathan Cameron @ 2023-06-22 17:01 UTC (permalink / raw)
  To: ira.weiny; +Cc: Navneet Singh, Fan Ni, Dan Williams, linux-cxl

On Wed, 14 Jun 2023 12:16:31 -0700
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> A dynamic capacity device utilizes events to signal the host about the
> changes to the allocation of DC blocks. The device communicates the
> state of these blocks of dynamic capacity through an extent list that
> describes the starting DPA and length of all blocks the host can access.
> 
> Based on the dynamic capacity add or release event type,
> dynamic memory represented by the extents are either added
> or removed as devdax device.
> 
> Process the dynamic capacity add and release events.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
Hi,

I ran out of time today and will be traveling next few weeks (may have
review time, may not) so sending what I have on basis it might be useful.

Jonathan

> +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> +				struct cxl_mbox_dc_response *res,
> +				int extent_cnt, int opcode)
> +{
> +	struct cxl_mbox_cmd mbox_cmd;
> +	int rc, size;
> +
> +	size = struct_size(res, extent_list, extent_cnt);
> +	res->extent_list_size = cpu_to_le32(extent_cnt);
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = opcode,
> +		.size_in = size,
> +		.payload_in = res,
> +	};
> +
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +
> +	return rc;
return cxl_..

> +
> +}
> +
> +static int cxl_prepare_ext_list(struct cxl_mbox_dc_response **res,
> +					int *n, struct range *extent)
> +{
> +	struct cxl_mbox_dc_response *dc_res;
> +	unsigned int size;
> +
> +	if (!extent)
> +		size = struct_size(dc_res, extent_list, 0);
> +	else
> +		size = struct_size(dc_res, extent_list, *n + 1);
> +
> +	dc_res = krealloc(*res, size, GFP_KERNEL);
> +	if (!dc_res)
> +		return -ENOMEM;
> +
> +	if (extent) {
> +		dc_res->extent_list[*n].dpa_start = cpu_to_le64(extent->start);
> +		memset(dc_res->extent_list[*n].reserved, 0, 8);
> +		dc_res->extent_list[*n].length =
> +				cpu_to_le64(range_len(extent));
> +		(*n)++;
> +	}
> +
> +	*res = dc_res;
> +	return 0;
> +}
blank line.

> +/**
> + * cxl_handle_dcd_event_records() - Read DCD event records.
> + * @mds: The memory device state

>  
> +int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> +			      unsigned int *extent_gen_num)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_mbox_dc_extents *dc_extents;
> +	struct cxl_mbox_get_dc_extent get_dc_extent;
> +	unsigned int total_extent_cnt;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	int rc;
> +
> +	/* Check GET_DC_EXTENT_LIST is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> +		return 0;
> +	}
> +
> +	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);

Put it on the stack - length is fixed and small if requesting 0
extents


> +	if (!dc_extents)
> +		return -ENOMEM;
> +
> +	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> +		.extent_cnt = 0,
> +		.start_extent_index = 0,
> +	};
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +		.payload_in = &get_dc_extent,
> +		.size_in = sizeof(get_dc_extent),
> +		.size_out = mds->payload_size,
> +		.payload_out = dc_extents,
> +		.min_out = 1,
> +	};
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		goto out;
> +
> +	total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> +	*extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +	dev_dbg(dev, "Total extent count :%d Extent list Generation Num: %d\n",
> +			total_extent_cnt, *extent_gen_num);
> +out:
> +
> +	kvfree(dc_extents);
> +	if (rc < 0)
> +		return rc;
> +
> +	return total_extent_cnt;
> +
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extent_cnt, CXL);



> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 144232c8305e..ba45c1c3b0a9 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /* Copyright(c) 2022 Intel Corporation. All rights reserved. */
>  #include <linux/memregion.h>
> +#include <linux/interrupt.h>
>  #include <linux/genalloc.h>
>  #include <linux/device.h>
>  #include <linux/module.h>
> @@ -11,6 +12,8 @@
>  #include <cxlmem.h>
>  #include <cxl.h>
>  #include "core.h"
> +#include "../../dax/bus.h"
> +#include "../../dax/dax-private.h"
>  
>  /**
>   * DOC: cxl core region
> @@ -166,6 +169,38 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
>  	return 0;
>  }
>  
> +static int cxl_region_manage_dc(struct cxl_region *cxlr)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +	unsigned int extent_gen_num;
> +	int i, rc;
> +
> +	/* Designed for Non Interleaving flow with the assumption one
> +	 * cxl_region will map the complete device DC region's DPA range
> +	 */
> +	for (i = 0; i < p->nr_targets; i++) {
> +		struct cxl_endpoint_decoder *cxled = p->targets[i];
> +		struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +		struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +		rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> +		if (rc < 0)
> +			goto err;
> +		else if (rc > 1) {
> +			rc = cxl_dev_get_dc_extents(mds, rc, 0);
> +			if (rc < 0)
> +				goto err;
> +			mds->num_dc_extents = rc;
> +			mds->dc_extents_index = rc - 1;
> +		}
> +		mds->dc_list_gen_num = extent_gen_num;
> +		dev_dbg(mds->cxlds.dev, "No of preallocated extents :%d\n", rc);
> +	}
> +	return 0;
> +err:
> +	return rc;

Direct returns easier to review.  

> +}
> +
>  static int commit_decoder(struct cxl_decoder *cxld)
>  {
>  	struct cxl_switch_decoder *cxlsd = NULL;
> @@ -2865,11 +2900,14 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
>  		return PTR_ERR(cxlr_dax);
>  
>  	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
> -	if (!cxlr_dc) {
> -		rc = -ENOMEM;
> -		goto err;
> -	}
> +	if (!cxlr_dc)
> +		return -ENOMEM;

Curious.  Looks like a bug from earlier.


>  
> +	rc = request_module("dax_cxl");
> +	if (rc) {
> +		dev_err(dev, "failed to load dax-ctl module\n");
> +		goto load_err;
> +	}
>  	dev = &cxlr_dax->dev;
>  	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
>  	if (rc)
> @@ -2891,10 +2929,24 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
>  	xa_init(&cxlr_dc->dax_dev_list);
>  	cxlr->cxlr_dc = cxlr_dc;
>  	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
> -	if (!rc)
> -		return 0;
> +	if (rc)
> +		goto err;
> +
> +	if (!dev->driver) {
> +		dev_err(dev, "%s Driver not attached\n", dev_name(dev));
> +		rc = -ENXIO;
> +		goto err;
> +	}
> +
> +	rc = cxl_region_manage_dc(cxlr);
> +	if (rc)
> +		goto err;
> +
> +	return 0;
> +
>  err:
>  	put_device(dev);
> +load_err:
>  	kfree(cxlr_dc);

I've lost track, but seems unlikely we now need to free this in all paths and didn't before. Doesn't
the cxl_dc_region_Release deal with it?

>  	return rc;
>  }
> @@ -3076,6 +3128,156 @@ struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_create_region, CXL);



> +
> +int cxl_release_dc_extent(struct cxl_memdev_state *mds,
> +			  struct range *rel_range)
> +{
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct cxl_dc_region *cxlr_dc;
> +	struct dax_region *dax_region;
> +	resource_size_t dpa_offset;
> +	struct cxl_region *cxlr;
> +	struct range hpa_range;
> +	struct dev_dax *dev_dax;
> +	resource_size_t hpa;
> +	struct device *dev;
> +	int ranges, rc = 0;






> +int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range)
> +{
...

> +	/*
> +	 * Find the cxl endpoind decoder with which has the extent dpa range and
> +	 * get the cxl_region, dax_region refrences.
> +	 */
> +	dev = device_find_child(&cxlmd->endpoint->dev, alloc_range,
> +				match_ep_decoder_by_range);
> +	if (!dev) {
> +		dev_err(mds->cxlds.dev, "%pr not mapped\n",	alloc_range);

Odd spacing. (Tab?)

> +		return PTR_ERR(dev);
> +	}
> +

...

> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 9c0b2fa72bdd..0440b5c04ef6 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h


>  /**
> @@ -296,6 +298,13 @@ enum cxl_devtype {
>  #define CXL_MAX_DC_REGION 8
>  #define CXL_DC_REGION_SRTLEN 8
>  
> +struct cxl_dc_extent_data {
> +	u64 dpa_start;
> +	u64 length;
> +	u8 tag[16];

Define for this length probably makes sense. It's non obvious.

> +	u16 shared_extent_seq;
> +};

> +
> +struct cxl_mbox_dc_response {
> +	__le32 extent_list_size;
> +	u8 reserved[4];
> +	struct updated_extent_list {
> +		__le64 dpa_start;
> +		__le64 length;
> +		u8 reserved[8];
> +	} __packed extent_list[];

Going to need this in multiple places (e.g. release) so factor out.


> +} __packed;
> +



> @@ -826,6 +894,14 @@ int cxl_trigger_poison_list(struct cxl_memdev *cxlmd);
>  int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
>  int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
>  
> +/* FIXME why not have these be static in mbox.c? */

:)

> +int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range);
> +int cxl_release_dc_extent(struct cxl_memdev_state *mds, struct range *rel_range);
> +int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> +			      unsigned int *extent_gen_num);
> +int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds, unsigned int cnt,
> +			   unsigned int index);
> +
>  #ifdef CONFIG_CXL_SUSPEND
>  void cxl_mem_active_inc(void);
>  void cxl_mem_active_dec(void);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index ac1a41bc083d..558ffbcb9b34 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -522,8 +522,8 @@ static int cxl_event_req_irq(struct cxl_dev_state *cxlds, u8 setting)
>  		return irq;
>  
>  	return devm_request_threaded_irq(dev, irq, NULL, cxl_event_thread,
> -					 IRQF_SHARED | IRQF_ONESHOT, NULL,
> -					 dev_id);
> +					IRQF_SHARED | IRQF_ONESHOT, NULL,
> +					dev_id);

No comment. :)

>  }
>  
>  static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
> @@ -555,6 +555,7 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
>  		.warn_settings = CXL_INT_MSI_MSIX,
>  		.failure_settings = CXL_INT_MSI_MSIX,
>  		.fatal_settings = CXL_INT_MSI_MSIX,
> +		.dyncap_settings = CXL_INT_MSI_MSIX,
>  	};
>  
>  	mbox_cmd = (struct cxl_mbox_cmd) {
> @@ -608,6 +609,11 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
>  		return rc;
>  	}
>  
> +	rc = cxl_event_req_irq(cxlds, policy.dyncap_settings);
> +	if (rc) {
> +		dev_err(cxlds->dev, "Failed to get interrupt for event dc log\n");
> +		return rc;
> +	}

Blank line to maintain existing style.

>  	return 0;
>  }
>  
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 227800053309..b2b27033f589 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -434,7 +434,7 @@ static void free_dev_dax_ranges(struct dev_dax *dev_dax)

...

> +EXPORT_SYMBOL_GPL(alloc_dev_dax_range);
> +
Single blank line seems to be style in this fiel.
>  
>  static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, resource_size_t size)
>  {
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 8cd79ab34292..aa8418c7aead 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -47,8 +47,11 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
>  	__dax_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
>  void dax_driver_unregister(struct dax_device_driver *dax_drv);
>  void kill_dev_dax(struct dev_dax *dev_dax);
> +void unregister_dev_dax(void *dev);
> +void unregister_dax_mapping(void *data);
>  bool static_dev_dax(struct dev_dax *dev_dax);
> -
> +int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
> +					resource_size_t size);

Keep a blank line here..

>  /*
>   * While run_dax() is potentially a generic operation that could be
>   * defined in include/linux/dax.h we don't want to grow any users
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-21 17:20   ` Fan Ni
@ 2023-06-23 18:02     ` Ira Weiny
  0 siblings, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2023-06-23 18:02 UTC (permalink / raw)
  To: Fan Ni, ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl,
	a.manzanares, dave, nmtadam.samsung, nifan

Fan Ni wrote:
> The 06/14/2023 12:16, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 

[snip]

> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 543c4499379e..144232c8305e 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -1733,7 +1733,7 @@ static int cxl_region_attach(struct cxl_region *cxlr,
> >  	lockdep_assert_held_write(&cxl_region_rwsem);
> >  	lockdep_assert_held_read(&cxl_dpa_rwsem);
> >  
> > -	if (cxled->mode != cxlr->mode) {
> > +	if (decoder_mode_is_dc(cxlr->mode) && !decoder_mode_is_dc(cxled->mode)) {
> For mode other than dc, no check will be performed, is that what we
> want?
> 

:-/  Yes, looks like I may have screwed up the logic here thanks.  But this
code is changing because after this thread Navneet and I decided to introduce a
new cxl_region_mode enum which should clarify this check.

[snip]

> > +
> > +static ssize_t create_dc_region_store(struct device *dev,
> > +				      struct device_attribute *attr,
> > +				      const char *buf, size_t len)
> > +{
> > +	/*
> > +	 * All DC regions use decoder mode DC0 as the region does not need the
> > +	 * index information
> > +	 */
> > +	return store_dcN_region(to_cxl_root_decoder(dev), buf,
> > +				CXL_DECODER_DC0, len);
> If all DC regions use DC0, what will CXL_DECODER_DC1~7 be used for?

Before sending the patches it did not set well with me that the mode for cxl
region was not longer 1:1 with endpoint decoder mode.  I basically hacked in
the idea that DC0 decoder mode would represent DC region mode.  But this is
really hacky.  So this is why we have introduced cxl_region_mode which
represents ram, pmem, or DC in v2.  I'm still squashing in all the changes and
clean ups and should post something soon.

Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device
  2023-06-22 15:58   ` Jonathan Cameron
@ 2023-06-24 13:08     ` Ira Weiny
  2023-07-03  2:29       ` Jonathan Cameron
  0 siblings, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2023-06-24 13:08 UTC (permalink / raw)
  To: Jonathan Cameron, ira.weiny
  Cc: Navneet Singh, Fan Ni, Dan Williams, linux-cxl

Jonathan Cameron wrote:
> On Wed, 14 Jun 2023 12:16:28 -0700
> ira.weiny@intel.com wrote:
> 
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > Read the Dynamic capacity configuration and store dynamic capacity region
> > information in the device state which driver will use to map into the HDM
> > ranges.
> > 
> > Implement Get Dynamic Capacity Configuration (opcode 4800h) mailbox
> > command as specified in CXL 3.0 spec section 8.2.9.8.9.1.
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> Hi Ira / Navneet,
> 
> I'll probably overlap with comments of others (good to see so much review!)

Indeed!  Thanks!

> so feel free to ignore duplication.
> 
> Comments inline,
> 
> Jonathan
> 
> 
> 
> > +/**
> > + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> > + * information from the device.
> > + * @mds: The memory device state
> > + * Return: 0 if identify was executed successfully.
> > + *
> > + * This will dispatch the get_dynamic_capacity command to the device
> > + * and on success populate structures to be exported to sysfs.
> > + */
> > +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> > +{
> > +	struct cxl_dev_state *cxlds = &mds->cxlds;
> > +	struct device *dev = cxlds->dev;
> > +	struct cxl_mbox_dynamic_capacity *dc;
> 
> Calling it dc is confusing.  I'd make it clear this is the mailbox
> response. config_resp or dc_config_res.

How about dc_resp?

> 
> > +	struct cxl_mbox_get_dc_config get_dc;
> > +	struct cxl_mbox_cmd mbox_cmd;
> > +	u64 next_dc_region_start;
> > +	int rc, i;
> > +
> > +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> > +		sprintf(mds->dc_region[i].name, "dc%d", i);
> > +
> > +	/* Check GET_DC_CONFIG is supported by device */
> > +	if (!test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds)) {
> > +		dev_dbg(dev, "unsupported cmd: get_dynamic_capacity_config\n");
> > +		return 0;
> > +	}
> > +
> > +	dc = kvmalloc(mds->payload_size, GFP_KERNEL);
> > +	if (!dc)
> > +		return -ENOMEM;
> 
> Response to CXL_MBOX_OP_GET_DC_CONFIG has a known maximum
> size. Can we provide that instead of potentially much larger?
> 
> 8 + 0x28 * 8 I think so 328 bytes. Use struct_size()

Actually yea and just putting that on the stack might also be better.

> 
> 
> But fun corner.... Mailbox is allowed to be smaller than that (256 bytes min
> I think) so need to handle multiple reads with different start regions.

Oh bother.  :-/

What are the chances a device is going to only support 256B and DC?  I think
you are correct though.  I'll add a loop to handle this possibility.

Anyway I've adjusted the algorithm...  Hopefully it will just loop 1 time.

> Which reminds me that we need to add support for running out of space
> in the mailbox to qemu... So far we've just made sure everything fitted :)

Might be nice to test stuff.

> 
> 
> > +
> > +	get_dc = (struct cxl_mbox_get_dc_config) {
> > +		.region_count = CXL_MAX_DC_REGION,
> > +		.start_region_index = 0,
> > +	};
> > +
> > +	mbox_cmd = (struct cxl_mbox_cmd) {
> > +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> > +		.payload_in = &get_dc,
> > +		.size_in = sizeof(get_dc),
> > +		.size_out = mds->payload_size,
> > +		.payload_out = dc,
> > +		.min_out = 1,
> > +	};
> > +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > +	if (rc < 0)
> > +		goto dc_error;
> The error label is a bit too generic.  Why dc_error?
> "error" conveys just as small amount of info.  I'd got for goto free_resp;

Sure.  But if we make dc on the stack then we get rid of this entirely.  I
prefer that.

> 
> > +
> > +	mds->nr_dc_region = dc->avail_region_count;
> > +
> > +	if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> > +		dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> > +			mds->nr_dc_region);
> > +		rc = -EINVAL;
> > +		goto dc_error;
> > +	}
> > +
> > +	for (i = 0; i < mds->nr_dc_region; i++) {
> > +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> > +
> > +		dcr->base = le64_to_cpu(dc->region[i].region_base);
> > +		dcr->decode_len =
> > +			le64_to_cpu(dc->region[i].region_decode_length);
> > +		dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> > +		dcr->len = le64_to_cpu(dc->region[i].region_length);
> > +		dcr->blk_size = le64_to_cpu(dc->region[i].region_block_size);
> > +
> > +		/* Check regions are in increasing DPA order */
> > +		if ((i + 1) < mds->nr_dc_region) {
> 
> Feels a bit odd to look at entries we haven't seen yet.  Maybe flip this around
> to check the ones we have looked at?

Totally agree.  I already did that.

> So don't start until 2nd region and then check
> it's start against mds->dc_region[0] etc?

Yep!  It makes the check easier...

> Or factor out this loop contents in general and just pass in the single
> value needed for checking this. Biggest advantage being direct returns in
> that function as allocation and free will be in caller.

Did that too.  I did not like how big cxl_dev_dynamic_capacity_identify() was
getting, especially with possibility of having to loop for a 2nd time.

> 
> 
> > +			next_dc_region_start =
> > +				le64_to_cpu(dc->region[i + 1].region_base);
> > +			if ((dcr->base > next_dc_region_start) ||
> > +			    ((dcr->base + dcr->decode_len) > next_dc_region_start)) {
> 
> Unless you have a negative decode length the second condition includes the first.
> So just check that.

... exactly!

> 
> > +				dev_err(dev,
> > +					"DPA ordering violation for DC region %d and %d\n",
> > +					i, i + 1);
> > +				rc = -EINVAL;
> > +				goto dc_error;
> > +			}
> > +		}
> > +
> > +		/* Check the region is 256 MB aligned */
> > +		if (!IS_ALIGNED(dcr->base, SZ_256M)) {
> 
> That's an oddity. I wonder why those lower bits where defined as reserved...
> Anyhow code is right if paranoid ;)

:shrug:

> 
> > +			dev_err(dev, "DC region %d not aligned to 256MB\n", i);
> > +			rc = -EINVAL;
> > +			goto dc_error;
> > +		}
> > +
> > +		/* Check Region base and length are aligned to block size */
> > +		if (!IS_ALIGNED(dcr->base, dcr->blk_size) ||
> > +		    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> > +			dev_err(dev, "DC region %d not aligned to %#llx\n", i,
> > +				dcr->blk_size);
> > +			rc = -EINVAL;
> > +			goto dc_error;
> > +		}
> > +
> > +		dcr->dsmad_handle =
> > +			le32_to_cpu(dc->region[i].region_dsmad_handle);
> > +		dcr->flags = dc->region[i].flags;
> 
> I'd just grab these at same time as all the other fields above.
> A pattern where you fill values in only after checking would be fine, or one
> where you fill them in all in one place. The mixture of the two is less clear
> than either consistent approach.

Ok, yea this did seem odd but I kind of ignored it.  Done.

> 
> > +		sprintf(dcr->name, "dc%d", i);

I may take this out too now that we always set the name regardless of if the
region is available.

Although...  I wonder if setting the name to something like '<nil>' by default
would be beneficial in some way?  :-/

> > +
> > +		dev_dbg(dev,
> > +			"DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> > +			dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> > +	}
> > +
> > +	/*
> > +	 * Calculate entire DPA range of all configured regions which will be mapped by
> > +	 * one or more HDM decoders
> > +	 */
> > +	mds->total_dynamic_capacity =
> > +		mds->dc_region[mds->nr_dc_region - 1].base +
> > +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> > +		mds->dc_region[0].base;
> > +	dev_dbg(dev, "Total dynamic capacity: %#llx\n",
> > +		mds->total_dynamic_capacity);
> > +
> > +dc_error:
> > +	kvfree(dc);
> > +	return rc;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> > +
> 
> 
> 
> > @@ -1121,13 +1289,23 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> >  	}
> >  
> >  	cxlds->dpa_res =
> > -		(struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> > +		(struct resource)DEFINE_RES_MEM(0, mds->total_capacity);
> > +
> > +	for (int i = 0; i < CXL_MAX_DC_REGION; i++) {
> > +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> > +
> > +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> > +				 dcr->base, dcr->decode_len, dcr->name);
> > +		if (rc)
> > +			return rc;
> > +	}
> >  
> >  	if (mds->partition_align_bytes == 0) {
> >  		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
> >  				 mds->volatile_only_bytes, "ram");
> >  		if (rc)
> >  			return rc;
> > +
> 
> Scrub for this stuff before posting v2. Just noise that slows down review
> a little.  If it is worth doing, do it in a separate patch.

I believe Alison or Dave caught that already.

> 
> >  		return add_dpa_res(dev, &cxlds->dpa_res, &cxlds->pmem_res,
> >  				   mds->volatile_only_bytes,
> >  				   mds->persistent_only_bytes, "pmem");
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 89e560ea14c0..9c0b2fa72bdd 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> >
> >  
> > +#define CXL_MAX_DC_REGION 8
> > +#define CXL_DC_REGION_SRTLEN 8
> 
> SRT? 

LOL oh yea 'STR'...  :-D

> 
> > +
> >  /**
> >   * struct cxl_dev_state - The driver device state
> >   *
> > @@ -300,6 +312,8 @@ enum cxl_devtype {
> >   * @dpa_res: Overall DPA resource tree for the device
> >   * @pmem_res: Active Persistent memory capacity configuration
> >   * @ram_res: Active Volatile memory capacity configuration
> > + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> > + *          region
> >   * @component_reg_phys: register base of component registers
> >   * @info: Cached DVSEC information about the device.
> >   * @serial: PCIe Device Serial Number
> > @@ -315,6 +329,7 @@ struct cxl_dev_state {
> >  	struct resource dpa_res;
> >  	struct resource pmem_res;
> >  	struct resource ram_res;
> > +	struct resource dc_res[CXL_MAX_DC_REGION];
> >  	resource_size_t component_reg_phys;
> >  	u64 serial;
> >  	enum cxl_devtype type;
> 
> ...
> 
> > @@ -357,9 +379,13 @@ struct cxl_memdev_state {
> >  	size_t lsa_size;
> >  	struct mutex mbox_mutex; /* Protects device mailbox and firmware */
> >  	char firmware_version[0x10];
> > +	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
> >  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> >  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> > -	u64 total_bytes;
> > +
> > +	u64 total_capacity;
> > +	u64 total_static_capacity;
> > +	u64 total_dynamic_capacity;
> >  	u64 volatile_only_bytes;
> >  	u64 persistent_only_bytes;
> >  	u64 partition_align_bytes;
> > @@ -367,6 +393,20 @@ struct cxl_memdev_state {
> >  	u64 active_persistent_bytes;
> >  	u64 next_volatile_bytes;
> >  	u64 next_persistent_bytes;
> > +
> > +	u8 nr_dc_region;
> > +
> > +	struct cxl_dc_region_info {
> > +		u8 name[CXL_DC_REGION_SRTLEN];
> 
> char? SRT?  Also isn't it a bit big? Looks like max 4 chars to me.
> Put it next to flags and we can save some space.

Well if I go forward with the idea of having them named something like '<nil>'
this would need to be longer than 4.

> 
> > +		u64 base;
> > +		u64 decode_len;
> > +		u64 len;
> > +		u64 blk_size;
> > +		u32 dsmad_handle;
> > +		u8 flags;
> > +	} dc_region[CXL_MAX_DC_REGION];
> > +
> > +	size_t dc_event_log_size;
> 
> >  /*
> > @@ -617,7 +662,27 @@ struct cxl_mbox_set_partition_info {
> >  	u8 flags;
> >  } __packed;
> >  
> > +struct cxl_mbox_get_dc_config {
> > +	u8 region_count;
> > +	u8 start_region_index;
> > +} __packed;
> > +
> > +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> > +struct cxl_mbox_dynamic_capacity {
> > +	u8 avail_region_count;
> > +	u8 rsvd[7];
> > +	struct cxl_dc_region_config {
> > +		__le64 region_base;
> > +		__le64 region_decode_length;
> > +		__le64 region_length;
> > +		__le64 region_block_size;
> > +		__le32 region_dsmad_handle;
> > +		u8 flags;
> > +		u8 rsvd[3];
> > +	} __packed region[];
> > +} __packed;
> >  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
> 
> This looks to have merged oddly with existing changes. I'd
> move the define into the structure definition so ti's clear which
> flag it reflects and avoids this sort of interleaving in future.

That has not been done anywhere in this file.  I think in this case it was just
a mistake to separate the partition define from the partition structure.  We
already fixed that based on Alisons feedback.

What I did do is move CXL_DC_REGION_STRLEN next to cxl_dc_region_info and made
it 7 chars to compact the structure a bit.  I think having that define next to
the structure helps to show why the odd length.

While we are at it I'm changing the sprintf's to snprintf's.  We are not in a
fast path and I'm paranoid now.

Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD)
  2023-06-22 15:07   ` Jonathan Cameron
  2023-06-22 16:37     ` Jonathan Cameron
@ 2023-06-27 14:59     ` Ira Weiny
  1 sibling, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2023-06-27 14:59 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Navneet Singh, Fan Ni, Dan Williams, linux-cxl

Jonathan Cameron wrote:
> On Thu, 15 Jun 2023 07:51:16 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > ira.weiny@ wrote:
> > > I'm submitting these on behalf of Navneet.  There was a round of
> > > internal discussion which left a few questions but we want to get the
> > > public discussion going.  A first public preview was posted by Dan.[1]  
> > 
> > Apologies for not being clear and marking these appropriately.  I
> > intended these to be RFC to get the discussion moving forward.  I somewhat
> > rushed the submission.  Depending on where the comments in this submission
> > go I'll try and make a better determination if the next submission is RFC
> > or can be a proper V1.  (Although b4 will mark them v2...  I'll have to
> > deal with that.)
> 
> Make sure your SoB is added after Navneet to reflect that you are
> handling the posting to the mailing list even if you feel changes are insufficient
> to merit a Co-developed-by tag. (no idea who is doing what :)

Apologies.  I've never really posted for someone else whom I've not
developed the patches for.

V2 will have my tag on it.

Ira

> 
> Jonathan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events.
  2023-06-14 19:16 ` [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events ira.weiny
                     ` (2 preceding siblings ...)
  2023-06-22 17:01   ` Jonathan Cameron
@ 2023-06-27 18:17   ` Fan Ni
  2023-07-13 12:55   ` Jørgen Hansen
  4 siblings, 0 replies; 55+ messages in thread
From: Fan Ni @ 2023-06-27 18:17 UTC (permalink / raw)
  To: ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl,
	a.manzanares, dave, nmtadam.samsung, nifan

The 06/14/2023 12:16, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> A dynamic capacity device utilizes events to signal the host about the
> changes to the allocation of DC blocks. The device communicates the
> state of these blocks of dynamic capacity through an extent list that
> describes the starting DPA and length of all blocks the host can access.
> 
> Based on the dynamic capacity add or release event type,
> dynamic memory represented by the extents are either added
> or removed as devdax device.
> 
> Process the dynamic capacity add and release events.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> ---
> [iweiny: Remove invalid comment]
> ---
>  drivers/cxl/core/mbox.c   | 345 +++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/cxl/core/region.c | 214 +++++++++++++++++++++++++++-
>  drivers/cxl/core/trace.h  |   3 +-
>  drivers/cxl/cxl.h         |   4 +-
>  drivers/cxl/cxlmem.h      |  76 ++++++++++
>  drivers/cxl/pci.c         |  10 +-
>  drivers/dax/bus.c         |  11 +-
>  drivers/dax/bus.h         |   5 +-
>  8 files changed, 652 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index c5b696737c87..db9295216de5 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -767,6 +767,14 @@ static const uuid_t log_uuid[] = {
>  	[VENDOR_DEBUG_UUID] = DEFINE_CXL_VENDOR_DEBUG_UUID,
>  };
>  
> +/* See CXL 3.0 8.2.9.2.1.5 */
> +enum dc_event {
> +	ADD_CAPACITY,
> +	RELEASE_CAPACITY,
> +	FORCED_CAPACITY_RELEASE,
> +	REGION_CONFIGURATION_UPDATED,
> +};
> +
>  /**
>   * cxl_enumerate_cmds() - Enumerate commands for a device.
>   * @mds: The driver data for the operation
> @@ -852,6 +860,14 @@ static const uuid_t mem_mod_event_uuid =
>  	UUID_INIT(0xfe927475, 0xdd59, 0x4339,
>  		  0xa5, 0x86, 0x79, 0xba, 0xb1, 0x13, 0xb7, 0x74);
>  
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.0 section 8.2.9.2.1.3; Table 8-45
> + */
> +static const uuid_t dc_event_uuid =
> +	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c,
> +		0x2f, 0x95, 0x26, 0x8e, 0x10, 0x1a, 0x2a);
> +
>  static void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  				   enum cxl_event_log_type type,
>  				   struct cxl_event_record_raw *record)
> @@ -945,6 +961,188 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
>  	return rc;
>  }
>  
> +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> +				struct cxl_mbox_dc_response *res,
> +				int extent_cnt, int opcode)
> +{
> +	struct cxl_mbox_cmd mbox_cmd;
> +	int rc, size;
> +
> +	size = struct_size(res, extent_list, extent_cnt);
> +	res->extent_list_size = cpu_to_le32(extent_cnt);
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = opcode,
> +		.size_in = size,
> +		.payload_in = res,
> +	};
> +
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +
> +	return rc;
> +
unwanted blank line.
> +}
> +
> +static int cxl_prepare_ext_list(struct cxl_mbox_dc_response **res,
> +					int *n, struct range *extent)
> +{
> +	struct cxl_mbox_dc_response *dc_res;
> +	unsigned int size;
> +
> +	if (!extent)
> +		size = struct_size(dc_res, extent_list, 0);
> +	else
> +		size = struct_size(dc_res, extent_list, *n + 1);
> +
> +	dc_res = krealloc(*res, size, GFP_KERNEL);
> +	if (!dc_res)
> +		return -ENOMEM;
> +
> +	if (extent) {
> +		dc_res->extent_list[*n].dpa_start = cpu_to_le64(extent->start);
> +		memset(dc_res->extent_list[*n].reserved, 0, 8);
> +		dc_res->extent_list[*n].length =
> +				cpu_to_le64(range_len(extent));
> +		(*n)++;
> +	}
> +
> +	*res = dc_res;
> +	return 0;
> +}
As mentioned by existing comments, need a blank line here.
> +/**
> + * cxl_handle_dcd_event_records() - Read DCD event records.
> + * @mds: The memory device state
> + *
> + * Returns 0 if enumerate completed successfully.
> + *
> + * CXL devices can generate DCD events to add or remove extents in the list.
> + */
> +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
> +					struct cxl_event_record_raw *rec)
> +{
> +	struct cxl_mbox_dc_response *dc_res = NULL;
> +	struct device *dev = mds->cxlds.dev;
> +	uuid_t *id = &rec->hdr.id;
> +	struct dcd_event_dyn_cap *record =
> +			(struct dcd_event_dyn_cap *)rec;
> +	int extent_cnt = 0, rc = 0;
> +	struct cxl_dc_extent_data *extent;
> +	struct range alloc_range, rel_range;
> +	resource_size_t dpa, size;
> +
> +	if (!uuid_equal(id, &dc_event_uuid))
> +		return -EINVAL;
> +
> +	switch (record->data.event_type) {
> +	case ADD_CAPACITY:
> +		extent = devm_kzalloc(dev, sizeof(*extent), GFP_ATOMIC);
> +		if (!extent)
> +			return -ENOMEM;
> +
> +		extent->dpa_start = le64_to_cpu(record->data.extent.start_dpa);
> +		extent->length = le64_to_cpu(record->data.extent.length);
> +		memcpy(extent->tag, record->data.extent.tag,
> +				sizeof(record->data.extent.tag));
> +		extent->shared_extent_seq =
> +			le16_to_cpu(record->data.extent.shared_extn_seq);
> +		dev_dbg(dev, "Add DC extent DPA:0x%llx LEN:%llx\n",
> +					extent->dpa_start, extent->length);
> +		alloc_range = (struct range) {
> +			.start = extent->dpa_start,
> +			.end = extent->dpa_start + extent->length - 1,
> +		};
> +
> +		rc = cxl_add_dc_extent(mds, &alloc_range);
> +		if (rc < 0) {
> +			dev_dbg(dev, "unconsumed DC extent DPA:0x%llx LEN:%llx\n",
> +					extent->dpa_start, extent->length);
> +			rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, NULL);
> +			if (rc < 0) {
> +				dev_err(dev, "Couldn't create extent list %d\n",
> +									rc);
> +				devm_kfree(dev, extent);
> +				return rc;
> +			}
> +
> +			rc = cxl_send_dc_cap_response(mds, dc_res,
> +					extent_cnt, CXL_MBOX_OP_ADD_DC_RESPONSE);
> +			if (rc < 0) {
> +				devm_kfree(dev, extent);
> +				goto out;
> +			}
> +
> +			kfree(dc_res);
> +			devm_kfree(dev, extent);
> +
> +			return 0;
> +		}
> +
> +		rc = xa_insert(&mds->dc_extent_list, extent->dpa_start, extent,
> +				GFP_KERNEL);
> +		if (rc < 0)
> +			goto out;
> +
> +		mds->num_dc_extents++;
> +		rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, &alloc_range);
> +		if (rc < 0) {
> +			dev_err(dev, "Couldn't create extent list %d\n", rc);
> +			return rc;
> +		}
> +
> +		rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
> +					      CXL_MBOX_OP_ADD_DC_RESPONSE);
> +		if (rc < 0)
> +			goto out;
> +
> +		break;

As mentioned by Iry already in one of his reply, I also think it is better to
have a helper function for add/release capacity cases, like
cxl_handle_add_dc_capacity and cxl_handle_release_dc_capacity.

> +
> +	case RELEASE_CAPACITY:
> +		dpa = le64_to_cpu(record->data.extent.start_dpa);
> +		size = le64_to_cpu(record->data.extent.length);
> +		dev_dbg(dev, "Release DC extents DPA:0x%llx LEN:%llx\n",
> +				dpa, size);
> +		extent = xa_load(&mds->dc_extent_list, dpa);
> +		if (!extent) {
> +			dev_err(dev, "No extent found with DPA:0x%llx\n", dpa);
> +			return -EINVAL;
> +		}
> +
> +		rel_range = (struct range) {
> +			.start = dpa,
> +			.end = dpa + size - 1,
> +		};
> +
> +		rc = cxl_release_dc_extent(mds, &rel_range);
> +		if (rc < 0) {
> +			dev_dbg(dev, "withhold DC extent DPA:0x%llx LEN:%llx\n",
> +									dpa, size);
> +			return 0;
> +		}
> +
> +		xa_erase(&mds->dc_extent_list, dpa);
> +		devm_kfree(dev, extent);
> +		mds->num_dc_extents--;
> +		rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, &rel_range);
> +		if (rc < 0) {
> +			dev_err(dev, "Couldn't create extent list %d\n", rc);
> +			return rc;
> +		}
> +
> +		rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
> +					      CXL_MBOX_OP_RELEASE_DC);
> +		if (rc < 0)
> +			goto out;
> +
> +		break;
> +
> +	default:
> +		return -EINVAL;
> +	}
> +out:
> +	kfree(dc_res);
> +	return rc;
> +}
> +
>  static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>  				    enum cxl_event_log_type type)
>  {
> @@ -982,9 +1180,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>  		if (!nr_rec)
>  			break;
>  
> -		for (i = 0; i < nr_rec; i++)
> +		for (i = 0; i < nr_rec; i++) {
>  			cxl_event_trace_record(cxlmd, type,
>  					       &payload->records[i]);
format issue, we have some spaces here after tab.
> +			if (type == CXL_EVENT_TYPE_DCD) {
> +				rc = cxl_handle_dcd_event_records(mds,
> +						&payload->records[i]);
> +				if (rc)
> +					dev_err_ratelimited(dev,
> +						"dcd event failed: %d\n", rc);
> +			}
> +		}
>  
>  		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
>  			trace_cxl_overflow(cxlmd, type, payload);
> @@ -1024,6 +1230,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
>  		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_WARN);
>  	if (status & CXLDEV_EVENT_STATUS_INFO)
>  		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_INFO);
> +	if (status & CXLDEV_EVENT_STATUS_DCD)
> +		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_mem_get_event_records, CXL);
>  
> @@ -1244,6 +1452,140 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
>  
> +int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> +			      unsigned int *extent_gen_num)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_mbox_dc_extents *dc_extents;
> +	struct cxl_mbox_get_dc_extent get_dc_extent;
> +	unsigned int total_extent_cnt;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	int rc;
> +
> +	/* Check GET_DC_EXTENT_LIST is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> +		return 0;
> +	}
> +
> +	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
> +	if (!dc_extents)
> +		return -ENOMEM;
> +
> +	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> +		.extent_cnt = 0,
> +		.start_extent_index = 0,
> +	};
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +		.payload_in = &get_dc_extent,
> +		.size_in = sizeof(get_dc_extent),
> +		.size_out = mds->payload_size,
> +		.payload_out = dc_extents,
> +		.min_out = 1,
> +	};
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		goto out;
> +
> +	total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> +	*extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +	dev_dbg(dev, "Total extent count :%d Extent list Generation Num: %d\n",
> +			total_extent_cnt, *extent_gen_num);
> +out:
> +
> +	kvfree(dc_extents);
> +	if (rc < 0)
> +		return rc;
> +
> +	return total_extent_cnt;
> +
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extent_cnt, CXL);
> +
> +int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds,
> +			   unsigned int index, unsigned int cnt)
> +{
> +	/* See CXL 3.0 Table 125 dynamic capacity config  Output Payload */
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_mbox_dc_extents *dc_extents;
> +	struct cxl_mbox_get_dc_extent get_dc_extent;
> +	unsigned int extent_gen_num, available_extents, total_extent_cnt;
> +	int rc;
> +	struct cxl_dc_extent_data *extent;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	struct range alloc_range;
> +
> +	/* Check GET_DC_EXTENT_LIST is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> +		return 0;
> +	}
> +
> +	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
> +	if (!dc_extents)
> +		return -ENOMEM;
> +	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> +		.extent_cnt = cnt,
> +		.start_extent_index = index,
> +	};
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +		.payload_in = &get_dc_extent,
> +		.size_in = sizeof(get_dc_extent),
> +		.size_out = mds->payload_size,
> +		.payload_out = dc_extents,
> +		.min_out = 1,
> +	};
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		goto out;
> +
> +	available_extents = le32_to_cpu(dc_extents->ret_extent_cnt);
> +	total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> +	extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +	dev_dbg(dev, "No Total extent count :%d Extent list Generation Num:%d\n",
> +			total_extent_cnt, extent_gen_num);
> +
> +
> +	for (int i = 0; i < available_extents ; i++) {
> +		extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
> +		if (!extent) {
> +			rc = -ENOMEM;
> +			goto out;
> +		}
> +		extent->dpa_start = le64_to_cpu(dc_extents->extent[i].start_dpa);
> +		extent->length = le64_to_cpu(dc_extents->extent[i].length);
> +		memcpy(extent->tag, dc_extents->extent[i].tag,
> +					sizeof(dc_extents->extent[i].tag));
> +		extent->shared_extent_seq =
> +				le16_to_cpu(dc_extents->extent[i].shared_extn_seq);
> +		dev_dbg(dev, "dynamic capacity extent[%d] DPA:0x%llx LEN:%llx\n",
> +				i, extent->dpa_start, extent->length);
> +
> +		alloc_range = (struct range){
> +			.start = extent->dpa_start,
> +			.end = extent->dpa_start + extent->length - 1,
> +		};
> +
> +		rc = cxl_add_dc_extent(mds, &alloc_range);
> +		if (rc < 0)
> +			goto out;
> +		rc = xa_insert(&mds->dc_extent_list, extent->dpa_start, extent,
> +				GFP_KERNEL);
> +	}
> +
> +out:
> +	kvfree(dc_extents);
> +	if (rc < 0)
> +		return rc;
> +
> +	return available_extents;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extents, CXL);
> +
>  static int add_dpa_res(struct device *dev, struct resource *parent,
>  		       struct resource *res, resource_size_t start,
>  		       resource_size_t size, const char *type)
> @@ -1452,6 +1794,7 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
>  	mutex_init(&mds->event.log_lock);
>  	mds->cxlds.dev = dev;
>  	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
> +	xa_init(&mds->dc_extent_list);
>  
>  	return mds;
>  }
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 144232c8305e..ba45c1c3b0a9 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /* Copyright(c) 2022 Intel Corporation. All rights reserved. */
>  #include <linux/memregion.h>
> +#include <linux/interrupt.h>
>  #include <linux/genalloc.h>
>  #include <linux/device.h>
>  #include <linux/module.h>
> @@ -11,6 +12,8 @@
>  #include <cxlmem.h>
>  #include <cxl.h>
>  #include "core.h"
> +#include "../../dax/bus.h"
> +#include "../../dax/dax-private.h"
>  
>  /**
>   * DOC: cxl core region
> @@ -166,6 +169,38 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
>  	return 0;
>  }
>  
> +static int cxl_region_manage_dc(struct cxl_region *cxlr)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +	unsigned int extent_gen_num;
> +	int i, rc;
> +
> +	/* Designed for Non Interleaving flow with the assumption one
> +	 * cxl_region will map the complete device DC region's DPA range
> +	 */
> +	for (i = 0; i < p->nr_targets; i++) {
> +		struct cxl_endpoint_decoder *cxled = p->targets[i];
> +		struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +		struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +		rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> +		if (rc < 0)
> +			goto err;
> +		else if (rc > 1) {
> +			rc = cxl_dev_get_dc_extents(mds, rc, 0);
> +			if (rc < 0)
> +				goto err;
> +			mds->num_dc_extents = rc;
> +			mds->dc_extents_index = rc - 1;
> +		}
> +		mds->dc_list_gen_num = extent_gen_num;
> +		dev_dbg(mds->cxlds.dev, "No of preallocated extents :%d\n", rc);
> +	}
> +	return 0;
> +err:
> +	return rc;
> +}
> +
>  static int commit_decoder(struct cxl_decoder *cxld)
>  {
>  	struct cxl_switch_decoder *cxlsd = NULL;
> @@ -2865,11 +2900,14 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
>  		return PTR_ERR(cxlr_dax);
>  
>  	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
> -	if (!cxlr_dc) {
> -		rc = -ENOMEM;
> -		goto err;
> -	}
> +	if (!cxlr_dc)
> +		return -ENOMEM;
>  
> +	rc = request_module("dax_cxl");
> +	if (rc) {
> +		dev_err(dev, "failed to load dax-ctl module\n");
> +		goto load_err;
> +	}
>  	dev = &cxlr_dax->dev;
>  	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
>  	if (rc)
> @@ -2891,10 +2929,24 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
>  	xa_init(&cxlr_dc->dax_dev_list);
>  	cxlr->cxlr_dc = cxlr_dc;
>  	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
> -	if (!rc)
> -		return 0;
> +	if (rc)
> +		goto err;
> +
> +	if (!dev->driver) {
> +		dev_err(dev, "%s Driver not attached\n", dev_name(dev));
> +		rc = -ENXIO;
> +		goto err;
> +	}
> +
> +	rc = cxl_region_manage_dc(cxlr);
> +	if (rc)
> +		goto err;
> +
> +	return 0;
> +
>  err:
>  	put_device(dev);
> +load_err:
>  	kfree(cxlr_dc);
>  	return rc;
>  }
> @@ -3076,6 +3128,156 @@ struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_create_region, CXL);
>  
> +static int match_ep_decoder_by_range(struct device *dev, void *data)
> +{
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range *dpa_range = data;
> +
> +	if (!is_endpoint_decoder(dev))
> +		return 0;
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	if (!cxled->cxld.region)
> +		return 0;
> +
> +	if (cxled->dpa_res->start <= dpa_range->start &&
> +				cxled->dpa_res->end >= dpa_range->end)
> +		return 1;
> +
> +	return 0;
> +}
> +
> +int cxl_release_dc_extent(struct cxl_memdev_state *mds,
> +			  struct range *rel_range)
> +{
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct cxl_dc_region *cxlr_dc;
> +	struct dax_region *dax_region;
> +	resource_size_t dpa_offset;
> +	struct cxl_region *cxlr;
> +	struct range hpa_range;
> +	struct dev_dax *dev_dax;
> +	resource_size_t hpa;
> +	struct device *dev;
> +	int ranges, rc = 0;
> +
> +	/*
> +	 * Find the cxl endpoind decoder with which has the extent dpa range and

s/endpoind/endpoint/

> +	 * get the cxl_region, dax_region refrences.
> +	 */
> +	dev = device_find_child(&cxlmd->endpoint->dev, rel_range,
> +				match_ep_decoder_by_range);
> +	if (!dev) {
> +		dev_err(mds->cxlds.dev, "%pr not mapped\n", rel_range);
> +		return PTR_ERR(dev);
> +	}
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	hpa_range = cxled->cxld.hpa_range;
> +	cxlr = cxled->cxld.region;
> +	cxlr_dc = cxlr->cxlr_dc;
> +
> +	/* DPA to HPA translation */
> +	if (cxled->cxld.interleave_ways == 1) {
> +		dpa_offset = rel_range->start - cxled->dpa_res->start;
> +		hpa = hpa_range.start + dpa_offset;
> +	} else {
> +		dev_err(mds->cxlds.dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}
> +
> +	dev_dax = xa_load(&cxlr_dc->dax_dev_list, hpa);
> +	if (!dev_dax)
> +		return -EINVAL;
> +
> +	dax_region = dev_dax->region;
> +	ranges = dev_dax->nr_range;
> +
> +	while (ranges) {
> +		int i = ranges - 1;
> +		struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
> +
> +		devm_release_action(dax_region->dev, unregister_dax_mapping,
> +								&mapping->dev);
> +		ranges--;
> +	}
> +
> +	dev_dbg(mds->cxlds.dev, "removing devdax device:%s\n",
> +						dev_name(&dev_dax->dev));
> +	devm_release_action(dax_region->dev, unregister_dev_dax,
> +							&dev_dax->dev);
> +	xa_erase(&cxlr_dc->dax_dev_list, hpa);
> +
> +	return rc;
> +}
> +
> +int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range)
> +{
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct cxl_dc_region *cxlr_dc;
> +	struct dax_region *dax_region;
> +	resource_size_t dpa_offset;
> +	struct dev_dax_data data;
> +	struct dev_dax *dev_dax;
> +	struct cxl_region *cxlr;
> +	struct range hpa_range;
> +	resource_size_t hpa;
> +	struct device *dev;
> +	int rc;
> +
> +	/*
> +	 * Find the cxl endpoind decoder with which has the extent dpa range and
> +	 * get the cxl_region, dax_region refrences.
> +	 */
> +	dev = device_find_child(&cxlmd->endpoint->dev, alloc_range,
> +				match_ep_decoder_by_range);
> +	if (!dev) {
> +		dev_err(mds->cxlds.dev, "%pr not mapped\n",	alloc_range);
> +		return PTR_ERR(dev);
> +	}
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	hpa_range = cxled->cxld.hpa_range;
> +	cxlr = cxled->cxld.region;
> +	cxlr_dc = cxlr->cxlr_dc;
> +	cxlr_dax = cxlr_dc->cxlr_dax;
> +	dax_region = dev_get_drvdata(&cxlr_dax->dev);
> +
> +	/* DPA to HPA translation */
> +	if (cxled->cxld.interleave_ways == 1) {
> +		dpa_offset = alloc_range->start - cxled->dpa_res->start;
> +		hpa = hpa_range.start + dpa_offset;
> +	} else {
> +		dev_err(mds->cxlds.dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}
> +
> +	data = (struct dev_dax_data) {
> +		.dax_region = dax_region,
> +		.id = -1,
> +		.size = 0,
> +	};
> +
> +	dev_dax = devm_create_dev_dax(&data);
> +	if (IS_ERR(dev_dax))
> +		return PTR_ERR(dev_dax);
> +
> +	if (IS_ALIGNED(range_len(alloc_range), max_t(unsigned long,
> +				dev_dax->align, memremap_compat_align()))) {
> +		rc = alloc_dev_dax_range(dev_dax, hpa,
> +					range_len(alloc_range));
> +		if (rc)
> +			return rc;
> +	}
> +
> +	rc = xa_insert(&cxlr_dc->dax_dev_list, hpa, dev_dax, GFP_KERNEL);
> +
> +	return rc;
> +}
> +
>  /* Establish an empty region covering the given HPA range */
>  static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
>  					   struct cxl_endpoint_decoder *cxled)
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index a0b5819bc70b..e11651255780 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -122,7 +122,8 @@ TRACE_EVENT(cxl_aer_correctable_error,
>  		{ CXL_EVENT_TYPE_INFO, "Informational" },	\
>  		{ CXL_EVENT_TYPE_WARN, "Warning" },		\
>  		{ CXL_EVENT_TYPE_FAIL, "Failure" },		\
> -		{ CXL_EVENT_TYPE_FATAL, "Fatal" })
> +		{ CXL_EVENT_TYPE_FATAL, "Fatal" },		\
> +		{ CXL_EVENT_TYPE_DCD, "DCD" })
>  
>  TRACE_EVENT(cxl_overflow,
>  
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 7ac1237938b7..60c436b7ebb1 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -163,11 +163,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
>  #define CXLDEV_EVENT_STATUS_WARN		BIT(1)
>  #define CXLDEV_EVENT_STATUS_FAIL		BIT(2)
>  #define CXLDEV_EVENT_STATUS_FATAL		BIT(3)
> +#define CXLDEV_EVENT_STATUS_DCD			BIT(4)
>  
>  #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO |	\
>  				 CXLDEV_EVENT_STATUS_WARN |	\
>  				 CXLDEV_EVENT_STATUS_FAIL |	\
> -				 CXLDEV_EVENT_STATUS_FATAL)
> +				 CXLDEV_EVENT_STATUS_FATAL|	\
> +				 CXLDEV_EVENT_STATUS_DCD)
>  
>  /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
>  #define CXLDEV_EVENT_INT_MODE_MASK	GENMASK(1, 0)
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 9c0b2fa72bdd..0440b5c04ef6 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -5,6 +5,7 @@
>  #include <uapi/linux/cxl_mem.h>
>  #include <linux/cdev.h>
>  #include <linux/uuid.h>
> +#include <linux/xarray.h>
>  #include "cxl.h"
>  
>  /* CXL 2.0 8.2.8.5.1.1 Memory Device Status Register */
> @@ -226,6 +227,7 @@ struct cxl_event_interrupt_policy {
>  	u8 warn_settings;
>  	u8 failure_settings;
>  	u8 fatal_settings;
> +	u8 dyncap_settings;
>  } __packed;
>  
>  /**
> @@ -296,6 +298,13 @@ enum cxl_devtype {
>  #define CXL_MAX_DC_REGION 8
>  #define CXL_DC_REGION_SRTLEN 8
>  
> +struct cxl_dc_extent_data {
> +	u64 dpa_start;
> +	u64 length;
> +	u8 tag[16];
> +	u16 shared_extent_seq;
> +};
> +
>  /**
>   * struct cxl_dev_state - The driver device state
>   *
> @@ -406,6 +415,11 @@ struct cxl_memdev_state {
>  		u8 flags;
>  	} dc_region[CXL_MAX_DC_REGION];
>  
> +	u32 dc_list_gen_num;
> +	u32 dc_extents_index;
> +	struct xarray dc_extent_list;
> +	u32 num_dc_extents;
> +
>  	size_t dc_event_log_size;
>  	struct cxl_event_state event;
>  	struct cxl_poison_state poison;
> @@ -470,6 +484,17 @@ enum cxl_opcode {
>  	UUID_INIT(0xe1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19,     \
>  		  0x40, 0x3d, 0x86)
>  
> +
> +struct cxl_mbox_dc_response {
> +	__le32 extent_list_size;
> +	u8 reserved[4];
> +	struct updated_extent_list {
> +		__le64 dpa_start;
> +		__le64 length;
> +		u8 reserved[8];
> +	} __packed extent_list[];
> +} __packed;
> +
>  struct cxl_mbox_get_supported_logs {
>  	__le16 entries;
>  	u8 rsvd[6];
> @@ -555,6 +580,7 @@ enum cxl_event_log_type {
>  	CXL_EVENT_TYPE_WARN,
>  	CXL_EVENT_TYPE_FAIL,
>  	CXL_EVENT_TYPE_FATAL,
> +	CXL_EVENT_TYPE_DCD,
>  	CXL_EVENT_TYPE_MAX
>  };
>  
> @@ -639,6 +665,35 @@ struct cxl_event_mem_module {
>  	u8 reserved[0x3d];
>  } __packed;
>  
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.0 section 8.2.9.2.1.5; Table 8-47
> + */
> +
> +#define CXL_EVENT_DC_TAG_SIZE	0x10
> +struct cxl_dc_extent {
> +	__le64 start_dpa;
> +	__le64 length;
> +	u8 tag[CXL_EVENT_DC_TAG_SIZE];
> +	__le16 shared_extn_seq;
> +	u8 reserved[6];
> +} __packed;
> +
> +struct dcd_record_data {
> +	u8 event_type;
> +	u8 reserved;
> +	__le16 host_id;
> +	u8 region_index;
> +	u8 reserved1[3];
> +	struct cxl_dc_extent extent;
> +	u8 reserved2[32];
> +} __packed;
> +
> +struct dcd_event_dyn_cap {
> +	struct cxl_event_record_hdr hdr;
> +	struct dcd_record_data data;
> +} __packed;
> +
>  struct cxl_mbox_get_partition_info {
>  	__le64 active_volatile_cap;
>  	__le64 active_persistent_cap;
> @@ -684,6 +739,19 @@ struct cxl_mbox_dynamic_capacity {
>  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
>  #define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
>  
> +struct cxl_mbox_get_dc_extent {
> +	__le32 extent_cnt;
> +	__le32 start_extent_index;
> +} __packed;
> +
> +struct cxl_mbox_dc_extents {
> +	__le32 ret_extent_cnt;
> +	__le32 total_extent_cnt;
> +	__le32 extent_list_num;
> +	u8 rsvd[4];
> +	struct cxl_dc_extent extent[];
> +}  __packed;
> +
>  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>  struct cxl_mbox_set_timestamp_in {
>  	__le64 timestamp;
> @@ -826,6 +894,14 @@ int cxl_trigger_poison_list(struct cxl_memdev *cxlmd);
>  int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
>  int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
>  
> +/* FIXME why not have these be static in mbox.c? */
> +int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range);
> +int cxl_release_dc_extent(struct cxl_memdev_state *mds, struct range *rel_range);
> +int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> +			      unsigned int *extent_gen_num);
> +int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds, unsigned int cnt,
> +			   unsigned int index);
> +
>  #ifdef CONFIG_CXL_SUSPEND
>  void cxl_mem_active_inc(void);
>  void cxl_mem_active_dec(void);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index ac1a41bc083d..558ffbcb9b34 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -522,8 +522,8 @@ static int cxl_event_req_irq(struct cxl_dev_state *cxlds, u8 setting)
>  		return irq;
>  
>  	return devm_request_threaded_irq(dev, irq, NULL, cxl_event_thread,
> -					 IRQF_SHARED | IRQF_ONESHOT, NULL,
> -					 dev_id);
> +					IRQF_SHARED | IRQF_ONESHOT, NULL,
> +					dev_id);
>  }
>  
>  static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
> @@ -555,6 +555,7 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
>  		.warn_settings = CXL_INT_MSI_MSIX,
>  		.failure_settings = CXL_INT_MSI_MSIX,
>  		.fatal_settings = CXL_INT_MSI_MSIX,
> +		.dyncap_settings = CXL_INT_MSI_MSIX,
>  	};
>  
>  	mbox_cmd = (struct cxl_mbox_cmd) {
> @@ -608,6 +609,11 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
>  		return rc;
>  	}
>  
> +	rc = cxl_event_req_irq(cxlds, policy.dyncap_settings);
> +	if (rc) {
> +		dev_err(cxlds->dev, "Failed to get interrupt for event dc log\n");
> +		return rc;
> +	}
>  	return 0;
>  }
>  
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 227800053309..b2b27033f589 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -434,7 +434,7 @@ static void free_dev_dax_ranges(struct dev_dax *dev_dax)
>  		trim_dev_dax_range(dev_dax);
>  }
>  
> -static void unregister_dev_dax(void *dev)
> +void unregister_dev_dax(void *dev)
>  {
>  	struct dev_dax *dev_dax = to_dev_dax(dev);
>  
> @@ -445,6 +445,7 @@ static void unregister_dev_dax(void *dev)
>  	free_dev_dax_ranges(dev_dax);
>  	put_device(dev);
>  }
> +EXPORT_SYMBOL_GPL(unregister_dev_dax);
>  
>  /* a return value >= 0 indicates this invocation invalidated the id */
>  static int __free_dev_dax_id(struct dev_dax *dev_dax)
> @@ -641,7 +642,7 @@ static void dax_mapping_release(struct device *dev)
>  	kfree(mapping);
>  }
>  
> -static void unregister_dax_mapping(void *data)
> +void unregister_dax_mapping(void *data)
>  {
>  	struct device *dev = data;
>  	struct dax_mapping *mapping = to_dax_mapping(dev);
> @@ -658,7 +659,7 @@ static void unregister_dax_mapping(void *data)
>  	device_del(dev);
>  	put_device(dev);
>  }
> -
> +EXPORT_SYMBOL_GPL(unregister_dax_mapping);
>  static struct dev_dax_range *get_dax_range(struct device *dev)
>  {
>  	struct dax_mapping *mapping = to_dax_mapping(dev);
> @@ -793,7 +794,7 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
>  	return 0;
>  }
>  
> -static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
> +int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
>  		resource_size_t size)
>  {
>  	struct dax_region *dax_region = dev_dax->region;
> @@ -853,6 +854,8 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
>  
>  	return rc;
>  }
> +EXPORT_SYMBOL_GPL(alloc_dev_dax_range);
> +
>  
>  static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, resource_size_t size)
>  {
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 8cd79ab34292..aa8418c7aead 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -47,8 +47,11 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
>  	__dax_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
>  void dax_driver_unregister(struct dax_device_driver *dax_drv);
>  void kill_dev_dax(struct dev_dax *dev_dax);
> +void unregister_dev_dax(void *dev);
> +void unregister_dax_mapping(void *data);
>  bool static_dev_dax(struct dev_dax *dev_dax);
> -
> +int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
> +					resource_size_t size);
>  /*
>   * While run_dax() is potentially a generic operation that could be
>   * defined in include/linux/dax.h we don't want to grow any users
> 
> -- 
> 2.40.0
> 

-- 
Fan Ni <nifan@outlook.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events.
  2023-06-16  4:11     ` Ira Weiny
@ 2023-06-27 18:20       ` Fan Ni
  0 siblings, 0 replies; 55+ messages in thread
From: Fan Ni @ 2023-06-27 18:20 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Alison Schofield, Navneet Singh, Fan Ni, Jonathan Cameron,
	Dan Williams, linux-cxl, a.manzanares, dave, nmtadam.samsung,
	nifan

The 06/15/2023 21:11, Ira Weiny wrote:
> Alison Schofield wrote:
> > On Wed, Jun 14, 2023 at 12:16:31PM -0700, Ira Weiny wrote:
> > > From: Navneet Singh <navneet.singh@intel.com>
> > > 
> > > A dynamic capacity device utilizes events to signal the host about the
> > > changes to the allocation of DC blocks. The device communicates the
> > > state of these blocks of dynamic capacity through an extent list that
> > > describes the starting DPA and length of all blocks the host can access.
> > > 
> > > Based on the dynamic capacity add or release event type,
> > > dynamic memory represented by the extents are either added
> > > or removed as devdax device.
> > 
> > Nice commit msg, please align second paragraph w first.
> 
> ok... fixed.  :-)
> 
> > 
> > > 
> > > Process the dynamic capacity add and release events.
> > > 
> > > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > > 
> > > ---
> > > [iweiny: Remove invalid comment]
> > > ---
> > >  drivers/cxl/core/mbox.c   | 345 +++++++++++++++++++++++++++++++++++++++++++++-
> > >  drivers/cxl/core/region.c | 214 +++++++++++++++++++++++++++-
> > >  drivers/cxl/core/trace.h  |   3 +-
> > >  drivers/cxl/cxl.h         |   4 +-
> > >  drivers/cxl/cxlmem.h      |  76 ++++++++++
> > >  drivers/cxl/pci.c         |  10 +-
> > >  drivers/dax/bus.c         |  11 +-
> > >  drivers/dax/bus.h         |   5 +-
> > >  8 files changed, 652 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > > index c5b696737c87..db9295216de5 100644
> > > --- a/drivers/cxl/core/mbox.c
> > > +++ b/drivers/cxl/core/mbox.c
> > > @@ -767,6 +767,14 @@ static const uuid_t log_uuid[] = {
> > >  	[VENDOR_DEBUG_UUID] = DEFINE_CXL_VENDOR_DEBUG_UUID,
> > >  };
> > >  
> > > +/* See CXL 3.0 8.2.9.2.1.5 */
> > > +enum dc_event {
> > > +	ADD_CAPACITY,
> > > +	RELEASE_CAPACITY,
> > > +	FORCED_CAPACITY_RELEASE,
> > > +	REGION_CONFIGURATION_UPDATED,
> > > +};
> > > +
> > >  /**
> > >   * cxl_enumerate_cmds() - Enumerate commands for a device.
> > >   * @mds: The driver data for the operation
> > > @@ -852,6 +860,14 @@ static const uuid_t mem_mod_event_uuid =
> > >  	UUID_INIT(0xfe927475, 0xdd59, 0x4339,
> > >  		  0xa5, 0x86, 0x79, 0xba, 0xb1, 0x13, 0xb7, 0x74);
> > >  
> > > +/*
> > > + * Dynamic Capacity Event Record
> > > + * CXL rev 3.0 section 8.2.9.2.1.3; Table 8-45
> > > + */
> > > +static const uuid_t dc_event_uuid =
> > > +	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c,
> > > +		0x2f, 0x95, 0x26, 0x8e, 0x10, 0x1a, 0x2a);
> > > +
> > >  static void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> > >  				   enum cxl_event_log_type type,
> > >  				   struct cxl_event_record_raw *record)
> > > @@ -945,6 +961,188 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> > >  	return rc;
> > >  }
> > >  
> > > +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> > > +				struct cxl_mbox_dc_response *res,
> > > +				int extent_cnt, int opcode)
> > > +{
> > > +	struct cxl_mbox_cmd mbox_cmd;
> > > +	int rc, size;
> > > +
> > > +	size = struct_size(res, extent_list, extent_cnt);
> > > +	res->extent_list_size = cpu_to_le32(extent_cnt);
> > > +
> > > +	mbox_cmd = (struct cxl_mbox_cmd) {
> > > +		.opcode = opcode,
> > > +		.size_in = size,
> > > +		.payload_in = res,
> > > +	};
> > > +
> > > +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > > +
> > > +	return rc;
> > > +
> > > +}
> > > +
> > > +static int cxl_prepare_ext_list(struct cxl_mbox_dc_response **res,
> > > +					int *n, struct range *extent)
> > > +{
> > > +	struct cxl_mbox_dc_response *dc_res;
> > > +	unsigned int size;
> > > +
> > > +	if (!extent)
> > > +		size = struct_size(dc_res, extent_list, 0);
> > > +	else
> > > +		size = struct_size(dc_res, extent_list, *n + 1);
> > > +
> > > +	dc_res = krealloc(*res, size, GFP_KERNEL);
> > > +	if (!dc_res)
> > > +		return -ENOMEM;
> > > +
> > > +	if (extent) {
> > > +		dc_res->extent_list[*n].dpa_start = cpu_to_le64(extent->start);
> > > +		memset(dc_res->extent_list[*n].reserved, 0, 8);
> > > +		dc_res->extent_list[*n].length = 
> > > +				cpu_to_le64(range_len(extent));
> > 
> > Unnecessary return. I think that fits in 80 columns.
> 
> exactly 80...  fixed.
> 
> > 
> > > +		(*n)++;
> > > +	}
> > > +
> > > +	*res = dc_res;
> > > +	return 0;
> > > +}
> > > +/**
> > > + * cxl_handle_dcd_event_records() - Read DCD event records.
> > > + * @mds: The memory device state
> > > + *
> > > + * Returns 0 if enumerate completed successfully.
> > > + *
> > > + * CXL devices can generate DCD events to add or remove extents in the list.
> > > + */
> > 
> > That's a kernel doc comment, so maybe can be clearer.
> 
> Or remove the kernel doc comment.
> 
> > It's called 'handle', so 'Read DCD event records' seems like a mismatch.
> 
> Yea.
> 
> > Probably needs more explaining.
> 
> Rather I would say less.  How about simply:
> 
> /* Returns 0 if the event was handled successfully. */
> 
> Or even nothing at all.  It is a static function and used 1 place.  Not
> sure we even need that line.
> 
> > 
> > 
> > > +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
> > > +					struct cxl_event_record_raw *rec)
> > > +{
> > > +	struct cxl_mbox_dc_response *dc_res = NULL;
> > > +	struct device *dev = mds->cxlds.dev;
> > > +	uuid_t *id = &rec->hdr.id;
> > > +	struct dcd_event_dyn_cap *record =
> > > +			(struct dcd_event_dyn_cap *)rec;
> > > +	int extent_cnt = 0, rc = 0;
> > > +	struct cxl_dc_extent_data *extent;
> > > +	struct range alloc_range, rel_range;
> > > +	resource_size_t dpa, size;
> > > +
> > 
> > Please reverse x-tree. And if things like that *record can't fit within
> > 80 columns and in reverse x-tree order, then assign it afterwards.
> 
> Done.
> 
> > 
> > 
> > > +	if (!uuid_equal(id, &dc_event_uuid))
> > > +		return -EINVAL;
> > > +
> > > +	switch (record->data.event_type) {
> > 
> > Maybe a local for record->data.extent that is used repeatedly below,
> > or,
> > Perhaps pull the length and dpa local defines you made down in the
> > RELEASE_CAPACITY up here and share them with ADD_CAPACITY. That'll
> > reduce the le65_to_cpu noise. Add similar for shared_extn_seq.
> 
> I'm thinking ADD_CAPACITY and RELEASE_CAPACITY need to be 2 separate
> functions which make this function a simple uuid check and event_type
> switch.
> 
> Having local variables for those become much cleaner then.
> 
> I think the handling of dc_res would be cleaner then too.
> 
> > 
> > 
> > > +	case ADD_CAPACITY:
> > > +		extent = devm_kzalloc(dev, sizeof(*extent), GFP_ATOMIC);
> > > +		if (!extent)
> > > +			return -ENOMEM;
> > > +
> > > +		extent->dpa_start = le64_to_cpu(record->data.extent.start_dpa);
> > > +		extent->length = le64_to_cpu(record->data.extent.length);
> > > +		memcpy(extent->tag, record->data.extent.tag,
> > > +				sizeof(record->data.extent.tag));
> > > +		extent->shared_extent_seq =
> > > +			le16_to_cpu(record->data.extent.shared_extn_seq);
> > > +		dev_dbg(dev, "Add DC extent DPA:0x%llx LEN:%llx\n",
> > > +					extent->dpa_start, extent->length);
> > > +		alloc_range = (struct range) {
> > > +			.start = extent->dpa_start,
> > > +			.end = extent->dpa_start + extent->length - 1,
> > > +		};
> > > +
> > > +		rc = cxl_add_dc_extent(mds, &alloc_range);
> > > +		if (rc < 0) {
> > 
> > How about 
> > 		if (rc >=)
> > 			goto insert;
> > 
> > Then you can remove this level of indent.
> 
> I think if this is a separate function it will be better...
> 
> Also this entire indent block could be another sub function because AFAICS
> (see below) it always returns out from this block (only via the 'out'
> label in 1 case which seems redundant).
> 
> > 
> > > +			dev_dbg(dev, "unconsumed DC extent DPA:0x%llx LEN:%llx\n",
> > > +					extent->dpa_start, extent->length);
> > > +			rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, NULL);
> > > +			if (rc < 0) {
> > > +				dev_err(dev, "Couldn't create extent list %d\n",
> > > +									rc);
> > > +				devm_kfree(dev, extent);
> > > +				return rc;
> > > +			}
> > > +
> > > +			rc = cxl_send_dc_cap_response(mds, dc_res,
> > > +					extent_cnt, CXL_MBOX_OP_ADD_DC_RESPONSE);
> > > +			if (rc < 0) {
> > > +				devm_kfree(dev, extent);
> > > +				goto out;
> 
> This if is not doing anything useful.  Because this statement ...
> 
> > > +			}
> > > +
> > > +			kfree(dc_res);
> > > +			devm_kfree(dev, extent);
> 
> ...  and the 'else' here end up being the same logic.  The 'out' label
> flows through kfree(dc_res).  Is the intent that
> cxl_send_dc_cap_response() has no failure consequences?
> 
> > > +
> > > +			return 0;
> > > +		}
> > 
> > insert:
> > 
> > > +
> > > +		rc = xa_insert(&mds->dc_extent_list, extent->dpa_start, extent,
> > > +				GFP_KERNEL);
> > > +		if (rc < 0)
> > > +			goto out;
> > > +
> > > +		mds->num_dc_extents++;
> > > +		rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, &alloc_range);
> > > +		if (rc < 0) {
> > > +			dev_err(dev, "Couldn't create extent list %d\n", rc);
> > > +			return rc;
> > > +		}
> > > +
> > > +		rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
> > > +					      CXL_MBOX_OP_ADD_DC_RESPONSE);
> > > +		if (rc < 0)
> > > +			goto out;
> > > +
> > > +		break;
> > > +
> > > +	case RELEASE_CAPACITY:
> > > +		dpa = le64_to_cpu(record->data.extent.start_dpa);
> > > +		size = le64_to_cpu(record->data.extent.length);
> > 
> > ^^ do these sooner and share
> 
> I think add/release should be their own functions.
> 
> > 
> > > +		dev_dbg(dev, "Release DC extents DPA:0x%llx LEN:%llx\n",
> > > +				dpa, size);
> > > +		extent = xa_load(&mds->dc_extent_list, dpa);
> > > +		if (!extent) {
> > > +			dev_err(dev, "No extent found with DPA:0x%llx\n", dpa);
> > > +			return -EINVAL;
> > > +		}
> > > +
> > > +		rel_range = (struct range) {
> > > +			.start = dpa,
> > > +			.end = dpa + size - 1,
> > > +		};
> > > +
> > > +		rc = cxl_release_dc_extent(mds, &rel_range);
> > > +		if (rc < 0) {
> > > +			dev_dbg(dev, "withhold DC extent DPA:0x%llx LEN:%llx\n",
> > > +									dpa, size);
> > > +			return 0;
> > > +		}
> > > +
> > > +		xa_erase(&mds->dc_extent_list, dpa);
> > > +		devm_kfree(dev, extent);
> > > +		mds->num_dc_extents--;
> > > +		rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, &rel_range);
> > > +		if (rc < 0) {
> > > +			dev_err(dev, "Couldn't create extent list %d\n", rc);
> > > +			return rc;
> > > +		}
> > > +
> > > +		rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
> > > +					      CXL_MBOX_OP_RELEASE_DC);
> > > +		if (rc < 0)
> > > +			goto out;
> > > +
> > > +		break;
> > > +
> > > +	default:
> > > +		return -EINVAL;
> > > +	}
> > > +out:
> > 
> > The out seems needless. Replace all 'goto out''s  with 'break'
> > 
> > I'm also a bit concerned about all the direct returns above.
> > Can this be the single exit point?
> 
> I think so...
> 
> > kfree of a NULL ptr is OK.
> > Maybe a bit more logic here to do that devm_free is all that
> > is needed.
> 
> ... but even more clean up so that the logic is:
> 
> handle_event()
> {
> 
> 	... do checks ...
> 
> 	switch (type):
> 	case ADD...:
> 		rc = handle_add();
> 		break;
> 	case RELEASE...:
> 		rc = handle_release();
> 		break;
> 	default:
> 		rc = -EINVAL;
> 		break;
> 	}
> 
> 	return rc;
> }

Second it. 

Fan
> 
> > 
> > 
> > > +	kfree(dc_res);
> > > +	return rc;
> > > +}
> > > +
> > >  static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> > >  				    enum cxl_event_log_type type)
> > >  {
> > > @@ -982,9 +1180,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> > >  		if (!nr_rec)
> > >  			break;
> > >  
> > > -		for (i = 0; i < nr_rec; i++)
> > > +		for (i = 0; i < nr_rec; i++) {
> > >  			cxl_event_trace_record(cxlmd, type,
> > >  					       &payload->records[i]);
> > > +			if (type == CXL_EVENT_TYPE_DCD) {
> > > +				rc = cxl_handle_dcd_event_records(mds,
> > > +						&payload->records[i]);
> > > +				if (rc)
> > > +					dev_err_ratelimited(dev,
> > > +						"dcd event failed: %d\n", rc);
> > > +			}
> > 
> > 
> > Reduce indent option:
> > 
> > 			if (type != CXL_EVENT_TYPE_DCD)
> > 				continue;
> > 
> > 			rc = cxl_handle_dcd_event_records(mds,
> > 							  &payload->records[i]);			if (rc)
> > 				dev_err_ratelimited(dev,
> > 						    "dcd event failed: %d\n", rc);
> 
> Ah...  Ok.
> 
> Honestly I just made this change and I'm not keen on it.  I think it makes
> the detail that the event was DCD obscured.
> 
> I'm also questioning the need to the error reporting here.  There seems to
> be error messages in the critical parts of cxl_handle_dcd_event_records()
> which would give a clue as to why the DCD failed.  (Other than some common
> memory allocation issues.)  But also those errors are not rate limited.
> So if we are concerned with a FM or other external entity causing events
> which flood the logs it seems they all need to be debug or ratelimited.
> 
> > 
> > I don't know where 'cxl_handle_dcd_event_records() was introduce,
> > but I'm wondering now if it can have a short name.
> 
> Its the function above which needs all the rework.
> 
> > 
> > > +		}
> > >  
> > >  		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
> > >  			trace_cxl_overflow(cxlmd, type, payload);
> > > @@ -1024,6 +1230,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
> > >  		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_WARN);
> > >  	if (status & CXLDEV_EVENT_STATUS_INFO)
> > >  		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_INFO);
> > > +	if (status & CXLDEV_EVENT_STATUS_DCD)
> > > +		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
> > >  }
> > >  EXPORT_SYMBOL_NS_GPL(cxl_mem_get_event_records, CXL);
> > >  
> > > @@ -1244,6 +1452,140 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> > >  }
> > >  EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> > >  
> > > +int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> > > +			      unsigned int *extent_gen_num)
> > > +{
> > > +	struct device *dev = mds->cxlds.dev;
> > > +	struct cxl_mbox_dc_extents *dc_extents;
> > > +	struct cxl_mbox_get_dc_extent get_dc_extent;
> > > +	unsigned int total_extent_cnt;
> > 
> > Seems 'count' would probably suffice here.
> 
> Done.
> 
> > 
> > > +	struct cxl_mbox_cmd mbox_cmd;
> > > +	int rc;
> > 
> > Above - reverse x-tree please.
> 
> Done.
> 
> > 
> > > +
> > > +	/* Check GET_DC_EXTENT_LIST is supported by device */
> > > +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> > > +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> > > +		return 0;
> > > +	}
> > > +
> > > +	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
> > > +	if (!dc_extents)
> > > +		return -ENOMEM;
> > > +
> > > +	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> > > +		.extent_cnt = 0,
> > > +		.start_extent_index = 0,
> > > +	};
> > > +
> > > +	mbox_cmd = (struct cxl_mbox_cmd) {
> > > +		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> > > +		.payload_in = &get_dc_extent,
> > > +		.size_in = sizeof(get_dc_extent),
> > > +		.size_out = mds->payload_size,
> > > +		.payload_out = dc_extents,
> > > +		.min_out = 1,
> > > +	};
> > > +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > > +	if (rc < 0)
> > > +		goto out;
> > > +
> > > +	total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> > > +	*extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
> > > +	dev_dbg(dev, "Total extent count :%d Extent list Generation Num: %d\n",
> > > +			total_extent_cnt, *extent_gen_num);
> > > +out:
> > > +
> > > +	kvfree(dc_extents);
> > > +	if (rc < 0)
> > > +		return rc;
> > > +
> > > +	return total_extent_cnt;
> > > +
> > > +}
> > > +EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extent_cnt, CXL);
> > > +
> > > +int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds,
> > > +			   unsigned int index, unsigned int cnt)
> > > +{
> > > +	/* See CXL 3.0 Table 125 dynamic capacity config  Output Payload */
> > > +	struct device *dev = mds->cxlds.dev;
> > > +	struct cxl_mbox_dc_extents *dc_extents;
> > > +	struct cxl_mbox_get_dc_extent get_dc_extent;
> > > +	unsigned int extent_gen_num, available_extents, total_extent_cnt;
> > > +	int rc;
> > > +	struct cxl_dc_extent_data *extent;
> > > +	struct cxl_mbox_cmd mbox_cmd;
> > > +	struct range alloc_range;
> > > +
> > 
> > Reverse x-tree please.
> 
> Done.
> 
> > 
> > > +	/* Check GET_DC_EXTENT_LIST is supported by device */
> > > +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> > > +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> > > +		return 0;
> > > +	}
> > 
> > Can we even get this far if this cmd is not supported by the device?
> > Is there a sooner place to test those bits?  Is this sysfs request?
> > (sorry not completely following here).
> > 
> 
> I'll have to check.  Perhaps Navneet knows.
> 
> > > +
> > > +	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
> > > +	if (!dc_extents)
> > > +		return -ENOMEM;
> > > +	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> > > +		.extent_cnt = cnt,
> > > +		.start_extent_index = index,
> > > +	};
> > > +
> > > +	mbox_cmd = (struct cxl_mbox_cmd) {
> > > +		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> > > +		.payload_in = &get_dc_extent,
> > > +		.size_in = sizeof(get_dc_extent),
> > > +		.size_out = mds->payload_size,
> > > +		.payload_out = dc_extents,
> > > +		.min_out = 1,
> > > +	};
> > > +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > > +	if (rc < 0)
> > > +		goto out;
> > > +
> > > +	available_extents = le32_to_cpu(dc_extents->ret_extent_cnt);
> > > +	total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> > > +	extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
> > > +	dev_dbg(dev, "No Total extent count :%d Extent list Generation Num:%d\n",
> > > +			total_extent_cnt, extent_gen_num);
> > > +
> > > +
> > > +	for (int i = 0; i < available_extents ; i++) {
> > > +		extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
> > > +		if (!extent) {
> > > +			rc = -ENOMEM;
> > > +			goto out;
> > > +		}
> > > +		extent->dpa_start = le64_to_cpu(dc_extents->extent[i].start_dpa);
> > > +		extent->length = le64_to_cpu(dc_extents->extent[i].length);
> > > +		memcpy(extent->tag, dc_extents->extent[i].tag,
> > > +					sizeof(dc_extents->extent[i].tag));
> > > +		extent->shared_extent_seq =
> > > +				le16_to_cpu(dc_extents->extent[i].shared_extn_seq);
> > > +		dev_dbg(dev, "dynamic capacity extent[%d] DPA:0x%llx LEN:%llx\n",
> > > +				i, extent->dpa_start, extent->length);
> > > +
> > > +		alloc_range = (struct range){
> > > +			.start = extent->dpa_start,
> > > +			.end = extent->dpa_start + extent->length - 1,
> > > +		};
> > > +
> > > +		rc = cxl_add_dc_extent(mds, &alloc_range);
> > > +		if (rc < 0)
> > > +			goto out;
> > > +		rc = xa_insert(&mds->dc_extent_list, extent->dpa_start, extent,
> > > +				GFP_KERNEL);
> > > +	}
> > > +
> > > +out:
> > > +	kvfree(dc_extents);
> > > +	if (rc < 0)
> > > +		return rc;
> > > +
> > > +	return available_extents;
> > > +}
> > > +EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extents, CXL);
> > > +
> > >  static int add_dpa_res(struct device *dev, struct resource *parent,
> > >  		       struct resource *res, resource_size_t start,
> > >  		       resource_size_t size, const char *type)
> > > @@ -1452,6 +1794,7 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
> > >  	mutex_init(&mds->event.log_lock);
> > >  	mds->cxlds.dev = dev;
> > >  	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
> > > +	xa_init(&mds->dc_extent_list);
> > >  
> > >  	return mds;
> > >  }
> > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > > index 144232c8305e..ba45c1c3b0a9 100644
> > > --- a/drivers/cxl/core/region.c
> > > +++ b/drivers/cxl/core/region.c
> > > @@ -1,6 +1,7 @@
> > >  // SPDX-License-Identifier: GPL-2.0-only
> > >  /* Copyright(c) 2022 Intel Corporation. All rights reserved. */
> > >  #include <linux/memregion.h>
> > > +#include <linux/interrupt.h>
> > >  #include <linux/genalloc.h>
> > >  #include <linux/device.h>
> > >  #include <linux/module.h>
> > > @@ -11,6 +12,8 @@
> > >  #include <cxlmem.h>
> > >  #include <cxl.h>
> > >  #include "core.h"
> > > +#include "../../dax/bus.h"
> > > +#include "../../dax/dax-private.h"
> > >  
> > >  /**
> > >   * DOC: cxl core region
> > > @@ -166,6 +169,38 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
> > >  	return 0;
> > >  }
> > >  
> > > +static int cxl_region_manage_dc(struct cxl_region *cxlr)
> > > +{
> > > +	struct cxl_region_params *p = &cxlr->params;
> > > +	unsigned int extent_gen_num;
> > > +	int i, rc;
> > > +
> > > +	/* Designed for Non Interleaving flow with the assumption one
> > > +	 * cxl_region will map the complete device DC region's DPA range
> > > +	 */
> > > +	for (i = 0; i < p->nr_targets; i++) {
> > > +		struct cxl_endpoint_decoder *cxled = p->targets[i];
> > > +		struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > > +		struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> > > +
> > > +		rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> > > +		if (rc < 0)
> > > +			goto err;
> > > +		else if (rc > 1) {
> > > +			rc = cxl_dev_get_dc_extents(mds, rc, 0);
> > > +			if (rc < 0)
> > > +				goto err;
> > > +			mds->num_dc_extents = rc;
> > > +			mds->dc_extents_index = rc - 1;
> > > +		}
> > 
> > Brackets required around both arms of that if/else if statement. 
> > (checkpatch should be telling you that)
> > 
> > How about flipping that and doing the (rc > 1) work first.
> > then the else if, goto err.
> 
> Actually the goto err handles it all.  Just get rid of the 'else'
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 57f8ec9ef07a..47f94dec47f4 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -186,7 +186,8 @@ static int cxl_region_manage_dc(struct cxl_region *cxlr)
>                 rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
>                 if (rc < 0)
>                         goto err;
> -               else if (rc > 1) {
> +
> +               if (rc > 1) {
>                         rc = cxl_dev_get_dc_extents(mds, rc, 0);
>                         if (rc < 0)
>                                 goto err;
> 
> > 
> > > +		mds->dc_list_gen_num = extent_gen_num;
> > > +		dev_dbg(mds->cxlds.dev, "No of preallocated extents :%d\n", rc);
> > > +	}
> > > +	return 0;
> > > +err:
> > > +	return rc;
> > > +}
> > > +
> > >  static int commit_decoder(struct cxl_decoder *cxld)
> > >  {
> > >  	struct cxl_switch_decoder *cxlsd = NULL;
> > > @@ -2865,11 +2900,14 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
> > >  		return PTR_ERR(cxlr_dax);
> > >  
> > >  	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
> > > -	if (!cxlr_dc) {
> > > -		rc = -ENOMEM;
> > > -		goto err;
> > > -	}
> > > +	if (!cxlr_dc)
> > > +		return -ENOMEM;
> > >  
> > > +	rc = request_module("dax_cxl");
> > > +	if (rc) {
> > > +		dev_err(dev, "failed to load dax-ctl module\n");
> > > +		goto load_err;
> > > +	}
> > >  	dev = &cxlr_dax->dev;
> > >  	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
> > >  	if (rc)
> > > @@ -2891,10 +2929,24 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
> > >  	xa_init(&cxlr_dc->dax_dev_list);
> > >  	cxlr->cxlr_dc = cxlr_dc;
> > >  	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
> > > -	if (!rc)
> > > -		return 0;
> > > +	if (rc)
> > > +		goto err;
> > > +
> > > +	if (!dev->driver) {
> > > +		dev_err(dev, "%s Driver not attached\n", dev_name(dev));
> > > +		rc = -ENXIO;
> > > +		goto err;
> > > +	}
> > > +
> > > +	rc = cxl_region_manage_dc(cxlr);
> > > +	if (rc)
> > > +		goto err;
> > > +
> > > +	return 0;
> > > +
> > >  err:
> > >  	put_device(dev);
> > > +load_err:
> > >  	kfree(cxlr_dc);
> > >  	return rc;
> > >  }
> > > @@ -3076,6 +3128,156 @@ struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
> > >  }
> > >  EXPORT_SYMBOL_NS_GPL(cxl_create_region, CXL);
> > >  
> > > +static int match_ep_decoder_by_range(struct device *dev, void *data)
> > > +{
> > > +	struct cxl_endpoint_decoder *cxled;
> > > +	struct range *dpa_range = data;
> > > +
> > > +	if (!is_endpoint_decoder(dev))
> > > +		return 0;
> > > +
> > > +	cxled = to_cxl_endpoint_decoder(dev);
> > > +	if (!cxled->cxld.region)
> > > +		return 0;
> > > +
> > > +	if (cxled->dpa_res->start <= dpa_range->start &&
> > > +				cxled->dpa_res->end >= dpa_range->end)
> > > +		return 1;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +int cxl_release_dc_extent(struct cxl_memdev_state *mds,
> > > +			  struct range *rel_range)
> > > +{
> > > +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> > > +	struct cxl_endpoint_decoder *cxled;
> > > +	struct cxl_dc_region *cxlr_dc;
> > > +	struct dax_region *dax_region;
> > > +	resource_size_t dpa_offset;
> > > +	struct cxl_region *cxlr;
> > > +	struct range hpa_range;
> > > +	struct dev_dax *dev_dax;
> > > +	resource_size_t hpa;
> > > +	struct device *dev;
> > > +	int ranges, rc = 0;
> > > +
> > > +	/*
> > > +	 * Find the cxl endpoind decoder with which has the extent dpa range and
> > > +	 * get the cxl_region, dax_region refrences.
> > > +	 */
> > > +	dev = device_find_child(&cxlmd->endpoint->dev, rel_range,
> > > +				match_ep_decoder_by_range);
> > > +	if (!dev) {
> > > +		dev_err(mds->cxlds.dev, "%pr not mapped\n", rel_range);
> > > +		return PTR_ERR(dev);
> > > +	}
> > > +
> > > +	cxled = to_cxl_endpoint_decoder(dev);
> > > +	hpa_range = cxled->cxld.hpa_range;
> > > +	cxlr = cxled->cxld.region;
> > > +	cxlr_dc = cxlr->cxlr_dc;
> > > +
> > > +	/* DPA to HPA translation */
> > > +	if (cxled->cxld.interleave_ways == 1) {
> > > +		dpa_offset = rel_range->start - cxled->dpa_res->start;
> > > +		hpa = hpa_range.start + dpa_offset;
> > > +	} else {
> > > +		dev_err(mds->cxlds.dev, "Interleaving DC not supported\n");
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	dev_dax = xa_load(&cxlr_dc->dax_dev_list, hpa);
> > > +	if (!dev_dax)
> > > +		return -EINVAL;
> > > +
> > > +	dax_region = dev_dax->region;
> > > +	ranges = dev_dax->nr_range;
> > > +
> > > +	while (ranges) {
> > > +		int i = ranges - 1;
> > > +		struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
> > > +
> > > +		devm_release_action(dax_region->dev, unregister_dax_mapping,
> > > +								&mapping->dev);
> > > +		ranges--;
> > > +	}
> > > +
> > > +	dev_dbg(mds->cxlds.dev, "removing devdax device:%s\n",
> > > +						dev_name(&dev_dax->dev));
> > > +	devm_release_action(dax_region->dev, unregister_dev_dax,
> > > +							&dev_dax->dev);
> > > +	xa_erase(&cxlr_dc->dax_dev_list, hpa);
> > > +
> > > +	return rc;
> > > +}
> > > +
> > > +int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range)
> > > +{
> > > +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> > > +	struct cxl_endpoint_decoder *cxled;
> > > +	struct cxl_dax_region *cxlr_dax;
> > > +	struct cxl_dc_region *cxlr_dc;
> > > +	struct dax_region *dax_region;
> > > +	resource_size_t dpa_offset;
> > > +	struct dev_dax_data data;
> > > +	struct dev_dax *dev_dax;
> > > +	struct cxl_region *cxlr;
> > > +	struct range hpa_range;
> > > +	resource_size_t hpa;
> > > +	struct device *dev;
> > > +	int rc;
> > > +
> > > +	/*
> > > +	 * Find the cxl endpoind decoder with which has the extent dpa range and
> > > +	 * get the cxl_region, dax_region refrences.
> > > +	 */
> > > +	dev = device_find_child(&cxlmd->endpoint->dev, alloc_range,
> > > +				match_ep_decoder_by_range);
> > > +	if (!dev) {
> > > +		dev_err(mds->cxlds.dev, "%pr not mapped\n",	alloc_range);
> > > +		return PTR_ERR(dev);
> > > +	}
> > > +
> > > +	cxled = to_cxl_endpoint_decoder(dev);
> > > +	hpa_range = cxled->cxld.hpa_range;
> > > +	cxlr = cxled->cxld.region;
> > > +	cxlr_dc = cxlr->cxlr_dc;
> > > +	cxlr_dax = cxlr_dc->cxlr_dax;
> > > +	dax_region = dev_get_drvdata(&cxlr_dax->dev);
> > > +
> > > +	/* DPA to HPA translation */
> > > +	if (cxled->cxld.interleave_ways == 1) {
> > > +		dpa_offset = alloc_range->start - cxled->dpa_res->start;
> > > +		hpa = hpa_range.start + dpa_offset;
> > > +	} else {
> > > +		dev_err(mds->cxlds.dev, "Interleaving DC not supported\n");
> > > +		return -EINVAL;
> > > +	}
> > 
> > Hey, I'm running out of steam here,
> 
> :-D
> 
> > but lastly between these last
> > 2 funcs, seems some duplicate code. Is there maybe an opportunity
> > for a common func that can 'add' or 'release' a dc extent?
> 
> Maybe.  I'm too tired to see how this intertwines with
> cxl_handle_dcd_event_records() and cxl_dev_get_dc_extents().  But the returning
> of the range is odd.  Might be ok I think.  But perhaps
> cxl_handle_dcd_event_records() and cxl_dev_get_dc_extents() can issue the
> device_find_child() or something?
> 
> > 
> > 
> > 
> > The end.
> 
> Thanks for looking!
> Ira

-- 
Fan Ni <nifan@outlook.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events.
  2023-06-22 17:01   ` Jonathan Cameron
@ 2023-06-29 15:19     ` Ira Weiny
  0 siblings, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2023-06-29 15:19 UTC (permalink / raw)
  To: Jonathan Cameron, ira.weiny
  Cc: Navneet Singh, Fan Ni, Dan Williams, linux-cxl

Jonathan Cameron wrote:
> On Wed, 14 Jun 2023 12:16:31 -0700
> ira.weiny@intel.com wrote:
> 
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > A dynamic capacity device utilizes events to signal the host about the
> > changes to the allocation of DC blocks. The device communicates the
> > state of these blocks of dynamic capacity through an extent list that
> > describes the starting DPA and length of all blocks the host can access.
> > 
> > Based on the dynamic capacity add or release event type,
> > dynamic memory represented by the extents are either added
> > or removed as devdax device.
> > 
> > Process the dynamic capacity add and release events.
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > 
> Hi,
> 
> I ran out of time today and will be traveling next few weeks (may have
> review time, may not) so sending what I have on basis it might be useful.
> 
> Jonathan
> 
> > +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> > +				struct cxl_mbox_dc_response *res,
> > +				int extent_cnt, int opcode)
> > +{
> > +	struct cxl_mbox_cmd mbox_cmd;
> > +	int rc, size;
> > +
> > +	size = struct_size(res, extent_list, extent_cnt);
> > +	res->extent_list_size = cpu_to_le32(extent_cnt);
> > +
> > +	mbox_cmd = (struct cxl_mbox_cmd) {
> > +		.opcode = opcode,
> > +		.size_in = size,
> > +		.payload_in = res,
> > +	};
> > +
> > +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > +
> > +	return rc;
> return cxl_..

Fixed.

> 
> > +
> > +}
> > +
> > +static int cxl_prepare_ext_list(struct cxl_mbox_dc_response **res,
> > +					int *n, struct range *extent)
> > +{
> > +	struct cxl_mbox_dc_response *dc_res;
> > +	unsigned int size;
> > +
> > +	if (!extent)
> > +		size = struct_size(dc_res, extent_list, 0);
> > +	else
> > +		size = struct_size(dc_res, extent_list, *n + 1);
> > +
> > +	dc_res = krealloc(*res, size, GFP_KERNEL);
> > +	if (!dc_res)
> > +		return -ENOMEM;
> > +
> > +	if (extent) {
> > +		dc_res->extent_list[*n].dpa_start = cpu_to_le64(extent->start);
> > +		memset(dc_res->extent_list[*n].reserved, 0, 8);
> > +		dc_res->extent_list[*n].length =
> > +				cpu_to_le64(range_len(extent));
> > +		(*n)++;
> > +	}
> > +
> > +	*res = dc_res;
> > +	return 0;
> > +}
> blank line.

Already done.

> 
> > +/**
> > + * cxl_handle_dcd_event_records() - Read DCD event records.
> > + * @mds: The memory device state
> 
> >  
> > +int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> > +			      unsigned int *extent_gen_num)
> > +{
> > +	struct device *dev = mds->cxlds.dev;
> > +	struct cxl_mbox_dc_extents *dc_extents;
> > +	struct cxl_mbox_get_dc_extent get_dc_extent;
> > +	unsigned int total_extent_cnt;
> > +	struct cxl_mbox_cmd mbox_cmd;
> > +	int rc;
> > +
> > +	/* Check GET_DC_EXTENT_LIST is supported by device */
> > +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> > +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> > +		return 0;
> > +	}
> > +
> > +	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
> 
> Put it on the stack - length is fixed and small if requesting 0
> extents
> 

Ah yea good idea.

> 
> > +	if (!dc_extents)
> > +		return -ENOMEM;
> > +
> > +	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> > +		.extent_cnt = 0,
> > +		.start_extent_index = 0,
> > +	};
> > +
> > +	mbox_cmd = (struct cxl_mbox_cmd) {
> > +		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> > +		.payload_in = &get_dc_extent,
> > +		.size_in = sizeof(get_dc_extent),
> > +		.size_out = mds->payload_size,
> > +		.payload_out = dc_extents,
> > +		.min_out = 1,
> > +	};
> > +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > +	if (rc < 0)
> > +		goto out;
> > +
> > +	total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> > +	*extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
> > +	dev_dbg(dev, "Total extent count :%d Extent list Generation Num: %d\n",
> > +			total_extent_cnt, *extent_gen_num);
> > +out:
> > +
> > +	kvfree(dc_extents);
> > +	if (rc < 0)
> > +		return rc;
> > +
> > +	return total_extent_cnt;
> > +
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extent_cnt, CXL);
> 
> 
> 
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 144232c8305e..ba45c1c3b0a9 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -1,6 +1,7 @@
> >  // SPDX-License-Identifier: GPL-2.0-only
> >  /* Copyright(c) 2022 Intel Corporation. All rights reserved. */
> >  #include <linux/memregion.h>
> > +#include <linux/interrupt.h>
> >  #include <linux/genalloc.h>
> >  #include <linux/device.h>
> >  #include <linux/module.h>
> > @@ -11,6 +12,8 @@
> >  #include <cxlmem.h>
> >  #include <cxl.h>
> >  #include "core.h"
> > +#include "../../dax/bus.h"
> > +#include "../../dax/dax-private.h"
> >  
> >  /**
> >   * DOC: cxl core region
> > @@ -166,6 +169,38 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
> >  	return 0;
> >  }
> >  
> > +static int cxl_region_manage_dc(struct cxl_region *cxlr)
> > +{
> > +	struct cxl_region_params *p = &cxlr->params;
> > +	unsigned int extent_gen_num;
> > +	int i, rc;
> > +
> > +	/* Designed for Non Interleaving flow with the assumption one
> > +	 * cxl_region will map the complete device DC region's DPA range
> > +	 */
> > +	for (i = 0; i < p->nr_targets; i++) {
> > +		struct cxl_endpoint_decoder *cxled = p->targets[i];
> > +		struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > +		struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> > +
> > +		rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> > +		if (rc < 0)
> > +			goto err;
> > +		else if (rc > 1) {
> > +			rc = cxl_dev_get_dc_extents(mds, rc, 0);
> > +			if (rc < 0)
> > +				goto err;
> > +			mds->num_dc_extents = rc;
> > +			mds->dc_extents_index = rc - 1;
> > +		}
> > +		mds->dc_list_gen_num = extent_gen_num;
> > +		dev_dbg(mds->cxlds.dev, "No of preallocated extents :%d\n", rc);
> > +	}
> > +	return 0;
> > +err:
> > +	return rc;
> 
> Direct returns easier to review.  

Done.

> 
> > +}
> > +
> >  static int commit_decoder(struct cxl_decoder *cxld)
> >  {
> >  	struct cxl_switch_decoder *cxlsd = NULL;
> > @@ -2865,11 +2900,14 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
> >  		return PTR_ERR(cxlr_dax);
> >  
> >  	cxlr_dc = kzalloc(sizeof(*cxlr_dc), GFP_KERNEL);
> > -	if (!cxlr_dc) {
> > -		rc = -ENOMEM;
> > -		goto err;
> > -	}
> > +	if (!cxlr_dc)
> > +		return -ENOMEM;
> 
> Curious.  Looks like a bug from earlier.

Actually no.  This is just a bug in this patch.  The put_device() in the
'err' path is still required.

Digging through devm_cxl_add_dc_region() in the previous patch there is
quite a bit of clean up which can be done which I think will make this
next section of code much cleaner.  I'll update this patch after cleaning
up the previous one.

> 
> 
> >  
> > +	rc = request_module("dax_cxl");
> > +	if (rc) {
> > +		dev_err(dev, "failed to load dax-ctl module\n");
> > +		goto load_err;
> > +	}
> >  	dev = &cxlr_dax->dev;
> >  	rc = dev_set_name(dev, "dax_region%d", cxlr->id);
> >  	if (rc)
> > @@ -2891,10 +2929,24 @@ static int devm_cxl_add_dc_region(struct cxl_region *cxlr)
> >  	xa_init(&cxlr_dc->dax_dev_list);
> >  	cxlr->cxlr_dc = cxlr_dc;
> >  	rc = devm_add_action_or_reset(&cxlr->dev, cxl_dc_region_release, cxlr);
> > -	if (!rc)
> > -		return 0;
> > +	if (rc)
> > +		goto err;
> > +
> > +	if (!dev->driver) {
> > +		dev_err(dev, "%s Driver not attached\n", dev_name(dev));
> > +		rc = -ENXIO;
> > +		goto err;
> > +	}
> > +
> > +	rc = cxl_region_manage_dc(cxlr);
> > +	if (rc)
> > +		goto err;
> > +
> > +	return 0;
> > +
> >  err:
> >  	put_device(dev);
> > +load_err:
> >  	kfree(cxlr_dc);
> 
> I've lost track, but seems unlikely we now need to free this in all paths and didn't before. Doesn't
> the cxl_dc_region_Release deal with it?

Yes it does.  I've realized that a large section ~70% of
devm_cxl_add_dc_region() is doing _exactly_ the same thing as
devm_cxl_add_dax_region().  The only think that devm_cxl_add_dc_region()
needs is the cxl_dax_region pointer back to track it.  So I've created a
lead in patch to factor out devm_cxl_add_dax_region() so that it can be
reused in devm_cxl_add_dc_region().  After that this code is much more
straight forward.

> 
> >  	return rc;
> >  }
> > @@ -3076,6 +3128,156 @@ struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
> >  }
> >  EXPORT_SYMBOL_NS_GPL(cxl_create_region, CXL);
> 
> 
> 
> > +
> > +int cxl_release_dc_extent(struct cxl_memdev_state *mds,
> > +			  struct range *rel_range)
> > +{
> > +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> > +	struct cxl_endpoint_decoder *cxled;
> > +	struct cxl_dc_region *cxlr_dc;
> > +	struct dax_region *dax_region;
> > +	resource_size_t dpa_offset;
> > +	struct cxl_region *cxlr;
> > +	struct range hpa_range;
> > +	struct dev_dax *dev_dax;
> > +	resource_size_t hpa;
> > +	struct device *dev;
> > +	int ranges, rc = 0;
> 
> 
> 
> 
> 
> 
> > +int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range)
> > +{
> ...
> 
> > +	/*
> > +	 * Find the cxl endpoind decoder with which has the extent dpa range and
> > +	 * get the cxl_region, dax_region refrences.
> > +	 */
> > +	dev = device_find_child(&cxlmd->endpoint->dev, alloc_range,
> > +				match_ep_decoder_by_range);
> > +	if (!dev) {
> > +		dev_err(mds->cxlds.dev, "%pr not mapped\n",	alloc_range);
> 
> Odd spacing. (Tab?)

Changed already.

> 
> > +		return PTR_ERR(dev);
> > +	}
> > +
> 
> ...
> 
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 9c0b2fa72bdd..0440b5c04ef6 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> 
> 
> >  /**
> > @@ -296,6 +298,13 @@ enum cxl_devtype {
> >  #define CXL_MAX_DC_REGION 8
> >  #define CXL_DC_REGION_SRTLEN 8
> >  
> > +struct cxl_dc_extent_data {
> > +	u64 dpa_start;
> > +	u64 length;
> > +	u8 tag[16];
> 
> Define for this length probably makes sense. It's non obvious.

Done.

> 
> > +	u16 shared_extent_seq;
> > +};
> 
> > +
> > +struct cxl_mbox_dc_response {
> > +	__le32 extent_list_size;
> > +	u8 reserved[4];
> > +	struct updated_extent_list {
> > +		__le64 dpa_start;
> > +		__le64 length;
> > +		u8 reserved[8];
> > +	} __packed extent_list[];
> 
> Going to need this in multiple places (e.g. release) so factor out.

I'm having trouble identifying how this is getting used in multiple
places.  But I'm also unsure of how this is working against the spec.

> 
> 
> > +} __packed;
> > +
> 
> 
> 
> > @@ -826,6 +894,14 @@ int cxl_trigger_poison_list(struct cxl_memdev *cxlmd);
> >  int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
> >  int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
> >  
> > +/* FIXME why not have these be static in mbox.c? */
> 
> :)
> 
> > +int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range);
> > +int cxl_release_dc_extent(struct cxl_memdev_state *mds, struct range *rel_range);
> > +int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> > +			      unsigned int *extent_gen_num);
> > +int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds, unsigned int cnt,
> > +			   unsigned int index);
> > +
> >  #ifdef CONFIG_CXL_SUSPEND
> >  void cxl_mem_active_inc(void);
> >  void cxl_mem_active_dec(void);
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index ac1a41bc083d..558ffbcb9b34 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -522,8 +522,8 @@ static int cxl_event_req_irq(struct cxl_dev_state *cxlds, u8 setting)
> >  		return irq;
> >  
> >  	return devm_request_threaded_irq(dev, irq, NULL, cxl_event_thread,
> > -					 IRQF_SHARED | IRQF_ONESHOT, NULL,
> > -					 dev_id);
> > +					IRQF_SHARED | IRQF_ONESHOT, NULL,
> > +					dev_id);
> 
> No comment. :)

Fixed.

> 
> >  }
> >  
> >  static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
> > @@ -555,6 +555,7 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
> >  		.warn_settings = CXL_INT_MSI_MSIX,
> >  		.failure_settings = CXL_INT_MSI_MSIX,
> >  		.fatal_settings = CXL_INT_MSI_MSIX,
> > +		.dyncap_settings = CXL_INT_MSI_MSIX,
> >  	};
> >  
> >  	mbox_cmd = (struct cxl_mbox_cmd) {
> > @@ -608,6 +609,11 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
> >  		return rc;
> >  	}
> >  
> > +	rc = cxl_event_req_irq(cxlds, policy.dyncap_settings);
> > +	if (rc) {
> > +		dev_err(cxlds->dev, "Failed to get interrupt for event dc log\n");
> > +		return rc;
> > +	}
> 
> Blank line to maintain existing style.

Done.

> 
> >  	return 0;
> >  }
> >  
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 227800053309..b2b27033f589 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -434,7 +434,7 @@ static void free_dev_dax_ranges(struct dev_dax *dev_dax)
> 
> ...
> 
> > +EXPORT_SYMBOL_GPL(alloc_dev_dax_range);
> > +
> Single blank line seems to be style in this fiel.

Done.

> >  
> >  static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, resource_size_t size)
> >  {
> > diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> > index 8cd79ab34292..aa8418c7aead 100644
> > --- a/drivers/dax/bus.h
> > +++ b/drivers/dax/bus.h
> > @@ -47,8 +47,11 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
> >  	__dax_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
> >  void dax_driver_unregister(struct dax_device_driver *dax_drv);
> >  void kill_dev_dax(struct dev_dax *dev_dax);
> > +void unregister_dev_dax(void *dev);
> > +void unregister_dax_mapping(void *data);
> >  bool static_dev_dax(struct dev_dax *dev_dax);
> > -
> > +int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
> > +					resource_size_t size);
> 
> Keep a blank line here..

Done.

Thanks for the review.  Based on this review, the previous patch review,
and more issues I've found along the way; I'm thinking you should hold off
on this series.

I'm working toward a very new V2; with my SoB line and all the rearch I
think is needed.

Thanks for looking!
Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD)
  2023-06-14 19:16 [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD) ira.weiny
                   ` (6 preceding siblings ...)
  2023-06-15 14:51 ` Ira Weiny
@ 2023-06-29 15:30 ` Ira Weiny
  7 siblings, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2023-06-29 15:30 UTC (permalink / raw)
  To: ira.weiny, Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams,
	linux-cxl

ira.weiny@ wrote:
> I'm submitting these on behalf of Navneet.  There was a round of
> internal discussion which left a few questions but we want to get the
> public discussion going.  A first public preview was posted by Dan.[1]

There has been a lot of review on this series so far.  Thank you!  At this
point a number of issues have been pointed out which is requiring some
extensive and careful rework of the series.

In the interest of saving folks time.  Any further review should focus on
the ABI or other big architectural issues.

The next version should be much improved, mostly due to the feedback we
have gotten thus far.  :-D

Thanks,
Ira

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device
  2023-06-24 13:08     ` Ira Weiny
@ 2023-07-03  2:29       ` Jonathan Cameron
  0 siblings, 0 replies; 55+ messages in thread
From: Jonathan Cameron @ 2023-07-03  2:29 UTC (permalink / raw)
  To: Ira Weiny; +Cc: Navneet Singh, Fan Ni, Dan Williams, linux-cxl

#> > > +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> > > +{
> > > +	struct cxl_dev_state *cxlds = &mds->cxlds;
> > > +	struct device *dev = cxlds->dev;
> > > +	struct cxl_mbox_dynamic_capacity *dc;  
> > 
> > Calling it dc is confusing.  I'd make it clear this is the mailbox
> > response. config_resp or dc_config_res.  
> 
> How about dc_resp?

That works for me as well.

...

> > 
> > 
> > But fun corner.... Mailbox is allowed to be smaller than that (256 bytes min
> > I think) so need to handle multiple reads with different start regions.  
> 
> Oh bother.  :-/
> 
> What are the chances a device is going to only support 256B and DC?  I think
> you are correct though.  I'll add a loop to handle this possibility.
> 
> Anyway I've adjusted the algorithm...  Hopefully it will just loop 1 time.
> 
> > Which reminds me that we need to add support for running out of space
> > in the mailbox to qemu... So far we've just made sure everything fitted :)  
> 
> Might be nice to test stuff.
>

Hmm. I'll try not to forget about it this time, unlike the previous 10 times
I've thought we should fix that :)
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support.
  2023-06-14 19:16 ` [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support ira.weiny
                     ` (5 preceding siblings ...)
  2023-06-22 16:34   ` Jonathan Cameron
@ 2023-07-05 14:49   ` Davidlohr Bueso
  6 siblings, 0 replies; 55+ messages in thread
From: Davidlohr Bueso @ 2023-07-05 14:49 UTC (permalink / raw)
  To: ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl

On Wed, 14 Jun 2023, ira.weiny@intel.com wrote:

>+config CXL_DCD
>+	bool "CXL: DCD Support"
>+	default CXL_BUS
>+	depends on CXL_REGION
>+	help
>+	  Enable the CXL core to provision CXL DCD regions.
>+	  CXL devices optionally support dynamic capacity and DCD region
>+	  maps the dynamic capacity regions DPA's into Host HPA ranges.
>+
>+	  If unsure say 'y'

Does this really merit another Kconfig option? What are the usecases for
this ever to be shipped as disabled?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events.
  2023-06-14 19:16 ` [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events ira.weiny
                     ` (3 preceding siblings ...)
  2023-06-27 18:17   ` Fan Ni
@ 2023-07-13 12:55   ` Jørgen Hansen
  4 siblings, 0 replies; 55+ messages in thread
From: Jørgen Hansen @ 2023-07-13 12:55 UTC (permalink / raw)
  To: ira.weiny
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Dan Williams, linux-cxl



> On 14 Jun 2023, at 21.16, ira.weiny@intel.com wrote:
> 
> From: Navneet Singh <navneet.singh@intel.com>
> 
> A dynamic capacity device utilizes events to signal the host about the
> changes to the allocation of DC blocks. The device communicates the
> state of these blocks of dynamic capacity through an extent list that
> describes the starting DPA and length of all blocks the host can access.
> 
> Based on the dynamic capacity add or release event type,
> dynamic memory represented by the extents are either added
> or removed as devdax device.
> 
> Process the dynamic capacity add and release events.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> 
> ---
> [iweiny: Remove invalid comment]
> ---
> drivers/cxl/core/mbox.c   | 345 +++++++++++++++++++++++++++++++++++++++++++++-
> drivers/cxl/core/region.c | 214 +++++++++++++++++++++++++++-
> drivers/cxl/core/trace.h  |   3 +-
> drivers/cxl/cxl.h         |   4 +-
> drivers/cxl/cxlmem.h      |  76 ++++++++++
> drivers/cxl/pci.c         |  10 +-
> drivers/dax/bus.c         |  11 +-
> drivers/dax/bus.h         |   5 +-
> 8 files changed, 652 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index c5b696737c87..db9295216de5 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> 
> @@ -1244,6 +1452,140 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> 
> +int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> +                             unsigned int *extent_gen_num)
> +{
> +       struct device *dev = mds->cxlds.dev;
> +       struct cxl_mbox_dc_extents *dc_extents;
> +       struct cxl_mbox_get_dc_extent get_dc_extent;
> +       unsigned int total_extent_cnt;
> +       struct cxl_mbox_cmd mbox_cmd;
> +       int rc;
> +
> +       /* Check GET_DC_EXTENT_LIST is supported by device */
> +       if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> +               dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> +               return 0;
> +       }
> +
> +       dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
> +       if (!dc_extents)
> +               return -ENOMEM;
> +
> +       get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> +               .extent_cnt = 0,
> +               .start_extent_index = 0,
> +       };
> +
> +       mbox_cmd = (struct cxl_mbox_cmd) {
> +               .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +               .payload_in = &get_dc_extent,
> +               .size_in = sizeof(get_dc_extent),
> +               .size_out = mds->payload_size,
> +               .payload_out = dc_extents,
> +               .min_out = 1,
> +       };
> +       rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +       if (rc < 0)
> +               goto out;
> +
> +       total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> +       *extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +       dev_dbg(dev, "Total extent count :%d Extent list Generation Num: %d\n",
> +                       total_extent_cnt, *extent_gen_num);
> +out:
> +
> +       kvfree(dc_extents);
> +       if (rc < 0)
> +               return rc;
> +
> +       return total_extent_cnt;
> +
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extent_cnt, CXL);
> +
> +int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds,
> +                          unsigned int index, unsigned int cnt)
> +{
> +       /* See CXL 3.0 Table 125 dynamic capacity config  Output Payload */
> +       struct device *dev = mds->cxlds.dev;
> +       struct cxl_mbox_dc_extents *dc_extents;
> +       struct cxl_mbox_get_dc_extent get_dc_extent;
> +       unsigned int extent_gen_num, available_extents, total_extent_cnt;
> +       int rc;
> +       struct cxl_dc_extent_data *extent;
> +       struct cxl_mbox_cmd mbox_cmd;
> +       struct range alloc_range;
> +
> +       /* Check GET_DC_EXTENT_LIST is supported by device */
> +       if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> +               dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> +               return 0;
> +       }
> +
> +       dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
> +       if (!dc_extents)
> +               return -ENOMEM;
> +       get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> +               .extent_cnt = cnt,
> +               .start_extent_index = index,
> +       };
> +
> +       mbox_cmd = (struct cxl_mbox_cmd) {
> +               .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +               .payload_in = &get_dc_extent,
> +               .size_in = sizeof(get_dc_extent),
> +               .size_out = mds->payload_size,
> +               .payload_out = dc_extents,
> +               .min_out = 1,
> +       };
> +       rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +       if (rc < 0)
> +               goto out;
> +
> +       available_extents = le32_to_cpu(dc_extents->ret_extent_cnt);
> +       total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> +       extent_gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +       dev_dbg(dev, "No Total extent count :%d Extent list Generation Num:%d\n",
> +                       total_extent_cnt, extent_gen_num);
> +
> +
> +       for (int i = 0; i < available_extents ; i++) {
> +               extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
> +               if (!extent) {
> +                       rc = -ENOMEM;
> +                       goto out;
> +               }
> +               extent->dpa_start = le64_to_cpu(dc_extents->extent[i].start_dpa);
> +               extent->length = le64_to_cpu(dc_extents->extent[i].length);
> +               memcpy(extent->tag, dc_extents->extent[i].tag,
> +                                       sizeof(dc_extents->extent[i].tag));
> +               extent->shared_extent_seq =
> +                               le16_to_cpu(dc_extents->extent[i].shared_extn_seq);
> +               dev_dbg(dev, "dynamic capacity extent[%d] DPA:0x%llx LEN:%llx\n",
> +                               i, extent->dpa_start, extent->length);
> +
> +               alloc_range = (struct range){
> +                       .start = extent->dpa_start,
> +                       .end = extent->dpa_start + extent->length - 1,
> +               };
> +
> +               rc = cxl_add_dc_extent(mds, &alloc_range);
> +               if (rc < 0)
> +                       goto out;
> +               rc = xa_insert(&mds->dc_extent_list, extent->dpa_start, extent,
> +                               GFP_KERNEL);
> +       }
> +
> +out:
> +       kvfree(dc_extents);
> +       if (rc < 0)
> +               return rc;
> +
> +       return available_extents;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dc_extents, CXL);
> +
> static int add_dpa_res(struct device *dev, struct resource *parent,
>                       struct resource *res, resource_size_t start,
>                       resource_size_t size, const char *type)
> @@ -1452,6 +1794,7 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
>        mutex_init(&mds->event.log_lock);
>        mds->cxlds.dev = dev;
>        mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
> +       xa_init(&mds->dc_extent_list);
> 
>        return mds;
> }
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 144232c8305e..ba45c1c3b0a9 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1,6 +1,7 @@
> // SPDX-License-Identifier: GPL-2.0-only
> /* Copyright(c) 2022 Intel Corporation. All rights reserved. */
> #include <linux/memregion.h>
> +#include <linux/interrupt.h>
> #include <linux/genalloc.h>
> #include <linux/device.h>
> #include <linux/module.h>
> @@ -11,6 +12,8 @@
> #include <cxlmem.h>
> #include <cxl.h>
> #include "core.h"
> +#include "../../dax/bus.h"
> +#include "../../dax/dax-private.h"
> 
> /**
>  * DOC: cxl core region
> @@ -166,6 +169,38 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
>        return 0;
> }
> 
> +static int cxl_region_manage_dc(struct cxl_region *cxlr)
> +{
> +       struct cxl_region_params *p = &cxlr->params;
> +       unsigned int extent_gen_num;
> +       int i, rc;
> +
> +       /* Designed for Non Interleaving flow with the assumption one
> +        * cxl_region will map the complete device DC region's DPA range
> +        */
> +       for (i = 0; i < p->nr_targets; i++) {
> +               struct cxl_endpoint_decoder *cxled = p->targets[i];
> +               struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +               struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +               rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> +               if (rc < 0)
> +                       goto err;
> +               else if (rc > 1) {
> +                       rc = cxl_dev_get_dc_extents(mds, rc, 0);

Hi,

when playing around with DCD, I noticed the following mismatch for cxl_dev_get_dc_extents. In the function
implementation above cxl_dev_get_dc_extents takes index as 2nd parameter and cnt as 3rd, but here they
are swapped so count (rc) is supplied as 2nd parameter and index (0) as 3rd, so this should be:
                      rc = cxl_dev_get_dc_extents(mds, 0, rc);

The prototype in cxlmem.h needs to be updated as well - see further down.

> +                       if (rc < 0)
> +                               goto err;
> +                       mds->num_dc_extents = rc;
> +                       mds->dc_extents_index = rc - 1;
> +               }
> +               mds->dc_list_gen_num = extent_gen_num;
> +               dev_dbg(mds->cxlds.dev, "No of preallocated extents :%d\n", rc);
> +       }
> +       return 0;
> +err:
> +       return rc;
> +}
> +


> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 9c0b2fa72bdd..0440b5c04ef6 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h

> @@ -826,6 +894,14 @@ int cxl_trigger_poison_list(struct cxl_memdev *cxlmd);
> int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
> int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
> 
> +/* FIXME why not have these be static in mbox.c? */
> +int cxl_add_dc_extent(struct cxl_memdev_state *mds, struct range *alloc_range);
> +int cxl_release_dc_extent(struct cxl_memdev_state *mds, struct range *rel_range);
> +int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> +                             unsigned int *extent_gen_num);
> +int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds, unsigned int cnt,
> +                          unsigned int index);

The 2nd and 3rd parameter for cxl_dev_get_dc_extents should be swapped here as well to match
the actual function.

Thanks,
Jorgen


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2023-07-13 12:55 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-14 19:16 [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD) ira.weiny
2023-06-14 19:16 ` [PATCH 1/5] cxl/mem : Read Dynamic capacity configuration from the device ira.weiny
2023-06-14 22:53   ` Dave Jiang
2023-06-15 15:04     ` Ira Weiny
2023-06-14 23:49   ` Alison Schofield
2023-06-15 22:46     ` Ira Weiny
2023-06-15 18:30   ` Fan Ni
2023-06-15 19:17     ` Navneet Singh
2023-06-15 21:41   ` Fan Ni
2023-06-22 15:58   ` Jonathan Cameron
2023-06-24 13:08     ` Ira Weiny
2023-07-03  2:29       ` Jonathan Cameron
2023-06-14 19:16 ` [PATCH 2/5] cxl/region: Add dynamic capacity cxl region support ira.weiny
2023-06-14 23:37   ` Dave Jiang
2023-06-15 18:12     ` Ira Weiny
2023-06-15 18:28       ` Dave Jiang
2023-06-16  3:52         ` Navneet Singh
2023-06-15 18:56       ` Navneet Singh
2023-06-15  0:21   ` Alison Schofield
2023-06-16  2:06     ` Ira Weiny
2023-06-16 15:56       ` Alison Schofield
2023-06-16 16:51   ` Alison Schofield
2023-06-21  2:44     ` Ira Weiny
2023-06-20 17:55   ` Fan Ni
2023-06-20 20:33     ` Ira Weiny
2023-06-21  3:13     ` Navneet Singh
2023-06-21 17:20   ` Fan Ni
2023-06-23 18:02     ` Ira Weiny
2023-06-22 16:34   ` Jonathan Cameron
2023-07-05 14:49   ` Davidlohr Bueso
2023-06-14 19:16 ` [PATCH 3/5] cxl/mem : Expose dynamic capacity configuration to userspace ira.weiny
2023-06-15  0:40   ` Alison Schofield
2023-06-16  2:47     ` Ira Weiny
2023-06-16 15:58       ` Dave Jiang
2023-06-20 16:23         ` Ira Weiny
2023-06-20 16:48           ` Dave Jiang
2023-06-15 15:41   ` Dave Jiang
2023-06-14 19:16 ` [PATCH 4/5] cxl/mem: Add support to handle DCD add and release capacity events ira.weiny
2023-06-15  2:19   ` Alison Schofield
2023-06-16  4:11     ` Ira Weiny
2023-06-27 18:20       ` Fan Ni
2023-06-15 16:58   ` Dave Jiang
2023-06-22 17:01   ` Jonathan Cameron
2023-06-29 15:19     ` Ira Weiny
2023-06-27 18:17   ` Fan Ni
2023-07-13 12:55   ` Jørgen Hansen
2023-06-14 19:16 ` [PATCH 5/5] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
2023-06-15 17:08   ` Dave Jiang
2023-06-15  0:56 ` [PATCH 0/5] cxl/dcd: Add support for Dynamic Capacity Devices (DCD) Alison Schofield
2023-06-16  2:57   ` Ira Weiny
2023-06-15 14:51 ` Ira Weiny
2023-06-22 15:07   ` Jonathan Cameron
2023-06-22 16:37     ` Jonathan Cameron
2023-06-27 14:59     ` Ira Weiny
2023-06-29 15:30 ` Ira Weiny

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.