All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD)
@ 2023-08-29  5:20 Ira Weiny
  2023-08-29  5:20 ` [PATCH RFC v2 01/18] cxl/hdm: Debug, use decoder name function Ira Weiny
                   ` (18 more replies)
  0 siblings, 19 replies; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

A Dynamic Capacity Device (DCD) (CXL 3.0 spec 9.13.3) is a CXL memory
device that implements dynamic capacity.  Dynamic capacity feature
allows memory capacity to change dynamically, without the need for
resetting the device.

Even though this is marked v2 by b4, this is effectively a whole new
series for DCD support.  Quite a bit of the core support was completed
by Navneet in [4].  However, the architecture through the CXL region,
DAX region, and DAX Device layers is completely different.  Particular
attention was paid to:

	1) managing skip resources in the hardware device
	2) ensuring the host OS only sent a release memory mailbox
	   response when all DAX devices are done using an extent
	3) allowing dax devices to span extents
	4) allowing dax devices to use parts of extents

I could say all of the review comments from v1 are addressed but frankly
the series has changed so much that I can't guarantee anything.

The series continues to be based on the type-2 work posted from Dan.[2]
However, my branch with that work is a bit dated.  Therefore I have
posted this series on github here.[5]

Testing was sped up with cxl-test and ndctl dcd support.  A preview of
that work is on github.[6]  In addition Fan Ni's Qemu DCD series was
used part of the time.[3]

The major parts of this series are:

- Get the dynamic capacity (DC) region information from cxl device
- Configure device DC regions reported by hardware
- Enhance CXL and DAX regions for DC
	a. maintain separation between the hardware extents and the CXL
	   region extents to provide for the addition of interleaving in
	   the future.
- Get and maintain the hardware extent lists for each device via an
  initial extent list and DC event records
        a. Add capacity Events
	b. Add capacity response
	b. Release capacity events
	d. Release capacity response
- Notify region layers of extent changes
- Allow for DAX devices to be created on extents which are surfaced
- Maintain references on extents which are in use
	a. Send Release capacity Response only when DAX devices are not
	   using memory
- Allow DAX region extent labels to change to allow for flexibility in
  DAX device creation in the future (further enhancements are required
  to ndctl for this)
- Trace Dynamic Capacity events
- Add cxl-test infrastructure to allow for faster unit testing

To: Dan Williams <dan.j.williams@intel.com>
Cc: Navneet Singh <navneet.singh@intel.com>
Cc: Fan Ni <fan.ni@samsung.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: linux-cxl@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

[1] https://lore.kernel.org/all/64326437c1496_934b2949f@dwillia2-mobl3.amr.corp.intel.com.notmuch/
[2] https://lore.kernel.org/all/168592149709.1948938.8663425987110396027.stgit@dwillia2-xfh.jf.intel.com/
[3] https://lore.kernel.org/all/6483946e8152f_f1132294a2@iweiny-mobl.notmuch/
[4] https://lore.kernel.org/r/20230604-dcd-type2-upstream-v1-0-71b6341bae54@intel.com
[5] https://github.com/weiny2/linux-kernel/commits/dcd-v2-2023-08-28
[6] https://github.com/weiny2/ndctl/tree/dcd-region2

---
Changes in v2:
- iweiny: Complete rework of the entire series
- Link to v1: https://lore.kernel.org/r/20230604-dcd-type2-upstream-v1-0-71b6341bae54@intel.com

---
Ira Weiny (15):
      cxl/hdm: Debug, use decoder name function
      cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
      cxl/region: Add Dynamic Capacity decoder and region modes
      cxl/port: Add Dynamic Capacity mode support to endpoint decoders
      cxl/port: Add Dynamic Capacity size support to endpoint decoders
      cxl/region: Add Dynamic Capacity CXL region support
      cxl/mem: Read extents on memory device discovery
      cxl/mem: Handle DCD add and release capacity events.
      cxl/region: Expose DC extents on region driver load
      cxl/region: Notify regions of DC changes
      dax/bus: Factor out dev dax resize logic
      dax/region: Support DAX device creation on dynamic DAX regions
      tools/testing/cxl: Make event logs dynamic
      tools/testing/cxl: Add DC Regions to mock mem data
      tools/testing/cxl: Add Dynamic Capacity events

Navneet Singh (3):
      cxl/mem: Read Dynamic capacity configuration from the device
      cxl/mem: Expose device dynamic capacity configuration
      cxl/mem: Trace Dynamic capacity Event Record

 Documentation/ABI/testing/sysfs-bus-cxl |  56 ++-
 drivers/cxl/core/core.h                 |   1 +
 drivers/cxl/core/hdm.c                  | 215 ++++++++-
 drivers/cxl/core/mbox.c                 | 646 +++++++++++++++++++++++++-
 drivers/cxl/core/memdev.c               |  77 ++++
 drivers/cxl/core/port.c                 |  19 +
 drivers/cxl/core/region.c               | 418 +++++++++++++++--
 drivers/cxl/core/trace.h                |  65 +++
 drivers/cxl/cxl.h                       |  99 +++-
 drivers/cxl/cxlmem.h                    | 138 +++++-
 drivers/cxl/mem.c                       |  50 ++
 drivers/cxl/pci.c                       |   8 +
 drivers/dax/Makefile                    |   1 +
 drivers/dax/bus.c                       | 263 ++++++++---
 drivers/dax/bus.h                       |   1 +
 drivers/dax/cxl.c                       | 213 ++++++++-
 drivers/dax/dax-private.h               |  61 +++
 drivers/dax/extent.c                    | 133 ++++++
 tools/testing/cxl/test/mem.c            | 782 +++++++++++++++++++++++++++-----
 19 files changed, 3005 insertions(+), 241 deletions(-)
---
base-commit: c76cce37fb6f3796e8e146677ba98d3cca30a488
change-id: 20230604-dcd-type2-upstream-0cd15f6216fd

Best regards,
-- 
Ira Weiny <ira.weiny@intel.com>


^ permalink raw reply	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 01/18] cxl/hdm: Debug, use decoder name function
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
@ 2023-08-29  5:20 ` Ira Weiny
  2023-08-29 14:03   ` Jonathan Cameron
  2023-08-30 20:32   ` Dave Jiang
  2023-08-29  5:20 ` [PATCH RFC v2 02/18] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

The decoder enum has a name conversion function defined now.

Use that instead of open coding.

Suggested-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for v2:
[iweiny: new patch, split out]
---
 drivers/cxl/core/hdm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index b01a77b67511..a254f79dd4e8 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -550,8 +550,7 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
 
 	if (size > avail) {
 		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
-			cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
-			&avail);
+			cxl_decoder_mode_name(cxled->mode), &avail);
 		rc = -ENOSPC;
 		goto out;
 	}

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 02/18] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
  2023-08-29  5:20 ` [PATCH RFC v2 01/18] cxl/hdm: Debug, use decoder name function Ira Weiny
@ 2023-08-29  5:20 ` Ira Weiny
  2023-08-29 14:07   ` Jonathan Cameron
                     ` (3 more replies)
  2023-08-29  5:20 ` [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device ira.weiny
                   ` (16 subsequent siblings)
  18 siblings, 4 replies; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

Per the CXL 3.0 specification software must check the Command Effects
Log (CEL) to know if a device supports DC.  If the device does support
DC the specifics of the DC Regions (0-7) are read through the mailbox.

Flag DC Device (DCD) commands in a device if they are supported.
Subsequent patches will key off these bits to configure a DCD.

Co-developed-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for v2
[iweiny: new patch]
---
 drivers/cxl/core/mbox.c | 38 +++++++++++++++++++++++++++++++++++---
 drivers/cxl/cxlmem.h    | 15 +++++++++++++++
 2 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index f052d5f174ee..554ec97a7c39 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -111,6 +111,34 @@ static u8 security_command_sets[] = {
 	0x46, /* Security Passthrough */
 };
 
+static bool cxl_is_dcd_command(u16 opcode)
+{
+#define CXL_MBOX_OP_DCD_CMDS 0x48
+
+	return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
+}
+
+static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
+					u16 opcode)
+{
+	switch (opcode) {
+	case CXL_MBOX_OP_GET_DC_CONFIG:
+		set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
+		break;
+	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+		set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
+		break;
+	case CXL_MBOX_OP_ADD_DC_RESPONSE:
+		set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
+		break;
+	case CXL_MBOX_OP_RELEASE_DC:
+		set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
+		break;
+	default:
+		break;
+	}
+}
+
 static bool cxl_is_security_command(u16 opcode)
 {
 	int i;
@@ -677,9 +705,10 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
 		u16 opcode = le16_to_cpu(cel_entry[i].opcode);
 		struct cxl_mem_command *cmd = cxl_mem_find_command(opcode);
 
-		if (!cmd && !cxl_is_poison_command(opcode)) {
-			dev_dbg(dev,
-				"Opcode 0x%04x unsupported by driver\n", opcode);
+		if (!cmd && !cxl_is_poison_command(opcode) &&
+		    !cxl_is_dcd_command(opcode)) {
+			dev_dbg(dev, "Opcode 0x%04x unsupported by driver\n",
+				opcode);
 			continue;
 		}
 
@@ -689,6 +718,9 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
 		if (cxl_is_poison_command(opcode))
 			cxl_set_poison_cmd_enabled(&mds->poison, opcode);
 
+		if (cxl_is_dcd_command(opcode))
+			cxl_set_dcd_cmd_enabled(mds, opcode);
+
 		dev_dbg(dev, "Opcode 0x%04x enabled\n", opcode);
 	}
 }
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index adfba72445fc..5f2e65204bf9 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -247,6 +247,15 @@ struct cxl_event_state {
 	struct mutex log_lock;
 };
 
+/* Device enabled DCD commands */
+enum dcd_cmd_enabled_bits {
+	CXL_DCD_ENABLED_GET_CONFIG,
+	CXL_DCD_ENABLED_GET_EXTENT_LIST,
+	CXL_DCD_ENABLED_ADD_RESPONSE,
+	CXL_DCD_ENABLED_RELEASE,
+	CXL_DCD_ENABLED_MAX
+};
+
 /* Device enabled poison commands */
 enum poison_cmd_enabled_bits {
 	CXL_POISON_ENABLED_LIST,
@@ -436,6 +445,7 @@ struct cxl_dev_state {
  *                (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
  * @mbox_mutex: Mutex to synchronize mailbox access.
  * @firmware_version: Firmware version for the memory device.
+ * @dcd_cmds: List of DCD commands implemented by memory device
  * @enabled_cmds: Hardware commands found enabled in CEL.
  * @exclusive_cmds: Commands that are kernel-internal only
  * @total_bytes: sum of all possible capacities
@@ -460,6 +470,7 @@ struct cxl_memdev_state {
 	size_t lsa_size;
 	struct mutex mbox_mutex; /* Protects device mailbox and firmware */
 	char firmware_version[0x10];
+	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
 	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
 	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
 	u64 total_bytes;
@@ -525,6 +536,10 @@ enum cxl_opcode {
 	CXL_MBOX_OP_UNLOCK		= 0x4503,
 	CXL_MBOX_OP_FREEZE_SECURITY	= 0x4504,
 	CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE	= 0x4505,
+	CXL_MBOX_OP_GET_DC_CONFIG	= 0x4800,
+	CXL_MBOX_OP_GET_DC_EXTENT_LIST	= 0x4801,
+	CXL_MBOX_OP_ADD_DC_RESPONSE	= 0x4802,
+	CXL_MBOX_OP_RELEASE_DC		= 0x4803,
 	CXL_MBOX_OP_MAX			= 0x10000
 };
 

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
  2023-08-29  5:20 ` [PATCH RFC v2 01/18] cxl/hdm: Debug, use decoder name function Ira Weiny
  2023-08-29  5:20 ` [PATCH RFC v2 02/18] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD) Ira Weiny
@ 2023-08-29  5:20 ` ira.weiny
  2023-08-29 14:37   ` Jonathan Cameron
                     ` (4 more replies)
  2023-08-29  5:20 ` [PATCH RFC v2 04/18] cxl/region: Add Dynamic Capacity decoder and region modes Ira Weiny
                   ` (15 subsequent siblings)
  18 siblings, 5 replies; 97+ messages in thread
From: ira.weiny @ 2023-08-29  5:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

From: Navneet Singh <navneet.singh@intel.com>

Devices can optionally support Dynamic Capacity (DC).  These devices are
known as Dynamic Capacity Devices (DCD).

Implement the DC (opcode 48XXh) mailbox commands as specified in CXL 3.0
section 8.2.9.8.9.  Read the DC configuration and store the DC region
information in the device state.

Co-developed-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for v2
[iweiny: Rebased to latest master type2 work]
[jonathan: s/dc/dc_resp/]
[iweiny: Clean up commit message]
[iweiny: Clean kernel docs]
[djiang: Fix up cxl_is_dcd_command]
[djiang: extra blank line]
[alison: s/total_capacity/cap/ etc...]
[alison: keep partition flag with partition structures]
[alison: reformat untenanted_mem declaration]
[alison: move 'cmd' definition back]
[alison: fix comment line length]
[alison: reverse x-tree]
[jonathan: fix and adjust CXL_DC_REGION_STRLEN]
[Jonathan/iweiny: Factor out storing each DC region read from the device]
[Jonathan: place all dcr initializers together]
[Jonathan/iweiny: flip around the region DPA order check]
[jonathan: Account for short read of mailbox command]
[iweiny: use snprintf for region name]
[iweiny: use '<nil>' for missing region names]
[iweiny: factor out struct cxl_dc_region_info]
[iweiny: Split out reading CEL]
---
 drivers/cxl/core/mbox.c   | 179 +++++++++++++++++++++++++++++++++++++++++++++-
 drivers/cxl/core/region.c |  75 +++++++++++++------
 drivers/cxl/cxl.h         |  27 ++++++-
 drivers/cxl/cxlmem.h      |  55 +++++++++++++-
 drivers/cxl/pci.c         |   4 ++
 5 files changed, 314 insertions(+), 26 deletions(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 554ec97a7c39..d769814f80e2 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1096,7 +1096,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
 	if (rc < 0)
 		return rc;
 
-	mds->total_bytes =
+	mds->static_cap =
 		le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
 	mds->volatile_only_bytes =
 		le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
@@ -1114,6 +1114,8 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
 		mds->poison.max_errors = min_t(u32, val, CXL_POISON_LIST_MAX);
 	}
 
+	mds->dc_event_log_size = le16_to_cpu(id.dc_event_log_size);
+
 	return 0;
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dev_state_identify, CXL);
@@ -1178,6 +1180,165 @@ int cxl_mem_sanitize(struct cxl_memdev_state *mds, u16 cmd)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_mem_sanitize, CXL);
 
+static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, int index,
+				   struct cxl_dc_region_config *region_config)
+{
+	struct cxl_dc_region_info *dcr = &mds->dc_region[index];
+	struct device *dev = mds->cxlds.dev;
+
+	dcr->base = le64_to_cpu(region_config->region_base);
+	dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
+	dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
+	dcr->len = le64_to_cpu(region_config->region_length);
+	dcr->blk_size = le64_to_cpu(region_config->region_block_size);
+	dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
+	dcr->flags = region_config->flags;
+	snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
+
+	/* Check regions are in increasing DPA order */
+	if (index > 0) {
+		struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
+
+		if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
+			dev_err(dev,
+				"DPA ordering violation for DC region %d and %d\n",
+				index - 1, index);
+			return -EINVAL;
+		}
+	}
+
+	/* Check the region is 256 MB aligned */
+	if (!IS_ALIGNED(dcr->base, SZ_256M)) {
+		dev_err(dev, "DC region %d not aligned to 256MB: %#llx\n",
+			index, dcr->base);
+		return -EINVAL;
+	}
+
+	/* Check Region base and length are aligned to block size */
+	if (!IS_ALIGNED(dcr->base, dcr->blk_size) ||
+	    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
+		dev_err(dev, "DC region %d not aligned to %#llx\n", index,
+			dcr->blk_size);
+		return -EINVAL;
+	}
+
+	dev_dbg(dev,
+		"DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
+		dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
+
+	return 0;
+}
+
+/* Returns the number of regions in dc_resp or -ERRNO */
+static int cxl_get_dc_id(struct cxl_memdev_state *mds, u8 start_region,
+			 struct cxl_mbox_dynamic_capacity *dc_resp,
+			 size_t dc_resp_size)
+{
+	struct cxl_mbox_get_dc_config get_dc = (struct cxl_mbox_get_dc_config) {
+		.region_count = CXL_MAX_DC_REGION,
+		.start_region_index = start_region,
+	};
+	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
+		.payload_in = &get_dc,
+		.size_in = sizeof(get_dc),
+		.size_out = dc_resp_size,
+		.payload_out = dc_resp,
+		.min_out = 1,
+	};
+	struct device *dev = mds->cxlds.dev;
+	int rc;
+
+	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
+	if (rc < 0)
+		return rc;
+
+	rc = dc_resp->avail_region_count - start_region;
+
+	/*
+	 * The number of regions in the payload may have been truncated due to
+	 * payload_size limits; if so adjust the count in this query.
+	 */
+	if (mbox_cmd.size_out < sizeof(*dc_resp))
+		rc = CXL_REGIONS_RETURNED(mbox_cmd.size_out);
+
+	dev_dbg(dev, "Read %d/%d DC regions\n", rc, dc_resp->avail_region_count);
+
+	return rc;
+}
+
+/**
+ * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
+ *					 information from the device.
+ * @mds: The memory device state
+ *
+ * This will dispatch the get_dynamic_capacity command to the device
+ * and on success populate structures to be exported to sysfs.
+ *
+ * Return: 0 if identify was executed successfully, -ERRNO on error.
+ */
+int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
+{
+	struct cxl_mbox_dynamic_capacity *dc_resp;
+	struct device *dev = mds->cxlds.dev;
+	size_t dc_resp_size = mds->payload_size;
+	u8 start_region;
+	int i, rc = 0;
+
+	for (i = 0; i < CXL_MAX_DC_REGION; i++)
+		snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
+
+	/* Check GET_DC_CONFIG is supported by device */
+	if (!test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds)) {
+		dev_dbg(dev, "unsupported cmd: get_dynamic_capacity_config\n");
+		return 0;
+	}
+
+	dc_resp = kvmalloc(dc_resp_size, GFP_KERNEL);                         
+	if (!dc_resp)                                                                
+		return -ENOMEM;                                                 
+
+	start_region = 0;
+	do {
+		int j;
+
+		rc = cxl_get_dc_id(mds, start_region, dc_resp, dc_resp_size);
+		if (rc < 0)
+			goto free_resp;
+
+		mds->nr_dc_region += rc;
+
+		if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
+			dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
+				mds->nr_dc_region);
+			rc = -EINVAL;
+			goto free_resp;
+		}
+
+		for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
+			rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
+			if (rc)
+				goto free_resp;
+		}
+
+		start_region = mds->nr_dc_region;
+
+	} while (mds->nr_dc_region < dc_resp->avail_region_count);
+
+	mds->dynamic_cap =
+		mds->dc_region[mds->nr_dc_region - 1].base +
+		mds->dc_region[mds->nr_dc_region - 1].decode_len -
+		mds->dc_region[0].base;
+	dev_dbg(dev, "Total dynamic capacity: %#llx\n", mds->dynamic_cap);
+
+free_resp:
+	kfree(dc_resp);
+	if (rc)
+		dev_err(dev, "Failed to get DC info: %d\n", rc);
+	return rc;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
+
 static int add_dpa_res(struct device *dev, struct resource *parent,
 		       struct resource *res, resource_size_t start,
 		       resource_size_t size, const char *type)
@@ -1208,8 +1369,12 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
 {
 	struct cxl_dev_state *cxlds = &mds->cxlds;
 	struct device *dev = cxlds->dev;
+	size_t untenanted_mem;
 	int rc;
 
+	untenanted_mem = mds->dc_region[0].base - mds->static_cap;
+	mds->total_bytes = mds->static_cap + untenanted_mem + mds->dynamic_cap;
+
 	if (!cxlds->media_ready) {
 		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
 		cxlds->ram_res = DEFINE_RES_MEM(0, 0);
@@ -1217,8 +1382,16 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
 		return 0;
 	}
 
-	cxlds->dpa_res =
-		(struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
+	cxlds->dpa_res = (struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
+
+	for (int i = 0; i < mds->nr_dc_region; i++) {
+		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
+
+		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
+				 dcr->base, dcr->decode_len, dcr->name);
+		if (rc)
+			return rc;
+	}
 
 	if (mds->partition_align_bytes == 0) {
 		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 252bc8e1f103..75041903b72c 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -46,7 +46,7 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 	rc = down_read_interruptible(&cxl_region_rwsem);
 	if (rc)
 		return rc;
-	if (cxlr->mode != CXL_DECODER_PMEM)
+	if (cxlr->mode != CXL_REGION_PMEM)
 		rc = sysfs_emit(buf, "\n");
 	else
 		rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
@@ -359,7 +359,7 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
 	 * Support tooling that expects to find a 'uuid' attribute for all
 	 * regions regardless of mode.
 	 */
-	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
+	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_REGION_PMEM)
 		return 0444;
 	return a->mode;
 }
@@ -537,7 +537,7 @@ static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
 {
 	struct cxl_region *cxlr = to_cxl_region(dev);
 
-	return sysfs_emit(buf, "%s\n", cxl_decoder_mode_name(cxlr->mode));
+	return sysfs_emit(buf, "%s\n", cxl_region_mode_name(cxlr->mode));
 }
 static DEVICE_ATTR_RO(mode);
 
@@ -563,7 +563,7 @@ static int alloc_hpa(struct cxl_region *cxlr, resource_size_t size)
 
 	/* ways, granularity and uuid (if PMEM) need to be set before HPA */
 	if (!p->interleave_ways || !p->interleave_granularity ||
-	    (cxlr->mode == CXL_DECODER_PMEM && uuid_is_null(&p->uuid)))
+	    (cxlr->mode == CXL_REGION_PMEM && uuid_is_null(&p->uuid)))
 		return -ENXIO;
 
 	div_u64_rem(size, SZ_256M * p->interleave_ways, &remainder);
@@ -1765,6 +1765,17 @@ static int cxl_region_sort_targets(struct cxl_region *cxlr)
 	return rc;
 }
 
+static bool cxl_modes_compatible(enum cxl_region_mode rmode,
+				 enum cxl_decoder_mode dmode)
+{
+	if (rmode == CXL_REGION_RAM && dmode == CXL_DECODER_RAM)
+		return true;
+	if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
+		return true;
+
+	return false;
+}
+
 static int cxl_region_attach(struct cxl_region *cxlr,
 			     struct cxl_endpoint_decoder *cxled, int pos)
 {
@@ -1778,9 +1789,11 @@ static int cxl_region_attach(struct cxl_region *cxlr,
 	lockdep_assert_held_write(&cxl_region_rwsem);
 	lockdep_assert_held_read(&cxl_dpa_rwsem);
 
-	if (cxled->mode != cxlr->mode) {
-		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
-			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
+	if (!cxl_modes_compatible(cxlr->mode, cxled->mode)) {
+		dev_dbg(&cxlr->dev, "%s region mode: %s mismatch decoder: %s\n",
+			dev_name(&cxled->cxld.dev),
+			cxl_region_mode_name(cxlr->mode),
+			cxl_decoder_mode_name(cxled->mode));
 		return -EINVAL;
 	}
 
@@ -2234,7 +2247,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
  * devm_cxl_add_region - Adds a region to a decoder
  * @cxlrd: root decoder
  * @id: memregion id to create, or memregion_free() on failure
- * @mode: mode for the endpoint decoders of this region
+ * @mode: mode of this region
  * @type: select whether this is an expander or accelerator (type-2 or type-3)
  *
  * This is the second step of region initialization. Regions exist within an
@@ -2245,7 +2258,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
  */
 static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
 					      int id,
-					      enum cxl_decoder_mode mode,
+					      enum cxl_region_mode mode,
 					      enum cxl_decoder_type type)
 {
 	struct cxl_port *port = to_cxl_port(cxlrd->cxlsd.cxld.dev.parent);
@@ -2254,11 +2267,12 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
 	int rc;
 
 	switch (mode) {
-	case CXL_DECODER_RAM:
-	case CXL_DECODER_PMEM:
+	case CXL_REGION_RAM:
+	case CXL_REGION_PMEM:
 		break;
 	default:
-		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
+		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
+			cxl_region_mode_name(mode));
 		return ERR_PTR(-EINVAL);
 	}
 
@@ -2308,7 +2322,7 @@ static ssize_t create_ram_region_show(struct device *dev,
 }
 
 static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
-					  int id, enum cxl_decoder_mode mode,
+					  int id, enum cxl_region_mode mode,
 					  enum cxl_decoder_type type)
 {
 	int rc;
@@ -2337,7 +2351,7 @@ static ssize_t create_pmem_region_store(struct device *dev,
 	if (rc != 1)
 		return -EINVAL;
 
-	cxlr = __create_region(cxlrd, id, CXL_DECODER_PMEM,
+	cxlr = __create_region(cxlrd, id, CXL_REGION_PMEM,
 			       CXL_DECODER_HOSTONLYMEM);
 	if (IS_ERR(cxlr))
 		return PTR_ERR(cxlr);
@@ -2358,7 +2372,7 @@ static ssize_t create_ram_region_store(struct device *dev,
 	if (rc != 1)
 		return -EINVAL;
 
-	cxlr = __create_region(cxlrd, id, CXL_DECODER_RAM,
+	cxlr = __create_region(cxlrd, id, CXL_REGION_RAM,
 			       CXL_DECODER_HOSTONLYMEM);
 	if (IS_ERR(cxlr))
 		return PTR_ERR(cxlr);
@@ -2886,10 +2900,31 @@ static void construct_region_end(void)
 	up_write(&cxl_region_rwsem);
 }
 
+static enum cxl_region_mode
+cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
+{
+	switch (mode) {
+	case CXL_DECODER_NONE:
+		return CXL_REGION_NONE;
+	case CXL_DECODER_RAM:
+		return CXL_REGION_RAM;
+	case CXL_DECODER_PMEM:
+		return CXL_REGION_PMEM;
+	case CXL_DECODER_DEAD:
+		return CXL_REGION_DEAD;
+	case CXL_DECODER_MIXED:
+	default:
+		return CXL_REGION_MIXED;
+	}
+
+	return CXL_REGION_MIXED;
+}
+
 static struct cxl_region *
 construct_region_begin(struct cxl_root_decoder *cxlrd,
 		       struct cxl_endpoint_decoder *cxled)
 {
+	enum cxl_region_mode mode = cxl_decoder_to_region_mode(cxled->mode);
 	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
 	struct cxl_region_params *p;
 	struct cxl_region *cxlr;
@@ -2897,7 +2932,7 @@ construct_region_begin(struct cxl_root_decoder *cxlrd,
 
 	do {
 		cxlr = __create_region(cxlrd, atomic_read(&cxlrd->region_id),
-				       cxled->mode, cxled->cxld.target_type);
+				       mode, cxled->cxld.target_type);
 	} while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
 
 	if (IS_ERR(cxlr)) {
@@ -3200,9 +3235,9 @@ static int cxl_region_probe(struct device *dev)
 		return rc;
 
 	switch (cxlr->mode) {
-	case CXL_DECODER_PMEM:
+	case CXL_REGION_PMEM:
 		return devm_cxl_add_pmem_region(cxlr);
-	case CXL_DECODER_RAM:
+	case CXL_REGION_RAM:
 		/*
 		 * The region can not be manged by CXL if any portion of
 		 * it is already online as 'System RAM'
@@ -3223,8 +3258,8 @@ static int cxl_region_probe(struct device *dev)
 		/* HDM-H routes to device-dax */
 		return devm_cxl_add_dax_region(cxlr);
 	default:
-		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
-			cxlr->mode);
+		dev_dbg(&cxlr->dev, "unsupported region mode: %s\n",
+			cxl_region_mode_name(cxlr->mode));
 		return -ENXIO;
 	}
 }
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index cd4a9ffdacc7..ed282dcd5cf5 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -374,6 +374,28 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
 	return "mixed";
 }
 
+enum cxl_region_mode {
+	CXL_REGION_NONE,
+	CXL_REGION_RAM,
+	CXL_REGION_PMEM,
+	CXL_REGION_MIXED,
+	CXL_REGION_DEAD,
+};
+
+static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
+{
+	static const char * const names[] = {
+		[CXL_REGION_NONE] = "none",
+		[CXL_REGION_RAM] = "ram",
+		[CXL_REGION_PMEM] = "pmem",
+		[CXL_REGION_MIXED] = "mixed",
+	};
+
+	if (mode >= CXL_REGION_NONE && mode <= CXL_REGION_MIXED)
+		return names[mode];
+	return "mixed";
+}
+
 /*
  * Track whether this decoder is reserved for region autodiscovery, or
  * free for userspace provisioning.
@@ -502,7 +524,8 @@ struct cxl_region_params {
  * struct cxl_region - CXL region
  * @dev: This region's device
  * @id: This region's id. Id is globally unique across all regions
- * @mode: Endpoint decoder allocation / access mode
+ * @mode: Region mode which defines which endpoint decoder mode the region is
+ *        compatible with
  * @type: Endpoint decoder target type
  * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
  * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
@@ -512,7 +535,7 @@ struct cxl_region_params {
 struct cxl_region {
 	struct device dev;
 	int id;
-	enum cxl_decoder_mode mode;
+	enum cxl_region_mode mode;
 	enum cxl_decoder_type type;
 	struct cxl_nvdimm_bridge *cxl_nvb;
 	struct cxl_pmem_region *cxlr_pmem;
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 5f2e65204bf9..8c8f47b397ab 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -396,6 +396,7 @@ enum cxl_devtype {
 	CXL_DEVTYPE_CLASSMEM,
 };
 
+#define CXL_MAX_DC_REGION 8
 /**
  * struct cxl_dev_state - The driver device state
  *
@@ -412,6 +413,8 @@ enum cxl_devtype {
  * @dpa_res: Overall DPA resource tree for the device
  * @pmem_res: Active Persistent memory capacity configuration
  * @ram_res: Active Volatile memory capacity configuration
+ * @dc_res: Active Dynamic Capacity memory configuration for each possible
+ *          region
  * @component_reg_phys: register base of component registers
  * @serial: PCIe Device Serial Number
  * @type: Generic Memory Class device or Vendor Specific Memory device
@@ -426,11 +429,23 @@ struct cxl_dev_state {
 	struct resource dpa_res;
 	struct resource pmem_res;
 	struct resource ram_res;
+	struct resource dc_res[CXL_MAX_DC_REGION];
 	resource_size_t component_reg_phys;
 	u64 serial;
 	enum cxl_devtype type;
 };
 
+#define CXL_DC_REGION_STRLEN 7
+struct cxl_dc_region_info {
+	u64 base;
+	u64 decode_len;
+	u64 len;
+	u64 blk_size;
+	u32 dsmad_handle;
+	u8 flags;
+	u8 name[CXL_DC_REGION_STRLEN];
+};
+
 /**
  * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
  *
@@ -449,6 +464,8 @@ struct cxl_dev_state {
  * @enabled_cmds: Hardware commands found enabled in CEL.
  * @exclusive_cmds: Commands that are kernel-internal only
  * @total_bytes: sum of all possible capacities
+ * @static_cap: Sum of RAM and PMEM capacities
+ * @dynamic_cap: Complete DPA range occupied by DC regions
  * @volatile_only_bytes: hard volatile capacity
  * @persistent_only_bytes: hard persistent capacity
  * @partition_align_bytes: alignment size for partition-able capacity
@@ -456,6 +473,10 @@ struct cxl_dev_state {
  * @active_persistent_bytes: sum of hard + soft persistent
  * @next_volatile_bytes: volatile capacity change pending device reset
  * @next_persistent_bytes: persistent capacity change pending device reset
+ * @nr_dc_region: number of DC regions implemented in the memory device
+ * @dc_region: array containing info about the DC regions
+ * @dc_event_log_size: The number of events the device can store in the
+ * Dynamic Capacity Event Log before it overflows
  * @event: event log driver state
  * @poison: poison driver state info
  * @fw: firmware upload / activation state
@@ -473,7 +494,10 @@ struct cxl_memdev_state {
 	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
 	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
 	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
+
 	u64 total_bytes;
+	u64 static_cap;
+	u64 dynamic_cap;
 	u64 volatile_only_bytes;
 	u64 persistent_only_bytes;
 	u64 partition_align_bytes;
@@ -481,6 +505,11 @@ struct cxl_memdev_state {
 	u64 active_persistent_bytes;
 	u64 next_volatile_bytes;
 	u64 next_persistent_bytes;
+
+	u8 nr_dc_region;
+	struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
+	size_t dc_event_log_size;
+
 	struct cxl_event_state event;
 	struct cxl_poison_state poison;
 	struct cxl_security_state security;
@@ -587,6 +616,7 @@ struct cxl_mbox_identify {
 	__le16 inject_poison_limit;
 	u8 poison_caps;
 	u8 qos_telemetry_caps;
+	__le16 dc_event_log_size;
 } __packed;
 
 /*
@@ -741,9 +771,31 @@ struct cxl_mbox_set_partition_info {
 	__le64 volatile_capacity;
 	u8 flags;
 } __packed;
-
 #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
 
+struct cxl_mbox_get_dc_config {
+	u8 region_count;
+	u8 start_region_index;
+} __packed;
+
+/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
+struct cxl_mbox_dynamic_capacity {
+	u8 avail_region_count;
+	u8 rsvd[7];
+	struct cxl_dc_region_config {
+		__le64 region_base;
+		__le64 region_decode_length;
+		__le64 region_length;
+		__le64 region_block_size;
+		__le32 region_dsmad_handle;
+		u8 flags;
+		u8 rsvd[3];
+	} __packed region[];
+} __packed;
+#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
+#define CXL_REGIONS_RETURNED(size_out) \
+	((size_out - 8) / sizeof(struct cxl_dc_region_config))
+
 /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
 struct cxl_mbox_set_timestamp_in {
 	__le64 timestamp;
@@ -867,6 +919,7 @@ enum {
 int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
 			  struct cxl_mbox_cmd *cmd);
 int cxl_dev_state_identify(struct cxl_memdev_state *mds);
+int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
 int cxl_await_media_ready(struct cxl_dev_state *cxlds);
 int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
 int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 5242dbf0044d..a9b110ff1176 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -879,6 +879,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
+	rc = cxl_dev_dynamic_capacity_identify(mds);
+	if (rc)
+		return rc;
+
 	rc = cxl_mem_create_range_info(mds);
 	if (rc)
 		return rc;

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 04/18] cxl/region: Add Dynamic Capacity decoder and region modes
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (2 preceding siblings ...)
  2023-08-29  5:20 ` [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device ira.weiny
@ 2023-08-29  5:20 ` Ira Weiny
  2023-08-29 14:39   ` Jonathan Cameron
                     ` (2 more replies)
  2023-08-29  5:20 ` [PATCH RFC v2 05/18] cxl/port: Add Dynamic Capacity mode support to endpoint decoders Ira Weiny
                   ` (14 subsequent siblings)
  18 siblings, 3 replies; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

Both regions and decoders will need a new mode to reflect the new type
of partition they are targeting on a device.  Regions reflect a dynamic
capacity type which may point to different Dynamic Capacity (DC)
Regions.  Decoder mode reflects a specific DC Region.

Define the new modes to use in subsequent patches and the helper
functions associated with them.

Co-developed-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for v2:
[iweiny: split out from: Add dynamic capacity cxl region support.]
---
 drivers/cxl/core/region.c |  4 ++++
 drivers/cxl/cxl.h         | 23 +++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 75041903b72c..69af1354bc5b 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1772,6 +1772,8 @@ static bool cxl_modes_compatible(enum cxl_region_mode rmode,
 		return true;
 	if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
 		return true;
+	if (rmode == CXL_REGION_DC && cxl_decoder_mode_is_dc(dmode))
+		return true;
 
 	return false;
 }
@@ -2912,6 +2914,8 @@ cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
 		return CXL_REGION_PMEM;
 	case CXL_DECODER_DEAD:
 		return CXL_REGION_DEAD;
+	case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
+		return CXL_REGION_DC;
 	case CXL_DECODER_MIXED:
 	default:
 		return CXL_REGION_MIXED;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index ed282dcd5cf5..d41f3f14fbe3 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -356,6 +356,14 @@ enum cxl_decoder_mode {
 	CXL_DECODER_NONE,
 	CXL_DECODER_RAM,
 	CXL_DECODER_PMEM,
+	CXL_DECODER_DC0,
+	CXL_DECODER_DC1,
+	CXL_DECODER_DC2,
+	CXL_DECODER_DC3,
+	CXL_DECODER_DC4,
+	CXL_DECODER_DC5,
+	CXL_DECODER_DC6,
+	CXL_DECODER_DC7,
 	CXL_DECODER_MIXED,
 	CXL_DECODER_DEAD,
 };
@@ -366,6 +374,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
 		[CXL_DECODER_NONE] = "none",
 		[CXL_DECODER_RAM] = "ram",
 		[CXL_DECODER_PMEM] = "pmem",
+		[CXL_DECODER_DC0] = "dc0",
+		[CXL_DECODER_DC1] = "dc1",
+		[CXL_DECODER_DC2] = "dc2",
+		[CXL_DECODER_DC3] = "dc3",
+		[CXL_DECODER_DC4] = "dc4",
+		[CXL_DECODER_DC5] = "dc5",
+		[CXL_DECODER_DC6] = "dc6",
+		[CXL_DECODER_DC7] = "dc7",
 		[CXL_DECODER_MIXED] = "mixed",
 	};
 
@@ -374,10 +390,16 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
 	return "mixed";
 }
 
+static inline bool cxl_decoder_mode_is_dc(enum cxl_decoder_mode mode)
+{
+	return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
+}
+
 enum cxl_region_mode {
 	CXL_REGION_NONE,
 	CXL_REGION_RAM,
 	CXL_REGION_PMEM,
+	CXL_REGION_DC,
 	CXL_REGION_MIXED,
 	CXL_REGION_DEAD,
 };
@@ -388,6 +410,7 @@ static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
 		[CXL_REGION_NONE] = "none",
 		[CXL_REGION_RAM] = "ram",
 		[CXL_REGION_PMEM] = "pmem",
+		[CXL_REGION_DC] = "dc",
 		[CXL_REGION_MIXED] = "mixed",
 	};
 

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 05/18] cxl/port: Add Dynamic Capacity mode support to endpoint decoders
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (3 preceding siblings ...)
  2023-08-29  5:20 ` [PATCH RFC v2 04/18] cxl/region: Add Dynamic Capacity decoder and region modes Ira Weiny
@ 2023-08-29  5:20 ` Ira Weiny
  2023-08-29 14:49   ` Jonathan Cameron
  2023-08-31 17:25   ` Fan Ni
  2023-08-29  5:20 ` [PATCH RFC v2 06/18] cxl/port: Add Dynamic Capacity size " Ira Weiny
                   ` (13 subsequent siblings)
  18 siblings, 2 replies; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

Endpoint decoders used to map Dynamic Capacity must be configured to
point to the correct Dynamic Capacity (DC) Region.  The decoder mode
currently represents the partition the decoder points to such as ram or
pmem.

Expand the mode to include DC Regions.

Co-developed-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for v2:
[iweiny: split from region creation patch]
---
 Documentation/ABI/testing/sysfs-bus-cxl | 19 ++++++++++---------
 drivers/cxl/core/hdm.c                  | 24 ++++++++++++++++++++++++
 drivers/cxl/core/port.c                 | 16 ++++++++++++++++
 3 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 6350dd82b9a9..2268ffcdb604 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -257,22 +257,23 @@ Description:
 
 What:		/sys/bus/cxl/devices/decoderX.Y/mode
 Date:		May, 2022
-KernelVersion:	v6.0
+KernelVersion:	v6.0, v6.6 (dcY)
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
 		translates from a host physical address range, to a device local
 		address range. Device-local address ranges are further split
-		into a 'ram' (volatile memory) range and 'pmem' (persistent
-		memory) range. The 'mode' attribute emits one of 'ram', 'pmem',
-		'mixed', or 'none'. The 'mixed' indication is for error cases
-		when a decoder straddles the volatile/persistent partition
-		boundary, and 'none' indicates the decoder is not actively
-		decoding, or no DPA allocation policy has been set.
+		into a 'ram' (volatile memory) range, 'pmem' (persistent
+		memory) range, or Dynamic Capacity (DC) range. The 'mode'
+		attribute emits one of 'ram', 'pmem', 'dcY', 'mixed', or
+		'none'. The 'mixed' indication is for error cases when a
+		decoder straddles the volatile/persistent partition boundary,
+		and 'none' indicates the decoder is not actively decoding, or
+		no DPA allocation policy has been set.
 
 		'mode' can be written, when the decoder is in the 'disabled'
-		state, with either 'ram' or 'pmem' to set the boundaries for the
-		next allocation.
+		state, with 'ram', 'pmem', or 'dcY' to set the boundaries for
+		the next allocation.
 
 
 What:		/sys/bus/cxl/devices/decoderX.Y/dpa_resource
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index a254f79dd4e8..3f4af1f5fac8 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -267,6 +267,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
 	__cxl_dpa_release(cxled);
 }
 
+static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
+{
+	int index = 0;
+
+	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
+		if (mode == i)
+			return index;
+		index++;
+	}
+
+	return -EINVAL;
+}
+
 static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 			     resource_size_t base, resource_size_t len,
 			     resource_size_t skipped)
@@ -429,6 +442,7 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
 	switch (mode) {
 	case CXL_DECODER_RAM:
 	case CXL_DECODER_PMEM:
+	case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
 		break;
 	default:
 		dev_dbg(dev, "unsupported mode: %d\n", mode);
@@ -456,6 +470,16 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
 		goto out;
 	}
 
+	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
+		int index = dc_mode_to_region_index(i);
+
+		if (mode == i && !resource_size(&cxlds->dc_res[index])) {
+			dev_dbg(dev, "no available dynamic capacity\n");
+			rc = -ENXIO;
+			goto out;
+		}
+	}
+
 	cxled->mode = mode;
 	rc = 0;
 out:
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index f58cf01f8d2c..ce4a66865db3 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -197,6 +197,22 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
 		mode = CXL_DECODER_PMEM;
 	else if (sysfs_streq(buf, "ram"))
 		mode = CXL_DECODER_RAM;
+	else if (sysfs_streq(buf, "dc0"))
+		mode = CXL_DECODER_DC0;
+	else if (sysfs_streq(buf, "dc1"))
+		mode = CXL_DECODER_DC1;
+	else if (sysfs_streq(buf, "dc2"))
+		mode = CXL_DECODER_DC2;
+	else if (sysfs_streq(buf, "dc3"))
+		mode = CXL_DECODER_DC3;
+	else if (sysfs_streq(buf, "dc4"))
+		mode = CXL_DECODER_DC4;
+	else if (sysfs_streq(buf, "dc5"))
+		mode = CXL_DECODER_DC5;
+	else if (sysfs_streq(buf, "dc6"))
+		mode = CXL_DECODER_DC6;
+	else if (sysfs_streq(buf, "dc7"))
+		mode = CXL_DECODER_DC7;
 	else
 		return -EINVAL;
 

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 06/18] cxl/port: Add Dynamic Capacity size support to endpoint decoders
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (4 preceding siblings ...)
  2023-08-29  5:20 ` [PATCH RFC v2 05/18] cxl/port: Add Dynamic Capacity mode support to endpoint decoders Ira Weiny
@ 2023-08-29  5:20 ` Ira Weiny
  2023-08-29 15:09   ` Jonathan Cameron
  2023-08-29  5:20 ` [PATCH RFC v2 07/18] cxl/mem: Expose device dynamic capacity configuration ira.weiny
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

To support Dynamic Capacity Devices (DCD) endpoint decoders will need to
map DC Regions (partitions).  Part of this is assigning the size of the
DC Region DPA to the decoder in addition to any skip value from the
previous decoder which exists.  This must be done within a continuous
DPA space.  Two complications arise with Dynamic Capacity regions which
did not exist with Ram and PMEM partitions.  First, gaps in the DPA
space can exist between and around the DC Regions.  Second, the Linux
resource tree does not allow a resource to be marked across existing
nodes within a tree.

For clarity, below is an example of an 60GB device with 10GB of RAM,
10GB of PMEM and 10GB for each of 2 DC Regions.  The desired CXL mapping
is 5GB of RAM, 5GB of PMEM, and all 10GB of DC1.

     DPA RANGE
     (dpa_res)
0GB        10GB       20GB       30GB       40GB       50GB       60GB
|----------|----------|----------|----------|----------|----------|

RAM         PMEM                  DC0                   DC1
 (ram_res)  (pmem_res)            (dc_res[0])           (dc_res[1])
|----------|----------|   <gap>  |----------|   <gap>  |----------|

 RAM        PMEM                                        DC1
|XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
0GB   5GB  10GB  15GB 20GB       30GB       40GB       50GB       60GB

The previous skip resource between RAM and PMEM was always a child of
the RAM resource and fit nicely (see X below).  Because of this
simplicity this skip resource reference was not stored in any CXL state.
On release the skip range could be calculated based on the endpoint
decoders stored values.

Now when DC1 is being mapped 4 skip resources must be created as
children.  One of the PMEM resource (A), two of the parent DPA resource
(B,D), and one more child of the DC0 resource (C).

0GB        10GB       20GB       30GB       40GB       50GB       60GB
|----------|----------|----------|----------|----------|----------|
                           |                     |
|----------|----------|    |     |----------|    |     |----------|
        |          |       |          |          |
       (X)        (A)     (B)        (C)        (D)
	v          v       v          v          v
|XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
       skip       skip  skip        skip      skip

Expand the calculation of DPA freespace and enhance the logic to support
mapping/unmapping DC DPA space.  To track the potential of multiple skip
resources an xarray is attached to the endpoint decoder.  The existing
algorithm is consolidated with the new one to store a single skip
resource in the same way as multiple skip resources.

Co-developed-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
An alternative of using reserve_region_with_split() was considered.
The advantage of that would be keeping all the resource information
stored solely in the resource tree rather than having separate
references to them.  However, it would best be implemented with a call
such as release_split_region() [name TBD?] which could find all the leaf
resources in the range and release them.  Furthermore, it is not clear
if reserve_region_with_split() is really intended for anything outside
of init code.  In the end this algorithm seems straight forward enough.

Changes for v2:
[iweiny: write commit message]
[iweiny: remove unneeded changes]
[iweiny: split from region creation patch]
[iweiny: Alter skip algorithm to use 'anonymous regions']
[iweiny: enhance debug messages]
[iweiny: consolidate skip resource creation]
[iweiny: ensure xa_destroy() is called]
[iweiny: consolidate region requests further]
[iweiny: ensure resource is released on xa_insert]
---
 drivers/cxl/core/hdm.c  | 188 +++++++++++++++++++++++++++++++++++++++++++-----
 drivers/cxl/core/port.c |   2 +
 drivers/cxl/cxl.h       |   2 +
 3 files changed, 176 insertions(+), 16 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 3f4af1f5fac8..3cd048677816 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -222,6 +222,25 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, CXL);
 
+static void cxl_skip_release(struct cxl_endpoint_decoder *cxled)
+{
+	struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
+	struct cxl_port *port = cxled_to_port(cxled);
+	struct device *dev = &port->dev;
+	unsigned long index;
+	void *entry;
+
+	xa_for_each(&cxled->skip_res, index, entry) {
+		struct resource *res = entry;
+
+		dev_dbg(dev, "decoder%d.%d: releasing skipped space; %pr\n",
+			port->id, cxled->cxld.id, res);
+		__release_region(&cxlds->dpa_res, res->start,
+				 resource_size(res));
+		xa_erase(&cxled->skip_res, index);
+	}
+}
+
 /*
  * Must be called in a context that synchronizes against this decoder's
  * port ->remove() callback (like an endpoint decoder sysfs attribute)
@@ -232,15 +251,11 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
 	struct cxl_port *port = cxled_to_port(cxled);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct resource *res = cxled->dpa_res;
-	resource_size_t skip_start;
 
 	lockdep_assert_held_write(&cxl_dpa_rwsem);
 
-	/* save @skip_start, before @res is released */
-	skip_start = res->start - cxled->skip;
 	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
-	if (cxled->skip)
-		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
+	cxl_skip_release(cxled);
 	cxled->skip = 0;
 	cxled->dpa_res = NULL;
 	put_device(&cxled->cxld.dev);
@@ -280,6 +295,98 @@ static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
 	return -EINVAL;
 }
 
+static int cxl_request_skip(struct cxl_endpoint_decoder *cxled,
+			    resource_size_t skip_base, resource_size_t skip_len)
+{
+	struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
+	const char *name = dev_name(&cxled->cxld.dev);
+	struct cxl_port *port = cxled_to_port(cxled);
+	struct resource *dpa_res = &cxlds->dpa_res;
+	struct device *dev = &port->dev;
+	struct resource *res;
+	int rc;
+
+	res = __request_region(dpa_res, skip_base, skip_len, name, 0);
+	if (!res)
+		return -EBUSY;
+
+	rc = xa_insert(&cxled->skip_res, skip_base, res, GFP_KERNEL);
+	if (rc) {
+		__release_region(dpa_res, skip_base, skip_len);
+		return rc;
+	}
+
+	dev_dbg(dev, "decoder%d.%d: skipped space; %pr\n",
+		port->id, cxled->cxld.id, res);
+	return 0;
+}
+
+static int cxl_reserve_dpa_skip(struct cxl_endpoint_decoder *cxled,
+				resource_size_t base, resource_size_t skipped)
+{
+	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+	struct cxl_port *port = cxled_to_port(cxled);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	resource_size_t skip_base = base - skipped;
+	resource_size_t size, skip_len = 0;
+	struct device *dev = &port->dev;
+	int rc, index;
+
+	size = resource_size(&cxlds->ram_res);
+	if (size && skip_base <= cxlds->ram_res.end) {
+		skip_len = cxlds->ram_res.end - skip_base + 1;
+		rc = cxl_request_skip(cxled, skip_base, skip_len);
+		if (rc)
+			return rc;
+		skip_base += skip_len;
+	}
+
+	if (skip_base == base) {
+		dev_dbg(dev, "skip done!\n");
+		return 0;
+	}
+
+	size = resource_size(&cxlds->pmem_res);
+	if (size && skip_base <= cxlds->pmem_res.end) {
+		skip_len = cxlds->pmem_res.end - skip_base + 1;
+		rc = cxl_request_skip(cxled, skip_base, skip_len);
+		if (rc)
+			return rc;
+		skip_base += skip_len;
+	}
+
+	index = dc_mode_to_region_index(cxled->mode);
+	for (int i = 0; i <= index; i++) {
+		struct resource *dcr = &cxlds->dc_res[i];
+
+		if (skip_base < dcr->start) {
+			skip_len = dcr->start - skip_base;
+			rc = cxl_request_skip(cxled, skip_base, skip_len);
+			if (rc)
+				return rc;
+			skip_base += skip_len;
+		}
+
+		if (skip_base == base) {
+			dev_dbg(dev, "skip done!\n");
+			break;
+		}
+
+		if (resource_size(dcr) && skip_base <= dcr->end) {
+			if (skip_base > base)
+				dev_err(dev, "Skip error\n");
+
+			skip_len = dcr->end - skip_base + 1;
+			rc = cxl_request_skip(cxled, skip_base, skip_len);
+			if (rc)
+				return rc;
+			skip_base += skip_len;
+		}
+	}
+
+	return 0;
+}
+
 static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 			     resource_size_t base, resource_size_t len,
 			     resource_size_t skipped)
@@ -317,13 +424,12 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	}
 
 	if (skipped) {
-		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
-				       dev_name(&cxled->cxld.dev), 0);
-		if (!res) {
-			dev_dbg(dev,
-				"decoder%d.%d: failed to reserve skipped space\n",
-				port->id, cxled->cxld.id);
-			return -EBUSY;
+		int rc = cxl_reserve_dpa_skip(cxled, base, skipped);
+
+		if (rc) {
+			dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space; %#llx - %#llx\n",
+				port->id, cxled->cxld.id, base, skipped);
+			return rc;
 		}
 	}
 	res = __request_region(&cxlds->dpa_res, base, len,
@@ -331,14 +437,20 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	if (!res) {
 		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
 			port->id, cxled->cxld.id);
-		if (skipped)
-			__release_region(&cxlds->dpa_res, base - skipped,
-					 skipped);
+		cxl_skip_release(cxled);
 		return -EBUSY;
 	}
 	cxled->dpa_res = res;
 	cxled->skip = skipped;
 
+	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
+		int index = dc_mode_to_region_index(mode);
+
+		if (resource_contains(&cxlds->dc_res[index], res)) {
+			cxled->mode = mode;
+			goto success;
+		}
+	}
 	if (resource_contains(&cxlds->pmem_res, res))
 		cxled->mode = CXL_DECODER_PMEM;
 	else if (resource_contains(&cxlds->ram_res, res))
@@ -349,6 +461,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 		cxled->mode = CXL_DECODER_MIXED;
 	}
 
+success:
+	dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id, cxled->cxld.id,
+		cxled->dpa_res, cxled->mode);
 	port->hdm_end++;
 	get_device(&cxled->cxld.dev);
 	return 0;
@@ -492,11 +607,13 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
 					 resource_size_t *start_out,
 					 resource_size_t *skip_out)
 {
+	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
 	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
-	resource_size_t free_ram_start, free_pmem_start;
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	struct device *dev = &cxled->cxld.dev;
 	resource_size_t start, avail, skip;
 	struct resource *p, *last;
+	int index;
 
 	lockdep_assert_held(&cxl_dpa_rwsem);
 
@@ -514,6 +631,20 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
 	else
 		free_pmem_start = cxlds->pmem_res.start;
 
+	/*
+	 * Limit each decoder to a single DC region to map memory with
+	 * different DSMAS entry.
+	 */
+	index = dc_mode_to_region_index(cxled->mode);
+	if (index >= 0) {
+		if (cxlds->dc_res[index].child) {
+			dev_err(dev, "Cannot allocate DPA from DC Region: %d\n",
+				index);
+			return -EINVAL;
+		}
+		free_dc_start = cxlds->dc_res[index].start;
+	}
+
 	if (cxled->mode == CXL_DECODER_RAM) {
 		start = free_ram_start;
 		avail = cxlds->ram_res.end - start + 1;
@@ -535,6 +666,29 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
 		else
 			skip_end = start - 1;
 		skip = skip_end - skip_start + 1;
+	} else if (cxl_decoder_mode_is_dc(cxled->mode)) {
+		resource_size_t skip_start, skip_end;
+
+		start = free_dc_start;
+		avail = cxlds->dc_res[index].end - start + 1;
+		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
+			skip_start = free_ram_start;
+		else
+			skip_start = free_pmem_start;
+		/*
+		 * If any dc region is already mapped, then that allocation
+		 * already handled the RAM and PMEM skip.  Check for DC region
+		 * skip.
+		 */
+		for (int i = index - 1; i >= 0 ; i--) {
+			if (cxlds->dc_res[i].child) {
+				skip_start = cxlds->dc_res[i].child->end + 1;
+				break;
+			}
+		}
+
+		skip_end = start - 1;
+		skip = skip_end - skip_start + 1;
 	} else {
 		dev_dbg(cxled_dev(cxled), "mode not set\n");
 		avail = 0;
@@ -572,6 +726,8 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
 
 	avail = cxl_dpa_freespace(cxled, &start, &skip);
 
+	dev_dbg(dev, "DPA Allocation start: %llx len: %llx Skip: %llx\n",
+		start, size, skip);
 	if (size > avail) {
 		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
 			cxl_decoder_mode_name(cxled->mode), &avail);
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index ce4a66865db3..a5db710a63bc 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -413,6 +413,7 @@ static void cxl_endpoint_decoder_release(struct device *dev)
 	struct cxl_endpoint_decoder *cxled = to_cxl_endpoint_decoder(dev);
 
 	__cxl_decoder_release(&cxled->cxld);
+	xa_destroy(&cxled->skip_res);
 	kfree(cxled);
 }
 
@@ -1769,6 +1770,7 @@ struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
 		return ERR_PTR(-ENOMEM);
 
 	cxled->pos = -1;
+	xa_init(&cxled->skip_res);
 	cxld = &cxled->cxld;
 	rc = cxl_decoder_init(port, cxld);
 	if (rc)	 {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index d41f3f14fbe3..0a225b0c20bf 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -433,6 +433,7 @@ enum cxl_decoder_state {
  * @cxld: base cxl_decoder_object
  * @dpa_res: actively claimed DPA span of this decoder
  * @skip: offset into @dpa_res where @cxld.hpa_range maps
+ * @skip_res: array of skipped resources from the previous decoder end
  * @mode: which memory type / access-mode-partition this decoder targets
  * @state: autodiscovery state
  * @pos: interleave position in @cxld.region
@@ -441,6 +442,7 @@ struct cxl_endpoint_decoder {
 	struct cxl_decoder cxld;
 	struct resource *dpa_res;
 	resource_size_t skip;
+	struct xarray skip_res;
 	enum cxl_decoder_mode mode;
 	enum cxl_decoder_state state;
 	int pos;

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 07/18] cxl/mem: Expose device dynamic capacity configuration
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (5 preceding siblings ...)
  2023-08-29  5:20 ` [PATCH RFC v2 06/18] cxl/port: Add Dynamic Capacity size " Ira Weiny
@ 2023-08-29  5:20 ` ira.weiny
  2023-08-29 15:14   ` Jonathan Cameron
  2023-08-30 22:46   ` Dave Jiang
  2023-08-29  5:20 ` [PATCH RFC v2 08/18] cxl/region: Add Dynamic Capacity CXL region support Ira Weiny
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 97+ messages in thread
From: ira.weiny @ 2023-08-29  5:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

From: Navneet Singh <navneet.singh@intel.com>

To properly configure CXL regions on Dynamic Capacity Devices (DCD),
user space will need to know the details of the DC Regions available on
a device.

Expose driver dynamic capacity configuration through sysfs
attributes.

Co-developed-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for v2:
[iweiny: Rebased on latest master/type2 work]
[iweiny: add documentation for sysfs entries]
[iweiny: s/dc_regions_count/region_count/]
[iweiny: s/dcY_size/regionY_size/]
[alison: change size format to %#llx]
[iweiny: change count format to %d]
[iweiny: Formatting updates]
[iweiny: Fix crash when device is not a mem device: found with cxl-test]
---
 Documentation/ABI/testing/sysfs-bus-cxl | 17 ++++++++
 drivers/cxl/core/memdev.c               | 77 +++++++++++++++++++++++++++++++++
 2 files changed, 94 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 2268ffcdb604..aa65dc5b4e13 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -37,6 +37,23 @@ Description:
 		identically named field in the Identify Memory Device Output
 		Payload in the CXL-2.0 specification.
 
+What:		/sys/bus/cxl/devices/memX/dc/region_count
+Date:		July, 2023
+KernelVersion:	v6.6
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) Number of Dynamic Capacity (DC) regions supported on the
+		device.  May be 0 if the device does not support Dynamic
+		Capacity.
+
+What:		/sys/bus/cxl/devices/memX/dc/regionY_size
+Date:		July, 2023
+KernelVersion:	v6.6
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) Size of the Dynamic Capacity (DC) region Y.  Only
+		available on devices which support DC and only for those
+		region indexes supported by the device.
 
 What:		/sys/bus/cxl/devices/memX/serial
 Date:		January, 2022
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 492486707fd0..397262e0ebd2 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -101,6 +101,20 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
 static struct device_attribute dev_attr_pmem_size =
 	__ATTR(size, 0444, pmem_size_show, NULL);
 
+static ssize_t region_count_show(struct device *dev, struct device_attribute *attr,
+				 char *buf)
+{
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+	int len = 0;
+
+	len = sysfs_emit(buf, "%d\n", mds->nr_dc_region);
+	return len;
+}
+
+struct device_attribute dev_attr_region_count =
+	__ATTR(region_count, 0444, region_count_show, NULL);
+
 static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
 			   char *buf)
 {
@@ -454,6 +468,62 @@ static struct attribute *cxl_memdev_security_attributes[] = {
 	NULL,
 };
 
+static ssize_t show_size_regionN(struct cxl_memdev *cxlmd, char *buf, int pos)
+{
+	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+	return sysfs_emit(buf, "%#llx\n", mds->dc_region[pos].decode_len);
+}
+
+#define REGION_SIZE_ATTR_RO(n)						\
+static ssize_t region##n##_size_show(struct device *dev,		\
+				     struct device_attribute *attr,	\
+				     char *buf)				\
+{									\
+	return show_size_regionN(to_cxl_memdev(dev), buf, (n));		\
+}									\
+static DEVICE_ATTR_RO(region##n##_size)
+REGION_SIZE_ATTR_RO(0);
+REGION_SIZE_ATTR_RO(1);
+REGION_SIZE_ATTR_RO(2);
+REGION_SIZE_ATTR_RO(3);
+REGION_SIZE_ATTR_RO(4);
+REGION_SIZE_ATTR_RO(5);
+REGION_SIZE_ATTR_RO(6);
+REGION_SIZE_ATTR_RO(7);
+
+static struct attribute *cxl_memdev_dc_attributes[] = {
+	&dev_attr_region0_size.attr,
+	&dev_attr_region1_size.attr,
+	&dev_attr_region2_size.attr,
+	&dev_attr_region3_size.attr,
+	&dev_attr_region4_size.attr,
+	&dev_attr_region5_size.attr,
+	&dev_attr_region6_size.attr,
+	&dev_attr_region7_size.attr,
+	&dev_attr_region_count.attr,
+	NULL,
+};
+
+static umode_t cxl_dc_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+	/* Not a memory device */
+	if (!mds)
+		return 0;
+
+	if (a == &dev_attr_region_count.attr)
+		return a->mode;
+
+	if (n < mds->nr_dc_region)
+		return a->mode;
+
+	return 0;
+}
+
 static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
 				  int n)
 {
@@ -482,11 +552,18 @@ static struct attribute_group cxl_memdev_security_attribute_group = {
 	.attrs = cxl_memdev_security_attributes,
 };
 
+static struct attribute_group cxl_memdev_dc_attribute_group = {
+	.name = "dc",
+	.attrs = cxl_memdev_dc_attributes,
+	.is_visible = cxl_dc_visible,
+};
+
 static const struct attribute_group *cxl_memdev_attribute_groups[] = {
 	&cxl_memdev_attribute_group,
 	&cxl_memdev_ram_attribute_group,
 	&cxl_memdev_pmem_attribute_group,
 	&cxl_memdev_security_attribute_group,
+	&cxl_memdev_dc_attribute_group,
 	NULL,
 };
 

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 08/18] cxl/region: Add Dynamic Capacity CXL region support
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (6 preceding siblings ...)
  2023-08-29  5:20 ` [PATCH RFC v2 07/18] cxl/mem: Expose device dynamic capacity configuration ira.weiny
@ 2023-08-29  5:20 ` Ira Weiny
  2023-08-29 15:19   ` Jonathan Cameron
                     ` (2 more replies)
  2023-08-29  5:21 ` [PATCH RFC v2 09/18] cxl/mem: Read extents on memory device discovery Ira Weiny
                   ` (10 subsequent siblings)
  18 siblings, 3 replies; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

CXL devices optionally support dynamic capacity.  CXL Regions must be
configured correctly to access this capacity.  Similar to ram and pmem
partitions, DC Regions represent different partitions of the DPA space.

Interleaving is deferred due to the complexity of managing extents on
multiple devices at the same time.  However, there is nothing which
directly prevents interleave support at this time.  The check allows
for early rejection.

To maintain backwards compatibility with older software, CXL regions
need a default DAX device to hold the reference for the region until it
is deleted.

Add create_dc_region sysfs entry to create DC regions.  Share the logic
of devm_cxl_add_dax_region() and region_is_system_ram().  Special case
DC capable CXL regions to create a 0 sized seed DAX device until others
can be created on dynamic space later.

Flag dax_regions to indicate 0 capacity available until dax_region
extents are supported by the region.

Co-developed-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
changes for v2:
[iweiny: flag empty dax regions]
[iweiny: Split out anything not directly related to creating a DC CXL
	 region]
[iweiny: Separate out dev dax stuff]
[iweiny/navneet: create 0 sized DAX device by default]
[iweiny: use new DC region mode]
---
 Documentation/ABI/testing/sysfs-bus-cxl | 20 +++++-----
 drivers/cxl/core/core.h                 |  1 +
 drivers/cxl/core/port.c                 |  1 +
 drivers/cxl/core/region.c               | 71 ++++++++++++++++++++++++++++-----
 drivers/dax/bus.c                       |  8 ++++
 drivers/dax/bus.h                       |  1 +
 drivers/dax/cxl.c                       | 15 ++++++-
 7 files changed, 96 insertions(+), 21 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index aa65dc5b4e13..a0562938ecac 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -351,20 +351,20 @@ Description:
 		interleave_granularity).
 
 
-What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
+What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dc}_region
 Date:		May, 2022, January, 2023
-KernelVersion:	v6.0 (pmem), v6.3 (ram)
+KernelVersion:	v6.0 (pmem), v6.3 (ram), v6.6 (dc)
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) Write a string in the form 'regionZ' to start the process
-		of defining a new persistent, or volatile memory region
-		(interleave-set) within the decode range bounded by root decoder
-		'decoderX.Y'. The value written must match the current value
-		returned from reading this attribute. An atomic compare exchange
-		operation is done on write to assign the requested id to a
-		region and allocate the region-id for the next creation attempt.
-		EBUSY is returned if the region name written does not match the
-		current cached value.
+		of defining a new persistent, volatile, or Dynamic Capacity
+		(DC) memory region (interleave-set) within the decode range
+		bounded by root decoder 'decoderX.Y'. The value written must
+		match the current value returned from reading this attribute.
+		An atomic compare exchange operation is done on write to assign
+		the requested id to a region and allocate the region-id for the
+		next creation attempt.  EBUSY is returned if the region name
+		written does not match the current cached value.
 
 
 What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 45e7e044cf4a..cf3cf01cb95d 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -13,6 +13,7 @@ extern struct attribute_group cxl_base_attribute_group;
 #ifdef CONFIG_CXL_REGION
 extern struct device_attribute dev_attr_create_pmem_region;
 extern struct device_attribute dev_attr_create_ram_region;
+extern struct device_attribute dev_attr_create_dc_region;
 extern struct device_attribute dev_attr_delete_region;
 extern struct device_attribute dev_attr_region;
 extern const struct device_type cxl_pmem_region_type;
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index a5db710a63bc..608901bb7d91 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -314,6 +314,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
 	&dev_attr_target_list.attr,
 	SET_CXL_REGION_ATTR(create_pmem_region)
 	SET_CXL_REGION_ATTR(create_ram_region)
+	SET_CXL_REGION_ATTR(create_dc_region)
 	SET_CXL_REGION_ATTR(delete_region)
 	NULL,
 };
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 69af1354bc5b..fc8dee469244 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2271,6 +2271,7 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
 	switch (mode) {
 	case CXL_REGION_RAM:
 	case CXL_REGION_PMEM:
+	case CXL_REGION_DC:
 		break;
 	default:
 		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
@@ -2383,6 +2384,33 @@ static ssize_t create_ram_region_store(struct device *dev,
 }
 DEVICE_ATTR_RW(create_ram_region);
 
+static ssize_t create_dc_region_show(struct device *dev,
+				     struct device_attribute *attr, char *buf)
+{
+	return __create_region_show(to_cxl_root_decoder(dev), buf);
+}
+
+static ssize_t create_dc_region_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t len)
+{
+	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
+	struct cxl_region *cxlr;
+	int rc, id;
+
+	rc = sscanf(buf, "region%d\n", &id);
+	if (rc != 1)
+		return -EINVAL;
+
+	cxlr = __create_region(cxlrd, id, CXL_REGION_DC,
+			       CXL_DECODER_HOSTONLYMEM);
+	if (IS_ERR(cxlr))
+		return PTR_ERR(cxlr);
+
+	return len;
+}
+DEVICE_ATTR_RW(create_dc_region);
+
 static ssize_t region_show(struct device *dev, struct device_attribute *attr,
 			   char *buf)
 {
@@ -2834,7 +2862,7 @@ static void cxlr_dax_unregister(void *_cxlr_dax)
 	device_unregister(&cxlr_dax->dev);
 }
 
-static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
+static int __devm_cxl_add_dax_region(struct cxl_region *cxlr)
 {
 	struct cxl_dax_region *cxlr_dax;
 	struct device *dev;
@@ -2863,6 +2891,21 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
 	return rc;
 }
 
+static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
+{
+	return __devm_cxl_add_dax_region(cxlr);
+}
+
+static int devm_cxl_add_dc_dax_region(struct cxl_region *cxlr)
+{
+	if (cxlr->params.interleave_ways != 1) {
+		dev_err(&cxlr->dev, "Interleaving DC not supported\n");
+		return -EINVAL;
+	}
+
+	return __devm_cxl_add_dax_region(cxlr);
+}
+
 static int match_decoder_by_range(struct device *dev, void *data)
 {
 	struct range *r1, *r2 = data;
@@ -3203,6 +3246,19 @@ static int is_system_ram(struct resource *res, void *arg)
 	return 1;
 }
 
+/*
+ * The region can not be manged by CXL if any portion of
+ * it is already online as 'System RAM'
+ */
+static bool region_is_system_ram(struct cxl_region *cxlr,
+				 struct cxl_region_params *p)
+{
+	return (walk_iomem_res_desc(IORES_DESC_NONE,
+				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+				    p->res->start, p->res->end, cxlr,
+				    is_system_ram) > 0);
+}
+
 static int cxl_region_probe(struct device *dev)
 {
 	struct cxl_region *cxlr = to_cxl_region(dev);
@@ -3242,14 +3298,7 @@ static int cxl_region_probe(struct device *dev)
 	case CXL_REGION_PMEM:
 		return devm_cxl_add_pmem_region(cxlr);
 	case CXL_REGION_RAM:
-		/*
-		 * The region can not be manged by CXL if any portion of
-		 * it is already online as 'System RAM'
-		 */
-		if (walk_iomem_res_desc(IORES_DESC_NONE,
-					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
-					p->res->start, p->res->end, cxlr,
-					is_system_ram) > 0)
+		if (region_is_system_ram(cxlr, p))
 			return 0;
 
 		/*
@@ -3261,6 +3310,10 @@ static int cxl_region_probe(struct device *dev)
 
 		/* HDM-H routes to device-dax */
 		return devm_cxl_add_dax_region(cxlr);
+	case CXL_REGION_DC:
+		if (region_is_system_ram(cxlr, p))
+			return 0;
+		return devm_cxl_add_dc_dax_region(cxlr);
 	default:
 		dev_dbg(&cxlr->dev, "unsupported region mode: %s\n",
 			cxl_region_mode_name(cxlr->mode));
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 0ee96e6fc426..b76e49813a39 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -169,6 +169,11 @@ static bool is_static(struct dax_region *dax_region)
 	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
 }
 
+static bool is_dynamic(struct dax_region *dax_region)
+{
+	return (dax_region->res.flags & IORESOURCE_DAX_DYNAMIC_CAP) != 0;
+}
+
 bool static_dev_dax(struct dev_dax *dev_dax)
 {
 	return is_static(dev_dax->region);
@@ -285,6 +290,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
 
 	device_lock_assert(dax_region->dev);
 
+	if (is_dynamic(dax_region))
+		return 0;
+
 	for_each_dax_region_resource(dax_region, res)
 		size -= resource_size(res);
 	return size;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 1ccd23360124..74d8fe4a5532 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -13,6 +13,7 @@ struct dax_region;
 /* dax bus specific ioresource flags */
 #define IORESOURCE_DAX_STATIC BIT(0)
 #define IORESOURCE_DAX_KMEM BIT(1)
+#define IORESOURCE_DAX_DYNAMIC_CAP BIT(2)
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct range *range, int target_node, unsigned int align,
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 8bc9d04034d6..147c8c69782b 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -13,19 +13,30 @@ static int cxl_dax_region_probe(struct device *dev)
 	struct cxl_region *cxlr = cxlr_dax->cxlr;
 	struct dax_region *dax_region;
 	struct dev_dax_data data;
+	resource_size_t dev_size;
+	unsigned long flags;
 
 	if (nid == NUMA_NO_NODE)
 		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
 
+	dev_size = range_len(&cxlr_dax->hpa_range);
+
+	flags = IORESOURCE_DAX_KMEM;
+	if (cxlr->mode == CXL_REGION_DC) {
+		/* Add empty seed dax device */
+		dev_size = 0;
+		flags |= IORESOURCE_DAX_DYNAMIC_CAP;
+	}
+
 	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
-				      PMD_SIZE, IORESOURCE_DAX_KMEM);
+				      PMD_SIZE, flags);
 	if (!dax_region)
 		return -ENOMEM;
 
 	data = (struct dev_dax_data) {
 		.dax_region = dax_region,
 		.id = -1,
-		.size = range_len(&cxlr_dax->hpa_range),
+		.size = dev_size,
 	};
 
 	return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 09/18] cxl/mem: Read extents on memory device discovery
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (7 preceding siblings ...)
  2023-08-29  5:20 ` [PATCH RFC v2 08/18] cxl/region: Add Dynamic Capacity CXL region support Ira Weiny
@ 2023-08-29  5:21 ` Ira Weiny
  2023-08-29 15:26   ` Jonathan Cameron
  2023-08-29  5:21 ` [PATCH RFC v2 10/18] cxl/mem: Handle DCD add and release capacity events Ira Weiny
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

When a Dynamic Capacity Device (DCD) is realized some extents may
already be available within the DC Regions.  This can happen if the host
has accepted extents and been rebooted or any other time the host driver
software has become out of sync with the device hardware.

Read the available extents during probe and store them for later
use.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Change for v2:
[iweiny: new patch]
---
 drivers/cxl/core/mbox.c | 195 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxlmem.h    |  36 +++++++++
 drivers/cxl/pci.c       |   4 +
 3 files changed, 235 insertions(+)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index d769814f80e2..9b08c40ef484 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -824,6 +824,37 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
 
+static int cxl_store_dc_extent(struct cxl_memdev_state *mds,
+			       struct cxl_dc_extent *dc_extent)
+{
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_dc_extent_data *extent;
+	int rc;
+
+	extent = kzalloc(sizeof(*extent), GFP_KERNEL);
+	if (!extent)
+		return -ENOMEM;
+
+	extent->dpa_start = le64_to_cpu(dc_extent->start_dpa);
+	extent->length = le64_to_cpu(dc_extent->length);
+	memcpy(extent->tag, dc_extent->tag, sizeof(extent->tag));
+	extent->shared_extent_seq = le16_to_cpu(dc_extent->shared_extn_seq);
+
+	dev_dbg(dev, "dynamic capacity extent DPA:0x%llx LEN:%llx\n",
+		extent->dpa_start, extent->length);
+
+	rc = xa_insert(&mds->dc_extent_list, extent->dpa_start, extent,
+			 GFP_KERNEL);
+	if (rc) {
+		if (rc == -EBUSY)
+			dev_warn_once(dev, "Duplicate extent DPA:%llx LEN:%llx\n",
+				      extent->dpa_start, extent->length);
+		kfree(extent);
+	}
+
+	return rc;
+}
+
 /*
  * General Media Event Record
  * CXL rev 3.0 Section 8.2.9.2.1.1; Table 8-43
@@ -1339,6 +1370,149 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
 
+static int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
+				     unsigned int *extent_gen_num)
+{
+	struct cxl_mbox_get_dc_extent get_dc_extent;
+	struct cxl_mbox_dc_extents dc_extents;
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_mbox_cmd mbox_cmd;
+	unsigned int count;
+	int rc;
+
+	/* Check GET_DC_EXTENT_LIST is supported by device */
+	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
+		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
+		return 0;
+	}
+
+	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
+		.extent_cnt = cpu_to_le32(0),
+		.start_extent_index = cpu_to_le32(0),
+	};
+
+	mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
+		.payload_in = &get_dc_extent,
+		.size_in = sizeof(get_dc_extent),
+		.size_out = mds->payload_size,
+		.payload_out = &dc_extents,
+		.min_out = 1,
+	};
+
+	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
+	if (rc < 0)
+		return rc;
+
+	count = le32_to_cpu(dc_extents.total_extent_cnt);
+	*extent_gen_num = le32_to_cpu(dc_extents.extent_list_num);
+
+	return count;
+}
+
+static int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds,
+				  unsigned int start_gen_num,
+				  unsigned int exp_cnt)
+{
+	struct cxl_mbox_dc_extents *dc_extents;
+	unsigned int start_index, total_read;
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_mbox_cmd mbox_cmd;
+	int retry = 3;
+	int rc;
+
+	/* Check GET_DC_EXTENT_LIST is supported by device */
+	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
+		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
+		return 0;
+	}
+
+	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
+	if (!dc_extents)
+		return -ENOMEM;
+
+reset:
+	total_read = 0;
+	start_index = 0;
+	do {
+		unsigned int nr_ext, total_extent_cnt, gen_num;
+		struct cxl_mbox_get_dc_extent get_dc_extent;
+
+		get_dc_extent = (struct cxl_mbox_get_dc_extent) {
+			.extent_cnt = exp_cnt - start_index,
+			.start_extent_index = start_index,
+		};
+		
+		mbox_cmd = (struct cxl_mbox_cmd) {
+			.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
+			.payload_in = &get_dc_extent,
+			.size_in = sizeof(get_dc_extent),
+			.size_out = mds->payload_size,
+			.payload_out = dc_extents,
+			.min_out = 1,
+		};
+		
+		rc = cxl_internal_send_cmd(mds, &mbox_cmd);
+		if (rc < 0)
+			goto out;
+		
+		nr_ext = le32_to_cpu(dc_extents->ret_extent_cnt);
+		total_read += nr_ext;
+		total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
+		gen_num = le32_to_cpu(dc_extents->extent_list_num);
+
+		dev_dbg(dev, "Get extent list count:%d generation Num:%d\n",
+			total_extent_cnt, gen_num);
+
+		if (gen_num != start_gen_num || exp_cnt != total_extent_cnt) {
+			dev_err(dev, "Extent list changed while reading; %u != %u : %u != %u\n",
+				gen_num, start_gen_num, exp_cnt, total_extent_cnt);
+			if (retry--)
+				goto reset;
+			return -EIO;
+		}
+		
+		for (int i = 0; i < nr_ext ; i++) {
+			dev_dbg(dev, "Storing extent %d/%d\n",
+				start_index + i, exp_cnt);
+			rc = cxl_store_dc_extent(mds, &dc_extents->extent[i]);
+			if (rc)
+				goto out;
+		}
+
+		start_index += nr_ext;
+	} while (exp_cnt > total_read);
+
+out:
+	kvfree(dc_extents);
+	return rc;
+}
+
+/**
+ * cxl_dev_get_dynamic_capacity_extents() - Reads the dynamic capacity
+ *					 extent list.
+ * @mds: The memory device state
+ *
+ * This will dispatch the get_dynamic_capacity_extent_list command to the device
+ * and on success add the extents to the host managed extent list.
+ *
+ * Return: 0 if command was executed successfully, -ERRNO on error.
+ */
+int cxl_dev_get_dynamic_capacity_extents(struct cxl_memdev_state *mds)
+{
+	unsigned int extent_gen_num;
+	int rc;
+
+	rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
+	dev_dbg(mds->cxlds.dev, "Extent count: %d Generation Num: %d\n",
+		rc, extent_gen_num);
+	if (rc <= 0) /* 0 == no records found */
+		return rc;
+
+	return cxl_dev_get_dc_extents(mds, extent_gen_num, rc);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dev_get_dynamic_capacity_extents, CXL);
+
 static int add_dpa_res(struct device *dev, struct resource *parent,
 		       struct resource *res, resource_size_t start,
 		       resource_size_t size, const char *type)
@@ -1530,9 +1704,23 @@ int cxl_poison_state_init(struct cxl_memdev_state *mds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_poison_state_init, CXL);
 
+static void cxl_destroy_mds(void *_mds)
+{
+	struct cxl_memdev_state *mds = _mds;
+	struct cxl_dc_extent_data *extent;
+	unsigned long index;
+
+	xa_for_each(&mds->dc_extent_list, index, extent) {
+		xa_erase(&mds->dc_extent_list, index);
+		kfree(extent);
+	}
+	xa_destroy(&mds->dc_extent_list);
+}
+
 struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
 {
 	struct cxl_memdev_state *mds;
+	int rc;
 
 	mds = devm_kzalloc(dev, sizeof(*mds), GFP_KERNEL);
 	if (!mds) {
@@ -1544,6 +1732,13 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
 	mutex_init(&mds->event.log_lock);
 	mds->cxlds.dev = dev;
 	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
+	xa_init(&mds->dc_extent_list);
+
+	rc = devm_add_action_or_reset(dev, cxl_destroy_mds, mds);
+	if (rc) {
+		dev_err(dev, "Failed to set up memdev state; %d\n", rc);
+		return ERR_PTR(rc);
+	}
 
 	return mds;
 }
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 8c8f47b397ab..ad690600c1b9 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -6,6 +6,7 @@
 #include <linux/cdev.h>
 #include <linux/uuid.h>
 #include <linux/rcuwait.h>
+#include <linux/xarray.h>
 #include "cxl.h"
 
 /* CXL 2.0 8.2.8.5.1.1 Memory Device Status Register */
@@ -509,6 +510,7 @@ struct cxl_memdev_state {
 	u8 nr_dc_region;
 	struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
 	size_t dc_event_log_size;
+	struct xarray dc_extent_list;
 
 	struct cxl_event_state event;
 	struct cxl_poison_state poison;
@@ -749,6 +751,26 @@ struct cxl_event_mem_module {
 	u8 reserved[0x3d];
 } __packed;
 
+#define CXL_DC_EXTENT_TAG_LEN 0x10
+struct cxl_dc_extent_data {
+	u64 dpa_start;
+	u64 length;
+	u8 tag[CXL_DC_EXTENT_TAG_LEN];
+	u16 shared_extent_seq;
+};
+
+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.0 section 8.2.9.2.1.5; Table 8-47
+ */
+struct cxl_dc_extent {
+	__le64 start_dpa;
+	__le64 length;
+	u8 tag[CXL_DC_EXTENT_TAG_LEN];
+	__le16 shared_extn_seq;
+	u8 reserved[6];
+} __packed;
+
 struct cxl_mbox_get_partition_info {
 	__le64 active_volatile_cap;
 	__le64 active_persistent_cap;
@@ -796,6 +818,19 @@ struct cxl_mbox_dynamic_capacity {
 #define CXL_REGIONS_RETURNED(size_out) \
 	((size_out - 8) / sizeof(struct cxl_dc_region_config))
 
+struct cxl_mbox_get_dc_extent {
+	__le32 extent_cnt;
+	__le32 start_extent_index;
+} __packed;
+
+struct cxl_mbox_dc_extents {
+	__le32 ret_extent_cnt;
+	__le32 total_extent_cnt;
+	__le32 extent_list_num;
+	u8 rsvd[4];
+	struct cxl_dc_extent extent[];
+}  __packed;
+
 /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
 struct cxl_mbox_set_timestamp_in {
 	__le64 timestamp;
@@ -920,6 +955,7 @@ int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
 			  struct cxl_mbox_cmd *cmd);
 int cxl_dev_state_identify(struct cxl_memdev_state *mds);
 int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
+int cxl_dev_get_dynamic_capacity_extents(struct cxl_memdev_state *mds);
 int cxl_await_media_ready(struct cxl_dev_state *cxlds);
 int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
 int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index a9b110ff1176..10c1a583113c 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -930,6 +930,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		dev_dbg(&pdev->dev, "No RAS reporting unmasked\n");
 
+	rc = cxl_dev_get_dynamic_capacity_extents(mds);
+	if (rc)
+		return rc;
+
 	pci_save_state(pdev);
 
 	return rc;

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 10/18] cxl/mem: Handle DCD add and release capacity events.
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (8 preceding siblings ...)
  2023-08-29  5:21 ` [PATCH RFC v2 09/18] cxl/mem: Read extents on memory device discovery Ira Weiny
@ 2023-08-29  5:21 ` Ira Weiny
  2023-08-29 15:59   ` Jonathan Cameron
  2023-08-31 17:28   ` Dave Jiang
  2023-08-29  5:21 ` [PATCH RFC v2 11/18] cxl/region: Expose DC extents on region driver load Ira Weiny
                   ` (8 subsequent siblings)
  18 siblings, 2 replies; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

A Dynamic Capacity Device (DCD) utilizes events to signal the host about
the changes to the allocation of Dynamic Capacity (DC) extents. The
device communicates the state of DC extents through an extent list that
describes the starting DPA, length, and meta data of the blocks the host
can access.

Process the dynamic capacity add and release events.  The addition or
removal of extents can occur at any time.  Adding asynchronous memory is
straight forward.  Also remember the host is under no obligation to
respond to a release event until it is done with the memory.  Introduce
extent kref's to handle the delay of extent release.

In the case of a force removal, access to the memory will fail and may
cause a crash.  However, the extent tracking object is preserved for the
region to safely tear down as long as the memory is not accessed.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
changes for v2:
[iweiny: Totally new version of the patch]
[iweiny: use kref to track when to release an extent]
[iweiny: rebased to latest master/type2 work]
[iweiny: use a kref to track if extents are being referenced]
[alison: align commit message paragraphs]
[alison: remove unnecessary return]
[iweiny: Adjust for the new __devm_cxl_add_dax_region()]
[navneet: Fix debug prints in adding/releasing extent]
[alison: deal with odd if/else logic]
[alison: reverse x-tree]
[alison: reverse x-tree]
[alison: s/total_extent_cnt/count/]
[alison: make handle event reverse x-tree]
[alison: cleanup/shorten/remove handle event comment]
[iweiny/Alison: refactor cxl_handle_dcd_event_records function]
[iweiny: keep cxl_dc_extent_data local to mbox.c]
[jonathan: eliminate 'rc']
[iweiny: use proper type for mailbox size]
[jonathan: put dc_extents on the stack]
[jonathan: use direct returns instead of goto]
[iweiny: Clean up comment]
[Jonathan: define CXL_DC_EXTENT_TAG_LEN]
[Jonathan: remove extraneous changes]
[Jonathan: fix blank line issues]
---
 drivers/cxl/core/mbox.c | 186 +++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/cxl/cxl.h       |   9 +++
 drivers/cxl/cxlmem.h    |  30 ++++++++
 3 files changed, 224 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 9b08c40ef484..8474a28b16ca 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -839,6 +839,8 @@ static int cxl_store_dc_extent(struct cxl_memdev_state *mds,
 	extent->length = le64_to_cpu(dc_extent->length);
 	memcpy(extent->tag, dc_extent->tag, sizeof(extent->tag));
 	extent->shared_extent_seq = le16_to_cpu(dc_extent->shared_extn_seq);
+	kref_init(&extent->region_ref);
+	extent->mds = mds;
 
 	dev_dbg(dev, "dynamic capacity extent DPA:0x%llx LEN:%llx\n",
 		extent->dpa_start, extent->length);
@@ -879,6 +881,14 @@ static const uuid_t mem_mod_event_uuid =
 	UUID_INIT(0xfe927475, 0xdd59, 0x4339,
 		  0xa5, 0x86, 0x79, 0xba, 0xb1, 0x13, 0xb7, 0x74);
 
+/* 
+ * Dynamic Capacity Event Record
+ * CXL rev 3.0 section 8.2.9.2.1.3; Table 8-45
+ */
+static const uuid_t dc_event_uuid =
+	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c,
+		  0x2f, 0x95, 0x26, 0x8e, 0x10, 0x1a, 0x2a);
+
 static void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 				   enum cxl_event_log_type type,
 				   struct cxl_event_record_raw *record)
@@ -973,6 +983,171 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
 	return rc;
 }
 
+static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
+				struct cxl_mbox_dc_response *res,
+				int extent_cnt, int opcode)
+{
+	struct cxl_mbox_cmd mbox_cmd;
+	size_t size;
+
+	size = struct_size(res, extent_list, extent_cnt);
+	res->extent_list_size = cpu_to_le32(extent_cnt);
+
+	mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = opcode,
+		.size_in = size,
+		.payload_in = res,
+	};
+
+	return cxl_internal_send_cmd(mds, &mbox_cmd);
+}
+
+static int cxl_prepare_ext_list(struct cxl_mbox_dc_response **res,
+				int *n, struct range *extent)
+{
+	struct cxl_mbox_dc_response *dc_res;
+	unsigned int size;
+
+	if (!extent)
+		size = struct_size(dc_res, extent_list, 0);
+	else
+		size = struct_size(dc_res, extent_list, *n + 1);
+
+	dc_res = krealloc(*res, size, GFP_KERNEL);
+	if (!dc_res)
+		return -ENOMEM;
+
+	if (extent) {
+		dc_res->extent_list[*n].dpa_start = cpu_to_le64(extent->start);
+		memset(dc_res->extent_list[*n].reserved, 0, 8);
+		dc_res->extent_list[*n].length = cpu_to_le64(range_len(extent));
+		(*n)++;
+	}
+
+	*res = dc_res;
+	return 0;
+}
+
+static void dc_extent_release(struct kref *kref)
+{
+	struct cxl_dc_extent_data *extent = container_of(kref,
+						struct cxl_dc_extent_data,
+						region_ref);
+	struct cxl_memdev_state *mds = extent->mds;
+	struct cxl_mbox_dc_response *dc_res = NULL;
+	struct range rel_range = (struct range) {
+		.start = extent->dpa_start,
+		.end = extent->dpa_start + extent->length - 1,
+	};
+	struct device *dev = mds->cxlds.dev;
+	int extent_cnt = 0, rc;
+
+	rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, &rel_range);
+	if (rc < 0) {
+		dev_err(dev, "Failed to create release response %d\n", rc);
+		goto free_extent;
+	}
+	rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
+				      CXL_MBOX_OP_RELEASE_DC);
+	kfree(dc_res);
+
+free_extent:
+	kfree(extent);
+}
+
+void cxl_dc_extent_put(struct cxl_dc_extent_data *extent)
+{
+	kref_put(&extent->region_ref, dc_extent_release);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dc_extent_put, CXL);
+
+static int cxl_handle_dcd_release_event(struct cxl_memdev_state *mds,
+					struct cxl_dc_extent *rel_extent)
+{
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_dc_extent_data *extent;
+	resource_size_t dpa, size;
+
+	dpa = le64_to_cpu(rel_extent->start_dpa);
+	size = le64_to_cpu(rel_extent->length);
+	dev_dbg(dev, "Release DC extent DPA:0x%llx LEN:%llx\n",
+		dpa, size);
+
+	extent = xa_erase(&mds->dc_extent_list, dpa);
+	if (!extent) {
+		dev_err(dev, "No extent found with DPA:0x%llx\n", dpa);
+		return -EINVAL;
+	}
+	cxl_dc_extent_put(extent);
+	return 0;
+}
+
+static int cxl_handle_dcd_add_event(struct cxl_memdev_state *mds,
+				    struct cxl_dc_extent *add_extent)
+{
+	struct cxl_mbox_dc_response *dc_res = NULL;
+	struct range alloc_range, *resp_range;
+	struct device *dev = mds->cxlds.dev;
+	int extent_cnt = 0;
+	int rc;
+
+	dev_dbg(dev, "Add DC extent DPA:0x%llx LEN:%llx\n",
+		le64_to_cpu(add_extent->start_dpa),
+		le64_to_cpu(add_extent->length));
+
+	alloc_range = (struct range){
+		.start = le64_to_cpu(add_extent->start_dpa),
+		.end = le64_to_cpu(add_extent->start_dpa) +
+			le64_to_cpu(add_extent->length) - 1,
+	};
+	resp_range = &alloc_range;
+
+	rc = cxl_store_dc_extent(mds, add_extent);
+	if (rc) {
+		dev_dbg(dev, "unconsumed DC extent DPA:0x%llx LEN:%llx\n",
+			le64_to_cpu(add_extent->start_dpa),
+			le64_to_cpu(add_extent->length));
+		resp_range = NULL;
+	}
+
+	rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, resp_range);
+	if (rc < 0) {
+		dev_err(dev, "Couldn't create extent list %d\n", rc);
+		return rc;
+	}
+
+	rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
+				      CXL_MBOX_OP_ADD_DC_RESPONSE);
+	kfree(dc_res);
+	return rc;
+}
+
+/* Returns 0 if the event was handled successfully. */
+static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
+					struct cxl_event_record_raw *rec)
+{
+	struct dcd_event_dyn_cap *record = (struct dcd_event_dyn_cap *)rec;
+	uuid_t *id = &rec->hdr.id;
+	int rc;
+
+	if (!uuid_equal(id, &dc_event_uuid))
+		return -EINVAL;
+
+	switch (record->data.event_type) {
+	case DCD_ADD_CAPACITY:
+		rc = cxl_handle_dcd_add_event(mds, &record->data.extent);
+		break;
+	case DCD_RELEASE_CAPACITY:
+        case DCD_FORCED_CAPACITY_RELEASE:
+		rc = cxl_handle_dcd_release_event(mds, &record->data.extent);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return rc;
+}
+
 static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
 				    enum cxl_event_log_type type)
 {
@@ -1016,6 +1191,13 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
 				le16_to_cpu(payload->records[i].hdr.handle));
 			cxl_event_trace_record(cxlmd, type,
 					       &payload->records[i]);
+			if (type == CXL_EVENT_TYPE_DCD) {
+				rc = cxl_handle_dcd_event_records(mds,
+								  &payload->records[i]);
+				if (rc) 
+					dev_err_ratelimited(dev, "dcd event failed: %d\n",
+							    rc);
+			}
 		}
 
 		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
@@ -1056,6 +1238,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
 		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_WARN);
 	if (status & CXLDEV_EVENT_STATUS_INFO)
 		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_INFO);
+	if (status & CXLDEV_EVENT_STATUS_DCD)
+		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_mem_get_event_records, CXL);
 
@@ -1712,7 +1896,7 @@ static void cxl_destroy_mds(void *_mds)
 
 	xa_for_each(&mds->dc_extent_list, index, extent) {
 		xa_erase(&mds->dc_extent_list, index);
-		kfree(extent);
+		cxl_dc_extent_put(extent);
 	}
 	xa_destroy(&mds->dc_extent_list);
 }
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 0a225b0c20bf..81ca76ae1d02 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -163,6 +163,7 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
 #define CXLDEV_EVENT_STATUS_WARN		BIT(1)
 #define CXLDEV_EVENT_STATUS_FAIL		BIT(2)
 #define CXLDEV_EVENT_STATUS_FATAL		BIT(3)
+#define CXLDEV_EVENT_STATUS_DCD                 BIT(4)
 
 #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO |	\
 				 CXLDEV_EVENT_STATUS_WARN |	\
@@ -601,6 +602,14 @@ struct cxl_pmem_region {
 	struct cxl_pmem_region_mapping mapping[];
 };
 
+/* See CXL 3.0 8.2.9.2.1.5 */
+enum dc_event {
+        DCD_ADD_CAPACITY,
+        DCD_RELEASE_CAPACITY,
+        DCD_FORCED_CAPACITY_RELEASE,
+        DCD_REGION_CONFIGURATION_UPDATED,
+};
+
 struct cxl_dax_region {
 	struct device dev;
 	struct cxl_region *cxlr;
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index ad690600c1b9..118392229174 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -582,6 +582,16 @@ enum cxl_opcode {
 	UUID_INIT(0xe1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19,     \
 		  0x40, 0x3d, 0x86)
 
+struct cxl_mbox_dc_response {
+	__le32 extent_list_size;
+	u8 reserved[4];
+	struct updated_extent_list {
+		__le64 dpa_start;
+		__le64 length;
+		u8 reserved[8];
+	} __packed extent_list[];
+} __packed;
+
 struct cxl_mbox_get_supported_logs {
 	__le16 entries;
 	u8 rsvd[6];
@@ -667,6 +677,7 @@ enum cxl_event_log_type {
 	CXL_EVENT_TYPE_WARN,
 	CXL_EVENT_TYPE_FAIL,
 	CXL_EVENT_TYPE_FATAL,
+	CXL_EVENT_TYPE_DCD,
 	CXL_EVENT_TYPE_MAX
 };
 
@@ -757,6 +768,8 @@ struct cxl_dc_extent_data {
 	u64 length;
 	u8 tag[CXL_DC_EXTENT_TAG_LEN];
 	u16 shared_extent_seq;
+	struct cxl_memdev_state *mds;
+	struct kref region_ref;
 };
 
 /*
@@ -771,6 +784,21 @@ struct cxl_dc_extent {
 	u8 reserved[6];
 } __packed;
 
+struct dcd_record_data {
+	u8 event_type;
+	u8 reserved;
+	__le16 host_id;
+	u8 region_index;
+	u8 reserved1[3];
+	struct cxl_dc_extent extent;
+	u8 reserved2[32];
+} __packed;
+
+struct dcd_event_dyn_cap {
+	struct cxl_event_record_hdr hdr; 
+	struct dcd_record_data data;
+} __packed;
+
 struct cxl_mbox_get_partition_info {
 	__le64 active_volatile_cap;
 	__le64 active_persistent_cap;
@@ -974,6 +1002,8 @@ int cxl_trigger_poison_list(struct cxl_memdev *cxlmd);
 int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
 int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
 
+void cxl_dc_extent_put(struct cxl_dc_extent_data *extent);
+
 #ifdef CONFIG_CXL_SUSPEND
 void cxl_mem_active_inc(void);
 void cxl_mem_active_dec(void);

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 11/18] cxl/region: Expose DC extents on region driver load
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (9 preceding siblings ...)
  2023-08-29  5:21 ` [PATCH RFC v2 10/18] cxl/mem: Handle DCD add and release capacity events Ira Weiny
@ 2023-08-29  5:21 ` Ira Weiny
  2023-08-29 16:20   ` Jonathan Cameron
  2023-08-31 18:38   ` Dave Jiang
  2023-08-29  5:21 ` [PATCH RFC v2 12/18] cxl/region: Notify regions of DC changes Ira Weiny
                   ` (7 subsequent siblings)
  18 siblings, 2 replies; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

Ultimately user space must associate Dynamic Capacity (DC) extents with
DAX devices.  Remember also that DCD extents may have been accepted
previous to regions being created and must have references held until
all higher level regions and DAX devices are done with the memory.

On CXL region driver load scan existing device extents and create CXL
DAX region extents as needed.

Create abstractions for the extents to be used in DAX region.  This
includes a generic interface to take proper references on the lower
level CXL region extents.

Also maintain separate objects for the DAX region extent device vs the
DAX region extent.  The DAX region extent device has a shorter life span
which corresponds to the removal of an extent while a DAX device is
still using it.  In this case an extent continues to exist whilst the
ability to create new DAX devices on that extent is prevented.

NOTE: Without interleaving; the device, CXL region, and DAX region
extents have a 1:1:1 relationship.  Future support for interleaving will
maintain a 1:N relationship between CXL region extents and the hardware
extents.

While the ability to create DAX devices on an extent exists; expose the
necessary details of DAX region extents by creating a device with the
following sysfs entries.

/sys/bus/cxl/devices/dax_regionX/extentY
/sys/bus/cxl/devices/dax_regionX/extentY/length
/sys/bus/cxl/devices/dax_regionX/extentY/label

Label is a rough analogy to the DC extent tag.  As such the DC extent
tag is used to initially populate the label.  However, the label is made
writeable so that it can be adjusted in the future when forming a DAX
device.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from v1
[iweiny: move dax_region_extents to dax layer]
[iweiny: adjust for kreference of extents]
[iweiny: adjust naming to cxl_dr_extent]
[iweiny: Remove region_extent xarray; use child devices instead]
[iweiny: ensure dax region devices are destroyed on region destruction]
[iweiny: use xa_insert]
[iweiny: hpa_offset is a dr_extent parameter not an extent parameter]
[iweiny: Add dc_region_extents when the region driver is loaded]
---
 drivers/cxl/core/mbox.c   |  12 ++++
 drivers/cxl/core/region.c | 179 ++++++++++++++++++++++++++++++++++++++++++++--
 drivers/cxl/cxl.h         |  16 +++++
 drivers/cxl/cxlmem.h      |   2 +
 drivers/dax/Makefile      |   1 +
 drivers/dax/cxl.c         | 101 ++++++++++++++++++++++++--
 drivers/dax/dax-private.h |  53 ++++++++++++++
 drivers/dax/extent.c      | 119 ++++++++++++++++++++++++++++++
 8 files changed, 473 insertions(+), 10 deletions(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 8474a28b16ca..5472ab1d0370 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1055,6 +1055,18 @@ static void dc_extent_release(struct kref *kref)
 	kfree(extent);
 }
 
+int __must_check cxl_dc_extent_get_not_zero(struct cxl_dc_extent_data *extent)
+{
+	return kref_get_unless_zero(&extent->region_ref);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dc_extent_get_not_zero, CXL);
+
+void cxl_dc_extent_get(struct cxl_dc_extent_data *extent)
+{
+	kref_get(&extent->region_ref);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dc_extent_get, CXL);
+
 void cxl_dc_extent_put(struct cxl_dc_extent_data *extent)
 {
 	kref_put(&extent->region_ref, dc_extent_release);
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index fc8dee469244..0aeea50550f6 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1547,6 +1547,122 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
 	return 0;
 }
 
+static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
+				struct cxl_dc_extent_data *extent)
+{
+	struct range dpa_range = (struct range){
+		.start = extent->dpa_start,
+		.end = extent->dpa_start + extent->length - 1,
+	};
+	struct device *dev = &cxled->cxld.dev;
+
+	dev_dbg(dev, "Checking extent DPA:%llx LEN:%llx\n",
+		extent->dpa_start, extent->length);
+
+	if (!cxled->cxld.region || !cxled->dpa_res)
+		return false;
+
+	dev_dbg(dev, "Cxled start:%llx end:%llx\n",
+		cxled->dpa_res->start, cxled->dpa_res->end);
+	return (cxled->dpa_res->start <= dpa_range.start &&
+		dpa_range.end <= cxled->dpa_res->end);
+}
+
+static int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
+				 struct cxl_dc_extent_data *extent)
+{
+	struct cxl_dr_extent *cxl_dr_ext;
+	struct cxl_dax_region *cxlr_dax;
+	resource_size_t dpa_offset, hpa;
+	struct range *ed_hpa_range;
+	struct device *dev;
+	int rc;
+
+	cxlr_dax = cxled->cxld.region->cxlr_dax;
+	dev = &cxlr_dax->dev;
+	dev_dbg(dev, "Adding DC extent DPA:%llx LEN:%llx\n",
+		extent->dpa_start, extent->length);
+
+	/*
+	 * Interleave ways == 1 means this coresponds to a 1:1 mapping between
+	 * device extents and DAX region extents.  Future implementations
+	 * should hold DC region extents here until the full dax region extent
+	 * can be realized.
+	 */
+	if (cxlr_dax->cxlr->params.interleave_ways != 1) {
+		dev_err(dev, "Interleaving DC not supported\n");
+		return -EINVAL;
+	}
+
+	cxl_dr_ext = kzalloc(sizeof(*cxl_dr_ext), GFP_KERNEL);
+	if (!cxl_dr_ext)
+		return -ENOMEM;
+
+	cxl_dr_ext->extent = extent;
+	kref_init(&cxl_dr_ext->region_ref);
+
+	/*
+	 * Without interleave...
+	 * HPA offset == DPA offset
+	 * ... but do the math anyway
+	 */
+	dpa_offset = extent->dpa_start - cxled->dpa_res->start;
+	ed_hpa_range = &cxled->cxld.hpa_range;
+	hpa = ed_hpa_range->start + dpa_offset;
+	cxl_dr_ext->hpa_offset = hpa - cxlr_dax->hpa_range.start;
+
+	/* Without interleave carry length and label through */
+	cxl_dr_ext->hpa_length = extent->length;
+	snprintf(cxl_dr_ext->label, CXL_EXTENT_LABEL_LEN, "%s",
+		 extent->tag);
+
+	dev_dbg(dev, "Inserting at HPA:%llx\n", cxl_dr_ext->hpa_offset);
+	rc = xa_insert(&cxlr_dax->extents, cxl_dr_ext->hpa_offset, cxl_dr_ext,
+		       GFP_KERNEL);
+	if (rc) {
+		dev_err(dev, "Failed to insert extent %d\n", rc);
+		kfree(cxl_dr_ext);
+		return rc;
+	}
+	/* Put in cxl_dr_release() */
+	cxl_dc_extent_get(cxl_dr_ext->extent);
+	return 0;
+}
+
+static int cxl_ed_add_extents(struct cxl_endpoint_decoder *cxled)
+{
+	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	struct cxl_memdev_state *mds = container_of(cxlds,
+						    struct cxl_memdev_state,
+						    cxlds);
+	struct device *dev = &cxled->cxld.dev;
+	struct cxl_dc_extent_data *extent;
+	unsigned long index;
+
+	dev_dbg(dev, "Searching for DC extents\n");
+	xa_for_each(&mds->dc_extent_list, index, extent) {
+		/*
+		 * get not zero is important because this is racing with the
+		 * memory device which could be removing the extent at the same
+		 * time.
+		 */
+		if (cxl_dc_extent_get_not_zero(extent)) {
+			int rc = 0;
+
+			if (cxl_dc_extent_in_ed(cxled, extent)) {
+				dev_dbg(dev, "Found extent DPA:%llx LEN:%llx\n",
+					extent->dpa_start, extent->length);
+				rc = cxl_ed_add_one_extent(cxled, extent);
+			}
+			cxl_dc_extent_put(extent);
+			if (rc)
+				return rc;
+		}
+	}
+	return 0;
+}
+
 static int cxl_region_attach_position(struct cxl_region *cxlr,
 				      struct cxl_root_decoder *cxlrd,
 				      struct cxl_endpoint_decoder *cxled,
@@ -2702,10 +2818,44 @@ static struct cxl_pmem_region *cxl_pmem_region_alloc(struct cxl_region *cxlr)
 	return cxlr_pmem;
 }
 
+int __must_check cxl_dr_extent_get_not_zero(struct cxl_dr_extent *cxl_dr_ext)
+{
+	return kref_get_unless_zero(&cxl_dr_ext->region_ref);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dr_extent_get_not_zero, CXL);
+
+void cxl_dr_extent_get(struct cxl_dr_extent *cxl_dr_ext)
+{
+	return kref_get(&cxl_dr_ext->region_ref);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dr_extent_get, CXL);
+
+static void cxl_dr_release(struct kref *kref)
+{
+	struct cxl_dr_extent *cxl_dr_ext = container_of(kref,
+						struct cxl_dr_extent,
+						region_ref);
+
+	cxl_dc_extent_put(cxl_dr_ext->extent);
+	kfree(cxl_dr_ext);
+}
+
+void cxl_dr_extent_put(struct cxl_dr_extent *cxl_dr_ext)
+{
+	kref_put(&cxl_dr_ext->region_ref, cxl_dr_release);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dr_extent_put, CXL);
+
 static void cxl_dax_region_release(struct device *dev)
 {
 	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
+	struct cxl_dr_extent *cxl_dr_ext;
+	unsigned long index;
 
+	xa_for_each(&cxlr_dax->extents, index, cxl_dr_ext) {
+		xa_erase(&cxlr_dax->extents, index);
+		cxl_dr_extent_put(cxl_dr_ext);
+	}
 	kfree(cxlr_dax);
 }
 
@@ -2756,6 +2906,7 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
 
 	cxlr_dax->hpa_range.start = p->res->start;
 	cxlr_dax->hpa_range.end = p->res->end;
+	xa_init(&cxlr_dax->extents);
 
 	dev = &cxlr_dax->dev;
 	cxlr_dax->cxlr = cxlr;
@@ -2862,7 +3013,17 @@ static void cxlr_dax_unregister(void *_cxlr_dax)
 	device_unregister(&cxlr_dax->dev);
 }
 
-static int __devm_cxl_add_dax_region(struct cxl_region *cxlr)
+static int cxl_region_add_dc_extents(struct cxl_region *cxlr)
+{
+	for (int i = 0; i < cxlr->params.nr_targets; i++) {
+		int rc = cxl_ed_add_extents(cxlr->params.targets[i]);
+		if (rc)
+			return rc;
+	}
+	return 0;
+}
+
+static int __devm_cxl_add_dax_region(struct cxl_region *cxlr, bool is_dc)
 {
 	struct cxl_dax_region *cxlr_dax;
 	struct device *dev;
@@ -2877,6 +3038,17 @@ static int __devm_cxl_add_dax_region(struct cxl_region *cxlr)
 	if (rc)
 		goto err;
 
+	cxlr->cxlr_dax = cxlr_dax;
+	if (is_dc) {
+		/*
+		 * Process device extents prior to surfacing the device to
+		 * ensure the cxl_dax_region driver has access to prior extents
+		 */
+		rc = cxl_region_add_dc_extents(cxlr);
+		if (rc)
+			goto err;
+	}
+
 	rc = device_add(dev);
 	if (rc)
 		goto err;
@@ -2893,7 +3065,7 @@ static int __devm_cxl_add_dax_region(struct cxl_region *cxlr)
 
 static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
 {
-	return __devm_cxl_add_dax_region(cxlr);
+	return __devm_cxl_add_dax_region(cxlr, false);
 }
 
 static int devm_cxl_add_dc_dax_region(struct cxl_region *cxlr)
@@ -2902,8 +3074,7 @@ static int devm_cxl_add_dc_dax_region(struct cxl_region *cxlr)
 		dev_err(&cxlr->dev, "Interleaving DC not supported\n");
 		return -EINVAL;
 	}
-
-	return __devm_cxl_add_dax_region(cxlr);
+	return __devm_cxl_add_dax_region(cxlr, true);
 }
 
 static int match_decoder_by_range(struct device *dev, void *data)
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 81ca76ae1d02..177b892ac53f 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -555,6 +555,7 @@ struct cxl_region_params {
  * @type: Endpoint decoder target type
  * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
  * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
+ * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
  * @flags: Region state flags
  * @params: active + config params for the region
  */
@@ -565,6 +566,7 @@ struct cxl_region {
 	enum cxl_decoder_type type;
 	struct cxl_nvdimm_bridge *cxl_nvb;
 	struct cxl_pmem_region *cxlr_pmem;
+	struct cxl_dax_region *cxlr_dax;
 	unsigned long flags;
 	struct cxl_region_params params;
 };
@@ -614,8 +616,22 @@ struct cxl_dax_region {
 	struct device dev;
 	struct cxl_region *cxlr;
 	struct range hpa_range;
+	struct xarray extents;
 };
 
+/* Interleave will manage multiple cxl_dc_extent_data objects */
+#define CXL_EXTENT_LABEL_LEN 64
+struct cxl_dr_extent {
+	struct kref region_ref;
+	u64 hpa_offset;
+	u64 hpa_length;
+	char label[CXL_EXTENT_LABEL_LEN];
+	struct cxl_dc_extent_data *extent;
+};
+int cxl_dr_extent_get_not_zero(struct cxl_dr_extent *cxl_dr_ext);
+void cxl_dr_extent_get(struct cxl_dr_extent *cxl_dr_ext);
+void cxl_dr_extent_put(struct cxl_dr_extent *cxl_dr_ext);
+
 /**
  * struct cxl_port - logical collection of upstream port devices and
  *		     downstream port devices to construct a CXL memory
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 118392229174..8ca81fd067c2 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -1002,6 +1002,8 @@ int cxl_trigger_poison_list(struct cxl_memdev *cxlmd);
 int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
 int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
 
+int cxl_dc_extent_get_not_zero(struct cxl_dc_extent_data *extent);
+void cxl_dc_extent_get(struct cxl_dc_extent_data *extent);
 void cxl_dc_extent_put(struct cxl_dc_extent_data *extent);
 
 #ifdef CONFIG_CXL_SUSPEND
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 5ed5c39857c8..38cd3c4c0898 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_DEV_DAX_CXL) += dax_cxl.o
 
 dax-y := super.o
 dax-y += bus.o
+dax-y += extent.o
 device_dax-y := device.o
 dax_pmem-y := pmem.o
 dax_cxl-y := cxl.o
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 147c8c69782b..057b00b1d914 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -5,6 +5,87 @@
 
 #include "../cxl/cxl.h"
 #include "bus.h"
+#include "dax-private.h"
+
+static void dax_reg_ext_get(struct dax_region_extent *dr_extent)
+{
+	kref_get(&dr_extent->ref);
+}
+
+static void dr_release(struct kref *kref)
+{
+	struct dax_region_extent *dr_extent;
+	struct cxl_dr_extent *cxl_dr_ext;
+
+	dr_extent = container_of(kref, struct dax_region_extent, ref);
+	cxl_dr_ext = dr_extent->private_data;
+	cxl_dr_extent_put(cxl_dr_ext);
+	kfree(dr_extent);
+}
+
+static void dax_reg_ext_put(struct dax_region_extent *dr_extent)
+{
+	kref_put(&dr_extent->ref, dr_release);
+}
+
+static int cxl_dax_region_create_extent(struct dax_region *dax_region,
+					struct cxl_dr_extent *cxl_dr_ext)
+{
+	struct dax_region_extent *dr_extent;
+	int rc;
+
+	dr_extent = kzalloc(sizeof(*dr_extent), GFP_KERNEL);
+	if (!dr_extent)
+		return -ENOMEM;
+
+	dr_extent->private_data = cxl_dr_ext;
+	dr_extent->get = dax_reg_ext_get;
+	dr_extent->put = dax_reg_ext_put;
+
+	/* device manages the dr_extent on success */
+	kref_init(&dr_extent->ref);
+
+	rc = dax_region_ext_create_dev(dax_region, dr_extent,
+				       cxl_dr_ext->hpa_offset,
+				       cxl_dr_ext->hpa_length,
+				       cxl_dr_ext->label);
+	if (rc) {
+		kfree(dr_extent);
+		return rc;
+	}
+
+	/* extent accepted */
+	cxl_dr_extent_get(cxl_dr_ext);
+	return 0;
+}
+
+static int cxl_dax_region_create_extents(struct cxl_dax_region *cxlr_dax)
+{
+	struct cxl_dr_extent *cxl_dr_ext;
+	unsigned long index;
+
+	dev_dbg(&cxlr_dax->dev, "Adding extents\n");
+	xa_for_each(&cxlr_dax->extents, index, cxl_dr_ext) {
+		/*
+		 * get not zero is important because this is racing with the
+		 * region driver which is racing with the memory device which
+		 * could be removing the extent at the same time.
+		 */
+		if (cxl_dr_extent_get_not_zero(cxl_dr_ext)) {
+			struct dax_region *dax_region;
+			int rc;
+
+			dax_region = dev_get_drvdata(&cxlr_dax->dev);
+			dev_dbg(&cxlr_dax->dev, "Found OFF:%llx LEN:%llx\n",
+				cxl_dr_ext->hpa_offset, cxl_dr_ext->hpa_length);
+			rc = cxl_dax_region_create_extent(dax_region, cxl_dr_ext);
+			cxl_dr_extent_put(cxl_dr_ext);
+			if (rc)
+				return rc;
+		}
+	}
+	return 0;
+}
 
 static int cxl_dax_region_probe(struct device *dev)
 {
@@ -19,20 +100,28 @@ static int cxl_dax_region_probe(struct device *dev)
 	if (nid == NUMA_NO_NODE)
 		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
 
-	dev_size = range_len(&cxlr_dax->hpa_range);
-
 	flags = IORESOURCE_DAX_KMEM;
-	if (cxlr->mode == CXL_REGION_DC) {
-		/* Add empty seed dax device */
-		dev_size = 0;
+	if (cxlr->mode == CXL_REGION_DC)
 		flags |= IORESOURCE_DAX_DYNAMIC_CAP;
-	}
 
 	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
 				      PMD_SIZE, flags);
 	if (!dax_region)
 		return -ENOMEM;
 
+	dev_size = range_len(&cxlr_dax->hpa_range);
+	if (cxlr->mode == CXL_REGION_DC) {
+		int rc;
+
+		/* NOTE: Depends on dax_region being set in driver data */
+		rc = cxl_dax_region_create_extents(cxlr_dax);
+		if (rc)
+			return rc;
+
+		/* Add empty seed dax device */
+		dev_size = 0;
+	}
+
 	data = (struct dev_dax_data) {
 		.dax_region = dax_region,
 		.id = -1,
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 27cf2daaaa79..4dab52496c3f 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -5,6 +5,7 @@
 #ifndef __DAX_PRIVATE_H__
 #define __DAX_PRIVATE_H__
 
+#include <linux/pgtable.h>
 #include <linux/device.h>
 #include <linux/cdev.h>
 #include <linux/idr.h>
@@ -40,6 +41,58 @@ struct dax_region {
 	struct device *youngest;
 };
 
+/*
+ * struct dax_region_extent - extent data defined by the low level region
+ * driver.
+ * @private_data: lower level region driver data
+ * @ref: track number of dax devices which are using this extent
+ * @get: get reference to low level data
+ * @put: put reference to low level data
+ */
+struct dax_region_extent {
+	void *private_data;
+	struct kref ref;
+	void (*get)(struct dax_region_extent *dr_extent);
+	void (*put)(struct dax_region_extent *dr_extent);
+};
+
+static inline void dr_extent_get(struct dax_region_extent *dr_extent)
+{
+	if (dr_extent->get)
+		dr_extent->get(dr_extent);
+}
+
+static inline void dr_extent_put(struct dax_region_extent *dr_extent)
+{
+	if (dr_extent->put)
+		dr_extent->put(dr_extent);
+}
+
+#define DAX_EXTENT_LABEL_LEN 64
+/**
+ * struct dax_reg_ext_dev - Device object to expose extent information
+ * @dev: device representing this extent
+ * @dr_extent: reference back to private extent data
+ * @offset: offset of this extent
+ * @length: size of this extent
+ * @label: identifier to group extents
+ */
+struct dax_reg_ext_dev {
+	struct device dev;
+	struct dax_region_extent *dr_extent;
+	resource_size_t offset;
+	resource_size_t length;
+	char label[DAX_EXTENT_LABEL_LEN];
+};
+
+int dax_region_ext_create_dev(struct dax_region *dax_region,
+			      struct dax_region_extent *dr_extent,
+			      resource_size_t offset,
+			      resource_size_t length,
+			      const char *label);
+#define to_dr_ext_dev(dev)	\
+	container_of(dev, struct dax_reg_ext_dev, dev)
+
 struct dax_mapping {
 	struct device dev;
 	int range_id;
diff --git a/drivers/dax/extent.c b/drivers/dax/extent.c
new file mode 100644
index 000000000000..2075ccfb21cb
--- /dev/null
+++ b/drivers/dax/extent.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2023 Intel Corporation. All rights reserved. */
+
+#include <linux/device.h>
+#include <linux/slab.h>
+#include "dax-private.h"
+
+static ssize_t length_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct dax_reg_ext_dev *dr_reg_ext_dev = to_dr_ext_dev(dev);
+
+	return sysfs_emit(buf, "%#llx\n", dr_reg_ext_dev->length);
+}
+static DEVICE_ATTR_RO(length);
+
+static ssize_t label_show(struct device *dev, struct device_attribute *attr,
+			  char *buf)
+{
+	struct dax_reg_ext_dev *dr_reg_ext_dev = to_dr_ext_dev(dev);
+
+	return sysfs_emit(buf, "%s\n", dr_reg_ext_dev->label);
+}
+
+static ssize_t label_store(struct device *dev, struct device_attribute *attr,
+			   const char *buf, size_t len)
+{
+	struct dax_reg_ext_dev *dr_reg_ext_dev = to_dr_ext_dev(dev);
+
+	snprintf(dr_reg_ext_dev->label, DAX_EXTENT_LABEL_LEN, "%s", buf);
+	return len;
+}
+static DEVICE_ATTR_RW(label);
+
+static struct attribute *dr_extent_attrs[] = {
+	&dev_attr_length.attr,
+	&dev_attr_label.attr,
+	NULL,
+};
+
+static const struct attribute_group dr_extent_attribute_group = {
+	.attrs = dr_extent_attrs,
+};
+
+static void dr_extent_release(struct device *dev)
+{
+	struct dax_reg_ext_dev *dr_reg_ext_dev = to_dr_ext_dev(dev);
+
+	kfree(dr_reg_ext_dev);
+}
+
+static const struct attribute_group *dr_extent_attribute_groups[] = {
+	&dr_extent_attribute_group,
+	NULL,
+};
+
+const struct device_type dr_extent_type = {
+	.name = "extent",
+	.release = dr_extent_release,
+	.groups = dr_extent_attribute_groups,
+};
+
+static void unregister_dr_extent(void *ext)
+{
+	struct dax_reg_ext_dev *dr_reg_ext_dev = ext;
+	struct dax_region_extent *dr_extent;
+
+	dr_extent = dr_reg_ext_dev->dr_extent;
+	dev_dbg(&dr_reg_ext_dev->dev, "Unregister DAX region ext OFF:%llx L:%s\n",
+		dr_reg_ext_dev->offset, dr_reg_ext_dev->label);
+	dr_extent_put(dr_extent);
+	device_unregister(&dr_reg_ext_dev->dev);
+}
+
+int dax_region_ext_create_dev(struct dax_region *dax_region,
+			      struct dax_region_extent *dr_extent,
+			      resource_size_t offset,
+			      resource_size_t length,
+			      const char *label)
+{
+	struct dax_reg_ext_dev *dr_reg_ext_dev;
+	struct device *dev;
+	int rc;
+
+	dr_reg_ext_dev = kzalloc(sizeof(*dr_reg_ext_dev), GFP_KERNEL);
+	if (!dr_reg_ext_dev)
+		return -ENOMEM;
+
+	dr_reg_ext_dev->dr_extent = dr_extent;
+	dr_reg_ext_dev->offset = offset;
+	dr_reg_ext_dev->length = length;
+	snprintf(dr_reg_ext_dev->label, DAX_EXTENT_LABEL_LEN, "%s", label);
+
+	dev = &dr_reg_ext_dev->dev;
+	device_initialize(dev);
+	dev->id = offset / PMD_SIZE ;
+	device_set_pm_not_required(dev);
+	dev->parent = dax_region->dev;
+	dev->type = &dr_extent_type;
+	rc = dev_set_name(dev, "extent%d", dev->id);
+	if (rc)
+		goto err;
+
+	rc = device_add(dev);
+	if (rc)
+		goto err;
+
+	dev_dbg(dev, "DAX region extent OFF:%llx LEN:%llx\n",
+		dr_reg_ext_dev->offset, dr_reg_ext_dev->length);
+	return devm_add_action_or_reset(dax_region->dev, unregister_dr_extent,
+					dr_reg_ext_dev);
+
+err:
+	dev_err(dev, "Failed to initialize DAX extent dev OFF:%llx LEN:%llx\n",
+		dr_reg_ext_dev->offset, dr_reg_ext_dev->length);
+	put_device(dev);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(dax_region_ext_create_dev);

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 12/18] cxl/region: Notify regions of DC changes
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (10 preceding siblings ...)
  2023-08-29  5:21 ` [PATCH RFC v2 11/18] cxl/region: Expose DC extents on region driver load Ira Weiny
@ 2023-08-29  5:21 ` Ira Weiny
  2023-08-29 16:40   ` Jonathan Cameron
  2023-09-18 13:56   ` Jørgen Hansen
  2023-08-29  5:21 ` [PATCH RFC v2 13/18] dax/bus: Factor out dev dax resize logic Ira Weiny
                   ` (6 subsequent siblings)
  18 siblings, 2 replies; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

In order for a user to use dynamic capacity effectively they need to
know when dynamic capacity is available.  Thus when Dynamic Capacity
(DC) extents are added or removed by a DC device the regions affected
need to be notified.  Ultimately the DAX region uses the memory
associated with DC extents.  However, remember that CXL DAX regions
maintain any interleave details between devices.

When a DCD event occurs, iterate all CXL endpoint decoders and notify
regions which contain the endpoints affected by the event.  In turn
notify the DAX regions of the changes to the DAX region extents.

For now interleave is handled by creating simple 1:1 mappings between
the CXL DAX region and DAX region layers.  Future implementations will
need to resolve when to actually surface a DAX region extent and pass
the notification along.

Remember that adding capacity is safe because there is no chance of the
memory being in use.  Also remember at this point releasing capacity is
straight forward because DAX devices do not yet have references to the
extents.  Future patches will handle that complication.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from v1:
[iweiny: Rewrite]
---
 drivers/cxl/core/mbox.c   |  39 +++++++++++++--
 drivers/cxl/core/region.c | 123 +++++++++++++++++++++++++++++++++++++++++-----
 drivers/cxl/cxl.h         |  22 +++++++++
 drivers/cxl/mem.c         |  50 +++++++++++++++++++
 drivers/dax/cxl.c         |  99 ++++++++++++++++++++++++++++++-------
 drivers/dax/dax-private.h |   3 ++
 drivers/dax/extent.c      |  14 ++++++
 7 files changed, 317 insertions(+), 33 deletions(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 5472ab1d0370..9d9c13e13ecf 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -824,6 +824,35 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
 
+static int cxl_notify_dc_extent(struct cxl_memdev_state *mds,
+				enum dc_event event,
+				struct cxl_dc_extent_data *extent)
+{
+	struct cxl_drv_nd nd = (struct cxl_drv_nd) {
+		.event = event,
+		.extent = extent
+	};
+	struct device *dev;
+	int rc = 0;
+
+	dev = &mds->cxlds.cxlmd->dev;
+	dev_dbg(dev, "Trying notify: type %d DPA:%llx LEN:%llx\n",
+		event, extent->dpa_start, extent->length);
+
+	device_lock(dev);
+	if (dev->driver) {
+		struct cxl_driver *mem_drv = to_cxl_drv(dev->driver);
+
+		if (mem_drv->notify) {
+			dev_dbg(dev, "Notify: type %d DPA:%llx LEN:%llx\n",
+				event, extent->dpa_start, extent->length);
+			rc = mem_drv->notify(dev, &nd);
+		}
+	}
+	device_unlock(dev);
+	return rc;
+}
+
 static int cxl_store_dc_extent(struct cxl_memdev_state *mds,
 			       struct cxl_dc_extent *dc_extent)
 {
@@ -852,9 +881,10 @@ static int cxl_store_dc_extent(struct cxl_memdev_state *mds,
 			dev_warn_once(dev, "Duplicate extent DPA:%llx LEN:%llx\n",
 				      extent->dpa_start, extent->length);
 		kfree(extent);
+		return rc;
 	}
 
-	return rc;
+	return cxl_notify_dc_extent(mds, DCD_ADD_CAPACITY, extent);
 }
 
 /*
@@ -1074,7 +1104,8 @@ void cxl_dc_extent_put(struct cxl_dc_extent_data *extent)
 EXPORT_SYMBOL_NS_GPL(cxl_dc_extent_put, CXL);
 
 static int cxl_handle_dcd_release_event(struct cxl_memdev_state *mds,
-					struct cxl_dc_extent *rel_extent)
+					struct cxl_dc_extent *rel_extent,
+					enum dc_event event)
 {
 	struct device *dev = mds->cxlds.dev;
 	struct cxl_dc_extent_data *extent;
@@ -1090,6 +1121,7 @@ static int cxl_handle_dcd_release_event(struct cxl_memdev_state *mds,
 		dev_err(dev, "No extent found with DPA:0x%llx\n", dpa);
 		return -EINVAL;
 	}
+	cxl_notify_dc_extent(mds, event, extent);
 	cxl_dc_extent_put(extent);
 	return 0;
 }
@@ -1151,7 +1183,8 @@ static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
 		break;
 	case DCD_RELEASE_CAPACITY:
         case DCD_FORCED_CAPACITY_RELEASE:
-		rc = cxl_handle_dcd_release_event(mds, &record->data.extent);
+		rc = cxl_handle_dcd_release_event(mds, &record->data.extent,
+						  record->data.event_type);
 		break;
 	default:
 		return -EINVAL;
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 0aeea50550f6..a0c1f2793dd7 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1547,8 +1547,8 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
 	return 0;
 }
 
-static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
-				struct cxl_dc_extent_data *extent)
+bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
+			 struct cxl_dc_extent_data *extent)
 {
 	struct range dpa_range = (struct range){
 		.start = extent->dpa_start,
@@ -1567,14 +1567,66 @@ static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
 	return (cxled->dpa_res->start <= dpa_range.start &&
 		dpa_range.end <= cxled->dpa_res->end);
 }
+EXPORT_SYMBOL_NS_GPL(cxl_dc_extent_in_ed, CXL);
+
+static int cxl_region_notify_extent(struct cxl_endpoint_decoder *cxled,
+				    enum dc_event event,
+				    struct cxl_dr_extent *cxl_dr_ext)
+{
+	struct cxl_dax_region *cxlr_dax;
+	struct device *dev;
+	int rc = 0;
+
+	cxlr_dax = cxled->cxld.region->cxlr_dax;
+	dev = &cxlr_dax->dev;
+	dev_dbg(dev, "Trying notify: type %d HPA:%llx LEN:%llx\n",
+		event, cxl_dr_ext->hpa_offset, cxl_dr_ext->hpa_length);
+
+	device_lock(dev);
+	if (dev->driver) {
+		struct cxl_driver *reg_drv = to_cxl_drv(dev->driver);
+		struct cxl_drv_nd nd = (struct cxl_drv_nd) {
+			.event = event,
+			.cxl_dr_ext = cxl_dr_ext,
+		};
+
+		if (reg_drv->notify) {
+			dev_dbg(dev, "Notify: type %d HPA:%llx LEN:%llx\n",
+				event, cxl_dr_ext->hpa_offset,
+				cxl_dr_ext->hpa_length);
+			rc = reg_drv->notify(dev, &nd);
+		}
+	}
+	device_unlock(dev);
+	return rc;
+}
+
+static resource_size_t
+cxl_dc_extent_to_hpa_offset(struct cxl_endpoint_decoder *cxled,
+			    struct cxl_dc_extent_data *extent)
+{
+	struct cxl_dax_region *cxlr_dax;
+	resource_size_t dpa_offset, hpa;
+	struct range *ed_hpa_range;
+
+	cxlr_dax = cxled->cxld.region->cxlr_dax;
+
+	/*
+	 * Without interleave...
+	 * HPA offset == DPA offset
+	 * ... but do the math anyway
+	 */
+	dpa_offset = extent->dpa_start - cxled->dpa_res->start;
+	ed_hpa_range = &cxled->cxld.hpa_range;
+	hpa = ed_hpa_range->start + dpa_offset;
+	return hpa - cxlr_dax->hpa_range.start;
+}
 
 static int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
 				 struct cxl_dc_extent_data *extent)
 {
 	struct cxl_dr_extent *cxl_dr_ext;
 	struct cxl_dax_region *cxlr_dax;
-	resource_size_t dpa_offset, hpa;
-	struct range *ed_hpa_range;
 	struct device *dev;
 	int rc;
 
@@ -1601,15 +1653,7 @@ static int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
 	cxl_dr_ext->extent = extent;
 	kref_init(&cxl_dr_ext->region_ref);
 
-	/*
-	 * Without interleave...
-	 * HPA offset == DPA offset
-	 * ... but do the math anyway
-	 */
-	dpa_offset = extent->dpa_start - cxled->dpa_res->start;
-	ed_hpa_range = &cxled->cxld.hpa_range;
-	hpa = ed_hpa_range->start + dpa_offset;
-	cxl_dr_ext->hpa_offset = hpa - cxlr_dax->hpa_range.start;
+	cxl_dr_ext->hpa_offset = cxl_dc_extent_to_hpa_offset(cxled, extent);
 
 	/* Without interleave carry length and label through */
 	cxl_dr_ext->hpa_length = extent->length;
@@ -1626,6 +1670,7 @@ static int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
 	}
 	/* Put in cxl_dr_release() */
 	cxl_dc_extent_get(cxl_dr_ext->extent);
+	cxl_region_notify_extent(cxled, DCD_ADD_CAPACITY, cxl_dr_ext);
 	return 0;
 }
 
@@ -1663,6 +1708,58 @@ static int cxl_ed_add_extents(struct cxl_endpoint_decoder *cxled)
 	return 0;
 }
 
+static int cxl_ed_rm_dc_extent(struct cxl_endpoint_decoder *cxled,
+			       enum dc_event event,
+			       struct cxl_dc_extent_data *extent)
+{
+	struct cxl_region *cxlr = cxled->cxld.region;
+	struct cxl_dax_region *cxlr_dax = cxlr->cxlr_dax;
+	struct cxl_dr_extent *cxl_dr_ext;
+	resource_size_t hpa_offset;
+
+	hpa_offset = cxl_dc_extent_to_hpa_offset(cxled, extent);
+
+	/*
+	 * NOTE on Interleaving: There is no need to 'break up' the cxl_dr_ext.
+	 * If one of the extents comprising it is gone it should be removed
+	 * from the region to prevent future use.  Later code may save other
+	 * extents for future processing.  But for now the corelation is 1:1:1
+	 * so just erase the extent.
+	 */
+	cxl_dr_ext = xa_erase(&cxlr_dax->extents, hpa_offset);
+
+	dev_dbg(&cxlr_dax->dev, "Remove DAX region ext HPA:%llx\n",
+		cxl_dr_ext->hpa_offset);
+	cxl_region_notify_extent(cxled, event, cxl_dr_ext);
+	cxl_dr_extent_put(cxl_dr_ext);
+	return 0;
+}
+
+int cxl_ed_notify_extent(struct cxl_endpoint_decoder *cxled,
+			 struct cxl_drv_nd *nd)
+{
+	int rc = 0;
+
+	switch (nd->event) {
+	case DCD_ADD_CAPACITY:
+		if (cxl_dc_extent_get_not_zero(nd->extent)) {
+			rc = cxl_ed_add_one_extent(cxled, nd->extent);
+			if (rc)
+				cxl_dc_extent_put(nd->extent);
+		}
+		break;
+	case DCD_RELEASE_CAPACITY:
+	case DCD_FORCED_CAPACITY_RELEASE:
+		rc = cxl_ed_rm_dc_extent(cxled, nd->event, nd->extent);
+		break;
+	default:
+		dev_err(&cxled->cxld.dev, "Unknown DC event %d\n", nd->event);
+		break;
+	}
+	return rc;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_ed_notify_extent, CXL);
+
 static int cxl_region_attach_position(struct cxl_region *cxlr,
 				      struct cxl_root_decoder *cxlrd,
 				      struct cxl_endpoint_decoder *cxled,
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 177b892ac53f..2c73a30980b6 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -838,10 +838,18 @@ bool is_cxl_region(struct device *dev);
 
 extern struct bus_type cxl_bus_type;
 
+/* Driver Notifier Data */
+struct cxl_drv_nd {
+	enum dc_event event;
+	struct cxl_dc_extent_data *extent;
+	struct cxl_dr_extent *cxl_dr_ext;
+};
+
 struct cxl_driver {
 	const char *name;
 	int (*probe)(struct device *dev);
 	void (*remove)(struct device *dev);
+	int (*notify)(struct device *dev, struct cxl_drv_nd *nd);
 	struct device_driver drv;
 	int id;
 };
@@ -887,6 +895,10 @@ struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev);
 int cxl_add_to_region(struct cxl_port *root,
 		      struct cxl_endpoint_decoder *cxled);
 struct cxl_dax_region *to_cxl_dax_region(struct device *dev);
+bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
+			 struct cxl_dc_extent_data *extent);
+int cxl_ed_notify_extent(struct cxl_endpoint_decoder *cxled,
+			 struct cxl_drv_nd *nd);
 #else
 static inline bool is_cxl_pmem_region(struct device *dev)
 {
@@ -905,6 +917,16 @@ static inline struct cxl_dax_region *to_cxl_dax_region(struct device *dev)
 {
 	return NULL;
 }
+static inline bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
+				       struct cxl_dc_extent_data *extent)
+{
+	return false;
+}
+static inline int cxl_ed_notify_extent(struct cxl_endpoint_decoder *cxled,
+				       struct cxl_drv_nd *nd)
+{
+	return 0;
+}
 #endif
 
 /*
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 80cffa40e91a..d3c4c9c87392 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -104,6 +104,55 @@ static int cxl_debugfs_poison_clear(void *data, u64 dpa)
 DEFINE_DEBUGFS_ATTRIBUTE(cxl_poison_clear_fops, NULL,
 			 cxl_debugfs_poison_clear, "%llx\n");
 
+static int match_ep_decoder_by_range(struct device *dev, void *data)
+{
+	struct cxl_dc_extent_data *extent = data;
+	struct cxl_endpoint_decoder *cxled;
+
+	if (!is_endpoint_decoder(dev))
+		return 0;
+	cxled = to_cxl_endpoint_decoder(dev);
+	return cxl_dc_extent_in_ed(cxled, extent);
+}
+
+static struct cxl_endpoint_decoder *cxl_find_ed(struct cxl_memdev_state *mds,
+						struct cxl_dc_extent_data *extent)
+{
+	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+	struct cxl_port *endpoint = cxlmd->endpoint;
+	struct device *dev;
+
+	dev = device_find_child(&endpoint->dev, extent,
+				match_ep_decoder_by_range);
+	if (!dev) {
+		dev_dbg(mds->cxlds.dev, "Extent DPA:%llx LEN:%llx not mapped\n",
+			extent->dpa_start, extent->length);
+		return NULL;
+	}
+
+	return to_cxl_endpoint_decoder(dev);
+}
+
+static int cxl_mem_notify(struct device *dev, struct cxl_drv_nd *nd)
+{
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+	struct cxl_endpoint_decoder *cxled;
+	struct cxl_dc_extent_data *extent;
+	int rc = 0;
+
+	extent = nd->extent;
+	dev_dbg(dev, "notify DC action %d DPA:%llx LEN:%llx\n",
+		nd->event, extent->dpa_start, extent->length);
+
+	cxled = cxl_find_ed(mds, extent);
+	if (!cxled)
+		return 0;
+	rc = cxl_ed_notify_extent(cxled, nd);
+	put_device(&cxled->cxld.dev);
+	return rc;
+}
+
 static int cxl_mem_probe(struct device *dev)
 {
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
@@ -247,6 +296,7 @@ __ATTRIBUTE_GROUPS(cxl_mem);
 static struct cxl_driver cxl_mem_driver = {
 	.name = "cxl_mem",
 	.probe = cxl_mem_probe,
+	.notify = cxl_mem_notify,
 	.id = CXL_DEVICE_MEMORY_EXPANDER,
 	.drv = {
 		.dev_groups = cxl_mem_groups,
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 057b00b1d914..44cbd28668f1 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -59,6 +59,29 @@ static int cxl_dax_region_create_extent(struct dax_region *dax_region,
 	return 0;
 }
 
+static int cxl_dax_region_add_extent(struct cxl_dax_region *cxlr_dax,
+				     struct cxl_dr_extent *cxl_dr_ext)
+{
+	/*
+	 * get not zero is important because this is racing with the
+	 * region driver which is racing with the memory device which
+	 * could be removing the extent at the same time.
+	 */
+	if (cxl_dr_extent_get_not_zero(cxl_dr_ext)) {
+		struct dax_region *dax_region;
+		int rc;
+
+		dax_region = dev_get_drvdata(&cxlr_dax->dev);
+		dev_dbg(&cxlr_dax->dev, "Creating HPA:%llx LEN:%llx\n",
+			cxl_dr_ext->hpa_offset, cxl_dr_ext->hpa_length);
+		rc = cxl_dax_region_create_extent(dax_region, cxl_dr_ext);
+		cxl_dr_extent_put(cxl_dr_ext);
+		if (rc)
+			return rc;
+	}
+	return 0;
+}
+
 static int cxl_dax_region_create_extents(struct cxl_dax_region *cxlr_dax)
 {
 	struct cxl_dr_extent *cxl_dr_ext;
@@ -66,27 +89,68 @@ static int cxl_dax_region_create_extents(struct cxl_dax_region *cxlr_dax)
 
 	dev_dbg(&cxlr_dax->dev, "Adding extents\n");
 	xa_for_each(&cxlr_dax->extents, index, cxl_dr_ext) {
-		/*
-		 * get not zero is important because this is racing with the
-		 * region driver which is racing with the memory device which
-		 * could be removing the extent at the same time.
-		 */
-		if (cxl_dr_extent_get_not_zero(cxl_dr_ext)) {
-			struct dax_region *dax_region;
-			int rc;
-
-			dax_region = dev_get_drvdata(&cxlr_dax->dev);
-			dev_dbg(&cxlr_dax->dev, "Found OFF:%llx LEN:%llx\n",
-				cxl_dr_ext->hpa_offset, cxl_dr_ext->hpa_length);
-			rc = cxl_dax_region_create_extent(dax_region, cxl_dr_ext);
-			cxl_dr_extent_put(cxl_dr_ext);
-			if (rc)
-				return rc;
-		}
+		int rc;
+
+		rc = cxl_dax_region_add_extent(cxlr_dax, cxl_dr_ext);
+		if (rc)
+			return rc;
 	}
 	return 0;
 }
 
+static int match_cxl_dr_extent(struct device *dev, void *data)
+{
+	struct dax_reg_ext_dev *dr_reg_ext_dev;
+	struct dax_region_extent *dr_extent;
+
+	if (!is_dr_ext_dev(dev))
+		return 0;
+
+	dr_reg_ext_dev = to_dr_ext_dev(dev);
+	dr_extent = dr_reg_ext_dev->dr_extent;
+	return data == dr_extent->private_data;
+}
+
+static int cxl_dax_region_rm_extent(struct cxl_dax_region *cxlr_dax,
+				    struct cxl_dr_extent *cxl_dr_ext)
+{
+	struct dax_reg_ext_dev *dr_reg_ext_dev;
+	struct dax_region *dax_region;
+	struct device *dev;
+
+	dev = device_find_child(&cxlr_dax->dev, cxl_dr_ext,
+				match_cxl_dr_extent);
+	if (!dev)
+		return -EINVAL;
+	dr_reg_ext_dev = to_dr_ext_dev(dev);
+	put_device(dev);
+	dax_region = dev_get_drvdata(&cxlr_dax->dev);
+	dax_region_ext_del_dev(dax_region, dr_reg_ext_dev);
+	return 0;
+}
+
+static int cxl_dax_region_notify(struct device *dev,
+				 struct cxl_drv_nd *nd)
+{
+	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
+	struct cxl_dr_extent *cxl_dr_ext = nd->cxl_dr_ext;
+	int rc = 0;
+
+	switch (nd->event) {
+	case DCD_ADD_CAPACITY:
+		rc = cxl_dax_region_add_extent(cxlr_dax, cxl_dr_ext);
+		break;
+	case DCD_RELEASE_CAPACITY:
+	case DCD_FORCED_CAPACITY_RELEASE:
+		rc = cxl_dax_region_rm_extent(cxlr_dax, cxl_dr_ext);
+		break;
+	default:
+		dev_err(&cxlr_dax->dev, "Unknown DC event %d\n", nd->event);
+		break;
+	}
+	return rc;
+}
+
 static int cxl_dax_region_probe(struct device *dev)
 {
 	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
@@ -134,6 +198,7 @@ static int cxl_dax_region_probe(struct device *dev)
 static struct cxl_driver cxl_dax_region_driver = {
 	.name = "cxl_dax_region",
 	.probe = cxl_dax_region_probe,
+	.notify = cxl_dax_region_notify,
 	.id = CXL_DEVICE_DAX_REGION,
 	.drv = {
 		.suppress_bind_attrs = true,
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 4dab52496c3f..250babd6e470 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -90,8 +90,11 @@ int dax_region_ext_create_dev(struct dax_region *dax_region,
 			      resource_size_t offset,
 			      resource_size_t length,
 			      const char *label);
+void dax_region_ext_del_dev(struct dax_region *dax_region,
+			    struct dax_reg_ext_dev *dr_reg_ext_dev);
 #define to_dr_ext_dev(dev)	\
 	container_of(dev, struct dax_reg_ext_dev, dev)
+bool is_dr_ext_dev(struct device *dev);
 
 struct dax_mapping {
 	struct device dev;
diff --git a/drivers/dax/extent.c b/drivers/dax/extent.c
index 2075ccfb21cb..dea6d408d2c8 100644
--- a/drivers/dax/extent.c
+++ b/drivers/dax/extent.c
@@ -60,6 +60,12 @@ const struct device_type dr_extent_type = {
 	.groups = dr_extent_attribute_groups,
 };
 
+bool is_dr_ext_dev(struct device *dev)
+{
+	return dev->type == &dr_extent_type;
+}
+EXPORT_SYMBOL_GPL(is_dr_ext_dev);
+
 static void unregister_dr_extent(void *ext)
 {
 	struct dax_reg_ext_dev *dr_reg_ext_dev = ext;
@@ -117,3 +123,11 @@ int dax_region_ext_create_dev(struct dax_region *dax_region,
 	return rc;
 }
 EXPORT_SYMBOL_GPL(dax_region_ext_create_dev);
+
+void dax_region_ext_del_dev(struct dax_region *dax_region,
+			    struct dax_reg_ext_dev *dr_reg_ext_dev)
+{
+	devm_remove_action(dax_region->dev, unregister_dr_extent, dr_reg_ext_dev);
+	unregister_dr_extent(dr_reg_ext_dev);
+}
+EXPORT_SYMBOL_GPL(dax_region_ext_del_dev);

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 13/18] dax/bus: Factor out dev dax resize logic
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (11 preceding siblings ...)
  2023-08-29  5:21 ` [PATCH RFC v2 12/18] cxl/region: Notify regions of DC changes Ira Weiny
@ 2023-08-29  5:21 ` Ira Weiny
  2023-08-30 11:27   ` Jonathan Cameron
  2023-08-31 21:48   ` Dave Jiang
  2023-08-29  5:21 ` [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions Ira Weiny
                   ` (5 subsequent siblings)
  18 siblings, 2 replies; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

Dynamic Capacity regions must limit dev dax resources to those areas
which have extents backing real memory.  Four alternatives were
considered to manage the intersection of region space and extents:

1) Create a single region resource child on region creation which
   reserves the entire region.  Then as extents are added punch holes in
   this reservation.  This requires new resource manipulation to punch
   the holes and still requires an additional iteration over the extent
   areas which may already have existing dev dax resources used.

2) Maintain an ordered xarray of extents which can be queried while
   processing the resize logic.  The issue is that existing region->res
   children may artificially limit the allocation size sent to
   alloc_dev_dax_range().  IE the resource children can't be directly
   used in the resize logic to find where space in the region is.

3) Maintain a separate resource tree with extents.  This option is the
   same as 2) but with a different data structure.  Most ideally we have
   some unified representation of the resource tree.

4) Create region resource children for each extent.  Manage the dax dev
   resize logic in the same way as before but use a region child
   (extent) resource as the parents to find space within each extent.

Option 4 can leverage the existing resize algorithm to find space within
the extents.

In preparation for this change, factor out the dev_dax_resize logic.
For static regions use dax_region->res as the parent to find space for
the dax ranges.  Future patches will use the same algorithm with
individual extent resources as the parent.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/dax/bus.c | 128 +++++++++++++++++++++++++++++++++---------------------
 1 file changed, 79 insertions(+), 49 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index b76e49813a39..ea7ae82b4687 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -817,11 +817,10 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
 	return 0;
 }
 
-static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
-		resource_size_t size)
+static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
+			       u64 start, resource_size_t size)
 {
 	struct dax_region *dax_region = dev_dax->region;
-	struct resource *res = &dax_region->res;
 	struct device *dev = &dev_dax->dev;
 	struct dev_dax_range *ranges;
 	unsigned long pgoff = 0;
@@ -839,14 +838,14 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
 		return 0;
 	}
 
-	alloc = __request_region(res, start, size, dev_name(dev), 0);
+	alloc = __request_region(parent, start, size, dev_name(dev), 0);
 	if (!alloc)
 		return -ENOMEM;
 
 	ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
 			* (dev_dax->nr_range + 1), GFP_KERNEL);
 	if (!ranges) {
-		__release_region(res, alloc->start, resource_size(alloc));
+		__release_region(parent, alloc->start, resource_size(alloc));
 		return -ENOMEM;
 	}
 
@@ -997,50 +996,45 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
 	return true;
 }
 
-static ssize_t dev_dax_resize(struct dax_region *dax_region,
-		struct dev_dax *dev_dax, resource_size_t size)
+/*
+ * dev_dax_resize_static - Expand the device into the unused portion of the
+ * region. This may involve adjusting the end of an existing resource, or
+ * allocating a new resource.
+ *
+ * @parent: parent resource to allocate this range in.
+ * @dev_dax: DAX device we are creating this range for
+ * @to_alloc: amount of space to alloc; must be <= space available in @parent
+ *
+ * Return the amount of space allocated or -ERRNO on failure
+ */
+static ssize_t dev_dax_resize_static(struct resource *parent,
+				     struct dev_dax *dev_dax,
+				     resource_size_t to_alloc)
 {
-	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
-	resource_size_t dev_size = dev_dax_size(dev_dax);
-	struct resource *region_res = &dax_region->res;
-	struct device *dev = &dev_dax->dev;
 	struct resource *res, *first;
-	resource_size_t alloc = 0;
 	int rc;
 
-	if (dev->driver)
-		return -EBUSY;
-	if (size == dev_size)
-		return 0;
-	if (size > dev_size && size - dev_size > avail)
-		return -ENOSPC;
-	if (size < dev_size)
-		return dev_dax_shrink(dev_dax, size);
-
-	to_alloc = size - dev_size;
-	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
-			"resize of %pa misaligned\n", &to_alloc))
-		return -ENXIO;
-
-	/*
-	 * Expand the device into the unused portion of the region. This
-	 * may involve adjusting the end of an existing resource, or
-	 * allocating a new resource.
-	 */
-retry:
-	first = region_res->child;
-	if (!first)
-		return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
+	first = parent->child;
+	if (!first) {
+		rc = alloc_dev_dax_range(parent, dev_dax,
+					   parent->start, to_alloc);
+		if (rc)
+			return rc;
+		return to_alloc;
+	}
 
-	rc = -ENOSPC;
 	for (res = first; res; res = res->sibling) {
 		struct resource *next = res->sibling;
+		resource_size_t alloc;
 
 		/* space at the beginning of the region */
-		if (res == first && res->start > dax_region->res.start) {
-			alloc = min(res->start - dax_region->res.start, to_alloc);
-			rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc);
-			break;
+		if (res == first && res->start > parent->start) {
+			alloc = min(res->start - parent->start, to_alloc);
+			rc = alloc_dev_dax_range(parent, dev_dax,
+						 parent->start, alloc);
+			if (rc)
+				return rc;
+			return alloc;
 		}
 
 		alloc = 0;
@@ -1049,21 +1043,55 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
 			alloc = min(next->start - (res->end + 1), to_alloc);
 
 		/* space at the end of the region */
-		if (!alloc && !next && res->end < region_res->end)
-			alloc = min(region_res->end - res->end, to_alloc);
+		if (!alloc && !next && res->end < parent->end)
+			alloc = min(parent->end - res->end, to_alloc);
 
 		if (!alloc)
 			continue;
 
 		if (adjust_ok(dev_dax, res)) {
 			rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc);
-			break;
+			if (rc)
+				return rc;
+			return alloc;
 		}
-		rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc);
-		break;
+		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
+		if (rc)
+			return rc;
+		return alloc;
 	}
-	if (rc)
-		return rc;
+
+	/* available was already calculated and should never be an issue */
+	dev_WARN_ONCE(&dev_dax->dev, 1, "space not found?");
+	return 0;
+}
+
+static ssize_t dev_dax_resize(struct dax_region *dax_region,
+		struct dev_dax *dev_dax, resource_size_t size)
+{
+	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
+	resource_size_t dev_size = dev_dax_size(dev_dax);
+	struct device *dev = &dev_dax->dev;
+	resource_size_t alloc = 0;
+
+	if (dev->driver)
+		return -EBUSY;
+	if (size == dev_size)
+		return 0;
+	if (size > dev_size && size - dev_size > avail)
+		return -ENOSPC;
+	if (size < dev_size)
+		return dev_dax_shrink(dev_dax, size);
+
+	to_alloc = size - dev_size;
+	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
+			"resize of %pa misaligned\n", &to_alloc))
+		return -ENXIO;
+
+retry:
+	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+	if (alloc <= 0)
+		return alloc;
 	to_alloc -= alloc;
 	if (to_alloc)
 		goto retry;
@@ -1154,7 +1182,8 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
 
 	to_alloc = range_len(&r);
 	if (alloc_is_aligned(dev_dax, to_alloc))
-		rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
+		rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
+					 to_alloc);
 	device_unlock(dev);
 	device_unlock(dax_region->dev);
 
@@ -1371,7 +1400,8 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 	device_initialize(dev);
 	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
 
-	rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size);
+	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
+				 data->size);
 	if (rc)
 		goto err_range;
 

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (12 preceding siblings ...)
  2023-08-29  5:21 ` [PATCH RFC v2 13/18] dax/bus: Factor out dev dax resize logic Ira Weiny
@ 2023-08-29  5:21 ` Ira Weiny
  2023-08-30 11:50   ` Jonathan Cameron
  2023-08-29  5:21 ` [PATCH RFC v2 15/18] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

Dynamic Capacity (DC) DAX regions have a list of extents which define
the memory of the region which is available.

Now that DAX region extents are fully realized support DAX device
creation on dynamic regions by adjusting the allocation algorithms
to account for the extents.  Remember also references must be held on
the extents until the DAX devices are done with the memory.

Redefine the region available size to include only extent space.  Reuse
the size allocation algorithm by defining sub-resources for each extent
and limiting range allocation to those extents which have space.  Do not
support direct mapping of DAX devices on dynamic devices.

Enhance DAX device range objects to hold references on the extents until
the DAX device is destroyed.

NOTE: At this time all extents within a region are created equally.
However, labels are associated with extents which can be used with
future DAX device labels to group which extents are used.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/dax/bus.c         | 157 +++++++++++++++++++++++++++++++++++++++-------
 drivers/dax/cxl.c         |  44 +++++++++++++
 drivers/dax/dax-private.h |   5 ++
 3 files changed, 182 insertions(+), 24 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index ea7ae82b4687..a9ea6a706702 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -280,6 +280,36 @@ static ssize_t region_align_show(struct device *dev,
 static struct device_attribute dev_attr_region_align =
 		__ATTR(align, 0400, region_align_show, NULL);
 
+#define for_each_extent_resource(extent, res) \
+	for (res = (extent)->child; res; res = res->sibling)
+
+static unsigned long long
+dr_extent_avail_size(struct dax_region_extent *dr_extent)
+{
+	unsigned long long rc;
+	struct resource *res;
+
+	rc = resource_size(dr_extent->res);
+	for_each_extent_resource(dr_extent->res, res)
+		rc -= resource_size(res);
+	return rc;
+}
+
+static int dax_region_add_dynamic_size(struct device *dev, void *data)
+{
+	unsigned long long *size = data, ext_size;
+	struct dax_reg_ext_dev *dr_reg_ext_dev;
+
+	if (!is_dr_ext_dev(dev))
+		return 0;
+
+	dr_reg_ext_dev = to_dr_ext_dev(dev);
+	ext_size = dr_extent_avail_size(dr_reg_ext_dev->dr_extent);
+	dev_dbg(dev, "size %llx\n", ext_size);
+	*size += ext_size;
+	return 0;
+}
+
 #define for_each_dax_region_resource(dax_region, res) \
 	for (res = (dax_region)->res.child; res; res = res->sibling)
 
@@ -290,8 +320,12 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
 
 	device_lock_assert(dax_region->dev);
 
-	if (is_dynamic(dax_region))
-		return 0;
+	if (is_dynamic(dax_region)) {
+		size = 0;
+		device_for_each_child(dax_region->dev, &size,
+				      dax_region_add_dynamic_size);
+		return size;
+	}
 
 	for_each_dax_region_resource(dax_region, res)
 		size -= resource_size(res);
@@ -421,15 +455,24 @@ EXPORT_SYMBOL_GPL(kill_dev_dax);
 static void trim_dev_dax_range(struct dev_dax *dev_dax)
 {
 	int i = dev_dax->nr_range - 1;
-	struct range *range = &dev_dax->ranges[i].range;
+	struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+	struct range *range = &dev_range->range;
 	struct dax_region *dax_region = dev_dax->region;
+	struct resource *res = &dax_region->res;
 
 	device_lock_assert(dax_region->dev);
 	dev_dbg(&dev_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
 		(unsigned long long)range->start,
 		(unsigned long long)range->end);
 
-	__release_region(&dax_region->res, range->start, range_len(range));
+	if (dev_range->dr_extent)
+		res = dev_range->dr_extent->res;
+
+	__release_region(res, range->start, range_len(range));
+
+	if (dev_range->dr_extent)
+		dr_extent_put(dev_range->dr_extent);
+
 	if (--dev_dax->nr_range == 0) {
 		kfree(dev_dax->ranges);
 		dev_dax->ranges = NULL;
@@ -818,7 +861,8 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
 }
 
 static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
-			       u64 start, resource_size_t size)
+			       u64 start, resource_size_t size,
+			       struct dax_region_extent *dr_extent)
 {
 	struct dax_region *dax_region = dev_dax->region;
 	struct device *dev = &dev_dax->dev;
@@ -852,12 +896,15 @@ static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
 	for (i = 0; i < dev_dax->nr_range; i++)
 		pgoff += PHYS_PFN(range_len(&ranges[i].range));
 	dev_dax->ranges = ranges;
+	if (dr_extent)
+		dr_extent_get(dr_extent);
 	ranges[dev_dax->nr_range++] = (struct dev_dax_range) {
 		.pgoff = pgoff,
 		.range = {
 			.start = alloc->start,
 			.end = alloc->end,
 		},
+		.dr_extent = dr_extent,
 	};
 
 	dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
@@ -938,7 +985,8 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
 	int i;
 
 	for (i = dev_dax->nr_range - 1; i >= 0; i--) {
-		struct range *range = &dev_dax->ranges[i].range;
+		struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+		struct range *range = &dev_range->range;
 		struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
 		struct resource *adjust = NULL, *res;
 		resource_size_t shrink;
@@ -954,12 +1002,16 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
 			continue;
 		}
 
-		for_each_dax_region_resource(dax_region, res)
-			if (strcmp(res->name, dev_name(dev)) == 0
-					&& res->start == range->start) {
-				adjust = res;
-				break;
-			}
+		if (dev_range->dr_extent) {
+			adjust = dev_range->dr_extent->res;
+		} else {
+			for_each_dax_region_resource(dax_region, res)
+				if (strcmp(res->name, dev_name(dev)) == 0
+						&& res->start == range->start) {
+					adjust = res;
+					break;
+				}
+		}
 
 		if (dev_WARN_ONCE(dev, !adjust || i != dev_dax->nr_range - 1,
 					"failed to find matching resource\n"))
@@ -973,12 +1025,15 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
 /*
  * Only allow adjustments that preserve the relative pgoff of existing
  * allocations. I.e. the dev_dax->ranges array is ordered by increasing pgoff.
+ * Dissallow adjustments on dynamic regions as they can come from all over.
  */
 static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
 {
 	struct dev_dax_range *last;
 	int i;
 
+	if (is_dynamic(dev_dax->region))
+		return false;
 	if (dev_dax->nr_range == 0)
 		return false;
 	if (strcmp(res->name, dev_name(&dev_dax->dev)) != 0)
@@ -997,19 +1052,21 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
 }
 
 /*
- * dev_dax_resize_static - Expand the device into the unused portion of the
- * region. This may involve adjusting the end of an existing resource, or
- * allocating a new resource.
+ * __dev_dax_resize - Expand the device into the unused portion of the region.
+ * This may involve adjusting the end of an existing resource, or allocating a
+ * new resource.
  *
  * @parent: parent resource to allocate this range in.
  * @dev_dax: DAX device we are creating this range for
  * @to_alloc: amount of space to alloc; must be <= space available in @parent
+ * @dr_extent: if dynamic; the extent containing parent
  *
  * Return the amount of space allocated or -ERRNO on failure
  */
-static ssize_t dev_dax_resize_static(struct resource *parent,
-				     struct dev_dax *dev_dax,
-				     resource_size_t to_alloc)
+static ssize_t __dev_dax_resize(struct resource *parent,
+				struct dev_dax *dev_dax,
+				resource_size_t to_alloc,
+				struct dax_region_extent *dr_extent)
 {
 	struct resource *res, *first;
 	int rc;
@@ -1017,7 +1074,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 	first = parent->child;
 	if (!first) {
 		rc = alloc_dev_dax_range(parent, dev_dax,
-					   parent->start, to_alloc);
+					   parent->start, to_alloc,
+					   dr_extent);
 		if (rc)
 			return rc;
 		return to_alloc;
@@ -1031,7 +1089,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 		if (res == first && res->start > parent->start) {
 			alloc = min(res->start - parent->start, to_alloc);
 			rc = alloc_dev_dax_range(parent, dev_dax,
-						 parent->start, alloc);
+						 parent->start, alloc,
+						 dr_extent);
 			if (rc)
 				return rc;
 			return alloc;
@@ -1055,7 +1114,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 				return rc;
 			return alloc;
 		}
-		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
+		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc,
+					 dr_extent);
 		if (rc)
 			return rc;
 		return alloc;
@@ -1066,6 +1126,47 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 	return 0;
 }
 
+static ssize_t dev_dax_resize_static(struct dax_region *dax_region,
+				     struct dev_dax *dev_dax,
+				     resource_size_t to_alloc)
+{
+	return __dev_dax_resize(&dax_region->res, dev_dax, to_alloc, NULL);
+}
+
+static int dax_region_find_space(struct device *dev, void *data)
+{
+	struct dax_reg_ext_dev *dr_reg_ext_dev;
+
+	if (!is_dr_ext_dev(dev))
+		return 0;
+
+	dr_reg_ext_dev = to_dr_ext_dev(dev);
+	return dr_extent_avail_size(dr_reg_ext_dev->dr_extent);
+}
+
+static ssize_t dev_dax_resize_dynamic(struct dax_region *dax_region,
+				      struct dev_dax *dev_dax,
+				      resource_size_t to_alloc)
+{
+	struct dax_reg_ext_dev *dr_reg_ext_dev;
+	struct dax_region_extent *dr_extent;
+	resource_size_t alloc;
+	resource_size_t extent_max;
+	struct device *dev;
+
+	dev = device_find_child(dax_region->dev, NULL, dax_region_find_space);
+	if (dev_WARN_ONCE(dax_region->dev, !dev, "Space should be available!"))
+		return -ENOSPC;
+	dr_reg_ext_dev = to_dr_ext_dev(dev);
+	dr_extent = dr_reg_ext_dev->dr_extent;
+	extent_max = dr_extent_avail_size(dr_extent);
+	to_alloc = min(extent_max, to_alloc);
+	alloc = __dev_dax_resize(dr_extent->res, dev_dax, to_alloc, dr_extent);
+	put_device(dev);
+
+	return alloc;
+}
+
 static ssize_t dev_dax_resize(struct dax_region *dax_region,
 		struct dev_dax *dev_dax, resource_size_t size)
 {
@@ -1089,7 +1190,10 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
 		return -ENXIO;
 
 retry:
-	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+	if (is_dynamic(dax_region))
+		alloc = dev_dax_resize_dynamic(dax_region, dev_dax, to_alloc);
+	else
+		alloc = dev_dax_resize_static(dax_region, dev_dax, to_alloc);
 	if (alloc <= 0)
 		return alloc;
 	to_alloc -= alloc;
@@ -1168,6 +1272,9 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
 	struct range r;
 	ssize_t rc;
 
+	if (is_dynamic(dax_region))
+		return -EINVAL;
+
 	rc = range_parse(buf, len, &r);
 	if (rc)
 		return rc;
@@ -1183,7 +1290,7 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
 	to_alloc = range_len(&r);
 	if (alloc_is_aligned(dev_dax, to_alloc))
 		rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
-					 to_alloc);
+					 to_alloc, NULL);
 	device_unlock(dev);
 	device_unlock(dax_region->dev);
 
@@ -1400,8 +1507,10 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 	device_initialize(dev);
 	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
 
+	dev_WARN_ONCE(parent, is_dynamic(dax_region) && data->size,
+		      "Dynamic DAX devices are created initially with 0 size");
 	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
-				 data->size);
+				 data->size, NULL);
 	if (rc)
 		goto err_range;
 
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 44cbd28668f1..6394a3531e25 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -12,6 +12,17 @@ static void dax_reg_ext_get(struct dax_region_extent *dr_extent)
 	kref_get(&dr_extent->ref);
 }
 
+
+static void dax_region_rm_resource(struct dax_region_extent *dr_extent)
+{
+	struct dax_region *dax_region = dr_extent->region;
+	struct resource *res = dr_extent->res;
+	
+	dev_dbg(dax_region->dev, "Extent release resource %pR\n",
+		dr_extent->res);
+	__release_region(&dax_region->res, res->start, resource_size(res));
+}
+
 static void dr_release(struct kref *kref)
 {
 	struct dax_region_extent *dr_extent;
@@ -19,6 +30,7 @@ static void dr_release(struct kref *kref)
 
 	dr_extent = container_of(kref, struct dax_region_extent, ref);
 	cxl_dr_ext = dr_extent->private_data;
+	dax_region_rm_resource(dr_extent);
 	cxl_dr_extent_put(cxl_dr_ext);
 	kfree(dr_extent);
 }
@@ -28,6 +40,29 @@ static void dax_reg_ext_put(struct dax_region_extent *dr_extent)
 	kref_put(&dr_extent->ref, dr_release);
 }
 
+static int dax_region_add_resource(struct dax_region *dax_region,
+				   struct dax_region_extent *dr_extent,
+				   resource_size_t offset,
+				   resource_size_t length)
+{
+	resource_size_t start = dax_region->res.start + offset;
+	struct resource *ext_res;
+
+	dev_dbg(dax_region->dev, "DAX region resource %pR\n", &dax_region->res);
+	ext_res = __request_region(&dax_region->res, start, length, "extent", 0);
+	if (!ext_res) {
+		dev_err(dax_region->dev, "Failed to add extent s:%llx l:%llx\n",
+			start, length);
+		return -ENOSPC;
+	}
+
+	dr_extent->region = dax_region;
+	dr_extent->res = ext_res;
+	dev_dbg(dax_region->dev, "Extent add resource %pR\n", ext_res);
+
+	return 0;
+}
+
 static int cxl_dax_region_create_extent(struct dax_region *dax_region,
 					struct cxl_dr_extent *cxl_dr_ext)
 {
@@ -45,11 +80,20 @@ static int cxl_dax_region_create_extent(struct dax_region *dax_region,
 	/* device manages the dr_extent on success */
 	kref_init(&dr_extent->ref);
 
+	rc = dax_region_add_resource(dax_region, dr_extent,
+				     cxl_dr_ext->hpa_offset,
+				     cxl_dr_ext->hpa_length);
+	if (rc) {
+		kfree(dr_extent);
+		return rc;
+	}
+
 	rc = dax_region_ext_create_dev(dax_region, dr_extent,
 				       cxl_dr_ext->hpa_offset,
 				       cxl_dr_ext->hpa_length,
 				       cxl_dr_ext->label);
 	if (rc) {
+		dax_region_rm_resource(dr_extent);
 		kfree(dr_extent);
 		return rc;
 	}
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 250babd6e470..ad73b53aa802 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -44,12 +44,16 @@ struct dax_region {
 /*
  * struct dax_region_extent - extent data defined by the low level region
  * driver.
+ * @region: cache of dax_region
+ * @res: cache of resource tree for this extent
  * @private_data: lower level region driver data
  * @ref: track number of dax devices which are using this extent
  * @get: get reference to low level data
  * @put: put reference to low level data
  */
 struct dax_region_extent {
+	struct dax_region *region;
+	struct resource *res;
 	void *private_data;
 	struct kref ref;
 	void (*get)(struct dax_region_extent *dr_extent);
@@ -131,6 +135,7 @@ struct dev_dax {
 		unsigned long pgoff;
 		struct range range;
 		struct dax_mapping *mapping;
+		struct dax_region_extent *dr_extent;
 	} *ranges;
 };
 

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 15/18] cxl/mem: Trace Dynamic capacity Event Record
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (13 preceding siblings ...)
  2023-08-29  5:21 ` [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions Ira Weiny
@ 2023-08-29  5:21 ` ira.weiny
  2023-08-29 16:46   ` Jonathan Cameron
  2023-08-29  5:21 ` [PATCH RFC v2 16/18] tools/testing/cxl: Make event logs dynamic Ira Weiny
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 97+ messages in thread
From: ira.weiny @ 2023-08-29  5:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

From: Navneet Singh <navneet.singh@intel.com>

CXL rev 3.0 section 8.2.9.2.1.5 defines the Dynamic Capacity Event Record
Determine if the event read is a Dynamic capacity event record and
if so trace the record for the debug purpose.

Add DC trace points to the trace log.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
[iweiny: fixups]
---
 drivers/cxl/core/mbox.c  |  5 ++++
 drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 9d9c13e13ecf..9462c34aa1dc 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -939,6 +939,11 @@ static void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 				(struct cxl_event_mem_module *)record;
 
 		trace_cxl_memory_module(cxlmd, type, rec);
+	} else if (uuid_equal(id, &dc_event_uuid)) {
+		struct dcd_event_dyn_cap *rec =
+				(struct dcd_event_dyn_cap *)record;
+
+		trace_cxl_dynamic_capacity(cxlmd, type, rec);
 	} else {
 		/* For unknown record types print just the header */
 		trace_cxl_generic_event(cxlmd, type, record);
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index a0b5819bc70b..1899c5cc96b9 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -703,6 +703,71 @@ TRACE_EVENT(cxl_poison,
 	)
 );
 
+/*
+ * DYNAMIC CAPACITY Event Record - DER
+ *
+ * CXL rev 3.0 section 8.2.9.2.1.5 Table 8-47
+ */
+
+#define CXL_DC_ADD_CAPACITY			0x00
+#define CXL_DC_REL_CAPACITY			0x01
+#define CXL_DC_FORCED_REL_CAPACITY		0x02
+#define CXL_DC_REG_CONF_UPDATED			0x03
+#define show_dc_evt_type(type)	__print_symbolic(type,		\
+	{ CXL_DC_ADD_CAPACITY,	"Add capacity"},		\
+	{ CXL_DC_REL_CAPACITY,	"Release capacity"},		\
+	{ CXL_DC_FORCED_REL_CAPACITY,	"Forced capacity release"},	\
+	{ CXL_DC_REG_CONF_UPDATED,	"Region Configuration Updated"	} \
+)
+
+TRACE_EVENT(cxl_dynamic_capacity,
+
+	TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
+		 struct dcd_event_dyn_cap  *rec),
+
+	TP_ARGS(cxlmd, log, rec),
+
+	TP_STRUCT__entry(
+		CXL_EVT_TP_entry
+
+		/* Dynamic capacity Event */
+		__field(u8, event_type)
+		__field(u16, hostid)
+		__field(u8, region_id)
+		__field(u64, dpa_start)
+		__field(u64, length)
+		__array(u8, tag, CXL_DC_EXTENT_TAG_LEN)
+		__field(u16, sh_extent_seq)
+	),
+
+	TP_fast_assign(
+		CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
+
+		/* Dynamic_capacity Event */
+		__entry->event_type = rec->data.event_type;
+
+		/* DCD event record data */
+		__entry->hostid = le16_to_cpu(rec->data.host_id);
+		__entry->region_id = rec->data.region_index;
+		__entry->dpa_start = le64_to_cpu(rec->data.extent.start_dpa);
+		__entry->length = le64_to_cpu(rec->data.extent.length);
+		memcpy(__entry->tag, &rec->data.extent.tag, CXL_DC_EXTENT_TAG_LEN);
+		__entry->sh_extent_seq = le16_to_cpu(rec->data.extent.shared_extn_seq);
+	),
+
+	CXL_EVT_TP_printk("event_type='%s' host_id='%d' region_id='%d' " \
+		"starting_dpa=%llx length=%llx tag=%s " \
+		"shared_extent_sequence=%d",
+		show_dc_evt_type(__entry->event_type),
+		__entry->hostid,
+		__entry->region_id,
+		__entry->dpa_start,
+		__entry->length,
+		__print_hex(__entry->tag, CXL_DC_EXTENT_TAG_LEN),
+		__entry->sh_extent_seq
+	)
+);
+
 #endif /* _CXL_EVENTS_H */
 
 #define TRACE_INCLUDE_FILE trace

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 16/18] tools/testing/cxl: Make event logs dynamic
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (14 preceding siblings ...)
  2023-08-29  5:21 ` [PATCH RFC v2 15/18] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
@ 2023-08-29  5:21 ` Ira Weiny
  2023-08-30 12:11   ` Jonathan Cameron
  2023-08-29  5:21 ` [PATCH RFC v2 17/18] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

The test event logs were created as static arrays as an easy way to mock
events.  Dynamic Capacity Device (DCD) test support requires events be
created dynamically when extents are created/destroyed.

Modify the event log storage to be dynamically allocated.  Thus they can
accommodate the dynamic events required by DCD.  Reuse the static event
data to create the dynamic events in the new logs without inventing
complex event injection through the test sysfs.  Simplify the processing
of the logs by using the event log array index as the handle.  Add a
lock to manage concurrency to come with DCD extent testing.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 tools/testing/cxl/test/mem.c | 276 ++++++++++++++++++++++++++-----------------
 1 file changed, 170 insertions(+), 106 deletions(-)

diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index 51be202fabd0..6a036c8d215d 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -118,18 +118,27 @@ static struct {
 
 #define PASS_TRY_LIMIT 3
 
-#define CXL_TEST_EVENT_CNT_MAX 15
+#define CXL_TEST_EVENT_CNT_MAX 17
 
 /* Set a number of events to return at a time for simulation.  */
 #define CXL_TEST_EVENT_CNT 3
 
+/*
+ * @next_handle: next handle (index) to be stored to
+ * @cur_handle: current handle (index) to be returned to the user on get_event
+ * @nr_events: total events in this log
+ * @nr_overflow: number of events added past the log size
+ * @lock: protect these state variables
+ * @events: array of pending events to be returned.
+ */
 struct mock_event_log {
-	u16 clear_idx;
-	u16 cur_idx;
+	u16 next_handle;
+	u16 cur_handle;
 	u16 nr_events;
 	u16 nr_overflow;
-	u16 overflow_reset;
-	struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX];
+	rwlock_t lock;
+	/* 1 extra slot to accommodate that handles can't be 0 */
+	struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX+1];
 };
 
 struct mock_event_store {
@@ -163,64 +172,76 @@ static struct mock_event_log *event_find_log(struct device *dev, int log_type)
 	return &mdata->mes.mock_logs[log_type];
 }
 
-static struct cxl_event_record_raw *event_get_current(struct mock_event_log *log)
-{
-	return log->events[log->cur_idx];
-}
-
-static void event_reset_log(struct mock_event_log *log)
-{
-	log->cur_idx = 0;
-	log->clear_idx = 0;
-	log->nr_overflow = log->overflow_reset;
-}
-
-/* Handle can never be 0 use 1 based indexing for handle */
-static u16 event_get_clear_handle(struct mock_event_log *log)
-{
-	return log->clear_idx + 1;
-}
-
 /* Handle can never be 0 use 1 based indexing for handle */
-static __le16 event_get_cur_event_handle(struct mock_event_log *log)
+static void event_inc_handle(u16 *handle)
 {
-	u16 cur_handle = log->cur_idx + 1;
-
-	return cpu_to_le16(cur_handle);
-}
-
-static bool event_log_empty(struct mock_event_log *log)
-{
-	return log->cur_idx == log->nr_events;
+	*handle = (*handle + 1) % CXL_TEST_EVENT_CNT_MAX;
+	if (!*handle)
+		*handle = *handle + 1;
 }
 
+/* Add the event or free it on 'overflow' */
 static void mes_add_event(struct mock_event_store *mes,
 			  enum cxl_event_log_type log_type,
 			  struct cxl_event_record_raw *event)
 {
+	struct device *dev = mes->mds->cxlds.dev;
 	struct mock_event_log *log;
+	u16 handle;
 
 	if (WARN_ON(log_type >= CXL_EVENT_TYPE_MAX))
 		return;
 
 	log = &mes->mock_logs[log_type];
 
-	if ((log->nr_events + 1) > CXL_TEST_EVENT_CNT_MAX) {
+	write_lock(&log->lock);
+
+	handle = log->next_handle;
+	if ((handle + 1) == log->cur_handle) {
 		log->nr_overflow++;
-		log->overflow_reset = log->nr_overflow;
-		return;
+		dev_dbg(dev, "Overflowing %d\n", log_type);
+		devm_kfree(dev, event);
+		goto unlock;
 	}
 
-	log->events[log->nr_events] = event;
+	dev_dbg(dev, "Log %d; handle %u\n", log_type, handle);
+	event->hdr.handle = cpu_to_le16(handle);
+	log->events[handle] = event;
+	event_inc_handle(&log->next_handle);
 	log->nr_events++;
+
+unlock:
+	write_unlock(&log->lock);
+}
+
+static void mes_del_event(struct device *dev,
+			  struct mock_event_log *log,
+			  u16 handle)
+{
+	struct cxl_event_record_raw *cur;
+
+	lockdep_assert(lockdep_is_held(&log->lock));
+
+	dev_dbg(dev, "Clearing event %u; cur %u\n", handle, log->cur_handle);
+	cur = log->events[handle];
+	if (!cur) {
+		dev_err(dev, "Mock event index %u empty? nr_events %u",
+			handle, log->nr_events);
+		return;
+	}
+	log->events[handle] = NULL;
+
+	event_inc_handle(&log->cur_handle);
+	log->nr_events--;
+	devm_kfree(dev, cur);
 }
 
 static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 {
 	struct cxl_get_event_payload *pl;
 	struct mock_event_log *log;
-	u16 nr_overflow;
 	u8 log_type;
+	u16 handle;
 	int i;
 
 	if (cmd->size_in != sizeof(log_type))
@@ -233,30 +254,38 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 	if (log_type >= CXL_EVENT_TYPE_MAX)
 		return -EINVAL;
 
-	memset(cmd->payload_out, 0, cmd->size_out);
-
 	log = event_find_log(dev, log_type);
-	if (!log || event_log_empty(log))
+	if (!log)
 		return 0;
 
+	memset(cmd->payload_out, 0, cmd->size_out);
 	pl = cmd->payload_out;
 
-	for (i = 0; i < CXL_TEST_EVENT_CNT && !event_log_empty(log); i++) {
-		memcpy(&pl->records[i], event_get_current(log),
-		       sizeof(pl->records[i]));
-		pl->records[i].hdr.handle = event_get_cur_event_handle(log);
-		log->cur_idx++;
+	read_lock(&log->lock);
+
+	handle = log->cur_handle;
+	dev_dbg(dev, "Get log %d handle %u next %u\n",
+		log_type, handle, log->next_handle);
+	for (i = 0;
+	     i < CXL_TEST_EVENT_CNT && handle != log->next_handle;
+	     i++, event_inc_handle(&handle)) {
+		struct cxl_event_record_raw *cur;
+
+		cur = log->events[handle];
+		dev_dbg(dev, "Sending event log %d handle %d idx %u\n",
+			log_type, le16_to_cpu(cur->hdr.handle), handle);
+		memcpy(&pl->records[i], cur, sizeof(pl->records[i]));
 	}
 
 	pl->record_count = cpu_to_le16(i);
-	if (!event_log_empty(log))
+	if (log->nr_events > i)
 		pl->flags |= CXL_GET_EVENT_FLAG_MORE_RECORDS;
 
 	if (log->nr_overflow) {
 		u64 ns;
 
 		pl->flags |= CXL_GET_EVENT_FLAG_OVERFLOW;
-		pl->overflow_err_count = cpu_to_le16(nr_overflow);
+		pl->overflow_err_count = cpu_to_le16(log->nr_overflow);
 		ns = ktime_get_real_ns();
 		ns -= 5000000000; /* 5s ago */
 		pl->first_overflow_timestamp = cpu_to_le64(ns);
@@ -265,16 +294,17 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 		pl->last_overflow_timestamp = cpu_to_le64(ns);
 	}
 
+	read_unlock(&log->lock);
 	return 0;
 }
 
 static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 {
 	struct cxl_mbox_clear_event_payload *pl = cmd->payload_in;
-	struct mock_event_log *log;
 	u8 log_type = pl->event_log;
+	struct mock_event_log *log;
+	int nr, rc = 0;
 	u16 handle;
-	int nr;
 
 	if (log_type >= CXL_EVENT_TYPE_MAX)
 		return -EINVAL;
@@ -283,24 +313,23 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 	if (!log)
 		return 0; /* No mock data in this log */
 
-	/*
-	 * This check is technically not invalid per the specification AFAICS.
-	 * (The host could 'guess' handles and clear them in order).
-	 * However, this is not good behavior for the host so test it.
-	 */
-	if (log->clear_idx + pl->nr_recs > log->cur_idx) {
-		dev_err(dev,
-			"Attempting to clear more events than returned!\n");
-		return -EINVAL;
-	}
+	write_lock(&log->lock);
 
 	/* Check handle order prior to clearing events */
-	for (nr = 0, handle = event_get_clear_handle(log);
-	     nr < pl->nr_recs;
-	     nr++, handle++) {
+	handle = log->cur_handle;
+	for (nr = 0;
+	     nr < pl->nr_recs && handle != log->next_handle;
+	     nr++, event_inc_handle(&handle)) {
+
+		dev_dbg(dev, "Checking clear of %d handle %u plhandle %u\n",
+			log_type, handle,
+			le16_to_cpu(pl->handles[nr]));
+
 		if (handle != le16_to_cpu(pl->handles[nr])) {
-			dev_err(dev, "Clearing events out of order\n");
-			return -EINVAL;
+			dev_err(dev, "Clearing events out of order %u %u\n",
+				handle, le16_to_cpu(pl->handles[nr]));
+			rc = -EINVAL;
+			goto unlock;
 		}
 	}
 
@@ -308,25 +337,12 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 		log->nr_overflow = 0;
 
 	/* Clear events */
-	log->clear_idx += pl->nr_recs;
-	return 0;
-}
+	for (nr = 0; nr < pl->nr_recs; nr++)
+		mes_del_event(dev, log, le16_to_cpu(pl->handles[nr]));
 
-static void cxl_mock_event_trigger(struct device *dev)
-{
-	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
-	struct mock_event_store *mes = &mdata->mes;
-	int i;
-
-	for (i = CXL_EVENT_TYPE_INFO; i < CXL_EVENT_TYPE_MAX; i++) {
-		struct mock_event_log *log;
-
-		log = event_find_log(dev, i);
-		if (log)
-			event_reset_log(log);
-	}
-
-	cxl_mem_get_event_records(mes->mds, mes->ev_status);
+unlock:
+	write_unlock(&log->lock);
+	return rc;
 }
 
 struct cxl_event_record_raw maint_needed = {
@@ -429,8 +445,29 @@ static int mock_set_timestamp(struct cxl_dev_state *cxlds,
 	return 0;
 }
 
-static void cxl_mock_add_event_logs(struct mock_event_store *mes)
+/* Create a dynamically allocated event out of a statically defined event. */
+static void add_event_from_static(struct mock_event_store *mes,
+				  enum cxl_event_log_type log_type,
+				  struct cxl_event_record_raw *raw)
+{
+	struct device *dev = mes->mds->cxlds.dev;
+	struct cxl_event_record_raw *rec;
+
+	rec = devm_kzalloc(dev, sizeof(*rec), GFP_KERNEL);
+	if (!rec) {
+		dev_err(dev, "Failed to alloc event for log\n");
+		return;
+	}
+
+	memcpy(rec, raw, sizeof(*rec));
+	mes_add_event(mes, log_type, rec);
+}
+
+static void cxl_mock_add_event_logs(struct cxl_mockmem_data *mdata)
 {
+	struct mock_event_store *mes = &mdata->mes;
+	struct device *dev = mes->mds->cxlds.dev;
+
 	put_unaligned_le16(CXL_GMER_VALID_CHANNEL | CXL_GMER_VALID_RANK,
 			   &gen_media.validity_flags);
 
@@ -438,43 +475,60 @@ static void cxl_mock_add_event_logs(struct mock_event_store *mes)
 			   CXL_DER_VALID_BANK | CXL_DER_VALID_COLUMN,
 			   &dram.validity_flags);
 
-	mes_add_event(mes, CXL_EVENT_TYPE_INFO, &maint_needed);
-	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
+	dev_dbg(dev, "Generating fake event logs %d\n",
+		CXL_EVENT_TYPE_INFO);
+	add_event_from_static(mes, CXL_EVENT_TYPE_INFO, &maint_needed);
+	add_event_from_static(mes, CXL_EVENT_TYPE_INFO,
 		      (struct cxl_event_record_raw *)&gen_media);
-	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
+	add_event_from_static(mes, CXL_EVENT_TYPE_INFO,
 		      (struct cxl_event_record_raw *)&mem_module);
 	mes->ev_status |= CXLDEV_EVENT_STATUS_INFO;
 
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &maint_needed);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	dev_dbg(dev, "Generating fake event logs %d\n",
+		CXL_EVENT_TYPE_FAIL);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &maint_needed);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL,
+		      (struct cxl_event_record_raw *)&mem_module);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&dram);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&gen_media);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&mem_module);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&dram);
 	/* Overflow this log */
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
 	mes->ev_status |= CXLDEV_EVENT_STATUS_FAIL;
 
-	mes_add_event(mes, CXL_EVENT_TYPE_FATAL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FATAL,
+	dev_dbg(dev, "Generating fake event logs %d\n",
+		CXL_EVENT_TYPE_FATAL);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FATAL, &hardware_replace);
+	add_event_from_static(mes, CXL_EVENT_TYPE_FATAL,
 		      (struct cxl_event_record_raw *)&dram);
 	mes->ev_status |= CXLDEV_EVENT_STATUS_FATAL;
 }
 
+static void cxl_mock_event_trigger(struct device *dev)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct mock_event_store *mes = &mdata->mes;
+
+	cxl_mock_add_event_logs(mdata);
+	cxl_mem_get_event_records(mes->mds, mes->ev_status);
+}
+
 static int mock_gsl(struct cxl_mbox_cmd *cmd)
 {
 	if (cmd->size_out < sizeof(mock_gsl_payload))
@@ -1391,6 +1445,14 @@ static ssize_t event_trigger_store(struct device *dev,
 }
 static DEVICE_ATTR_WO(event_trigger);
 
+static void init_event_log(struct mock_event_log *log)
+{
+	rwlock_init(&log->lock);
+	/* Handle can never be 0 use 1 based indexing for handle */
+	log->cur_handle = 1;
+	log->next_handle = 1;
+}
+
 static int __cxl_mock_mem_probe(struct platform_device *pdev)
 {
 	struct device *dev = &pdev->dev;
@@ -1458,7 +1520,9 @@ static int __cxl_mock_mem_probe(struct platform_device *pdev)
 		return rc;
 
 	mdata->mes.mds = mds;
-	cxl_mock_add_event_logs(&mdata->mes);
+	for (int i = 0; i < CXL_EVENT_TYPE_MAX; i++)
+		init_event_log(&mdata->mes.mock_logs[i]);
+	cxl_mock_add_event_logs(mdata);
 
 	cxlmd = devm_cxl_add_memdev(cxlds);
 	if (IS_ERR(cxlmd))

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 17/18] tools/testing/cxl: Add DC Regions to mock mem data
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (15 preceding siblings ...)
  2023-08-29  5:21 ` [PATCH RFC v2 16/18] tools/testing/cxl: Make event logs dynamic Ira Weiny
@ 2023-08-29  5:21 ` Ira Weiny
  2023-08-30 12:20   ` Jonathan Cameron
  2023-08-31 23:19   ` Dave Jiang
  2023-08-29  5:21 ` [PATCH RFC v2 18/18] tools/testing/cxl: Add Dynamic Capacity events Ira Weiny
  2023-09-07 21:01 ` [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Fan Ni
  18 siblings, 2 replies; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

To test DC regions the mock memory devices will need to store
information about the regions and manage fake extent data.

Define mock_dc_region information within the mock memory data.  Add
sysfs entries on the mock device to inject and delete extents.

The inject format is <start>:<length>:<tag>
The delete format is <start>

Add DC mailbox commands to the CEL and implement those commands.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 tools/testing/cxl/test/mem.c | 449 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 449 insertions(+)

diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index 6a036c8d215d..d6041a2145c5 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -18,6 +18,7 @@
 #define FW_SLOTS 3
 #define DEV_SIZE SZ_2G
 #define EFFECT(x) (1U << x)
+#define BASE_DYNAMIC_CAP_DPA DEV_SIZE
 
 #define MOCK_INJECT_DEV_MAX 8
 #define MOCK_INJECT_TEST_MAX 128
@@ -89,6 +90,22 @@ static struct cxl_cel_entry mock_cel[] = {
 		.effect = cpu_to_le16(EFFECT(CONF_CHANGE_COLD_RESET) |
 				      EFFECT(CONF_CHANGE_IMMEDIATE)),
 	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_CONFIG),
+		.effect = CXL_CMD_EFFECT_NONE,
+	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_EXTENT_LIST),
+		.effect = CXL_CMD_EFFECT_NONE,
+	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_ADD_DC_RESPONSE),
+		.effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
+	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_RELEASE_DC),
+		.effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
+	},
 };
 
 /* See CXL 2.0 Table 181 Get Health Info Output Payload */
@@ -147,6 +164,7 @@ struct mock_event_store {
 	u32 ev_status;
 };
 
+#define NUM_MOCK_DC_REGIONS 2
 struct cxl_mockmem_data {
 	void *lsa;
 	void *fw;
@@ -161,6 +179,10 @@ struct cxl_mockmem_data {
 	struct mock_event_store mes;
 	u8 event_buf[SZ_4K];
 	u64 timestamp;
+	struct cxl_dc_region_config dc_regions[NUM_MOCK_DC_REGIONS];
+	u32 dc_ext_generation;
+	struct xarray dc_extents;
+	struct xarray dc_accepted_exts;
 };
 
 static struct mock_event_log *event_find_log(struct device *dev, int log_type)
@@ -529,6 +551,98 @@ static void cxl_mock_event_trigger(struct device *dev)
 	cxl_mem_get_event_records(mes->mds, mes->ev_status);
 }
 
+static int devm_add_extent(struct device *dev, u64 start, u64 length,
+			   const char *tag)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_dc_extent_data *extent;
+
+	extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
+	if (!extent) {
+		dev_dbg(dev, "Failed to allocate extent\n");
+		return -ENOMEM;
+	}
+	extent->dpa_start = start;
+	extent->length = length;
+	memcpy(extent->tag, tag, min(sizeof(extent->tag), strlen(tag)));
+
+	if (xa_insert(&mdata->dc_extents, start, extent, GFP_KERNEL)) {
+		devm_kfree(dev, extent);
+		dev_err(dev, "Failed xarry insert %llx\n", start);
+		return -EINVAL;
+	}
+	mdata->dc_ext_generation++;
+
+	return 0;
+}
+
+static int dc_accept_extent(struct device *dev, u64 start)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+
+	dev_dbg(dev, "Accepting extent 0x%llx\n", start);
+	return xa_insert(&mdata->dc_accepted_exts, start, (void *)start,
+			 GFP_KERNEL);
+}
+
+static void release_dc_ext(void *md)
+{
+	struct cxl_mockmem_data *mdata = md;
+
+	xa_destroy(&mdata->dc_extents);
+	xa_destroy(&mdata->dc_accepted_exts);
+}
+
+static int cxl_mock_dc_region_setup(struct device *dev)
+{
+#define DUMMY_EXT_OFFSET SZ_256M
+#define DUMMY_EXT_LENGTH SZ_256M
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	u64 base_dpa = BASE_DYNAMIC_CAP_DPA;
+	u32 dsmad_handle = 0xFADE;
+	u64 decode_length = SZ_2G;
+	u64 block_size = SZ_512;
+	/* For testing make this smaller than decode length */
+	u64 length = SZ_1G;
+	int rc;
+
+	xa_init(&mdata->dc_extents);
+	xa_init(&mdata->dc_accepted_exts);
+
+	rc = devm_add_action_or_reset(dev, release_dc_ext, mdata);
+	if (rc)
+		return rc;
+
+	for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
+		struct cxl_dc_region_config *conf = &mdata->dc_regions[i];
+
+		dev_dbg(dev, "Creating DC region DC%d DPA:%llx LEN:%llx\n",
+			i, base_dpa, length);
+
+		conf->region_base = cpu_to_le64(base_dpa);
+		conf->region_decode_length = cpu_to_le64(decode_length /
+						CXL_CAPACITY_MULTIPLIER);
+		conf->region_length = cpu_to_le64(length);
+		conf->region_block_size = cpu_to_le64(block_size);
+		conf->region_dsmad_handle = cpu_to_le32(dsmad_handle);
+		dsmad_handle++;
+
+		/* Pretend we have some previous accepted extents */
+		rc = devm_add_extent(dev, base_dpa + DUMMY_EXT_OFFSET,
+				     DUMMY_EXT_LENGTH, "CXL-TEST");
+		if (rc)
+			return rc;
+
+		rc = dc_accept_extent(dev, base_dpa + DUMMY_EXT_OFFSET);
+		if (rc)
+			return rc;
+
+		base_dpa += decode_length;
+	}
+
+	return 0;
+}
+
 static int mock_gsl(struct cxl_mbox_cmd *cmd)
 {
 	if (cmd->size_out < sizeof(mock_gsl_payload))
@@ -1315,6 +1429,148 @@ static int mock_activate_fw(struct cxl_mockmem_data *mdata,
 	return -EINVAL;
 }
 
+static int mock_get_dc_config(struct device *dev,
+			      struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mbox_get_dc_config *dc_config = cmd->payload_in;
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	u8 region_requested, region_start_idx, region_ret_cnt;
+	struct cxl_mbox_dynamic_capacity *resp;
+
+	region_requested = dc_config->region_count;
+	if (NUM_MOCK_DC_REGIONS < region_requested)
+		region_requested = NUM_MOCK_DC_REGIONS;
+
+	if (cmd->size_out < struct_size(resp, region, region_requested))
+		return -EINVAL;
+
+	memset(cmd->payload_out, 0, cmd->size_out);
+	resp = cmd->payload_out;
+
+	region_start_idx = dc_config->start_region_index;
+	region_ret_cnt = 0;
+	for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
+		if (i >= region_start_idx) {
+			memcpy(&resp->region[region_ret_cnt],
+				&mdata->dc_regions[i],
+				sizeof(resp->region[region_ret_cnt]));
+			region_ret_cnt++;
+		}
+	}
+	resp->avail_region_count = region_ret_cnt;
+
+	dev_dbg(dev, "Returning %d dc regions\n", region_ret_cnt);
+	return 0;
+}
+
+
+static int mock_get_dc_extent_list(struct device *dev,
+				   struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_mbox_get_dc_extent *get = cmd->payload_in;
+	struct cxl_mbox_dc_extents *resp = cmd->payload_out;
+	u32 total_avail = 0, total_ret = 0;
+	struct cxl_dc_extent_data *ext;
+	u32 ext_count, start_idx;
+	unsigned long i;
+
+	ext_count = le32_to_cpu(get->extent_cnt);
+	start_idx = le32_to_cpu(get->start_extent_index);
+
+	memset(resp, 0, sizeof(*resp));
+
+	/*
+	 * Total available needs to be calculated and returned regardless of
+	 * how many can actually be returned.
+	 */
+	xa_for_each(&mdata->dc_extents, i, ext)
+		total_avail++;
+
+	if (start_idx > total_avail)
+		return -EINVAL;
+
+	xa_for_each(&mdata->dc_extents, i, ext) {
+		if (total_ret >= ext_count)
+			break;
+
+		if (total_ret >= start_idx) {
+			resp->extent[total_ret].start_dpa =
+						cpu_to_le64(ext->dpa_start);
+			resp->extent[total_ret].length =
+						cpu_to_le64(ext->length);
+			memcpy(&resp->extent[total_ret].tag, ext->tag,
+					sizeof(resp->extent[total_ret]));
+			resp->extent[total_ret].shared_extn_seq =
+					cpu_to_le16(ext->shared_extent_seq);
+			total_ret++;
+		}
+	}
+
+	resp->ret_extent_cnt = cpu_to_le32(total_ret);
+	resp->total_extent_cnt = cpu_to_le32(total_avail);
+	resp->extent_list_num = cpu_to_le32(mdata->dc_ext_generation);
+
+	dev_dbg(dev, "Returning %d extents of %d total\n",
+		total_ret, total_avail);
+
+	return 0;
+}
+
+static int mock_add_dc_response(struct device *dev,
+				struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mbox_dc_response *req = cmd->payload_in;
+	u32 list_size = le32_to_cpu(req->extent_list_size);
+
+	for (int i = 0; i < list_size; i++) {
+		u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
+		int rc;
+
+		dev_dbg(dev, "Extent 0x%llx accepted by HOST\n", start);
+		rc = dc_accept_extent(dev, start);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+static int dc_delete_extent(struct device *dev, unsigned long long start)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	void *ext;
+
+	dev_dbg(dev, "Deleting extent at %llx\n", start);
+
+	ext = xa_erase(&mdata->dc_extents, start);
+	if (!ext) {
+		dev_err(dev, "No extent found at %llx\n", start);
+		return -EINVAL;
+	}
+	devm_kfree(dev, ext);
+	mdata->dc_ext_generation++;
+
+	return 0;
+}
+
+static int mock_dc_release(struct device *dev,
+			   struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_mbox_dc_response *req = cmd->payload_in;
+	u32 list_size = le32_to_cpu(req->extent_list_size);
+
+	for (int i = 0; i < list_size; i++) {
+		u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
+
+		dev_dbg(dev, "Extent 0x%llx released by HOST\n", start);
+		xa_erase(&mdata->dc_accepted_exts, start);
+	}
+
+	return 0;
+}
+
 static int cxl_mock_mbox_send(struct cxl_memdev_state *mds,
 			      struct cxl_mbox_cmd *cmd)
 {
@@ -1399,6 +1655,18 @@ static int cxl_mock_mbox_send(struct cxl_memdev_state *mds,
 	case CXL_MBOX_OP_ACTIVATE_FW:
 		rc = mock_activate_fw(mdata, cmd);
 		break;
+	case CXL_MBOX_OP_GET_DC_CONFIG:
+		rc = mock_get_dc_config(dev, cmd);
+		break;
+	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+		rc = mock_get_dc_extent_list(dev, cmd);
+		break;
+	case CXL_MBOX_OP_ADD_DC_RESPONSE:
+		rc = mock_add_dc_response(dev, cmd);
+		break;
+	case CXL_MBOX_OP_RELEASE_DC:
+		rc = mock_dc_release(dev, cmd);
+		break;
 	default:
 		break;
 	}
@@ -1467,6 +1735,10 @@ static int __cxl_mock_mem_probe(struct platform_device *pdev)
 		return -ENOMEM;
 	dev_set_drvdata(dev, mdata);
 
+	rc = cxl_mock_dc_region_setup(dev);
+	if (rc)
+		return rc;
+
 	mdata->lsa = vmalloc(LSA_SIZE);
 	if (!mdata->lsa)
 		return -ENOMEM;
@@ -1515,6 +1787,10 @@ static int __cxl_mock_mem_probe(struct platform_device *pdev)
 	if (rc)
 		return rc;
 
+	rc = cxl_dev_dynamic_capacity_identify(mds);
+	if (rc)
+		return rc;
+
 	rc = cxl_mem_create_range_info(mds);
 	if (rc)
 		return rc;
@@ -1528,6 +1804,10 @@ static int __cxl_mock_mem_probe(struct platform_device *pdev)
 	if (IS_ERR(cxlmd))
 		return PTR_ERR(cxlmd);
 
+	rc = cxl_dev_get_dynamic_capacity_extents(mds);
+	if (rc)
+		return rc;
+
 	rc = cxl_memdev_setup_fw_upload(mds);
 	if (rc)
 		return rc;
@@ -1669,10 +1949,179 @@ static ssize_t fw_buf_checksum_show(struct device *dev,
 
 static DEVICE_ATTR_RO(fw_buf_checksum);
 
+/* Returns if the proposed extent is valid */
+static bool new_extent_valid(struct device *dev, size_t new_start,
+			     size_t new_len)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_dc_extent_data *extent;
+	size_t new_end, i;
+
+	if (!new_len)
+		return -EINVAL;
+
+	new_end = new_start + new_len;
+
+	dev_dbg(dev, "New extent %zx-%zx\n", new_start, new_end);
+
+	/* Overlap with other extent? */
+	xa_for_each(&mdata->dc_extents, i, extent) {
+		size_t ext_end = extent->dpa_start + extent->length;
+
+		if (extent->dpa_start <= new_start && new_start < ext_end) {
+			dev_err(dev, "Extent overlap: Start %llu ?<= %zx ?<= %zx\n",
+				extent->dpa_start, new_start, ext_end);
+			return false;
+		}
+		if (extent->dpa_start <= new_end && new_end < ext_end) {
+			dev_err(dev, "Extent overlap: End %llx ?<= %zx ?<= %zx\n",
+				extent->dpa_start, new_end, ext_end);
+			return false;
+		}
+	}
+
+	/* Ensure it is in a region and is valid for that regions block size */
+	for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
+		struct cxl_dc_region_config *dc_region = &mdata->dc_regions[i];
+		size_t reg_start, reg_end;
+
+		reg_start = le64_to_cpu(dc_region->region_base);
+		reg_end = le64_to_cpu(dc_region->region_length);
+		reg_end += reg_start;
+
+		dev_dbg(dev, "Region %d: %zx-%zx\n", i, reg_start, reg_end);
+
+		if (reg_start >= new_start && new_end < reg_end) {
+			u64 block_size = le64_to_cpu(dc_region->region_block_size);
+
+			if (new_start % block_size || new_len % block_size) {
+				dev_err(dev, "Extent not aligned to block size: start %zx; len %zx; block_size 0x%llx\n",
+					new_start, new_len, block_size);
+				return false;
+			}
+
+			dev_dbg(dev, "Extent in region %d\n", i);
+			return true;
+		}
+	}
+
+	return false;
+}
+
+/*
+ * Format <start>:<length>:<tag>
+ *
+ * start and length must be a multiple of the configured region block size.
+ * Tag can be any string up to 16 bytes.
+ *
+ * Extents must be exclusive of other extents
+ */
+static ssize_t dc_inject_extent_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t count)
+{
+	char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
+	unsigned long long start, length;
+	char *len_str, *tag_str;
+	size_t buf_len = count;
+	int rc;
+
+	if (!start_str)
+		return -ENOMEM;
+
+	len_str = strnchr(start_str, buf_len, ':');
+	if (!len_str) {
+		dev_err(dev, "Extent failed to find len_str: %s\n", start_str);
+		return -EINVAL;
+	}
+
+	*len_str = '\0';
+	len_str += 1;
+	buf_len -= strlen(start_str);
+
+	tag_str = strnchr(len_str, buf_len, ':');
+	if (!tag_str) {
+		dev_err(dev, "Extent failed to find tag_str: %s\n", len_str);
+		return -EINVAL;
+	}
+	*tag_str = '\0';
+	tag_str += 1;
+
+	if (kstrtoull(start_str, 0, &start)) {
+		dev_err(dev, "Extent failed to parse start: %s\n", start_str);
+		return -EINVAL;
+	}
+	if (kstrtoull(len_str, 0, &length)) {
+		dev_err(dev, "Extent failed to parse length: %s\n", len_str);
+		return -EINVAL;
+	}
+
+	if (!new_extent_valid(dev, start, length))
+		return -EINVAL;
+
+	rc = devm_add_extent(dev, start, length, tag_str);
+	if (rc)
+		return rc;
+
+	return count;
+}
+static DEVICE_ATTR_WO(dc_inject_extent);
+
+static ssize_t dc_del_extent_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t count)
+{
+	unsigned long long start;
+	int rc;
+
+	if (kstrtoull(buf, 0, &start)) {
+		dev_err(dev, "Extent failed to parse start value\n");
+		return -EINVAL;
+	}
+
+	rc = dc_delete_extent(dev, start);
+	if (rc)
+		return rc;
+
+	return count;
+}
+static DEVICE_ATTR_WO(dc_del_extent);
+
+static ssize_t dc_force_del_extent_store(struct device *dev,
+					 struct device_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	unsigned long long start;
+	void *ext;
+	int rc;
+
+	if (kstrtoull(buf, 0, &start)) {
+		dev_err(dev, "Extent failed to parse start value\n");
+		return -EINVAL;
+	}
+
+	ext = xa_erase(&mdata->dc_accepted_exts, start);
+	if (ext)
+		dev_dbg(dev, "Forcing remove of accepted extent: %llx\n",
+			start);
+
+	dev_dbg(dev, "Forcing delete of extent at %llx\n", start);
+	rc = dc_delete_extent(dev, start);
+	if (rc)
+		return rc;
+
+	return count;
+}
+static DEVICE_ATTR_WO(dc_force_del_extent);
+
 static struct attribute *cxl_mock_mem_attrs[] = {
 	&dev_attr_security_lock.attr,
 	&dev_attr_event_trigger.attr,
 	&dev_attr_fw_buf_checksum.attr,
+	&dev_attr_dc_inject_extent.attr,
+	&dev_attr_dc_del_extent.attr,
+	&dev_attr_dc_force_del_extent.attr,
 	NULL
 };
 ATTRIBUTE_GROUPS(cxl_mock_mem);

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH RFC v2 18/18] tools/testing/cxl: Add Dynamic Capacity events
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (16 preceding siblings ...)
  2023-08-29  5:21 ` [PATCH RFC v2 17/18] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
@ 2023-08-29  5:21 ` Ira Weiny
  2023-08-30 12:23   ` Jonathan Cameron
  2023-08-31 23:20   ` Dave Jiang
  2023-09-07 21:01 ` [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Fan Ni
  18 siblings, 2 replies; 97+ messages in thread
From: Ira Weiny @ 2023-08-29  5:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linux-cxl,
	linux-kernel

OS software needs to be alerted when new extents arrive on a Dynamic
Capacity Device (DCD).  On test DCDs extents are added through sysfs.

Add events on DCD extent injection.  Directly call the event irq
callback to simulate irqs to process the test extents.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 tools/testing/cxl/test/mem.c | 57 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index d6041a2145c5..20364fee9df9 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -2008,6 +2008,41 @@ static bool new_extent_valid(struct device *dev, size_t new_start,
 	return false;
 }
 
+struct dcd_event_dyn_cap dcd_event_rec_template = {
+	.hdr = {
+		.id = UUID_INIT(0xca95afa7, 0xf183, 0x4018,
+				0x8c, 0x2f, 0x95, 0x26, 0x8e, 0x10, 0x1a, 0x2a),
+		.length = sizeof(struct dcd_event_dyn_cap),
+	},
+};
+
+static int send_dc_event(struct mock_event_store *mes, enum dc_event type,
+			 u64 start, u64 length, const char *tag_str)
+{
+	struct device *dev = mes->mds->cxlds.dev;
+	struct dcd_event_dyn_cap *dcd_event_rec;
+
+	dcd_event_rec = devm_kzalloc(dev, sizeof(*dcd_event_rec), GFP_KERNEL);
+	if (!dcd_event_rec)
+		return -ENOMEM;
+
+	memcpy(dcd_event_rec, &dcd_event_rec_template, sizeof(*dcd_event_rec));
+	dcd_event_rec->data.event_type = type;
+	dcd_event_rec->data.extent.start_dpa = cpu_to_le64(start);
+	dcd_event_rec->data.extent.length = cpu_to_le64(length);
+	memcpy(dcd_event_rec->data.extent.tag, tag_str,
+	       min(sizeof(dcd_event_rec->data.extent.tag),
+		   strlen(tag_str)));
+
+	mes_add_event(mes, CXL_EVENT_TYPE_DCD,
+		      (struct cxl_event_record_raw *)dcd_event_rec);
+
+	/* Fake the irq */
+	cxl_mem_get_event_records(mes->mds, CXLDEV_EVENT_STATUS_DCD);
+
+	return 0;
+}
+
 /*
  * Format <start>:<length>:<tag>
  *
@@ -2021,6 +2056,7 @@ static ssize_t dc_inject_extent_store(struct device *dev,
 				      const char *buf, size_t count)
 {
 	char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
 	unsigned long long start, length;
 	char *len_str, *tag_str;
 	size_t buf_len = count;
@@ -2063,6 +2099,13 @@ static ssize_t dc_inject_extent_store(struct device *dev,
 	if (rc)
 		return rc;
 
+	rc = send_dc_event(&mdata->mes, DCD_ADD_CAPACITY, start, length,
+			   tag_str);
+	if (rc) {
+		dev_err(dev, "Failed to add event %d\n", rc);
+		return rc;
+	}
+
 	return count;
 }
 static DEVICE_ATTR_WO(dc_inject_extent);
@@ -2071,6 +2114,7 @@ static ssize_t dc_del_extent_store(struct device *dev,
 				   struct device_attribute *attr,
 				   const char *buf, size_t count)
 {
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
 	unsigned long long start;
 	int rc;
 
@@ -2083,6 +2127,12 @@ static ssize_t dc_del_extent_store(struct device *dev,
 	if (rc)
 		return rc;
 
+	rc = send_dc_event(&mdata->mes, DCD_RELEASE_CAPACITY, start, 0, "");
+	if (rc) {
+		dev_err(dev, "Failed to add event %d\n", rc);
+		return rc;
+	}
+
 	return count;
 }
 static DEVICE_ATTR_WO(dc_del_extent);
@@ -2111,6 +2161,13 @@ static ssize_t dc_force_del_extent_store(struct device *dev,
 	if (rc)
 		return rc;
 
+	rc = send_dc_event(&mdata->mes, DCD_FORCED_CAPACITY_RELEASE,
+			      start, 0, "");
+	if (rc) {
+		dev_err(dev, "Failed to add event %d\n", rc);
+		return rc;
+	}
+
 	return count;
 }
 static DEVICE_ATTR_WO(dc_force_del_extent);

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 01/18] cxl/hdm: Debug, use decoder name function
  2023-08-29  5:20 ` [PATCH RFC v2 01/18] cxl/hdm: Debug, use decoder name function Ira Weiny
@ 2023-08-29 14:03   ` Jonathan Cameron
  2023-08-29 21:48     ` Fan Ni
  2023-09-03  2:55     ` Ira Weiny
  2023-08-30 20:32   ` Dave Jiang
  1 sibling, 2 replies; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-29 14:03 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:20:52 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> The decoder enum has a name conversion function defined now.
> 
> Use that instead of open coding.
> 
> Suggested-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 

Perhaps pull this one out so it can go upstream before the rest are ready,
or could be picked up from here.

Whilst we probably won't see the other decoder modes in here, there
is no reason why anyone reading the code should have to figure that out.
As such much better to use the more generic function.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> ---
> Changes for v2:
> [iweiny: new patch, split out]
> ---
>  drivers/cxl/core/hdm.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index b01a77b67511..a254f79dd4e8 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -550,8 +550,7 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  
>  	if (size > avail) {
>  		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
> -			cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
> -			&avail);
> +			cxl_decoder_mode_name(cxled->mode), &avail);
>  		rc = -ENOSPC;
>  		goto out;
>  	}
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 02/18] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2023-08-29  5:20 ` [PATCH RFC v2 02/18] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD) Ira Weiny
@ 2023-08-29 14:07   ` Jonathan Cameron
  2023-09-03  3:38     ` Ira Weiny
  2023-08-29 21:49   ` Fan Ni
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-29 14:07 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:20:53 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> Per the CXL 3.0 specification software must check the Command Effects
> Log (CEL) to know if a device supports DC.  If the device does support
> DC the specifics of the DC Regions (0-7) are read through the mailbox.
> 
> Flag DC Device (DCD) commands in a device if they are supported.
> Subsequent patches will key off these bits to configure a DCD.
> 
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 

Trivial unrelated change seems to have sneaked in. Other than that
this looks good to me.

So with that tidied up.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


Thanks,

Jonathan

> +
>  static bool cxl_is_security_command(u16 opcode)
>  {
>  	int i;
> @@ -677,9 +705,10 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  		u16 opcode = le16_to_cpu(cel_entry[i].opcode);
>  		struct cxl_mem_command *cmd = cxl_mem_find_command(opcode);
>  
> -		if (!cmd && !cxl_is_poison_command(opcode)) {
> -			dev_dbg(dev,
> -				"Opcode 0x%04x unsupported by driver\n", opcode);
> +		if (!cmd && !cxl_is_poison_command(opcode) &&
> +		    !cxl_is_dcd_command(opcode)) {
> +			dev_dbg(dev, "Opcode 0x%04x unsupported by driver\n",
> +				opcode);

Clang format has been playing?
Better to leave this alone and save reviewers wondering what the change
in the dev_dbg() was.

>  			continue;
>  		}


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device
  2023-08-29  5:20 ` [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device ira.weiny
@ 2023-08-29 14:37   ` Jonathan Cameron
  2023-09-03 23:36     ` Ira Weiny
  2023-08-30 21:01   ` Dave Jiang
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-29 14:37 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:20:54 -0700
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> Devices can optionally support Dynamic Capacity (DC).  These devices are
> known as Dynamic Capacity Devices (DCD).
> 
> Implement the DC (opcode 48XXh) mailbox commands as specified in CXL 3.0
> section 8.2.9.8.9.  Read the DC configuration and store the DC region
> information in the device state.
> 
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
Hi.

A few minor things inline.  Otherwise, I wonder if it's worth separating
the mode of the region from that of the endpoint decoder in a precusor patch.
That's a large part of this one and not really related to the mbox command stuff.

Jonathan


...

> +
> +/* Returns the number of regions in dc_resp or -ERRNO */
> +static int cxl_get_dc_id(struct cxl_memdev_state *mds, u8 start_region,
> +			 struct cxl_mbox_dynamic_capacity *dc_resp,
> +			 size_t dc_resp_size)
> +{
> +	struct cxl_mbox_get_dc_config get_dc = (struct cxl_mbox_get_dc_config) {
> +		.region_count = CXL_MAX_DC_REGION,
> +		.start_region_index = start_region,
> +	};
> +	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> +		.payload_in = &get_dc,
> +		.size_in = sizeof(get_dc),
> +		.size_out = dc_resp_size,
> +		.payload_out = dc_resp,
> +		.min_out = 1,
> +	};
> +	struct device *dev = mds->cxlds.dev;
> +	int rc;
> +
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		return rc;
> +
> +	rc = dc_resp->avail_region_count - start_region;
> +
> +	/*
> +	 * The number of regions in the payload may have been truncated due to
> +	 * payload_size limits; if so adjust the count in this query.

Not adjusting the query.  "if so adjust the returned count to match."

> +	 */
> +	if (mbox_cmd.size_out < sizeof(*dc_resp))
> +		rc = CXL_REGIONS_RETURNED(mbox_cmd.size_out);
> +
> +	dev_dbg(dev, "Read %d/%d DC regions\n", rc, dc_resp->avail_region_count);
> +
> +	return rc;
> +}
> +
> +/**
> + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> + *					 information from the device.
> + * @mds: The memory device state
> + *
> + * This will dispatch the get_dynamic_capacity command to the device
> + * and on success populate structures to be exported to sysfs.

I'd skip the 'exported to sysfs' as I'd guess this will have other uses
(maybe) in the longer term.

and on success populate state structures for later use.

> + *
> + * Return: 0 if identify was executed successfully, -ERRNO on error.
> + */
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> +{
> +	struct cxl_mbox_dynamic_capacity *dc_resp;
> +	struct device *dev = mds->cxlds.dev;
> +	size_t dc_resp_size = mds->payload_size;
> +	u8 start_region;
> +	int i, rc = 0;
> +
> +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> +		snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
> +
> +	/* Check GET_DC_CONFIG is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd: get_dynamic_capacity_config\n");
> +		return 0;
> +	}
> +
> +	dc_resp = kvmalloc(dc_resp_size, GFP_KERNEL);                         
> +	if (!dc_resp)                                                                
> +		return -ENOMEM;                                                 
> +
> +	start_region = 0;
> +	do {
> +		int j;
> +
> +		rc = cxl_get_dc_id(mds, start_region, dc_resp, dc_resp_size);

I'd spell out identify.
Initially I thought this was getting an index.


> +		if (rc < 0)
> +			goto free_resp;
> +
> +		mds->nr_dc_region += rc;
> +
> +		if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> +			dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> +				mds->nr_dc_region);
> +			rc = -EINVAL;
> +			goto free_resp;
> +		}
> +
> +		for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
> +			rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
> +			if (rc)
> +				goto free_resp;
> +		}
> +
> +		start_region = mds->nr_dc_region;
> +
> +	} while (mds->nr_dc_region < dc_resp->avail_region_count);
> +
> +	mds->dynamic_cap =
> +		mds->dc_region[mds->nr_dc_region - 1].base +
> +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> +		mds->dc_region[0].base;
> +	dev_dbg(dev, "Total dynamic capacity: %#llx\n", mds->dynamic_cap);
> +
> +free_resp:
> +	kfree(dc_resp);

Maybe a first use for __free in cxl?

See include/linux/cleanup.h
Would enable returns rather than goto and label.



> +	if (rc)
> +		dev_err(dev, "Failed to get DC info: %d\n", rc);

I'd prefer to see more specific debug in the few paths that don't already
print it above.

> +	return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> +
>  static int add_dpa_res(struct device *dev, struct resource *parent,
>  		       struct resource *res, resource_size_t start,
>  		       resource_size_t size, const char *type)
> @@ -1208,8 +1369,12 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  {
>  	struct cxl_dev_state *cxlds = &mds->cxlds;
>  	struct device *dev = cxlds->dev;
> +	size_t untenanted_mem;
>  	int rc;
>  
> +	untenanted_mem = mds->dc_region[0].base - mds->static_cap;
> +	mds->total_bytes = mds->static_cap + untenanted_mem + mds->dynamic_cap;
> +
>  	if (!cxlds->media_ready) {
>  		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
>  		cxlds->ram_res = DEFINE_RES_MEM(0, 0);
> @@ -1217,8 +1382,16 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  		return 0;
>  	}
>  
> -	cxlds->dpa_res =
> -		(struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> +	cxlds->dpa_res = (struct resource)DEFINE_RES_MEM(0, mds->total_bytes);

Beat back that auto-formater! Or just run it once and fix everything before
doing anything new.

> +
> +	for (int i = 0; i < mds->nr_dc_region; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> +				 dcr->base, dcr->decode_len, dcr->name);
> +		if (rc)
> +			return rc;
> +	}
>  
>  	if (mds->partition_align_bytes == 0) {
>  		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 252bc8e1f103..75041903b72c 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -46,7 +46,7 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
>  	rc = down_read_interruptible(&cxl_region_rwsem);
>  	if (rc)
>  		return rc;
> -	if (cxlr->mode != CXL_DECODER_PMEM)
> +	if (cxlr->mode != CXL_REGION_PMEM)
>  		rc = sysfs_emit(buf, "\n");
>  	else
>  		rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
> @@ -359,7 +359,7 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
>  	 * Support tooling that expects to find a 'uuid' attribute for all
>  	 * regions regardless of mode.
>  	 */
> -	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
> +	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_REGION_PMEM)
>  		return 0444;
>  	return a->mode;
>  }
> @@ -537,7 +537,7 @@ static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
>  {
>  	struct cxl_region *cxlr = to_cxl_region(dev);
>  
> -	return sysfs_emit(buf, "%s\n", cxl_decoder_mode_name(cxlr->mode));
> +	return sysfs_emit(buf, "%s\n", cxl_region_mode_name(cxlr->mode));
>  }
>  static DEVICE_ATTR_RO(mode);
>  
> @@ -563,7 +563,7 @@ static int alloc_hpa(struct cxl_region *cxlr, resource_size_t size)
>  
>  	/* ways, granularity and uuid (if PMEM) need to be set before HPA */
>  	if (!p->interleave_ways || !p->interleave_granularity ||
> -	    (cxlr->mode == CXL_DECODER_PMEM && uuid_is_null(&p->uuid)))
> +	    (cxlr->mode == CXL_REGION_PMEM && uuid_is_null(&p->uuid)))
>  		return -ENXIO;
>  
>  	div_u64_rem(size, SZ_256M * p->interleave_ways, &remainder);
> @@ -1765,6 +1765,17 @@ static int cxl_region_sort_targets(struct cxl_region *cxlr)
>  	return rc;
>  }
>  
> +static bool cxl_modes_compatible(enum cxl_region_mode rmode,
> +				 enum cxl_decoder_mode dmode)
> +{
> +	if (rmode == CXL_REGION_RAM && dmode == CXL_DECODER_RAM)
> +		return true;
> +	if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
> +		return true;
> +
> +	return false;
> +}
> +
>  static int cxl_region_attach(struct cxl_region *cxlr,
>  			     struct cxl_endpoint_decoder *cxled, int pos)
>  {
> @@ -1778,9 +1789,11 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  	lockdep_assert_held_write(&cxl_region_rwsem);
>  	lockdep_assert_held_read(&cxl_dpa_rwsem);
>  
> -	if (cxled->mode != cxlr->mode) {
> -		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
> -			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
> +	if (!cxl_modes_compatible(cxlr->mode, cxled->mode)) {
> +		dev_dbg(&cxlr->dev, "%s region mode: %s mismatch decoder: %s\n",
> +			dev_name(&cxled->cxld.dev),
> +			cxl_region_mode_name(cxlr->mode),
> +			cxl_decoder_mode_name(cxled->mode));
>  		return -EINVAL;
>  	}
>  
> @@ -2234,7 +2247,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
>   * devm_cxl_add_region - Adds a region to a decoder
>   * @cxlrd: root decoder
>   * @id: memregion id to create, or memregion_free() on failure
> - * @mode: mode for the endpoint decoders of this region
> + * @mode: mode of this region
>   * @type: select whether this is an expander or accelerator (type-2 or type-3)
>   *
>   * This is the second step of region initialization. Regions exist within an
> @@ -2245,7 +2258,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
>   */
>  static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>  					      int id,
> -					      enum cxl_decoder_mode mode,
> +					      enum cxl_region_mode mode,
>  					      enum cxl_decoder_type type)
>  {
>  	struct cxl_port *port = to_cxl_port(cxlrd->cxlsd.cxld.dev.parent);
> @@ -2254,11 +2267,12 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>  	int rc;
>  
>  	switch (mode) {
> -	case CXL_DECODER_RAM:
> -	case CXL_DECODER_PMEM:
> +	case CXL_REGION_RAM:
> +	case CXL_REGION_PMEM:
>  		break;
>  	default:
> -		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);

Arguably should have been moved to the cxl_decoder_mode_name() in patch 1
before being changed to cxl_region_mode_name() when the two are separated in this
patch.  You could just add a note to patch 1 to say 'other instances will be
covered by refactors shortly'. 

> +		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
> +			cxl_region_mode_name(mode));
>  		return ERR_PTR(-EINVAL);
>  	}
>  
> @@ -2308,7 +2322,7 @@ static ssize_t create_ram_region_show(struct device *dev,
>  }
>  
>  static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
> -					  int id, enum cxl_decoder_mode mode,
> +					  int id, enum cxl_region_mode mode,
>  					  enum cxl_decoder_type type)
>  {
>  	int rc;
> @@ -2337,7 +2351,7 @@ static ssize_t create_pmem_region_store(struct device *dev,
>  	if (rc != 1)
>  		return -EINVAL;
>  
> -	cxlr = __create_region(cxlrd, id, CXL_DECODER_PMEM,
> +	cxlr = __create_region(cxlrd, id, CXL_REGION_PMEM,
>  			       CXL_DECODER_HOSTONLYMEM);
>  	if (IS_ERR(cxlr))
>  		return PTR_ERR(cxlr);
> @@ -2358,7 +2372,7 @@ static ssize_t create_ram_region_store(struct device *dev,
>  	if (rc != 1)
>  		return -EINVAL;
>  
> -	cxlr = __create_region(cxlrd, id, CXL_DECODER_RAM,
> +	cxlr = __create_region(cxlrd, id, CXL_REGION_RAM,
>  			       CXL_DECODER_HOSTONLYMEM);
>  	if (IS_ERR(cxlr))
>  		return PTR_ERR(cxlr);
> @@ -2886,10 +2900,31 @@ static void construct_region_end(void)
>  	up_write(&cxl_region_rwsem);
>  }
>  
> +static enum cxl_region_mode
> +cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
> +{
> +	switch (mode) {
> +	case CXL_DECODER_NONE:
> +		return CXL_REGION_NONE;
> +	case CXL_DECODER_RAM:
> +		return CXL_REGION_RAM;
> +	case CXL_DECODER_PMEM:
> +		return CXL_REGION_PMEM;
> +	case CXL_DECODER_DEAD:
> +		return CXL_REGION_DEAD;
> +	case CXL_DECODER_MIXED:
> +	default:
> +		return CXL_REGION_MIXED;
> +	}
> +
> +	return CXL_REGION_MIXED;
> +}
> +
>  static struct cxl_region *
>  construct_region_begin(struct cxl_root_decoder *cxlrd,
>  		       struct cxl_endpoint_decoder *cxled)
>  {
> +	enum cxl_region_mode mode = cxl_decoder_to_region_mode(cxled->mode);
>  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
>  	struct cxl_region_params *p;
>  	struct cxl_region *cxlr;
> @@ -2897,7 +2932,7 @@ construct_region_begin(struct cxl_root_decoder *cxlrd,
>  
>  	do {
>  		cxlr = __create_region(cxlrd, atomic_read(&cxlrd->region_id),
> -				       cxled->mode, cxled->cxld.target_type);
> +				       mode, cxled->cxld.target_type);
>  	} while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
>  
>  	if (IS_ERR(cxlr)) {
> @@ -3200,9 +3235,9 @@ static int cxl_region_probe(struct device *dev)
>  		return rc;
>  
>  	switch (cxlr->mode) {
> -	case CXL_DECODER_PMEM:
> +	case CXL_REGION_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
> -	case CXL_DECODER_RAM:
> +	case CXL_REGION_RAM:
>  		/*
>  		 * The region can not be manged by CXL if any portion of
>  		 * it is already online as 'System RAM'
> @@ -3223,8 +3258,8 @@ static int cxl_region_probe(struct device *dev)
>  		/* HDM-H routes to device-dax */
>  		return devm_cxl_add_dax_region(cxlr);
>  	default:
> -		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
> -			cxlr->mode);
> +		dev_dbg(&cxlr->dev, "unsupported region mode: %s\n",
> +			cxl_region_mode_name(cxlr->mode));
>  		return -ENXIO;
>  	}
>  }
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index cd4a9ffdacc7..ed282dcd5cf5 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -374,6 +374,28 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  	return "mixed";
>  }
>  
> +enum cxl_region_mode {
> +	CXL_REGION_NONE,
> +	CXL_REGION_RAM,
> +	CXL_REGION_PMEM,
> +	CXL_REGION_MIXED,
> +	CXL_REGION_DEAD,
> +};

It feels to me like you could have yanked the introduction and use of cxl_region_mode
out as a trivial precursor patch with a note saying the separation will be needed
shortly and why it will be needed.

> +
> +static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
> +{
> +	static const char * const names[] = {
> +		[CXL_REGION_NONE] = "none",
> +		[CXL_REGION_RAM] = "ram",
> +		[CXL_REGION_PMEM] = "pmem",
> +		[CXL_REGION_MIXED] = "mixed",
> +	};
> +
> +	if (mode >= CXL_REGION_NONE && mode <= CXL_REGION_MIXED)
> +		return names[mode];
> +	return "mixed";
> +}
> +
>  /*
>   * Track whether this decoder is reserved for region autodiscovery, or
>   * free for userspace provisioning.
> @@ -502,7 +524,8 @@ struct cxl_region_params {
>   * struct cxl_region - CXL region
>   * @dev: This region's device
>   * @id: This region's id. Id is globally unique across all regions
> - * @mode: Endpoint decoder allocation / access mode
> + * @mode: Region mode which defines which endpoint decoder mode the region is
> + *        compatible with
>   * @type: Endpoint decoder target type
>   * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
>   * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
> @@ -512,7 +535,7 @@ struct cxl_region_params {
>  struct cxl_region {
>  	struct device dev;
>  	int id;
> -	enum cxl_decoder_mode mode;
> +	enum cxl_region_mode mode;
>  	enum cxl_decoder_type type;
>  	struct cxl_nvdimm_bridge *cxl_nvb;
>  	struct cxl_pmem_region *cxlr_pmem;
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 5f2e65204bf9..8c8f47b397ab 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -396,6 +396,7 @@ enum cxl_devtype {
>  	CXL_DEVTYPE_CLASSMEM,
>  };
>  
> +#define CXL_MAX_DC_REGION 8
>  /**
>   * struct cxl_dev_state - The driver device state
>   *
> @@ -412,6 +413,8 @@ enum cxl_devtype {
>   * @dpa_res: Overall DPA resource tree for the device
>   * @pmem_res: Active Persistent memory capacity configuration
>   * @ram_res: Active Volatile memory capacity configuration
> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> + *          region
>   * @component_reg_phys: register base of component registers
>   * @serial: PCIe Device Serial Number
>   * @type: Generic Memory Class device or Vendor Specific Memory device
> @@ -426,11 +429,23 @@ struct cxl_dev_state {
>  	struct resource dpa_res;
>  	struct resource pmem_res;
>  	struct resource ram_res;
> +	struct resource dc_res[CXL_MAX_DC_REGION];
>  	resource_size_t component_reg_phys;
>  	u64 serial;
>  	enum cxl_devtype type;
>  };
>  
> +#define CXL_DC_REGION_STRLEN 7
> +struct cxl_dc_region_info {
> +	u64 base;
> +	u64 decode_len;
> +	u64 len;
> +	u64 blk_size;
> +	u32 dsmad_handle;
> +	u8 flags;
> +	u8 name[CXL_DC_REGION_STRLEN];
> +};
> +
>  /**
>   * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
>   *
> @@ -449,6 +464,8 @@ struct cxl_dev_state {
>   * @enabled_cmds: Hardware commands found enabled in CEL.
>   * @exclusive_cmds: Commands that are kernel-internal only
>   * @total_bytes: sum of all possible capacities
> + * @static_cap: Sum of RAM and PMEM capacities

Sum of static RAM and PMEM capacities

Dynamic cap may well be RAM or PMEM!

> + * @dynamic_cap: Complete DPA range occupied by DC regions
>   * @volatile_only_bytes: hard volatile capacity
>   * @persistent_only_bytes: hard persistent capacity
>   * @partition_align_bytes: alignment size for partition-able capacity
> @@ -456,6 +473,10 @@ struct cxl_dev_state {
>   * @active_persistent_bytes: sum of hard + soft persistent
>   * @next_volatile_bytes: volatile capacity change pending device reset
>   * @next_persistent_bytes: persistent capacity change pending device reset
> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
> + * @dc_event_log_size: The number of events the device can store in the
> + * Dynamic Capacity Event Log before it overflows
>   * @event: event log driver state
>   * @poison: poison driver state info
>   * @fw: firmware upload / activation state
> @@ -473,7 +494,10 @@ struct cxl_memdev_state {
>  	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> +
>  	u64 total_bytes;
> +	u64 static_cap;
> +	u64 dynamic_cap;
>  	u64 volatile_only_bytes;
>  	u64 persistent_only_bytes;
>  	u64 partition_align_bytes;
> @@ -481,6 +505,11 @@ struct cxl_memdev_state {
>  	u64 active_persistent_bytes;
>  	u64 next_volatile_bytes;
>  	u64 next_persistent_bytes;
> +
> +	u8 nr_dc_region;
> +	struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> +	size_t dc_event_log_size;
> +
>  	struct cxl_event_state event;
>  	struct cxl_poison_state poison;
>  	struct cxl_security_state security;
> @@ -587,6 +616,7 @@ struct cxl_mbox_identify {
>  	__le16 inject_poison_limit;
>  	u8 poison_caps;
>  	u8 qos_telemetry_caps;
> +	__le16 dc_event_log_size;
>  } __packed;
>  
>  /*
> @@ -741,9 +771,31 @@ struct cxl_mbox_set_partition_info {
>  	__le64 volatile_capacity;
>  	u8 flags;
>  } __packed;
> -

?

>  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
>  
> +struct cxl_mbox_get_dc_config {
> +	u8 region_count;
> +	u8 start_region_index;
> +} __packed;
> +
> +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> +struct cxl_mbox_dynamic_capacity {

Can we rename to make it more clear which payload this is?

> +	u8 avail_region_count;
> +	u8 rsvd[7];
> +	struct cxl_dc_region_config {
> +		__le64 region_base;
> +		__le64 region_decode_length;
> +		__le64 region_length;
> +		__le64 region_block_size;
> +		__le32 region_dsmad_handle;
> +		u8 flags;
> +		u8 rsvd[3];
> +	} __packed region[];
> +} __packed;
> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> +#define CXL_REGIONS_RETURNED(size_out) \
> +	((size_out - 8) / sizeof(struct cxl_dc_region_config))
> +
>  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>  struct cxl_mbox_set_timestamp_in {
>  	__le64 timestamp;
> @@ -867,6 +919,7 @@ enum {
>  int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
>  			  struct cxl_mbox_cmd *cmd);
>  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
>  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>  int cxl_mem_create_range_info(struct cxl_memdev_state *mds);

ta

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 04/18] cxl/region: Add Dynamic Capacity decoder and region modes
  2023-08-29  5:20 ` [PATCH RFC v2 04/18] cxl/region: Add Dynamic Capacity decoder and region modes Ira Weiny
@ 2023-08-29 14:39   ` Jonathan Cameron
  2023-08-30 21:13   ` Dave Jiang
  2023-08-31 17:00   ` Fan Ni
  2 siblings, 0 replies; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-29 14:39 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:20:55 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> Both regions and decoders will need a new mode to reflect the new type
> of partition they are targeting on a device.  Regions reflect a dynamic
> capacity type which may point to different Dynamic Capacity (DC)
> Regions.  Decoder mode reflects a specific DC Region.
> 
> Define the new modes to use in subsequent patches and the helper
> functions associated with them.
> 
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Looks fine, though I'll be interested to see how it is used in later patches
as DC region does feel somewhat separate from the other types.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> ---
> Changes for v2:
> [iweiny: split out from: Add dynamic capacity cxl region support.]
> ---
>  drivers/cxl/core/region.c |  4 ++++
>  drivers/cxl/cxl.h         | 23 +++++++++++++++++++++++
>  2 files changed, 27 insertions(+)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 75041903b72c..69af1354bc5b 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1772,6 +1772,8 @@ static bool cxl_modes_compatible(enum cxl_region_mode rmode,
>  		return true;
>  	if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
>  		return true;
> +	if (rmode == CXL_REGION_DC && cxl_decoder_mode_is_dc(dmode))
> +		return true;
>  
>  	return false;
>  }
> @@ -2912,6 +2914,8 @@ cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
>  		return CXL_REGION_PMEM;
>  	case CXL_DECODER_DEAD:
>  		return CXL_REGION_DEAD;
> +	case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
> +		return CXL_REGION_DC;
>  	case CXL_DECODER_MIXED:
>  	default:
>  		return CXL_REGION_MIXED;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index ed282dcd5cf5..d41f3f14fbe3 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -356,6 +356,14 @@ enum cxl_decoder_mode {
>  	CXL_DECODER_NONE,
>  	CXL_DECODER_RAM,
>  	CXL_DECODER_PMEM,
> +	CXL_DECODER_DC0,
> +	CXL_DECODER_DC1,
> +	CXL_DECODER_DC2,
> +	CXL_DECODER_DC3,
> +	CXL_DECODER_DC4,
> +	CXL_DECODER_DC5,
> +	CXL_DECODER_DC6,
> +	CXL_DECODER_DC7,
>  	CXL_DECODER_MIXED,
>  	CXL_DECODER_DEAD,
>  };
> @@ -366,6 +374,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  		[CXL_DECODER_NONE] = "none",
>  		[CXL_DECODER_RAM] = "ram",
>  		[CXL_DECODER_PMEM] = "pmem",
> +		[CXL_DECODER_DC0] = "dc0",
> +		[CXL_DECODER_DC1] = "dc1",
> +		[CXL_DECODER_DC2] = "dc2",
> +		[CXL_DECODER_DC3] = "dc3",
> +		[CXL_DECODER_DC4] = "dc4",
> +		[CXL_DECODER_DC5] = "dc5",
> +		[CXL_DECODER_DC6] = "dc6",
> +		[CXL_DECODER_DC7] = "dc7",
>  		[CXL_DECODER_MIXED] = "mixed",
>  	};
>  
> @@ -374,10 +390,16 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  	return "mixed";
>  }
>  
> +static inline bool cxl_decoder_mode_is_dc(enum cxl_decoder_mode mode)
> +{
> +	return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
> +}
> +
>  enum cxl_region_mode {
>  	CXL_REGION_NONE,
>  	CXL_REGION_RAM,
>  	CXL_REGION_PMEM,
> +	CXL_REGION_DC,
>  	CXL_REGION_MIXED,
>  	CXL_REGION_DEAD,
>  };
> @@ -388,6 +410,7 @@ static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
>  		[CXL_REGION_NONE] = "none",
>  		[CXL_REGION_RAM] = "ram",
>  		[CXL_REGION_PMEM] = "pmem",
> +		[CXL_REGION_DC] = "dc",
>  		[CXL_REGION_MIXED] = "mixed",
>  	};
>  
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 05/18] cxl/port: Add Dynamic Capacity mode support to endpoint decoders
  2023-08-29  5:20 ` [PATCH RFC v2 05/18] cxl/port: Add Dynamic Capacity mode support to endpoint decoders Ira Weiny
@ 2023-08-29 14:49   ` Jonathan Cameron
  2023-09-05  0:05     ` Ira Weiny
  2023-08-31 17:25   ` Fan Ni
  1 sibling, 1 reply; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-29 14:49 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:20:56 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> Endpoint decoders used to map Dynamic Capacity must be configured to
> point to the correct Dynamic Capacity (DC) Region.  The decoder mode
> currently represents the partition the decoder points to such as ram or
> pmem.
> 
> Expand the mode to include DC Regions.
> 
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

I'm reading this in a linear fashion for now (and ideally that should
always make sense) so I don't currently see the reason for the loops
in here. If they are needed for a future patch, add something to the
description to indicate that.

> 
> ---
> Changes for v2:
> [iweiny: split from region creation patch]
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 19 ++++++++++---------
>  drivers/cxl/core/hdm.c                  | 24 ++++++++++++++++++++++++
>  drivers/cxl/core/port.c                 | 16 ++++++++++++++++
>  3 files changed, 50 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 6350dd82b9a9..2268ffcdb604 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -257,22 +257,23 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/mode
>  Date:		May, 2022
> -KernelVersion:	v6.0
> +KernelVersion:	v6.0, v6.6 (dcY)
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
>  		translates from a host physical address range, to a device local
>  		address range. Device-local address ranges are further split
> -		into a 'ram' (volatile memory) range and 'pmem' (persistent
> -		memory) range. The 'mode' attribute emits one of 'ram', 'pmem',
> -		'mixed', or 'none'. The 'mixed' indication is for error cases
> -		when a decoder straddles the volatile/persistent partition
> -		boundary, and 'none' indicates the decoder is not actively
> -		decoding, or no DPA allocation policy has been set.
> +		into a 'ram' (volatile memory) range, 'pmem' (persistent
> +		memory) range, or Dynamic Capacity (DC) range. The 'mode'
> +		attribute emits one of 'ram', 'pmem', 'dcY', 'mixed', or
> +		'none'. The 'mixed' indication is for error cases when a
> +		decoder straddles the volatile/persistent partition boundary,
> +		and 'none' indicates the decoder is not actively decoding, or
> +		no DPA allocation policy has been set.
>  
>  		'mode' can be written, when the decoder is in the 'disabled'
> -		state, with either 'ram' or 'pmem' to set the boundaries for the
> -		next allocation.
> +		state, with 'ram', 'pmem', or 'dcY' to set the boundaries for
> +		the next allocation.
>  
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/dpa_resource
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index a254f79dd4e8..3f4af1f5fac8 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -267,6 +267,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	__cxl_dpa_release(cxled);
>  }
>  
> +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> +{
> +	int index = 0;
> +
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
As you are relying on them being in order and adjacent for the loop, why is

	if (mode < CXL_DECODER_DC0 || mode > CXL_DECODER_DC7)
		return -EINVAL;

	return mode - CXL_DECODER_DC0;

Not sufficient?

> +		if (mode == i)
> +			return index;
> +		index++;
> +	}
> +
> +	return -EINVAL;
> +}
> +
>  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  			     resource_size_t base, resource_size_t len,
>  			     resource_size_t skipped)
> @@ -429,6 +442,7 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  	switch (mode) {
>  	case CXL_DECODER_RAM:
>  	case CXL_DECODER_PMEM:
> +	case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
>  		break;
>  	default:
>  		dev_dbg(dev, "unsupported mode: %d\n", mode);
> @@ -456,6 +470,16 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  		goto out;
>  	}
>  
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> +		int index = dc_mode_to_region_index(i);
> +
> +		if (mode == i && !resource_size(&cxlds->dc_res[index])) {

Not obvious why we have the loop in this patch - perhaps it makes sense later.
If this is to enable later changes, then good to say that in the patch description.
otherwise, something like.

	int index;
	
	rc = dc_mode_to_region_index(i);
	if (rc < 0)
		goto out;

	index = rc;
	if (!resource_size(&cxlds->dc_res[index]) {
	....
		
	
	 	

> +			dev_dbg(dev, "no available dynamic capacity\n");
> +			rc = -ENXIO;
> +			goto out;
> +		}
> +	}
> +
>  	cxled->mode = mode;
>  	rc = 0;
>  out:
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index f58cf01f8d2c..ce4a66865db3 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -197,6 +197,22 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
>  		mode = CXL_DECODER_PMEM;
>  	else if (sysfs_streq(buf, "ram"))
>  		mode = CXL_DECODER_RAM;
> +	else if (sysfs_streq(buf, "dc0"))
> +		mode = CXL_DECODER_DC0;
> +	else if (sysfs_streq(buf, "dc1"))
> +		mode = CXL_DECODER_DC1;
> +	else if (sysfs_streq(buf, "dc2"))
> +		mode = CXL_DECODER_DC2;
> +	else if (sysfs_streq(buf, "dc3"))
> +		mode = CXL_DECODER_DC3;
> +	else if (sysfs_streq(buf, "dc4"))
> +		mode = CXL_DECODER_DC4;
> +	else if (sysfs_streq(buf, "dc5"))
> +		mode = CXL_DECODER_DC5;
> +	else if (sysfs_streq(buf, "dc6"))
> +		mode = CXL_DECODER_DC6;
> +	else if (sysfs_streq(buf, "dc7"))
> +		mode = CXL_DECODER_DC7;
>  	else
>  		return -EINVAL;
>  
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 06/18] cxl/port: Add Dynamic Capacity size support to endpoint decoders
  2023-08-29  5:20 ` [PATCH RFC v2 06/18] cxl/port: Add Dynamic Capacity size " Ira Weiny
@ 2023-08-29 15:09   ` Jonathan Cameron
  2023-09-05  4:32     ` Ira Weiny
  0 siblings, 1 reply; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-29 15:09 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:20:57 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> To support Dynamic Capacity Devices (DCD) endpoint decoders will need to
> map DC Regions (partitions).  Part of this is assigning the size of the
> DC Region DPA to the decoder in addition to any skip value from the
> previous decoder which exists.  This must be done within a continuous
> DPA space.  Two complications arise with Dynamic Capacity regions which
> did not exist with Ram and PMEM partitions.  First, gaps in the DPA
> space can exist between and around the DC Regions.  Second, the Linux
> resource tree does not allow a resource to be marked across existing
> nodes within a tree.
> 
> For clarity, below is an example of an 60GB device with 10GB of RAM,
> 10GB of PMEM and 10GB for each of 2 DC Regions.  The desired CXL mapping
> is 5GB of RAM, 5GB of PMEM, and all 10GB of DC1.
> 
>      DPA RANGE
>      (dpa_res)
> 0GB        10GB       20GB       30GB       40GB       50GB       60GB
> |----------|----------|----------|----------|----------|----------|
> 
> RAM         PMEM                  DC0                   DC1
>  (ram_res)  (pmem_res)            (dc_res[0])           (dc_res[1])
> |----------|----------|   <gap>  |----------|   <gap>  |----------|
> 
>  RAM        PMEM                                        DC1
> |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
> 0GB   5GB  10GB  15GB 20GB       30GB       40GB       50GB       60GB
> 
> The previous skip resource between RAM and PMEM was always a child of
> the RAM resource and fit nicely (see X below).  Because of this
> simplicity this skip resource reference was not stored in any CXL state.
> On release the skip range could be calculated based on the endpoint
> decoders stored values.
> 
> Now when DC1 is being mapped 4 skip resources must be created as
> children.  One of the PMEM resource (A), two of the parent DPA resource
> (B,D), and one more child of the DC0 resource (C).
> 
> 0GB        10GB       20GB       30GB       40GB       50GB       60GB
> |----------|----------|----------|----------|----------|----------|
>                            |                     |
> |----------|----------|    |     |----------|    |     |----------|
>         |          |       |          |          |
>        (X)        (A)     (B)        (C)        (D)
> 	v          v       v          v          v
> |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
>        skip       skip  skip        skip      skip
> 
> Expand the calculation of DPA freespace and enhance the logic to support
> mapping/unmapping DC DPA space.  To track the potential of multiple skip
> resources an xarray is attached to the endpoint decoder.  The existing
> algorithm is consolidated with the new one to store a single skip
> resource in the same way as multiple skip resources.
> 
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Various minor things noticed inline.

Jonathan

> 
> ---
> An alternative of using reserve_region_with_split() was considered.
> The advantage of that would be keeping all the resource information
> stored solely in the resource tree rather than having separate
> references to them.  However, it would best be implemented with a call
> such as release_split_region() [name TBD?] which could find all the leaf
> resources in the range and release them.  Furthermore, it is not clear
> if reserve_region_with_split() is really intended for anything outside
> of init code.  In the end this algorithm seems straight forward enough.
> 
> Changes for v2:
> [iweiny: write commit message]
> [iweiny: remove unneeded changes]
> [iweiny: split from region creation patch]
> [iweiny: Alter skip algorithm to use 'anonymous regions']
> [iweiny: enhance debug messages]
> [iweiny: consolidate skip resource creation]
> [iweiny: ensure xa_destroy() is called]
> [iweiny: consolidate region requests further]
> [iweiny: ensure resource is released on xa_insert]
> ---
>  drivers/cxl/core/hdm.c  | 188 +++++++++++++++++++++++++++++++++++++++++++-----
>  drivers/cxl/core/port.c |   2 +
>  drivers/cxl/cxl.h       |   2 +
>  3 files changed, 176 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 3f4af1f5fac8..3cd048677816 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c


> +
> +static int cxl_reserve_dpa_skip(struct cxl_endpoint_decoder *cxled,
> +				resource_size_t base, resource_size_t skipped)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_port *port = cxled_to_port(cxled);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	resource_size_t skip_base = base - skipped;
> +	resource_size_t size, skip_len = 0;
> +	struct device *dev = &port->dev;
> +	int rc, index;
> +
> +	size = resource_size(&cxlds->ram_res);
> +	if (size && skip_base <= cxlds->ram_res.end) {

This size only used in this if statement I'd just put it inline.
 
> +		skip_len = cxlds->ram_res.end - skip_base + 1;
> +		rc = cxl_request_skip(cxled, skip_base, skip_len);
> +		if (rc)
> +			return rc;
> +		skip_base += skip_len;
> +	}
> +
> +	if (skip_base == base) {
> +		dev_dbg(dev, "skip done!\n");

Not sure that dbg is much help as other places below where skip also done...

> +		return 0;
> +	}
> +
> +	size = resource_size(&cxlds->pmem_res);
> +	if (size && skip_base <= cxlds->pmem_res.end) {

size only used in this if statement. I'd just put
the resource_size() bit inline.

> +		skip_len = cxlds->pmem_res.end - skip_base + 1;
> +		rc = cxl_request_skip(cxled, skip_base, skip_len);
> +		if (rc)
> +			return rc;
> +		skip_base += skip_len;
> +	}
> +
> +	index = dc_mode_to_region_index(cxled->mode);
> +	for (int i = 0; i <= index; i++) {
> +		struct resource *dcr = &cxlds->dc_res[i];
> +
> +		if (skip_base < dcr->start) {
> +			skip_len = dcr->start - skip_base;
> +			rc = cxl_request_skip(cxled, skip_base, skip_len);
> +			if (rc)
> +				return rc;
> +			skip_base += skip_len;
> +		}
> +
> +		if (skip_base == base) {
> +			dev_dbg(dev, "skip done!\n");

As above - perhaps some more info?

> +			break;
> +		}
> +
> +		if (resource_size(dcr) && skip_base <= dcr->end) {
> +			if (skip_base > base)
> +				dev_err(dev, "Skip error\n");

Not return ?  If there is a reason to carry on, I'd like a comment to say what it is.

> +
> +			skip_len = dcr->end - skip_base + 1;
> +			rc = cxl_request_skip(cxled, skip_base, skip_len);
> +			if (rc)
> +				return rc;
> +			skip_base += skip_len;
> +		}
> +	}
> +
> +	return 0;
> +}
> +


> @@ -492,11 +607,13 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  					 resource_size_t *start_out,
>  					 resource_size_t *skip_out)
>  {
> +	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
>  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -	resource_size_t free_ram_start, free_pmem_start;
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	struct device *dev = &cxled->cxld.dev;

There is one existing (I think) call to dev_dbg(cxled_dev(cxled) ...
in this function.  So both should use that here, and should convert that one
case to using dev.

>  	resource_size_t start, avail, skip;
>  	struct resource *p, *last;
> +	int index;
>  
>  	lockdep_assert_held(&cxl_dpa_rwsem);
>  
> @@ -514,6 +631,20 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  	else
>  		free_pmem_start = cxlds->pmem_res.start;
>  
> +	/*
> +	 * Limit each decoder to a single DC region to map memory with
> +	 * different DSMAS entry.
> +	 */
> +	index = dc_mode_to_region_index(cxled->mode);
> +	if (index >= 0) {
> +		if (cxlds->dc_res[index].child) {
> +			dev_err(dev, "Cannot allocate DPA from DC Region: %d\n",
> +				index);
> +			return -EINVAL;
> +		}
> +		free_dc_start = cxlds->dc_res[index].start;
> +	}
> +
>  	if (cxled->mode == CXL_DECODER_RAM) {
>  		start = free_ram_start;
>  		avail = cxlds->ram_res.end - start + 1;
> @@ -535,6 +666,29 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
>  		else
>  			skip_end = start - 1;
>  		skip = skip_end - skip_start + 1;
> +	} else if (cxl_decoder_mode_is_dc(cxled->mode)) {
> +		resource_size_t skip_start, skip_end;
> +
> +		start = free_dc_start;
> +		avail = cxlds->dc_res[index].end - start + 1;
> +		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)

Previous patch used !resource_size()
I prefer compare with 0 like you have here, but which ever is chosen, things should
be consistent.

...


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 07/18] cxl/mem: Expose device dynamic capacity configuration
  2023-08-29  5:20 ` [PATCH RFC v2 07/18] cxl/mem: Expose device dynamic capacity configuration ira.weiny
@ 2023-08-29 15:14   ` Jonathan Cameron
  2023-09-05 17:55     ` Fan Ni
  2023-09-05 20:45     ` Ira Weiny
  2023-08-30 22:46   ` Dave Jiang
  1 sibling, 2 replies; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-29 15:14 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:20:58 -0700
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> To properly configure CXL regions on Dynamic Capacity Devices (DCD),
> user space will need to know the details of the DC Regions available on
> a device.
> 
> Expose driver dynamic capacity configuration through sysfs
> attributes.
> 
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
One trivial comment inline.  I wondered a bit if it would
be better to not present dc at all on devices that don't support
dynamic capacity, but for now there isn't an elegant way to do that
(some discussions and patches are flying around however so maybe this
 will be resolved before this series merges giving us that elegant
 option).

With commented code tidied up
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


> ---
> Changes for v2:
> [iweiny: Rebased on latest master/type2 work]
> [iweiny: add documentation for sysfs entries]
> [iweiny: s/dc_regions_count/region_count/]
> [iweiny: s/dcY_size/regionY_size/]
> [alison: change size format to %#llx]
> [iweiny: change count format to %d]
> [iweiny: Formatting updates]
> [iweiny: Fix crash when device is not a mem device: found with cxl-test]
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 17 ++++++++
>  drivers/cxl/core/memdev.c               | 77 +++++++++++++++++++++++++++++++++
>  2 files changed, 94 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 2268ffcdb604..aa65dc5b4e13 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -37,6 +37,23 @@ Description:
>  		identically named field in the Identify Memory Device Output
>  		Payload in the CXL-2.0 specification.
>  
> +What:		/sys/bus/cxl/devices/memX/dc/region_count
> +Date:		July, 2023
> +KernelVersion:	v6.6
> +Contact:	linux-cxl@vger.kernel.org
> +Description:
> +		(RO) Number of Dynamic Capacity (DC) regions supported on the
> +		device.  May be 0 if the device does not support Dynamic
> +		Capacity.
> +
> +What:		/sys/bus/cxl/devices/memX/dc/regionY_size
> +Date:		July, 2023
> +KernelVersion:	v6.6
> +Contact:	linux-cxl@vger.kernel.org
> +Description:
> +		(RO) Size of the Dynamic Capacity (DC) region Y.  Only
> +		available on devices which support DC and only for those
> +		region indexes supported by the device.
>  
>  What:		/sys/bus/cxl/devices/memX/serial
>  Date:		January, 2022
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 492486707fd0..397262e0ebd2 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -101,6 +101,20 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
>  static struct device_attribute dev_attr_pmem_size =
>  	__ATTR(size, 0444, pmem_size_show, NULL);
>  
> +static ssize_t region_count_show(struct device *dev, struct device_attribute *attr,
> +				 char *buf)
> +{
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +	int len = 0;
> +
> +	len = sysfs_emit(buf, "%d\n", mds->nr_dc_region);
> +	return len;

return sysfs_emit(buf, "...);
	
> +}
> +
> +struct device_attribute dev_attr_region_count =
> +	__ATTR(region_count, 0444, region_count_show, NULL);
> +
>  static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -454,6 +468,62 @@ static struct attribute *cxl_memdev_security_attributes[] = {
>  	NULL,
>  };
>  
> +static ssize_t show_size_regionN(struct cxl_memdev *cxlmd, char *buf, int pos)
> +{
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +	return sysfs_emit(buf, "%#llx\n", mds->dc_region[pos].decode_len);
> +}
> +
> +#define REGION_SIZE_ATTR_RO(n)						\
> +static ssize_t region##n##_size_show(struct device *dev,		\
> +				     struct device_attribute *attr,	\
> +				     char *buf)				\
> +{									\
> +	return show_size_regionN(to_cxl_memdev(dev), buf, (n));		\
> +}									\
> +static DEVICE_ATTR_RO(region##n##_size)
> +REGION_SIZE_ATTR_RO(0);
> +REGION_SIZE_ATTR_RO(1);
> +REGION_SIZE_ATTR_RO(2);
> +REGION_SIZE_ATTR_RO(3);
> +REGION_SIZE_ATTR_RO(4);
> +REGION_SIZE_ATTR_RO(5);
> +REGION_SIZE_ATTR_RO(6);
> +REGION_SIZE_ATTR_RO(7);
> +
> +static struct attribute *cxl_memdev_dc_attributes[] = {
> +	&dev_attr_region0_size.attr,
> +	&dev_attr_region1_size.attr,
> +	&dev_attr_region2_size.attr,
> +	&dev_attr_region3_size.attr,
> +	&dev_attr_region4_size.attr,
> +	&dev_attr_region5_size.attr,
> +	&dev_attr_region6_size.attr,
> +	&dev_attr_region7_size.attr,
> +	&dev_attr_region_count.attr,
> +	NULL,
> +};
> +
> +static umode_t cxl_dc_visible(struct kobject *kobj, struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +	/* Not a memory device */
> +	if (!mds)
> +		return 0;
> +
> +	if (a == &dev_attr_region_count.attr)
> +		return a->mode;
> +
> +	if (n < mds->nr_dc_region)
> +		return a->mode;
> +
> +	return 0;
> +}
> +
>  static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
>  				  int n)
>  {
> @@ -482,11 +552,18 @@ static struct attribute_group cxl_memdev_security_attribute_group = {
>  	.attrs = cxl_memdev_security_attributes,
>  };
>  
> +static struct attribute_group cxl_memdev_dc_attribute_group = {
> +	.name = "dc",
> +	.attrs = cxl_memdev_dc_attributes,
> +	.is_visible = cxl_dc_visible,
> +};
> +
>  static const struct attribute_group *cxl_memdev_attribute_groups[] = {
>  	&cxl_memdev_attribute_group,
>  	&cxl_memdev_ram_attribute_group,
>  	&cxl_memdev_pmem_attribute_group,
>  	&cxl_memdev_security_attribute_group,
> +	&cxl_memdev_dc_attribute_group,
>  	NULL,
>  };
>  
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 08/18] cxl/region: Add Dynamic Capacity CXL region support
  2023-08-29  5:20 ` [PATCH RFC v2 08/18] cxl/region: Add Dynamic Capacity CXL region support Ira Weiny
@ 2023-08-29 15:19   ` Jonathan Cameron
  2023-08-30 23:27   ` Dave Jiang
  2023-09-05 21:09   ` Fan Ni
  2 siblings, 0 replies; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-29 15:19 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:20:59 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> CXL devices optionally support dynamic capacity.  CXL Regions must be
> configured correctly to access this capacity.  Similar to ram and pmem
> partitions, DC Regions represent different partitions of the DPA space.
> 
> Interleaving is deferred due to the complexity of managing extents on
> multiple devices at the same time.  However, there is nothing which
> directly prevents interleave support at this time.  The check allows
> for early rejection.
> 
> To maintain backwards compatibility with older software, CXL regions
> need a default DAX device to hold the reference for the region until it
> is deleted.
> 
> Add create_dc_region sysfs entry to create DC regions.  Share the logic
> of devm_cxl_add_dax_region() and region_is_system_ram().  Special case
> DC capable CXL regions to create a 0 sized seed DAX device until others
> can be created on dynamic space later.
> 
> Flag dax_regions to indicate 0 capacity available until dax_region
> extents are supported by the region.
> 
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
LGTM
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 09/18] cxl/mem: Read extents on memory device discovery
  2023-08-29  5:21 ` [PATCH RFC v2 09/18] cxl/mem: Read extents on memory device discovery Ira Weiny
@ 2023-08-29 15:26   ` Jonathan Cameron
  2023-08-30  0:16     ` Ira Weiny
  2023-09-05 21:41     ` Ira Weiny
  0 siblings, 2 replies; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-29 15:26 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:21:00 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> When a Dynamic Capacity Device (DCD) is realized some extents may
> already be available within the DC Regions.  This can happen if the host
> has accepted extents and been rebooted or any other time the host driver
> software has become out of sync with the device hardware.
> 
> Read the available extents during probe and store them for later
> use.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
A few minor comments inline.

Thanks,

Jonathan

> ---
> Change for v2:
> [iweiny: new patch]
> ---
>  drivers/cxl/core/mbox.c | 195 ++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxlmem.h    |  36 +++++++++
>  drivers/cxl/pci.c       |   4 +
>  3 files changed, 235 insertions(+)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index d769814f80e2..9b08c40ef484 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -824,6 +824,37 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)

...

> +static int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> +				     unsigned int *extent_gen_num)
> +{
> +	struct cxl_mbox_get_dc_extent get_dc_extent;
> +	struct cxl_mbox_dc_extents dc_extents;
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	unsigned int count;
> +	int rc;
> +
> +	/* Check GET_DC_EXTENT_LIST is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> +		return 0;
> +	}
> +
> +	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> +		.extent_cnt = cpu_to_le32(0),
> +		.start_extent_index = cpu_to_le32(0),
> +	};
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +		.payload_in = &get_dc_extent,
> +		.size_in = sizeof(get_dc_extent),
> +		.size_out = mds->payload_size,

If all you are after is the count, then size_out can be a lot smaller than that
I think as we know it can't return any extents.

> +		.payload_out = &dc_extents,
> +		.min_out = 1,
> +	};
> +
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		return rc;
> +
> +	count = le32_to_cpu(dc_extents.total_extent_cnt);
> +	*extent_gen_num = le32_to_cpu(dc_extents.extent_list_num);
> +
> +	return count;
> +}
> +
> +static int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds,
> +				  unsigned int start_gen_num,
> +				  unsigned int exp_cnt)
> +{
> +	struct cxl_mbox_dc_extents *dc_extents;
> +	unsigned int start_index, total_read;
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	int retry = 3;

Why 3?

> +	int rc;
> +
> +	/* Check GET_DC_EXTENT_LIST is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> +		return 0;
> +	}
> +
> +	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);

Maybe __free magic would simplify this enough to be useful.

> +	if (!dc_extents)
> +		return -ENOMEM;
> +
> +reset:
> +	total_read = 0;
> +	start_index = 0;
> +	do {
> +		unsigned int nr_ext, total_extent_cnt, gen_num;
> +		struct cxl_mbox_get_dc_extent get_dc_extent;
> +
> +		get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> +			.extent_cnt = exp_cnt - start_index,
> +			.start_extent_index = start_index,
> +		};
> +		
> +		mbox_cmd = (struct cxl_mbox_cmd) {
> +			.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +			.payload_in = &get_dc_extent,
> +			.size_in = sizeof(get_dc_extent),
> +			.size_out = mds->payload_size,
> +			.payload_out = dc_extents,
> +			.min_out = 1,
> +		};
> +		
> +		rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +		if (rc < 0)
> +			goto out;
> +		
> +		nr_ext = le32_to_cpu(dc_extents->ret_extent_cnt);
> +		total_read += nr_ext;
> +		total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> +		gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +
> +		dev_dbg(dev, "Get extent list count:%d generation Num:%d\n",
> +			total_extent_cnt, gen_num);
> +
> +		if (gen_num != start_gen_num || exp_cnt != total_extent_cnt) {
> +			dev_err(dev, "Extent list changed while reading; %u != %u : %u != %u\n",
> +				gen_num, start_gen_num, exp_cnt, total_extent_cnt);
> +			if (retry--)
> +				goto reset;
> +			return -EIO;
> +		}
> +		
> +		for (int i = 0; i < nr_ext ; i++) {
> +			dev_dbg(dev, "Storing extent %d/%d\n",
> +				start_index + i, exp_cnt);
> +			rc = cxl_store_dc_extent(mds, &dc_extents->extent[i]);
> +			if (rc)
> +				goto out;
> +		}
> +
> +		start_index += nr_ext;
> +	} while (exp_cnt > total_read);
> +
> +out:
> +	kvfree(dc_extents);
> +	return rc;
> +}



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 10/18] cxl/mem: Handle DCD add and release capacity events.
  2023-08-29  5:21 ` [PATCH RFC v2 10/18] cxl/mem: Handle DCD add and release capacity events Ira Weiny
@ 2023-08-29 15:59   ` Jonathan Cameron
  2023-09-05 23:49     ` Ira Weiny
  2023-08-31 17:28   ` Dave Jiang
  1 sibling, 1 reply; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-29 15:59 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:21:01 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> A Dynamic Capacity Device (DCD) utilizes events to signal the host about
> the changes to the allocation of Dynamic Capacity (DC) extents. The
> device communicates the state of DC extents through an extent list that
> describes the starting DPA, length, and meta data of the blocks the host
> can access.
> 
> Process the dynamic capacity add and release events.  The addition or
> removal of extents can occur at any time.  Adding asynchronous memory is
> straight forward.  Also remember the host is under no obligation to
> respond to a release event until it is done with the memory.  Introduce
> extent kref's to handle the delay of extent release.
> 
> In the case of a force removal, access to the memory will fail and may
> cause a crash.  However, the extent tracking object is preserved for the
> region to safely tear down as long as the memory is not accessed.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Minor stuff inline.


> +static int cxl_prepare_ext_list(struct cxl_mbox_dc_response **res,
> +				int *n, struct range *extent)
> +{
> +	struct cxl_mbox_dc_response *dc_res;
> +	unsigned int size;
> +
> +	if (!extent)
> +		size = struct_size(dc_res, extent_list, 0);

This is confusing as if you did have *n > 0 I'd kind of expect
this to just not extend the list rather than shortening it.
Now I guess that never happens, but locally it looks odd.

Maybe just handle that case in a separate function as it doesn't
share much code with the case where there is an extent and I would
assume we always know at the caller which one we want.


> +	else
> +		size = struct_size(dc_res, extent_list, *n + 1);

Might be clearer with a local variable for the number of extents.

extents_count = *n;

if (extent)
	extents_count++;

size = struct_size(dc_res, extent_list, extents_count);

Though I'm not sure that really helps.  Maybe this will just need
to be a little confusing :)

> +
> +	dc_res = krealloc(*res, size, GFP_KERNEL);
> +	if (!dc_res)
> +		return -ENOMEM;
> +
> +	if (extent) {
> +		dc_res->extent_list[*n].dpa_start = cpu_to_le64(extent->start);
> +		memset(dc_res->extent_list[*n].reserved, 0, 8);
> +		dc_res->extent_list[*n].length = cpu_to_le64(range_len(extent));
> +		(*n)++;
> +	}
> +
> +	*res = dc_res;
> +	return 0;
> +}

> +
> +/* Returns 0 if the event was handled successfully. */
> +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
> +					struct cxl_event_record_raw *rec)
> +{
> +	struct dcd_event_dyn_cap *record = (struct dcd_event_dyn_cap *)rec;
> +	uuid_t *id = &rec->hdr.id;
> +	int rc;
> +
> +	if (!uuid_equal(id, &dc_event_uuid))
> +		return -EINVAL;
> +
> +	switch (record->data.event_type) {
> +	case DCD_ADD_CAPACITY:
> +		rc = cxl_handle_dcd_add_event(mds, &record->data.extent);
> +		break;

I guess it might not be consistent with local style...
		return cxl_handle_dcd_add_event()  etc

> +	case DCD_RELEASE_CAPACITY:
> +        case DCD_FORCED_CAPACITY_RELEASE:
> +		rc = cxl_handle_dcd_release_event(mds, &record->data.extent);
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	return rc;
> +}
> +



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 11/18] cxl/region: Expose DC extents on region driver load
  2023-08-29  5:21 ` [PATCH RFC v2 11/18] cxl/region: Expose DC extents on region driver load Ira Weiny
@ 2023-08-29 16:20   ` Jonathan Cameron
  2023-09-06  3:36     ` Ira Weiny
  2023-08-31 18:38   ` Dave Jiang
  1 sibling, 1 reply; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-29 16:20 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:21:02 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> Ultimately user space must associate Dynamic Capacity (DC) extents with
> DAX devices.  Remember also that DCD extents may have been accepted
> previous to regions being created and must have references held until
> all higher level regions and DAX devices are done with the memory.
> 
> On CXL region driver load scan existing device extents and create CXL
> DAX region extents as needed.
> 
> Create abstractions for the extents to be used in DAX region.  This
> includes a generic interface to take proper references on the lower
> level CXL region extents.
> 
> Also maintain separate objects for the DAX region extent device vs the
> DAX region extent.  The DAX region extent device has a shorter life span
> which corresponds to the removal of an extent while a DAX device is
> still using it.  In this case an extent continues to exist whilst the
> ability to create new DAX devices on that extent is prevented.
> 
> NOTE: Without interleaving; the device, CXL region, and DAX region
> extents have a 1:1:1 relationship.  Future support for interleaving will
> maintain a 1:N relationship between CXL region extents and the hardware
> extents.
> 
> While the ability to create DAX devices on an extent exists; expose the
> necessary details of DAX region extents by creating a device with the
> following sysfs entries.
> 
> /sys/bus/cxl/devices/dax_regionX/extentY
> /sys/bus/cxl/devices/dax_regionX/extentY/length
> /sys/bus/cxl/devices/dax_regionX/extentY/label
> 
> Label is a rough analogy to the DC extent tag.  As such the DC extent
> tag is used to initially populate the label.  However, the label is made
> writeable so that it can be adjusted in the future when forming a DAX
> device.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 

Trivial stuff inline.



> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index 27cf2daaaa79..4dab52496c3f 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -5,6 +5,7 @@
>  #ifndef __DAX_PRIVATE_H__
>  #define __DAX_PRIVATE_H__
>  
> +#include <linux/pgtable.h>
>  #include <linux/device.h>
>  #include <linux/cdev.h>
>  #include <linux/idr.h>
> @@ -40,6 +41,58 @@ struct dax_region {
>  	struct device *youngest;
>  };
>  
> +/*
/**

as it's valid kernel doc so no disadvantage really.

> + * struct dax_region_extent - extent data defined by the low level region
> + * driver.
> + * @private_data: lower level region driver data
> + * @ref: track number of dax devices which are using this extent
> + * @get: get reference to low level data
> + * @put: put reference to low level data

I'd like to understand when these are optional - perhaps comment on that?

> + */
> +struct dax_region_extent {
> +	void *private_data;
> +	struct kref ref;
> +	void (*get)(struct dax_region_extent *dr_extent);
> +	void (*put)(struct dax_region_extent *dr_extent);
> +};
> +
> +static inline void dr_extent_get(struct dax_region_extent *dr_extent)
> +{
> +	if (dr_extent->get)
> +		dr_extent->get(dr_extent);
> +}
> +
> +static inline void dr_extent_put(struct dax_region_extent *dr_extent)
> +{
> +	if (dr_extent->put)
> +		dr_extent->put(dr_extent);
> +}
> +
> +#define DAX_EXTENT_LABEL_LEN 64

blank line here.

> +/**
> + * struct dax_reg_ext_dev - Device object to expose extent information
> + * @dev: device representing this extent
> + * @dr_extent: reference back to private extent data
> + * @offset: offset of this extent
> + * @length: size of this extent
> + * @label: identifier to group extents
> + */
> +struct dax_reg_ext_dev {
> +	struct device dev;
> +	struct dax_region_extent *dr_extent;
> +	resource_size_t offset;
> +	resource_size_t length;
> +	char label[DAX_EXTENT_LABEL_LEN];
> +};


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 12/18] cxl/region: Notify regions of DC changes
  2023-08-29  5:21 ` [PATCH RFC v2 12/18] cxl/region: Notify regions of DC changes Ira Weiny
@ 2023-08-29 16:40   ` Jonathan Cameron
  2023-09-06  4:00     ` Ira Weiny
  2023-09-18 13:56   ` Jørgen Hansen
  1 sibling, 1 reply; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-29 16:40 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:21:03 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> In order for a user to use dynamic capacity effectively they need to
> know when dynamic capacity is available.  Thus when Dynamic Capacity
> (DC) extents are added or removed by a DC device the regions affected
> need to be notified.  Ultimately the DAX region uses the memory
> associated with DC extents.  However, remember that CXL DAX regions
> maintain any interleave details between devices.
> 
> When a DCD event occurs, iterate all CXL endpoint decoders and notify
> regions which contain the endpoints affected by the event.  In turn
> notify the DAX regions of the changes to the DAX region extents.
> 
> For now interleave is handled by creating simple 1:1 mappings between
> the CXL DAX region and DAX region layers.  Future implementations will
> need to resolve when to actually surface a DAX region extent and pass
> the notification along.
> 
> Remember that adding capacity is safe because there is no chance of the
> memory being in use.  Also remember at this point releasing capacity is
> straight forward because DAX devices do not yet have references to the
> extents.  Future patches will handle that complication.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 

A few trivial comments on this.  Lot here so I'll take a closer look
at some point after doing a light pass over the rest of the series.





> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 80cffa40e91a..d3c4c9c87392 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -104,6 +104,55 @@ static int cxl_debugfs_poison_clear(void *data, u64 dpa)
>  DEFINE_DEBUGFS_ATTRIBUTE(cxl_poison_clear_fops, NULL,
>  			 cxl_debugfs_poison_clear, "%llx\n");
>  
> +static int match_ep_decoder_by_range(struct device *dev, void *data)
> +{
> +	struct cxl_dc_extent_data *extent = data;
> +	struct cxl_endpoint_decoder *cxled;
> +
> +	if (!is_endpoint_decoder(dev))
> +		return 0;

blank line

> +	cxled = to_cxl_endpoint_decoder(dev);
> +	return cxl_dc_extent_in_ed(cxled, extent);
> +}
> +
> +static struct cxl_endpoint_decoder *cxl_find_ed(struct cxl_memdev_state *mds,
> +						struct cxl_dc_extent_data *extent)
> +{
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_port *endpoint = cxlmd->endpoint;
> +	struct device *dev;
> +
> +	dev = device_find_child(&endpoint->dev, extent,
> +				match_ep_decoder_by_range);
> +	if (!dev) {
> +		dev_dbg(mds->cxlds.dev, "Extent DPA:%llx LEN:%llx not mapped\n",
> +			extent->dpa_start, extent->length);
> +		return NULL;
> +	}
> +
> +	return to_cxl_endpoint_decoder(dev);
> +}
> +
> +static int cxl_mem_notify(struct device *dev, struct cxl_drv_nd *nd)
> +{
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +	struct cxl_endpoint_decoder *cxled;
> +	struct cxl_dc_extent_data *extent;
> +	int rc = 0;
> +
> +	extent = nd->extent;
> +	dev_dbg(dev, "notify DC action %d DPA:%llx LEN:%llx\n",
> +		nd->event, extent->dpa_start, extent->length);
> +
> +	cxled = cxl_find_ed(mds, extent);
> +	if (!cxled)
> +		return 0;
Blank line.

> +	rc = cxl_ed_notify_extent(cxled, nd);
> +	put_device(&cxled->cxld.dev);
> +	return rc;
> +}
> +
>  static int cxl_mem_probe(struct device *dev)
>  {
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> @@ -247,6 +296,7 @@ __ATTRIBUTE_GROUPS(cxl_mem);
>  static struct cxl_driver cxl_mem_driver = {
>  	.name = "cxl_mem",
>  	.probe = cxl_mem_probe,
> +	.notify = cxl_mem_notify,
>  	.id = CXL_DEVICE_MEMORY_EXPANDER,
>  	.drv = {
>  		.dev_groups = cxl_mem_groups,
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 057b00b1d914..44cbd28668f1 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -59,6 +59,29 @@ static int cxl_dax_region_create_extent(struct dax_region *dax_region,
>  	return 0;
>  }
>  
> +static int cxl_dax_region_add_extent(struct cxl_dax_region *cxlr_dax,
> +				     struct cxl_dr_extent *cxl_dr_ext)
> +{

Why not have this helper in the earlier patch that introduced the code
this is factoring out?  Will reduce churn in the set whilst not much hurting
readability of that patch.

> +	/*
> +	 * get not zero is important because this is racing with the
> +	 * region driver which is racing with the memory device which
> +	 * could be removing the extent at the same time.
> +	 */
> +	if (cxl_dr_extent_get_not_zero(cxl_dr_ext)) {
> +		struct dax_region *dax_region;
> +		int rc;
> +
> +		dax_region = dev_get_drvdata(&cxlr_dax->dev);
> +		dev_dbg(&cxlr_dax->dev, "Creating HPA:%llx LEN:%llx\n",
> +			cxl_dr_ext->hpa_offset, cxl_dr_ext->hpa_length);
> +		rc = cxl_dax_region_create_extent(dax_region, cxl_dr_ext);
> +		cxl_dr_extent_put(cxl_dr_ext);
> +		if (rc)
> +			return rc;
> +	}
> +	return 0;
Perhaps flip logic
	if (!cxl_dr_extent_get_not_zero())
		return 0;

etc to reduce the code indent.
> +}
> +
>  static int cxl_dax_region_create_extents(struct cxl_dax_region *cxlr_dax)
>  {
>  	struct cxl_dr_extent *cxl_dr_ext;
> @@ -66,27 +89,68 @@ static int cxl_dax_region_create_extents(struct cxl_dax_region *cxlr_dax)
>  
>  	dev_dbg(&cxlr_dax->dev, "Adding extents\n");
>  	xa_for_each(&cxlr_dax->extents, index, cxl_dr_ext) {
> -		/*
> -		 * get not zero is important because this is racing with the
> -		 * region driver which is racing with the memory device which
> -		 * could be removing the extent at the same time.
> -		 */
> -		if (cxl_dr_extent_get_not_zero(cxl_dr_ext)) {
> -			struct dax_region *dax_region;
> -			int rc;
> -
> -			dax_region = dev_get_drvdata(&cxlr_dax->dev);
> -			dev_dbg(&cxlr_dax->dev, "Found OFF:%llx LEN:%llx\n",
> -				cxl_dr_ext->hpa_offset, cxl_dr_ext->hpa_length);
> -			rc = cxl_dax_region_create_extent(dax_region, cxl_dr_ext);
> -			cxl_dr_extent_put(cxl_dr_ext);
> -			if (rc)
> -				return rc;
> -		}
> +		int rc;
> +
> +		rc = cxl_dax_region_add_extent(cxlr_dax, cxl_dr_ext);
> +		if (rc)
> +			return rc;
>  	}
>  	return 0;
>  }
>  
> +static int match_cxl_dr_extent(struct device *dev, void *data)
> +{
> +	struct dax_reg_ext_dev *dr_reg_ext_dev;
> +	struct dax_region_extent *dr_extent;
> +
> +	if (!is_dr_ext_dev(dev))
> +		return 0;
> +
> +	dr_reg_ext_dev = to_dr_ext_dev(dev);
> +	dr_extent = dr_reg_ext_dev->dr_extent;
> +	return data == dr_extent->private_data;
> +}
> +
> +static int cxl_dax_region_rm_extent(struct cxl_dax_region *cxlr_dax,
> +				    struct cxl_dr_extent *cxl_dr_ext)
> +{
> +	struct dax_reg_ext_dev *dr_reg_ext_dev;
> +	struct dax_region *dax_region;
> +	struct device *dev;
> +
> +	dev = device_find_child(&cxlr_dax->dev, cxl_dr_ext,
> +				match_cxl_dr_extent);
> +	if (!dev)
> +		return -EINVAL;

blank line.

> +	dr_reg_ext_dev = to_dr_ext_dev(dev);
> +	put_device(dev);
> +	dax_region = dev_get_drvdata(&cxlr_dax->dev);
> +	dax_region_ext_del_dev(dax_region, dr_reg_ext_dev);
blank line

> +	return 0;
> +}
> +
> +static int cxl_dax_region_notify(struct device *dev,
> +				 struct cxl_drv_nd *nd)
> +{
> +	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
> +	struct cxl_dr_extent *cxl_dr_ext = nd->cxl_dr_ext;
> +	int rc = 0;
> +
> +	switch (nd->event) {
> +	case DCD_ADD_CAPACITY:
> +		rc = cxl_dax_region_add_extent(cxlr_dax, cxl_dr_ext);
> +		break;

Early returns in here will perhaps make this more readable and definitely
make it more compact.

> +	case DCD_RELEASE_CAPACITY:
> +	case DCD_FORCED_CAPACITY_RELEASE:
> +		rc = cxl_dax_region_rm_extent(cxlr_dax, cxl_dr_ext);
> +		break;
> +	default:
> +		dev_err(&cxlr_dax->dev, "Unknown DC event %d\n", nd->event);
> +		break;
> +	}
> +	return rc;
> +}
> +
>  static int cxl_dax_region_probe(struct device *dev)
>  {
>  	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
> @@ -134,6 +198,7 @@ static int cxl_dax_region_probe(struct device *dev)
>  static struct cxl_driver cxl_dax_region_driver = {
>  	.name = "cxl_dax_region",
>  	.probe = cxl_dax_region_probe,
> +	.notify = cxl_dax_region_notify,
>  	.id = CXL_DEVICE_DAX_REGION,
>  	.drv = {
>  		.suppress_bind_attrs = true,

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 15/18] cxl/mem: Trace Dynamic capacity Event Record
  2023-08-29  5:21 ` [PATCH RFC v2 15/18] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
@ 2023-08-29 16:46   ` Jonathan Cameron
  2023-09-06  4:07     ` Ira Weiny
  0 siblings, 1 reply; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-29 16:46 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:21:06 -0700
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> CXL rev 3.0 section 8.2.9.2.1.5 defines the Dynamic Capacity Event Record
> Determine if the event read is a Dynamic capacity event record and
> if so trace the record for the debug purpose.
> 
> Add DC trace points to the trace log.

Probably should say why these might be useful...


> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> [iweiny: fixups]
> ---
>  drivers/cxl/core/mbox.c  |  5 ++++
>  drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 70 insertions(+)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 9d9c13e13ecf..9462c34aa1dc 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -939,6 +939,11 @@ static void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  				(struct cxl_event_mem_module *)record;
>  
>  		trace_cxl_memory_module(cxlmd, type, rec);
> +	} else if (uuid_equal(id, &dc_event_uuid)) {
> +		struct dcd_event_dyn_cap *rec =
> +				(struct dcd_event_dyn_cap *)record;
> +
> +		trace_cxl_dynamic_capacity(cxlmd, type, rec);
>  	} else {
>  		/* For unknown record types print just the header */
>  		trace_cxl_generic_event(cxlmd, type, record);
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index a0b5819bc70b..1899c5cc96b9 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -703,6 +703,71 @@ TRACE_EVENT(cxl_poison,
>  	)
>  );
>  
> +/*
> + * DYNAMIC CAPACITY Event Record - DER
> + *
> + * CXL rev 3.0 section 8.2.9.2.1.5 Table 8-47
> + */
> +
> +#define CXL_DC_ADD_CAPACITY			0x00
> +#define CXL_DC_REL_CAPACITY			0x01
> +#define CXL_DC_FORCED_REL_CAPACITY		0x02
> +#define CXL_DC_REG_CONF_UPDATED			0x03
> +#define show_dc_evt_type(type)	__print_symbolic(type,		\
> +	{ CXL_DC_ADD_CAPACITY,	"Add capacity"},		\
> +	{ CXL_DC_REL_CAPACITY,	"Release capacity"},		\
> +	{ CXL_DC_FORCED_REL_CAPACITY,	"Forced capacity release"},	\
> +	{ CXL_DC_REG_CONF_UPDATED,	"Region Configuration Updated"	} \
> +)
> +
> +TRACE_EVENT(cxl_dynamic_capacity,
> +
> +	TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
> +		 struct dcd_event_dyn_cap  *rec),
> +
> +	TP_ARGS(cxlmd, log, rec),
> +
> +	TP_STRUCT__entry(
> +		CXL_EVT_TP_entry
> +
> +		/* Dynamic capacity Event */
> +		__field(u8, event_type)
> +		__field(u16, hostid)
> +		__field(u8, region_id)
> +		__field(u64, dpa_start)
> +		__field(u64, length)
> +		__array(u8, tag, CXL_DC_EXTENT_TAG_LEN)
> +		__field(u16, sh_extent_seq)
> +	),
> +
> +	TP_fast_assign(
> +		CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
> +
> +		/* Dynamic_capacity Event */
> +		__entry->event_type = rec->data.event_type;
> +
> +		/* DCD event record data */
> +		__entry->hostid = le16_to_cpu(rec->data.host_id);
> +		__entry->region_id = rec->data.region_index;
> +		__entry->dpa_start = le64_to_cpu(rec->data.extent.start_dpa);
> +		__entry->length = le64_to_cpu(rec->data.extent.length);
> +		memcpy(__entry->tag, &rec->data.extent.tag, CXL_DC_EXTENT_TAG_LEN);
> +		__entry->sh_extent_seq = le16_to_cpu(rec->data.extent.shared_extn_seq);
> +	),
> +
> +	CXL_EVT_TP_printk("event_type='%s' host_id='%d' region_id='%d' " \
> +		"starting_dpa=%llx length=%llx tag=%s " \
> +		"shared_extent_sequence=%d",
> +		show_dc_evt_type(__entry->event_type),
> +		__entry->hostid,
> +		__entry->region_id,
> +		__entry->dpa_start,
> +		__entry->length,
> +		__print_hex(__entry->tag, CXL_DC_EXTENT_TAG_LEN),
> +		__entry->sh_extent_seq
> +	)
> +);
> +
>  #endif /* _CXL_EVENTS_H */
>  
>  #define TRACE_INCLUDE_FILE trace
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 01/18] cxl/hdm: Debug, use decoder name function
  2023-08-29 14:03   ` Jonathan Cameron
@ 2023-08-29 21:48     ` Fan Ni
  2023-09-03  2:55     ` Ira Weiny
  1 sibling, 0 replies; 97+ messages in thread
From: Fan Ni @ 2023-08-29 21:48 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Ira Weiny, Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, linux-cxl,
	linux-kernel

On Tue, Aug 29, 2023 at 03:03:20PM +0100, Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:20:52 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
>
> > The decoder enum has a name conversion function defined now.
> >
> > Use that instead of open coding.
> >
> > Suggested-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> >
>
> Perhaps pull this one out so it can go upstream before the rest are ready,
> or could be picked up from here.
>
> Whilst we probably won't see the other decoder modes in here, there
> is no reason why anyone reading the code should have to figure that out.
> As such much better to use the more generic function.
>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>

Agreed. This patch can be pulled out and picked up before the rest.


Reviewed-by: Fan Ni <fan.ni@samsung.com>


> > ---
> > Changes for v2:
> > [iweiny: new patch, split out]
> > ---
> >  drivers/cxl/core/hdm.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index b01a77b67511..a254f79dd4e8 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -550,8 +550,7 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> >
> >  	if (size > avail) {
> >  		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
> > -			cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
> > -			&avail);
> > +			cxl_decoder_mode_name(cxled->mode), &avail);
> >  		rc = -ENOSPC;
> >  		goto out;
> >  	}
> >
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 02/18] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2023-08-29  5:20 ` [PATCH RFC v2 02/18] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD) Ira Weiny
  2023-08-29 14:07   ` Jonathan Cameron
@ 2023-08-29 21:49   ` Fan Ni
  2023-08-30 20:33   ` Dave Jiang
  2023-10-24 16:16   ` Jonathan Cameron
  3 siblings, 0 replies; 97+ messages in thread
From: Fan Ni @ 2023-08-29 21:49 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Jonathan Cameron,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma,
	linux-cxl, linux-kernel

On Mon, Aug 28, 2023 at 10:20:53PM -0700, Ira Weiny wrote:
> Per the CXL 3.0 specification software must check the Command Effects
> Log (CEL) to know if a device supports DC.  If the device does support
> DC the specifics of the DC Regions (0-7) are read through the mailbox.
>
> Flag DC Device (DCD) commands in a device if they are supported.
> Subsequent patches will key off these bits to configure a DCD.
>
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>

Reviewed-by: Fan Ni <fan.ni@samsung.com>

> ---
> Changes for v2
> [iweiny: new patch]
> ---
>  drivers/cxl/core/mbox.c | 38 +++++++++++++++++++++++++++++++++++---
>  drivers/cxl/cxlmem.h    | 15 +++++++++++++++
>  2 files changed, 50 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index f052d5f174ee..554ec97a7c39 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -111,6 +111,34 @@ static u8 security_command_sets[] = {
>  	0x46, /* Security Passthrough */
>  };
>
> +static bool cxl_is_dcd_command(u16 opcode)
> +{
> +#define CXL_MBOX_OP_DCD_CMDS 0x48
> +
> +	return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
> +}
> +
> +static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
> +					u16 opcode)
> +{
> +	switch (opcode) {
> +	case CXL_MBOX_OP_GET_DC_CONFIG:
> +		set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> +		set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_ADD_DC_RESPONSE:
> +		set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_RELEASE_DC:
> +		set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
>  static bool cxl_is_security_command(u16 opcode)
>  {
>  	int i;
> @@ -677,9 +705,10 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  		u16 opcode = le16_to_cpu(cel_entry[i].opcode);
>  		struct cxl_mem_command *cmd = cxl_mem_find_command(opcode);
>
> -		if (!cmd && !cxl_is_poison_command(opcode)) {
> -			dev_dbg(dev,
> -				"Opcode 0x%04x unsupported by driver\n", opcode);
> +		if (!cmd && !cxl_is_poison_command(opcode) &&
> +		    !cxl_is_dcd_command(opcode)) {
> +			dev_dbg(dev, "Opcode 0x%04x unsupported by driver\n",
> +				opcode);
>  			continue;
>  		}
>
> @@ -689,6 +718,9 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  		if (cxl_is_poison_command(opcode))
>  			cxl_set_poison_cmd_enabled(&mds->poison, opcode);
>
> +		if (cxl_is_dcd_command(opcode))
> +			cxl_set_dcd_cmd_enabled(mds, opcode);
> +
>  		dev_dbg(dev, "Opcode 0x%04x enabled\n", opcode);
>  	}
>  }
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index adfba72445fc..5f2e65204bf9 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -247,6 +247,15 @@ struct cxl_event_state {
>  	struct mutex log_lock;
>  };
>
> +/* Device enabled DCD commands */
> +enum dcd_cmd_enabled_bits {
> +	CXL_DCD_ENABLED_GET_CONFIG,
> +	CXL_DCD_ENABLED_GET_EXTENT_LIST,
> +	CXL_DCD_ENABLED_ADD_RESPONSE,
> +	CXL_DCD_ENABLED_RELEASE,
> +	CXL_DCD_ENABLED_MAX
> +};
> +
>  /* Device enabled poison commands */
>  enum poison_cmd_enabled_bits {
>  	CXL_POISON_ENABLED_LIST,
> @@ -436,6 +445,7 @@ struct cxl_dev_state {
>   *                (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
>   * @mbox_mutex: Mutex to synchronize mailbox access.
>   * @firmware_version: Firmware version for the memory device.
> + * @dcd_cmds: List of DCD commands implemented by memory device
>   * @enabled_cmds: Hardware commands found enabled in CEL.
>   * @exclusive_cmds: Commands that are kernel-internal only
>   * @total_bytes: sum of all possible capacities
> @@ -460,6 +470,7 @@ struct cxl_memdev_state {
>  	size_t lsa_size;
>  	struct mutex mbox_mutex; /* Protects device mailbox and firmware */
>  	char firmware_version[0x10];
> +	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
>  	u64 total_bytes;
> @@ -525,6 +536,10 @@ enum cxl_opcode {
>  	CXL_MBOX_OP_UNLOCK		= 0x4503,
>  	CXL_MBOX_OP_FREEZE_SECURITY	= 0x4504,
>  	CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE	= 0x4505,
> +	CXL_MBOX_OP_GET_DC_CONFIG	= 0x4800,
> +	CXL_MBOX_OP_GET_DC_EXTENT_LIST	= 0x4801,
> +	CXL_MBOX_OP_ADD_DC_RESPONSE	= 0x4802,
> +	CXL_MBOX_OP_RELEASE_DC		= 0x4803,
>  	CXL_MBOX_OP_MAX			= 0x10000
>  };
>
>
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 09/18] cxl/mem: Read extents on memory device discovery
  2023-08-29 15:26   ` Jonathan Cameron
@ 2023-08-30  0:16     ` Ira Weiny
  2023-09-05 21:41     ` Ira Weiny
  1 sibling, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-08-30  0:16 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:21:00 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 

[snip]

I'll go through each review but I had to respond to this one...

> 
> > +		.payload_out = &dc_extents,
> > +		.min_out = 1,
> > +	};
> > +
> > +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > +	if (rc < 0)
> > +		return rc;
> > +
> > +	count = le32_to_cpu(dc_extents.total_extent_cnt);
> > +	*extent_gen_num = le32_to_cpu(dc_extents.extent_list_num);
> > +
> > +	return count;
> > +}
> > +
> > +static int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds,
> > +				  unsigned int start_gen_num,
> > +				  unsigned int exp_cnt)
> > +{
> > +	struct cxl_mbox_dc_extents *dc_extents;
> > +	unsigned int start_index, total_read;
> > +	struct device *dev = mds->cxlds.dev;
> > +	struct cxl_mbox_cmd mbox_cmd;
> > +	int retry = 3;
> 
> Why 3?
> 

Then shall thou count to 3, no more, no less...
4 shall thou not count...
5 is right out...

;-)

Seriously, it seemed like a decent number to try.  I would hope that the
extents are not changing much as the host is booting or the device drivers
are loading.  But since the generation number is there I figured it was
fine to try again.

However, its been a while since I focused on this patch and as I look at
it now I realize that retrying is going to be a problem anyway.  Some of
the old extents from the previous generation may have been stored and the
new list is likely to have the same extents.  Which would result in errors
later.

I think it is best to remove the retry and just throw an error.

Thanks for catching,
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 13/18] dax/bus: Factor out dev dax resize logic
  2023-08-29  5:21 ` [PATCH RFC v2 13/18] dax/bus: Factor out dev dax resize logic Ira Weiny
@ 2023-08-30 11:27   ` Jonathan Cameron
  2023-09-06  4:12     ` Ira Weiny
  2023-08-31 21:48   ` Dave Jiang
  1 sibling, 1 reply; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-30 11:27 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:21:04 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> Dynamic Capacity regions must limit dev dax resources to those areas
> which have extents backing real memory.  Four alternatives were
> considered to manage the intersection of region space and extents:
> 
> 1) Create a single region resource child on region creation which
>    reserves the entire region.  Then as extents are added punch holes in
>    this reservation.  This requires new resource manipulation to punch
>    the holes and still requires an additional iteration over the extent
>    areas which may already have existing dev dax resources used.
> 
> 2) Maintain an ordered xarray of extents which can be queried while
>    processing the resize logic.  The issue is that existing region->res
>    children may artificially limit the allocation size sent to
>    alloc_dev_dax_range().  IE the resource children can't be directly
>    used in the resize logic to find where space in the region is.
> 
> 3) Maintain a separate resource tree with extents.  This option is the
>    same as 2) but with a different data structure.  Most ideally we have
>    some unified representation of the resource tree.
> 
> 4) Create region resource children for each extent.  Manage the dax dev
>    resize logic in the same way as before but use a region child
>    (extent) resource as the parents to find space within each extent.
> 
> Option 4 can leverage the existing resize algorithm to find space within
> the extents.
> 
> In preparation for this change, factor out the dev_dax_resize logic.
> For static regions use dax_region->res as the parent to find space for
> the dax ranges.  Future patches will use the same algorithm with
> individual extent resources as the parent.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Hi Ira,

Some trivial comments on comments, but in general this indeed seems to be doing what you
say and factoring out the static allocation part.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


> ---
>  drivers/dax/bus.c | 128 +++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 79 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index b76e49813a39..ea7ae82b4687 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -817,11 +817,10 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
>  	return 0;
>  }
>  

> -static ssize_t dev_dax_resize(struct dax_region *dax_region,
> -		struct dev_dax *dev_dax, resource_size_t size)
> +/*

/**

Suitable builds will then check this doc matches the function etc
even if this is never included into any of the docs build.

> + * dev_dax_resize_static - Expand the device into the unused portion of the
> + * region. This may involve adjusting the end of an existing resource, or
> + * allocating a new resource.
> + *
> + * @parent: parent resource to allocate this range in.
> + * @dev_dax: DAX device we are creating this range for

Trivial: Doesn't seem to be consistent on . or not

> + * @to_alloc: amount of space to alloc; must be <= space available in @parent
> + *
> + * Return the amount of space allocated or -ERRNO on failure
> + */
> +static ssize_t dev_dax_resize_static(struct resource *parent,
> +				     struct dev_dax *dev_dax,
> +				     resource_size_t to_alloc)


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions
  2023-08-29  5:21 ` [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions Ira Weiny
@ 2023-08-30 11:50   ` Jonathan Cameron
  2023-09-06  4:35     ` Ira Weiny
  0 siblings, 1 reply; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-30 11:50 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:21:05 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> Dynamic Capacity (DC) DAX regions have a list of extents which define
> the memory of the region which is available.
> 
> Now that DAX region extents are fully realized support DAX device
> creation on dynamic regions by adjusting the allocation algorithms
> to account for the extents.  Remember also references must be held on
> the extents until the DAX devices are done with the memory.
> 
> Redefine the region available size to include only extent space.  Reuse
> the size allocation algorithm by defining sub-resources for each extent
> and limiting range allocation to those extents which have space.  Do not
> support direct mapping of DAX devices on dynamic devices.
> 
> Enhance DAX device range objects to hold references on the extents until
> the DAX device is destroyed.
> 
> NOTE: At this time all extents within a region are created equally.
> However, labels are associated with extents which can be used with
> future DAX device labels to group which extents are used.

This sound like a bad place to start to me as we are enabling something
that is probably 'wrong' in the long term as opposed to just not enabling it
until we have appropriate support.
I'd argue better to just reject any extents with different labels for now.

As this is an RFC meh ;)

Whilst this looks fine to me, I'm rather out of my depth wrt to the DAX
side of things so take that with a pinch of salt.

Jonathan


> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  drivers/dax/bus.c         | 157 +++++++++++++++++++++++++++++++++++++++-------
>  drivers/dax/cxl.c         |  44 +++++++++++++
>  drivers/dax/dax-private.h |   5 ++
>  3 files changed, 182 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index ea7ae82b4687..a9ea6a706702 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c

...


> @@ -1183,7 +1290,7 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
>  	to_alloc = range_len(&r);
>  	if (alloc_is_aligned(dev_dax, to_alloc))
>  		rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
> -					 to_alloc);
> +					 to_alloc, NULL);
>  	device_unlock(dev);
>  	device_unlock(dax_region->dev);
>  
> @@ -1400,8 +1507,10 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
>  	device_initialize(dev);
>  	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
>  
> +	dev_WARN_ONCE(parent, is_dynamic(dax_region) && data->size,
> +		      "Dynamic DAX devices are created initially with 0 size");

dev_info() maybe more appropriate?   Is this common enough that we need the
_ONCE?


>  	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
> -				 data->size);
> +				 data->size, NULL);
>  	if (rc)
>  		goto err_range;
>  
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 44cbd28668f1..6394a3531e25 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
...


>  static int cxl_dax_region_create_extent(struct dax_region *dax_region,
>  					struct cxl_dr_extent *cxl_dr_ext)
>  {
> @@ -45,11 +80,20 @@ static int cxl_dax_region_create_extent(struct dax_region *dax_region,
>  	/* device manages the dr_extent on success */
>  	kref_init(&dr_extent->ref);
>  
> +	rc = dax_region_add_resource(dax_region, dr_extent,
> +				     cxl_dr_ext->hpa_offset,
> +				     cxl_dr_ext->hpa_length);
> +	if (rc) {
> +		kfree(dr_extent);

goto for these and single unwinding block?

> +		return rc;
> +	}
> +
>  	rc = dax_region_ext_create_dev(dax_region, dr_extent,
>  				       cxl_dr_ext->hpa_offset,
>  				       cxl_dr_ext->hpa_length,
>  				       cxl_dr_ext->label);
>  	if (rc) {
> +		dax_region_rm_resource(dr_extent);
>  		kfree(dr_extent);
as above.

>  		return rc;
>  	}
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index 250babd6e470..ad73b53aa802 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -44,12 +44,16 @@ struct dax_region {
>  /*
>   * struct dax_region_extent - extent data defined by the low level region
>   * driver.
> + * @region: cache of dax_region
> + * @res: cache of resource tree for this extent
>   * @private_data: lower level region driver data

Not sure 'lower level' is well defined here. Is "region driver data"
not enough?

>   * @ref: track number of dax devices which are using this extent
>   * @get: get reference to low level data
>   * @put: put reference to low level data
>   */
>  struct dax_region_extent {
> +	struct dax_region *region;
> +	struct resource *res;
>  	void *private_data;
>  	struct kref ref;
>  	void (*get)(struct dax_region_extent *dr_extent);
> @@ -131,6 +135,7 @@ struct dev_dax {
>  		unsigned long pgoff;
>  		struct range range;
>  		struct dax_mapping *mapping;
> +		struct dax_region_extent *dr_extent;

Huh. Seems that ranges is in the kernel doc but not the
bits that make that up.  Maybe good to add the docs
whilst here?

>  	} *ranges;
>  };
>  
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 16/18] tools/testing/cxl: Make event logs dynamic
  2023-08-29  5:21 ` [PATCH RFC v2 16/18] tools/testing/cxl: Make event logs dynamic Ira Weiny
@ 2023-08-30 12:11   ` Jonathan Cameron
  2023-09-06 21:15     ` Ira Weiny
  0 siblings, 1 reply; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-30 12:11 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:21:07 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> The test event logs were created as static arrays as an easy way to mock
> events.  Dynamic Capacity Device (DCD) test support requires events be
> created dynamically when extents are created/destroyed.
> 
> Modify the event log storage to be dynamically allocated.  Thus they can
> accommodate the dynamic events required by DCD.  Reuse the static event
> data to create the dynamic events in the new logs without inventing
> complex event injection through the test sysfs.  Simplify the processing
> of the logs by using the event log array index as the handle.  Add a
> lock to manage concurrency to come with DCD extent testing.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Diff did a horrible job on readability of this patch.

Ah well. Comments superficial only.

Jonathan

> ---
>  tools/testing/cxl/test/mem.c | 276 ++++++++++++++++++++++++++-----------------
>  1 file changed, 170 insertions(+), 106 deletions(-)
> 
> diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
> index 51be202fabd0..6a036c8d215d 100644
> --- a/tools/testing/cxl/test/mem.c
> +++ b/tools/testing/cxl/test/mem.c
> @@ -118,18 +118,27 @@ static struct {
>  
>  #define PASS_TRY_LIMIT 3
>  
> -#define CXL_TEST_EVENT_CNT_MAX 15
> +#define CXL_TEST_EVENT_CNT_MAX 17
>  
>  /* Set a number of events to return at a time for simulation.  */
>  #define CXL_TEST_EVENT_CNT 3
>  
> +/*
> + * @next_handle: next handle (index) to be stored to
> + * @cur_handle: current handle (index) to be returned to the user on get_event
> + * @nr_events: total events in this log
> + * @nr_overflow: number of events added past the log size
> + * @lock: protect these state variables
> + * @events: array of pending events to be returned.
> + */
>  struct mock_event_log {
> -	u16 clear_idx;
> -	u16 cur_idx;
> +	u16 next_handle;
> +	u16 cur_handle;
>  	u16 nr_events;
>  	u16 nr_overflow;
> -	u16 overflow_reset;
> -	struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX];
> +	rwlock_t lock;
> +	/* 1 extra slot to accommodate that handles can't be 0 */
> +	struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX+1];

Spaces around +

>  };
>  

...


>  
> -static void cxl_mock_add_event_logs(struct mock_event_store *mes)
> +/* Create a dynamically allocated event out of a statically defined event. */
> +static void add_event_from_static(struct mock_event_store *mes,
> +				  enum cxl_event_log_type log_type,
> +				  struct cxl_event_record_raw *raw)
> +{
> +	struct device *dev = mes->mds->cxlds.dev;
> +	struct cxl_event_record_raw *rec;
> +
> +	rec = devm_kzalloc(dev, sizeof(*rec), GFP_KERNEL);
> +	if (!rec) {
> +		dev_err(dev, "Failed to alloc event for log\n");
> +		return;
> +	}
> +
> +	memcpy(rec, raw, sizeof(*rec));

devm_kmemdup()?


> +	mes_add_event(mes, log_type, rec);
> +}
> +
> +static void cxl_mock_add_event_logs(struct cxl_mockmem_data *mdata)
>  {
> +	struct mock_event_store *mes = &mdata->mes;
> +	struct device *dev = mes->mds->cxlds.dev;
> +
>  	put_unaligned_le16(CXL_GMER_VALID_CHANNEL | CXL_GMER_VALID_RANK,
>  			   &gen_media.validity_flags);
>  
> @@ -438,43 +475,60 @@ static void cxl_mock_add_event_logs(struct mock_event_store *mes)
>  			   CXL_DER_VALID_BANK | CXL_DER_VALID_COLUMN,
>  			   &dram.validity_flags);
>  
> -	mes_add_event(mes, CXL_EVENT_TYPE_INFO, &maint_needed);
> -	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
> +	dev_dbg(dev, "Generating fake event logs %d\n",
> +		CXL_EVENT_TYPE_INFO);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_INFO, &maint_needed);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_INFO,
>  		      (struct cxl_event_record_raw *)&gen_media);
> -	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
> +	add_event_from_static(mes, CXL_EVENT_TYPE_INFO,
>  		      (struct cxl_event_record_raw *)&mem_module);
>  	mes->ev_status |= CXLDEV_EVENT_STATUS_INFO;
>  
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &maint_needed);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> +	dev_dbg(dev, "Generating fake event logs %d\n",
> +		CXL_EVENT_TYPE_FAIL);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &maint_needed);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL,
> +		      (struct cxl_event_record_raw *)&mem_module);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL,
>  		      (struct cxl_event_record_raw *)&dram);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL,
>  		      (struct cxl_event_record_raw *)&gen_media);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL,
>  		      (struct cxl_event_record_raw *)&mem_module);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL,
>  		      (struct cxl_event_record_raw *)&dram);
>  	/* Overflow this log */
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
>  	mes->ev_status |= CXLDEV_EVENT_STATUS_FAIL;
>  
> -	mes_add_event(mes, CXL_EVENT_TYPE_FATAL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FATAL,
> +	dev_dbg(dev, "Generating fake event logs %d\n",
> +		CXL_EVENT_TYPE_FATAL);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FATAL, &hardware_replace);
> +	add_event_from_static(mes, CXL_EVENT_TYPE_FATAL,
>  		      (struct cxl_event_record_raw *)&dram);
>  	mes->ev_status |= CXLDEV_EVENT_STATUS_FATAL;
>  }
>  
> +static void cxl_mock_event_trigger(struct device *dev)
> +{
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	struct mock_event_store *mes = &mdata->mes;
> +
> +	cxl_mock_add_event_logs(mdata);
> +	cxl_mem_get_event_records(mes->mds, mes->ev_status);
> +}


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 17/18] tools/testing/cxl: Add DC Regions to mock mem data
  2023-08-29  5:21 ` [PATCH RFC v2 17/18] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
@ 2023-08-30 12:20   ` Jonathan Cameron
  2023-09-06 21:18     ` Ira Weiny
  2023-08-31 23:19   ` Dave Jiang
  1 sibling, 1 reply; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-30 12:20 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:21:08 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> To test DC regions the mock memory devices will need to store
> information about the regions and manage fake extent data.
> 
> Define mock_dc_region information within the mock memory data.  Add
> sysfs entries on the mock device to inject and delete extents.
> 
> The inject format is <start>:<length>:<tag>
> The delete format is <start>
> 
> Add DC mailbox commands to the CEL and implement those commands.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Looks fine to me.  Totally trivial comment inline.

FWIW
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


> +
>  static int mock_gsl(struct cxl_mbox_cmd *cmd)
>  {
>  	if (cmd->size_out < sizeof(mock_gsl_payload))
> @@ -1315,6 +1429,148 @@ static int mock_activate_fw(struct cxl_mockmem_data *mdata,
>  	return -EINVAL;
>  }
>  

Bit inconsistent on whether there are one or two blank lines between functions.

> +static int mock_get_dc_config(struct device *dev,
> +			      struct cxl_mbox_cmd *cmd)
> +{
> +	struct cxl_mbox_get_dc_config *dc_config = cmd->payload_in;
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	u8 region_requested, region_start_idx, region_ret_cnt;
> +	struct cxl_mbox_dynamic_capacity *resp;
> +
> +	region_requested = dc_config->region_count;
> +	if (NUM_MOCK_DC_REGIONS < region_requested)
> +		region_requested = NUM_MOCK_DC_REGIONS;
> +
> +	if (cmd->size_out < struct_size(resp, region, region_requested))
> +		return -EINVAL;
> +
> +	memset(cmd->payload_out, 0, cmd->size_out);
> +	resp = cmd->payload_out;
> +
> +	region_start_idx = dc_config->start_region_index;
> +	region_ret_cnt = 0;
> +	for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
> +		if (i >= region_start_idx) {
> +			memcpy(&resp->region[region_ret_cnt],
> +				&mdata->dc_regions[i],
> +				sizeof(resp->region[region_ret_cnt]));
> +			region_ret_cnt++;
> +		}
> +	}
> +	resp->avail_region_count = region_ret_cnt;
> +
> +	dev_dbg(dev, "Returning %d dc regions\n", region_ret_cnt);
> +	return 0;
> +}



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 18/18] tools/testing/cxl: Add Dynamic Capacity events
  2023-08-29  5:21 ` [PATCH RFC v2 18/18] tools/testing/cxl: Add Dynamic Capacity events Ira Weiny
@ 2023-08-30 12:23   ` Jonathan Cameron
  2023-09-06 21:39     ` Ira Weiny
  2023-08-31 23:20   ` Dave Jiang
  1 sibling, 1 reply; 97+ messages in thread
From: Jonathan Cameron @ 2023-08-30 12:23 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:21:09 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> OS software needs to be alerted when new extents arrive on a Dynamic
> Capacity Device (DCD).  On test DCDs extents are added through sysfs.
> 
> Add events on DCD extent injection.  Directly call the event irq
> callback to simulate irqs to process the test extents.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Trivial comments inline.

Reviewed-by: Jonathan.Cameron@huawei.com>

> ---
>  tools/testing/cxl/test/mem.c | 57 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 57 insertions(+)
> 
> diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
> index d6041a2145c5..20364fee9df9 100644
> --- a/tools/testing/cxl/test/mem.c
> +++ b/tools/testing/cxl/test/mem.c
> @@ -2008,6 +2008,41 @@ static bool new_extent_valid(struct device *dev, size_t new_start,
>  	return false;
>  }
>  
> +struct dcd_event_dyn_cap dcd_event_rec_template = {
> +	.hdr = {
> +		.id = UUID_INIT(0xca95afa7, 0xf183, 0x4018,
> +				0x8c, 0x2f, 0x95, 0x26, 0x8e, 0x10, 0x1a, 0x2a),
> +		.length = sizeof(struct dcd_event_dyn_cap),
> +	},
> +};
> +
> +static int send_dc_event(struct mock_event_store *mes, enum dc_event type,
> +			 u64 start, u64 length, const char *tag_str)

Arguably it's not sending the event, but rather adding it to the event log and
flicking the irq. So maybe naming needs some thought?

> +{
> +	struct device *dev = mes->mds->cxlds.dev;
> +	struct dcd_event_dyn_cap *dcd_event_rec;
> +
> +	dcd_event_rec = devm_kzalloc(dev, sizeof(*dcd_event_rec), GFP_KERNEL);
> +	if (!dcd_event_rec)
> +		return -ENOMEM;
> +
> +	memcpy(dcd_event_rec, &dcd_event_rec_template, sizeof(*dcd_event_rec));

devm_kmemdup?

> +	dcd_event_rec->data.event_type = type;
> +	dcd_event_rec->data.extent.start_dpa = cpu_to_le64(start);
> +	dcd_event_rec->data.extent.length = cpu_to_le64(length);
> +	memcpy(dcd_event_rec->data.extent.tag, tag_str,
> +	       min(sizeof(dcd_event_rec->data.extent.tag),
> +		   strlen(tag_str)));
> +
> +	mes_add_event(mes, CXL_EVENT_TYPE_DCD,
> +		      (struct cxl_event_record_raw *)dcd_event_rec);
> +
> +	/* Fake the irq */
> +	cxl_mem_get_event_records(mes->mds, CXLDEV_EVENT_STATUS_DCD);
> +
> +	return 0;
> +}
> +



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 01/18] cxl/hdm: Debug, use decoder name function
  2023-08-29  5:20 ` [PATCH RFC v2 01/18] cxl/hdm: Debug, use decoder name function Ira Weiny
  2023-08-29 14:03   ` Jonathan Cameron
@ 2023-08-30 20:32   ` Dave Jiang
  1 sibling, 0 replies; 97+ messages in thread
From: Dave Jiang @ 2023-08-30 20:32 UTC (permalink / raw)
  To: Ira Weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel



On 8/28/23 22:20, Ira Weiny wrote:
> The decoder enum has a name conversion function defined now.
> 
> Use that instead of open coding.
> 
> Suggested-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

As others said, send this upstream outside of the series.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
> ---
> Changes for v2:
> [iweiny: new patch, split out]
> ---
>   drivers/cxl/core/hdm.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index b01a77b67511..a254f79dd4e8 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -550,8 +550,7 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>   
>   	if (size > avail) {
>   		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
> -			cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
> -			&avail);
> +			cxl_decoder_mode_name(cxled->mode), &avail);
>   		rc = -ENOSPC;
>   		goto out;
>   	}
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 02/18] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2023-08-29  5:20 ` [PATCH RFC v2 02/18] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD) Ira Weiny
  2023-08-29 14:07   ` Jonathan Cameron
  2023-08-29 21:49   ` Fan Ni
@ 2023-08-30 20:33   ` Dave Jiang
  2023-10-24 16:16   ` Jonathan Cameron
  3 siblings, 0 replies; 97+ messages in thread
From: Dave Jiang @ 2023-08-30 20:33 UTC (permalink / raw)
  To: Ira Weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel



On 8/28/23 22:20, Ira Weiny wrote:
> Per the CXL 3.0 specification software must check the Command Effects
> Log (CEL) to know if a device supports DC.  If the device does support
> DC the specifics of the DC Regions (0-7) are read through the mailbox.
> 
> Flag DC Device (DCD) commands in a device if they are supported.
> Subsequent patches will key off these bits to configure a DCD.
> 
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
> ---
> Changes for v2
> [iweiny: new patch]
> ---
>   drivers/cxl/core/mbox.c | 38 +++++++++++++++++++++++++++++++++++---
>   drivers/cxl/cxlmem.h    | 15 +++++++++++++++
>   2 files changed, 50 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index f052d5f174ee..554ec97a7c39 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -111,6 +111,34 @@ static u8 security_command_sets[] = {
>   	0x46, /* Security Passthrough */
>   };
>   
> +static bool cxl_is_dcd_command(u16 opcode)
> +{
> +#define CXL_MBOX_OP_DCD_CMDS 0x48
> +
> +	return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
> +}
> +
> +static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
> +					u16 opcode)
> +{
> +	switch (opcode) {
> +	case CXL_MBOX_OP_GET_DC_CONFIG:
> +		set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> +		set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_ADD_DC_RESPONSE:
> +		set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_RELEASE_DC:
> +		set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
>   static bool cxl_is_security_command(u16 opcode)
>   {
>   	int i;
> @@ -677,9 +705,10 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>   		u16 opcode = le16_to_cpu(cel_entry[i].opcode);
>   		struct cxl_mem_command *cmd = cxl_mem_find_command(opcode);
>   
> -		if (!cmd && !cxl_is_poison_command(opcode)) {
> -			dev_dbg(dev,
> -				"Opcode 0x%04x unsupported by driver\n", opcode);
> +		if (!cmd && !cxl_is_poison_command(opcode) &&
> +		    !cxl_is_dcd_command(opcode)) {
> +			dev_dbg(dev, "Opcode 0x%04x unsupported by driver\n",
> +				opcode);
>   			continue;
>   		}
>   
> @@ -689,6 +718,9 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>   		if (cxl_is_poison_command(opcode))
>   			cxl_set_poison_cmd_enabled(&mds->poison, opcode);
>   
> +		if (cxl_is_dcd_command(opcode))
> +			cxl_set_dcd_cmd_enabled(mds, opcode);
> +
>   		dev_dbg(dev, "Opcode 0x%04x enabled\n", opcode);
>   	}
>   }
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index adfba72445fc..5f2e65204bf9 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -247,6 +247,15 @@ struct cxl_event_state {
>   	struct mutex log_lock;
>   };
>   
> +/* Device enabled DCD commands */
> +enum dcd_cmd_enabled_bits {
> +	CXL_DCD_ENABLED_GET_CONFIG,
> +	CXL_DCD_ENABLED_GET_EXTENT_LIST,
> +	CXL_DCD_ENABLED_ADD_RESPONSE,
> +	CXL_DCD_ENABLED_RELEASE,
> +	CXL_DCD_ENABLED_MAX
> +};
> +
>   /* Device enabled poison commands */
>   enum poison_cmd_enabled_bits {
>   	CXL_POISON_ENABLED_LIST,
> @@ -436,6 +445,7 @@ struct cxl_dev_state {
>    *                (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
>    * @mbox_mutex: Mutex to synchronize mailbox access.
>    * @firmware_version: Firmware version for the memory device.
> + * @dcd_cmds: List of DCD commands implemented by memory device
>    * @enabled_cmds: Hardware commands found enabled in CEL.
>    * @exclusive_cmds: Commands that are kernel-internal only
>    * @total_bytes: sum of all possible capacities
> @@ -460,6 +470,7 @@ struct cxl_memdev_state {
>   	size_t lsa_size;
>   	struct mutex mbox_mutex; /* Protects device mailbox and firmware */
>   	char firmware_version[0x10];
> +	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>   	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>   	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
>   	u64 total_bytes;
> @@ -525,6 +536,10 @@ enum cxl_opcode {
>   	CXL_MBOX_OP_UNLOCK		= 0x4503,
>   	CXL_MBOX_OP_FREEZE_SECURITY	= 0x4504,
>   	CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE	= 0x4505,
> +	CXL_MBOX_OP_GET_DC_CONFIG	= 0x4800,
> +	CXL_MBOX_OP_GET_DC_EXTENT_LIST	= 0x4801,
> +	CXL_MBOX_OP_ADD_DC_RESPONSE	= 0x4802,
> +	CXL_MBOX_OP_RELEASE_DC		= 0x4803,
>   	CXL_MBOX_OP_MAX			= 0x10000
>   };
>   
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device
  2023-08-29  5:20 ` [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device ira.weiny
  2023-08-29 14:37   ` Jonathan Cameron
@ 2023-08-30 21:01   ` Dave Jiang
  2023-09-05  0:14     ` Ira Weiny
  2023-09-08 20:23     ` Ira Weiny
  2023-08-30 21:44   ` Fan Ni
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 97+ messages in thread
From: Dave Jiang @ 2023-08-30 21:01 UTC (permalink / raw)
  To: ira.weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel



On 8/28/23 22:20, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Devices can optionally support Dynamic Capacity (DC).  These devices are
> known as Dynamic Capacity Devices (DCD).
> 
> Implement the DC (opcode 48XXh) mailbox commands as specified in CXL 3.0
> section 8.2.9.8.9.  Read the DC configuration and store the DC region
> information in the device state.
> 
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Uncapitalize Dynamic in subject

Also, maybe split out the REGION vs DECODER as a prep patch.

DJ

> 
> ---
> Changes for v2
> [iweiny: Rebased to latest master type2 work]
> [jonathan: s/dc/dc_resp/]
> [iweiny: Clean up commit message]
> [iweiny: Clean kernel docs]
> [djiang: Fix up cxl_is_dcd_command]
> [djiang: extra blank line]
> [alison: s/total_capacity/cap/ etc...]
> [alison: keep partition flag with partition structures]
> [alison: reformat untenanted_mem declaration]
> [alison: move 'cmd' definition back]
> [alison: fix comment line length]
> [alison: reverse x-tree]
> [jonathan: fix and adjust CXL_DC_REGION_STRLEN]
> [Jonathan/iweiny: Factor out storing each DC region read from the device]
> [Jonathan: place all dcr initializers together]
> [Jonathan/iweiny: flip around the region DPA order check]
> [jonathan: Account for short read of mailbox command]
> [iweiny: use snprintf for region name]
> [iweiny: use '<nil>' for missing region names]
> [iweiny: factor out struct cxl_dc_region_info]
> [iweiny: Split out reading CEL]
> ---
>   drivers/cxl/core/mbox.c   | 179 +++++++++++++++++++++++++++++++++++++++++++++-
>   drivers/cxl/core/region.c |  75 +++++++++++++------
>   drivers/cxl/cxl.h         |  27 ++++++-
>   drivers/cxl/cxlmem.h      |  55 +++++++++++++-
>   drivers/cxl/pci.c         |   4 ++
>   5 files changed, 314 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 554ec97a7c39..d769814f80e2 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1096,7 +1096,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>   	if (rc < 0)
>   		return rc;
>   
> -	mds->total_bytes =
> +	mds->static_cap =
>   		le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
>   	mds->volatile_only_bytes =
>   		le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
> @@ -1114,6 +1114,8 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>   		mds->poison.max_errors = min_t(u32, val, CXL_POISON_LIST_MAX);
>   	}
>   
> +	mds->dc_event_log_size = le16_to_cpu(id.dc_event_log_size);
> +
>   	return 0;
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_dev_state_identify, CXL);
> @@ -1178,6 +1180,165 @@ int cxl_mem_sanitize(struct cxl_memdev_state *mds, u16 cmd)
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_mem_sanitize, CXL);
>   
> +static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, int index,
> +				   struct cxl_dc_region_config *region_config)
> +{
> +	struct cxl_dc_region_info *dcr = &mds->dc_region[index];
> +	struct device *dev = mds->cxlds.dev;
> +
> +	dcr->base = le64_to_cpu(region_config->region_base);
> +	dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
> +	dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> +	dcr->len = le64_to_cpu(region_config->region_length);
> +	dcr->blk_size = le64_to_cpu(region_config->region_block_size);
> +	dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
> +	dcr->flags = region_config->flags;
> +	snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
> +
> +	/* Check regions are in increasing DPA order */
> +	if (index > 0) {
> +		struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
> +
> +		if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
> +			dev_err(dev,
> +				"DPA ordering violation for DC region %d and %d\n",
> +				index - 1, index);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	/* Check the region is 256 MB aligned */
> +	if (!IS_ALIGNED(dcr->base, SZ_256M)) {
> +		dev_err(dev, "DC region %d not aligned to 256MB: %#llx\n",
> +			index, dcr->base);
> +		return -EINVAL;
> +	}
> +
> +	/* Check Region base and length are aligned to block size */
> +	if (!IS_ALIGNED(dcr->base, dcr->blk_size) ||
> +	    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> +		dev_err(dev, "DC region %d not aligned to %#llx\n", index,
> +			dcr->blk_size);
> +		return -EINVAL;
> +	}
> +
> +	dev_dbg(dev,
> +		"DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> +		dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> +
> +	return 0;
> +}
> +
> +/* Returns the number of regions in dc_resp or -ERRNO */
> +static int cxl_get_dc_id(struct cxl_memdev_state *mds, u8 start_region,
> +			 struct cxl_mbox_dynamic_capacity *dc_resp,
> +			 size_t dc_resp_size)
> +{
> +	struct cxl_mbox_get_dc_config get_dc = (struct cxl_mbox_get_dc_config) {
> +		.region_count = CXL_MAX_DC_REGION,
> +		.start_region_index = start_region,
> +	};
> +	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> +		.payload_in = &get_dc,
> +		.size_in = sizeof(get_dc),
> +		.size_out = dc_resp_size,
> +		.payload_out = dc_resp,
> +		.min_out = 1,
> +	};
> +	struct device *dev = mds->cxlds.dev;
> +	int rc;
> +
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		return rc;
> +
> +	rc = dc_resp->avail_region_count - start_region;
> +
> +	/*
> +	 * The number of regions in the payload may have been truncated due to
> +	 * payload_size limits; if so adjust the count in this query.
> +	 */
> +	if (mbox_cmd.size_out < sizeof(*dc_resp))
> +		rc = CXL_REGIONS_RETURNED(mbox_cmd.size_out);
> +
> +	dev_dbg(dev, "Read %d/%d DC regions\n", rc, dc_resp->avail_region_count);
> +
> +	return rc;
> +}
> +
> +/**
> + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> + *					 information from the device.
> + * @mds: The memory device state
> + *
> + * This will dispatch the get_dynamic_capacity command to the device
> + * and on success populate structures to be exported to sysfs.
> + *
> + * Return: 0 if identify was executed successfully, -ERRNO on error.
> + */
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> +{
> +	struct cxl_mbox_dynamic_capacity *dc_resp;
> +	struct device *dev = mds->cxlds.dev;
> +	size_t dc_resp_size = mds->payload_size;
> +	u8 start_region;
> +	int i, rc = 0;
> +
> +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> +		snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
> +
> +	/* Check GET_DC_CONFIG is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd: get_dynamic_capacity_config\n");
> +		return 0;
> +	}
> +
> +	dc_resp = kvmalloc(dc_resp_size, GFP_KERNEL);
> +	if (!dc_resp)
> +		return -ENOMEM;
> +
> +	start_region = 0;
> +	do {
> +		int j;
> +
> +		rc = cxl_get_dc_id(mds, start_region, dc_resp, dc_resp_size);
> +		if (rc < 0)
> +			goto free_resp;
> +
> +		mds->nr_dc_region += rc;
> +
> +		if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> +			dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> +				mds->nr_dc_region);
> +			rc = -EINVAL;
> +			goto free_resp;
> +		}
> +
> +		for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
> +			rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
> +			if (rc)
> +				goto free_resp;
> +		}
> +
> +		start_region = mds->nr_dc_region;
> +
> +	} while (mds->nr_dc_region < dc_resp->avail_region_count);
> +
> +	mds->dynamic_cap =
> +		mds->dc_region[mds->nr_dc_region - 1].base +
> +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> +		mds->dc_region[0].base;
> +	dev_dbg(dev, "Total dynamic capacity: %#llx\n", mds->dynamic_cap);
> +
> +free_resp:
> +	kfree(dc_resp);
> +	if (rc)
> +		dev_err(dev, "Failed to get DC info: %d\n", rc);
> +	return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> +
>   static int add_dpa_res(struct device *dev, struct resource *parent,
>   		       struct resource *res, resource_size_t start,
>   		       resource_size_t size, const char *type)
> @@ -1208,8 +1369,12 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>   {
>   	struct cxl_dev_state *cxlds = &mds->cxlds;
>   	struct device *dev = cxlds->dev;
> +	size_t untenanted_mem;
>   	int rc;
>   
> +	untenanted_mem = mds->dc_region[0].base - mds->static_cap;
> +	mds->total_bytes = mds->static_cap + untenanted_mem + mds->dynamic_cap;
> +
>   	if (!cxlds->media_ready) {
>   		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
>   		cxlds->ram_res = DEFINE_RES_MEM(0, 0);
> @@ -1217,8 +1382,16 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>   		return 0;
>   	}
>   
> -	cxlds->dpa_res =
> -		(struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> +	cxlds->dpa_res = (struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> +
> +	for (int i = 0; i < mds->nr_dc_region; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> +				 dcr->base, dcr->decode_len, dcr->name);
> +		if (rc)
> +			return rc;
> +	}
>   
>   	if (mds->partition_align_bytes == 0) {
>   		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 252bc8e1f103..75041903b72c 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -46,7 +46,7 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
>   	rc = down_read_interruptible(&cxl_region_rwsem);
>   	if (rc)
>   		return rc;
> -	if (cxlr->mode != CXL_DECODER_PMEM)
> +	if (cxlr->mode != CXL_REGION_PMEM)
>   		rc = sysfs_emit(buf, "\n");
>   	else
>   		rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
> @@ -359,7 +359,7 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
>   	 * Support tooling that expects to find a 'uuid' attribute for all
>   	 * regions regardless of mode.
>   	 */
> -	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
> +	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_REGION_PMEM)
>   		return 0444;
>   	return a->mode;
>   }
> @@ -537,7 +537,7 @@ static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
>   {
>   	struct cxl_region *cxlr = to_cxl_region(dev);
>   
> -	return sysfs_emit(buf, "%s\n", cxl_decoder_mode_name(cxlr->mode));
> +	return sysfs_emit(buf, "%s\n", cxl_region_mode_name(cxlr->mode));
>   }
>   static DEVICE_ATTR_RO(mode);
>   
> @@ -563,7 +563,7 @@ static int alloc_hpa(struct cxl_region *cxlr, resource_size_t size)
>   
>   	/* ways, granularity and uuid (if PMEM) need to be set before HPA */
>   	if (!p->interleave_ways || !p->interleave_granularity ||
> -	    (cxlr->mode == CXL_DECODER_PMEM && uuid_is_null(&p->uuid)))
> +	    (cxlr->mode == CXL_REGION_PMEM && uuid_is_null(&p->uuid)))
>   		return -ENXIO;
>   
>   	div_u64_rem(size, SZ_256M * p->interleave_ways, &remainder);
> @@ -1765,6 +1765,17 @@ static int cxl_region_sort_targets(struct cxl_region *cxlr)
>   	return rc;
>   }
>   
> +static bool cxl_modes_compatible(enum cxl_region_mode rmode,
> +				 enum cxl_decoder_mode dmode)
> +{
> +	if (rmode == CXL_REGION_RAM && dmode == CXL_DECODER_RAM)
> +		return true;
> +	if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
> +		return true;
> +
> +	return false;
> +}
> +
>   static int cxl_region_attach(struct cxl_region *cxlr,
>   			     struct cxl_endpoint_decoder *cxled, int pos)
>   {
> @@ -1778,9 +1789,11 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>   	lockdep_assert_held_write(&cxl_region_rwsem);
>   	lockdep_assert_held_read(&cxl_dpa_rwsem);
>   
> -	if (cxled->mode != cxlr->mode) {
> -		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
> -			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
> +	if (!cxl_modes_compatible(cxlr->mode, cxled->mode)) {
> +		dev_dbg(&cxlr->dev, "%s region mode: %s mismatch decoder: %s\n",
> +			dev_name(&cxled->cxld.dev),
> +			cxl_region_mode_name(cxlr->mode),
> +			cxl_decoder_mode_name(cxled->mode));
>   		return -EINVAL;
>   	}
>   
> @@ -2234,7 +2247,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
>    * devm_cxl_add_region - Adds a region to a decoder
>    * @cxlrd: root decoder
>    * @id: memregion id to create, or memregion_free() on failure
> - * @mode: mode for the endpoint decoders of this region
> + * @mode: mode of this region
>    * @type: select whether this is an expander or accelerator (type-2 or type-3)
>    *
>    * This is the second step of region initialization. Regions exist within an
> @@ -2245,7 +2258,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
>    */
>   static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>   					      int id,
> -					      enum cxl_decoder_mode mode,
> +					      enum cxl_region_mode mode,
>   					      enum cxl_decoder_type type)
>   {
>   	struct cxl_port *port = to_cxl_port(cxlrd->cxlsd.cxld.dev.parent);
> @@ -2254,11 +2267,12 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>   	int rc;
>   
>   	switch (mode) {
> -	case CXL_DECODER_RAM:
> -	case CXL_DECODER_PMEM:
> +	case CXL_REGION_RAM:
> +	case CXL_REGION_PMEM:
>   		break;
>   	default:
> -		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> +		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
> +			cxl_region_mode_name(mode));
>   		return ERR_PTR(-EINVAL);
>   	}
>   
> @@ -2308,7 +2322,7 @@ static ssize_t create_ram_region_show(struct device *dev,
>   }
>   
>   static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
> -					  int id, enum cxl_decoder_mode mode,
> +					  int id, enum cxl_region_mode mode,
>   					  enum cxl_decoder_type type)
>   {
>   	int rc;
> @@ -2337,7 +2351,7 @@ static ssize_t create_pmem_region_store(struct device *dev,
>   	if (rc != 1)
>   		return -EINVAL;
>   
> -	cxlr = __create_region(cxlrd, id, CXL_DECODER_PMEM,
> +	cxlr = __create_region(cxlrd, id, CXL_REGION_PMEM,
>   			       CXL_DECODER_HOSTONLYMEM);
>   	if (IS_ERR(cxlr))
>   		return PTR_ERR(cxlr);
> @@ -2358,7 +2372,7 @@ static ssize_t create_ram_region_store(struct device *dev,
>   	if (rc != 1)
>   		return -EINVAL;
>   
> -	cxlr = __create_region(cxlrd, id, CXL_DECODER_RAM,
> +	cxlr = __create_region(cxlrd, id, CXL_REGION_RAM,
>   			       CXL_DECODER_HOSTONLYMEM);
>   	if (IS_ERR(cxlr))
>   		return PTR_ERR(cxlr);
> @@ -2886,10 +2900,31 @@ static void construct_region_end(void)
>   	up_write(&cxl_region_rwsem);
>   }
>   
> +static enum cxl_region_mode
> +cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
> +{
> +	switch (mode) {
> +	case CXL_DECODER_NONE:
> +		return CXL_REGION_NONE;
> +	case CXL_DECODER_RAM:
> +		return CXL_REGION_RAM;
> +	case CXL_DECODER_PMEM:
> +		return CXL_REGION_PMEM;
> +	case CXL_DECODER_DEAD:
> +		return CXL_REGION_DEAD;
> +	case CXL_DECODER_MIXED:
> +	default:
> +		return CXL_REGION_MIXED;
> +	}
> +
> +	return CXL_REGION_MIXED;
> +}
> +
>   static struct cxl_region *
>   construct_region_begin(struct cxl_root_decoder *cxlrd,
>   		       struct cxl_endpoint_decoder *cxled)
>   {
> +	enum cxl_region_mode mode = cxl_decoder_to_region_mode(cxled->mode);
>   	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
>   	struct cxl_region_params *p;
>   	struct cxl_region *cxlr;
> @@ -2897,7 +2932,7 @@ construct_region_begin(struct cxl_root_decoder *cxlrd,
>   
>   	do {
>   		cxlr = __create_region(cxlrd, atomic_read(&cxlrd->region_id),
> -				       cxled->mode, cxled->cxld.target_type);
> +				       mode, cxled->cxld.target_type);
>   	} while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
>   
>   	if (IS_ERR(cxlr)) {
> @@ -3200,9 +3235,9 @@ static int cxl_region_probe(struct device *dev)
>   		return rc;
>   
>   	switch (cxlr->mode) {
> -	case CXL_DECODER_PMEM:
> +	case CXL_REGION_PMEM:
>   		return devm_cxl_add_pmem_region(cxlr);
> -	case CXL_DECODER_RAM:
> +	case CXL_REGION_RAM:
>   		/*
>   		 * The region can not be manged by CXL if any portion of
>   		 * it is already online as 'System RAM'
> @@ -3223,8 +3258,8 @@ static int cxl_region_probe(struct device *dev)
>   		/* HDM-H routes to device-dax */
>   		return devm_cxl_add_dax_region(cxlr);
>   	default:
> -		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
> -			cxlr->mode);
> +		dev_dbg(&cxlr->dev, "unsupported region mode: %s\n",
> +			cxl_region_mode_name(cxlr->mode));
>   		return -ENXIO;
>   	}
>   }
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index cd4a9ffdacc7..ed282dcd5cf5 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -374,6 +374,28 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>   	return "mixed";
>   }
>   
> +enum cxl_region_mode {
> +	CXL_REGION_NONE,
> +	CXL_REGION_RAM,
> +	CXL_REGION_PMEM,
> +	CXL_REGION_MIXED,
> +	CXL_REGION_DEAD,
> +};
> +
> +static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
> +{
> +	static const char * const names[] = {
> +		[CXL_REGION_NONE] = "none",
> +		[CXL_REGION_RAM] = "ram",
> +		[CXL_REGION_PMEM] = "pmem",
> +		[CXL_REGION_MIXED] = "mixed",
> +	};
> +
> +	if (mode >= CXL_REGION_NONE && mode <= CXL_REGION_MIXED)
> +		return names[mode];
> +	return "mixed";
> +}
> +
>   /*
>    * Track whether this decoder is reserved for region autodiscovery, or
>    * free for userspace provisioning.
> @@ -502,7 +524,8 @@ struct cxl_region_params {
>    * struct cxl_region - CXL region
>    * @dev: This region's device
>    * @id: This region's id. Id is globally unique across all regions
> - * @mode: Endpoint decoder allocation / access mode
> + * @mode: Region mode which defines which endpoint decoder mode the region is
> + *        compatible with
>    * @type: Endpoint decoder target type
>    * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
>    * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
> @@ -512,7 +535,7 @@ struct cxl_region_params {
>   struct cxl_region {
>   	struct device dev;
>   	int id;
> -	enum cxl_decoder_mode mode;
> +	enum cxl_region_mode mode;
>   	enum cxl_decoder_type type;
>   	struct cxl_nvdimm_bridge *cxl_nvb;
>   	struct cxl_pmem_region *cxlr_pmem;
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 5f2e65204bf9..8c8f47b397ab 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -396,6 +396,7 @@ enum cxl_devtype {
>   	CXL_DEVTYPE_CLASSMEM,
>   };
>   
> +#define CXL_MAX_DC_REGION 8
>   /**
>    * struct cxl_dev_state - The driver device state
>    *
> @@ -412,6 +413,8 @@ enum cxl_devtype {
>    * @dpa_res: Overall DPA resource tree for the device
>    * @pmem_res: Active Persistent memory capacity configuration
>    * @ram_res: Active Volatile memory capacity configuration
> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> + *          region
>    * @component_reg_phys: register base of component registers
>    * @serial: PCIe Device Serial Number
>    * @type: Generic Memory Class device or Vendor Specific Memory device
> @@ -426,11 +429,23 @@ struct cxl_dev_state {
>   	struct resource dpa_res;
>   	struct resource pmem_res;
>   	struct resource ram_res;
> +	struct resource dc_res[CXL_MAX_DC_REGION];
>   	resource_size_t component_reg_phys;
>   	u64 serial;
>   	enum cxl_devtype type;
>   };
>   
> +#define CXL_DC_REGION_STRLEN 7
> +struct cxl_dc_region_info {
> +	u64 base;
> +	u64 decode_len;
> +	u64 len;
> +	u64 blk_size;
> +	u32 dsmad_handle;
> +	u8 flags;
> +	u8 name[CXL_DC_REGION_STRLEN];
> +};
> +
>   /**
>    * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
>    *
> @@ -449,6 +464,8 @@ struct cxl_dev_state {
>    * @enabled_cmds: Hardware commands found enabled in CEL.
>    * @exclusive_cmds: Commands that are kernel-internal only
>    * @total_bytes: sum of all possible capacities
> + * @static_cap: Sum of RAM and PMEM capacities
> + * @dynamic_cap: Complete DPA range occupied by DC regions
>    * @volatile_only_bytes: hard volatile capacity
>    * @persistent_only_bytes: hard persistent capacity
>    * @partition_align_bytes: alignment size for partition-able capacity
> @@ -456,6 +473,10 @@ struct cxl_dev_state {
>    * @active_persistent_bytes: sum of hard + soft persistent
>    * @next_volatile_bytes: volatile capacity change pending device reset
>    * @next_persistent_bytes: persistent capacity change pending device reset
> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
> + * @dc_event_log_size: The number of events the device can store in the
> + * Dynamic Capacity Event Log before it overflows
>    * @event: event log driver state
>    * @poison: poison driver state info
>    * @fw: firmware upload / activation state
> @@ -473,7 +494,10 @@ struct cxl_memdev_state {
>   	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>   	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>   	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> +
>   	u64 total_bytes;
> +	u64 static_cap;
> +	u64 dynamic_cap;
>   	u64 volatile_only_bytes;
>   	u64 persistent_only_bytes;
>   	u64 partition_align_bytes;
> @@ -481,6 +505,11 @@ struct cxl_memdev_state {
>   	u64 active_persistent_bytes;
>   	u64 next_volatile_bytes;
>   	u64 next_persistent_bytes;
> +
> +	u8 nr_dc_region;
> +	struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> +	size_t dc_event_log_size;
> +
>   	struct cxl_event_state event;
>   	struct cxl_poison_state poison;
>   	struct cxl_security_state security;
> @@ -587,6 +616,7 @@ struct cxl_mbox_identify {
>   	__le16 inject_poison_limit;
>   	u8 poison_caps;
>   	u8 qos_telemetry_caps;
> +	__le16 dc_event_log_size;
>   } __packed;
>   
>   /*
> @@ -741,9 +771,31 @@ struct cxl_mbox_set_partition_info {
>   	__le64 volatile_capacity;
>   	u8 flags;
>   } __packed;
> -
>   #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
>   
> +struct cxl_mbox_get_dc_config {
> +	u8 region_count;
> +	u8 start_region_index;
> +} __packed;
> +
> +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> +struct cxl_mbox_dynamic_capacity {
> +	u8 avail_region_count;
> +	u8 rsvd[7];
> +	struct cxl_dc_region_config {
> +		__le64 region_base;
> +		__le64 region_decode_length;
> +		__le64 region_length;
> +		__le64 region_block_size;
> +		__le32 region_dsmad_handle;
> +		u8 flags;
> +		u8 rsvd[3];
> +	} __packed region[];
> +} __packed;
> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> +#define CXL_REGIONS_RETURNED(size_out) \
> +	((size_out - 8) / sizeof(struct cxl_dc_region_config))
> +
>   /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>   struct cxl_mbox_set_timestamp_in {
>   	__le64 timestamp;
> @@ -867,6 +919,7 @@ enum {
>   int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
>   			  struct cxl_mbox_cmd *cmd);
>   int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
>   int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>   int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>   int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 5242dbf0044d..a9b110ff1176 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -879,6 +879,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>   	if (rc)
>   		return rc;
>   
> +	rc = cxl_dev_dynamic_capacity_identify(mds);
> +	if (rc)
> +		return rc;
> +
>   	rc = cxl_mem_create_range_info(mds);
>   	if (rc)
>   		return rc;
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 04/18] cxl/region: Add Dynamic Capacity decoder and region modes
  2023-08-29  5:20 ` [PATCH RFC v2 04/18] cxl/region: Add Dynamic Capacity decoder and region modes Ira Weiny
  2023-08-29 14:39   ` Jonathan Cameron
@ 2023-08-30 21:13   ` Dave Jiang
  2023-08-31 17:00   ` Fan Ni
  2 siblings, 0 replies; 97+ messages in thread
From: Dave Jiang @ 2023-08-30 21:13 UTC (permalink / raw)
  To: Ira Weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel



On 8/28/23 22:20, Ira Weiny wrote:
> Both regions and decoders will need a new mode to reflect the new type
> of partition they are targeting on a device.  Regions reflect a dynamic
> capacity type which may point to different Dynamic Capacity (DC)
> Regions.  Decoder mode reflects a specific DC Region.
> 
> Define the new modes to use in subsequent patches and the helper
> functions associated with them.
> 
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> Changes for v2:
> [iweiny: split out from: Add dynamic capacity cxl region support.]
> ---
>   drivers/cxl/core/region.c |  4 ++++
>   drivers/cxl/cxl.h         | 23 +++++++++++++++++++++++
>   2 files changed, 27 insertions(+)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 75041903b72c..69af1354bc5b 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1772,6 +1772,8 @@ static bool cxl_modes_compatible(enum cxl_region_mode rmode,
>   		return true;
>   	if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
>   		return true;
> +	if (rmode == CXL_REGION_DC && cxl_decoder_mode_is_dc(dmode))
> +		return true;
>   
>   	return false;
>   }
> @@ -2912,6 +2914,8 @@ cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
>   		return CXL_REGION_PMEM;
>   	case CXL_DECODER_DEAD:
>   		return CXL_REGION_DEAD;
> +	case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
> +		return CXL_REGION_DC;
>   	case CXL_DECODER_MIXED:
>   	default:
>   		return CXL_REGION_MIXED;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index ed282dcd5cf5..d41f3f14fbe3 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -356,6 +356,14 @@ enum cxl_decoder_mode {
>   	CXL_DECODER_NONE,
>   	CXL_DECODER_RAM,
>   	CXL_DECODER_PMEM,
> +	CXL_DECODER_DC0,
> +	CXL_DECODER_DC1,
> +	CXL_DECODER_DC2,
> +	CXL_DECODER_DC3,
> +	CXL_DECODER_DC4,
> +	CXL_DECODER_DC5,
> +	CXL_DECODER_DC6,
> +	CXL_DECODER_DC7,
>   	CXL_DECODER_MIXED,
>   	CXL_DECODER_DEAD,
>   };
> @@ -366,6 +374,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>   		[CXL_DECODER_NONE] = "none",
>   		[CXL_DECODER_RAM] = "ram",
>   		[CXL_DECODER_PMEM] = "pmem",
> +		[CXL_DECODER_DC0] = "dc0",
> +		[CXL_DECODER_DC1] = "dc1",
> +		[CXL_DECODER_DC2] = "dc2",
> +		[CXL_DECODER_DC3] = "dc3",
> +		[CXL_DECODER_DC4] = "dc4",
> +		[CXL_DECODER_DC5] = "dc5",
> +		[CXL_DECODER_DC6] = "dc6",
> +		[CXL_DECODER_DC7] = "dc7",
>   		[CXL_DECODER_MIXED] = "mixed",
>   	};
>   
> @@ -374,10 +390,16 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>   	return "mixed";
>   }
>   
> +static inline bool cxl_decoder_mode_is_dc(enum cxl_decoder_mode mode)
> +{
> +	return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
> +}
> +
>   enum cxl_region_mode {
>   	CXL_REGION_NONE,
>   	CXL_REGION_RAM,
>   	CXL_REGION_PMEM,
> +	CXL_REGION_DC,
>   	CXL_REGION_MIXED,
>   	CXL_REGION_DEAD,
>   };
> @@ -388,6 +410,7 @@ static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
>   		[CXL_REGION_NONE] = "none",
>   		[CXL_REGION_RAM] = "ram",
>   		[CXL_REGION_PMEM] = "pmem",
> +		[CXL_REGION_DC] = "dc",
>   		[CXL_REGION_MIXED] = "mixed",
>   	};
>   
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device
  2023-08-29  5:20 ` [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device ira.weiny
  2023-08-29 14:37   ` Jonathan Cameron
  2023-08-30 21:01   ` Dave Jiang
@ 2023-08-30 21:44   ` Fan Ni
  2023-09-08 22:52     ` Ira Weiny
  2023-09-07 15:46   ` Alison Schofield
  2023-09-08 12:46   ` Jørgen Hansen
  4 siblings, 1 reply; 97+ messages in thread
From: Fan Ni @ 2023-08-30 21:44 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Jonathan Cameron,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma,
	linux-cxl, linux-kernel

On Mon, Aug 28, 2023 at 10:20:54PM -0700, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
>
> Devices can optionally support Dynamic Capacity (DC).  These devices are
> known as Dynamic Capacity Devices (DCD).
>
> Implement the DC (opcode 48XXh) mailbox commands as specified in CXL 3.0
> section 8.2.9.8.9.  Read the DC configuration and store the DC region
> information in the device state.
>
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes for v2
> [iweiny: Rebased to latest master type2 work]
> [jonathan: s/dc/dc_resp/]
> [iweiny: Clean up commit message]
> [iweiny: Clean kernel docs]
> [djiang: Fix up cxl_is_dcd_command]
> [djiang: extra blank line]
> [alison: s/total_capacity/cap/ etc...]
> [alison: keep partition flag with partition structures]
> [alison: reformat untenanted_mem declaration]
> [alison: move 'cmd' definition back]
> [alison: fix comment line length]
> [alison: reverse x-tree]
> [jonathan: fix and adjust CXL_DC_REGION_STRLEN]
> [Jonathan/iweiny: Factor out storing each DC region read from the device]
> [Jonathan: place all dcr initializers together]
> [Jonathan/iweiny: flip around the region DPA order check]
> [jonathan: Account for short read of mailbox command]
> [iweiny: use snprintf for region name]
> [iweiny: use '<nil>' for missing region names]
> [iweiny: factor out struct cxl_dc_region_info]
> [iweiny: Split out reading CEL]
> ---
>  drivers/cxl/core/mbox.c   | 179 +++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/cxl/core/region.c |  75 +++++++++++++------
>  drivers/cxl/cxl.h         |  27 ++++++-
>  drivers/cxl/cxlmem.h      |  55 +++++++++++++-
>  drivers/cxl/pci.c         |   4 ++
>  5 files changed, 314 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 554ec97a7c39..d769814f80e2 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1096,7 +1096,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>  	if (rc < 0)
>  		return rc;
>
> -	mds->total_bytes =
> +	mds->static_cap =
>  		le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
>  	mds->volatile_only_bytes =
>  		le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
> @@ -1114,6 +1114,8 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>  		mds->poison.max_errors = min_t(u32, val, CXL_POISON_LIST_MAX);
>  	}
>
> +	mds->dc_event_log_size = le16_to_cpu(id.dc_event_log_size);
> +
>  	return 0;
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dev_state_identify, CXL);
> @@ -1178,6 +1180,165 @@ int cxl_mem_sanitize(struct cxl_memdev_state *mds, u16 cmd)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_mem_sanitize, CXL);
>
> +static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, int index,
> +				   struct cxl_dc_region_config *region_config)
> +{
> +	struct cxl_dc_region_info *dcr = &mds->dc_region[index];
> +	struct device *dev = mds->cxlds.dev;
> +
> +	dcr->base = le64_to_cpu(region_config->region_base);
> +	dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
> +	dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> +	dcr->len = le64_to_cpu(region_config->region_length);
> +	dcr->blk_size = le64_to_cpu(region_config->region_block_size);
> +	dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
> +	dcr->flags = region_config->flags;
> +	snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
> +
> +	/* Check regions are in increasing DPA order */
> +	if (index > 0) {
> +		struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
> +
> +		if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
> +			dev_err(dev,
> +				"DPA ordering violation for DC region %d and %d\n",
> +				index - 1, index);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	/* Check the region is 256 MB aligned */
> +	if (!IS_ALIGNED(dcr->base, SZ_256M)) {
> +		dev_err(dev, "DC region %d not aligned to 256MB: %#llx\n",
> +			index, dcr->base);
> +		return -EINVAL;
> +	}
> +
> +	/* Check Region base and length are aligned to block size */
> +	if (!IS_ALIGNED(dcr->base, dcr->blk_size) ||
> +	    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> +		dev_err(dev, "DC region %d not aligned to %#llx\n", index,
> +			dcr->blk_size);
> +		return -EINVAL;
> +	}

Based on on cxl 3.0 spec: Table 8-126, we may need some extra checks
here:
1. region len <= decode_len
2. region block size should be power of 2 and a multiple of 40H.

Also, if region len or block size is 0, it mentions that DC will not be
available, we may also need to handle that.

Fan

> +
> +	dev_dbg(dev,
> +		"DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> +		dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> +
> +	return 0;
> +}
> +
> +/* Returns the number of regions in dc_resp or -ERRNO */
> +static int cxl_get_dc_id(struct cxl_memdev_state *mds, u8 start_region,
> +			 struct cxl_mbox_dynamic_capacity *dc_resp,
> +			 size_t dc_resp_size)
> +{
> +	struct cxl_mbox_get_dc_config get_dc = (struct cxl_mbox_get_dc_config) {
> +		.region_count = CXL_MAX_DC_REGION,
> +		.start_region_index = start_region,
> +	};
> +	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> +		.payload_in = &get_dc,
> +		.size_in = sizeof(get_dc),
> +		.size_out = dc_resp_size,
> +		.payload_out = dc_resp,
> +		.min_out = 1,
> +	};
> +	struct device *dev = mds->cxlds.dev;
> +	int rc;
> +
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		return rc;
> +
> +	rc = dc_resp->avail_region_count - start_region;
> +
> +	/*
> +	 * The number of regions in the payload may have been truncated due to
> +	 * payload_size limits; if so adjust the count in this query.
> +	 */
> +	if (mbox_cmd.size_out < sizeof(*dc_resp))
> +		rc = CXL_REGIONS_RETURNED(mbox_cmd.size_out);
> +
> +	dev_dbg(dev, "Read %d/%d DC regions\n", rc, dc_resp->avail_region_count);
> +
> +	return rc;
> +}
> +
> +/**
> + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> + *					 information from the device.
> + * @mds: The memory device state
> + *
> + * This will dispatch the get_dynamic_capacity command to the device
> + * and on success populate structures to be exported to sysfs.
> + *
> + * Return: 0 if identify was executed successfully, -ERRNO on error.
> + */
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> +{
> +	struct cxl_mbox_dynamic_capacity *dc_resp;
> +	struct device *dev = mds->cxlds.dev;
> +	size_t dc_resp_size = mds->payload_size;
> +	u8 start_region;
> +	int i, rc = 0;
> +
> +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> +		snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
> +
> +	/* Check GET_DC_CONFIG is supported by device */
> +	if (!test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds)) {
> +		dev_dbg(dev, "unsupported cmd: get_dynamic_capacity_config\n");
> +		return 0;
> +	}
> +
> +	dc_resp = kvmalloc(dc_resp_size, GFP_KERNEL);
> +	if (!dc_resp)
> +		return -ENOMEM;
> +
> +	start_region = 0;
> +	do {
> +		int j;
> +
> +		rc = cxl_get_dc_id(mds, start_region, dc_resp, dc_resp_size);
> +		if (rc < 0)
> +			goto free_resp;
> +
> +		mds->nr_dc_region += rc;
> +
> +		if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> +			dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> +				mds->nr_dc_region);
> +			rc = -EINVAL;
> +			goto free_resp;
> +		}
> +
> +		for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
> +			rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
> +			if (rc)
> +				goto free_resp;
> +		}
> +
> +		start_region = mds->nr_dc_region;
> +
> +	} while (mds->nr_dc_region < dc_resp->avail_region_count);
> +
> +	mds->dynamic_cap =
> +		mds->dc_region[mds->nr_dc_region - 1].base +
> +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> +		mds->dc_region[0].base;
> +	dev_dbg(dev, "Total dynamic capacity: %#llx\n", mds->dynamic_cap);
> +
> +free_resp:
> +	kfree(dc_resp);
> +	if (rc)
> +		dev_err(dev, "Failed to get DC info: %d\n", rc);
> +	return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> +
>  static int add_dpa_res(struct device *dev, struct resource *parent,
>  		       struct resource *res, resource_size_t start,
>  		       resource_size_t size, const char *type)
> @@ -1208,8 +1369,12 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  {
>  	struct cxl_dev_state *cxlds = &mds->cxlds;
>  	struct device *dev = cxlds->dev;
> +	size_t untenanted_mem;
>  	int rc;
>
> +	untenanted_mem = mds->dc_region[0].base - mds->static_cap;
> +	mds->total_bytes = mds->static_cap + untenanted_mem + mds->dynamic_cap;
> +
>  	if (!cxlds->media_ready) {
>  		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
>  		cxlds->ram_res = DEFINE_RES_MEM(0, 0);
> @@ -1217,8 +1382,16 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  		return 0;
>  	}
>
> -	cxlds->dpa_res =
> -		(struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> +	cxlds->dpa_res = (struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> +
> +	for (int i = 0; i < mds->nr_dc_region; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> +				 dcr->base, dcr->decode_len, dcr->name);
> +		if (rc)
> +			return rc;
> +	}
>
>  	if (mds->partition_align_bytes == 0) {
>  		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 252bc8e1f103..75041903b72c 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -46,7 +46,7 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
>  	rc = down_read_interruptible(&cxl_region_rwsem);
>  	if (rc)
>  		return rc;
> -	if (cxlr->mode != CXL_DECODER_PMEM)
> +	if (cxlr->mode != CXL_REGION_PMEM)
>  		rc = sysfs_emit(buf, "\n");
>  	else
>  		rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
> @@ -359,7 +359,7 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
>  	 * Support tooling that expects to find a 'uuid' attribute for all
>  	 * regions regardless of mode.
>  	 */
> -	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
> +	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_REGION_PMEM)
>  		return 0444;
>  	return a->mode;
>  }
> @@ -537,7 +537,7 @@ static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
>  {
>  	struct cxl_region *cxlr = to_cxl_region(dev);
>
> -	return sysfs_emit(buf, "%s\n", cxl_decoder_mode_name(cxlr->mode));
> +	return sysfs_emit(buf, "%s\n", cxl_region_mode_name(cxlr->mode));
>  }
>  static DEVICE_ATTR_RO(mode);
>
> @@ -563,7 +563,7 @@ static int alloc_hpa(struct cxl_region *cxlr, resource_size_t size)
>
>  	/* ways, granularity and uuid (if PMEM) need to be set before HPA */
>  	if (!p->interleave_ways || !p->interleave_granularity ||
> -	    (cxlr->mode == CXL_DECODER_PMEM && uuid_is_null(&p->uuid)))
> +	    (cxlr->mode == CXL_REGION_PMEM && uuid_is_null(&p->uuid)))
>  		return -ENXIO;
>
>  	div_u64_rem(size, SZ_256M * p->interleave_ways, &remainder);
> @@ -1765,6 +1765,17 @@ static int cxl_region_sort_targets(struct cxl_region *cxlr)
>  	return rc;
>  }
>
> +static bool cxl_modes_compatible(enum cxl_region_mode rmode,
> +				 enum cxl_decoder_mode dmode)
> +{
> +	if (rmode == CXL_REGION_RAM && dmode == CXL_DECODER_RAM)
> +		return true;
> +	if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
> +		return true;
> +
> +	return false;
> +}
> +
>  static int cxl_region_attach(struct cxl_region *cxlr,
>  			     struct cxl_endpoint_decoder *cxled, int pos)
>  {
> @@ -1778,9 +1789,11 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  	lockdep_assert_held_write(&cxl_region_rwsem);
>  	lockdep_assert_held_read(&cxl_dpa_rwsem);
>
> -	if (cxled->mode != cxlr->mode) {
> -		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
> -			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
> +	if (!cxl_modes_compatible(cxlr->mode, cxled->mode)) {
> +		dev_dbg(&cxlr->dev, "%s region mode: %s mismatch decoder: %s\n",
> +			dev_name(&cxled->cxld.dev),
> +			cxl_region_mode_name(cxlr->mode),
> +			cxl_decoder_mode_name(cxled->mode));
>  		return -EINVAL;
>  	}
>
> @@ -2234,7 +2247,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
>   * devm_cxl_add_region - Adds a region to a decoder
>   * @cxlrd: root decoder
>   * @id: memregion id to create, or memregion_free() on failure
> - * @mode: mode for the endpoint decoders of this region
> + * @mode: mode of this region
>   * @type: select whether this is an expander or accelerator (type-2 or type-3)
>   *
>   * This is the second step of region initialization. Regions exist within an
> @@ -2245,7 +2258,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
>   */
>  static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>  					      int id,
> -					      enum cxl_decoder_mode mode,
> +					      enum cxl_region_mode mode,
>  					      enum cxl_decoder_type type)
>  {
>  	struct cxl_port *port = to_cxl_port(cxlrd->cxlsd.cxld.dev.parent);
> @@ -2254,11 +2267,12 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>  	int rc;
>
>  	switch (mode) {
> -	case CXL_DECODER_RAM:
> -	case CXL_DECODER_PMEM:
> +	case CXL_REGION_RAM:
> +	case CXL_REGION_PMEM:
>  		break;
>  	default:
> -		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> +		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
> +			cxl_region_mode_name(mode));
>  		return ERR_PTR(-EINVAL);
>  	}
>
> @@ -2308,7 +2322,7 @@ static ssize_t create_ram_region_show(struct device *dev,
>  }
>
>  static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
> -					  int id, enum cxl_decoder_mode mode,
> +					  int id, enum cxl_region_mode mode,
>  					  enum cxl_decoder_type type)
>  {
>  	int rc;
> @@ -2337,7 +2351,7 @@ static ssize_t create_pmem_region_store(struct device *dev,
>  	if (rc != 1)
>  		return -EINVAL;
>
> -	cxlr = __create_region(cxlrd, id, CXL_DECODER_PMEM,
> +	cxlr = __create_region(cxlrd, id, CXL_REGION_PMEM,
>  			       CXL_DECODER_HOSTONLYMEM);
>  	if (IS_ERR(cxlr))
>  		return PTR_ERR(cxlr);
> @@ -2358,7 +2372,7 @@ static ssize_t create_ram_region_store(struct device *dev,
>  	if (rc != 1)
>  		return -EINVAL;
>
> -	cxlr = __create_region(cxlrd, id, CXL_DECODER_RAM,
> +	cxlr = __create_region(cxlrd, id, CXL_REGION_RAM,
>  			       CXL_DECODER_HOSTONLYMEM);
>  	if (IS_ERR(cxlr))
>  		return PTR_ERR(cxlr);
> @@ -2886,10 +2900,31 @@ static void construct_region_end(void)
>  	up_write(&cxl_region_rwsem);
>  }
>
> +static enum cxl_region_mode
> +cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
> +{
> +	switch (mode) {
> +	case CXL_DECODER_NONE:
> +		return CXL_REGION_NONE;
> +	case CXL_DECODER_RAM:
> +		return CXL_REGION_RAM;
> +	case CXL_DECODER_PMEM:
> +		return CXL_REGION_PMEM;
> +	case CXL_DECODER_DEAD:
> +		return CXL_REGION_DEAD;
> +	case CXL_DECODER_MIXED:
> +	default:
> +		return CXL_REGION_MIXED;
> +	}
> +
> +	return CXL_REGION_MIXED;
> +}
> +
>  static struct cxl_region *
>  construct_region_begin(struct cxl_root_decoder *cxlrd,
>  		       struct cxl_endpoint_decoder *cxled)
>  {
> +	enum cxl_region_mode mode = cxl_decoder_to_region_mode(cxled->mode);
>  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
>  	struct cxl_region_params *p;
>  	struct cxl_region *cxlr;
> @@ -2897,7 +2932,7 @@ construct_region_begin(struct cxl_root_decoder *cxlrd,
>
>  	do {
>  		cxlr = __create_region(cxlrd, atomic_read(&cxlrd->region_id),
> -				       cxled->mode, cxled->cxld.target_type);
> +				       mode, cxled->cxld.target_type);
>  	} while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
>
>  	if (IS_ERR(cxlr)) {
> @@ -3200,9 +3235,9 @@ static int cxl_region_probe(struct device *dev)
>  		return rc;
>
>  	switch (cxlr->mode) {
> -	case CXL_DECODER_PMEM:
> +	case CXL_REGION_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
> -	case CXL_DECODER_RAM:
> +	case CXL_REGION_RAM:
>  		/*
>  		 * The region can not be manged by CXL if any portion of
>  		 * it is already online as 'System RAM'
> @@ -3223,8 +3258,8 @@ static int cxl_region_probe(struct device *dev)
>  		/* HDM-H routes to device-dax */
>  		return devm_cxl_add_dax_region(cxlr);
>  	default:
> -		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
> -			cxlr->mode);
> +		dev_dbg(&cxlr->dev, "unsupported region mode: %s\n",
> +			cxl_region_mode_name(cxlr->mode));
>  		return -ENXIO;
>  	}
>  }
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index cd4a9ffdacc7..ed282dcd5cf5 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -374,6 +374,28 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  	return "mixed";
>  }
>
> +enum cxl_region_mode {
> +	CXL_REGION_NONE,
> +	CXL_REGION_RAM,
> +	CXL_REGION_PMEM,
> +	CXL_REGION_MIXED,
> +	CXL_REGION_DEAD,
> +};
> +
> +static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
> +{
> +	static const char * const names[] = {
> +		[CXL_REGION_NONE] = "none",
> +		[CXL_REGION_RAM] = "ram",
> +		[CXL_REGION_PMEM] = "pmem",
> +		[CXL_REGION_MIXED] = "mixed",
> +	};
> +
> +	if (mode >= CXL_REGION_NONE && mode <= CXL_REGION_MIXED)
> +		return names[mode];
> +	return "mixed";
> +}
> +
>  /*
>   * Track whether this decoder is reserved for region autodiscovery, or
>   * free for userspace provisioning.
> @@ -502,7 +524,8 @@ struct cxl_region_params {
>   * struct cxl_region - CXL region
>   * @dev: This region's device
>   * @id: This region's id. Id is globally unique across all regions
> - * @mode: Endpoint decoder allocation / access mode
> + * @mode: Region mode which defines which endpoint decoder mode the region is
> + *        compatible with
>   * @type: Endpoint decoder target type
>   * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
>   * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
> @@ -512,7 +535,7 @@ struct cxl_region_params {
>  struct cxl_region {
>  	struct device dev;
>  	int id;
> -	enum cxl_decoder_mode mode;
> +	enum cxl_region_mode mode;
>  	enum cxl_decoder_type type;
>  	struct cxl_nvdimm_bridge *cxl_nvb;
>  	struct cxl_pmem_region *cxlr_pmem;
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 5f2e65204bf9..8c8f47b397ab 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -396,6 +396,7 @@ enum cxl_devtype {
>  	CXL_DEVTYPE_CLASSMEM,
>  };
>
> +#define CXL_MAX_DC_REGION 8
>  /**
>   * struct cxl_dev_state - The driver device state
>   *
> @@ -412,6 +413,8 @@ enum cxl_devtype {
>   * @dpa_res: Overall DPA resource tree for the device
>   * @pmem_res: Active Persistent memory capacity configuration
>   * @ram_res: Active Volatile memory capacity configuration
> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> + *          region
>   * @component_reg_phys: register base of component registers
>   * @serial: PCIe Device Serial Number
>   * @type: Generic Memory Class device or Vendor Specific Memory device
> @@ -426,11 +429,23 @@ struct cxl_dev_state {
>  	struct resource dpa_res;
>  	struct resource pmem_res;
>  	struct resource ram_res;
> +	struct resource dc_res[CXL_MAX_DC_REGION];
>  	resource_size_t component_reg_phys;
>  	u64 serial;
>  	enum cxl_devtype type;
>  };
>
> +#define CXL_DC_REGION_STRLEN 7
> +struct cxl_dc_region_info {
> +	u64 base;
> +	u64 decode_len;
> +	u64 len;
> +	u64 blk_size;
> +	u32 dsmad_handle;
> +	u8 flags;
> +	u8 name[CXL_DC_REGION_STRLEN];
> +};
> +
>  /**
>   * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
>   *
> @@ -449,6 +464,8 @@ struct cxl_dev_state {
>   * @enabled_cmds: Hardware commands found enabled in CEL.
>   * @exclusive_cmds: Commands that are kernel-internal only
>   * @total_bytes: sum of all possible capacities
> + * @static_cap: Sum of RAM and PMEM capacities
> + * @dynamic_cap: Complete DPA range occupied by DC regions
>   * @volatile_only_bytes: hard volatile capacity
>   * @persistent_only_bytes: hard persistent capacity
>   * @partition_align_bytes: alignment size for partition-able capacity
> @@ -456,6 +473,10 @@ struct cxl_dev_state {
>   * @active_persistent_bytes: sum of hard + soft persistent
>   * @next_volatile_bytes: volatile capacity change pending device reset
>   * @next_persistent_bytes: persistent capacity change pending device reset
> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
> + * @dc_event_log_size: The number of events the device can store in the
> + * Dynamic Capacity Event Log before it overflows
>   * @event: event log driver state
>   * @poison: poison driver state info
>   * @fw: firmware upload / activation state
> @@ -473,7 +494,10 @@ struct cxl_memdev_state {
>  	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> +
>  	u64 total_bytes;
> +	u64 static_cap;
> +	u64 dynamic_cap;
>  	u64 volatile_only_bytes;
>  	u64 persistent_only_bytes;
>  	u64 partition_align_bytes;
> @@ -481,6 +505,11 @@ struct cxl_memdev_state {
>  	u64 active_persistent_bytes;
>  	u64 next_volatile_bytes;
>  	u64 next_persistent_bytes;
> +
> +	u8 nr_dc_region;
> +	struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> +	size_t dc_event_log_size;
> +
>  	struct cxl_event_state event;
>  	struct cxl_poison_state poison;
>  	struct cxl_security_state security;
> @@ -587,6 +616,7 @@ struct cxl_mbox_identify {
>  	__le16 inject_poison_limit;
>  	u8 poison_caps;
>  	u8 qos_telemetry_caps;
> +	__le16 dc_event_log_size;
>  } __packed;
>
>  /*
> @@ -741,9 +771,31 @@ struct cxl_mbox_set_partition_info {
>  	__le64 volatile_capacity;
>  	u8 flags;
>  } __packed;
> -
>  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
>
> +struct cxl_mbox_get_dc_config {
> +	u8 region_count;
> +	u8 start_region_index;
> +} __packed;
> +
> +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> +struct cxl_mbox_dynamic_capacity {
> +	u8 avail_region_count;
> +	u8 rsvd[7];
> +	struct cxl_dc_region_config {
> +		__le64 region_base;
> +		__le64 region_decode_length;
> +		__le64 region_length;
> +		__le64 region_block_size;
> +		__le32 region_dsmad_handle;
> +		u8 flags;
> +		u8 rsvd[3];
> +	} __packed region[];
> +} __packed;
> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> +#define CXL_REGIONS_RETURNED(size_out) \
> +	((size_out - 8) / sizeof(struct cxl_dc_region_config))
> +
>  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>  struct cxl_mbox_set_timestamp_in {
>  	__le64 timestamp;
> @@ -867,6 +919,7 @@ enum {
>  int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
>  			  struct cxl_mbox_cmd *cmd);
>  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
>  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>  int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 5242dbf0044d..a9b110ff1176 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -879,6 +879,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>
> +	rc = cxl_dev_dynamic_capacity_identify(mds);
> +	if (rc)
> +		return rc;
> +
>  	rc = cxl_mem_create_range_info(mds);
>  	if (rc)
>  		return rc;
>
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 07/18] cxl/mem: Expose device dynamic capacity configuration
  2023-08-29  5:20 ` [PATCH RFC v2 07/18] cxl/mem: Expose device dynamic capacity configuration ira.weiny
  2023-08-29 15:14   ` Jonathan Cameron
@ 2023-08-30 22:46   ` Dave Jiang
  2023-09-08 23:22     ` Ira Weiny
  1 sibling, 1 reply; 97+ messages in thread
From: Dave Jiang @ 2023-08-30 22:46 UTC (permalink / raw)
  To: ira.weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel



On 8/28/23 22:20, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> To properly configure CXL regions on Dynamic Capacity Devices (DCD),
> user space will need to know the details of the DC Regions available on
> a device.
> 
> Expose driver dynamic capacity configuration through sysfs
> attributes.
> 
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes for v2:
> [iweiny: Rebased on latest master/type2 work]
> [iweiny: add documentation for sysfs entries]
> [iweiny: s/dc_regions_count/region_count/]
> [iweiny: s/dcY_size/regionY_size/]
> [alison: change size format to %#llx]
> [iweiny: change count format to %d]
> [iweiny: Formatting updates]
> [iweiny: Fix crash when device is not a mem device: found with cxl-test]
> ---
>   Documentation/ABI/testing/sysfs-bus-cxl | 17 ++++++++
>   drivers/cxl/core/memdev.c               | 77 +++++++++++++++++++++++++++++++++
>   2 files changed, 94 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 2268ffcdb604..aa65dc5b4e13 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -37,6 +37,23 @@ Description:
>   		identically named field in the Identify Memory Device Output
>   		Payload in the CXL-2.0 specification.
>   
> +What:		/sys/bus/cxl/devices/memX/dc/region_count
> +Date:		July, 2023
> +KernelVersion:	v6.6
> +Contact:	linux-cxl@vger.kernel.org
> +Description:
> +		(RO) Number of Dynamic Capacity (DC) regions supported on the
> +		device.  May be 0 if the device does not support Dynamic
> +		Capacity.
> +
> +What:		/sys/bus/cxl/devices/memX/dc/regionY_size
> +Date:		July, 2023
> +KernelVersion:	v6.6
> +Contact:	linux-cxl@vger.kernel.org
> +Description:
> +		(RO) Size of the Dynamic Capacity (DC) region Y.  Only
> +		available on devices which support DC and only for those
> +		region indexes supported by the device.
>   
>   What:		/sys/bus/cxl/devices/memX/serial
>   Date:		January, 2022
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 492486707fd0..397262e0ebd2 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -101,6 +101,20 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
>   static struct device_attribute dev_attr_pmem_size =
>   	__ATTR(size, 0444, pmem_size_show, NULL);
>   
> +static ssize_t region_count_show(struct device *dev, struct device_attribute *attr,
> +				 char *buf)
> +{
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +	int len = 0;
> +
> +	len = sysfs_emit(buf, "%d\n", mds->nr_dc_region);
> +	return len;
> +}
> +
> +struct device_attribute dev_attr_region_count =
> +	__ATTR(region_count, 0444, region_count_show, NULL);
> +
>   static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
>   			   char *buf)
>   {
> @@ -454,6 +468,62 @@ static struct attribute *cxl_memdev_security_attributes[] = {
>   	NULL,
>   };
>   
> +static ssize_t show_size_regionN(struct cxl_memdev *cxlmd, char *buf, int pos)
> +{
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +	return sysfs_emit(buf, "%#llx\n", mds->dc_region[pos].decode_len);
> +}
> +
> +#define REGION_SIZE_ATTR_RO(n)						\
> +static ssize_t region##n##_size_show(struct device *dev,		\
> +				     struct device_attribute *attr,	\
> +				     char *buf)				\
> +{									\
> +	return show_size_regionN(to_cxl_memdev(dev), buf, (n));		\
> +}									\
> +static DEVICE_ATTR_RO(region##n##_size)
> +REGION_SIZE_ATTR_RO(0);
> +REGION_SIZE_ATTR_RO(1);
> +REGION_SIZE_ATTR_RO(2);
> +REGION_SIZE_ATTR_RO(3);
> +REGION_SIZE_ATTR_RO(4);
> +REGION_SIZE_ATTR_RO(5);
> +REGION_SIZE_ATTR_RO(6);
> +REGION_SIZE_ATTR_RO(7);
> +
> +static struct attribute *cxl_memdev_dc_attributes[] = {
> +	&dev_attr_region0_size.attr,
> +	&dev_attr_region1_size.attr,
> +	&dev_attr_region2_size.attr,
> +	&dev_attr_region3_size.attr,
> +	&dev_attr_region4_size.attr,
> +	&dev_attr_region5_size.attr,
> +	&dev_attr_region6_size.attr,
> +	&dev_attr_region7_size.attr,
> +	&dev_attr_region_count.attr,
> +	NULL,
> +};
> +
> +static umode_t cxl_dc_visible(struct kobject *kobj, struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +	/* Not a memory device */
> +	if (!mds)
> +		return 0;
> +
> +	if (a == &dev_attr_region_count.attr)
> +		return a->mode;
> +
> +	if (n < mds->nr_dc_region)
> +		return a->mode;

I would add a comment on who you are checking against nr_dc_region to 
make it obvious.

DJ

> +
> +	return 0;
> +}
> +
>   static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
>   				  int n)
>   {
> @@ -482,11 +552,18 @@ static struct attribute_group cxl_memdev_security_attribute_group = {
>   	.attrs = cxl_memdev_security_attributes,
>   };
>   
> +static struct attribute_group cxl_memdev_dc_attribute_group = {
> +	.name = "dc",
> +	.attrs = cxl_memdev_dc_attributes,
> +	.is_visible = cxl_dc_visible,
> +};
> +
>   static const struct attribute_group *cxl_memdev_attribute_groups[] = {
>   	&cxl_memdev_attribute_group,
>   	&cxl_memdev_ram_attribute_group,
>   	&cxl_memdev_pmem_attribute_group,
>   	&cxl_memdev_security_attribute_group,
> +	&cxl_memdev_dc_attribute_group,
>   	NULL,
>   };
>   
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 08/18] cxl/region: Add Dynamic Capacity CXL region support
  2023-08-29  5:20 ` [PATCH RFC v2 08/18] cxl/region: Add Dynamic Capacity CXL region support Ira Weiny
  2023-08-29 15:19   ` Jonathan Cameron
@ 2023-08-30 23:27   ` Dave Jiang
  2023-09-06  4:36     ` Ira Weiny
  2023-09-05 21:09   ` Fan Ni
  2 siblings, 1 reply; 97+ messages in thread
From: Dave Jiang @ 2023-08-30 23:27 UTC (permalink / raw)
  To: Ira Weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel



On 8/28/23 22:20, Ira Weiny wrote:
> CXL devices optionally support dynamic capacity.  CXL Regions must be
> configured correctly to access this capacity.  Similar to ram and pmem
> partitions, DC Regions represent different partitions of the DPA space.
> 
> Interleaving is deferred due to the complexity of managing extents on
> multiple devices at the same time.  However, there is nothing which
> directly prevents interleave support at this time.  The check allows
> for early rejection.
> 
> To maintain backwards compatibility with older software, CXL regions
> need a default DAX device to hold the reference for the region until it
> is deleted.
> 
> Add create_dc_region sysfs entry to create DC regions.  Share the logic
> of devm_cxl_add_dax_region() and region_is_system_ram().  Special case
> DC capable CXL regions to create a 0 sized seed DAX device until others
> can be created on dynamic space later.
> 
> Flag dax_regions to indicate 0 capacity available until dax_region
> extents are supported by the region.
> 
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

You probably should update kernel version to v6.7. Otherwise
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> 
> ---
> changes for v2:
> [iweiny: flag empty dax regions]
> [iweiny: Split out anything not directly related to creating a DC CXL
> 	 region]
> [iweiny: Separate out dev dax stuff]
> [iweiny/navneet: create 0 sized DAX device by default]
> [iweiny: use new DC region mode]
> ---
>   Documentation/ABI/testing/sysfs-bus-cxl | 20 +++++-----
>   drivers/cxl/core/core.h                 |  1 +
>   drivers/cxl/core/port.c                 |  1 +
>   drivers/cxl/core/region.c               | 71 ++++++++++++++++++++++++++++-----
>   drivers/dax/bus.c                       |  8 ++++
>   drivers/dax/bus.h                       |  1 +
>   drivers/dax/cxl.c                       | 15 ++++++-
>   7 files changed, 96 insertions(+), 21 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index aa65dc5b4e13..a0562938ecac 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -351,20 +351,20 @@ Description:
>   		interleave_granularity).
>   
>   
> -What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> +What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dc}_region
>   Date:		May, 2022, January, 2023
> -KernelVersion:	v6.0 (pmem), v6.3 (ram)
> +KernelVersion:	v6.0 (pmem), v6.3 (ram), v6.6 (dc)
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RW) Write a string in the form 'regionZ' to start the process
> -		of defining a new persistent, or volatile memory region
> -		(interleave-set) within the decode range bounded by root decoder
> -		'decoderX.Y'. The value written must match the current value
> -		returned from reading this attribute. An atomic compare exchange
> -		operation is done on write to assign the requested id to a
> -		region and allocate the region-id for the next creation attempt.
> -		EBUSY is returned if the region name written does not match the
> -		current cached value.
> +		of defining a new persistent, volatile, or Dynamic Capacity
> +		(DC) memory region (interleave-set) within the decode range
> +		bounded by root decoder 'decoderX.Y'. The value written must
> +		match the current value returned from reading this attribute.
> +		An atomic compare exchange operation is done on write to assign
> +		the requested id to a region and allocate the region-id for the
> +		next creation attempt.  EBUSY is returned if the region name
> +		written does not match the current cached value.
>   
>   
>   What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 45e7e044cf4a..cf3cf01cb95d 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -13,6 +13,7 @@ extern struct attribute_group cxl_base_attribute_group;
>   #ifdef CONFIG_CXL_REGION
>   extern struct device_attribute dev_attr_create_pmem_region;
>   extern struct device_attribute dev_attr_create_ram_region;
> +extern struct device_attribute dev_attr_create_dc_region;
>   extern struct device_attribute dev_attr_delete_region;
>   extern struct device_attribute dev_attr_region;
>   extern const struct device_type cxl_pmem_region_type;
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index a5db710a63bc..608901bb7d91 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -314,6 +314,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>   	&dev_attr_target_list.attr,
>   	SET_CXL_REGION_ATTR(create_pmem_region)
>   	SET_CXL_REGION_ATTR(create_ram_region)
> +	SET_CXL_REGION_ATTR(create_dc_region)
>   	SET_CXL_REGION_ATTR(delete_region)
>   	NULL,
>   };
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 69af1354bc5b..fc8dee469244 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2271,6 +2271,7 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>   	switch (mode) {
>   	case CXL_REGION_RAM:
>   	case CXL_REGION_PMEM:
> +	case CXL_REGION_DC:
>   		break;
>   	default:
>   		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
> @@ -2383,6 +2384,33 @@ static ssize_t create_ram_region_store(struct device *dev,
>   }
>   DEVICE_ATTR_RW(create_ram_region);
>   
> +static ssize_t create_dc_region_show(struct device *dev,
> +				     struct device_attribute *attr, char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dc_region_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t len)
> +{
> +	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
> +	struct cxl_region *cxlr;
> +	int rc, id;
> +
> +	rc = sscanf(buf, "region%d\n", &id);
> +	if (rc != 1)
> +		return -EINVAL;
> +
> +	cxlr = __create_region(cxlrd, id, CXL_REGION_DC,
> +			       CXL_DECODER_HOSTONLYMEM);
> +	if (IS_ERR(cxlr))
> +		return PTR_ERR(cxlr);
> +
> +	return len;
> +}
> +DEVICE_ATTR_RW(create_dc_region);
> +
>   static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>   			   char *buf)
>   {
> @@ -2834,7 +2862,7 @@ static void cxlr_dax_unregister(void *_cxlr_dax)
>   	device_unregister(&cxlr_dax->dev);
>   }
>   
> -static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> +static int __devm_cxl_add_dax_region(struct cxl_region *cxlr)
>   {
>   	struct cxl_dax_region *cxlr_dax;
>   	struct device *dev;
> @@ -2863,6 +2891,21 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>   	return rc;
>   }
>   
> +static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> +{
> +	return __devm_cxl_add_dax_region(cxlr);
> +}
> +
> +static int devm_cxl_add_dc_dax_region(struct cxl_region *cxlr)
> +{
> +	if (cxlr->params.interleave_ways != 1) {
> +		dev_err(&cxlr->dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}
> +
> +	return __devm_cxl_add_dax_region(cxlr);
> +}
> +
>   static int match_decoder_by_range(struct device *dev, void *data)
>   {
>   	struct range *r1, *r2 = data;
> @@ -3203,6 +3246,19 @@ static int is_system_ram(struct resource *res, void *arg)
>   	return 1;
>   }
>   
> +/*
> + * The region can not be manged by CXL if any portion of
> + * it is already online as 'System RAM'
> + */
> +static bool region_is_system_ram(struct cxl_region *cxlr,
> +				 struct cxl_region_params *p)
> +{
> +	return (walk_iomem_res_desc(IORES_DESC_NONE,
> +				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> +				    p->res->start, p->res->end, cxlr,
> +				    is_system_ram) > 0);
> +}
> +
>   static int cxl_region_probe(struct device *dev)
>   {
>   	struct cxl_region *cxlr = to_cxl_region(dev);
> @@ -3242,14 +3298,7 @@ static int cxl_region_probe(struct device *dev)
>   	case CXL_REGION_PMEM:
>   		return devm_cxl_add_pmem_region(cxlr);
>   	case CXL_REGION_RAM:
> -		/*
> -		 * The region can not be manged by CXL if any portion of
> -		 * it is already online as 'System RAM'
> -		 */
> -		if (walk_iomem_res_desc(IORES_DESC_NONE,
> -					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> -					p->res->start, p->res->end, cxlr,
> -					is_system_ram) > 0)
> +		if (region_is_system_ram(cxlr, p))
>   			return 0;
>   
>   		/*
> @@ -3261,6 +3310,10 @@ static int cxl_region_probe(struct device *dev)
>   
>   		/* HDM-H routes to device-dax */
>   		return devm_cxl_add_dax_region(cxlr);
> +	case CXL_REGION_DC:
> +		if (region_is_system_ram(cxlr, p))
> +			return 0;
> +		return devm_cxl_add_dc_dax_region(cxlr);
>   	default:
>   		dev_dbg(&cxlr->dev, "unsupported region mode: %s\n",
>   			cxl_region_mode_name(cxlr->mode));
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 0ee96e6fc426..b76e49813a39 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -169,6 +169,11 @@ static bool is_static(struct dax_region *dax_region)
>   	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
>   }
>   
> +static bool is_dynamic(struct dax_region *dax_region)
> +{
> +	return (dax_region->res.flags & IORESOURCE_DAX_DYNAMIC_CAP) != 0;
> +}
> +
>   bool static_dev_dax(struct dev_dax *dev_dax)
>   {
>   	return is_static(dev_dax->region);
> @@ -285,6 +290,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
>   
>   	device_lock_assert(dax_region->dev);
>   
> +	if (is_dynamic(dax_region))
> +		return 0;
> +
>   	for_each_dax_region_resource(dax_region, res)
>   		size -= resource_size(res);
>   	return size;
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 1ccd23360124..74d8fe4a5532 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -13,6 +13,7 @@ struct dax_region;
>   /* dax bus specific ioresource flags */
>   #define IORESOURCE_DAX_STATIC BIT(0)
>   #define IORESOURCE_DAX_KMEM BIT(1)
> +#define IORESOURCE_DAX_DYNAMIC_CAP BIT(2)
>   
>   struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>   		struct range *range, int target_node, unsigned int align,
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 8bc9d04034d6..147c8c69782b 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -13,19 +13,30 @@ static int cxl_dax_region_probe(struct device *dev)
>   	struct cxl_region *cxlr = cxlr_dax->cxlr;
>   	struct dax_region *dax_region;
>   	struct dev_dax_data data;
> +	resource_size_t dev_size;
> +	unsigned long flags;
>   
>   	if (nid == NUMA_NO_NODE)
>   		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
>   
> +	dev_size = range_len(&cxlr_dax->hpa_range);
> +
> +	flags = IORESOURCE_DAX_KMEM;
> +	if (cxlr->mode == CXL_REGION_DC) {
> +		/* Add empty seed dax device */
> +		dev_size = 0;
> +		flags |= IORESOURCE_DAX_DYNAMIC_CAP;
> +	}
> +
>   	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> -				      PMD_SIZE, IORESOURCE_DAX_KMEM);
> +				      PMD_SIZE, flags);
>   	if (!dax_region)
>   		return -ENOMEM;
>   
>   	data = (struct dev_dax_data) {
>   		.dax_region = dax_region,
>   		.id = -1,
> -		.size = range_len(&cxlr_dax->hpa_range),
> +		.size = dev_size,
>   	};
>   
>   	return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 04/18] cxl/region: Add Dynamic Capacity decoder and region modes
  2023-08-29  5:20 ` [PATCH RFC v2 04/18] cxl/region: Add Dynamic Capacity decoder and region modes Ira Weiny
  2023-08-29 14:39   ` Jonathan Cameron
  2023-08-30 21:13   ` Dave Jiang
@ 2023-08-31 17:00   ` Fan Ni
  2 siblings, 0 replies; 97+ messages in thread
From: Fan Ni @ 2023-08-31 17:00 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Jonathan Cameron,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma,
	linux-cxl, linux-kernel

On Mon, Aug 28, 2023 at 10:20:55PM -0700, Ira Weiny wrote:
> Both regions and decoders will need a new mode to reflect the new type
> of partition they are targeting on a device.  Regions reflect a dynamic
> capacity type which may point to different Dynamic Capacity (DC)
> Regions.  Decoder mode reflects a specific DC Region.
>
> Define the new modes to use in subsequent patches and the helper
> functions associated with them.
>
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>

Reviewed-by: Fan Ni <fan.ni@samsung.com>

> ---
> Changes for v2:
> [iweiny: split out from: Add dynamic capacity cxl region support.]
> ---
>  drivers/cxl/core/region.c |  4 ++++
>  drivers/cxl/cxl.h         | 23 +++++++++++++++++++++++
>  2 files changed, 27 insertions(+)
>
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 75041903b72c..69af1354bc5b 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1772,6 +1772,8 @@ static bool cxl_modes_compatible(enum cxl_region_mode rmode,
>  		return true;
>  	if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
>  		return true;
> +	if (rmode == CXL_REGION_DC && cxl_decoder_mode_is_dc(dmode))
> +		return true;
>
>  	return false;
>  }
> @@ -2912,6 +2914,8 @@ cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
>  		return CXL_REGION_PMEM;
>  	case CXL_DECODER_DEAD:
>  		return CXL_REGION_DEAD;
> +	case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
> +		return CXL_REGION_DC;
>  	case CXL_DECODER_MIXED:
>  	default:
>  		return CXL_REGION_MIXED;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index ed282dcd5cf5..d41f3f14fbe3 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -356,6 +356,14 @@ enum cxl_decoder_mode {
>  	CXL_DECODER_NONE,
>  	CXL_DECODER_RAM,
>  	CXL_DECODER_PMEM,
> +	CXL_DECODER_DC0,
> +	CXL_DECODER_DC1,
> +	CXL_DECODER_DC2,
> +	CXL_DECODER_DC3,
> +	CXL_DECODER_DC4,
> +	CXL_DECODER_DC5,
> +	CXL_DECODER_DC6,
> +	CXL_DECODER_DC7,
>  	CXL_DECODER_MIXED,
>  	CXL_DECODER_DEAD,
>  };
> @@ -366,6 +374,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  		[CXL_DECODER_NONE] = "none",
>  		[CXL_DECODER_RAM] = "ram",
>  		[CXL_DECODER_PMEM] = "pmem",
> +		[CXL_DECODER_DC0] = "dc0",
> +		[CXL_DECODER_DC1] = "dc1",
> +		[CXL_DECODER_DC2] = "dc2",
> +		[CXL_DECODER_DC3] = "dc3",
> +		[CXL_DECODER_DC4] = "dc4",
> +		[CXL_DECODER_DC5] = "dc5",
> +		[CXL_DECODER_DC6] = "dc6",
> +		[CXL_DECODER_DC7] = "dc7",
>  		[CXL_DECODER_MIXED] = "mixed",
>  	};
>
> @@ -374,10 +390,16 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  	return "mixed";
>  }
>
> +static inline bool cxl_decoder_mode_is_dc(enum cxl_decoder_mode mode)
> +{
> +	return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
> +}
> +
>  enum cxl_region_mode {
>  	CXL_REGION_NONE,
>  	CXL_REGION_RAM,
>  	CXL_REGION_PMEM,
> +	CXL_REGION_DC,
>  	CXL_REGION_MIXED,
>  	CXL_REGION_DEAD,
>  };
> @@ -388,6 +410,7 @@ static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
>  		[CXL_REGION_NONE] = "none",
>  		[CXL_REGION_RAM] = "ram",
>  		[CXL_REGION_PMEM] = "pmem",
> +		[CXL_REGION_DC] = "dc",
>  		[CXL_REGION_MIXED] = "mixed",
>  	};
>
>
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 05/18] cxl/port: Add Dynamic Capacity mode support to endpoint decoders
  2023-08-29  5:20 ` [PATCH RFC v2 05/18] cxl/port: Add Dynamic Capacity mode support to endpoint decoders Ira Weiny
  2023-08-29 14:49   ` Jonathan Cameron
@ 2023-08-31 17:25   ` Fan Ni
  2023-09-08 23:26     ` Ira Weiny
  1 sibling, 1 reply; 97+ messages in thread
From: Fan Ni @ 2023-08-31 17:25 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Jonathan Cameron,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma,
	linux-cxl, linux-kernel

On Mon, Aug 28, 2023 at 10:20:56PM -0700, Ira Weiny wrote:
> Endpoint decoders used to map Dynamic Capacity must be configured to
> point to the correct Dynamic Capacity (DC) Region.  The decoder mode
> currently represents the partition the decoder points to such as ram or
> pmem.
>
> Expand the mode to include DC Regions.
>
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>

I have the same question about how dc_mode_to_region_index is
implemented and used as Jonathan.

Nice to see the code spit out, it is easier to review now.

Fan

> ---
> Changes for v2:
> [iweiny: split from region creation patch]
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 19 ++++++++++---------
>  drivers/cxl/core/hdm.c                  | 24 ++++++++++++++++++++++++
>  drivers/cxl/core/port.c                 | 16 ++++++++++++++++
>  3 files changed, 50 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 6350dd82b9a9..2268ffcdb604 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -257,22 +257,23 @@ Description:
>
>  What:		/sys/bus/cxl/devices/decoderX.Y/mode
>  Date:		May, 2022
> -KernelVersion:	v6.0
> +KernelVersion:	v6.0, v6.6 (dcY)
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
>  		translates from a host physical address range, to a device local
>  		address range. Device-local address ranges are further split
> -		into a 'ram' (volatile memory) range and 'pmem' (persistent
> -		memory) range. The 'mode' attribute emits one of 'ram', 'pmem',
> -		'mixed', or 'none'. The 'mixed' indication is for error cases
> -		when a decoder straddles the volatile/persistent partition
> -		boundary, and 'none' indicates the decoder is not actively
> -		decoding, or no DPA allocation policy has been set.
> +		into a 'ram' (volatile memory) range, 'pmem' (persistent
> +		memory) range, or Dynamic Capacity (DC) range. The 'mode'
> +		attribute emits one of 'ram', 'pmem', 'dcY', 'mixed', or
> +		'none'. The 'mixed' indication is for error cases when a
> +		decoder straddles the volatile/persistent partition boundary,
> +		and 'none' indicates the decoder is not actively decoding, or
> +		no DPA allocation policy has been set.
>
>  		'mode' can be written, when the decoder is in the 'disabled'
> -		state, with either 'ram' or 'pmem' to set the boundaries for the
> -		next allocation.
> +		state, with 'ram', 'pmem', or 'dcY' to set the boundaries for
> +		the next allocation.
>
>
>  What:		/sys/bus/cxl/devices/decoderX.Y/dpa_resource
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index a254f79dd4e8..3f4af1f5fac8 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -267,6 +267,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	__cxl_dpa_release(cxled);
>  }
>
> +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> +{
> +	int index = 0;
> +
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> +		if (mode == i)
> +			return index;
> +		index++;
> +	}
> +
> +	return -EINVAL;
> +}
> +
>  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  			     resource_size_t base, resource_size_t len,
>  			     resource_size_t skipped)
> @@ -429,6 +442,7 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  	switch (mode) {
>  	case CXL_DECODER_RAM:
>  	case CXL_DECODER_PMEM:
> +	case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
>  		break;
>  	default:
>  		dev_dbg(dev, "unsupported mode: %d\n", mode);
> @@ -456,6 +470,16 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  		goto out;
>  	}
>
> +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> +		int index = dc_mode_to_region_index(i);
> +
> +		if (mode == i && !resource_size(&cxlds->dc_res[index])) {
> +			dev_dbg(dev, "no available dynamic capacity\n");
> +			rc = -ENXIO;
> +			goto out;
> +		}
> +	}
> +
>  	cxled->mode = mode;
>  	rc = 0;
>  out:
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index f58cf01f8d2c..ce4a66865db3 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -197,6 +197,22 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
>  		mode = CXL_DECODER_PMEM;
>  	else if (sysfs_streq(buf, "ram"))
>  		mode = CXL_DECODER_RAM;
> +	else if (sysfs_streq(buf, "dc0"))
> +		mode = CXL_DECODER_DC0;
> +	else if (sysfs_streq(buf, "dc1"))
> +		mode = CXL_DECODER_DC1;
> +	else if (sysfs_streq(buf, "dc2"))
> +		mode = CXL_DECODER_DC2;
> +	else if (sysfs_streq(buf, "dc3"))
> +		mode = CXL_DECODER_DC3;
> +	else if (sysfs_streq(buf, "dc4"))
> +		mode = CXL_DECODER_DC4;
> +	else if (sysfs_streq(buf, "dc5"))
> +		mode = CXL_DECODER_DC5;
> +	else if (sysfs_streq(buf, "dc6"))
> +		mode = CXL_DECODER_DC6;
> +	else if (sysfs_streq(buf, "dc7"))
> +		mode = CXL_DECODER_DC7;
>  	else
>  		return -EINVAL;
>
>
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 10/18] cxl/mem: Handle DCD add and release capacity events.
  2023-08-29  5:21 ` [PATCH RFC v2 10/18] cxl/mem: Handle DCD add and release capacity events Ira Weiny
  2023-08-29 15:59   ` Jonathan Cameron
@ 2023-08-31 17:28   ` Dave Jiang
  2023-09-08 15:35     ` Ira Weiny
  1 sibling, 1 reply; 97+ messages in thread
From: Dave Jiang @ 2023-08-31 17:28 UTC (permalink / raw)
  To: Ira Weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel



On 8/28/23 22:21, Ira Weiny wrote:
> A Dynamic Capacity Device (DCD) utilizes events to signal the host about
> the changes to the allocation of Dynamic Capacity (DC) extents. The
> device communicates the state of DC extents through an extent list that
> describes the starting DPA, length, and meta data of the blocks the host
> can access.
> 
> Process the dynamic capacity add and release events.  The addition or
> removal of extents can occur at any time.  Adding asynchronous memory is
> straight forward.  Also remember the host is under no obligation to
> respond to a release event until it is done with the memory.  Introduce
> extent kref's to handle the delay of extent release.
> 
> In the case of a force removal, access to the memory will fail and may
> cause a crash.  However, the extent tracking object is preserved for the
> region to safely tear down as long as the memory is not accessed.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> changes for v2:
> [iweiny: Totally new version of the patch]
> [iweiny: use kref to track when to release an extent]
> [iweiny: rebased to latest master/type2 work]
> [iweiny: use a kref to track if extents are being referenced]
> [alison: align commit message paragraphs]
> [alison: remove unnecessary return]
> [iweiny: Adjust for the new __devm_cxl_add_dax_region()]
> [navneet: Fix debug prints in adding/releasing extent]
> [alison: deal with odd if/else logic]
> [alison: reverse x-tree]
> [alison: reverse x-tree]
> [alison: s/total_extent_cnt/count/]
> [alison: make handle event reverse x-tree]
> [alison: cleanup/shorten/remove handle event comment]
> [iweiny/Alison: refactor cxl_handle_dcd_event_records function]
> [iweiny: keep cxl_dc_extent_data local to mbox.c]
> [jonathan: eliminate 'rc']
> [iweiny: use proper type for mailbox size]
> [jonathan: put dc_extents on the stack]
> [jonathan: use direct returns instead of goto]
> [iweiny: Clean up comment]
> [Jonathan: define CXL_DC_EXTENT_TAG_LEN]
> [Jonathan: remove extraneous changes]
> [Jonathan: fix blank line issues]
> ---
>   drivers/cxl/core/mbox.c | 186 +++++++++++++++++++++++++++++++++++++++++++++++-
>   drivers/cxl/cxl.h       |   9 +++
>   drivers/cxl/cxlmem.h    |  30 ++++++++
>   3 files changed, 224 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 9b08c40ef484..8474a28b16ca 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -839,6 +839,8 @@ static int cxl_store_dc_extent(struct cxl_memdev_state *mds,
>   	extent->length = le64_to_cpu(dc_extent->length);
>   	memcpy(extent->tag, dc_extent->tag, sizeof(extent->tag));
>   	extent->shared_extent_seq = le16_to_cpu(dc_extent->shared_extn_seq);
> +	kref_init(&extent->region_ref);
> +	extent->mds = mds;
>   
>   	dev_dbg(dev, "dynamic capacity extent DPA:0x%llx LEN:%llx\n",
>   		extent->dpa_start, extent->length);
> @@ -879,6 +881,14 @@ static const uuid_t mem_mod_event_uuid =
>   	UUID_INIT(0xfe927475, 0xdd59, 0x4339,
>   		  0xa5, 0x86, 0x79, 0xba, 0xb1, 0x13, 0xb7, 0x74);
>   
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.0 section 8.2.9.2.1.3; Table 8-45
> + */
> +static const uuid_t dc_event_uuid =
> +	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c,
> +		  0x2f, 0x95, 0x26, 0x8e, 0x10, 0x1a, 0x2a);
> +
>   static void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>   				   enum cxl_event_log_type type,
>   				   struct cxl_event_record_raw *record)
> @@ -973,6 +983,171 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
>   	return rc;
>   }
>   
> +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> +				struct cxl_mbox_dc_response *res,
> +				int extent_cnt, int opcode)
> +{
> +	struct cxl_mbox_cmd mbox_cmd;
> +	size_t size;
> +
> +	size = struct_size(res, extent_list, extent_cnt);
> +	res->extent_list_size = cpu_to_le32(extent_cnt);
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = opcode,
> +		.size_in = size,
> +		.payload_in = res,
> +	};
> +
> +	return cxl_internal_send_cmd(mds, &mbox_cmd);
> +}
> +
> +static int cxl_prepare_ext_list(struct cxl_mbox_dc_response **res,
> +				int *n, struct range *extent)
> +{
> +	struct cxl_mbox_dc_response *dc_res;
> +	unsigned int size;
> +
> +	if (!extent)
> +		size = struct_size(dc_res, extent_list, 0);
> +	else
> +		size = struct_size(dc_res, extent_list, *n + 1);
> +
> +	dc_res = krealloc(*res, size, GFP_KERNEL);
> +	if (!dc_res)
> +		return -ENOMEM;
> +
> +	if (extent) {
> +		dc_res->extent_list[*n].dpa_start = cpu_to_le64(extent->start);
> +		memset(dc_res->extent_list[*n].reserved, 0, 8);
> +		dc_res->extent_list[*n].length = cpu_to_le64(range_len(extent));
> +		(*n)++;
> +	}
> +
> +	*res = dc_res;
> +	return 0;
> +}
> +
> +static void dc_extent_release(struct kref *kref)
> +{
> +	struct cxl_dc_extent_data *extent = container_of(kref,
> +						struct cxl_dc_extent_data,
> +						region_ref);
> +	struct cxl_memdev_state *mds = extent->mds;
> +	struct cxl_mbox_dc_response *dc_res = NULL;
> +	struct range rel_range = (struct range) {
> +		.start = extent->dpa_start,
> +		.end = extent->dpa_start + extent->length - 1,
> +	};
> +	struct device *dev = mds->cxlds.dev;
> +	int extent_cnt = 0, rc;
> +
> +	rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, &rel_range);
> +	if (rc < 0) {
> +		dev_err(dev, "Failed to create release response %d\n", rc);
> +		goto free_extent;
> +	}
> +	rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
> +				      CXL_MBOX_OP_RELEASE_DC);
> +	kfree(dc_res);
> +
> +free_extent:
> +	kfree(extent);
> +}
> +
> +void cxl_dc_extent_put(struct cxl_dc_extent_data *extent)
> +{
> +	kref_put(&extent->region_ref, dc_extent_release);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dc_extent_put, CXL);
> +
> +static int cxl_handle_dcd_release_event(struct cxl_memdev_state *mds,
> +					struct cxl_dc_extent *rel_extent)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_dc_extent_data *extent;
> +	resource_size_t dpa, size;
> +
> +	dpa = le64_to_cpu(rel_extent->start_dpa);
> +	size = le64_to_cpu(rel_extent->length);
> +	dev_dbg(dev, "Release DC extent DPA:0x%llx LEN:%llx\n",
> +		dpa, size);
> +
> +	extent = xa_erase(&mds->dc_extent_list, dpa);
> +	if (!extent) {
> +		dev_err(dev, "No extent found with DPA:0x%llx\n", dpa);
> +		return -EINVAL;
> +	}
> +	cxl_dc_extent_put(extent);
> +	return 0;
> +}
> +
> +static int cxl_handle_dcd_add_event(struct cxl_memdev_state *mds,
> +				    struct cxl_dc_extent *add_extent)
> +{
> +	struct cxl_mbox_dc_response *dc_res = NULL;
> +	struct range alloc_range, *resp_range;
> +	struct device *dev = mds->cxlds.dev;
> +	int extent_cnt = 0;
> +	int rc;
> +
> +	dev_dbg(dev, "Add DC extent DPA:0x%llx LEN:%llx\n",
> +		le64_to_cpu(add_extent->start_dpa),
> +		le64_to_cpu(add_extent->length));
> +
> +	alloc_range = (struct range){
> +		.start = le64_to_cpu(add_extent->start_dpa),
> +		.end = le64_to_cpu(add_extent->start_dpa) +
> +			le64_to_cpu(add_extent->length) - 1,
> +	};
> +	resp_range = &alloc_range;
> +
> +	rc = cxl_store_dc_extent(mds, add_extent);
> +	if (rc) {
> +		dev_dbg(dev, "unconsumed DC extent DPA:0x%llx LEN:%llx\n",
> +			le64_to_cpu(add_extent->start_dpa),
> +			le64_to_cpu(add_extent->length));
> +		resp_range = NULL;
> +	}
> +
> +	rc = cxl_prepare_ext_list(&dc_res, &extent_cnt, resp_range);
> +	if (rc < 0) {
> +		dev_err(dev, "Couldn't create extent list %d\n", rc);
> +		return rc;
> +	}
> +
> +	rc = cxl_send_dc_cap_response(mds, dc_res, extent_cnt,
> +				      CXL_MBOX_OP_ADD_DC_RESPONSE);
> +	kfree(dc_res);
> +	return rc;
> +}
> +
> +/* Returns 0 if the event was handled successfully. */
Is this comment necessary?

> +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
> +					struct cxl_event_record_raw *rec)
> +{
> +	struct dcd_event_dyn_cap *record = (struct dcd_event_dyn_cap *)rec;
> +	uuid_t *id = &rec->hdr.id;
> +	int rc;
> +
> +	if (!uuid_equal(id, &dc_event_uuid))
> +		return -EINVAL;
> +
> +	switch (record->data.event_type) {
> +	case DCD_ADD_CAPACITY:
> +		rc = cxl_handle_dcd_add_event(mds, &record->data.extent);

Just return?
> +		break;
> +	case DCD_RELEASE_CAPACITY:
> +        case DCD_FORCED_CAPACITY_RELEASE:

Extra 2 spaces of indentation?

> +		rc = cxl_handle_dcd_release_event(mds, &record->data.extent);

Same here about return.

DJ

> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	return rc;
> +}
> +
>   static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>   				    enum cxl_event_log_type type)
>   {
> @@ -1016,6 +1191,13 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>   				le16_to_cpu(payload->records[i].hdr.handle));
>   			cxl_event_trace_record(cxlmd, type,
>   					       &payload->records[i]);
> +			if (type == CXL_EVENT_TYPE_DCD) {
> +				rc = cxl_handle_dcd_event_records(mds,
> +								  &payload->records[i]);
> +				if (rc)
> +					dev_err_ratelimited(dev, "dcd event failed: %d\n",
> +							    rc);
> +			}
>   		}
>   
>   		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
> @@ -1056,6 +1238,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
>   		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_WARN);
>   	if (status & CXLDEV_EVENT_STATUS_INFO)
>   		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_INFO);
> +	if (status & CXLDEV_EVENT_STATUS_DCD)
> +		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_mem_get_event_records, CXL);
>   
> @@ -1712,7 +1896,7 @@ static void cxl_destroy_mds(void *_mds)
>   
>   	xa_for_each(&mds->dc_extent_list, index, extent) {
>   		xa_erase(&mds->dc_extent_list, index);
> -		kfree(extent);
> +		cxl_dc_extent_put(extent);
>   	}
>   	xa_destroy(&mds->dc_extent_list);
>   }
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 0a225b0c20bf..81ca76ae1d02 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -163,6 +163,7 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
>   #define CXLDEV_EVENT_STATUS_WARN		BIT(1)
>   #define CXLDEV_EVENT_STATUS_FAIL		BIT(2)
>   #define CXLDEV_EVENT_STATUS_FATAL		BIT(3)
> +#define CXLDEV_EVENT_STATUS_DCD                 BIT(4)
>   
>   #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO |	\
>   				 CXLDEV_EVENT_STATUS_WARN |	\
> @@ -601,6 +602,14 @@ struct cxl_pmem_region {
>   	struct cxl_pmem_region_mapping mapping[];
>   };
>   
> +/* See CXL 3.0 8.2.9.2.1.5 */
> +enum dc_event {
> +        DCD_ADD_CAPACITY,
> +        DCD_RELEASE_CAPACITY,
> +        DCD_FORCED_CAPACITY_RELEASE,
> +        DCD_REGION_CONFIGURATION_UPDATED,
> +};
> +
>   struct cxl_dax_region {
>   	struct device dev;
>   	struct cxl_region *cxlr;
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index ad690600c1b9..118392229174 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -582,6 +582,16 @@ enum cxl_opcode {
>   	UUID_INIT(0xe1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19,     \
>   		  0x40, 0x3d, 0x86)
>   
> +struct cxl_mbox_dc_response {
> +	__le32 extent_list_size;
> +	u8 reserved[4];
> +	struct updated_extent_list {
> +		__le64 dpa_start;
> +		__le64 length;
> +		u8 reserved[8];
> +	} __packed extent_list[];
> +} __packed;
> +
>   struct cxl_mbox_get_supported_logs {
>   	__le16 entries;
>   	u8 rsvd[6];
> @@ -667,6 +677,7 @@ enum cxl_event_log_type {
>   	CXL_EVENT_TYPE_WARN,
>   	CXL_EVENT_TYPE_FAIL,
>   	CXL_EVENT_TYPE_FATAL,
> +	CXL_EVENT_TYPE_DCD,
>   	CXL_EVENT_TYPE_MAX
>   };
>   
> @@ -757,6 +768,8 @@ struct cxl_dc_extent_data {
>   	u64 length;
>   	u8 tag[CXL_DC_EXTENT_TAG_LEN];
>   	u16 shared_extent_seq;
> +	struct cxl_memdev_state *mds;
> +	struct kref region_ref;
>   };
>   
>   /*
> @@ -771,6 +784,21 @@ struct cxl_dc_extent {
>   	u8 reserved[6];
>   } __packed;
>   
> +struct dcd_record_data {
> +	u8 event_type;
> +	u8 reserved;
> +	__le16 host_id;
> +	u8 region_index;
> +	u8 reserved1[3];
> +	struct cxl_dc_extent extent;
> +	u8 reserved2[32];
> +} __packed;
> +
> +struct dcd_event_dyn_cap {
> +	struct cxl_event_record_hdr hdr;
> +	struct dcd_record_data data;
> +} __packed;
> +
>   struct cxl_mbox_get_partition_info {
>   	__le64 active_volatile_cap;
>   	__le64 active_persistent_cap;
> @@ -974,6 +1002,8 @@ int cxl_trigger_poison_list(struct cxl_memdev *cxlmd);
>   int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
>   int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
>   
> +void cxl_dc_extent_put(struct cxl_dc_extent_data *extent);
> +
>   #ifdef CONFIG_CXL_SUSPEND
>   void cxl_mem_active_inc(void);
>   void cxl_mem_active_dec(void);
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 11/18] cxl/region: Expose DC extents on region driver load
  2023-08-29  5:21 ` [PATCH RFC v2 11/18] cxl/region: Expose DC extents on region driver load Ira Weiny
  2023-08-29 16:20   ` Jonathan Cameron
@ 2023-08-31 18:38   ` Dave Jiang
  2023-09-08 23:57     ` Ira Weiny
  1 sibling, 1 reply; 97+ messages in thread
From: Dave Jiang @ 2023-08-31 18:38 UTC (permalink / raw)
  To: Ira Weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel



On 8/28/23 22:21, Ira Weiny wrote:
> Ultimately user space must associate Dynamic Capacity (DC) extents with
> DAX devices.  Remember also that DCD extents may have been accepted
> previous to regions being created and must have references held until
> all higher level regions and DAX devices are done with the memory.
> 
> On CXL region driver load scan existing device extents and create CXL
> DAX region extents as needed.
> 
> Create abstractions for the extents to be used in DAX region.  This
> includes a generic interface to take proper references on the lower
> level CXL region extents.
> 
> Also maintain separate objects for the DAX region extent device vs the
> DAX region extent.  The DAX region extent device has a shorter life span
> which corresponds to the removal of an extent while a DAX device is
> still using it.  In this case an extent continues to exist whilst the
> ability to create new DAX devices on that extent is prevented.
> 
> NOTE: Without interleaving; the device, CXL region, and DAX region
> extents have a 1:1:1 relationship.  Future support for interleaving will
> maintain a 1:N relationship between CXL region extents and the hardware
> extents.
> 
> While the ability to create DAX devices on an extent exists; expose the
> necessary details of DAX region extents by creating a device with the
> following sysfs entries.
> 
> /sys/bus/cxl/devices/dax_regionX/extentY
> /sys/bus/cxl/devices/dax_regionX/extentY/length
> /sys/bus/cxl/devices/dax_regionX/extentY/label
> 
> Label is a rough analogy to the DC extent tag.  As such the DC extent
> tag is used to initially populate the label.  However, the label is made
> writeable so that it can be adjusted in the future when forming a DAX
> device.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from v1
> [iweiny: move dax_region_extents to dax layer]
> [iweiny: adjust for kreference of extents]
> [iweiny: adjust naming to cxl_dr_extent]
> [iweiny: Remove region_extent xarray; use child devices instead]
> [iweiny: ensure dax region devices are destroyed on region destruction]
> [iweiny: use xa_insert]
> [iweiny: hpa_offset is a dr_extent parameter not an extent parameter]
> [iweiny: Add dc_region_extents when the region driver is loaded]
> ---
>   drivers/cxl/core/mbox.c   |  12 ++++
>   drivers/cxl/core/region.c | 179 ++++++++++++++++++++++++++++++++++++++++++++--
>   drivers/cxl/cxl.h         |  16 +++++
>   drivers/cxl/cxlmem.h      |   2 +
>   drivers/dax/Makefile      |   1 +
>   drivers/dax/cxl.c         | 101 ++++++++++++++++++++++++--
>   drivers/dax/dax-private.h |  53 ++++++++++++++
>   drivers/dax/extent.c      | 119 ++++++++++++++++++++++++++++++
>   8 files changed, 473 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 8474a28b16ca..5472ab1d0370 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1055,6 +1055,18 @@ static void dc_extent_release(struct kref *kref)
>   	kfree(extent);
>   }
>   
> +int __must_check cxl_dc_extent_get_not_zero(struct cxl_dc_extent_data *extent)
> +{
> +	return kref_get_unless_zero(&extent->region_ref);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dc_extent_get_not_zero, CXL);
> +
> +void cxl_dc_extent_get(struct cxl_dc_extent_data *extent)
> +{
> +	kref_get(&extent->region_ref);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dc_extent_get, CXL);
> +
>   void cxl_dc_extent_put(struct cxl_dc_extent_data *extent)
>   {
>   	kref_put(&extent->region_ref, dc_extent_release);
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index fc8dee469244..0aeea50550f6 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1547,6 +1547,122 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
>   	return 0;
>   }
>   
> +static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> +				struct cxl_dc_extent_data *extent)
> +{
> +	struct range dpa_range = (struct range){
> +		.start = extent->dpa_start,
> +		.end = extent->dpa_start + extent->length - 1,
> +	};
> +	struct device *dev = &cxled->cxld.dev;
> +
> +	dev_dbg(dev, "Checking extent DPA:%llx LEN:%llx\n",
> +		extent->dpa_start, extent->length);
> +
> +	if (!cxled->cxld.region || !cxled->dpa_res)
> +		return false;
> +
> +	dev_dbg(dev, "Cxled start:%llx end:%llx\n",
> +		cxled->dpa_res->start, cxled->dpa_res->end);

Just use %pr?

> +	return (cxled->dpa_res->start <= dpa_range.start &&
> +		dpa_range.end <= cxled->dpa_res->end);

I may be easier to read for some if you have (dpa_range.start > 
cxled->dpa_res->start && ...) instead.

> +}
> +
> +static int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> +				 struct cxl_dc_extent_data *extent)
> +{
> +	struct cxl_dr_extent *cxl_dr_ext;
> +	struct cxl_dax_region *cxlr_dax;
> +	resource_size_t dpa_offset, hpa;
> +	struct range *ed_hpa_range;
> +	struct device *dev;
> +	int rc;
> +
> +	cxlr_dax = cxled->cxld.region->cxlr_dax;
> +	dev = &cxlr_dax->dev;
> +	dev_dbg(dev, "Adding DC extent DPA:%llx LEN:%llx\n",
> +		extent->dpa_start, extent->length);
> +
> +	/*
> +	 * Interleave ways == 1 means this coresponds to a 1:1 mapping between
> +	 * device extents and DAX region extents.  Future implementations
> +	 * should hold DC region extents here until the full dax region extent
> +	 * can be realized.
> +	 */
> +	if (cxlr_dax->cxlr->params.interleave_ways != 1) {
> +		dev_err(dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}
> +
> +	cxl_dr_ext = kzalloc(sizeof(*cxl_dr_ext), GFP_KERNEL);
> +	if (!cxl_dr_ext)
> +		return -ENOMEM;
> +
> +	cxl_dr_ext->extent = extent;
> +	kref_init(&cxl_dr_ext->region_ref);
> +
> +	/*
> +	 * Without interleave...
> +	 * HPA offset == DPA offset
> +	 * ... but do the math anyway
> +	 */
> +	dpa_offset = extent->dpa_start - cxled->dpa_res->start;
> +	ed_hpa_range = &cxled->cxld.hpa_range;
> +	hpa = ed_hpa_range->start + dpa_offset;
> +	cxl_dr_ext->hpa_offset = hpa - cxlr_dax->hpa_range.start;
> +
> +	/* Without interleave carry length and label through */
> +	cxl_dr_ext->hpa_length = extent->length;
> +	snprintf(cxl_dr_ext->label, CXL_EXTENT_LABEL_LEN, "%s",
> +		 extent->tag);
> +
> +	dev_dbg(dev, "Inserting at HPA:%llx\n", cxl_dr_ext->hpa_offset);
> +	rc = xa_insert(&cxlr_dax->extents, cxl_dr_ext->hpa_offset, cxl_dr_ext,
> +		       GFP_KERNEL);
> +	if (rc) {
> +		dev_err(dev, "Failed to insert extent %d\n", rc);
> +		kfree(cxl_dr_ext);
> +		return rc;
> +	}
> +	/* Put in cxl_dr_release() */
> +	cxl_dc_extent_get(cxl_dr_ext->extent);
> +	return 0;
> +}
> +
> +static int cxl_ed_add_extents(struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	struct cxl_memdev_state *mds = container_of(cxlds,
> +						    struct cxl_memdev_state,
> +						    cxlds);
> +	struct device *dev = &cxled->cxld.dev;
> +	struct cxl_dc_extent_data *extent;
> +	unsigned long index;
> +
> +	dev_dbg(dev, "Searching for DC extents\n");
> +	xa_for_each(&mds->dc_extent_list, index, extent) {
> +		/*
> +		 * get not zero is important because this is racing with the
> +		 * memory device which could be removing the extent at the same
> +		 * time.
> +		 */
> +		if (cxl_dc_extent_get_not_zero(extent)) {
> +			int rc = 0;
> +
> +			if (cxl_dc_extent_in_ed(cxled, extent)) {
> +				dev_dbg(dev, "Found extent DPA:%llx LEN:%llx\n",
> +					extent->dpa_start, extent->length);
> +				rc = cxl_ed_add_one_extent(cxled, extent);
> +			}
> +			cxl_dc_extent_put(extent);
> +			if (rc)
> +				return rc;
> +		}
> +	}
> +	return 0;
> +}
> +
>   static int cxl_region_attach_position(struct cxl_region *cxlr,
>   				      struct cxl_root_decoder *cxlrd,
>   				      struct cxl_endpoint_decoder *cxled,
> @@ -2702,10 +2818,44 @@ static struct cxl_pmem_region *cxl_pmem_region_alloc(struct cxl_region *cxlr)
>   	return cxlr_pmem;
>   }
>   
> +int __must_check cxl_dr_extent_get_not_zero(struct cxl_dr_extent *cxl_dr_ext)
> +{
> +	return kref_get_unless_zero(&cxl_dr_ext->region_ref);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dr_extent_get_not_zero, CXL);
> +
> +void cxl_dr_extent_get(struct cxl_dr_extent *cxl_dr_ext)
> +{
> +	return kref_get(&cxl_dr_ext->region_ref);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dr_extent_get, CXL);
> +
> +static void cxl_dr_release(struct kref *kref)
> +{
> +	struct cxl_dr_extent *cxl_dr_ext = container_of(kref,
> +						struct cxl_dr_extent,
> +						region_ref);
> +
> +	cxl_dc_extent_put(cxl_dr_ext->extent);
> +	kfree(cxl_dr_ext);
> +}
> +
> +void cxl_dr_extent_put(struct cxl_dr_extent *cxl_dr_ext)
> +{
> +	kref_put(&cxl_dr_ext->region_ref, cxl_dr_release);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dr_extent_put, CXL);
> +
>   static void cxl_dax_region_release(struct device *dev)
>   {
>   	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
> +	struct cxl_dr_extent *cxl_dr_ext;
> +	unsigned long index;
>   
> +	xa_for_each(&cxlr_dax->extents, index, cxl_dr_ext) {
> +		xa_erase(&cxlr_dax->extents, index);
> +		cxl_dr_extent_put(cxl_dr_ext);
> +	}
>   	kfree(cxlr_dax);
>   }
>   
> @@ -2756,6 +2906,7 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
>   
>   	cxlr_dax->hpa_range.start = p->res->start;
>   	cxlr_dax->hpa_range.end = p->res->end;
> +	xa_init(&cxlr_dax->extents);
>   
>   	dev = &cxlr_dax->dev;
>   	cxlr_dax->cxlr = cxlr;
> @@ -2862,7 +3013,17 @@ static void cxlr_dax_unregister(void *_cxlr_dax)
>   	device_unregister(&cxlr_dax->dev);
>   }
>   
> -static int __devm_cxl_add_dax_region(struct cxl_region *cxlr)
> +static int cxl_region_add_dc_extents(struct cxl_region *cxlr)
> +{
> +	for (int i = 0; i < cxlr->params.nr_targets; i++) {
> +		int rc = cxl_ed_add_extents(cxlr->params.targets[i]);
> +		if (rc)
> +			return rc;
> +	}
> +	return 0;
> +}
> +
> +static int __devm_cxl_add_dax_region(struct cxl_region *cxlr, bool is_dc)
>   {
>   	struct cxl_dax_region *cxlr_dax;
>   	struct device *dev;
> @@ -2877,6 +3038,17 @@ static int __devm_cxl_add_dax_region(struct cxl_region *cxlr)
>   	if (rc)
>   		goto err;
>   
> +	cxlr->cxlr_dax = cxlr_dax;
> +	if (is_dc) {
> +		/*
> +		 * Process device extents prior to surfacing the device to
> +		 * ensure the cxl_dax_region driver has access to prior extents
> +		 */
> +		rc = cxl_region_add_dc_extents(cxlr);
> +		if (rc)
> +			goto err;
> +	}
> +
>   	rc = device_add(dev);
>   	if (rc)
>   		goto err;
> @@ -2893,7 +3065,7 @@ static int __devm_cxl_add_dax_region(struct cxl_region *cxlr)
>   
>   static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>   {
> -	return __devm_cxl_add_dax_region(cxlr);
> +	return __devm_cxl_add_dax_region(cxlr, false);
>   }
>   
>   static int devm_cxl_add_dc_dax_region(struct cxl_region *cxlr)
> @@ -2902,8 +3074,7 @@ static int devm_cxl_add_dc_dax_region(struct cxl_region *cxlr)
>   		dev_err(&cxlr->dev, "Interleaving DC not supported\n");
>   		return -EINVAL;
>   	}
> -
> -	return __devm_cxl_add_dax_region(cxlr);
> +	return __devm_cxl_add_dax_region(cxlr, true);
>   }
>   
>   static int match_decoder_by_range(struct device *dev, void *data)
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 81ca76ae1d02..177b892ac53f 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -555,6 +555,7 @@ struct cxl_region_params {
>    * @type: Endpoint decoder target type
>    * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
>    * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
> + * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
>    * @flags: Region state flags
>    * @params: active + config params for the region
>    */
> @@ -565,6 +566,7 @@ struct cxl_region {
>   	enum cxl_decoder_type type;
>   	struct cxl_nvdimm_bridge *cxl_nvb;
>   	struct cxl_pmem_region *cxlr_pmem;
> +	struct cxl_dax_region *cxlr_dax;
>   	unsigned long flags;
>   	struct cxl_region_params params;
>   };
> @@ -614,8 +616,22 @@ struct cxl_dax_region {
>   	struct device dev;
>   	struct cxl_region *cxlr;
>   	struct range hpa_range;
> +	struct xarray extents;
>   };
>   
> +/* Interleave will manage multiple cxl_dc_extent_data objects */
> +#define CXL_EXTENT_LABEL_LEN 64
> +struct cxl_dr_extent {
> +	struct kref region_ref;
> +	u64 hpa_offset;
> +	u64 hpa_length;
> +	char label[CXL_EXTENT_LABEL_LEN];
> +	struct cxl_dc_extent_data *extent;
> +};
> +int cxl_dr_extent_get_not_zero(struct cxl_dr_extent *cxl_dr_ext);
> +void cxl_dr_extent_get(struct cxl_dr_extent *cxl_dr_ext);
> +void cxl_dr_extent_put(struct cxl_dr_extent *cxl_dr_ext);
> +
>   /**
>    * struct cxl_port - logical collection of upstream port devices and
>    *		     downstream port devices to construct a CXL memory
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 118392229174..8ca81fd067c2 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -1002,6 +1002,8 @@ int cxl_trigger_poison_list(struct cxl_memdev *cxlmd);
>   int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
>   int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
>   
> +int cxl_dc_extent_get_not_zero(struct cxl_dc_extent_data *extent);
> +void cxl_dc_extent_get(struct cxl_dc_extent_data *extent);
>   void cxl_dc_extent_put(struct cxl_dc_extent_data *extent);
>   
>   #ifdef CONFIG_CXL_SUSPEND
> diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
> index 5ed5c39857c8..38cd3c4c0898 100644
> --- a/drivers/dax/Makefile
> +++ b/drivers/dax/Makefile
> @@ -7,6 +7,7 @@ obj-$(CONFIG_DEV_DAX_CXL) += dax_cxl.o
>   
>   dax-y := super.o
>   dax-y += bus.o
> +dax-y += extent.o
>   device_dax-y := device.o
>   dax_pmem-y := pmem.o
>   dax_cxl-y := cxl.o
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 147c8c69782b..057b00b1d914 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -5,6 +5,87 @@
>   
>   #include "../cxl/cxl.h"
>   #include "bus.h"
> +#include "dax-private.h"
> +
> +static void dax_reg_ext_get(struct dax_region_extent *dr_extent)
> +{
> +	kref_get(&dr_extent->ref);
> +}
> +
> +static void dr_release(struct kref *kref)
> +{
> +	struct dax_region_extent *dr_extent;
> +	struct cxl_dr_extent *cxl_dr_ext;
> +
> +	dr_extent = container_of(kref, struct dax_region_extent, ref);
> +	cxl_dr_ext = dr_extent->private_data;
> +	cxl_dr_extent_put(cxl_dr_ext);
> +	kfree(dr_extent);
> +}
> +
> +static void dax_reg_ext_put(struct dax_region_extent *dr_extent)
> +{
> +	kref_put(&dr_extent->ref, dr_release);
> +}
> +
> +static int cxl_dax_region_create_extent(struct dax_region *dax_region,
> +					struct cxl_dr_extent *cxl_dr_ext)
> +{
> +	struct dax_region_extent *dr_extent;
> +	int rc;
> +
> +	dr_extent = kzalloc(sizeof(*dr_extent), GFP_KERNEL);
> +	if (!dr_extent)
> +		return -ENOMEM;
> +
> +	dr_extent->private_data = cxl_dr_ext;
> +	dr_extent->get = dax_reg_ext_get;
> +	dr_extent->put = dax_reg_ext_put;
> +
> +	/* device manages the dr_extent on success */
> +	kref_init(&dr_extent->ref);
> +
> +	rc = dax_region_ext_create_dev(dax_region, dr_extent,
> +				       cxl_dr_ext->hpa_offset,
> +				       cxl_dr_ext->hpa_length,
> +				       cxl_dr_ext->label);
> +	if (rc) {
> +		kfree(dr_extent);
> +		return rc;
> +	}
> +
> +	/* extent accepted */
> +	cxl_dr_extent_get(cxl_dr_ext);
> +	return 0;
> +}
> +
> +static int cxl_dax_region_create_extents(struct cxl_dax_region *cxlr_dax)
> +{
> +	struct cxl_dr_extent *cxl_dr_ext;
> +	unsigned long index;
> +
> +	dev_dbg(&cxlr_dax->dev, "Adding extents\n");
> +	xa_for_each(&cxlr_dax->extents, index, cxl_dr_ext) {
> +		/*
> +		 * get not zero is important because this is racing with the
> +		 * region driver which is racing with the memory device which
> +		 * could be removing the extent at the same time.
> +		 */
> +		if (cxl_dr_extent_get_not_zero(cxl_dr_ext)) {
> +			struct dax_region *dax_region;
> +			int rc;
> +
> +			dax_region = dev_get_drvdata(&cxlr_dax->dev);
> +			dev_dbg(&cxlr_dax->dev, "Found OFF:%llx LEN:%llx\n",
> +				cxl_dr_ext->hpa_offset, cxl_dr_ext->hpa_length);
> +			rc = cxl_dax_region_create_extent(dax_region, cxl_dr_ext);
> +			cxl_dr_extent_put(cxl_dr_ext);
> +			if (rc)
> +				return rc;
> +		}
> +	}
> +	return 0;
> +}
>   
>   static int cxl_dax_region_probe(struct device *dev)
>   {
> @@ -19,20 +100,28 @@ static int cxl_dax_region_probe(struct device *dev)
>   	if (nid == NUMA_NO_NODE)
>   		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
>   
> -	dev_size = range_len(&cxlr_dax->hpa_range);
> -
>   	flags = IORESOURCE_DAX_KMEM;
> -	if (cxlr->mode == CXL_REGION_DC) {
> -		/* Add empty seed dax device */
> -		dev_size = 0;
> +	if (cxlr->mode == CXL_REGION_DC)
>   		flags |= IORESOURCE_DAX_DYNAMIC_CAP;
> -	}
>   
>   	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
>   				      PMD_SIZE, flags);
>   	if (!dax_region)
>   		return -ENOMEM;
>   
> +	dev_size = range_len(&cxlr_dax->hpa_range);
> +	if (cxlr->mode == CXL_REGION_DC) {
> +		int rc;
> +
> +		/* NOTE: Depends on dax_region being set in driver data */
> +		rc = cxl_dax_region_create_extents(cxlr_dax);
> +		if (rc)
> +			return rc;
> +
> +		/* Add empty seed dax device */
> +		dev_size = 0;
> +	}
> +
>   	data = (struct dev_dax_data) {
>   		.dax_region = dax_region,
>   		.id = -1,
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index 27cf2daaaa79..4dab52496c3f 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -5,6 +5,7 @@
>   #ifndef __DAX_PRIVATE_H__
>   #define __DAX_PRIVATE_H__
>   
> +#include <linux/pgtable.h>
>   #include <linux/device.h>
>   #include <linux/cdev.h>
>   #include <linux/idr.h>
> @@ -40,6 +41,58 @@ struct dax_region {
>   	struct device *youngest;
>   };
>   
> +/*
> + * struct dax_region_extent - extent data defined by the low level region
> + * driver.
> + * @private_data: lower level region driver data
> + * @ref: track number of dax devices which are using this extent
> + * @get: get reference to low level data
> + * @put: put reference to low level data
> + */
> +struct dax_region_extent {
> +	void *private_data;
> +	struct kref ref;
> +	void (*get)(struct dax_region_extent *dr_extent);
> +	void (*put)(struct dax_region_extent *dr_extent);
> +};
> +
> +static inline void dr_extent_get(struct dax_region_extent *dr_extent)
> +{
> +	if (dr_extent->get)
> +		dr_extent->get(dr_extent);
> +}
> +
> +static inline void dr_extent_put(struct dax_region_extent *dr_extent)
> +{
> +	if (dr_extent->put)
> +		dr_extent->put(dr_extent);
> +}
> +
> +#define DAX_EXTENT_LABEL_LEN 64
> +/**
> + * struct dax_reg_ext_dev - Device object to expose extent information
> + * @dev: device representing this extent
> + * @dr_extent: reference back to private extent data
> + * @offset: offset of this extent
> + * @length: size of this extent
> + * @label: identifier to group extents
> + */
> +struct dax_reg_ext_dev {
> +	struct device dev;
> +	struct dax_region_extent *dr_extent;
> +	resource_size_t offset;
> +	resource_size_t length;
> +	char label[DAX_EXTENT_LABEL_LEN];
> +};
> +
> +int dax_region_ext_create_dev(struct dax_region *dax_region,
> +			      struct dax_region_extent *dr_extent,
> +			      resource_size_t offset,
> +			      resource_size_t length,
> +			      const char *label);
> +#define to_dr_ext_dev(dev)	\
> +	container_of(dev, struct dax_reg_ext_dev, dev)
> +
>   struct dax_mapping {
>   	struct device dev;
>   	int range_id;


This is a rather large patch. Can the code below be broken out to a 
separate patch?

> diff --git a/drivers/dax/extent.c b/drivers/dax/extent.c
> new file mode 100644
> index 000000000000..2075ccfb21cb
> --- /dev/null
> +++ b/drivers/dax/extent.c
> @@ -0,0 +1,119 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2023 Intel Corporation. All rights reserved. */
> +
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include "dax-private.h"
> +
> +static ssize_t length_show(struct device *dev, struct device_attribute *attr,
> +			 char *buf)
> +{
> +	struct dax_reg_ext_dev *dr_reg_ext_dev = to_dr_ext_dev(dev);
> +
> +	return sysfs_emit(buf, "%#llx\n", dr_reg_ext_dev->length);
> +}
> +static DEVICE_ATTR_RO(length);
> +
> +static ssize_t label_show(struct device *dev, struct device_attribute *attr,
> +			  char *buf)
> +{
> +	struct dax_reg_ext_dev *dr_reg_ext_dev = to_dr_ext_dev(dev);
> +
> +	return sysfs_emit(buf, "%s\n", dr_reg_ext_dev->label);
> +}
> +
> +static ssize_t label_store(struct device *dev, struct device_attribute *attr,
> +			   const char *buf, size_t len)
> +{
> +	struct dax_reg_ext_dev *dr_reg_ext_dev = to_dr_ext_dev(dev);
> +
> +	snprintf(dr_reg_ext_dev->label, DAX_EXTENT_LABEL_LEN, "%s", buf);
> +	return len;
> +}
> +static DEVICE_ATTR_RW(label);
> +
> +static struct attribute *dr_extent_attrs[] = {
> +	&dev_attr_length.attr,
> +	&dev_attr_label.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group dr_extent_attribute_group = {
> +	.attrs = dr_extent_attrs,
> +};
> +
> +static void dr_extent_release(struct device *dev)
> +{
> +	struct dax_reg_ext_dev *dr_reg_ext_dev = to_dr_ext_dev(dev);
> +
> +	kfree(dr_reg_ext_dev);
> +}
> +
> +static const struct attribute_group *dr_extent_attribute_groups[] = {
> +	&dr_extent_attribute_group,
> +	NULL,
> +};
> +
> +const struct device_type dr_extent_type = {
> +	.name = "extent",
> +	.release = dr_extent_release,
> +	.groups = dr_extent_attribute_groups,
> +};
> +
> +static void unregister_dr_extent(void *ext)
> +{
> +	struct dax_reg_ext_dev *dr_reg_ext_dev = ext;
> +	struct dax_region_extent *dr_extent;
> +
> +	dr_extent = dr_reg_ext_dev->dr_extent;
> +	dev_dbg(&dr_reg_ext_dev->dev, "Unregister DAX region ext OFF:%llx L:%s\n",
> +		dr_reg_ext_dev->offset, dr_reg_ext_dev->label);
> +	dr_extent_put(dr_extent);
> +	device_unregister(&dr_reg_ext_dev->dev);
> +}
> +
> +int dax_region_ext_create_dev(struct dax_region *dax_region,
> +			      struct dax_region_extent *dr_extent,
> +			      resource_size_t offset,
> +			      resource_size_t length,
> +			      const char *label)
> +{
> +	struct dax_reg_ext_dev *dr_reg_ext_dev;
> +	struct device *dev;
> +	int rc;
> +
> +	dr_reg_ext_dev = kzalloc(sizeof(*dr_reg_ext_dev), GFP_KERNEL);
> +	if (!dr_reg_ext_dev)
> +		return -ENOMEM;
> +
> +	dr_reg_ext_dev->dr_extent = dr_extent;
> +	dr_reg_ext_dev->offset = offset;
> +	dr_reg_ext_dev->length = length;
> +	snprintf(dr_reg_ext_dev->label, DAX_EXTENT_LABEL_LEN, "%s", label);
> +
> +	dev = &dr_reg_ext_dev->dev;
> +	device_initialize(dev);
> +	dev->id = offset / PMD_SIZE ;
> +	device_set_pm_not_required(dev);
> +	dev->parent = dax_region->dev;
> +	dev->type = &dr_extent_type;
> +	rc = dev_set_name(dev, "extent%d", dev->id);
> +	if (rc)
> +		goto err;
> +
> +	rc = device_add(dev);
> +	if (rc)
> +		goto err;
> +
> +	dev_dbg(dev, "DAX region extent OFF:%llx LEN:%llx\n",
> +		dr_reg_ext_dev->offset, dr_reg_ext_dev->length);
> +	return devm_add_action_or_reset(dax_region->dev, unregister_dr_extent,
> +					dr_reg_ext_dev);
> +
> +err:
> +	dev_err(dev, "Failed to initialize DAX extent dev OFF:%llx LEN:%llx\n",
> +		dr_reg_ext_dev->offset, dr_reg_ext_dev->length);
> +	put_device(dev);
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(dax_region_ext_create_dev);
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 13/18] dax/bus: Factor out dev dax resize logic
  2023-08-29  5:21 ` [PATCH RFC v2 13/18] dax/bus: Factor out dev dax resize logic Ira Weiny
  2023-08-30 11:27   ` Jonathan Cameron
@ 2023-08-31 21:48   ` Dave Jiang
  1 sibling, 0 replies; 97+ messages in thread
From: Dave Jiang @ 2023-08-31 21:48 UTC (permalink / raw)
  To: Ira Weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel



On 8/28/23 22:21, Ira Weiny wrote:
> Dynamic Capacity regions must limit dev dax resources to those areas
> which have extents backing real memory.  Four alternatives were
> considered to manage the intersection of region space and extents:
> 
> 1) Create a single region resource child on region creation which
>     reserves the entire region.  Then as extents are added punch holes in
>     this reservation.  This requires new resource manipulation to punch
>     the holes and still requires an additional iteration over the extent
>     areas which may already have existing dev dax resources used.
> 
> 2) Maintain an ordered xarray of extents which can be queried while
>     processing the resize logic.  The issue is that existing region->res
>     children may artificially limit the allocation size sent to
>     alloc_dev_dax_range().  IE the resource children can't be directly
>     used in the resize logic to find where space in the region is.
> 
> 3) Maintain a separate resource tree with extents.  This option is the
>     same as 2) but with a different data structure.  Most ideally we have
>     some unified representation of the resource tree.
> 
> 4) Create region resource children for each extent.  Manage the dax dev
>     resize logic in the same way as before but use a region child
>     (extent) resource as the parents to find space within each extent.
> 
> Option 4 can leverage the existing resize algorithm to find space within
> the extents.
> 
> In preparation for this change, factor out the dev_dax_resize logic.
> For static regions use dax_region->res as the parent to find space for
> the dax ranges.  Future patches will use the same algorithm with
> individual extent resources as the parent.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>   drivers/dax/bus.c | 128 +++++++++++++++++++++++++++++++++---------------------
>   1 file changed, 79 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index b76e49813a39..ea7ae82b4687 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -817,11 +817,10 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
>   	return 0;
>   }
>   
> -static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
> -		resource_size_t size)
> +static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
> +			       u64 start, resource_size_t size)
>   {
>   	struct dax_region *dax_region = dev_dax->region;
> -	struct resource *res = &dax_region->res;
>   	struct device *dev = &dev_dax->dev;
>   	struct dev_dax_range *ranges;
>   	unsigned long pgoff = 0;
> @@ -839,14 +838,14 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
>   		return 0;
>   	}
>   
> -	alloc = __request_region(res, start, size, dev_name(dev), 0);
> +	alloc = __request_region(parent, start, size, dev_name(dev), 0);
>   	if (!alloc)
>   		return -ENOMEM;
>   
>   	ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
>   			* (dev_dax->nr_range + 1), GFP_KERNEL);
>   	if (!ranges) {
> -		__release_region(res, alloc->start, resource_size(alloc));
> +		__release_region(parent, alloc->start, resource_size(alloc));
>   		return -ENOMEM;
>   	}
>   
> @@ -997,50 +996,45 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
>   	return true;
>   }
>   
> -static ssize_t dev_dax_resize(struct dax_region *dax_region,
> -		struct dev_dax *dev_dax, resource_size_t size)
> +/*
> + * dev_dax_resize_static - Expand the device into the unused portion of the
> + * region. This may involve adjusting the end of an existing resource, or
> + * allocating a new resource.
> + *
> + * @parent: parent resource to allocate this range in.
> + * @dev_dax: DAX device we are creating this range for
> + * @to_alloc: amount of space to alloc; must be <= space available in @parent
> + *
> + * Return the amount of space allocated or -ERRNO on failure
> + */
> +static ssize_t dev_dax_resize_static(struct resource *parent,
> +				     struct dev_dax *dev_dax,
> +				     resource_size_t to_alloc)
>   {
> -	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
> -	resource_size_t dev_size = dev_dax_size(dev_dax);
> -	struct resource *region_res = &dax_region->res;
> -	struct device *dev = &dev_dax->dev;
>   	struct resource *res, *first;
> -	resource_size_t alloc = 0;
>   	int rc;
>   
> -	if (dev->driver)
> -		return -EBUSY;
> -	if (size == dev_size)
> -		return 0;
> -	if (size > dev_size && size - dev_size > avail)
> -		return -ENOSPC;
> -	if (size < dev_size)
> -		return dev_dax_shrink(dev_dax, size);
> -
> -	to_alloc = size - dev_size;
> -	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
> -			"resize of %pa misaligned\n", &to_alloc))
> -		return -ENXIO;
> -
> -	/*
> -	 * Expand the device into the unused portion of the region. This
> -	 * may involve adjusting the end of an existing resource, or
> -	 * allocating a new resource.
> -	 */
> -retry:
> -	first = region_res->child;
> -	if (!first)
> -		return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
> +	first = parent->child;
> +	if (!first) {
> +		rc = alloc_dev_dax_range(parent, dev_dax,
> +					   parent->start, to_alloc);
> +		if (rc)
> +			return rc;
> +		return to_alloc;
> +	}
>   
> -	rc = -ENOSPC;
>   	for (res = first; res; res = res->sibling) {
>   		struct resource *next = res->sibling;
> +		resource_size_t alloc;
>   
>   		/* space at the beginning of the region */
> -		if (res == first && res->start > dax_region->res.start) {
> -			alloc = min(res->start - dax_region->res.start, to_alloc);
> -			rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc);
> -			break;
> +		if (res == first && res->start > parent->start) {
> +			alloc = min(res->start - parent->start, to_alloc);
> +			rc = alloc_dev_dax_range(parent, dev_dax,
> +						 parent->start, alloc);
> +			if (rc)
> +				return rc;
> +			return alloc;
>   		}
>   
>   		alloc = 0;
> @@ -1049,21 +1043,55 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
>   			alloc = min(next->start - (res->end + 1), to_alloc);
>   
>   		/* space at the end of the region */
> -		if (!alloc && !next && res->end < region_res->end)
> -			alloc = min(region_res->end - res->end, to_alloc);
> +		if (!alloc && !next && res->end < parent->end)
> +			alloc = min(parent->end - res->end, to_alloc);
>   
>   		if (!alloc)
>   			continue;
>   
>   		if (adjust_ok(dev_dax, res)) {
>   			rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc);
> -			break;
> +			if (rc)
> +				return rc;
> +			return alloc;
>   		}
> -		rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc);
> -		break;
> +		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
> +		if (rc)
> +			return rc;
> +		return alloc;
>   	}
> -	if (rc)
> -		return rc;
> +
> +	/* available was already calculated and should never be an issue */
> +	dev_WARN_ONCE(&dev_dax->dev, 1, "space not found?");
> +	return 0;
> +}
> +
> +static ssize_t dev_dax_resize(struct dax_region *dax_region,
> +		struct dev_dax *dev_dax, resource_size_t size)
> +{
> +	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
> +	resource_size_t dev_size = dev_dax_size(dev_dax);
> +	struct device *dev = &dev_dax->dev;
> +	resource_size_t alloc = 0;
> +
> +	if (dev->driver)
> +		return -EBUSY;
> +	if (size == dev_size)
> +		return 0;
> +	if (size > dev_size && size - dev_size > avail)
> +		return -ENOSPC;
> +	if (size < dev_size)
> +		return dev_dax_shrink(dev_dax, size);
> +
> +	to_alloc = size - dev_size;
> +	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
> +			"resize of %pa misaligned\n", &to_alloc))
> +		return -ENXIO;
> +
> +retry:
> +	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
> +	if (alloc <= 0)
> +		return alloc;
>   	to_alloc -= alloc;
>   	if (to_alloc)
>   		goto retry;
> @@ -1154,7 +1182,8 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
>   
>   	to_alloc = range_len(&r);
>   	if (alloc_is_aligned(dev_dax, to_alloc))
> -		rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
> +		rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
> +					 to_alloc);
>   	device_unlock(dev);
>   	device_unlock(dax_region->dev);
>   
> @@ -1371,7 +1400,8 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
>   	device_initialize(dev);
>   	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
>   
> -	rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size);
> +	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
> +				 data->size);
>   	if (rc)
>   		goto err_range;
>   
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 17/18] tools/testing/cxl: Add DC Regions to mock mem data
  2023-08-29  5:21 ` [PATCH RFC v2 17/18] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
  2023-08-30 12:20   ` Jonathan Cameron
@ 2023-08-31 23:19   ` Dave Jiang
  1 sibling, 0 replies; 97+ messages in thread
From: Dave Jiang @ 2023-08-31 23:19 UTC (permalink / raw)
  To: Ira Weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel



On 8/28/23 22:21, Ira Weiny wrote:
> To test DC regions the mock memory devices will need to store
> information about the regions and manage fake extent data.
> 
> Define mock_dc_region information within the mock memory data.  Add
> sysfs entries on the mock device to inject and delete extents.
> 
> The inject format is <start>:<length>:<tag>
> The delete format is <start>
> 
> Add DC mailbox commands to the CEL and implement those commands.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>   tools/testing/cxl/test/mem.c | 449 +++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 449 insertions(+)
> 
> diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
> index 6a036c8d215d..d6041a2145c5 100644
> --- a/tools/testing/cxl/test/mem.c
> +++ b/tools/testing/cxl/test/mem.c
> @@ -18,6 +18,7 @@
>   #define FW_SLOTS 3
>   #define DEV_SIZE SZ_2G
>   #define EFFECT(x) (1U << x)
> +#define BASE_DYNAMIC_CAP_DPA DEV_SIZE
>   
>   #define MOCK_INJECT_DEV_MAX 8
>   #define MOCK_INJECT_TEST_MAX 128
> @@ -89,6 +90,22 @@ static struct cxl_cel_entry mock_cel[] = {
>   		.effect = cpu_to_le16(EFFECT(CONF_CHANGE_COLD_RESET) |
>   				      EFFECT(CONF_CHANGE_IMMEDIATE)),
>   	},
> +	{
> +		.opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_CONFIG),
> +		.effect = CXL_CMD_EFFECT_NONE,
> +	},
> +	{
> +		.opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_EXTENT_LIST),
> +		.effect = CXL_CMD_EFFECT_NONE,
> +	},
> +	{
> +		.opcode = cpu_to_le16(CXL_MBOX_OP_ADD_DC_RESPONSE),
> +		.effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
> +	},
> +	{
> +		.opcode = cpu_to_le16(CXL_MBOX_OP_RELEASE_DC),
> +		.effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
> +	},
>   };
>   
>   /* See CXL 2.0 Table 181 Get Health Info Output Payload */
> @@ -147,6 +164,7 @@ struct mock_event_store {
>   	u32 ev_status;
>   };
>   
> +#define NUM_MOCK_DC_REGIONS 2
>   struct cxl_mockmem_data {
>   	void *lsa;
>   	void *fw;
> @@ -161,6 +179,10 @@ struct cxl_mockmem_data {
>   	struct mock_event_store mes;
>   	u8 event_buf[SZ_4K];
>   	u64 timestamp;
> +	struct cxl_dc_region_config dc_regions[NUM_MOCK_DC_REGIONS];
> +	u32 dc_ext_generation;
> +	struct xarray dc_extents;
> +	struct xarray dc_accepted_exts;
>   };
>   
>   static struct mock_event_log *event_find_log(struct device *dev, int log_type)
> @@ -529,6 +551,98 @@ static void cxl_mock_event_trigger(struct device *dev)
>   	cxl_mem_get_event_records(mes->mds, mes->ev_status);
>   }
>   
> +static int devm_add_extent(struct device *dev, u64 start, u64 length,
> +			   const char *tag)
> +{
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	struct cxl_dc_extent_data *extent;
> +
> +	extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
> +	if (!extent) {
> +		dev_dbg(dev, "Failed to allocate extent\n");
> +		return -ENOMEM;
> +	}
> +	extent->dpa_start = start;
> +	extent->length = length;
> +	memcpy(extent->tag, tag, min(sizeof(extent->tag), strlen(tag)));
> +
> +	if (xa_insert(&mdata->dc_extents, start, extent, GFP_KERNEL)) {
> +		devm_kfree(dev, extent);
> +		dev_err(dev, "Failed xarry insert %llx\n", start);
> +		return -EINVAL;
> +	}
> +	mdata->dc_ext_generation++;
> +
> +	return 0;
> +}
> +
> +static int dc_accept_extent(struct device *dev, u64 start)
> +{
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +
> +	dev_dbg(dev, "Accepting extent 0x%llx\n", start);
> +	return xa_insert(&mdata->dc_accepted_exts, start, (void *)start,
> +			 GFP_KERNEL);
> +}
> +
> +static void release_dc_ext(void *md)
> +{
> +	struct cxl_mockmem_data *mdata = md;
> +
> +	xa_destroy(&mdata->dc_extents);
> +	xa_destroy(&mdata->dc_accepted_exts);
> +}
> +
> +static int cxl_mock_dc_region_setup(struct device *dev)
> +{
> +#define DUMMY_EXT_OFFSET SZ_256M
> +#define DUMMY_EXT_LENGTH SZ_256M
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	u64 base_dpa = BASE_DYNAMIC_CAP_DPA;
> +	u32 dsmad_handle = 0xFADE;
> +	u64 decode_length = SZ_2G;
> +	u64 block_size = SZ_512;
> +	/* For testing make this smaller than decode length */
> +	u64 length = SZ_1G;
> +	int rc;
> +
> +	xa_init(&mdata->dc_extents);
> +	xa_init(&mdata->dc_accepted_exts);
> +
> +	rc = devm_add_action_or_reset(dev, release_dc_ext, mdata);
> +	if (rc)
> +		return rc;
> +
> +	for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
> +		struct cxl_dc_region_config *conf = &mdata->dc_regions[i];
> +
> +		dev_dbg(dev, "Creating DC region DC%d DPA:%llx LEN:%llx\n",
> +			i, base_dpa, length);
> +
> +		conf->region_base = cpu_to_le64(base_dpa);
> +		conf->region_decode_length = cpu_to_le64(decode_length /
> +						CXL_CAPACITY_MULTIPLIER);
> +		conf->region_length = cpu_to_le64(length);
> +		conf->region_block_size = cpu_to_le64(block_size);
> +		conf->region_dsmad_handle = cpu_to_le32(dsmad_handle);
> +		dsmad_handle++;
> +
> +		/* Pretend we have some previous accepted extents */
> +		rc = devm_add_extent(dev, base_dpa + DUMMY_EXT_OFFSET,
> +				     DUMMY_EXT_LENGTH, "CXL-TEST");
> +		if (rc)
> +			return rc;
> +
> +		rc = dc_accept_extent(dev, base_dpa + DUMMY_EXT_OFFSET);
> +		if (rc)
> +			return rc;
> +
> +		base_dpa += decode_length;
> +	}
> +
> +	return 0;
> +}
> +
>   static int mock_gsl(struct cxl_mbox_cmd *cmd)
>   {
>   	if (cmd->size_out < sizeof(mock_gsl_payload))
> @@ -1315,6 +1429,148 @@ static int mock_activate_fw(struct cxl_mockmem_data *mdata,
>   	return -EINVAL;
>   }
>   
> +static int mock_get_dc_config(struct device *dev,
> +			      struct cxl_mbox_cmd *cmd)
> +{
> +	struct cxl_mbox_get_dc_config *dc_config = cmd->payload_in;
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	u8 region_requested, region_start_idx, region_ret_cnt;
> +	struct cxl_mbox_dynamic_capacity *resp;
> +
> +	region_requested = dc_config->region_count;
> +	if (NUM_MOCK_DC_REGIONS < region_requested)
> +		region_requested = NUM_MOCK_DC_REGIONS;
> +
> +	if (cmd->size_out < struct_size(resp, region, region_requested))
> +		return -EINVAL;
> +
> +	memset(cmd->payload_out, 0, cmd->size_out);
> +	resp = cmd->payload_out;
> +
> +	region_start_idx = dc_config->start_region_index;
> +	region_ret_cnt = 0;
> +	for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
> +		if (i >= region_start_idx) {
> +			memcpy(&resp->region[region_ret_cnt],
> +				&mdata->dc_regions[i],
> +				sizeof(resp->region[region_ret_cnt]));
> +			region_ret_cnt++;
> +		}
> +	}
> +	resp->avail_region_count = region_ret_cnt;
> +
> +	dev_dbg(dev, "Returning %d dc regions\n", region_ret_cnt);
> +	return 0;
> +}
> +
> +
> +static int mock_get_dc_extent_list(struct device *dev,
> +				   struct cxl_mbox_cmd *cmd)
> +{
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	struct cxl_mbox_get_dc_extent *get = cmd->payload_in;
> +	struct cxl_mbox_dc_extents *resp = cmd->payload_out;
> +	u32 total_avail = 0, total_ret = 0;
> +	struct cxl_dc_extent_data *ext;
> +	u32 ext_count, start_idx;
> +	unsigned long i;
> +
> +	ext_count = le32_to_cpu(get->extent_cnt);
> +	start_idx = le32_to_cpu(get->start_extent_index);
> +
> +	memset(resp, 0, sizeof(*resp));
> +
> +	/*
> +	 * Total available needs to be calculated and returned regardless of
> +	 * how many can actually be returned.
> +	 */
> +	xa_for_each(&mdata->dc_extents, i, ext)
> +		total_avail++;
> +
> +	if (start_idx > total_avail)
> +		return -EINVAL;
> +
> +	xa_for_each(&mdata->dc_extents, i, ext) {
> +		if (total_ret >= ext_count)
> +			break;
> +
> +		if (total_ret >= start_idx) {
> +			resp->extent[total_ret].start_dpa =
> +						cpu_to_le64(ext->dpa_start);
> +			resp->extent[total_ret].length =
> +						cpu_to_le64(ext->length);
> +			memcpy(&resp->extent[total_ret].tag, ext->tag,
> +					sizeof(resp->extent[total_ret]));
> +			resp->extent[total_ret].shared_extn_seq =
> +					cpu_to_le16(ext->shared_extent_seq);
> +			total_ret++;
> +		}
> +	}
> +
> +	resp->ret_extent_cnt = cpu_to_le32(total_ret);
> +	resp->total_extent_cnt = cpu_to_le32(total_avail);
> +	resp->extent_list_num = cpu_to_le32(mdata->dc_ext_generation);
> +
> +	dev_dbg(dev, "Returning %d extents of %d total\n",
> +		total_ret, total_avail);
> +
> +	return 0;
> +}
> +
> +static int mock_add_dc_response(struct device *dev,
> +				struct cxl_mbox_cmd *cmd)
> +{
> +	struct cxl_mbox_dc_response *req = cmd->payload_in;
> +	u32 list_size = le32_to_cpu(req->extent_list_size);
> +
> +	for (int i = 0; i < list_size; i++) {
> +		u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
> +		int rc;
> +
> +		dev_dbg(dev, "Extent 0x%llx accepted by HOST\n", start);
> +		rc = dc_accept_extent(dev, start);
> +		if (rc)
> +			return rc;
> +	}
> +
> +	return 0;
> +}
> +
> +static int dc_delete_extent(struct device *dev, unsigned long long start)
> +{
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	void *ext;
> +
> +	dev_dbg(dev, "Deleting extent at %llx\n", start);
> +
> +	ext = xa_erase(&mdata->dc_extents, start);
> +	if (!ext) {
> +		dev_err(dev, "No extent found at %llx\n", start);
> +		return -EINVAL;
> +	}
> +	devm_kfree(dev, ext);
> +	mdata->dc_ext_generation++;
> +
> +	return 0;
> +}
> +
> +static int mock_dc_release(struct device *dev,
> +			   struct cxl_mbox_cmd *cmd)
> +{
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	struct cxl_mbox_dc_response *req = cmd->payload_in;
> +	u32 list_size = le32_to_cpu(req->extent_list_size);
> +
> +	for (int i = 0; i < list_size; i++) {
> +		u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
> +
> +		dev_dbg(dev, "Extent 0x%llx released by HOST\n", start);
> +		xa_erase(&mdata->dc_accepted_exts, start);
> +	}
> +
> +	return 0;
> +}
> +
>   static int cxl_mock_mbox_send(struct cxl_memdev_state *mds,
>   			      struct cxl_mbox_cmd *cmd)
>   {
> @@ -1399,6 +1655,18 @@ static int cxl_mock_mbox_send(struct cxl_memdev_state *mds,
>   	case CXL_MBOX_OP_ACTIVATE_FW:
>   		rc = mock_activate_fw(mdata, cmd);
>   		break;
> +	case CXL_MBOX_OP_GET_DC_CONFIG:
> +		rc = mock_get_dc_config(dev, cmd);
> +		break;
> +	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> +		rc = mock_get_dc_extent_list(dev, cmd);
> +		break;
> +	case CXL_MBOX_OP_ADD_DC_RESPONSE:
> +		rc = mock_add_dc_response(dev, cmd);
> +		break;
> +	case CXL_MBOX_OP_RELEASE_DC:
> +		rc = mock_dc_release(dev, cmd);
> +		break;
>   	default:
>   		break;
>   	}
> @@ -1467,6 +1735,10 @@ static int __cxl_mock_mem_probe(struct platform_device *pdev)
>   		return -ENOMEM;
>   	dev_set_drvdata(dev, mdata);
>   
> +	rc = cxl_mock_dc_region_setup(dev);
> +	if (rc)
> +		return rc;
> +
>   	mdata->lsa = vmalloc(LSA_SIZE);
>   	if (!mdata->lsa)
>   		return -ENOMEM;
> @@ -1515,6 +1787,10 @@ static int __cxl_mock_mem_probe(struct platform_device *pdev)
>   	if (rc)
>   		return rc;
>   
> +	rc = cxl_dev_dynamic_capacity_identify(mds);
> +	if (rc)
> +		return rc;
> +
>   	rc = cxl_mem_create_range_info(mds);
>   	if (rc)
>   		return rc;
> @@ -1528,6 +1804,10 @@ static int __cxl_mock_mem_probe(struct platform_device *pdev)
>   	if (IS_ERR(cxlmd))
>   		return PTR_ERR(cxlmd);
>   
> +	rc = cxl_dev_get_dynamic_capacity_extents(mds);
> +	if (rc)
> +		return rc;
> +
>   	rc = cxl_memdev_setup_fw_upload(mds);
>   	if (rc)
>   		return rc;
> @@ -1669,10 +1949,179 @@ static ssize_t fw_buf_checksum_show(struct device *dev,
>   
>   static DEVICE_ATTR_RO(fw_buf_checksum);
>   
> +/* Returns if the proposed extent is valid */
> +static bool new_extent_valid(struct device *dev, size_t new_start,
> +			     size_t new_len)
> +{
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	struct cxl_dc_extent_data *extent;
> +	size_t new_end, i;
> +
> +	if (!new_len)
> +		return -EINVAL;
> +
> +	new_end = new_start + new_len;
> +
> +	dev_dbg(dev, "New extent %zx-%zx\n", new_start, new_end);
> +
> +	/* Overlap with other extent? */
> +	xa_for_each(&mdata->dc_extents, i, extent) {
> +		size_t ext_end = extent->dpa_start + extent->length;
> +
> +		if (extent->dpa_start <= new_start && new_start < ext_end) {
> +			dev_err(dev, "Extent overlap: Start %llu ?<= %zx ?<= %zx\n",
> +				extent->dpa_start, new_start, ext_end);
> +			return false;
> +		}
> +		if (extent->dpa_start <= new_end && new_end < ext_end) {
> +			dev_err(dev, "Extent overlap: End %llx ?<= %zx ?<= %zx\n",
> +				extent->dpa_start, new_end, ext_end);
> +			return false;
> +		}
> +	}
> +
> +	/* Ensure it is in a region and is valid for that regions block size */
> +	for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
> +		struct cxl_dc_region_config *dc_region = &mdata->dc_regions[i];
> +		size_t reg_start, reg_end;
> +
> +		reg_start = le64_to_cpu(dc_region->region_base);
> +		reg_end = le64_to_cpu(dc_region->region_length);
> +		reg_end += reg_start;
> +
> +		dev_dbg(dev, "Region %d: %zx-%zx\n", i, reg_start, reg_end);
> +
> +		if (reg_start >= new_start && new_end < reg_end) {
> +			u64 block_size = le64_to_cpu(dc_region->region_block_size);
> +
> +			if (new_start % block_size || new_len % block_size) {
> +				dev_err(dev, "Extent not aligned to block size: start %zx; len %zx; block_size 0x%llx\n",
> +					new_start, new_len, block_size);
> +				return false;
> +			}
> +
> +			dev_dbg(dev, "Extent in region %d\n", i);
> +			return true;
> +		}
> +	}
> +
> +	return false;
> +}
> +
> +/*
> + * Format <start>:<length>:<tag>
> + *
> + * start and length must be a multiple of the configured region block size.
> + * Tag can be any string up to 16 bytes.
> + *
> + * Extents must be exclusive of other extents
> + */
> +static ssize_t dc_inject_extent_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t count)
> +{
> +	char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
> +	unsigned long long start, length;
> +	char *len_str, *tag_str;
> +	size_t buf_len = count;
> +	int rc;
> +
> +	if (!start_str)
> +		return -ENOMEM;
> +
> +	len_str = strnchr(start_str, buf_len, ':');
> +	if (!len_str) {
> +		dev_err(dev, "Extent failed to find len_str: %s\n", start_str);
> +		return -EINVAL;
> +	}
> +
> +	*len_str = '\0';
> +	len_str += 1;
> +	buf_len -= strlen(start_str);
> +
> +	tag_str = strnchr(len_str, buf_len, ':');
> +	if (!tag_str) {
> +		dev_err(dev, "Extent failed to find tag_str: %s\n", len_str);
> +		return -EINVAL;
> +	}
> +	*tag_str = '\0';
> +	tag_str += 1;
> +
> +	if (kstrtoull(start_str, 0, &start)) {
> +		dev_err(dev, "Extent failed to parse start: %s\n", start_str);
> +		return -EINVAL;
> +	}
> +	if (kstrtoull(len_str, 0, &length)) {
> +		dev_err(dev, "Extent failed to parse length: %s\n", len_str);
> +		return -EINVAL;
> +	}
> +
> +	if (!new_extent_valid(dev, start, length))
> +		return -EINVAL;
> +
> +	rc = devm_add_extent(dev, start, length, tag_str);
> +	if (rc)
> +		return rc;
> +
> +	return count;
> +}
> +static DEVICE_ATTR_WO(dc_inject_extent);
> +
> +static ssize_t dc_del_extent_store(struct device *dev,
> +				   struct device_attribute *attr,
> +				   const char *buf, size_t count)
> +{
> +	unsigned long long start;
> +	int rc;
> +
> +	if (kstrtoull(buf, 0, &start)) {
> +		dev_err(dev, "Extent failed to parse start value\n");
> +		return -EINVAL;
> +	}
> +
> +	rc = dc_delete_extent(dev, start);
> +	if (rc)
> +		return rc;
> +
> +	return count;
> +}
> +static DEVICE_ATTR_WO(dc_del_extent);
> +
> +static ssize_t dc_force_del_extent_store(struct device *dev,
> +					 struct device_attribute *attr,
> +					 const char *buf, size_t count)
> +{
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	unsigned long long start;
> +	void *ext;
> +	int rc;
> +
> +	if (kstrtoull(buf, 0, &start)) {
> +		dev_err(dev, "Extent failed to parse start value\n");
> +		return -EINVAL;
> +	}
> +
> +	ext = xa_erase(&mdata->dc_accepted_exts, start);
> +	if (ext)
> +		dev_dbg(dev, "Forcing remove of accepted extent: %llx\n",
> +			start);
> +
> +	dev_dbg(dev, "Forcing delete of extent at %llx\n", start);
> +	rc = dc_delete_extent(dev, start);
> +	if (rc)
> +		return rc;
> +
> +	return count;
> +}
> +static DEVICE_ATTR_WO(dc_force_del_extent);
> +
>   static struct attribute *cxl_mock_mem_attrs[] = {
>   	&dev_attr_security_lock.attr,
>   	&dev_attr_event_trigger.attr,
>   	&dev_attr_fw_buf_checksum.attr,
> +	&dev_attr_dc_inject_extent.attr,
> +	&dev_attr_dc_del_extent.attr,
> +	&dev_attr_dc_force_del_extent.attr,
>   	NULL
>   };
>   ATTRIBUTE_GROUPS(cxl_mock_mem);
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 18/18] tools/testing/cxl: Add Dynamic Capacity events
  2023-08-29  5:21 ` [PATCH RFC v2 18/18] tools/testing/cxl: Add Dynamic Capacity events Ira Weiny
  2023-08-30 12:23   ` Jonathan Cameron
@ 2023-08-31 23:20   ` Dave Jiang
  1 sibling, 0 replies; 97+ messages in thread
From: Dave Jiang @ 2023-08-31 23:20 UTC (permalink / raw)
  To: Ira Weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel



On 8/28/23 22:21, Ira Weiny wrote:
> OS software needs to be alerted when new extents arrive on a Dynamic
> Capacity Device (DCD).  On test DCDs extents are added through sysfs.
> 
> Add events on DCD extent injection.  Directly call the event irq
> callback to simulate irqs to process the test extents.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>   tools/testing/cxl/test/mem.c | 57 ++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 57 insertions(+)
> 
> diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
> index d6041a2145c5..20364fee9df9 100644
> --- a/tools/testing/cxl/test/mem.c
> +++ b/tools/testing/cxl/test/mem.c
> @@ -2008,6 +2008,41 @@ static bool new_extent_valid(struct device *dev, size_t new_start,
>   	return false;
>   }
>   
> +struct dcd_event_dyn_cap dcd_event_rec_template = {
> +	.hdr = {
> +		.id = UUID_INIT(0xca95afa7, 0xf183, 0x4018,
> +				0x8c, 0x2f, 0x95, 0x26, 0x8e, 0x10, 0x1a, 0x2a),
> +		.length = sizeof(struct dcd_event_dyn_cap),
> +	},
> +};
> +
> +static int send_dc_event(struct mock_event_store *mes, enum dc_event type,
> +			 u64 start, u64 length, const char *tag_str)
> +{
> +	struct device *dev = mes->mds->cxlds.dev;
> +	struct dcd_event_dyn_cap *dcd_event_rec;
> +
> +	dcd_event_rec = devm_kzalloc(dev, sizeof(*dcd_event_rec), GFP_KERNEL);
> +	if (!dcd_event_rec)
> +		return -ENOMEM;
> +
> +	memcpy(dcd_event_rec, &dcd_event_rec_template, sizeof(*dcd_event_rec));
> +	dcd_event_rec->data.event_type = type;
> +	dcd_event_rec->data.extent.start_dpa = cpu_to_le64(start);
> +	dcd_event_rec->data.extent.length = cpu_to_le64(length);
> +	memcpy(dcd_event_rec->data.extent.tag, tag_str,
> +	       min(sizeof(dcd_event_rec->data.extent.tag),
> +		   strlen(tag_str)));
> +
> +	mes_add_event(mes, CXL_EVENT_TYPE_DCD,
> +		      (struct cxl_event_record_raw *)dcd_event_rec);
> +
> +	/* Fake the irq */
> +	cxl_mem_get_event_records(mes->mds, CXLDEV_EVENT_STATUS_DCD);
> +
> +	return 0;
> +}
> +
>   /*
>    * Format <start>:<length>:<tag>
>    *
> @@ -2021,6 +2056,7 @@ static ssize_t dc_inject_extent_store(struct device *dev,
>   				      const char *buf, size_t count)
>   {
>   	char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
>   	unsigned long long start, length;
>   	char *len_str, *tag_str;
>   	size_t buf_len = count;
> @@ -2063,6 +2099,13 @@ static ssize_t dc_inject_extent_store(struct device *dev,
>   	if (rc)
>   		return rc;
>   
> +	rc = send_dc_event(&mdata->mes, DCD_ADD_CAPACITY, start, length,
> +			   tag_str);
> +	if (rc) {
> +		dev_err(dev, "Failed to add event %d\n", rc);
> +		return rc;
> +	}
> +
>   	return count;
>   }
>   static DEVICE_ATTR_WO(dc_inject_extent);
> @@ -2071,6 +2114,7 @@ static ssize_t dc_del_extent_store(struct device *dev,
>   				   struct device_attribute *attr,
>   				   const char *buf, size_t count)
>   {
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
>   	unsigned long long start;
>   	int rc;
>   
> @@ -2083,6 +2127,12 @@ static ssize_t dc_del_extent_store(struct device *dev,
>   	if (rc)
>   		return rc;
>   
> +	rc = send_dc_event(&mdata->mes, DCD_RELEASE_CAPACITY, start, 0, "");
> +	if (rc) {
> +		dev_err(dev, "Failed to add event %d\n", rc);
> +		return rc;
> +	}
> +
>   	return count;
>   }
>   static DEVICE_ATTR_WO(dc_del_extent);
> @@ -2111,6 +2161,13 @@ static ssize_t dc_force_del_extent_store(struct device *dev,
>   	if (rc)
>   		return rc;
>   
> +	rc = send_dc_event(&mdata->mes, DCD_FORCED_CAPACITY_RELEASE,
> +			      start, 0, "");
> +	if (rc) {
> +		dev_err(dev, "Failed to add event %d\n", rc);
> +		return rc;
> +	}
> +
>   	return count;
>   }
>   static DEVICE_ATTR_WO(dc_force_del_extent);
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 01/18] cxl/hdm: Debug, use decoder name function
  2023-08-29 14:03   ` Jonathan Cameron
  2023-08-29 21:48     ` Fan Ni
@ 2023-09-03  2:55     ` Ira Weiny
  1 sibling, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-03  2:55 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:20:52 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > The decoder enum has a name conversion function defined now.
> > 
> > Use that instead of open coding.
> > 
> > Suggested-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> 
> Perhaps pull this one out so it can go upstream before the rest are ready,
> or could be picked up from here.

Good idea, sent separately.

https://lore.kernel.org/all/20230902-use-decoder-name-v1-1-06374ed7a400@intel.com/

Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 02/18] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2023-08-29 14:07   ` Jonathan Cameron
@ 2023-09-03  3:38     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-03  3:38 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:20:53 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > Per the CXL 3.0 specification software must check the Command Effects
> > Log (CEL) to know if a device supports DC.  If the device does support
> > DC the specifics of the DC Regions (0-7) are read through the mailbox.
> > 
> > Flag DC Device (DCD) commands in a device if they are supported.
> > Subsequent patches will key off these bits to configure a DCD.
> > 
> > Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> 
> Trivial unrelated change seems to have sneaked in. Other than that
> this looks good to me.
> 
> So with that tidied up.
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> 
> Thanks,
> 
> Jonathan
> 
> > +
> >  static bool cxl_is_security_command(u16 opcode)
> >  {
> >  	int i;
> > @@ -677,9 +705,10 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> >  		u16 opcode = le16_to_cpu(cel_entry[i].opcode);
> >  		struct cxl_mem_command *cmd = cxl_mem_find_command(opcode);
> >  
> > -		if (!cmd && !cxl_is_poison_command(opcode)) {
> > -			dev_dbg(dev,
> > -				"Opcode 0x%04x unsupported by driver\n", opcode);
> > +		if (!cmd && !cxl_is_poison_command(opcode) &&
> > +		    !cxl_is_dcd_command(opcode)) {
> > +			dev_dbg(dev, "Opcode 0x%04x unsupported by driver\n",
> > +				opcode);
> 
> Clang format has been playing?
> Better to leave this alone and save reviewers wondering what the change
> in the dev_dbg() was.

Fixed.  Thanks for the review,
Ira

> 
> >  			continue;
> >  		}
> 



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device
  2023-08-29 14:37   ` Jonathan Cameron
@ 2023-09-03 23:36     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-03 23:36 UTC (permalink / raw)
  To: Jonathan Cameron, ira.weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:20:54 -0700
> ira.weiny@intel.com wrote:
> 
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > Devices can optionally support Dynamic Capacity (DC).  These devices are
> > known as Dynamic Capacity Devices (DCD).
> > 
> > Implement the DC (opcode 48XXh) mailbox commands as specified in CXL 3.0
> > section 8.2.9.8.9.  Read the DC configuration and store the DC region
> > information in the device state.
> > 
> > Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> Hi.
> 
> A few minor things inline.  Otherwise, I wonder if it's worth separating
> the mode of the region from that of the endpoint decoder in a precusor patch.
> That's a large part of this one and not really related to the mbox command stuff.

I've taken some time looking through my backup branches because I thought
this was a separate patch.  I'm feeling like this was a rebase error where
some of the next patch got merged here accidentally.  I agree it seems a
good idea to have it separate but I can't confirm at this point if it was
originally.

Split done.

[snip]

> > +
> > +	rc = dc_resp->avail_region_count - start_region;
> > +
> > +	/*
> > +	 * The number of regions in the payload may have been truncated due to
> > +	 * payload_size limits; if so adjust the count in this query.
> 
> Not adjusting the query.  "if so adjust the returned count to match."

Yep done!

> 
> > +	 */
> > +	if (mbox_cmd.size_out < sizeof(*dc_resp))
> > +		rc = CXL_REGIONS_RETURNED(mbox_cmd.size_out);
> > +
> > +	dev_dbg(dev, "Read %d/%d DC regions\n", rc, dc_resp->avail_region_count);
> > +
> > +	return rc;
> > +}
> > +
> > +/**
> > + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> > + *					 information from the device.
> > + * @mds: The memory device state
> > + *
> > + * This will dispatch the get_dynamic_capacity command to the device
> > + * and on success populate structures to be exported to sysfs.
> 
> I'd skip the 'exported to sysfs' as I'd guess this will have other uses
> (maybe) in the longer term.
> 
> and on success populate state structures for later use.

Yea that was poorly worded.  Changed to:

	Read Dynamic Capacity information from the device and populate the
	state structures for later use.

> 
> > + *
> > + * Return: 0 if identify was executed successfully, -ERRNO on error.
> > + */
> > +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> > +{
> > +	struct cxl_mbox_dynamic_capacity *dc_resp;
> > +	struct device *dev = mds->cxlds.dev;
> > +	size_t dc_resp_size = mds->payload_size;
> > +	u8 start_region;
> > +	int i, rc = 0;
> > +
> > +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> > +		snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
> > +
> > +	/* Check GET_DC_CONFIG is supported by device */
> > +	if (!test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds)) {
> > +		dev_dbg(dev, "unsupported cmd: get_dynamic_capacity_config\n");
> > +		return 0;
> > +	}
> > +
> > +	dc_resp = kvmalloc(dc_resp_size, GFP_KERNEL);                         
> > +	if (!dc_resp)                                                                
> > +		return -ENOMEM;                                                 
> > +
> > +	start_region = 0;
> > +	do {
> > +		int j;
> > +
> > +		rc = cxl_get_dc_id(mds, start_region, dc_resp, dc_resp_size);
> 
> I'd spell out identify.
> Initially I thought this was getting an index.

Actually this is getting the DC configuration.  So I'm changing it to.

cxl_get_dc_config()

> 
> 
> > +		if (rc < 0)
> > +			goto free_resp;
> > +
> > +		mds->nr_dc_region += rc;
> > +
> > +		if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> > +			dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> > +				mds->nr_dc_region);
> > +			rc = -EINVAL;
> > +			goto free_resp;
> > +		}
> > +
> > +		for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
> > +			rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
> > +			if (rc)
> > +				goto free_resp;
> > +		}
> > +
> > +		start_region = mds->nr_dc_region;
> > +
> > +	} while (mds->nr_dc_region < dc_resp->avail_region_count);
> > +
> > +	mds->dynamic_cap =
> > +		mds->dc_region[mds->nr_dc_region - 1].base +
> > +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> > +		mds->dc_region[0].base;
> > +	dev_dbg(dev, "Total dynamic capacity: %#llx\n", mds->dynamic_cap);
> > +
> > +free_resp:
> > +	kfree(dc_resp);
> 
> Maybe a first use for __free in cxl?
> 
> See include/linux/cleanup.h
> Would enable returns rather than goto and label.
> 

Good idea.  Done.

> 
> 
> > +	if (rc)
> > +		dev_err(dev, "Failed to get DC info: %d\n", rc);
> 
> I'd prefer to see more specific debug in the few paths that don't already
> print it above.

With the use of __free it kind of went the same way.

Done.

> 
> > +	return rc;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> > +
> >  static int add_dpa_res(struct device *dev, struct resource *parent,
> >  		       struct resource *res, resource_size_t start,
> >  		       resource_size_t size, const char *type)
> > @@ -1208,8 +1369,12 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> >  {
> >  	struct cxl_dev_state *cxlds = &mds->cxlds;
> >  	struct device *dev = cxlds->dev;
> > +	size_t untenanted_mem;
> >  	int rc;
> >  
> > +	untenanted_mem = mds->dc_region[0].base - mds->static_cap;
> > +	mds->total_bytes = mds->static_cap + untenanted_mem + mds->dynamic_cap;
> > +
> >  	if (!cxlds->media_ready) {
> >  		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> >  		cxlds->ram_res = DEFINE_RES_MEM(0, 0);
> > @@ -1217,8 +1382,16 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> >  		return 0;
> >  	}
> >  
> > -	cxlds->dpa_res =
> > -		(struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> > +	cxlds->dpa_res = (struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> 
> Beat back that auto-formater! Or just run it once and fix everything before
> doing anything new.

Will do.

[snip]

> >  
> > @@ -2234,7 +2247,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
> >   * devm_cxl_add_region - Adds a region to a decoder
> >   * @cxlrd: root decoder
> >   * @id: memregion id to create, or memregion_free() on failure
> > - * @mode: mode for the endpoint decoders of this region
> > + * @mode: mode of this region
> >   * @type: select whether this is an expander or accelerator (type-2 or type-3)
> >   *
> >   * This is the second step of region initialization. Regions exist within an
> > @@ -2245,7 +2258,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
> >   */
> >  static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
> >  					      int id,
> > -					      enum cxl_decoder_mode mode,
> > +					      enum cxl_region_mode mode,
> >  					      enum cxl_decoder_type type)
> >  {
> >  	struct cxl_port *port = to_cxl_port(cxlrd->cxlsd.cxld.dev.parent);
> > @@ -2254,11 +2267,12 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
> >  	int rc;
> >  
> >  	switch (mode) {
> > -	case CXL_DECODER_RAM:
> > -	case CXL_DECODER_PMEM:
> > +	case CXL_REGION_RAM:
> > +	case CXL_REGION_PMEM:
> >  		break;
> >  	default:
> > -		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> 
> Arguably should have been moved to the cxl_decoder_mode_name() in patch 1
> before being changed to cxl_region_mode_name() when the two are separated in this
> patch.  You could just add a note to patch 1 to say 'other instances will be
> covered by refactors shortly'. 

Ah well I've already split that out and sent it.  I was hoping little
things like that could land quickly and we could get to the larger patches
in this series.  For now I'm going to leave it (But split out as part of
the region mode patch).

[snip]

> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index cd4a9ffdacc7..ed282dcd5cf5 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -374,6 +374,28 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
> >  	return "mixed";
> >  }
> >  
> > +enum cxl_region_mode {
> > +	CXL_REGION_NONE,
> > +	CXL_REGION_RAM,
> > +	CXL_REGION_PMEM,
> > +	CXL_REGION_MIXED,
> > +	CXL_REGION_DEAD,
> > +};
> 
> It feels to me like you could have yanked the introduction and use of cxl_region_mode
> out as a trivial precursor patch with a note saying the separation will be needed
> shortly and why it will be needed.

Yep done.  Like I said I think I had this split out at some point ...
It's immaterial now.

[snip]

> >  
> > +#define CXL_DC_REGION_STRLEN 7
> > +struct cxl_dc_region_info {
> > +	u64 base;
> > +	u64 decode_len;
> > +	u64 len;
> > +	u64 blk_size;
> > +	u32 dsmad_handle;
> > +	u8 flags;
> > +	u8 name[CXL_DC_REGION_STRLEN];
> > +};
> > +
> >  /**
> >   * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
> >   *
> > @@ -449,6 +464,8 @@ struct cxl_dev_state {
> >   * @enabled_cmds: Hardware commands found enabled in CEL.
> >   * @exclusive_cmds: Commands that are kernel-internal only
> >   * @total_bytes: sum of all possible capacities
> > + * @static_cap: Sum of RAM and PMEM capacities
> 
> Sum of static RAM and PMEM capacities
> 
> Dynamic cap may well be RAM or PMEM!

Indeed!  Done.

[snip]

> >  
> >  /*
> > @@ -741,9 +771,31 @@ struct cxl_mbox_set_partition_info {
> >  	__le64 volatile_capacity;
> >  	u8 flags;
> >  } __packed;
> > -
> 
> ?

I just missed it when self reviewing.  Fixed.

> 
> >  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
> >  
> > +struct cxl_mbox_get_dc_config {
> > +	u8 region_count;
> > +	u8 start_region_index;
> > +} __packed;
> > +
> > +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> > +struct cxl_mbox_dynamic_capacity {
> 
> Can we rename to make it more clear which payload this is?

Sure.

> 
> > +	u8 avail_region_count;
> > +	u8 rsvd[7];
> > +	struct cxl_dc_region_config {
> > +		__le64 region_base;
> > +		__le64 region_decode_length;
> > +		__le64 region_length;
> > +		__le64 region_block_size;
> > +		__le32 region_dsmad_handle;
> > +		u8 flags;
> > +		u8 rsvd[3];
> > +	} __packed region[];
> > +} __packed;
> > +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> > +#define CXL_REGIONS_RETURNED(size_out) \
> > +	((size_out - 8) / sizeof(struct cxl_dc_region_config))
> > +
> >  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
> >  struct cxl_mbox_set_timestamp_in {
> >  	__le64 timestamp;
> > @@ -867,6 +919,7 @@ enum {
> >  int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
> >  			  struct cxl_mbox_cmd *cmd);
> >  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> > +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
> >  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> >  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> >  int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> 
> ta

ta?

Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 05/18] cxl/port: Add Dynamic Capacity mode support to endpoint decoders
  2023-08-29 14:49   ` Jonathan Cameron
@ 2023-09-05  0:05     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-05  0:05 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:20:56 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > Endpoint decoders used to map Dynamic Capacity must be configured to
> > point to the correct Dynamic Capacity (DC) Region.  The decoder mode
> > currently represents the partition the decoder points to such as ram or
> > pmem.
> > 
> > Expand the mode to include DC Regions.
> > 
> > Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> I'm reading this in a linear fashion for now (and ideally that should
> always make sense) so I don't currently see the reason for the loops
> in here. If they are needed for a future patch, add something to the
> description to indicate that.
> 
> > 
> > ---
> > Changes for v2:
> > [iweiny: split from region creation patch]
> > ---
> >  Documentation/ABI/testing/sysfs-bus-cxl | 19 ++++++++++---------
> >  drivers/cxl/core/hdm.c                  | 24 ++++++++++++++++++++++++
> >  drivers/cxl/core/port.c                 | 16 ++++++++++++++++
> >  3 files changed, 50 insertions(+), 9 deletions(-)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index 6350dd82b9a9..2268ffcdb604 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -257,22 +257,23 @@ Description:
> >  
> >  What:		/sys/bus/cxl/devices/decoderX.Y/mode
> >  Date:		May, 2022
> > -KernelVersion:	v6.0
> > +KernelVersion:	v6.0, v6.6 (dcY)
> >  Contact:	linux-cxl@vger.kernel.org
> >  Description:
> >  		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
> >  		translates from a host physical address range, to a device local
> >  		address range. Device-local address ranges are further split
> > -		into a 'ram' (volatile memory) range and 'pmem' (persistent
> > -		memory) range. The 'mode' attribute emits one of 'ram', 'pmem',
> > -		'mixed', or 'none'. The 'mixed' indication is for error cases
> > -		when a decoder straddles the volatile/persistent partition
> > -		boundary, and 'none' indicates the decoder is not actively
> > -		decoding, or no DPA allocation policy has been set.
> > +		into a 'ram' (volatile memory) range, 'pmem' (persistent
> > +		memory) range, or Dynamic Capacity (DC) range. The 'mode'
> > +		attribute emits one of 'ram', 'pmem', 'dcY', 'mixed', or
> > +		'none'. The 'mixed' indication is for error cases when a
> > +		decoder straddles the volatile/persistent partition boundary,
> > +		and 'none' indicates the decoder is not actively decoding, or
> > +		no DPA allocation policy has been set.
> >  
> >  		'mode' can be written, when the decoder is in the 'disabled'
> > -		state, with either 'ram' or 'pmem' to set the boundaries for the
> > -		next allocation.
> > +		state, with 'ram', 'pmem', or 'dcY' to set the boundaries for
> > +		the next allocation.
> >  
> >  
> >  What:		/sys/bus/cxl/devices/decoderX.Y/dpa_resource
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index a254f79dd4e8..3f4af1f5fac8 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -267,6 +267,19 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> >  	__cxl_dpa_release(cxled);
> >  }
> >  
> > +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> > +{
> > +	int index = 0;
> > +
> > +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> As you are relying on them being in order and adjacent for the loop, why is
> 
> 	if (mode < CXL_DECODER_DC0 || mode > CXL_DECODER_DC7)
> 		return -EINVAL;
> 
> 	return mode - CXL_DECODER_DC0;
> 
> Not sufficient?

That would work yes.  There is no future need for a loop.  It was just
implemented this way early on and I did not really think about it too
much.

Done.

> 
> > +		if (mode == i)
> > +			return index;
> > +		index++;
> > +	}
> > +
> > +	return -EINVAL;
> > +}
> > +
> >  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >  			     resource_size_t base, resource_size_t len,
> >  			     resource_size_t skipped)
> > @@ -429,6 +442,7 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> >  	switch (mode) {
> >  	case CXL_DECODER_RAM:
> >  	case CXL_DECODER_PMEM:
> > +	case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
> >  		break;
> >  	default:
> >  		dev_dbg(dev, "unsupported mode: %d\n", mode);
> > @@ -456,6 +470,16 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> >  		goto out;
> >  	}
> >  
> > +	for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
> > +		int index = dc_mode_to_region_index(i);
> > +
> > +		if (mode == i && !resource_size(&cxlds->dc_res[index])) {
> 
> Not obvious why we have the loop in this patch - perhaps it makes sense later.

I think it was just walking through the DC regions like the previous code
was walking through the PMEM/RAM 'regions'.

> If this is to enable later changes, then good to say that in the patch description.

... nope...

> otherwise, something like.
> 
> 	int index;
> 	
> 	rc = dc_mode_to_region_index(i);
> 	if (rc < 0)
> 		goto out;
> 
> 	index = rc;
> 	if (!resource_size(&cxlds->dc_res[index]) {
> 	....
> 		

Yea...  but that won't exactly work.  Something like this:

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index cf5d656c271b..f250d1566682 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -463,10 +463,12 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
                goto out;
        }

-       for (int i = CXL_DECODER_DC0; i <= CXL_DECODER_DC7; i++) {
-               int index = dc_mode_to_region_index(i);
+       if (cxl_decoder_mode_is_dc(mode)) {
+               rc = dc_mode_to_region_index(mode);
+               if (rc < 0)
+                       goto out;

-               if (mode == i && !resource_size(&cxlds->dc_res[index])) {
+               if (!resource_size(&cxlds->dc_res[rc])) {
                        dev_dbg(dev, "no available dynamic capacity\n");
                        rc = -ENXIO;
                        goto out;

But looking at the function I think there could be a clean up patch before
this.  I don't see the need to check the mode twice.

...  Yes I think that looks cleaner.

Thanks for the review!
Ira

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device
  2023-08-30 21:01   ` Dave Jiang
@ 2023-09-05  0:14     ` Ira Weiny
  2023-09-08 20:23     ` Ira Weiny
  1 sibling, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-05  0:14 UTC (permalink / raw)
  To: Dave Jiang, ira.weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Dave Jiang wrote:
> 
> 
> On 8/28/23 22:20, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > Devices can optionally support Dynamic Capacity (DC).  These devices are
> > known as Dynamic Capacity Devices (DCD).
> > 
> > Implement the DC (opcode 48XXh) mailbox commands as specified in CXL 3.0
> > section 8.2.9.8.9.  Read the DC configuration and store the DC region
> > information in the device state.
> > 
> > Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Uncapitalize Dynamic in subject

Fair enough.

> 
> Also, maybe split out the REGION vs DECODER as a prep patch.

Done per Jonathan.

Thanks for the review.
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 06/18] cxl/port: Add Dynamic Capacity size support to endpoint decoders
  2023-08-29 15:09   ` Jonathan Cameron
@ 2023-09-05  4:32     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-05  4:32 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:20:57 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 

[snip]

> > 
> > Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Various minor things noticed inline.

Thanks!

[snip]

> 
> > +
> > +static int cxl_reserve_dpa_skip(struct cxl_endpoint_decoder *cxled,
> > +				resource_size_t base, resource_size_t skipped)
> > +{
> > +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > +	struct cxl_port *port = cxled_to_port(cxled);
> > +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > +	resource_size_t skip_base = base - skipped;
> > +	resource_size_t size, skip_len = 0;
> > +	struct device *dev = &port->dev;
> > +	int rc, index;
> > +
> > +	size = resource_size(&cxlds->ram_res);
> > +	if (size && skip_base <= cxlds->ram_res.end) {
> 
> This size only used in this if statement I'd just put it inline.

And in the pmem case...

>  
> > +		skip_len = cxlds->ram_res.end - skip_base + 1;
> > +		rc = cxl_request_skip(cxled, skip_base, skip_len);
> > +		if (rc)
> > +			return rc;
> > +		skip_base += skip_len;
> > +	}
> > +
> > +	if (skip_base == base) {
> > +		dev_dbg(dev, "skip done!\n");
> 
> Not sure that dbg is much help as other places below where skip also done...

Ok.

> 
> > +		return 0;
> > +	}
> > +
> > +	size = resource_size(&cxlds->pmem_res);
> > +	if (size && skip_base <= cxlds->pmem_res.end) {
> 
> size only used in this if statement. I'd just put
> the resource_size() bit inline.

Ah ok.  I think the line length was the issue here.

I'm ok taking the variable out.

> 
> > +		skip_len = cxlds->pmem_res.end - skip_base + 1;
> > +		rc = cxl_request_skip(cxled, skip_base, skip_len);
> > +		if (rc)
> > +			return rc;
> > +		skip_base += skip_len;
> > +	}
> > +
> > +	index = dc_mode_to_region_index(cxled->mode);
> > +	for (int i = 0; i <= index; i++) {
> > +		struct resource *dcr = &cxlds->dc_res[i];
> > +
> > +		if (skip_base < dcr->start) {
> > +			skip_len = dcr->start - skip_base;
> > +			rc = cxl_request_skip(cxled, skip_base, skip_len);
> > +			if (rc)
> > +				return rc;
> > +			skip_base += skip_len;
> > +		}
> > +
> > +		if (skip_base == base) {
> > +			dev_dbg(dev, "skip done!\n");
> 
> As above - perhaps some more info?

Sure.

> 
> > +			break;
> > +		}
> > +
> > +		if (resource_size(dcr) && skip_base <= dcr->end) {
> > +			if (skip_base > base)
> > +				dev_err(dev, "Skip error\n");
> 
> Not return ?  If there is a reason to carry on, I'd like a comment to say what it is.

Looks like a bug I missed.  thanks!

> 
> > +
> > +			skip_len = dcr->end - skip_base + 1;
> > +			rc = cxl_request_skip(cxled, skip_base, skip_len);
> > +			if (rc)
> > +				return rc;
> > +			skip_base += skip_len;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> 
> 
> > @@ -492,11 +607,13 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
> >  					 resource_size_t *start_out,
> >  					 resource_size_t *skip_out)
> >  {
> > +	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
> >  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > -	resource_size_t free_ram_start, free_pmem_start;
> >  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > +	struct device *dev = &cxled->cxld.dev;
> 
> There is one existing (I think) call to dev_dbg(cxled_dev(cxled) ...
> in this function.  So both should use that here, and should convert that one
> case to using dev.

I think the type 2 stuff is lower priority than this series.  The main reason I
had this series based on that work was due to the split of the memdev state
from the device state.  Because that patch has landed I've rebased this
series on master in hopes of it landing in 6.7 without the type 2
dependency.

As such this code got moved to __cxl_dpa_reserve().

> 
> >  	resource_size_t start, avail, skip;
> >  	struct resource *p, *last;
> > +	int index;
> >  
> >  	lockdep_assert_held(&cxl_dpa_rwsem);
> >  
> > @@ -514,6 +631,20 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
> >  	else
> >  		free_pmem_start = cxlds->pmem_res.start;
> >  
> > +	/*
> > +	 * Limit each decoder to a single DC region to map memory with
> > +	 * different DSMAS entry.
> > +	 */
> > +	index = dc_mode_to_region_index(cxled->mode);
> > +	if (index >= 0) {
> > +		if (cxlds->dc_res[index].child) {
> > +			dev_err(dev, "Cannot allocate DPA from DC Region: %d\n",
> > +				index);
> > +			return -EINVAL;
> > +		}
> > +		free_dc_start = cxlds->dc_res[index].start;
> > +	}
> > +
> >  	if (cxled->mode == CXL_DECODER_RAM) {
> >  		start = free_ram_start;
> >  		avail = cxlds->ram_res.end - start + 1;
> > @@ -535,6 +666,29 @@ static resource_size_t cxl_dpa_freespace(struct cxl_endpoint_decoder *cxled,
> >  		else
> >  			skip_end = start - 1;
> >  		skip = skip_end - skip_start + 1;
> > +	} else if (cxl_decoder_mode_is_dc(cxled->mode)) {
> > +		resource_size_t skip_start, skip_end;
> > +
> > +		start = free_dc_start;
> > +		avail = cxlds->dc_res[index].end - start + 1;
> > +		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
> 
> Previous patch used !resource_size()
> I prefer compare with 0 like you have here, but which ever is chosen, things should
> be consistent.
> 
> ...
> 

Yea good point.  I audited the series for this and made the change.

Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 07/18] cxl/mem: Expose device dynamic capacity configuration
  2023-08-29 15:14   ` Jonathan Cameron
@ 2023-09-05 17:55     ` Fan Ni
  2023-09-05 20:45     ` Ira Weiny
  1 sibling, 0 replies; 97+ messages in thread
From: Fan Ni @ 2023-09-05 17:55 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: ira.weiny, Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, linux-cxl,
	linux-kernel

On Tue, Aug 29, 2023 at 04:14:49PM +0100, Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:20:58 -0700
> ira.weiny@intel.com wrote:
>
> > From: Navneet Singh <navneet.singh@intel.com>
> >
> > To properly configure CXL regions on Dynamic Capacity Devices (DCD),
> > user space will need to know the details of the DC Regions available on
> > a device.
> >
> > Expose driver dynamic capacity configuration through sysfs
> > attributes.
> >
> > Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> >
> One trivial comment inline.  I wondered a bit if it would
> be better to not present dc at all on devices that don't support
> dynamic capacity, but for now there isn't an elegant way to do that
> (some discussions and patches are flying around however so maybe this
>  will be resolved before this series merges giving us that elegant
>  option).
>
> With commented code tidied up
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>

Agreed. It makes more sense to not show dc at all.
Other than that, looks good to me.

Reviewed-by: Fan Ni <fan.ni@samsung.com>

>
> > ---
> > Changes for v2:
> > [iweiny: Rebased on latest master/type2 work]
> > [iweiny: add documentation for sysfs entries]
> > [iweiny: s/dc_regions_count/region_count/]
> > [iweiny: s/dcY_size/regionY_size/]
> > [alison: change size format to %#llx]
> > [iweiny: change count format to %d]
> > [iweiny: Formatting updates]
> > [iweiny: Fix crash when device is not a mem device: found with cxl-test]
> > ---
> >  Documentation/ABI/testing/sysfs-bus-cxl | 17 ++++++++
> >  drivers/cxl/core/memdev.c               | 77 +++++++++++++++++++++++++++++++++
> >  2 files changed, 94 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index 2268ffcdb604..aa65dc5b4e13 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -37,6 +37,23 @@ Description:
> >  		identically named field in the Identify Memory Device Output
> >  		Payload in the CXL-2.0 specification.
> >
> > +What:		/sys/bus/cxl/devices/memX/dc/region_count
> > +Date:		July, 2023
> > +KernelVersion:	v6.6
> > +Contact:	linux-cxl@vger.kernel.org
> > +Description:
> > +		(RO) Number of Dynamic Capacity (DC) regions supported on the
> > +		device.  May be 0 if the device does not support Dynamic
> > +		Capacity.
> > +
> > +What:		/sys/bus/cxl/devices/memX/dc/regionY_size
> > +Date:		July, 2023
> > +KernelVersion:	v6.6
> > +Contact:	linux-cxl@vger.kernel.org
> > +Description:
> > +		(RO) Size of the Dynamic Capacity (DC) region Y.  Only
> > +		available on devices which support DC and only for those
> > +		region indexes supported by the device.
> >
> >  What:		/sys/bus/cxl/devices/memX/serial
> >  Date:		January, 2022
> > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> > index 492486707fd0..397262e0ebd2 100644
> > --- a/drivers/cxl/core/memdev.c
> > +++ b/drivers/cxl/core/memdev.c
> > @@ -101,6 +101,20 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
> >  static struct device_attribute dev_attr_pmem_size =
> >  	__ATTR(size, 0444, pmem_size_show, NULL);
> >
> > +static ssize_t region_count_show(struct device *dev, struct device_attribute *attr,
> > +				 char *buf)
> > +{
> > +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> > +	int len = 0;
> > +
> > +	len = sysfs_emit(buf, "%d\n", mds->nr_dc_region);
> > +	return len;
>
> return sysfs_emit(buf, "...);
>
> > +}
> > +
> > +struct device_attribute dev_attr_region_count =
> > +	__ATTR(region_count, 0444, region_count_show, NULL);
> > +
> >  static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
> >  			   char *buf)
> >  {
> > @@ -454,6 +468,62 @@ static struct attribute *cxl_memdev_security_attributes[] = {
> >  	NULL,
> >  };
> >
> > +static ssize_t show_size_regionN(struct cxl_memdev *cxlmd, char *buf, int pos)
> > +{
> > +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> > +
> > +	return sysfs_emit(buf, "%#llx\n", mds->dc_region[pos].decode_len);
> > +}
> > +
> > +#define REGION_SIZE_ATTR_RO(n)						\
> > +static ssize_t region##n##_size_show(struct device *dev,		\
> > +				     struct device_attribute *attr,	\
> > +				     char *buf)				\
> > +{									\
> > +	return show_size_regionN(to_cxl_memdev(dev), buf, (n));		\
> > +}									\
> > +static DEVICE_ATTR_RO(region##n##_size)
> > +REGION_SIZE_ATTR_RO(0);
> > +REGION_SIZE_ATTR_RO(1);
> > +REGION_SIZE_ATTR_RO(2);
> > +REGION_SIZE_ATTR_RO(3);
> > +REGION_SIZE_ATTR_RO(4);
> > +REGION_SIZE_ATTR_RO(5);
> > +REGION_SIZE_ATTR_RO(6);
> > +REGION_SIZE_ATTR_RO(7);
> > +
> > +static struct attribute *cxl_memdev_dc_attributes[] = {
> > +	&dev_attr_region0_size.attr,
> > +	&dev_attr_region1_size.attr,
> > +	&dev_attr_region2_size.attr,
> > +	&dev_attr_region3_size.attr,
> > +	&dev_attr_region4_size.attr,
> > +	&dev_attr_region5_size.attr,
> > +	&dev_attr_region6_size.attr,
> > +	&dev_attr_region7_size.attr,
> > +	&dev_attr_region_count.attr,
> > +	NULL,
> > +};
> > +
> > +static umode_t cxl_dc_visible(struct kobject *kobj, struct attribute *a, int n)
> > +{
> > +	struct device *dev = kobj_to_dev(kobj);
> > +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> > +
> > +	/* Not a memory device */
> > +	if (!mds)
> > +		return 0;
> > +
> > +	if (a == &dev_attr_region_count.attr)
> > +		return a->mode;
> > +
> > +	if (n < mds->nr_dc_region)
> > +		return a->mode;
> > +
> > +	return 0;
> > +}
> > +
> >  static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
> >  				  int n)
> >  {
> > @@ -482,11 +552,18 @@ static struct attribute_group cxl_memdev_security_attribute_group = {
> >  	.attrs = cxl_memdev_security_attributes,
> >  };
> >
> > +static struct attribute_group cxl_memdev_dc_attribute_group = {
> > +	.name = "dc",
> > +	.attrs = cxl_memdev_dc_attributes,
> > +	.is_visible = cxl_dc_visible,
> > +};
> > +
> >  static const struct attribute_group *cxl_memdev_attribute_groups[] = {
> >  	&cxl_memdev_attribute_group,
> >  	&cxl_memdev_ram_attribute_group,
> >  	&cxl_memdev_pmem_attribute_group,
> >  	&cxl_memdev_security_attribute_group,
> > +	&cxl_memdev_dc_attribute_group,
> >  	NULL,
> >  };
> >
> >
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 07/18] cxl/mem: Expose device dynamic capacity configuration
  2023-08-29 15:14   ` Jonathan Cameron
  2023-09-05 17:55     ` Fan Ni
@ 2023-09-05 20:45     ` Ira Weiny
  1 sibling, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-05 20:45 UTC (permalink / raw)
  To: Jonathan Cameron, ira.weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:20:58 -0700
> ira.weiny@intel.com wrote:
> 
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > To properly configure CXL regions on Dynamic Capacity Devices (DCD),
> > user space will need to know the details of the DC Regions available on
> > a device.
> > 
> > Expose driver dynamic capacity configuration through sysfs
> > attributes.
> > 
> > Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> One trivial comment inline.  I wondered a bit if it would
> be better to not present dc at all on devices that don't support
> dynamic capacity, but for now there isn't an elegant way to do that
> (some discussions and patches are flying around however so maybe this
>  will be resolved before this series merges giving us that elegant
>  option).

For now I will keep this yes.  But if there is a better way then sure.

> 
> With commented code tidied up
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 

Thanks.

> > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> > index 492486707fd0..397262e0ebd2 100644
> > --- a/drivers/cxl/core/memdev.c
> > +++ b/drivers/cxl/core/memdev.c
> > @@ -101,6 +101,20 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
> >  static struct device_attribute dev_attr_pmem_size =
> >  	__ATTR(size, 0444, pmem_size_show, NULL);
> >  
> > +static ssize_t region_count_show(struct device *dev, struct device_attribute *attr,
> > +				 char *buf)
> > +{
> > +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> > +	int len = 0;
> > +
> > +	len = sysfs_emit(buf, "%d\n", mds->nr_dc_region);
> > +	return len;
> 
> return sysfs_emit(buf, "...);
> 	

Done thanks!
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 08/18] cxl/region: Add Dynamic Capacity CXL region support
  2023-08-29  5:20 ` [PATCH RFC v2 08/18] cxl/region: Add Dynamic Capacity CXL region support Ira Weiny
  2023-08-29 15:19   ` Jonathan Cameron
  2023-08-30 23:27   ` Dave Jiang
@ 2023-09-05 21:09   ` Fan Ni
  2 siblings, 0 replies; 97+ messages in thread
From: Fan Ni @ 2023-09-05 21:09 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Jonathan Cameron,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma,
	linux-cxl, linux-kernel

On Mon, Aug 28, 2023 at 10:20:59PM -0700, Ira Weiny wrote:
> CXL devices optionally support dynamic capacity.  CXL Regions must be
> configured correctly to access this capacity.  Similar to ram and pmem
> partitions, DC Regions represent different partitions of the DPA space.
>
> Interleaving is deferred due to the complexity of managing extents on
> multiple devices at the same time.  However, there is nothing which
> directly prevents interleave support at this time.  The check allows
> for early rejection.
>
> To maintain backwards compatibility with older software, CXL regions
> need a default DAX device to hold the reference for the region until it
> is deleted.
>
> Add create_dc_region sysfs entry to create DC regions.  Share the logic
> of devm_cxl_add_dax_region() and region_is_system_ram().  Special case
> DC capable CXL regions to create a 0 sized seed DAX device until others
> can be created on dynamic space later.
>
> Flag dax_regions to indicate 0 capacity available until dax_region
> extents are supported by the region.
>
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>

Reviewed-by: Fan Ni <fan.ni@samsung.com>

> ---
> changes for v2:
> [iweiny: flag empty dax regions]
> [iweiny: Split out anything not directly related to creating a DC CXL
> 	 region]
> [iweiny: Separate out dev dax stuff]
> [iweiny/navneet: create 0 sized DAX device by default]
> [iweiny: use new DC region mode]
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 20 +++++-----
>  drivers/cxl/core/core.h                 |  1 +
>  drivers/cxl/core/port.c                 |  1 +
>  drivers/cxl/core/region.c               | 71 ++++++++++++++++++++++++++++-----
>  drivers/dax/bus.c                       |  8 ++++
>  drivers/dax/bus.h                       |  1 +
>  drivers/dax/cxl.c                       | 15 ++++++-
>  7 files changed, 96 insertions(+), 21 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index aa65dc5b4e13..a0562938ecac 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -351,20 +351,20 @@ Description:
>  		interleave_granularity).
>
>
> -What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> +What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dc}_region
>  Date:		May, 2022, January, 2023
> -KernelVersion:	v6.0 (pmem), v6.3 (ram)
> +KernelVersion:	v6.0 (pmem), v6.3 (ram), v6.6 (dc)
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Write a string in the form 'regionZ' to start the process
> -		of defining a new persistent, or volatile memory region
> -		(interleave-set) within the decode range bounded by root decoder
> -		'decoderX.Y'. The value written must match the current value
> -		returned from reading this attribute. An atomic compare exchange
> -		operation is done on write to assign the requested id to a
> -		region and allocate the region-id for the next creation attempt.
> -		EBUSY is returned if the region name written does not match the
> -		current cached value.
> +		of defining a new persistent, volatile, or Dynamic Capacity
> +		(DC) memory region (interleave-set) within the decode range
> +		bounded by root decoder 'decoderX.Y'. The value written must
> +		match the current value returned from reading this attribute.
> +		An atomic compare exchange operation is done on write to assign
> +		the requested id to a region and allocate the region-id for the
> +		next creation attempt.  EBUSY is returned if the region name
> +		written does not match the current cached value.
>
>
>  What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 45e7e044cf4a..cf3cf01cb95d 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -13,6 +13,7 @@ extern struct attribute_group cxl_base_attribute_group;
>  #ifdef CONFIG_CXL_REGION
>  extern struct device_attribute dev_attr_create_pmem_region;
>  extern struct device_attribute dev_attr_create_ram_region;
> +extern struct device_attribute dev_attr_create_dc_region;
>  extern struct device_attribute dev_attr_delete_region;
>  extern struct device_attribute dev_attr_region;
>  extern const struct device_type cxl_pmem_region_type;
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index a5db710a63bc..608901bb7d91 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -314,6 +314,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>  	&dev_attr_target_list.attr,
>  	SET_CXL_REGION_ATTR(create_pmem_region)
>  	SET_CXL_REGION_ATTR(create_ram_region)
> +	SET_CXL_REGION_ATTR(create_dc_region)
>  	SET_CXL_REGION_ATTR(delete_region)
>  	NULL,
>  };
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 69af1354bc5b..fc8dee469244 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2271,6 +2271,7 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>  	switch (mode) {
>  	case CXL_REGION_RAM:
>  	case CXL_REGION_PMEM:
> +	case CXL_REGION_DC:
>  		break;
>  	default:
>  		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
> @@ -2383,6 +2384,33 @@ static ssize_t create_ram_region_store(struct device *dev,
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>
> +static ssize_t create_dc_region_show(struct device *dev,
> +				     struct device_attribute *attr, char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dc_region_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t len)
> +{
> +	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
> +	struct cxl_region *cxlr;
> +	int rc, id;
> +
> +	rc = sscanf(buf, "region%d\n", &id);
> +	if (rc != 1)
> +		return -EINVAL;
> +
> +	cxlr = __create_region(cxlrd, id, CXL_REGION_DC,
> +			       CXL_DECODER_HOSTONLYMEM);
> +	if (IS_ERR(cxlr))
> +		return PTR_ERR(cxlr);
> +
> +	return len;
> +}
> +DEVICE_ATTR_RW(create_dc_region);
> +
>  static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -2834,7 +2862,7 @@ static void cxlr_dax_unregister(void *_cxlr_dax)
>  	device_unregister(&cxlr_dax->dev);
>  }
>
> -static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> +static int __devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  {
>  	struct cxl_dax_region *cxlr_dax;
>  	struct device *dev;
> @@ -2863,6 +2891,21 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  	return rc;
>  }
>
> +static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> +{
> +	return __devm_cxl_add_dax_region(cxlr);
> +}
> +
> +static int devm_cxl_add_dc_dax_region(struct cxl_region *cxlr)
> +{
> +	if (cxlr->params.interleave_ways != 1) {
> +		dev_err(&cxlr->dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}
> +
> +	return __devm_cxl_add_dax_region(cxlr);
> +}
> +
>  static int match_decoder_by_range(struct device *dev, void *data)
>  {
>  	struct range *r1, *r2 = data;
> @@ -3203,6 +3246,19 @@ static int is_system_ram(struct resource *res, void *arg)
>  	return 1;
>  }
>
> +/*
> + * The region can not be manged by CXL if any portion of
> + * it is already online as 'System RAM'
> + */
> +static bool region_is_system_ram(struct cxl_region *cxlr,
> +				 struct cxl_region_params *p)
> +{
> +	return (walk_iomem_res_desc(IORES_DESC_NONE,
> +				    IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> +				    p->res->start, p->res->end, cxlr,
> +				    is_system_ram) > 0);
> +}
> +
>  static int cxl_region_probe(struct device *dev)
>  {
>  	struct cxl_region *cxlr = to_cxl_region(dev);
> @@ -3242,14 +3298,7 @@ static int cxl_region_probe(struct device *dev)
>  	case CXL_REGION_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
>  	case CXL_REGION_RAM:
> -		/*
> -		 * The region can not be manged by CXL if any portion of
> -		 * it is already online as 'System RAM'
> -		 */
> -		if (walk_iomem_res_desc(IORES_DESC_NONE,
> -					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> -					p->res->start, p->res->end, cxlr,
> -					is_system_ram) > 0)
> +		if (region_is_system_ram(cxlr, p))
>  			return 0;
>
>  		/*
> @@ -3261,6 +3310,10 @@ static int cxl_region_probe(struct device *dev)
>
>  		/* HDM-H routes to device-dax */
>  		return devm_cxl_add_dax_region(cxlr);
> +	case CXL_REGION_DC:
> +		if (region_is_system_ram(cxlr, p))
> +			return 0;
> +		return devm_cxl_add_dc_dax_region(cxlr);
>  	default:
>  		dev_dbg(&cxlr->dev, "unsupported region mode: %s\n",
>  			cxl_region_mode_name(cxlr->mode));
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 0ee96e6fc426..b76e49813a39 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -169,6 +169,11 @@ static bool is_static(struct dax_region *dax_region)
>  	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
>  }
>
> +static bool is_dynamic(struct dax_region *dax_region)
> +{
> +	return (dax_region->res.flags & IORESOURCE_DAX_DYNAMIC_CAP) != 0;
> +}
> +
>  bool static_dev_dax(struct dev_dax *dev_dax)
>  {
>  	return is_static(dev_dax->region);
> @@ -285,6 +290,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
>
>  	device_lock_assert(dax_region->dev);
>
> +	if (is_dynamic(dax_region))
> +		return 0;
> +
>  	for_each_dax_region_resource(dax_region, res)
>  		size -= resource_size(res);
>  	return size;
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 1ccd23360124..74d8fe4a5532 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -13,6 +13,7 @@ struct dax_region;
>  /* dax bus specific ioresource flags */
>  #define IORESOURCE_DAX_STATIC BIT(0)
>  #define IORESOURCE_DAX_KMEM BIT(1)
> +#define IORESOURCE_DAX_DYNAMIC_CAP BIT(2)
>
>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  		struct range *range, int target_node, unsigned int align,
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 8bc9d04034d6..147c8c69782b 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -13,19 +13,30 @@ static int cxl_dax_region_probe(struct device *dev)
>  	struct cxl_region *cxlr = cxlr_dax->cxlr;
>  	struct dax_region *dax_region;
>  	struct dev_dax_data data;
> +	resource_size_t dev_size;
> +	unsigned long flags;
>
>  	if (nid == NUMA_NO_NODE)
>  		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
>
> +	dev_size = range_len(&cxlr_dax->hpa_range);
> +
> +	flags = IORESOURCE_DAX_KMEM;
> +	if (cxlr->mode == CXL_REGION_DC) {
> +		/* Add empty seed dax device */
> +		dev_size = 0;
> +		flags |= IORESOURCE_DAX_DYNAMIC_CAP;
> +	}
> +
>  	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> -				      PMD_SIZE, IORESOURCE_DAX_KMEM);
> +				      PMD_SIZE, flags);
>  	if (!dax_region)
>  		return -ENOMEM;
>
>  	data = (struct dev_dax_data) {
>  		.dax_region = dax_region,
>  		.id = -1,
> -		.size = range_len(&cxlr_dax->hpa_range),
> +		.size = dev_size,
>  	};
>
>  	return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
>
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 09/18] cxl/mem: Read extents on memory device discovery
  2023-08-29 15:26   ` Jonathan Cameron
  2023-08-30  0:16     ` Ira Weiny
@ 2023-09-05 21:41     ` Ira Weiny
  1 sibling, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-05 21:41 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:21:00 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > When a Dynamic Capacity Device (DCD) is realized some extents may
> > already be available within the DC Regions.  This can happen if the host
> > has accepted extents and been rebooted or any other time the host driver
> > software has become out of sync with the device hardware.
> > 
> > Read the available extents during probe and store them for later
> > use.
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> A few minor comments inline.
> 
> Thanks,
> 
> Jonathan
> 

[snip]

> 
> > +static int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> > +				     unsigned int *extent_gen_num)
> > +{
> > +	struct cxl_mbox_get_dc_extent get_dc_extent;
> > +	struct cxl_mbox_dc_extents dc_extents;
> > +	struct device *dev = mds->cxlds.dev;
> > +	struct cxl_mbox_cmd mbox_cmd;
> > +	unsigned int count;
> > +	int rc;
> > +
> > +	/* Check GET_DC_EXTENT_LIST is supported by device */
> > +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> > +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> > +		return 0;
> > +	}
> > +
> > +	get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> > +		.extent_cnt = cpu_to_le32(0),
> > +		.start_extent_index = cpu_to_le32(0),
> > +	};
> > +
> > +	mbox_cmd = (struct cxl_mbox_cmd) {
> > +		.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> > +		.payload_in = &get_dc_extent,
> > +		.size_in = sizeof(get_dc_extent),
> > +		.size_out = mds->payload_size,
> 
> If all you are after is the count, then size_out can be a lot smaller than that
> I think as we know it can't return any extents.

Done.

> 
> > +		.payload_out = &dc_extents,
> > +		.min_out = 1,
> > +	};
> > +
> > +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > +	if (rc < 0)
> > +		return rc;
> > +
> > +	count = le32_to_cpu(dc_extents.total_extent_cnt);
> > +	*extent_gen_num = le32_to_cpu(dc_extents.extent_list_num);
> > +
> > +	return count;
> > +}
> > +
> > +static int cxl_dev_get_dc_extents(struct cxl_memdev_state *mds,
> > +				  unsigned int start_gen_num,
> > +				  unsigned int exp_cnt)
> > +{
> > +	struct cxl_mbox_dc_extents *dc_extents;
> > +	unsigned int start_index, total_read;
> > +	struct device *dev = mds->cxlds.dev;
> > +	struct cxl_mbox_cmd mbox_cmd;
> > +	int retry = 3;
> 
> Why 3?

Removed.

> 
> > +	int rc;
> > +
> > +	/* Check GET_DC_EXTENT_LIST is supported by device */
> > +	if (!test_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds)) {
> > +		dev_dbg(dev, "unsupported cmd : get dyn cap extent list\n");
> > +		return 0;
> > +	}
> > +
> > +	dc_extents = kvmalloc(mds->payload_size, GFP_KERNEL);
> 
> Maybe __free magic would simplify this enough to be useful.

Yes.  I'd not wrapped my head around the __free magic until you mentioned in
in the other patch.  It is pretty easy to use.  But I'm worried because it
seems 'too easy'...  ;-)

I'll convert this one too.  So far the other one seems good.  So dare I
say "I know what I'm doing now"...  :-D

> 
> > +	if (!dc_extents)
> > +		return -ENOMEM;
> > +
> > +reset:
> > +	total_read = 0;
> > +	start_index = 0;
> > +	do {
> > +		unsigned int nr_ext, total_extent_cnt, gen_num;
> > +		struct cxl_mbox_get_dc_extent get_dc_extent;
> > +
> > +		get_dc_extent = (struct cxl_mbox_get_dc_extent) {
> > +			.extent_cnt = exp_cnt - start_index,
> > +			.start_extent_index = start_index,
> > +		};
> > +		
> > +		mbox_cmd = (struct cxl_mbox_cmd) {
> > +			.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> > +			.payload_in = &get_dc_extent,
> > +			.size_in = sizeof(get_dc_extent),
> > +			.size_out = mds->payload_size,
> > +			.payload_out = dc_extents,
> > +			.min_out = 1,
> > +		};
> > +		
> > +		rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > +		if (rc < 0)
> > +			goto out;
> > +		
> > +		nr_ext = le32_to_cpu(dc_extents->ret_extent_cnt);
> > +		total_read += nr_ext;
> > +		total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> > +		gen_num = le32_to_cpu(dc_extents->extent_list_num);
> > +
> > +		dev_dbg(dev, "Get extent list count:%d generation Num:%d\n",
> > +			total_extent_cnt, gen_num);
> > +
> > +		if (gen_num != start_gen_num || exp_cnt != total_extent_cnt) {
> > +			dev_err(dev, "Extent list changed while reading; %u != %u : %u != %u\n",
> > +				gen_num, start_gen_num, exp_cnt, total_extent_cnt);
> > +			if (retry--)
> > +				goto reset;
> > +			return -EIO;

And this was a bug too :-(  ...  Fixed with the __free() magic.

Thanks for the review,
Ira

> > +		}
> > +		
> > +		for (int i = 0; i < nr_ext ; i++) {
> > +			dev_dbg(dev, "Storing extent %d/%d\n",
> > +				start_index + i, exp_cnt);
> > +			rc = cxl_store_dc_extent(mds, &dc_extents->extent[i]);
> > +			if (rc)
> > +				goto out;
> > +		}
> > +
> > +		start_index += nr_ext;
> > +	} while (exp_cnt > total_read);
> > +
> > +out:
> > +	kvfree(dc_extents);
> > +	return rc;
> > +}
> 
> 



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 10/18] cxl/mem: Handle DCD add and release capacity events.
  2023-08-29 15:59   ` Jonathan Cameron
@ 2023-09-05 23:49     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-05 23:49 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:21:01 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > A Dynamic Capacity Device (DCD) utilizes events to signal the host about
> > the changes to the allocation of Dynamic Capacity (DC) extents. The
> > device communicates the state of DC extents through an extent list that
> > describes the starting DPA, length, and meta data of the blocks the host
> > can access.
> > 
> > Process the dynamic capacity add and release events.  The addition or
> > removal of extents can occur at any time.  Adding asynchronous memory is
> > straight forward.  Also remember the host is under no obligation to
> > respond to a release event until it is done with the memory.  Introduce
> > extent kref's to handle the delay of extent release.
> > 
> > In the case of a force removal, access to the memory will fail and may
> > cause a crash.  However, the extent tracking object is preserved for the
> > region to safely tear down as long as the memory is not accessed.
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Minor stuff inline.
> 
> 
> > +static int cxl_prepare_ext_list(struct cxl_mbox_dc_response **res,
> > +				int *n, struct range *extent)
> > +{
> > +	struct cxl_mbox_dc_response *dc_res;
> > +	unsigned int size;
> > +
> > +	if (!extent)
> > +		size = struct_size(dc_res, extent_list, 0);
> 
> This is confusing as if you did have *n > 0 I'd kind of expect
> this to just not extend the list rather than shortening it.
> Now I guess that never happens, but locally it looks odd.
> 
> Maybe just handle that case in a separate function as it doesn't
> share much code with the case where there is an extent and I would
> assume we always know at the caller which one we want.

Yea I forget why I left this alone.  I did not care for it during internal
review and I think I got so busy with the other code that this just got
left behind.

Frankly this is a candidate for the __free() magic as well.  But in a
helper function which handles sending the response...

This needs some refactoring for sure...  :-/

> 
> 
> > +	else
> > +		size = struct_size(dc_res, extent_list, *n + 1);
> 
> Might be clearer with a local variable for the number of extents.
> 
> extents_count = *n;
> 
> if (extent)
> 	extents_count++;
> 
> size = struct_size(dc_res, extent_list, extents_count);
> 
> Though I'm not sure that really helps.  Maybe this will just need
> to be a little confusing :)

Actually no.  IIRC the original idea was to have a running response data
structure realloc'ed as events were processed from the log and then to
send out a final large response...  But in my refactoring I did not do
that.  The refactoring processes each event (extent) before going on to
the next event.  I suppose this may be an issue later if large numbers
of extents are added to the logs rapidly and the processing is not fast
enough and the logs overflow.

But I don't think the complexity is warranted at this time.  Especially
because under that condition the size of the response needs to be
contained within mds->payload_size.  So there is quite a bit more
complexity there that I don't think was accounted for initially.

I think cxl_send_dc_cap_response() should handle this allocation (using
__free() magic) and then do the send all in 1 function.

I'll refactor and see how it goes.

> 
> > +
> > +	dc_res = krealloc(*res, size, GFP_KERNEL);
> > +	if (!dc_res)
> > +		return -ENOMEM;
> > +
> > +	if (extent) {
> > +		dc_res->extent_list[*n].dpa_start = cpu_to_le64(extent->start);
> > +		memset(dc_res->extent_list[*n].reserved, 0, 8);
> > +		dc_res->extent_list[*n].length = cpu_to_le64(range_len(extent));
> > +		(*n)++;
> > +	}
> > +
> > +	*res = dc_res;
> > +	return 0;
> > +}
> 
> > +
> > +/* Returns 0 if the event was handled successfully. */
> > +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
> > +					struct cxl_event_record_raw *rec)
> > +{
> > +	struct dcd_event_dyn_cap *record = (struct dcd_event_dyn_cap *)rec;
> > +	uuid_t *id = &rec->hdr.id;
> > +	int rc;
> > +
> > +	if (!uuid_equal(id, &dc_event_uuid))
> > +		return -EINVAL;
> > +
> > +	switch (record->data.event_type) {
> > +	case DCD_ADD_CAPACITY:
> > +		rc = cxl_handle_dcd_add_event(mds, &record->data.extent);
> > +		break;
> 
> I guess it might not be consistent with local style...
> 		return cxl_handle_dcd_add_event()  etc

Sure.  That is cleaner.  Done.

Ira

> 
> > +	case DCD_RELEASE_CAPACITY:
> > +        case DCD_FORCED_CAPACITY_RELEASE:
> > +		rc = cxl_handle_dcd_release_event(mds, &record->data.extent);
> > +		break;
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +
> > +	return rc;
> > +}
> > +
> 
> 



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 11/18] cxl/region: Expose DC extents on region driver load
  2023-08-29 16:20   ` Jonathan Cameron
@ 2023-09-06  3:36     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-06  3:36 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:21:02 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > Ultimately user space must associate Dynamic Capacity (DC) extents with
> > DAX devices.  Remember also that DCD extents may have been accepted
> > previous to regions being created and must have references held until
> > all higher level regions and DAX devices are done with the memory.
> > 
> > On CXL region driver load scan existing device extents and create CXL
> > DAX region extents as needed.
> > 
> > Create abstractions for the extents to be used in DAX region.  This
> > includes a generic interface to take proper references on the lower
> > level CXL region extents.
> > 
> > Also maintain separate objects for the DAX region extent device vs the
> > DAX region extent.  The DAX region extent device has a shorter life span
> > which corresponds to the removal of an extent while a DAX device is
> > still using it.  In this case an extent continues to exist whilst the
> > ability to create new DAX devices on that extent is prevented.
> > 
> > NOTE: Without interleaving; the device, CXL region, and DAX region
> > extents have a 1:1:1 relationship.  Future support for interleaving will
> > maintain a 1:N relationship between CXL region extents and the hardware
> > extents.
> > 
> > While the ability to create DAX devices on an extent exists; expose the
> > necessary details of DAX region extents by creating a device with the
> > following sysfs entries.
> > 
> > /sys/bus/cxl/devices/dax_regionX/extentY
> > /sys/bus/cxl/devices/dax_regionX/extentY/length
> > /sys/bus/cxl/devices/dax_regionX/extentY/label
> > 
> > Label is a rough analogy to the DC extent tag.  As such the DC extent
> > tag is used to initially populate the label.  However, the label is made
> > writeable so that it can be adjusted in the future when forming a DAX
> > device.
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> 
> Trivial stuff inline.
> 
> 
> 
> > diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> > index 27cf2daaaa79..4dab52496c3f 100644
> > --- a/drivers/dax/dax-private.h
> > +++ b/drivers/dax/dax-private.h
> > @@ -5,6 +5,7 @@
> >  #ifndef __DAX_PRIVATE_H__
> >  #define __DAX_PRIVATE_H__
> >  
> > +#include <linux/pgtable.h>
> >  #include <linux/device.h>
> >  #include <linux/cdev.h>
> >  #include <linux/idr.h>
> > @@ -40,6 +41,58 @@ struct dax_region {
> >  	struct device *youngest;
> >  };
> >  
> > +/*
> /**
> 
> as it's valid kernel doc so no disadvantage really.

Sure. Done.

> 
> > + * struct dax_region_extent - extent data defined by the low level region
> > + * driver.
> > + * @private_data: lower level region driver data
> > + * @ref: track number of dax devices which are using this extent
> > + * @get: get reference to low level data
> > + * @put: put reference to low level data
> 
> I'd like to understand when these are optional - perhaps comment on that?

They are not optional in this implementation.  I got a bit carried away in
extrapolating the dax_region away from the lower levels in thinking that
some other implementation may not need these.

I will still keep the helpers below though.

> 
> > + */
> > +struct dax_region_extent {
> > +	void *private_data;
> > +	struct kref ref;
> > +	void (*get)(struct dax_region_extent *dr_extent);
> > +	void (*put)(struct dax_region_extent *dr_extent);
> > +};
> > +
> > +static inline void dr_extent_get(struct dax_region_extent *dr_extent)
> > +{
> > +	if (dr_extent->get)
> > +		dr_extent->get(dr_extent);
> > +}
> > +
> > +static inline void dr_extent_put(struct dax_region_extent *dr_extent)
> > +{
> > +	if (dr_extent->put)
> > +		dr_extent->put(dr_extent);
> > +}
> > +
> > +#define DAX_EXTENT_LABEL_LEN 64
> 
> blank line here.

Sure.  Done

Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 12/18] cxl/region: Notify regions of DC changes
  2023-08-29 16:40   ` Jonathan Cameron
@ 2023-09-06  4:00     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-06  4:00 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:21:03 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > In order for a user to use dynamic capacity effectively they need to
> > know when dynamic capacity is available.  Thus when Dynamic Capacity
> > (DC) extents are added or removed by a DC device the regions affected
> > need to be notified.  Ultimately the DAX region uses the memory
> > associated with DC extents.  However, remember that CXL DAX regions
> > maintain any interleave details between devices.
> > 
> > When a DCD event occurs, iterate all CXL endpoint decoders and notify
> > regions which contain the endpoints affected by the event.  In turn
> > notify the DAX regions of the changes to the DAX region extents.
> > 
> > For now interleave is handled by creating simple 1:1 mappings between
> > the CXL DAX region and DAX region layers.  Future implementations will
> > need to resolve when to actually surface a DAX region extent and pass
> > the notification along.
> > 
> > Remember that adding capacity is safe because there is no chance of the
> > memory being in use.  Also remember at this point releasing capacity is
> > straight forward because DAX devices do not yet have references to the
> > extents.  Future patches will handle that complication.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> 
> A few trivial comments on this.  Lot here so I'll take a closer look
> at some point after doing a light pass over the rest of the series.
> 

I agree this is a lot.  I thought about splitting the notification into 2
layers.  Notify from device to CXL region then a separate patch from CXL
region to DAX region.  In the end I channeled Dan and kept it all together
because without interleaving there is not much for the CXL region to do
but pass the notification up.  So that patch would have been kind of
awkward.

> 
> 
> 
> > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > index 80cffa40e91a..d3c4c9c87392 100644
> > --- a/drivers/cxl/mem.c
> > +++ b/drivers/cxl/mem.c
> > @@ -104,6 +104,55 @@ static int cxl_debugfs_poison_clear(void *data, u64 dpa)
> >  DEFINE_DEBUGFS_ATTRIBUTE(cxl_poison_clear_fops, NULL,
> >  			 cxl_debugfs_poison_clear, "%llx\n");
> >  
> > +static int match_ep_decoder_by_range(struct device *dev, void *data)
> > +{
> > +	struct cxl_dc_extent_data *extent = data;
> > +	struct cxl_endpoint_decoder *cxled;
> > +
> > +	if (!is_endpoint_decoder(dev))
> > +		return 0;
> 
> blank line

Done.

> 
> > +	cxled = to_cxl_endpoint_decoder(dev);
> > +	return cxl_dc_extent_in_ed(cxled, extent);
> > +}
> > +
> > +static struct cxl_endpoint_decoder *cxl_find_ed(struct cxl_memdev_state *mds,
> > +						struct cxl_dc_extent_data *extent)
> > +{
> > +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> > +	struct cxl_port *endpoint = cxlmd->endpoint;
> > +	struct device *dev;
> > +
> > +	dev = device_find_child(&endpoint->dev, extent,
> > +				match_ep_decoder_by_range);
> > +	if (!dev) {
> > +		dev_dbg(mds->cxlds.dev, "Extent DPA:%llx LEN:%llx not mapped\n",
> > +			extent->dpa_start, extent->length);
> > +		return NULL;
> > +	}
> > +
> > +	return to_cxl_endpoint_decoder(dev);
> > +}
> > +
> > +static int cxl_mem_notify(struct device *dev, struct cxl_drv_nd *nd)
> > +{
> > +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> > +	struct cxl_endpoint_decoder *cxled;
> > +	struct cxl_dc_extent_data *extent;
> > +	int rc = 0;
> > +
> > +	extent = nd->extent;
> > +	dev_dbg(dev, "notify DC action %d DPA:%llx LEN:%llx\n",
> > +		nd->event, extent->dpa_start, extent->length);
> > +
> > +	cxled = cxl_find_ed(mds, extent);
> > +	if (!cxled)
> > +		return 0;
> Blank line.

Done.

> 
> > +	rc = cxl_ed_notify_extent(cxled, nd);
> > +	put_device(&cxled->cxld.dev);
> > +	return rc;
> > +}
> > +
> >  static int cxl_mem_probe(struct device *dev)
> >  {
> >  	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > @@ -247,6 +296,7 @@ __ATTRIBUTE_GROUPS(cxl_mem);
> >  static struct cxl_driver cxl_mem_driver = {
> >  	.name = "cxl_mem",
> >  	.probe = cxl_mem_probe,
> > +	.notify = cxl_mem_notify,
> >  	.id = CXL_DEVICE_MEMORY_EXPANDER,
> >  	.drv = {
> >  		.dev_groups = cxl_mem_groups,
> > diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> > index 057b00b1d914..44cbd28668f1 100644
> > --- a/drivers/dax/cxl.c
> > +++ b/drivers/dax/cxl.c
> > @@ -59,6 +59,29 @@ static int cxl_dax_region_create_extent(struct dax_region *dax_region,
> >  	return 0;
> >  }
> >  
> > +static int cxl_dax_region_add_extent(struct cxl_dax_region *cxlr_dax,
> > +				     struct cxl_dr_extent *cxl_dr_ext)
> > +{
> 
> Why not have this helper in the earlier patch that introduced the code
> this is factoring out?  Will reduce churn in the set whilst not much hurting
> readability of that patch.

Because this logic appeared in only 1 place in that patch.  Here we are
using this same logic 2x so it got factored out.

I see where you are coming from because this is a straight up copy of the
code.  I'll go ahead and do that.

> 
> > +	/*
> > +	 * get not zero is important because this is racing with the
> > +	 * region driver which is racing with the memory device which
> > +	 * could be removing the extent at the same time.
> > +	 */
> > +	if (cxl_dr_extent_get_not_zero(cxl_dr_ext)) {
> > +		struct dax_region *dax_region;
> > +		int rc;
> > +
> > +		dax_region = dev_get_drvdata(&cxlr_dax->dev);
> > +		dev_dbg(&cxlr_dax->dev, "Creating HPA:%llx LEN:%llx\n",
> > +			cxl_dr_ext->hpa_offset, cxl_dr_ext->hpa_length);
> > +		rc = cxl_dax_region_create_extent(dax_region, cxl_dr_ext);
> > +		cxl_dr_extent_put(cxl_dr_ext);
> > +		if (rc)
> > +			return rc;
> > +	}
> > +	return 0;
> Perhaps flip logic
> 	if (!cxl_dr_extent_get_not_zero())
> 		return 0;
> 
> etc to reduce the code indent.

That is ok.  But in this case I do kind of like the indent.  I'll change
it to your way because I think it is slightly better.

Done in the previous patch.


> > +}
> > +
> >  static int cxl_dax_region_create_extents(struct cxl_dax_region *cxlr_dax)
> >  {
> >  	struct cxl_dr_extent *cxl_dr_ext;
> > @@ -66,27 +89,68 @@ static int cxl_dax_region_create_extents(struct cxl_dax_region *cxlr_dax)
> >  
> >  	dev_dbg(&cxlr_dax->dev, "Adding extents\n");
> >  	xa_for_each(&cxlr_dax->extents, index, cxl_dr_ext) {
> > -		/*
> > -		 * get not zero is important because this is racing with the
> > -		 * region driver which is racing with the memory device which
> > -		 * could be removing the extent at the same time.
> > -		 */
> > -		if (cxl_dr_extent_get_not_zero(cxl_dr_ext)) {
> > -			struct dax_region *dax_region;
> > -			int rc;
> > -
> > -			dax_region = dev_get_drvdata(&cxlr_dax->dev);
> > -			dev_dbg(&cxlr_dax->dev, "Found OFF:%llx LEN:%llx\n",
> > -				cxl_dr_ext->hpa_offset, cxl_dr_ext->hpa_length);
> > -			rc = cxl_dax_region_create_extent(dax_region, cxl_dr_ext);
> > -			cxl_dr_extent_put(cxl_dr_ext);
> > -			if (rc)
> > -				return rc;
> > -		}
> > +		int rc;
> > +
> > +		rc = cxl_dax_region_add_extent(cxlr_dax, cxl_dr_ext);
> > +		if (rc)
> > +			return rc;
> >  	}
> >  	return 0;
> >  }
> >  
> > +static int match_cxl_dr_extent(struct device *dev, void *data)
> > +{
> > +	struct dax_reg_ext_dev *dr_reg_ext_dev;
> > +	struct dax_region_extent *dr_extent;
> > +
> > +	if (!is_dr_ext_dev(dev))
> > +		return 0;
> > +
> > +	dr_reg_ext_dev = to_dr_ext_dev(dev);
> > +	dr_extent = dr_reg_ext_dev->dr_extent;
> > +	return data == dr_extent->private_data;
> > +}
> > +
> > +static int cxl_dax_region_rm_extent(struct cxl_dax_region *cxlr_dax,
> > +				    struct cxl_dr_extent *cxl_dr_ext)
> > +{
> > +	struct dax_reg_ext_dev *dr_reg_ext_dev;
> > +	struct dax_region *dax_region;
> > +	struct device *dev;
> > +
> > +	dev = device_find_child(&cxlr_dax->dev, cxl_dr_ext,
> > +				match_cxl_dr_extent);
> > +	if (!dev)
> > +		return -EINVAL;
> 
> blank line.

Done.

> 
> > +	dr_reg_ext_dev = to_dr_ext_dev(dev);
> > +	put_device(dev);
> > +	dax_region = dev_get_drvdata(&cxlr_dax->dev);
> > +	dax_region_ext_del_dev(dax_region, dr_reg_ext_dev);
> blank line

Done.

> 
> > +	return 0;
> > +}
> > +
> > +static int cxl_dax_region_notify(struct device *dev,
> > +				 struct cxl_drv_nd *nd)
> > +{
> > +	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
> > +	struct cxl_dr_extent *cxl_dr_ext = nd->cxl_dr_ext;
> > +	int rc = 0;
> > +
> > +	switch (nd->event) {
> > +	case DCD_ADD_CAPACITY:
> > +		rc = cxl_dax_region_add_extent(cxlr_dax, cxl_dr_ext);
> > +		break;
> 
> Early returns in here will perhaps make this more readable and definitely
> make it more compact.

Yep. done.

Thanks again for the review!
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 15/18] cxl/mem: Trace Dynamic capacity Event Record
  2023-08-29 16:46   ` Jonathan Cameron
@ 2023-09-06  4:07     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-06  4:07 UTC (permalink / raw)
  To: Jonathan Cameron, ira.weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:21:06 -0700
> ira.weiny@intel.com wrote:
> 
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > CXL rev 3.0 section 8.2.9.2.1.5 defines the Dynamic Capacity Event Record
> > Determine if the event read is a Dynamic capacity event record and
> > if so trace the record for the debug purpose.
> > 
> > Add DC trace points to the trace log.
> 
> Probably should say why these might be useful...
> 

Its kind of hidden.

	"... for the debug purpose."

I suppose this could be used to react to new extents coming online to
create new dax devices in the future.  But that should really be done
through udev when dax extent devices surface not these events.

I'll clarify the commit message.

Thanks,
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 13/18] dax/bus: Factor out dev dax resize logic
  2023-08-30 11:27   ` Jonathan Cameron
@ 2023-09-06  4:12     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-06  4:12 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:21:04 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > Dynamic Capacity regions must limit dev dax resources to those areas
> > which have extents backing real memory.  Four alternatives were
> > considered to manage the intersection of region space and extents:
> > 
> > 1) Create a single region resource child on region creation which
> >    reserves the entire region.  Then as extents are added punch holes in
> >    this reservation.  This requires new resource manipulation to punch
> >    the holes and still requires an additional iteration over the extent
> >    areas which may already have existing dev dax resources used.
> > 
> > 2) Maintain an ordered xarray of extents which can be queried while
> >    processing the resize logic.  The issue is that existing region->res
> >    children may artificially limit the allocation size sent to
> >    alloc_dev_dax_range().  IE the resource children can't be directly
> >    used in the resize logic to find where space in the region is.
> > 
> > 3) Maintain a separate resource tree with extents.  This option is the
> >    same as 2) but with a different data structure.  Most ideally we have
> >    some unified representation of the resource tree.
> > 
> > 4) Create region resource children for each extent.  Manage the dax dev
> >    resize logic in the same way as before but use a region child
> >    (extent) resource as the parents to find space within each extent.
> > 
> > Option 4 can leverage the existing resize algorithm to find space within
> > the extents.
> > 
> > In preparation for this change, factor out the dev_dax_resize logic.
> > For static regions use dax_region->res as the parent to find space for
> > the dax ranges.  Future patches will use the same algorithm with
> > individual extent resources as the parent.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Hi Ira,
> 
> Some trivial comments on comments, but in general this indeed seems to be doing what you
> say and factoring out the static allocation part.
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Thanks!

> 
> 
> > ---
> >  drivers/dax/bus.c | 128 +++++++++++++++++++++++++++++++++---------------------
> >  1 file changed, 79 insertions(+), 49 deletions(-)
> > 
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index b76e49813a39..ea7ae82b4687 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -817,11 +817,10 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
> >  	return 0;
> >  }
> >  
> 
> > -static ssize_t dev_dax_resize(struct dax_region *dax_region,
> > -		struct dev_dax *dev_dax, resource_size_t size)
> > +/*
> 
> /**
> 
> Suitable builds will then check this doc matches the function etc
> even if this is never included into any of the docs build.

Done.

> 
> > + * dev_dax_resize_static - Expand the device into the unused portion of the
> > + * region. This may involve adjusting the end of an existing resource, or
> > + * allocating a new resource.
> > + *
> > + * @parent: parent resource to allocate this range in.
> > + * @dev_dax: DAX device we are creating this range for
> 
> Trivial: Doesn't seem to be consistent on . or not

That is because my brain has a real consistency issue on this...  ;-)

'.' removed.

Thanks again,
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions
  2023-08-30 11:50   ` Jonathan Cameron
@ 2023-09-06  4:35     ` Ira Weiny
  2023-09-12 16:49       ` Jonathan Cameron
  0 siblings, 1 reply; 97+ messages in thread
From: Ira Weiny @ 2023-09-06  4:35 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:21:05 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > Dynamic Capacity (DC) DAX regions have a list of extents which define
> > the memory of the region which is available.
> > 
> > Now that DAX region extents are fully realized support DAX device
> > creation on dynamic regions by adjusting the allocation algorithms
> > to account for the extents.  Remember also references must be held on
> > the extents until the DAX devices are done with the memory.
> > 
> > Redefine the region available size to include only extent space.  Reuse
> > the size allocation algorithm by defining sub-resources for each extent
> > and limiting range allocation to those extents which have space.  Do not
> > support direct mapping of DAX devices on dynamic devices.
> > 
> > Enhance DAX device range objects to hold references on the extents until
> > the DAX device is destroyed.
> > 
> > NOTE: At this time all extents within a region are created equally.
> > However, labels are associated with extents which can be used with
> > future DAX device labels to group which extents are used.
> 
> This sound like a bad place to start to me as we are enabling something
> that is probably 'wrong' in the long term as opposed to just not enabling it
> until we have appropriate support.

I disagree.  I don't think the kernel should be trying to process tags at
the lower level.

> I'd argue better to just reject any extents with different labels for now.

Again I disagree.  This is less restrictive.  The idea is that labels can
be changed such that user space can ultimately decided which extents
should be used for which devices.  I have some work on that already.
(Basically it becomes quite easy to assign a label to a dax device and
have the extent search use only dax extents which match that label.)

> 
> As this is an RFC meh ;)

Sure!  ;-)

> 
> Whilst this looks fine to me, I'm rather out of my depth wrt to the DAX
> side of things so take that with a pinch of salt.

NP

> 
> Jonathan
> 
> 
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > ---
> >  drivers/dax/bus.c         | 157 +++++++++++++++++++++++++++++++++++++++-------
> >  drivers/dax/cxl.c         |  44 +++++++++++++
> >  drivers/dax/dax-private.h |   5 ++
> >  3 files changed, 182 insertions(+), 24 deletions(-)
> > 
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index ea7ae82b4687..a9ea6a706702 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> 
> ...
> 
> 
> > @@ -1183,7 +1290,7 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
> >  	to_alloc = range_len(&r);
> >  	if (alloc_is_aligned(dev_dax, to_alloc))
> >  		rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
> > -					 to_alloc);
> > +					 to_alloc, NULL);
> >  	device_unlock(dev);
> >  	device_unlock(dax_region->dev);
> >  
> > @@ -1400,8 +1507,10 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
> >  	device_initialize(dev);
> >  	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
> >  
> > +	dev_WARN_ONCE(parent, is_dynamic(dax_region) && data->size,
> > +		      "Dynamic DAX devices are created initially with 0 size");
> 
> dev_info() maybe more appropriate?

Unless I'm mistaken this can happen from userspace but only if something
in the code changes later.  Because the dax layer is trying to support
non-dynamic regions (which dynamic may be a bad name), I was worried that
the creation with a size might slip through...

> Is this common enough that we need the
> _ONCE?

once is because it could end up spamming a log later if something got
coded up wrong.

> 
> 
> >  	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
> > -				 data->size);
> > +				 data->size, NULL);
> >  	if (rc)
> >  		goto err_range;
> >  
> > diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> > index 44cbd28668f1..6394a3531e25 100644
> > --- a/drivers/dax/cxl.c
> > +++ b/drivers/dax/cxl.c
> ...
> 
> 
> >  static int cxl_dax_region_create_extent(struct dax_region *dax_region,
> >  					struct cxl_dr_extent *cxl_dr_ext)
> >  {
> > @@ -45,11 +80,20 @@ static int cxl_dax_region_create_extent(struct dax_region *dax_region,
> >  	/* device manages the dr_extent on success */
> >  	kref_init(&dr_extent->ref);
> >  
> > +	rc = dax_region_add_resource(dax_region, dr_extent,
> > +				     cxl_dr_ext->hpa_offset,
> > +				     cxl_dr_ext->hpa_length);
> > +	if (rc) {
> > +		kfree(dr_extent);
> 
> goto for these and single unwinding block?

Yea.  Done.

> 
> > +		return rc;
> > +	}
> > +
> >  	rc = dax_region_ext_create_dev(dax_region, dr_extent,
> >  				       cxl_dr_ext->hpa_offset,
> >  				       cxl_dr_ext->hpa_length,
> >  				       cxl_dr_ext->label);
> >  	if (rc) {
> > +		dax_region_rm_resource(dr_extent);
> >  		kfree(dr_extent);
> as above.

Done.

> 
> >  		return rc;
> >  	}
> > diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> > index 250babd6e470..ad73b53aa802 100644
> > --- a/drivers/dax/dax-private.h
> > +++ b/drivers/dax/dax-private.h
> > @@ -44,12 +44,16 @@ struct dax_region {
> >  /*
> >   * struct dax_region_extent - extent data defined by the low level region
> >   * driver.
> > + * @region: cache of dax_region
> > + * @res: cache of resource tree for this extent
> >   * @private_data: lower level region driver data
> 
> Not sure 'lower level' is well defined here. Is "region driver data"
> not enough?

For me it was not.  I'll have to sleep on it.  Technically there is no
dax_region 'driver' but only a dax_region device.

> 
> >   * @ref: track number of dax devices which are using this extent
> >   * @get: get reference to low level data
> >   * @put: put reference to low level data
> >   */
> >  struct dax_region_extent {
> > +	struct dax_region *region;
> > +	struct resource *res;
> >  	void *private_data;
> >  	struct kref ref;
> >  	void (*get)(struct dax_region_extent *dr_extent);
> > @@ -131,6 +135,7 @@ struct dev_dax {
> >  		unsigned long pgoff;
> >  		struct range range;
> >  		struct dax_mapping *mapping;
> > +		struct dax_region_extent *dr_extent;
> 
> Huh. Seems that ranges is in the kernel doc but not the
> bits that make that up.  Maybe good to add the docs
> whilst here?

oh.  sure.  took me a couple of reads of this sentence.

I'm going to think on this too.

Ira

> 
> >  	} *ranges;
> >  };
> >  
> > 
> 



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 08/18] cxl/region: Add Dynamic Capacity CXL region support
  2023-08-30 23:27   ` Dave Jiang
@ 2023-09-06  4:36     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-06  4:36 UTC (permalink / raw)
  To: Dave Jiang, Ira Weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Dave Jiang wrote:
> 
> 
> On 8/28/23 22:20, Ira Weiny wrote:
> > CXL devices optionally support dynamic capacity.  CXL Regions must be
> > configured correctly to access this capacity.  Similar to ram and pmem
> > partitions, DC Regions represent different partitions of the DPA space.
> > 
> > Interleaving is deferred due to the complexity of managing extents on
> > multiple devices at the same time.  However, there is nothing which
> > directly prevents interleave support at this time.  The check allows
> > for early rejection.
> > 
> > To maintain backwards compatibility with older software, CXL regions
> > need a default DAX device to hold the reference for the region until it
> > is deleted.
> > 
> > Add create_dc_region sysfs entry to create DC regions.  Share the logic
> > of devm_cxl_add_dax_region() and region_is_system_ram().  Special case
> > DC capable CXL regions to create a 0 sized seed DAX device until others
> > can be created on dynamic space later.
> > 
> > Flag dax_regions to indicate 0 capacity available until dax_region
> > extents are supported by the region.
> > 
> > Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> You probably should update kernel version to v6.7. Otherwise

Done.

> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 16/18] tools/testing/cxl: Make event logs dynamic
  2023-08-30 12:11   ` Jonathan Cameron
@ 2023-09-06 21:15     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-06 21:15 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:21:07 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > The test event logs were created as static arrays as an easy way to mock
> > events.  Dynamic Capacity Device (DCD) test support requires events be
> > created dynamically when extents are created/destroyed.
> > 
> > Modify the event log storage to be dynamically allocated.  Thus they can
> > accommodate the dynamic events required by DCD.  Reuse the static event
> > data to create the dynamic events in the new logs without inventing
> > complex event injection through the test sysfs.  Simplify the processing
> > of the logs by using the event log array index as the handle.  Add a
> > lock to manage concurrency to come with DCD extent testing.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Diff did a horrible job on readability of this patch.

Yea apologies.  I'm not sure if b4 can use --patience or not but I tried
patience by hand and it did not do any better.

> 
> Ah well. Comments superficial only.
> 
> Jonathan
> 
> > ---
> >  tools/testing/cxl/test/mem.c | 276 ++++++++++++++++++++++++++-----------------
> >  1 file changed, 170 insertions(+), 106 deletions(-)
> > 
> > diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
> > index 51be202fabd0..6a036c8d215d 100644
> > --- a/tools/testing/cxl/test/mem.c
> > +++ b/tools/testing/cxl/test/mem.c
> > @@ -118,18 +118,27 @@ static struct {
> >  
> >  #define PASS_TRY_LIMIT 3
> >  
> > -#define CXL_TEST_EVENT_CNT_MAX 15
> > +#define CXL_TEST_EVENT_CNT_MAX 17
> >  
> >  /* Set a number of events to return at a time for simulation.  */
> >  #define CXL_TEST_EVENT_CNT 3
> >  
> > +/*
> > + * @next_handle: next handle (index) to be stored to
> > + * @cur_handle: current handle (index) to be returned to the user on get_event
> > + * @nr_events: total events in this log
> > + * @nr_overflow: number of events added past the log size
> > + * @lock: protect these state variables
> > + * @events: array of pending events to be returned.
> > + */
> >  struct mock_event_log {
> > -	u16 clear_idx;
> > -	u16 cur_idx;
> > +	u16 next_handle;
> > +	u16 cur_handle;
> >  	u16 nr_events;
> >  	u16 nr_overflow;
> > -	u16 overflow_reset;
> > -	struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX];
> > +	rwlock_t lock;
> > +	/* 1 extra slot to accommodate that handles can't be 0 */
> > +	struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX+1];
> 
> Spaces around +

Done.

> 
> >  };
> >  
> 
> ...
> 
> 
> >  
> > -static void cxl_mock_add_event_logs(struct mock_event_store *mes)
> > +/* Create a dynamically allocated event out of a statically defined event. */
> > +static void add_event_from_static(struct mock_event_store *mes,
> > +				  enum cxl_event_log_type log_type,
> > +				  struct cxl_event_record_raw *raw)
> > +{
> > +	struct device *dev = mes->mds->cxlds.dev;
> > +	struct cxl_event_record_raw *rec;
> > +
> > +	rec = devm_kzalloc(dev, sizeof(*rec), GFP_KERNEL);
> > +	if (!rec) {
> > +		dev_err(dev, "Failed to alloc event for log\n");
> > +		return;
> > +	}
> > +
> > +	memcpy(rec, raw, sizeof(*rec));
> 
> devm_kmemdup()?

Yea!  Thanks!

Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 17/18] tools/testing/cxl: Add DC Regions to mock mem data
  2023-08-30 12:20   ` Jonathan Cameron
@ 2023-09-06 21:18     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-06 21:18 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:21:08 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > To test DC regions the mock memory devices will need to store
> > information about the regions and manage fake extent data.
> > 
> > Define mock_dc_region information within the mock memory data.  Add
> > sysfs entries on the mock device to inject and delete extents.
> > 
> > The inject format is <start>:<length>:<tag>
> > The delete format is <start>
> > 
> > Add DC mailbox commands to the CEL and implement those commands.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Looks fine to me.  Totally trivial comment inline.
> 
> FWIW
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> 
> > +
> >  static int mock_gsl(struct cxl_mbox_cmd *cmd)
> >  {
> >  	if (cmd->size_out < sizeof(mock_gsl_payload))
> > @@ -1315,6 +1429,148 @@ static int mock_activate_fw(struct cxl_mockmem_data *mdata,
> >  	return -EINVAL;
> >  }
> >  
> 
> Bit inconsistent on whether there are one or two blank lines between functions.

I missed this one in my internal review.  There should be no
inconsistency...  always 1 unless I messed up!  :-D

Thanks,
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 18/18] tools/testing/cxl: Add Dynamic Capacity events
  2023-08-30 12:23   ` Jonathan Cameron
@ 2023-09-06 21:39     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-06 21:39 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Mon, 28 Aug 2023 22:21:09 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > OS software needs to be alerted when new extents arrive on a Dynamic
> > Capacity Device (DCD).  On test DCDs extents are added through sysfs.
> > 
> > Add events on DCD extent injection.  Directly call the event irq
> > callback to simulate irqs to process the test extents.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Trivial comments inline.
> 
> Reviewed-by: Jonathan.Cameron@huawei.com>
> 
> > ---
> >  tools/testing/cxl/test/mem.c | 57 ++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 57 insertions(+)
> > 
> > diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
> > index d6041a2145c5..20364fee9df9 100644
> > --- a/tools/testing/cxl/test/mem.c
> > +++ b/tools/testing/cxl/test/mem.c
> > @@ -2008,6 +2008,41 @@ static bool new_extent_valid(struct device *dev, size_t new_start,
> >  	return false;
> >  }
> >  
> > +struct dcd_event_dyn_cap dcd_event_rec_template = {
> > +	.hdr = {
> > +		.id = UUID_INIT(0xca95afa7, 0xf183, 0x4018,
> > +				0x8c, 0x2f, 0x95, 0x26, 0x8e, 0x10, 0x1a, 0x2a),
> > +		.length = sizeof(struct dcd_event_dyn_cap),
> > +	},
> > +};
> > +
> > +static int send_dc_event(struct mock_event_store *mes, enum dc_event type,
> > +			 u64 start, u64 length, const char *tag_str)
> 
> Arguably it's not sending the event, but rather adding it to the event log and
> flicking the irq. So maybe naming needs some thought?

I spent all my naming energy on what to call extents at each layer...  ;-)

Yea I'll rename to add_dc_event() or something like that.

> 
> > +{
> > +	struct device *dev = mes->mds->cxlds.dev;
> > +	struct dcd_event_dyn_cap *dcd_event_rec;
> > +
> > +	dcd_event_rec = devm_kzalloc(dev, sizeof(*dcd_event_rec), GFP_KERNEL);
> > +	if (!dcd_event_rec)
> > +		return -ENOMEM;
> > +
> > +	memcpy(dcd_event_rec, &dcd_event_rec_template, sizeof(*dcd_event_rec));
> 
> devm_kmemdup?

Yep would work well.

Thanks again for all the review,
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device
  2023-08-29  5:20 ` [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device ira.weiny
                     ` (2 preceding siblings ...)
  2023-08-30 21:44   ` Fan Ni
@ 2023-09-07 15:46   ` Alison Schofield
  2023-09-12  1:18     ` Ira Weiny
  2023-09-08 12:46   ` Jørgen Hansen
  4 siblings, 1 reply; 97+ messages in thread
From: Alison Schofield @ 2023-09-07 15:46 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Jonathan Cameron,
	Davidlohr Bueso, Dave Jiang, Vishal Verma, linux-cxl,
	linux-kernel

On Mon, Aug 28, 2023 at 10:20:54PM -0700, Ira Weiny wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Devices can optionally support Dynamic Capacity (DC).  These devices are
> known as Dynamic Capacity Devices (DCD).

snip

> 
> +static enum cxl_region_mode
> +cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
> +{
> +	switch (mode) {
> +	case CXL_DECODER_NONE:
> +		return CXL_REGION_NONE;
> +	case CXL_DECODER_RAM:
> +		return CXL_REGION_RAM;
> +	case CXL_DECODER_PMEM:
> +		return CXL_REGION_PMEM;
> +	case CXL_DECODER_DEAD:
> +		return CXL_REGION_DEAD;
> +	case CXL_DECODER_MIXED:
> +	default:
> +		return CXL_REGION_MIXED;
> +	}
> +
> +	return CXL_REGION_MIXED;

Can the paths to return _MIXED be simplified here?


> +}
> +
snip

> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index cd4a9ffdacc7..ed282dcd5cf5 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -374,6 +374,28 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  	return "mixed";
>  }
>  
> +enum cxl_region_mode {
> +	CXL_REGION_NONE,
> +	CXL_REGION_RAM,
> +	CXL_REGION_PMEM,
> +	CXL_REGION_MIXED,
> +	CXL_REGION_DEAD,
> +};

I'm concerned about _DEAD.
At first I was going to say name these as CXL_REGION_MODE_*, but it's
pretty obvious that these are mode words...except for DEAD. Is that 
an actual mode or is it some type of status? I don't think I see it
used yet.

> +
> +static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
> +{
> +	static const char * const names[] = {
> +		[CXL_REGION_NONE] = "none",
> +		[CXL_REGION_RAM] = "ram",
> +		[CXL_REGION_PMEM] = "pmem",
> +		[CXL_REGION_MIXED] = "mixed",
> +	};
> +
> +	if (mode >= CXL_REGION_NONE && mode <= CXL_REGION_MIXED)
> +		return names[mode];
> +	return "mixed";
> +}

snip

> +
>  /**
>   * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
>   *
> @@ -449,6 +464,8 @@ struct cxl_dev_state {
>   * @enabled_cmds: Hardware commands found enabled in CEL.
>   * @exclusive_cmds: Commands that are kernel-internal only
>   * @total_bytes: sum of all possible capacities
> + * @static_cap: Sum of RAM and PMEM capacities
> + * @dynamic_cap: Complete DPA range occupied by DC regions

Wondering about renaming RAM and PMEM caps as 'static'.
They are changeable via set partition commands.


>   * @volatile_only_bytes: hard volatile capacity
>   * @persistent_only_bytes: hard persistent capacity
>   * @partition_align_bytes: alignment size for partition-able capacity
> @@ -456,6 +473,10 @@ struct cxl_dev_state {
>   * @active_persistent_bytes: sum of hard + soft persistent
>   * @next_volatile_bytes: volatile capacity change pending device reset
>   * @next_persistent_bytes: persistent capacity change pending device reset
> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
> + * @dc_event_log_size: The number of events the device can store in the
> + * Dynamic Capacity Event Log before it overflows
>   * @event: event log driver state
>   * @poison: poison driver state info
>   * @fw: firmware upload / activation state
> @@ -473,7 +494,10 @@ struct cxl_memdev_state {
>  	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> +
>  	u64 total_bytes;
> +	u64 static_cap;
> +	u64 dynamic_cap;
>  	u64 volatile_only_bytes;
>  	u64 persistent_only_bytes;
>  	u64 partition_align_bytes;
> @@ -481,6 +505,11 @@ struct cxl_memdev_state {
>  	u64 active_persistent_bytes;
>  	u64 next_volatile_bytes;
>  	u64 next_persistent_bytes;
> +
> +	u8 nr_dc_region;
> +	struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> +	size_t dc_event_log_size;
> +
>  	struct cxl_event_state event;
>  	struct cxl_poison_state poison;
>  	struct cxl_security_state security;
> @@ -587,6 +616,7 @@ struct cxl_mbox_identify {
>  	__le16 inject_poison_limit;
>  	u8 poison_caps;
>  	u8 qos_telemetry_caps;
> +	__le16 dc_event_log_size;
>  } __packed;
>  

snip

>  /*

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD)
  2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (17 preceding siblings ...)
  2023-08-29  5:21 ` [PATCH RFC v2 18/18] tools/testing/cxl: Add Dynamic Capacity events Ira Weiny
@ 2023-09-07 21:01 ` Fan Ni
  2023-09-12  1:44   ` Ira Weiny
  18 siblings, 1 reply; 97+ messages in thread
From: Fan Ni @ 2023-09-07 21:01 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Jonathan Cameron,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma,
	linux-cxl, linux-kernel, a.manzanares, nmtadam.samsung, nifan

On Mon, Aug 28, 2023 at 10:20:51PM -0700, Ira Weiny wrote:
> A Dynamic Capacity Device (DCD) (CXL 3.0 spec 9.13.3) is a CXL memory
> device that implements dynamic capacity.  Dynamic capacity feature
> allows memory capacity to change dynamically, without the need for
> resetting the device.
>
> Even though this is marked v2 by b4, this is effectively a whole new
> series for DCD support.  Quite a bit of the core support was completed
> by Navneet in [4].  However, the architecture through the CXL region,
> DAX region, and DAX Device layers is completely different.  Particular
> attention was paid to:
>
> 	1) managing skip resources in the hardware device
> 	2) ensuring the host OS only sent a release memory mailbox
> 	   response when all DAX devices are done using an extent
> 	3) allowing dax devices to span extents
> 	4) allowing dax devices to use parts of extents
>
> I could say all of the review comments from v1 are addressed but frankly
> the series has changed so much that I can't guarantee anything.
>
> The series continues to be based on the type-2 work posted from Dan.[2]
> However, my branch with that work is a bit dated.  Therefore I have
> posted this series on github here.[5]
>
> Testing was sped up with cxl-test and ndctl dcd support.  A preview of
> that work is on github.[6]  In addition Fan Ni's Qemu DCD series was
> used part of the time.[3]
>
> The major parts of this series are:
>
> - Get the dynamic capacity (DC) region information from cxl device
> - Configure device DC regions reported by hardware
> - Enhance CXL and DAX regions for DC
> 	a. maintain separation between the hardware extents and the CXL
> 	   region extents to provide for the addition of interleaving in
> 	   the future.
> - Get and maintain the hardware extent lists for each device via an
>   initial extent list and DC event records
>         a. Add capacity Events
> 	b. Add capacity response
> 	b. Release capacity events
> 	d. Release capacity response
> - Notify region layers of extent changes
> - Allow for DAX devices to be created on extents which are surfaced
> - Maintain references on extents which are in use
> 	a. Send Release capacity Response only when DAX devices are not
> 	   using memory
> - Allow DAX region extent labels to change to allow for flexibility in
>   DAX device creation in the future (further enhancements are required
>   to ndctl for this)
> - Trace Dynamic Capacity events
> - Add cxl-test infrastructure to allow for faster unit testing
>
> To: Dan Williams <dan.j.williams@intel.com>
> Cc: Navneet Singh <navneet.singh@intel.com>
> Cc: Fan Ni <fan.ni@samsung.com>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Alison Schofield <alison.schofield@intel.com>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: linux-cxl@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
>
> [1] https://lore.kernel.org/all/64326437c1496_934b2949f@dwillia2-mobl3.amr.corp.intel.com.notmuch/
> [2] https://lore.kernel.org/all/168592149709.1948938.8663425987110396027.stgit@dwillia2-xfh.jf.intel.com/
> [3] https://lore.kernel.org/all/6483946e8152f_f1132294a2@iweiny-mobl.notmuch/
> [4] https://lore.kernel.org/r/20230604-dcd-type2-upstream-v1-0-71b6341bae54@intel.com
> [5] https://github.com/weiny2/linux-kernel/commits/dcd-v2-2023-08-28
> [6] https://github.com/weiny2/ndctl/tree/dcd-region2
>

Hi Ira,

I tried to test the patch series with the qemu dcd patches, however, I
hit some issues, and would like to check the following with you.

1. After we create a region for DC before any extents are added, a dax
device will show under /dev. Is that what we want? If I remember it
correctly, the dax device used to show up after a dc extent is added.


2. add/release extent does not work correctly for me. The code path is
not called, and I made the following changes to make it pass.
---
 drivers/cxl/cxl.h    | 3 ++-
 drivers/cxl/cxlmem.h | 1 +
 drivers/cxl/pci.c    | 7 +++++++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 2c73a30980b6..0d132c1739ce 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -168,7 +168,8 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
 #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO |	\
 				 CXLDEV_EVENT_STATUS_WARN |	\
 				 CXLDEV_EVENT_STATUS_FAIL |	\
-				 CXLDEV_EVENT_STATUS_FATAL)
+				 CXLDEV_EVENT_STATUS_FATAL| \
+				 CXLDEV_EVENT_STATUS_DCD)

 /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
 #define CXLDEV_EVENT_INT_MODE_MASK	GENMASK(1, 0)
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 8ca81fd067c2..ae9dcb291c75 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -235,6 +235,7 @@ struct cxl_event_interrupt_policy {
 	u8 warn_settings;
 	u8 failure_settings;
 	u8 fatal_settings;
+	u8 dyncap_settings;
 } __packed;

 /**
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 10c1a583113c..e30fe0304514 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -686,6 +686,7 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
 		.warn_settings = CXL_INT_MSI_MSIX,
 		.failure_settings = CXL_INT_MSI_MSIX,
 		.fatal_settings = CXL_INT_MSI_MSIX,
+		.dyncap_settings = CXL_INT_MSI_MSIX,
 	};

 	mbox_cmd = (struct cxl_mbox_cmd) {
@@ -739,6 +740,12 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
 		return rc;
 	}

+	rc = cxl_event_req_irq(cxlds, policy.dyncap_settings);
+	if (rc) {
+		dev_err(cxlds->dev, "Failed to get interrupt for event dyncap log\n");
+		return rc;
+	}
+
 	return 0;
 }

--

3. With changes made in 2, the code for add/release dc extent can be called,
however, the system behaviour seems different from before. Previously, after a
dc extent is added, it will show up with lsmem command and listed as offline.
Now, nothing is showing. Is it expected? What should we do to make it usable
as system ram?

Please let me know if I miss something or did something wrong. Thanks.

Fan



> ---
> Changes in v2:
> - iweiny: Complete rework of the entire series
> - Link to v1: https://lore.kernel.org/r/20230604-dcd-type2-upstream-v1-0-71b6341bae54@intel.com
>
> ---
> Ira Weiny (15):
>       cxl/hdm: Debug, use decoder name function
>       cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
>       cxl/region: Add Dynamic Capacity decoder and region modes
>       cxl/port: Add Dynamic Capacity mode support to endpoint decoders
>       cxl/port: Add Dynamic Capacity size support to endpoint decoders
>       cxl/region: Add Dynamic Capacity CXL region support
>       cxl/mem: Read extents on memory device discovery
>       cxl/mem: Handle DCD add and release capacity events.
>       cxl/region: Expose DC extents on region driver load
>       cxl/region: Notify regions of DC changes
>       dax/bus: Factor out dev dax resize logic
>       dax/region: Support DAX device creation on dynamic DAX regions
>       tools/testing/cxl: Make event logs dynamic
>       tools/testing/cxl: Add DC Regions to mock mem data
>       tools/testing/cxl: Add Dynamic Capacity events
>
> Navneet Singh (3):
>       cxl/mem: Read Dynamic capacity configuration from the device
>       cxl/mem: Expose device dynamic capacity configuration
>       cxl/mem: Trace Dynamic capacity Event Record
>
>  Documentation/ABI/testing/sysfs-bus-cxl |  56 ++-
>  drivers/cxl/core/core.h                 |   1 +
>  drivers/cxl/core/hdm.c                  | 215 ++++++++-
>  drivers/cxl/core/mbox.c                 | 646 +++++++++++++++++++++++++-
>  drivers/cxl/core/memdev.c               |  77 ++++
>  drivers/cxl/core/port.c                 |  19 +
>  drivers/cxl/core/region.c               | 418 +++++++++++++++--
>  drivers/cxl/core/trace.h                |  65 +++
>  drivers/cxl/cxl.h                       |  99 +++-
>  drivers/cxl/cxlmem.h                    | 138 +++++-
>  drivers/cxl/mem.c                       |  50 ++
>  drivers/cxl/pci.c                       |   8 +
>  drivers/dax/Makefile                    |   1 +
>  drivers/dax/bus.c                       | 263 ++++++++---
>  drivers/dax/bus.h                       |   1 +
>  drivers/dax/cxl.c                       | 213 ++++++++-
>  drivers/dax/dax-private.h               |  61 +++
>  drivers/dax/extent.c                    | 133 ++++++
>  tools/testing/cxl/test/mem.c            | 782 +++++++++++++++++++++++++++-----
>  19 files changed, 3005 insertions(+), 241 deletions(-)
> ---
> base-commit: c76cce37fb6f3796e8e146677ba98d3cca30a488
> change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
>
> Best regards,
> --
> Ira Weiny <ira.weiny@intel.com>
>

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device
  2023-08-29  5:20 ` [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device ira.weiny
                     ` (3 preceding siblings ...)
  2023-09-07 15:46   ` Alison Schofield
@ 2023-09-08 12:46   ` Jørgen Hansen
  2023-09-11 20:26     ` Ira Weiny
  4 siblings, 1 reply; 97+ messages in thread
From: Jørgen Hansen @ 2023-09-08 12:46 UTC (permalink / raw)
  To: ira.weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, linux-cxl,
	linux-kernel

On 8/29/23 07:20, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Devices can optionally support Dynamic Capacity (DC).  These devices are
> known as Dynamic Capacity Devices (DCD).
> 
> Implement the DC (opcode 48XXh) mailbox commands as specified in CXL 3.0
> section 8.2.9.8.9.  Read the DC configuration and store the DC region
> information in the device state.
> 
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes for v2
> [iweiny: Rebased to latest master type2 work]
> [jonathan: s/dc/dc_resp/]
> [iweiny: Clean up commit message]
> [iweiny: Clean kernel docs]
> [djiang: Fix up cxl_is_dcd_command]
> [djiang: extra blank line]
> [alison: s/total_capacity/cap/ etc...]
> [alison: keep partition flag with partition structures]
> [alison: reformat untenanted_mem declaration]
> [alison: move 'cmd' definition back]
> [alison: fix comment line length]
> [alison: reverse x-tree]
> [jonathan: fix and adjust CXL_DC_REGION_STRLEN]
> [Jonathan/iweiny: Factor out storing each DC region read from the device]
> [Jonathan: place all dcr initializers together]
> [Jonathan/iweiny: flip around the region DPA order check]
> [jonathan: Account for short read of mailbox command]
> [iweiny: use snprintf for region name]
> [iweiny: use '<nil>' for missing region names]
> [iweiny: factor out struct cxl_dc_region_info]
> [iweiny: Split out reading CEL]
> ---
>   drivers/cxl/core/mbox.c   | 179 +++++++++++++++++++++++++++++++++++++++++++++-
>   drivers/cxl/core/region.c |  75 +++++++++++++------
>   drivers/cxl/cxl.h         |  27 ++++++-
>   drivers/cxl/cxlmem.h      |  55 +++++++++++++-
>   drivers/cxl/pci.c         |   4 ++
>   5 files changed, 314 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 554ec97a7c39..d769814f80e2 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1096,7 +1096,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>          if (rc < 0)
>                  return rc;
> 
> -       mds->total_bytes =
> +       mds->static_cap =
>                  le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
>          mds->volatile_only_bytes =
>                  le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
> @@ -1114,6 +1114,8 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>                  mds->poison.max_errors = min_t(u32, val, CXL_POISON_LIST_MAX);
>          }
> 
> +       mds->dc_event_log_size = le16_to_cpu(id.dc_event_log_size);
> +
>          return 0;
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_dev_state_identify, CXL);
> @@ -1178,6 +1180,165 @@ int cxl_mem_sanitize(struct cxl_memdev_state *mds, u16 cmd)
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_mem_sanitize, CXL);
> 
> +static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, int index,
> +                                  struct cxl_dc_region_config *region_config)
> +{
> +       struct cxl_dc_region_info *dcr = &mds->dc_region[index];
> +       struct device *dev = mds->cxlds.dev;
> +
> +       dcr->base = le64_to_cpu(region_config->region_base);
> +       dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
> +       dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> +       dcr->len = le64_to_cpu(region_config->region_length);
> +       dcr->blk_size = le64_to_cpu(region_config->region_block_size);
> +       dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
> +       dcr->flags = region_config->flags;
> +       snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
> +
> +       /* Check regions are in increasing DPA order */
> +       if (index > 0) {
> +               struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
> +
> +               if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
> +                       dev_err(dev,
> +                               "DPA ordering violation for DC region %d and %d\n",
> +                               index - 1, index);
> +                       return -EINVAL;
> +               }
> +       }
> +
> +       /* Check the region is 256 MB aligned */
> +       if (!IS_ALIGNED(dcr->base, SZ_256M)) {
> +               dev_err(dev, "DC region %d not aligned to 256MB: %#llx\n",
> +                       index, dcr->base);
> +               return -EINVAL;
> +       }
> +
> +       /* Check Region base and length are aligned to block size */
> +       if (!IS_ALIGNED(dcr->base, dcr->blk_size) ||
> +           !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> +               dev_err(dev, "DC region %d not aligned to %#llx\n", index,
> +                       dcr->blk_size);
> +               return -EINVAL;
> +       }
> +
> +       dev_dbg(dev,
> +               "DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> +               dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> +
> +       return 0;
> +}
> +
> +/* Returns the number of regions in dc_resp or -ERRNO */
> +static int cxl_get_dc_id(struct cxl_memdev_state *mds, u8 start_region,
> +                        struct cxl_mbox_dynamic_capacity *dc_resp,
> +                        size_t dc_resp_size)
> +{
> +       struct cxl_mbox_get_dc_config get_dc = (struct cxl_mbox_get_dc_config) {
> +               .region_count = CXL_MAX_DC_REGION,
> +               .start_region_index = start_region,
> +       };
> +       struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> +               .opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> +               .payload_in = &get_dc,
> +               .size_in = sizeof(get_dc),
> +               .size_out = dc_resp_size,
> +               .payload_out = dc_resp,
> +               .min_out = 1,
> +       };
> +       struct device *dev = mds->cxlds.dev;
> +       int rc;
> +
> +       rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +       if (rc < 0)
> +               return rc;
> +
> +       rc = dc_resp->avail_region_count - start_region;
> +
> +       /*
> +        * The number of regions in the payload may have been truncated due to
> +        * payload_size limits; if so adjust the count in this query.
> +        */
> +       if (mbox_cmd.size_out < sizeof(*dc_resp))
> +               rc = CXL_REGIONS_RETURNED(mbox_cmd.size_out);
> +
> +       dev_dbg(dev, "Read %d/%d DC regions\n", rc, dc_resp->avail_region_count);
> +
> +       return rc;
> +}
> +
> +/**
> + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> + *                                      information from the device.
> + * @mds: The memory device state
> + *
> + * This will dispatch the get_dynamic_capacity command to the device
> + * and on success populate structures to be exported to sysfs.
> + *
> + * Return: 0 if identify was executed successfully, -ERRNO on error.
> + */
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> +{
> +       struct cxl_mbox_dynamic_capacity *dc_resp;
> +       struct device *dev = mds->cxlds.dev;
> +       size_t dc_resp_size = mds->payload_size;
> +       u8 start_region;
> +       int i, rc = 0;
> +
> +       for (i = 0; i < CXL_MAX_DC_REGION; i++)
> +               snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
> +
> +       /* Check GET_DC_CONFIG is supported by device */
> +       if (!test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds)) {
> +               dev_dbg(dev, "unsupported cmd: get_dynamic_capacity_config\n");
> +               return 0;
> +       }
> +
> +       dc_resp = kvmalloc(dc_resp_size, GFP_KERNEL);
> +       if (!dc_resp)
> +               return -ENOMEM;
> +
> +       start_region = 0;
> +       do {
> +               int j;
> +
> +               rc = cxl_get_dc_id(mds, start_region, dc_resp, dc_resp_size);
> +               if (rc < 0)
> +                       goto free_resp;
> +
> +               mds->nr_dc_region += rc;
> +
> +               if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> +                       dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> +                               mds->nr_dc_region);
> +                       rc = -EINVAL;
> +                       goto free_resp;
> +               }
> +
> +               for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
> +                       rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
> +                       if (rc)
> +                               goto free_resp;
> +               }
> +
> +               start_region = mds->nr_dc_region;
> +
> +       } while (mds->nr_dc_region < dc_resp->avail_region_count);
> +
> +       mds->dynamic_cap =
> +               mds->dc_region[mds->nr_dc_region - 1].base +
> +               mds->dc_region[mds->nr_dc_region - 1].decode_len -
> +               mds->dc_region[0].base;
> +       dev_dbg(dev, "Total dynamic capacity: %#llx\n", mds->dynamic_cap);
> +
> +free_resp:
> +       kfree(dc_resp);
> +       if (rc)
> +               dev_err(dev, "Failed to get DC info: %d\n", rc);
> +       return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> +
>   static int add_dpa_res(struct device *dev, struct resource *parent,
>                         struct resource *res, resource_size_t start,
>                         resource_size_t size, const char *type)
> @@ -1208,8 +1369,12 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>   {
>          struct cxl_dev_state *cxlds = &mds->cxlds;
>          struct device *dev = cxlds->dev;
> +       size_t untenanted_mem;
>          int rc;
> 
> +       untenanted_mem = mds->dc_region[0].base - mds->static_cap;
> +       mds->total_bytes = mds->static_cap + untenanted_mem + mds->dynamic_cap;
> +
>          if (!cxlds->media_ready) {
>                  cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
>                  cxlds->ram_res = DEFINE_RES_MEM(0, 0);
> @@ -1217,8 +1382,16 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>                  return 0;
>          }
> 
> -       cxlds->dpa_res =
> -               (struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> +       cxlds->dpa_res = (struct resource)DEFINE_RES_MEM(0, mds->total_bytes);
> +
> +       for (int i = 0; i < mds->nr_dc_region; i++) {
> +               struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +               rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> +                                dcr->base, dcr->decode_len, dcr->name);
> +               if (rc)
> +                       return rc;
> +       }
> 
>          if (mds->partition_align_bytes == 0) {
>                  rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 252bc8e1f103..75041903b72c 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -46,7 +46,7 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
>          rc = down_read_interruptible(&cxl_region_rwsem);
>          if (rc)
>                  return rc;
> -       if (cxlr->mode != CXL_DECODER_PMEM)
> +       if (cxlr->mode != CXL_REGION_PMEM)
>                  rc = sysfs_emit(buf, "\n");
>          else
>                  rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
> @@ -359,7 +359,7 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
>           * Support tooling that expects to find a 'uuid' attribute for all
>           * regions regardless of mode.
>           */
> -       if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
> +       if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_REGION_PMEM)
>                  return 0444;
>          return a->mode;
>   }
> @@ -537,7 +537,7 @@ static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
>   {
>          struct cxl_region *cxlr = to_cxl_region(dev);
> 
> -       return sysfs_emit(buf, "%s\n", cxl_decoder_mode_name(cxlr->mode));
> +       return sysfs_emit(buf, "%s\n", cxl_region_mode_name(cxlr->mode));
>   }
>   static DEVICE_ATTR_RO(mode);
> 
> @@ -563,7 +563,7 @@ static int alloc_hpa(struct cxl_region *cxlr, resource_size_t size)
> 
>          /* ways, granularity and uuid (if PMEM) need to be set before HPA */
>          if (!p->interleave_ways || !p->interleave_granularity ||
> -           (cxlr->mode == CXL_DECODER_PMEM && uuid_is_null(&p->uuid)))
> +           (cxlr->mode == CXL_REGION_PMEM && uuid_is_null(&p->uuid)))
>                  return -ENXIO;
> 
>          div_u64_rem(size, SZ_256M * p->interleave_ways, &remainder);
> @@ -1765,6 +1765,17 @@ static int cxl_region_sort_targets(struct cxl_region *cxlr)
>          return rc;
>   }
> 
> +static bool cxl_modes_compatible(enum cxl_region_mode rmode,
> +                                enum cxl_decoder_mode dmode)
> +{
> +       if (rmode == CXL_REGION_RAM && dmode == CXL_DECODER_RAM)
> +               return true;
> +       if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
> +               return true;
> +
> +       return false;
> +}
> +
>   static int cxl_region_attach(struct cxl_region *cxlr,
>                               struct cxl_endpoint_decoder *cxled, int pos)
>   {
> @@ -1778,9 +1789,11 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>          lockdep_assert_held_write(&cxl_region_rwsem);
>          lockdep_assert_held_read(&cxl_dpa_rwsem);
> 
> -       if (cxled->mode != cxlr->mode) {
> -               dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
> -                       dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
> +       if (!cxl_modes_compatible(cxlr->mode, cxled->mode)) {
> +               dev_dbg(&cxlr->dev, "%s region mode: %s mismatch decoder: %s\n",
> +                       dev_name(&cxled->cxld.dev),
> +                       cxl_region_mode_name(cxlr->mode),
> +                       cxl_decoder_mode_name(cxled->mode));
>                  return -EINVAL;
>          }
> 
> @@ -2234,7 +2247,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
>    * devm_cxl_add_region - Adds a region to a decoder
>    * @cxlrd: root decoder
>    * @id: memregion id to create, or memregion_free() on failure
> - * @mode: mode for the endpoint decoders of this region
> + * @mode: mode of this region
>    * @type: select whether this is an expander or accelerator (type-2 or type-3)
>    *
>    * This is the second step of region initialization. Regions exist within an
> @@ -2245,7 +2258,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
>    */
>   static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>                                                int id,
> -                                             enum cxl_decoder_mode mode,
> +                                             enum cxl_region_mode mode,
>                                                enum cxl_decoder_type type)
>   {
>          struct cxl_port *port = to_cxl_port(cxlrd->cxlsd.cxld.dev.parent);
> @@ -2254,11 +2267,12 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>          int rc;
> 
>          switch (mode) {
> -       case CXL_DECODER_RAM:
> -       case CXL_DECODER_PMEM:
> +       case CXL_REGION_RAM:
> +       case CXL_REGION_PMEM:
>                  break;
>          default:
> -               dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> +               dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
> +                       cxl_region_mode_name(mode));
>                  return ERR_PTR(-EINVAL);
>          }
> 
> @@ -2308,7 +2322,7 @@ static ssize_t create_ram_region_show(struct device *dev,
>   }
> 
>   static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
> -                                         int id, enum cxl_decoder_mode mode,
> +                                         int id, enum cxl_region_mode mode,
>                                            enum cxl_decoder_type type)
>   {
>          int rc;
> @@ -2337,7 +2351,7 @@ static ssize_t create_pmem_region_store(struct device *dev,
>          if (rc != 1)
>                  return -EINVAL;
> 
> -       cxlr = __create_region(cxlrd, id, CXL_DECODER_PMEM,
> +       cxlr = __create_region(cxlrd, id, CXL_REGION_PMEM,
>                                 CXL_DECODER_HOSTONLYMEM);
>          if (IS_ERR(cxlr))
>                  return PTR_ERR(cxlr);
> @@ -2358,7 +2372,7 @@ static ssize_t create_ram_region_store(struct device *dev,
>          if (rc != 1)
>                  return -EINVAL;
> 
> -       cxlr = __create_region(cxlrd, id, CXL_DECODER_RAM,
> +       cxlr = __create_region(cxlrd, id, CXL_REGION_RAM,
>                                 CXL_DECODER_HOSTONLYMEM);
>          if (IS_ERR(cxlr))
>                  return PTR_ERR(cxlr);
> @@ -2886,10 +2900,31 @@ static void construct_region_end(void)
>          up_write(&cxl_region_rwsem);
>   }
> 
> +static enum cxl_region_mode
> +cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
> +{
> +       switch (mode) {
> +       case CXL_DECODER_NONE:
> +               return CXL_REGION_NONE;
> +       case CXL_DECODER_RAM:
> +               return CXL_REGION_RAM;
> +       case CXL_DECODER_PMEM:
> +               return CXL_REGION_PMEM;
> +       case CXL_DECODER_DEAD:
> +               return CXL_REGION_DEAD;
> +       case CXL_DECODER_MIXED:
> +       default:
> +               return CXL_REGION_MIXED;
> +       }
> +
> +       return CXL_REGION_MIXED;
> +}
> +
>   static struct cxl_region *
>   construct_region_begin(struct cxl_root_decoder *cxlrd,
>                         struct cxl_endpoint_decoder *cxled)
>   {
> +       enum cxl_region_mode mode = cxl_decoder_to_region_mode(cxled->mode);
>          struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
>          struct cxl_region_params *p;
>          struct cxl_region *cxlr;
> @@ -2897,7 +2932,7 @@ construct_region_begin(struct cxl_root_decoder *cxlrd,
> 
>          do {
>                  cxlr = __create_region(cxlrd, atomic_read(&cxlrd->region_id),
> -                                      cxled->mode, cxled->cxld.target_type);
> +                                      mode, cxled->cxld.target_type);
>          } while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
> 
>          if (IS_ERR(cxlr)) {
> @@ -3200,9 +3235,9 @@ static int cxl_region_probe(struct device *dev)
>                  return rc;
> 
>          switch (cxlr->mode) {
> -       case CXL_DECODER_PMEM:
> +       case CXL_REGION_PMEM:
>                  return devm_cxl_add_pmem_region(cxlr);
> -       case CXL_DECODER_RAM:
> +       case CXL_REGION_RAM:
>                  /*
>                   * The region can not be manged by CXL if any portion of
>                   * it is already online as 'System RAM'
> @@ -3223,8 +3258,8 @@ static int cxl_region_probe(struct device *dev)
>                  /* HDM-H routes to device-dax */
>                  return devm_cxl_add_dax_region(cxlr);
>          default:
> -               dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
> -                       cxlr->mode);
> +               dev_dbg(&cxlr->dev, "unsupported region mode: %s\n",
> +                       cxl_region_mode_name(cxlr->mode));
>                  return -ENXIO;
>          }
>   }
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index cd4a9ffdacc7..ed282dcd5cf5 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -374,6 +374,28 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>          return "mixed";
>   }
> 
> +enum cxl_region_mode {
> +       CXL_REGION_NONE,
> +       CXL_REGION_RAM,
> +       CXL_REGION_PMEM,
> +       CXL_REGION_MIXED,
> +       CXL_REGION_DEAD,
> +};
> +
> +static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
> +{
> +       static const char * const names[] = {
> +               [CXL_REGION_NONE] = "none",
> +               [CXL_REGION_RAM] = "ram",
> +               [CXL_REGION_PMEM] = "pmem",
> +               [CXL_REGION_MIXED] = "mixed",
> +       };
> +
> +       if (mode >= CXL_REGION_NONE && mode <= CXL_REGION_MIXED)
> +               return names[mode];
> +       return "mixed";
> +}
> +
>   /*
>    * Track whether this decoder is reserved for region autodiscovery, or
>    * free for userspace provisioning.
> @@ -502,7 +524,8 @@ struct cxl_region_params {
>    * struct cxl_region - CXL region
>    * @dev: This region's device
>    * @id: This region's id. Id is globally unique across all regions
> - * @mode: Endpoint decoder allocation / access mode
> + * @mode: Region mode which defines which endpoint decoder mode the region is
> + *        compatible with
>    * @type: Endpoint decoder target type
>    * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
>    * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
> @@ -512,7 +535,7 @@ struct cxl_region_params {
>   struct cxl_region {
>          struct device dev;
>          int id;
> -       enum cxl_decoder_mode mode;
> +       enum cxl_region_mode mode;
>          enum cxl_decoder_type type;
>          struct cxl_nvdimm_bridge *cxl_nvb;
>          struct cxl_pmem_region *cxlr_pmem;
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 5f2e65204bf9..8c8f47b397ab 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -396,6 +396,7 @@ enum cxl_devtype {
>          CXL_DEVTYPE_CLASSMEM,
>   };
> 
> +#define CXL_MAX_DC_REGION 8
>   /**
>    * struct cxl_dev_state - The driver device state
>    *
> @@ -412,6 +413,8 @@ enum cxl_devtype {
>    * @dpa_res: Overall DPA resource tree for the device
>    * @pmem_res: Active Persistent memory capacity configuration
>    * @ram_res: Active Volatile memory capacity configuration
> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> + *          region
>    * @component_reg_phys: register base of component registers
>    * @serial: PCIe Device Serial Number
>    * @type: Generic Memory Class device or Vendor Specific Memory device
> @@ -426,11 +429,23 @@ struct cxl_dev_state {
>          struct resource dpa_res;
>          struct resource pmem_res;
>          struct resource ram_res;
> +       struct resource dc_res[CXL_MAX_DC_REGION];
>          resource_size_t component_reg_phys;
>          u64 serial;
>          enum cxl_devtype type;
>   };
> 
> +#define CXL_DC_REGION_STRLEN 7
> +struct cxl_dc_region_info {
> +       u64 base;
> +       u64 decode_len;
> +       u64 len;
> +       u64 blk_size;
> +       u32 dsmad_handle;
> +       u8 flags;
> +       u8 name[CXL_DC_REGION_STRLEN];
> +};
> +
>   /**
>    * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
>    *
> @@ -449,6 +464,8 @@ struct cxl_dev_state {
>    * @enabled_cmds: Hardware commands found enabled in CEL.
>    * @exclusive_cmds: Commands that are kernel-internal only
>    * @total_bytes: sum of all possible capacities
> + * @static_cap: Sum of RAM and PMEM capacities
> + * @dynamic_cap: Complete DPA range occupied by DC regions
>    * @volatile_only_bytes: hard volatile capacity
>    * @persistent_only_bytes: hard persistent capacity
>    * @partition_align_bytes: alignment size for partition-able capacity
> @@ -456,6 +473,10 @@ struct cxl_dev_state {
>    * @active_persistent_bytes: sum of hard + soft persistent
>    * @next_volatile_bytes: volatile capacity change pending device reset
>    * @next_persistent_bytes: persistent capacity change pending device reset
> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
> + * @dc_event_log_size: The number of events the device can store in the
> + * Dynamic Capacity Event Log before it overflows
>    * @event: event log driver state
>    * @poison: poison driver state info
>    * @fw: firmware upload / activation state
> @@ -473,7 +494,10 @@ struct cxl_memdev_state {
>          DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>          DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>          DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> +
>          u64 total_bytes;
> +       u64 static_cap;
> +       u64 dynamic_cap;
>          u64 volatile_only_bytes;
>          u64 persistent_only_bytes;
>          u64 partition_align_bytes;
> @@ -481,6 +505,11 @@ struct cxl_memdev_state {
>          u64 active_persistent_bytes;
>          u64 next_volatile_bytes;
>          u64 next_persistent_bytes;
> +
> +       u8 nr_dc_region;
> +       struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> +       size_t dc_event_log_size;
> +
>          struct cxl_event_state event;
>          struct cxl_poison_state poison;
>          struct cxl_security_state security;
> @@ -587,6 +616,7 @@ struct cxl_mbox_identify {
>          __le16 inject_poison_limit;
>          u8 poison_caps;
>          u8 qos_telemetry_caps;
> +       __le16 dc_event_log_size;
>   } __packed;

Hi,

To handle backwards compatibility with CXL 2.0 devices, 
cxl_dev_state_identify() needs to handle both the CXL 2.0 and 3.0 
versions of struct cxl_mbox_identify. The spec says that newer code can 
use the payload size to detect the different versions, so something like 
the following:

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 9462c34aa1dc..0a6f038996aa 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1356,6 +1356,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state 
*mds)
                 .opcode = CXL_MBOX_OP_IDENTIFY,
                 .size_out = sizeof(id),
                 .payload_out = &id,
+               .min_out = CXL_MBOX_IDENTIFY_MIN_LENGTH,
         };
         rc = cxl_internal_send_cmd(mds, &mbox_cmd);
         if (rc < 0)
@@ -1379,7 +1380,8 @@ int cxl_dev_state_identify(struct cxl_memdev_state 
*mds)
                 mds->poison.max_errors = min_t(u32, val, 
CXL_POISON_LIST_MAX);
         }

-       mds->dc_event_log_size = le16_to_cpu(id.dc_event_log_size);
+       if (mbox_cmd.size_out >= CXL_MBOX_IDENTIFY_CXL3_LENGTH)
+               mds->dc_event_log_size = le16_to_cpu(id.dc_event_log_size);

         return 0;
  }
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index ae9dcb291c75..756e30db10d6 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -629,8 +629,11 @@ struct cxl_mbox_identify {
         __le16 inject_poison_limit;
         u8 poison_caps;
         u8 qos_telemetry_caps;
+       /* CXL 3.0 additions */
         __le16 dc_event_log_size;
  } __packed;
+#define CXL_MBOX_IDENTIFY_MIN_LENGTH    0x43
+#define CXL_MBOX_IDENTIFY_CXL3_LENGTH sizeof(struct cxl_mbox_identify)

  /*
   * Common Event Record Format

---

Something similar needs to be handled for cxl_event_get_int_policy with 
the addition of dyncap_settings to cxl_event_interrupt_policy, that Fan 
Ni mentions.

Thanks,
Jorgen

>   /*
> @@ -741,9 +771,31 @@ struct cxl_mbox_set_partition_info {
>          __le64 volatile_capacity;
>          u8 flags;
>   } __packed;
> -
>   #define  CXL_SET_PARTITION_IMMEDIATE_FLAG      BIT(0)
> 
> +struct cxl_mbox_get_dc_config {
> +       u8 region_count;
> +       u8 start_region_index;
> +} __packed;
> +
> +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> +struct cxl_mbox_dynamic_capacity {
> +       u8 avail_region_count;
> +       u8 rsvd[7];
> +       struct cxl_dc_region_config {
> +               __le64 region_base;
> +               __le64 region_decode_length;
> +               __le64 region_length;
> +               __le64 region_block_size;
> +               __le32 region_dsmad_handle;
> +               u8 flags;
> +               u8 rsvd[3];
> +       } __packed region[];
> +} __packed;
> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> +#define CXL_REGIONS_RETURNED(size_out) \
> +       ((size_out - 8) / sizeof(struct cxl_dc_region_config))
> +
>   /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>   struct cxl_mbox_set_timestamp_in {
>          __le64 timestamp;
> @@ -867,6 +919,7 @@ enum {
>   int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
>                            struct cxl_mbox_cmd *cmd);
>   int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
>   int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>   int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>   int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 5242dbf0044d..a9b110ff1176 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -879,6 +879,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>          if (rc)
>                  return rc;
> 
> +       rc = cxl_dev_dynamic_capacity_identify(mds);
> +       if (rc)
> +               return rc;
> +
>          rc = cxl_mem_create_range_info(mds);
>          if (rc)
>                  return rc;
> 
> --
> 2.41.0
> 

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 10/18] cxl/mem: Handle DCD add and release capacity events.
  2023-08-31 17:28   ` Dave Jiang
@ 2023-09-08 15:35     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-08 15:35 UTC (permalink / raw)
  To: Dave Jiang, Ira Weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Dave Jiang wrote:
> 
> 
> On 8/28/23 22:21, Ira Weiny wrote:

[snip]

> > +
> > +/* Returns 0 if the event was handled successfully. */
> Is this comment necessary?

Not really, deleted.

> 
> > +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
> > +					struct cxl_event_record_raw *rec)
> > +{
> > +	struct dcd_event_dyn_cap *record = (struct dcd_event_dyn_cap *)rec;
> > +	uuid_t *id = &rec->hdr.id;
> > +	int rc;
> > +
> > +	if (!uuid_equal(id, &dc_event_uuid))
> > +		return -EINVAL;
> > +
> > +	switch (record->data.event_type) {
> > +	case DCD_ADD_CAPACITY:
> > +		rc = cxl_handle_dcd_add_event(mds, &record->data.extent);
> 
> Just return?

Fixed from Jonathans comments

> > +		break;
> > +	case DCD_RELEASE_CAPACITY:
> > +        case DCD_FORCED_CAPACITY_RELEASE:
> 
> Extra 2 spaces of indentation?

This was a checkpatch issues.  Fixed.

> 
> > +		rc = cxl_handle_dcd_release_event(mds, &record->data.extent);
> 
> Same here about return.

Fixed from Jonathans comments

Thanks for the review!

Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device
  2023-08-30 21:01   ` Dave Jiang
  2023-09-05  0:14     ` Ira Weiny
@ 2023-09-08 20:23     ` Ira Weiny
  1 sibling, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-08 20:23 UTC (permalink / raw)
  To: Dave Jiang, ira.weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Dave Jiang wrote:
> 
> 
> On 8/28/23 22:20, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > Devices can optionally support Dynamic Capacity (DC).  These devices are
> > known as Dynamic Capacity Devices (DCD).
> > 
> > Implement the DC (opcode 48XXh) mailbox commands as specified in CXL 3.0
> > section 8.2.9.8.9.  Read the DC configuration and store the DC region
> > information in the device state.
> > 
> > Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Uncapitalize Dynamic in subject
> 
> Also, maybe split out the REGION vs DECODER as a prep patch.

Both done.

Thanks!
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device
  2023-08-30 21:44   ` Fan Ni
@ 2023-09-08 22:52     ` Ira Weiny
  2023-09-12 21:32       ` Fan Ni
  0 siblings, 1 reply; 97+ messages in thread
From: Ira Weiny @ 2023-09-08 22:52 UTC (permalink / raw)
  To: Fan Ni, ira.weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Jonathan Cameron,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma,
	linux-cxl, linux-kernel

Fan Ni wrote:
> On Mon, Aug 28, 2023 at 10:20:54PM -0700, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> >

[snip]

> >
> > +static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, int index,
> > +				   struct cxl_dc_region_config *region_config)
> > +{
> > +	struct cxl_dc_region_info *dcr = &mds->dc_region[index];
> > +	struct device *dev = mds->cxlds.dev;
> > +
> > +	dcr->base = le64_to_cpu(region_config->region_base);
> > +	dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
> > +	dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> > +	dcr->len = le64_to_cpu(region_config->region_length);
> > +	dcr->blk_size = le64_to_cpu(region_config->region_block_size);
> > +	dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
> > +	dcr->flags = region_config->flags;
> > +	snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
> > +
> > +	/* Check regions are in increasing DPA order */
> > +	if (index > 0) {
> > +		struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
> > +
> > +		if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
> > +			dev_err(dev,
> > +				"DPA ordering violation for DC region %d and %d\n",
> > +				index - 1, index);
> > +			return -EINVAL;
> > +		}
> > +	}
> > +
> > +	/* Check the region is 256 MB aligned */
> > +	if (!IS_ALIGNED(dcr->base, SZ_256M)) {
> > +		dev_err(dev, "DC region %d not aligned to 256MB: %#llx\n",
> > +			index, dcr->base);
> > +		return -EINVAL;
> > +	}
> > +
> > +	/* Check Region base and length are aligned to block size */
> > +	if (!IS_ALIGNED(dcr->base, dcr->blk_size) ||
> > +	    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> > +		dev_err(dev, "DC region %d not aligned to %#llx\n", index,
> > +			dcr->blk_size);
> > +		return -EINVAL;
> > +	}
> 
> Based on on cxl 3.0 spec: Table 8-126, we may need some extra checks
> here:
> 1. region len <= decode_len
> 2. region block size should be power of 2 and a multiple of 40H.

Thanks for pointing these additional checks out!  I've added these.

> 
> Also, if region len or block size is 0, it mentions that DC will not be
> available, we may also need to handle that.

I've just added checks for 0 in region length, length and block size.

I don't think we need to handle this in any special way.  Any of these
checks will fail the device probe.  From my interpretation of the spec
reading these values as 0 would indicate an invalid device configuration.

That said I think the spec is a bit vague here.  On the one hand the
number of DC regions should reflect the number of valid regions.

Table 8-125 'Number of Available Regions':
	"This is the number of valid region configurations returned in
	this payload."

But it also says:
	"Each region may be unconfigured or configured with a different
	block size and capacity."

I don't believe that a 0 in the Region Decode Length, Region Length, or
Region Block Size is going to happen with the code structured the way it
is.  I believe these values are used if the host specifically requests the
configuration of a region not indicated by 'Number of Available Regions'
through the Starting Region Index in Table 8-163.  This code does not do
that.

Would you agree with this?

Thanks again,
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 07/18] cxl/mem: Expose device dynamic capacity configuration
  2023-08-30 22:46   ` Dave Jiang
@ 2023-09-08 23:22     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-08 23:22 UTC (permalink / raw)
  To: Dave Jiang, ira.weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Dave Jiang wrote:
> 
> 
> On 8/28/23 22:20, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 

[snip]

> > +
> > +static umode_t cxl_dc_visible(struct kobject *kobj, struct attribute *a, int n)
> > +{
> > +	struct device *dev = kobj_to_dev(kobj);
> > +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> > +
> > +	/* Not a memory device */
> > +	if (!mds)
> > +		return 0;
> > +
> > +	if (a == &dev_attr_region_count.attr)
> > +		return a->mode;
> > +
> > +	if (n < mds->nr_dc_region)
> > +		return a->mode;
> 
> I would add a comment on who you are checking against nr_dc_region to 
> make it obvious.

Sounds good.

Thanks!
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 05/18] cxl/port: Add Dynamic Capacity mode support to endpoint decoders
  2023-08-31 17:25   ` Fan Ni
@ 2023-09-08 23:26     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-08 23:26 UTC (permalink / raw)
  To: Fan Ni, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Jonathan Cameron,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma,
	linux-cxl, linux-kernel

Fan Ni wrote:
> On Mon, Aug 28, 2023 at 10:20:56PM -0700, Ira Weiny wrote:
> > Endpoint decoders used to map Dynamic Capacity must be configured to
> > point to the correct Dynamic Capacity (DC) Region.  The decoder mode
> > currently represents the partition the decoder points to such as ram or
> > pmem.
> >
> > Expand the mode to include DC Regions.
> >
> > Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> >
> 
> I have the same question about how dc_mode_to_region_index is
> implemented and used as Jonathan.

I changed it per Jonathans recommendation.  Was that satisfactory or were
you looking for more information?

> 
> Nice to see the code spit out, it is easier to review now.

Thanks,
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 11/18] cxl/region: Expose DC extents on region driver load
  2023-08-31 18:38   ` Dave Jiang
@ 2023-09-08 23:57     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-08 23:57 UTC (permalink / raw)
  To: Dave Jiang, Ira Weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Dave Jiang wrote:
> 
> 
> On 8/28/23 22:21, Ira Weiny wrote:
> >   

[snip]

> > +static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> > +				struct cxl_dc_extent_data *extent)
> > +{
> > +	struct range dpa_range = (struct range){
> > +		.start = extent->dpa_start,
> > +		.end = extent->dpa_start + extent->length - 1,
> > +	};
> > +	struct device *dev = &cxled->cxld.dev;
> > +
> > +	dev_dbg(dev, "Checking extent DPA:%llx LEN:%llx\n",
> > +		extent->dpa_start, extent->length);
> > +
> > +	if (!cxled->cxld.region || !cxled->dpa_res)
> > +		return false;
> > +
> > +	dev_dbg(dev, "Cxled start:%llx end:%llx\n",
> > +		cxled->dpa_res->start, cxled->dpa_res->end);
> 
> Just use %pr?

Yep!

> 
> > +	return (cxled->dpa_res->start <= dpa_range.start &&
> > +		dpa_range.end <= cxled->dpa_res->end);
> 
> I may be easier to read for some if you have (dpa_range.start > 
> cxled->dpa_res->start && ...) instead.

<sigh>  I think about checks like this visually.

Resource
	As                        Ae
	|-------------------------|
Check
	   Bs                   Be
	   |--------------------|

	As <= Bs && Be <= Ae

I know this is odd for some but I like seeing B 'inside' A.

If others feel strongly like you I can change it but I'm inclined to leave
it.

[snip]

> > +
> > +#define DAX_EXTENT_LABEL_LEN 64
> > +/**
> > + * struct dax_reg_ext_dev - Device object to expose extent information
> > + * @dev: device representing this extent
> > + * @dr_extent: reference back to private extent data
> > + * @offset: offset of this extent
> > + * @length: size of this extent
> > + * @label: identifier to group extents
> > + */
> > +struct dax_reg_ext_dev {
> > +	struct device dev;
> > +	struct dax_region_extent *dr_extent;
> > +	resource_size_t offset;
> > +	resource_size_t length;
> > +	char label[DAX_EXTENT_LABEL_LEN];
> > +};
> > +
> > +int dax_region_ext_create_dev(struct dax_region *dax_region,
> > +			      struct dax_region_extent *dr_extent,
> > +			      resource_size_t offset,
> > +			      resource_size_t length,
> > +			      const char *label);
> > +#define to_dr_ext_dev(dev)	\
> > +	container_of(dev, struct dax_reg_ext_dev, dev)
> > +
> >   struct dax_mapping {
> >   	struct device dev;
> >   	int range_id;
> 
> 
> This is a rather large patch. Can the code below be broken out to a 
> separate patch?

Possibly.  The issue was that the natural split was to implement extents
at the CXL region level.  Then implement the dax region extents.  But
without the 2nd patch the CXL region code does not do anything.  This is
because the CXL region driver load triggers this patch to do something.
It made more sense to have the code which triggers the extent processing
bundled with the extent processing.

To split it as you suggest would still be a very large patch with this new
extent file being pretty small.  So I just combined them.

Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device
  2023-09-08 12:46   ` Jørgen Hansen
@ 2023-09-11 20:26     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-11 20:26 UTC (permalink / raw)
  To: Jørgen Hansen, ira.weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, linux-cxl,
	linux-kernel

Jørgen Hansen wrote:
> On 8/29/23 07:20, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 

[snip]

> >   /**
> >    * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
> >    *
> > @@ -449,6 +464,8 @@ struct cxl_dev_state {
> >    * @enabled_cmds: Hardware commands found enabled in CEL.
> >    * @exclusive_cmds: Commands that are kernel-internal only
> >    * @total_bytes: sum of all possible capacities
> > + * @static_cap: Sum of RAM and PMEM capacities
> > + * @dynamic_cap: Complete DPA range occupied by DC regions
> >    * @volatile_only_bytes: hard volatile capacity
> >    * @persistent_only_bytes: hard persistent capacity
> >    * @partition_align_bytes: alignment size for partition-able capacity
> > @@ -456,6 +473,10 @@ struct cxl_dev_state {
> >    * @active_persistent_bytes: sum of hard + soft persistent
> >    * @next_volatile_bytes: volatile capacity change pending device reset
> >    * @next_persistent_bytes: persistent capacity change pending device reset
> > + * @nr_dc_region: number of DC regions implemented in the memory device
> > + * @dc_region: array containing info about the DC regions
> > + * @dc_event_log_size: The number of events the device can store in the
> > + * Dynamic Capacity Event Log before it overflows
> >    * @event: event log driver state
> >    * @poison: poison driver state info
> >    * @fw: firmware upload / activation state
> > @@ -473,7 +494,10 @@ struct cxl_memdev_state {
> >          DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
> >          DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> >          DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> > +
> >          u64 total_bytes;
> > +       u64 static_cap;
> > +       u64 dynamic_cap;
> >          u64 volatile_only_bytes;
> >          u64 persistent_only_bytes;
> >          u64 partition_align_bytes;
> > @@ -481,6 +505,11 @@ struct cxl_memdev_state {
> >          u64 active_persistent_bytes;
> >          u64 next_volatile_bytes;
> >          u64 next_persistent_bytes;
> > +
> > +       u8 nr_dc_region;
> > +       struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> > +       size_t dc_event_log_size;
> > +
> >          struct cxl_event_state event;
> >          struct cxl_poison_state poison;
> >          struct cxl_security_state security;
> > @@ -587,6 +616,7 @@ struct cxl_mbox_identify {
> >          __le16 inject_poison_limit;
> >          u8 poison_caps;
> >          u8 qos_telemetry_caps;
> > +       __le16 dc_event_log_size;
> >   } __packed;
> 
> Hi,
> 
> To handle backwards compatibility with CXL 2.0 devices, 
> cxl_dev_state_identify() needs to handle both the CXL 2.0 and 3.0 
> versions of struct cxl_mbox_identify.
> The spec says that newer code can 
> use the payload size to detect the different versions, so something like 
> the following:

Software does not need to detect the different version.  The spec states
that the payload size or a zero value can be used.

	"... software written to the new definition can use the zero value
	                                                    ^^^^^^^^^^^^^^
	or the payload size to detect devices that do not support the new
	field."

A log size of 0 is valid and is indicative of no DC support.

That said the current code could interpret the log size as larger because
id is not correctly initialized.  So good catch.

However, dc_event_log_size is not used anywhere.  For this reason alone I
almost removed it from the code.  This complication gives me even more
reason to do so.

> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 9462c34aa1dc..0a6f038996aa 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1356,6 +1356,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state 
> *mds)
>                  .opcode = CXL_MBOX_OP_IDENTIFY,
>                  .size_out = sizeof(id),
>                  .payload_out = &id,
> +               .min_out = CXL_MBOX_IDENTIFY_MIN_LENGTH,
>          };
>          rc = cxl_internal_send_cmd(mds, &mbox_cmd);
>          if (rc < 0)
> @@ -1379,7 +1380,8 @@ int cxl_dev_state_identify(struct cxl_memdev_state 
> *mds)
>                  mds->poison.max_errors = min_t(u32, val, 
> CXL_POISON_LIST_MAX);
>          }
> 
> -       mds->dc_event_log_size = le16_to_cpu(id.dc_event_log_size);
> +       if (mbox_cmd.size_out >= CXL_MBOX_IDENTIFY_CXL3_LENGTH)
> +               mds->dc_event_log_size = le16_to_cpu(id.dc_event_log_size);
> 
>          return 0;
>   }
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index ae9dcb291c75..756e30db10d6 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -629,8 +629,11 @@ struct cxl_mbox_identify {
>          __le16 inject_poison_limit;
>          u8 poison_caps;
>          u8 qos_telemetry_caps;
> +       /* CXL 3.0 additions */
>          __le16 dc_event_log_size;
>   } __packed;
> +#define CXL_MBOX_IDENTIFY_MIN_LENGTH    0x43
> +#define CXL_MBOX_IDENTIFY_CXL3_LENGTH sizeof(struct cxl_mbox_identify)
> 
>   /*
>    * Common Event Record Format
> 
> ---
> 
> Something similar needs to be handled for cxl_event_get_int_policy with 
> the addition of dyncap_settings to cxl_event_interrupt_policy, that Fan 
> Ni mentions.

Yes this needs to be handled.  I've overlooked that entire part.  I think
it had something to do with the fact the 3.0 errata was not published when
the first RFC was sent out and this version just continued with the broken
code.

Thanks for pointing this out and thanks for the review!
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device
  2023-09-07 15:46   ` Alison Schofield
@ 2023-09-12  1:18     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-12  1:18 UTC (permalink / raw)
  To: Alison Schofield, ira.weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Jonathan Cameron,
	Davidlohr Bueso, Dave Jiang, Vishal Verma, linux-cxl,
	linux-kernel

Alison Schofield wrote:
> On Mon, Aug 28, 2023 at 10:20:54PM -0700, Ira Weiny wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > Devices can optionally support Dynamic Capacity (DC).  These devices are
> > known as Dynamic Capacity Devices (DCD).
> 
> snip
> 
> > 
> > +static enum cxl_region_mode
> > +cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
> > +{
> > +	switch (mode) {
> > +	case CXL_DECODER_NONE:
> > +		return CXL_REGION_NONE;
> > +	case CXL_DECODER_RAM:
> > +		return CXL_REGION_RAM;
> > +	case CXL_DECODER_PMEM:
> > +		return CXL_REGION_PMEM;
> > +	case CXL_DECODER_DEAD:
> > +		return CXL_REGION_DEAD;
> > +	case CXL_DECODER_MIXED:
> > +	default:
> > +		return CXL_REGION_MIXED;
> > +	}
> > +
> > +	return CXL_REGION_MIXED;
> 
> Can the paths to return _MIXED be simplified here?

I suppose:

...
	case CXL_DECODER_MIXED:
	default:
		break;
	}
	
	return CXL_REGION_MIXED;
...

I don't think that makes things any better.

> 
> 
> > +}
> > +
> snip
> 
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index cd4a9ffdacc7..ed282dcd5cf5 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -374,6 +374,28 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
> >  	return "mixed";
> >  }
> >  
> > +enum cxl_region_mode {
> > +	CXL_REGION_NONE,
> > +	CXL_REGION_RAM,
> > +	CXL_REGION_PMEM,
> > +	CXL_REGION_MIXED,
> > +	CXL_REGION_DEAD,
> > +};
> 
> I'm concerned about _DEAD.
> At first I was going to say name these as CXL_REGION_MODE_*, but it's
> pretty obvious that these are mode words...except for DEAD. Is that 
> an actual mode or is it some type of status? I don't think I see it
> used yet.

My first reaction was to remove this.  But I had to go back and look.  It
took me a minute to trace this.

'Dead' is not used directly.  If a decoder happens to be dead
(CXL_DECODER_DEAD) then it will eventually fail the creation of a region
with CXL_REGION_DEAD as the mode.  CXL_REGION_MIXED fails the same way but
only because mixed mode is not yet supported.  Therefore, decoder mode
DEAD indicates something different and CXL_REGION_DEAD was added to convey
this when converting.

The alternative is to be more explicit and check decoder mode to be !DEAD
prior to trying to convert.  I think I like that but I'm going to sleep on
it.

> 
> > +
> > +static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
> > +{
> > +	static const char * const names[] = {
> > +		[CXL_REGION_NONE] = "none",
> > +		[CXL_REGION_RAM] = "ram",
> > +		[CXL_REGION_PMEM] = "pmem",
> > +		[CXL_REGION_MIXED] = "mixed",
> > +	};
> > +
> > +	if (mode >= CXL_REGION_NONE && mode <= CXL_REGION_MIXED)
> > +		return names[mode];
> > +	return "mixed";
> > +}
> 
> snip
> 
> > +
> >  /**
> >   * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
> >   *
> > @@ -449,6 +464,8 @@ struct cxl_dev_state {
> >   * @enabled_cmds: Hardware commands found enabled in CEL.
> >   * @exclusive_cmds: Commands that are kernel-internal only
> >   * @total_bytes: sum of all possible capacities
> > + * @static_cap: Sum of RAM and PMEM capacities
> > + * @dynamic_cap: Complete DPA range occupied by DC regions
> 
> Wondering about renaming RAM and PMEM caps as 'static'.
> They are changeable via set partition commands.

True but they are static compared to dynamic capacity.  I'm open to other
names but !dynamic is normally referred to as static.  :-/

Thanks for the review!
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD)
  2023-09-07 21:01 ` [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Fan Ni
@ 2023-09-12  1:44   ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-12  1:44 UTC (permalink / raw)
  To: Fan Ni, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Jonathan Cameron,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma,
	linux-cxl, linux-kernel, a.manzanares, nmtadam.samsung, nifan

Fan Ni wrote:
> On Mon, Aug 28, 2023 at 10:20:51PM -0700, Ira Weiny wrote:

Sorry for the delay, I've been walking through the responses and just saw
this.

> 
> Hi Ira,
> 
> I tried to test the patch series with the qemu dcd patches, however, I
> hit some issues, and would like to check the following with you.
> 
> 1. After we create a region for DC before any extents are added, a dax
> device will show under /dev. Is that what we want?

Yes, see

cxl/region: Add Dynamic Capacity CXL region support

	"Special case DC capable CXL regions to create a 0 sized seed DAX
	device until others can be created on dynamic space later."

The seed device is required but is left empty.  It can be resized when
extents are added later.

> If I remember it
> correctly, the dax device used to show up after a dc extent is added.
> 
> 
> 2. add/release extent does not work correctly for me. The code path is
> not called, and I made the following changes to make it pass.

:-(

This is the problem with cxl_test...  I've just realized this after seeing
Jorgen's email regarding the interrupt configuration code.  I've added it
back in.  I'm not sure where it got lost along the way but it was
completely gone from this RFC v2.  Sorry about that.

> ---
>  drivers/cxl/cxl.h    | 3 ++-
>  drivers/cxl/cxlmem.h | 1 +
>  drivers/cxl/pci.c    | 7 +++++++
>  3 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 2c73a30980b6..0d132c1739ce 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -168,7 +168,8 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
>  #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO |	\
>  				 CXLDEV_EVENT_STATUS_WARN |	\
>  				 CXLDEV_EVENT_STATUS_FAIL |	\
> -				 CXLDEV_EVENT_STATUS_FATAL)
> +				 CXLDEV_EVENT_STATUS_FATAL| \
> +				 CXLDEV_EVENT_STATUS_DCD)
> 
>  /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
>  #define CXLDEV_EVENT_INT_MODE_MASK	GENMASK(1, 0)
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 8ca81fd067c2..ae9dcb291c75 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -235,6 +235,7 @@ struct cxl_event_interrupt_policy {
>  	u8 warn_settings;
>  	u8 failure_settings;
>  	u8 fatal_settings;
> +	u8 dyncap_settings;
>  } __packed;
> 
>  /**
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 10c1a583113c..e30fe0304514 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -686,6 +686,7 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
>  		.warn_settings = CXL_INT_MSI_MSIX,
>  		.failure_settings = CXL_INT_MSI_MSIX,
>  		.fatal_settings = CXL_INT_MSI_MSIX,
> +		.dyncap_settings = CXL_INT_MSI_MSIX,
>  	};
> 
>  	mbox_cmd = (struct cxl_mbox_cmd) {
> @@ -739,6 +740,12 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
>  		return rc;
>  	}
> 
> +	rc = cxl_event_req_irq(cxlds, policy.dyncap_settings);
> +	if (rc) {
> +		dev_err(cxlds->dev, "Failed to get interrupt for event dyncap log\n");
> +		return rc;
> +	}
> +
>  	return 0;
>  }
> 
> --
> 
> 3. With changes made in 2, the code for add/release dc extent can be called,
> however, the system behaviour seems different from before. Previously, after a
> dc extent is added, it will show up with lsmem command and listed as offline.
> Now, nothing is showing. Is it expected? What should we do to make it usable
> as system ram?

Yes this behavior was not correct before.  DAX devices should be flexible
to be created throughout the region.  Either within extents or across
extents.  Dave Jiang mentioned to me internally it might help to add some
ASCII art documentation regarding how this works.  Generally, the dax
region available size will increase when extents are added and new dax
devices can be created to utilize that space.

Check out the dcd-test.sh in ndctl at this link for the commands to create
a dax device in the new architecture.

https://github.com/weiny2/ndctl/tree/dcd-region2

Hope this helps.

> 
> Please let me know if I miss something or did something wrong. Thanks.

You did not.  I thought the new dax code would explain this new dax device
operation.

Some new documentation is in order.

Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions
  2023-09-06  4:35     ` Ira Weiny
@ 2023-09-12 16:49       ` Jonathan Cameron
  2023-09-12 22:08         ` Ira Weiny
  0 siblings, 1 reply; 97+ messages in thread
From: Jonathan Cameron @ 2023-09-12 16:49 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Tue, 5 Sep 2023 21:35:03 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> Jonathan Cameron wrote:
> > On Mon, 28 Aug 2023 22:21:05 -0700
> > Ira Weiny <ira.weiny@intel.com> wrote:
> >   
> > > Dynamic Capacity (DC) DAX regions have a list of extents which define
> > > the memory of the region which is available.
> > > 
> > > Now that DAX region extents are fully realized support DAX device
> > > creation on dynamic regions by adjusting the allocation algorithms
> > > to account for the extents.  Remember also references must be held on
> > > the extents until the DAX devices are done with the memory.
> > > 
> > > Redefine the region available size to include only extent space.  Reuse
> > > the size allocation algorithm by defining sub-resources for each extent
> > > and limiting range allocation to those extents which have space.  Do not
> > > support direct mapping of DAX devices on dynamic devices.
> > > 
> > > Enhance DAX device range objects to hold references on the extents until
> > > the DAX device is destroyed.
> > > 
> > > NOTE: At this time all extents within a region are created equally.
> > > However, labels are associated with extents which can be used with
> > > future DAX device labels to group which extents are used.  
> > 
> > This sound like a bad place to start to me as we are enabling something
> > that is probably 'wrong' in the long term as opposed to just not enabling it
> > until we have appropriate support.  
> 
> I disagree.  I don't think the kernel should be trying to process tags at
> the lower level.
> 
> > I'd argue better to just reject any extents with different labels for now.  
> 
> Again I disagree.  This is less restrictive.  The idea is that labels can
> be changed such that user space can ultimately decided which extents
> should be used for which devices.  I have some work on that already.
> (Basically it becomes quite easy to assign a label to a dax device and
> have the extent search use only dax extents which match that label.)

That sounds good - but if someone expects that and uses it with an old
kernel I'm not sure if it is better to say 'we don't support it yet' or
do something different from a newer kernel.


> > > @@ -1400,8 +1507,10 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
> > >  	device_initialize(dev);
> > >  	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
> > >  
> > > +	dev_WARN_ONCE(parent, is_dynamic(dax_region) && data->size,
> > > +		      "Dynamic DAX devices are created initially with 0 size");  
> > 
> > dev_info() maybe more appropriate?  
> 
> Unless I'm mistaken this can happen from userspace but only if something
> in the code changes later.  Because the dax layer is trying to support
> non-dynamic regions (which dynamic may be a bad name), I was worried that
> the creation with a size might slip through...

Fair enough - if strong chance userspace will control it at somepoitn then
ONCE seems fine.

> 
> > Is this common enough that we need the
> > _ONCE?  
> 
> once is because it could end up spamming a log later if something got
> coded up wrong.

I'm not sure I care about bugs spamming the log.   Only things that
are userspace controlled or likely hardware failures etc.




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device
  2023-09-08 22:52     ` Ira Weiny
@ 2023-09-12 21:32       ` Fan Ni
  0 siblings, 0 replies; 97+ messages in thread
From: Fan Ni @ 2023-09-12 21:32 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Fan Ni, Dan Williams, Navneet Singh, Jonathan Cameron,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma,
	linux-cxl, linux-kernel

On Fri, Sep 08, 2023 at 03:52:15PM -0700, Ira Weiny wrote:

> Fan Ni wrote:
> > On Mon, Aug 28, 2023 at 10:20:54PM -0700, ira.weiny@intel.com wrote:
> > > From: Navneet Singh <navneet.singh@intel.com>
> > >
> 
> [snip]
> 
> > >
> > > +static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, int index,
> > > +				   struct cxl_dc_region_config *region_config)
> > > +{
> > > +	struct cxl_dc_region_info *dcr = &mds->dc_region[index];
> > > +	struct device *dev = mds->cxlds.dev;
> > > +
> > > +	dcr->base = le64_to_cpu(region_config->region_base);
> > > +	dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
> > > +	dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> > > +	dcr->len = le64_to_cpu(region_config->region_length);
> > > +	dcr->blk_size = le64_to_cpu(region_config->region_block_size);
> > > +	dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
> > > +	dcr->flags = region_config->flags;
> > > +	snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
> > > +
> > > +	/* Check regions are in increasing DPA order */
> > > +	if (index > 0) {
> > > +		struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
> > > +
> > > +		if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
> > > +			dev_err(dev,
> > > +				"DPA ordering violation for DC region %d and %d\n",
> > > +				index - 1, index);
> > > +			return -EINVAL;
> > > +		}
> > > +	}
> > > +
> > > +	/* Check the region is 256 MB aligned */
> > > +	if (!IS_ALIGNED(dcr->base, SZ_256M)) {
> > > +		dev_err(dev, "DC region %d not aligned to 256MB: %#llx\n",
> > > +			index, dcr->base);
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	/* Check Region base and length are aligned to block size */
> > > +	if (!IS_ALIGNED(dcr->base, dcr->blk_size) ||
> > > +	    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> > > +		dev_err(dev, "DC region %d not aligned to %#llx\n", index,
> > > +			dcr->blk_size);
> > > +		return -EINVAL;
> > > +	}
> > 
> > Based on on cxl 3.0 spec: Table 8-126, we may need some extra checks
> > here:
> > 1. region len <= decode_len
> > 2. region block size should be power of 2 and a multiple of 40H.
> 
> Thanks for pointing these additional checks out!  I've added these.
> 
> > 
> > Also, if region len or block size is 0, it mentions that DC will not be
> > available, we may also need to handle that.
> 
> I've just added checks for 0 in region length, length and block size.
> 
> I don't think we need to handle this in any special way.  Any of these
> checks will fail the device probe.  From my interpretation of the spec
> reading these values as 0 would indicate an invalid device configuration.
> 
> That said I think the spec is a bit vague here.  On the one hand the
> number of DC regions should reflect the number of valid regions.
> 
> Table 8-125 'Number of Available Regions':
> 	"This is the number of valid region configurations returned in
> 	this payload."
> 
> But it also says:
> 	"Each region may be unconfigured or configured with a different
> 	block size and capacity."
> 
> I don't believe that a 0 in the Region Decode Length, Region Length, or
> Region Block Size is going to happen with the code structured the way it
> is.  I believe these values are used if the host specifically requests the
> configuration of a region not indicated by 'Number of Available Regions'
> through the Starting Region Index in Table 8-163.  This code does not do
> that.
> 
> Would you agree with this?

Agreed.

Fan
> 
> Thanks again,
> Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions
  2023-09-12 16:49       ` Jonathan Cameron
@ 2023-09-12 22:08         ` Ira Weiny
  2023-09-12 22:35           ` Dan Williams
  0 siblings, 1 reply; 97+ messages in thread
From: Ira Weiny @ 2023-09-12 22:08 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Jonathan Cameron wrote:
> On Tue, 5 Sep 2023 21:35:03 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > Jonathan Cameron wrote:
> > > On Mon, 28 Aug 2023 22:21:05 -0700
> > > Ira Weiny <ira.weiny@intel.com> wrote:
> > >   
> > > > Dynamic Capacity (DC) DAX regions have a list of extents which define
> > > > the memory of the region which is available.
> > > > 
> > > > Now that DAX region extents are fully realized support DAX device
> > > > creation on dynamic regions by adjusting the allocation algorithms
> > > > to account for the extents.  Remember also references must be held on
> > > > the extents until the DAX devices are done with the memory.
> > > > 
> > > > Redefine the region available size to include only extent space.  Reuse
> > > > the size allocation algorithm by defining sub-resources for each extent
> > > > and limiting range allocation to those extents which have space.  Do not
> > > > support direct mapping of DAX devices on dynamic devices.
> > > > 
> > > > Enhance DAX device range objects to hold references on the extents until
> > > > the DAX device is destroyed.
> > > > 
> > > > NOTE: At this time all extents within a region are created equally.
> > > > However, labels are associated with extents which can be used with
> > > > future DAX device labels to group which extents are used.  
> > > 
> > > This sound like a bad place to start to me as we are enabling something
> > > that is probably 'wrong' in the long term as opposed to just not enabling it
> > > until we have appropriate support.  
> > 
> > I disagree.  I don't think the kernel should be trying to process tags at
> > the lower level.
> > 
> > > I'd argue better to just reject any extents with different labels for now.  
> > 
> > Again I disagree.  This is less restrictive.  The idea is that labels can
> > be changed such that user space can ultimately decided which extents
> > should be used for which devices.  I have some work on that already.
> > (Basically it becomes quite easy to assign a label to a dax device and
> > have the extent search use only dax extents which match that label.)
> 
> That sounds good - but if someone expects that and uses it with an old
> kernel I'm not sure if it is better to say 'we don't support it yet' or
> do something different from a newer kernel.

This does provide the 'we don't support that yet' in that dax device
creation can't be associated with a label yet.  So surfacing the extents
with the tag as a default label and letting those labels change is more
informational at this point and not functional.  Simple use cases can use
the label (from the tag) to detect that some extent with the wrong tag got
in the region but can't correct it without going through the FM.

It is easy enough to remove the label sysfs and defer that until the dax
device has a label and this support though.

> 
> 
> > > > @@ -1400,8 +1507,10 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
> > > >  	device_initialize(dev);
> > > >  	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
> > > >  
> > > > +	dev_WARN_ONCE(parent, is_dynamic(dax_region) && data->size,
> > > > +		      "Dynamic DAX devices are created initially with 0 size");  
> > > 
> > > dev_info() maybe more appropriate?  
> > 
> > Unless I'm mistaken this can happen from userspace but only if something
> > in the code changes later.  Because the dax layer is trying to support
> > non-dynamic regions (which dynamic may be a bad name), I was worried that
> > the creation with a size might slip through...
> 
> Fair enough - if strong chance userspace will control it at somepoitn then
> ONCE seems fine.
> 
> > 
> > > Is this common enough that we need the
> > > _ONCE?  
> > 
> > once is because it could end up spamming a log later if something got
> > coded up wrong.
> 
> I'm not sure I care about bugs spamming the log.   Only things that
> are userspace controlled or likely hardware failures etc.
> 

Understood.  Let me trace them again but I think these can be triggered by
user space.  If not I'll remove the ONCE.

Thanks again,
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions
  2023-09-12 22:08         ` Ira Weiny
@ 2023-09-12 22:35           ` Dan Williams
  2023-09-13 17:30             ` Ira Weiny
  0 siblings, 1 reply; 97+ messages in thread
From: Dan Williams @ 2023-09-12 22:35 UTC (permalink / raw)
  To: Ira Weiny, Jonathan Cameron
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Ira Weiny wrote:
> Jonathan Cameron wrote:
> > On Tue, 5 Sep 2023 21:35:03 -0700
> > Ira Weiny <ira.weiny@intel.com> wrote:
> > 
> > > Jonathan Cameron wrote:
> > > > On Mon, 28 Aug 2023 22:21:05 -0700
> > > > Ira Weiny <ira.weiny@intel.com> wrote:
> > > >   
> > > > > Dynamic Capacity (DC) DAX regions have a list of extents which define
> > > > > the memory of the region which is available.
> > > > > 
> > > > > Now that DAX region extents are fully realized support DAX device
> > > > > creation on dynamic regions by adjusting the allocation algorithms
> > > > > to account for the extents.  Remember also references must be held on
> > > > > the extents until the DAX devices are done with the memory.
> > > > > 
> > > > > Redefine the region available size to include only extent space.  Reuse
> > > > > the size allocation algorithm by defining sub-resources for each extent
> > > > > and limiting range allocation to those extents which have space.  Do not
> > > > > support direct mapping of DAX devices on dynamic devices.
> > > > > 
> > > > > Enhance DAX device range objects to hold references on the extents until
> > > > > the DAX device is destroyed.
> > > > > 
> > > > > NOTE: At this time all extents within a region are created equally.
> > > > > However, labels are associated with extents which can be used with
> > > > > future DAX device labels to group which extents are used.  
> > > > 
> > > > This sound like a bad place to start to me as we are enabling something
> > > > that is probably 'wrong' in the long term as opposed to just not enabling it
> > > > until we have appropriate support.  
> > > 
> > > I disagree.  I don't think the kernel should be trying to process tags at
> > > the lower level.
> > > 
> > > > I'd argue better to just reject any extents with different labels for now.  
> > > 
> > > Again I disagree.  This is less restrictive.  The idea is that labels can
> > > be changed such that user space can ultimately decided which extents
> > > should be used for which devices.  I have some work on that already.
> > > (Basically it becomes quite easy to assign a label to a dax device and
> > > have the extent search use only dax extents which match that label.)
> > 
> > That sounds good - but if someone expects that and uses it with an old
> > kernel I'm not sure if it is better to say 'we don't support it yet' or
> > do something different from a newer kernel.
> 
> This does provide the 'we don't support that yet' in that dax device
> creation can't be associated with a label yet.  So surfacing the extents
> with the tag as a default label and letting those labels change is more
> informational at this point and not functional.  Simple use cases can use
> the label (from the tag) to detect that some extent with the wrong tag got
> in the region but can't correct it without going through the FM.
> 
> It is easy enough to remove the label sysfs and defer that until the dax
> device has a label and this support though.

Catching up on just this point (still need to go through the whole
thing).  A Sparse DAX region is one where the extents need not be
present at DAX region instantiation and may be added/removed later. The
device-dax allocation scheme just takes a size to do a "first-available"
search for free capacity in the region.

Given that one of the expected DCD use cases is to provide just in time
memory for specific jobs the "first-available" search for free capacity
in a Sparse DAX Region collides with the need to keep allocations
bounded by tag.

I agree with Jonathan that unless and until the allocation scheme is
updated to be tag aware then there is no reason for allocate by tag to
exist in the interface.

That said, the next question, "is DCD enabling considered a toy until
the ability to allocate by tag is present?" I think yes, to the point
where old daxctl binaries should be made fail to create device instances
by forcing a tag to be selected at allocation time for Sparse DAX
Regions.

The last question is whether *writable* tags are needed to allow for
repurposing memory allocated to a host without needing to round trip it
through the FM to get it re-tagged. While that is something the host and
orchestrator can figure out on their own, it looks like a nice to have
until the above questions are answered.

> > > > > @@ -1400,8 +1507,10 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
> > > > >  	device_initialize(dev);
> > > > >  	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
> > > > >  
> > > > > +	dev_WARN_ONCE(parent, is_dynamic(dax_region) && data->size,
> > > > > +		      "Dynamic DAX devices are created initially with 0 size");  
> > > > 
> > > > dev_info() maybe more appropriate?  
> > > 
> > > Unless I'm mistaken this can happen from userspace but only if something
> > > in the code changes later.  Because the dax layer is trying to support
> > > non-dynamic regions (which dynamic may be a bad name), I was worried that
> > > the creation with a size might slip through...
> > 
> > Fair enough - if strong chance userspace will control it at somepoitn then
> > ONCE seems fine.
> > 
> > > 
> > > > Is this common enough that we need the
> > > > _ONCE?  
> > > 
> > > once is because it could end up spamming a log later if something got
> > > coded up wrong.
> > 
> > I'm not sure I care about bugs spamming the log.   Only things that
> > are userspace controlled or likely hardware failures etc.
> > 
> 
> Understood.  Let me trace them again but I think these can be triggered by
> user space.  If not I'll remove the ONCE.

Unless this is an unequivocal kernel bug if it fires, and there is a
significant potential for active development to do the wrong thing,
don't leave a panic_on_warn land mine.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions
  2023-09-12 22:35           ` Dan Williams
@ 2023-09-13 17:30             ` Ira Weiny
  2023-09-13 17:59               ` Dan Williams
  0 siblings, 1 reply; 97+ messages in thread
From: Ira Weiny @ 2023-09-13 17:30 UTC (permalink / raw)
  To: Dan Williams, Ira Weiny, Jonathan Cameron
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Dan Williams wrote:
> Ira Weiny wrote:
> > Jonathan Cameron wrote:
> > > On Tue, 5 Sep 2023 21:35:03 -0700
> > > Ira Weiny <ira.weiny@intel.com> wrote:
> > > 
> > > > Jonathan Cameron wrote:
> > > > > On Mon, 28 Aug 2023 22:21:05 -0700
> > > > > Ira Weiny <ira.weiny@intel.com> wrote:
> > > > >   
> > > > > > Dynamic Capacity (DC) DAX regions have a list of extents which define
> > > > > > the memory of the region which is available.
> > > > > > 
> > > > > > Now that DAX region extents are fully realized support DAX device
> > > > > > creation on dynamic regions by adjusting the allocation algorithms
> > > > > > to account for the extents.  Remember also references must be held on
> > > > > > the extents until the DAX devices are done with the memory.
> > > > > > 
> > > > > > Redefine the region available size to include only extent space.  Reuse
> > > > > > the size allocation algorithm by defining sub-resources for each extent
> > > > > > and limiting range allocation to those extents which have space.  Do not
> > > > > > support direct mapping of DAX devices on dynamic devices.
> > > > > > 
> > > > > > Enhance DAX device range objects to hold references on the extents until
> > > > > > the DAX device is destroyed.
> > > > > > 
> > > > > > NOTE: At this time all extents within a region are created equally.
> > > > > > However, labels are associated with extents which can be used with
> > > > > > future DAX device labels to group which extents are used.  
> > > > > 
> > > > > This sound like a bad place to start to me as we are enabling something
> > > > > that is probably 'wrong' in the long term as opposed to just not enabling it
> > > > > until we have appropriate support.  
> > > > 
> > > > I disagree.  I don't think the kernel should be trying to process tags at
> > > > the lower level.
> > > > 
> > > > > I'd argue better to just reject any extents with different labels for now.  
> > > > 
> > > > Again I disagree.  This is less restrictive.  The idea is that labels can
> > > > be changed such that user space can ultimately decided which extents
> > > > should be used for which devices.  I have some work on that already.
> > > > (Basically it becomes quite easy to assign a label to a dax device and
> > > > have the extent search use only dax extents which match that label.)
> > > 
> > > That sounds good - but if someone expects that and uses it with an old
> > > kernel I'm not sure if it is better to say 'we don't support it yet' or
> > > do something different from a newer kernel.
> > 
> > This does provide the 'we don't support that yet' in that dax device
> > creation can't be associated with a label yet.  So surfacing the extents
> > with the tag as a default label and letting those labels change is more
> > informational at this point and not functional.  Simple use cases can use
> > the label (from the tag) to detect that some extent with the wrong tag got
> > in the region but can't correct it without going through the FM.
> > 
> > It is easy enough to remove the label sysfs and defer that until the dax
> > device has a label and this support though.
> 
> Catching up on just this point (still need to go through the whole
> thing).  A Sparse DAX region is one where the extents need not be
> present at DAX region instantiation and may be added/removed later. The
> device-dax allocation scheme just takes a size to do a "first-available"
> search for free capacity in the region.

Agreed.  And this is the way things work now.

Also your use of 'Sparse DAX region' seems better than the word 'dynamic'
I have used now.  I know that static regions mean something else but I
could not think of a better word.  I'll make adjustments to the
code/commit messages.

> 
> Given that one of the expected DCD use cases is to provide just in time
> memory for specific jobs the "first-available" search for free capacity
> in a Sparse DAX Region collides with the need to keep allocations
> bounded by tag.

How does it collide?

My attempt here is to leave dax devices 'unlabeled'.  As such they will use
space on a 'first-available' search regardless of extent labels.

Effectively I have defined 'no label' as being 'any label'.  I apologize
for this detail being implicit and not explicit.

My envisioned path would be that older daxctl would continue to work like
this because the kernel would not restrict unlabeled dax device creation.

Newer daxctl could use dax device labels to control the extents used.  But
only when dax device labeling is introduced in a future kernel.  Use of a
newer daxctl on an older DCD kernel could continue to work sans label.

In this way I envisioned a path where the policy is completely dictated by
user space restricted only by the software available.

> 
> I agree with Jonathan that unless and until the allocation scheme is
> updated to be tag aware then there is no reason for allocate by tag to
> exist in the interface.

I will agree that it was perhaps premature to introduce labels on the
extents.  However, I did so to give tags a space to be informationally
surfaced.

IMO we must have a plan forward or wait until that plan is fully formed
and implemented.  The size of this set is rather large.  Therefore, I was
hoping that a plan would be enough to move forward.

> 
> That said, the next question, "is DCD enabling considered a toy until
> the ability to allocate by tag is present?" I think yes, to the point
> where old daxctl binaries should be made fail to create device instances
> by forcing a tag to be selected at allocation time for Sparse DAX
> Regions.

Interesting.  I was not considering allocate by label to be a requirement
but rather an enhancement.  Labels IMO are a further refinement of the
memory space allocation.  I can see a very valid use case (not toy use
case) where all the DCD memory allocated to a node is dedicated to a
singular job and is done without tags or even ignoring tags.  Many HPC
sites run with singular jobs per host.

> 
> The last question is whether *writable* tags are needed to allow for
> repurposing memory allocated to a host without needing to round trip it
> through the FM to get it re-tagged. While that is something the host and
> orchestrator can figure out on their own, it looks like a nice to have
> until the above questions are answered.

Needed?  No.  Of course not.  As you said the orchestrator software can
keep iterating with the FM until it gets what it wants.  It was you who
had the idea of a writable labels and I agreed.

"Seemed like a good idea at the time..."  ;-)

As I have reviewed and rewritten this message I worry that writable labels
are a bad idea.  Interleaving will most likely depend on grouping extent
tags into the CXL/DAX extent.  With this in mind adjusting extents is
potentially going to require an FM interaction to get things set up
anyway.

	[Again re-reading my message I thought of another issue.  What
	happens if the user decides to change the label on an extent after
	some dax device with the old label?  That seems like an additional
	complication which is best left out by not allowing extent labels
	to be writable.]

I think writable labels are orthogonal to the kernel behavior though.
Allowing labels to change after the fact is a policy matter which is not
something the kernel needs to manage.

The kernel does need to manage how it allocates a dax device across the
extents available.  Assigning a dax label and allocating to the extents
matching that label is very straight forward.  The real issue is how to
deal with the 'no label' case.

As a path forward, I made a couple of assumptions.  First was the idea of
'no dax device label' == 'any extent label'.  Second, was that current dax
device creation was done as 'no dax device label'.

In this way I did not see a requirement to fully implement label
restriction on dax devices.  Labels are simply a nice to have thing to
group extents later.  Also, if you want dax devices created with specific
extents you have to assign them a label.  Otherwise they are allocated
'first-available' like they have been in the past.

I see a few ways forward.

One is to define 'no dax device label' as 'any extent label' as I have it
now.  IMO this provides the most backwards compatible dax device creation.
The ndctl region code additions are minimal and there are no daxctl
modifications required at all.

A second is to define 'no dax device label' as 'no extent label' and go
forward with this series but add a restriction on dax device creation to
only extents without a label.  This is still pretty compatible but if tags
are used then some extents would not be available without additional
daxctl modifications.

A third way forward is to fully implement label enabled dax device
creation.  In this case I feel like the direction is to make 'no label' ==
'no label'.  This is not hard but will take a couple more weeks to get the
daxctl code and all the testing.

It warrants mentioning that tags are an optional feature.  I feel like
there is momentum in the community to not use tags initially.  And so I
was targeting an initial implementation which really did not need tags at
all.  Perhaps I am wrong in that assumption?  Or perhaps I was short
sighted (possibly because interleaving becomes more straight forward)?

To summarize I see the following fundamental questions.

	1) Do we require DCD support to require dax device label
	   management?
	2) What does 'no dax device label' mean?
		a) any extent label
		b) no extent label
	3) Should writable labels be allowed on extents?
		a) this is more flexible
		b) security issues?
		c) does it just confuse things with interleaving?
		d) nice to change the tag name to something easy to read?
		e) other issues?
	4) How should the available size for labels be communicated to the
	   user?
	   	a) currently available size reflects an 'any extent label'
		   behavior when there is no label on the dax device.
		b) this becomes an issue if labelless dax devices are
		   restricted to labelless extents.

My current view is:
	1) No.  Current dax devices can be defined as 'no label'
	2) I'm not sure.  I can see both ways having benefits.
	3) No I think the ROI is not worth it.
	4) The use of 'any extent label' in #2 means that available size
	   retains it's meaning for no label dax devices.  Labeled dax
	   devices would require a future enhancement to size information.

> 
> > > > > > @@ -1400,8 +1507,10 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
> > > > > >  	device_initialize(dev);
> > > > > >  	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
> > > > > >  
> > > > > > +	dev_WARN_ONCE(parent, is_dynamic(dax_region) && data->size,
> > > > > > +		      "Dynamic DAX devices are created initially with 0 size");  
> > > > > 
> > > > > dev_info() maybe more appropriate?  
> > > > 
> > > > Unless I'm mistaken this can happen from userspace but only if something
> > > > in the code changes later.  Because the dax layer is trying to support
> > > > non-dynamic regions (which dynamic may be a bad name), I was worried that
> > > > the creation with a size might slip through...
> > > 
> > > Fair enough - if strong chance userspace will control it at somepoitn then
> > > ONCE seems fine.
> > > 
> > > > 
> > > > > Is this common enough that we need the
> > > > > _ONCE?  
> > > > 
> > > > once is because it could end up spamming a log later if something got
> > > > coded up wrong.
> > > 
> > > I'm not sure I care about bugs spamming the log.   Only things that
> > > are userspace controlled or likely hardware failures etc.
> > > 
> > 
> > Understood.  Let me trace them again but I think these can be triggered by
> > user space.  If not I'll remove the ONCE.
> 
> Unless this is an unequivocal kernel bug if it fires, and there is a
> significant potential for active development to do the wrong thing,
> don't leave a panic_on_warn land mine.

Indeed.  I forgot about those panic on warn users.  I'll remove the warn
altogether.

Thanks,
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions
  2023-09-13 17:30             ` Ira Weiny
@ 2023-09-13 17:59               ` Dan Williams
  2023-09-13 19:26                 ` Ira Weiny
  0 siblings, 1 reply; 97+ messages in thread
From: Dan Williams @ 2023-09-13 17:59 UTC (permalink / raw)
  To: Ira Weiny, Dan Williams, Jonathan Cameron
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Ira Weiny wrote:
[..]
> > 
> > Given that one of the expected DCD use cases is to provide just in time
> > memory for specific jobs the "first-available" search for free capacity
> > in a Sparse DAX Region collides with the need to keep allocations
> > bounded by tag.
> 
> How does it collide?
> 
> My attempt here is to leave dax devices 'unlabeled'.  As such they will use
> space on a 'first-available' search regardless of extent labels.
> 
> Effectively I have defined 'no label' as being 'any label'.  I apologize
> for this detail being implicit and not explicit.
> 
> My envisioned path would be that older daxctl would continue to work like
> this because the kernel would not restrict unlabeled dax device creation.
> 
> Newer daxctl could use dax device labels to control the extents used.  But
> only when dax device labeling is introduced in a future kernel.  Use of a
> newer daxctl on an older DCD kernel could continue to work sans label.
> 
> In this way I envisioned a path where the policy is completely dictated by
> user space restricted only by the software available.

Tags are a core concept in DCD. "Allocate by tag" does not feel like
something that can come later at least in terms of when the DCD ABI is
ready for upstream. So, yes, it can remain out of this patchset, but the
upstream merge of all of DCD would be gated on that facility arriving.

> > I agree with Jonathan that unless and until the allocation scheme is
> > updated to be tag aware then there is no reason for allocate by tag to
> > exist in the interface.
> 
> I will agree that it was perhaps premature to introduce labels on the
> extents.  However, I did so to give tags a space to be informationally
> surfaced.
> 
> IMO we must have a plan forward or wait until that plan is fully formed
> and implemented.  The size of this set is rather large.  Therefore, I was
> hoping that a plan would be enough to move forward.

Leave it out for now to focus on the core mechanisms and then we can
circle back to it.

> > That said, the next question, "is DCD enabling considered a toy until
> > the ability to allocate by tag is present?" I think yes, to the point
> > where old daxctl binaries should be made fail to create device instances
> > by forcing a tag to be selected at allocation time for Sparse DAX
> > Regions.
> 
> Interesting.  I was not considering allocate by label to be a requirement
> but rather an enhancement.  Labels IMO are a further refinement of the
> memory space allocation.  I can see a very valid use case (not toy use
> case) where all the DCD memory allocated to a node is dedicated to a
> singular job and is done without tags or even ignoring tags.  Many HPC
> sites run with singular jobs per host.

Is HPC going to use DCD? My impression is that HPC is statically
provisioned per node and that DCD is more targeted at Cloud use cases
where dynamic provisioning is common.

> > The last question is whether *writable* tags are needed to allow for
> > repurposing memory allocated to a host without needing to round trip it
> > through the FM to get it re-tagged. While that is something the host and
> > orchestrator can figure out on their own, it looks like a nice to have
> > until the above questions are answered.
> 
> Needed?  No.  Of course not.  As you said the orchestrator software can
> keep iterating with the FM until it gets what it wants.  It was you who
> had the idea of a writable labels and I agreed.

Yeah, it was an idea for how to solve the problem of repurposing tag
without needing to round trip with the FM.

> "Seemed like a good idea at the time..."  ;-)
> 
> As I have reviewed and rewritten this message I worry that writable labels
> are a bad idea.  Interleaving will most likely depend on grouping extent
> tags into the CXL/DAX extent.  With this in mind adjusting extents is
> potentially going to require an FM interaction to get things set up
> anyway.
> 
> 	[Again re-reading my message I thought of another issue.  What
> 	happens if the user decides to change the label on an extent after
> 	some dax device with the old label?  That seems like an additional
> 	complication which is best left out by not allowing extent labels
> 	to be writable.]

At least for this point extents can not be relabeled while allocated to
an instance.

[..]
> My current view is:
> 	1) No.  Current dax devices can be defined as 'no label'
> 	2) I'm not sure.  I can see both ways having benefits.
> 	3) No I think the ROI is not worth it.
> 	4) The use of 'any extent label' in #2 means that available size
> 	   retains it's meaning for no label dax devices.  Labeled dax
> 	   devices would require a future enhancement to size information.

If the ABI is going to change in the future I don't want every debug
session to start with "which version of daxctl were you using", or "do
your scripts comprehend Sparse DAX Regions?". This stance is motivated
by having seen the problems that the current ABI causes for people that want
to do things like mitigate the "noisy neighbor" phenomenon in memory
side caches. The allocation ABI is too simple and DCD seems to need
more.

The kernel enforced requirement for Sparse DAX Region aware tooling just
makes it easier on us to maintain. If it means waiting until we ahve
agreement on the allocation ABI I think that's a simple release valve.

The fundamental mechanisms can be reviewed in the meantime.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions
  2023-09-13 17:59               ` Dan Williams
@ 2023-09-13 19:26                 ` Ira Weiny
  2023-09-14 10:32                   ` Jonathan Cameron
  0 siblings, 1 reply; 97+ messages in thread
From: Ira Weiny @ 2023-09-13 19:26 UTC (permalink / raw)
  To: Dan Williams, Ira Weiny, Jonathan Cameron
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

Dan Williams wrote:
> Ira Weiny wrote:
> [..]
> > > 
> > > Given that one of the expected DCD use cases is to provide just in time
> > > memory for specific jobs the "first-available" search for free capacity
> > > in a Sparse DAX Region collides with the need to keep allocations
> > > bounded by tag.
> > 
> > How does it collide?
> > 
> > My attempt here is to leave dax devices 'unlabeled'.  As such they will use
> > space on a 'first-available' search regardless of extent labels.
> > 
> > Effectively I have defined 'no label' as being 'any label'.  I apologize
> > for this detail being implicit and not explicit.
> > 
> > My envisioned path would be that older daxctl would continue to work like
> > this because the kernel would not restrict unlabeled dax device creation.
> > 
> > Newer daxctl could use dax device labels to control the extents used.  But
> > only when dax device labeling is introduced in a future kernel.  Use of a
> > newer daxctl on an older DCD kernel could continue to work sans label.
> > 
> > In this way I envisioned a path where the policy is completely dictated by
> > user space restricted only by the software available.
> 
> Tags are a core concept in DCD. "Allocate by tag" does not feel like
> something that can come later at least in terms of when the DCD ABI is
> ready for upstream. So, yes, it can remain out of this patchset, but the
> upstream merge of all of DCD would be gated on that facility arriving.

I don't see how this can be left out of this patchset.  Without dax device
support on DCD there is no functionality and this patchset does nothing.

> 
> > > I agree with Jonathan that unless and until the allocation scheme is
> > > updated to be tag aware then there is no reason for allocate by tag to
> > > exist in the interface.
> > 
> > I will agree that it was perhaps premature to introduce labels on the
> > extents.  However, I did so to give tags a space to be informationally
> > surfaced.
> > 
> > IMO we must have a plan forward or wait until that plan is fully formed
> > and implemented.  The size of this set is rather large.  Therefore, I was
> > hoping that a plan would be enough to move forward.
> 
> Leave it out for now to focus on the core mechanisms and then we can
       ^^^^
       it what?
> circle back to it.

Again dax devices need to be created to full test this so I have to create
them in some way.  I'm going to assume you mean 'labelless' and deal with
labels later.

> 
> > > That said, the next question, "is DCD enabling considered a toy until
> > > the ability to allocate by tag is present?" I think yes, to the point
> > > where old daxctl binaries should be made fail to create device instances
> > > by forcing a tag to be selected at allocation time for Sparse DAX
> > > Regions.
> > 
> > Interesting.  I was not considering allocate by label to be a requirement
> > but rather an enhancement.  Labels IMO are a further refinement of the
> > memory space allocation.  I can see a very valid use case (not toy use
> > case) where all the DCD memory allocated to a node is dedicated to a
> > singular job and is done without tags or even ignoring tags.  Many HPC
> > sites run with singular jobs per host.
> 
> Is HPC going to use DCD? My impression is that HPC is statically
> provisioned per node and that DCD is more targeted at Cloud use cases
> where dynamic provisioning is common.

I heard someone mention HPC in a call at some point.

> 
> > > The last question is whether *writable* tags are needed to allow for
> > > repurposing memory allocated to a host without needing to round trip it
> > > through the FM to get it re-tagged. While that is something the host and
> > > orchestrator can figure out on their own, it looks like a nice to have
> > > until the above questions are answered.
> > 
> > Needed?  No.  Of course not.  As you said the orchestrator software can
> > keep iterating with the FM until it gets what it wants.  It was you who
> > had the idea of a writable labels and I agreed.
> 
> Yeah, it was an idea for how to solve the problem of repurposing tag
> without needing to round trip with the FM.
> 
> > "Seemed like a good idea at the time..."  ;-)
> > 
> > As I have reviewed and rewritten this message I worry that writable labels
> > are a bad idea.  Interleaving will most likely depend on grouping extent
> > tags into the CXL/DAX extent.  With this in mind adjusting extents is
> > potentially going to require an FM interaction to get things set up
> > anyway.
> > 
> > 	[Again re-reading my message I thought of another issue.  What
> > 	happens if the user decides to change the label on an extent after
> > 	some dax device with the old label?  That seems like an additional
> > 	complication which is best left out by not allowing extent labels
> > 	to be writable.]
> 
> At least for this point extents can not be relabeled while allocated to
> an instance.

Sure but is having writeable labels worth this extra complexity?

> 
> [..]
> > My current view is:
> > 	1) No.  Current dax devices can be defined as 'no label'
> > 	2) I'm not sure.  I can see both ways having benefits.
> > 	3) No I think the ROI is not worth it.
> > 	4) The use of 'any extent label' in #2 means that available size
> > 	   retains it's meaning for no label dax devices.  Labeled dax
> > 	   devices would require a future enhancement to size information.
> 
> If the ABI is going to change in the future I don't want every debug
> session to start with "which version of daxctl were you using", or "do
> your scripts comprehend Sparse DAX Regions?".

Well then we are stuck.  Because at a minimum they will have to understand
Sparse DAX regions.  cxl create-region needs a new type to create such
regions.

I envisioned an ABI *extension* not change.  The current ABI supports dax
devices without a tag.  Even with DCD no tag is possible.  Unless you want
to restrict it, which it sounds like you do?

I'm ok with that but I know of at least 1 meeting where it was
emphatically mentioned that tags are _not_ required.  So I'd like some
community members to chime in here if requiring tags is ok.

>
> This stance is motivated
> by having seen the problems that the current ABI causes for people that want
> to do things like mitigate the "noisy neighbor" phenomenon in memory
> side caches.

Does a dax device need specific placement within the region?  That sounds
like control at the extent layer when the extent is mapped into the
region.

The mapping store interface does need to be resolved for DCD.  I could
envision the ability for user space to create extents...  Are you thinking
the same thing?

Conceptually from a top down approach _any_ dax region could be a sparse
dax region if I get what you are driving at?  Not just DCD?  In that case
extent creation is even more complicated in the DCD case.

> The allocation ABI is too simple and DCD seems to need
> more.

Are you advocating for an ABI which requires dax devices to be labeled?
It sounds like you don't want the current tool set to work on sparse dax
regions.  Is that correct?  I'm ok with that but I don't think a specific
check in the kernel is the proper way to do that.  Current dax devices are
unlabled.  So I envisioned them being supported with the current ABI.

> 
> The kernel enforced requirement for Sparse DAX Region aware tooling just
> makes it easier on us to maintain. If it means waiting until we ahve
> agreement on the allocation ABI I think that's a simple release valve.

These statements imply to me you have additional requirements for this ABI
beyond what DCD does.  I've tried to make the dax layer DCD/CXL agnostic.
But beyond having the concept of region extents which are labeled and
matched to dax devices based on that label; what other requirements on dax
to region space allocations are there?

> 
> The fundamental mechanisms can be reviewed in the meantime.

Sure,
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions
  2023-09-13 19:26                 ` Ira Weiny
@ 2023-09-14 10:32                   ` Jonathan Cameron
  0 siblings, 0 replies; 97+ messages in thread
From: Jonathan Cameron @ 2023-09-14 10:32 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Wed, 13 Sep 2023 12:26:58 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> Dan Williams wrote:
> > Ira Weiny wrote:

Jumping in on randomly selected points :)

> > [..]  
> > > > 
> > > > Given that one of the expected DCD use cases is to provide just in time
> > > > memory for specific jobs the "first-available" search for free capacity
> > > > in a Sparse DAX Region collides with the need to keep allocations
> > > > bounded by tag.  
> > > 
> > > How does it collide?
> > > 
> > > My attempt here is to leave dax devices 'unlabeled'.  As such they will use
> > > space on a 'first-available' search regardless of extent labels.
> > > 
> > > Effectively I have defined 'no label' as being 'any label'.  I apologize
> > > for this detail being implicit and not explicit.
> > > 
> > > My envisioned path would be that older daxctl would continue to work like
> > > this because the kernel would not restrict unlabeled dax device creation.
> > > 
> > > Newer daxctl could use dax device labels to control the extents used.  But
> > > only when dax device labeling is introduced in a future kernel.  Use of a
> > > newer daxctl on an older DCD kernel could continue to work sans label.
> > > 
> > > In this way I envisioned a path where the policy is completely dictated by
> > > user space restricted only by the software available.  
> > 
> > Tags are a core concept in DCD. "Allocate by tag" does not feel like
> > something that can come later at least in terms of when the DCD ABI is
> > ready for upstream. So, yes, it can remain out of this patchset, but the
> > upstream merge of all of DCD would be gated on that facility arriving.  
> 
> I don't see how this can be left out of this patchset.  Without dax device
> support on DCD there is no functionality and this patchset does nothing.

Agreed - but I think one path you suggest is fine.

No label dax == no label DCD extents.

That one should be true for ever (or until writeable tags added) so
is safe and gets us going.

> 
> >   
> > > > I agree with Jonathan that unless and until the allocation scheme is
> > > > updated to be tag aware then there is no reason for allocate by tag to
> > > > exist in the interface.  
> > > 
> > > I will agree that it was perhaps premature to introduce labels on the
> > > extents.  However, I did so to give tags a space to be informationally
> > > surfaced.
> > > 
> > > IMO we must have a plan forward or wait until that plan is fully formed
> > > and implemented.  The size of this set is rather large.  Therefore, I was
> > > hoping that a plan would be enough to move forward.  
> > 
> > Leave it out for now to focus on the core mechanisms and then we can  
>        ^^^^
>        it what?
> > circle back to it.  
> 
> Again dax devices need to be created to full test this so I have to create
> them in some way.  I'm going to assume you mean 'labelless' and deal with
> labels later.
> 
> >   
> > > > That said, the next question, "is DCD enabling considered a toy until
> > > > the ability to allocate by tag is present?" I think yes, to the point
> > > > where old daxctl binaries should be made fail to create device instances
> > > > by forcing a tag to be selected at allocation time for Sparse DAX
> > > > Regions.  
> > > 
> > > Interesting.  I was not considering allocate by label to be a requirement
> > > but rather an enhancement.  Labels IMO are a further refinement of the
> > > memory space allocation.  I can see a very valid use case (not toy use
> > > case) where all the DCD memory allocated to a node is dedicated to a
> > > singular job and is done without tags or even ignoring tags.  Many HPC
> > > sites run with singular jobs per host.  
> > 
> > Is HPC going to use DCD? My impression is that HPC is statically
> > provisioned per node and that DCD is more targeted at Cloud use cases
> > where dynamic provisioning is common.  
> 
> I heard someone mention HPC in a call at some point.

I'd not rule it out.  Some HPC systems run very mixed workloads in parallel
so would benefit form Dynamic capacity - though maybe not with the same
rate of change as cloud workloads.

> 
> >   
> > > > The last question is whether *writable* tags are needed to allow for
> > > > repurposing memory allocated to a host without needing to round trip it
> > > > through the FM to get it re-tagged. While that is something the host and
> > > > orchestrator can figure out on their own, it looks like a nice to have
> > > > until the above questions are answered.  
> > > 
> > > Needed?  No.  Of course not.  As you said the orchestrator software can
> > > keep iterating with the FM until it gets what it wants.  It was you who
> > > had the idea of a writable labels and I agreed.  
> > 
> > Yeah, it was an idea for how to solve the problem of repurposing tag
> > without needing to round trip with the FM.
> >   
> > > "Seemed like a good idea at the time..."  ;-)
> > > 
> > > As I have reviewed and rewritten this message I worry that writable labels
> > > are a bad idea.  Interleaving will most likely depend on grouping extent
> > > tags into the CXL/DAX extent.  With this in mind adjusting extents is
> > > potentially going to require an FM interaction to get things set up
> > > anyway.
> > > 
> > > 	[Again re-reading my message I thought of another issue.  What
> > > 	happens if the user decides to change the label on an extent after
> > > 	some dax device with the old label?  That seems like an additional
> > > 	complication which is best left out by not allowing extent labels
> > > 	to be writable.]  
> > 
> > At least for this point extents can not be relabeled while allocated to
> > an instance.  
> 
> Sure but is having writeable labels worth this extra complexity?

No. Or not yet anyway.


> 
> > 
> > [..]  
> > > My current view is:
> > > 	1) No.  Current dax devices can be defined as 'no label'
> > > 	2) I'm not sure.  I can see both ways having benefits.

> 2) What does 'no dax device label' mean?
> 		a) any extent label
> 		b) no extent label
(that bit got cropped)

Option b seems something we can support for ever.  Not sure that works
for option a.

> > > 	3) No I think the ROI is not worth it.
> > > 	4) The use of 'any extent label' in #2 means that available size
> > > 	   retains it's meaning for no label dax devices.  Labeled dax
> > > 	   devices would require a future enhancement to size information.  
> > 
> > If the ABI is going to change in the future I don't want every debug
> > session to start with "which version of daxctl were you using", or "do
> > your scripts comprehend Sparse DAX Regions?".  
> 
> Well then we are stuck.  Because at a minimum they will have to understand
> Sparse DAX regions.  cxl create-region needs a new type to create such
> regions.
> 
> I envisioned an ABI *extension* not change.  The current ABI supports dax
> devices without a tag.  Even with DCD no tag is possible.  Unless you want
> to restrict it, which it sounds like you do?
> 
> I'm ok with that but I know of at least 1 meeting where it was
> emphatically mentioned that tags are _not_ required.  So I'd like some
> community members to chime in here if requiring tags is ok.

They are definitely not required and I don't think we want to make
it a Linux requirement that tags are needed.


> 
> >
> > This stance is motivated
> > by having seen the problems that the current ABI causes for people that want
> > to do things like mitigate the "noisy neighbor" phenomenon in memory
> > side caches.  
> 
> Does a dax device need specific placement within the region?  That sounds
> like control at the extent layer when the extent is mapped into the
> region.
> 
> The mapping store interface does need to be resolved for DCD.  I could
> envision the ability for user space to create extents...  Are you thinking
> the same thing?
> 
> Conceptually from a top down approach _any_ dax region could be a sparse
> dax region if I get what you are driving at?  Not just DCD?  In that case
> extent creation is even more complicated in the DCD case.

For now at least I'd push any clever noisy neighbour mess on to the
Fabric manager once we have tags.  Not sure the OS even has the visibility to
do this sort of fine tuning.  We could provide it of course, but that's
a whole level of system description that we don't have today.

> 
> > The allocation ABI is too simple and DCD seems to need
> > more.  
> 
> Are you advocating for an ABI which requires dax devices to be labeled?
> It sounds like you don't want the current tool set to work on sparse dax
> regions.  Is that correct?  I'm ok with that but I don't think a specific
> check in the kernel is the proper way to do that.  Current dax devices are
> unlabled.  So I envisioned them being supported with the current ABI.
> 
> > 
> > The kernel enforced requirement for Sparse DAX Region aware tooling just
> > makes it easier on us to maintain. If it means waiting until we ahve
> > agreement on the allocation ABI I think that's a simple release valve.  
> 
> These statements imply to me you have additional requirements for this ABI
> beyond what DCD does.  I've tried to make the dax layer DCD/CXL agnostic.
> But beyond having the concept of region extents which are labeled and
> matched to dax devices based on that label; what other requirements on dax
> to region space allocations are there?
> 
> > 
> > The fundamental mechanisms can be reviewed in the meantime.  
> 
> Sure,
> Ira
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 12/18] cxl/region: Notify regions of DC changes
  2023-08-29  5:21 ` [PATCH RFC v2 12/18] cxl/region: Notify regions of DC changes Ira Weiny
  2023-08-29 16:40   ` Jonathan Cameron
@ 2023-09-18 13:56   ` Jørgen Hansen
  2023-09-18 17:45     ` Ira Weiny
  1 sibling, 1 reply; 97+ messages in thread
From: Jørgen Hansen @ 2023-09-18 13:56 UTC (permalink / raw)
  To: Ira Weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, linux-cxl,
	linux-kernel

On 8/29/23 07:21, Ira Weiny wrote:
> 
> In order for a user to use dynamic capacity effectively they need to
> know when dynamic capacity is available.  Thus when Dynamic Capacity
> (DC) extents are added or removed by a DC device the regions affected
> need to be notified.  Ultimately the DAX region uses the memory
> associated with DC extents.  However, remember that CXL DAX regions
> maintain any interleave details between devices.
> 
> When a DCD event occurs, iterate all CXL endpoint decoders and notify
> regions which contain the endpoints affected by the event.  In turn
> notify the DAX regions of the changes to the DAX region extents.
> 
> For now interleave is handled by creating simple 1:1 mappings between
> the CXL DAX region and DAX region layers.  Future implementations will
> need to resolve when to actually surface a DAX region extent and pass
> the notification along.
> 
> Remember that adding capacity is safe because there is no chance of the
> memory being in use.  Also remember at this point releasing capacity is
> straight forward because DAX devices do not yet have references to the
> extents.  Future patches will handle that complication.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from v1:
> [iweiny: Rewrite]
> ---
>   drivers/cxl/core/mbox.c   |  39 +++++++++++++--
>   drivers/cxl/core/region.c | 123 +++++++++++++++++++++++++++++++++++++++++-----
>   drivers/cxl/cxl.h         |  22 +++++++++
>   drivers/cxl/mem.c         |  50 +++++++++++++++++++
>   drivers/dax/cxl.c         |  99 ++++++++++++++++++++++++++++++-------
>   drivers/dax/dax-private.h |   3 ++
>   drivers/dax/extent.c      |  14 ++++++
>   7 files changed, 317 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 5472ab1d0370..9d9c13e13ecf 100644

[snip]

> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 0aeea50550f6..a0c1f2793dd7 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1547,8 +1547,8 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
>          return 0;
>   }
> 
> -static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> -                               struct cxl_dc_extent_data *extent)
> +bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> +                        struct cxl_dc_extent_data *extent)
>   {
>          struct range dpa_range = (struct range){
>                  .start = extent->dpa_start,
> @@ -1567,14 +1567,66 @@ static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
>          return (cxled->dpa_res->start <= dpa_range.start &&
>                  dpa_range.end <= cxled->dpa_res->end);
>   }
> +EXPORT_SYMBOL_NS_GPL(cxl_dc_extent_in_ed, CXL);
> +
> +static int cxl_region_notify_extent(struct cxl_endpoint_decoder *cxled,
> +                                   enum dc_event event,
> +                                   struct cxl_dr_extent *cxl_dr_ext)
> +{
> +       struct cxl_dax_region *cxlr_dax;
> +       struct device *dev;
> +       int rc = 0;
> +
> +       cxlr_dax = cxled->cxld.region->cxlr_dax;
> +       dev = &cxlr_dax->dev;
> +       dev_dbg(dev, "Trying notify: type %d HPA:%llx LEN:%llx\n",
> +               event, cxl_dr_ext->hpa_offset, cxl_dr_ext->hpa_length);
> +
> +       device_lock(dev);
> +       if (dev->driver) {
> +               struct cxl_driver *reg_drv = to_cxl_drv(dev->driver);
> +               struct cxl_drv_nd nd = (struct cxl_drv_nd) {
> +                       .event = event,
> +                       .cxl_dr_ext = cxl_dr_ext,
> +               };
> +
> +               if (reg_drv->notify) {
> +                       dev_dbg(dev, "Notify: type %d HPA:%llx LEN:%llx\n",
> +                               event, cxl_dr_ext->hpa_offset,
> +                               cxl_dr_ext->hpa_length);
> +                       rc = reg_drv->notify(dev, &nd);
> +               }
> +       }
> +       device_unlock(dev);
> +       return rc;
> +}
> +
> +static resource_size_t
> +cxl_dc_extent_to_hpa_offset(struct cxl_endpoint_decoder *cxled,
> +                           struct cxl_dc_extent_data *extent)
> +{
> +       struct cxl_dax_region *cxlr_dax;
> +       resource_size_t dpa_offset, hpa;
> +       struct range *ed_hpa_range;
> +
> +       cxlr_dax = cxled->cxld.region->cxlr_dax;
> +
> +       /*
> +        * Without interleave...
> +        * HPA offset == DPA offset
> +        * ... but do the math anyway
> +        */
> +       dpa_offset = extent->dpa_start - cxled->dpa_res->start;
> +       ed_hpa_range = &cxled->cxld.hpa_range;
> +       hpa = ed_hpa_range->start + dpa_offset;
> +       return hpa - cxlr_dax->hpa_range.start;
> +}
> 
>   static int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
>                                   struct cxl_dc_extent_data *extent)
>   {
>          struct cxl_dr_extent *cxl_dr_ext;
>          struct cxl_dax_region *cxlr_dax;
> -       resource_size_t dpa_offset, hpa;
> -       struct range *ed_hpa_range;
>          struct device *dev;
>          int rc;
> 
> @@ -1601,15 +1653,7 @@ static int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
>          cxl_dr_ext->extent = extent;
>          kref_init(&cxl_dr_ext->region_ref);
> 
> -       /*
> -        * Without interleave...
> -        * HPA offset == DPA offset
> -        * ... but do the math anyway
> -        */
> -       dpa_offset = extent->dpa_start - cxled->dpa_res->start;
> -       ed_hpa_range = &cxled->cxld.hpa_range;
> -       hpa = ed_hpa_range->start + dpa_offset;
> -       cxl_dr_ext->hpa_offset = hpa - cxlr_dax->hpa_range.start;
> +       cxl_dr_ext->hpa_offset = cxl_dc_extent_to_hpa_offset(cxled, extent);
> 
>          /* Without interleave carry length and label through */
>          cxl_dr_ext->hpa_length = extent->length;
> @@ -1626,6 +1670,7 @@ static int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
>          }
>          /* Put in cxl_dr_release() */
>          cxl_dc_extent_get(cxl_dr_ext->extent);
> +       cxl_region_notify_extent(cxled, DCD_ADD_CAPACITY, cxl_dr_ext);
>          return 0;
>   }
> 
> @@ -1663,6 +1708,58 @@ static int cxl_ed_add_extents(struct cxl_endpoint_decoder *cxled)
>          return 0;
>   }
> 
> +static int cxl_ed_rm_dc_extent(struct cxl_endpoint_decoder *cxled,
> +                              enum dc_event event,
> +                              struct cxl_dc_extent_data *extent)
> +{
> +       struct cxl_region *cxlr = cxled->cxld.region;
> +       struct cxl_dax_region *cxlr_dax = cxlr->cxlr_dax;
> +       struct cxl_dr_extent *cxl_dr_ext;
> +       resource_size_t hpa_offset;
> +
> +       hpa_offset = cxl_dc_extent_to_hpa_offset(cxled, extent);
> +
> +       /*
> +        * NOTE on Interleaving: There is no need to 'break up' the cxl_dr_ext.
> +        * If one of the extents comprising it is gone it should be removed
> +        * from the region to prevent future use.  Later code may save other
> +        * extents for future processing.  But for now the corelation is 1:1:1
> +        * so just erase the extent.
> +        */
> +       cxl_dr_ext = xa_erase(&cxlr_dax->extents, hpa_offset);
> +
> +       dev_dbg(&cxlr_dax->dev, "Remove DAX region ext HPA:%llx\n",
> +               cxl_dr_ext->hpa_offset);
> +       cxl_region_notify_extent(cxled, event, cxl_dr_ext);
> +       cxl_dr_extent_put(cxl_dr_ext);
> +       return 0;
> +}
> +
> +int cxl_ed_notify_extent(struct cxl_endpoint_decoder *cxled,
> +                        struct cxl_drv_nd *nd)
> +{
> +       int rc = 0;
> +
> +       switch (nd->event) {
> +       case DCD_ADD_CAPACITY:
> +               if (cxl_dc_extent_get_not_zero(nd->extent)) {
> +                       rc = cxl_ed_add_one_extent(cxled, nd->extent);
> +                       if (rc)
> +                               cxl_dc_extent_put(nd->extent);

Hi,
when playing around with adding and releasing DCD extents through the 
qmp interface for the QEMU DCD emulation, I noticed that extents weren't 
handed back to the device. It looks like there is a refcounting issue, 
as the kref never drops below 2 for the dc extents. So I was wondering 
whether we should only put the dc extent here on error or maybe always 
put it?  cxl_ed_add_one_extent() also grabs a reference to the dc 
extent, and that one is put in cxl_dr_release(), but I couldn't find a 
matching put for this get_not_zero.


> +               }
> +               break;
> +       case DCD_RELEASE_CAPACITY:
> +       case DCD_FORCED_CAPACITY_RELEASE:
> +               rc = cxl_ed_rm_dc_extent(cxled, nd->event, nd->extent);
> +               break;
> +       default:
> +               dev_err(&cxled->cxld.dev, "Unknown DC event %d\n", nd->event);
> +               break;
> +       }
> +       return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_ed_notify_extent, CXL);
> +
>   static int cxl_region_attach_position(struct cxl_region *cxlr,
>                                        struct cxl_root_decoder *cxlrd,
>                                        struct cxl_endpoint_decoder *cxled,

[snip]

> 
> --
> 2.41.0
> 

Thanks,
Jorgen

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 12/18] cxl/region: Notify regions of DC changes
  2023-09-18 13:56   ` Jørgen Hansen
@ 2023-09-18 17:45     ` Ira Weiny
  0 siblings, 0 replies; 97+ messages in thread
From: Ira Weiny @ 2023-09-18 17:45 UTC (permalink / raw)
  To: Jørgen Hansen, Ira Weiny, Dan Williams
  Cc: Navneet Singh, Fan Ni, Jonathan Cameron, Davidlohr Bueso,
	Dave Jiang, Alison Schofield, Vishal Verma, linux-cxl,
	linux-kernel

Jørgen Hansen wrote:
> On 8/29/23 07:21, Ira Weiny wrote:
> > 
> > In order for a user to use dynamic capacity effectively they need to
> > know when dynamic capacity is available.  Thus when Dynamic Capacity
> > (DC) extents are added or removed by a DC device the regions affected
> > need to be notified.  Ultimately the DAX region uses the memory
> > associated with DC extents.  However, remember that CXL DAX regions
> > maintain any interleave details between devices.
> > 
> > When a DCD event occurs, iterate all CXL endpoint decoders and notify
> > regions which contain the endpoints affected by the event.  In turn
> > notify the DAX regions of the changes to the DAX region extents.
> > 
> > For now interleave is handled by creating simple 1:1 mappings between
> > the CXL DAX region and DAX region layers.  Future implementations will
> > need to resolve when to actually surface a DAX region extent and pass
> > the notification along.
> > 
> > Remember that adding capacity is safe because there is no chance of the
> > memory being in use.  Also remember at this point releasing capacity is
> > straight forward because DAX devices do not yet have references to the
> > extents.  Future patches will handle that complication.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes from v1:
> > [iweiny: Rewrite]
> > ---
> >   drivers/cxl/core/mbox.c   |  39 +++++++++++++--
> >   drivers/cxl/core/region.c | 123 +++++++++++++++++++++++++++++++++++++++++-----
> >   drivers/cxl/cxl.h         |  22 +++++++++
> >   drivers/cxl/mem.c         |  50 +++++++++++++++++++
> >   drivers/dax/cxl.c         |  99 ++++++++++++++++++++++++++++++-------
> >   drivers/dax/dax-private.h |   3 ++
> >   drivers/dax/extent.c      |  14 ++++++
> >   7 files changed, 317 insertions(+), 33 deletions(-)
> > 
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 5472ab1d0370..9d9c13e13ecf 100644
> 
> [snip]
> 
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 0aeea50550f6..a0c1f2793dd7 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -1547,8 +1547,8 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
> >          return 0;
> >   }
> > 
> > -static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> > -                               struct cxl_dc_extent_data *extent)
> > +bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> > +                        struct cxl_dc_extent_data *extent)
> >   {
> >          struct range dpa_range = (struct range){
> >                  .start = extent->dpa_start,
> > @@ -1567,14 +1567,66 @@ static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> >          return (cxled->dpa_res->start <= dpa_range.start &&
> >                  dpa_range.end <= cxled->dpa_res->end);
> >   }
> > +EXPORT_SYMBOL_NS_GPL(cxl_dc_extent_in_ed, CXL);
> > +
> > +static int cxl_region_notify_extent(struct cxl_endpoint_decoder *cxled,
> > +                                   enum dc_event event,
> > +                                   struct cxl_dr_extent *cxl_dr_ext)
> > +{
> > +       struct cxl_dax_region *cxlr_dax;
> > +       struct device *dev;
> > +       int rc = 0;
> > +
> > +       cxlr_dax = cxled->cxld.region->cxlr_dax;
> > +       dev = &cxlr_dax->dev;
> > +       dev_dbg(dev, "Trying notify: type %d HPA:%llx LEN:%llx\n",
> > +               event, cxl_dr_ext->hpa_offset, cxl_dr_ext->hpa_length);
> > +
> > +       device_lock(dev);
> > +       if (dev->driver) {
> > +               struct cxl_driver *reg_drv = to_cxl_drv(dev->driver);
> > +               struct cxl_drv_nd nd = (struct cxl_drv_nd) {
> > +                       .event = event,
> > +                       .cxl_dr_ext = cxl_dr_ext,
> > +               };
> > +
> > +               if (reg_drv->notify) {
> > +                       dev_dbg(dev, "Notify: type %d HPA:%llx LEN:%llx\n",
> > +                               event, cxl_dr_ext->hpa_offset,
> > +                               cxl_dr_ext->hpa_length);
> > +                       rc = reg_drv->notify(dev, &nd);
> > +               }
> > +       }
> > +       device_unlock(dev);
> > +       return rc;
> > +}
> > +
> > +static resource_size_t
> > +cxl_dc_extent_to_hpa_offset(struct cxl_endpoint_decoder *cxled,
> > +                           struct cxl_dc_extent_data *extent)
> > +{
> > +       struct cxl_dax_region *cxlr_dax;
> > +       resource_size_t dpa_offset, hpa;
> > +       struct range *ed_hpa_range;
> > +
> > +       cxlr_dax = cxled->cxld.region->cxlr_dax;
> > +
> > +       /*
> > +        * Without interleave...
> > +        * HPA offset == DPA offset
> > +        * ... but do the math anyway
> > +        */
> > +       dpa_offset = extent->dpa_start - cxled->dpa_res->start;
> > +       ed_hpa_range = &cxled->cxld.hpa_range;
> > +       hpa = ed_hpa_range->start + dpa_offset;
> > +       return hpa - cxlr_dax->hpa_range.start;
> > +}
> > 
> >   static int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> >                                   struct cxl_dc_extent_data *extent)
> >   {
> >          struct cxl_dr_extent *cxl_dr_ext;
> >          struct cxl_dax_region *cxlr_dax;
> > -       resource_size_t dpa_offset, hpa;
> > -       struct range *ed_hpa_range;
> >          struct device *dev;
> >          int rc;
> > 
> > @@ -1601,15 +1653,7 @@ static int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> >          cxl_dr_ext->extent = extent;
> >          kref_init(&cxl_dr_ext->region_ref);
> > 
> > -       /*
> > -        * Without interleave...
> > -        * HPA offset == DPA offset
> > -        * ... but do the math anyway
> > -        */
> > -       dpa_offset = extent->dpa_start - cxled->dpa_res->start;
> > -       ed_hpa_range = &cxled->cxld.hpa_range;
> > -       hpa = ed_hpa_range->start + dpa_offset;
> > -       cxl_dr_ext->hpa_offset = hpa - cxlr_dax->hpa_range.start;
> > +       cxl_dr_ext->hpa_offset = cxl_dc_extent_to_hpa_offset(cxled, extent);
> > 
> >          /* Without interleave carry length and label through */
> >          cxl_dr_ext->hpa_length = extent->length;
> > @@ -1626,6 +1670,7 @@ static int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> >          }
> >          /* Put in cxl_dr_release() */
> >          cxl_dc_extent_get(cxl_dr_ext->extent);
> > +       cxl_region_notify_extent(cxled, DCD_ADD_CAPACITY, cxl_dr_ext);
> >          return 0;
> >   }
> > 
> > @@ -1663,6 +1708,58 @@ static int cxl_ed_add_extents(struct cxl_endpoint_decoder *cxled)
> >          return 0;
> >   }
> > 
> > +static int cxl_ed_rm_dc_extent(struct cxl_endpoint_decoder *cxled,
> > +                              enum dc_event event,
> > +                              struct cxl_dc_extent_data *extent)
> > +{
> > +       struct cxl_region *cxlr = cxled->cxld.region;
> > +       struct cxl_dax_region *cxlr_dax = cxlr->cxlr_dax;
> > +       struct cxl_dr_extent *cxl_dr_ext;
> > +       resource_size_t hpa_offset;
> > +
> > +       hpa_offset = cxl_dc_extent_to_hpa_offset(cxled, extent);
> > +
> > +       /*
> > +        * NOTE on Interleaving: There is no need to 'break up' the cxl_dr_ext.
> > +        * If one of the extents comprising it is gone it should be removed
> > +        * from the region to prevent future use.  Later code may save other
> > +        * extents for future processing.  But for now the corelation is 1:1:1
> > +        * so just erase the extent.
> > +        */
> > +       cxl_dr_ext = xa_erase(&cxlr_dax->extents, hpa_offset);
> > +
> > +       dev_dbg(&cxlr_dax->dev, "Remove DAX region ext HPA:%llx\n",
> > +               cxl_dr_ext->hpa_offset);
> > +       cxl_region_notify_extent(cxled, event, cxl_dr_ext);
> > +       cxl_dr_extent_put(cxl_dr_ext);
> > +       return 0;
> > +}
> > +
> > +int cxl_ed_notify_extent(struct cxl_endpoint_decoder *cxled,
> > +                        struct cxl_drv_nd *nd)
> > +{
> > +       int rc = 0;
> > +
> > +       switch (nd->event) {
> > +       case DCD_ADD_CAPACITY:
> > +               if (cxl_dc_extent_get_not_zero(nd->extent)) {
> > +                       rc = cxl_ed_add_one_extent(cxled, nd->extent);
> > +                       if (rc)
> > +                               cxl_dc_extent_put(nd->extent);
> 
> Hi,
> when playing around with adding and releasing DCD extents through the 
> qmp interface for the QEMU DCD emulation, I noticed that extents weren't 
> handed back to the device. It looks like there is a refcounting issue, 
> as the kref never drops below 2 for the dc extents. So I was wondering 
> whether we should only put the dc extent here on error or maybe always 
> put it?  cxl_ed_add_one_extent() also grabs a reference to the dc 
> extent, and that one is put in cxl_dr_release(), but I couldn't find a 
> matching put for this get_not_zero.

This is a bug I have fixed in the next version.

Yes the put needs to happen regardless of the return value.

...
        case DCD_ADD_CAPACITY:
                if (cxl_dc_extent_get_not_zero(nd->extent)) {
                        rc = cxl_ed_add_one_extent(cxled, nd->extent);
                        cxl_dc_extent_put(nd->extent);
                }
...

Please let me know if that does not work.  And thanks for the testing,
Ira

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH RFC v2 02/18] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2023-08-29  5:20 ` [PATCH RFC v2 02/18] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD) Ira Weiny
                     ` (2 preceding siblings ...)
  2023-08-30 20:33   ` Dave Jiang
@ 2023-10-24 16:16   ` Jonathan Cameron
  3 siblings, 0 replies; 97+ messages in thread
From: Jonathan Cameron @ 2023-10-24 16:16 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Navneet Singh, Fan Ni, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Vishal Verma, linux-cxl, linux-kernel

On Mon, 28 Aug 2023 22:20:53 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> Per the CXL 3.0 specification software must check the Command Effects
> Log (CEL) to know if a device supports DC.  If the device does support
> DC the specifics of the DC Regions (0-7) are read through the mailbox.
> 
> Flag DC Device (DCD) commands in a device if they are supported.
> Subsequent patches will key off these bits to configure a DCD.
> 
> Co-developed-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes for v2
> [iweiny: new patch]
> ---
>  drivers/cxl/core/mbox.c | 38 +++++++++++++++++++++++++++++++++++---
>  drivers/cxl/cxlmem.h    | 15 +++++++++++++++
>  2 files changed, 50 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index f052d5f174ee..554ec97a7c39 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -111,6 +111,34 @@ static u8 security_command_sets[] = {
>  	0x46, /* Security Passthrough */
>  };
>  
Small note I noticed whilst rebasing this for some tests.
Adding this in the middle of the stuff for security commands is
a bit odd. I'd move it up or down a few lines.

> +static bool cxl_is_dcd_command(u16 opcode)
> +{
> +#define CXL_MBOX_OP_DCD_CMDS 0x48
> +
> +	return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
> +}
> +
> +static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
> +					u16 opcode)
> +{
> +	switch (opcode) {
> +	case CXL_MBOX_OP_GET_DC_CONFIG:
> +		set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> +		set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_ADD_DC_RESPONSE:
> +		set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
> +		break;
> +	case CXL_MBOX_OP_RELEASE_DC:
> +		set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
>  static bool cxl_is_security_command(u16 opcode)
>  {
>  	int i;
> @@ -677,9 +705,10 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  		u16 opcode = le16_to_cpu(cel_entry[i].opcode);
>  		struct cxl_mem_command *cmd = cxl_mem_find_command(opcode);
>  
> -		if (!cmd && !cxl_is_poison_command(opcode)) {
> -			dev_dbg(dev,
> -				"Opcode 0x%04x unsupported by driver\n", opcode);
> +		if (!cmd && !cxl_is_poison_command(opcode) &&
> +		    !cxl_is_dcd_command(opcode)) {
> +			dev_dbg(dev, "Opcode 0x%04x unsupported by driver\n",
> +				opcode);
>  			continue;
>  		}
>  
> @@ -689,6 +718,9 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
>  		if (cxl_is_poison_command(opcode))
>  			cxl_set_poison_cmd_enabled(&mds->poison, opcode);
>  
> +		if (cxl_is_dcd_command(opcode))
> +			cxl_set_dcd_cmd_enabled(mds, opcode);
> +
>  		dev_dbg(dev, "Opcode 0x%04x enabled\n", opcode);
>  	}
>  }
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index adfba72445fc..5f2e65204bf9 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -247,6 +247,15 @@ struct cxl_event_state {
>  	struct mutex log_lock;
>  };
>  
> +/* Device enabled DCD commands */
> +enum dcd_cmd_enabled_bits {
> +	CXL_DCD_ENABLED_GET_CONFIG,
> +	CXL_DCD_ENABLED_GET_EXTENT_LIST,
> +	CXL_DCD_ENABLED_ADD_RESPONSE,
> +	CXL_DCD_ENABLED_RELEASE,
> +	CXL_DCD_ENABLED_MAX
> +};
> +
>  /* Device enabled poison commands */
>  enum poison_cmd_enabled_bits {
>  	CXL_POISON_ENABLED_LIST,
> @@ -436,6 +445,7 @@ struct cxl_dev_state {
>   *                (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
>   * @mbox_mutex: Mutex to synchronize mailbox access.
>   * @firmware_version: Firmware version for the memory device.
> + * @dcd_cmds: List of DCD commands implemented by memory device
>   * @enabled_cmds: Hardware commands found enabled in CEL.
>   * @exclusive_cmds: Commands that are kernel-internal only
>   * @total_bytes: sum of all possible capacities
> @@ -460,6 +470,7 @@ struct cxl_memdev_state {
>  	size_t lsa_size;
>  	struct mutex mbox_mutex; /* Protects device mailbox and firmware */
>  	char firmware_version[0x10];
> +	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
>  	u64 total_bytes;
> @@ -525,6 +536,10 @@ enum cxl_opcode {
>  	CXL_MBOX_OP_UNLOCK		= 0x4503,
>  	CXL_MBOX_OP_FREEZE_SECURITY	= 0x4504,
>  	CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE	= 0x4505,
> +	CXL_MBOX_OP_GET_DC_CONFIG	= 0x4800,
> +	CXL_MBOX_OP_GET_DC_EXTENT_LIST	= 0x4801,
> +	CXL_MBOX_OP_ADD_DC_RESPONSE	= 0x4802,
> +	CXL_MBOX_OP_RELEASE_DC		= 0x4803,
>  	CXL_MBOX_OP_MAX			= 0x10000
>  };
>  
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

end of thread, other threads:[~2023-10-24 16:16 UTC | newest]

Thread overview: 97+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-29  5:20 [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
2023-08-29  5:20 ` [PATCH RFC v2 01/18] cxl/hdm: Debug, use decoder name function Ira Weiny
2023-08-29 14:03   ` Jonathan Cameron
2023-08-29 21:48     ` Fan Ni
2023-09-03  2:55     ` Ira Weiny
2023-08-30 20:32   ` Dave Jiang
2023-08-29  5:20 ` [PATCH RFC v2 02/18] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD) Ira Weiny
2023-08-29 14:07   ` Jonathan Cameron
2023-09-03  3:38     ` Ira Weiny
2023-08-29 21:49   ` Fan Ni
2023-08-30 20:33   ` Dave Jiang
2023-10-24 16:16   ` Jonathan Cameron
2023-08-29  5:20 ` [PATCH RFC v2 03/18] cxl/mem: Read Dynamic capacity configuration from the device ira.weiny
2023-08-29 14:37   ` Jonathan Cameron
2023-09-03 23:36     ` Ira Weiny
2023-08-30 21:01   ` Dave Jiang
2023-09-05  0:14     ` Ira Weiny
2023-09-08 20:23     ` Ira Weiny
2023-08-30 21:44   ` Fan Ni
2023-09-08 22:52     ` Ira Weiny
2023-09-12 21:32       ` Fan Ni
2023-09-07 15:46   ` Alison Schofield
2023-09-12  1:18     ` Ira Weiny
2023-09-08 12:46   ` Jørgen Hansen
2023-09-11 20:26     ` Ira Weiny
2023-08-29  5:20 ` [PATCH RFC v2 04/18] cxl/region: Add Dynamic Capacity decoder and region modes Ira Weiny
2023-08-29 14:39   ` Jonathan Cameron
2023-08-30 21:13   ` Dave Jiang
2023-08-31 17:00   ` Fan Ni
2023-08-29  5:20 ` [PATCH RFC v2 05/18] cxl/port: Add Dynamic Capacity mode support to endpoint decoders Ira Weiny
2023-08-29 14:49   ` Jonathan Cameron
2023-09-05  0:05     ` Ira Weiny
2023-08-31 17:25   ` Fan Ni
2023-09-08 23:26     ` Ira Weiny
2023-08-29  5:20 ` [PATCH RFC v2 06/18] cxl/port: Add Dynamic Capacity size " Ira Weiny
2023-08-29 15:09   ` Jonathan Cameron
2023-09-05  4:32     ` Ira Weiny
2023-08-29  5:20 ` [PATCH RFC v2 07/18] cxl/mem: Expose device dynamic capacity configuration ira.weiny
2023-08-29 15:14   ` Jonathan Cameron
2023-09-05 17:55     ` Fan Ni
2023-09-05 20:45     ` Ira Weiny
2023-08-30 22:46   ` Dave Jiang
2023-09-08 23:22     ` Ira Weiny
2023-08-29  5:20 ` [PATCH RFC v2 08/18] cxl/region: Add Dynamic Capacity CXL region support Ira Weiny
2023-08-29 15:19   ` Jonathan Cameron
2023-08-30 23:27   ` Dave Jiang
2023-09-06  4:36     ` Ira Weiny
2023-09-05 21:09   ` Fan Ni
2023-08-29  5:21 ` [PATCH RFC v2 09/18] cxl/mem: Read extents on memory device discovery Ira Weiny
2023-08-29 15:26   ` Jonathan Cameron
2023-08-30  0:16     ` Ira Weiny
2023-09-05 21:41     ` Ira Weiny
2023-08-29  5:21 ` [PATCH RFC v2 10/18] cxl/mem: Handle DCD add and release capacity events Ira Weiny
2023-08-29 15:59   ` Jonathan Cameron
2023-09-05 23:49     ` Ira Weiny
2023-08-31 17:28   ` Dave Jiang
2023-09-08 15:35     ` Ira Weiny
2023-08-29  5:21 ` [PATCH RFC v2 11/18] cxl/region: Expose DC extents on region driver load Ira Weiny
2023-08-29 16:20   ` Jonathan Cameron
2023-09-06  3:36     ` Ira Weiny
2023-08-31 18:38   ` Dave Jiang
2023-09-08 23:57     ` Ira Weiny
2023-08-29  5:21 ` [PATCH RFC v2 12/18] cxl/region: Notify regions of DC changes Ira Weiny
2023-08-29 16:40   ` Jonathan Cameron
2023-09-06  4:00     ` Ira Weiny
2023-09-18 13:56   ` Jørgen Hansen
2023-09-18 17:45     ` Ira Weiny
2023-08-29  5:21 ` [PATCH RFC v2 13/18] dax/bus: Factor out dev dax resize logic Ira Weiny
2023-08-30 11:27   ` Jonathan Cameron
2023-09-06  4:12     ` Ira Weiny
2023-08-31 21:48   ` Dave Jiang
2023-08-29  5:21 ` [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions Ira Weiny
2023-08-30 11:50   ` Jonathan Cameron
2023-09-06  4:35     ` Ira Weiny
2023-09-12 16:49       ` Jonathan Cameron
2023-09-12 22:08         ` Ira Weiny
2023-09-12 22:35           ` Dan Williams
2023-09-13 17:30             ` Ira Weiny
2023-09-13 17:59               ` Dan Williams
2023-09-13 19:26                 ` Ira Weiny
2023-09-14 10:32                   ` Jonathan Cameron
2023-08-29  5:21 ` [PATCH RFC v2 15/18] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
2023-08-29 16:46   ` Jonathan Cameron
2023-09-06  4:07     ` Ira Weiny
2023-08-29  5:21 ` [PATCH RFC v2 16/18] tools/testing/cxl: Make event logs dynamic Ira Weiny
2023-08-30 12:11   ` Jonathan Cameron
2023-09-06 21:15     ` Ira Weiny
2023-08-29  5:21 ` [PATCH RFC v2 17/18] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
2023-08-30 12:20   ` Jonathan Cameron
2023-09-06 21:18     ` Ira Weiny
2023-08-31 23:19   ` Dave Jiang
2023-08-29  5:21 ` [PATCH RFC v2 18/18] tools/testing/cxl: Add Dynamic Capacity events Ira Weiny
2023-08-30 12:23   ` Jonathan Cameron
2023-09-06 21:39     ` Ira Weiny
2023-08-31 23:20   ` Dave Jiang
2023-09-07 21:01 ` [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD) Fan Ni
2023-09-12  1:44   ` Ira Weiny

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.