linux-cxl.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/14] CXL Region driver
@ 2022-01-28  0:26 Ben Widawsky
  2022-01-28  0:26 ` [PATCH v3 01/14] cxl/region: Add region creation ABI Ben Widawsky
                   ` (13 more replies)
  0 siblings, 14 replies; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

Major changes since v2:
- Clarify encoded region/granularity from raw values
- Rename "region" to cxlr everywhere
- Kconfig for the region driver
- Several small bug fixes

https://gitlab.com/bwidawsk/linux/-/tree/cxl_region-v3

Original commit message follows with minor updates for correctness.

---

This patch series introduces the CXL region driver as well as associated APIs in
CXL core. The region driver enables the creation of "regions" which is a concept
defined by the CXL 2.0 specification [1]. Region verification and programming
state are owned by the cxl_region driver (implemented in the cxl_region module).
It relies on cxl_mem to determine if devices are CXL routed, and cxl_port to
actually handle the programming of the HDM decoders. Much of the region driver
is an implementation of algorithms described in the CXL Type 3 Memory Device
Software Guide [2].

The region driver will be responsible for configuring regions found on
persistent capacities in the Label Storage Area (LSA), it will also enumerate
regions configured by BIOS, usually volatile capacities, and will allow for
dynamic region creation (which can then be stored in the LSA). It is the primary
consumer of the CXL Port [3] and CXL Mem drivers introduced previously [4].

The patches for the region driver could be squashed. They're broken out to aid
review and because that's the order they were implemented in. My preference is
to keep those as they are.

Some things are still missing and will be worked on while these are reviewed (in
priority order):
1. Volatile regions creation and enumeration (Have a plan)
2. multi-level switches
3. Decoder programming restrictions (No plan). The one know restriction I've
   missed is to disallow programming HDM decoders that aren't in incremental
   system physical address ranges.
4. CXL region teardown -> nd_region teardown
5. Stress testing

[1]: https://www.computeexpresslink.org/download-the-specification
[2]: https://cdrdv2.intel.com/v1/dl/getContent/643805?wapkw=CXL%20memory%20device%20sw%20guide
[3]: https://lore.kernel.org/linux-cxl/164298424635.3018233.9356036382052246767.stgit@dwillia2-desk3.amr.corp.intel.com/T/#u
[4]: https://lore.kernel.org/linux-cxl/164298429450.3018233.13269591903486669825.stgit@dwillia2-desk3.amr.corp.intel.com/T/#u

---

Ben Widawsky (14):
  cxl/region: Add region creation ABI
  cxl/region: Introduce concept of region configuration
  cxl/mem: Cache port created by the mem dev
  cxl/region: Introduce a cxl_region driver
  cxl/acpi: Handle address space allocation
  cxl/region: Address space allocation
  cxl/region: Implement XHB verification
  cxl/region: HB port config verification
  cxl/region: Add infrastructure for decoder programming
  cxl/region: Collect host bridge decoders
  cxl/region: Add support for single switch level
  cxl: Program decoders for regions
  cxl/pmem: Convert nvdimm bridge API to use dev
  cxl/region: Create an nd_region

 .clang-format                                 |   3 +
 Documentation/ABI/testing/sysfs-bus-cxl       |  64 ++
 .../driver-api/cxl/memory-devices.rst         |  14 +
 drivers/cxl/Kconfig                           |   5 +
 drivers/cxl/Makefile                          |   2 +
 drivers/cxl/acpi.c                            |  30 +
 drivers/cxl/core/Makefile                     |   1 +
 drivers/cxl/core/core.h                       |   4 +
 drivers/cxl/core/hdm.c                        | 209 +++++
 drivers/cxl/core/pmem.c                       |  28 +-
 drivers/cxl/core/port.c                       | 105 ++-
 drivers/cxl/core/region.c                     | 529 +++++++++++
 drivers/cxl/cxl.h                             |  76 +-
 drivers/cxl/cxlmem.h                          |   9 +
 drivers/cxl/mem.c                             |  35 +-
 drivers/cxl/pmem.c                            |   2 +-
 drivers/cxl/port.c                            |  62 +-
 drivers/cxl/region.c                          | 866 ++++++++++++++++++
 drivers/cxl/region.h                          |  47 +
 tools/testing/cxl/Kbuild                      |   1 +
 20 files changed, 2077 insertions(+), 15 deletions(-)
 create mode 100644 drivers/cxl/core/region.c
 create mode 100644 drivers/cxl/region.c
 create mode 100644 drivers/cxl/region.h


base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
prerequisite-patch-id: 90de8aefc2999f55c7534fefa971d95653c4220c
prerequisite-patch-id: 32a5b56d83bf3372b6ed4b40f621eafb33a7201b
prerequisite-patch-id: f827831bb7a23e0789d16d7b8979b165253c6301
prerequisite-patch-id: 08b8febd42d3ab508b618937473807e553589e36
prerequisite-patch-id: 18049f47c948582c1dc26348d9765c934eb82a75
prerequisite-patch-id: 8f66d52af297449fa007a0ba963c5239b153ef5b
prerequisite-patch-id: 3e2e86cbc2631b99c1b5c0179f35799d3df31f91
prerequisite-patch-id: b88becd4997320a34e918cdef1b620e6dea14917
prerequisite-patch-id: c61df81018f2a93b87d10965b418afa659d9d6d6
prerequisite-patch-id: 73b31df62e00bb7af7082e2ca4d40023a7962abd
prerequisite-patch-id: 207abfcd5028c41df8875ee795a8ab697cd7c688
prerequisite-patch-id: 26978f021b3b0f4a6734ef8c0100c724dc88742e
prerequisite-patch-id: bf229ca5aab5c5dffe69ba5b9380749a66cf20ba
prerequisite-patch-id: 20ebefe1acfdecf184d048cb605368e1863646c1
prerequisite-patch-id: f34c26e902dd868dc1c3ef8ba8246cc063cf991a
prerequisite-patch-id: bcc59db1c6528244b649ced35eab015699c410fa
prerequisite-patch-id: 2f9f6cfbd6b73a563498c6b6d721bbc169a0a414
prerequisite-patch-id: dc8fb216dc8ff4f813bfc689273d9c5f5124e789
prerequisite-patch-id: da83e8074d339426c886c481070366afb189b561
prerequisite-patch-id: 501fe71f19065ba9f31cabd86756fedda853c414
prerequisite-patch-id: ceeef31c2ca85a426d507563b886347d28acc322
prerequisite-patch-id: f876c09942ae5a3223a36329c23262a05b2669f4
prerequisite-patch-id: 44fa61c5569614c8d9df854cde6fedfc2bc78c12
prerequisite-patch-id: 04ad90e1bbb5646125c4633fbe5341f572bc9548
prerequisite-patch-id: f4dbf89d99917f50c30e1ee56bfeff8d8dd6b0f3
prerequisite-patch-id: 2d7c3aacefcb8133897e3256ed6f76952555c2f1
prerequisite-patch-id: 7454df4bdb07381f02717845eb3b17011a89ab18
prerequisite-patch-id: 52ec0dfd506bb6a3f8d11a914cfc7320193a6445
prerequisite-patch-id: 9de14fa54cfba412e09d7b41f392c0f6d55d6a01
prerequisite-patch-id: ae39a482c2067a1f04baee5ce9131901e6d359ec
prerequisite-patch-id: 446240d2ed24d9e55ac9edfc65b511495659464a
prerequisite-patch-id: ba6bf6450e47df5e95e2fb1780d9edd126bc0eb2
prerequisite-patch-id: 3c0865b6dd062e677ef8e160e14f823622eafb9f
prerequisite-patch-id: 4503f5507cbdeb0770b420b4c26d87be2b173813
prerequisite-patch-id: c5a8cbda77c95b052040770eca0dc5b99876dc66
prerequisite-patch-id: e064003a6c48131fac401d9a48d4d6204fea6123
prerequisite-patch-id: b4c7213971c981dd5ca0fda992643a7c61548fef
prerequisite-patch-id: 2bd09e27f8a8df144a8ad386822390c87ef46ec5
prerequisite-patch-id: 60b3fafbd3bfa225405a6762bdb6b89c044b0b86
prerequisite-patch-id: 620068ae417bf0784809107e0dae3ec9793632df
prerequisite-patch-id: c3415fe92e29cd4afc508f8caf31cb914be09261
prerequisite-patch-id: 4c01f305244036afa9aaa918c8215659327dd0f3
prerequisite-patch-id: 034aeb7e124c5a34785c963bf014aa5380f00a2e
prerequisite-patch-id: 26f18c2ca586e6d734cd319e0e7f24398b17217f
prerequisite-patch-id: ef97136efb8c077232fe39a0465389565803a7b7
prerequisite-patch-id: 6a63e03117287b748cfec00e2c16a41ed38f4f9a
-- 
2.35.0


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v3 01/14] cxl/region: Add region creation ABI
  2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
@ 2022-01-28  0:26 ` Ben Widawsky
  2022-01-28 18:14   ` Dan Williams
                     ` (2 more replies)
  2022-01-28  0:26 ` [PATCH v3 02/14] cxl/region: Introduce concept of region configuration Ben Widawsky
                   ` (12 subsequent siblings)
  13 siblings, 3 replies; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

Regions are created as a child of the decoder that encompasses an
address space with constraints. Regions have a number of attributes that
must be configured before the region can be activated.

The ABI is not meant to be secure, but is meant to avoid accidental
races. As a result, a buggy process may create a region by name that was
allocated by a different process. However, multiple processes which are
trying not to race with each other shouldn't need special
synchronization to do so.

// Allocate a new region name
region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)

// Create a new region by name
echo $region > /sys/bus/cxl/devices/decoder0.0/create_region

// Region now exists in sysfs
stat -t /sys/bus/cxl/devices/decoder0.0/$region

// Delete the region, and name
echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>

---
Changes since v2:
- Rename 'region' variables to 'cxlr'
- Update ABI docs for possible actual upstream version
---
 Documentation/ABI/testing/sysfs-bus-cxl       |  24 ++
 .../driver-api/cxl/memory-devices.rst         |  11 +
 drivers/cxl/core/Makefile                     |   1 +
 drivers/cxl/core/core.h                       |   3 +
 drivers/cxl/core/port.c                       |  16 ++
 drivers/cxl/core/region.c                     | 208 ++++++++++++++++++
 drivers/cxl/cxl.h                             |   9 +
 drivers/cxl/region.h                          |  38 ++++
 tools/testing/cxl/Kbuild                      |   1 +
 9 files changed, 311 insertions(+)
 create mode 100644 drivers/cxl/core/region.c
 create mode 100644 drivers/cxl/region.h

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 7c2b846521f3..dcc728458936 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -163,3 +163,27 @@ Description:
 		memory (type-3). The 'target_type' attribute indicates the
 		current setting which may dynamically change based on what
 		memory regions are activated in this decode hierarchy.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/create_region
+Date:		August, 2021
+KernelVersion:	v5.18
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		Creates a new CXL region. Writing a value of the form
+		"regionX.Y:Z" will create a new uninitialized region that will
+		be mapped by the CXL decoderX.Y. Reading from this node will
+		return a newly allocated region name. In order to create a
+		region (writing) you must use a value returned from reading the
+		node. Regions must be created for root decoders, and must
+		subsequently configured and bound to a region driver before they
+		can be used.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
+Date:		August, 2021
+KernelVersion:	v5.18
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		Deletes the named region. A region must be unbound from the
+		region driver before being deleted. The attributes expects a
+		region in the form "regionX.Y:Z". The region's name, allocated
+		by reading create_region, will also be released.
diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
index db476bb170b6..66ddc58a21b1 100644
--- a/Documentation/driver-api/cxl/memory-devices.rst
+++ b/Documentation/driver-api/cxl/memory-devices.rst
@@ -362,6 +362,17 @@ CXL Core
 .. kernel-doc:: drivers/cxl/core/mbox.c
    :doc: cxl mbox
 
+CXL Regions
+-----------
+.. kernel-doc:: drivers/cxl/region.h
+   :identifiers:
+
+.. kernel-doc:: drivers/cxl/core/region.c
+   :doc: cxl core region
+
+.. kernel-doc:: drivers/cxl/core/region.c
+   :identifiers:
+
 External Interfaces
 ===================
 
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 6d37cd78b151..39ce8f2f2373 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_CXL_BUS) += cxl_core.o
 ccflags-y += -I$(srctree)/drivers/cxl
 cxl_core-y := port.o
 cxl_core-y += pmem.o
+cxl_core-y += region.o
 cxl_core-y += regs.o
 cxl_core-y += memdev.o
 cxl_core-y += mbox.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index efbaa851929d..35fd08d560e2 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -10,6 +10,9 @@ extern const struct device_type cxl_memdev_type;
 
 extern struct attribute_group cxl_base_attribute_group;
 
+extern struct device_attribute dev_attr_create_region;
+extern struct device_attribute dev_attr_delete_region;
+
 struct cxl_send_command;
 struct cxl_mem_query_commands;
 int cxl_query_cmd(struct cxl_memdev *cxlmd,
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 631dec0fa79e..0826208b2bdf 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -215,6 +215,8 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
 };
 
 static struct attribute *cxl_decoder_root_attrs[] = {
+	&dev_attr_create_region.attr,
+	&dev_attr_delete_region.attr,
 	&dev_attr_cap_pmem.attr,
 	&dev_attr_cap_ram.attr,
 	&dev_attr_cap_type2.attr,
@@ -267,11 +269,23 @@ static const struct attribute_group *cxl_decoder_endpoint_attribute_groups[] = {
 	NULL,
 };
 
+static int delete_region(struct device *dev, void *arg)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
+
+	return cxl_delete_region(cxld, dev_name(dev));
+}
+
 static void cxl_decoder_release(struct device *dev)
 {
 	struct cxl_decoder *cxld = to_cxl_decoder(dev);
 	struct cxl_port *port = to_cxl_port(dev->parent);
 
+	device_for_each_child(&cxld->dev, cxld, delete_region);
+
+	dev_WARN_ONCE(dev, !ida_is_empty(&cxld->region_ida),
+		      "Lost track of a region");
+
 	ida_free(&port->decoder_ida, cxld->id);
 	kfree(cxld);
 }
@@ -1194,6 +1208,8 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
 	cxld->target_type = CXL_DECODER_EXPANDER;
 	cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
 
+	ida_init(&cxld->region_ida);
+
 	return cxld;
 err:
 	kfree(cxld);
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
new file mode 100644
index 000000000000..1a448543db0d
--- /dev/null
+++ b/drivers/cxl/core/region.c
@@ -0,0 +1,208 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2021 Intel Corporation. All rights reserved. */
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/idr.h>
+#include <region.h>
+#include <cxl.h>
+#include "core.h"
+
+/**
+ * DOC: cxl core region
+ *
+ * Regions are managed through the Linux device model. Each region instance is a
+ * unique struct device. CXL core provides functionality to create, destroy, and
+ * configure regions. This is all implemented here. Binding a region
+ * (programming the hardware) is handled by a separate region driver.
+ */
+
+static void cxl_region_release(struct device *dev);
+
+static const struct device_type cxl_region_type = {
+	.name = "cxl_region",
+	.release = cxl_region_release,
+};
+
+static ssize_t create_region_show(struct device *dev,
+				  struct device_attribute *attr, char *buf)
+{
+	struct cxl_port *port = to_cxl_port(dev->parent);
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	int rc;
+
+	if (dev_WARN_ONCE(dev, !is_root_decoder(dev),
+			  "Invalid decoder selected for region.")) {
+		return -ENODEV;
+	}
+
+	rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
+	if (rc < 0) {
+		dev_err(&cxld->dev, "Couldn't get a new id\n");
+		return rc;
+	}
+
+	return sysfs_emit(buf, "region%d.%d:%d\n", port->id, cxld->id, rc);
+}
+
+static ssize_t create_region_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t len)
+{
+	struct cxl_port *port = to_cxl_port(dev->parent);
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	int decoder_id, port_id, region_id;
+	struct cxl_region *cxlr;
+	ssize_t rc;
+
+	if (sscanf(buf, "region%d.%d:%d", &port_id, &decoder_id, &region_id) != 3)
+		return -EINVAL;
+
+	if (decoder_id != cxld->id)
+		return -EINVAL;
+
+	if (port_id != port->id)
+		return -EINVAL;
+
+	cxlr = cxl_alloc_region(cxld, region_id);
+	if (IS_ERR(cxlr))
+		return PTR_ERR(cxlr);
+
+	rc = cxl_add_region(cxld, cxlr);
+	if (rc) {
+		kfree(cxlr);
+		return rc;
+	}
+
+	return len;
+}
+DEVICE_ATTR_RW(create_region);
+
+static ssize_t delete_region_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t len)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	int rc;
+
+	rc = cxl_delete_region(cxld, buf);
+	if (rc)
+		return rc;
+
+	return len;
+}
+DEVICE_ATTR_WO(delete_region);
+
+struct cxl_region *to_cxl_region(struct device *dev)
+{
+	if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type,
+			  "not a cxl_region device\n"))
+		return NULL;
+
+	return container_of(dev, struct cxl_region, dev);
+}
+EXPORT_SYMBOL_GPL(to_cxl_region);
+
+static void cxl_region_release(struct device *dev)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	ida_free(&cxld->region_ida, cxlr->id);
+	kfree(cxlr);
+}
+
+struct cxl_region *cxl_alloc_region(struct cxl_decoder *cxld, int id)
+{
+	struct cxl_region *cxlr;
+
+	cxlr = kzalloc(sizeof(*cxlr), GFP_KERNEL);
+	if (!cxlr)
+		return ERR_PTR(-ENOMEM);
+
+	cxlr->id = id;
+
+	return cxlr;
+}
+
+/**
+ * cxl_add_region - Adds a region to a decoder
+ * @cxld: Parent decoder.
+ * @cxlr: Region to be added to the decoder.
+ *
+ * This is the second step of region initialization. Regions exist within an
+ * address space which is mapped by a @cxld. That @cxld must be a root decoder,
+ * and it enforces constraints upon the region as it is configured.
+ *
+ * Return: 0 if the region was added to the @cxld, else returns negative error
+ * code. The region will be named "regionX.Y.Z" where X is the port, Y is the
+ * decoder id, and Z is the region number.
+ */
+int cxl_add_region(struct cxl_decoder *cxld, struct cxl_region *cxlr)
+{
+	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
+	struct device *dev = &cxlr->dev;
+	int rc;
+
+	device_initialize(dev);
+	dev->parent = &cxld->dev;
+	device_set_pm_not_required(dev);
+	dev->bus = &cxl_bus_type;
+	dev->type = &cxl_region_type;
+	rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
+	if (rc)
+		goto err;
+
+	rc = device_add(dev);
+	if (rc)
+		goto err;
+
+	dev_dbg(dev, "Added to %s\n", dev_name(&cxld->dev));
+
+	return 0;
+
+err:
+	put_device(dev);
+	return rc;
+}
+
+static struct cxl_region *cxl_find_region_by_name(struct cxl_decoder *cxld,
+						  const char *name)
+{
+	struct device *region_dev;
+
+	region_dev = device_find_child_by_name(&cxld->dev, name);
+	if (!region_dev)
+		return ERR_PTR(-ENOENT);
+
+	return to_cxl_region(region_dev);
+}
+
+/**
+ * cxl_delete_region - Deletes a region
+ * @cxld: Parent decoder
+ * @region_name: Named region, ie. regionX.Y:Z
+ */
+int cxl_delete_region(struct cxl_decoder *cxld, const char *region_name)
+{
+	struct cxl_region *cxlr;
+
+	device_lock(&cxld->dev);
+
+	cxlr = cxl_find_region_by_name(cxld, region_name);
+	if (IS_ERR(cxlr)) {
+		device_unlock(&cxld->dev);
+		return PTR_ERR(cxlr);
+	}
+
+	dev_dbg(&cxld->dev, "Requested removal of %s from %s\n",
+		dev_name(&cxlr->dev), dev_name(&cxld->dev));
+
+	device_unregister(&cxlr->dev);
+	device_unlock(&cxld->dev);
+
+	put_device(&cxlr->dev);
+
+	return 0;
+}
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 13fb06849199..b9f0099c1f39 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -221,6 +221,7 @@ enum cxl_decoder_type {
  * @target_type: accelerator vs expander (type2 vs type3) selector
  * @flags: memory type capabilities and locking
  * @target_lock: coordinate coherent reads of the target list
+ * @region_ida: allocator for region ids.
  * @nr_targets: number of elements in @target
  * @target: active ordered target list in current decoder configuration
  */
@@ -236,6 +237,7 @@ struct cxl_decoder {
 	enum cxl_decoder_type target_type;
 	unsigned long flags;
 	seqlock_t target_lock;
+	struct ida region_ida;
 	int nr_targets;
 	struct cxl_dport *target[];
 };
@@ -323,6 +325,13 @@ struct cxl_ep {
 	struct list_head list;
 };
 
+bool is_cxl_region(struct device *dev);
+struct cxl_region *to_cxl_region(struct device *dev);
+struct cxl_region *cxl_alloc_region(struct cxl_decoder *cxld,
+				    int interleave_ways);
+int cxl_add_region(struct cxl_decoder *cxld, struct cxl_region *cxlr);
+int cxl_delete_region(struct cxl_decoder *cxld, const char *region);
+
 static inline bool is_cxl_root(struct cxl_port *port)
 {
 	return port->uport == port->dev.parent;
diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
new file mode 100644
index 000000000000..eb1249e3c1d4
--- /dev/null
+++ b/drivers/cxl/region.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2021 Intel Corporation. */
+#ifndef __CXL_REGION_H__
+#define __CXL_REGION_H__
+
+#include <linux/uuid.h>
+
+#include "cxl.h"
+
+/**
+ * struct cxl_region - CXL region
+ * @dev: This region's device.
+ * @id: This regions id. Id is globally unique across all regions.
+ * @list: Node in decoder's region list.
+ * @res: Resource this region carves out of the platform decode range.
+ * @config: HDM decoder program config
+ * @config.size: Size of the region determined from LSA or userspace.
+ * @config.uuid: The UUID for this region.
+ * @config.interleave_ways: Number of interleave ways this region is configured for.
+ * @config.interleave_granularity: Interleave granularity of region
+ * @config.targets: The memory devices comprising the region.
+ */
+struct cxl_region {
+	struct device dev;
+	int id;
+	struct list_head list;
+	struct resource *res;
+
+	struct {
+		u64 size;
+		uuid_t uuid;
+		int interleave_ways;
+		int interleave_granularity;
+		struct cxl_memdev *targets[CXL_DECODER_MAX_INTERLEAVE];
+	} config;
+};
+
+#endif
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index 82e49ab0937d..3fe6d34e6d59 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -46,6 +46,7 @@ cxl_core-y += $(CXL_CORE_SRC)/memdev.o
 cxl_core-y += $(CXL_CORE_SRC)/mbox.o
 cxl_core-y += $(CXL_CORE_SRC)/pci.o
 cxl_core-y += $(CXL_CORE_SRC)/hdm.o
+cxl_core-y += $(CXL_CORE_SRC)/region.o
 cxl_core-y += config_check.o
 
 obj-m += test/
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
  2022-01-28  0:26 ` [PATCH v3 01/14] cxl/region: Add region creation ABI Ben Widawsky
@ 2022-01-28  0:26 ` Ben Widawsky
  2022-01-29  0:25   ` Dan Williams
  2022-01-28  0:26 ` [PATCH v3 03/14] cxl/mem: Cache port created by the mem dev Ben Widawsky
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, kernel test robot, Alison Schofield,
	Dan Williams, Ira Weiny, Jonathan Cameron, Vishal Verma,
	Bjorn Helgaas, nvdimm, linux-pci

The region creation APIs create a vacant region. Configuring the region
works in the same way as similar subsystems such as devdax. Sysfs attrs
will be provided to allow userspace to configure the region.  Finally
once all configuration is complete, userspace may activate the region.

Introduced here are the most basic attributes needed to configure a
region. Details of these attribute are described in the ABI
Documentation. Sanity checking of configuration parameters are done at
region binding time. This consolidates all such logic in one place,
rather than being strewn across multiple places.

A example is provided below:

/sys/bus/cxl/devices/region0.0:0
├── interleave_granularity
├── interleave_ways
├── offset
├── size
├── subsystem -> ../../../../../../bus/cxl
├── target0
├── uevent
└── uuid

Reported-by: kernel test robot <lkp@intel.com> (v2)
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
 drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
 2 files changed, 340 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index dcc728458936..50ba5018014d 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -187,3 +187,43 @@ Description:
 		region driver before being deleted. The attributes expects a
 		region in the form "regionX.Y:Z". The region's name, allocated
 		by reading create_region, will also be released.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
+Date:		August, 2021
+KernelVersion:	v5.18
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) A region resides within an address space that is claimed by
+		a decoder. Region space allocation is handled by the driver, but
+		the offset may be read by userspace tooling in order to
+		determine fragmentation, and available size for new regions.
+
+What:
+/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
+Date:		August, 2021
+KernelVersion:	v5.18
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RW) Configuring regions requires a minimal set of parameters in
+		order for the subsequent bind operation to succeed. The
+		following parameters are defined:
+
+		==	========================================================
+		interleave_granularity Mandatory. Number of consecutive bytes
+			each device in the interleave set will claim. The
+			possible interleave granularity values are determined by
+			the CXL spec and the participating devices.
+		interleave_ways Mandatory. Number of devices participating in the
+			region. Each device will provide 1/interleave of storage
+			for the region.
+		size	Manadatory. Phsyical address space the region will
+			consume.
+		target  Mandatory. Memory devices are the backing storage for a
+			region. There will be N targets based on the number of
+			interleave ways that the top level decoder is configured
+			for. Each target must be set with a memdev device ie.
+			'mem1'. This attribute only becomes available after
+			setting the 'interleave' attribute.
+		uuid	Optional. A unique identifier for the region. If none is
+			selected, the kernel will create one.
+		==	========================================================
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 1a448543db0d..3b48e0469fc7 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3,9 +3,12 @@
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/device.h>
 #include <linux/module.h>
+#include <linux/sizes.h>
 #include <linux/slab.h>
+#include <linux/uuid.h>
 #include <linux/idr.h>
 #include <region.h>
+#include <cxlmem.h>
 #include <cxl.h>
 #include "core.h"
 
@@ -18,11 +21,305 @@
  * (programming the hardware) is handled by a separate region driver.
  */
 
+struct cxl_region *to_cxl_region(struct device *dev);
+static const struct attribute_group region_interleave_group;
+
+static bool is_region_active(struct cxl_region *cxlr)
+{
+	/* TODO: Regions can't be activated yet. */
+	return false;
+}
+
+static void remove_target(struct cxl_region *cxlr, int target)
+{
+	struct cxl_memdev *cxlmd;
+
+	cxlmd = cxlr->config.targets[target];
+	if (cxlmd)
+		put_device(&cxlmd->dev);
+	cxlr->config.targets[target] = NULL;
+}
+
+static ssize_t interleave_ways_show(struct device *dev,
+				    struct device_attribute *attr, char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
+}
+
+static ssize_t interleave_ways_store(struct device *dev,
+				     struct device_attribute *attr,
+				     const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	int ret, prev_iw;
+	int val;
+
+	prev_iw = cxlr->config.interleave_ways;
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+	if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
+		return -EINVAL;
+
+	cxlr->config.interleave_ways = val;
+
+	ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
+	if (ret < 0)
+		goto err;
+
+	sysfs_notify(&dev->kobj, NULL, "target_interleave");
+
+	while (prev_iw > cxlr->config.interleave_ways)
+		remove_target(cxlr, --prev_iw);
+
+	return len;
+
+err:
+	cxlr->config.interleave_ways = prev_iw;
+	return ret;
+}
+static DEVICE_ATTR_RW(interleave_ways);
+
+static ssize_t interleave_granularity_show(struct device *dev,
+					   struct device_attribute *attr,
+					   char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%d\n", cxlr->config.interleave_granularity);
+}
+
+static ssize_t interleave_granularity_store(struct device *dev,
+					    struct device_attribute *attr,
+					    const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	int val, ret;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+	cxlr->config.interleave_granularity = val;
+
+	return len;
+}
+static DEVICE_ATTR_RW(interleave_granularity);
+
+static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
+			   char *buf)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	resource_size_t offset;
+
+	if (!cxlr->res)
+		return sysfs_emit(buf, "\n");
+
+	offset = cxld->platform_res.start - cxlr->res->start;
+
+	return sysfs_emit(buf, "%pa\n", &offset);
+}
+static DEVICE_ATTR_RO(offset);
+
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%llu\n", cxlr->config.size);
+}
+
+static ssize_t size_store(struct device *dev, struct device_attribute *attr,
+			  const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	unsigned long long val;
+	ssize_t rc;
+
+	rc = kstrtoull(buf, 0, &val);
+	if (rc)
+		return rc;
+
+	device_lock(&cxlr->dev);
+	if (is_region_active(cxlr))
+		rc = -EBUSY;
+	else
+		cxlr->config.size = val;
+	device_unlock(&cxlr->dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(size);
+
+static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
+}
+
+static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
+			  const char *buf, size_t len)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	ssize_t rc;
+
+	if (len != UUID_STRING_LEN + 1)
+		return -EINVAL;
+
+	device_lock(&cxlr->dev);
+	if (is_region_active(cxlr))
+		rc = -EBUSY;
+	else
+		rc = uuid_parse(buf, &cxlr->config.uuid);
+	device_unlock(&cxlr->dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(uuid);
+
+static struct attribute *region_attrs[] = {
+	&dev_attr_interleave_ways.attr,
+	&dev_attr_interleave_granularity.attr,
+	&dev_attr_offset.attr,
+	&dev_attr_size.attr,
+	&dev_attr_uuid.attr,
+	NULL,
+};
+
+static const struct attribute_group region_group = {
+	.attrs = region_attrs,
+};
+
+static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
+{
+	int ret;
+
+	device_lock(&cxlr->dev);
+	if (!cxlr->config.targets[n])
+		ret = sysfs_emit(buf, "\n");
+	else
+		ret = sysfs_emit(buf, "%s\n",
+				 dev_name(&cxlr->config.targets[n]->dev));
+	device_unlock(&cxlr->dev);
+
+	return ret;
+}
+
+static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
+			  size_t len)
+{
+	struct device *memdev_dev;
+	struct cxl_memdev *cxlmd;
+
+	device_lock(&cxlr->dev);
+
+	if (len == 1 || cxlr->config.targets[n])
+		remove_target(cxlr, n);
+
+	/* Remove target special case */
+	if (len == 1) {
+		device_unlock(&cxlr->dev);
+		return len;
+	}
+
+	memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
+	if (!memdev_dev) {
+		device_unlock(&cxlr->dev);
+		return -ENOENT;
+	}
+
+	/* reference to memdev held until target is unset or region goes away */
+
+	cxlmd = to_cxl_memdev(memdev_dev);
+	cxlr->config.targets[n] = cxlmd;
+
+	device_unlock(&cxlr->dev);
+
+	return len;
+}
+
+#define TARGET_ATTR_RW(n)                                                      \
+	static ssize_t target##n##_show(                                       \
+		struct device *dev, struct device_attribute *attr, char *buf)  \
+	{                                                                      \
+		return show_targetN(to_cxl_region(dev), buf, (n));             \
+	}                                                                      \
+	static ssize_t target##n##_store(struct device *dev,                   \
+					 struct device_attribute *attr,        \
+					 const char *buf, size_t len)          \
+	{                                                                      \
+		return set_targetN(to_cxl_region(dev), buf, (n), len);         \
+	}                                                                      \
+	static DEVICE_ATTR_RW(target##n)
+
+TARGET_ATTR_RW(0);
+TARGET_ATTR_RW(1);
+TARGET_ATTR_RW(2);
+TARGET_ATTR_RW(3);
+TARGET_ATTR_RW(4);
+TARGET_ATTR_RW(5);
+TARGET_ATTR_RW(6);
+TARGET_ATTR_RW(7);
+TARGET_ATTR_RW(8);
+TARGET_ATTR_RW(9);
+TARGET_ATTR_RW(10);
+TARGET_ATTR_RW(11);
+TARGET_ATTR_RW(12);
+TARGET_ATTR_RW(13);
+TARGET_ATTR_RW(14);
+TARGET_ATTR_RW(15);
+
+static struct attribute *interleave_attrs[] = {
+	&dev_attr_target0.attr,
+	&dev_attr_target1.attr,
+	&dev_attr_target2.attr,
+	&dev_attr_target3.attr,
+	&dev_attr_target4.attr,
+	&dev_attr_target5.attr,
+	&dev_attr_target6.attr,
+	&dev_attr_target7.attr,
+	&dev_attr_target8.attr,
+	&dev_attr_target9.attr,
+	&dev_attr_target10.attr,
+	&dev_attr_target11.attr,
+	&dev_attr_target12.attr,
+	&dev_attr_target13.attr,
+	&dev_attr_target14.attr,
+	&dev_attr_target15.attr,
+	NULL,
+};
+
+static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	if (n < cxlr->config.interleave_ways)
+		return a->mode;
+	return 0;
+}
+
+static const struct attribute_group region_interleave_group = {
+	.attrs = interleave_attrs,
+	.is_visible = visible_targets,
+};
+
+static const struct attribute_group *region_groups[] = {
+	&region_group,
+	&region_interleave_group,
+	NULL,
+};
+
 static void cxl_region_release(struct device *dev);
 
 static const struct device_type cxl_region_type = {
 	.name = "cxl_region",
 	.release = cxl_region_release,
+	.groups = region_groups
 };
 
 static ssize_t create_region_show(struct device *dev,
@@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
 {
 	struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
 	struct cxl_region *cxlr = to_cxl_region(dev);
+	int i;
 
 	ida_free(&cxld->region_ida, cxlr->id);
+	for (i = 0; i < cxlr->config.interleave_ways; i++)
+		remove_target(cxlr, i);
 	kfree(cxlr);
 }
 
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 03/14] cxl/mem: Cache port created by the mem dev
  2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
  2022-01-28  0:26 ` [PATCH v3 01/14] cxl/region: Add region creation ABI Ben Widawsky
  2022-01-28  0:26 ` [PATCH v3 02/14] cxl/region: Introduce concept of region configuration Ben Widawsky
@ 2022-01-28  0:26 ` Ben Widawsky
  2022-02-17  1:20   ` Dan Williams
  2022-01-28  0:26 ` [PATCH v3 04/14] cxl/region: Introduce a cxl_region driver Ben Widawsky
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

Since region programming sees all components in the topology as a port,
it's required that endpoints are treated equally. The easiest way to go
from endpoint to port is to simply cache it at creation time.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>

---
Changes since v2:
- Rebased on Dan's latest port/mem changes
- Keep a reference to the port until the memdev goes away
- add action to release device reference for the port
---
 drivers/cxl/cxlmem.h |  2 ++
 drivers/cxl/mem.c    | 35 ++++++++++++++++++++++++++++-------
 2 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 7ba0edb4a1ab..2b8c66616d4e 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -37,6 +37,7 @@
  * @id: id number of this memdev instance.
  * @detach_work: active memdev lost a port in its ancestry
  * @component_reg_phys: register base of component registers
+ * @port: The port created by this device
  */
 struct cxl_memdev {
 	struct device dev;
@@ -44,6 +45,7 @@ struct cxl_memdev {
 	struct cxl_dev_state *cxlds;
 	struct work_struct detach_work;
 	int id;
+	struct cxl_port *port;
 };
 
 static inline struct cxl_memdev *to_cxl_memdev(struct device *dev)
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 27f9dd0d55b6..c36219193886 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -45,26 +45,31 @@ static int wait_for_media(struct cxl_memdev *cxlmd)
 	return 0;
 }
 
-static int create_endpoint(struct cxl_memdev *cxlmd,
-			   struct cxl_port *parent_port)
+static struct cxl_port *create_endpoint(struct cxl_memdev *cxlmd,
+					struct cxl_port *parent_port)
 {
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct cxl_port *endpoint;
+	int rc;
 
 	endpoint = devm_cxl_add_port(&parent_port->dev, &cxlmd->dev,
 				     cxlds->component_reg_phys, parent_port);
 	if (IS_ERR(endpoint))
-		return PTR_ERR(endpoint);
+		return endpoint;
 
 	dev_dbg(&cxlmd->dev, "add: %s\n", dev_name(&endpoint->dev));
 
 	if (!endpoint->dev.driver) {
 		dev_err(&cxlmd->dev, "%s failed probe\n",
 			dev_name(&endpoint->dev));
-		return -ENXIO;
+		return ERR_PTR(-ENXIO);
 	}
 
-	return cxl_endpoint_autoremove(cxlmd, endpoint);
+	rc = cxl_endpoint_autoremove(cxlmd, endpoint);
+	if (rc)
+		return ERR_PTR(rc);
+
+	return endpoint;
 }
 
 /**
@@ -127,11 +132,18 @@ __mock bool cxl_dvsec_decode_init(struct cxl_dev_state *cxlds)
 	return do_hdm_init;
 }
 
+static void delete_memdev(void *dev)
+{
+	struct cxl_memdev *cxlmd = dev;
+
+	put_device(&cxlmd->port->dev);
+}
+
 static int cxl_mem_probe(struct device *dev)
 {
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
-	struct cxl_port *parent_port;
+	struct cxl_port *parent_port, *ep_port;
 	int rc;
 
 	/*
@@ -201,7 +213,16 @@ static int cxl_mem_probe(struct device *dev)
 		goto out;
 	}
 
-	rc = create_endpoint(cxlmd, parent_port);
+	ep_port = create_endpoint(cxlmd, parent_port);
+	if (IS_ERR(ep_port)) {
+		rc = PTR_ERR(ep_port);
+		goto out;
+	}
+
+	get_device(&ep_port->dev);
+	cxlmd->port = ep_port;
+
+	rc = devm_add_action_or_reset(dev, delete_memdev, cxlmd);
 out:
 	cxl_device_unlock(&parent_port->dev);
 	put_device(&parent_port->dev);
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 04/14] cxl/region: Introduce a cxl_region driver
  2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
                   ` (2 preceding siblings ...)
  2022-01-28  0:26 ` [PATCH v3 03/14] cxl/mem: Cache port created by the mem dev Ben Widawsky
@ 2022-01-28  0:26 ` Ben Widawsky
  2022-02-01 16:21   ` Jonathan Cameron
  2022-02-17  6:04   ` Dan Williams
  2022-01-28  0:26 ` [PATCH v3 05/14] cxl/acpi: Handle address space allocation Ben Widawsky
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, kernel test robot, Alison Schofield,
	Dan Williams, Ira Weiny, Jonathan Cameron, Vishal Verma,
	Bjorn Helgaas, nvdimm, linux-pci

The cxl_region driver is responsible for managing the HDM decoder
programming in the CXL topology. Once a region is created it must be
configured and bound to the driver in order to activate it.

The following is a sample of how such controls might work:

region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
echo 2 > /sys/bus/cxl/devices/decoder0.0/region0.0:0/interleave
echo $((256<<20)) > /sys/bus/cxl/devices/decoder0.0/region0.0:0/size
echo mem0 > /sys/bus/cxl/devices/decoder0.0/region0.0:0/target0
echo mem1 > /sys/bus/cxl/devices/decoder0.0/region0.0:0/target1
echo region0.0:0 > /sys/bus/cxl/drivers/cxl_region/bind

In order to handle the eventual rise in failure modes of binding a
region, a new trace event is created to help track these failures for
debug and reconfiguration paths in userspace.

Reported-by: kernel test robot <lkp@intel.com> (v2)
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
Changes since v2:
- Add CONFIG_CXL_REGION
- Check ways/granularity in sanitize
---
 .../driver-api/cxl/memory-devices.rst         |   3 +
 drivers/cxl/Kconfig                           |   4 +
 drivers/cxl/Makefile                          |   2 +
 drivers/cxl/core/core.h                       |   1 +
 drivers/cxl/core/port.c                       |  17 +-
 drivers/cxl/core/region.c                     |  25 +-
 drivers/cxl/cxl.h                             |  31 ++
 drivers/cxl/region.c                          | 349 ++++++++++++++++++
 drivers/cxl/region.h                          |   4 +
 9 files changed, 431 insertions(+), 5 deletions(-)
 create mode 100644 drivers/cxl/region.c

diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
index 66ddc58a21b1..8cb4dece5b17 100644
--- a/Documentation/driver-api/cxl/memory-devices.rst
+++ b/Documentation/driver-api/cxl/memory-devices.rst
@@ -364,6 +364,9 @@ CXL Core
 
 CXL Regions
 -----------
+.. kernel-doc:: drivers/cxl/region.c
+   :doc: cxl region
+
 .. kernel-doc:: drivers/cxl/region.h
    :identifiers:
 
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index b88ab956bb7c..742847503c16 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -98,4 +98,8 @@ config CXL_PORT
 	default CXL_BUS
 	tristate
 
+config CXL_REGION
+	default CXL_PORT
+	tristate
+
 endif
diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
index ce267ef11d93..02a4776e7ab9 100644
--- a/drivers/cxl/Makefile
+++ b/drivers/cxl/Makefile
@@ -5,9 +5,11 @@ obj-$(CONFIG_CXL_MEM) += cxl_mem.o
 obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
 obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
 obj-$(CONFIG_CXL_PORT) += cxl_port.o
+obj-$(CONFIG_CXL_REGION) += cxl_region.o
 
 cxl_mem-y := mem.o
 cxl_pci-y := pci.o
 cxl_acpi-y := acpi.o
 cxl_pmem-y := pmem.o
 cxl_port-y := port.o
+cxl_region-y := region.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 35fd08d560e2..b8a154da34df 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -7,6 +7,7 @@
 extern const struct device_type cxl_nvdimm_bridge_type;
 extern const struct device_type cxl_nvdimm_type;
 extern const struct device_type cxl_memdev_type;
+extern const struct device_type cxl_region_type;
 
 extern struct attribute_group cxl_base_attribute_group;
 
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 0826208b2bdf..0847e6ce19ef 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -9,6 +9,7 @@
 #include <linux/idr.h>
 #include <cxlmem.h>
 #include <cxlpci.h>
+#include <region.h>
 #include <cxl.h>
 #include "core.h"
 
@@ -49,6 +50,8 @@ static int cxl_device_id(struct device *dev)
 	}
 	if (dev->type == &cxl_memdev_type)
 		return CXL_DEVICE_MEMORY_EXPANDER;
+	if (dev->type == &cxl_region_type)
+		return CXL_DEVICE_REGION;
 	return 0;
 }
 
@@ -1425,13 +1428,23 @@ static int cxl_bus_match(struct device *dev, struct device_driver *drv)
 
 static int cxl_bus_probe(struct device *dev)
 {
-	int rc;
+	int id = cxl_device_id(dev);
+	int rc = -ENODEV;
 
 	cxl_nested_lock(dev);
-	rc = to_cxl_drv(dev->driver)->probe(dev);
+	if (id == CXL_DEVICE_REGION) {
+		/* Regions cannot bind until parameters are set */
+		struct cxl_region *cxlr = to_cxl_region(dev);
+
+		if (is_cxl_region_configured(cxlr))
+			rc = to_cxl_drv(dev->driver)->probe(dev);
+	} else {
+		rc = to_cxl_drv(dev->driver)->probe(dev);
+	}
 	cxl_nested_unlock(dev);
 
 	dev_dbg(dev, "probe: %d\n", rc);
+
 	return rc;
 }
 
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 3b48e0469fc7..784e4ba25128 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -12,6 +12,8 @@
 #include <cxl.h>
 #include "core.h"
 
+#include "core.h"
+
 /**
  * DOC: cxl core region
  *
@@ -26,10 +28,27 @@ static const struct attribute_group region_interleave_group;
 
 static bool is_region_active(struct cxl_region *cxlr)
 {
-	/* TODO: Regions can't be activated yet. */
-	return false;
+	return cxlr->active;
 }
 
+/*
+ * Most sanity checking is left up to region binding. This does the most basic
+ * check to determine whether or not the core should try probing the driver.
+ */
+bool is_cxl_region_configured(const struct cxl_region *cxlr)
+{
+	/* zero sized regions aren't a thing. */
+	if (cxlr->config.size <= 0)
+		return false;
+
+	/* all regions have at least 1 target */
+	if (!cxlr->config.targets[0])
+		return false;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(is_cxl_region_configured);
+
 static void remove_target(struct cxl_region *cxlr, int target)
 {
 	struct cxl_memdev *cxlmd;
@@ -316,7 +335,7 @@ static const struct attribute_group *region_groups[] = {
 
 static void cxl_region_release(struct device *dev);
 
-static const struct device_type cxl_region_type = {
+const struct device_type cxl_region_type = {
 	.name = "cxl_region",
 	.release = cxl_region_release,
 	.groups = region_groups
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index b9f0099c1f39..d1a8ca19c9ea 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -81,6 +81,31 @@ static inline int cxl_to_interleave_ways(u8 eniw)
 	}
 }
 
+static inline bool cxl_is_interleave_ways_valid(int iw)
+{
+	switch (iw) {
+		case 0 ... 4:
+		case 6:
+		case 8:
+		case 12:
+		case 16:
+			return true;
+		default:
+			return false;
+	}
+
+	unreachable();
+}
+
+static inline bool cxl_is_interleave_granularity_valid(int ig)
+{
+	if (!is_power_of_2(ig))
+		return false;
+
+	/* 16K is the max */
+	return ((ig >> 15) == 0);
+}
+
 /* CXL 2.0 8.2.8.1 Device Capabilities Array Register */
 #define CXLDEV_CAP_ARRAY_OFFSET 0x0
 #define   CXLDEV_CAP_ARRAY_CAP_ID 0
@@ -199,6 +224,10 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
 #define CXL_DECODER_F_ENABLE    BIT(5)
 #define CXL_DECODER_F_MASK  GENMASK(5, 0)
 
+#define cxl_is_pmem_t3(flags)                                                  \
+	(((flags) & (CXL_DECODER_F_TYPE3 | CXL_DECODER_F_PMEM)) ==             \
+	 (CXL_DECODER_F_TYPE3 | CXL_DECODER_F_PMEM))
+
 enum cxl_decoder_type {
        CXL_DECODER_ACCELERATOR = 2,
        CXL_DECODER_EXPANDER = 3,
@@ -357,6 +386,7 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port,
 				     resource_size_t component_reg_phys);
 struct cxl_dport *cxl_find_dport_by_dev(struct cxl_port *port,
 					const struct device *dev);
+struct cxl_port *ep_find_cxl_port(struct cxl_memdev *cxlmd, unsigned int depth);
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
 bool is_root_decoder(struct device *dev);
@@ -404,6 +434,7 @@ void cxl_driver_unregister(struct cxl_driver *cxl_drv);
 #define CXL_DEVICE_PORT			3
 #define CXL_DEVICE_ROOT			4
 #define CXL_DEVICE_MEMORY_EXPANDER	5
+#define CXL_DEVICE_REGION		6
 
 #define MODULE_ALIAS_CXL(type) MODULE_ALIAS("cxl:t" __stringify(type) "*")
 #define CXL_MODALIAS_FMT "cxl:t%d"
diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
new file mode 100644
index 000000000000..cc41939a2f0a
--- /dev/null
+++ b/drivers/cxl/region.c
@@ -0,0 +1,349 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2021 Intel Corporation. All rights reserved. */
+#include <linux/platform_device.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include "cxlmem.h"
+#include "region.h"
+#include "cxl.h"
+
+/**
+ * DOC: cxl region
+ *
+ * This module implements a region driver that is capable of programming CXL
+ * hardware to setup regions.
+ *
+ * A CXL region encompasses a chunk of host physical address space that may be
+ * consumed by a single device (x1 interleave aka linear) or across multiple
+ * devices (xN interleaved). The region driver has the following
+ * responsibilities:
+ *
+ * * Walk topology to obtain decoder resources for region configuration.
+ * * Program decoder resources based on region configuration.
+ * * Bridge CXL regions to LIBNVDIMM
+ * * Initiates reading and configuring LSA regions
+ * * Enumerates regions created by BIOS (typically volatile)
+ */
+
+#define region_ways(region) ((region)->config.interleave_ways)
+#define region_granularity(region) ((region)->config.interleave_granularity)
+
+static struct cxl_decoder *rootd_from_region(struct cxl_region *cxlr)
+{
+	struct device *d = cxlr->dev.parent;
+
+	if (WARN_ONCE(!is_root_decoder(d),
+		      "Corrupt topology for root region\n"))
+		return NULL;
+
+	return to_cxl_decoder(d);
+}
+
+static struct cxl_port *get_hostbridge(const struct cxl_memdev *ep)
+{
+	struct cxl_port *port = ep->port;
+
+	while (!is_cxl_root(port)) {
+		port = to_cxl_port(port->dev.parent);
+		if (port->depth == 1)
+			return port;
+	}
+
+	BUG();
+	return NULL;
+}
+
+static struct cxl_port *get_root_decoder(const struct cxl_memdev *endpoint)
+{
+	struct cxl_port *hostbridge = get_hostbridge(endpoint);
+
+	if (hostbridge)
+		return to_cxl_port(hostbridge->dev.parent);
+
+	return NULL;
+}
+
+/**
+ * sanitize_region() - Check is region is reasonably configured
+ * @cxlr: The region to check
+ *
+ * Determination as to whether or not a region can possibly be configured is
+ * described in CXL Memory Device SW Guide. In order to implement the algorithms
+ * described there, certain more basic configuration parameters must first need
+ * to be validated. That is accomplished by this function.
+ *
+ * Returns 0 if the region is reasonably configured, else returns a negative
+ * error code.
+ */
+static int sanitize_region(const struct cxl_region *cxlr)
+{
+	const int ig = region_granularity(cxlr);
+	const int iw = region_ways(cxlr);
+	int i;
+
+	if (dev_WARN_ONCE(&cxlr->dev, !is_cxl_region_configured(cxlr),
+			  "unconfigured regions can't be probed (race?)\n")) {
+		return -ENXIO;
+	}
+
+	/*
+	 * Interleave attributes should be caught by later math, but it's
+	 * easiest to find those issues here, now.
+	 */
+	if (!cxl_is_interleave_ways_valid(iw)) {
+		dev_dbg(&cxlr->dev, "Invalid number of ways\n");
+		return -ENXIO;
+	}
+
+	if (!cxl_is_interleave_granularity_valid(ig)) {
+		dev_dbg(&cxlr->dev, "Invalid interleave granularity\n");
+		return -ENXIO;
+	}
+
+	if (cxlr->config.size % (SZ_256M * iw)) {
+		dev_dbg(&cxlr->dev, "Invalid size. Must be multiple of %uM\n",
+			256 * iw);
+		return -ENXIO;
+	}
+
+	for (i = 0; i < iw; i++) {
+		if (!cxlr->config.targets[i]) {
+			dev_dbg(&cxlr->dev, "Missing memory device target%u",
+				i);
+			return -ENXIO;
+		}
+		if (!cxlr->config.targets[i]->dev.driver) {
+			dev_dbg(&cxlr->dev, "%s isn't CXL.mem capable\n",
+				dev_name(&cxlr->config.targets[i]->dev));
+			return -ENODEV;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * allocate_address_space() - Gets address space for the region.
+ * @cxlr: The region that will consume the address space
+ */
+static int allocate_address_space(struct cxl_region *cxlr)
+{
+	/* TODO */
+	return 0;
+}
+
+/**
+ * find_cdat_dsmas() - Find a valid DSMAS for the region
+ * @cxlr: The region
+ */
+static bool find_cdat_dsmas(const struct cxl_region *cxlr)
+{
+	return true;
+}
+
+/**
+ * qtg_match() - Does this root decoder have desirable QTG for the endpoint
+ * @rootd: The root decoder for the region
+ * @endpoint: Endpoint whose QTG is being compared
+ *
+ * Prior to calling this function, the caller should verify that all endpoints
+ * in the region have the same QTG ID.
+ *
+ * Returns true if the QTG ID of the root decoder matches the endpoint
+ */
+static bool qtg_match(const struct cxl_decoder *rootd,
+		      const struct cxl_memdev *endpoint)
+{
+	/* TODO: */
+	return true;
+}
+
+/**
+ * region_xhb_config_valid() - determine cross host bridge validity
+ * @cxlr: The region being programmed
+ * @rootd: The root decoder to check against
+ *
+ * The algorithm is outlined in 2.13.14 "Verify XHB configuration sequence" of
+ * the CXL Memory Device SW Guide (Rev1p0).
+ *
+ * Returns true if the configuration is valid.
+ */
+static bool region_xhb_config_valid(const struct cxl_region *cxlr,
+				    const struct cxl_decoder *rootd)
+{
+	/* TODO: */
+	return true;
+}
+
+/**
+ * region_hb_rp_config_valid() - determine root port ordering is correct
+ * @cxlr: Region to validate
+ * @rootd: root decoder for this @cxlr
+ *
+ * The algorithm is outlined in 2.13.15 "Verify HB root port configuration
+ * sequence" of the CXL Memory Device SW Guide (Rev1p0).
+ *
+ * Returns true if the configuration is valid.
+ */
+static bool region_hb_rp_config_valid(const struct cxl_region *cxlr,
+				      const struct cxl_decoder *rootd)
+{
+	/* TODO: */
+	return true;
+}
+
+/**
+ * rootd_contains() - determine if this region can exist in the root decoder
+ * @rootd: root decoder that potentially decodes to this region
+ * @cxlr: region to be routed by the @rootd
+ */
+static bool rootd_contains(const struct cxl_region *cxlr,
+			   const struct cxl_decoder *rootd)
+{
+	/* TODO: */
+	return true;
+}
+
+static bool rootd_valid(const struct cxl_region *cxlr,
+			const struct cxl_decoder *rootd)
+{
+	const struct cxl_memdev *endpoint = cxlr->config.targets[0];
+
+	if (!qtg_match(rootd, endpoint))
+		return false;
+
+	if (!cxl_is_pmem_t3(rootd->flags))
+		return false;
+
+	if (!region_xhb_config_valid(cxlr, rootd))
+		return false;
+
+	if (!region_hb_rp_config_valid(cxlr, rootd))
+		return false;
+
+	if (!rootd_contains(cxlr, rootd))
+		return false;
+
+	return true;
+}
+
+struct rootd_context {
+	const struct cxl_region *cxlr;
+	struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
+	int count;
+};
+
+static int rootd_match(struct device *dev, void *data)
+{
+	struct rootd_context *ctx = (struct rootd_context *)data;
+	const struct cxl_region *cxlr = ctx->cxlr;
+
+	if (!is_root_decoder(dev))
+		return 0;
+
+	return !!rootd_valid(cxlr, to_cxl_decoder(dev));
+}
+
+/*
+ * This is a roughly equivalent implementation to "Figure 45 - High-level
+ * sequence: Finding CFMWS for region" from the CXL Memory Device SW Guide
+ * Rev1p0.
+ */
+static struct cxl_decoder *find_rootd(const struct cxl_region *cxlr,
+				      const struct cxl_port *root)
+{
+	struct rootd_context ctx;
+	struct device *ret;
+
+	ctx.cxlr = cxlr;
+
+	ret = device_find_child((struct device *)&root->dev, &ctx, rootd_match);
+	if (ret)
+		return to_cxl_decoder(ret);
+
+	return NULL;
+}
+
+static int collect_ep_decoders(const struct cxl_region *cxlr)
+{
+	/* TODO: */
+	return 0;
+}
+
+static int bind_region(const struct cxl_region *cxlr)
+{
+	/* TODO: */
+	return 0;
+}
+
+static int cxl_region_probe(struct device *dev)
+{
+	struct cxl_region *cxlr = to_cxl_region(dev);
+	struct cxl_port *root_port;
+	struct cxl_decoder *rootd, *ours;
+	int ret;
+
+	device_lock_assert(&cxlr->dev);
+
+	if (cxlr->active)
+		return 0;
+
+	if (uuid_is_null(&cxlr->config.uuid))
+		uuid_gen(&cxlr->config.uuid);
+
+	/* TODO: What about volatile, and LSA generated regions? */
+
+	ret = sanitize_region(cxlr);
+	if (ret)
+		return ret;
+
+	ret = allocate_address_space(cxlr);
+	if (ret)
+		return ret;
+
+	if (!find_cdat_dsmas(cxlr))
+		return -ENXIO;
+
+	rootd = rootd_from_region(cxlr);
+	if (!rootd) {
+		dev_err(dev, "Couldn't find root decoder\n");
+		return -ENXIO;
+	}
+
+	if (!rootd_valid(cxlr, rootd)) {
+		dev_err(dev, "Picked invalid rootd\n");
+		return -ENXIO;
+	}
+
+	root_port = get_root_decoder(cxlr->config.targets[0]);
+	ours = find_rootd(cxlr, root_port);
+	if (ours != rootd)
+		dev_dbg(dev, "Picked different rootd %s %s\n",
+			dev_name(&rootd->dev), dev_name(&ours->dev));
+	if (ours)
+		put_device(&ours->dev);
+
+	ret = collect_ep_decoders(cxlr);
+	if (ret)
+		return ret;
+
+	ret = bind_region(cxlr);
+	if (!ret) {
+		cxlr->active = true;
+		dev_info(dev, "Bound");
+	}
+
+	return ret;
+}
+
+static struct cxl_driver cxl_region_driver = {
+	.name = "cxl_region",
+	.probe = cxl_region_probe,
+	.id = CXL_DEVICE_REGION,
+};
+module_cxl_driver(cxl_region_driver);
+
+MODULE_LICENSE("GPL v2");
+MODULE_IMPORT_NS(CXL);
+MODULE_ALIAS_CXL(CXL_DEVICE_REGION);
diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
index eb1249e3c1d4..00a6dc729c26 100644
--- a/drivers/cxl/region.h
+++ b/drivers/cxl/region.h
@@ -13,6 +13,7 @@
  * @id: This regions id. Id is globally unique across all regions.
  * @list: Node in decoder's region list.
  * @res: Resource this region carves out of the platform decode range.
+ * @active: If the region has been activated.
  * @config: HDM decoder program config
  * @config.size: Size of the region determined from LSA or userspace.
  * @config.uuid: The UUID for this region.
@@ -25,6 +26,7 @@ struct cxl_region {
 	int id;
 	struct list_head list;
 	struct resource *res;
+	bool active;
 
 	struct {
 		u64 size;
@@ -35,4 +37,6 @@ struct cxl_region {
 	} config;
 };
 
+bool is_cxl_region_configured(const struct cxl_region *cxlr);
+
 #endif
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 05/14] cxl/acpi: Handle address space allocation
  2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
                   ` (3 preceding siblings ...)
  2022-01-28  0:26 ` [PATCH v3 04/14] cxl/region: Introduce a cxl_region driver Ben Widawsky
@ 2022-01-28  0:26 ` Ben Widawsky
  2022-02-18 19:17   ` Dan Williams
  2022-01-28  0:26 ` [PATCH v3 06/14] cxl/region: Address " Ben Widawsky
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

Regions are carved out of an addresses space which is claimed by top
level decoders, and subsequently their children decoders. Regions are
created with a size and therefore must fit, with proper alignment, in
that address space. The support for doing this fitting is handled by the
driver automatically.

As an example, a platform might configure a top level decoder to claim
1TB of address space @ 0x800000000 -> 0x10800000000; it would be
possible to create M regions with appropriate alignment to occupy that
address space. Each of those regions would have a host physical address
somewhere in the range between 32G and 1.3TB, and the location will be
determined by the logic added here.

The request_region() usage is not strictly mandatory at this point as
the actual handling of the address space is done with genpools. It is
highly likely however that the resource/region APIs will become useful
in the not too distant future.

All decoders manage a host physical address space while active. Only the
root decoder has constraints on location and size. As a result, it makes
most sense for the root decoder to be responsible for managing the
entire address space, and mid-level decoders and endpoints can ask the
root decoder for suballocations.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/acpi.c | 30 ++++++++++++++++++++++++++++++
 drivers/cxl/cxl.h  |  2 ++
 2 files changed, 32 insertions(+)

diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index d6dcb2b6af48..74681bfbf53c 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /* Copyright(c) 2021 Intel Corporation. All rights reserved. */
 #include <linux/platform_device.h>
+#include <linux/genalloc.h>
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/kernel.h>
@@ -73,6 +74,27 @@ static int cxl_acpi_cfmws_verify(struct device *dev,
 	return 0;
 }
 
+/*
+ * Every decoder while active has an address space that it is decoding. However,
+ * only the root level decoders have fixed host physical address space ranges.
+ */
+static int cxl_create_cfmws_address_space(struct cxl_decoder *cxld,
+					  struct acpi_cedt_cfmws *cfmws)
+{
+	const int order = ilog2(SZ_256M * cxld->interleave_ways);
+	struct device *dev = &cxld->dev;
+	struct gen_pool *pool;
+
+	pool = devm_gen_pool_create(dev, order, NUMA_NO_NODE, dev_name(dev));
+	if (IS_ERR(pool))
+		return PTR_ERR(pool);
+
+	cxld->address_space = pool;
+
+	return gen_pool_add(cxld->address_space, cfmws->base_hpa,
+			    cfmws->window_size, NUMA_NO_NODE);
+}
+
 struct cxl_cfmws_context {
 	struct device *dev;
 	struct cxl_port *root_port;
@@ -113,6 +135,14 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
 	cxld->interleave_ways = CFMWS_INTERLEAVE_WAYS(cfmws);
 	cxld->interleave_granularity = CFMWS_INTERLEAVE_GRANULARITY(cfmws);
 
+	rc = cxl_create_cfmws_address_space(cxld, cfmws);
+	if (rc) {
+		dev_err(dev,
+			"Failed to create CFMWS address space for decoder\n");
+		put_device(&cxld->dev);
+		return 0;
+	}
+
 	rc = cxl_decoder_add(cxld, target_map);
 	if (rc)
 		put_device(&cxld->dev);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index d1a8ca19c9ea..b300673072f5 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -251,6 +251,7 @@ enum cxl_decoder_type {
  * @flags: memory type capabilities and locking
  * @target_lock: coordinate coherent reads of the target list
  * @region_ida: allocator for region ids.
+ * @address_space: Used/free address space for regions.
  * @nr_targets: number of elements in @target
  * @target: active ordered target list in current decoder configuration
  */
@@ -267,6 +268,7 @@ struct cxl_decoder {
 	unsigned long flags;
 	seqlock_t target_lock;
 	struct ida region_ida;
+	struct gen_pool *address_space;
 	int nr_targets;
 	struct cxl_dport *target[];
 };
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 06/14] cxl/region: Address space allocation
  2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
                   ` (4 preceding siblings ...)
  2022-01-28  0:26 ` [PATCH v3 05/14] cxl/acpi: Handle address space allocation Ben Widawsky
@ 2022-01-28  0:26 ` Ben Widawsky
  2022-02-18 19:51   ` Dan Williams
  2022-01-28  0:27 ` [PATCH v3 07/14] cxl/region: Implement XHB verification Ben Widawsky
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:26 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

When a region is not assigned a host physical address, one is picked by
the driver. As the address will determine which CFMWS contains the
region, it's usually a better idea to let the driver make this
determination.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/region.c | 40 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
index cc41939a2f0a..5588873dd250 100644
--- a/drivers/cxl/region.c
+++ b/drivers/cxl/region.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /* Copyright(c) 2021 Intel Corporation. All rights reserved. */
 #include <linux/platform_device.h>
+#include <linux/genalloc.h>
 #include <linux/device.h>
 #include <linux/module.h>
 #include <linux/pci.h>
@@ -64,6 +65,20 @@ static struct cxl_port *get_root_decoder(const struct cxl_memdev *endpoint)
 	return NULL;
 }
 
+static void release_cxl_region(void *r)
+{
+	struct cxl_region *cxlr = (struct cxl_region *)r;
+	struct cxl_decoder *rootd = rootd_from_region(cxlr);
+	struct resource *res = &rootd->platform_res;
+	resource_size_t start, size;
+
+	start = cxlr->res->start;
+	size = resource_size(cxlr->res);
+
+	__release_region(res, start, size);
+	gen_pool_free(rootd->address_space, start, size);
+}
+
 /**
  * sanitize_region() - Check is region is reasonably configured
  * @cxlr: The region to check
@@ -129,8 +144,29 @@ static int sanitize_region(const struct cxl_region *cxlr)
  */
 static int allocate_address_space(struct cxl_region *cxlr)
 {
-	/* TODO */
-	return 0;
+	struct cxl_decoder *rootd = rootd_from_region(cxlr);
+	unsigned long start;
+
+	start = gen_pool_alloc(rootd->address_space, cxlr->config.size);
+	if (!start) {
+		dev_dbg(&cxlr->dev, "Couldn't allocate %lluM of address space",
+			cxlr->config.size >> 20);
+		return -ENOMEM;
+	}
+
+	cxlr->res =
+		__request_region(&rootd->platform_res, start, cxlr->config.size,
+				 dev_name(&cxlr->dev), IORESOURCE_MEM);
+	if (!cxlr->res) {
+		dev_dbg(&cxlr->dev, "Couldn't obtain region from %s (%pR)\n",
+			dev_name(&rootd->dev), &rootd->platform_res);
+		gen_pool_free(rootd->address_space, start, cxlr->config.size);
+		return -ENOMEM;
+	}
+
+	dev_dbg(&cxlr->dev, "resource %pR", cxlr->res);
+
+	return devm_add_action_or_reset(&cxlr->dev, release_cxl_region, cxlr);
 }
 
 /**
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 07/14] cxl/region: Implement XHB verification
  2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
                   ` (5 preceding siblings ...)
  2022-01-28  0:26 ` [PATCH v3 06/14] cxl/region: Address " Ben Widawsky
@ 2022-01-28  0:27 ` Ben Widawsky
  2022-02-18 20:23   ` Dan Williams
  2022-01-28  0:27 ` [PATCH v3 08/14] cxl/region: HB port config verification Ben Widawsky
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:27 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

Cross host bridge verification primarily determines if the requested
interleave ordering can be achieved by the root decoder, which isn't as
programmable as other decoders.

The algorithm implemented here is based on the CXL Type 3 Memory Device
Software Guide, chapter 2.13.14

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
Changes since v2:
- Fail earlier on lack of host bridges. This should only be capable as
  of now with cxl_test memdevs.
---
 .clang-format        |  2 +
 drivers/cxl/cxl.h    | 13 +++++++
 drivers/cxl/region.c | 89 +++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 103 insertions(+), 1 deletion(-)

diff --git a/.clang-format b/.clang-format
index fa959436bcfd..1221d53be90b 100644
--- a/.clang-format
+++ b/.clang-format
@@ -169,6 +169,8 @@ ForEachMacros:
   - 'for_each_cpu_and'
   - 'for_each_cpu_not'
   - 'for_each_cpu_wrap'
+  - 'for_each_cxl_decoder_target'
+  - 'for_each_cxl_endpoint'
   - 'for_each_dapm_widgets'
   - 'for_each_dev_addr'
   - 'for_each_dev_scope'
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index b300673072f5..a291999431c7 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -81,6 +81,19 @@ static inline int cxl_to_interleave_ways(u8 eniw)
 	}
 }
 
+static inline u8 cxl_to_eniw(u8 ways)
+{
+	if (is_power_of_2(ways))
+		return ilog2(ways);
+
+	return ways / 3 + 8;
+}
+
+static inline u8 cxl_to_ig(u16 g)
+{
+	return ilog2(g) - 8;
+}
+
 static inline bool cxl_is_interleave_ways_valid(int iw)
 {
 	switch (iw) {
diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
index 5588873dd250..562c8720da56 100644
--- a/drivers/cxl/region.c
+++ b/drivers/cxl/region.c
@@ -29,6 +29,17 @@
 
 #define region_ways(region) ((region)->config.interleave_ways)
 #define region_granularity(region) ((region)->config.interleave_granularity)
+#define region_eniw(region) (cxl_to_eniw(region_ways(region)))
+#define region_ig(region) (cxl_to_ig(region_granularity(region)))
+
+#define for_each_cxl_endpoint(ep, region, idx)                                 \
+	for (idx = 0, ep = (region)->config.targets[idx];                      \
+	     idx < region_ways(region); ep = (region)->config.targets[++idx])
+
+#define for_each_cxl_decoder_target(dport, decoder, idx)                       \
+	for (idx = 0, dport = (decoder)->target[idx];                          \
+	     idx < (decoder)->nr_targets - 1;                                  \
+	     dport = (decoder)->target[++idx])
 
 static struct cxl_decoder *rootd_from_region(struct cxl_region *cxlr)
 {
@@ -195,6 +206,30 @@ static bool qtg_match(const struct cxl_decoder *rootd,
 	return true;
 }
 
+static int get_unique_hostbridges(const struct cxl_region *cxlr,
+				  struct cxl_port **hbs)
+{
+	struct cxl_memdev *ep;
+	int i, hb_count = 0;
+
+	for_each_cxl_endpoint(ep, cxlr, i) {
+		struct cxl_port *hb = get_hostbridge(ep);
+		bool found = false;
+		int j;
+
+		BUG_ON(!hb);
+
+		for (j = 0; j < hb_count; j++) {
+			if (hbs[j] == hb)
+				found = true;
+		}
+		if (!found)
+			hbs[hb_count++] = hb;
+	}
+
+	return hb_count;
+}
+
 /**
  * region_xhb_config_valid() - determine cross host bridge validity
  * @cxlr: The region being programmed
@@ -208,7 +243,59 @@ static bool qtg_match(const struct cxl_decoder *rootd,
 static bool region_xhb_config_valid(const struct cxl_region *cxlr,
 				    const struct cxl_decoder *rootd)
 {
-	/* TODO: */
+	const int rootd_eniw = cxl_to_eniw(rootd->interleave_ways);
+	const int rootd_ig = cxl_to_ig(rootd->interleave_granularity);
+	const int cxlr_ig = region_ig(cxlr);
+	const int cxlr_iw = region_ways(cxlr);
+	struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
+	struct cxl_dport *target;
+	int i;
+
+	i = get_unique_hostbridges(cxlr, hbs);
+	if (dev_WARN_ONCE(&cxlr->dev, i == 0, "Cannot find a valid host bridge\n"))
+		return false;
+
+	/* Are all devices in this region on the same CXL host bridge */
+	if (i == 1)
+		return true;
+
+	/* CFMWS.HBIG >= Device.Label.IG */
+	if (rootd_ig < cxlr_ig) {
+		dev_dbg(&cxlr->dev,
+			"%s HBIG must be greater than region IG (%d < %d)\n",
+			dev_name(&rootd->dev), rootd_ig, cxlr_ig);
+		return false;
+	}
+
+	/*
+	 * ((2^(CFMWS.HBIG - Device.RLabel.IG) * (2^CFMWS.ENIW)) > Device.RLabel.NLabel)
+	 *
+	 * XXX: 2^CFMWS.ENIW is trying to decode the NIW. Instead, use the look
+	 * up function which supports non power of 2 interleave configurations.
+	 */
+	if (((1 << (rootd_ig - cxlr_ig)) * (1 << rootd_eniw)) > cxlr_iw) {
+		dev_dbg(&cxlr->dev,
+			"granularity ratio requires a larger number of devices (%d) than currently configured (%d)\n",
+			((1 << (rootd_ig - cxlr_ig)) * (1 << rootd_eniw)),
+			cxlr_iw);
+		return false;
+	}
+
+	/*
+	 * CFMWS.InterleaveTargetList[n] must contain all devices, x where:
+	 *	(Device[x],RegionLabel.Position >> (CFMWS.HBIG -
+	 *	Device[x].RegionLabel.InterleaveGranularity)) &
+	 *	((2^CFMWS.ENIW) - 1) = n
+	 */
+	for_each_cxl_decoder_target(target, rootd, i) {
+		if (((i >> (rootd_ig - cxlr_ig))) &
+		    (((1 << rootd_eniw) - 1) != target->port_id)) {
+			dev_dbg(&cxlr->dev,
+				"One or more devices are not connected to the correct hostbridge.\n");
+			return false;
+		}
+	}
+
 	return true;
 }
 
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 08/14] cxl/region: HB port config verification
  2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
                   ` (6 preceding siblings ...)
  2022-01-28  0:27 ` [PATCH v3 07/14] cxl/region: Implement XHB verification Ben Widawsky
@ 2022-01-28  0:27 ` Ben Widawsky
  2022-02-14 16:20   ` Jonathan Cameron
                     ` (2 more replies)
  2022-01-28  0:27 ` [PATCH v3 09/14] cxl/region: Add infrastructure for decoder programming Ben Widawsky
                   ` (5 subsequent siblings)
  13 siblings, 3 replies; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:27 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

Host bridge root port verification determines if the device ordering in
an interleave set can be programmed through the host bridges and
switches.

The algorithm implemented here is based on the CXL Type 3 Memory Device
Software Guide, chapter 2.13.15. The current version of the guide does
not yet support x3 interleave configurations, and so that's not
supported here either.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 .clang-format           |   1 +
 drivers/cxl/core/port.c |   1 +
 drivers/cxl/cxl.h       |   2 +
 drivers/cxl/region.c    | 127 +++++++++++++++++++++++++++++++++++++++-
 4 files changed, 130 insertions(+), 1 deletion(-)

diff --git a/.clang-format b/.clang-format
index 1221d53be90b..5e20206f905e 100644
--- a/.clang-format
+++ b/.clang-format
@@ -171,6 +171,7 @@ ForEachMacros:
   - 'for_each_cpu_wrap'
   - 'for_each_cxl_decoder_target'
   - 'for_each_cxl_endpoint'
+  - 'for_each_cxl_endpoint_hb'
   - 'for_each_dapm_widgets'
   - 'for_each_dev_addr'
   - 'for_each_dev_scope'
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 0847e6ce19ef..1d81c5f56a3e 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -706,6 +706,7 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port,
 		return ERR_PTR(-ENOMEM);
 
 	INIT_LIST_HEAD(&dport->list);
+	INIT_LIST_HEAD(&dport->verify_link);
 	dport->dport = dport_dev;
 	dport->port_id = port_id;
 	dport->component_reg_phys = component_reg_phys;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index a291999431c7..ed984465b59c 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -350,6 +350,7 @@ struct cxl_port {
  * @component_reg_phys: downstream port component registers
  * @port: reference to cxl_port that contains this downstream port
  * @list: node for a cxl_port's list of cxl_dport instances
+ * @verify_link: node used for hb root port verification
  */
 struct cxl_dport {
 	struct device *dport;
@@ -357,6 +358,7 @@ struct cxl_dport {
 	resource_size_t component_reg_phys;
 	struct cxl_port *port;
 	struct list_head list;
+	struct list_head verify_link;
 };
 
 /**
diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
index 562c8720da56..d2f6c990c8a8 100644
--- a/drivers/cxl/region.c
+++ b/drivers/cxl/region.c
@@ -4,6 +4,7 @@
 #include <linux/genalloc.h>
 #include <linux/device.h>
 #include <linux/module.h>
+#include <linux/sort.h>
 #include <linux/pci.h>
 #include "cxlmem.h"
 #include "region.h"
@@ -36,6 +37,12 @@
 	for (idx = 0, ep = (region)->config.targets[idx];                      \
 	     idx < region_ways(region); ep = (region)->config.targets[++idx])
 
+#define for_each_cxl_endpoint_hb(ep, region, hb, idx)                          \
+	for (idx = 0, (ep) = (region)->config.targets[idx];                    \
+	     idx < region_ways(region);                                        \
+	     idx++, (ep) = (region)->config.targets[idx])                      \
+		if (get_hostbridge(ep) == (hb))
+
 #define for_each_cxl_decoder_target(dport, decoder, idx)                       \
 	for (idx = 0, dport = (decoder)->target[idx];                          \
 	     idx < (decoder)->nr_targets - 1;                                  \
@@ -299,6 +306,59 @@ static bool region_xhb_config_valid(const struct cxl_region *cxlr,
 	return true;
 }
 
+static struct cxl_dport *get_rp(struct cxl_memdev *ep)
+{
+	struct cxl_port *port, *parent_port = port = ep->port;
+	struct cxl_dport *dport;
+
+	while (!is_cxl_root(port)) {
+		parent_port = to_cxl_port(port->dev.parent);
+		if (parent_port->depth == 1)
+			list_for_each_entry(dport, &parent_port->dports, list)
+				if (dport->dport == port->uport->parent->parent)
+					return dport;
+		port = parent_port;
+	}
+
+	BUG();
+	return NULL;
+}
+
+static int get_num_root_ports(const struct cxl_region *cxlr)
+{
+	struct cxl_memdev *endpoint;
+	struct cxl_dport *dport, *tmp;
+	int num_root_ports = 0;
+	LIST_HEAD(root_ports);
+	int idx;
+
+	for_each_cxl_endpoint(endpoint, cxlr, idx) {
+		struct cxl_dport *root_port = get_rp(endpoint);
+
+		if (list_empty(&root_port->verify_link)) {
+			list_add_tail(&root_port->verify_link, &root_ports);
+			num_root_ports++;
+		}
+	}
+
+	list_for_each_entry_safe(dport, tmp, &root_ports, verify_link)
+		list_del_init(&dport->verify_link);
+
+	return num_root_ports;
+}
+
+static bool has_switch(const struct cxl_region *cxlr)
+{
+	struct cxl_memdev *ep;
+	int i;
+
+	for_each_cxl_endpoint(ep, cxlr, i)
+		if (ep->port->depth > 2)
+			return true;
+
+	return false;
+}
+
 /**
  * region_hb_rp_config_valid() - determine root port ordering is correct
  * @cxlr: Region to validate
@@ -312,7 +372,72 @@ static bool region_xhb_config_valid(const struct cxl_region *cxlr,
 static bool region_hb_rp_config_valid(const struct cxl_region *cxlr,
 				      const struct cxl_decoder *rootd)
 {
-	/* TODO: */
+	const int num_root_ports = get_num_root_ports(cxlr);
+	struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
+	int hb_count, i;
+
+	hb_count = get_unique_hostbridges(cxlr, hbs);
+
+	/* TODO: Switch support */
+	if (has_switch(cxlr))
+		return false;
+
+	/*
+	 * Are all devices in this region on the same CXL Host Bridge
+	 * Root Port?
+	 */
+	if (num_root_ports == 1 && !has_switch(cxlr))
+		return true;
+
+	for (i = 0; i < hb_count; i++) {
+		int idx, position_mask;
+		struct cxl_dport *rp;
+		struct cxl_port *hb;
+
+		/* Get next CXL Host Bridge this region spans */
+		hb = hbs[i];
+
+		/*
+		 * Calculate the position mask: NumRootPorts = 2^PositionMask
+		 * for this region.
+		 *
+		 * XXX: pos_mask is actually (1 << PositionMask)  - 1
+		 */
+		position_mask = (1 << (ilog2(num_root_ports))) - 1;
+
+		/*
+		 * Calculate the PortGrouping for each device on this CXL Host
+		 * Bridge Root Port:
+		 * PortGrouping = RegionLabel.Position & PositionMask
+		 *
+		 * The following nest iterators effectively iterate over each
+		 * root port in the region.
+		 *   for_each_unique_rootport(rp, cxlr)
+		 */
+		list_for_each_entry(rp, &hb->dports, list) {
+			struct cxl_memdev *ep;
+			int port_grouping = -1;
+
+			for_each_cxl_endpoint_hb(ep, cxlr, hb, idx) {
+				if (get_rp(ep) != rp)
+					continue;
+
+				if (port_grouping == -1)
+					port_grouping = idx & position_mask;
+
+				/*
+				 * Do all devices in the region connected to this CXL
+				 * Host Bridge Root Port have the same PortGrouping?
+				 */
+				if ((idx & position_mask) != port_grouping) {
+					dev_dbg(&cxlr->dev,
+						"One or more devices are not connected to the correct Host Bridge Root Port\n");
+					return false;
+				}
+			}
+		}
+	}
+
 	return true;
 }
 
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 09/14] cxl/region: Add infrastructure for decoder programming
  2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
                   ` (7 preceding siblings ...)
  2022-01-28  0:27 ` [PATCH v3 08/14] cxl/region: HB port config verification Ben Widawsky
@ 2022-01-28  0:27 ` Ben Widawsky
  2022-02-01 18:16   ` Jonathan Cameron
  2022-02-18 21:53   ` Dan Williams
  2022-01-28  0:27 ` [PATCH v3 10/14] cxl/region: Collect host bridge decoders Ben Widawsky
                   ` (4 subsequent siblings)
  13 siblings, 2 replies; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:27 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

There are 3 steps in handling region programming once it has been
configured by userspace.
1. Sanitize the parameters against the system.
2. Collect decoder resources from the topology
3. Program decoder resources

The infrastructure added here addresses #2. Two new APIs are introduced
to allow collecting and returning decoder resources. Additionally the
infrastructure includes two lists managed by the region driver, a staged
list, and a commit list. The staged list contains those collected in
step #2, and the commit list are all the decoders programmed in step #3.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/core/port.c   |  71 ++++++++++++++++++++++
 drivers/cxl/core/region.c |   2 +
 drivers/cxl/cxl.h         |   8 +++
 drivers/cxl/cxlmem.h      |   7 +++
 drivers/cxl/port.c        |  62 ++++++++++++++++++-
 drivers/cxl/region.c      | 125 +++++++++++++++++++++++++++++++++-----
 drivers/cxl/region.h      |   5 ++
 7 files changed, 263 insertions(+), 17 deletions(-)

diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 1d81c5f56a3e..92aaaa65ec61 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1212,6 +1212,8 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
 	cxld->target_type = CXL_DECODER_EXPANDER;
 	cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
 
+	INIT_LIST_HEAD(&cxld->region_link);
+
 	ida_init(&cxld->region_ida);
 
 	return cxld;
@@ -1366,6 +1368,75 @@ int cxl_decoder_add(struct cxl_decoder *cxld, int *target_map)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_decoder_add, CXL);
 
+/**
+ * cxl_get_decoder() - Get an unused decoder from the port.
+ * @port: The port to obtain a decoder from.
+ *
+ * Region programming requires obtaining decoder resources from all ports that
+ * participate in the interleave set. This function shall be used to pull the
+ * decoder resource out of the list of available.
+ *
+ * Context: Process context. Takes and releases the device lock of the port.
+ *
+ * Return: A cxl_decoder that can be used for programming if successful, else a
+ *	   negative error code.
+ */
+struct cxl_decoder *cxl_get_decoder(struct cxl_port *port)
+{
+	struct cxl_hdm *cxlhdm;
+	int dec;
+
+	cxlhdm = dev_get_drvdata(&port->dev);
+	if (dev_WARN_ONCE(&port->dev, !cxlhdm, "No port drvdata\n"))
+		return ERR_PTR(-ENXIO);
+
+	device_lock(&port->dev);
+	dec = find_first_bit(cxlhdm->decoders.free_mask,
+			     cxlhdm->decoders.count);
+	if (dec == cxlhdm->decoders.count) {
+		device_unlock(&port->dev);
+		return ERR_PTR(-ENODEV);
+	}
+
+	clear_bit(dec, cxlhdm->decoders.free_mask);
+	device_unlock(&port->dev);
+
+	return cxlhdm->decoders.cxld[dec];
+}
+EXPORT_SYMBOL_NS_GPL(cxl_get_decoder, CXL);
+
+/**
+ * cxl_put_decoder() - Return an inactive decoder to the port.
+ * @cxld: The decoder being returned.
+ */
+void cxl_put_decoder(struct cxl_decoder *cxld)
+{
+	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
+	struct cxl_hdm *cxlhdm;
+	int i;
+
+	cxlhdm = dev_get_drvdata(&port->dev);
+	if (dev_WARN_ONCE(&port->dev, !cxlhdm, "No port drvdata\n"))
+		return;
+
+	device_lock(&port->dev);
+
+	for (i = 0; i < CXL_DECODER_MAX_INSTANCES; i++) {
+		struct cxl_decoder *d = cxlhdm->decoders.cxld[i];
+
+		if (!d)
+			continue;
+
+		if (d == cxld) {
+			set_bit(i, cxlhdm->decoders.free_mask);
+			break;
+		}
+	}
+
+	device_unlock(&port->dev);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_put_decoder, CXL);
+
 static void cxld_unregister(void *dev)
 {
 	device_unregister(dev);
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 784e4ba25128..a62d48454a56 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -440,6 +440,8 @@ struct cxl_region *cxl_alloc_region(struct cxl_decoder *cxld, int id)
 	if (!cxlr)
 		return ERR_PTR(-ENOMEM);
 
+	INIT_LIST_HEAD(&cxlr->staged_list);
+	INIT_LIST_HEAD(&cxlr->commit_list);
 	cxlr->id = id;
 
 	return cxlr;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index ed984465b59c..8ace6cca0776 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -35,6 +35,8 @@
 #define   CXL_CM_CAP_CAP_ID_HDM 0x5
 #define   CXL_CM_CAP_CAP_HDM_VERSION 1
 
+#define CXL_DECODER_MAX_INSTANCES 10
+
 /* HDM decoders CXL 2.0 8.2.5.12 CXL HDM Decoder Capability Structure */
 #define CXL_HDM_DECODER_CAP_OFFSET 0x0
 #define   CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0)
@@ -265,6 +267,7 @@ enum cxl_decoder_type {
  * @target_lock: coordinate coherent reads of the target list
  * @region_ida: allocator for region ids.
  * @address_space: Used/free address space for regions.
+ * @region_link: This decoder's place on either the staged, or commit list.
  * @nr_targets: number of elements in @target
  * @target: active ordered target list in current decoder configuration
  */
@@ -282,6 +285,7 @@ struct cxl_decoder {
 	seqlock_t target_lock;
 	struct ida region_ida;
 	struct gen_pool *address_space;
+	struct list_head region_link;
 	int nr_targets;
 	struct cxl_dport *target[];
 };
@@ -326,6 +330,7 @@ struct cxl_nvdimm {
  * @id: id for port device-name
  * @dports: cxl_dport instances referenced by decoders
  * @endpoints: cxl_ep instances, endpoints that are a descendant of this port
+ * @region_link: this port's node on the region's list of ports
  * @decoder_ida: allocator for decoder ids
  * @component_reg_phys: component register capability base address (optional)
  * @dead: last ep has been removed, force port re-creation
@@ -396,6 +401,8 @@ struct cxl_port *find_cxl_root(struct device *dev);
 int devm_cxl_enumerate_ports(struct cxl_memdev *cxlmd);
 int cxl_bus_rescan(void);
 struct cxl_port *cxl_mem_find_port(struct cxl_memdev *cxlmd);
+struct cxl_decoder *cxl_get_decoder(struct cxl_port *port);
+void cxl_put_decoder(struct cxl_decoder *cxld);
 bool schedule_cxl_memdev_detach(struct cxl_memdev *cxlmd);
 
 struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port,
@@ -406,6 +413,7 @@ struct cxl_dport *cxl_find_dport_by_dev(struct cxl_port *port,
 struct cxl_port *ep_find_cxl_port(struct cxl_memdev *cxlmd, unsigned int depth);
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
+bool is_cxl_decoder(struct device *dev);
 bool is_root_decoder(struct device *dev);
 bool is_cxl_decoder(struct device *dev);
 struct cxl_decoder *cxl_root_decoder_alloc(struct cxl_port *port,
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 2b8c66616d4e..6db66eaf51be 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -305,5 +305,12 @@ struct cxl_hdm {
 	unsigned int target_count;
 	unsigned int interleave_mask;
 	struct cxl_port *port;
+
+	struct port_decoders {
+		unsigned long *free_mask;
+		int count;
+
+		struct cxl_decoder *cxld[CXL_DECODER_MAX_INSTANCES];
+	} decoders;
 };
 #endif /* __CXL_MEM_H__ */
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index d420da5fc39c..fdb62ed06433 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -30,11 +30,55 @@ static void schedule_detach(void *cxlmd)
 	schedule_cxl_memdev_detach(cxlmd);
 }
 
+static int count_decoders(struct device *dev, void *data)
+{
+	if (is_cxl_decoder(dev))
+		(*(int *)data)++;
+
+	return 0;
+}
+
+struct dec_init_ctx {
+	struct cxl_hdm *cxlhdm;
+	int ndx;
+};
+
+static int set_decoders(struct device *dev, void *data)
+{
+	struct cxl_decoder *cxld;
+	struct dec_init_ctx *ctx;
+	struct cxl_hdm *cxlhdm;
+	int dec;
+
+	if (!is_cxl_decoder(dev))
+		return 0;
+
+	cxld = to_cxl_decoder(dev);
+
+	ctx = data;
+
+	cxlhdm = ctx->cxlhdm;
+	dec = ctx->ndx++;
+	cxlhdm->decoders.cxld[dec] = cxld;
+
+	if (cxld->flags & CXL_DECODER_F_ENABLE) {
+		dev_dbg(dev, "Not adding to free decoders\n");
+		return 0;
+	}
+
+	set_bit(dec, cxlhdm->decoders.free_mask);
+
+	dev_dbg(dev, "Adding to free decoder list\n");
+
+	return 0;
+}
+
 static int cxl_port_probe(struct device *dev)
 {
 	struct cxl_port *port = to_cxl_port(dev);
+	int rc, decoder_count = 0;
+	struct dec_init_ctx ctx;
 	struct cxl_hdm *cxlhdm;
-	int rc;
 
 	if (is_cxl_endpoint(port)) {
 		struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport);
@@ -61,6 +105,22 @@ static int cxl_port_probe(struct device *dev)
 		return rc;
 	}
 
+	device_for_each_child(dev, &decoder_count, count_decoders);
+
+	cxlhdm->decoders.free_mask =
+		devm_bitmap_zalloc(dev, decoder_count, GFP_KERNEL);
+	cxlhdm->decoders.count = decoder_count;
+
+	ctx.cxlhdm = cxlhdm;
+	ctx.ndx = 0;
+	if (device_for_each_child(dev, &ctx, set_decoders))
+		return -ENXIO;
+
+	dev_set_drvdata(dev, cxlhdm);
+
+	dev_dbg(dev, "Setup complete. Free decoders %*pb\n",
+		cxlhdm->decoders.count, &cxlhdm->decoders.free_mask);
+
 	return 0;
 }
 
diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
index d2f6c990c8a8..145d7bb02714 100644
--- a/drivers/cxl/region.c
+++ b/drivers/cxl/region.c
@@ -359,21 +359,59 @@ static bool has_switch(const struct cxl_region *cxlr)
 	return false;
 }
 
+static struct cxl_decoder *get_decoder(struct cxl_region *cxlr,
+				       struct cxl_port *p)
+{
+	struct cxl_decoder *cxld;
+
+	cxld = cxl_get_decoder(p);
+	if (IS_ERR(cxld)) {
+		dev_dbg(&cxlr->dev, "Couldn't get decoder for %s\n",
+			dev_name(&p->dev));
+		return cxld;
+	}
+
+	cxld->decoder_range = (struct range){ .start = cxlr->res->start,
+					      .end = cxlr->res->end };
+
+	list_add_tail(&cxld->region_link,
+		      (struct list_head *)&cxlr->staged_list);
+
+	return cxld;
+}
+
+static bool simple_config(struct cxl_region *cxlr, struct cxl_port *hb)
+{
+	struct cxl_decoder *cxld;
+
+	cxld = get_decoder(cxlr, hb);
+	if (IS_ERR(cxld))
+		return false;
+
+	cxld->interleave_ways = 1;
+	cxld->interleave_granularity = region_granularity(cxlr);
+	cxld->target[0] = get_rp(cxlr->config.targets[0]);
+	return true;
+}
+
 /**
  * region_hb_rp_config_valid() - determine root port ordering is correct
  * @cxlr: Region to validate
  * @rootd: root decoder for this @cxlr
+ * @state_update: Whether or not to update port state
  *
  * The algorithm is outlined in 2.13.15 "Verify HB root port configuration
  * sequence" of the CXL Memory Device SW Guide (Rev1p0).
  *
  * Returns true if the configuration is valid.
  */
-static bool region_hb_rp_config_valid(const struct cxl_region *cxlr,
-				      const struct cxl_decoder *rootd)
+static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
+				      const struct cxl_decoder *rootd,
+				      bool state_update)
 {
 	const int num_root_ports = get_num_root_ports(cxlr);
 	struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
+	struct cxl_decoder *cxld, *c;
 	int hb_count, i;
 
 	hb_count = get_unique_hostbridges(cxlr, hbs);
@@ -386,8 +424,8 @@ static bool region_hb_rp_config_valid(const struct cxl_region *cxlr,
 	 * Are all devices in this region on the same CXL Host Bridge
 	 * Root Port?
 	 */
-	if (num_root_ports == 1 && !has_switch(cxlr))
-		return true;
+	if (num_root_ports == 1 && !has_switch(cxlr) && state_update)
+		return simple_config(cxlr, hbs[0]);
 
 	for (i = 0; i < hb_count; i++) {
 		int idx, position_mask;
@@ -397,6 +435,20 @@ static bool region_hb_rp_config_valid(const struct cxl_region *cxlr,
 		/* Get next CXL Host Bridge this region spans */
 		hb = hbs[i];
 
+		if (state_update) {
+			cxld = get_decoder(cxlr, hb);
+			if (IS_ERR(cxld)) {
+				dev_dbg(&cxlr->dev,
+					"Couldn't get decoder for %s\n",
+					dev_name(&hb->dev));
+				goto err;
+			}
+			cxld->interleave_ways = 0;
+			cxld->interleave_granularity = region_granularity(cxlr);
+		} else {
+			cxld = NULL;
+		}
+
 		/*
 		 * Calculate the position mask: NumRootPorts = 2^PositionMask
 		 * for this region.
@@ -432,13 +484,20 @@ static bool region_hb_rp_config_valid(const struct cxl_region *cxlr,
 				if ((idx & position_mask) != port_grouping) {
 					dev_dbg(&cxlr->dev,
 						"One or more devices are not connected to the correct Host Bridge Root Port\n");
-					return false;
+					goto err;
 				}
 			}
 		}
 	}
 
 	return true;
+
+err:
+	dev_dbg(&cxlr->dev, "Couldn't get decoder for region\n");
+	list_for_each_entry_safe(cxld, c, &cxlr->staged_list, region_link)
+		cxl_put_decoder(cxld);
+
+	return false;
 }
 
 /**
@@ -454,7 +513,7 @@ static bool rootd_contains(const struct cxl_region *cxlr,
 }
 
 static bool rootd_valid(const struct cxl_region *cxlr,
-			const struct cxl_decoder *rootd)
+			const struct cxl_decoder *rootd, bool state_update)
 {
 	const struct cxl_memdev *endpoint = cxlr->config.targets[0];
 
@@ -467,7 +526,8 @@ static bool rootd_valid(const struct cxl_region *cxlr,
 	if (!region_xhb_config_valid(cxlr, rootd))
 		return false;
 
-	if (!region_hb_rp_config_valid(cxlr, rootd))
+	if (!region_hb_rp_config_valid((struct cxl_region *)cxlr, rootd,
+				       state_update))
 		return false;
 
 	if (!rootd_contains(cxlr, rootd))
@@ -490,7 +550,7 @@ static int rootd_match(struct device *dev, void *data)
 	if (!is_root_decoder(dev))
 		return 0;
 
-	return !!rootd_valid(cxlr, to_cxl_decoder(dev));
+	return !!rootd_valid(cxlr, to_cxl_decoder(dev), false);
 }
 
 /*
@@ -513,10 +573,39 @@ static struct cxl_decoder *find_rootd(const struct cxl_region *cxlr,
 	return NULL;
 }
 
-static int collect_ep_decoders(const struct cxl_region *cxlr)
+static void cleanup_staged_decoders(struct cxl_region *cxlr)
 {
-	/* TODO: */
+	struct cxl_decoder *cxld, *d;
+
+	list_for_each_entry_safe(cxld, d, &cxlr->staged_list, region_link) {
+		cxl_put_decoder(cxld);
+		list_del_init(&cxld->region_link);
+	}
+}
+
+static int collect_ep_decoders(struct cxl_region *cxlr)
+{
+	struct cxl_memdev *ep;
+	int i, rc = 0;
+
+	for_each_cxl_endpoint(ep, cxlr, i) {
+		struct cxl_decoder *cxld;
+
+		cxld = get_decoder(cxlr, ep->port);
+		if (IS_ERR(cxld)) {
+			rc = PTR_ERR(cxld);
+			goto err;
+		}
+
+		cxld->interleave_granularity = region_granularity(cxlr);
+		cxld->interleave_ways = region_ways(cxlr);
+	}
+
 	return 0;
+
+err:
+	cleanup_staged_decoders(cxlr);
+	return rc;
 }
 
 static int bind_region(const struct cxl_region *cxlr)
@@ -559,7 +648,7 @@ static int cxl_region_probe(struct device *dev)
 		return -ENXIO;
 	}
 
-	if (!rootd_valid(cxlr, rootd)) {
+	if (!rootd_valid(cxlr, rootd, true)) {
 		dev_err(dev, "Picked invalid rootd\n");
 		return -ENXIO;
 	}
@@ -574,14 +663,18 @@ static int cxl_region_probe(struct device *dev)
 
 	ret = collect_ep_decoders(cxlr);
 	if (ret)
-		return ret;
+		goto err;
 
 	ret = bind_region(cxlr);
-	if (!ret) {
-		cxlr->active = true;
-		dev_info(dev, "Bound");
-	}
+	if (ret)
+		goto err;
 
+	cxlr->active = true;
+	dev_info(dev, "Bound");
+	return 0;
+
+err:
+	cleanup_staged_decoders(cxlr);
 	return ret;
 }
 
diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
index 00a6dc729c26..fc15abaeb638 100644
--- a/drivers/cxl/region.h
+++ b/drivers/cxl/region.h
@@ -14,6 +14,9 @@
  * @list: Node in decoder's region list.
  * @res: Resource this region carves out of the platform decode range.
  * @active: If the region has been activated.
+ * @staged_list: All decoders staged for programming.
+ * @commit_list: All decoders programmed for this region's parameters.
+ *
  * @config: HDM decoder program config
  * @config.size: Size of the region determined from LSA or userspace.
  * @config.uuid: The UUID for this region.
@@ -27,6 +30,8 @@ struct cxl_region {
 	struct list_head list;
 	struct resource *res;
 	bool active;
+	struct list_head staged_list;
+	struct list_head commit_list;
 
 	struct {
 		u64 size;
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 10/14] cxl/region: Collect host bridge decoders
  2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
                   ` (8 preceding siblings ...)
  2022-01-28  0:27 ` [PATCH v3 09/14] cxl/region: Add infrastructure for decoder programming Ben Widawsky
@ 2022-01-28  0:27 ` Ben Widawsky
  2022-02-01 18:21   ` Jonathan Cameron
  2022-02-18 23:42   ` Dan Williams
  2022-01-28  0:27 ` [PATCH v3 11/14] cxl/region: Add support for single switch level Ben Widawsky
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:27 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

Part of host bridge verification in the CXL Type 3 Memory Device
Software Guide calculates the host bridge interleave target list (6th
step in the flow chart), ie. verification and state update are done in
the same step. Host bridge verification is already in place, so go ahead
and store the decoders with their target lists.

Switches are implemented in a separate patch.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/region.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
index 145d7bb02714..b8982be13bfe 100644
--- a/drivers/cxl/region.c
+++ b/drivers/cxl/region.c
@@ -428,6 +428,7 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
 		return simple_config(cxlr, hbs[0]);
 
 	for (i = 0; i < hb_count; i++) {
+		struct cxl_decoder *cxld;
 		int idx, position_mask;
 		struct cxl_dport *rp;
 		struct cxl_port *hb;
@@ -486,6 +487,18 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
 						"One or more devices are not connected to the correct Host Bridge Root Port\n");
 					goto err;
 				}
+
+				if (!state_update)
+					continue;
+
+				if (dev_WARN_ONCE(&cxld->dev,
+						  port_grouping >= cxld->nr_targets,
+						  "Invalid port grouping %d/%d\n",
+						  port_grouping, cxld->nr_targets))
+					goto err;
+
+				cxld->interleave_ways++;
+				cxld->target[port_grouping] = get_rp(ep);
 			}
 		}
 	}
@@ -538,7 +551,7 @@ static bool rootd_valid(const struct cxl_region *cxlr,
 
 struct rootd_context {
 	const struct cxl_region *cxlr;
-	struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
+	const struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
 	int count;
 };
 
@@ -564,7 +577,7 @@ static struct cxl_decoder *find_rootd(const struct cxl_region *cxlr,
 	struct rootd_context ctx;
 	struct device *ret;
 
-	ctx.cxlr = cxlr;
+	ctx.cxlr = (struct cxl_region *)cxlr;
 
 	ret = device_find_child((struct device *)&root->dev, &ctx, rootd_match);
 	if (ret)
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 11/14] cxl/region: Add support for single switch level
  2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
                   ` (9 preceding siblings ...)
  2022-01-28  0:27 ` [PATCH v3 10/14] cxl/region: Collect host bridge decoders Ben Widawsky
@ 2022-01-28  0:27 ` Ben Widawsky
  2022-02-01 18:26   ` Jonathan Cameron
  2022-02-15 16:10   ` Jonathan Cameron
  2022-01-28  0:27 ` [PATCH v3 12/14] cxl: Program decoders for regions Ben Widawsky
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:27 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

CXL switches have HDM decoders just like host bridges and endpoints.
Their programming works in a similar fashion.

The spec does not prohibit multiple levels of switches, however, those
are not implemented at this time.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/cxl.h    |  5 ++++
 drivers/cxl/region.c | 61 ++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 8ace6cca0776..d70d8c85d05f 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -96,6 +96,11 @@ static inline u8 cxl_to_ig(u16 g)
 	return ilog2(g) - 8;
 }
 
+static inline int cxl_to_ways(u8 ways)
+{
+	return 1 << ways;
+}
+
 static inline bool cxl_is_interleave_ways_valid(int iw)
 {
 	switch (iw) {
diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
index b8982be13bfe..f748060733dd 100644
--- a/drivers/cxl/region.c
+++ b/drivers/cxl/region.c
@@ -359,6 +359,23 @@ static bool has_switch(const struct cxl_region *cxlr)
 	return false;
 }
 
+static bool has_multi_switch(const struct cxl_region *cxlr)
+{
+	struct cxl_memdev *ep;
+	int i;
+
+	for_each_cxl_endpoint(ep, cxlr, i)
+		if (ep->port->depth > 3)
+			return true;
+
+	return false;
+}
+
+static struct cxl_port *get_switch(struct cxl_memdev *ep)
+{
+	return to_cxl_port(ep->port->dev.parent);
+}
+
 static struct cxl_decoder *get_decoder(struct cxl_region *cxlr,
 				       struct cxl_port *p)
 {
@@ -409,6 +426,8 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
 				      const struct cxl_decoder *rootd,
 				      bool state_update)
 {
+	const int region_ig = cxl_to_ig(cxlr->config.interleave_granularity);
+	const int region_eniw = cxl_to_eniw(cxlr->config.interleave_ways);
 	const int num_root_ports = get_num_root_ports(cxlr);
 	struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
 	struct cxl_decoder *cxld, *c;
@@ -416,8 +435,12 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
 
 	hb_count = get_unique_hostbridges(cxlr, hbs);
 
-	/* TODO: Switch support */
-	if (has_switch(cxlr))
+	/* TODO: support multiple levels of switches */
+	if (has_multi_switch(cxlr))
+		return false;
+
+	/* TODO: x3 interleave for switches is hard. */
+	if (has_switch(cxlr) && !is_power_of_2(region_ways(cxlr)))
 		return false;
 
 	/*
@@ -470,8 +493,14 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
 		list_for_each_entry(rp, &hb->dports, list) {
 			struct cxl_memdev *ep;
 			int port_grouping = -1;
+			int target_ndx;
 
 			for_each_cxl_endpoint_hb(ep, cxlr, hb, idx) {
+				struct cxl_decoder *switch_cxld;
+				struct cxl_dport *target;
+				struct cxl_port *switch_port;
+				bool found = false;
+
 				if (get_rp(ep) != rp)
 					continue;
 
@@ -499,6 +528,34 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
 
 				cxld->interleave_ways++;
 				cxld->target[port_grouping] = get_rp(ep);
+
+				/*
+				 * At least one switch is connected here if the endpoint
+				 * has a depth > 2
+				 */
+				if (ep->port->depth == 2)
+					continue;
+
+				/* Check the staged list to see if this
+				 * port has already been added
+				 */
+				switch_port = get_switch(ep);
+				list_for_each_entry(switch_cxld, &cxlr->staged_list, region_link) {
+					if (to_cxl_port(switch_cxld->dev.parent) == switch_port)
+						found = true;
+				}
+
+				if (found) {
+					target = cxl_find_dport_by_dev(switch_port, ep->dev.parent->parent);
+					switch_cxld->target[target_ndx++] = target;
+					continue;
+				}
+
+				target_ndx = 0;
+
+				switch_cxld = get_decoder(cxlr, switch_port);
+				switch_cxld->interleave_ways++;
+				switch_cxld->interleave_granularity = cxl_to_ways(region_ig + region_eniw);
 			}
 		}
 	}
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 12/14] cxl: Program decoders for regions
  2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
                   ` (10 preceding siblings ...)
  2022-01-28  0:27 ` [PATCH v3 11/14] cxl/region: Add support for single switch level Ben Widawsky
@ 2022-01-28  0:27 ` Ben Widawsky
  2022-02-24  0:08   ` Dan Williams
  2022-01-28  0:27 ` [PATCH v3 13/14] cxl/pmem: Convert nvdimm bridge API to use dev Ben Widawsky
  2022-01-28  0:27 ` [PATCH v3 14/14] cxl/region: Create an nd_region Ben Widawsky
  13 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:27 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

Configure and commit the HDM decoders for the region. Since the region
driver already was able to walk the topology and build the list of
needed decoders, all that was needed to finish region setup was to
actually write the HDM decoder MMIO.

CXL regions appear as linear addresses in the system's physical address
space. CXL memory devices comprise the storage for the region. In order
for traffic to be properly routed to the memory devices in the region, a
set of Host-manged Device Memory decoders must be present. The decoders
are a piece of hardware defined in the CXL specification.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>

---
Changes since v2:
- Fix unwind issue in bind_region introduced in v2
---
 drivers/cxl/core/hdm.c | 209 +++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxl.h      |   3 +
 drivers/cxl/region.c   |  72 +++++++++++---
 3 files changed, 272 insertions(+), 12 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index a28369f264da..66c08d69f7a6 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -268,3 +268,212 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
 	return 0;
 }
 EXPORT_SYMBOL_NS_GPL(devm_cxl_enumerate_decoders, CXL);
+
+#define COMMIT_TIMEOUT_MS 10
+static int wait_for_commit(struct cxl_decoder *cxld)
+{
+	const unsigned long end = jiffies + msecs_to_jiffies(COMMIT_TIMEOUT_MS);
+	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
+	void __iomem *hdm_decoder;
+	struct cxl_hdm *cxlhdm;
+	u32 ctrl;
+
+	cxlhdm = dev_get_drvdata(&port->dev);
+	hdm_decoder = cxlhdm->regs.hdm_decoder;
+
+	while (1) {
+		ctrl = readl(hdm_decoder +
+			     CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
+		if (FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMITTED, ctrl))
+			break;
+
+		if (time_after(jiffies, end)) {
+			dev_err(&cxld->dev, "HDM decoder commit timeout %x\n",
+				ctrl);
+			return -ETIMEDOUT;
+		}
+		if ((ctrl & CXL_HDM_DECODER0_CTRL_COMMIT_ERROR) != 0) {
+			dev_err(&cxld->dev, "HDM decoder commit error %x\n",
+				ctrl);
+			return -ENXIO;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * cxl_commit_decoder() - Program a configured cxl_decoder
+ * @cxld: The preconfigured cxl decoder.
+ *
+ * A cxl decoder that is to be committed should have been earmarked as enabled.
+ * This mechanism acts as a soft reservation on the decoder.
+ *
+ * Returns 0 if commit was successful, negative error code otherwise.
+ */
+int cxl_commit_decoder(struct cxl_decoder *cxld)
+{
+	u32 ctrl, tl_lo, tl_hi, base_lo, base_hi, size_lo, size_hi;
+	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
+	void __iomem *hdm_decoder;
+	struct cxl_hdm *cxlhdm;
+	int rc;
+
+	/*
+	 * Decoder flags are entirely software controlled and therefore this
+	 * case is purely a driver bug.
+	 */
+	if (dev_WARN_ONCE(&port->dev, (cxld->flags & CXL_DECODER_F_ENABLE) != 0,
+			  "Invalid %s enable state\n", dev_name(&cxld->dev)))
+		return -ENXIO;
+
+	cxlhdm = dev_get_drvdata(&port->dev);
+	hdm_decoder = cxlhdm->regs.hdm_decoder;
+	ctrl = readl(hdm_decoder + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
+
+	/*
+	 * A decoder that's currently active cannot be changed without the
+	 * system being quiesced. While the driver should prevent against this,
+	 * for a variety of reasons the hardware might not be in sync with the
+	 * hardware and so, do not splat on error.
+	 */
+	size_hi = readl(hdm_decoder +
+			CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(cxld->id));
+	size_lo =
+		readl(hdm_decoder + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(cxld->id));
+	if (FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMITTED, ctrl) &&
+	    (size_lo + size_hi)) {
+		dev_err(&port->dev, "Tried to change an active decoder (%s)\n",
+			dev_name(&cxld->dev));
+		return -EBUSY;
+	}
+
+	u32p_replace_bits(&ctrl, cxl_to_ig(cxld->interleave_granularity),
+			  CXL_HDM_DECODER0_CTRL_IG_MASK);
+	u32p_replace_bits(&ctrl, cxl_to_eniw(cxld->interleave_ways),
+			  CXL_HDM_DECODER0_CTRL_IW_MASK);
+	u32p_replace_bits(&ctrl, 1, CXL_HDM_DECODER0_CTRL_COMMIT);
+
+	/* TODO: set based on type */
+	u32p_replace_bits(&ctrl, 1, CXL_HDM_DECODER0_CTRL_TYPE);
+
+	base_lo = GENMASK(31, 28) & lower_32_bits(cxld->decoder_range.start);
+	base_hi = upper_32_bits(cxld->decoder_range.start);
+
+	size_lo = GENMASK(31, 28) & (u32)(range_len(&cxld->decoder_range));
+	size_hi = upper_32_bits(range_len(&cxld->decoder_range) >> 32);
+
+	if (cxld->nr_targets > 0) {
+		tl_hi = 0;
+
+		tl_lo = FIELD_PREP(GENMASK(7, 0), cxld->target[0]->port_id);
+
+		if (cxld->interleave_ways > 1)
+			tl_lo |= FIELD_PREP(GENMASK(15, 8),
+					    cxld->target[1]->port_id);
+		if (cxld->interleave_ways > 2)
+			tl_lo |= FIELD_PREP(GENMASK(23, 16),
+					    cxld->target[2]->port_id);
+		if (cxld->interleave_ways > 3)
+			tl_lo |= FIELD_PREP(GENMASK(31, 24),
+					    cxld->target[3]->port_id);
+		if (cxld->interleave_ways > 4)
+			tl_hi |= FIELD_PREP(GENMASK(7, 0),
+					    cxld->target[4]->port_id);
+		if (cxld->interleave_ways > 5)
+			tl_hi |= FIELD_PREP(GENMASK(15, 8),
+					    cxld->target[5]->port_id);
+		if (cxld->interleave_ways > 6)
+			tl_hi |= FIELD_PREP(GENMASK(23, 16),
+					    cxld->target[6]->port_id);
+		if (cxld->interleave_ways > 7)
+			tl_hi |= FIELD_PREP(GENMASK(31, 24),
+					    cxld->target[7]->port_id);
+
+		writel(tl_hi, hdm_decoder + CXL_HDM_DECODER0_TL_HIGH(cxld->id));
+		writel(tl_lo, hdm_decoder + CXL_HDM_DECODER0_TL_LOW(cxld->id));
+	} else {
+		/* Zero out skip list for devices */
+		writel(0, hdm_decoder + CXL_HDM_DECODER0_TL_HIGH(cxld->id));
+		writel(0, hdm_decoder + CXL_HDM_DECODER0_TL_LOW(cxld->id));
+	}
+
+	writel(size_hi,
+	       hdm_decoder + CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(cxld->id));
+	writel(size_lo,
+	       hdm_decoder + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(cxld->id));
+	writel(base_hi,
+	       hdm_decoder + CXL_HDM_DECODER0_BASE_HIGH_OFFSET(cxld->id));
+	writel(base_lo,
+	       hdm_decoder + CXL_HDM_DECODER0_BASE_LOW_OFFSET(cxld->id));
+	writel(ctrl, hdm_decoder + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
+
+	rc = wait_for_commit(cxld);
+	if (rc)
+		return rc;
+
+	cxld->flags |= CXL_DECODER_F_ENABLE;
+
+#define DPORT_TL_STR "%d %d %d %d %d %d %d %d"
+#define DPORT(i)                                                               \
+	(cxld->nr_targets && cxld->interleave_ways > (i)) ?                    \
+		cxld->target[(i)]->port_id :                                   \
+		      -1
+#define DPORT_TL                                                               \
+	DPORT(0), DPORT(1), DPORT(2), DPORT(3), DPORT(4), DPORT(5), DPORT(6),  \
+		DPORT(7)
+
+	dev_dbg(&cxld->dev,
+		"%s (depth %d)\n\tBase %pa\n\tSize %llu\n\tIG %u (%ub)\n\tENIW %u (x%u)\n\tTargetList: \n" DPORT_TL_STR,
+		dev_name(&port->dev), port->depth, &cxld->decoder_range.start,
+		range_len(&cxld->decoder_range),
+		cxl_to_ig(cxld->interleave_granularity),
+		cxld->interleave_granularity,
+		cxl_to_eniw(cxld->interleave_ways), cxld->interleave_ways,
+		DPORT_TL);
+#undef DPORT_TL
+#undef DPORT
+#undef DPORT_TL_STR
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cxl_commit_decoder);
+
+/**
+ * cxl_disable_decoder() - Disables a decoder
+ * @cxld: The active cxl decoder.
+ *
+ * CXL decoders (as of 2.0 spec) have no way to deactivate them other than to
+ * set the size of the HDM to 0. This function will clear all registers, and if
+ * the decoder is active, commit the 0'd out registers.
+ */
+void cxl_disable_decoder(struct cxl_decoder *cxld)
+{
+	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
+	void __iomem *hdm_decoder;
+	struct cxl_hdm *cxlhdm;
+	u32 ctrl;
+
+	cxlhdm = dev_get_drvdata(&port->dev);
+	hdm_decoder = cxlhdm->regs.hdm_decoder;
+	ctrl = readl(hdm_decoder + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
+
+	if (dev_WARN_ONCE(&port->dev, (cxld->flags & CXL_DECODER_F_ENABLE) == 0,
+			  "Invalid decoder enable state\n"))
+		return;
+
+	cxld->flags &= ~CXL_DECODER_F_ENABLE;
+
+	/* There's no way to "uncommit" a committed decoder, only 0 size it */
+	writel(0, hdm_decoder + CXL_HDM_DECODER0_TL_HIGH(cxld->id));
+	writel(0, hdm_decoder + CXL_HDM_DECODER0_TL_LOW(cxld->id));
+	writel(0, hdm_decoder + CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(cxld->id));
+	writel(0, hdm_decoder + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(cxld->id));
+	writel(0, hdm_decoder + CXL_HDM_DECODER0_BASE_HIGH_OFFSET(cxld->id));
+	writel(0, hdm_decoder + CXL_HDM_DECODER0_BASE_LOW_OFFSET(cxld->id));
+
+	/* If the device isn't actually active, just zero out all the fields */
+	if (FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMITTED, ctrl))
+		writel(CXL_HDM_DECODER0_CTRL_COMMIT,
+		       hdm_decoder + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
+}
+EXPORT_SYMBOL_GPL(cxl_disable_decoder);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index d70d8c85d05f..f9dab312ed26 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -55,6 +55,7 @@
 #define   CXL_HDM_DECODER0_CTRL_LOCK BIT(8)
 #define   CXL_HDM_DECODER0_CTRL_COMMIT BIT(9)
 #define   CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10)
+#define   CXL_HDM_DECODER0_CTRL_COMMIT_ERROR BIT(11)
 #define   CXL_HDM_DECODER0_CTRL_TYPE BIT(12)
 #define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
 #define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
@@ -416,6 +417,8 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port,
 struct cxl_dport *cxl_find_dport_by_dev(struct cxl_port *port,
 					const struct device *dev);
 struct cxl_port *ep_find_cxl_port(struct cxl_memdev *cxlmd, unsigned int depth);
+int cxl_commit_decoder(struct cxl_decoder *cxld);
+void cxl_disable_decoder(struct cxl_decoder *cxld);
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
 bool is_cxl_decoder(struct device *dev);
diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
index f748060733dd..ac290677534d 100644
--- a/drivers/cxl/region.c
+++ b/drivers/cxl/region.c
@@ -678,10 +678,52 @@ static int collect_ep_decoders(struct cxl_region *cxlr)
 	return rc;
 }
 
-static int bind_region(const struct cxl_region *cxlr)
+static int bind_region(struct cxl_region *cxlr)
 {
-	/* TODO: */
-	return 0;
+	struct cxl_decoder *cxld, *d;
+	int rc;
+
+	list_for_each_entry_safe(cxld, d, &cxlr->staged_list, region_link) {
+		rc = cxl_commit_decoder(cxld);
+		if (!rc) {
+			list_move_tail(&cxld->region_link, &cxlr->commit_list);
+		} else {
+			dev_dbg(&cxlr->dev, "Failed to commit %s\n",
+				dev_name(&cxld->dev));
+			break;
+		}
+	}
+
+	list_for_each_entry_safe(cxld, d, &cxlr->commit_list, region_link) {
+		if (rc) {
+			cxl_disable_decoder(cxld);
+			list_del(&cxld->region_link);
+		}
+	}
+
+	if (rc)
+		cleanup_staged_decoders(cxlr);
+
+	BUG_ON(!list_empty(&cxlr->staged_list));
+	return rc;
+}
+
+static void region_unregister(void *dev)
+{
+	struct cxl_region *region = to_cxl_region(dev);
+	struct cxl_decoder *cxld, *d;
+
+	if (dev_WARN_ONCE(dev, !list_empty(&region->staged_list),
+			  "Decoders still staged"))
+		cleanup_staged_decoders(region);
+
+	/* TODO: teardown the nd_region */
+
+	list_for_each_entry_safe(cxld, d, &region->commit_list, region_link) {
+		cxl_disable_decoder(cxld);
+		list_del(&cxld->region_link);
+		cxl_put_decoder(cxld);
+	}
 }
 
 static int cxl_region_probe(struct device *dev)
@@ -732,20 +774,26 @@ static int cxl_region_probe(struct device *dev)
 		put_device(&ours->dev);
 
 	ret = collect_ep_decoders(cxlr);
-	if (ret)
-		goto err;
+	if (ret) {
+		cleanup_staged_decoders(cxlr);
+		return ret;
+	}
 
 	ret = bind_region(cxlr);
-	if (ret)
-		goto err;
+	if (ret) {
+		/* bind_region should cleanup after itself */
+		if (dev_WARN_ONCE(dev, !list_empty(&cxlr->staged_list),
+				  "Region bind failed to cleanup staged decoders\n"))
+			cleanup_staged_decoders(cxlr);
+		if (dev_WARN_ONCE(dev, !list_empty(&cxlr->commit_list),
+				  "Region bind failed to cleanup committed decoders\n"))
+			region_unregister(&cxlr->dev);
+		return ret;
+	}
 
 	cxlr->active = true;
 	dev_info(dev, "Bound");
-	return 0;
-
-err:
-	cleanup_staged_decoders(cxlr);
-	return ret;
+	return devm_add_action_or_reset(dev, region_unregister, dev);
 }
 
 static struct cxl_driver cxl_region_driver = {
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 13/14] cxl/pmem: Convert nvdimm bridge API to use dev
  2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
                   ` (11 preceding siblings ...)
  2022-01-28  0:27 ` [PATCH v3 12/14] cxl: Program decoders for regions Ben Widawsky
@ 2022-01-28  0:27 ` Ben Widawsky
  2022-01-28  0:27 ` [PATCH v3 14/14] cxl/region: Create an nd_region Ben Widawsky
  13 siblings, 0 replies; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:27 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

The cxl_pmem driver specific cxl_nvdimm structure isn't a suitable
parameter for an exported API that can be used by other drivers.
Instead, use a dev structure, which should be woven into any caller
using this API. This will allow for either the nvdimm's dev, or the
memdev's dev to be used.

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
Changes since v2:
- Added kdoc to cxl_find_nvdimm_bridge()
---
 drivers/cxl/core/pmem.c | 12 +++++++++---
 drivers/cxl/cxl.h       |  2 +-
 drivers/cxl/pmem.c      |  2 +-
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/cxl/core/pmem.c b/drivers/cxl/core/pmem.c
index 8de240c4d96b..7e431667ade1 100644
--- a/drivers/cxl/core/pmem.c
+++ b/drivers/cxl/core/pmem.c
@@ -62,10 +62,16 @@ static int match_nvdimm_bridge(struct device *dev, void *data)
 	return is_cxl_nvdimm_bridge(dev);
 }
 
-struct cxl_nvdimm_bridge *cxl_find_nvdimm_bridge(struct cxl_nvdimm *cxl_nvd)
+/**
+ * cxl_find_nvdimm_bridge() - Find an nvdimm bridge for a given device
+ * @dev: The device to find a bridge for. This device must be in the part of the
+ *	 CXL topology which is being bridged.
+ *
+ * Return: bridge device that hosts cxl_nvdimm objects if found, else NULL.
+ */
+struct cxl_nvdimm_bridge *cxl_find_nvdimm_bridge(struct device *dev)
 {
-	struct cxl_port *port = find_cxl_root(&cxl_nvd->dev);
-	struct device *dev;
+	struct cxl_port *port = find_cxl_root(dev);
 
 	if (!port)
 		return NULL;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f9dab312ed26..062654204eca 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -479,7 +479,7 @@ struct cxl_nvdimm *to_cxl_nvdimm(struct device *dev);
 bool is_cxl_nvdimm(struct device *dev);
 bool is_cxl_nvdimm_bridge(struct device *dev);
 int devm_cxl_add_nvdimm(struct device *host, struct cxl_memdev *cxlmd);
-struct cxl_nvdimm_bridge *cxl_find_nvdimm_bridge(struct cxl_nvdimm *cxl_nvd);
+struct cxl_nvdimm_bridge *cxl_find_nvdimm_bridge(struct device *dev);
 
 /*
  * Unit test builds overrides this to __weak, find the 'strong' version
diff --git a/drivers/cxl/pmem.c b/drivers/cxl/pmem.c
index 15ad666ab03e..fabdb0c6dbf2 100644
--- a/drivers/cxl/pmem.c
+++ b/drivers/cxl/pmem.c
@@ -39,7 +39,7 @@ static int cxl_nvdimm_probe(struct device *dev)
 	struct nvdimm *nvdimm;
 	int rc;
 
-	cxl_nvb = cxl_find_nvdimm_bridge(cxl_nvd);
+	cxl_nvb = cxl_find_nvdimm_bridge(&cxl_nvd->dev);
 	if (!cxl_nvb)
 		return -ENXIO;
 
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 14/14] cxl/region: Create an nd_region
  2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
                   ` (12 preceding siblings ...)
  2022-01-28  0:27 ` [PATCH v3 13/14] cxl/pmem: Convert nvdimm bridge API to use dev Ben Widawsky
@ 2022-01-28  0:27 ` Ben Widawsky
  13 siblings, 0 replies; 70+ messages in thread
From: Ben Widawsky @ 2022-01-28  0:27 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, kernel test robot, Alison Schofield,
	Dan Williams, Ira Weiny, Jonathan Cameron, Vishal Verma,
	Bjorn Helgaas, nvdimm, linux-pci

LIBNVDIMM supports the creation of regions for both persistent and
volatile memory ranges. The cxl_region driver is capable of handling the
CXL side of region creation but will reuse LIBVDIMM for interfacing with
the rest of the kernel.

TODO: CXL regions can go away. As a result the nd_region must also be
torn down.

TODO2: Handle mappings. LIBNVDIMM is capable of being informed about
which parts of devices contribute to a region and validating whether or
not the region is configured properly. To do this properly requires
tracking allocations per device.

Reported-by: kernel test robot <lkp@intel.com> (v2)
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
Changes since v2:
- Check nvb is non-null
- Give a dev_dbg for non-existent nvdimm_bus
---
 drivers/cxl/Kconfig     |  3 ++-
 drivers/cxl/core/pmem.c | 16 ++++++++++++
 drivers/cxl/cxl.h       |  1 +
 drivers/cxl/region.c    | 58 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 742847503c16..054dc78d6f7d 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -99,7 +99,8 @@ config CXL_PORT
 	tristate
 
 config CXL_REGION
-	default CXL_PORT
+	depends on CXL_PMEM
+	default CXL_BUS
 	tristate
 
 endif
diff --git a/drivers/cxl/core/pmem.c b/drivers/cxl/core/pmem.c
index 7e431667ade1..58dc6fba3130 100644
--- a/drivers/cxl/core/pmem.c
+++ b/drivers/cxl/core/pmem.c
@@ -220,6 +220,22 @@ struct cxl_nvdimm *to_cxl_nvdimm(struct device *dev)
 }
 EXPORT_SYMBOL_NS_GPL(to_cxl_nvdimm, CXL);
 
+static int match_cxl_nvdimm(struct device *dev, void *data)
+{
+	return is_cxl_nvdimm(dev);
+}
+
+struct cxl_nvdimm *cxl_find_nvdimm(struct cxl_memdev *cxlmd)
+{
+	struct device *dev;
+
+	dev = device_find_child(&cxlmd->dev, NULL, match_cxl_nvdimm);
+	if (!dev)
+		return NULL;
+	return to_cxl_nvdimm(dev);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_find_nvdimm, CXL);
+
 static struct cxl_nvdimm *cxl_nvdimm_alloc(struct cxl_memdev *cxlmd)
 {
 	struct cxl_nvdimm *cxl_nvd;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 062654204eca..7eb8f36af30b 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -480,6 +480,7 @@ bool is_cxl_nvdimm(struct device *dev);
 bool is_cxl_nvdimm_bridge(struct device *dev);
 int devm_cxl_add_nvdimm(struct device *host, struct cxl_memdev *cxlmd);
 struct cxl_nvdimm_bridge *cxl_find_nvdimm_bridge(struct device *dev);
+struct cxl_nvdimm *cxl_find_nvdimm(struct cxl_memdev *cxlmd);
 
 /*
  * Unit test builds overrides this to __weak, find the 'strong' version
diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
index ac290677534d..be472560fc6a 100644
--- a/drivers/cxl/region.c
+++ b/drivers/cxl/region.c
@@ -708,6 +708,58 @@ static int bind_region(struct cxl_region *cxlr)
 	return rc;
 }
 
+static int connect_to_libnvdimm(struct cxl_region *region)
+{
+	struct nd_region_desc ndr_desc;
+	struct cxl_nvdimm_bridge *nvb;
+	struct nd_region *ndr;
+	int rc = 0;
+
+	nvb = cxl_find_nvdimm_bridge(&region->config.targets[0]->dev);
+	if (!nvb) {
+		dev_dbg(&region->dev, "Couldn't find nvdimm bridge\n");
+		return -ENODEV;
+	}
+
+	device_lock(&nvb->dev);
+	if (!nvb->nvdimm_bus) {
+		dev_dbg(&nvb->dev, "Couldn't find nvdimm bridge's bus\n");
+		rc = -ENXIO;
+		goto out;
+	}
+
+	memset(&ndr_desc, 0, sizeof(ndr_desc));
+
+	ndr_desc.res = region->res;
+
+	ndr_desc.numa_node = memory_add_physaddr_to_nid(region->res->start);
+	ndr_desc.target_node = phys_to_target_node(region->res->start);
+	if (ndr_desc.numa_node == NUMA_NO_NODE) {
+		ndr_desc.numa_node =
+			memory_add_physaddr_to_nid(region->res->start);
+		dev_info(&region->dev,
+			 "changing numa node from %d to %d for CXL region %pR",
+			 NUMA_NO_NODE, ndr_desc.numa_node, region->res);
+	}
+	if (ndr_desc.target_node == NUMA_NO_NODE) {
+		ndr_desc.target_node = ndr_desc.numa_node;
+		dev_info(&region->dev,
+			 "changing target node from %d to %d for CXL region %pR",
+			 NUMA_NO_NODE, ndr_desc.target_node, region->res);
+	}
+
+	ndr = nvdimm_pmem_region_create(nvb->nvdimm_bus, &ndr_desc);
+	if (IS_ERR(ndr))
+		rc = PTR_ERR(ndr);
+	else
+		dev_set_drvdata(&region->dev, ndr);
+
+out:
+	device_unlock(&nvb->dev);
+	put_device(&nvb->dev);
+	return rc;
+}
+
 static void region_unregister(void *dev)
 {
 	struct cxl_region *region = to_cxl_region(dev);
@@ -791,6 +843,12 @@ static int cxl_region_probe(struct device *dev)
 		return ret;
 	}
 
+	ret = connect_to_libnvdimm(cxlr);
+	if (ret) {
+		region_unregister(dev);
+		return ret;
+	}
+
 	cxlr->active = true;
 	dev_info(dev, "Bound");
 	return devm_add_action_or_reset(dev, region_unregister, dev);
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 01/14] cxl/region: Add region creation ABI
  2022-01-28  0:26 ` [PATCH v3 01/14] cxl/region: Add region creation ABI Ben Widawsky
@ 2022-01-28 18:14   ` Dan Williams
  2022-01-28 18:59     ` Dan Williams
  2022-02-01 22:42     ` Ben Widawsky
  2022-02-01 15:53   ` Jonathan Cameron
  2022-02-17 17:10   ` [PATCH v4 " Ben Widawsky
  2 siblings, 2 replies; 70+ messages in thread
From: Dan Williams @ 2022-01-28 18:14 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Regions are created as a child of the decoder that encompasses an
> address space with constraints. Regions have a number of attributes that
> must be configured before the region can be activated.
>
> The ABI is not meant to be secure, but is meant to avoid accidental
> races. As a result, a buggy process may create a region by name that was
> allocated by a different process. However, multiple processes which are
> trying not to race with each other shouldn't need special
> synchronization to do so.
>
> // Allocate a new region name
> region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
>
> // Create a new region by name
> echo $region > /sys/bus/cxl/devices/decoder0.0/create_region

Were I someone coming in cold to this the immediate question about
this example would be "what if userspace races to create the region?".
How about showing the example that this interface requires looping
until the kernel returns success in case userspace races itself to
create the next region? I think this would work for that purpose.

---

while
region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
! echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
do true; done

---

> // Region now exists in sysfs
> stat -t /sys/bus/cxl/devices/decoder0.0/$region
>
> // Delete the region, and name
> echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region
>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
>
> ---
> Changes since v2:
> - Rename 'region' variables to 'cxlr'
> - Update ABI docs for possible actual upstream version
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl       |  24 ++
>  .../driver-api/cxl/memory-devices.rst         |  11 +
>  drivers/cxl/core/Makefile                     |   1 +
>  drivers/cxl/core/core.h                       |   3 +
>  drivers/cxl/core/port.c                       |  16 ++
>  drivers/cxl/core/region.c                     | 208 ++++++++++++++++++
>  drivers/cxl/cxl.h                             |   9 +
>  drivers/cxl/region.h                          |  38 ++++
>  tools/testing/cxl/Kbuild                      |   1 +
>  9 files changed, 311 insertions(+)
>  create mode 100644 drivers/cxl/core/region.c
>  create mode 100644 drivers/cxl/region.h
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 7c2b846521f3..dcc728458936 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -163,3 +163,27 @@ Description:
>                 memory (type-3). The 'target_type' attribute indicates the
>                 current setting which may dynamically change based on what
>                 memory regions are activated in this decode hierarchy.
> +
> +What:          /sys/bus/cxl/devices/decoderX.Y/create_region
> +Date:          August, 2021

Maybe move this to January, 2022?

> +KernelVersion: v5.18
> +Contact:       linux-cxl@vger.kernel.org
> +Description:
> +               Creates a new CXL region. Writing a value of the form
> +               "regionX.Y:Z" will create a new uninitialized region that will
> +               be mapped by the CXL decoderX.Y.

"Write a value of the form 'regionX.Y:Z' to instantiate a new region
within the decode range bounded by decoderX.Y."

> +               Reading from this node will
> +               return a newly allocated region name. In order to create a
> +               region (writing) you must use a value returned from reading the
> +               node.

"The value written must match the current value returned from reading
this attribute. This behavior lets the kernel arbitrate racing
attempts to create a region. The thread that fails to write loops and
tries the next value."

> +               subsequently configured and bound to a region driver before they
> +               can be used.
> +
> +What:          /sys/bus/cxl/devices/decoderX.Y/delete_region
> +Date:          August, 2021
> +KernelVersion: v5.18
> +Contact:       linux-cxl@vger.kernel.org
> +Description:
> +               Deletes the named region. A region must be unbound from the
> +               region driver before being deleted.

....why does it need to be unbound first? device_unregister() triggers
device_release_driver()?

Side note: I am more and more thinking that even though the BIOS may
try to lock some configurations down, if the system owner wants the
region deleted the kernel should probably do everything in its power
to oblige and override the BIOS including secondary bus resets to get
the decoders unlocked. I.e. either the driver needs to hide the delete
region attribute for locked regions, or it needs to give as much power
to root as someone who has physical access and can rip out devices
that are decoding locked ranges. Something we can discuss later, but
every 'disable' and 'delete' interface requires answering the question
"what about 'locked' configs and hot-remove?".

> +               region in the form "regionX.Y:Z". The region's name, allocated
> +               by reading create_region, will also be released.
> diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
> index db476bb170b6..66ddc58a21b1 100644
> --- a/Documentation/driver-api/cxl/memory-devices.rst
> +++ b/Documentation/driver-api/cxl/memory-devices.rst
> @@ -362,6 +362,17 @@ CXL Core
>  .. kernel-doc:: drivers/cxl/core/mbox.c
>     :doc: cxl mbox
>
> +CXL Regions
> +-----------
> +.. kernel-doc:: drivers/cxl/region.h
> +   :identifiers:
> +
> +.. kernel-doc:: drivers/cxl/core/region.c
> +   :doc: cxl core region
> +
> +.. kernel-doc:: drivers/cxl/core/region.c
> +   :identifiers:
> +
>  External Interfaces
>  ===================
>
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 6d37cd78b151..39ce8f2f2373 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -4,6 +4,7 @@ obj-$(CONFIG_CXL_BUS) += cxl_core.o
>  ccflags-y += -I$(srctree)/drivers/cxl
>  cxl_core-y := port.o
>  cxl_core-y += pmem.o
> +cxl_core-y += region.o
>  cxl_core-y += regs.o
>  cxl_core-y += memdev.o
>  cxl_core-y += mbox.o
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index efbaa851929d..35fd08d560e2 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -10,6 +10,9 @@ extern const struct device_type cxl_memdev_type;
>
>  extern struct attribute_group cxl_base_attribute_group;
>
> +extern struct device_attribute dev_attr_create_region;
> +extern struct device_attribute dev_attr_delete_region;
> +
>  struct cxl_send_command;
>  struct cxl_mem_query_commands;
>  int cxl_query_cmd(struct cxl_memdev *cxlmd,
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 631dec0fa79e..0826208b2bdf 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -215,6 +215,8 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
>  };
>
>  static struct attribute *cxl_decoder_root_attrs[] = {
> +       &dev_attr_create_region.attr,
> +       &dev_attr_delete_region.attr,
>         &dev_attr_cap_pmem.attr,
>         &dev_attr_cap_ram.attr,
>         &dev_attr_cap_type2.attr,
> @@ -267,11 +269,23 @@ static const struct attribute_group *cxl_decoder_endpoint_attribute_groups[] = {
>         NULL,
>  };
>
> +static int delete_region(struct device *dev, void *arg)
> +{
> +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> +
> +       return cxl_delete_region(cxld, dev_name(dev));
> +}
> +
>  static void cxl_decoder_release(struct device *dev)
>  {
>         struct cxl_decoder *cxld = to_cxl_decoder(dev);
>         struct cxl_port *port = to_cxl_port(dev->parent);
>
> +       device_for_each_child(&cxld->dev, cxld, delete_region);

This is too late. Regions should be deleted before the decoder is
unregistered, and I think it happens naturally due to the root port
association with memdevs. I.e. a root decoder is unregistered by its
parent port being unregistered which is triggered by cxl_acpi
->remove(). That ->remove() event triggers all memdevs to disconnect,
albeit in a workqueue. So as long as cxl_acpi ->remove flushes that
workqueue then it knows that all memdevs have triggered ->remove().

If the behavior of a region is that it gets deleted upon the last
memdev being unmapped from it then there should not be any regions to
clean up at decoder release time.

> +
> +       dev_WARN_ONCE(dev, !ida_is_empty(&cxld->region_ida),
> +                     "Lost track of a region");
> +
>         ida_free(&port->decoder_ida, cxld->id);
>         kfree(cxld);
>  }
> @@ -1194,6 +1208,8 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
>         cxld->target_type = CXL_DECODER_EXPANDER;
>         cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
>
> +       ida_init(&cxld->region_ida);
> +
>         return cxld;
>  err:
>         kfree(cxld);
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> new file mode 100644
> index 000000000000..1a448543db0d
> --- /dev/null
> +++ b/drivers/cxl/core/region.c
> @@ -0,0 +1,208 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2021 Intel Corporation. All rights reserved. */

Happy New Year! Let's go to 2022 here.

> +#include <linux/io-64-nonatomic-lo-hi.h>
> +#include <linux/device.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/idr.h>
> +#include <region.h>
> +#include <cxl.h>
> +#include "core.h"
> +
> +/**
> + * DOC: cxl core region
> + *
> + * Regions are managed through the Linux device model. Each region instance is a
> + * unique struct device. CXL core provides functionality to create, destroy, and
> + * configure regions. This is all implemented here. Binding a region
> + * (programming the hardware) is handled by a separate region driver.
> + */

Somewhat information lite, how about:

"CXL Regions represent mapped memory capacity in system physical
address space. Whereas the CXL Root Decoders identify the bounds of
potential CXL Memory ranges, Regions represent the active mapped
capacity by the HDM Decoder Capability structures throughout the Host
Bridges, Switches, and Endpoints in the topology."

> +
> +static void cxl_region_release(struct device *dev);
> +
> +static const struct device_type cxl_region_type = {
> +       .name = "cxl_region",
> +       .release = cxl_region_release,
> +};
> +
> +static ssize_t create_region_show(struct device *dev,
> +                                 struct device_attribute *attr, char *buf)
> +{
> +       struct cxl_port *port = to_cxl_port(dev->parent);
> +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> +       int rc;
> +
> +       if (dev_WARN_ONCE(dev, !is_root_decoder(dev),
> +                         "Invalid decoder selected for region.")) {
> +               return -ENODEV;
> +       }

This can go, it's already the case that this attribute is only listed
in 'cxl_decoder_root_attrs'

> +       rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);

This looks broken. What if userspace does:

region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
echo $region > /sys/bus/cxl/devices/decoder0.0/create_region

...i.e. it should only advance to create a new name after the previous
one was instantiated / confirmed via a write.

Also, sysfs values are world readable by default, so non-root can burn
up region_ida.

> +       if (rc < 0) {
> +               dev_err(&cxld->dev, "Couldn't get a new id\n");
> +               return rc;
> +       }
> +
> +       return sysfs_emit(buf, "region%d.%d:%d\n", port->id, cxld->id, rc);
> +}
> +
> +static ssize_t create_region_store(struct device *dev,
> +                                  struct device_attribute *attr,
> +                                  const char *buf, size_t len)
> +{
> +       struct cxl_port *port = to_cxl_port(dev->parent);
> +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> +       int decoder_id, port_id, region_id;
> +       struct cxl_region *cxlr;
> +       ssize_t rc;
> +
> +       if (sscanf(buf, "region%d.%d:%d", &port_id, &decoder_id, &region_id) != 3)
> +               return -EINVAL;

With the proposed change above to cache the current 'next' region name
this can just be something like:

sysfs_streq(buf, cxld->next);

> +
> +       if (decoder_id != cxld->id)
> +               return -EINVAL;
> +
> +       if (port_id != port->id)
> +               return -EINVAL;
> +
> +       cxlr = cxl_alloc_region(cxld, region_id);
> +       if (IS_ERR(cxlr))
> +               return PTR_ERR(cxlr);
> +
> +       rc = cxl_add_region(cxld, cxlr);
> +       if (rc) {
> +               kfree(cxlr);

...'add' failures usually require a put_device(), are you sure kfree()
is correct here.

> +               return rc;
> +       }
> +
> +       return len;
> +}
> +DEVICE_ATTR_RW(create_region);
> +
> +static ssize_t delete_region_store(struct device *dev,
> +                                  struct device_attribute *attr,
> +                                  const char *buf, size_t len)
> +{
> +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> +       int rc;
> +
> +       rc = cxl_delete_region(cxld, buf);

I would have expected symmetry with cxl_add_region() i.e. convert @buf
to @cxlr and keep the function signatures between add and delete
aligned.

> +       if (rc)
> +               return rc;
> +
> +       return len;
> +}
> +DEVICE_ATTR_WO(delete_region);
> +
> +struct cxl_region *to_cxl_region(struct device *dev)
> +{
> +       if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type,
> +                         "not a cxl_region device\n"))
> +               return NULL;
> +
> +       return container_of(dev, struct cxl_region, dev);
> +}
> +EXPORT_SYMBOL_GPL(to_cxl_region);
> +
> +static void cxl_region_release(struct device *dev)
> +{
> +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       ida_free(&cxld->region_ida, cxlr->id);
> +       kfree(cxlr);
> +}
> +
> +struct cxl_region *cxl_alloc_region(struct cxl_decoder *cxld, int id)
> +{
> +       struct cxl_region *cxlr;
> +
> +       cxlr = kzalloc(sizeof(*cxlr), GFP_KERNEL);
> +       if (!cxlr)
> +               return ERR_PTR(-ENOMEM);

To keep symmetry with other device object allocations in the cxl/core
I would expect to see the device_initialize() dev->type and dev->bus
setup occur here as well.

> +
> +       cxlr->id = id;
> +
> +       return cxlr;
> +}
> +
> +/**
> + * cxl_add_region - Adds a region to a decoder
> + * @cxld: Parent decoder.
> + * @cxlr: Region to be added to the decoder.
> + *
> + * This is the second step of region initialization. Regions exist within an
> + * address space which is mapped by a @cxld. That @cxld must be a root decoder,
> + * and it enforces constraints upon the region as it is configured.
> + *
> + * Return: 0 if the region was added to the @cxld, else returns negative error
> + * code. The region will be named "regionX.Y.Z" where X is the port, Y is the
> + * decoder id, and Z is the region number.
> + */
> +int cxl_add_region(struct cxl_decoder *cxld, struct cxl_region *cxlr)
> +{
> +       struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> +       struct device *dev = &cxlr->dev;
> +       int rc;
> +
> +       device_initialize(dev);
> +       dev->parent = &cxld->dev;
> +       device_set_pm_not_required(dev);
> +       dev->bus = &cxl_bus_type;
> +       dev->type = &cxl_region_type;
> +       rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
> +       if (rc)
> +               goto err;
> +
> +       rc = device_add(dev);
> +       if (rc)
> +               goto err;
> +
> +       dev_dbg(dev, "Added to %s\n", dev_name(&cxld->dev));
> +
> +       return 0;
> +
> +err:
> +       put_device(dev);

Here is that put_device() I was expecting, that kfree() earlier was a
double-free it seems.

Also, I would have expected a devm action to remove this. Something like:

struct cxl_port *port = to_cxl_port(cxld->dev.parent);

cxl_device_lock(&port->dev);
if (port->dev.driver)
    devm_cxl_add_region(port->uport, cxld, id);
else
    rc = -ENXIO;
cxl_device_unlock(&port->dev);

...then no matter what you know the region will be unregistered when
the root port goes away.

> +       return rc;
> +}
> +
> +static struct cxl_region *cxl_find_region_by_name(struct cxl_decoder *cxld,
> +                                                 const char *name)
> +{
> +       struct device *region_dev;
> +
> +       region_dev = device_find_child_by_name(&cxld->dev, name);
> +       if (!region_dev)
> +               return ERR_PTR(-ENOENT);
> +
> +       return to_cxl_region(region_dev);
> +}
> +
> +/**
> + * cxl_delete_region - Deletes a region
> + * @cxld: Parent decoder
> + * @region_name: Named region, ie. regionX.Y:Z
> + */
> +int cxl_delete_region(struct cxl_decoder *cxld, const char *region_name)
> +{
> +       struct cxl_region *cxlr;
> +
> +       device_lock(&cxld->dev);

cxl_device_lock()

...if the lock is needed, but I don't see why the lock is needed?

> +
> +       cxlr = cxl_find_region_by_name(cxld, region_name);
> +       if (IS_ERR(cxlr)) {
> +               device_unlock(&cxld->dev);
> +               return PTR_ERR(cxlr);
> +       }
> +
> +       dev_dbg(&cxld->dev, "Requested removal of %s from %s\n",
> +               dev_name(&cxlr->dev), dev_name(&cxld->dev));
> +
> +       device_unregister(&cxlr->dev);
> +       device_unlock(&cxld->dev);

This would need to change to devm_release_action() of course if the
add side changes to devm_cxl_add_region().

> +
> +       put_device(&cxlr->dev);
> +
> +       return 0;
> +}
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 13fb06849199..b9f0099c1f39 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -221,6 +221,7 @@ enum cxl_decoder_type {
>   * @target_type: accelerator vs expander (type2 vs type3) selector
>   * @flags: memory type capabilities and locking
>   * @target_lock: coordinate coherent reads of the target list
> + * @region_ida: allocator for region ids.
>   * @nr_targets: number of elements in @target
>   * @target: active ordered target list in current decoder configuration
>   */
> @@ -236,6 +237,7 @@ struct cxl_decoder {
>         enum cxl_decoder_type target_type;
>         unsigned long flags;
>         seqlock_t target_lock;
> +       struct ida region_ida;
>         int nr_targets;
>         struct cxl_dport *target[];
>  };
> @@ -323,6 +325,13 @@ struct cxl_ep {
>         struct list_head list;
>  };
>
> +bool is_cxl_region(struct device *dev);
> +struct cxl_region *to_cxl_region(struct device *dev);
> +struct cxl_region *cxl_alloc_region(struct cxl_decoder *cxld,
> +                                   int interleave_ways);
> +int cxl_add_region(struct cxl_decoder *cxld, struct cxl_region *cxlr);
> +int cxl_delete_region(struct cxl_decoder *cxld, const char *region);
> +
>  static inline bool is_cxl_root(struct cxl_port *port)
>  {
>         return port->uport == port->dev.parent;
> diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
> new file mode 100644
> index 000000000000..eb1249e3c1d4
> --- /dev/null
> +++ b/drivers/cxl/region.h
> @@ -0,0 +1,38 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright(c) 2021 Intel Corporation. */
> +#ifndef __CXL_REGION_H__
> +#define __CXL_REGION_H__
> +
> +#include <linux/uuid.h>
> +
> +#include "cxl.h"
> +
> +/**
> + * struct cxl_region - CXL region
> + * @dev: This region's device.
> + * @id: This regions id. Id is globally unique across all regions.

s/regions/region's/

> + * @list: Node in decoder's region list.
> + * @res: Resource this region carves out of the platform decode range.
> + * @config: HDM decoder program config
> + * @config.size: Size of the region determined from LSA or userspace.
> + * @config.uuid: The UUID for this region.
> + * @config.interleave_ways: Number of interleave ways this region is configured for.
> + * @config.interleave_granularity: Interleave granularity of region
> + * @config.targets: The memory devices comprising the region.
> + */
> +struct cxl_region {
> +       struct device dev;
> +       int id;
> +       struct list_head list;
> +       struct resource *res;
> +
> +       struct {
> +               u64 size;
> +               uuid_t uuid;
> +               int interleave_ways;
> +               int interleave_granularity;
> +               struct cxl_memdev *targets[CXL_DECODER_MAX_INTERLEAVE];
> +       } config;

Why a sub-struct?

> +};
> +
> +#endif
> diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
> index 82e49ab0937d..3fe6d34e6d59 100644
> --- a/tools/testing/cxl/Kbuild
> +++ b/tools/testing/cxl/Kbuild
> @@ -46,6 +46,7 @@ cxl_core-y += $(CXL_CORE_SRC)/memdev.o
>  cxl_core-y += $(CXL_CORE_SRC)/mbox.o
>  cxl_core-y += $(CXL_CORE_SRC)/pci.o
>  cxl_core-y += $(CXL_CORE_SRC)/hdm.o
> +cxl_core-y += $(CXL_CORE_SRC)/region.o
>  cxl_core-y += config_check.o
>
>  obj-m += test/
> --
> 2.35.0
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 01/14] cxl/region: Add region creation ABI
  2022-01-28 18:14   ` Dan Williams
@ 2022-01-28 18:59     ` Dan Williams
  2022-02-02 18:26       ` Ben Widawsky
  2022-02-01 22:42     ` Ben Widawsky
  1 sibling, 1 reply; 70+ messages in thread
From: Dan Williams @ 2022-01-28 18:59 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Fri, Jan 28, 2022 at 10:14 AM Dan Williams <dan.j.williams@intel.com> wrote:
[..]
> Here is that put_device() I was expecting, that kfree() earlier was a
> double-free it seems.
>
> Also, I would have expected a devm action to remove this. Something like:
>
> struct cxl_port *port = to_cxl_port(cxld->dev.parent);
>
> cxl_device_lock(&port->dev);
> if (port->dev.driver)
>     devm_cxl_add_region(port->uport, cxld, id);
> else
>     rc = -ENXIO;
> cxl_device_unlock(&port->dev);
>
> ...then no matter what you know the region will be unregistered when
> the root port goes away.

...actually, the lock and ->dev.driver check here are not needed
because this attribute is only registered while the cxl_acpi driver is
bound. So, it is safe to assume this is protected as decoder remove
synchronizes against active sysfs users.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-01-28  0:26 ` [PATCH v3 02/14] cxl/region: Introduce concept of region configuration Ben Widawsky
@ 2022-01-29  0:25   ` Dan Williams
  2022-02-01 14:59     ` Ben Widawsky
                       ` (2 more replies)
  0 siblings, 3 replies; 70+ messages in thread
From: Dan Williams @ 2022-01-29  0:25 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> The region creation APIs create a vacant region. Configuring the region
> works in the same way as similar subsystems such as devdax. Sysfs attrs
> will be provided to allow userspace to configure the region.  Finally
> once all configuration is complete, userspace may activate the region.
>
> Introduced here are the most basic attributes needed to configure a
> region. Details of these attribute are described in the ABI

s/attribute/attributes/

> Documentation. Sanity checking of configuration parameters are done at
> region binding time. This consolidates all such logic in one place,
> rather than being strewn across multiple places.

I think that's too late for some of the validation. The complex
validation that the region driver does throughout the topology is
different from the basic input validation that can  be done at the
sysfs write time. For example ,this patch allows negative
interleave_granularity values to specified, just return -EINVAL. I
agree that sysfs should not validate everything, I disagree with
pushing all validation to cxl_region_probe().

>
> A example is provided below:
>
> /sys/bus/cxl/devices/region0.0:0
> ├── interleave_granularity
> ├── interleave_ways
> ├── offset
> ├── size
> ├── subsystem -> ../../../../../../bus/cxl
> ├── target0
> ├── uevent
> └── uuid

As mentioned off-list, it looks like devtype and modalias are missing.

>
> Reported-by: kernel test robot <lkp@intel.com> (v2)
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
>  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
>  2 files changed, 340 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index dcc728458936..50ba5018014d 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -187,3 +187,43 @@ Description:
>                 region driver before being deleted. The attributes expects a
>                 region in the form "regionX.Y:Z". The region's name, allocated
>                 by reading create_region, will also be released.
> +
> +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset

This is just another 'resource' attribute for the physical base
address of the region, right? 'offset' sounds like something that
would be relative instead of absolute.

> +Date:          August, 2021

Same date update comment here.

> +KernelVersion: v5.18
> +Contact:       linux-cxl@vger.kernel.org
> +Description:
> +               (RO) A region resides within an address space that is claimed by
> +               a decoder.

"A region is a contiguous partition of a CXL Root decoder address space."

>                  Region space allocation is handled by the driver, but

"Region capacity is allocated by writing to the size attribute, the
resulting physical address base determined by the driver is reflected
here."

> +               the offset may be read by userspace tooling in order to
> +               determine fragmentation, and available size for new regions.

I would also expect, before / along with these new region attributes,
there would be 'available' and 'max_extent_available' at the decoder
level to indicate how much free space the decoder has and how big the
next region creation can be. User tooling can walk  the decoder and
the regions together to determine fragmentation if necessary, but for
the most part the tool likely only cares about "how big can the next
region be?" and "how full is this decoder?".


> +
> +What:
> +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> +Date:          August, 2021
> +KernelVersion: v5.18
> +Contact:       linux-cxl@vger.kernel.org
> +Description:
> +               (RW) Configuring regions requires a minimal set of parameters in
> +               order for the subsequent bind operation to succeed. The
> +               following parameters are defined:

Let's split up the descriptions into individual sections. That can
also document the order that attributes must be written. For example,
doesn't size need to be set before targets are added so that targets
can be validated whether they have sufficient capacity?

> +
> +               ==      ========================================================
> +               interleave_granularity Mandatory. Number of consecutive bytes
> +                       each device in the interleave set will claim. The
> +                       possible interleave granularity values are determined by
> +                       the CXL spec and the participating devices.
> +               interleave_ways Mandatory. Number of devices participating in the
> +                       region. Each device will provide 1/interleave of storage
> +                       for the region.
> +               size    Manadatory. Phsyical address space the region will
> +                       consume.

s/Phsyical/Physical/

> +               target  Mandatory. Memory devices are the backing storage for a
> +                       region. There will be N targets based on the number of
> +                       interleave ways that the top level decoder is configured
> +                       for.

That doesn't sound right, IW at the root != IW at the endpoint level
and the region needs to record all the endpoint level targets.

> Each target must be set with a memdev device ie.
> +                       'mem1'. This attribute only becomes available after
> +                       setting the 'interleave' attribute.
> +               uuid    Optional. A unique identifier for the region. If none is
> +                       selected, the kernel will create one.

Let's drop the Mandatory / Optional distinction, or I am otherwise not
understanding what this is trying to document. For example 'uuid' is
"mandatory" for PMEM regions and "omitted" for volatile regions not
optional.

> +               ==      ========================================================
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 1a448543db0d..3b48e0469fc7 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -3,9 +3,12 @@
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/device.h>
>  #include <linux/module.h>
> +#include <linux/sizes.h>
>  #include <linux/slab.h>
> +#include <linux/uuid.h>
>  #include <linux/idr.h>
>  #include <region.h>
> +#include <cxlmem.h>
>  #include <cxl.h>
>  #include "core.h"
>
> @@ -18,11 +21,305 @@
>   * (programming the hardware) is handled by a separate region driver.
>   */
>
> +struct cxl_region *to_cxl_region(struct device *dev);
> +static const struct attribute_group region_interleave_group;
> +
> +static bool is_region_active(struct cxl_region *cxlr)
> +{
> +       /* TODO: Regions can't be activated yet. */
> +       return false;

This function seems redundant with just checking "cxlr->dev.driver !=
NULL"? The benefit of that is there is no need to carry a TODO in the
series.

> +}
> +
> +static void remove_target(struct cxl_region *cxlr, int target)
> +{
> +       struct cxl_memdev *cxlmd;
> +
> +       cxlmd = cxlr->config.targets[target];
> +       if (cxlmd)
> +               put_device(&cxlmd->dev);

A memdev can be a member of multiple regions at once, shouldn't this
be an endpoint decoder or similar, not the entire memdev?

Also, if memdevs autoremove themselves from regions at memdev
->remove() time then I don't think the region needs to hold references
on memdevs.

> +       cxlr->config.targets[target] = NULL;
> +}
> +
> +static ssize_t interleave_ways_show(struct device *dev,
> +                                   struct device_attribute *attr, char *buf)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> +}
> +
> +static ssize_t interleave_ways_store(struct device *dev,
> +                                    struct device_attribute *attr,
> +                                    const char *buf, size_t len)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       int ret, prev_iw;
> +       int val;

I would expect:

if (dev->driver)
   return -EBUSY;

...to shutdown configuration writes once the region is active. Might
also need a region-wide seqlock like target_list_show. So that region
probe drains  all active sysfs writers before assuming the
configuration is stable.

> +
> +       prev_iw = cxlr->config.interleave_ways;
> +       ret = kstrtoint(buf, 0, &val);
> +       if (ret)
> +               return ret;
> +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> +               return -EINVAL;
> +
> +       cxlr->config.interleave_ways = val;
> +
> +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> +       if (ret < 0)
> +               goto err;
> +
> +       sysfs_notify(&dev->kobj, NULL, "target_interleave");

Why?

> +
> +       while (prev_iw > cxlr->config.interleave_ways)
> +               remove_target(cxlr, --prev_iw);

To make the kernel side simpler this attribute could just require that
setting interleave ways is a one way street, if you want to change it
you need to delete the region and start over.

> +
> +       return len;
> +
> +err:
> +       cxlr->config.interleave_ways = prev_iw;
> +       return ret;
> +}
> +static DEVICE_ATTR_RW(interleave_ways);
> +
> +static ssize_t interleave_granularity_show(struct device *dev,
> +                                          struct device_attribute *attr,
> +                                          char *buf)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_granularity);
> +}
> +
> +static ssize_t interleave_granularity_store(struct device *dev,
> +                                           struct device_attribute *attr,
> +                                           const char *buf, size_t len)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       int val, ret;
> +
> +       ret = kstrtoint(buf, 0, &val);
> +       if (ret)
> +               return ret;
> +       cxlr->config.interleave_granularity = val;

This wants minimum input validation and synchronization against an
active region.

> +
> +       return len;
> +}
> +static DEVICE_ATTR_RW(interleave_granularity);
> +
> +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> +                          char *buf)
> +{
> +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       resource_size_t offset;
> +
> +       if (!cxlr->res)
> +               return sysfs_emit(buf, "\n");

Should be an error I would think. I.e. require size to be set before
s/offset/resource/ can be read.

> +
> +       offset = cxld->platform_res.start - cxlr->res->start;

Why make usersapce do the offset math?

> +
> +       return sysfs_emit(buf, "%pa\n", &offset);
> +}
> +static DEVICE_ATTR_RO(offset);

This can be DEVICE_ATTR_ADMIN_RO() to hide physical address layout
information from non-root.

> +
> +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> +                        char *buf)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       return sysfs_emit(buf, "%llu\n", cxlr->config.size);

Perhaps no need to store size separately if this becomes:

sysfs_emit(buf, "%llu\n", (unsigned long long) resource_size(cxlr->res));


...?

> +}
> +
> +static ssize_t size_store(struct device *dev, struct device_attribute *attr,
> +                         const char *buf, size_t len)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       unsigned long long val;
> +       ssize_t rc;
> +
> +       rc = kstrtoull(buf, 0, &val);
> +       if (rc)
> +               return rc;
> +
> +       device_lock(&cxlr->dev);
> +       if (is_region_active(cxlr))
> +               rc = -EBUSY;
> +       else
> +               cxlr->config.size = val;
> +       device_unlock(&cxlr->dev);

I think lockdep will complain about device_lock() usage in an
attribute. Try changing this to cxl_device_lock() with
CONFIG_PROVE_CXL_LOCKING=y.

> +
> +       return rc ? rc : len;
> +}
> +static DEVICE_ATTR_RW(size);
> +
> +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> +                        char *buf)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> +}
> +
> +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> +                         const char *buf, size_t len)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       ssize_t rc;
> +
> +       if (len != UUID_STRING_LEN + 1)
> +               return -EINVAL;
> +
> +       device_lock(&cxlr->dev);
> +       if (is_region_active(cxlr))
> +               rc = -EBUSY;
> +       else
> +               rc = uuid_parse(buf, &cxlr->config.uuid);
> +       device_unlock(&cxlr->dev);
> +
> +       return rc ? rc : len;
> +}
> +static DEVICE_ATTR_RW(uuid);
> +
> +static struct attribute *region_attrs[] = {
> +       &dev_attr_interleave_ways.attr,
> +       &dev_attr_interleave_granularity.attr,
> +       &dev_attr_offset.attr,
> +       &dev_attr_size.attr,
> +       &dev_attr_uuid.attr,
> +       NULL,
> +};
> +
> +static const struct attribute_group region_group = {
> +       .attrs = region_attrs,
> +};
> +
> +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> +{
> +       int ret;
> +
> +       device_lock(&cxlr->dev);
> +       if (!cxlr->config.targets[n])
> +               ret = sysfs_emit(buf, "\n");
> +       else
> +               ret = sysfs_emit(buf, "%s\n",
> +                                dev_name(&cxlr->config.targets[n]->dev));
> +       device_unlock(&cxlr->dev);

The component contribution of a memdev to a region is a DPA-span, not
the whole memdev. I would expect something like dax_mapping_attributes
or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
information about the component contribution of a memdev to a region.

> +
> +       return ret;
> +}
> +
> +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> +                         size_t len)
> +{
> +       struct device *memdev_dev;
> +       struct cxl_memdev *cxlmd;
> +
> +       device_lock(&cxlr->dev);
> +
> +       if (len == 1 || cxlr->config.targets[n])
> +               remove_target(cxlr, n);
> +
> +       /* Remove target special case */
> +       if (len == 1) {
> +               device_unlock(&cxlr->dev);
> +               return len;
> +       }
> +
> +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);

I think this wants to be an endpoint decoder, not a memdev. Because
it's the decoder that joins a memdev to a region, or at least a
decoder should be picked when the memdev is assigned so that the DPA
mapping can be registered. If all the decoders are allocated then fail
here.

> +       if (!memdev_dev) {
> +               device_unlock(&cxlr->dev);
> +               return -ENOENT;
> +       }
> +
> +       /* reference to memdev held until target is unset or region goes away */
> +
> +       cxlmd = to_cxl_memdev(memdev_dev);
> +       cxlr->config.targets[n] = cxlmd;
> +
> +       device_unlock(&cxlr->dev);
> +
> +       return len;
> +}
> +
> +#define TARGET_ATTR_RW(n)                                                      \
> +       static ssize_t target##n##_show(                                       \
> +               struct device *dev, struct device_attribute *attr, char *buf)  \
> +       {                                                                      \
> +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> +       }                                                                      \
> +       static ssize_t target##n##_store(struct device *dev,                   \
> +                                        struct device_attribute *attr,        \
> +                                        const char *buf, size_t len)          \
> +       {                                                                      \
> +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> +       }                                                                      \
> +       static DEVICE_ATTR_RW(target##n)
> +
> +TARGET_ATTR_RW(0);
> +TARGET_ATTR_RW(1);
> +TARGET_ATTR_RW(2);
> +TARGET_ATTR_RW(3);
> +TARGET_ATTR_RW(4);
> +TARGET_ATTR_RW(5);
> +TARGET_ATTR_RW(6);
> +TARGET_ATTR_RW(7);
> +TARGET_ATTR_RW(8);
> +TARGET_ATTR_RW(9);
> +TARGET_ATTR_RW(10);
> +TARGET_ATTR_RW(11);
> +TARGET_ATTR_RW(12);
> +TARGET_ATTR_RW(13);
> +TARGET_ATTR_RW(14);
> +TARGET_ATTR_RW(15);
> +
> +static struct attribute *interleave_attrs[] = {
> +       &dev_attr_target0.attr,
> +       &dev_attr_target1.attr,
> +       &dev_attr_target2.attr,
> +       &dev_attr_target3.attr,
> +       &dev_attr_target4.attr,
> +       &dev_attr_target5.attr,
> +       &dev_attr_target6.attr,
> +       &dev_attr_target7.attr,
> +       &dev_attr_target8.attr,
> +       &dev_attr_target9.attr,
> +       &dev_attr_target10.attr,
> +       &dev_attr_target11.attr,
> +       &dev_attr_target12.attr,
> +       &dev_attr_target13.attr,
> +       &dev_attr_target14.attr,
> +       &dev_attr_target15.attr,
> +       NULL,
> +};
> +
> +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> +{
> +       struct device *dev = container_of(kobj, struct device, kobj);
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       if (n < cxlr->config.interleave_ways)
> +               return a->mode;
> +       return 0;
> +}
> +
> +static const struct attribute_group region_interleave_group = {
> +       .attrs = interleave_attrs,
> +       .is_visible = visible_targets,
> +};
> +
> +static const struct attribute_group *region_groups[] = {
> +       &region_group,
> +       &region_interleave_group,
> +       NULL,
> +};
> +
>  static void cxl_region_release(struct device *dev);
>
>  static const struct device_type cxl_region_type = {
>         .name = "cxl_region",
>         .release = cxl_region_release,
> +       .groups = region_groups
>  };
>
>  static ssize_t create_region_show(struct device *dev,
> @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
>  {
>         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
>         struct cxl_region *cxlr = to_cxl_region(dev);
> +       int i;
>
>         ida_free(&cxld->region_ida, cxlr->id);
> +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> +               remove_target(cxlr, i);

Like the last patch this feels too late. I expect whatever unregisters
the region should have already handled removing the targets.

>         kfree(cxlr);
>  }
>
> --
> 2.35.0
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-01-29  0:25   ` Dan Williams
@ 2022-02-01 14:59     ` Ben Widawsky
  2022-02-03  5:06       ` Dan Williams
  2022-02-01 23:11     ` Ben Widawsky
  2022-02-17 18:36     ` Ben Widawsky
  2 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-01 14:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

I will cut to the part that effects ABI so tool development can continue. I'll
get back to the other bits later.

On 22-01-28 16:25:34, Dan Williams wrote:

[snip]

> 
> > +
> > +       return ret;
> > +}
> > +
> > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > +                         size_t len)
> > +{
> > +       struct device *memdev_dev;
> > +       struct cxl_memdev *cxlmd;
> > +
> > +       device_lock(&cxlr->dev);
> > +
> > +       if (len == 1 || cxlr->config.targets[n])
> > +               remove_target(cxlr, n);
> > +
> > +       /* Remove target special case */
> > +       if (len == 1) {
> > +               device_unlock(&cxlr->dev);
> > +               return len;
> > +       }
> > +
> > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> 
> I think this wants to be an endpoint decoder, not a memdev. Because
> it's the decoder that joins a memdev to a region, or at least a
> decoder should be picked when the memdev is assigned so that the DPA
> mapping can be registered. If all the decoders are allocated then fail
> here.
> 

You've put two points in here:

1. Handle decoder allocation at sysfs boundary. I'll respond to this when I come
back around to the rest of the review comments.

2. Take a decoder for target instead of a memdev. I don't agree with this
direction as it's asymmetric to how LSA processing works. The goal was to model
the LSA for configuration. The kernel will have to be in the business of
reserving and enumerating decoders out of memdevs for both LSA (where we have a
list of memdevs) and volatile (where we use the memdevs in the system to
enumerate populated decoders). I don't see much value in making userspace do the
same.

I'd like to ask you reconsider if you still think it's preferable to use
decoders as part of the ABI and if you still feel that way I can go change it
since it has minimal impact overall.

> > +       if (!memdev_dev) {
> > +               device_unlock(&cxlr->dev);
> > +               return -ENOENT;
> > +       }
> > +
> > +       /* reference to memdev held until target is unset or region goes away */
> > +
> > +       cxlmd = to_cxl_memdev(memdev_dev);
> > +       cxlr->config.targets[n] = cxlmd;
> > +
> > +       device_unlock(&cxlr->dev);
> > +
> > +       return len;
> > +}
> > +

[snip]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 01/14] cxl/region: Add region creation ABI
  2022-01-28  0:26 ` [PATCH v3 01/14] cxl/region: Add region creation ABI Ben Widawsky
  2022-01-28 18:14   ` Dan Williams
@ 2022-02-01 15:53   ` Jonathan Cameron
  2022-02-17 17:10   ` [PATCH v4 " Ben Widawsky
  2 siblings, 0 replies; 70+ messages in thread
From: Jonathan Cameron @ 2022-02-01 15:53 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Dan Williams, Ira Weiny,
	Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

On Thu, 27 Jan 2022 16:26:54 -0800
Ben Widawsky <ben.widawsky@intel.com> wrote:

> Regions are created as a child of the decoder that encompasses an
> address space with constraints. Regions have a number of attributes that
> must be configured before the region can be activated.
> 
> The ABI is not meant to be secure, but is meant to avoid accidental
> races. As a result, a buggy process may create a region by name that was
> allocated by a different process. However, multiple processes which are
> trying not to race with each other shouldn't need special
> synchronization to do so.
> 
> // Allocate a new region name
> region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> 
> // Create a new region by name
> echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
> 
> // Region now exists in sysfs
> stat -t /sys/bus/cxl/devices/decoder0.0/$region
> 
> // Delete the region, and name
> echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region
> 
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> 

One trivial comment below.


> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 13fb06849199..b9f0099c1f39 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -221,6 +221,7 @@ enum cxl_decoder_type {
>   * @target_type: accelerator vs expander (type2 vs type3) selector
>   * @flags: memory type capabilities and locking
>   * @target_lock: coordinate coherent reads of the target list
> + * @region_ida: allocator for region ids.
>   * @nr_targets: number of elements in @target
>   * @target: active ordered target list in current decoder configuration
>   */
> @@ -236,6 +237,7 @@ struct cxl_decoder {
>  	enum cxl_decoder_type target_type;
>  	unsigned long flags;
>  	seqlock_t target_lock;
> +	struct ida region_ida;
>  	int nr_targets;
>  	struct cxl_dport *target[];
>  };
> @@ -323,6 +325,13 @@ struct cxl_ep {
>  	struct list_head list;
>  };
>  
> +bool is_cxl_region(struct device *dev);

Not in this patch. Looks like it is in patch 4.


> +struct cxl_region *to_cxl_region(struct device *dev);
> +struct cxl_region *cxl_alloc_region(struct cxl_decoder *cxld,
> +				    int interleave_ways);
> +int cxl_add_region(struct cxl_decoder *cxld, struct cxl_region *cxlr);
> +int cxl_delete_region(struct cxl_decoder *cxld, const char *region);
> +
>  static inline bool is_cxl_root(struct cxl_port *port)
>  {
>  	return port->uport == port->dev.parent;


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 04/14] cxl/region: Introduce a cxl_region driver
  2022-01-28  0:26 ` [PATCH v3 04/14] cxl/region: Introduce a cxl_region driver Ben Widawsky
@ 2022-02-01 16:21   ` Jonathan Cameron
  2022-02-17  6:04   ` Dan Williams
  1 sibling, 0 replies; 70+ messages in thread
From: Jonathan Cameron @ 2022-02-01 16:21 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Dan Williams, Ira Weiny, Vishal Verma, Bjorn Helgaas, nvdimm,
	linux-pci

On Thu, 27 Jan 2022 16:26:57 -0800
Ben Widawsky <ben.widawsky@intel.com> wrote:

> The cxl_region driver is responsible for managing the HDM decoder
> programming in the CXL topology. Once a region is created it must be
> configured and bound to the driver in order to activate it.
> 
> The following is a sample of how such controls might work:
> 
> region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
> echo 2 > /sys/bus/cxl/devices/decoder0.0/region0.0:0/interleave
> echo $((256<<20)) > /sys/bus/cxl/devices/decoder0.0/region0.0:0/size
> echo mem0 > /sys/bus/cxl/devices/decoder0.0/region0.0:0/target0
> echo mem1 > /sys/bus/cxl/devices/decoder0.0/region0.0:0/target1
> echo region0.0:0 > /sys/bus/cxl/drivers/cxl_region/bind
> 
> In order to handle the eventual rise in failure modes of binding a
> region, a new trace event is created to help track these failures for
> debug and reconfiguration paths in userspace.
> 
> Reported-by: kernel test robot <lkp@intel.com> (v2)
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>

A few minor comments inline.

Thanks,

Jonathan

> ---
> Changes since v2:
> - Add CONFIG_CXL_REGION
> - Check ways/granularity in sanitize
> ---
>  .../driver-api/cxl/memory-devices.rst         |   3 +
>  drivers/cxl/Kconfig                           |   4 +
>  drivers/cxl/Makefile                          |   2 +
>  drivers/cxl/core/core.h                       |   1 +
>  drivers/cxl/core/port.c                       |  17 +-
>  drivers/cxl/core/region.c                     |  25 +-
>  drivers/cxl/cxl.h                             |  31 ++
>  drivers/cxl/region.c                          | 349 ++++++++++++++++++
>  drivers/cxl/region.h                          |   4 +
>  9 files changed, 431 insertions(+), 5 deletions(-)
>  create mode 100644 drivers/cxl/region.c
> 
> diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
> index 66ddc58a21b1..8cb4dece5b17 100644
> --- a/Documentation/driver-api/cxl/memory-devices.rst
> +++ b/Documentation/driver-api/cxl/memory-devices.rst
> @@ -364,6 +364,9 @@ CXL Core
>  
>  CXL Regions
>  -----------
> +.. kernel-doc:: drivers/cxl/region.c
> +   :doc: cxl region
> +
>  .. kernel-doc:: drivers/cxl/region.h
>     :identifiers:
>  
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index b88ab956bb7c..742847503c16 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -98,4 +98,8 @@ config CXL_PORT
>  	default CXL_BUS
>  	tristate
>  
> +config CXL_REGION
> +	default CXL_PORT
> +	tristate
> +
>  endif
> diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
> index ce267ef11d93..02a4776e7ab9 100644
> --- a/drivers/cxl/Makefile
> +++ b/drivers/cxl/Makefile
> @@ -5,9 +5,11 @@ obj-$(CONFIG_CXL_MEM) += cxl_mem.o
>  obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
>  obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
>  obj-$(CONFIG_CXL_PORT) += cxl_port.o
> +obj-$(CONFIG_CXL_REGION) += cxl_region.o
>  
>  cxl_mem-y := mem.o
>  cxl_pci-y := pci.o
>  cxl_acpi-y := acpi.o
>  cxl_pmem-y := pmem.o
>  cxl_port-y := port.o
> +cxl_region-y := region.o
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 35fd08d560e2..b8a154da34df 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -7,6 +7,7 @@
>  extern const struct device_type cxl_nvdimm_bridge_type;
>  extern const struct device_type cxl_nvdimm_type;
>  extern const struct device_type cxl_memdev_type;
> +extern const struct device_type cxl_region_type;
>  
>  extern struct attribute_group cxl_base_attribute_group;
>  
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 0826208b2bdf..0847e6ce19ef 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -9,6 +9,7 @@
>  #include <linux/idr.h>
>  #include <cxlmem.h>
>  #include <cxlpci.h>
> +#include <region.h>
>  #include <cxl.h>
>  #include "core.h"
>  
> @@ -49,6 +50,8 @@ static int cxl_device_id(struct device *dev)
>  	}
>  	if (dev->type == &cxl_memdev_type)
>  		return CXL_DEVICE_MEMORY_EXPANDER;
> +	if (dev->type == &cxl_region_type)
> +		return CXL_DEVICE_REGION;
>  	return 0;
>  }
>  
> @@ -1425,13 +1428,23 @@ static int cxl_bus_match(struct device *dev, struct device_driver *drv)
>  
>  static int cxl_bus_probe(struct device *dev)
>  {
> -	int rc;
> +	int id = cxl_device_id(dev);
> +	int rc = -ENODEV;
>  
>  	cxl_nested_lock(dev);
> -	rc = to_cxl_drv(dev->driver)->probe(dev);
> +	if (id == CXL_DEVICE_REGION) {
> +		/* Regions cannot bind until parameters are set */
> +		struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +		if (is_cxl_region_configured(cxlr))
> +			rc = to_cxl_drv(dev->driver)->probe(dev);
> +	} else {
> +		rc = to_cxl_drv(dev->driver)->probe(dev);
> +	}
>  	cxl_nested_unlock(dev);
>  
>  	dev_dbg(dev, "probe: %d\n", rc);
> +

Shouldn't be in this patch.

>  	return rc;
>  }
>  
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 3b48e0469fc7..784e4ba25128 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -12,6 +12,8 @@
>  #include <cxl.h>
>  #include "core.h"
>  
> +#include "core.h"
> +
>  /**
>   * DOC: cxl core region
>   *
> @@ -26,10 +28,27 @@ static const struct attribute_group region_interleave_group;
>  
>  static bool is_region_active(struct cxl_region *cxlr)
>  {
> -	/* TODO: Regions can't be activated yet. */
> -	return false;
> +	return cxlr->active;
>  }
>  
> +/*
> + * Most sanity checking is left up to region binding. This does the most basic
> + * check to determine whether or not the core should try probing the driver.
> + */
> +bool is_cxl_region_configured(const struct cxl_region *cxlr)
> +{
> +	/* zero sized regions aren't a thing. */
> +	if (cxlr->config.size <= 0)
> +		return false;
> +
> +	/* all regions have at least 1 target */
> +	if (!cxlr->config.targets[0])
> +		return false;
> +
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(is_cxl_region_configured);
> +
>  static void remove_target(struct cxl_region *cxlr, int target)
>  {
>  	struct cxl_memdev *cxlmd;
> @@ -316,7 +335,7 @@ static const struct attribute_group *region_groups[] = {
>  
>  static void cxl_region_release(struct device *dev);
>  
> -static const struct device_type cxl_region_type = {
> +const struct device_type cxl_region_type = {
>  	.name = "cxl_region",
>  	.release = cxl_region_release,
>  	.groups = region_groups
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index b9f0099c1f39..d1a8ca19c9ea 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -81,6 +81,31 @@ static inline int cxl_to_interleave_ways(u8 eniw)
>  	}
>  }
>  
> +static inline bool cxl_is_interleave_ways_valid(int iw)
> +{
> +	switch (iw) {
> +		case 0 ... 4:
> +		case 6:
> +		case 8:
> +		case 12:
> +		case 16:
> +			return true;
> +		default:
> +			return false;
> +	}
> +
> +	unreachable();

Why does this need marking?

> +}
> +
> +static inline bool cxl_is_interleave_granularity_valid(int ig)
> +{
> +	if (!is_power_of_2(ig))
> +		return false;
> +
> +	/* 16K is the max */
> +	return ((ig >> 15) == 0);
> +}
> +
>  /* CXL 2.0 8.2.8.1 Device Capabilities Array Register */
>  #define CXLDEV_CAP_ARRAY_OFFSET 0x0
>  #define   CXLDEV_CAP_ARRAY_CAP_ID 0
> @@ -199,6 +224,10 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
>  #define CXL_DECODER_F_ENABLE    BIT(5)
>  #define CXL_DECODER_F_MASK  GENMASK(5, 0)
>  
> +#define cxl_is_pmem_t3(flags)                                                  \
> +	(((flags) & (CXL_DECODER_F_TYPE3 | CXL_DECODER_F_PMEM)) ==             \
> +	 (CXL_DECODER_F_TYPE3 | CXL_DECODER_F_PMEM))
> +
>  enum cxl_decoder_type {
>         CXL_DECODER_ACCELERATOR = 2,
>         CXL_DECODER_EXPANDER = 3,
> @@ -357,6 +386,7 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port,
>  				     resource_size_t component_reg_phys);
>  struct cxl_dport *cxl_find_dport_by_dev(struct cxl_port *port,
>  					const struct device *dev);
> +struct cxl_port *ep_find_cxl_port(struct cxl_memdev *cxlmd, unsigned int depth);

Doesn't seem to be in this patch or indeed anywhere in the series.


>  
>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
>  bool is_root_decoder(struct device *dev);
> @@ -404,6 +434,7 @@ void cxl_driver_unregister(struct cxl_driver *cxl_drv);
>  #define CXL_DEVICE_PORT			3
>  #define CXL_DEVICE_ROOT			4
>  #define CXL_DEVICE_MEMORY_EXPANDER	5
> +#define CXL_DEVICE_REGION		6
>  
>  #define MODULE_ALIAS_CXL(type) MODULE_ALIAS("cxl:t" __stringify(type) "*")
>  #define CXL_MODALIAS_FMT "cxl:t%d"
> diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
> new file mode 100644
> index 000000000000..cc41939a2f0a
> --- /dev/null
> +++ b/drivers/cxl/region.c
> @@ -0,0 +1,349 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2021 Intel Corporation. All rights reserved. */
> +#include <linux/platform_device.h>
> +#include <linux/device.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include "cxlmem.h"
> +#include "region.h"
> +#include "cxl.h"
> +
> +/**
> + * DOC: cxl region
> + *
> + * This module implements a region driver that is capable of programming CXL
> + * hardware to setup regions.
> + *
> + * A CXL region encompasses a chunk of host physical address space that may be
> + * consumed by a single device (x1 interleave aka linear) or across multiple
> + * devices (xN interleaved). The region driver has the following
> + * responsibilities:
> + *
> + * * Walk topology to obtain decoder resources for region configuration.
> + * * Program decoder resources based on region configuration.
> + * * Bridge CXL regions to LIBNVDIMM
> + * * Initiates reading and configuring LSA regions
> + * * Enumerates regions created by BIOS (typically volatile)
> + */
> +
> +#define region_ways(region) ((region)->config.interleave_ways)
> +#define region_granularity(region) ((region)->config.interleave_granularity)

If you drop the 'config' level of the structure as I think Dan suggested
then I don't think these macros will be justified vs
region->interleave_xxx  They are a bit borderline now for whether they
help or hinder readability!

> +
> +static struct cxl_decoder *rootd_from_region(struct cxl_region *cxlr)
> +{
> +	struct device *d = cxlr->dev.parent;
> +
> +	if (WARN_ONCE(!is_root_decoder(d),
> +		      "Corrupt topology for root region\n"))
> +		return NULL;
> +
> +	return to_cxl_decoder(d);
> +}
> +
> +static struct cxl_port *get_hostbridge(const struct cxl_memdev *ep)
> +{
> +	struct cxl_port *port = ep->port;
> +
> +	while (!is_cxl_root(port)) {
> +		port = to_cxl_port(port->dev.parent);
> +		if (port->depth == 1)
> +			return port;
> +	}
> +
> +	BUG();

Add a comment on why. For anyone hitting this they are in a weird
state, so give them a hint!

> +	return NULL;
> +}
> +
> +static struct cxl_port *get_root_decoder(const struct cxl_memdev *endpoint)
> +{
> +	struct cxl_port *hostbridge = get_hostbridge(endpoint);
> +
> +	if (hostbridge)
> +		return to_cxl_port(hostbridge->dev.parent);
> +
> +	return NULL;
> +}
> +
> +/**
> + * sanitize_region() - Check is region is reasonably configured
> + * @cxlr: The region to check
> + *
> + * Determination as to whether or not a region can possibly be configured is
> + * described in CXL Memory Device SW Guide. In order to implement the algorithms

Lines are a tiny bit long given no loss of readability in bringing them under
80 chars.

> + * described there, certain more basic configuration parameters must first need
> + * to be validated. That is accomplished by this function.
> + *
> + * Returns 0 if the region is reasonably configured, else returns a negative
> + * error code.
> + */
> +static int sanitize_region(const struct cxl_region *cxlr)
> +{
> +	const int ig = region_granularity(cxlr);
> +	const int iw = region_ways(cxlr);
> +	int i;
> +
> +	if (dev_WARN_ONCE(&cxlr->dev, !is_cxl_region_configured(cxlr),
> +			  "unconfigured regions can't be probed (race?)\n")) {
> +		return -ENXIO;
> +	}
> +
> +	/*
> +	 * Interleave attributes should be caught by later math, but it's
> +	 * easiest to find those issues here, now.
> +	 */
> +	if (!cxl_is_interleave_ways_valid(iw)) {
> +		dev_dbg(&cxlr->dev, "Invalid number of ways\n");
> +		return -ENXIO;
> +	}
> +
> +	if (!cxl_is_interleave_granularity_valid(ig)) {
> +		dev_dbg(&cxlr->dev, "Invalid interleave granularity\n");
> +		return -ENXIO;
> +	}
> +
> +	if (cxlr->config.size % (SZ_256M * iw)) {
> +		dev_dbg(&cxlr->dev, "Invalid size. Must be multiple of %uM\n",
> +			256 * iw);
> +		return -ENXIO;
> +	}
> +
> +	for (i = 0; i < iw; i++) {
> +		if (!cxlr->config.targets[i]) {
> +			dev_dbg(&cxlr->dev, "Missing memory device target%u",
> +				i);
> +			return -ENXIO;
> +		}
> +		if (!cxlr->config.targets[i]->dev.driver) {
> +			dev_dbg(&cxlr->dev, "%s isn't CXL.mem capable\n",
> +				dev_name(&cxlr->config.targets[i]->dev));
> +			return -ENODEV;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * allocate_address_space() - Gets address space for the region.
> + * @cxlr: The region that will consume the address space
> + */
> +static int allocate_address_space(struct cxl_region *cxlr)
> +{
> +	/* TODO */
> +	return 0;
> +}
> +
> +/**
> + * find_cdat_dsmas() - Find a valid DSMAS for the region
> + * @cxlr: The region
> + */
> +static bool find_cdat_dsmas(const struct cxl_region *cxlr)
> +{
> +	return true;
> +}
> +
> +/**
> + * qtg_match() - Does this root decoder have desirable QTG for the endpoint
> + * @rootd: The root decoder for the region
> + * @endpoint: Endpoint whose QTG is being compared
> + *
> + * Prior to calling this function, the caller should verify that all endpoints
> + * in the region have the same QTG ID.
> + *
> + * Returns true if the QTG ID of the root decoder matches the endpoint
> + */
> +static bool qtg_match(const struct cxl_decoder *rootd,
> +		      const struct cxl_memdev *endpoint)
> +{
> +	/* TODO: */
> +	return true;
> +}
> +
> +/**
> + * region_xhb_config_valid() - determine cross host bridge validity
> + * @cxlr: The region being programmed
> + * @rootd: The root decoder to check against
> + *
> + * The algorithm is outlined in 2.13.14 "Verify XHB configuration sequence" of
> + * the CXL Memory Device SW Guide (Rev1p0).
> + *
> + * Returns true if the configuration is valid.
> + */
> +static bool region_xhb_config_valid(const struct cxl_region *cxlr,
> +				    const struct cxl_decoder *rootd)
> +{
> +	/* TODO: */
> +	return true;
> +}
> +
> +/**
> + * region_hb_rp_config_valid() - determine root port ordering is correct
> + * @cxlr: Region to validate
> + * @rootd: root decoder for this @cxlr
> + *
> + * The algorithm is outlined in 2.13.15 "Verify HB root port configuration
> + * sequence" of the CXL Memory Device SW Guide (Rev1p0).
> + *
> + * Returns true if the configuration is valid.
> + */
> +static bool region_hb_rp_config_valid(const struct cxl_region *cxlr,
> +				      const struct cxl_decoder *rootd)
> +{
> +	/* TODO: */
> +	return true;
> +}
> +
> +/**
> + * rootd_contains() - determine if this region can exist in the root decoder
> + * @rootd: root decoder that potentially decodes to this region
> + * @cxlr: region to be routed by the @rootd
> + */
> +static bool rootd_contains(const struct cxl_region *cxlr,
> +			   const struct cxl_decoder *rootd)
> +{
> +	/* TODO: */
> +	return true;
> +}
> +
> +static bool rootd_valid(const struct cxl_region *cxlr,
> +			const struct cxl_decoder *rootd)
> +{
> +	const struct cxl_memdev *endpoint = cxlr->config.targets[0];
> +
> +	if (!qtg_match(rootd, endpoint))
> +		return false;
> +
> +	if (!cxl_is_pmem_t3(rootd->flags))
> +		return false;
> +
> +	if (!region_xhb_config_valid(cxlr, rootd))
> +		return false;
> +
> +	if (!region_hb_rp_config_valid(cxlr, rootd))
> +		return false;
> +
> +	if (!rootd_contains(cxlr, rootd))
> +		return false;
> +
> +	return true;
> +}
> +
> +struct rootd_context {
> +	const struct cxl_region *cxlr;
> +	struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
> +	int count;
> +};
> +
> +static int rootd_match(struct device *dev, void *data)
> +{
> +	struct rootd_context *ctx = (struct rootd_context *)data;

Shouldn't need to cast from a void * to another pointer type.

> +	const struct cxl_region *cxlr = ctx->cxlr;
> +
> +	if (!is_root_decoder(dev))
> +		return 0;
> +
> +	return !!rootd_valid(cxlr, to_cxl_decoder(dev));

Extreme given rootd_valid returns a bool...


> +}
> +
> +/*
> + * This is a roughly equivalent implementation to "Figure 45 - High-level
> + * sequence: Finding CFMWS for region" from the CXL Memory Device SW Guide
> + * Rev1p0.
> + */
> +static struct cxl_decoder *find_rootd(const struct cxl_region *cxlr,
> +				      const struct cxl_port *root)
> +{
> +	struct rootd_context ctx;
> +	struct device *ret;
> +
> +	ctx.cxlr = cxlr;
> +
> +	ret = device_find_child((struct device *)&root->dev, &ctx, rootd_match);

Why do you need a cast? 

> +	if (ret)
> +		return to_cxl_decoder(ret);
> +
> +	return NULL;
> +}


> +static int cxl_region_probe(struct device *dev)
> +{
> +	struct cxl_region *cxlr = to_cxl_region(dev);
> +	struct cxl_port *root_port;
> +	struct cxl_decoder *rootd, *ours;
> +	int ret;
> +
> +	device_lock_assert(&cxlr->dev);
> +
> +	if (cxlr->active)
> +		return 0;
> +
> +	if (uuid_is_null(&cxlr->config.uuid))
> +		uuid_gen(&cxlr->config.uuid);
> +
> +	/* TODO: What about volatile, and LSA generated regions? */
> +
> +	ret = sanitize_region(cxlr);
> +	if (ret)
> +		return ret;
> +
> +	ret = allocate_address_space(cxlr);
> +	if (ret)
> +		return ret;
> +
> +	if (!find_cdat_dsmas(cxlr))
> +		return -ENXIO;
> +
> +	rootd = rootd_from_region(cxlr);
> +	if (!rootd) {
> +		dev_err(dev, "Couldn't find root decoder\n");
> +		return -ENXIO;
> +	}
> +
> +	if (!rootd_valid(cxlr, rootd)) {
> +		dev_err(dev, "Picked invalid rootd\n");
> +		return -ENXIO;
> +	}
> +
> +	root_port = get_root_decoder(cxlr->config.targets[0]);
> +	ours = find_rootd(cxlr, root_port);
> +	if (ours != rootd)
> +		dev_dbg(dev, "Picked different rootd %s %s\n",
> +			dev_name(&rootd->dev), dev_name(&ours->dev));

Good to add a comment here on why it is fine to continue after
what the dbg statement and check makes sound like a bad thing!

> +	if (ours)
> +		put_device(&ours->dev);
> +
> +	ret = collect_ep_decoders(cxlr);
> +	if (ret)
> +		return ret;
> +
> +	ret = bind_region(cxlr);
> +	if (!ret) {
> +		cxlr->active = true;
> +		dev_info(dev, "Bound");

Looks like left over debug info...

> +	}
> +
> +	return ret;
> +}
> +
> +static struct cxl_driver cxl_region_driver = {
> +	.name = "cxl_region",
> +	.probe = cxl_region_probe,
> +	.id = CXL_DEVICE_REGION,
> +};
> +module_cxl_driver(cxl_region_driver);
> +
> +MODULE_LICENSE("GPL v2");
> +MODULE_IMPORT_NS(CXL);
> +MODULE_ALIAS_CXL(CXL_DEVICE_REGION);

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 09/14] cxl/region: Add infrastructure for decoder programming
  2022-01-28  0:27 ` [PATCH v3 09/14] cxl/region: Add infrastructure for decoder programming Ben Widawsky
@ 2022-02-01 18:16   ` Jonathan Cameron
  2022-02-18 21:53   ` Dan Williams
  1 sibling, 0 replies; 70+ messages in thread
From: Jonathan Cameron @ 2022-02-01 18:16 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Dan Williams, Ira Weiny,
	Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

On Thu, 27 Jan 2022 16:27:02 -0800
Ben Widawsky <ben.widawsky@intel.com> wrote:

> There are 3 steps in handling region programming once it has been
> configured by userspace.
> 1. Sanitize the parameters against the system.
> 2. Collect decoder resources from the topology
> 3. Program decoder resources
> 
> The infrastructure added here addresses #2. Two new APIs are introduced
> to allow collecting and returning decoder resources. Additionally the
> infrastructure includes two lists managed by the region driver, a staged
> list, and a commit list. The staged list contains those collected in
> step #2, and the commit list are all the decoders programmed in step #3.
> 
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>

Minor comments inline.

>  static void cxld_unregister(void *dev)
>  {
>  	device_unregister(dev);
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 784e4ba25128..a62d48454a56 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -440,6 +440,8 @@ struct cxl_region *cxl_alloc_region(struct cxl_decoder *cxld, int id)
>  	if (!cxlr)
>  		return ERR_PTR(-ENOMEM);
>  
> +	INIT_LIST_HEAD(&cxlr->staged_list);
> +	INIT_LIST_HEAD(&cxlr->commit_list);
>  	cxlr->id = id;
>  
>  	return cxlr;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index ed984465b59c..8ace6cca0776 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -35,6 +35,8 @@
>  #define   CXL_CM_CAP_CAP_ID_HDM 0x5
>  #define   CXL_CM_CAP_CAP_HDM_VERSION 1
>  
> +#define CXL_DECODER_MAX_INSTANCES 10
> +
>  /* HDM decoders CXL 2.0 8.2.5.12 CXL HDM Decoder Capability Structure */
>  #define CXL_HDM_DECODER_CAP_OFFSET 0x0
>  #define   CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0)
> @@ -265,6 +267,7 @@ enum cxl_decoder_type {
>   * @target_lock: coordinate coherent reads of the target list
>   * @region_ida: allocator for region ids.
>   * @address_space: Used/free address space for regions.
> + * @region_link: This decoder's place on either the staged, or commit list.
>   * @nr_targets: number of elements in @target
>   * @target: active ordered target list in current decoder configuration
>   */
> @@ -282,6 +285,7 @@ struct cxl_decoder {
>  	seqlock_t target_lock;
>  	struct ida region_ida;
>  	struct gen_pool *address_space;
> +	struct list_head region_link;
>  	int nr_targets;
>  	struct cxl_dport *target[];
>  };
> @@ -326,6 +330,7 @@ struct cxl_nvdimm {
>   * @id: id for port device-name
>   * @dports: cxl_dport instances referenced by decoders
>   * @endpoints: cxl_ep instances, endpoints that are a descendant of this port
> + * @region_link: this port's node on the region's list of ports

docs but no field in the structure...


>   * @decoder_ida: allocator for decoder ids
>   * @component_reg_phys: component register capability base address (optional)
>   * @dead: last ep has been removed, force port re-creation
> @@ -396,6 +401,8 @@ struct cxl_port *find_cxl_root(struct device *dev);
>  int devm_cxl_enumerate_ports(struct cxl_memdev *cxlmd);
>  int cxl_bus_rescan(void);
>  struct cxl_port *cxl_mem_find_port(struct cxl_memdev *cxlmd);
> +struct cxl_decoder *cxl_get_decoder(struct cxl_port *port);
> +void cxl_put_decoder(struct cxl_decoder *cxld);
>  bool schedule_cxl_memdev_detach(struct cxl_memdev *cxlmd);
>  
>  struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port,
> @@ -406,6 +413,7 @@ struct cxl_dport *cxl_find_dport_by_dev(struct cxl_port *port,
>  struct cxl_port *ep_find_cxl_port(struct cxl_memdev *cxlmd, unsigned int depth);
>  
>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
> +bool is_cxl_decoder(struct device *dev);

They are multiplying!! (see 2 lines down.)

>  bool is_root_decoder(struct device *dev);
>  bool is_cxl_decoder(struct device *dev);
>  struct cxl_decoder *cxl_root_decoder_alloc(struct cxl_port *port,

> diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
> index d2f6c990c8a8..145d7bb02714 100644
> --- a/drivers/cxl/region.c
> +++ b/drivers/cxl/region.c
> @@ -359,21 +359,59 @@ static bool has_switch(const struct cxl_region *cxlr)
>  	return false;
>  }
>

...

>  static int bind_region(const struct cxl_region *cxlr)
> @@ -559,7 +648,7 @@ static int cxl_region_probe(struct device *dev)
>  		return -ENXIO;
>  	}
>  
> -	if (!rootd_valid(cxlr, rootd)) {
> +	if (!rootd_valid(cxlr, rootd, true)) {
>  		dev_err(dev, "Picked invalid rootd\n");
>  		return -ENXIO;
>  	}
> @@ -574,14 +663,18 @@ static int cxl_region_probe(struct device *dev)
>  
>  	ret = collect_ep_decoders(cxlr);
>  	if (ret)
> -		return ret;
> +		goto err;
>  
>  	ret = bind_region(cxlr);
> -	if (!ret) {
> -		cxlr->active = true;
> -		dev_info(dev, "Bound");
> -	}
> +	if (ret)
> +		goto err;
>  
> +	cxlr->active = true;
> +	dev_info(dev, "Bound");

Not keen on this being always printed in the logs.  dev_dbg() perhaps with
some more detail may be 

> +	return 0;
> +
> +err:
> +	cleanup_staged_decoders(cxlr);
>  	return ret;
>  }
>  
> diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
> index 00a6dc729c26..fc15abaeb638 100644
> --- a/drivers/cxl/region.h
> +++ b/drivers/cxl/region.h
> @@ -14,6 +14,9 @@
>   * @list: Node in decoder's region list.
>   * @res: Resource this region carves out of the platform decode range.
>   * @active: If the region has been activated.
> + * @staged_list: All decoders staged for programming.
> + * @commit_list: All decoders programmed for this region's parameters.
> + *

Why the blank line?  If it makes sense should be in earlier patch
and I'm not sure if kernel-doc allows blank lines in the list.


>   * @config: HDM decoder program config
>   * @config.size: Size of the region determined from LSA or userspace.
>   * @config.uuid: The UUID for this region.
> @@ -27,6 +30,8 @@ struct cxl_region {
>  	struct list_head list;
>  	struct resource *res;
>  	bool active;
> +	struct list_head staged_list;
> +	struct list_head commit_list;
>  
>  	struct {
>  		u64 size;


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 10/14] cxl/region: Collect host bridge decoders
  2022-01-28  0:27 ` [PATCH v3 10/14] cxl/region: Collect host bridge decoders Ben Widawsky
@ 2022-02-01 18:21   ` Jonathan Cameron
  2022-02-18 23:42   ` Dan Williams
  1 sibling, 0 replies; 70+ messages in thread
From: Jonathan Cameron @ 2022-02-01 18:21 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Dan Williams, Ira Weiny,
	Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

On Thu, 27 Jan 2022 16:27:03 -0800
Ben Widawsky <ben.widawsky@intel.com> wrote:

> Part of host bridge verification in the CXL Type 3 Memory Device
> Software Guide calculates the host bridge interleave target list (6th
> step in the flow chart), ie. verification and state update are done in
> the same step. Host bridge verification is already in place, so go ahead
> and store the decoders with their target lists.
> 
> Switches are implemented in a separate patch.
> 
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Looks like a little bit of code got in here that I think
belongs in a different patch.

> ---
>  drivers/cxl/region.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
> index 145d7bb02714..b8982be13bfe 100644
> --- a/drivers/cxl/region.c
> +++ b/drivers/cxl/region.c
> @@ -428,6 +428,7 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
>  		return simple_config(cxlr, hbs[0]);
>  
>  	for (i = 0; i < hb_count; i++) {
> +		struct cxl_decoder *cxld;
>  		int idx, position_mask;
>  		struct cxl_dport *rp;
>  		struct cxl_port *hb;
> @@ -486,6 +487,18 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
>  						"One or more devices are not connected to the correct Host Bridge Root Port\n");
>  					goto err;
>  				}
> +
> +				if (!state_update)
> +					continue;
> +
> +				if (dev_WARN_ONCE(&cxld->dev,
> +						  port_grouping >= cxld->nr_targets,
> +						  "Invalid port grouping %d/%d\n",
> +						  port_grouping, cxld->nr_targets))
> +					goto err;
> +
> +				cxld->interleave_ways++;
> +				cxld->target[port_grouping] = get_rp(ep);
>  			}
>  		}
>  	}
> @@ -538,7 +551,7 @@ static bool rootd_valid(const struct cxl_region *cxlr,
>  
>  struct rootd_context {
>  	const struct cxl_region *cxlr;
> -	struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
> +	const struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
>  	int count;
>  };
>  
> @@ -564,7 +577,7 @@ static struct cxl_decoder *find_rootd(const struct cxl_region *cxlr,
>  	struct rootd_context ctx;
>  	struct device *ret;
>  
> -	ctx.cxlr = cxlr;
> +	ctx.cxlr = (struct cxl_region *)cxlr;

Why is this here?  If it's needed, that need doesn't seem to have come in
as part of this patch.

>  
>  	ret = device_find_child((struct device *)&root->dev, &ctx, rootd_match);
>  	if (ret)


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 11/14] cxl/region: Add support for single switch level
  2022-01-28  0:27 ` [PATCH v3 11/14] cxl/region: Add support for single switch level Ben Widawsky
@ 2022-02-01 18:26   ` Jonathan Cameron
  2022-02-15 16:10   ` Jonathan Cameron
  1 sibling, 0 replies; 70+ messages in thread
From: Jonathan Cameron @ 2022-02-01 18:26 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Dan Williams, Ira Weiny,
	Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

On Thu, 27 Jan 2022 16:27:04 -0800
Ben Widawsky <ben.widawsky@intel.com> wrote:

> CXL switches have HDM decoders just like host bridges and endpoints.
> Their programming works in a similar fashion.
> 
> The spec does not prohibit multiple levels of switches, however, those
> are not implemented at this time.
> 
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
One trivial comment inline, but it's end of day here so I've not taken as
deeper look at this as I probably will at some later date.

Thanks,

Jonathan

> ---
>  drivers/cxl/cxl.h    |  5 ++++
>  drivers/cxl/region.c | 61 ++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 64 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 8ace6cca0776..d70d8c85d05f 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -96,6 +96,11 @@ static inline u8 cxl_to_ig(u16 g)
>  	return ilog2(g) - 8;
>  }
>  
> +static inline int cxl_to_ways(u8 ways)
> +{
> +	return 1 << ways;

This special case of cxl_to_interleave_ways probably needs some
documentation or a name that makes it clear why it is special.

> +}
> +
>  static inline bool cxl_is_interleave_ways_valid(int iw)
>  {
>  	switch (iw) {
> diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
> index b8982be13bfe..f748060733dd 100644
> --- a/drivers/cxl/region.c
> +++ b/drivers/cxl/region.c
> @@ -359,6 +359,23 @@ static bool has_switch(const struct cxl_region *cxlr)
>  	return false;
>  }
>  
> +static bool has_multi_switch(const struct cxl_region *cxlr)
> +{
> +	struct cxl_memdev *ep;
> +	int i;
> +
> +	for_each_cxl_endpoint(ep, cxlr, i)
> +		if (ep->port->depth > 3)
> +			return true;
> +
> +	return false;
> +}
> +
> +static struct cxl_port *get_switch(struct cxl_memdev *ep)
> +{
> +	return to_cxl_port(ep->port->dev.parent);
> +}
> +
>  static struct cxl_decoder *get_decoder(struct cxl_region *cxlr,
>  				       struct cxl_port *p)
>  {
> @@ -409,6 +426,8 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
>  				      const struct cxl_decoder *rootd,
>  				      bool state_update)
>  {
> +	const int region_ig = cxl_to_ig(cxlr->config.interleave_granularity);
> +	const int region_eniw = cxl_to_eniw(cxlr->config.interleave_ways);
>  	const int num_root_ports = get_num_root_ports(cxlr);
>  	struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
>  	struct cxl_decoder *cxld, *c;
> @@ -416,8 +435,12 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
>  
>  	hb_count = get_unique_hostbridges(cxlr, hbs);
>  
> -	/* TODO: Switch support */
> -	if (has_switch(cxlr))
> +	/* TODO: support multiple levels of switches */
> +	if (has_multi_switch(cxlr))
> +		return false;
> +
> +	/* TODO: x3 interleave for switches is hard. */
> +	if (has_switch(cxlr) && !is_power_of_2(region_ways(cxlr)))
>  		return false;
>  
>  	/*
> @@ -470,8 +493,14 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
>  		list_for_each_entry(rp, &hb->dports, list) {
>  			struct cxl_memdev *ep;
>  			int port_grouping = -1;
> +			int target_ndx;
>  
>  			for_each_cxl_endpoint_hb(ep, cxlr, hb, idx) {
> +				struct cxl_decoder *switch_cxld;
> +				struct cxl_dport *target;
> +				struct cxl_port *switch_port;
> +				bool found = false;
> +
>  				if (get_rp(ep) != rp)
>  					continue;
>  
> @@ -499,6 +528,34 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
>  
>  				cxld->interleave_ways++;
>  				cxld->target[port_grouping] = get_rp(ep);
> +
> +				/*
> +				 * At least one switch is connected here if the endpoint
> +				 * has a depth > 2
> +				 */
> +				if (ep->port->depth == 2)
> +					continue;
> +
> +				/* Check the staged list to see if this
> +				 * port has already been added
> +				 */
> +				switch_port = get_switch(ep);
> +				list_for_each_entry(switch_cxld, &cxlr->staged_list, region_link) {
> +					if (to_cxl_port(switch_cxld->dev.parent) == switch_port)
> +						found = true;
> +				}
> +
> +				if (found) {
> +					target = cxl_find_dport_by_dev(switch_port, ep->dev.parent->parent);
> +					switch_cxld->target[target_ndx++] = target;
> +					continue;
> +				}
> +
> +				target_ndx = 0;
> +
> +				switch_cxld = get_decoder(cxlr, switch_port);
> +				switch_cxld->interleave_ways++;
> +				switch_cxld->interleave_granularity = cxl_to_ways(region_ig + region_eniw);
>  			}
>  		}
>  	}


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 01/14] cxl/region: Add region creation ABI
  2022-01-28 18:14   ` Dan Williams
  2022-01-28 18:59     ` Dan Williams
@ 2022-02-01 22:42     ` Ben Widawsky
  1 sibling, 0 replies; 70+ messages in thread
From: Ben Widawsky @ 2022-02-01 22:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On 22-01-28 10:14:09, Dan Williams wrote:
> On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > Regions are created as a child of the decoder that encompasses an
> > address space with constraints. Regions have a number of attributes that
> > must be configured before the region can be activated.
> >
> > The ABI is not meant to be secure, but is meant to avoid accidental
> > races. As a result, a buggy process may create a region by name that was
> > allocated by a different process. However, multiple processes which are
> > trying not to race with each other shouldn't need special
> > synchronization to do so.
> >
> > // Allocate a new region name
> > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> >
> > // Create a new region by name
> > echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
> 
> Were I someone coming in cold to this the immediate question about
> this example would be "what if userspace races to create the region?".
> How about showing the example that this interface requires looping
> until the kernel returns success in case userspace races itself to
> create the next region? I think this would work for that purpose.
> 
> ---
> 
> while
> region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> ! echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
> do true; done
> 
> ---
> 

See below... TL;DR the interface was designed to not work this way because I
misunderstood you.

> > // Region now exists in sysfs
> > stat -t /sys/bus/cxl/devices/decoder0.0/$region
> >
> > // Delete the region, and name
> > echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region
> >
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> >
> > ---
> > Changes since v2:
> > - Rename 'region' variables to 'cxlr'
> > - Update ABI docs for possible actual upstream version
> > ---
> >  Documentation/ABI/testing/sysfs-bus-cxl       |  24 ++
> >  .../driver-api/cxl/memory-devices.rst         |  11 +
> >  drivers/cxl/core/Makefile                     |   1 +
> >  drivers/cxl/core/core.h                       |   3 +
> >  drivers/cxl/core/port.c                       |  16 ++
> >  drivers/cxl/core/region.c                     | 208 ++++++++++++++++++
> >  drivers/cxl/cxl.h                             |   9 +
> >  drivers/cxl/region.h                          |  38 ++++
> >  tools/testing/cxl/Kbuild                      |   1 +
> >  9 files changed, 311 insertions(+)
> >  create mode 100644 drivers/cxl/core/region.c
> >  create mode 100644 drivers/cxl/region.h
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index 7c2b846521f3..dcc728458936 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -163,3 +163,27 @@ Description:
> >                 memory (type-3). The 'target_type' attribute indicates the
> >                 current setting which may dynamically change based on what
> >                 memory regions are activated in this decode hierarchy.
> > +
> > +What:          /sys/bus/cxl/devices/decoderX.Y/create_region
> > +Date:          August, 2021
> 
> Maybe move this to January, 2022?
> 

Okay.

> > +KernelVersion: v5.18
> > +Contact:       linux-cxl@vger.kernel.org
> > +Description:
> > +               Creates a new CXL region. Writing a value of the form
> > +               "regionX.Y:Z" will create a new uninitialized region that will
> > +               be mapped by the CXL decoderX.Y.
> 
> "Write a value of the form 'regionX.Y:Z' to instantiate a new region
> within the decode range bounded by decoderX.Y."
> 
> > +               Reading from this node will
> > +               return a newly allocated region name. In order to create a
> > +               region (writing) you must use a value returned from reading the
> > +               node.
> 
> "The value written must match the current value returned from reading
> this attribute. This behavior lets the kernel arbitrate racing
> attempts to create a region. The thread that fails to write loops and
> tries the next value."

Okay.

> 
> > +               subsequently configured and bound to a region driver before they
> > +               can be used.
> > +
> > +What:          /sys/bus/cxl/devices/decoderX.Y/delete_region
> > +Date:          August, 2021
> > +KernelVersion: v5.18
> > +Contact:       linux-cxl@vger.kernel.org
> > +Description:
> > +               Deletes the named region. A region must be unbound from the
> > +               region driver before being deleted.
> 
> ....why does it need to be unbound first? device_unregister() triggers
> device_release_driver()?

It shouldn't. This must be old verbiage from when the operation was expected to
be different.

> 
> Side note: I am more and more thinking that even though the BIOS may
> try to lock some configurations down, if the system owner wants the
> region deleted the kernel should probably do everything in its power
> to oblige and override the BIOS including secondary bus resets to get
> the decoders unlocked. I.e. either the driver needs to hide the delete
> region attribute for locked regions, or it needs to give as much power
> to root as someone who has physical access and can rip out devices
> that are decoding locked ranges. Something we can discuss later, but
> every 'disable' and 'delete' interface requires answering the question
> "what about 'locked' configs and hot-remove?".
> 

When I originally started on these patches, I hadn't been planning to enumerate
regions based on the current system programming. So the original answer would
have been, the region is hidden. With that change of mind, I think it makes
sense to do everything in the driver's power to obey, however, it starts to get
really messy if/when switch ports are decoding to other devices and need a reset
to unlock decoders. With that in mind, first attempt should be to disallow
deleting locked regions. Perhaps later we can add an "unlock" interface which
must be used before delete.

> > +               region in the form "regionX.Y:Z". The region's name, allocated
> > +               by reading create_region, will also be released.
> > diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
> > index db476bb170b6..66ddc58a21b1 100644
> > --- a/Documentation/driver-api/cxl/memory-devices.rst
> > +++ b/Documentation/driver-api/cxl/memory-devices.rst
> > @@ -362,6 +362,17 @@ CXL Core
> >  .. kernel-doc:: drivers/cxl/core/mbox.c
> >     :doc: cxl mbox
> >
> > +CXL Regions
> > +-----------
> > +.. kernel-doc:: drivers/cxl/region.h
> > +   :identifiers:
> > +
> > +.. kernel-doc:: drivers/cxl/core/region.c
> > +   :doc: cxl core region
> > +
> > +.. kernel-doc:: drivers/cxl/core/region.c
> > +   :identifiers:
> > +
> >  External Interfaces
> >  ===================
> >
> > diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> > index 6d37cd78b151..39ce8f2f2373 100644
> > --- a/drivers/cxl/core/Makefile
> > +++ b/drivers/cxl/core/Makefile
> > @@ -4,6 +4,7 @@ obj-$(CONFIG_CXL_BUS) += cxl_core.o
> >  ccflags-y += -I$(srctree)/drivers/cxl
> >  cxl_core-y := port.o
> >  cxl_core-y += pmem.o
> > +cxl_core-y += region.o
> >  cxl_core-y += regs.o
> >  cxl_core-y += memdev.o
> >  cxl_core-y += mbox.o
> > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> > index efbaa851929d..35fd08d560e2 100644
> > --- a/drivers/cxl/core/core.h
> > +++ b/drivers/cxl/core/core.h
> > @@ -10,6 +10,9 @@ extern const struct device_type cxl_memdev_type;
> >
> >  extern struct attribute_group cxl_base_attribute_group;
> >
> > +extern struct device_attribute dev_attr_create_region;
> > +extern struct device_attribute dev_attr_delete_region;
> > +
> >  struct cxl_send_command;
> >  struct cxl_mem_query_commands;
> >  int cxl_query_cmd(struct cxl_memdev *cxlmd,
> > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > index 631dec0fa79e..0826208b2bdf 100644
> > --- a/drivers/cxl/core/port.c
> > +++ b/drivers/cxl/core/port.c
> > @@ -215,6 +215,8 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
> >  };
> >
> >  static struct attribute *cxl_decoder_root_attrs[] = {
> > +       &dev_attr_create_region.attr,
> > +       &dev_attr_delete_region.attr,
> >         &dev_attr_cap_pmem.attr,
> >         &dev_attr_cap_ram.attr,
> >         &dev_attr_cap_type2.attr,
> > @@ -267,11 +269,23 @@ static const struct attribute_group *cxl_decoder_endpoint_attribute_groups[] = {
> >         NULL,
> >  };
> >
> > +static int delete_region(struct device *dev, void *arg)
> > +{
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > +
> > +       return cxl_delete_region(cxld, dev_name(dev));
> > +}
> > +
> >  static void cxl_decoder_release(struct device *dev)
> >  {
> >         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> >         struct cxl_port *port = to_cxl_port(dev->parent);
> >
> > +       device_for_each_child(&cxld->dev, cxld, delete_region);
> 
> This is too late. Regions should be deleted before the decoder is
> unregistered, and I think it happens naturally due to the root port
> association with memdevs. I.e. a root decoder is unregistered by its
> parent port being unregistered which is triggered by cxl_acpi
> ->remove(). That ->remove() event triggers all memdevs to disconnect,
> albeit in a workqueue. So as long as cxl_acpi ->remove flushes that
> workqueue then it knows that all memdevs have triggered ->remove().
> 

Yes, I agree with this. I wasn't entirely certain of what I was doing when I
originally wrote this. However, it should probably become a WARN of some sort
that no regions are left.

> If the behavior of a region is that it gets deleted upon the last
> memdev being unmapped from it then there should not be any regions to
> clean up at decoder release time.
> 

This is the goal, but not implemented yet.

> > +
> > +       dev_WARN_ONCE(dev, !ida_is_empty(&cxld->region_ida),
> > +                     "Lost track of a region");
> > +
> >         ida_free(&port->decoder_ida, cxld->id);
> >         kfree(cxld);
> >  }
> > @@ -1194,6 +1208,8 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
> >         cxld->target_type = CXL_DECODER_EXPANDER;
> >         cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
> >
> > +       ida_init(&cxld->region_ida);
> > +
> >         return cxld;
> >  err:
> >         kfree(cxld);
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > new file mode 100644
> > index 000000000000..1a448543db0d
> > --- /dev/null
> > +++ b/drivers/cxl/core/region.c
> > @@ -0,0 +1,208 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright(c) 2021 Intel Corporation. All rights reserved. */
> 
> Happy New Year! Let's go to 2022 here.
> 

Sure. I've always wondered if it's supposed to be when it was written, or when
it was merged.

> > +#include <linux/io-64-nonatomic-lo-hi.h>
> > +#include <linux/device.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/idr.h>
> > +#include <region.h>
> > +#include <cxl.h>
> > +#include "core.h"
> > +
> > +/**
> > + * DOC: cxl core region
> > + *
> > + * Regions are managed through the Linux device model. Each region instance is a
> > + * unique struct device. CXL core provides functionality to create, destroy, and
> > + * configure regions. This is all implemented here. Binding a region
> > + * (programming the hardware) is handled by a separate region driver.
> > + */
> 
> Somewhat information lite, how about:
> 
> "CXL Regions represent mapped memory capacity in system physical
> address space. Whereas the CXL Root Decoders identify the bounds of
> potential CXL Memory ranges, Regions represent the active mapped
> capacity by the HDM Decoder Capability structures throughout the Host
> Bridges, Switches, and Endpoints in the topology."
> 

Okay.

> > +
> > +static void cxl_region_release(struct device *dev);
> > +
> > +static const struct device_type cxl_region_type = {
> > +       .name = "cxl_region",
> > +       .release = cxl_region_release,
> > +};
> > +
> > +static ssize_t create_region_show(struct device *dev,
> > +                                 struct device_attribute *attr, char *buf)
> > +{
> > +       struct cxl_port *port = to_cxl_port(dev->parent);
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > +       int rc;
> > +
> > +       if (dev_WARN_ONCE(dev, !is_root_decoder(dev),
> > +                         "Invalid decoder selected for region.")) {
> > +               return -ENODEV;
> > +       }
> 
> This can go, it's already the case that this attribute is only listed
> in 'cxl_decoder_root_attrs'
> 
> > +       rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
> 
> This looks broken. What if userspace does:
> 
> region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
> 
> ...i.e. it should only advance to create a new name after the previous
> one was instantiated / confirmed via a write.
> 
> Also, sysfs values are world readable by default, so non-root can burn
> up region_ida.
> 

I misunderstood your original request for this interface. How it's working is by
design. I'll change it to work the way you describe.

> > +       if (rc < 0) {
> > +               dev_err(&cxld->dev, "Couldn't get a new id\n");
> > +               return rc;
> > +       }
> > +
> > +       return sysfs_emit(buf, "region%d.%d:%d\n", port->id, cxld->id, rc);
> > +}
> > +
> > +static ssize_t create_region_store(struct device *dev,
> > +                                  struct device_attribute *attr,
> > +                                  const char *buf, size_t len)
> > +{
> > +       struct cxl_port *port = to_cxl_port(dev->parent);
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > +       int decoder_id, port_id, region_id;
> > +       struct cxl_region *cxlr;
> > +       ssize_t rc;
> > +
> > +       if (sscanf(buf, "region%d.%d:%d", &port_id, &decoder_id, &region_id) != 3)
> > +               return -EINVAL;
> 
> With the proposed change above to cache the current 'next' region name
> this can just be something like:
> 
> sysfs_streq(buf, cxld->next);
> 

Sounds good.

> > +
> > +       if (decoder_id != cxld->id)
> > +               return -EINVAL;
> > +
> > +       if (port_id != port->id)
> > +               return -EINVAL;
> > +
> > +       cxlr = cxl_alloc_region(cxld, region_id);
> > +       if (IS_ERR(cxlr))
> > +               return PTR_ERR(cxlr);
> > +
> > +       rc = cxl_add_region(cxld, cxlr);
> > +       if (rc) {
> > +               kfree(cxlr);
> 
> ...'add' failures usually require a put_device(), are you sure kfree()
> is correct here.
> 

You had requested, and I can't recall why, that regions are not devm managed. It
sounds like you're asking for it to be devm managed. I wish I took more notes of
our conversations...

> > +               return rc;
> > +       }
> > +
> > +       return len;
> > +}
> > +DEVICE_ATTR_RW(create_region);
> > +
> > +static ssize_t delete_region_store(struct device *dev,
> > +                                  struct device_attribute *attr,
> > +                                  const char *buf, size_t len)
> > +{
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > +       int rc;
> > +
> > +       rc = cxl_delete_region(cxld, buf);
> 
> I would have expected symmetry with cxl_add_region() i.e. convert @buf
> to @cxlr and keep the function signatures between add and delete
> aligned.
> 
> > +       if (rc)
> > +               return rc;
> > +
> > +       return len;
> > +}
> > +DEVICE_ATTR_WO(delete_region);
> > +
> > +struct cxl_region *to_cxl_region(struct device *dev)
> > +{
> > +       if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type,
> > +                         "not a cxl_region device\n"))
> > +               return NULL;
> > +
> > +       return container_of(dev, struct cxl_region, dev);
> > +}
> > +EXPORT_SYMBOL_GPL(to_cxl_region);
> > +
> > +static void cxl_region_release(struct device *dev)
> > +{
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       ida_free(&cxld->region_ida, cxlr->id);
> > +       kfree(cxlr);
> > +}
> > +
> > +struct cxl_region *cxl_alloc_region(struct cxl_decoder *cxld, int id)
> > +{
> > +       struct cxl_region *cxlr;
> > +
> > +       cxlr = kzalloc(sizeof(*cxlr), GFP_KERNEL);
> > +       if (!cxlr)
> > +               return ERR_PTR(-ENOMEM);
> 
> To keep symmetry with other device object allocations in the cxl/core
> I would expect to see the device_initialize() dev->type and dev->bus
> setup occur here as well.
> 

Okay.

> > +
> > +       cxlr->id = id;
> > +
> > +       return cxlr;
> > +}
> > +
> > +/**
> > + * cxl_add_region - Adds a region to a decoder
> > + * @cxld: Parent decoder.
> > + * @cxlr: Region to be added to the decoder.
> > + *
> > + * This is the second step of region initialization. Regions exist within an
> > + * address space which is mapped by a @cxld. That @cxld must be a root decoder,
> > + * and it enforces constraints upon the region as it is configured.
> > + *
> > + * Return: 0 if the region was added to the @cxld, else returns negative error
> > + * code. The region will be named "regionX.Y.Z" where X is the port, Y is the
> > + * decoder id, and Z is the region number.
> > + */
> > +int cxl_add_region(struct cxl_decoder *cxld, struct cxl_region *cxlr)
> > +{
> > +       struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> > +       struct device *dev = &cxlr->dev;
> > +       int rc;
> > +
> > +       device_initialize(dev);
> > +       dev->parent = &cxld->dev;
> > +       device_set_pm_not_required(dev);
> > +       dev->bus = &cxl_bus_type;
> > +       dev->type = &cxl_region_type;
> > +       rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
> > +       if (rc)
> > +               goto err;
> > +
> > +       rc = device_add(dev);
> > +       if (rc)
> > +               goto err;
> > +
> > +       dev_dbg(dev, "Added to %s\n", dev_name(&cxld->dev));
> > +
> > +       return 0;
> > +
> > +err:
> > +       put_device(dev);
> 
> Here is that put_device() I was expecting, that kfree() earlier was a
> double-free it seems.
> 
> Also, I would have expected a devm action to remove this. Something like:
> 
> struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> 
> cxl_device_lock(&port->dev);
> if (port->dev.driver)
>     devm_cxl_add_region(port->uport, cxld, id);
> else
>     rc = -ENXIO;
> cxl_device_unlock(&port->dev);
> 
> ...then no matter what you know the region will be unregistered when
> the root port goes away.
> 

Right. So as I said above, you had suggested devm wasn't the right interface. I
can't recall why. I can rework it to use devm, and then it should work that way.

> > +       return rc;
> > +}
> > +
> > +static struct cxl_region *cxl_find_region_by_name(struct cxl_decoder *cxld,
> > +                                                 const char *name)
> > +{
> > +       struct device *region_dev;
> > +
> > +       region_dev = device_find_child_by_name(&cxld->dev, name);
> > +       if (!region_dev)
> > +               return ERR_PTR(-ENOENT);
> > +
> > +       return to_cxl_region(region_dev);
> > +}
> > +
> > +/**
> > + * cxl_delete_region - Deletes a region
> > + * @cxld: Parent decoder
> > + * @region_name: Named region, ie. regionX.Y:Z
> > + */
> > +int cxl_delete_region(struct cxl_decoder *cxld, const char *region_name)
> > +{
> > +       struct cxl_region *cxlr;
> > +
> > +       device_lock(&cxld->dev);
> 
> cxl_device_lock()
> 
> ...if the lock is needed, but I don't see why the lock is needed?
> 

I don't see one either. I can't recall if I had one previously.

> > +
> > +       cxlr = cxl_find_region_by_name(cxld, region_name);
> > +       if (IS_ERR(cxlr)) {
> > +               device_unlock(&cxld->dev);
> > +               return PTR_ERR(cxlr);
> > +       }
> > +
> > +       dev_dbg(&cxld->dev, "Requested removal of %s from %s\n",
> > +               dev_name(&cxlr->dev), dev_name(&cxld->dev));
> > +
> > +       device_unregister(&cxlr->dev);
> > +       device_unlock(&cxld->dev);
> 
> This would need to change to devm_release_action() of course if the
> add side changes to devm_cxl_add_region().
> 

Yes.

> > +
> > +       put_device(&cxlr->dev);
> > +
> > +       return 0;
> > +}
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index 13fb06849199..b9f0099c1f39 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -221,6 +221,7 @@ enum cxl_decoder_type {
> >   * @target_type: accelerator vs expander (type2 vs type3) selector
> >   * @flags: memory type capabilities and locking
> >   * @target_lock: coordinate coherent reads of the target list
> > + * @region_ida: allocator for region ids.
> >   * @nr_targets: number of elements in @target
> >   * @target: active ordered target list in current decoder configuration
> >   */
> > @@ -236,6 +237,7 @@ struct cxl_decoder {
> >         enum cxl_decoder_type target_type;
> >         unsigned long flags;
> >         seqlock_t target_lock;
> > +       struct ida region_ida;
> >         int nr_targets;
> >         struct cxl_dport *target[];
> >  };
> > @@ -323,6 +325,13 @@ struct cxl_ep {
> >         struct list_head list;
> >  };
> >
> > +bool is_cxl_region(struct device *dev);
> > +struct cxl_region *to_cxl_region(struct device *dev);
> > +struct cxl_region *cxl_alloc_region(struct cxl_decoder *cxld,
> > +                                   int interleave_ways);
> > +int cxl_add_region(struct cxl_decoder *cxld, struct cxl_region *cxlr);
> > +int cxl_delete_region(struct cxl_decoder *cxld, const char *region);
> > +
> >  static inline bool is_cxl_root(struct cxl_port *port)
> >  {
> >         return port->uport == port->dev.parent;
> > diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
> > new file mode 100644
> > index 000000000000..eb1249e3c1d4
> > --- /dev/null
> > +++ b/drivers/cxl/region.h
> > @@ -0,0 +1,38 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/* Copyright(c) 2021 Intel Corporation. */
> > +#ifndef __CXL_REGION_H__
> > +#define __CXL_REGION_H__
> > +
> > +#include <linux/uuid.h>
> > +
> > +#include "cxl.h"
> > +
> > +/**
> > + * struct cxl_region - CXL region
> > + * @dev: This region's device.
> > + * @id: This regions id. Id is globally unique across all regions.
> 
> s/regions/region's/
> 
> > + * @list: Node in decoder's region list.
> > + * @res: Resource this region carves out of the platform decode range.
> > + * @config: HDM decoder program config
> > + * @config.size: Size of the region determined from LSA or userspace.
> > + * @config.uuid: The UUID for this region.
> > + * @config.interleave_ways: Number of interleave ways this region is configured for.
> > + * @config.interleave_granularity: Interleave granularity of region
> > + * @config.targets: The memory devices comprising the region.
> > + */
> > +struct cxl_region {
> > +       struct device dev;
> > +       int id;
> > +       struct list_head list;
> > +       struct resource *res;
> > +
> > +       struct {
> > +               u64 size;
> > +               uuid_t uuid;
> > +               int interleave_ways;
> > +               int interleave_granularity;
> > +               struct cxl_memdev *targets[CXL_DECODER_MAX_INTERLEAVE];
> > +       } config;
> 
> Why a sub-struct?
> 

Just stylistic. I originally did have this as a named struct and used it as
such. I will change it.

> > +};
> > +
> > +#endif
> > diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
> > index 82e49ab0937d..3fe6d34e6d59 100644
> > --- a/tools/testing/cxl/Kbuild
> > +++ b/tools/testing/cxl/Kbuild
> > @@ -46,6 +46,7 @@ cxl_core-y += $(CXL_CORE_SRC)/memdev.o
> >  cxl_core-y += $(CXL_CORE_SRC)/mbox.o
> >  cxl_core-y += $(CXL_CORE_SRC)/pci.o
> >  cxl_core-y += $(CXL_CORE_SRC)/hdm.o
> > +cxl_core-y += $(CXL_CORE_SRC)/region.o
> >  cxl_core-y += config_check.o
> >
> >  obj-m += test/
> > --
> > 2.35.0
> >

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-01-29  0:25   ` Dan Williams
  2022-02-01 14:59     ` Ben Widawsky
@ 2022-02-01 23:11     ` Ben Widawsky
  2022-02-03 17:48       ` Dan Williams
  2022-02-17 18:36     ` Ben Widawsky
  2 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-01 23:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On 22-01-28 16:25:34, Dan Williams wrote:
> On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > The region creation APIs create a vacant region. Configuring the region
> > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > will be provided to allow userspace to configure the region.  Finally
> > once all configuration is complete, userspace may activate the region.
> >
> > Introduced here are the most basic attributes needed to configure a
> > region. Details of these attribute are described in the ABI
> 
> s/attribute/attributes/
> 
> > Documentation. Sanity checking of configuration parameters are done at
> > region binding time. This consolidates all such logic in one place,
> > rather than being strewn across multiple places.
> 
> I think that's too late for some of the validation. The complex
> validation that the region driver does throughout the topology is
> different from the basic input validation that can  be done at the
> sysfs write time. For example ,this patch allows negative
> interleave_granularity values to specified, just return -EINVAL. I
> agree that sysfs should not validate everything, I disagree with
> pushing all validation to cxl_region_probe().
> 

Two points:c
1. How do we distinguish "basic input validation". It'd be good if we could
   define "basic input validation". For instance, when I first wrote these
   patches, x3 would have been EINVAL, but today it's allowed. Can you help
   enumerate what you consider basic.

2. I like the idea that all validation takes place in one place. Obviously you
   do not. So, see #1 and I will rework.

> >
> > A example is provided below:
> >
> > /sys/bus/cxl/devices/region0.0:0
> > ├── interleave_granularity
> > ├── interleave_ways
> > ├── offset
> > ├── size
> > ├── subsystem -> ../../../../../../bus/cxl
> > ├── target0
> > ├── uevent
> > └── uuid
> 
> As mentioned off-list, it looks like devtype and modalias are missing.
> 

Thanks.

> >
> > Reported-by: kernel test robot <lkp@intel.com> (v2)
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > ---
> >  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
> >  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
> >  2 files changed, 340 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index dcc728458936..50ba5018014d 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -187,3 +187,43 @@ Description:
> >                 region driver before being deleted. The attributes expects a
> >                 region in the form "regionX.Y:Z". The region's name, allocated
> >                 by reading create_region, will also be released.
> > +
> > +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
> 
> This is just another 'resource' attribute for the physical base
> address of the region, right? 'offset' sounds like something that
> would be relative instead of absolute.
> 

It is offset. I can change it to physical base if you'd like but I thought that
information wasn't critically important for userspace to have. Does userspace
care about the physical base?

> > +Date:          August, 2021
> 
> Same date update comment here.
> 
> > +KernelVersion: v5.18
> > +Contact:       linux-cxl@vger.kernel.org
> > +Description:
> > +               (RO) A region resides within an address space that is claimed by
> > +               a decoder.
> 
> "A region is a contiguous partition of a CXL Root decoder address space."
> 
> >                  Region space allocation is handled by the driver, but
> 
> "Region capacity is allocated by writing to the size attribute, the
> resulting physical address base determined by the driver is reflected
> here."
> 
> > +               the offset may be read by userspace tooling in order to
> > +               determine fragmentation, and available size for new regions.
> 
> I would also expect, before / along with these new region attributes,
> there would be 'available' and 'max_extent_available' at the decoder
> level to indicate how much free space the decoder has and how big the
> next region creation can be. User tooling can walk  the decoder and
> the regions together to determine fragmentation if necessary, but for
> the most part the tool likely only cares about "how big can the next
> region be?" and "how full is this decoder?".

Sounds good.

> 
> 
> > +
> > +What:
> > +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> > +Date:          August, 2021
> > +KernelVersion: v5.18
> > +Contact:       linux-cxl@vger.kernel.org
> > +Description:
> > +               (RW) Configuring regions requires a minimal set of parameters in
> > +               order for the subsequent bind operation to succeed. The
> > +               following parameters are defined:
> 
> Let's split up the descriptions into individual sections. That can
> also document the order that attributes must be written. For example,
> doesn't size need to be set before targets are added so that targets
> can be validated whether they have sufficient capacity?
> 

Okay. Order doesn't matter if you do validation all in one place as it is, but
sounds like we're changing that. So I can split it when we figure out what
validation is actually occurring at the sysfs attr boundary.

> > +
> > +               ==      ========================================================
> > +               interleave_granularity Mandatory. Number of consecutive bytes
> > +                       each device in the interleave set will claim. The
> > +                       possible interleave granularity values are determined by
> > +                       the CXL spec and the participating devices.
> > +               interleave_ways Mandatory. Number of devices participating in the
> > +                       region. Each device will provide 1/interleave of storage
> > +                       for the region.
> > +               size    Manadatory. Phsyical address space the region will
> > +                       consume.
> 
> s/Phsyical/Physical/
> 
> > +               target  Mandatory. Memory devices are the backing storage for a
> > +                       region. There will be N targets based on the number of
> > +                       interleave ways that the top level decoder is configured
> > +                       for.
> 
> That doesn't sound right, IW at the root != IW at the endpoint level
> and the region needs to record all the endpoint level targets.


Yes This is wrong. I thought I had fixed it, but I guess not.

> 
> > Each target must be set with a memdev device ie.
> > +                       'mem1'. This attribute only becomes available after
> > +                       setting the 'interleave' attribute.
> > +               uuid    Optional. A unique identifier for the region. If none is
> > +                       selected, the kernel will create one.
> 
> Let's drop the Mandatory / Optional distinction, or I am otherwise not
> understanding what this is trying to document. For example 'uuid' is
> "mandatory" for PMEM regions and "omitted" for volatile regions not
> optional.
> 

Well the kernel fills it in if userspace leaves it out. I'm guessing you're
going to ask me to change that, so I will remove Mandatory/Optional.

> > +               ==      ========================================================
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 1a448543db0d..3b48e0469fc7 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -3,9 +3,12 @@
> >  #include <linux/io-64-nonatomic-lo-hi.h>
> >  #include <linux/device.h>
> >  #include <linux/module.h>
> > +#include <linux/sizes.h>
> >  #include <linux/slab.h>
> > +#include <linux/uuid.h>
> >  #include <linux/idr.h>
> >  #include <region.h>
> > +#include <cxlmem.h>
> >  #include <cxl.h>
> >  #include "core.h"
> >
> > @@ -18,11 +21,305 @@
> >   * (programming the hardware) is handled by a separate region driver.
> >   */
> >
> > +struct cxl_region *to_cxl_region(struct device *dev);
> > +static const struct attribute_group region_interleave_group;
> > +
> > +static bool is_region_active(struct cxl_region *cxlr)
> > +{
> > +       /* TODO: Regions can't be activated yet. */
> > +       return false;
> 
> This function seems redundant with just checking "cxlr->dev.driver !=
> NULL"? The benefit of that is there is no need to carry a TODO in the
> series.
> 

Yeah. I think checking driver bind status is sufficient to replace this.

> > +}
> > +
> > +static void remove_target(struct cxl_region *cxlr, int target)
> > +{
> > +       struct cxl_memdev *cxlmd;
> > +
> > +       cxlmd = cxlr->config.targets[target];
> > +       if (cxlmd)
> > +               put_device(&cxlmd->dev);
> 
> A memdev can be a member of multiple regions at once, shouldn't this
> be an endpoint decoder or similar, not the entire memdev?

Is this referring to the later question about whether targets are decoders or
memdevs? The thought was each region would hold a reference to all memdevs in
the interleave set.

> 
> Also, if memdevs autoremove themselves from regions at memdev
> ->remove() time then I don't think the region needs to hold references
> on memdevs.
> 

I'll defer to you on that. I'll remove holding the reference, but I definitely
haven't solved the interaction when a memdev goes away. I had been thinking the
inverse originally, a memdev can't go away until the region is gone. According
to the spec, these devices can't be hot removed, only managed remove, so if
things blew up, not our problem. However, if we have decent infrastructure to
support better than that, we should.

> > +       cxlr->config.targets[target] = NULL;
> > +}
> > +
> > +static ssize_t interleave_ways_show(struct device *dev,
> > +                                   struct device_attribute *attr, char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> > +}
> > +
> > +static ssize_t interleave_ways_store(struct device *dev,
> > +                                    struct device_attribute *attr,
> > +                                    const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       int ret, prev_iw;
> > +       int val;
> 
> I would expect:
> 
> if (dev->driver)
>    return -EBUSY;
> 
> ...to shutdown configuration writes once the region is active. Might
> also need a region-wide seqlock like target_list_show. So that region
> probe drains  all active sysfs writers before assuming the
> configuration is stable.
> 

Okay.

> > +
> > +       prev_iw = cxlr->config.interleave_ways;
> > +       ret = kstrtoint(buf, 0, &val);
> > +       if (ret)
> > +               return ret;
> > +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> > +               return -EINVAL;
> > +
> > +       cxlr->config.interleave_ways = val;
> > +
> > +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> > +       if (ret < 0)
> > +               goto err;
> > +
> > +       sysfs_notify(&dev->kobj, NULL, "target_interleave");
> 
> Why?
> 

I copied it from another driver. I didn't check if it was actually needed or
not.

> > +
> > +       while (prev_iw > cxlr->config.interleave_ways)
> > +               remove_target(cxlr, --prev_iw);
> 
> To make the kernel side simpler this attribute could just require that
> setting interleave ways is a one way street, if you want to change it
> you need to delete the region and start over.
> 

I'm fine with that.

> > +
> > +       return len;
> > +
> > +err:
> > +       cxlr->config.interleave_ways = prev_iw;
> > +       return ret;
> > +}
> > +static DEVICE_ATTR_RW(interleave_ways);
> > +
> > +static ssize_t interleave_granularity_show(struct device *dev,
> > +                                          struct device_attribute *attr,
> > +                                          char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_granularity);
> > +}
> > +
> > +static ssize_t interleave_granularity_store(struct device *dev,
> > +                                           struct device_attribute *attr,
> > +                                           const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       int val, ret;
> > +
> > +       ret = kstrtoint(buf, 0, &val);
> > +       if (ret)
> > +               return ret;
> > +       cxlr->config.interleave_granularity = val;
> 
> This wants minimum input validation and synchronization against an
> active region.
> 
> > +
> > +       return len;
> > +}
> > +static DEVICE_ATTR_RW(interleave_granularity);
> > +
> > +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> > +                          char *buf)
> > +{
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       resource_size_t offset;
> > +
> > +       if (!cxlr->res)
> > +               return sysfs_emit(buf, "\n");
> 
> Should be an error I would think. I.e. require size to be set before
> s/offset/resource/ can be read.
> 
> > +
> > +       offset = cxld->platform_res.start - cxlr->res->start;
> 
> Why make usersapce do the offset math?
> 
> > +
> > +       return sysfs_emit(buf, "%pa\n", &offset);
> > +}
> > +static DEVICE_ATTR_RO(offset);
> 
> This can be DEVICE_ATTR_ADMIN_RO() to hide physical address layout
> information from non-root.
> 
> > +
> > +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> > +                        char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%llu\n", cxlr->config.size);
> 
> Perhaps no need to store size separately if this becomes:
> 
> sysfs_emit(buf, "%llu\n", (unsigned long long) resource_size(cxlr->res));
> 
> 
> ...?
> 
> > +}
> > +
> > +static ssize_t size_store(struct device *dev, struct device_attribute *attr,
> > +                         const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       unsigned long long val;
> > +       ssize_t rc;
> > +
> > +       rc = kstrtoull(buf, 0, &val);
> > +       if (rc)
> > +               return rc;
> > +
> > +       device_lock(&cxlr->dev);
> > +       if (is_region_active(cxlr))
> > +               rc = -EBUSY;
> > +       else
> > +               cxlr->config.size = val;
> > +       device_unlock(&cxlr->dev);
> 
> I think lockdep will complain about device_lock() usage in an
> attribute. Try changing this to cxl_device_lock() with
> CONFIG_PROVE_CXL_LOCKING=y.
> 
> > +
> > +       return rc ? rc : len;
> > +}
> > +static DEVICE_ATTR_RW(size);
> > +
> > +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> > +                        char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> > +}
> > +
> > +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> > +                         const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       ssize_t rc;
> > +
> > +       if (len != UUID_STRING_LEN + 1)
> > +               return -EINVAL;
> > +
> > +       device_lock(&cxlr->dev);
> > +       if (is_region_active(cxlr))
> > +               rc = -EBUSY;
> > +       else
> > +               rc = uuid_parse(buf, &cxlr->config.uuid);
> > +       device_unlock(&cxlr->dev);
> > +
> > +       return rc ? rc : len;
> > +}
> > +static DEVICE_ATTR_RW(uuid);
> > +
> > +static struct attribute *region_attrs[] = {
> > +       &dev_attr_interleave_ways.attr,
> > +       &dev_attr_interleave_granularity.attr,
> > +       &dev_attr_offset.attr,
> > +       &dev_attr_size.attr,
> > +       &dev_attr_uuid.attr,
> > +       NULL,
> > +};
> > +
> > +static const struct attribute_group region_group = {
> > +       .attrs = region_attrs,
> > +};
> > +
> > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > +{
> > +       int ret;
> > +
> > +       device_lock(&cxlr->dev);
> > +       if (!cxlr->config.targets[n])
> > +               ret = sysfs_emit(buf, "\n");
> > +       else
> > +               ret = sysfs_emit(buf, "%s\n",
> > +                                dev_name(&cxlr->config.targets[n]->dev));
> > +       device_unlock(&cxlr->dev);
> 
> The component contribution of a memdev to a region is a DPA-span, not
> the whole memdev. I would expect something like dax_mapping_attributes
> or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> information about the component contribution of a memdev to a region.
> 

I had been thinking the kernel would manage the DPS spans of a memdev (and
create the mappings). I can make this look like dax_mapping_attributes.

> > +
> > +       return ret;
> > +}
> > +
> > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > +                         size_t len)
> > +{
> > +       struct device *memdev_dev;
> > +       struct cxl_memdev *cxlmd;
> > +
> > +       device_lock(&cxlr->dev);
> > +
> > +       if (len == 1 || cxlr->config.targets[n])
> > +               remove_target(cxlr, n);
> > +
> > +       /* Remove target special case */
> > +       if (len == 1) {
> > +               device_unlock(&cxlr->dev);
> > +               return len;
> > +       }
> > +
> > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> 
> I think this wants to be an endpoint decoder, not a memdev. Because
> it's the decoder that joins a memdev to a region, or at least a
> decoder should be picked when the memdev is assigned so that the DPA
> mapping can be registered. If all the decoders are allocated then fail
> here.
> 

My preference is obviously how it is, using memdevs and having the decoders
allocated at bind time. I don't have an objective argument why one is better
than the other so I will change it. I will make the interface take a set of
decoders.

> > +       if (!memdev_dev) {
> > +               device_unlock(&cxlr->dev);
> > +               return -ENOENT;
> > +       }
> > +
> > +       /* reference to memdev held until target is unset or region goes away */
> > +
> > +       cxlmd = to_cxl_memdev(memdev_dev);
> > +       cxlr->config.targets[n] = cxlmd;
> > +
> > +       device_unlock(&cxlr->dev);
> > +
> > +       return len;
> > +}
> > +
> > +#define TARGET_ATTR_RW(n)                                                      \
> > +       static ssize_t target##n##_show(                                       \
> > +               struct device *dev, struct device_attribute *attr, char *buf)  \
> > +       {                                                                      \
> > +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> > +       }                                                                      \
> > +       static ssize_t target##n##_store(struct device *dev,                   \
> > +                                        struct device_attribute *attr,        \
> > +                                        const char *buf, size_t len)          \
> > +       {                                                                      \
> > +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> > +       }                                                                      \
> > +       static DEVICE_ATTR_RW(target##n)
> > +
> > +TARGET_ATTR_RW(0);
> > +TARGET_ATTR_RW(1);
> > +TARGET_ATTR_RW(2);
> > +TARGET_ATTR_RW(3);
> > +TARGET_ATTR_RW(4);
> > +TARGET_ATTR_RW(5);
> > +TARGET_ATTR_RW(6);
> > +TARGET_ATTR_RW(7);
> > +TARGET_ATTR_RW(8);
> > +TARGET_ATTR_RW(9);
> > +TARGET_ATTR_RW(10);
> > +TARGET_ATTR_RW(11);
> > +TARGET_ATTR_RW(12);
> > +TARGET_ATTR_RW(13);
> > +TARGET_ATTR_RW(14);
> > +TARGET_ATTR_RW(15);
> > +
> > +static struct attribute *interleave_attrs[] = {
> > +       &dev_attr_target0.attr,
> > +       &dev_attr_target1.attr,
> > +       &dev_attr_target2.attr,
> > +       &dev_attr_target3.attr,
> > +       &dev_attr_target4.attr,
> > +       &dev_attr_target5.attr,
> > +       &dev_attr_target6.attr,
> > +       &dev_attr_target7.attr,
> > +       &dev_attr_target8.attr,
> > +       &dev_attr_target9.attr,
> > +       &dev_attr_target10.attr,
> > +       &dev_attr_target11.attr,
> > +       &dev_attr_target12.attr,
> > +       &dev_attr_target13.attr,
> > +       &dev_attr_target14.attr,
> > +       &dev_attr_target15.attr,
> > +       NULL,
> > +};
> > +
> > +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> > +{
> > +       struct device *dev = container_of(kobj, struct device, kobj);
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       if (n < cxlr->config.interleave_ways)
> > +               return a->mode;
> > +       return 0;
> > +}
> > +
> > +static const struct attribute_group region_interleave_group = {
> > +       .attrs = interleave_attrs,
> > +       .is_visible = visible_targets,
> > +};
> > +
> > +static const struct attribute_group *region_groups[] = {
> > +       &region_group,
> > +       &region_interleave_group,
> > +       NULL,
> > +};
> > +
> >  static void cxl_region_release(struct device *dev);
> >
> >  static const struct device_type cxl_region_type = {
> >         .name = "cxl_region",
> >         .release = cxl_region_release,
> > +       .groups = region_groups
> >  };
> >
> >  static ssize_t create_region_show(struct device *dev,
> > @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
> >  {
> >         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> >         struct cxl_region *cxlr = to_cxl_region(dev);
> > +       int i;
> >
> >         ida_free(&cxld->region_ida, cxlr->id);
> > +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> > +               remove_target(cxlr, i);
> 
> Like the last patch this feels too late. I expect whatever unregisters
> the region should have already handled removing the targets.
> 

I think I already explained why it works this way. I will change it.

> >         kfree(cxlr);
> >  }
> >
> > --
> > 2.35.0
> >

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 01/14] cxl/region: Add region creation ABI
  2022-01-28 18:59     ` Dan Williams
@ 2022-02-02 18:26       ` Ben Widawsky
  2022-02-02 18:28         ` Ben Widawsky
  0 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-02 18:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On 22-01-28 10:59:26, Dan Williams wrote:
> On Fri, Jan 28, 2022 at 10:14 AM Dan Williams <dan.j.williams@intel.com> wrote:
> [..]
> > Here is that put_device() I was expecting, that kfree() earlier was a
> > double-free it seems.
> >
> > Also, I would have expected a devm action to remove this. Something like:
> >
> > struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> >
> > cxl_device_lock(&port->dev);
> > if (port->dev.driver)
> >     devm_cxl_add_region(port->uport, cxld, id);

I assume you mean devm_cxl_delete_region(), yes?

> > else
> >     rc = -ENXIO;
> > cxl_device_unlock(&port->dev);
> >
> > ...then no matter what you know the region will be unregistered when
> > the root port goes away.
> 
> ...actually, the lock and ->dev.driver check here are not needed
> because this attribute is only registered while the cxl_acpi driver is
> bound. So, it is safe to assume this is protected as decoder remove
> synchronizes against active sysfs users.

I'm somewhat confused when you say devm action to remove this. The current auto
region deletion happens when the ->release() is called. Are you suggesting when
the root decoder is removed I delete the regions at that point?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 01/14] cxl/region: Add region creation ABI
  2022-02-02 18:26       ` Ben Widawsky
@ 2022-02-02 18:28         ` Ben Widawsky
  2022-02-02 18:48           ` Ben Widawsky
  0 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-02 18:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On 22-02-02 10:26:06, Ben Widawsky wrote:
> On 22-01-28 10:59:26, Dan Williams wrote:
> > On Fri, Jan 28, 2022 at 10:14 AM Dan Williams <dan.j.williams@intel.com> wrote:
> > [..]
> > > Here is that put_device() I was expecting, that kfree() earlier was a
> > > double-free it seems.
> > >
> > > Also, I would have expected a devm action to remove this. Something like:
> > >
> > > struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> > >
> > > cxl_device_lock(&port->dev);
> > > if (port->dev.driver)
> > >     devm_cxl_add_region(port->uport, cxld, id);
> 
> I assume you mean devm_cxl_delete_region(), yes?
> 
> > > else
> > >     rc = -ENXIO;
> > > cxl_device_unlock(&port->dev);
> > >
> > > ...then no matter what you know the region will be unregistered when
> > > the root port goes away.
> > 
> > ...actually, the lock and ->dev.driver check here are not needed
> > because this attribute is only registered while the cxl_acpi driver is
> > bound. So, it is safe to assume this is protected as decoder remove
> > synchronizes against active sysfs users.
> 
> I'm somewhat confused when you say devm action to remove this. The current auto
> region deletion happens when the ->release() is called. Are you suggesting when
> the root decoder is removed I delete the regions at that point?

Hmm. I went back and looked and I had changed this functionality at some
point... So forget I said that, it isn't how it's working currently. But the
question remains, are you suggesting I delete in the root decoder
unregistration?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 01/14] cxl/region: Add region creation ABI
  2022-02-02 18:28         ` Ben Widawsky
@ 2022-02-02 18:48           ` Ben Widawsky
  2022-02-02 19:00             ` Dan Williams
  0 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-02 18:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On 22-02-02 10:28:11, Ben Widawsky wrote:
> On 22-02-02 10:26:06, Ben Widawsky wrote:
> > On 22-01-28 10:59:26, Dan Williams wrote:
> > > On Fri, Jan 28, 2022 at 10:14 AM Dan Williams <dan.j.williams@intel.com> wrote:
> > > [..]
> > > > Here is that put_device() I was expecting, that kfree() earlier was a
> > > > double-free it seems.
> > > >
> > > > Also, I would have expected a devm action to remove this. Something like:
> > > >
> > > > struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> > > >
> > > > cxl_device_lock(&port->dev);
> > > > if (port->dev.driver)
> > > >     devm_cxl_add_region(port->uport, cxld, id);
> > 
> > I assume you mean devm_cxl_delete_region(), yes?
> > 
> > > > else
> > > >     rc = -ENXIO;
> > > > cxl_device_unlock(&port->dev);
> > > >
> > > > ...then no matter what you know the region will be unregistered when
> > > > the root port goes away.
> > > 
> > > ...actually, the lock and ->dev.driver check here are not needed
> > > because this attribute is only registered while the cxl_acpi driver is
> > > bound. So, it is safe to assume this is protected as decoder remove
> > > synchronizes against active sysfs users.
> > 
> > I'm somewhat confused when you say devm action to remove this. The current auto
> > region deletion happens when the ->release() is called. Are you suggesting when
> > the root decoder is removed I delete the regions at that point?
> 
> Hmm. I went back and looked and I had changed this functionality at some
> point... So forget I said that, it isn't how it's working currently. But the
> question remains, are you suggesting I delete in the root decoder
> unregistration?

I think it's easier if I write what I think you mean.... Here are the relevant
parts:

devm_cxl_region_delete() is removed entirely.

static void unregister_region(void *_cxlr)
{
        struct cxl_region *cxlr = _cxlr;

        device_unregister(&cxlr->dev);
}


static int devm_cxl_region_add(struct cxl_decoder *cxld, struct cxl_region *cxlr)
{
        struct cxl_port *port = to_cxl_port(cxld->dev.parent);
        struct device *dev = &cxlr->dev;
        int rc;

        rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
        if (rc)
                return rc;

        rc = device_add(dev);
        if (rc)
                return rc;

        return devm_add_action_or_reset(&cxld->dev, unregister_region, cxlr);
}

static ssize_t delete_region_store(struct device *dev,
                                   struct device_attribute *attr,
                                   const char *buf, size_t len)
{
        struct cxl_decoder *cxld = to_cxl_decoder(dev);
        struct cxl_region *cxlr;

        cxlr = cxl_find_region_by_name(cxld, buf);
        if (IS_ERR(cxlr))
                return PTR_ERR(cxlr);

        devm_release_action(dev, unregister_region, cxlr);

        return len;
}
DEVICE_ATTR_WO(delete_region);

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 01/14] cxl/region: Add region creation ABI
  2022-02-02 18:48           ` Ben Widawsky
@ 2022-02-02 19:00             ` Dan Williams
  2022-02-02 19:02               ` Ben Widawsky
  0 siblings, 1 reply; 70+ messages in thread
From: Dan Williams @ 2022-02-02 19:00 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Wed, Feb 2, 2022 at 10:48 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> On 22-02-02 10:28:11, Ben Widawsky wrote:
> > On 22-02-02 10:26:06, Ben Widawsky wrote:
> > > On 22-01-28 10:59:26, Dan Williams wrote:
> > > > On Fri, Jan 28, 2022 at 10:14 AM Dan Williams <dan.j.williams@intel.com> wrote:
> > > > [..]
> > > > > Here is that put_device() I was expecting, that kfree() earlier was a
> > > > > double-free it seems.
> > > > >
> > > > > Also, I would have expected a devm action to remove this. Something like:
> > > > >
> > > > > struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> > > > >
> > > > > cxl_device_lock(&port->dev);
> > > > > if (port->dev.driver)
> > > > >     devm_cxl_add_region(port->uport, cxld, id);
> > >
> > > I assume you mean devm_cxl_delete_region(), yes?
> > >
> > > > > else
> > > > >     rc = -ENXIO;
> > > > > cxl_device_unlock(&port->dev);
> > > > >
> > > > > ...then no matter what you know the region will be unregistered when
> > > > > the root port goes away.
> > > >
> > > > ...actually, the lock and ->dev.driver check here are not needed
> > > > because this attribute is only registered while the cxl_acpi driver is
> > > > bound. So, it is safe to assume this is protected as decoder remove
> > > > synchronizes against active sysfs users.
> > >
> > > I'm somewhat confused when you say devm action to remove this. The current auto
> > > region deletion happens when the ->release() is called. Are you suggesting when
> > > the root decoder is removed I delete the regions at that point?
> >
> > Hmm. I went back and looked and I had changed this functionality at some
> > point... So forget I said that, it isn't how it's working currently. But the
> > question remains, are you suggesting I delete in the root decoder
> > unregistration?
>
> I think it's easier if I write what I think you mean.... Here are the relevant
> parts:
>
> devm_cxl_region_delete() is removed entirely.
>
> static void unregister_region(void *_cxlr)
> {
>         struct cxl_region *cxlr = _cxlr;
>
>         device_unregister(&cxlr->dev);
> }
>
>
> static int devm_cxl_region_add(struct cxl_decoder *cxld, struct cxl_region *cxlr)
> {
>         struct cxl_port *port = to_cxl_port(cxld->dev.parent);
>         struct device *dev = &cxlr->dev;
>         int rc;
>
>         rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
>         if (rc)
>                 return rc;
>
>         rc = device_add(dev);
>         if (rc)
>                 return rc;
>
>         return devm_add_action_or_reset(&cxld->dev, unregister_region, cxlr);

Decoders can't host devm actions. The host for this action would need
to be the parent port.

> }
>
> static ssize_t delete_region_store(struct device *dev,
>                                    struct device_attribute *attr,
>                                    const char *buf, size_t len)
> {
>         struct cxl_decoder *cxld = to_cxl_decoder(dev);
>         struct cxl_region *cxlr;
>
>         cxlr = cxl_find_region_by_name(cxld, buf);
>         if (IS_ERR(cxlr))
>                 return PTR_ERR(cxlr);
>
>         devm_release_action(dev, unregister_region, cxlr);

Yes, modulo the same comment as before that the decoder object is not
a suitable devm host. This also needs a solution for the race between
these 2 actions:

echo "ACPI0017:00" > /sys/bus/platform/drivers/cxl_acpi/unbind
echo $region > /sys/bus/cxl/devices/$decoder/delete_region

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 01/14] cxl/region: Add region creation ABI
  2022-02-02 19:00             ` Dan Williams
@ 2022-02-02 19:02               ` Ben Widawsky
  2022-02-02 19:15                 ` Dan Williams
  0 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-02 19:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On 22-02-02 11:00:04, Dan Williams wrote:
> On Wed, Feb 2, 2022 at 10:48 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > On 22-02-02 10:28:11, Ben Widawsky wrote:
> > > On 22-02-02 10:26:06, Ben Widawsky wrote:
> > > > On 22-01-28 10:59:26, Dan Williams wrote:
> > > > > On Fri, Jan 28, 2022 at 10:14 AM Dan Williams <dan.j.williams@intel.com> wrote:
> > > > > [..]
> > > > > > Here is that put_device() I was expecting, that kfree() earlier was a
> > > > > > double-free it seems.
> > > > > >
> > > > > > Also, I would have expected a devm action to remove this. Something like:
> > > > > >
> > > > > > struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> > > > > >
> > > > > > cxl_device_lock(&port->dev);
> > > > > > if (port->dev.driver)
> > > > > >     devm_cxl_add_region(port->uport, cxld, id);
> > > >
> > > > I assume you mean devm_cxl_delete_region(), yes?
> > > >
> > > > > > else
> > > > > >     rc = -ENXIO;
> > > > > > cxl_device_unlock(&port->dev);
> > > > > >
> > > > > > ...then no matter what you know the region will be unregistered when
> > > > > > the root port goes away.
> > > > >
> > > > > ...actually, the lock and ->dev.driver check here are not needed
> > > > > because this attribute is only registered while the cxl_acpi driver is
> > > > > bound. So, it is safe to assume this is protected as decoder remove
> > > > > synchronizes against active sysfs users.
> > > >
> > > > I'm somewhat confused when you say devm action to remove this. The current auto
> > > > region deletion happens when the ->release() is called. Are you suggesting when
> > > > the root decoder is removed I delete the regions at that point?
> > >
> > > Hmm. I went back and looked and I had changed this functionality at some
> > > point... So forget I said that, it isn't how it's working currently. But the
> > > question remains, are you suggesting I delete in the root decoder
> > > unregistration?
> >
> > I think it's easier if I write what I think you mean.... Here are the relevant
> > parts:
> >
> > devm_cxl_region_delete() is removed entirely.
> >
> > static void unregister_region(void *_cxlr)
> > {
> >         struct cxl_region *cxlr = _cxlr;
> >
> >         device_unregister(&cxlr->dev);
> > }
> >
> >
> > static int devm_cxl_region_add(struct cxl_decoder *cxld, struct cxl_region *cxlr)
> > {
> >         struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> >         struct device *dev = &cxlr->dev;
> >         int rc;
> >
> >         rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
> >         if (rc)
> >                 return rc;
> >
> >         rc = device_add(dev);
> >         if (rc)
> >                 return rc;
> >
> >         return devm_add_action_or_reset(&cxld->dev, unregister_region, cxlr);
> 
> Decoders can't host devm actions. The host for this action would need
> to be the parent port.

Happy to change it since I can't imagine a decoder would go down without the
port also going down. Can you please explain why a decoder can't host a devm
action though. I'd like to understand that better.

> 
> > }
> >
> > static ssize_t delete_region_store(struct device *dev,
> >                                    struct device_attribute *attr,
> >                                    const char *buf, size_t len)
> > {
> >         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> >         struct cxl_region *cxlr;
> >
> >         cxlr = cxl_find_region_by_name(cxld, buf);
> >         if (IS_ERR(cxlr))
> >                 return PTR_ERR(cxlr);
> >
> >         devm_release_action(dev, unregister_region, cxlr);
> 
> Yes, modulo the same comment as before that the decoder object is not
> a suitable devm host. This also needs a solution for the race between
> these 2 actions:
> 
> echo "ACPI0017:00" > /sys/bus/platform/drivers/cxl_acpi/unbind
> echo $region > /sys/bus/cxl/devices/$decoder/delete_region

Is there a better solution than taking the root port lock?


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 01/14] cxl/region: Add region creation ABI
  2022-02-02 19:02               ` Ben Widawsky
@ 2022-02-02 19:15                 ` Dan Williams
  0 siblings, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-02 19:15 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Wed, Feb 2, 2022 at 11:03 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> On 22-02-02 11:00:04, Dan Williams wrote:
> > On Wed, Feb 2, 2022 at 10:48 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > On 22-02-02 10:28:11, Ben Widawsky wrote:
> > > > On 22-02-02 10:26:06, Ben Widawsky wrote:
> > > > > On 22-01-28 10:59:26, Dan Williams wrote:
> > > > > > On Fri, Jan 28, 2022 at 10:14 AM Dan Williams <dan.j.williams@intel.com> wrote:
> > > > > > [..]
> > > > > > > Here is that put_device() I was expecting, that kfree() earlier was a
> > > > > > > double-free it seems.
> > > > > > >
> > > > > > > Also, I would have expected a devm action to remove this. Something like:
> > > > > > >
> > > > > > > struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> > > > > > >
> > > > > > > cxl_device_lock(&port->dev);
> > > > > > > if (port->dev.driver)
> > > > > > >     devm_cxl_add_region(port->uport, cxld, id);
> > > > >
> > > > > I assume you mean devm_cxl_delete_region(), yes?
> > > > >
> > > > > > > else
> > > > > > >     rc = -ENXIO;
> > > > > > > cxl_device_unlock(&port->dev);
> > > > > > >
> > > > > > > ...then no matter what you know the region will be unregistered when
> > > > > > > the root port goes away.
> > > > > >
> > > > > > ...actually, the lock and ->dev.driver check here are not needed
> > > > > > because this attribute is only registered while the cxl_acpi driver is
> > > > > > bound. So, it is safe to assume this is protected as decoder remove
> > > > > > synchronizes against active sysfs users.
> > > > >
> > > > > I'm somewhat confused when you say devm action to remove this. The current auto
> > > > > region deletion happens when the ->release() is called. Are you suggesting when
> > > > > the root decoder is removed I delete the regions at that point?
> > > >
> > > > Hmm. I went back and looked and I had changed this functionality at some
> > > > point... So forget I said that, it isn't how it's working currently. But the
> > > > question remains, are you suggesting I delete in the root decoder
> > > > unregistration?
> > >
> > > I think it's easier if I write what I think you mean.... Here are the relevant
> > > parts:
> > >
> > > devm_cxl_region_delete() is removed entirely.
> > >
> > > static void unregister_region(void *_cxlr)
> > > {
> > >         struct cxl_region *cxlr = _cxlr;
> > >
> > >         device_unregister(&cxlr->dev);
> > > }
> > >
> > >
> > > static int devm_cxl_region_add(struct cxl_decoder *cxld, struct cxl_region *cxlr)
> > > {
> > >         struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> > >         struct device *dev = &cxlr->dev;
> > >         int rc;
> > >
> > >         rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
> > >         if (rc)
> > >                 return rc;
> > >
> > >         rc = device_add(dev);
> > >         if (rc)
> > >                 return rc;
> > >
> > >         return devm_add_action_or_reset(&cxld->dev, unregister_region, cxlr);
> >
> > Decoders can't host devm actions. The host for this action would need
> > to be the parent port.
>
> Happy to change it since I can't imagine a decoder would go down without the
> port also going down. Can you please explain why a decoder can't host a devm
> action though. I'd like to understand that better.

So, devm releases resources at 2 times one of which is "too late". The
natural / expected point at which they are released is by the driver
core at $driver->remove($dev) time. There is also a backstop release
point at $dev->${type,class,bus}->release() time. The latter one is
"too late" because it effectively leave the device registered
indefinitely which is broken because the parent sysfs directory
hierarchy for that device will have already been removed, so the late
release may crash depending on when the last put_device($dev) is
performed. The decoder never experiences a ->remove() event, but it's
parent port does (at least for non-root ports). For this case it will
likely need to reach further up and use the same devm host as the
decoder itself which is the ACPI0017 device.

>
> >
> > > }
> > >
> > > static ssize_t delete_region_store(struct device *dev,
> > >                                    struct device_attribute *attr,
> > >                                    const char *buf, size_t len)
> > > {
> > >         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > >         struct cxl_region *cxlr;
> > >
> > >         cxlr = cxl_find_region_by_name(cxld, buf);
> > >         if (IS_ERR(cxlr))
> > >                 return PTR_ERR(cxlr);
> > >
> > >         devm_release_action(dev, unregister_region, cxlr);
> >
> > Yes, modulo the same comment as before that the decoder object is not
> > a suitable devm host. This also needs a solution for the race between
> > these 2 actions:
> >
> > echo "ACPI0017:00" > /sys/bus/platform/drivers/cxl_acpi/unbind
> > echo $region > /sys/bus/cxl/devices/$decoder/delete_region
>
> Is there a better solution than taking the root port lock?

Depends what lockdep says. The first choice lock to synchronize those
2 actions would be the ACPI0017 device_lock, but lockdep might point
out a problem with that vs sysfs teardown synchronization.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-02-01 14:59     ` Ben Widawsky
@ 2022-02-03  5:06       ` Dan Williams
  0 siblings, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-03  5:06 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On Tue, Feb 1, 2022 at 6:59 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> I will cut to the part that effects ABI so tool development can continue. I'll
> get back to the other bits later.
>
> On 22-01-28 16:25:34, Dan Williams wrote:
>
> [snip]
>
> >
> > > +
> > > +       return ret;
> > > +}
> > > +
> > > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > > +                         size_t len)
> > > +{
> > > +       struct device *memdev_dev;
> > > +       struct cxl_memdev *cxlmd;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +
> > > +       if (len == 1 || cxlr->config.targets[n])
> > > +               remove_target(cxlr, n);
> > > +
> > > +       /* Remove target special case */
> > > +       if (len == 1) {
> > > +               device_unlock(&cxlr->dev);
> > > +               return len;
> > > +       }
> > > +
> > > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> >
> > I think this wants to be an endpoint decoder, not a memdev. Because
> > it's the decoder that joins a memdev to a region, or at least a
> > decoder should be picked when the memdev is assigned so that the DPA
> > mapping can be registered. If all the decoders are allocated then fail
> > here.
> >
>
> You've put two points in here:
>
> 1. Handle decoder allocation at sysfs boundary. I'll respond to this when I come
> back around to the rest of the review comments.
>
> 2. Take a decoder for target instead of a memdev. I don't agree with this
> direction as it's asymmetric to how LSA processing works. The goal was to model
> the LSA for configuration. The kernel will have to be in the business of
> reserving and enumerating decoders out of memdevs for both LSA (where we have a
> list of memdevs) and volatile (where we use the memdevs in the system to
> enumerate populated decoders). I don't see much value in making userspace do the
> same.
>
> I'd like to ask you reconsider if you still think it's preferable to use
> decoders as part of the ABI and if you still feel that way I can go change it
> since it has minimal impact overall.

It's more than a preference. I think there are fundamental recovery
scenarios where the kernel needs userspace help to resolve decoder /
DPA assignment and conflicts.

PMEM interleaves behave similarly to RAID where you have multiple
devices in a set that can each fail independently, and because they
are hotplug capable the chances of migrating devices from one system
to another are higher than PMEM devices today where hotplug is mostly
non-existent. If you lurk on linux-raid long enough you will
inevitably encounter someone coming to the list saying, "help a drive
in my RAID array was dying. I managed to save it off, help me
reassemble my array". The story often gets worse when they say "I
managed to corrupt my metadata block, so I don't know what order the
drives are supposed to be in". There are several breadcrumbs and trial
and error steps that one takes to try to get the data back online:
https://raid.wiki.kernel.org/index.php/RAID_Recovery.

Now imagine that scenario with CXL where there are additional
complicating factors like label-storage-area can fail independently of
the data area, there are region labels with HPA fields that mandate
assembly at a given address, decoders must be programmed in increasing
DPA order, volatile memory and locked/fixed decoders complicate
decoder selection. Considering all the ways that CXL region assembly
can fail it seems inevitable that someone will get into a situation
where they need to pick the decoder and the DPA to map while also not
clobbering the LSA. I.e. I see a "CXL Recovery" wiki in our future.

The requirements that I think fall out from that are:

1/ Region assembly needs to be possible without updating labels. So,
in contrast to the way that nvdimm does it, instead of updating the
region label on every attribute write there would be a commit step
that realizes the current region configuration in the labels, or is
ommitted in recovery scenarios where you are not ready to clobber the
labels.

2/ Userspace needs the flexibility to be able to select/override which
DPA gets mapped in which decoder (kernel would handle DPA skip).

All the ways I can think of to augment the ABI to allow for this style
of recovery devolve to just assigning decoders to regions in the first
instance.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-02-01 23:11     ` Ben Widawsky
@ 2022-02-03 17:48       ` Dan Williams
  2022-02-03 22:23         ` Ben Widawsky
  0 siblings, 1 reply; 70+ messages in thread
From: Dan Williams @ 2022-02-03 17:48 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On Tue, Feb 1, 2022 at 3:11 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> On 22-01-28 16:25:34, Dan Williams wrote:
> > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > The region creation APIs create a vacant region. Configuring the region
> > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > will be provided to allow userspace to configure the region.  Finally
> > > once all configuration is complete, userspace may activate the region.
> > >
> > > Introduced here are the most basic attributes needed to configure a
> > > region. Details of these attribute are described in the ABI
> >
> > s/attribute/attributes/
> >
> > > Documentation. Sanity checking of configuration parameters are done at
> > > region binding time. This consolidates all such logic in one place,
> > > rather than being strewn across multiple places.
> >
> > I think that's too late for some of the validation. The complex
> > validation that the region driver does throughout the topology is
> > different from the basic input validation that can  be done at the
> > sysfs write time. For example ,this patch allows negative
> > interleave_granularity values to specified, just return -EINVAL. I
> > agree that sysfs should not validate everything, I disagree with
> > pushing all validation to cxl_region_probe().
> >
>
> Two points:c
> 1. How do we distinguish "basic input validation". It'd be good if we could
>    define "basic input validation". For instance, when I first wrote these
>    patches, x3 would have been EINVAL, but today it's allowed. Can you help
>    enumerate what you consider basic.

I internalized this kernel design principle from Dave Miller many
years ago paraphrasing "push decision making out to leaf code as much
as possible", and centralizing all validation in cxl_region_probe()
violates. The software that makes the mistake does not know it made a
mistake until much later and "probe failed" is less descriptive than
"EINVAL writing interleave_ways" . I wish I could find the thread
because it also talked about his iteration process.

Basic input validation to me is things like:

- Don't allow writes while the region is active
- Check that values are in bound. So yes, the interleave-ways value of
3 would fail until the kernel supports it, and granularity values >
16K would also fail.
- Check that memdevs are actually downstream targets of the given decoder
- Check that the region uuid is unique
- Check that decoder has capacity
- Check that the memdev has capacity
- Check that the decoder to map the DPA is actually available given
decoders must be programmed in increasing DPA order

Essentially any validation short of walking the topology to program
upstream decoders since those errors are only resolved by racing
region probes that try to grab upstream decoder resources.

>
> 2. I like the idea that all validation takes place in one place. Obviously you
>    do not. So, see #1 and I will rework.

The validation helpers need to be written once, where they are called
does not much matter, does it?

>
> > >
> > > A example is provided below:
> > >
> > > /sys/bus/cxl/devices/region0.0:0
> > > ├── interleave_granularity
> > > ├── interleave_ways
> > > ├── offset
> > > ├── size
> > > ├── subsystem -> ../../../../../../bus/cxl
> > > ├── target0
> > > ├── uevent
> > > └── uuid
> >
> > As mentioned off-list, it looks like devtype and modalias are missing.
> >
>
> Thanks.
>
> > >
> > > Reported-by: kernel test robot <lkp@intel.com> (v2)
> > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > ---
> > >  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
> > >  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
> > >  2 files changed, 340 insertions(+)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > index dcc728458936..50ba5018014d 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > @@ -187,3 +187,43 @@ Description:
> > >                 region driver before being deleted. The attributes expects a
> > >                 region in the form "regionX.Y:Z". The region's name, allocated
> > >                 by reading create_region, will also be released.
> > > +
> > > +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
> >
> > This is just another 'resource' attribute for the physical base
> > address of the region, right? 'offset' sounds like something that
> > would be relative instead of absolute.
> >
>
> It is offset. I can change it to physical base if you'd like but I thought that
> information wasn't critically important for userspace to have. Does userspace
> care about the physical base?

Yes, similar use case as /proc/iomem. Error handling comes to mind as
you can see physical address data in the messages like machine check
notification and go immediately match that to a CXL region. PCI,
NVDIMM, and DAX all emit a "resource" attribute to identify the
physical address base.

>
> > > +Date:          August, 2021
> >
> > Same date update comment here.
> >
> > > +KernelVersion: v5.18
> > > +Contact:       linux-cxl@vger.kernel.org
> > > +Description:
> > > +               (RO) A region resides within an address space that is claimed by
> > > +               a decoder.
> >
> > "A region is a contiguous partition of a CXL Root decoder address space."
> >
> > >                  Region space allocation is handled by the driver, but
> >
> > "Region capacity is allocated by writing to the size attribute, the
> > resulting physical address base determined by the driver is reflected
> > here."
> >
> > > +               the offset may be read by userspace tooling in order to
> > > +               determine fragmentation, and available size for new regions.
> >
> > I would also expect, before / along with these new region attributes,
> > there would be 'available' and 'max_extent_available' at the decoder
> > level to indicate how much free space the decoder has and how big the
> > next region creation can be. User tooling can walk  the decoder and
> > the regions together to determine fragmentation if necessary, but for
> > the most part the tool likely only cares about "how big can the next
> > region be?" and "how full is this decoder?".
>
> Sounds good.
>
> >
> >
> > > +
> > > +What:
> > > +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> > > +Date:          August, 2021
> > > +KernelVersion: v5.18
> > > +Contact:       linux-cxl@vger.kernel.org
> > > +Description:
> > > +               (RW) Configuring regions requires a minimal set of parameters in
> > > +               order for the subsequent bind operation to succeed. The
> > > +               following parameters are defined:
> >
> > Let's split up the descriptions into individual sections. That can
> > also document the order that attributes must be written. For example,
> > doesn't size need to be set before targets are added so that targets
> > can be validated whether they have sufficient capacity?
> >
>
> Okay. Order doesn't matter if you do validation all in one place as it is, but
> sounds like we're changing that. So I can split it when we figure out what
> validation is actually occurring at the sysfs attr boundary.

Forcing a write order simplifies the validation matrix. Consider the
reduction in test surface if the kernel is more strict about what it
allows into the kernel early. Let's make syzbot's job harder.

>
> > > +
> > > +               ==      ========================================================
> > > +               interleave_granularity Mandatory. Number of consecutive bytes
> > > +                       each device in the interleave set will claim. The
> > > +                       possible interleave granularity values are determined by
> > > +                       the CXL spec and the participating devices.
> > > +               interleave_ways Mandatory. Number of devices participating in the
> > > +                       region. Each device will provide 1/interleave of storage
> > > +                       for the region.
> > > +               size    Manadatory. Phsyical address space the region will
> > > +                       consume.
> >
> > s/Phsyical/Physical/
> >
> > > +               target  Mandatory. Memory devices are the backing storage for a
> > > +                       region. There will be N targets based on the number of
> > > +                       interleave ways that the top level decoder is configured
> > > +                       for.
> >
> > That doesn't sound right, IW at the root != IW at the endpoint level
> > and the region needs to record all the endpoint level targets.
>
>
> Yes This is wrong. I thought I had fixed it, but I guess not.
>
> >
> > > Each target must be set with a memdev device ie.
> > > +                       'mem1'. This attribute only becomes available after
> > > +                       setting the 'interleave' attribute.
> > > +               uuid    Optional. A unique identifier for the region. If none is
> > > +                       selected, the kernel will create one.
> >
> > Let's drop the Mandatory / Optional distinction, or I am otherwise not
> > understanding what this is trying to document. For example 'uuid' is
> > "mandatory" for PMEM regions and "omitted" for volatile regions not
> > optional.
> >
>
> Well the kernel fills it in if userspace leaves it out. I'm guessing you're
> going to ask me to change that, so I will remove Mandatory/Optional.

Yeah, why carry unnecessary code in the kernel? Userspace is well
equipped to meet the requirement that it writes the UUID.

>
> > > +               ==      ========================================================
> > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > > index 1a448543db0d..3b48e0469fc7 100644
> > > --- a/drivers/cxl/core/region.c
> > > +++ b/drivers/cxl/core/region.c
> > > @@ -3,9 +3,12 @@
> > >  #include <linux/io-64-nonatomic-lo-hi.h>
> > >  #include <linux/device.h>
> > >  #include <linux/module.h>
> > > +#include <linux/sizes.h>
> > >  #include <linux/slab.h>
> > > +#include <linux/uuid.h>
> > >  #include <linux/idr.h>
> > >  #include <region.h>
> > > +#include <cxlmem.h>
> > >  #include <cxl.h>
> > >  #include "core.h"
> > >
> > > @@ -18,11 +21,305 @@
> > >   * (programming the hardware) is handled by a separate region driver.
> > >   */
> > >
> > > +struct cxl_region *to_cxl_region(struct device *dev);
> > > +static const struct attribute_group region_interleave_group;
> > > +
> > > +static bool is_region_active(struct cxl_region *cxlr)
> > > +{
> > > +       /* TODO: Regions can't be activated yet. */
> > > +       return false;
> >
> > This function seems redundant with just checking "cxlr->dev.driver !=
> > NULL"? The benefit of that is there is no need to carry a TODO in the
> > series.
> >
>
> Yeah. I think checking driver bind status is sufficient to replace this.
>
> > > +}
> > > +
> > > +static void remove_target(struct cxl_region *cxlr, int target)
> > > +{
> > > +       struct cxl_memdev *cxlmd;
> > > +
> > > +       cxlmd = cxlr->config.targets[target];
> > > +       if (cxlmd)
> > > +               put_device(&cxlmd->dev);
> >
> > A memdev can be a member of multiple regions at once, shouldn't this
> > be an endpoint decoder or similar, not the entire memdev?
>
> Is this referring to the later question about whether targets are decoders or
> memdevs?

Yes.

> The thought was each region would hold a reference to all memdevs in
> the interleave set.

It's not clear that a region needs to hold a reference if a memdev
self removes itself from the region before it is unregistered. I am
open to being convinced this is needed but it would need come with an
explanation of what can a region do with a memdev reference after that
memdev has experienced a ->remove() event.

> > Also, if memdevs autoremove themselves from regions at memdev
> > ->remove() time then I don't think the region needs to hold references
> > on memdevs.
> >
>
> I'll defer to you on that. I'll remove holding the reference, but I definitely
> haven't solved the interaction when a memdev goes away. I had been thinking the
> inverse originally, a memdev can't go away until the region is gone. According
> to the spec, these devices can't be hot removed, only managed remove, so if
> things blew up, not our problem. However, if we have decent infrastructure to
> support better than that, we should.

It turns out there's no such thing as "managed remove" as far as the
kernel is concerned, it's all up to userspace. ->remove() can happen
at any time and ->remove() can not fail. Per Bjorn Linux does not
support PCIe hotplug latching so there is no kernel mechanism to block
hotplug. Unless and until the PCIe native hotplug code picks up a
mechanism to deny unplug events CXL needs to be prepared for any
memdev to experience ->remove() regardless of region status.

For example, I expect the "managed remove" flow is something like this:

# cxl disable-memdev mem3
cxl memdev: action_disable: mem3 is part of an active region
cxl memdev: cmd_disable_memdev: disabled 0 mem

...where the tool tries to enforce safety, but if someone really wants
that device gone:

# cxl disable-memdev mem3 --force
cxl memdev: cmd_disable_memdev: disabled 1 mem

...and the CXL sub-system will need to trigger memory-failure across
all the regions that were impacted by that violent event.

>
> > > +       cxlr->config.targets[target] = NULL;
> > > +}
> > > +
> > > +static ssize_t interleave_ways_show(struct device *dev,
> > > +                                   struct device_attribute *attr, char *buf)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +
> > > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> > > +}
> > > +
> > > +static ssize_t interleave_ways_store(struct device *dev,
> > > +                                    struct device_attribute *attr,
> > > +                                    const char *buf, size_t len)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       int ret, prev_iw;
> > > +       int val;
> >
> > I would expect:
> >
> > if (dev->driver)
> >    return -EBUSY;
> >
> > ...to shutdown configuration writes once the region is active. Might
> > also need a region-wide seqlock like target_list_show. So that region
> > probe drains  all active sysfs writers before assuming the
> > configuration is stable.
> >
>
> Okay.
>
> > > +
> > > +       prev_iw = cxlr->config.interleave_ways;
> > > +       ret = kstrtoint(buf, 0, &val);
> > > +       if (ret)
> > > +               return ret;
> > > +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> > > +               return -EINVAL;
> > > +
> > > +       cxlr->config.interleave_ways = val;
> > > +
> > > +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> > > +       if (ret < 0)
> > > +               goto err;
> > > +
> > > +       sysfs_notify(&dev->kobj, NULL, "target_interleave");
> >
> > Why?
> >
>
> I copied it from another driver. I didn't check if it was actually needed or
> not.

It's not needed since the agent that wrote interleave ways is also
expected to be the agent that is configuring the rest of the
parameters.

>
> > > +
> > > +       while (prev_iw > cxlr->config.interleave_ways)
> > > +               remove_target(cxlr, --prev_iw);
> >
> > To make the kernel side simpler this attribute could just require that
> > setting interleave ways is a one way street, if you want to change it
> > you need to delete the region and start over.
> >
>
> I'm fine with that.
>
> > > +
> > > +       return len;
> > > +
> > > +err:
> > > +       cxlr->config.interleave_ways = prev_iw;
> > > +       return ret;
> > > +}
> > > +static DEVICE_ATTR_RW(interleave_ways);
> > > +
> > > +static ssize_t interleave_granularity_show(struct device *dev,
> > > +                                          struct device_attribute *attr,
> > > +                                          char *buf)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +
> > > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_granularity);
> > > +}
> > > +
> > > +static ssize_t interleave_granularity_store(struct device *dev,
> > > +                                           struct device_attribute *attr,
> > > +                                           const char *buf, size_t len)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       int val, ret;
> > > +
> > > +       ret = kstrtoint(buf, 0, &val);
> > > +       if (ret)
> > > +               return ret;
> > > +       cxlr->config.interleave_granularity = val;
> >
> > This wants minimum input validation and synchronization against an
> > active region.
> >
> > > +
> > > +       return len;
> > > +}
> > > +static DEVICE_ATTR_RW(interleave_granularity);
> > > +
> > > +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> > > +                          char *buf)
> > > +{
> > > +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       resource_size_t offset;
> > > +
> > > +       if (!cxlr->res)
> > > +               return sysfs_emit(buf, "\n");
> >
> > Should be an error I would think. I.e. require size to be set before
> > s/offset/resource/ can be read.
> >
> > > +
> > > +       offset = cxld->platform_res.start - cxlr->res->start;
> >
> > Why make usersapce do the offset math?
> >
> > > +
> > > +       return sysfs_emit(buf, "%pa\n", &offset);
> > > +}
> > > +static DEVICE_ATTR_RO(offset);
> >
> > This can be DEVICE_ATTR_ADMIN_RO() to hide physical address layout
> > information from non-root.
> >
> > > +
> > > +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> > > +                        char *buf)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +
> > > +       return sysfs_emit(buf, "%llu\n", cxlr->config.size);
> >
> > Perhaps no need to store size separately if this becomes:
> >
> > sysfs_emit(buf, "%llu\n", (unsigned long long) resource_size(cxlr->res));
> >
> >
> > ...?
> >
> > > +}
> > > +
> > > +static ssize_t size_store(struct device *dev, struct device_attribute *attr,
> > > +                         const char *buf, size_t len)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       unsigned long long val;
> > > +       ssize_t rc;
> > > +
> > > +       rc = kstrtoull(buf, 0, &val);
> > > +       if (rc)
> > > +               return rc;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +       if (is_region_active(cxlr))
> > > +               rc = -EBUSY;
> > > +       else
> > > +               cxlr->config.size = val;
> > > +       device_unlock(&cxlr->dev);
> >
> > I think lockdep will complain about device_lock() usage in an
> > attribute. Try changing this to cxl_device_lock() with
> > CONFIG_PROVE_CXL_LOCKING=y.
> >
> > > +
> > > +       return rc ? rc : len;
> > > +}
> > > +static DEVICE_ATTR_RW(size);
> > > +
> > > +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> > > +                        char *buf)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +
> > > +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> > > +}
> > > +
> > > +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> > > +                         const char *buf, size_t len)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       ssize_t rc;
> > > +
> > > +       if (len != UUID_STRING_LEN + 1)
> > > +               return -EINVAL;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +       if (is_region_active(cxlr))
> > > +               rc = -EBUSY;
> > > +       else
> > > +               rc = uuid_parse(buf, &cxlr->config.uuid);
> > > +       device_unlock(&cxlr->dev);
> > > +
> > > +       return rc ? rc : len;
> > > +}
> > > +static DEVICE_ATTR_RW(uuid);
> > > +
> > > +static struct attribute *region_attrs[] = {
> > > +       &dev_attr_interleave_ways.attr,
> > > +       &dev_attr_interleave_granularity.attr,
> > > +       &dev_attr_offset.attr,
> > > +       &dev_attr_size.attr,
> > > +       &dev_attr_uuid.attr,
> > > +       NULL,
> > > +};
> > > +
> > > +static const struct attribute_group region_group = {
> > > +       .attrs = region_attrs,
> > > +};
> > > +
> > > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > > +{
> > > +       int ret;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +       if (!cxlr->config.targets[n])
> > > +               ret = sysfs_emit(buf, "\n");
> > > +       else
> > > +               ret = sysfs_emit(buf, "%s\n",
> > > +                                dev_name(&cxlr->config.targets[n]->dev));
> > > +       device_unlock(&cxlr->dev);
> >
> > The component contribution of a memdev to a region is a DPA-span, not
> > the whole memdev. I would expect something like dax_mapping_attributes
> > or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> > information about the component contribution of a memdev to a region.
> >
>
> I had been thinking the kernel would manage the DPS spans of a memdev (and
> create the mappings). I can make this look like dax_mapping_attributes.

I think we get this for free by just linking the region to each
endpoint decoder in use and then userspace can walk that to get DPA
info from the decoder (modulo extending endpoint decoders with DPA
extent info).

>
> > > +
> > > +       return ret;
> > > +}
> > > +
> > > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > > +                         size_t len)
> > > +{
> > > +       struct device *memdev_dev;
> > > +       struct cxl_memdev *cxlmd;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +
> > > +       if (len == 1 || cxlr->config.targets[n])
> > > +               remove_target(cxlr, n);
> > > +
> > > +       /* Remove target special case */
> > > +       if (len == 1) {
> > > +               device_unlock(&cxlr->dev);
> > > +               return len;
> > > +       }
> > > +
> > > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> >
> > I think this wants to be an endpoint decoder, not a memdev. Because
> > it's the decoder that joins a memdev to a region, or at least a
> > decoder should be picked when the memdev is assigned so that the DPA
> > mapping can be registered. If all the decoders are allocated then fail
> > here.
> >
>
> My preference is obviously how it is, using memdevs and having the decoders
> allocated at bind time. I don't have an objective argument why one is better
> than the other so I will change it. I will make the interface take a set of
> decoders.

Let me know if the arguments I have made for why more granularity is
necessary have changed your preference. I'm open to hearing alternate
ideas.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-02-03 17:48       ` Dan Williams
@ 2022-02-03 22:23         ` Ben Widawsky
  2022-02-03 23:27           ` Dan Williams
  0 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-03 22:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On 22-02-03 09:48:49, Dan Williams wrote:
> On Tue, Feb 1, 2022 at 3:11 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > On 22-01-28 16:25:34, Dan Williams wrote:
> > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > The region creation APIs create a vacant region. Configuring the region
> > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > will be provided to allow userspace to configure the region.  Finally
> > > > once all configuration is complete, userspace may activate the region.
> > > >
> > > > Introduced here are the most basic attributes needed to configure a
> > > > region. Details of these attribute are described in the ABI
> > >
> > > s/attribute/attributes/
> > >
> > > > Documentation. Sanity checking of configuration parameters are done at
> > > > region binding time. This consolidates all such logic in one place,
> > > > rather than being strewn across multiple places.
> > >
> > > I think that's too late for some of the validation. The complex
> > > validation that the region driver does throughout the topology is
> > > different from the basic input validation that can  be done at the
> > > sysfs write time. For example ,this patch allows negative
> > > interleave_granularity values to specified, just return -EINVAL. I
> > > agree that sysfs should not validate everything, I disagree with
> > > pushing all validation to cxl_region_probe().
> > >
> >
> > Two points:c
> > 1. How do we distinguish "basic input validation". It'd be good if we could
> >    define "basic input validation". For instance, when I first wrote these
> >    patches, x3 would have been EINVAL, but today it's allowed. Can you help
> >    enumerate what you consider basic.
> 
> I internalized this kernel design principle from Dave Miller many
> years ago paraphrasing "push decision making out to leaf code as much
> as possible", and centralizing all validation in cxl_region_probe()
> violates. The software that makes the mistake does not know it made a
> mistake until much later and "probe failed" is less descriptive than
> "EINVAL writing interleave_ways" . I wish I could find the thread
> because it also talked about his iteration process.

It would definitely be interesting to understand why pushing decision making
into the leaf code is a violation. Was it primary around the descriptiveness of
the error?

> 
> Basic input validation to me is things like:
> 
> - Don't allow writes while the region is active
> - Check that values are in bound. So yes, the interleave-ways value of
> 3 would fail until the kernel supports it, and granularity values >
> 16K would also fail.
> - Check that memdevs are actually downstream targets of the given decoder
> - Check that the region uuid is unique

These are obviously easy and informative at attr store time (in fact, active was
meant to be checked already for many cases). So if we agree to codify this at
probe via WARN, and add it to kdoc, I've no problem with it.

> - Check that decoder has capacity
> - Check that the memdev has capacity
> - Check that the decoder to map the DPA is actually available given
> decoders must be programmed in increasing DPA order
> 
> Essentially any validation short of walking the topology to program
> upstream decoders since those errors are only resolved by racing
> region probes that try to grab upstream decoder resources.
> 

I intentionally avoided doing a lot of these until probe because it seemed like
not a great policy to deny regions from being populated if another region
utilizing those resources hasn't been bound yes. For a simple example, if x1
region A is created and utilizes all of memdev ɑ's capacity you block out any
other region setup using memdev ɑ, even if region A wasn't bound. There's a
similar problem with specifying decoders as part of configuration.

I'll infer from your comment that you are fine with this tradeoff, or you have
some other way to manage this in mind.

I really see any validation which requires removal of resources from the system
to be more fit for bind time. I suppose if the proposal is to move the region
attributes to be DEVICE_ATTR_ADMIN, that pushes the problem onto the system
administrator. It just seemed like most of the interface could be non-root.

> >
> > 2. I like the idea that all validation takes place in one place. Obviously you
> >    do not. So, see #1 and I will rework.
> 
> The validation helpers need to be written once, where they are called
> does not much matter, does it?
> 

Somewhat addressed above too...

I think that depends on whether the full list is established as mentioned. If in
the region driver we can put several assertions that a variety of things don't
need [re]validation, then it doesn't matter. Without this, when trying to debug
or add code you need to figure out which place is doing the validation and which
place should do it.

At the very least I think the plan should be established in a kdoc.

> >
> > > >
> > > > A example is provided below:
> > > >
> > > > /sys/bus/cxl/devices/region0.0:0
> > > > ├── interleave_granularity
> > > > ├── interleave_ways
> > > > ├── offset
> > > > ├── size
> > > > ├── subsystem -> ../../../../../../bus/cxl
> > > > ├── target0
> > > > ├── uevent
> > > > └── uuid
> > >
> > > As mentioned off-list, it looks like devtype and modalias are missing.
> > >
> >
> > Thanks.
> >
> > > >
> > > > Reported-by: kernel test robot <lkp@intel.com> (v2)
> > > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > > ---
> > > >  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
> > > >  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
> > > >  2 files changed, 340 insertions(+)
> > > >
> > > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > index dcc728458936..50ba5018014d 100644
> > > > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > @@ -187,3 +187,43 @@ Description:
> > > >                 region driver before being deleted. The attributes expects a
> > > >                 region in the form "regionX.Y:Z". The region's name, allocated
> > > >                 by reading create_region, will also be released.
> > > > +
> > > > +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
> > >
> > > This is just another 'resource' attribute for the physical base
> > > address of the region, right? 'offset' sounds like something that
> > > would be relative instead of absolute.
> > >
> >
> > It is offset. I can change it to physical base if you'd like but I thought that
> > information wasn't critically important for userspace to have. Does userspace
> > care about the physical base?
> 
> Yes, similar use case as /proc/iomem. Error handling comes to mind as
> you can see physical address data in the messages like machine check
> notification and go immediately match that to a CXL region. PCI,
> NVDIMM, and DAX all emit a "resource" attribute to identify the
> physical address base.
> 
> >
> > > > +Date:          August, 2021
> > >
> > > Same date update comment here.
> > >
> > > > +KernelVersion: v5.18
> > > > +Contact:       linux-cxl@vger.kernel.org
> > > > +Description:
> > > > +               (RO) A region resides within an address space that is claimed by
> > > > +               a decoder.
> > >
> > > "A region is a contiguous partition of a CXL Root decoder address space."
> > >
> > > >                  Region space allocation is handled by the driver, but
> > >
> > > "Region capacity is allocated by writing to the size attribute, the
> > > resulting physical address base determined by the driver is reflected
> > > here."
> > >
> > > > +               the offset may be read by userspace tooling in order to
> > > > +               determine fragmentation, and available size for new regions.
> > >
> > > I would also expect, before / along with these new region attributes,
> > > there would be 'available' and 'max_extent_available' at the decoder
> > > level to indicate how much free space the decoder has and how big the
> > > next region creation can be. User tooling can walk  the decoder and
> > > the regions together to determine fragmentation if necessary, but for
> > > the most part the tool likely only cares about "how big can the next
> > > region be?" and "how full is this decoder?".
> >
> > Sounds good.
> >
> > >
> > >
> > > > +
> > > > +What:
> > > > +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> > > > +Date:          August, 2021
> > > > +KernelVersion: v5.18
> > > > +Contact:       linux-cxl@vger.kernel.org
> > > > +Description:
> > > > +               (RW) Configuring regions requires a minimal set of parameters in
> > > > +               order for the subsequent bind operation to succeed. The
> > > > +               following parameters are defined:
> > >
> > > Let's split up the descriptions into individual sections. That can
> > > also document the order that attributes must be written. For example,
> > > doesn't size need to be set before targets are added so that targets
> > > can be validated whether they have sufficient capacity?
> > >
> >
> > Okay. Order doesn't matter if you do validation all in one place as it is, but
> > sounds like we're changing that. So I can split it when we figure out what
> > validation is actually occurring at the sysfs attr boundary.
> 
> Forcing a write order simplifies the validation matrix. Consider the
> reduction in test surface if the kernel is more strict about what it
> allows into the kernel early. Let's make syzbot's job harder.
> 
> >
> > > > +
> > > > +               ==      ========================================================
> > > > +               interleave_granularity Mandatory. Number of consecutive bytes
> > > > +                       each device in the interleave set will claim. The
> > > > +                       possible interleave granularity values are determined by
> > > > +                       the CXL spec and the participating devices.
> > > > +               interleave_ways Mandatory. Number of devices participating in the
> > > > +                       region. Each device will provide 1/interleave of storage
> > > > +                       for the region.
> > > > +               size    Manadatory. Phsyical address space the region will
> > > > +                       consume.
> > >
> > > s/Phsyical/Physical/
> > >
> > > > +               target  Mandatory. Memory devices are the backing storage for a
> > > > +                       region. There will be N targets based on the number of
> > > > +                       interleave ways that the top level decoder is configured
> > > > +                       for.
> > >
> > > That doesn't sound right, IW at the root != IW at the endpoint level
> > > and the region needs to record all the endpoint level targets.
> >
> >
> > Yes This is wrong. I thought I had fixed it, but I guess not.
> >
> > >
> > > > Each target must be set with a memdev device ie.
> > > > +                       'mem1'. This attribute only becomes available after
> > > > +                       setting the 'interleave' attribute.
> > > > +               uuid    Optional. A unique identifier for the region. If none is
> > > > +                       selected, the kernel will create one.
> > >
> > > Let's drop the Mandatory / Optional distinction, or I am otherwise not
> > > understanding what this is trying to document. For example 'uuid' is
> > > "mandatory" for PMEM regions and "omitted" for volatile regions not
> > > optional.
> > >
> >
> > Well the kernel fills it in if userspace leaves it out. I'm guessing you're
> > going to ask me to change that, so I will remove Mandatory/Optional.
> 
> Yeah, why carry unnecessary code in the kernel? Userspace is well
> equipped to meet the requirement that it writes the UUID.
> 
> >
> > > > +               ==      ========================================================
> > > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > > > index 1a448543db0d..3b48e0469fc7 100644
> > > > --- a/drivers/cxl/core/region.c
> > > > +++ b/drivers/cxl/core/region.c
> > > > @@ -3,9 +3,12 @@
> > > >  #include <linux/io-64-nonatomic-lo-hi.h>
> > > >  #include <linux/device.h>
> > > >  #include <linux/module.h>
> > > > +#include <linux/sizes.h>
> > > >  #include <linux/slab.h>
> > > > +#include <linux/uuid.h>
> > > >  #include <linux/idr.h>
> > > >  #include <region.h>
> > > > +#include <cxlmem.h>
> > > >  #include <cxl.h>
> > > >  #include "core.h"
> > > >
> > > > @@ -18,11 +21,305 @@
> > > >   * (programming the hardware) is handled by a separate region driver.
> > > >   */
> > > >
> > > > +struct cxl_region *to_cxl_region(struct device *dev);
> > > > +static const struct attribute_group region_interleave_group;
> > > > +
> > > > +static bool is_region_active(struct cxl_region *cxlr)
> > > > +{
> > > > +       /* TODO: Regions can't be activated yet. */
> > > > +       return false;
> > >
> > > This function seems redundant with just checking "cxlr->dev.driver !=
> > > NULL"? The benefit of that is there is no need to carry a TODO in the
> > > series.
> > >
> >
> > Yeah. I think checking driver bind status is sufficient to replace this.
> >
> > > > +}
> > > > +
> > > > +static void remove_target(struct cxl_region *cxlr, int target)
> > > > +{
> > > > +       struct cxl_memdev *cxlmd;
> > > > +
> > > > +       cxlmd = cxlr->config.targets[target];
> > > > +       if (cxlmd)
> > > > +               put_device(&cxlmd->dev);
> > >
> > > A memdev can be a member of multiple regions at once, shouldn't this
> > > be an endpoint decoder or similar, not the entire memdev?
> >
> > Is this referring to the later question about whether targets are decoders or
> > memdevs?
> 
> Yes.
> 
> > The thought was each region would hold a reference to all memdevs in
> > the interleave set.
> 
> It's not clear that a region needs to hold a reference if a memdev
> self removes itself from the region before it is unregistered. I am
> open to being convinced this is needed but it would need come with an
> explanation of what can a region do with a memdev reference after that
> memdev has experienced a ->remove() event.
> 
> > > Also, if memdevs autoremove themselves from regions at memdev
> > > ->remove() time then I don't think the region needs to hold references
> > > on memdevs.
> > >
> >
> > I'll defer to you on that. I'll remove holding the reference, but I definitely
> > haven't solved the interaction when a memdev goes away. I had been thinking the
> > inverse originally, a memdev can't go away until the region is gone. According
> > to the spec, these devices can't be hot removed, only managed remove, so if
> > things blew up, not our problem. However, if we have decent infrastructure to
> > support better than that, we should.
> 
> It turns out there's no such thing as "managed remove" as far as the
> kernel is concerned, it's all up to userspace. ->remove() can happen
> at any time and ->remove() can not fail. Per Bjorn Linux does not
> support PCIe hotplug latching so there is no kernel mechanism to block
> hotplug. Unless and until the PCIe native hotplug code picks up a
> mechanism to deny unplug events CXL needs to be prepared for any
> memdev to experience ->remove() regardless of region status.
> 
> For example, I expect the "managed remove" flow is something like this:
> 
> # cxl disable-memdev mem3
> cxl memdev: action_disable: mem3 is part of an active region
> cxl memdev: cmd_disable_memdev: disabled 0 mem
> 
> ...where the tool tries to enforce safety, but if someone really wants
> that device gone:
> 
> # cxl disable-memdev mem3 --force
> cxl memdev: cmd_disable_memdev: disabled 1 mem
> 
> ...and the CXL sub-system will need to trigger memory-failure across
> all the regions that were impacted by that violent event.
> 
> >
> > > > +       cxlr->config.targets[target] = NULL;
> > > > +}
> > > > +
> > > > +static ssize_t interleave_ways_show(struct device *dev,
> > > > +                                   struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> > > > +}
> > > > +
> > > > +static ssize_t interleave_ways_store(struct device *dev,
> > > > +                                    struct device_attribute *attr,
> > > > +                                    const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       int ret, prev_iw;
> > > > +       int val;
> > >
> > > I would expect:
> > >
> > > if (dev->driver)
> > >    return -EBUSY;
> > >
> > > ...to shutdown configuration writes once the region is active. Might
> > > also need a region-wide seqlock like target_list_show. So that region
> > > probe drains  all active sysfs writers before assuming the
> > > configuration is stable.
> > >
> >
> > Okay.
> >
> > > > +
> > > > +       prev_iw = cxlr->config.interleave_ways;
> > > > +       ret = kstrtoint(buf, 0, &val);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> > > > +               return -EINVAL;
> > > > +
> > > > +       cxlr->config.interleave_ways = val;
> > > > +
> > > > +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> > > > +       if (ret < 0)
> > > > +               goto err;
> > > > +
> > > > +       sysfs_notify(&dev->kobj, NULL, "target_interleave");
> > >
> > > Why?
> > >
> >
> > I copied it from another driver. I didn't check if it was actually needed or
> > not.
> 
> It's not needed since the agent that wrote interleave ways is also
> expected to be the agent that is configuring the rest of the
> parameters.
> 
> >
> > > > +
> > > > +       while (prev_iw > cxlr->config.interleave_ways)
> > > > +               remove_target(cxlr, --prev_iw);
> > >
> > > To make the kernel side simpler this attribute could just require that
> > > setting interleave ways is a one way street, if you want to change it
> > > you need to delete the region and start over.
> > >
> >
> > I'm fine with that.
> >
> > > > +
> > > > +       return len;
> > > > +
> > > > +err:
> > > > +       cxlr->config.interleave_ways = prev_iw;
> > > > +       return ret;
> > > > +}
> > > > +static DEVICE_ATTR_RW(interleave_ways);
> > > > +
> > > > +static ssize_t interleave_granularity_show(struct device *dev,
> > > > +                                          struct device_attribute *attr,
> > > > +                                          char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_granularity);
> > > > +}
> > > > +
> > > > +static ssize_t interleave_granularity_store(struct device *dev,
> > > > +                                           struct device_attribute *attr,
> > > > +                                           const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       int val, ret;
> > > > +
> > > > +       ret = kstrtoint(buf, 0, &val);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +       cxlr->config.interleave_granularity = val;
> > >
> > > This wants minimum input validation and synchronization against an
> > > active region.
> > >
> > > > +
> > > > +       return len;
> > > > +}
> > > > +static DEVICE_ATTR_RW(interleave_granularity);
> > > > +
> > > > +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> > > > +                          char *buf)
> > > > +{
> > > > +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       resource_size_t offset;
> > > > +
> > > > +       if (!cxlr->res)
> > > > +               return sysfs_emit(buf, "\n");
> > >
> > > Should be an error I would think. I.e. require size to be set before
> > > s/offset/resource/ can be read.
> > >
> > > > +
> > > > +       offset = cxld->platform_res.start - cxlr->res->start;
> > >
> > > Why make usersapce do the offset math?
> > >
> > > > +
> > > > +       return sysfs_emit(buf, "%pa\n", &offset);
> > > > +}
> > > > +static DEVICE_ATTR_RO(offset);
> > >
> > > This can be DEVICE_ATTR_ADMIN_RO() to hide physical address layout
> > > information from non-root.
> > >
> > > > +
> > > > +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> > > > +                        char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%llu\n", cxlr->config.size);
> > >
> > > Perhaps no need to store size separately if this becomes:
> > >
> > > sysfs_emit(buf, "%llu\n", (unsigned long long) resource_size(cxlr->res));
> > >
> > >
> > > ...?
> > >
> > > > +}
> > > > +
> > > > +static ssize_t size_store(struct device *dev, struct device_attribute *attr,
> > > > +                         const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       unsigned long long val;
> > > > +       ssize_t rc;
> > > > +
> > > > +       rc = kstrtoull(buf, 0, &val);
> > > > +       if (rc)
> > > > +               return rc;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (is_region_active(cxlr))
> > > > +               rc = -EBUSY;
> > > > +       else
> > > > +               cxlr->config.size = val;
> > > > +       device_unlock(&cxlr->dev);
> > >
> > > I think lockdep will complain about device_lock() usage in an
> > > attribute. Try changing this to cxl_device_lock() with
> > > CONFIG_PROVE_CXL_LOCKING=y.
> > >
> > > > +
> > > > +       return rc ? rc : len;
> > > > +}
> > > > +static DEVICE_ATTR_RW(size);
> > > > +
> > > > +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> > > > +                        char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> > > > +}
> > > > +
> > > > +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> > > > +                         const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       ssize_t rc;
> > > > +
> > > > +       if (len != UUID_STRING_LEN + 1)
> > > > +               return -EINVAL;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (is_region_active(cxlr))
> > > > +               rc = -EBUSY;
> > > > +       else
> > > > +               rc = uuid_parse(buf, &cxlr->config.uuid);
> > > > +       device_unlock(&cxlr->dev);
> > > > +
> > > > +       return rc ? rc : len;
> > > > +}
> > > > +static DEVICE_ATTR_RW(uuid);
> > > > +
> > > > +static struct attribute *region_attrs[] = {
> > > > +       &dev_attr_interleave_ways.attr,
> > > > +       &dev_attr_interleave_granularity.attr,
> > > > +       &dev_attr_offset.attr,
> > > > +       &dev_attr_size.attr,
> > > > +       &dev_attr_uuid.attr,
> > > > +       NULL,
> > > > +};
> > > > +
> > > > +static const struct attribute_group region_group = {
> > > > +       .attrs = region_attrs,
> > > > +};
> > > > +
> > > > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > > > +{
> > > > +       int ret;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (!cxlr->config.targets[n])
> > > > +               ret = sysfs_emit(buf, "\n");
> > > > +       else
> > > > +               ret = sysfs_emit(buf, "%s\n",
> > > > +                                dev_name(&cxlr->config.targets[n]->dev));
> > > > +       device_unlock(&cxlr->dev);
> > >
> > > The component contribution of a memdev to a region is a DPA-span, not
> > > the whole memdev. I would expect something like dax_mapping_attributes
> > > or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> > > information about the component contribution of a memdev to a region.
> > >
> >
> > I had been thinking the kernel would manage the DPS spans of a memdev (and
> > create the mappings). I can make this look like dax_mapping_attributes.
> 
> I think we get this for free by just linking the region to each
> endpoint decoder in use and then userspace can walk that to get DPA
> info from the decoder (modulo extending endpoint decoders with DPA
> extent info).
> 
> >
> > > > +
> > > > +       return ret;
> > > > +}
> > > > +
> > > > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > > > +                         size_t len)
> > > > +{
> > > > +       struct device *memdev_dev;
> > > > +       struct cxl_memdev *cxlmd;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +
> > > > +       if (len == 1 || cxlr->config.targets[n])
> > > > +               remove_target(cxlr, n);
> > > > +
> > > > +       /* Remove target special case */
> > > > +       if (len == 1) {
> > > > +               device_unlock(&cxlr->dev);
> > > > +               return len;
> > > > +       }
> > > > +
> > > > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> > >
> > > I think this wants to be an endpoint decoder, not a memdev. Because
> > > it's the decoder that joins a memdev to a region, or at least a
> > > decoder should be picked when the memdev is assigned so that the DPA
> > > mapping can be registered. If all the decoders are allocated then fail
> > > here.
> > >
> >
> > My preference is obviously how it is, using memdevs and having the decoders
> > allocated at bind time. I don't have an objective argument why one is better
> > than the other so I will change it. I will make the interface take a set of
> > decoders.
> 
> Let me know if the arguments I have made for why more granularity is
> necessary have changed your preference. I'm open to hearing alternate
> ideas.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-02-03 22:23         ` Ben Widawsky
@ 2022-02-03 23:27           ` Dan Williams
  2022-02-04  0:19             ` Ben Widawsky
  0 siblings, 1 reply; 70+ messages in thread
From: Dan Williams @ 2022-02-03 23:27 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On Thu, Feb 3, 2022 at 2:23 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> On 22-02-03 09:48:49, Dan Williams wrote:
> > On Tue, Feb 1, 2022 at 3:11 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > On 22-01-28 16:25:34, Dan Williams wrote:
> > > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > >
> > > > > The region creation APIs create a vacant region. Configuring the region
> > > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > > will be provided to allow userspace to configure the region.  Finally
> > > > > once all configuration is complete, userspace may activate the region.
> > > > >
> > > > > Introduced here are the most basic attributes needed to configure a
> > > > > region. Details of these attribute are described in the ABI
> > > >
> > > > s/attribute/attributes/
> > > >
> > > > > Documentation. Sanity checking of configuration parameters are done at
> > > > > region binding time. This consolidates all such logic in one place,
> > > > > rather than being strewn across multiple places.
> > > >
> > > > I think that's too late for some of the validation. The complex
> > > > validation that the region driver does throughout the topology is
> > > > different from the basic input validation that can  be done at the
> > > > sysfs write time. For example ,this patch allows negative
> > > > interleave_granularity values to specified, just return -EINVAL. I
> > > > agree that sysfs should not validate everything, I disagree with
> > > > pushing all validation to cxl_region_probe().
> > > >
> > >
> > > Two points:c
> > > 1. How do we distinguish "basic input validation". It'd be good if we could
> > >    define "basic input validation". For instance, when I first wrote these
> > >    patches, x3 would have been EINVAL, but today it's allowed. Can you help
> > >    enumerate what you consider basic.
> >
> > I internalized this kernel design principle from Dave Miller many
> > years ago paraphrasing "push decision making out to leaf code as much
> > as possible", and centralizing all validation in cxl_region_probe()
> > violates. The software that makes the mistake does not know it made a
> > mistake until much later and "probe failed" is less descriptive than
> > "EINVAL writing interleave_ways" . I wish I could find the thread
> > because it also talked about his iteration process.
>
> It would definitely be interesting to understand why pushing decision making
> into the leaf code is a violation. Was it primary around the descriptiveness of
> the error?

You mean the other way round, why is it a violation to move decision
making into the core? It was a comment about the inflexibility of the
core logic vs leaf logic, in the case of CXL it's about the
observability of errors at the right granularity which the core can
not do because the core is disconnected from the transaction that
injected the error.

> > Basic input validation to me is things like:
> >
> > - Don't allow writes while the region is active
> > - Check that values are in bound. So yes, the interleave-ways value of
> > 3 would fail until the kernel supports it, and granularity values >
> > 16K would also fail.
> > - Check that memdevs are actually downstream targets of the given decoder
> > - Check that the region uuid is unique
>
> These are obviously easy and informative at attr store time (in fact, active was
> meant to be checked already for many cases). So if we agree to codify this at
> probe via WARN, and add it to kdoc, I've no problem with it.

Why is WARN needed? Either the sysfs validation does it job correctly
or it doesn't. Also if sysfs didn't WARN when the bad input is
specified why would the core do anything higher than dev_err()?
Basically I think the bar for WARN is obvious kernel programming error
where only a kernel-developer will see it vs EINVAL at runtime
scenarios. I have seen Greg raise the bar for WARN in his reviews
given how many deployments turn on 'panic_on_warn'.

> > - Check that decoder has capacity
> > - Check that the memdev has capacity
> > - Check that the decoder to map the DPA is actually available given
> > decoders must be programmed in increasing DPA order
> >
> > Essentially any validation short of walking the topology to program
> > upstream decoders since those errors are only resolved by racing
> > region probes that try to grab upstream decoder resources.
> >
>
> I intentionally avoided doing a lot of these until probe because it seemed like
> not a great policy to deny regions from being populated if another region
> utilizing those resources hasn't been bound yes. For a simple example, if x1
> region A is created and utilizes all of memdev ɑ's capacity you block out any
> other region setup using memdev ɑ, even if region A wasn't bound. There's a
> similar problem with specifying decoders as part of configuration.
>
> I'll infer from your comment that you are fine with this tradeoff, or you have
> some other way to manage this in mind.

It comes back to observability if threadA allocates all the DPA then
yes all other threads should see -ENOSPC. No different than if 3 fdisk
threads all tried to create a partition, the first one to the kernel
wins. If threadA does not end up activating that regionA's capacity
that's userspace's fault, and the admin needs to make sure that
configuration does not race itself. The kernel allocating DPA
immediately lets those races be found early such that threadB finds
all the DPA gone and stops trying to create the region.

> I really see any validation which requires removal of resources from the system
> to be more fit for bind time. I suppose if the proposal is to move the region
> attributes to be DEVICE_ATTR_ADMIN, that pushes the problem onto the system
> administrator. It just seemed like most of the interface could be non-root.

None of the sysfs entries for CXL are writable by non-root.

DEVICE_ATTR_RW() is 0644
DEVICE_ATTR_ADMIN_RW() is 0600

Yes, pushing the problem onto the sysadmin is the only option. Only
CAP_SYS_ADMIN can be trusted to muck with the physical address layout
of the system. Even then CONFIG_LOCKDOWN_KERNEL wants to limit what
CAP_SYS_ADMIN can to do the memory configuration, so I don't see any
room for non-root to be considered in this ABI.

>
> > >
> > > 2. I like the idea that all validation takes place in one place. Obviously you
> > >    do not. So, see #1 and I will rework.
> >
> > The validation helpers need to be written once, where they are called
> > does not much matter, does it?
> >
>
> Somewhat addressed above too...
>
> I think that depends on whether the full list is established as mentioned. If in
> the region driver we can put several assertions that a variety of things don't
> need [re]validation, then it doesn't matter. Without this, when trying to debug
> or add code you need to figure out which place is doing the validation and which
> place should do it.

All I can say it has not been a problem in practice for NVDIMM debug
scenarios which does validation at probe for pre-existing namespaces
and validation at sysfs write for namespace creation.

> At the very least I think the plan should be established in a kdoc.

Sure, a "CXL Region: Theory of Operation" would be a good document to
lead into the patch series as a follow-on to "CXL Bus: Theory of
Operation".

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-02-03 23:27           ` Dan Williams
@ 2022-02-04  0:19             ` Ben Widawsky
  2022-02-04  2:45               ` Dan Williams
  0 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-04  0:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On 22-02-03 15:27:02, Dan Williams wrote:
> On Thu, Feb 3, 2022 at 2:23 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > On 22-02-03 09:48:49, Dan Williams wrote:
> > > On Tue, Feb 1, 2022 at 3:11 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > On 22-01-28 16:25:34, Dan Williams wrote:
> > > > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > > >
> > > > > > The region creation APIs create a vacant region. Configuring the region
> > > > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > > > will be provided to allow userspace to configure the region.  Finally
> > > > > > once all configuration is complete, userspace may activate the region.
> > > > > >
> > > > > > Introduced here are the most basic attributes needed to configure a
> > > > > > region. Details of these attribute are described in the ABI
> > > > >
> > > > > s/attribute/attributes/
> > > > >
> > > > > > Documentation. Sanity checking of configuration parameters are done at
> > > > > > region binding time. This consolidates all such logic in one place,
> > > > > > rather than being strewn across multiple places.
> > > > >
> > > > > I think that's too late for some of the validation. The complex
> > > > > validation that the region driver does throughout the topology is
> > > > > different from the basic input validation that can  be done at the
> > > > > sysfs write time. For example ,this patch allows negative
> > > > > interleave_granularity values to specified, just return -EINVAL. I
> > > > > agree that sysfs should not validate everything, I disagree with
> > > > > pushing all validation to cxl_region_probe().
> > > > >
> > > >
> > > > Two points:c
> > > > 1. How do we distinguish "basic input validation". It'd be good if we could
> > > >    define "basic input validation". For instance, when I first wrote these
> > > >    patches, x3 would have been EINVAL, but today it's allowed. Can you help
> > > >    enumerate what you consider basic.
> > >
> > > I internalized this kernel design principle from Dave Miller many
> > > years ago paraphrasing "push decision making out to leaf code as much
> > > as possible", and centralizing all validation in cxl_region_probe()
> > > violates. The software that makes the mistake does not know it made a
> > > mistake until much later and "probe failed" is less descriptive than
> > > "EINVAL writing interleave_ways" . I wish I could find the thread
> > > because it also talked about his iteration process.
> >
> > It would definitely be interesting to understand why pushing decision making
> > into the leaf code is a violation. Was it primary around the descriptiveness of
> > the error?
> 
> You mean the other way round, why is it a violation to move decision
> making into the core? It was a comment about the inflexibility of the
> core logic vs leaf logic, in the case of CXL it's about the
> observability of errors at the right granularity which the core can
> not do because the core is disconnected from the transaction that
> injected the error.

I did mean the other way around. The thing that gets tricky if you do it at the
sysfs boundary is you do have to start seeing the interface as stateful. Perhaps
the complexity I see arising from this won't materialize, so I'll try it and
see. It seems like it can get messy quickly though.

> 
> > > Basic input validation to me is things like:
> > >
> > > - Don't allow writes while the region is active
> > > - Check that values are in bound. So yes, the interleave-ways value of
> > > 3 would fail until the kernel supports it, and granularity values >
> > > 16K would also fail.
> > > - Check that memdevs are actually downstream targets of the given decoder
> > > - Check that the region uuid is unique
> >
> > These are obviously easy and informative at attr store time (in fact, active was
> > meant to be checked already for many cases). So if we agree to codify this at
> > probe via WARN, and add it to kdoc, I've no problem with it.
> 
> Why is WARN needed? Either the sysfs validation does it job correctly
> or it doesn't. Also if sysfs didn't WARN when the bad input is
> specified why would the core do anything higher than dev_err()?
> Basically I think the bar for WARN is obvious kernel programming error
> where only a kernel-developer will see it vs EINVAL at runtime
> scenarios. I have seen Greg raise the bar for WARN in his reviews
> given how many deployments turn on 'panic_on_warn'.

Ultimately some checking will need to occur in one form or another in region
probe(). Either explicit via conditional: if (!is_valid(interleave_ways)) return
-EINVAL, or implicitly, for example 1 << (rootd_ig - cxlr_ig) is some invalid
nonsense which later fails host bridge programming verification. Before
discussing further, which are you suggesting?

> 
> > > - Check that decoder has capacity
> > > - Check that the memdev has capacity
> > > - Check that the decoder to map the DPA is actually available given
> > > decoders must be programmed in increasing DPA order
> > >
> > > Essentially any validation short of walking the topology to program
> > > upstream decoders since those errors are only resolved by racing
> > > region probes that try to grab upstream decoder resources.
> > >
> >
> > I intentionally avoided doing a lot of these until probe because it seemed like
> > not a great policy to deny regions from being populated if another region
> > utilizing those resources hasn't been bound yes. For a simple example, if x1
> > region A is created and utilizes all of memdev ɑ's capacity you block out any
> > other region setup using memdev ɑ, even if region A wasn't bound. There's a
> > similar problem with specifying decoders as part of configuration.
> >
> > I'll infer from your comment that you are fine with this tradeoff, or you have
> > some other way to manage this in mind.
> 
> It comes back to observability if threadA allocates all the DPA then
> yes all other threads should see -ENOSPC. No different than if 3 fdisk
> threads all tried to create a partition, the first one to the kernel
> wins. If threadA does not end up activating that regionA's capacity
> that's userspace's fault, and the admin needs to make sure that
> configuration does not race itself. The kernel allocating DPA
> immediately lets those races be found early such that threadB finds
> all the DPA gone and stops trying to create the region.

Okay. I don't have a strong opinion on how userspace should or shouldn't use
this interface. It seems less friendly to do it this way, but per the following
comment, if it's root only, it doesn't really matter.

I was under the impression you expected userspace to manage the DPA as well. I
don't really see any reason why the kernel should manage it if userspace is
already handling all these other bits. Let userspace set the offset and size
(can make a single device attr for it), and upon doing so it gets reserved.

> 
> > I really see any validation which requires removal of resources from the system
> > to be more fit for bind time. I suppose if the proposal is to move the region
> > attributes to be DEVICE_ATTR_ADMIN, that pushes the problem onto the system
> > administrator. It just seemed like most of the interface could be non-root.
> 
> None of the sysfs entries for CXL are writable by non-root.
> 
> DEVICE_ATTR_RW() is 0644
> DEVICE_ATTR_ADMIN_RW() is 0600

My mistake. I forgot about that.

> 
> Yes, pushing the problem onto the sysadmin is the only option. Only
> CAP_SYS_ADMIN can be trusted to muck with the physical address layout
> of the system. Even then CONFIG_LOCKDOWN_KERNEL wants to limit what
> CAP_SYS_ADMIN can to do the memory configuration, so I don't see any
> room for non-root to be considered in this ABI.

That's fine. As the interface is today (before your requested changes) only
region->probe() is something that mucks with the physical address layout. It
could theoretically be entirely configured (not bound) by userspace.

> 
> >
> > > >
> > > > 2. I like the idea that all validation takes place in one place. Obviously you
> > > >    do not. So, see #1 and I will rework.
> > >
> > > The validation helpers need to be written once, where they are called
> > > does not much matter, does it?
> > >
> >
> > Somewhat addressed above too...
> >
> > I think that depends on whether the full list is established as mentioned. If in
> > the region driver we can put several assertions that a variety of things don't
> > need [re]validation, then it doesn't matter. Without this, when trying to debug
> > or add code you need to figure out which place is doing the validation and which
> > place should do it.
> 
> All I can say it has not been a problem in practice for NVDIMM debug
> scenarios which does validation at probe for pre-existing namespaces
> and validation at sysfs write for namespace creation.
> 
> > At the very least I think the plan should be established in a kdoc.
> 
> Sure, a "CXL Region: Theory of Operation" would be a good document to
> lead into the patch series as a follow-on to "CXL Bus: Theory of
> Operation".

Yeah. I will write it once we close this discussion.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-02-04  0:19             ` Ben Widawsky
@ 2022-02-04  2:45               ` Dan Williams
  0 siblings, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-04  2:45 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On Thu, Feb 3, 2022 at 4:20 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
[..]
> > > > Basic input validation to me is things like:
> > > >
> > > > - Don't allow writes while the region is active
> > > > - Check that values are in bound. So yes, the interleave-ways value of
> > > > 3 would fail until the kernel supports it, and granularity values >
> > > > 16K would also fail.
> > > > - Check that memdevs are actually downstream targets of the given decoder
> > > > - Check that the region uuid is unique
> > >
> > > These are obviously easy and informative at attr store time (in fact, active was
> > > meant to be checked already for many cases). So if we agree to codify this at
> > > probe via WARN, and add it to kdoc, I've no problem with it.
> >
> > Why is WARN needed? Either the sysfs validation does it job correctly
> > or it doesn't. Also if sysfs didn't WARN when the bad input is
> > specified why would the core do anything higher than dev_err()?
> > Basically I think the bar for WARN is obvious kernel programming error
> > where only a kernel-developer will see it vs EINVAL at runtime
> > scenarios. I have seen Greg raise the bar for WARN in his reviews
> > given how many deployments turn on 'panic_on_warn'.
>
> Ultimately some checking will need to occur in one form or another in region
> probe(). Either explicit via conditional: if (!is_valid(interleave_ways)) return
> -EINVAL, or implicitly, for example 1 << (rootd_ig - cxlr_ig) is some invalid
> nonsense which later fails host bridge programming verification. Before
> discussing further, which are you suggesting?

Explicit validation at probe in addition to the explicit validation at
the sysfs boundary (as much as possible to report errors early). The
"at probe time" validation does not know if this was a new region, or
one enumerated from LSA or the configuration that the BIOS specified.
So I do expect validation overlap, but there will also be distinct
checks for those different scenarios. For example, see how NVDIMM
validates namespace configuration writes via sysfs, but does not
validate the LSA because it's writing the label and had better be
prepared to read what it writes.

>
> >
> > > > - Check that decoder has capacity
> > > > - Check that the memdev has capacity
> > > > - Check that the decoder to map the DPA is actually available given
> > > > decoders must be programmed in increasing DPA order
> > > >
> > > > Essentially any validation short of walking the topology to program
> > > > upstream decoders since those errors are only resolved by racing
> > > > region probes that try to grab upstream decoder resources.
> > > >
> > >
> > > I intentionally avoided doing a lot of these until probe because it seemed like
> > > not a great policy to deny regions from being populated if another region
> > > utilizing those resources hasn't been bound yes. For a simple example, if x1
> > > region A is created and utilizes all of memdev ɑ's capacity you block out any
> > > other region setup using memdev ɑ, even if region A wasn't bound. There's a
> > > similar problem with specifying decoders as part of configuration.
> > >
> > > I'll infer from your comment that you are fine with this tradeoff, or you have
> > > some other way to manage this in mind.
> >
> > It comes back to observability if threadA allocates all the DPA then
> > yes all other threads should see -ENOSPC. No different than if 3 fdisk
> > threads all tried to create a partition, the first one to the kernel
> > wins. If threadA does not end up activating that regionA's capacity
> > that's userspace's fault, and the admin needs to make sure that
> > configuration does not race itself. The kernel allocating DPA
> > immediately lets those races be found early such that threadB finds
> > all the DPA gone and stops trying to create the region.
>
> Okay. I don't have a strong opinion on how userspace should or shouldn't use
> this interface. It seems less friendly to do it this way, but per the following
> comment, if it's root only, it doesn't really matter.
>
> I was under the impression you expected userspace to manage the DPA as well. I
> don't really see any reason why the kernel should manage it if userspace is
> already handling all these other bits. Let userspace set the offset and size
> (can make a single device attr for it), and upon doing so it gets reserved.

Userspace sense requests, kernel allocates or denies that request
after resolving races with other requesters. Yes, this makes the
interface stateful. Sysfs is not suited to stateless operation.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 08/14] cxl/region: HB port config verification
  2022-01-28  0:27 ` [PATCH v3 08/14] cxl/region: HB port config verification Ben Widawsky
@ 2022-02-14 16:20   ` Jonathan Cameron
  2022-02-14 17:51     ` Ben Widawsky
  2022-02-15 16:35   ` Jonathan Cameron
  2022-02-18 21:04   ` Dan Williams
  2 siblings, 1 reply; 70+ messages in thread
From: Jonathan Cameron @ 2022-02-14 16:20 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Dan Williams, Ira Weiny,
	Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

On Thu, 27 Jan 2022 16:27:01 -0800
Ben Widawsky <ben.widawsky@intel.com> wrote:

> Host bridge root port verification determines if the device ordering in
> an interleave set can be programmed through the host bridges and
> switches.
> 
> The algorithm implemented here is based on the CXL Type 3 Memory Device
> Software Guide, chapter 2.13.15. The current version of the guide does
> not yet support x3 interleave configurations, and so that's not
> supported here either.
> 
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>


> +static struct cxl_dport *get_rp(struct cxl_memdev *ep)
> +{
> +	struct cxl_port *port, *parent_port = port = ep->port;
> +	struct cxl_dport *dport;
> +
> +	while (!is_cxl_root(port)) {
> +		parent_port = to_cxl_port(port->dev.parent);
> +		if (parent_port->depth == 1)
> +			list_for_each_entry(dport, &parent_port->dports, list)
> +				if (dport->dport == port->uport->parent->parent)
> +					return dport;
> +		port = parent_port;
> +	}
> +
> +	BUG();

I know you mentioned you were reworking this patch set anyway, but
I thought I'd give some quick debugging related feedback.

When running against a single switch in qemu (patches out once
things are actually working), I hit this BUG()
printing dev_name for the port->uport->parent->parent gives
pci0000:0c but the matches are sort against
0000:0c:00.0 etc

So looks like one too many levels of parent in this case at least.

The other bug I haven't chased down yet is that if we happen
to have downstream ports of the switch with duplicate ids
(far too easy to do in QEMU as port_num is an optional
parameter for switch DS ports) it's detected and the probe fails
- but then it tries again and we get an infinite loop of new
ports being created and failing to probe...
I'll get back this one once I have it working with
a valid switch config.

Jonathan

> +	return NULL;
> +}

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 08/14] cxl/region: HB port config verification
  2022-02-14 16:20   ` Jonathan Cameron
@ 2022-02-14 17:51     ` Ben Widawsky
  2022-02-14 18:09       ` Jonathan Cameron
  0 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-14 17:51 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, patches, Alison Schofield, Dan Williams, Ira Weiny,
	Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

On 22-02-14 16:20:37, Jonathan Cameron wrote:
> On Thu, 27 Jan 2022 16:27:01 -0800
> Ben Widawsky <ben.widawsky@intel.com> wrote:
> 
> > Host bridge root port verification determines if the device ordering in
> > an interleave set can be programmed through the host bridges and
> > switches.
> > 
> > The algorithm implemented here is based on the CXL Type 3 Memory Device
> > Software Guide, chapter 2.13.15. The current version of the guide does
> > not yet support x3 interleave configurations, and so that's not
> > supported here either.
> > 
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> 
> 
> > +static struct cxl_dport *get_rp(struct cxl_memdev *ep)
> > +{
> > +	struct cxl_port *port, *parent_port = port = ep->port;
> > +	struct cxl_dport *dport;
> > +
> > +	while (!is_cxl_root(port)) {
> > +		parent_port = to_cxl_port(port->dev.parent);
> > +		if (parent_port->depth == 1)
> > +			list_for_each_entry(dport, &parent_port->dports, list)
> > +				if (dport->dport == port->uport->parent->parent)
> > +					return dport;
> > +		port = parent_port;
> > +	}
> > +
> > +	BUG();
> 
> I know you mentioned you were reworking this patch set anyway, but
> I thought I'd give some quick debugging related feedback.
> 
> When running against a single switch in qemu (patches out once
> things are actually working), I hit this BUG()
> printing dev_name for the port->uport->parent->parent gives
> pci0000:0c but the matches are sort against
> 0000:0c:00.0 etc
> 
> So looks like one too many levels of parent in this case at least.

Hmm. This definitely looks dubious now that I see it again. Let me try to figure
out how to rework it. I think it would be good to ask Dan as well. Much of the
topology relationship works from bottom up, but top down is less easy.
Previously I had used pci-isms to do this but Dan has been working on keeping
the two domains isolated, which I agree is a good idea.

> 
> The other bug I haven't chased down yet is that if we happen
> to have downstream ports of the switch with duplicate ids
> (far too easy to do in QEMU as port_num is an optional
> parameter for switch DS ports) it's detected and the probe fails
> - but then it tries again and we get an infinite loop of new
> ports being created and failing to probe...

Is this allowed by spec? We shouldn't infinite loop, but I can't imagine the
driver could do anything saner than fail to probe for such a case.

> I'll get back this one once I have it working with
> a valid switch config.

Thanks.

> 
> Jonathan
> 
> > +	return NULL;
> > +}

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 08/14] cxl/region: HB port config verification
  2022-02-14 17:51     ` Ben Widawsky
@ 2022-02-14 18:09       ` Jonathan Cameron
  0 siblings, 0 replies; 70+ messages in thread
From: Jonathan Cameron @ 2022-02-14 18:09 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Dan Williams, Ira Weiny,
	Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

On Mon, 14 Feb 2022 09:51:55 -0800
Ben Widawsky <ben.widawsky@intel.com> wrote:

> On 22-02-14 16:20:37, Jonathan Cameron wrote:
> > On Thu, 27 Jan 2022 16:27:01 -0800
> > Ben Widawsky <ben.widawsky@intel.com> wrote:
> >   
> > > Host bridge root port verification determines if the device ordering in
> > > an interleave set can be programmed through the host bridges and
> > > switches.
> > > 
> > > The algorithm implemented here is based on the CXL Type 3 Memory Device
> > > Software Guide, chapter 2.13.15. The current version of the guide does
> > > not yet support x3 interleave configurations, and so that's not
> > > supported here either.
> > > 
> > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>  
> > 
> >   
> > > +static struct cxl_dport *get_rp(struct cxl_memdev *ep)
> > > +{
> > > +	struct cxl_port *port, *parent_port = port = ep->port;
> > > +	struct cxl_dport *dport;
> > > +
> > > +	while (!is_cxl_root(port)) {
> > > +		parent_port = to_cxl_port(port->dev.parent);
> > > +		if (parent_port->depth == 1)
> > > +			list_for_each_entry(dport, &parent_port->dports, list)
> > > +				if (dport->dport == port->uport->parent->parent)
> > > +					return dport;
> > > +		port = parent_port;
> > > +	}
> > > +
> > > +	BUG();  
> > 
> > I know you mentioned you were reworking this patch set anyway, but
> > I thought I'd give some quick debugging related feedback.
> > 
> > When running against a single switch in qemu (patches out once
> > things are actually working), I hit this BUG()
> > printing dev_name for the port->uport->parent->parent gives
> > pci0000:0c but the matches are sort against
> > 0000:0c:00.0 etc
> > 
> > So looks like one too many levels of parent in this case at least.  
> 
> Hmm. This definitely looks dubious now that I see it again. Let me try to figure
> out how to rework it. I think it would be good to ask Dan as well. Much of the
> topology relationship works from bottom up, but top down is less easy.
> Previously I had used pci-isms to do this but Dan has been working on keeping
> the two domains isolated, which I agree is a good idea.
> 
> > 
> > The other bug I haven't chased down yet is that if we happen
> > to have downstream ports of the switch with duplicate ids
> > (far too easy to do in QEMU as port_num is an optional
> > parameter for switch DS ports) it's detected and the probe fails
> > - but then it tries again and we get an infinite loop of new
> > ports being created and failing to probe...  
> 
> Is this allowed by spec? We shouldn't infinite loop, but I can't imagine the
> driver could do anything saner than fail to probe for such a case.

It would be a hardware bug, however I suspect any failure to probe will
cause it rather that this specific case.  I'll inject another failure
when I get back to this properly.

Jonathan

> 
> > I'll get back this one once I have it working with
> > a valid switch config.  
> 
> Thanks.
> 
> > 
> > Jonathan
> >   
> > > +	return NULL;
> > > +}  


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 11/14] cxl/region: Add support for single switch level
  2022-01-28  0:27 ` [PATCH v3 11/14] cxl/region: Add support for single switch level Ben Widawsky
  2022-02-01 18:26   ` Jonathan Cameron
@ 2022-02-15 16:10   ` Jonathan Cameron
  2022-02-18 18:23     ` Jonathan Cameron
  1 sibling, 1 reply; 70+ messages in thread
From: Jonathan Cameron @ 2022-02-15 16:10 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Dan Williams, Ira Weiny,
	Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

On Thu, 27 Jan 2022 16:27:04 -0800
Ben Widawsky <ben.widawsky@intel.com> wrote:

> CXL switches have HDM decoders just like host bridges and endpoints.
> Their programming works in a similar fashion.
> 
> The spec does not prohibit multiple levels of switches, however, those
> are not implemented at this time.
> 
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Hi Ben,

I'm still hammering away at trying to bring up qemu switch emulation.
Even though I know you are reworking this, seems only sensible to point
out issues when I hit them.  If no longer relevant you can ignore them!

With these bits and a few other minor tweaks the decoders now look
to be right - I just need to wire up the QEMU side so that
I don't get hardware exceptions on actually reading and writing once
the region is bound :) 

Thanks,

J
> ---
>  drivers/cxl/cxl.h    |  5 ++++
>  drivers/cxl/region.c | 61 ++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 64 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 8ace6cca0776..d70d8c85d05f 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -96,6 +96,11 @@ static inline u8 cxl_to_ig(u16 g)
>  	return ilog2(g) - 8;
>  }
>  
> +static inline int cxl_to_ways(u8 ways)
> +{
> +	return 1 << ways;
> +}
> +
>  static inline bool cxl_is_interleave_ways_valid(int iw)
>  {
>  	switch (iw) {
> diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
> index b8982be13bfe..f748060733dd 100644
> --- a/drivers/cxl/region.c
> +++ b/drivers/cxl/region.c
> @@ -359,6 +359,23 @@ static bool has_switch(const struct cxl_region *cxlr)
>  	return false;
>  }
>  
> +static bool has_multi_switch(const struct cxl_region *cxlr)
> +{
> +	struct cxl_memdev *ep;
> +	int i;
> +
> +	for_each_cxl_endpoint(ep, cxlr, i)
> +		if (ep->port->depth > 3)
> +			return true;
> +
> +	return false;
> +}
> +
> +static struct cxl_port *get_switch(struct cxl_memdev *ep)
> +{
> +	return to_cxl_port(ep->port->dev.parent);
> +}
> +
>  static struct cxl_decoder *get_decoder(struct cxl_region *cxlr,
>  				       struct cxl_port *p)
>  {
> @@ -409,6 +426,8 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
>  				      const struct cxl_decoder *rootd,
>  				      bool state_update)
>  {
> +	const int region_ig = cxl_to_ig(cxlr->config.interleave_granularity);
> +	const int region_eniw = cxl_to_eniw(cxlr->config.interleave_ways);
>  	const int num_root_ports = get_num_root_ports(cxlr);
>  	struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
>  	struct cxl_decoder *cxld, *c;
> @@ -416,8 +435,12 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
>  
>  	hb_count = get_unique_hostbridges(cxlr, hbs);
>  
> -	/* TODO: Switch support */
> -	if (has_switch(cxlr))
> +	/* TODO: support multiple levels of switches */
> +	if (has_multi_switch(cxlr))
> +		return false;
> +
> +	/* TODO: x3 interleave for switches is hard. */
> +	if (has_switch(cxlr) && !is_power_of_2(region_ways(cxlr)))
>  		return false;
>  
>  	/*
> @@ -470,8 +493,14 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
>  		list_for_each_entry(rp, &hb->dports, list) {
>  			struct cxl_memdev *ep;
>  			int port_grouping = -1;
> +			int target_ndx;
As things currently stand, with a switch connected below a single port
of a host bridge (4 type 3 off the switch) this will program the HB
decoder to have 4 targets, all routed to the switch USP.

There is an argument that this is correct but its not what I'd expect.
I'd expect to see 1 target only.  It's not a problem for small cases, but
with enough rp and switches we can run out of targets.

>  
>  			for_each_cxl_endpoint_hb(ep, cxlr, hb, idx) {
> +				struct cxl_decoder *switch_cxld;
> +				struct cxl_dport *target;
> +				struct cxl_port *switch_port;
> +				bool found = false;
> +
>  				if (get_rp(ep) != rp)
>  					continue;
>  
> @@ -499,6 +528,34 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
>  
>  				cxld->interleave_ways++;
>  				cxld->target[port_grouping] = get_rp(ep);
> +
> +				/*
> +				 * At least one switch is connected here if the endpoint
> +				 * has a depth > 2
> +				 */
> +				if (ep->port->depth == 2)
> +					continue;
> +
> +				/* Check the staged list to see if this
> +				 * port has already been added
> +				 */
> +				switch_port = get_switch(ep);
> +				list_for_each_entry(switch_cxld, &cxlr->staged_list, region_link) {
> +					if (to_cxl_port(switch_cxld->dev.parent) == switch_port)
> +						found = true;

break;

> +				}
> +
> +				if (found) {
> +					target = cxl_find_dport_by_dev(switch_port, ep->dev.parent->parent);
> +					switch_cxld->target[target_ndx++] = target;
> +					continue;
> +				}
> +
> +				target_ndx = 0;
> +
> +				switch_cxld = get_decoder(cxlr, switch_port);
> +				switch_cxld->interleave_ways++;
> +				switch_cxld->interleave_granularity = cxl_to_ways(region_ig + region_eniw);

I'm not following this.  Perhaps comment on why this particular maths?  I was assuming the switch
interleave granularity would that of the region as the switch is last level of decode.

Need to do the equivalent here of what you do in the if (found) or the first target is missed.
Also need to be updating interleave_ways only in the found path, not here (as the default is 1)

>  			}
>  		}
>  	}


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 08/14] cxl/region: HB port config verification
  2022-01-28  0:27 ` [PATCH v3 08/14] cxl/region: HB port config verification Ben Widawsky
  2022-02-14 16:20   ` Jonathan Cameron
@ 2022-02-15 16:35   ` Jonathan Cameron
  2022-02-18 21:04   ` Dan Williams
  2 siblings, 0 replies; 70+ messages in thread
From: Jonathan Cameron @ 2022-02-15 16:35 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Dan Williams, Ira Weiny,
	Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

On Thu, 27 Jan 2022 16:27:01 -0800
Ben Widawsky <ben.widawsky@intel.com> wrote:

> Host bridge root port verification determines if the device ordering in
> an interleave set can be programmed through the host bridges and
> switches.
> 
> The algorithm implemented here is based on the CXL Type 3 Memory Device
> Software Guide, chapter 2.13.15. The current version of the guide does
> not yet support x3 interleave configurations, and so that's not
> supported here either.
> 
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
>  .clang-format           |   1 +
>  drivers/cxl/core/port.c |   1 +
>  drivers/cxl/cxl.h       |   2 +
>  drivers/cxl/region.c    | 127 +++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 130 insertions(+), 1 deletion(-)
> 
> diff --git a/.clang-format b/.clang-format
> index 1221d53be90b..5e20206f905e 100644
> --- a/.clang-format
> +++ b/.clang-format
> @@ -171,6 +171,7 @@ ForEachMacros:
>    - 'for_each_cpu_wrap'
>    - 'for_each_cxl_decoder_target'
>    - 'for_each_cxl_endpoint'
> +  - 'for_each_cxl_endpoint_hb'
>    - 'for_each_dapm_widgets'
>    - 'for_each_dev_addr'
>    - 'for_each_dev_scope'
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 0847e6ce19ef..1d81c5f56a3e 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -706,6 +706,7 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port,
>  		return ERR_PTR(-ENOMEM);
>  
>  	INIT_LIST_HEAD(&dport->list);
> +	INIT_LIST_HEAD(&dport->verify_link);
>  	dport->dport = dport_dev;
>  	dport->port_id = port_id;
>  	dport->component_reg_phys = component_reg_phys;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index a291999431c7..ed984465b59c 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -350,6 +350,7 @@ struct cxl_port {
>   * @component_reg_phys: downstream port component registers
>   * @port: reference to cxl_port that contains this downstream port
>   * @list: node for a cxl_port's list of cxl_dport instances
> + * @verify_link: node used for hb root port verification
>   */
>  struct cxl_dport {
>  	struct device *dport;
> @@ -357,6 +358,7 @@ struct cxl_dport {
>  	resource_size_t component_reg_phys;
>  	struct cxl_port *port;
>  	struct list_head list;
> +	struct list_head verify_link;
>  };
>  
>  /**
> diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
> index 562c8720da56..d2f6c990c8a8 100644
> --- a/drivers/cxl/region.c
> +++ b/drivers/cxl/region.c
> @@ -4,6 +4,7 @@
>  #include <linux/genalloc.h>
>  #include <linux/device.h>
>  #include <linux/module.h>
> +#include <linux/sort.h>
>  #include <linux/pci.h>
>  #include "cxlmem.h"
>  #include "region.h"
> @@ -36,6 +37,12 @@
>  	for (idx = 0, ep = (region)->config.targets[idx];                      \
>  	     idx < region_ways(region); ep = (region)->config.targets[++idx])
>  
> +#define for_each_cxl_endpoint_hb(ep, region, hb, idx)                          \
> +	for (idx = 0, (ep) = (region)->config.targets[idx];                    \
> +	     idx < region_ways(region);                                        \
> +	     idx++, (ep) = (region)->config.targets[idx])                      \
> +		if (get_hostbridge(ep) == (hb))
> +
>  #define for_each_cxl_decoder_target(dport, decoder, idx)                       \
>  	for (idx = 0, dport = (decoder)->target[idx];                          \
>  	     idx < (decoder)->nr_targets - 1;                                  \
> @@ -299,6 +306,59 @@ static bool region_xhb_config_valid(const struct cxl_region *cxlr,
>  	return true;
>  }
>  
> +static struct cxl_dport *get_rp(struct cxl_memdev *ep)
> +{
> +	struct cxl_port *port, *parent_port = port = ep->port;
> +	struct cxl_dport *dport;
> +
> +	while (!is_cxl_root(port)) {
> +		parent_port = to_cxl_port(port->dev.parent);
> +		if (parent_port->depth == 1)
> +			list_for_each_entry(dport, &parent_port->dports, list)
> +				if (dport->dport == port->uport->parent->parent)
> +					return dport;
> +		port = parent_port;
> +	}
> +
> +	BUG();
> +	return NULL;
> +}
> +
> +static int get_num_root_ports(const struct cxl_region *cxlr)
> +{
> +	struct cxl_memdev *endpoint;
> +	struct cxl_dport *dport, *tmp;
> +	int num_root_ports = 0;
> +	LIST_HEAD(root_ports);
> +	int idx;
> +
> +	for_each_cxl_endpoint(endpoint, cxlr, idx) {
> +		struct cxl_dport *root_port = get_rp(endpoint);
> +
> +		if (list_empty(&root_port->verify_link)) {
> +			list_add_tail(&root_port->verify_link, &root_ports);
> +			num_root_ports++;
> +		}
> +	}
> +
> +	list_for_each_entry_safe(dport, tmp, &root_ports, verify_link)
> +		list_del_init(&dport->verify_link);
> +
> +	return num_root_ports;
> +}
> +
> +static bool has_switch(const struct cxl_region *cxlr)
> +{
> +	struct cxl_memdev *ep;
> +	int i;
> +
> +	for_each_cxl_endpoint(ep, cxlr, i)
> +		if (ep->port->depth > 2)
> +			return true;
> +
> +	return false;
> +}
> +
>  /**
>   * region_hb_rp_config_valid() - determine root port ordering is correct
>   * @cxlr: Region to validate
> @@ -312,7 +372,72 @@ static bool region_xhb_config_valid(const struct cxl_region *cxlr,
>  static bool region_hb_rp_config_valid(const struct cxl_region *cxlr,
>  				      const struct cxl_decoder *rootd)
>  {
> -	/* TODO: */
> +	const int num_root_ports = get_num_root_ports(cxlr);
> +	struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
> +	int hb_count, i;
> +
> +	hb_count = get_unique_hostbridges(cxlr, hbs);
> +
> +	/* TODO: Switch support */
> +	if (has_switch(cxlr))
> +		return false;
> +
> +	/*
> +	 * Are all devices in this region on the same CXL Host Bridge
> +	 * Root Port?
> +	 */
> +	if (num_root_ports == 1 && !has_switch(cxlr))
> +		return true;
> +
> +	for (i = 0; i < hb_count; i++) {
> +		int idx, position_mask;
> +		struct cxl_dport *rp;
> +		struct cxl_port *hb;
> +
> +		/* Get next CXL Host Bridge this region spans */
> +		hb = hbs[i];
> +
> +		/*
> +		 * Calculate the position mask: NumRootPorts = 2^PositionMask
> +		 * for this region.
> +		 *
> +		 * XXX: pos_mask is actually (1 << PositionMask)  - 1
> +		 */
> +		position_mask = (1 << (ilog2(num_root_ports))) - 1;

Needs to account for the root ports potentially being spread over multiple host
bridges.  For now I'm assuming some symmetry to move my own testing forwards
but that's not strictly required if we want to be really flexible.

So I've been using

		position_mask = (1 << (ilog2(num_root_ports/hb_count)) - 1;


> +
> +		/*
> +		 * Calculate the PortGrouping for each device on this CXL Host
> +		 * Bridge Root Port:
> +		 * PortGrouping = RegionLabel.Position & PositionMask
> +		 *
> +		 * The following nest iterators effectively iterate over each
> +		 * root port in the region.
> +		 *   for_each_unique_rootport(rp, cxlr)
> +		 */
> +		list_for_each_entry(rp, &hb->dports, list) {
> +			struct cxl_memdev *ep;
> +			int port_grouping = -1;
> +
> +			for_each_cxl_endpoint_hb(ep, cxlr, hb, idx) {
> +				if (get_rp(ep) != rp)
> +					continue;
> +
> +				if (port_grouping == -1)
> +					port_grouping = idx & position_mask;
> +
> +				/*
> +				 * Do all devices in the region connected to this CXL
> +				 * Host Bridge Root Port have the same PortGrouping?
> +				 */
> +				if ((idx & position_mask) != port_grouping) {
> +					dev_dbg(&cxlr->dev,
> +						"One or more devices are not connected to the correct Host Bridge Root Port\n");
> +					return false;
> +				}
> +			}
> +		}
> +	}
> +
>  	return true;
>  }
>  


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 03/14] cxl/mem: Cache port created by the mem dev
  2022-01-28  0:26 ` [PATCH v3 03/14] cxl/mem: Cache port created by the mem dev Ben Widawsky
@ 2022-02-17  1:20   ` Dan Williams
  0 siblings, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-17  1:20 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Since region programming sees all components in the topology as a port,
> it's required that endpoints are treated equally. The easiest way to go
> from endpoint to port is to simply cache it at creation time.

As of 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver"),
cxl_endpoint_autoremove() already sets cxlmd drvdata to @endpoint, so
this patch isn't needed.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 04/14] cxl/region: Introduce a cxl_region driver
  2022-01-28  0:26 ` [PATCH v3 04/14] cxl/region: Introduce a cxl_region driver Ben Widawsky
  2022-02-01 16:21   ` Jonathan Cameron
@ 2022-02-17  6:04   ` Dan Williams
  1 sibling, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-17  6:04 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> The cxl_region driver is responsible for managing the HDM decoder
> programming in the CXL topology. Once a region is created it must be
> configured and bound to the driver in order to activate it.
>
> The following is a sample of how such controls might work:
>
> region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
> echo 2 > /sys/bus/cxl/devices/decoder0.0/region0.0:0/interleave
> echo $((256<<20)) > /sys/bus/cxl/devices/decoder0.0/region0.0:0/size
> echo mem0 > /sys/bus/cxl/devices/decoder0.0/region0.0:0/target0
> echo mem1 > /sys/bus/cxl/devices/decoder0.0/region0.0:0/target1
> echo region0.0:0 > /sys/bus/cxl/drivers/cxl_region/bind
>
> In order to handle the eventual rise in failure modes of binding a
> region, a new trace event is created to help track these failures for
> debug and reconfiguration paths in userspace.
>
> Reported-by: kernel test robot <lkp@intel.com> (v2)
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
> Changes since v2:
> - Add CONFIG_CXL_REGION
> - Check ways/granularity in sanitize
> ---
>  .../driver-api/cxl/memory-devices.rst         |   3 +
>  drivers/cxl/Kconfig                           |   4 +
>  drivers/cxl/Makefile                          |   2 +
>  drivers/cxl/core/core.h                       |   1 +
>  drivers/cxl/core/port.c                       |  17 +-
>  drivers/cxl/core/region.c                     |  25 +-
>  drivers/cxl/cxl.h                             |  31 ++
>  drivers/cxl/region.c                          | 349 ++++++++++++++++++
>  drivers/cxl/region.h                          |   4 +
>  9 files changed, 431 insertions(+), 5 deletions(-)
>  create mode 100644 drivers/cxl/region.c
>
> diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
> index 66ddc58a21b1..8cb4dece5b17 100644
> --- a/Documentation/driver-api/cxl/memory-devices.rst
> +++ b/Documentation/driver-api/cxl/memory-devices.rst
> @@ -364,6 +364,9 @@ CXL Core
>
>  CXL Regions
>  -----------
> +.. kernel-doc:: drivers/cxl/region.c
> +   :doc: cxl region
> +
>  .. kernel-doc:: drivers/cxl/region.h
>     :identifiers:
>
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index b88ab956bb7c..742847503c16 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -98,4 +98,8 @@ config CXL_PORT
>         default CXL_BUS
>         tristate
>
> +config CXL_REGION
> +       default CXL_PORT
> +       tristate
> +
>  endif
> diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
> index ce267ef11d93..02a4776e7ab9 100644
> --- a/drivers/cxl/Makefile
> +++ b/drivers/cxl/Makefile
> @@ -5,9 +5,11 @@ obj-$(CONFIG_CXL_MEM) += cxl_mem.o
>  obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
>  obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
>  obj-$(CONFIG_CXL_PORT) += cxl_port.o
> +obj-$(CONFIG_CXL_REGION) += cxl_region.o
>
>  cxl_mem-y := mem.o
>  cxl_pci-y := pci.o
>  cxl_acpi-y := acpi.o
>  cxl_pmem-y := pmem.o
>  cxl_port-y := port.o
> +cxl_region-y := region.o
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 35fd08d560e2..b8a154da34df 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -7,6 +7,7 @@
>  extern const struct device_type cxl_nvdimm_bridge_type;
>  extern const struct device_type cxl_nvdimm_type;
>  extern const struct device_type cxl_memdev_type;
> +extern const struct device_type cxl_region_type;
>
>  extern struct attribute_group cxl_base_attribute_group;
>
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 0826208b2bdf..0847e6ce19ef 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -9,6 +9,7 @@
>  #include <linux/idr.h>
>  #include <cxlmem.h>
>  #include <cxlpci.h>
> +#include <region.h>
>  #include <cxl.h>
>  #include "core.h"
>
> @@ -49,6 +50,8 @@ static int cxl_device_id(struct device *dev)
>         }
>         if (dev->type == &cxl_memdev_type)
>                 return CXL_DEVICE_MEMORY_EXPANDER;
> +       if (dev->type == &cxl_region_type)
> +               return CXL_DEVICE_REGION;
>         return 0;
>  }
>
> @@ -1425,13 +1428,23 @@ static int cxl_bus_match(struct device *dev, struct device_driver *drv)
>
>  static int cxl_bus_probe(struct device *dev)
>  {
> -       int rc;
> +       int id = cxl_device_id(dev);
> +       int rc = -ENODEV;
>
>         cxl_nested_lock(dev);
> -       rc = to_cxl_drv(dev->driver)->probe(dev);
> +       if (id == CXL_DEVICE_REGION) {
> +               /* Regions cannot bind until parameters are set */
> +               struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +               if (is_cxl_region_configured(cxlr))
> +                       rc = to_cxl_drv(dev->driver)->probe(dev);

Setting aside whether is_cxl_region_configured() is sufficient, all
probe failures of the region belong to cxl_region_probe(), no special
casing in cxl_bus_probe() required. I would only expect special casing
here if there were responsibilities that the core needed to manage
relative to region probe, but I think all of that can be safely pushed
out to leaf functions.

> +       } else {
> +               rc = to_cxl_drv(dev->driver)->probe(dev);
> +       }
>         cxl_nested_unlock(dev);
>
>         dev_dbg(dev, "probe: %d\n", rc);
> +
>         return rc;
>  }
>
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 3b48e0469fc7..784e4ba25128 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -12,6 +12,8 @@
>  #include <cxl.h>
>  #include "core.h"
>
> +#include "core.h"
> +
>  /**
>   * DOC: cxl core region
>   *
> @@ -26,10 +28,27 @@ static const struct attribute_group region_interleave_group;
>
>  static bool is_region_active(struct cxl_region *cxlr)
>  {
> -       /* TODO: Regions can't be activated yet. */
> -       return false;
> +       return cxlr->active;

I think we already talked about this being redundant with the state of
cxlr->dev.driver...

>  }
>
> +/*
> + * Most sanity checking is left up to region binding. This does the most basic
> + * check to determine whether or not the core should try probing the driver.
> + */
> +bool is_cxl_region_configured(const struct cxl_region *cxlr)
> +{
> +       /* zero sized regions aren't a thing. */
> +       if (cxlr->config.size <= 0)
> +               return false;
> +
> +       /* all regions have at least 1 target */
> +       if (!cxlr->config.targets[0])
> +               return false;
> +
> +       return true;
> +}
> +EXPORT_SYMBOL_GPL(is_cxl_region_configured);
> +
>  static void remove_target(struct cxl_region *cxlr, int target)
>  {
>         struct cxl_memdev *cxlmd;
> @@ -316,7 +335,7 @@ static const struct attribute_group *region_groups[] = {
>
>  static void cxl_region_release(struct device *dev);
>
> -static const struct device_type cxl_region_type = {
> +const struct device_type cxl_region_type = {
>         .name = "cxl_region",
>         .release = cxl_region_release,
>         .groups = region_groups
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index b9f0099c1f39..d1a8ca19c9ea 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -81,6 +81,31 @@ static inline int cxl_to_interleave_ways(u8 eniw)
>         }
>  }
>
> +static inline bool cxl_is_interleave_ways_valid(int iw)
> +{
> +       switch (iw) {
> +               case 0 ... 4:
> +               case 6:
> +               case 8:
> +               case 12:
> +               case 16:
> +                       return true;
> +               default:
> +                       return false;
> +       }
> +
> +       unreachable();
> +}
> +
> +static inline bool cxl_is_interleave_granularity_valid(int ig)
> +{
> +       if (!is_power_of_2(ig))
> +               return false;
> +
> +       /* 16K is the max */
> +       return ((ig >> 15) == 0);
> +}

...already discussed that this validation needs to happen at input time.

> +
>  /* CXL 2.0 8.2.8.1 Device Capabilities Array Register */
>  #define CXLDEV_CAP_ARRAY_OFFSET 0x0
>  #define   CXLDEV_CAP_ARRAY_CAP_ID 0
> @@ -199,6 +224,10 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
>  #define CXL_DECODER_F_ENABLE    BIT(5)
>  #define CXL_DECODER_F_MASK  GENMASK(5, 0)
>
> +#define cxl_is_pmem_t3(flags)                                                  \
> +       (((flags) & (CXL_DECODER_F_TYPE3 | CXL_DECODER_F_PMEM)) ==             \
> +        (CXL_DECODER_F_TYPE3 | CXL_DECODER_F_PMEM))

It's not clear that this macro is more readable than open coding the
flag checking. For example all the decoders that this driver cares
about will have CXL_DECODER_F_TYPE3, right?

> +
>  enum cxl_decoder_type {
>         CXL_DECODER_ACCELERATOR = 2,
>         CXL_DECODER_EXPANDER = 3,
> @@ -357,6 +386,7 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port,
>                                      resource_size_t component_reg_phys);
>  struct cxl_dport *cxl_find_dport_by_dev(struct cxl_port *port,
>                                         const struct device *dev);
> +struct cxl_port *ep_find_cxl_port(struct cxl_memdev *cxlmd, unsigned int depth);
>
>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
>  bool is_root_decoder(struct device *dev);
> @@ -404,6 +434,7 @@ void cxl_driver_unregister(struct cxl_driver *cxl_drv);
>  #define CXL_DEVICE_PORT                        3
>  #define CXL_DEVICE_ROOT                        4
>  #define CXL_DEVICE_MEMORY_EXPANDER     5
> +#define CXL_DEVICE_REGION              6
>
>  #define MODULE_ALIAS_CXL(type) MODULE_ALIAS("cxl:t" __stringify(type) "*")
>  #define CXL_MODALIAS_FMT "cxl:t%d"
> diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
> new file mode 100644
> index 000000000000..cc41939a2f0a
> --- /dev/null
> +++ b/drivers/cxl/region.c
> @@ -0,0 +1,349 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2021 Intel Corporation. All rights reserved. */
> +#include <linux/platform_device.h>
> +#include <linux/device.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include "cxlmem.h"
> +#include "region.h"
> +#include "cxl.h"
> +
> +/**
> + * DOC: cxl region
> + *
> + * This module implements a region driver that is capable of programming CXL
> + * hardware to setup regions.

This lead-in does not clarify anything, I'd just delete it to jump
straight to this goodness in the next paragraph.

> + *
> + * A CXL region encompasses a chunk of host physical address space that may be
> + * consumed by a single device (x1 interleave aka linear) or across multiple
> + * devices (xN interleaved). The region driver has the following
> + * responsibilities:
> + *
> + * * Walk topology to obtain decoder resources for region configuration.
> + * * Program decoder resources based on region configuration.
> + * * Bridge CXL regions to LIBNVDIMM
> + * * Initiates reading and configuring LSA regions
> + * * Enumerates regions created by BIOS (typically volatile)
> + */
> +
> +#define region_ways(region) ((region)->config.interleave_ways)
> +#define region_granularity(region) ((region)->config.interleave_granularity)

This seems to lend credence to just dropping the "config." indirection
and just open-coding region->interleave_{ways,granularity}.

I can read region->interleave_ways and not worry whereas
region_ways(region) might have side effects.

> +
> +static struct cxl_decoder *rootd_from_region(struct cxl_region *cxlr)
> +{
> +       struct device *d = cxlr->dev.parent;
> +
> +       if (WARN_ONCE(!is_root_decoder(d),
> +                     "Corrupt topology for root region\n"))

This can't happen. The WARN's in the other to_cxl_<object> helpers are
due to missing type-safety of the argument. A cxl_region is an
explicit type with an implicit relationship to its parent device.

> +               return NULL;
> +
> +       return to_cxl_decoder(d);
> +}
> +
> +static struct cxl_port *get_hostbridge(const struct cxl_memdev *ep)

I'd prefer to call this walk_to_port_level() and take a @depth arg
instead of get_hostbridge(). The walk_ instead of get_ prefix since
this helper does not take a reference (get_device()) on the returned
object, and _to_port_level instead of _hostbridge because this is CXL
subsystem topology independent of the PCIe naming, and will need the
depth support for switch support.

> +{
> +       struct cxl_port *port = ep->port;
> +
> +       while (!is_cxl_root(port)) {
> +               port = to_cxl_port(port->dev.parent);
> +               if (port->depth == 1)
> +                       return port;
> +       }
> +
> +       BUG();

No need to crash, just have the caller handle NULL.

> +       return NULL;
> +}
> +
> +static struct cxl_port *get_root_decoder(const struct cxl_memdev *endpoint)

Why is a function named _decoder returning a 'struct cxl_port *'?

> +{
> +       struct cxl_port *hostbridge = get_hostbridge(endpoint);
> +
> +       if (hostbridge)
> +               return to_cxl_port(hostbridge->dev.parent);
> +
> +       return NULL;
> +}
> +
> +/**
> + * sanitize_region() - Check is region is reasonably configured
> + * @cxlr: The region to check
> + *
> + * Determination as to whether or not a region can possibly be configured is
> + * described in CXL Memory Device SW Guide. In order to implement the algorithms
> + * described there, certain more basic configuration parameters must first need
> + * to be validated. That is accomplished by this function.
> + *
> + * Returns 0 if the region is reasonably configured, else returns a negative
> + * error code.
> + */
> +static int sanitize_region(const struct cxl_region *cxlr)
> +{
> +       const int ig = region_granularity(cxlr);
> +       const int iw = region_ways(cxlr);
> +       int i;
> +
> +       if (dev_WARN_ONCE(&cxlr->dev, !is_cxl_region_configured(cxlr),
> +                         "unconfigured regions can't be probed (race?)\n")) {
> +               return -ENXIO;

Just let unconfigured regions be probed and then fail the probe.

> +       }
> +
> +       /*
> +        * Interleave attributes should be caught by later math, but it's
> +        * easiest to find those issues here, now.
> +        */
> +       if (!cxl_is_interleave_ways_valid(iw)) {
> +               dev_dbg(&cxlr->dev, "Invalid number of ways\n");
> +               return -ENXIO;
> +       }
> +
> +       if (!cxl_is_interleave_granularity_valid(ig)) {
> +               dev_dbg(&cxlr->dev, "Invalid interleave granularity\n");
> +               return -ENXIO;
> +       }
> +
> +       if (cxlr->config.size % (SZ_256M * iw)) {
> +               dev_dbg(&cxlr->dev, "Invalid size. Must be multiple of %uM\n",
> +                       256 * iw);
> +               return -ENXIO;
> +       }
> +

All of the above looks ok modulo also validating at input time.

> +       for (i = 0; i < iw; i++) {
> +               if (!cxlr->config.targets[i]) {
> +                       dev_dbg(&cxlr->dev, "Missing memory device target%u",
> +                               i);
> +                       return -ENXIO;
> +               }
> +               if (!cxlr->config.targets[i]->dev.driver) {

The above is an argument for why the region should track endpoint
decoders as targets not memdevs, because endpoint decoders only exist
while memdevs are enabled. The endpoint decoder unregistration path
can take care to trigger driver detachment of all associated regions
first.

> +                       dev_dbg(&cxlr->dev, "%s isn't CXL.mem capable\n",
> +                               dev_name(&cxlr->config.targets[i]->dev));
> +                       return -ENODEV;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +/**
> + * allocate_address_space() - Gets address space for the region.
> + * @cxlr: The region that will consume the address space
> + */
> +static int allocate_address_space(struct cxl_region *cxlr)
> +{
> +       /* TODO */
> +       return 0;

I'd prefer this function definition just move to the later patch where
this gets implemented.

> +}
> +
> +/**
> + * find_cdat_dsmas() - Find a valid DSMAS for the region
> + * @cxlr: The region
> + */
> +static bool find_cdat_dsmas(const struct cxl_region *cxlr)

CDAT support is a long ways off, too early to include this.
\

> +{
> +       return true;
> +}
> +
> +/**
> + * qtg_match() - Does this root decoder have desirable QTG for the endpoint
> + * @rootd: The root decoder for the region
> + * @endpoint: Endpoint whose QTG is being compared
> + *
> + * Prior to calling this function, the caller should verify that all endpoints
> + * in the region have the same QTG ID.
> + *
> + * Returns true if the QTG ID of the root decoder matches the endpoint
> + */
> +static bool qtg_match(const struct cxl_decoder *rootd,
> +                     const struct cxl_memdev *endpoint)
> +{
> +       /* TODO: */

QTG support is also a long ways away, no need to carry this cruft now.
Also QTG is an ACPI'ism we probably need a CXL core generic term for
this.


> +       return true;
> +}
> +
> +/**
> + * region_xhb_config_valid() - determine cross host bridge validity
> + * @cxlr: The region being programmed
> + * @rootd: The root decoder to check against
> + *
> + * The algorithm is outlined in 2.13.14 "Verify XHB configuration sequence" of
> + * the CXL Memory Device SW Guide (Rev1p0).
> + *
> + * Returns true if the configuration is valid.
> + */
> +static bool region_xhb_config_valid(const struct cxl_region *cxlr,
> +                                   const struct cxl_decoder *rootd)
> +{
> +       /* TODO: */
> +       return true;
> +}
> +
> +/**
> + * region_hb_rp_config_valid() - determine root port ordering is correct
> + * @cxlr: Region to validate
> + * @rootd: root decoder for this @cxlr
> + *
> + * The algorithm is outlined in 2.13.15 "Verify HB root port configuration
> + * sequence" of the CXL Memory Device SW Guide (Rev1p0).

Similar to the feedback on the port patches. The guide is not a spec.
The commentary in the core should be CXL spec relative. Sure you can
paraphrase what the guide is recommending, but the Linux
implementation needs to stand alone from the guide.

> + *
> + * Returns true if the configuration is valid.
> + */
> +static bool region_hb_rp_config_valid(const struct cxl_region *cxlr,
> +                                     const struct cxl_decoder *rootd)
> +{
> +       /* TODO: */
> +       return true;
> +}
> +
> +/**
> + * rootd_contains() - determine if this region can exist in the root decoder
> + * @rootd: root decoder that potentially decodes to this region
> + * @cxlr: region to be routed by the @rootd
> + */
> +static bool rootd_contains(const struct cxl_region *cxlr,
> +                          const struct cxl_decoder *rootd)
> +{
> +       /* TODO: */
> +       return true;
> +}

The short names like xhb, rp, and rootd feel like they could be
spelled out or made more generic in anticipation of switch support.
Depth based naming would seem more generic at first glance.

> +
> +static bool rootd_valid(const struct cxl_region *cxlr,
> +                       const struct cxl_decoder *rootd)
> +{
> +       const struct cxl_memdev *endpoint = cxlr->config.targets[0];
> +
> +       if (!qtg_match(rootd, endpoint))
> +               return false;
> +
> +       if (!cxl_is_pmem_t3(rootd->flags))
> +               return false;
> +
> +       if (!region_xhb_config_valid(cxlr, rootd))
> +               return false;
> +
> +       if (!region_hb_rp_config_valid(cxlr, rootd))
> +               return false;
> +
> +       if (!rootd_contains(cxlr, rootd))
> +               return false;
> +
> +       return true;
> +}
> +
> +struct rootd_context {
> +       const struct cxl_region *cxlr;
> +       struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
> +       int count;
> +};
> +
> +static int rootd_match(struct device *dev, void *data)
> +{
> +       struct rootd_context *ctx = (struct rootd_context *)data;
> +       const struct cxl_region *cxlr = ctx->cxlr;
> +
> +       if (!is_root_decoder(dev))
> +               return 0;
> +
> +       return !!rootd_valid(cxlr, to_cxl_decoder(dev));
> +}
> +
> +/*
> + * This is a roughly equivalent implementation to "Figure 45 - High-level
> + * sequence: Finding CFMWS for region" from the CXL Memory Device SW Guide
> + * Rev1p0.
> + */
> +static struct cxl_decoder *find_rootd(const struct cxl_region *cxlr,
> +                                     const struct cxl_port *root)
> +{
> +       struct rootd_context ctx;
> +       struct device *ret;
> +
> +       ctx.cxlr = cxlr;
> +
> +       ret = device_find_child((struct device *)&root->dev, &ctx, rootd_match);
> +       if (ret)
> +               return to_cxl_decoder(ret);
> +
> +       return NULL;
> +}
> +
> +static int collect_ep_decoders(const struct cxl_region *cxlr)
> +{
> +       /* TODO: */
> +       return 0;
> +}
> +
> +static int bind_region(const struct cxl_region *cxlr)
> +{
> +       /* TODO: */
> +       return 0;

More too early to include helpers.

> +}
> +
> +static int cxl_region_probe(struct device *dev)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       struct cxl_port *root_port;
> +       struct cxl_decoder *rootd, *ours;
> +       int ret;
> +
> +       device_lock_assert(&cxlr->dev);

This is a driver-core guarantee, no need to assert.

> +
> +       if (cxlr->active)
> +               return 0;

Given cxlr->active can be replaced by cxlr->dev.driver this can't happen.

> +
> +       if (uuid_is_null(&cxlr->config.uuid))
> +               uuid_gen(&cxlr->config.uuid);

Too late to generate a uuid.

> +
> +       /* TODO: What about volatile, and LSA generated regions? */
> +
> +       ret = sanitize_region(cxlr);
> +       if (ret)
> +               return ret;
> +
> +       ret = allocate_address_space(cxlr);
> +       if (ret)
> +               return ret;

I expect address space to be allocated before the region is activated
so that resource conflicts are negotiated early.

> +
> +       if (!find_cdat_dsmas(cxlr))
> +               return -ENXIO;
> +
> +       rootd = rootd_from_region(cxlr);
> +       if (!rootd) {
> +               dev_err(dev, "Couldn't find root decoder\n");
> +               return -ENXIO;
> +       }
> +
> +       if (!rootd_valid(cxlr, rootd)) {
> +               dev_err(dev, "Picked invalid rootd\n");
> +               return -ENXIO;
> +       }
> +
> +       root_port = get_root_decoder(cxlr->config.targets[0]);
> +       ours = find_rootd(cxlr, root_port);
> +       if (ours != rootd)
> +               dev_dbg(dev, "Picked different rootd %s %s\n",
> +                       dev_name(&rootd->dev), dev_name(&ours->dev));
> +       if (ours)
> +               put_device(&ours->dev);
> +
> +       ret = collect_ep_decoders(cxlr);
> +       if (ret)
> +               return ret;
> +
> +       ret = bind_region(cxlr);

Not sure what this is supposed to do...

> +       if (!ret) {
> +               cxlr->active = true;
> +               dev_info(dev, "Bound");

Not a useful log message...

> +       }
> +
> +       return ret;
> +}
> +
> +static struct cxl_driver cxl_region_driver = {
> +       .name = "cxl_region",
> +       .probe = cxl_region_probe,
> +       .id = CXL_DEVICE_REGION,
> +};
> +module_cxl_driver(cxl_region_driver);
> +
> +MODULE_LICENSE("GPL v2");
> +MODULE_IMPORT_NS(CXL);
> +MODULE_ALIAS_CXL(CXL_DEVICE_REGION);
> diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
> index eb1249e3c1d4..00a6dc729c26 100644
> --- a/drivers/cxl/region.h
> +++ b/drivers/cxl/region.h
> @@ -13,6 +13,7 @@
>   * @id: This regions id. Id is globally unique across all regions.
>   * @list: Node in decoder's region list.
>   * @res: Resource this region carves out of the platform decode range.
> + * @active: If the region has been activated.
>   * @config: HDM decoder program config
>   * @config.size: Size of the region determined from LSA or userspace.
>   * @config.uuid: The UUID for this region.
> @@ -25,6 +26,7 @@ struct cxl_region {
>         int id;
>         struct list_head list;
>         struct resource *res;
> +       bool active;
>
>         struct {
>                 u64 size;
> @@ -35,4 +37,6 @@ struct cxl_region {
>         } config;
>  };
>
> +bool is_cxl_region_configured(const struct cxl_region *cxlr);
> +
>  #endif
> --
> 2.35.0
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v4 01/14] cxl/region: Add region creation ABI
  2022-01-28  0:26 ` [PATCH v3 01/14] cxl/region: Add region creation ABI Ben Widawsky
  2022-01-28 18:14   ` Dan Williams
  2022-02-01 15:53   ` Jonathan Cameron
@ 2022-02-17 17:10   ` Ben Widawsky
  2022-02-17 17:19     ` [PATCH v5 01/15] " Ben Widawsky
  2 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-17 17:10 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

Regions are created as a child of the decoder that encompasses an
address space with constraints. Regions have a number of attributes that
must be configured before the region can be activated.

The ABI is not meant to be secure, but is meant to avoid accidental
races. As a result, a buggy process may create a region by name that was
allocated by a different process. However, multiple processes which are
trying not to race with each other shouldn't need special
synchronization to do so.

// Allocate a new region name
region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)

// Create a new region by name
while
region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
! echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
do true; done

// Region now exists in sysfs
stat -t /sys/bus/cxl/devices/decoder0.0/$region

// Delete the region, and name
echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>

---
Changes since v3:
- Change ABI for creation (cache a next) (Dan)
- For above, Use ratelimited dev_err for new ABI
- Use devm management for regions (Dan)
- Move device initialization to alloc (Dan)
- Move device alloc to add
- Change functions naming to use foo_bar_alloc instead of foo_alloc_bar
- Update commit message example for ABI change (Dan)
- Update ABI documentation (Dan)
- Update copyright date (Dan)
- Update region kdoc (Dan)
- Remove unnecessary WARN in region creation (Dan)
- Remove is_cxl_region() (for now) (Jonathan)
- Check driver binding to avoid double free of region (Dan)
---
 Documentation/ABI/testing/sysfs-bus-cxl       |  23 ++
 .../driver-api/cxl/memory-devices.rst         |  11 +
 drivers/cxl/core/Makefile                     |   1 +
 drivers/cxl/core/core.h                       |   3 +
 drivers/cxl/core/port.c                       |  11 +
 drivers/cxl/core/region.c                     | 213 ++++++++++++++++++
 drivers/cxl/cxl.h                             |   5 +
 drivers/cxl/region.h                          |  23 ++
 tools/testing/cxl/Kbuild                      |   1 +
 9 files changed, 291 insertions(+)
 create mode 100644 drivers/cxl/core/region.c
 create mode 100644 drivers/cxl/region.h

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 7c2b846521f3..e5db45ea70ad 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -163,3 +163,26 @@ Description:
 		memory (type-3). The 'target_type' attribute indicates the
 		current setting which may dynamically change based on what
 		memory regions are activated in this decode hierarchy.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/create_region
+Date:		January, 2022
+KernelVersion:	v5.18
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		Write a value of the form 'regionX.Y:Z' to instantiate a new
+		region within the decode range bounded by decoderX.Y. The value
+		written must match the current value returned from reading this
+		attribute. This behavior lets the kernel arbitrate racing
+		attempts to create a region. The thread that fails to write
+		loops and tries the next value. Regions must be created for root
+		decoders, and must subsequently configured and bound to a region
+		driver before they can be used.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
+Date:		January, 2022
+KernelVersion:	v5.18
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		Deletes the named region.  The attribute expects a region in the
+		form "regionX.Y:Z". The region's name, allocated by reading
+		create_region, will also be released.
diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
index db476bb170b6..66ddc58a21b1 100644
--- a/Documentation/driver-api/cxl/memory-devices.rst
+++ b/Documentation/driver-api/cxl/memory-devices.rst
@@ -362,6 +362,17 @@ CXL Core
 .. kernel-doc:: drivers/cxl/core/mbox.c
    :doc: cxl mbox
 
+CXL Regions
+-----------
+.. kernel-doc:: drivers/cxl/region.h
+   :identifiers:
+
+.. kernel-doc:: drivers/cxl/core/region.c
+   :doc: cxl core region
+
+.. kernel-doc:: drivers/cxl/core/region.c
+   :identifiers:
+
 External Interfaces
 ===================
 
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 6d37cd78b151..39ce8f2f2373 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_CXL_BUS) += cxl_core.o
 ccflags-y += -I$(srctree)/drivers/cxl
 cxl_core-y := port.o
 cxl_core-y += pmem.o
+cxl_core-y += region.o
 cxl_core-y += regs.o
 cxl_core-y += memdev.o
 cxl_core-y += mbox.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 1a50c0fc399c..adfd42370b28 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -9,6 +9,9 @@ extern const struct device_type cxl_nvdimm_type;
 
 extern struct attribute_group cxl_base_attribute_group;
 
+extern struct device_attribute dev_attr_create_region;
+extern struct device_attribute dev_attr_delete_region;
+
 struct cxl_send_command;
 struct cxl_mem_query_commands;
 int cxl_query_cmd(struct cxl_memdev *cxlmd,
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 1e785a3affaa..860e91cae29b 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -213,6 +213,8 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
 };
 
 static struct attribute *cxl_decoder_root_attrs[] = {
+	&dev_attr_create_region.attr,
+	&dev_attr_delete_region.attr,
 	&dev_attr_cap_pmem.attr,
 	&dev_attr_cap_ram.attr,
 	&dev_attr_cap_type2.attr,
@@ -270,6 +272,8 @@ static void cxl_decoder_release(struct device *dev)
 	struct cxl_decoder *cxld = to_cxl_decoder(dev);
 	struct cxl_port *port = to_cxl_port(dev->parent);
 
+	ida_free(&cxld->region_ida, cxld->next_region_id);
+	ida_destroy(&cxld->region_ida);
 	ida_free(&port->decoder_ida, cxld->id);
 	kfree(cxld);
 }
@@ -1244,6 +1248,13 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
 	cxld->target_type = CXL_DECODER_EXPANDER;
 	cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
 
+	mutex_init(&cxld->id_lock);
+	ida_init(&cxld->region_ida);
+	rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
+	if (rc < 0)
+		goto err;
+
+	cxld->next_region_id = rc;
 	return cxld;
 err:
 	kfree(cxld);
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
new file mode 100644
index 000000000000..5576952e4aa1
--- /dev/null
+++ b/drivers/cxl/core/region.c
@@ -0,0 +1,213 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2022 Intel Corporation. All rights reserved. */
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/idr.h>
+#include <region.h>
+#include <cxl.h>
+#include "core.h"
+
+/**
+ * DOC: cxl core region
+ *
+ * CXL Regions represent mapped memory capacity in system physical address
+ * space. Whereas the CXL Root Decoders identify the bounds of potential CXL
+ * Memory ranges, Regions represent the active mapped capacity by the HDM
+ * Decoder Capability structures throughout the Host Bridges, Switches, and
+ * Endpoints in the topology.
+ */
+
+static void cxl_region_release(struct device *dev);
+
+static const struct device_type cxl_region_type = {
+	.name = "cxl_region",
+	.release = cxl_region_release,
+};
+
+static struct cxl_region *to_cxl_region(struct device *dev)
+{
+	if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type,
+			  "not a cxl_region device\n"))
+		return NULL;
+
+	return container_of(dev, struct cxl_region, dev);
+}
+
+static struct cxl_region *cxl_region_alloc(struct cxl_decoder *cxld)
+{
+	struct cxl_region *cxlr;
+	struct device *dev;
+
+	cxlr = kzalloc(sizeof(*cxlr), GFP_KERNEL);
+	if (!cxlr)
+		return ERR_PTR(-ENOMEM);
+
+	dev = &cxlr->dev;
+	device_initialize(dev);
+	dev->parent = &cxld->dev;
+	device_set_pm_not_required(dev);
+	dev->bus = &cxl_bus_type;
+	dev->type = &cxl_region_type;
+
+	return cxlr;
+}
+
+static void unregister_region(void *_cxlr)
+{
+	struct cxl_region *cxlr = _cxlr;
+
+	if (!test_and_set_bit(REGION_DEAD, &cxlr->flags))
+		device_unregister(&cxlr->dev);
+}
+
+/**
+ * devm_cxl_add_region - Adds a region to a decoder
+ * @cxld: Parent decoder.
+ * @cxlr: Region to be added to the decoder.
+ *
+ * This is the second step of region initialization. Regions exist within an
+ * address space which is mapped by a @cxld. That @cxld must be a root decoder,
+ * and it enforces constraints upon the region as it is configured.
+ *
+ * Return: 0 if the region was added to the @cxld, else returns negative error
+ * code. The region will be named "regionX.Y.Z" where X is the port, Y is the
+ * decoder id, and Z is the region number.
+ */
+static struct cxl_region *devm_cxl_add_region(struct cxl_decoder *cxld)
+{
+	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
+	struct cxl_region *cxlr;
+	struct device *dev;
+	int rc;
+
+	cxlr = cxl_region_alloc(cxld);
+	if (IS_ERR(cxlr))
+		return cxlr;
+
+	dev = &cxlr->dev;
+
+	cxlr->id = cxld->next_region_id;
+	rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
+	if (rc)
+		goto err_out;
+
+	/* affirm that release will have access to the decoder's region ida  */
+	get_device(&cxld->dev);
+
+	rc = device_add(dev);
+	if (!rc)
+		rc = devm_add_action_or_reset(port->uport, unregister_region,
+					      cxlr);
+	if (rc)
+		goto err_out;
+
+	return cxlr;
+
+err_out:
+	put_device(dev);
+	kfree(cxlr);
+	return ERR_PTR(rc);
+}
+
+static ssize_t create_region_show(struct device *dev,
+				  struct device_attribute *attr, char *buf)
+{
+	struct cxl_port *port = to_cxl_port(dev->parent);
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+
+	return sysfs_emit(buf, "region%d.%d:%d\n", port->id, cxld->id,
+			  cxld->next_region_id);
+}
+
+static ssize_t create_region_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t len)
+{
+	struct cxl_port *port = to_cxl_port(dev->parent);
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	struct cxl_region *cxlr;
+	int d, p, r, rc = 0;
+
+	if (sscanf(buf, "region%d.%d:%d", &p, &d, &r) != 3)
+		return -EINVAL;
+
+	if (port->id != p || cxld->id != d)
+		return -EINVAL;
+
+	rc = mutex_lock_interruptible(&cxld->id_lock);
+	if (rc)
+		return rc;
+
+	if (cxld->next_region_id != r) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
+	if (rc < 0) {
+		dev_dbg(dev, "Failed to get next cached id (%d)\n", rc);
+		goto out;
+	}
+
+	cxlr = devm_cxl_add_region(cxld);
+	if (IS_ERR(cxlr)) {
+		rc = PTR_ERR(cxlr);
+		goto out;
+	}
+
+	cxld->next_region_id = rc;
+	dev_dbg(dev, "Created %s\n", dev_name(&cxlr->dev));
+
+out:
+	mutex_unlock(&cxld->id_lock);
+	return rc ? rc : len;
+}
+DEVICE_ATTR_RW(create_region);
+
+static struct cxl_region *cxl_find_region_by_name(struct cxl_decoder *cxld,
+						  const char *name)
+{
+	struct device *region_dev;
+
+	region_dev = device_find_child_by_name(&cxld->dev, name);
+	if (!region_dev)
+		return ERR_PTR(-ENOENT);
+
+	return to_cxl_region(region_dev);
+}
+
+static ssize_t delete_region_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t len)
+{
+	struct cxl_port *port = to_cxl_port(dev->parent);
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	struct cxl_region *cxlr;
+
+	cxlr = cxl_find_region_by_name(cxld, buf);
+	if (IS_ERR(cxlr))
+		return PTR_ERR(cxlr);
+
+	/* After this, the region is no longer a child of the decoder. */
+	devm_release_action(port->uport, unregister_region, cxlr);
+
+	/* Release is likely called here, so cxlr is not safe to reference. */
+	put_device(&cxlr->dev);
+	cxlr = NULL;
+
+	dev_dbg(dev, "Deleted %s\n", buf);
+	return len;
+}
+DEVICE_ATTR_WO(delete_region);
+
+static void cxl_region_release(struct device *dev)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	dev_dbg(&cxld->dev, "Releasing %s\n", dev_name(dev));
+	ida_free(&cxld->region_ida, cxlr->id);
+	kfree(cxlr);
+	put_device(&cxld->dev);
+}
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index b4047a310340..d5397f7dfcf4 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -221,6 +221,8 @@ enum cxl_decoder_type {
  * @target_type: accelerator vs expander (type2 vs type3) selector
  * @flags: memory type capabilities and locking
  * @target_lock: coordinate coherent reads of the target list
+ * @region_ida: allocator for region ids.
+ * @next_region_id: Cached region id for next region.
  * @nr_targets: number of elements in @target
  * @target: active ordered target list in current decoder configuration
  */
@@ -236,6 +238,9 @@ struct cxl_decoder {
 	enum cxl_decoder_type target_type;
 	unsigned long flags;
 	seqlock_t target_lock;
+	struct mutex id_lock;
+	struct ida region_ida;
+	int next_region_id;
 	int nr_targets;
 	struct cxl_dport *target[];
 };
diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
new file mode 100644
index 000000000000..0016f83bbdfd
--- /dev/null
+++ b/drivers/cxl/region.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2021 Intel Corporation. */
+#ifndef __CXL_REGION_H__
+#define __CXL_REGION_H__
+
+#include <linux/uuid.h>
+
+#include "cxl.h"
+
+/**
+ * struct cxl_region - CXL region
+ * @dev: This region's device.
+ * @id: This region's id. Id is globally unique across all regions.
+ * @flags: Flags representing the current state of the region.
+ */
+struct cxl_region {
+	struct device dev;
+	int id;
+	unsigned long flags;
+#define REGION_DEAD 0
+};
+
+#endif
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index 82e49ab0937d..3fe6d34e6d59 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -46,6 +46,7 @@ cxl_core-y += $(CXL_CORE_SRC)/memdev.o
 cxl_core-y += $(CXL_CORE_SRC)/mbox.o
 cxl_core-y += $(CXL_CORE_SRC)/pci.o
 cxl_core-y += $(CXL_CORE_SRC)/hdm.o
+cxl_core-y += $(CXL_CORE_SRC)/region.o
 cxl_core-y += config_check.o
 
 obj-m += test/

base-commit: 3bdf187d313e067de2a81109f9a1dd3da7f3dc2c
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v5 01/15] cxl/region: Add region creation ABI
  2022-02-17 17:10   ` [PATCH v4 " Ben Widawsky
@ 2022-02-17 17:19     ` Ben Widawsky
  2022-02-17 17:33       ` Ben Widawsky
  2022-02-17 17:58       ` Dan Williams
  0 siblings, 2 replies; 70+ messages in thread
From: Ben Widawsky @ 2022-02-17 17:19 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Ben Widawsky, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

Regions are created as a child of the decoder that encompasses an
address space with constraints. Regions have a number of attributes that
must be configured before the region can be activated.

The ABI is not meant to be secure, but is meant to avoid accidental
races. As a result, a buggy process may create a region by name that was
allocated by a different process. However, multiple processes which are
trying not to race with each other shouldn't need special
synchronization to do so.

// Allocate a new region name
region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)

// Create a new region by name
while
region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
! echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
do true; done

// Region now exists in sysfs
stat -t /sys/bus/cxl/devices/decoder0.0/$region

// Delete the region, and name
echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>

---
Changes since v4:
- Add the missed base attributes addition

---
 Documentation/ABI/testing/sysfs-bus-cxl       |  23 ++
 .../driver-api/cxl/memory-devices.rst         |  11 +
 drivers/cxl/core/Makefile                     |   1 +
 drivers/cxl/core/core.h                       |   3 +
 drivers/cxl/core/port.c                       |  11 +
 drivers/cxl/core/region.c                     | 213 ++++++++++++++++++
 drivers/cxl/cxl.h                             |   5 +
 drivers/cxl/region.h                          |  23 ++
 tools/testing/cxl/Kbuild                      |   1 +
 9 files changed, 291 insertions(+)
 create mode 100644 drivers/cxl/core/region.c
 create mode 100644 drivers/cxl/region.h

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 7c2b846521f3..e5db45ea70ad 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -163,3 +163,26 @@ Description:
 		memory (type-3). The 'target_type' attribute indicates the
 		current setting which may dynamically change based on what
 		memory regions are activated in this decode hierarchy.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/create_region
+Date:		January, 2022
+KernelVersion:	v5.18
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		Write a value of the form 'regionX.Y:Z' to instantiate a new
+		region within the decode range bounded by decoderX.Y. The value
+		written must match the current value returned from reading this
+		attribute. This behavior lets the kernel arbitrate racing
+		attempts to create a region. The thread that fails to write
+		loops and tries the next value. Regions must be created for root
+		decoders, and must subsequently configured and bound to a region
+		driver before they can be used.
+
+What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
+Date:		January, 2022
+KernelVersion:	v5.18
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		Deletes the named region.  The attribute expects a region in the
+		form "regionX.Y:Z". The region's name, allocated by reading
+		create_region, will also be released.
diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
index db476bb170b6..66ddc58a21b1 100644
--- a/Documentation/driver-api/cxl/memory-devices.rst
+++ b/Documentation/driver-api/cxl/memory-devices.rst
@@ -362,6 +362,17 @@ CXL Core
 .. kernel-doc:: drivers/cxl/core/mbox.c
    :doc: cxl mbox
 
+CXL Regions
+-----------
+.. kernel-doc:: drivers/cxl/region.h
+   :identifiers:
+
+.. kernel-doc:: drivers/cxl/core/region.c
+   :doc: cxl core region
+
+.. kernel-doc:: drivers/cxl/core/region.c
+   :identifiers:
+
 External Interfaces
 ===================
 
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 6d37cd78b151..39ce8f2f2373 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_CXL_BUS) += cxl_core.o
 ccflags-y += -I$(srctree)/drivers/cxl
 cxl_core-y := port.o
 cxl_core-y += pmem.o
+cxl_core-y += region.o
 cxl_core-y += regs.o
 cxl_core-y += memdev.o
 cxl_core-y += mbox.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 1a50c0fc399c..adfd42370b28 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -9,6 +9,9 @@ extern const struct device_type cxl_nvdimm_type;
 
 extern struct attribute_group cxl_base_attribute_group;
 
+extern struct device_attribute dev_attr_create_region;
+extern struct device_attribute dev_attr_delete_region;
+
 struct cxl_send_command;
 struct cxl_mem_query_commands;
 int cxl_query_cmd(struct cxl_memdev *cxlmd,
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 1e785a3affaa..860e91cae29b 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -213,6 +213,8 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
 };
 
 static struct attribute *cxl_decoder_root_attrs[] = {
+	&dev_attr_create_region.attr,
+	&dev_attr_delete_region.attr,
 	&dev_attr_cap_pmem.attr,
 	&dev_attr_cap_ram.attr,
 	&dev_attr_cap_type2.attr,
@@ -270,6 +272,8 @@ static void cxl_decoder_release(struct device *dev)
 	struct cxl_decoder *cxld = to_cxl_decoder(dev);
 	struct cxl_port *port = to_cxl_port(dev->parent);
 
+	ida_free(&cxld->region_ida, cxld->next_region_id);
+	ida_destroy(&cxld->region_ida);
 	ida_free(&port->decoder_ida, cxld->id);
 	kfree(cxld);
 }
@@ -1244,6 +1248,13 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
 	cxld->target_type = CXL_DECODER_EXPANDER;
 	cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
 
+	mutex_init(&cxld->id_lock);
+	ida_init(&cxld->region_ida);
+	rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
+	if (rc < 0)
+		goto err;
+
+	cxld->next_region_id = rc;
 	return cxld;
 err:
 	kfree(cxld);
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
new file mode 100644
index 000000000000..5576952e4aa1
--- /dev/null
+++ b/drivers/cxl/core/region.c
@@ -0,0 +1,213 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2022 Intel Corporation. All rights reserved. */
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/idr.h>
+#include <region.h>
+#include <cxl.h>
+#include "core.h"
+
+/**
+ * DOC: cxl core region
+ *
+ * CXL Regions represent mapped memory capacity in system physical address
+ * space. Whereas the CXL Root Decoders identify the bounds of potential CXL
+ * Memory ranges, Regions represent the active mapped capacity by the HDM
+ * Decoder Capability structures throughout the Host Bridges, Switches, and
+ * Endpoints in the topology.
+ */
+
+static void cxl_region_release(struct device *dev);
+
+static const struct device_type cxl_region_type = {
+	.name = "cxl_region",
+	.release = cxl_region_release,
+};
+
+static struct cxl_region *to_cxl_region(struct device *dev)
+{
+	if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type,
+			  "not a cxl_region device\n"))
+		return NULL;
+
+	return container_of(dev, struct cxl_region, dev);
+}
+
+static struct cxl_region *cxl_region_alloc(struct cxl_decoder *cxld)
+{
+	struct cxl_region *cxlr;
+	struct device *dev;
+
+	cxlr = kzalloc(sizeof(*cxlr), GFP_KERNEL);
+	if (!cxlr)
+		return ERR_PTR(-ENOMEM);
+
+	dev = &cxlr->dev;
+	device_initialize(dev);
+	dev->parent = &cxld->dev;
+	device_set_pm_not_required(dev);
+	dev->bus = &cxl_bus_type;
+	dev->type = &cxl_region_type;
+
+	return cxlr;
+}
+
+static void unregister_region(void *_cxlr)
+{
+	struct cxl_region *cxlr = _cxlr;
+
+	if (!test_and_set_bit(REGION_DEAD, &cxlr->flags))
+		device_unregister(&cxlr->dev);
+}
+
+/**
+ * devm_cxl_add_region - Adds a region to a decoder
+ * @cxld: Parent decoder.
+ * @cxlr: Region to be added to the decoder.
+ *
+ * This is the second step of region initialization. Regions exist within an
+ * address space which is mapped by a @cxld. That @cxld must be a root decoder,
+ * and it enforces constraints upon the region as it is configured.
+ *
+ * Return: 0 if the region was added to the @cxld, else returns negative error
+ * code. The region will be named "regionX.Y.Z" where X is the port, Y is the
+ * decoder id, and Z is the region number.
+ */
+static struct cxl_region *devm_cxl_add_region(struct cxl_decoder *cxld)
+{
+	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
+	struct cxl_region *cxlr;
+	struct device *dev;
+	int rc;
+
+	cxlr = cxl_region_alloc(cxld);
+	if (IS_ERR(cxlr))
+		return cxlr;
+
+	dev = &cxlr->dev;
+
+	cxlr->id = cxld->next_region_id;
+	rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
+	if (rc)
+		goto err_out;
+
+	/* affirm that release will have access to the decoder's region ida  */
+	get_device(&cxld->dev);
+
+	rc = device_add(dev);
+	if (!rc)
+		rc = devm_add_action_or_reset(port->uport, unregister_region,
+					      cxlr);
+	if (rc)
+		goto err_out;
+
+	return cxlr;
+
+err_out:
+	put_device(dev);
+	kfree(cxlr);
+	return ERR_PTR(rc);
+}
+
+static ssize_t create_region_show(struct device *dev,
+				  struct device_attribute *attr, char *buf)
+{
+	struct cxl_port *port = to_cxl_port(dev->parent);
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+
+	return sysfs_emit(buf, "region%d.%d:%d\n", port->id, cxld->id,
+			  cxld->next_region_id);
+}
+
+static ssize_t create_region_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t len)
+{
+	struct cxl_port *port = to_cxl_port(dev->parent);
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	struct cxl_region *cxlr;
+	int d, p, r, rc = 0;
+
+	if (sscanf(buf, "region%d.%d:%d", &p, &d, &r) != 3)
+		return -EINVAL;
+
+	if (port->id != p || cxld->id != d)
+		return -EINVAL;
+
+	rc = mutex_lock_interruptible(&cxld->id_lock);
+	if (rc)
+		return rc;
+
+	if (cxld->next_region_id != r) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
+	if (rc < 0) {
+		dev_dbg(dev, "Failed to get next cached id (%d)\n", rc);
+		goto out;
+	}
+
+	cxlr = devm_cxl_add_region(cxld);
+	if (IS_ERR(cxlr)) {
+		rc = PTR_ERR(cxlr);
+		goto out;
+	}
+
+	cxld->next_region_id = rc;
+	dev_dbg(dev, "Created %s\n", dev_name(&cxlr->dev));
+
+out:
+	mutex_unlock(&cxld->id_lock);
+	return rc ? rc : len;
+}
+DEVICE_ATTR_RW(create_region);
+
+static struct cxl_region *cxl_find_region_by_name(struct cxl_decoder *cxld,
+						  const char *name)
+{
+	struct device *region_dev;
+
+	region_dev = device_find_child_by_name(&cxld->dev, name);
+	if (!region_dev)
+		return ERR_PTR(-ENOENT);
+
+	return to_cxl_region(region_dev);
+}
+
+static ssize_t delete_region_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t len)
+{
+	struct cxl_port *port = to_cxl_port(dev->parent);
+	struct cxl_decoder *cxld = to_cxl_decoder(dev);
+	struct cxl_region *cxlr;
+
+	cxlr = cxl_find_region_by_name(cxld, buf);
+	if (IS_ERR(cxlr))
+		return PTR_ERR(cxlr);
+
+	/* After this, the region is no longer a child of the decoder. */
+	devm_release_action(port->uport, unregister_region, cxlr);
+
+	/* Release is likely called here, so cxlr is not safe to reference. */
+	put_device(&cxlr->dev);
+	cxlr = NULL;
+
+	dev_dbg(dev, "Deleted %s\n", buf);
+	return len;
+}
+DEVICE_ATTR_WO(delete_region);
+
+static void cxl_region_release(struct device *dev)
+{
+	struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
+	struct cxl_region *cxlr = to_cxl_region(dev);
+
+	dev_dbg(&cxld->dev, "Releasing %s\n", dev_name(dev));
+	ida_free(&cxld->region_ida, cxlr->id);
+	kfree(cxlr);
+	put_device(&cxld->dev);
+}
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index b4047a310340..d5397f7dfcf4 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -221,6 +221,8 @@ enum cxl_decoder_type {
  * @target_type: accelerator vs expander (type2 vs type3) selector
  * @flags: memory type capabilities and locking
  * @target_lock: coordinate coherent reads of the target list
+ * @region_ida: allocator for region ids.
+ * @next_region_id: Cached region id for next region.
  * @nr_targets: number of elements in @target
  * @target: active ordered target list in current decoder configuration
  */
@@ -236,6 +238,9 @@ struct cxl_decoder {
 	enum cxl_decoder_type target_type;
 	unsigned long flags;
 	seqlock_t target_lock;
+	struct mutex id_lock;
+	struct ida region_ida;
+	int next_region_id;
 	int nr_targets;
 	struct cxl_dport *target[];
 };
diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
new file mode 100644
index 000000000000..0016f83bbdfd
--- /dev/null
+++ b/drivers/cxl/region.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2021 Intel Corporation. */
+#ifndef __CXL_REGION_H__
+#define __CXL_REGION_H__
+
+#include <linux/uuid.h>
+
+#include "cxl.h"
+
+/**
+ * struct cxl_region - CXL region
+ * @dev: This region's device.
+ * @id: This region's id. Id is globally unique across all regions.
+ * @flags: Flags representing the current state of the region.
+ */
+struct cxl_region {
+	struct device dev;
+	int id;
+	unsigned long flags;
+#define REGION_DEAD 0
+};
+
+#endif
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index 82e49ab0937d..3fe6d34e6d59 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -46,6 +46,7 @@ cxl_core-y += $(CXL_CORE_SRC)/memdev.o
 cxl_core-y += $(CXL_CORE_SRC)/mbox.o
 cxl_core-y += $(CXL_CORE_SRC)/pci.o
 cxl_core-y += $(CXL_CORE_SRC)/hdm.o
+cxl_core-y += $(CXL_CORE_SRC)/region.o
 cxl_core-y += config_check.o
 
 obj-m += test/
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 01/15] cxl/region: Add region creation ABI
  2022-02-17 17:19     ` [PATCH v5 01/15] " Ben Widawsky
@ 2022-02-17 17:33       ` Ben Widawsky
  2022-02-17 17:58       ` Dan Williams
  1 sibling, 0 replies; 70+ messages in thread
From: Ben Widawsky @ 2022-02-17 17:33 UTC (permalink / raw)
  To: linux-cxl
  Cc: patches, Alison Schofield, Dan Williams, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

On 22-02-17 09:19:31, Ben Widawsky wrote:
> Regions are created as a child of the decoder that encompasses an
> address space with constraints. Regions have a number of attributes that
> must be configured before the region can be activated.
> 
> The ABI is not meant to be secure, but is meant to avoid accidental
> races. As a result, a buggy process may create a region by name that was
> allocated by a different process. However, multiple processes which are
> trying not to race with each other shouldn't need special
> synchronization to do so.
> 
> // Allocate a new region name
> region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> 
> // Create a new region by name
> while
> region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> ! echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
> do true; done
> 
> // Region now exists in sysfs
> stat -t /sys/bus/cxl/devices/decoder0.0/$region
> 
> // Delete the region, and name
> echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region
> 
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> 
> ---
> Changes since v4:
> - Add the missed base attributes addition
> 
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl       |  23 ++
>  .../driver-api/cxl/memory-devices.rst         |  11 +
>  drivers/cxl/core/Makefile                     |   1 +
>  drivers/cxl/core/core.h                       |   3 +
>  drivers/cxl/core/port.c                       |  11 +
>  drivers/cxl/core/region.c                     | 213 ++++++++++++++++++
>  drivers/cxl/cxl.h                             |   5 +
>  drivers/cxl/region.h                          |  23 ++
>  tools/testing/cxl/Kbuild                      |   1 +
>  9 files changed, 291 insertions(+)
>  create mode 100644 drivers/cxl/core/region.c
>  create mode 100644 drivers/cxl/region.h
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 7c2b846521f3..e5db45ea70ad 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -163,3 +163,26 @@ Description:
>  		memory (type-3). The 'target_type' attribute indicates the
>  		current setting which may dynamically change based on what
>  		memory regions are activated in this decode hierarchy.
> +
> +What:		/sys/bus/cxl/devices/decoderX.Y/create_region
> +Date:		January, 2022
> +KernelVersion:	v5.18
> +Contact:	linux-cxl@vger.kernel.org
> +Description:
> +		Write a value of the form 'regionX.Y:Z' to instantiate a new
> +		region within the decode range bounded by decoderX.Y. The value
> +		written must match the current value returned from reading this
> +		attribute. This behavior lets the kernel arbitrate racing
> +		attempts to create a region. The thread that fails to write
> +		loops and tries the next value. Regions must be created for root
> +		decoders, and must subsequently configured and bound to a region
> +		driver before they can be used.
> +
> +What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
> +Date:		January, 2022
> +KernelVersion:	v5.18
> +Contact:	linux-cxl@vger.kernel.org
> +Description:
> +		Deletes the named region.  The attribute expects a region in the
> +		form "regionX.Y:Z". The region's name, allocated by reading
> +		create_region, will also be released.
> diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
> index db476bb170b6..66ddc58a21b1 100644
> --- a/Documentation/driver-api/cxl/memory-devices.rst
> +++ b/Documentation/driver-api/cxl/memory-devices.rst
> @@ -362,6 +362,17 @@ CXL Core
>  .. kernel-doc:: drivers/cxl/core/mbox.c
>     :doc: cxl mbox
>  
> +CXL Regions
> +-----------
> +.. kernel-doc:: drivers/cxl/region.h
> +   :identifiers:
> +
> +.. kernel-doc:: drivers/cxl/core/region.c
> +   :doc: cxl core region
> +
> +.. kernel-doc:: drivers/cxl/core/region.c
> +   :identifiers:
> +
>  External Interfaces
>  ===================
>  
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 6d37cd78b151..39ce8f2f2373 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -4,6 +4,7 @@ obj-$(CONFIG_CXL_BUS) += cxl_core.o
>  ccflags-y += -I$(srctree)/drivers/cxl
>  cxl_core-y := port.o
>  cxl_core-y += pmem.o
> +cxl_core-y += region.o
>  cxl_core-y += regs.o
>  cxl_core-y += memdev.o
>  cxl_core-y += mbox.o
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 1a50c0fc399c..adfd42370b28 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -9,6 +9,9 @@ extern const struct device_type cxl_nvdimm_type;
>  
>  extern struct attribute_group cxl_base_attribute_group;
>  
> +extern struct device_attribute dev_attr_create_region;
> +extern struct device_attribute dev_attr_delete_region;
> +
>  struct cxl_send_command;
>  struct cxl_mem_query_commands;
>  int cxl_query_cmd(struct cxl_memdev *cxlmd,
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 1e785a3affaa..860e91cae29b 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -213,6 +213,8 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
>  };
>  
>  static struct attribute *cxl_decoder_root_attrs[] = {
> +	&dev_attr_create_region.attr,
> +	&dev_attr_delete_region.attr,
>  	&dev_attr_cap_pmem.attr,
>  	&dev_attr_cap_ram.attr,
>  	&dev_attr_cap_type2.attr,
> @@ -270,6 +272,8 @@ static void cxl_decoder_release(struct device *dev)
>  	struct cxl_decoder *cxld = to_cxl_decoder(dev);
>  	struct cxl_port *port = to_cxl_port(dev->parent);
>  
> +	ida_free(&cxld->region_ida, cxld->next_region_id);
> +	ida_destroy(&cxld->region_ida);
>  	ida_free(&port->decoder_ida, cxld->id);
>  	kfree(cxld);
>  }
> @@ -1244,6 +1248,13 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
>  	cxld->target_type = CXL_DECODER_EXPANDER;
>  	cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
>  
> +	mutex_init(&cxld->id_lock);
> +	ida_init(&cxld->region_ida);
> +	rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
> +	if (rc < 0)
> +		goto err;
> +
> +	cxld->next_region_id = rc;
>  	return cxld;
>  err:
>  	kfree(cxld);
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> new file mode 100644
> index 000000000000..5576952e4aa1
> --- /dev/null
> +++ b/drivers/cxl/core/region.c
> @@ -0,0 +1,213 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2022 Intel Corporation. All rights reserved. */
> +#include <linux/device.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/idr.h>
> +#include <region.h>
> +#include <cxl.h>
> +#include "core.h"
> +
> +/**
> + * DOC: cxl core region
> + *
> + * CXL Regions represent mapped memory capacity in system physical address
> + * space. Whereas the CXL Root Decoders identify the bounds of potential CXL
> + * Memory ranges, Regions represent the active mapped capacity by the HDM
> + * Decoder Capability structures throughout the Host Bridges, Switches, and
> + * Endpoints in the topology.
> + */
> +
> +static void cxl_region_release(struct device *dev);
> +
> +static const struct device_type cxl_region_type = {
> +	.name = "cxl_region",
> +	.release = cxl_region_release,
> +};
> +
> +static struct cxl_region *to_cxl_region(struct device *dev)
> +{
> +	if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type,
> +			  "not a cxl_region device\n"))
> +		return NULL;
> +
> +	return container_of(dev, struct cxl_region, dev);
> +}
> +
> +static struct cxl_region *cxl_region_alloc(struct cxl_decoder *cxld)
> +{
> +	struct cxl_region *cxlr;
> +	struct device *dev;
> +
> +	cxlr = kzalloc(sizeof(*cxlr), GFP_KERNEL);
> +	if (!cxlr)
> +		return ERR_PTR(-ENOMEM);
> +
> +	dev = &cxlr->dev;
> +	device_initialize(dev);
> +	dev->parent = &cxld->dev;
> +	device_set_pm_not_required(dev);
> +	dev->bus = &cxl_bus_type;
> +	dev->type = &cxl_region_type;
> +
> +	return cxlr;
> +}
> +
> +static void unregister_region(void *_cxlr)
> +{
> +	struct cxl_region *cxlr = _cxlr;
> +
> +	if (!test_and_set_bit(REGION_DEAD, &cxlr->flags))
> +		device_unregister(&cxlr->dev);
> +}
> +
> +/**
> + * devm_cxl_add_region - Adds a region to a decoder
> + * @cxld: Parent decoder.
> + * @cxlr: Region to be added to the decoder.
> + *
> + * This is the second step of region initialization. Regions exist within an
> + * address space which is mapped by a @cxld. That @cxld must be a root decoder,
> + * and it enforces constraints upon the region as it is configured.
> + *
> + * Return: 0 if the region was added to the @cxld, else returns negative error
> + * code. The region will be named "regionX.Y.Z" where X is the port, Y is the
> + * decoder id, and Z is the region number.
> + */
> +static struct cxl_region *devm_cxl_add_region(struct cxl_decoder *cxld)
> +{
> +	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> +	struct cxl_region *cxlr;
> +	struct device *dev;
> +	int rc;
> +
> +	cxlr = cxl_region_alloc(cxld);
> +	if (IS_ERR(cxlr))
> +		return cxlr;
> +
> +	dev = &cxlr->dev;
> +
> +	cxlr->id = cxld->next_region_id;
> +	rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
> +	if (rc)
> +		goto err_out;
> +
> +	/* affirm that release will have access to the decoder's region ida  */
> +	get_device(&cxld->dev);
> +
> +	rc = device_add(dev);
> +	if (!rc)
> +		rc = devm_add_action_or_reset(port->uport, unregister_region,
> +					      cxlr);
> +	if (rc)
> +		goto err_out;
> +
> +	return cxlr;
> +
> +err_out:
> +	put_device(dev);
> +	kfree(cxlr);
> +	return ERR_PTR(rc);
> +}
> +
> +static ssize_t create_region_show(struct device *dev,
> +				  struct device_attribute *attr, char *buf)
> +{
> +	struct cxl_port *port = to_cxl_port(dev->parent);
> +	struct cxl_decoder *cxld = to_cxl_decoder(dev);
> +
> +	return sysfs_emit(buf, "region%d.%d:%d\n", port->id, cxld->id,
> +			  cxld->next_region_id);
> +}
> +
> +static ssize_t create_region_store(struct device *dev,
> +				   struct device_attribute *attr,
> +				   const char *buf, size_t len)
> +{
> +	struct cxl_port *port = to_cxl_port(dev->parent);
> +	struct cxl_decoder *cxld = to_cxl_decoder(dev);
> +	struct cxl_region *cxlr;
> +	int d, p, r, rc = 0;
> +
> +	if (sscanf(buf, "region%d.%d:%d", &p, &d, &r) != 3)
> +		return -EINVAL;
> +
> +	if (port->id != p || cxld->id != d)
> +		return -EINVAL;
> +
> +	rc = mutex_lock_interruptible(&cxld->id_lock);
> +	if (rc)
> +		return rc;
> +
> +	if (cxld->next_region_id != r) {
> +		rc = -EINVAL;
> +		goto out;
> +	}
> +
> +	rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
> +	if (rc < 0) {
> +		dev_dbg(dev, "Failed to get next cached id (%d)\n", rc);
> +		goto out;
> +	}
> +
> +	cxlr = devm_cxl_add_region(cxld);
> +	if (IS_ERR(cxlr)) {
> +		rc = PTR_ERR(cxlr);
> +		goto out;
> +	}
> +
> +	cxld->next_region_id = rc;
> +	dev_dbg(dev, "Created %s\n", dev_name(&cxlr->dev));
> +
> +out:
> +	mutex_unlock(&cxld->id_lock);
> +	return rc ? rc : len;

Sigh. This is incorrect after a last minute change I made. While the regions do
get created you get back an errno here based on the ID allocated. Will not
bother resending v6 until there's some other review to avoid more thrash.

> +}
> +DEVICE_ATTR_RW(create_region);
> +
> +static struct cxl_region *cxl_find_region_by_name(struct cxl_decoder *cxld,
> +						  const char *name)
> +{
> +	struct device *region_dev;
> +
> +	region_dev = device_find_child_by_name(&cxld->dev, name);
> +	if (!region_dev)
> +		return ERR_PTR(-ENOENT);
> +
> +	return to_cxl_region(region_dev);
> +}
> +
> +static ssize_t delete_region_store(struct device *dev,
> +				   struct device_attribute *attr,
> +				   const char *buf, size_t len)
> +{
> +	struct cxl_port *port = to_cxl_port(dev->parent);
> +	struct cxl_decoder *cxld = to_cxl_decoder(dev);
> +	struct cxl_region *cxlr;
> +
> +	cxlr = cxl_find_region_by_name(cxld, buf);
> +	if (IS_ERR(cxlr))
> +		return PTR_ERR(cxlr);
> +
> +	/* After this, the region is no longer a child of the decoder. */
> +	devm_release_action(port->uport, unregister_region, cxlr);
> +
> +	/* Release is likely called here, so cxlr is not safe to reference. */
> +	put_device(&cxlr->dev);
> +	cxlr = NULL;
> +
> +	dev_dbg(dev, "Deleted %s\n", buf);
> +	return len;
> +}
> +DEVICE_ATTR_WO(delete_region);
> +
> +static void cxl_region_release(struct device *dev)
> +{
> +	struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> +	struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +	dev_dbg(&cxld->dev, "Releasing %s\n", dev_name(dev));
> +	ida_free(&cxld->region_ida, cxlr->id);
> +	kfree(cxlr);
> +	put_device(&cxld->dev);
> +}
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index b4047a310340..d5397f7dfcf4 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -221,6 +221,8 @@ enum cxl_decoder_type {
>   * @target_type: accelerator vs expander (type2 vs type3) selector
>   * @flags: memory type capabilities and locking
>   * @target_lock: coordinate coherent reads of the target list
> + * @region_ida: allocator for region ids.
> + * @next_region_id: Cached region id for next region.
>   * @nr_targets: number of elements in @target
>   * @target: active ordered target list in current decoder configuration
>   */
> @@ -236,6 +238,9 @@ struct cxl_decoder {
>  	enum cxl_decoder_type target_type;
>  	unsigned long flags;
>  	seqlock_t target_lock;
> +	struct mutex id_lock;
> +	struct ida region_ida;
> +	int next_region_id;
>  	int nr_targets;
>  	struct cxl_dport *target[];
>  };
> diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
> new file mode 100644
> index 000000000000..0016f83bbdfd
> --- /dev/null
> +++ b/drivers/cxl/region.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright(c) 2021 Intel Corporation. */
> +#ifndef __CXL_REGION_H__
> +#define __CXL_REGION_H__
> +
> +#include <linux/uuid.h>
> +
> +#include "cxl.h"
> +
> +/**
> + * struct cxl_region - CXL region
> + * @dev: This region's device.
> + * @id: This region's id. Id is globally unique across all regions.
> + * @flags: Flags representing the current state of the region.
> + */
> +struct cxl_region {
> +	struct device dev;
> +	int id;
> +	unsigned long flags;
> +#define REGION_DEAD 0
> +};
> +
> +#endif
> diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
> index 82e49ab0937d..3fe6d34e6d59 100644
> --- a/tools/testing/cxl/Kbuild
> +++ b/tools/testing/cxl/Kbuild
> @@ -46,6 +46,7 @@ cxl_core-y += $(CXL_CORE_SRC)/memdev.o
>  cxl_core-y += $(CXL_CORE_SRC)/mbox.o
>  cxl_core-y += $(CXL_CORE_SRC)/pci.o
>  cxl_core-y += $(CXL_CORE_SRC)/hdm.o
> +cxl_core-y += $(CXL_CORE_SRC)/region.o
>  cxl_core-y += config_check.o
>  
>  obj-m += test/
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 01/15] cxl/region: Add region creation ABI
  2022-02-17 17:19     ` [PATCH v5 01/15] " Ben Widawsky
  2022-02-17 17:33       ` Ben Widawsky
@ 2022-02-17 17:58       ` Dan Williams
  2022-02-17 18:58         ` Ben Widawsky
  2022-02-17 22:22         ` Ben Widawsky
  1 sibling, 2 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-17 17:58 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Thu, Feb 17, 2022 at 9:19 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Regions are created as a child of the decoder that encompasses an
> address space with constraints. Regions have a number of attributes that
> must be configured before the region can be activated.
>
> The ABI is not meant to be secure, but is meant to avoid accidental
> races. As a result, a buggy process may create a region by name that was
> allocated by a different process. However, multiple processes which are
> trying not to race with each other shouldn't need special
> synchronization to do so.
>
> // Allocate a new region name
> region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
>
> // Create a new region by name
> while
> region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> ! echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
> do true; done
>
> // Region now exists in sysfs
> stat -t /sys/bus/cxl/devices/decoder0.0/$region
>
> // Delete the region, and name
> echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region
>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>

Looking good, a few more fixes and cleanups identified below.

>
> ---
> Changes since v4:
> - Add the missed base attributes addition
>
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl       |  23 ++
>  .../driver-api/cxl/memory-devices.rst         |  11 +
>  drivers/cxl/core/Makefile                     |   1 +
>  drivers/cxl/core/core.h                       |   3 +
>  drivers/cxl/core/port.c                       |  11 +
>  drivers/cxl/core/region.c                     | 213 ++++++++++++++++++
>  drivers/cxl/cxl.h                             |   5 +
>  drivers/cxl/region.h                          |  23 ++
>  tools/testing/cxl/Kbuild                      |   1 +
>  9 files changed, 291 insertions(+)
>  create mode 100644 drivers/cxl/core/region.c
>  create mode 100644 drivers/cxl/region.h
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 7c2b846521f3..e5db45ea70ad 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -163,3 +163,26 @@ Description:
>                 memory (type-3). The 'target_type' attribute indicates the
>                 current setting which may dynamically change based on what
>                 memory regions are activated in this decode hierarchy.
> +
> +What:          /sys/bus/cxl/devices/decoderX.Y/create_region
> +Date:          January, 2022
> +KernelVersion: v5.18
> +Contact:       linux-cxl@vger.kernel.org
> +Description:
> +               Write a value of the form 'regionX.Y:Z' to instantiate a new
> +               region within the decode range bounded by decoderX.Y. The value
> +               written must match the current value returned from reading this
> +               attribute. This behavior lets the kernel arbitrate racing
> +               attempts to create a region. The thread that fails to write
> +               loops and tries the next value. Regions must be created for root
> +               decoders, and must subsequently configured and bound to a region
> +               driver before they can be used.
> +
> +What:          /sys/bus/cxl/devices/decoderX.Y/delete_region
> +Date:          January, 2022
> +KernelVersion: v5.18
> +Contact:       linux-cxl@vger.kernel.org
> +Description:
> +               Deletes the named region.  The attribute expects a region in the
> +               form "regionX.Y:Z". The region's name, allocated by reading
> +               create_region, will also be released.
> diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
> index db476bb170b6..66ddc58a21b1 100644
> --- a/Documentation/driver-api/cxl/memory-devices.rst
> +++ b/Documentation/driver-api/cxl/memory-devices.rst
> @@ -362,6 +362,17 @@ CXL Core
>  .. kernel-doc:: drivers/cxl/core/mbox.c
>     :doc: cxl mbox
>
> +CXL Regions
> +-----------
> +.. kernel-doc:: drivers/cxl/region.h
> +   :identifiers:
> +
> +.. kernel-doc:: drivers/cxl/core/region.c
> +   :doc: cxl core region
> +
> +.. kernel-doc:: drivers/cxl/core/region.c
> +   :identifiers:
> +
>  External Interfaces
>  ===================
>
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 6d37cd78b151..39ce8f2f2373 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -4,6 +4,7 @@ obj-$(CONFIG_CXL_BUS) += cxl_core.o
>  ccflags-y += -I$(srctree)/drivers/cxl
>  cxl_core-y := port.o
>  cxl_core-y += pmem.o
> +cxl_core-y += region.o
>  cxl_core-y += regs.o
>  cxl_core-y += memdev.o
>  cxl_core-y += mbox.o
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 1a50c0fc399c..adfd42370b28 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -9,6 +9,9 @@ extern const struct device_type cxl_nvdimm_type;
>
>  extern struct attribute_group cxl_base_attribute_group;
>
> +extern struct device_attribute dev_attr_create_region;
> +extern struct device_attribute dev_attr_delete_region;
> +
>  struct cxl_send_command;
>  struct cxl_mem_query_commands;
>  int cxl_query_cmd(struct cxl_memdev *cxlmd,
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 1e785a3affaa..860e91cae29b 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -213,6 +213,8 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
>  };
>
>  static struct attribute *cxl_decoder_root_attrs[] = {
> +       &dev_attr_create_region.attr,
> +       &dev_attr_delete_region.attr,
>         &dev_attr_cap_pmem.attr,
>         &dev_attr_cap_ram.attr,
>         &dev_attr_cap_type2.attr,
> @@ -270,6 +272,8 @@ static void cxl_decoder_release(struct device *dev)
>         struct cxl_decoder *cxld = to_cxl_decoder(dev);
>         struct cxl_port *port = to_cxl_port(dev->parent);
>
> +       ida_free(&cxld->region_ida, cxld->next_region_id);
> +       ida_destroy(&cxld->region_ida);
>         ida_free(&port->decoder_ida, cxld->id);
>         kfree(cxld);
>  }
> @@ -1244,6 +1248,13 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
>         cxld->target_type = CXL_DECODER_EXPANDER;
>         cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
>
> +       mutex_init(&cxld->id_lock);
> +       ida_init(&cxld->region_ida);
> +       rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
> +       if (rc < 0)
> +               goto err;
> +
> +       cxld->next_region_id = rc;
>         return cxld;
>  err:
>         kfree(cxld);
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> new file mode 100644
> index 000000000000..5576952e4aa1
> --- /dev/null
> +++ b/drivers/cxl/core/region.c
> @@ -0,0 +1,213 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2022 Intel Corporation. All rights reserved. */
> +#include <linux/device.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/idr.h>
> +#include <region.h>
> +#include <cxl.h>
> +#include "core.h"
> +
> +/**
> + * DOC: cxl core region
> + *
> + * CXL Regions represent mapped memory capacity in system physical address
> + * space. Whereas the CXL Root Decoders identify the bounds of potential CXL
> + * Memory ranges, Regions represent the active mapped capacity by the HDM
> + * Decoder Capability structures throughout the Host Bridges, Switches, and
> + * Endpoints in the topology.
> + */
> +
> +static void cxl_region_release(struct device *dev);

Why forward declare this versus move cxl_region_type after the definition?

No other CXL object release functions are forward declared.

> +
> +static const struct device_type cxl_region_type = {
> +       .name = "cxl_region",
> +       .release = cxl_region_release,
> +};
> +
> +static struct cxl_region *to_cxl_region(struct device *dev)
> +{
> +       if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type,
> +                         "not a cxl_region device\n"))
> +               return NULL;
> +
> +       return container_of(dev, struct cxl_region, dev);
> +}
> +
> +static struct cxl_region *cxl_region_alloc(struct cxl_decoder *cxld)
> +{
> +       struct cxl_region *cxlr;
> +       struct device *dev;
> +
> +       cxlr = kzalloc(sizeof(*cxlr), GFP_KERNEL);
> +       if (!cxlr)
> +               return ERR_PTR(-ENOMEM);
> +
> +       dev = &cxlr->dev;
> +       device_initialize(dev);
> +       dev->parent = &cxld->dev;
> +       device_set_pm_not_required(dev);
> +       dev->bus = &cxl_bus_type;
> +       dev->type = &cxl_region_type;
> +
> +       return cxlr;
> +}
> +
> +static void unregister_region(void *_cxlr)
> +{
> +       struct cxl_region *cxlr = _cxlr;
> +
> +       if (!test_and_set_bit(REGION_DEAD, &cxlr->flags))
> +               device_unregister(&cxlr->dev);

I thought REGION_DEAD was needed to prevent double
devm_release_action(), not double unregister?

> +}
> +
> +/**
> + * devm_cxl_add_region - Adds a region to a decoder
> + * @cxld: Parent decoder.
> + * @cxlr: Region to be added to the decoder.
> + *
> + * This is the second step of region initialization. Regions exist within an
> + * address space which is mapped by a @cxld. That @cxld must be a root decoder,
> + * and it enforces constraints upon the region as it is configured.
> + *
> + * Return: 0 if the region was added to the @cxld, else returns negative error
> + * code. The region will be named "regionX.Y.Z" where X is the port, Y is the
> + * decoder id, and Z is the region number.
> + */
> +static struct cxl_region *devm_cxl_add_region(struct cxl_decoder *cxld)
> +{
> +       struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> +       struct cxl_region *cxlr;
> +       struct device *dev;
> +       int rc;
> +
> +       cxlr = cxl_region_alloc(cxld);
> +       if (IS_ERR(cxlr))
> +               return cxlr;
> +
> +       dev = &cxlr->dev;
> +
> +       cxlr->id = cxld->next_region_id;
> +       rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
> +       if (rc)
> +               goto err_out;
> +
> +       /* affirm that release will have access to the decoder's region ida  */
> +       get_device(&cxld->dev);
> +
> +       rc = device_add(dev);
> +       if (!rc)
> +               rc = devm_add_action_or_reset(port->uport, unregister_region,
> +                                             cxlr);
> +       if (rc)
> +               goto err_out;

All the other usages in device_add() in the subsystem follow the style of:

rc = device_add(dev);
if (rc)
    goto err;

...any reason to be unique here and indent the success case?


> +
> +       return cxlr;
> +
> +err_out:
> +       put_device(dev);
> +       kfree(cxlr);

This is a double-free of cxlr;

> +       return ERR_PTR(rc);
> +}
> +
> +static ssize_t create_region_show(struct device *dev,
> +                                 struct device_attribute *attr, char *buf)
> +{
> +       struct cxl_port *port = to_cxl_port(dev->parent);
> +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> +
> +       return sysfs_emit(buf, "region%d.%d:%d\n", port->id, cxld->id,
> +                         cxld->next_region_id);
> +}
> +
> +static ssize_t create_region_store(struct device *dev,
> +                                  struct device_attribute *attr,
> +                                  const char *buf, size_t len)
> +{
> +       struct cxl_port *port = to_cxl_port(dev->parent);
> +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> +       struct cxl_region *cxlr;
> +       int d, p, r, rc = 0;
> +
> +       if (sscanf(buf, "region%d.%d:%d", &p, &d, &r) != 3)
> +               return -EINVAL;
> +
> +       if (port->id != p || cxld->id != d)
> +               return -EINVAL;
> +
> +       rc = mutex_lock_interruptible(&cxld->id_lock);
> +       if (rc)
> +               return rc;
> +
> +       if (cxld->next_region_id != r) {
> +               rc = -EINVAL;
> +               goto out;
> +       }
> +
> +       rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
> +       if (rc < 0) {
> +               dev_dbg(dev, "Failed to get next cached id (%d)\n", rc);
> +               goto out;
> +       }
> +
> +       cxlr = devm_cxl_add_region(cxld);
> +       if (IS_ERR(cxlr)) {
> +               rc = PTR_ERR(cxlr);
> +               goto out;
> +       }
> +
> +       cxld->next_region_id = rc;

This looks like a leak in the case when devm_cxl_add_region() fails,
so just move it above that call.

> +       dev_dbg(dev, "Created %s\n", dev_name(&cxlr->dev));
> +
> +out:
> +       mutex_unlock(&cxld->id_lock);
> +       return rc ? rc : len;

if (rc)
    return rc;
return len;

> +}
> +DEVICE_ATTR_RW(create_region);
> +
> +static struct cxl_region *cxl_find_region_by_name(struct cxl_decoder *cxld,
> +                                                 const char *name)
> +{
> +       struct device *region_dev;
> +
> +       region_dev = device_find_child_by_name(&cxld->dev, name);
> +       if (!region_dev)
> +               return ERR_PTR(-ENOENT);
> +
> +       return to_cxl_region(region_dev);
> +}
> +
> +static ssize_t delete_region_store(struct device *dev,
> +                                  struct device_attribute *attr,
> +                                  const char *buf, size_t len)
> +{
> +       struct cxl_port *port = to_cxl_port(dev->parent);
> +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> +       struct cxl_region *cxlr;
> +
> +       cxlr = cxl_find_region_by_name(cxld, buf);
> +       if (IS_ERR(cxlr))
> +               return PTR_ERR(cxlr);
> +
> +       /* After this, the region is no longer a child of the decoder. */
> +       devm_release_action(port->uport, unregister_region, cxlr);

This may trigger a WARN in the case where 2 threads race to trigger
the release action. I think the DEAD check is needed to gate this
call, not device_unregister().

> +
> +       /* Release is likely called here, so cxlr is not safe to reference. */

This is always the case with any put_device(), so no need for this comment.

> +       put_device(&cxlr->dev);
> +       cxlr = NULL;

This NULL assignment has no value.

> +
> +       dev_dbg(dev, "Deleted %s\n", buf);

Not sure a debug statement is needed for something userspace can
directly view itself with the result code from the sysfs write.

> +       return len;
> +}
> +DEVICE_ATTR_WO(delete_region);
> +
> +static void cxl_region_release(struct device *dev)
> +{
> +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       dev_dbg(&cxld->dev, "Releasing %s\n", dev_name(dev));
> +       ida_free(&cxld->region_ida, cxlr->id);
> +       kfree(cxlr);
> +       put_device(&cxld->dev);
> +}
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index b4047a310340..d5397f7dfcf4 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -221,6 +221,8 @@ enum cxl_decoder_type {
>   * @target_type: accelerator vs expander (type2 vs type3) selector
>   * @flags: memory type capabilities and locking
>   * @target_lock: coordinate coherent reads of the target list
> + * @region_ida: allocator for region ids.
> + * @next_region_id: Cached region id for next region.
>   * @nr_targets: number of elements in @target
>   * @target: active ordered target list in current decoder configuration
>   */
> @@ -236,6 +238,9 @@ struct cxl_decoder {
>         enum cxl_decoder_type target_type;
>         unsigned long flags;
>         seqlock_t target_lock;
> +       struct mutex id_lock;
> +       struct ida region_ida;
> +       int next_region_id;
>         int nr_targets;
>         struct cxl_dport *target[];
>  };
> diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
> new file mode 100644
> index 000000000000..0016f83bbdfd
> --- /dev/null
> +++ b/drivers/cxl/region.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright(c) 2021 Intel Corporation. */
> +#ifndef __CXL_REGION_H__
> +#define __CXL_REGION_H__
> +
> +#include <linux/uuid.h>
> +
> +#include "cxl.h"
> +
> +/**
> + * struct cxl_region - CXL region
> + * @dev: This region's device.
> + * @id: This region's id. Id is globally unique across all regions.
> + * @flags: Flags representing the current state of the region.
> + */
> +struct cxl_region {
> +       struct device dev;
> +       int id;
> +       unsigned long flags;
> +#define REGION_DEAD 0
> +};
> +
> +#endif
> diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
> index 82e49ab0937d..3fe6d34e6d59 100644
> --- a/tools/testing/cxl/Kbuild
> +++ b/tools/testing/cxl/Kbuild
> @@ -46,6 +46,7 @@ cxl_core-y += $(CXL_CORE_SRC)/memdev.o
>  cxl_core-y += $(CXL_CORE_SRC)/mbox.o
>  cxl_core-y += $(CXL_CORE_SRC)/pci.o
>  cxl_core-y += $(CXL_CORE_SRC)/hdm.o
> +cxl_core-y += $(CXL_CORE_SRC)/region.o
>  cxl_core-y += config_check.o
>
>  obj-m += test/
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-01-29  0:25   ` Dan Williams
  2022-02-01 14:59     ` Ben Widawsky
  2022-02-01 23:11     ` Ben Widawsky
@ 2022-02-17 18:36     ` Ben Widawsky
  2022-02-17 19:57       ` Dan Williams
  2 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-17 18:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

Consolidating earlier discussions...

On 22-01-28 16:25:34, Dan Williams wrote:
> On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > The region creation APIs create a vacant region. Configuring the region
> > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > will be provided to allow userspace to configure the region.  Finally
> > once all configuration is complete, userspace may activate the region.
> >
> > Introduced here are the most basic attributes needed to configure a
> > region. Details of these attribute are described in the ABI
> 
> s/attribute/attributes/
> 
> > Documentation. Sanity checking of configuration parameters are done at
> > region binding time. This consolidates all such logic in one place,
> > rather than being strewn across multiple places.
> 
> I think that's too late for some of the validation. The complex
> validation that the region driver does throughout the topology is
> different from the basic input validation that can  be done at the
> sysfs write time. For example ,this patch allows negative
> interleave_granularity values to specified, just return -EINVAL. I
> agree that sysfs should not validate everything, I disagree with
> pushing all validation to cxl_region_probe().
> 

Okay. It might save us some back and forth if you could outline everything you'd
expect to be validated, but I can also make an attempt to figure out the
reasonable set of things.

> >
> > A example is provided below:
> >
> > /sys/bus/cxl/devices/region0.0:0
> > ├── interleave_granularity
> > ├── interleave_ways
> > ├── offset
> > ├── size
> > ├── subsystem -> ../../../../../../bus/cxl
> > ├── target0
> > ├── uevent
> > └── uuid
> 
> As mentioned off-list, it looks like devtype and modalias are missing.
> 

Yep. This belongs in the previous patch though.

> >
> > Reported-by: kernel test robot <lkp@intel.com> (v2)
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > ---
> >  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
> >  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
> >  2 files changed, 340 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index dcc728458936..50ba5018014d 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -187,3 +187,43 @@ Description:
> >                 region driver before being deleted. The attributes expects a
> >                 region in the form "regionX.Y:Z". The region's name, allocated
> >                 by reading create_region, will also be released.
> > +
> > +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
> 
> This is just another 'resource' attribute for the physical base
> address of the region, right? 'offset' sounds like something that
> would be relative instead of absolute.
> 

It was meant to be relative. I can make it absolute if that's preferable but the
physical base is known at the decoder level already.

> > +Date:          August, 2021
> 
> Same date update comment here.
> 
> > +KernelVersion: v5.18
> > +Contact:       linux-cxl@vger.kernel.org
> > +Description:
> > +               (RO) A region resides within an address space that is claimed by
> > +               a decoder.
> 
> "A region is a contiguous partition of a CXL Root decoder address space."
> 
> >                  Region space allocation is handled by the driver, but
> 
> "Region capacity is allocated by writing to the size attribute, the
> resulting physical address base determined by the driver is reflected
> here."
> 
> > +               the offset may be read by userspace tooling in order to
> > +               determine fragmentation, and available size for new regions.
> 
> I would also expect, before / along with these new region attributes,
> there would be 'available' and 'max_extent_available' at the decoder
> level to indicate how much free space the decoder has and how big the
> next region creation can be. User tooling can walk  the decoder and
> the regions together to determine fragmentation if necessary, but for
> the most part the tool likely only cares about "how big can the next
> region be?" and "how full is this decoder?".
> 

Since this is the configuration part of the ABI, I'd rather add that information
when the plumbing to report them exists. I'm struggling to understand the
balance (as mentioned also earlier in this mail thread) as to what userspace
does and what the kernel does. I will add these as you request.

> 
> > +
> > +What:
> > +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> > +Date:          August, 2021
> > +KernelVersion: v5.18
> > +Contact:       linux-cxl@vger.kernel.org
> > +Description:
> > +               (RW) Configuring regions requires a minimal set of parameters in
> > +               order for the subsequent bind operation to succeed. The
> > +               following parameters are defined:
> 
> Let's split up the descriptions into individual sections. That can
> also document the order that attributes must be written. For example,
> doesn't size need to be set before targets are added so that targets
> can be validated whether they have sufficient capacity?
> 

Okay. Since we're moving toward making the sysfs ABI stateful, would you like me
to make the attrs only visible when they can actually be set?

> > +
> > +               ==      ========================================================
> > +               interleave_granularity Mandatory. Number of consecutive bytes
> > +                       each device in the interleave set will claim. The
> > +                       possible interleave granularity values are determined by
> > +                       the CXL spec and the participating devices.
> > +               interleave_ways Mandatory. Number of devices participating in the
> > +                       region. Each device will provide 1/interleave of storage
> > +                       for the region.
> > +               size    Manadatory. Phsyical address space the region will
> > +                       consume.
> 
> s/Phsyical/Physical/
> 
> > +               target  Mandatory. Memory devices are the backing storage for a
> > +                       region. There will be N targets based on the number of
> > +                       interleave ways that the top level decoder is configured
> > +                       for.
> 
> That doesn't sound right, IW at the root != IW at the endpoint level
> and the region needs to record all the endpoint level targets.

Correct.

> 
> > Each target must be set with a memdev device ie.
> > +                       'mem1'. This attribute only becomes available after
> > +                       setting the 'interleave' attribute.
> > +               uuid    Optional. A unique identifier for the region. If none is
> > +                       selected, the kernel will create one.
> 
> Let's drop the Mandatory / Optional distinction, or I am otherwise not
> understanding what this is trying to document. For example 'uuid' is
> "mandatory" for PMEM regions and "omitted" for volatile regions not
> optional.

Okay.

> 
> > +               ==      ========================================================
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 1a448543db0d..3b48e0469fc7 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -3,9 +3,12 @@
> >  #include <linux/io-64-nonatomic-lo-hi.h>
> >  #include <linux/device.h>
> >  #include <linux/module.h>
> > +#include <linux/sizes.h>
> >  #include <linux/slab.h>
> > +#include <linux/uuid.h>
> >  #include <linux/idr.h>
> >  #include <region.h>
> > +#include <cxlmem.h>
> >  #include <cxl.h>
> >  #include "core.h"
> >
> > @@ -18,11 +21,305 @@
> >   * (programming the hardware) is handled by a separate region driver.
> >   */
> >
> > +struct cxl_region *to_cxl_region(struct device *dev);
> > +static const struct attribute_group region_interleave_group;
> > +
> > +static bool is_region_active(struct cxl_region *cxlr)
> > +{
> > +       /* TODO: Regions can't be activated yet. */
> > +       return false;
> 
> This function seems redundant with just checking "cxlr->dev.driver !=
> NULL"? The benefit of that is there is no need to carry a TODO in the
> series.
> 

The idea behind this was to give the reviewer somewhat of a bigger picture as to
how things should work in the code rather than in a commit message. I will
remove this.

> > +}
> > +
> > +static void remove_target(struct cxl_region *cxlr, int target)
> > +{
> > +       struct cxl_memdev *cxlmd;
> > +
> > +       cxlmd = cxlr->config.targets[target];
> > +       if (cxlmd)
> > +               put_device(&cxlmd->dev);
> 
> A memdev can be a member of multiple regions at once, shouldn't this
> be an endpoint decoder or similar, not the entire memdev?
> 
> Also, if memdevs autoremove themselves from regions at memdev
> ->remove() time then I don't think the region needs to hold references
> on memdevs.
> 

This needs some work. The concern I have is region operations will need to
operate on memdevs/decoders at various points in time. When the memdev goes
away, the region will also need to go away. None of that plumbing was in place
in v3 and the reference on the memdev was just a half-hearted attempt at doing
the right thing.

For now if you prefer I remove the reference, but perhaps the decoder reference
would buy us some safety?

> > +       cxlr->config.targets[target] = NULL;
> > +}
> > +
> > +static ssize_t interleave_ways_show(struct device *dev,
> > +                                   struct device_attribute *attr, char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> > +}
> > +
> > +static ssize_t interleave_ways_store(struct device *dev,
> > +                                    struct device_attribute *attr,
> > +                                    const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       int ret, prev_iw;
> > +       int val;
> 
> I would expect:
> 
> if (dev->driver)
>    return -EBUSY;
> 
> ...to shutdown configuration writes once the region is active. Might
> also need a region-wide seqlock like target_list_show. So that region
> probe drains  all active sysfs writers before assuming the
> configuration is stable.

Initially my thought here is that this is a problem for userspace to deal with.
If userspace can't figure out how to synchronously configure and bind the
region, that's not a kernel problem. However, we've put some effort into
protecting userspace from itself in the create ABI, so it might be more in line
to do that here.

In summary, I'm fine to add it, but I think I really need to get more in your
brain about the userspace/kernel divide sooner rather than later.

> 
> > +
> > +       prev_iw = cxlr->config.interleave_ways;
> > +       ret = kstrtoint(buf, 0, &val);
> > +       if (ret)
> > +               return ret;
> > +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> > +               return -EINVAL;
> > +
> > +       cxlr->config.interleave_ways = val;
> > +
> > +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> > +       if (ret < 0)
> > +               goto err;
> > +
> > +       sysfs_notify(&dev->kobj, NULL, "target_interleave");
> 
> Why?
> 

copypasta

> > +
> > +       while (prev_iw > cxlr->config.interleave_ways)
> > +               remove_target(cxlr, --prev_iw);
> 
> To make the kernel side simpler this attribute could just require that
> setting interleave ways is a one way street, if you want to change it
> you need to delete the region and start over.
> 

Okay. One of the earlier versions did this implicitly since the #ways was needed
to create the region. I thought from the ABI perspective, flexibility was good.
Userspace may choose not to utilize it.

> > +
> > +       return len;
> > +
> > +err:
> > +       cxlr->config.interleave_ways = prev_iw;
> > +       return ret;
> > +}
> > +static DEVICE_ATTR_RW(interleave_ways);
> > +
> > +static ssize_t interleave_granularity_show(struct device *dev,
> > +                                          struct device_attribute *attr,
> > +                                          char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_granularity);
> > +}
> > +
> > +static ssize_t interleave_granularity_store(struct device *dev,
> > +                                           struct device_attribute *attr,
> > +                                           const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       int val, ret;
> > +
> > +       ret = kstrtoint(buf, 0, &val);
> > +       if (ret)
> > +               return ret;
> > +       cxlr->config.interleave_granularity = val;
> 
> This wants minimum input validation and synchronization against an
> active region.
> 

Okay. Already comprehended above.

> > +
> > +       return len;
> > +}
> > +static DEVICE_ATTR_RW(interleave_granularity);
> > +
> > +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> > +                          char *buf)
> > +{
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       resource_size_t offset;
> > +
> > +       if (!cxlr->res)
> > +               return sysfs_emit(buf, "\n");
> 
> Should be an error I would think. I.e. require size to be set before
> s/offset/resource/ can be read.
> 

Okay. Already comprehended above.

> > +
> > +       offset = cxld->platform_res.start - cxlr->res->start;
> 
> Why make usersapce do the offset math?
> 

Okay. Already comprehended above.

> > +
> > +       return sysfs_emit(buf, "%pa\n", &offset);
> > +}
> > +static DEVICE_ATTR_RO(offset);
> 
> This can be DEVICE_ATTR_ADMIN_RO() to hide physical address layout
> information from non-root.
> 
> > +
> > +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> > +                        char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%llu\n", cxlr->config.size);
> 
> Perhaps no need to store size separately if this becomes:
> 
> sysfs_emit(buf, "%llu\n", (unsigned long long) resource_size(cxlr->res));

iirc that broke the 80 character limit so I opted to do it this way. I can
change it.

> 
> 
> ...?
> 
> > +}
> > +
> > +static ssize_t size_store(struct device *dev, struct device_attribute *attr,
> > +                         const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       unsigned long long val;
> > +       ssize_t rc;
> > +
> > +       rc = kstrtoull(buf, 0, &val);
> > +       if (rc)
> > +               return rc;
> > +
> > +       device_lock(&cxlr->dev);
> > +       if (is_region_active(cxlr))
> > +               rc = -EBUSY;
> > +       else
> > +               cxlr->config.size = val;
> > +       device_unlock(&cxlr->dev);
> 
> I think lockdep will complain about device_lock() usage in an
> attribute. Try changing this to cxl_device_lock() with
> CONFIG_PROVE_CXL_LOCKING=y.
> 

I might have messed it up, but I didn't seem to run into an issue. With the
driver bound check though, it can go away.

I think it would be really good to add this kind of detail to sysfs.rst. Quick
grep finds me arm64/kernel/mte and the nfit driver taking the device lock in an
attr.


> > +
> > +       return rc ? rc : len;
> > +}
> > +static DEVICE_ATTR_RW(size);
> > +
> > +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> > +                        char *buf)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> > +}
> > +
> > +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> > +                         const char *buf, size_t len)
> > +{
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +       ssize_t rc;
> > +
> > +       if (len != UUID_STRING_LEN + 1)
> > +               return -EINVAL;
> > +
> > +       device_lock(&cxlr->dev);
> > +       if (is_region_active(cxlr))
> > +               rc = -EBUSY;
> > +       else
> > +               rc = uuid_parse(buf, &cxlr->config.uuid);
> > +       device_unlock(&cxlr->dev);
> > +
> > +       return rc ? rc : len;
> > +}
> > +static DEVICE_ATTR_RW(uuid);
> > +
> > +static struct attribute *region_attrs[] = {
> > +       &dev_attr_interleave_ways.attr,
> > +       &dev_attr_interleave_granularity.attr,
> > +       &dev_attr_offset.attr,
> > +       &dev_attr_size.attr,
> > +       &dev_attr_uuid.attr,
> > +       NULL,
> > +};
> > +
> > +static const struct attribute_group region_group = {
> > +       .attrs = region_attrs,
> > +};
> > +
> > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > +{
> > +       int ret;
> > +
> > +       device_lock(&cxlr->dev);
> > +       if (!cxlr->config.targets[n])
> > +               ret = sysfs_emit(buf, "\n");
> > +       else
> > +               ret = sysfs_emit(buf, "%s\n",
> > +                                dev_name(&cxlr->config.targets[n]->dev));
> > +       device_unlock(&cxlr->dev);
> 
> The component contribution of a memdev to a region is a DPA-span, not
> the whole memdev. I would expect something like dax_mapping_attributes
> or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> information about the component contribution of a memdev to a region.
> 

I think show_target should just return the chosen decoder and then the decoder
attributes will tell the rest, wouldn't they?

> > +
> > +       return ret;
> > +}
> > +
> > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > +                         size_t len)
> > +{
> > +       struct device *memdev_dev;
> > +       struct cxl_memdev *cxlmd;
> > +
> > +       device_lock(&cxlr->dev);
> > +
> > +       if (len == 1 || cxlr->config.targets[n])
> > +               remove_target(cxlr, n);
> > +
> > +       /* Remove target special case */
> > +       if (len == 1) {
> > +               device_unlock(&cxlr->dev);
> > +               return len;
> > +       }
> > +
> > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> 
> I think this wants to be an endpoint decoder, not a memdev. Because
> it's the decoder that joins a memdev to a region, or at least a
> decoder should be picked when the memdev is assigned so that the DPA
> mapping can be registered. If all the decoders are allocated then fail
> here.
>

Per above, I think making this decoders makes sense. I could make it flexible
for ease of use, like if you specify memX, the kernel will pick a decoder for
you however I suspect you won't like that.

> > +       if (!memdev_dev) {
> > +               device_unlock(&cxlr->dev);
> > +               return -ENOENT;
> > +       }
> > +
> > +       /* reference to memdev held until target is unset or region goes away */
> > +
> > +       cxlmd = to_cxl_memdev(memdev_dev);
> > +       cxlr->config.targets[n] = cxlmd;
> > +
> > +       device_unlock(&cxlr->dev);
> > +
> > +       return len;
> > +}
> > +
> > +#define TARGET_ATTR_RW(n)                                                      \
> > +       static ssize_t target##n##_show(                                       \
> > +               struct device *dev, struct device_attribute *attr, char *buf)  \
> > +       {                                                                      \
> > +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> > +       }                                                                      \
> > +       static ssize_t target##n##_store(struct device *dev,                   \
> > +                                        struct device_attribute *attr,        \
> > +                                        const char *buf, size_t len)          \
> > +       {                                                                      \
> > +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> > +       }                                                                      \
> > +       static DEVICE_ATTR_RW(target##n)
> > +
> > +TARGET_ATTR_RW(0);
> > +TARGET_ATTR_RW(1);
> > +TARGET_ATTR_RW(2);
> > +TARGET_ATTR_RW(3);
> > +TARGET_ATTR_RW(4);
> > +TARGET_ATTR_RW(5);
> > +TARGET_ATTR_RW(6);
> > +TARGET_ATTR_RW(7);
> > +TARGET_ATTR_RW(8);
> > +TARGET_ATTR_RW(9);
> > +TARGET_ATTR_RW(10);
> > +TARGET_ATTR_RW(11);
> > +TARGET_ATTR_RW(12);
> > +TARGET_ATTR_RW(13);
> > +TARGET_ATTR_RW(14);
> > +TARGET_ATTR_RW(15);
> > +
> > +static struct attribute *interleave_attrs[] = {
> > +       &dev_attr_target0.attr,
> > +       &dev_attr_target1.attr,
> > +       &dev_attr_target2.attr,
> > +       &dev_attr_target3.attr,
> > +       &dev_attr_target4.attr,
> > +       &dev_attr_target5.attr,
> > +       &dev_attr_target6.attr,
> > +       &dev_attr_target7.attr,
> > +       &dev_attr_target8.attr,
> > +       &dev_attr_target9.attr,
> > +       &dev_attr_target10.attr,
> > +       &dev_attr_target11.attr,
> > +       &dev_attr_target12.attr,
> > +       &dev_attr_target13.attr,
> > +       &dev_attr_target14.attr,
> > +       &dev_attr_target15.attr,
> > +       NULL,
> > +};
> > +
> > +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> > +{
> > +       struct device *dev = container_of(kobj, struct device, kobj);
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       if (n < cxlr->config.interleave_ways)
> > +               return a->mode;
> > +       return 0;
> > +}
> > +
> > +static const struct attribute_group region_interleave_group = {
> > +       .attrs = interleave_attrs,
> > +       .is_visible = visible_targets,
> > +};
> > +
> > +static const struct attribute_group *region_groups[] = {
> > +       &region_group,
> > +       &region_interleave_group,
> > +       NULL,
> > +};
> > +
> >  static void cxl_region_release(struct device *dev);
> >
> >  static const struct device_type cxl_region_type = {
> >         .name = "cxl_region",
> >         .release = cxl_region_release,
> > +       .groups = region_groups
> >  };
> >
> >  static ssize_t create_region_show(struct device *dev,
> > @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
> >  {
> >         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> >         struct cxl_region *cxlr = to_cxl_region(dev);
> > +       int i;
> >
> >         ida_free(&cxld->region_ida, cxlr->id);
> > +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> > +               remove_target(cxlr, i);
> 
> Like the last patch this feels too late. I expect whatever unregisters
> the region should have already handled removing the targets.
> 

Would remove() be more appropriate?

> >         kfree(cxlr);
> >  }
> >
> > --
> > 2.35.0
> >

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 01/15] cxl/region: Add region creation ABI
  2022-02-17 17:58       ` Dan Williams
@ 2022-02-17 18:58         ` Ben Widawsky
  2022-02-17 20:26           ` Dan Williams
  2022-02-17 22:22         ` Ben Widawsky
  1 sibling, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-17 18:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On 22-02-17 09:58:04, Dan Williams wrote:
> On Thu, Feb 17, 2022 at 9:19 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > Regions are created as a child of the decoder that encompasses an
> > address space with constraints. Regions have a number of attributes that
> > must be configured before the region can be activated.
> >
> > The ABI is not meant to be secure, but is meant to avoid accidental
> > races. As a result, a buggy process may create a region by name that was
> > allocated by a different process. However, multiple processes which are
> > trying not to race with each other shouldn't need special
> > synchronization to do so.
> >
> > // Allocate a new region name
> > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> >
> > // Create a new region by name
> > while
> > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> > ! echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
> > do true; done
> >
> > // Region now exists in sysfs
> > stat -t /sys/bus/cxl/devices/decoder0.0/$region
> >
> > // Delete the region, and name
> > echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region
> >
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> 
> Looking good, a few more fixes and cleanups identified below.
> 
> >
> > ---
> > Changes since v4:
> > - Add the missed base attributes addition
> >
> > ---
> >  Documentation/ABI/testing/sysfs-bus-cxl       |  23 ++
> >  .../driver-api/cxl/memory-devices.rst         |  11 +
> >  drivers/cxl/core/Makefile                     |   1 +
> >  drivers/cxl/core/core.h                       |   3 +
> >  drivers/cxl/core/port.c                       |  11 +
> >  drivers/cxl/core/region.c                     | 213 ++++++++++++++++++
> >  drivers/cxl/cxl.h                             |   5 +
> >  drivers/cxl/region.h                          |  23 ++
> >  tools/testing/cxl/Kbuild                      |   1 +
> >  9 files changed, 291 insertions(+)
> >  create mode 100644 drivers/cxl/core/region.c
> >  create mode 100644 drivers/cxl/region.h
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index 7c2b846521f3..e5db45ea70ad 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -163,3 +163,26 @@ Description:
> >                 memory (type-3). The 'target_type' attribute indicates the
> >                 current setting which may dynamically change based on what
> >                 memory regions are activated in this decode hierarchy.
> > +
> > +What:          /sys/bus/cxl/devices/decoderX.Y/create_region
> > +Date:          January, 2022
> > +KernelVersion: v5.18
> > +Contact:       linux-cxl@vger.kernel.org
> > +Description:
> > +               Write a value of the form 'regionX.Y:Z' to instantiate a new
> > +               region within the decode range bounded by decoderX.Y. The value
> > +               written must match the current value returned from reading this
> > +               attribute. This behavior lets the kernel arbitrate racing
> > +               attempts to create a region. The thread that fails to write
> > +               loops and tries the next value. Regions must be created for root
> > +               decoders, and must subsequently configured and bound to a region
> > +               driver before they can be used.
> > +
> > +What:          /sys/bus/cxl/devices/decoderX.Y/delete_region
> > +Date:          January, 2022
> > +KernelVersion: v5.18
> > +Contact:       linux-cxl@vger.kernel.org
> > +Description:
> > +               Deletes the named region.  The attribute expects a region in the
> > +               form "regionX.Y:Z". The region's name, allocated by reading
> > +               create_region, will also be released.
> > diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
> > index db476bb170b6..66ddc58a21b1 100644
> > --- a/Documentation/driver-api/cxl/memory-devices.rst
> > +++ b/Documentation/driver-api/cxl/memory-devices.rst
> > @@ -362,6 +362,17 @@ CXL Core
> >  .. kernel-doc:: drivers/cxl/core/mbox.c
> >     :doc: cxl mbox
> >
> > +CXL Regions
> > +-----------
> > +.. kernel-doc:: drivers/cxl/region.h
> > +   :identifiers:
> > +
> > +.. kernel-doc:: drivers/cxl/core/region.c
> > +   :doc: cxl core region
> > +
> > +.. kernel-doc:: drivers/cxl/core/region.c
> > +   :identifiers:
> > +
> >  External Interfaces
> >  ===================
> >
> > diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> > index 6d37cd78b151..39ce8f2f2373 100644
> > --- a/drivers/cxl/core/Makefile
> > +++ b/drivers/cxl/core/Makefile
> > @@ -4,6 +4,7 @@ obj-$(CONFIG_CXL_BUS) += cxl_core.o
> >  ccflags-y += -I$(srctree)/drivers/cxl
> >  cxl_core-y := port.o
> >  cxl_core-y += pmem.o
> > +cxl_core-y += region.o
> >  cxl_core-y += regs.o
> >  cxl_core-y += memdev.o
> >  cxl_core-y += mbox.o
> > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> > index 1a50c0fc399c..adfd42370b28 100644
> > --- a/drivers/cxl/core/core.h
> > +++ b/drivers/cxl/core/core.h
> > @@ -9,6 +9,9 @@ extern const struct device_type cxl_nvdimm_type;
> >
> >  extern struct attribute_group cxl_base_attribute_group;
> >
> > +extern struct device_attribute dev_attr_create_region;
> > +extern struct device_attribute dev_attr_delete_region;
> > +
> >  struct cxl_send_command;
> >  struct cxl_mem_query_commands;
> >  int cxl_query_cmd(struct cxl_memdev *cxlmd,
> > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > index 1e785a3affaa..860e91cae29b 100644
> > --- a/drivers/cxl/core/port.c
> > +++ b/drivers/cxl/core/port.c
> > @@ -213,6 +213,8 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
> >  };
> >
> >  static struct attribute *cxl_decoder_root_attrs[] = {
> > +       &dev_attr_create_region.attr,
> > +       &dev_attr_delete_region.attr,
> >         &dev_attr_cap_pmem.attr,
> >         &dev_attr_cap_ram.attr,
> >         &dev_attr_cap_type2.attr,
> > @@ -270,6 +272,8 @@ static void cxl_decoder_release(struct device *dev)
> >         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> >         struct cxl_port *port = to_cxl_port(dev->parent);
> >
> > +       ida_free(&cxld->region_ida, cxld->next_region_id);
> > +       ida_destroy(&cxld->region_ida);
> >         ida_free(&port->decoder_ida, cxld->id);
> >         kfree(cxld);
> >  }
> > @@ -1244,6 +1248,13 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
> >         cxld->target_type = CXL_DECODER_EXPANDER;
> >         cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
> >
> > +       mutex_init(&cxld->id_lock);
> > +       ida_init(&cxld->region_ida);
> > +       rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
> > +       if (rc < 0)
> > +               goto err;
> > +
> > +       cxld->next_region_id = rc;
> >         return cxld;
> >  err:
> >         kfree(cxld);
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > new file mode 100644
> > index 000000000000..5576952e4aa1
> > --- /dev/null
> > +++ b/drivers/cxl/core/region.c
> > @@ -0,0 +1,213 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright(c) 2022 Intel Corporation. All rights reserved. */
> > +#include <linux/device.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/idr.h>
> > +#include <region.h>
> > +#include <cxl.h>
> > +#include "core.h"
> > +
> > +/**
> > + * DOC: cxl core region
> > + *
> > + * CXL Regions represent mapped memory capacity in system physical address
> > + * space. Whereas the CXL Root Decoders identify the bounds of potential CXL
> > + * Memory ranges, Regions represent the active mapped capacity by the HDM
> > + * Decoder Capability structures throughout the Host Bridges, Switches, and
> > + * Endpoints in the topology.
> > + */
> > +
> > +static void cxl_region_release(struct device *dev);
> 
> Why forward declare this versus move cxl_region_type after the definition?
> 
> No other CXL object release functions are forward declared.
> 

I liked having the device_type declared at the top. I will change it to match
the other code.

> > +
> > +static const struct device_type cxl_region_type = {
> > +       .name = "cxl_region",
> > +       .release = cxl_region_release,
> > +};
> > +
> > +static struct cxl_region *to_cxl_region(struct device *dev)
> > +{
> > +       if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type,
> > +                         "not a cxl_region device\n"))
> > +               return NULL;
> > +
> > +       return container_of(dev, struct cxl_region, dev);
> > +}
> > +
> > +static struct cxl_region *cxl_region_alloc(struct cxl_decoder *cxld)
> > +{
> > +       struct cxl_region *cxlr;
> > +       struct device *dev;
> > +
> > +       cxlr = kzalloc(sizeof(*cxlr), GFP_KERNEL);
> > +       if (!cxlr)
> > +               return ERR_PTR(-ENOMEM);
> > +
> > +       dev = &cxlr->dev;
> > +       device_initialize(dev);
> > +       dev->parent = &cxld->dev;
> > +       device_set_pm_not_required(dev);
> > +       dev->bus = &cxl_bus_type;
> > +       dev->type = &cxl_region_type;
> > +
> > +       return cxlr;
> > +}
> > +
> > +static void unregister_region(void *_cxlr)
> > +{
> > +       struct cxl_region *cxlr = _cxlr;
> > +
> > +       if (!test_and_set_bit(REGION_DEAD, &cxlr->flags))
> > +               device_unregister(&cxlr->dev);
> 
> I thought REGION_DEAD was needed to prevent double
> devm_release_action(), not double unregister?
> 

I believe that's correct, repeating what you said on our internal list:

On 22-02-14 14:11:41, Dan Williams wrote:
  True, you do need to solve the race between multiple writers racing to
  do the unregistration, but that could be done with something like:

  if (!test_and_set_bit(REGION_DEAD, &cxlr->flags))
      device_unregister(&cxlr->dev);

So I was just trying to implement what you said. Remainder of the discussion
below...

> > +}
> > +
> > +/**
> > + * devm_cxl_add_region - Adds a region to a decoder
> > + * @cxld: Parent decoder.
> > + * @cxlr: Region to be added to the decoder.
> > + *
> > + * This is the second step of region initialization. Regions exist within an
> > + * address space which is mapped by a @cxld. That @cxld must be a root decoder,
> > + * and it enforces constraints upon the region as it is configured.
> > + *
> > + * Return: 0 if the region was added to the @cxld, else returns negative error
> > + * code. The region will be named "regionX.Y.Z" where X is the port, Y is the
> > + * decoder id, and Z is the region number.
> > + */
> > +static struct cxl_region *devm_cxl_add_region(struct cxl_decoder *cxld)
> > +{
> > +       struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> > +       struct cxl_region *cxlr;
> > +       struct device *dev;
> > +       int rc;
> > +
> > +       cxlr = cxl_region_alloc(cxld);
> > +       if (IS_ERR(cxlr))
> > +               return cxlr;
> > +
> > +       dev = &cxlr->dev;
> > +
> > +       cxlr->id = cxld->next_region_id;
> > +       rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
> > +       if (rc)
> > +               goto err_out;
> > +
> > +       /* affirm that release will have access to the decoder's region ida  */
> > +       get_device(&cxld->dev);
> > +
> > +       rc = device_add(dev);
> > +       if (!rc)
> > +               rc = devm_add_action_or_reset(port->uport, unregister_region,
> > +                                             cxlr);
> > +       if (rc)
> > +               goto err_out;
> 
> All the other usages in device_add() in the subsystem follow the style of:
> 
> rc = device_add(dev);
> if (rc)
>     goto err;
> 
> ...any reason to be unique here and indent the success case?
> 

It allowed skipping a goto. I can change it.

> 
> > +
> > +       return cxlr;
> > +
> > +err_out:
> > +       put_device(dev);
> > +       kfree(cxlr);
> 
> This is a double-free of cxlr;
> 

Because of release()? How does release get called if the region device wasn't
added? Or is there something else?

> > +       return ERR_PTR(rc);
> > +}
> > +
> > +static ssize_t create_region_show(struct device *dev,
> > +                                 struct device_attribute *attr, char *buf)
> > +{
> > +       struct cxl_port *port = to_cxl_port(dev->parent);
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > +
> > +       return sysfs_emit(buf, "region%d.%d:%d\n", port->id, cxld->id,
> > +                         cxld->next_region_id);
> > +}
> > +
> > +static ssize_t create_region_store(struct device *dev,
> > +                                  struct device_attribute *attr,
> > +                                  const char *buf, size_t len)
> > +{
> > +       struct cxl_port *port = to_cxl_port(dev->parent);
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > +       struct cxl_region *cxlr;
> > +       int d, p, r, rc = 0;
> > +
> > +       if (sscanf(buf, "region%d.%d:%d", &p, &d, &r) != 3)
> > +               return -EINVAL;
> > +
> > +       if (port->id != p || cxld->id != d)
> > +               return -EINVAL;
> > +
> > +       rc = mutex_lock_interruptible(&cxld->id_lock);
> > +       if (rc)
> > +               return rc;
> > +
> > +       if (cxld->next_region_id != r) {
> > +               rc = -EINVAL;
> > +               goto out;
> > +       }
> > +
> > +       rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
> > +       if (rc < 0) {
> > +               dev_dbg(dev, "Failed to get next cached id (%d)\n", rc);
> > +               goto out;
> > +       }
> > +
> > +       cxlr = devm_cxl_add_region(cxld);
> > +       if (IS_ERR(cxlr)) {
> > +               rc = PTR_ERR(cxlr);
> > +               goto out;
> > +       }
> > +
> > +       cxld->next_region_id = rc;
> 
> This looks like a leak in the case when devm_cxl_add_region() fails,
> so just move it above that call.
> 
> > +       dev_dbg(dev, "Created %s\n", dev_name(&cxlr->dev));
> > +
> > +out:
> > +       mutex_unlock(&cxld->id_lock);
> > +       return rc ? rc : len;
> 
> if (rc)
>     return rc;
> return len;
> 

Yeah. This was a bug I pointed out already. I have it fixed locally.

> > +}
> > +DEVICE_ATTR_RW(create_region);
> > +
> > +static struct cxl_region *cxl_find_region_by_name(struct cxl_decoder *cxld,
> > +                                                 const char *name)
> > +{
> > +       struct device *region_dev;
> > +
> > +       region_dev = device_find_child_by_name(&cxld->dev, name);
> > +       if (!region_dev)
> > +               return ERR_PTR(-ENOENT);
> > +
> > +       return to_cxl_region(region_dev);
> > +}
> > +
> > +static ssize_t delete_region_store(struct device *dev,
> > +                                  struct device_attribute *attr,
> > +                                  const char *buf, size_t len)
> > +{
> > +       struct cxl_port *port = to_cxl_port(dev->parent);
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > +       struct cxl_region *cxlr;
> > +
> > +       cxlr = cxl_find_region_by_name(cxld, buf);
> > +       if (IS_ERR(cxlr))
> > +               return PTR_ERR(cxlr);
> > +
> > +       /* After this, the region is no longer a child of the decoder. */
> > +       devm_release_action(port->uport, unregister_region, cxlr);
> 
> This may trigger a WARN in the case where 2 threads race to trigger
> the release action. I think the DEAD check is needed to gate this
> call, not device_unregister().

Continuing from above use of REGION_DEAD....

Dang. I actually had added a mutex to protect this and I cherry-picked the wrong
patch.

This looks good to me though:

cxlr = cxl_find_region_by_name(cxld, buf);
if (IS_ERR(cxlr))
	return PTR_ERR(cxlr);

if (!test_and_set_bit(REGION_DEAD, &cxlr->flags)) {
	devm_release_action(...)
}

put_device(dev);
return len;

> 
> > +
> > +       /* Release is likely called here, so cxlr is not safe to reference. */
> 
> This is always the case with any put_device(), so no need for this comment.
> 
> > +       put_device(&cxlr->dev);
> > +       cxlr = NULL;
> 
> This NULL assignment has no value.

It was meant to prevent future me from trying to use cxlr after this point,
which I've done. I can remove it.

> 
> > +
> > +       dev_dbg(dev, "Deleted %s\n", buf);
> 
> Not sure a debug statement is needed for something userspace can
> directly view itself with the result code from the sysfs write.
> 

It was helpful for my driver development, but I can remove it.

> > +       return len;
> > +}
> > +DEVICE_ATTR_WO(delete_region);
> > +
> > +static void cxl_region_release(struct device *dev)
> > +{
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       dev_dbg(&cxld->dev, "Releasing %s\n", dev_name(dev));
> > +       ida_free(&cxld->region_ida, cxlr->id);
> > +       kfree(cxlr);
> > +       put_device(&cxld->dev);
> > +}
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index b4047a310340..d5397f7dfcf4 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -221,6 +221,8 @@ enum cxl_decoder_type {
> >   * @target_type: accelerator vs expander (type2 vs type3) selector
> >   * @flags: memory type capabilities and locking
> >   * @target_lock: coordinate coherent reads of the target list
> > + * @region_ida: allocator for region ids.
> > + * @next_region_id: Cached region id for next region.
> >   * @nr_targets: number of elements in @target
> >   * @target: active ordered target list in current decoder configuration
> >   */
> > @@ -236,6 +238,9 @@ struct cxl_decoder {
> >         enum cxl_decoder_type target_type;
> >         unsigned long flags;
> >         seqlock_t target_lock;
> > +       struct mutex id_lock;
> > +       struct ida region_ida;
> > +       int next_region_id;
> >         int nr_targets;
> >         struct cxl_dport *target[];
> >  };
> > diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
> > new file mode 100644
> > index 000000000000..0016f83bbdfd
> > --- /dev/null
> > +++ b/drivers/cxl/region.h
> > @@ -0,0 +1,23 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/* Copyright(c) 2021 Intel Corporation. */
> > +#ifndef __CXL_REGION_H__
> > +#define __CXL_REGION_H__
> > +
> > +#include <linux/uuid.h>
> > +
> > +#include "cxl.h"
> > +
> > +/**
> > + * struct cxl_region - CXL region
> > + * @dev: This region's device.
> > + * @id: This region's id. Id is globally unique across all regions.
> > + * @flags: Flags representing the current state of the region.
> > + */
> > +struct cxl_region {
> > +       struct device dev;
> > +       int id;
> > +       unsigned long flags;
> > +#define REGION_DEAD 0
> > +};
> > +
> > +#endif
> > diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
> > index 82e49ab0937d..3fe6d34e6d59 100644
> > --- a/tools/testing/cxl/Kbuild
> > +++ b/tools/testing/cxl/Kbuild
> > @@ -46,6 +46,7 @@ cxl_core-y += $(CXL_CORE_SRC)/memdev.o
> >  cxl_core-y += $(CXL_CORE_SRC)/mbox.o
> >  cxl_core-y += $(CXL_CORE_SRC)/pci.o
> >  cxl_core-y += $(CXL_CORE_SRC)/hdm.o
> > +cxl_core-y += $(CXL_CORE_SRC)/region.o
> >  cxl_core-y += config_check.o
> >
> >  obj-m += test/
> > --
> > 2.35.1
> >

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-02-17 18:36     ` Ben Widawsky
@ 2022-02-17 19:57       ` Dan Williams
  2022-02-17 20:20         ` Ben Widawsky
  2022-02-23 21:49         ` Ben Widawsky
  0 siblings, 2 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-17 19:57 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On Thu, Feb 17, 2022 at 10:36 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Consolidating earlier discussions...
>
> On 22-01-28 16:25:34, Dan Williams wrote:
> > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > The region creation APIs create a vacant region. Configuring the region
> > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > will be provided to allow userspace to configure the region.  Finally
> > > once all configuration is complete, userspace may activate the region.
> > >
> > > Introduced here are the most basic attributes needed to configure a
> > > region. Details of these attribute are described in the ABI
> >
> > s/attribute/attributes/
> >
> > > Documentation. Sanity checking of configuration parameters are done at
> > > region binding time. This consolidates all such logic in one place,
> > > rather than being strewn across multiple places.
> >
> > I think that's too late for some of the validation. The complex
> > validation that the region driver does throughout the topology is
> > different from the basic input validation that can  be done at the
> > sysfs write time. For example ,this patch allows negative
> > interleave_granularity values to specified, just return -EINVAL. I
> > agree that sysfs should not validate everything, I disagree with
> > pushing all validation to cxl_region_probe().
> >
>
> Okay. It might save us some back and forth if you could outline everything you'd
> expect to be validated, but I can also make an attempt to figure out the
> reasonable set of things.

Input validation. Every value that gets written to a sysfs attribute
should be checked for validity, more below:

>
> > >
> > > A example is provided below:
> > >
> > > /sys/bus/cxl/devices/region0.0:0
> > > ├── interleave_granularity

...validate granularity is within spec and can be supported by the root decoder.

> > > ├── interleave_ways

...validate ways is within spec and can be supported by the root decoder.

> > > ├── offset
> > > ├── size

...try to reserve decoder capacity to validate that there is available space.

> > > ├── subsystem -> ../../../../../../bus/cxl
> > > ├── target0

...validate that the target maps to the decoder.

> > > ├── uevent
> > > └── uuid

...validate that the uuid is unique relative to other regions.

> >
> > As mentioned off-list, it looks like devtype and modalias are missing.
> >
>
> Yep. This belongs in the previous patch though.
>
> > >
> > > Reported-by: kernel test robot <lkp@intel.com> (v2)
> > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > ---
> > >  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
> > >  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
> > >  2 files changed, 340 insertions(+)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > index dcc728458936..50ba5018014d 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > @@ -187,3 +187,43 @@ Description:
> > >                 region driver before being deleted. The attributes expects a
> > >                 region in the form "regionX.Y:Z". The region's name, allocated
> > >                 by reading create_region, will also be released.
> > > +
> > > +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
> >
> > This is just another 'resource' attribute for the physical base
> > address of the region, right? 'offset' sounds like something that
> > would be relative instead of absolute.
> >
>
> It was meant to be relative. I can make it absolute if that's preferable but the
> physical base is known at the decoder level already.

Yes, but it saves userspace a step to get the absolute value here and
matches what happens in PCI sysfs where the user is not required to
look up the bridge base to calculate the absolute value.

>
> > > +Date:          August, 2021
> >
> > Same date update comment here.
> >
> > > +KernelVersion: v5.18
> > > +Contact:       linux-cxl@vger.kernel.org
> > > +Description:
> > > +               (RO) A region resides within an address space that is claimed by
> > > +               a decoder.
> >
> > "A region is a contiguous partition of a CXL Root decoder address space."
> >
> > >                  Region space allocation is handled by the driver, but
> >
> > "Region capacity is allocated by writing to the size attribute, the
> > resulting physical address base determined by the driver is reflected
> > here."
> >
> > > +               the offset may be read by userspace tooling in order to
> > > +               determine fragmentation, and available size for new regions.
> >
> > I would also expect, before / along with these new region attributes,
> > there would be 'available' and 'max_extent_available' at the decoder
> > level to indicate how much free space the decoder has and how big the
> > next region creation can be. User tooling can walk  the decoder and
> > the regions together to determine fragmentation if necessary, but for
> > the most part the tool likely only cares about "how big can the next
> > region be?" and "how full is this decoder?".
> >
>
> Since this is the configuration part of the ABI, I'd rather add that information
> when the plumbing to report them exists. I'm struggling to understand the
> balance (as mentioned also earlier in this mail thread) as to what userspace
> does and what the kernel does. I will add these as you request.

Userspace asks by DPA size and SPA size and the kernel validates /
performs the allocation on its behalf.

> > > +
> > > +What:
> > > +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> > > +Date:          August, 2021
> > > +KernelVersion: v5.18
> > > +Contact:       linux-cxl@vger.kernel.org
> > > +Description:
> > > +               (RW) Configuring regions requires a minimal set of parameters in
> > > +               order for the subsequent bind operation to succeed. The
> > > +               following parameters are defined:
> >
> > Let's split up the descriptions into individual sections. That can
> > also document the order that attributes must be written. For example,
> > doesn't size need to be set before targets are added so that targets
> > can be validated whether they have sufficient capacity?
> >
>
> Okay. Since we're moving toward making the sysfs ABI stateful,

sysfs is always stateful. Stateless would be an ioctl.

> would you like me
> to make the attrs only visible when they can actually be set?

No, that's a bit too much magic, and it would be racy.

>
> > > +
> > > +               ==      ========================================================
> > > +               interleave_granularity Mandatory. Number of consecutive bytes
> > > +                       each device in the interleave set will claim. The
> > > +                       possible interleave granularity values are determined by
> > > +                       the CXL spec and the participating devices.
> > > +               interleave_ways Mandatory. Number of devices participating in the
> > > +                       region. Each device will provide 1/interleave of storage
> > > +                       for the region.
> > > +               size    Manadatory. Phsyical address space the region will
> > > +                       consume.
> >
> > s/Phsyical/Physical/
> >
> > > +               target  Mandatory. Memory devices are the backing storage for a
> > > +                       region. There will be N targets based on the number of
> > > +                       interleave ways that the top level decoder is configured
> > > +                       for.
> >
> > That doesn't sound right, IW at the root != IW at the endpoint level
> > and the region needs to record all the endpoint level targets.
>
> Correct.
>
> >
> > > Each target must be set with a memdev device ie.
> > > +                       'mem1'. This attribute only becomes available after
> > > +                       setting the 'interleave' attribute.
> > > +               uuid    Optional. A unique identifier for the region. If none is
> > > +                       selected, the kernel will create one.
> >
> > Let's drop the Mandatory / Optional distinction, or I am otherwise not
> > understanding what this is trying to document. For example 'uuid' is
> > "mandatory" for PMEM regions and "omitted" for volatile regions not
> > optional.
>
> Okay.
>
> >
> > > +               ==      ========================================================
> > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > > index 1a448543db0d..3b48e0469fc7 100644
> > > --- a/drivers/cxl/core/region.c
> > > +++ b/drivers/cxl/core/region.c
> > > @@ -3,9 +3,12 @@
> > >  #include <linux/io-64-nonatomic-lo-hi.h>
> > >  #include <linux/device.h>
> > >  #include <linux/module.h>
> > > +#include <linux/sizes.h>
> > >  #include <linux/slab.h>
> > > +#include <linux/uuid.h>
> > >  #include <linux/idr.h>
> > >  #include <region.h>
> > > +#include <cxlmem.h>
> > >  #include <cxl.h>
> > >  #include "core.h"
> > >
> > > @@ -18,11 +21,305 @@
> > >   * (programming the hardware) is handled by a separate region driver.
> > >   */
> > >
> > > +struct cxl_region *to_cxl_region(struct device *dev);
> > > +static const struct attribute_group region_interleave_group;
> > > +
> > > +static bool is_region_active(struct cxl_region *cxlr)
> > > +{
> > > +       /* TODO: Regions can't be activated yet. */
> > > +       return false;
> >
> > This function seems redundant with just checking "cxlr->dev.driver !=
> > NULL"? The benefit of that is there is no need to carry a TODO in the
> > series.
> >
>
> The idea behind this was to give the reviewer somewhat of a bigger picture as to
> how things should work in the code rather than in a commit message. I will
> remove this.

They look premature to me.

>
> > > +}
> > > +
> > > +static void remove_target(struct cxl_region *cxlr, int target)
> > > +{
> > > +       struct cxl_memdev *cxlmd;
> > > +
> > > +       cxlmd = cxlr->config.targets[target];
> > > +       if (cxlmd)
> > > +               put_device(&cxlmd->dev);
> >
> > A memdev can be a member of multiple regions at once, shouldn't this
> > be an endpoint decoder or similar, not the entire memdev?
> >
> > Also, if memdevs autoremove themselves from regions at memdev
> > ->remove() time then I don't think the region needs to hold references
> > on memdevs.
> >
>
> This needs some work. The concern I have is region operations will need to
> operate on memdevs/decoders at various points in time. When the memdev goes
> away, the region will also need to go away. None of that plumbing was in place
> in v3 and the reference on the memdev was just a half-hearted attempt at doing
> the right thing.
>
> For now if you prefer I remove the reference, but perhaps the decoder reference
> would buy us some safety?

So, I don't want to merge an interim solution. I think this series
needs to prove out the end to end final ABI with all the lifetime
issues worked out before committing to it upstream because lifetime
issues get much harder to fix when they also need to conform to a
legacy ABI.

>
> > > +       cxlr->config.targets[target] = NULL;
> > > +}
> > > +
> > > +static ssize_t interleave_ways_show(struct device *dev,
> > > +                                   struct device_attribute *attr, char *buf)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +
> > > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> > > +}
> > > +
> > > +static ssize_t interleave_ways_store(struct device *dev,
> > > +                                    struct device_attribute *attr,
> > > +                                    const char *buf, size_t len)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       int ret, prev_iw;
> > > +       int val;
> >
> > I would expect:
> >
> > if (dev->driver)
> >    return -EBUSY;
> >
> > ...to shutdown configuration writes once the region is active. Might
> > also need a region-wide seqlock like target_list_show. So that region
> > probe drains  all active sysfs writers before assuming the
> > configuration is stable.
>
> Initially my thought here is that this is a problem for userspace to deal with.
> If userspace can't figure out how to synchronously configure and bind the
> region, that's not a kernel problem.

The kernel always needs to protect itself. Userspace is free to race
itself, but it can not be allowed to trigger a kernel race. So there
needs to be protection against userspace writing interleave_ways and
the kernel being able to trust that interleave_ways is now static for
the life of the region.

> However, we've put some effort into
> protecting userspace from itself in the create ABI, so it might be more in line
> to do that here.

That safety was about preventing userspace from leaking kernel memory,
not about protecting userspace from itself. It's still the case that
userspace racing itself will get horribly confused when it collides
region creation, but the kernel protects itself by resolving the race.

> In summary, I'm fine to add it, but I think I really need to get more in your
> brain about the userspace/kernel divide sooner rather than later.

Don't let userspace break the kernel, that's it.

>
> >
> > > +
> > > +       prev_iw = cxlr->config.interleave_ways;
> > > +       ret = kstrtoint(buf, 0, &val);
> > > +       if (ret)
> > > +               return ret;
> > > +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> > > +               return -EINVAL;
> > > +
> > > +       cxlr->config.interleave_ways = val;
> > > +
> > > +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> > > +       if (ret < 0)
> > > +               goto err;
> > > +
> > > +       sysfs_notify(&dev->kobj, NULL, "target_interleave");
> >
> > Why?
> >
>
> copypasta
>
> > > +
> > > +       while (prev_iw > cxlr->config.interleave_ways)
> > > +               remove_target(cxlr, --prev_iw);
> >
> > To make the kernel side simpler this attribute could just require that
> > setting interleave ways is a one way street, if you want to change it
> > you need to delete the region and start over.
> >
>
> Okay. One of the earlier versions did this implicitly since the #ways was needed
> to create the region. I thought from the ABI perspective, flexibility was good.
> Userspace may choose not to utilize it.

More flexibility == more maintenance burden. If it's not strictly
necessary, don't expose it, so making this read-only seems simpler to
me.

[..]
> > > +       device_lock(&cxlr->dev);
> > > +       if (is_region_active(cxlr))
> > > +               rc = -EBUSY;
> > > +       else
> > > +               cxlr->config.size = val;
> > > +       device_unlock(&cxlr->dev);
> >
> > I think lockdep will complain about device_lock() usage in an
> > attribute. Try changing this to cxl_device_lock() with
> > CONFIG_PROVE_CXL_LOCKING=y.
> >
>
> I might have messed it up, but I didn't seem to run into an issue. With the
> driver bound check though, it can go away.
>
> I think it would be really good to add this kind of detail to sysfs.rst. Quick
> grep finds me arm64/kernel/mte and the nfit driver taking the device lock in an
> attr.

Yeah, CONFIG_PROVE_{NVDIMM,CXL}_LOCKING needs to annotate the
driver-core as well. I'm concerned there's a class of deadlocks that
lockdep just can't see.

>
>
> > > +
> > > +       return rc ? rc : len;
> > > +}
> > > +static DEVICE_ATTR_RW(size);
> > > +
> > > +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> > > +                        char *buf)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +
> > > +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> > > +}
> > > +
> > > +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> > > +                         const char *buf, size_t len)
> > > +{
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       ssize_t rc;
> > > +
> > > +       if (len != UUID_STRING_LEN + 1)
> > > +               return -EINVAL;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +       if (is_region_active(cxlr))
> > > +               rc = -EBUSY;
> > > +       else
> > > +               rc = uuid_parse(buf, &cxlr->config.uuid);
> > > +       device_unlock(&cxlr->dev);
> > > +
> > > +       return rc ? rc : len;
> > > +}
> > > +static DEVICE_ATTR_RW(uuid);
> > > +
> > > +static struct attribute *region_attrs[] = {
> > > +       &dev_attr_interleave_ways.attr,
> > > +       &dev_attr_interleave_granularity.attr,
> > > +       &dev_attr_offset.attr,
> > > +       &dev_attr_size.attr,
> > > +       &dev_attr_uuid.attr,
> > > +       NULL,
> > > +};
> > > +
> > > +static const struct attribute_group region_group = {
> > > +       .attrs = region_attrs,
> > > +};
> > > +
> > > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > > +{
> > > +       int ret;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +       if (!cxlr->config.targets[n])
> > > +               ret = sysfs_emit(buf, "\n");
> > > +       else
> > > +               ret = sysfs_emit(buf, "%s\n",
> > > +                                dev_name(&cxlr->config.targets[n]->dev));
> > > +       device_unlock(&cxlr->dev);
> >
> > The component contribution of a memdev to a region is a DPA-span, not
> > the whole memdev. I would expect something like dax_mapping_attributes
> > or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> > information about the component contribution of a memdev to a region.
> >
>
> I think show_target should just return the chosen decoder and then the decoder
> attributes will tell the rest, wouldn't they?

Given the conflicts that can arise between HDM decoders needing to map
increasing DPA values and other conflicts that there will be
situations where the kernel auto-picking a decoder will get in the
way. Exposing the decoder selection to userspace also gives one more
place to do leaf validation. I.e. at decoder-to-region assignment time
the kernel can validate that the DPA is available and can be mapped by
the given decoder given the state of other decoders on that device.

>
> > > +
> > > +       return ret;
> > > +}
> > > +
> > > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > > +                         size_t len)
> > > +{
> > > +       struct device *memdev_dev;
> > > +       struct cxl_memdev *cxlmd;
> > > +
> > > +       device_lock(&cxlr->dev);
> > > +
> > > +       if (len == 1 || cxlr->config.targets[n])
> > > +               remove_target(cxlr, n);
> > > +
> > > +       /* Remove target special case */
> > > +       if (len == 1) {
> > > +               device_unlock(&cxlr->dev);
> > > +               return len;
> > > +       }
> > > +
> > > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> >
> > I think this wants to be an endpoint decoder, not a memdev. Because
> > it's the decoder that joins a memdev to a region, or at least a
> > decoder should be picked when the memdev is assigned so that the DPA
> > mapping can be registered. If all the decoders are allocated then fail
> > here.
> >
>
> Per above, I think making this decoders makes sense. I could make it flexible
> for ease of use, like if you specify memX, the kernel will pick a decoder for
> you however I suspect you won't like that.

Right, put the user friendliness in the tooling, not sysfs ABI.

>
> > > +       if (!memdev_dev) {
> > > +               device_unlock(&cxlr->dev);
> > > +               return -ENOENT;
> > > +       }
> > > +
> > > +       /* reference to memdev held until target is unset or region goes away */
> > > +
> > > +       cxlmd = to_cxl_memdev(memdev_dev);
> > > +       cxlr->config.targets[n] = cxlmd;
> > > +
> > > +       device_unlock(&cxlr->dev);
> > > +
> > > +       return len;
> > > +}
> > > +
> > > +#define TARGET_ATTR_RW(n)                                                      \
> > > +       static ssize_t target##n##_show(                                       \
> > > +               struct device *dev, struct device_attribute *attr, char *buf)  \
> > > +       {                                                                      \
> > > +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> > > +       }                                                                      \
> > > +       static ssize_t target##n##_store(struct device *dev,                   \
> > > +                                        struct device_attribute *attr,        \
> > > +                                        const char *buf, size_t len)          \
> > > +       {                                                                      \
> > > +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> > > +       }                                                                      \
> > > +       static DEVICE_ATTR_RW(target##n)
> > > +
> > > +TARGET_ATTR_RW(0);
> > > +TARGET_ATTR_RW(1);
> > > +TARGET_ATTR_RW(2);
> > > +TARGET_ATTR_RW(3);
> > > +TARGET_ATTR_RW(4);
> > > +TARGET_ATTR_RW(5);
> > > +TARGET_ATTR_RW(6);
> > > +TARGET_ATTR_RW(7);
> > > +TARGET_ATTR_RW(8);
> > > +TARGET_ATTR_RW(9);
> > > +TARGET_ATTR_RW(10);
> > > +TARGET_ATTR_RW(11);
> > > +TARGET_ATTR_RW(12);
> > > +TARGET_ATTR_RW(13);
> > > +TARGET_ATTR_RW(14);
> > > +TARGET_ATTR_RW(15);
> > > +
> > > +static struct attribute *interleave_attrs[] = {
> > > +       &dev_attr_target0.attr,
> > > +       &dev_attr_target1.attr,
> > > +       &dev_attr_target2.attr,
> > > +       &dev_attr_target3.attr,
> > > +       &dev_attr_target4.attr,
> > > +       &dev_attr_target5.attr,
> > > +       &dev_attr_target6.attr,
> > > +       &dev_attr_target7.attr,
> > > +       &dev_attr_target8.attr,
> > > +       &dev_attr_target9.attr,
> > > +       &dev_attr_target10.attr,
> > > +       &dev_attr_target11.attr,
> > > +       &dev_attr_target12.attr,
> > > +       &dev_attr_target13.attr,
> > > +       &dev_attr_target14.attr,
> > > +       &dev_attr_target15.attr,
> > > +       NULL,
> > > +};
> > > +
> > > +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> > > +{
> > > +       struct device *dev = container_of(kobj, struct device, kobj);
> > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > +
> > > +       if (n < cxlr->config.interleave_ways)
> > > +               return a->mode;
> > > +       return 0;
> > > +}
> > > +
> > > +static const struct attribute_group region_interleave_group = {
> > > +       .attrs = interleave_attrs,
> > > +       .is_visible = visible_targets,
> > > +};
> > > +
> > > +static const struct attribute_group *region_groups[] = {
> > > +       &region_group,
> > > +       &region_interleave_group,
> > > +       NULL,
> > > +};
> > > +
> > >  static void cxl_region_release(struct device *dev);
> > >
> > >  static const struct device_type cxl_region_type = {
> > >         .name = "cxl_region",
> > >         .release = cxl_region_release,
> > > +       .groups = region_groups
> > >  };
> > >
> > >  static ssize_t create_region_show(struct device *dev,
> > > @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
> > >  {
> > >         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > >         struct cxl_region *cxlr = to_cxl_region(dev);
> > > +       int i;
> > >
> > >         ida_free(&cxld->region_ida, cxlr->id);
> > > +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> > > +               remove_target(cxlr, i);
> >
> > Like the last patch this feels too late. I expect whatever unregisters
> > the region should have already handled removing the targets.
> >
>
> Would remove() be more appropriate?

->remove() does not seem a good fit since it may be the case that
someone wants do "echo $region >
/sys/bus/cxl/drivers/cxl_region/unbind; echo $region >
/sys/bus/cxl/drivers/cxl_region/bind;" without needing to go
reconfigure the targets. I am suggesting that before
device_unregister(&cxlr->dev) the targets are released.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-02-17 19:57       ` Dan Williams
@ 2022-02-17 20:20         ` Ben Widawsky
  2022-02-17 21:12           ` Dan Williams
  2022-02-23 21:49         ` Ben Widawsky
  1 sibling, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-17 20:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On 22-02-17 11:57:59, Dan Williams wrote:
> On Thu, Feb 17, 2022 at 10:36 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > Consolidating earlier discussions...
> >
> > On 22-01-28 16:25:34, Dan Williams wrote:
> > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > The region creation APIs create a vacant region. Configuring the region
> > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > will be provided to allow userspace to configure the region.  Finally
> > > > once all configuration is complete, userspace may activate the region.
> > > >
> > > > Introduced here are the most basic attributes needed to configure a
> > > > region. Details of these attribute are described in the ABI
> > >
> > > s/attribute/attributes/
> > >
> > > > Documentation. Sanity checking of configuration parameters are done at
> > > > region binding time. This consolidates all such logic in one place,
> > > > rather than being strewn across multiple places.
> > >
> > > I think that's too late for some of the validation. The complex
> > > validation that the region driver does throughout the topology is
> > > different from the basic input validation that can  be done at the
> > > sysfs write time. For example ,this patch allows negative
> > > interleave_granularity values to specified, just return -EINVAL. I
> > > agree that sysfs should not validate everything, I disagree with
> > > pushing all validation to cxl_region_probe().
> > >
> >
> > Okay. It might save us some back and forth if you could outline everything you'd
> > expect to be validated, but I can also make an attempt to figure out the
> > reasonable set of things.
> 
> Input validation. Every value that gets written to a sysfs attribute
> should be checked for validity, more below:
> 
> >
> > > >
> > > > A example is provided below:
> > > >
> > > > /sys/bus/cxl/devices/region0.0:0
> > > > ├── interleave_granularity
> 
> ...validate granularity is within spec and can be supported by the root decoder.
> 
> > > > ├── interleave_ways
> 
> ...validate ways is within spec and can be supported by the root decoder.
> 
> > > > ├── offset
> > > > ├── size
> 
> ...try to reserve decoder capacity to validate that there is available space.
> 
> > > > ├── subsystem -> ../../../../../../bus/cxl
> > > > ├── target0
> 
> ...validate that the target maps to the decoder.
> 
> > > > ├── uevent
> > > > └── uuid
> 
> ...validate that the uuid is unique relative to other regions.
> 

Okay.

> > >
> > > As mentioned off-list, it looks like devtype and modalias are missing.
> > >
> >
> > Yep. This belongs in the previous patch though.
> >
> > > >
> > > > Reported-by: kernel test robot <lkp@intel.com> (v2)
> > > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > > ---
> > > >  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
> > > >  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
> > > >  2 files changed, 340 insertions(+)
> > > >
> > > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > index dcc728458936..50ba5018014d 100644
> > > > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > @@ -187,3 +187,43 @@ Description:
> > > >                 region driver before being deleted. The attributes expects a
> > > >                 region in the form "regionX.Y:Z". The region's name, allocated
> > > >                 by reading create_region, will also be released.
> > > > +
> > > > +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
> > >
> > > This is just another 'resource' attribute for the physical base
> > > address of the region, right? 'offset' sounds like something that
> > > would be relative instead of absolute.
> > >
> >
> > It was meant to be relative. I can make it absolute if that's preferable but the
> > physical base is known at the decoder level already.
> 
> Yes, but it saves userspace a step to get the absolute value here and
> matches what happens in PCI sysfs where the user is not required to
> look up the bridge base to calculate the absolute value.
> 

Okay.

> >
> > > > +Date:          August, 2021
> > >
> > > Same date update comment here.
> > >
> > > > +KernelVersion: v5.18
> > > > +Contact:       linux-cxl@vger.kernel.org
> > > > +Description:
> > > > +               (RO) A region resides within an address space that is claimed by
> > > > +               a decoder.
> > >
> > > "A region is a contiguous partition of a CXL Root decoder address space."
> > >
> > > >                  Region space allocation is handled by the driver, but
> > >
> > > "Region capacity is allocated by writing to the size attribute, the
> > > resulting physical address base determined by the driver is reflected
> > > here."
> > >
> > > > +               the offset may be read by userspace tooling in order to
> > > > +               determine fragmentation, and available size for new regions.
> > >
> > > I would also expect, before / along with these new region attributes,
> > > there would be 'available' and 'max_extent_available' at the decoder
> > > level to indicate how much free space the decoder has and how big the
> > > next region creation can be. User tooling can walk  the decoder and
> > > the regions together to determine fragmentation if necessary, but for
> > > the most part the tool likely only cares about "how big can the next
> > > region be?" and "how full is this decoder?".
> > >
> >
> > Since this is the configuration part of the ABI, I'd rather add that information
> > when the plumbing to report them exists. I'm struggling to understand the
> > balance (as mentioned also earlier in this mail thread) as to what userspace
> > does and what the kernel does. I will add these as you request.
> 
> Userspace asks by DPA size and SPA size and the kernel validates /
> performs the allocation on its behalf.

Okay, I guess that addresses below regarding the tuple.

> 
> > > > +
> > > > +What:
> > > > +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> > > > +Date:          August, 2021
> > > > +KernelVersion: v5.18
> > > > +Contact:       linux-cxl@vger.kernel.org
> > > > +Description:
> > > > +               (RW) Configuring regions requires a minimal set of parameters in
> > > > +               order for the subsequent bind operation to succeed. The
> > > > +               following parameters are defined:
> > >
> > > Let's split up the descriptions into individual sections. That can
> > > also document the order that attributes must be written. For example,
> > > doesn't size need to be set before targets are added so that targets
> > > can be validated whether they have sufficient capacity?
> > >
> >
> > Okay. Since we're moving toward making the sysfs ABI stateful,
> 
> sysfs is always stateful. Stateless would be an ioctl.
> 
> > would you like me
> > to make the attrs only visible when they can actually be set?
> 
> No, that's a bit too much magic, and it would be racy.
> 
> >
> > > > +
> > > > +               ==      ========================================================
> > > > +               interleave_granularity Mandatory. Number of consecutive bytes
> > > > +                       each device in the interleave set will claim. The
> > > > +                       possible interleave granularity values are determined by
> > > > +                       the CXL spec and the participating devices.
> > > > +               interleave_ways Mandatory. Number of devices participating in the
> > > > +                       region. Each device will provide 1/interleave of storage
> > > > +                       for the region.
> > > > +               size    Manadatory. Phsyical address space the region will
> > > > +                       consume.
> > >
> > > s/Phsyical/Physical/
> > >
> > > > +               target  Mandatory. Memory devices are the backing storage for a
> > > > +                       region. There will be N targets based on the number of
> > > > +                       interleave ways that the top level decoder is configured
> > > > +                       for.
> > >
> > > That doesn't sound right, IW at the root != IW at the endpoint level
> > > and the region needs to record all the endpoint level targets.
> >
> > Correct.
> >
> > >
> > > > Each target must be set with a memdev device ie.
> > > > +                       'mem1'. This attribute only becomes available after
> > > > +                       setting the 'interleave' attribute.
> > > > +               uuid    Optional. A unique identifier for the region. If none is
> > > > +                       selected, the kernel will create one.
> > >
> > > Let's drop the Mandatory / Optional distinction, or I am otherwise not
> > > understanding what this is trying to document. For example 'uuid' is
> > > "mandatory" for PMEM regions and "omitted" for volatile regions not
> > > optional.
> >
> > Okay.
> >
> > >
> > > > +               ==      ========================================================
> > > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > > > index 1a448543db0d..3b48e0469fc7 100644
> > > > --- a/drivers/cxl/core/region.c
> > > > +++ b/drivers/cxl/core/region.c
> > > > @@ -3,9 +3,12 @@
> > > >  #include <linux/io-64-nonatomic-lo-hi.h>
> > > >  #include <linux/device.h>
> > > >  #include <linux/module.h>
> > > > +#include <linux/sizes.h>
> > > >  #include <linux/slab.h>
> > > > +#include <linux/uuid.h>
> > > >  #include <linux/idr.h>
> > > >  #include <region.h>
> > > > +#include <cxlmem.h>
> > > >  #include <cxl.h>
> > > >  #include "core.h"
> > > >
> > > > @@ -18,11 +21,305 @@
> > > >   * (programming the hardware) is handled by a separate region driver.
> > > >   */
> > > >
> > > > +struct cxl_region *to_cxl_region(struct device *dev);
> > > > +static const struct attribute_group region_interleave_group;
> > > > +
> > > > +static bool is_region_active(struct cxl_region *cxlr)
> > > > +{
> > > > +       /* TODO: Regions can't be activated yet. */
> > > > +       return false;
> > >
> > > This function seems redundant with just checking "cxlr->dev.driver !=
> > > NULL"? The benefit of that is there is no need to carry a TODO in the
> > > series.
> > >
> >
> > The idea behind this was to give the reviewer somewhat of a bigger picture as to
> > how things should work in the code rather than in a commit message. I will
> > remove this.
> 
> They look premature to me.
> 

Given that you don't want me to reference the DWG, it is. The steps outlined
with TODOs were all based on the DWG's overall flow.

> >
> > > > +}
> > > > +
> > > > +static void remove_target(struct cxl_region *cxlr, int target)
> > > > +{
> > > > +       struct cxl_memdev *cxlmd;
> > > > +
> > > > +       cxlmd = cxlr->config.targets[target];
> > > > +       if (cxlmd)
> > > > +               put_device(&cxlmd->dev);
> > >
> > > A memdev can be a member of multiple regions at once, shouldn't this
> > > be an endpoint decoder or similar, not the entire memdev?
> > >
> > > Also, if memdevs autoremove themselves from regions at memdev
> > > ->remove() time then I don't think the region needs to hold references
> > > on memdevs.
> > >
> >
> > This needs some work. The concern I have is region operations will need to
> > operate on memdevs/decoders at various points in time. When the memdev goes
> > away, the region will also need to go away. None of that plumbing was in place
> > in v3 and the reference on the memdev was just a half-hearted attempt at doing
> > the right thing.
> >
> > For now if you prefer I remove the reference, but perhaps the decoder reference
> > would buy us some safety?
> 
> So, I don't want to merge an interim solution. I think this series
> needs to prove out the end to end final ABI with all the lifetime
> issues worked out before committing to it upstream because lifetime
> issues get much harder to fix when they also need to conform to a
> legacy ABI.
> 

I should have been clearer. I had been planning to send at least one more
version before promising to fix lifetime issues. I will skip that step.

> >
> > > > +       cxlr->config.targets[target] = NULL;
> > > > +}
> > > > +
> > > > +static ssize_t interleave_ways_show(struct device *dev,
> > > > +                                   struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> > > > +}
> > > > +
> > > > +static ssize_t interleave_ways_store(struct device *dev,
> > > > +                                    struct device_attribute *attr,
> > > > +                                    const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       int ret, prev_iw;
> > > > +       int val;
> > >
> > > I would expect:
> > >
> > > if (dev->driver)
> > >    return -EBUSY;
> > >
> > > ...to shutdown configuration writes once the region is active. Might
> > > also need a region-wide seqlock like target_list_show. So that region
> > > probe drains  all active sysfs writers before assuming the
> > > configuration is stable.
> >
> > Initially my thought here is that this is a problem for userspace to deal with.
> > If userspace can't figure out how to synchronously configure and bind the
> > region, that's not a kernel problem.
> 
> The kernel always needs to protect itself. Userspace is free to race
> itself, but it can not be allowed to trigger a kernel race. So there
> needs to be protection against userspace writing interleave_ways and
> the kernel being able to trust that interleave_ways is now static for
> the life of the region.

Yeah - originally I was relying on the device_lock for this, but that now
doesn't work. seqlock is fine. I could also copy all the config information at
the beginning of probe and simply use that.

If we're going the route of making interleave_ways write-once, why not make all
attributes the same?

> 
> > However, we've put some effort into
> > protecting userspace from itself in the create ABI, so it might be more in line
> > to do that here.
> 
> That safety was about preventing userspace from leaking kernel memory,
> not about protecting userspace from itself. It's still the case that
> userspace racing itself will get horribly confused when it collides
> region creation, but the kernel protects itself by resolving the race.
> 
> > In summary, I'm fine to add it, but I think I really need to get more in your
> > brain about the userspace/kernel divide sooner rather than later.
> 
> Don't let userspace break the kernel, that's it.
> 
> >
> > >
> > > > +
> > > > +       prev_iw = cxlr->config.interleave_ways;
> > > > +       ret = kstrtoint(buf, 0, &val);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> > > > +               return -EINVAL;
> > > > +
> > > > +       cxlr->config.interleave_ways = val;
> > > > +
> > > > +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> > > > +       if (ret < 0)
> > > > +               goto err;
> > > > +
> > > > +       sysfs_notify(&dev->kobj, NULL, "target_interleave");
> > >
> > > Why?
> > >
> >
> > copypasta
> >
> > > > +
> > > > +       while (prev_iw > cxlr->config.interleave_ways)
> > > > +               remove_target(cxlr, --prev_iw);
> > >
> > > To make the kernel side simpler this attribute could just require that
> > > setting interleave ways is a one way street, if you want to change it
> > > you need to delete the region and start over.
> > >
> >
> > Okay. One of the earlier versions did this implicitly since the #ways was needed
> > to create the region. I thought from the ABI perspective, flexibility was good.
> > Userspace may choose not to utilize it.
> 
> More flexibility == more maintenance burden. If it's not strictly
> necessary, don't expose it, so making this read-only seems simpler to
> me.
> 
> [..]
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (is_region_active(cxlr))
> > > > +               rc = -EBUSY;
> > > > +       else
> > > > +               cxlr->config.size = val;
> > > > +       device_unlock(&cxlr->dev);
> > >
> > > I think lockdep will complain about device_lock() usage in an
> > > attribute. Try changing this to cxl_device_lock() with
> > > CONFIG_PROVE_CXL_LOCKING=y.
> > >
> >
> > I might have messed it up, but I didn't seem to run into an issue. With the
> > driver bound check though, it can go away.
> >
> > I think it would be really good to add this kind of detail to sysfs.rst. Quick
> > grep finds me arm64/kernel/mte and the nfit driver taking the device lock in an
> > attr.
> 
> Yeah, CONFIG_PROVE_{NVDIMM,CXL}_LOCKING needs to annotate the
> driver-core as well. I'm concerned there's a class of deadlocks that
> lockdep just can't see.
> 
> >
> >
> > > > +
> > > > +       return rc ? rc : len;
> > > > +}
> > > > +static DEVICE_ATTR_RW(size);
> > > > +
> > > > +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> > > > +                        char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> > > > +}
> > > > +
> > > > +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> > > > +                         const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       ssize_t rc;
> > > > +
> > > > +       if (len != UUID_STRING_LEN + 1)
> > > > +               return -EINVAL;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (is_region_active(cxlr))
> > > > +               rc = -EBUSY;
> > > > +       else
> > > > +               rc = uuid_parse(buf, &cxlr->config.uuid);
> > > > +       device_unlock(&cxlr->dev);
> > > > +
> > > > +       return rc ? rc : len;
> > > > +}
> > > > +static DEVICE_ATTR_RW(uuid);
> > > > +
> > > > +static struct attribute *region_attrs[] = {
> > > > +       &dev_attr_interleave_ways.attr,
> > > > +       &dev_attr_interleave_granularity.attr,
> > > > +       &dev_attr_offset.attr,
> > > > +       &dev_attr_size.attr,
> > > > +       &dev_attr_uuid.attr,
> > > > +       NULL,
> > > > +};
> > > > +
> > > > +static const struct attribute_group region_group = {
> > > > +       .attrs = region_attrs,
> > > > +};
> > > > +
> > > > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > > > +{
> > > > +       int ret;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (!cxlr->config.targets[n])
> > > > +               ret = sysfs_emit(buf, "\n");
> > > > +       else
> > > > +               ret = sysfs_emit(buf, "%s\n",
> > > > +                                dev_name(&cxlr->config.targets[n]->dev));
> > > > +       device_unlock(&cxlr->dev);
> > >
> > > The component contribution of a memdev to a region is a DPA-span, not
> > > the whole memdev. I would expect something like dax_mapping_attributes
> > > or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> > > information about the component contribution of a memdev to a region.
> > >
> >
> > I think show_target should just return the chosen decoder and then the decoder
> > attributes will tell the rest, wouldn't they?
> 
> Given the conflicts that can arise between HDM decoders needing to map
> increasing DPA values and other conflicts that there will be
> situations where the kernel auto-picking a decoder will get in the
> way. Exposing the decoder selection to userspace also gives one more
> place to do leaf validation. I.e. at decoder-to-region assignment time
> the kernel can validate that the DPA is available and can be mapped by
> the given decoder given the state of other decoders on that device.
> 

Okay, but per below, these are associated with setting the target. The attribute
show does only need to provide the decoder, then userspace can look at the
decoder to find if it's active/DPAs/etc.

> >
> > > > +
> > > > +       return ret;
> > > > +}
> > > > +
> > > > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > > > +                         size_t len)
> > > > +{
> > > > +       struct device *memdev_dev;
> > > > +       struct cxl_memdev *cxlmd;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +
> > > > +       if (len == 1 || cxlr->config.targets[n])
> > > > +               remove_target(cxlr, n);
> > > > +
> > > > +       /* Remove target special case */
> > > > +       if (len == 1) {
> > > > +               device_unlock(&cxlr->dev);
> > > > +               return len;
> > > > +       }
> > > > +
> > > > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> > >
> > > I think this wants to be an endpoint decoder, not a memdev. Because
> > > it's the decoder that joins a memdev to a region, or at least a
> > > decoder should be picked when the memdev is assigned so that the DPA
> > > mapping can be registered. If all the decoders are allocated then fail
> > > here.
> > >
> >
> > Per above, I think making this decoders makes sense. I could make it flexible
> > for ease of use, like if you specify memX, the kernel will pick a decoder for
> > you however I suspect you won't like that.
> 
> Right, put the user friendliness in the tooling, not sysfs ABI.
> 

Okay.

> >
> > > > +       if (!memdev_dev) {
> > > > +               device_unlock(&cxlr->dev);
> > > > +               return -ENOENT;
> > > > +       }
> > > > +
> > > > +       /* reference to memdev held until target is unset or region goes away */
> > > > +
> > > > +       cxlmd = to_cxl_memdev(memdev_dev);
> > > > +       cxlr->config.targets[n] = cxlmd;
> > > > +
> > > > +       device_unlock(&cxlr->dev);
> > > > +
> > > > +       return len;
> > > > +}
> > > > +
> > > > +#define TARGET_ATTR_RW(n)                                                      \
> > > > +       static ssize_t target##n##_show(                                       \
> > > > +               struct device *dev, struct device_attribute *attr, char *buf)  \
> > > > +       {                                                                      \
> > > > +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> > > > +       }                                                                      \
> > > > +       static ssize_t target##n##_store(struct device *dev,                   \
> > > > +                                        struct device_attribute *attr,        \
> > > > +                                        const char *buf, size_t len)          \
> > > > +       {                                                                      \
> > > > +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> > > > +       }                                                                      \
> > > > +       static DEVICE_ATTR_RW(target##n)
> > > > +
> > > > +TARGET_ATTR_RW(0);
> > > > +TARGET_ATTR_RW(1);
> > > > +TARGET_ATTR_RW(2);
> > > > +TARGET_ATTR_RW(3);
> > > > +TARGET_ATTR_RW(4);
> > > > +TARGET_ATTR_RW(5);
> > > > +TARGET_ATTR_RW(6);
> > > > +TARGET_ATTR_RW(7);
> > > > +TARGET_ATTR_RW(8);
> > > > +TARGET_ATTR_RW(9);
> > > > +TARGET_ATTR_RW(10);
> > > > +TARGET_ATTR_RW(11);
> > > > +TARGET_ATTR_RW(12);
> > > > +TARGET_ATTR_RW(13);
> > > > +TARGET_ATTR_RW(14);
> > > > +TARGET_ATTR_RW(15);
> > > > +
> > > > +static struct attribute *interleave_attrs[] = {
> > > > +       &dev_attr_target0.attr,
> > > > +       &dev_attr_target1.attr,
> > > > +       &dev_attr_target2.attr,
> > > > +       &dev_attr_target3.attr,
> > > > +       &dev_attr_target4.attr,
> > > > +       &dev_attr_target5.attr,
> > > > +       &dev_attr_target6.attr,
> > > > +       &dev_attr_target7.attr,
> > > > +       &dev_attr_target8.attr,
> > > > +       &dev_attr_target9.attr,
> > > > +       &dev_attr_target10.attr,
> > > > +       &dev_attr_target11.attr,
> > > > +       &dev_attr_target12.attr,
> > > > +       &dev_attr_target13.attr,
> > > > +       &dev_attr_target14.attr,
> > > > +       &dev_attr_target15.attr,
> > > > +       NULL,
> > > > +};
> > > > +
> > > > +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> > > > +{
> > > > +       struct device *dev = container_of(kobj, struct device, kobj);
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       if (n < cxlr->config.interleave_ways)
> > > > +               return a->mode;
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static const struct attribute_group region_interleave_group = {
> > > > +       .attrs = interleave_attrs,
> > > > +       .is_visible = visible_targets,
> > > > +};
> > > > +
> > > > +static const struct attribute_group *region_groups[] = {
> > > > +       &region_group,
> > > > +       &region_interleave_group,
> > > > +       NULL,
> > > > +};
> > > > +
> > > >  static void cxl_region_release(struct device *dev);
> > > >
> > > >  static const struct device_type cxl_region_type = {
> > > >         .name = "cxl_region",
> > > >         .release = cxl_region_release,
> > > > +       .groups = region_groups
> > > >  };
> > > >
> > > >  static ssize_t create_region_show(struct device *dev,
> > > > @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
> > > >  {
> > > >         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > > >         struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       int i;
> > > >
> > > >         ida_free(&cxld->region_ida, cxlr->id);
> > > > +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> > > > +               remove_target(cxlr, i);
> > >
> > > Like the last patch this feels too late. I expect whatever unregisters
> > > the region should have already handled removing the targets.
> > >
> >
> > Would remove() be more appropriate?
> 
> ->remove() does not seem a good fit since it may be the case that
> someone wants do "echo $region >
> /sys/bus/cxl/drivers/cxl_region/unbind; echo $region >
> /sys/bus/cxl/drivers/cxl_region/bind;" without needing to go
> reconfigure the targets. I am suggesting that before
> device_unregister(&cxlr->dev) the targets are released.

Okay.

Why would one want to do this? I acknowledge someone *may* do that. I'd like to
know what value you see there.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 01/15] cxl/region: Add region creation ABI
  2022-02-17 18:58         ` Ben Widawsky
@ 2022-02-17 20:26           ` Dan Williams
  0 siblings, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-17 20:26 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Thu, Feb 17, 2022 at 10:58 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> On 22-02-17 09:58:04, Dan Williams wrote:
> > On Thu, Feb 17, 2022 at 9:19 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > Regions are created as a child of the decoder that encompasses an
> > > address space with constraints. Regions have a number of attributes that
> > > must be configured before the region can be activated.
> > >
> > > The ABI is not meant to be secure, but is meant to avoid accidental
> > > races. As a result, a buggy process may create a region by name that was
> > > allocated by a different process. However, multiple processes which are
> > > trying not to race with each other shouldn't need special
> > > synchronization to do so.
> > >
> > > // Allocate a new region name
> > > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> > >
> > > // Create a new region by name
> > > while
> > > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> > > ! echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
> > > do true; done
> > >
> > > // Region now exists in sysfs
> > > stat -t /sys/bus/cxl/devices/decoder0.0/$region
> > >
> > > // Delete the region, and name
> > > echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region
> > >
> > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
[..]
> > > +static void unregister_region(void *_cxlr)
> > > +{
> > > +       struct cxl_region *cxlr = _cxlr;
> > > +
> > > +       if (!test_and_set_bit(REGION_DEAD, &cxlr->flags))
> > > +               device_unregister(&cxlr->dev);
> >
> > I thought REGION_DEAD was needed to prevent double
> > devm_release_action(), not double unregister?
> >
>
> I believe that's correct, repeating what you said on our internal list:
>
> On 22-02-14 14:11:41, Dan Williams wrote:
>   True, you do need to solve the race between multiple writers racing to
>   do the unregistration, but that could be done with something like:
>
>   if (!test_and_set_bit(REGION_DEAD, &cxlr->flags))
>       device_unregister(&cxlr->dev);
>
> So I was just trying to implement what you said. Remainder of the discussion
> below...

That was in the context of moving the unregistration to a workqueue
and taking the device lock to validate whether the device has already
been unbound. In this case keeping the devm_release_action() inline in
the sysfs attribute the flag needs to protect against racing
devm_release_action(). I am not saying that a workqueue is now needed,
just clarifying the context of that suggestion.

[..]
> > > +
> > > +       return cxlr;
> > > +
> > > +err_out:
> > > +       put_device(dev);
> > > +       kfree(cxlr);
> >
> > This is a double-free of cxlr;
> >
>
> Because of release()? How does release get called if the region device wasn't
> added? Or is there something else?

->release() is always called at final put_device() regardless of
whether the device was registered with device_add() or not. I.e. see
all the other dev_set_name() error handling in the core that just does
put_device().

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-02-17 20:20         ` Ben Widawsky
@ 2022-02-17 21:12           ` Dan Williams
  0 siblings, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-17 21:12 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On Thu, Feb 17, 2022 at 12:20 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
[..]
> > > > > +static bool is_region_active(struct cxl_region *cxlr)
> > > > > +{
> > > > > +       /* TODO: Regions can't be activated yet. */
> > > > > +       return false;
> > > >
> > > > This function seems redundant with just checking "cxlr->dev.driver !=
> > > > NULL"? The benefit of that is there is no need to carry a TODO in the
> > > > series.
> > > >
> > >
> > > The idea behind this was to give the reviewer somewhat of a bigger picture as to
> > > how things should work in the code rather than in a commit message. I will
> > > remove this.
> >
> > They look premature to me.
> >
>
> Given that you don't want me to reference the DWG, it is. The steps outlined
> with TODOs were all based on the DWG's overall flow.

Right, the DWG is a good bootstrap, but the Linux implementation is
going to go beyond it so might as well couch all the language in terms
of base spec references and Linux documentation and not have this
indirection to a 3rd document.

[..]
> > > > ...to shutdown configuration writes once the region is active. Might
> > > > also need a region-wide seqlock like target_list_show. So that region
> > > > probe drains  all active sysfs writers before assuming the
> > > > configuration is stable.
> > >
> > > Initially my thought here is that this is a problem for userspace to deal with.
> > > If userspace can't figure out how to synchronously configure and bind the
> > > region, that's not a kernel problem.
> >
> > The kernel always needs to protect itself. Userspace is free to race
> > itself, but it can not be allowed to trigger a kernel race. So there
> > needs to be protection against userspace writing interleave_ways and
> > the kernel being able to trust that interleave_ways is now static for
> > the life of the region.
>
> Yeah - originally I was relying on the device_lock for this, but that now
> doesn't work. seqlock is fine. I could also copy all the config information at
> the beginning of probe and simply use that.
>
> If we're going the route of making interleave_ways write-once, why not make all
> attributes the same?

Sure. It could always be relaxed later if there was a convincing need
to modify an existing region without tearing it down first. In fact,
that reconfigure flexibility was a source of bugs in NVDIMM sysfs ABI
that the tooling did not leverage because "ndctl create-namespace
--reconfigure" internally did: read namespace attributes, destroy
namepsace, create new namespace with saved attributes.

[..]
> > > > > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > > > > +{
> > > > > +       int ret;
> > > > > +
> > > > > +       device_lock(&cxlr->dev);
> > > > > +       if (!cxlr->config.targets[n])
> > > > > +               ret = sysfs_emit(buf, "\n");
> > > > > +       else
> > > > > +               ret = sysfs_emit(buf, "%s\n",
> > > > > +                                dev_name(&cxlr->config.targets[n]->dev));
> > > > > +       device_unlock(&cxlr->dev);
> > > >
> > > > The component contribution of a memdev to a region is a DPA-span, not
> > > > the whole memdev. I would expect something like dax_mapping_attributes
> > > > or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> > > > information about the component contribution of a memdev to a region.
> > > >
> > >
> > > I think show_target should just return the chosen decoder and then the decoder
> > > attributes will tell the rest, wouldn't they?
> >
> > Given the conflicts that can arise between HDM decoders needing to map
> > increasing DPA values and other conflicts that there will be
> > situations where the kernel auto-picking a decoder will get in the
> > way. Exposing the decoder selection to userspace also gives one more
> > place to do leaf validation. I.e. at decoder-to-region assignment time
> > the kernel can validate that the DPA is available and can be mapped by
> > the given decoder given the state of other decoders on that device.
> >
>
> Okay, but per below, these are associated with setting the target. The attribute
> show does only need to provide the decoder, then userspace can look at the
> decoder to find if it's active/DPAs/etc.

Yes.

>
> > >
> > > > > +
> > > > > +       return ret;
> > > > > +}
> > > > > +
> > > > > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > > > > +                         size_t len)
> > > > > +{
> > > > > +       struct device *memdev_dev;
> > > > > +       struct cxl_memdev *cxlmd;
> > > > > +
> > > > > +       device_lock(&cxlr->dev);
> > > > > +
> > > > > +       if (len == 1 || cxlr->config.targets[n])
> > > > > +               remove_target(cxlr, n);
> > > > > +
> > > > > +       /* Remove target special case */
> > > > > +       if (len == 1) {
> > > > > +               device_unlock(&cxlr->dev);
> > > > > +               return len;
> > > > > +       }
> > > > > +
> > > > > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> > > >
> > > > I think this wants to be an endpoint decoder, not a memdev. Because
> > > > it's the decoder that joins a memdev to a region, or at least a
> > > > decoder should be picked when the memdev is assigned so that the DPA
> > > > mapping can be registered. If all the decoders are allocated then fail
> > > > here.
> > > >
> > >
> > > Per above, I think making this decoders makes sense. I could make it flexible
> > > for ease of use, like if you specify memX, the kernel will pick a decoder for
> > > you however I suspect you won't like that.
> >
> > Right, put the user friendliness in the tooling, not sysfs ABI.
> >
>
> Okay.
>
> > >
> > > > > +       if (!memdev_dev) {
> > > > > +               device_unlock(&cxlr->dev);
> > > > > +               return -ENOENT;
> > > > > +       }
> > > > > +
> > > > > +       /* reference to memdev held until target is unset or region goes away */
> > > > > +
> > > > > +       cxlmd = to_cxl_memdev(memdev_dev);
> > > > > +       cxlr->config.targets[n] = cxlmd;
> > > > > +
> > > > > +       device_unlock(&cxlr->dev);
> > > > > +
> > > > > +       return len;
> > > > > +}
> > > > > +
> > > > > +#define TARGET_ATTR_RW(n)                                                      \
> > > > > +       static ssize_t target##n##_show(                                       \
> > > > > +               struct device *dev, struct device_attribute *attr, char *buf)  \
> > > > > +       {                                                                      \
> > > > > +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> > > > > +       }                                                                      \
> > > > > +       static ssize_t target##n##_store(struct device *dev,                   \
> > > > > +                                        struct device_attribute *attr,        \
> > > > > +                                        const char *buf, size_t len)          \
> > > > > +       {                                                                      \
> > > > > +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> > > > > +       }                                                                      \
> > > > > +       static DEVICE_ATTR_RW(target##n)
> > > > > +
> > > > > +TARGET_ATTR_RW(0);
> > > > > +TARGET_ATTR_RW(1);
> > > > > +TARGET_ATTR_RW(2);
> > > > > +TARGET_ATTR_RW(3);
> > > > > +TARGET_ATTR_RW(4);
> > > > > +TARGET_ATTR_RW(5);
> > > > > +TARGET_ATTR_RW(6);
> > > > > +TARGET_ATTR_RW(7);
> > > > > +TARGET_ATTR_RW(8);
> > > > > +TARGET_ATTR_RW(9);
> > > > > +TARGET_ATTR_RW(10);
> > > > > +TARGET_ATTR_RW(11);
> > > > > +TARGET_ATTR_RW(12);
> > > > > +TARGET_ATTR_RW(13);
> > > > > +TARGET_ATTR_RW(14);
> > > > > +TARGET_ATTR_RW(15);
> > > > > +
> > > > > +static struct attribute *interleave_attrs[] = {
> > > > > +       &dev_attr_target0.attr,
> > > > > +       &dev_attr_target1.attr,
> > > > > +       &dev_attr_target2.attr,
> > > > > +       &dev_attr_target3.attr,
> > > > > +       &dev_attr_target4.attr,
> > > > > +       &dev_attr_target5.attr,
> > > > > +       &dev_attr_target6.attr,
> > > > > +       &dev_attr_target7.attr,
> > > > > +       &dev_attr_target8.attr,
> > > > > +       &dev_attr_target9.attr,
> > > > > +       &dev_attr_target10.attr,
> > > > > +       &dev_attr_target11.attr,
> > > > > +       &dev_attr_target12.attr,
> > > > > +       &dev_attr_target13.attr,
> > > > > +       &dev_attr_target14.attr,
> > > > > +       &dev_attr_target15.attr,
> > > > > +       NULL,
> > > > > +};
> > > > > +
> > > > > +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> > > > > +{
> > > > > +       struct device *dev = container_of(kobj, struct device, kobj);
> > > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > > +
> > > > > +       if (n < cxlr->config.interleave_ways)
> > > > > +               return a->mode;
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > > +static const struct attribute_group region_interleave_group = {
> > > > > +       .attrs = interleave_attrs,
> > > > > +       .is_visible = visible_targets,
> > > > > +};
> > > > > +
> > > > > +static const struct attribute_group *region_groups[] = {
> > > > > +       &region_group,
> > > > > +       &region_interleave_group,
> > > > > +       NULL,
> > > > > +};
> > > > > +
> > > > >  static void cxl_region_release(struct device *dev);
> > > > >
> > > > >  static const struct device_type cxl_region_type = {
> > > > >         .name = "cxl_region",
> > > > >         .release = cxl_region_release,
> > > > > +       .groups = region_groups
> > > > >  };
> > > > >
> > > > >  static ssize_t create_region_show(struct device *dev,
> > > > > @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
> > > > >  {
> > > > >         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > > > >         struct cxl_region *cxlr = to_cxl_region(dev);
> > > > > +       int i;
> > > > >
> > > > >         ida_free(&cxld->region_ida, cxlr->id);
> > > > > +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> > > > > +               remove_target(cxlr, i);
> > > >
> > > > Like the last patch this feels too late. I expect whatever unregisters
> > > > the region should have already handled removing the targets.
> > > >
> > >
> > > Would remove() be more appropriate?
> >
> > ->remove() does not seem a good fit since it may be the case that
> > someone wants do "echo $region >
> > /sys/bus/cxl/drivers/cxl_region/unbind; echo $region >
> > /sys/bus/cxl/drivers/cxl_region/bind;" without needing to go
> > reconfigure the targets. I am suggesting that before
> > device_unregister(&cxlr->dev) the targets are released.
>
> Okay.
>
> Why would one want to do this? I acknowledge someone *may* do that. I'd like to
> know what value you see there.

There are several debug and error handling scenarios that say "quiesce
CXL.mem". Seems reasonable to map those to "cxl disable-region", and
seems unreasonable that "cxl disable-region" requires "cxl
create-region" to get back to operational state.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 01/15] cxl/region: Add region creation ABI
  2022-02-17 17:58       ` Dan Williams
  2022-02-17 18:58         ` Ben Widawsky
@ 2022-02-17 22:22         ` Ben Widawsky
  2022-02-17 23:32           ` Dan Williams
  1 sibling, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-17 22:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On 22-02-17 09:58:04, Dan Williams wrote:
> On Thu, Feb 17, 2022 at 9:19 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > Regions are created as a child of the decoder that encompasses an
> > address space with constraints. Regions have a number of attributes that
> > must be configured before the region can be activated.
> >
> > The ABI is not meant to be secure, but is meant to avoid accidental
> > races. As a result, a buggy process may create a region by name that was
> > allocated by a different process. However, multiple processes which are
> > trying not to race with each other shouldn't need special
> > synchronization to do so.
> >
> > // Allocate a new region name
> > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> >
> > // Create a new region by name
> > while
> > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> > ! echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
> > do true; done
> >
> > // Region now exists in sysfs
> > stat -t /sys/bus/cxl/devices/decoder0.0/$region
> >
> > // Delete the region, and name
> > echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region
> >
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> 
> Looking good, a few more fixes and cleanups identified below.
> 
> >
> > ---
> > Changes since v4:
> > - Add the missed base attributes addition
> >
> > ---
> >  Documentation/ABI/testing/sysfs-bus-cxl       |  23 ++
> >  .../driver-api/cxl/memory-devices.rst         |  11 +
> >  drivers/cxl/core/Makefile                     |   1 +
> >  drivers/cxl/core/core.h                       |   3 +
> >  drivers/cxl/core/port.c                       |  11 +
> >  drivers/cxl/core/region.c                     | 213 ++++++++++++++++++
> >  drivers/cxl/cxl.h                             |   5 +
> >  drivers/cxl/region.h                          |  23 ++
> >  tools/testing/cxl/Kbuild                      |   1 +
> >  9 files changed, 291 insertions(+)
> >  create mode 100644 drivers/cxl/core/region.c
> >  create mode 100644 drivers/cxl/region.h
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index 7c2b846521f3..e5db45ea70ad 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -163,3 +163,26 @@ Description:
> >                 memory (type-3). The 'target_type' attribute indicates the
> >                 current setting which may dynamically change based on what
> >                 memory regions are activated in this decode hierarchy.
> > +
> > +What:          /sys/bus/cxl/devices/decoderX.Y/create_region
> > +Date:          January, 2022
> > +KernelVersion: v5.18
> > +Contact:       linux-cxl@vger.kernel.org
> > +Description:
> > +               Write a value of the form 'regionX.Y:Z' to instantiate a new
> > +               region within the decode range bounded by decoderX.Y. The value
> > +               written must match the current value returned from reading this
> > +               attribute. This behavior lets the kernel arbitrate racing
> > +               attempts to create a region. The thread that fails to write
> > +               loops and tries the next value. Regions must be created for root
> > +               decoders, and must subsequently configured and bound to a region
> > +               driver before they can be used.
> > +
> > +What:          /sys/bus/cxl/devices/decoderX.Y/delete_region
> > +Date:          January, 2022
> > +KernelVersion: v5.18
> > +Contact:       linux-cxl@vger.kernel.org
> > +Description:
> > +               Deletes the named region.  The attribute expects a region in the
> > +               form "regionX.Y:Z". The region's name, allocated by reading
> > +               create_region, will also be released.
> > diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
> > index db476bb170b6..66ddc58a21b1 100644
> > --- a/Documentation/driver-api/cxl/memory-devices.rst
> > +++ b/Documentation/driver-api/cxl/memory-devices.rst
> > @@ -362,6 +362,17 @@ CXL Core
> >  .. kernel-doc:: drivers/cxl/core/mbox.c
> >     :doc: cxl mbox
> >
> > +CXL Regions
> > +-----------
> > +.. kernel-doc:: drivers/cxl/region.h
> > +   :identifiers:
> > +
> > +.. kernel-doc:: drivers/cxl/core/region.c
> > +   :doc: cxl core region
> > +
> > +.. kernel-doc:: drivers/cxl/core/region.c
> > +   :identifiers:
> > +
> >  External Interfaces
> >  ===================
> >
> > diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> > index 6d37cd78b151..39ce8f2f2373 100644
> > --- a/drivers/cxl/core/Makefile
> > +++ b/drivers/cxl/core/Makefile
> > @@ -4,6 +4,7 @@ obj-$(CONFIG_CXL_BUS) += cxl_core.o
> >  ccflags-y += -I$(srctree)/drivers/cxl
> >  cxl_core-y := port.o
> >  cxl_core-y += pmem.o
> > +cxl_core-y += region.o
> >  cxl_core-y += regs.o
> >  cxl_core-y += memdev.o
> >  cxl_core-y += mbox.o
> > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> > index 1a50c0fc399c..adfd42370b28 100644
> > --- a/drivers/cxl/core/core.h
> > +++ b/drivers/cxl/core/core.h
> > @@ -9,6 +9,9 @@ extern const struct device_type cxl_nvdimm_type;
> >
> >  extern struct attribute_group cxl_base_attribute_group;
> >
> > +extern struct device_attribute dev_attr_create_region;
> > +extern struct device_attribute dev_attr_delete_region;
> > +
> >  struct cxl_send_command;
> >  struct cxl_mem_query_commands;
> >  int cxl_query_cmd(struct cxl_memdev *cxlmd,
> > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > index 1e785a3affaa..860e91cae29b 100644
> > --- a/drivers/cxl/core/port.c
> > +++ b/drivers/cxl/core/port.c
> > @@ -213,6 +213,8 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
> >  };
> >
> >  static struct attribute *cxl_decoder_root_attrs[] = {
> > +       &dev_attr_create_region.attr,
> > +       &dev_attr_delete_region.attr,
> >         &dev_attr_cap_pmem.attr,
> >         &dev_attr_cap_ram.attr,
> >         &dev_attr_cap_type2.attr,
> > @@ -270,6 +272,8 @@ static void cxl_decoder_release(struct device *dev)
> >         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> >         struct cxl_port *port = to_cxl_port(dev->parent);
> >
> > +       ida_free(&cxld->region_ida, cxld->next_region_id);
> > +       ida_destroy(&cxld->region_ida);
> >         ida_free(&port->decoder_ida, cxld->id);
> >         kfree(cxld);
> >  }
> > @@ -1244,6 +1248,13 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
> >         cxld->target_type = CXL_DECODER_EXPANDER;
> >         cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
> >
> > +       mutex_init(&cxld->id_lock);
> > +       ida_init(&cxld->region_ida);
> > +       rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
> > +       if (rc < 0)
> > +               goto err;
> > +
> > +       cxld->next_region_id = rc;
> >         return cxld;
> >  err:
> >         kfree(cxld);
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > new file mode 100644
> > index 000000000000..5576952e4aa1
> > --- /dev/null
> > +++ b/drivers/cxl/core/region.c
> > @@ -0,0 +1,213 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright(c) 2022 Intel Corporation. All rights reserved. */
> > +#include <linux/device.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/idr.h>
> > +#include <region.h>
> > +#include <cxl.h>
> > +#include "core.h"
> > +
> > +/**
> > + * DOC: cxl core region
> > + *
> > + * CXL Regions represent mapped memory capacity in system physical address
> > + * space. Whereas the CXL Root Decoders identify the bounds of potential CXL
> > + * Memory ranges, Regions represent the active mapped capacity by the HDM
> > + * Decoder Capability structures throughout the Host Bridges, Switches, and
> > + * Endpoints in the topology.
> > + */
> > +
> > +static void cxl_region_release(struct device *dev);
> 
> Why forward declare this versus move cxl_region_type after the definition?
> 
> No other CXL object release functions are forward declared.
> 
> > +
> > +static const struct device_type cxl_region_type = {
> > +       .name = "cxl_region",
> > +       .release = cxl_region_release,
> > +};
> > +
> > +static struct cxl_region *to_cxl_region(struct device *dev)
> > +{
> > +       if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type,
> > +                         "not a cxl_region device\n"))
> > +               return NULL;
> > +
> > +       return container_of(dev, struct cxl_region, dev);
> > +}
> > +
> > +static struct cxl_region *cxl_region_alloc(struct cxl_decoder *cxld)
> > +{
> > +       struct cxl_region *cxlr;
> > +       struct device *dev;
> > +
> > +       cxlr = kzalloc(sizeof(*cxlr), GFP_KERNEL);
> > +       if (!cxlr)
> > +               return ERR_PTR(-ENOMEM);
> > +
> > +       dev = &cxlr->dev;
> > +       device_initialize(dev);
> > +       dev->parent = &cxld->dev;
> > +       device_set_pm_not_required(dev);
> > +       dev->bus = &cxl_bus_type;
> > +       dev->type = &cxl_region_type;
> > +
> > +       return cxlr;
> > +}
> > +
> > +static void unregister_region(void *_cxlr)
> > +{
> > +       struct cxl_region *cxlr = _cxlr;
> > +
> > +       if (!test_and_set_bit(REGION_DEAD, &cxlr->flags))
> > +               device_unregister(&cxlr->dev);
> 
> I thought REGION_DEAD was needed to prevent double
> devm_release_action(), not double unregister?
> 
> > +}
> > +
> > +/**
> > + * devm_cxl_add_region - Adds a region to a decoder
> > + * @cxld: Parent decoder.
> > + * @cxlr: Region to be added to the decoder.
> > + *
> > + * This is the second step of region initialization. Regions exist within an
> > + * address space which is mapped by a @cxld. That @cxld must be a root decoder,
> > + * and it enforces constraints upon the region as it is configured.
> > + *
> > + * Return: 0 if the region was added to the @cxld, else returns negative error
> > + * code. The region will be named "regionX.Y.Z" where X is the port, Y is the
> > + * decoder id, and Z is the region number.
> > + */
> > +static struct cxl_region *devm_cxl_add_region(struct cxl_decoder *cxld)
> > +{
> > +       struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> > +       struct cxl_region *cxlr;
> > +       struct device *dev;
> > +       int rc;
> > +
> > +       cxlr = cxl_region_alloc(cxld);
> > +       if (IS_ERR(cxlr))
> > +               return cxlr;
> > +
> > +       dev = &cxlr->dev;
> > +
> > +       cxlr->id = cxld->next_region_id;
> > +       rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
> > +       if (rc)
> > +               goto err_out;
> > +
> > +       /* affirm that release will have access to the decoder's region ida  */
> > +       get_device(&cxld->dev);
> > +
> > +       rc = device_add(dev);
> > +       if (!rc)
> > +               rc = devm_add_action_or_reset(port->uport, unregister_region,
> > +                                             cxlr);
> > +       if (rc)
> > +               goto err_out;
> 
> All the other usages in device_add() in the subsystem follow the style of:
> 
> rc = device_add(dev);
> if (rc)
>     goto err;
> 
> ...any reason to be unique here and indent the success case?
> 
> 
> > +
> > +       return cxlr;
> > +
> > +err_out:
> > +       put_device(dev);
> > +       kfree(cxlr);
> 
> This is a double-free of cxlr;
> 
> > +       return ERR_PTR(rc);
> > +}
> > +
> > +static ssize_t create_region_show(struct device *dev,
> > +                                 struct device_attribute *attr, char *buf)
> > +{
> > +       struct cxl_port *port = to_cxl_port(dev->parent);
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > +
> > +       return sysfs_emit(buf, "region%d.%d:%d\n", port->id, cxld->id,
> > +                         cxld->next_region_id);
> > +}
> > +
> > +static ssize_t create_region_store(struct device *dev,
> > +                                  struct device_attribute *attr,
> > +                                  const char *buf, size_t len)
> > +{
> > +       struct cxl_port *port = to_cxl_port(dev->parent);
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > +       struct cxl_region *cxlr;
> > +       int d, p, r, rc = 0;
> > +
> > +       if (sscanf(buf, "region%d.%d:%d", &p, &d, &r) != 3)
> > +               return -EINVAL;
> > +
> > +       if (port->id != p || cxld->id != d)
> > +               return -EINVAL;
> > +
> > +       rc = mutex_lock_interruptible(&cxld->id_lock);
> > +       if (rc)
> > +               return rc;
> > +
> > +       if (cxld->next_region_id != r) {
> > +               rc = -EINVAL;
> > +               goto out;
> > +       }
> > +
> > +       rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
> > +       if (rc < 0) {
> > +               dev_dbg(dev, "Failed to get next cached id (%d)\n", rc);
> > +               goto out;
> > +       }
> > +
> > +       cxlr = devm_cxl_add_region(cxld);
> > +       if (IS_ERR(cxlr)) {
> > +               rc = PTR_ERR(cxlr);
> > +               goto out;
> > +       }
> > +
> > +       cxld->next_region_id = rc;
> 
> This looks like a leak in the case when devm_cxl_add_region() fails,
> so just move it above that call.
> 

It's not super simple with the current pre-caching. If you move this above
devm_cxl_add_region(), then you lose the previously pre-cached region. I think
the cleaner solution is to just free the ida on failure. Pretty sure no matter
what method you go, you need an ida_free in there somewhere. Do you see another
way?

> > +       dev_dbg(dev, "Created %s\n", dev_name(&cxlr->dev));
> > +
> > +out:
> > +       mutex_unlock(&cxld->id_lock);
> > +       return rc ? rc : len;
> 
> if (rc)
>     return rc;
> return len;
> 
> > +}
> > +DEVICE_ATTR_RW(create_region);
> > +
> > +static struct cxl_region *cxl_find_region_by_name(struct cxl_decoder *cxld,
> > +                                                 const char *name)
> > +{
> > +       struct device *region_dev;
> > +
> > +       region_dev = device_find_child_by_name(&cxld->dev, name);
> > +       if (!region_dev)
> > +               return ERR_PTR(-ENOENT);
> > +
> > +       return to_cxl_region(region_dev);
> > +}
> > +
> > +static ssize_t delete_region_store(struct device *dev,
> > +                                  struct device_attribute *attr,
> > +                                  const char *buf, size_t len)
> > +{
> > +       struct cxl_port *port = to_cxl_port(dev->parent);
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > +       struct cxl_region *cxlr;
> > +
> > +       cxlr = cxl_find_region_by_name(cxld, buf);
> > +       if (IS_ERR(cxlr))
> > +               return PTR_ERR(cxlr);
> > +
> > +       /* After this, the region is no longer a child of the decoder. */
> > +       devm_release_action(port->uport, unregister_region, cxlr);
> 
> This may trigger a WARN in the case where 2 threads race to trigger
> the release action. I think the DEAD check is needed to gate this
> call, not device_unregister().
> 
> > +
> > +       /* Release is likely called here, so cxlr is not safe to reference. */
> 
> This is always the case with any put_device(), so no need for this comment.
> 
> > +       put_device(&cxlr->dev);
> > +       cxlr = NULL;
> 
> This NULL assignment has no value.
> 
> > +
> > +       dev_dbg(dev, "Deleted %s\n", buf);
> 
> Not sure a debug statement is needed for something userspace can
> directly view itself with the result code from the sysfs write.
> 
> > +       return len;
> > +}
> > +DEVICE_ATTR_WO(delete_region);
> > +
> > +static void cxl_region_release(struct device *dev)
> > +{
> > +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > +
> > +       dev_dbg(&cxld->dev, "Releasing %s\n", dev_name(dev));
> > +       ida_free(&cxld->region_ida, cxlr->id);
> > +       kfree(cxlr);
> > +       put_device(&cxld->dev);
> > +}
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index b4047a310340..d5397f7dfcf4 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -221,6 +221,8 @@ enum cxl_decoder_type {
> >   * @target_type: accelerator vs expander (type2 vs type3) selector
> >   * @flags: memory type capabilities and locking
> >   * @target_lock: coordinate coherent reads of the target list
> > + * @region_ida: allocator for region ids.
> > + * @next_region_id: Cached region id for next region.
> >   * @nr_targets: number of elements in @target
> >   * @target: active ordered target list in current decoder configuration
> >   */
> > @@ -236,6 +238,9 @@ struct cxl_decoder {
> >         enum cxl_decoder_type target_type;
> >         unsigned long flags;
> >         seqlock_t target_lock;
> > +       struct mutex id_lock;
> > +       struct ida region_ida;
> > +       int next_region_id;
> >         int nr_targets;
> >         struct cxl_dport *target[];
> >  };
> > diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h
> > new file mode 100644
> > index 000000000000..0016f83bbdfd
> > --- /dev/null
> > +++ b/drivers/cxl/region.h
> > @@ -0,0 +1,23 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/* Copyright(c) 2021 Intel Corporation. */
> > +#ifndef __CXL_REGION_H__
> > +#define __CXL_REGION_H__
> > +
> > +#include <linux/uuid.h>
> > +
> > +#include "cxl.h"
> > +
> > +/**
> > + * struct cxl_region - CXL region
> > + * @dev: This region's device.
> > + * @id: This region's id. Id is globally unique across all regions.
> > + * @flags: Flags representing the current state of the region.
> > + */
> > +struct cxl_region {
> > +       struct device dev;
> > +       int id;
> > +       unsigned long flags;
> > +#define REGION_DEAD 0
> > +};
> > +
> > +#endif
> > diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
> > index 82e49ab0937d..3fe6d34e6d59 100644
> > --- a/tools/testing/cxl/Kbuild
> > +++ b/tools/testing/cxl/Kbuild
> > @@ -46,6 +46,7 @@ cxl_core-y += $(CXL_CORE_SRC)/memdev.o
> >  cxl_core-y += $(CXL_CORE_SRC)/mbox.o
> >  cxl_core-y += $(CXL_CORE_SRC)/pci.o
> >  cxl_core-y += $(CXL_CORE_SRC)/hdm.o
> > +cxl_core-y += $(CXL_CORE_SRC)/region.o
> >  cxl_core-y += config_check.o
> >
> >  obj-m += test/
> > --
> > 2.35.1
> >

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 01/15] cxl/region: Add region creation ABI
  2022-02-17 22:22         ` Ben Widawsky
@ 2022-02-17 23:32           ` Dan Williams
  2022-02-18 16:41             ` Ben Widawsky
  0 siblings, 1 reply; 70+ messages in thread
From: Dan Williams @ 2022-02-17 23:32 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Thu, Feb 17, 2022 at 2:22 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> On 22-02-17 09:58:04, Dan Williams wrote:
> > On Thu, Feb 17, 2022 at 9:19 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > Regions are created as a child of the decoder that encompasses an
> > > address space with constraints. Regions have a number of attributes that
> > > must be configured before the region can be activated.
> > >
> > > The ABI is not meant to be secure, but is meant to avoid accidental
> > > races. As a result, a buggy process may create a region by name that was
> > > allocated by a different process. However, multiple processes which are
> > > trying not to race with each other shouldn't need special
> > > synchronization to do so.
> > >
> > > // Allocate a new region name
> > > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> > >
> > > // Create a new region by name
> > > while
> > > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> > > ! echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
> > > do true; done
> > >
> > > // Region now exists in sysfs
> > > stat -t /sys/bus/cxl/devices/decoder0.0/$region
> > >
> > > // Delete the region, and name
> > > echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region
> > >
> > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> >
> > Looking good, a few more fixes and cleanups identified below.
> >
> > >
> > > ---
> > > Changes since v4:
> > > - Add the missed base attributes addition
> > >
> > > ---
> > >  Documentation/ABI/testing/sysfs-bus-cxl       |  23 ++
> > >  .../driver-api/cxl/memory-devices.rst         |  11 +
> > >  drivers/cxl/core/Makefile                     |   1 +
> > >  drivers/cxl/core/core.h                       |   3 +
> > >  drivers/cxl/core/port.c                       |  11 +
> > >  drivers/cxl/core/region.c                     | 213 ++++++++++++++++++
> > >  drivers/cxl/cxl.h                             |   5 +
> > >  drivers/cxl/region.h                          |  23 ++
> > >  tools/testing/cxl/Kbuild                      |   1 +
> > >  9 files changed, 291 insertions(+)
> > >  create mode 100644 drivers/cxl/core/region.c
> > >  create mode 100644 drivers/cxl/region.h
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > index 7c2b846521f3..e5db45ea70ad 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > @@ -163,3 +163,26 @@ Description:
> > >                 memory (type-3). The 'target_type' attribute indicates the
> > >                 current setting which may dynamically change based on what
> > >                 memory regions are activated in this decode hierarchy.
> > > +
> > > +What:          /sys/bus/cxl/devices/decoderX.Y/create_region
> > > +Date:          January, 2022
> > > +KernelVersion: v5.18
> > > +Contact:       linux-cxl@vger.kernel.org
> > > +Description:
> > > +               Write a value of the form 'regionX.Y:Z' to instantiate a new
> > > +               region within the decode range bounded by decoderX.Y. The value
> > > +               written must match the current value returned from reading this
> > > +               attribute. This behavior lets the kernel arbitrate racing
> > > +               attempts to create a region. The thread that fails to write
> > > +               loops and tries the next value. Regions must be created for root
> > > +               decoders, and must subsequently configured and bound to a region
> > > +               driver before they can be used.
> > > +
> > > +What:          /sys/bus/cxl/devices/decoderX.Y/delete_region
> > > +Date:          January, 2022
> > > +KernelVersion: v5.18
> > > +Contact:       linux-cxl@vger.kernel.org
> > > +Description:
> > > +               Deletes the named region.  The attribute expects a region in the
> > > +               form "regionX.Y:Z". The region's name, allocated by reading
> > > +               create_region, will also be released.
> > > diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
> > > index db476bb170b6..66ddc58a21b1 100644
> > > --- a/Documentation/driver-api/cxl/memory-devices.rst
> > > +++ b/Documentation/driver-api/cxl/memory-devices.rst
> > > @@ -362,6 +362,17 @@ CXL Core
> > >  .. kernel-doc:: drivers/cxl/core/mbox.c
> > >     :doc: cxl mbox
> > >
> > > +CXL Regions
> > > +-----------
> > > +.. kernel-doc:: drivers/cxl/region.h
> > > +   :identifiers:
> > > +
> > > +.. kernel-doc:: drivers/cxl/core/region.c
> > > +   :doc: cxl core region
> > > +
> > > +.. kernel-doc:: drivers/cxl/core/region.c
> > > +   :identifiers:
> > > +
> > >  External Interfaces
> > >  ===================
> > >
> > > diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> > > index 6d37cd78b151..39ce8f2f2373 100644
> > > --- a/drivers/cxl/core/Makefile
> > > +++ b/drivers/cxl/core/Makefile
> > > @@ -4,6 +4,7 @@ obj-$(CONFIG_CXL_BUS) += cxl_core.o
> > >  ccflags-y += -I$(srctree)/drivers/cxl
> > >  cxl_core-y := port.o
> > >  cxl_core-y += pmem.o
> > > +cxl_core-y += region.o
> > >  cxl_core-y += regs.o
> > >  cxl_core-y += memdev.o
> > >  cxl_core-y += mbox.o
> > > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> > > index 1a50c0fc399c..adfd42370b28 100644
> > > --- a/drivers/cxl/core/core.h
> > > +++ b/drivers/cxl/core/core.h
> > > @@ -9,6 +9,9 @@ extern const struct device_type cxl_nvdimm_type;
> > >
> > >  extern struct attribute_group cxl_base_attribute_group;
> > >
> > > +extern struct device_attribute dev_attr_create_region;
> > > +extern struct device_attribute dev_attr_delete_region;
> > > +
> > >  struct cxl_send_command;
> > >  struct cxl_mem_query_commands;
> > >  int cxl_query_cmd(struct cxl_memdev *cxlmd,
> > > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > > index 1e785a3affaa..860e91cae29b 100644
> > > --- a/drivers/cxl/core/port.c
> > > +++ b/drivers/cxl/core/port.c
> > > @@ -213,6 +213,8 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
> > >  };
> > >
> > >  static struct attribute *cxl_decoder_root_attrs[] = {
> > > +       &dev_attr_create_region.attr,
> > > +       &dev_attr_delete_region.attr,
> > >         &dev_attr_cap_pmem.attr,
> > >         &dev_attr_cap_ram.attr,
> > >         &dev_attr_cap_type2.attr,
> > > @@ -270,6 +272,8 @@ static void cxl_decoder_release(struct device *dev)
> > >         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > >         struct cxl_port *port = to_cxl_port(dev->parent);
> > >
> > > +       ida_free(&cxld->region_ida, cxld->next_region_id);
> > > +       ida_destroy(&cxld->region_ida);
> > >         ida_free(&port->decoder_ida, cxld->id);
> > >         kfree(cxld);
> > >  }
> > > @@ -1244,6 +1248,13 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
> > >         cxld->target_type = CXL_DECODER_EXPANDER;
> > >         cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
> > >
> > > +       mutex_init(&cxld->id_lock);
> > > +       ida_init(&cxld->region_ida);
> > > +       rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
> > > +       if (rc < 0)
> > > +               goto err;
> > > +
> > > +       cxld->next_region_id = rc;
> > >         return cxld;
> > >  err:
> > >         kfree(cxld);
> > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > > new file mode 100644
> > > index 000000000000..5576952e4aa1
> > > --- /dev/null
> > > +++ b/drivers/cxl/core/region.c
> > > @@ -0,0 +1,213 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/* Copyright(c) 2022 Intel Corporation. All rights reserved. */
> > > +#include <linux/device.h>
> > > +#include <linux/module.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/idr.h>
> > > +#include <region.h>
> > > +#include <cxl.h>
> > > +#include "core.h"
> > > +
> > > +/**
> > > + * DOC: cxl core region
> > > + *
> > > + * CXL Regions represent mapped memory capacity in system physical address
> > > + * space. Whereas the CXL Root Decoders identify the bounds of potential CXL
> > > + * Memory ranges, Regions represent the active mapped capacity by the HDM
> > > + * Decoder Capability structures throughout the Host Bridges, Switches, and
> > > + * Endpoints in the topology.
> > > + */
> > > +
> > > +static void cxl_region_release(struct device *dev);
> >
> > Why forward declare this versus move cxl_region_type after the definition?
> >
> > No other CXL object release functions are forward declared.
> >
> > > +
> > > +static const struct device_type cxl_region_type = {
> > > +       .name = "cxl_region",
> > > +       .release = cxl_region_release,
> > > +};
> > > +
> > > +static struct cxl_region *to_cxl_region(struct device *dev)
> > > +{
> > > +       if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type,
> > > +                         "not a cxl_region device\n"))
> > > +               return NULL;
> > > +
> > > +       return container_of(dev, struct cxl_region, dev);
> > > +}
> > > +
> > > +static struct cxl_region *cxl_region_alloc(struct cxl_decoder *cxld)
> > > +{
> > > +       struct cxl_region *cxlr;
> > > +       struct device *dev;
> > > +
> > > +       cxlr = kzalloc(sizeof(*cxlr), GFP_KERNEL);
> > > +       if (!cxlr)
> > > +               return ERR_PTR(-ENOMEM);
> > > +
> > > +       dev = &cxlr->dev;
> > > +       device_initialize(dev);
> > > +       dev->parent = &cxld->dev;
> > > +       device_set_pm_not_required(dev);
> > > +       dev->bus = &cxl_bus_type;
> > > +       dev->type = &cxl_region_type;
> > > +
> > > +       return cxlr;
> > > +}
> > > +
> > > +static void unregister_region(void *_cxlr)
> > > +{
> > > +       struct cxl_region *cxlr = _cxlr;
> > > +
> > > +       if (!test_and_set_bit(REGION_DEAD, &cxlr->flags))
> > > +               device_unregister(&cxlr->dev);
> >
> > I thought REGION_DEAD was needed to prevent double
> > devm_release_action(), not double unregister?
> >
> > > +}
> > > +
> > > +/**
> > > + * devm_cxl_add_region - Adds a region to a decoder
> > > + * @cxld: Parent decoder.
> > > + * @cxlr: Region to be added to the decoder.
> > > + *
> > > + * This is the second step of region initialization. Regions exist within an
> > > + * address space which is mapped by a @cxld. That @cxld must be a root decoder,
> > > + * and it enforces constraints upon the region as it is configured.
> > > + *
> > > + * Return: 0 if the region was added to the @cxld, else returns negative error
> > > + * code. The region will be named "regionX.Y.Z" where X is the port, Y is the
> > > + * decoder id, and Z is the region number.
> > > + */
> > > +static struct cxl_region *devm_cxl_add_region(struct cxl_decoder *cxld)
> > > +{
> > > +       struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> > > +       struct cxl_region *cxlr;
> > > +       struct device *dev;
> > > +       int rc;
> > > +
> > > +       cxlr = cxl_region_alloc(cxld);
> > > +       if (IS_ERR(cxlr))
> > > +               return cxlr;
> > > +
> > > +       dev = &cxlr->dev;
> > > +
> > > +       cxlr->id = cxld->next_region_id;
> > > +       rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
> > > +       if (rc)
> > > +               goto err_out;
> > > +
> > > +       /* affirm that release will have access to the decoder's region ida  */
> > > +       get_device(&cxld->dev);
> > > +
> > > +       rc = device_add(dev);
> > > +       if (!rc)
> > > +               rc = devm_add_action_or_reset(port->uport, unregister_region,
> > > +                                             cxlr);
> > > +       if (rc)
> > > +               goto err_out;
> >
> > All the other usages in device_add() in the subsystem follow the style of:
> >
> > rc = device_add(dev);
> > if (rc)
> >     goto err;
> >
> > ...any reason to be unique here and indent the success case?
> >
> >
> > > +
> > > +       return cxlr;
> > > +
> > > +err_out:
> > > +       put_device(dev);
> > > +       kfree(cxlr);
> >
> > This is a double-free of cxlr;
> >
> > > +       return ERR_PTR(rc);
> > > +}
> > > +
> > > +static ssize_t create_region_show(struct device *dev,
> > > +                                 struct device_attribute *attr, char *buf)
> > > +{
> > > +       struct cxl_port *port = to_cxl_port(dev->parent);
> > > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > > +
> > > +       return sysfs_emit(buf, "region%d.%d:%d\n", port->id, cxld->id,
> > > +                         cxld->next_region_id);
> > > +}
> > > +
> > > +static ssize_t create_region_store(struct device *dev,
> > > +                                  struct device_attribute *attr,
> > > +                                  const char *buf, size_t len)
> > > +{
> > > +       struct cxl_port *port = to_cxl_port(dev->parent);
> > > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > > +       struct cxl_region *cxlr;
> > > +       int d, p, r, rc = 0;
> > > +
> > > +       if (sscanf(buf, "region%d.%d:%d", &p, &d, &r) != 3)
> > > +               return -EINVAL;
> > > +
> > > +       if (port->id != p || cxld->id != d)
> > > +               return -EINVAL;
> > > +
> > > +       rc = mutex_lock_interruptible(&cxld->id_lock);
> > > +       if (rc)
> > > +               return rc;
> > > +
> > > +       if (cxld->next_region_id != r) {
> > > +               rc = -EINVAL;
> > > +               goto out;
> > > +       }
> > > +
> > > +       rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
> > > +       if (rc < 0) {
> > > +               dev_dbg(dev, "Failed to get next cached id (%d)\n", rc);
> > > +               goto out;
> > > +       }
> > > +
> > > +       cxlr = devm_cxl_add_region(cxld);
> > > +       if (IS_ERR(cxlr)) {
> > > +               rc = PTR_ERR(cxlr);
> > > +               goto out;
> > > +       }
> > > +
> > > +       cxld->next_region_id = rc;
> >
> > This looks like a leak in the case when devm_cxl_add_region() fails,
> > so just move it above that call.
> >
>
> It's not super simple with the current pre-caching. If you move this above
> devm_cxl_add_region(), then you lose the previously pre-cached region. I think
> the cleaner solution is to just free the ida on failure. Pretty sure no matter
> what method you go, you need an ida_free in there somewhere. Do you see another
> way?

As soon as one thread has successfully acquired the next_region_id
then it is safe to advance and assume that the devm_cxl_add_region()
owns releasing that id.

In fact, just make that implicit. Move the ida_alloc() and the
next_region_id advancement internal to cxl_region_alloc() with a
device_lock_assert() for the id_lock. Then the recovery of the
allocated id happens naturally like all the other ids in the subsystem
i.e. at release time.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 01/15] cxl/region: Add region creation ABI
  2022-02-17 23:32           ` Dan Williams
@ 2022-02-18 16:41             ` Ben Widawsky
  0 siblings, 0 replies; 70+ messages in thread
From: Ben Widawsky @ 2022-02-18 16:41 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On 22-02-17 15:32:44, Dan Williams wrote:
> On Thu, Feb 17, 2022 at 2:22 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > On 22-02-17 09:58:04, Dan Williams wrote:
> > > On Thu, Feb 17, 2022 at 9:19 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > Regions are created as a child of the decoder that encompasses an
> > > > address space with constraints. Regions have a number of attributes that
> > > > must be configured before the region can be activated.
> > > >
> > > > The ABI is not meant to be secure, but is meant to avoid accidental
> > > > races. As a result, a buggy process may create a region by name that was
> > > > allocated by a different process. However, multiple processes which are
> > > > trying not to race with each other shouldn't need special
> > > > synchronization to do so.
> > > >
> > > > // Allocate a new region name
> > > > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> > > >
> > > > // Create a new region by name
> > > > while
> > > > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region)
> > > > ! echo $region > /sys/bus/cxl/devices/decoder0.0/create_region
> > > > do true; done
> > > >
> > > > // Region now exists in sysfs
> > > > stat -t /sys/bus/cxl/devices/decoder0.0/$region
> > > >
> > > > // Delete the region, and name
> > > > echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region
> > > >
> > > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > >
> > > Looking good, a few more fixes and cleanups identified below.
> > >
> > > >
> > > > ---
> > > > Changes since v4:
> > > > - Add the missed base attributes addition
> > > >
> > > > ---
> > > >  Documentation/ABI/testing/sysfs-bus-cxl       |  23 ++
> > > >  .../driver-api/cxl/memory-devices.rst         |  11 +
> > > >  drivers/cxl/core/Makefile                     |   1 +
> > > >  drivers/cxl/core/core.h                       |   3 +
> > > >  drivers/cxl/core/port.c                       |  11 +
> > > >  drivers/cxl/core/region.c                     | 213 ++++++++++++++++++
> > > >  drivers/cxl/cxl.h                             |   5 +
> > > >  drivers/cxl/region.h                          |  23 ++
> > > >  tools/testing/cxl/Kbuild                      |   1 +
> > > >  9 files changed, 291 insertions(+)
> > > >  create mode 100644 drivers/cxl/core/region.c
> > > >  create mode 100644 drivers/cxl/region.h
> > > >
> > > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > index 7c2b846521f3..e5db45ea70ad 100644
> > > > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > @@ -163,3 +163,26 @@ Description:
> > > >                 memory (type-3). The 'target_type' attribute indicates the
> > > >                 current setting which may dynamically change based on what
> > > >                 memory regions are activated in this decode hierarchy.
> > > > +
> > > > +What:          /sys/bus/cxl/devices/decoderX.Y/create_region
> > > > +Date:          January, 2022
> > > > +KernelVersion: v5.18
> > > > +Contact:       linux-cxl@vger.kernel.org
> > > > +Description:
> > > > +               Write a value of the form 'regionX.Y:Z' to instantiate a new
> > > > +               region within the decode range bounded by decoderX.Y. The value
> > > > +               written must match the current value returned from reading this
> > > > +               attribute. This behavior lets the kernel arbitrate racing
> > > > +               attempts to create a region. The thread that fails to write
> > > > +               loops and tries the next value. Regions must be created for root
> > > > +               decoders, and must subsequently configured and bound to a region
> > > > +               driver before they can be used.
> > > > +
> > > > +What:          /sys/bus/cxl/devices/decoderX.Y/delete_region
> > > > +Date:          January, 2022
> > > > +KernelVersion: v5.18
> > > > +Contact:       linux-cxl@vger.kernel.org
> > > > +Description:
> > > > +               Deletes the named region.  The attribute expects a region in the
> > > > +               form "regionX.Y:Z". The region's name, allocated by reading
> > > > +               create_region, will also be released.
> > > > diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
> > > > index db476bb170b6..66ddc58a21b1 100644
> > > > --- a/Documentation/driver-api/cxl/memory-devices.rst
> > > > +++ b/Documentation/driver-api/cxl/memory-devices.rst
> > > > @@ -362,6 +362,17 @@ CXL Core
> > > >  .. kernel-doc:: drivers/cxl/core/mbox.c
> > > >     :doc: cxl mbox
> > > >
> > > > +CXL Regions
> > > > +-----------
> > > > +.. kernel-doc:: drivers/cxl/region.h
> > > > +   :identifiers:
> > > > +
> > > > +.. kernel-doc:: drivers/cxl/core/region.c
> > > > +   :doc: cxl core region
> > > > +
> > > > +.. kernel-doc:: drivers/cxl/core/region.c
> > > > +   :identifiers:
> > > > +
> > > >  External Interfaces
> > > >  ===================
> > > >
> > > > diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> > > > index 6d37cd78b151..39ce8f2f2373 100644
> > > > --- a/drivers/cxl/core/Makefile
> > > > +++ b/drivers/cxl/core/Makefile
> > > > @@ -4,6 +4,7 @@ obj-$(CONFIG_CXL_BUS) += cxl_core.o
> > > >  ccflags-y += -I$(srctree)/drivers/cxl
> > > >  cxl_core-y := port.o
> > > >  cxl_core-y += pmem.o
> > > > +cxl_core-y += region.o
> > > >  cxl_core-y += regs.o
> > > >  cxl_core-y += memdev.o
> > > >  cxl_core-y += mbox.o
> > > > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> > > > index 1a50c0fc399c..adfd42370b28 100644
> > > > --- a/drivers/cxl/core/core.h
> > > > +++ b/drivers/cxl/core/core.h
> > > > @@ -9,6 +9,9 @@ extern const struct device_type cxl_nvdimm_type;
> > > >
> > > >  extern struct attribute_group cxl_base_attribute_group;
> > > >
> > > > +extern struct device_attribute dev_attr_create_region;
> > > > +extern struct device_attribute dev_attr_delete_region;
> > > > +
> > > >  struct cxl_send_command;
> > > >  struct cxl_mem_query_commands;
> > > >  int cxl_query_cmd(struct cxl_memdev *cxlmd,
> > > > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > > > index 1e785a3affaa..860e91cae29b 100644
> > > > --- a/drivers/cxl/core/port.c
> > > > +++ b/drivers/cxl/core/port.c
> > > > @@ -213,6 +213,8 @@ static struct attribute_group cxl_decoder_base_attribute_group = {
> > > >  };
> > > >
> > > >  static struct attribute *cxl_decoder_root_attrs[] = {
> > > > +       &dev_attr_create_region.attr,
> > > > +       &dev_attr_delete_region.attr,
> > > >         &dev_attr_cap_pmem.attr,
> > > >         &dev_attr_cap_ram.attr,
> > > >         &dev_attr_cap_type2.attr,
> > > > @@ -270,6 +272,8 @@ static void cxl_decoder_release(struct device *dev)
> > > >         struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > > >         struct cxl_port *port = to_cxl_port(dev->parent);
> > > >
> > > > +       ida_free(&cxld->region_ida, cxld->next_region_id);
> > > > +       ida_destroy(&cxld->region_ida);
> > > >         ida_free(&port->decoder_ida, cxld->id);
> > > >         kfree(cxld);
> > > >  }
> > > > @@ -1244,6 +1248,13 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port,
> > > >         cxld->target_type = CXL_DECODER_EXPANDER;
> > > >         cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0);
> > > >
> > > > +       mutex_init(&cxld->id_lock);
> > > > +       ida_init(&cxld->region_ida);
> > > > +       rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
> > > > +       if (rc < 0)
> > > > +               goto err;
> > > > +
> > > > +       cxld->next_region_id = rc;
> > > >         return cxld;
> > > >  err:
> > > >         kfree(cxld);
> > > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > > > new file mode 100644
> > > > index 000000000000..5576952e4aa1
> > > > --- /dev/null
> > > > +++ b/drivers/cxl/core/region.c
> > > > @@ -0,0 +1,213 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > +/* Copyright(c) 2022 Intel Corporation. All rights reserved. */
> > > > +#include <linux/device.h>
> > > > +#include <linux/module.h>
> > > > +#include <linux/slab.h>
> > > > +#include <linux/idr.h>
> > > > +#include <region.h>
> > > > +#include <cxl.h>
> > > > +#include "core.h"
> > > > +
> > > > +/**
> > > > + * DOC: cxl core region
> > > > + *
> > > > + * CXL Regions represent mapped memory capacity in system physical address
> > > > + * space. Whereas the CXL Root Decoders identify the bounds of potential CXL
> > > > + * Memory ranges, Regions represent the active mapped capacity by the HDM
> > > > + * Decoder Capability structures throughout the Host Bridges, Switches, and
> > > > + * Endpoints in the topology.
> > > > + */
> > > > +
> > > > +static void cxl_region_release(struct device *dev);
> > >
> > > Why forward declare this versus move cxl_region_type after the definition?
> > >
> > > No other CXL object release functions are forward declared.
> > >
> > > > +
> > > > +static const struct device_type cxl_region_type = {
> > > > +       .name = "cxl_region",
> > > > +       .release = cxl_region_release,
> > > > +};
> > > > +
> > > > +static struct cxl_region *to_cxl_region(struct device *dev)
> > > > +{
> > > > +       if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type,
> > > > +                         "not a cxl_region device\n"))
> > > > +               return NULL;
> > > > +
> > > > +       return container_of(dev, struct cxl_region, dev);
> > > > +}
> > > > +
> > > > +static struct cxl_region *cxl_region_alloc(struct cxl_decoder *cxld)
> > > > +{
> > > > +       struct cxl_region *cxlr;
> > > > +       struct device *dev;
> > > > +
> > > > +       cxlr = kzalloc(sizeof(*cxlr), GFP_KERNEL);
> > > > +       if (!cxlr)
> > > > +               return ERR_PTR(-ENOMEM);
> > > > +
> > > > +       dev = &cxlr->dev;
> > > > +       device_initialize(dev);
> > > > +       dev->parent = &cxld->dev;
> > > > +       device_set_pm_not_required(dev);
> > > > +       dev->bus = &cxl_bus_type;
> > > > +       dev->type = &cxl_region_type;
> > > > +
> > > > +       return cxlr;
> > > > +}
> > > > +
> > > > +static void unregister_region(void *_cxlr)
> > > > +{
> > > > +       struct cxl_region *cxlr = _cxlr;
> > > > +
> > > > +       if (!test_and_set_bit(REGION_DEAD, &cxlr->flags))
> > > > +               device_unregister(&cxlr->dev);
> > >
> > > I thought REGION_DEAD was needed to prevent double
> > > devm_release_action(), not double unregister?
> > >
> > > > +}
> > > > +
> > > > +/**
> > > > + * devm_cxl_add_region - Adds a region to a decoder
> > > > + * @cxld: Parent decoder.
> > > > + * @cxlr: Region to be added to the decoder.
> > > > + *
> > > > + * This is the second step of region initialization. Regions exist within an
> > > > + * address space which is mapped by a @cxld. That @cxld must be a root decoder,
> > > > + * and it enforces constraints upon the region as it is configured.
> > > > + *
> > > > + * Return: 0 if the region was added to the @cxld, else returns negative error
> > > > + * code. The region will be named "regionX.Y.Z" where X is the port, Y is the
> > > > + * decoder id, and Z is the region number.
> > > > + */
> > > > +static struct cxl_region *devm_cxl_add_region(struct cxl_decoder *cxld)
> > > > +{
> > > > +       struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> > > > +       struct cxl_region *cxlr;
> > > > +       struct device *dev;
> > > > +       int rc;
> > > > +
> > > > +       cxlr = cxl_region_alloc(cxld);
> > > > +       if (IS_ERR(cxlr))
> > > > +               return cxlr;
> > > > +
> > > > +       dev = &cxlr->dev;
> > > > +
> > > > +       cxlr->id = cxld->next_region_id;
> > > > +       rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id);
> > > > +       if (rc)
> > > > +               goto err_out;
> > > > +
> > > > +       /* affirm that release will have access to the decoder's region ida  */
> > > > +       get_device(&cxld->dev);
> > > > +
> > > > +       rc = device_add(dev);
> > > > +       if (!rc)
> > > > +               rc = devm_add_action_or_reset(port->uport, unregister_region,
> > > > +                                             cxlr);
> > > > +       if (rc)
> > > > +               goto err_out;
> > >
> > > All the other usages in device_add() in the subsystem follow the style of:
> > >
> > > rc = device_add(dev);
> > > if (rc)
> > >     goto err;
> > >
> > > ...any reason to be unique here and indent the success case?
> > >
> > >
> > > > +
> > > > +       return cxlr;
> > > > +
> > > > +err_out:
> > > > +       put_device(dev);
> > > > +       kfree(cxlr);
> > >
> > > This is a double-free of cxlr;
> > >
> > > > +       return ERR_PTR(rc);
> > > > +}
> > > > +
> > > > +static ssize_t create_region_show(struct device *dev,
> > > > +                                 struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       struct cxl_port *port = to_cxl_port(dev->parent);
> > > > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "region%d.%d:%d\n", port->id, cxld->id,
> > > > +                         cxld->next_region_id);
> > > > +}
> > > > +
> > > > +static ssize_t create_region_store(struct device *dev,
> > > > +                                  struct device_attribute *attr,
> > > > +                                  const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_port *port = to_cxl_port(dev->parent);
> > > > +       struct cxl_decoder *cxld = to_cxl_decoder(dev);
> > > > +       struct cxl_region *cxlr;
> > > > +       int d, p, r, rc = 0;
> > > > +
> > > > +       if (sscanf(buf, "region%d.%d:%d", &p, &d, &r) != 3)
> > > > +               return -EINVAL;
> > > > +
> > > > +       if (port->id != p || cxld->id != d)
> > > > +               return -EINVAL;
> > > > +
> > > > +       rc = mutex_lock_interruptible(&cxld->id_lock);
> > > > +       if (rc)
> > > > +               return rc;
> > > > +
> > > > +       if (cxld->next_region_id != r) {
> > > > +               rc = -EINVAL;
> > > > +               goto out;
> > > > +       }
> > > > +
> > > > +       rc = ida_alloc(&cxld->region_ida, GFP_KERNEL);
> > > > +       if (rc < 0) {
> > > > +               dev_dbg(dev, "Failed to get next cached id (%d)\n", rc);
> > > > +               goto out;
> > > > +       }
> > > > +
> > > > +       cxlr = devm_cxl_add_region(cxld);
> > > > +       if (IS_ERR(cxlr)) {
> > > > +               rc = PTR_ERR(cxlr);
> > > > +               goto out;
> > > > +       }
> > > > +
> > > > +       cxld->next_region_id = rc;
> > >
> > > This looks like a leak in the case when devm_cxl_add_region() fails,
> > > so just move it above that call.
> > >
> >
> > It's not super simple with the current pre-caching. If you move this above
> > devm_cxl_add_region(), then you lose the previously pre-cached region. I think
> > the cleaner solution is to just free the ida on failure. Pretty sure no matter
> > what method you go, you need an ida_free in there somewhere. Do you see another
> > way?
> 
> As soon as one thread has successfully acquired the next_region_id
> then it is safe to advance and assume that the devm_cxl_add_region()
> owns releasing that id.
> 
> In fact, just make that implicit. Move the ida_alloc() and the
> next_region_id advancement internal to cxl_region_alloc() with a
> device_lock_assert() for the id_lock. Then the recovery of the
> allocated id happens naturally like all the other ids in the subsystem
> i.e. at release time.

This is the right answer. Regions created from LSA/BIOS will need this anyway.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 11/14] cxl/region: Add support for single switch level
  2022-02-15 16:10   ` Jonathan Cameron
@ 2022-02-18 18:23     ` Jonathan Cameron
  0 siblings, 0 replies; 70+ messages in thread
From: Jonathan Cameron @ 2022-02-18 18:23 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Dan Williams, Ira Weiny,
	Vishal Verma, Bjorn Helgaas, nvdimm, linux-pci

On Tue, 15 Feb 2022 16:10:14 +0000
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Thu, 27 Jan 2022 16:27:04 -0800
> Ben Widawsky <ben.widawsky@intel.com> wrote:
> 
> > CXL switches have HDM decoders just like host bridges and endpoints.
> > Their programming works in a similar fashion.
> > 
> > The spec does not prohibit multiple levels of switches, however, those
> > are not implemented at this time.
> > 
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>  
> Hi Ben,
> 
> I'm still hammering away at trying to bring up qemu switch emulation.
> Even though I know you are reworking this, seems only sensible to point
> out issues when I hit them.  If no longer relevant you can ignore them!
> 
> With these bits and a few other minor tweaks the decoders now look
> to be right - I just need to wire up the QEMU side so that
> I don't get hardware exceptions on actually reading and writing once
> the region is bound :) 

QEMU emulation of a very basic CXL switch at:
https://gitlab.com/jic23/qemu/-/commits/cxl-v7-draft-for-test

Hope that is helpful for testing next version of this.
I've been testing with a 4 port switch and 4 type 3 devices
and with a few hacks/fixes as mentioned in review feedback it all
works (well devmem2 gives me what I expect. :)

Jonathan

> 
> Thanks,
> 
> J
> > ---
> >  drivers/cxl/cxl.h    |  5 ++++
> >  drivers/cxl/region.c | 61 ++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 64 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index 8ace6cca0776..d70d8c85d05f 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -96,6 +96,11 @@ static inline u8 cxl_to_ig(u16 g)
> >  	return ilog2(g) - 8;
> >  }
> >  
> > +static inline int cxl_to_ways(u8 ways)
> > +{
> > +	return 1 << ways;
> > +}
> > +
> >  static inline bool cxl_is_interleave_ways_valid(int iw)
> >  {
> >  	switch (iw) {
> > diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
> > index b8982be13bfe..f748060733dd 100644
> > --- a/drivers/cxl/region.c
> > +++ b/drivers/cxl/region.c
> > @@ -359,6 +359,23 @@ static bool has_switch(const struct cxl_region *cxlr)
> >  	return false;
> >  }
> >  
> > +static bool has_multi_switch(const struct cxl_region *cxlr)
> > +{
> > +	struct cxl_memdev *ep;
> > +	int i;
> > +
> > +	for_each_cxl_endpoint(ep, cxlr, i)
> > +		if (ep->port->depth > 3)
> > +			return true;
> > +
> > +	return false;
> > +}
> > +
> > +static struct cxl_port *get_switch(struct cxl_memdev *ep)
> > +{
> > +	return to_cxl_port(ep->port->dev.parent);
> > +}
> > +
> >  static struct cxl_decoder *get_decoder(struct cxl_region *cxlr,
> >  				       struct cxl_port *p)
> >  {
> > @@ -409,6 +426,8 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
> >  				      const struct cxl_decoder *rootd,
> >  				      bool state_update)
> >  {
> > +	const int region_ig = cxl_to_ig(cxlr->config.interleave_granularity);
> > +	const int region_eniw = cxl_to_eniw(cxlr->config.interleave_ways);
> >  	const int num_root_ports = get_num_root_ports(cxlr);
> >  	struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
> >  	struct cxl_decoder *cxld, *c;
> > @@ -416,8 +435,12 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
> >  
> >  	hb_count = get_unique_hostbridges(cxlr, hbs);
> >  
> > -	/* TODO: Switch support */
> > -	if (has_switch(cxlr))
> > +	/* TODO: support multiple levels of switches */
> > +	if (has_multi_switch(cxlr))
> > +		return false;
> > +
> > +	/* TODO: x3 interleave for switches is hard. */
> > +	if (has_switch(cxlr) && !is_power_of_2(region_ways(cxlr)))
> >  		return false;
> >  
> >  	/*
> > @@ -470,8 +493,14 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
> >  		list_for_each_entry(rp, &hb->dports, list) {
> >  			struct cxl_memdev *ep;
> >  			int port_grouping = -1;
> > +			int target_ndx;  
> As things currently stand, with a switch connected below a single port
> of a host bridge (4 type 3 off the switch) this will program the HB
> decoder to have 4 targets, all routed to the switch USP.
> 
> There is an argument that this is correct but its not what I'd expect.
> I'd expect to see 1 target only.  It's not a problem for small cases, but
> with enough rp and switches we can run out of targets.
> 
> >  
> >  			for_each_cxl_endpoint_hb(ep, cxlr, hb, idx) {
> > +				struct cxl_decoder *switch_cxld;
> > +				struct cxl_dport *target;
> > +				struct cxl_port *switch_port;
> > +				bool found = false;
> > +
> >  				if (get_rp(ep) != rp)
> >  					continue;
> >  
> > @@ -499,6 +528,34 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
> >  
> >  				cxld->interleave_ways++;
> >  				cxld->target[port_grouping] = get_rp(ep);
> > +
> > +				/*
> > +				 * At least one switch is connected here if the endpoint
> > +				 * has a depth > 2
> > +				 */
> > +				if (ep->port->depth == 2)
> > +					continue;
> > +
> > +				/* Check the staged list to see if this
> > +				 * port has already been added
> > +				 */
> > +				switch_port = get_switch(ep);
> > +				list_for_each_entry(switch_cxld, &cxlr->staged_list, region_link) {
> > +					if (to_cxl_port(switch_cxld->dev.parent) == switch_port)
> > +						found = true;  
> 
> break;
> 
> > +				}
> > +
> > +				if (found) {
> > +					target = cxl_find_dport_by_dev(switch_port, ep->dev.parent->parent);
> > +					switch_cxld->target[target_ndx++] = target;
> > +					continue;
> > +				}
> > +
> > +				target_ndx = 0;
> > +
> > +				switch_cxld = get_decoder(cxlr, switch_port);
> > +				switch_cxld->interleave_ways++;
> > +				switch_cxld->interleave_granularity = cxl_to_ways(region_ig + region_eniw);  
> 
> I'm not following this.  Perhaps comment on why this particular maths?  I was assuming the switch
> interleave granularity would that of the region as the switch is last level of decode.
> 
> Need to do the equivalent here of what you do in the if (found) or the first target is missed.
> Also need to be updating interleave_ways only in the found path, not here (as the default is 1)
> 
> >  			}
> >  		}
> >  	}  
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 05/14] cxl/acpi: Handle address space allocation
  2022-01-28  0:26 ` [PATCH v3 05/14] cxl/acpi: Handle address space allocation Ben Widawsky
@ 2022-02-18 19:17   ` Dan Williams
  0 siblings, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-18 19:17 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Regions are carved out of an addresses space which is claimed by top
> level decoders, and subsequently their children decoders. Regions are

s/children/descendant/

> created with a size and therefore must fit, with proper alignment, in
> that address space. The support for doing this fitting is handled by the
> driver automatically.
>
> As an example, a platform might configure a top level decoder to claim
> 1TB of address space @ 0x800000000 -> 0x10800000000; it would be
> possible to create M regions with appropriate alignment to occupy that
> address space. Each of those regions would have a host physical address
> somewhere in the range between 32G and 1.3TB, and the location will be
> determined by the logic added here.
>
> The request_region() usage is not strictly mandatory at this point as
> the actual handling of the address space is done with genpools. It is
> highly likely however that the resource/region APIs will become useful
> in the not too distant future.

More on this below, but I think resource APIs are critical for the
pre-existing / BIOS created region case and I have a feeling gen_pool
is not a good fit.

> All decoders manage a host physical address space while active. Only the
> root decoder has constraints on location and size. As a result, it makes
> most sense for the root decoder to be responsible for managing the
> entire address space, and mid-level decoders and endpoints can ask the
> root decoder for suballocations.
>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
>  drivers/cxl/acpi.c | 30 ++++++++++++++++++++++++++++++
>  drivers/cxl/cxl.h  |  2 ++
>  2 files changed, 32 insertions(+)
>
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index d6dcb2b6af48..74681bfbf53c 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /* Copyright(c) 2021 Intel Corporation. All rights reserved. */
>  #include <linux/platform_device.h>
> +#include <linux/genalloc.h>
>  #include <linux/module.h>
>  #include <linux/device.h>
>  #include <linux/kernel.h>
> @@ -73,6 +74,27 @@ static int cxl_acpi_cfmws_verify(struct device *dev,
>         return 0;
>  }
>
> +/*
> + * Every decoder while active has an address space that it is decoding. However,
> + * only the root level decoders have fixed host physical address space ranges.
> + */
> +static int cxl_create_cfmws_address_space(struct cxl_decoder *cxld,
> +                                         struct acpi_cedt_cfmws *cfmws)
> +{
> +       const int order = ilog2(SZ_256M * cxld->interleave_ways);
> +       struct device *dev = &cxld->dev;
> +       struct gen_pool *pool;
> +
> +       pool = devm_gen_pool_create(dev, order, NUMA_NO_NODE, dev_name(dev));

The cxld dev is not a suitable devm host.

Moreover, the address space is a generic property of root decoders, it
belongs in the core not in cxl_acpi.

As for the data structure / APIs to manage the address space I'm not
sure gen_pool is the right answer, because the capacity tracking will
be done in terms of __request_region() and resource trees. The
infrastructure to keep the gen_pool aligned with the resource tree
drops away if there was an interface for allocating free space out of
a resource tree to augment the base API of requesting space with known
addresses. In fact, there is already the request_free_mem_region()
helper. Did you consider that vs gen_pool? Otherwise, how to solve the
problem of pre-populating the busy areas of the gen_pool relative to
capacity that the BIOS may have consumed out of the decoder range?
That comes for free with just walking decoders at boot and doing
__request_region() against the root decoders. Then the allocation
helper can just walk that free space similar to
request_free_mem_region().

> +       if (IS_ERR(pool))
> +               return PTR_ERR(pool);
> +
> +       cxld->address_space = pool;
> +
> +       return gen_pool_add(cxld->address_space, cfmws->base_hpa,
> +                           cfmws->window_size, NUMA_NO_NODE);
> +}
> +
>  struct cxl_cfmws_context {
>         struct device *dev;
>         struct cxl_port *root_port;
> @@ -113,6 +135,14 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
>         cxld->interleave_ways = CFMWS_INTERLEAVE_WAYS(cfmws);
>         cxld->interleave_granularity = CFMWS_INTERLEAVE_GRANULARITY(cfmws);
>
> +       rc = cxl_create_cfmws_address_space(cxld, cfmws);
> +       if (rc) {
> +               dev_err(dev,
> +                       "Failed to create CFMWS address space for decoder\n");
> +               put_device(&cxld->dev);
> +               return 0;
> +       }
> +
>         rc = cxl_decoder_add(cxld, target_map);
>         if (rc)
>                 put_device(&cxld->dev);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index d1a8ca19c9ea..b300673072f5 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -251,6 +251,7 @@ enum cxl_decoder_type {
>   * @flags: memory type capabilities and locking
>   * @target_lock: coordinate coherent reads of the target list
>   * @region_ida: allocator for region ids.
> + * @address_space: Used/free address space for regions.
>   * @nr_targets: number of elements in @target
>   * @target: active ordered target list in current decoder configuration
>   */
> @@ -267,6 +268,7 @@ struct cxl_decoder {
>         unsigned long flags;
>         seqlock_t target_lock;
>         struct ida region_ida;
> +       struct gen_pool *address_space;
>         int nr_targets;
>         struct cxl_dport *target[];
>  };
> --
> 2.35.0
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 06/14] cxl/region: Address space allocation
  2022-01-28  0:26 ` [PATCH v3 06/14] cxl/region: Address " Ben Widawsky
@ 2022-02-18 19:51   ` Dan Williams
  0 siblings, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-18 19:51 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> When a region is not assigned a host physical address,

Curious, is there a way to pick one in the current ABI? Not that I
want one, in fact I think Linux should make it as difficult as
possible to create a region with a fixed address (per the 'HPA' field
of the region label) given all the problems it can cause with decoder
allocation ordering. Unless and until someone identifies a solid use
case for that capability it should be de-emphasized.

> one is picked by
> the driver. As the address will determine which CFMWS contains the
> region, it's usually a better idea to let the driver make this
> determination.
>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
>  drivers/cxl/region.c | 40 ++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 38 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
> index cc41939a2f0a..5588873dd250 100644
> --- a/drivers/cxl/region.c
> +++ b/drivers/cxl/region.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /* Copyright(c) 2021 Intel Corporation. All rights reserved. */
>  #include <linux/platform_device.h>
> +#include <linux/genalloc.h>
>  #include <linux/device.h>
>  #include <linux/module.h>
>  #include <linux/pci.h>
> @@ -64,6 +65,20 @@ static struct cxl_port *get_root_decoder(const struct cxl_memdev *endpoint)
>         return NULL;
>  }
>
> +static void release_cxl_region(void *r)
> +{
> +       struct cxl_region *cxlr = (struct cxl_region *)r;
> +       struct cxl_decoder *rootd = rootd_from_region(cxlr);
> +       struct resource *res = &rootd->platform_res;
> +       resource_size_t start, size;
> +
> +       start = cxlr->res->start;
> +       size = resource_size(cxlr->res);
> +
> +       __release_region(res, start, size);
> +       gen_pool_free(rootd->address_space, start, size);

If the need to keep the gen_pool in sync is dropped then this
open-coded devm release handler can be replaced with
__devm_request_region().

> +}
> +
>  /**
>   * sanitize_region() - Check is region is reasonably configured
>   * @cxlr: The region to check
> @@ -129,8 +144,29 @@ static int sanitize_region(const struct cxl_region *cxlr)
>   */
>  static int allocate_address_space(struct cxl_region *cxlr)
>  {
> -       /* TODO */

The problem with TODOs is now I forget which context calls
allocate_address_space(). If the caller was added in this patch it
would be reviewable, as is, I need to go to another window to search
"allocate_address_space" to recall that it is called from
cxl_region_probe(). That's too late as someone defining a region
should know upfront a region creation time that space has been
reserved, or not.

> -       return 0;
> +       struct cxl_decoder *rootd = rootd_from_region(cxlr);
> +       unsigned long start;

s/unsigned long/resource_size_t/?

> +
> +       start = gen_pool_alloc(rootd->address_space, cxlr->config.size);
> +       if (!start) {
> +               dev_dbg(&cxlr->dev, "Couldn't allocate %lluM of address space",
> +                       cxlr->config.size >> 20);
> +               return -ENOMEM;
> +       }
> +
> +       cxlr->res =
> +               __request_region(&rootd->platform_res, start, cxlr->config.size,
> +                                dev_name(&cxlr->dev), IORESOURCE_MEM);
> +       if (!cxlr->res) {
> +               dev_dbg(&cxlr->dev, "Couldn't obtain region from %s (%pR)\n",
> +                       dev_name(&rootd->dev), &rootd->platform_res);
> +               gen_pool_free(rootd->address_space, start, cxlr->config.size);
> +               return -ENOMEM;
> +       }
> +
> +       dev_dbg(&cxlr->dev, "resource %pR", cxlr->res);
> +
> +       return devm_add_action_or_reset(&cxlr->dev, release_cxl_region, cxlr);
>  }
>
>  /**
> --
> 2.35.0
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 07/14] cxl/region: Implement XHB verification
  2022-01-28  0:27 ` [PATCH v3 07/14] cxl/region: Implement XHB verification Ben Widawsky
@ 2022-02-18 20:23   ` Dan Williams
  0 siblings, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-18 20:23 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Cross host bridge verification primarily determines if the requested
> interleave ordering can be achieved by the root decoder, which isn't as
> programmable as other decoders.

I don't understand that comment. Are you talking about the CFMWS
static decoders that can not be programmed at all, or the 1st level
decoders beneath that.

> The algorithm implemented here is based on the CXL Type 3 Memory Device
> Software Guide, chapter 2.13.14

Just spell out the support here and don't require the reader to read
that other doc to understand if it follows it exactly, or takes some
liberties. I.e. the assumptions and tradeoffs built into the design
choices in the patch need to be spelled out here.

>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
> Changes since v2:
> - Fail earlier on lack of host bridges. This should only be capable as
>   of now with cxl_test memdevs.
> ---
>  .clang-format        |  2 +
>  drivers/cxl/cxl.h    | 13 +++++++
>  drivers/cxl/region.c | 89 +++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 103 insertions(+), 1 deletion(-)
>
> diff --git a/.clang-format b/.clang-format
> index fa959436bcfd..1221d53be90b 100644
> --- a/.clang-format
> +++ b/.clang-format
> @@ -169,6 +169,8 @@ ForEachMacros:
>    - 'for_each_cpu_and'
>    - 'for_each_cpu_not'
>    - 'for_each_cpu_wrap'
> +  - 'for_each_cxl_decoder_target'
> +  - 'for_each_cxl_endpoint'
>    - 'for_each_dapm_widgets'
>    - 'for_each_dev_addr'
>    - 'for_each_dev_scope'
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index b300673072f5..a291999431c7 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -81,6 +81,19 @@ static inline int cxl_to_interleave_ways(u8 eniw)
>         }
>  }
>
> +static inline u8 cxl_to_eniw(u8 ways)
> +{
> +       if (is_power_of_2(ways))
> +               return ilog2(ways);
> +
> +       return ways / 3 + 8;
> +}
> +
> +static inline u8 cxl_to_ig(u16 g)
> +{
> +       return ilog2(g) - 8;
> +}

These need better names to not be confused with the reverse helpers.
How about interleave_ways_to_cxl_iw() and
interleave_granularity_to_cxl_ig()?

> +
>  static inline bool cxl_is_interleave_ways_valid(int iw)
>  {
>         switch (iw) {
> diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
> index 5588873dd250..562c8720da56 100644
> --- a/drivers/cxl/region.c
> +++ b/drivers/cxl/region.c
> @@ -29,6 +29,17 @@
>
>  #define region_ways(region) ((region)->config.interleave_ways)
>  #define region_granularity(region) ((region)->config.interleave_granularity)
> +#define region_eniw(region) (cxl_to_eniw(region_ways(region)))
> +#define region_ig(region) (cxl_to_ig(region_granularity(region)))

This feels like too much indirection...

> +
> +#define for_each_cxl_endpoint(ep, region, idx)                                 \
> +       for (idx = 0, ep = (region)->config.targets[idx];                      \
> +            idx < region_ways(region); ep = (region)->config.targets[++idx])

Is the macro really buying anything in terms of readability?

for (i = 0; i < region->interleave_ways; i++)

...looks ok to me.

> +
> +#define for_each_cxl_decoder_target(dport, decoder, idx)                       \
> +       for (idx = 0, dport = (decoder)->target[idx];                          \
> +            idx < (decoder)->nr_targets - 1;                                  \
> +            dport = (decoder)->target[++idx])

Doesn't this need locking to protect against target array updates?
Another detail that might be less obfuscated with an open coded loop.
I would only expect a new for_each() macro when the iterator is a
function call, not a simple array de-reference with an incremented
index.

>
>  static struct cxl_decoder *rootd_from_region(struct cxl_region *cxlr)
>  {
> @@ -195,6 +206,30 @@ static bool qtg_match(const struct cxl_decoder *rootd,
>         return true;
>  }
>
> +static int get_unique_hostbridges(const struct cxl_region *cxlr,
> +                                 struct cxl_port **hbs)
> +{
> +       struct cxl_memdev *ep;
> +       int i, hb_count = 0;
> +
> +       for_each_cxl_endpoint(ep, cxlr, i) {
> +               struct cxl_port *hb = get_hostbridge(ep);
> +               bool found = false;
> +               int j;
> +
> +               BUG_ON(!hb);

Doesn't seem like a reason to crash the kernel.

> +
> +               for (j = 0; j < hb_count; j++) {
> +                       if (hbs[j] == hb)
> +                               found = true;
> +               }
> +               if (!found)
> +                       hbs[hb_count++] = hb;
> +       }
> +
> +       return hb_count;
> +}
> +
>  /**
>   * region_xhb_config_valid() - determine cross host bridge validity
>   * @cxlr: The region being programmed
> @@ -208,7 +243,59 @@ static bool qtg_match(const struct cxl_decoder *rootd,
>  static bool region_xhb_config_valid(const struct cxl_region *cxlr,
>                                     const struct cxl_decoder *rootd)
>  {
> -       /* TODO: */
> +       const int rootd_eniw = cxl_to_eniw(rootd->interleave_ways);
> +       const int rootd_ig = cxl_to_ig(rootd->interleave_granularity);
> +       const int cxlr_ig = region_ig(cxlr);
> +       const int cxlr_iw = region_ways(cxlr);
> +       struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];

I'm worried about the stack usage of this. 0day has not complained
yet, but is it necessary to collect everything into an array versus
having a helper that does the lookup based on an index? It's not like
cxl_region_probe() is a fast path.

> +       struct cxl_dport *target;
> +       int i;
> +
> +       i = get_unique_hostbridges(cxlr, hbs);
> +       if (dev_WARN_ONCE(&cxlr->dev, i == 0, "Cannot find a valid host bridge\n"))

Doesn't seem like a reason to crash the kernel. At least the topology
basics like this can be validated at target assignment time.

> +               return false;
> +
> +       /* Are all devices in this region on the same CXL host bridge */
> +       if (i == 1)
> +               return true;

Doesn't this also need to check that the decoder is not interleaved
across host bridges?

> +
> +       /* CFMWS.HBIG >= Device.Label.IG */
> +       if (rootd_ig < cxlr_ig) {
> +               dev_dbg(&cxlr->dev,
> +                       "%s HBIG must be greater than region IG (%d < %d)\n",
> +                       dev_name(&rootd->dev), rootd_ig, cxlr_ig);
> +               return false;
> +       }
> +
> +       /*
> +        * ((2^(CFMWS.HBIG - Device.RLabel.IG) * (2^CFMWS.ENIW)) > Device.RLabel.NLabel)
> +        *
> +        * XXX: 2^CFMWS.ENIW is trying to decode the NIW. Instead, use the look
> +        * up function which supports non power of 2 interleave configurations.
> +        */
> +       if (((1 << (rootd_ig - cxlr_ig)) * (1 << rootd_eniw)) > cxlr_iw) {

Now here is where some helper macros could make things more readable.
This looks like an acronym soup despite me being familiar with the CXL
spec.

> +               dev_dbg(&cxlr->dev,
> +                       "granularity ratio requires a larger number of devices (%d) than currently configured (%d)\n",
> +                       ((1 << (rootd_ig - cxlr_ig)) * (1 << rootd_eniw)),
> +                       cxlr_iw);
> +               return false;
> +       }
> +
> +       /*
> +        * CFMWS.InterleaveTargetList[n] must contain all devices, x where:
> +        *      (Device[x],RegionLabel.Position >> (CFMWS.HBIG -
> +        *      Device[x].RegionLabel.InterleaveGranularity)) &
> +        *      ((2^CFMWS.ENIW) - 1) = n
> +        */
> +       for_each_cxl_decoder_target(target, rootd, i) {
> +               if (((i >> (rootd_ig - cxlr_ig))) &
> +                   (((1 << rootd_eniw) - 1) != target->port_id)) {
> +                       dev_dbg(&cxlr->dev,
> +                               "One or more devices are not connected to the correct hostbridge.\n");
> +                       return false;
> +               }
> +       }
> +
>         return true;
>  }
>
> --
> 2.35.0
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 08/14] cxl/region: HB port config verification
  2022-01-28  0:27 ` [PATCH v3 08/14] cxl/region: HB port config verification Ben Widawsky
  2022-02-14 16:20   ` Jonathan Cameron
  2022-02-15 16:35   ` Jonathan Cameron
@ 2022-02-18 21:04   ` Dan Williams
  2 siblings, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-18 21:04 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Host bridge root port verification determines if the device ordering in
> an interleave set can be programmed through the host bridges and
> switches.
>
> The algorithm implemented here is based on the CXL Type 3 Memory Device
> Software Guide, chapter 2.13.15. The current version of the guide does
> not yet support x3 interleave configurations, and so that's not
> supported here either.

Given x3 is in a released ECN lets go ahead and include it because it
may have a material effect on the design, but more importantly the
ABI.

>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
>  .clang-format           |   1 +
>  drivers/cxl/core/port.c |   1 +
>  drivers/cxl/cxl.h       |   2 +
>  drivers/cxl/region.c    | 127 +++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 130 insertions(+), 1 deletion(-)
>
> diff --git a/.clang-format b/.clang-format
> index 1221d53be90b..5e20206f905e 100644
> --- a/.clang-format
> +++ b/.clang-format
> @@ -171,6 +171,7 @@ ForEachMacros:
>    - 'for_each_cpu_wrap'
>    - 'for_each_cxl_decoder_target'
>    - 'for_each_cxl_endpoint'
> +  - 'for_each_cxl_endpoint_hb'
>    - 'for_each_dapm_widgets'
>    - 'for_each_dev_addr'
>    - 'for_each_dev_scope'
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 0847e6ce19ef..1d81c5f56a3e 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -706,6 +706,7 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port,
>                 return ERR_PTR(-ENOMEM);
>
>         INIT_LIST_HEAD(&dport->list);
> +       INIT_LIST_HEAD(&dport->verify_link);
>         dport->dport = dport_dev;
>         dport->port_id = port_id;
>         dport->component_reg_phys = component_reg_phys;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index a291999431c7..ed984465b59c 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -350,6 +350,7 @@ struct cxl_port {
>   * @component_reg_phys: downstream port component registers
>   * @port: reference to cxl_port that contains this downstream port
>   * @list: node for a cxl_port's list of cxl_dport instances
> + * @verify_link: node used for hb root port verification
>   */
>  struct cxl_dport {
>         struct device *dport;
> @@ -357,6 +358,7 @@ struct cxl_dport {
>         resource_size_t component_reg_phys;
>         struct cxl_port *port;
>         struct list_head list;
> +       struct list_head verify_link;

Is this list also protected by the port device_lock?

>  };
>
>  /**
> diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
> index 562c8720da56..d2f6c990c8a8 100644
> --- a/drivers/cxl/region.c
> +++ b/drivers/cxl/region.c
> @@ -4,6 +4,7 @@
>  #include <linux/genalloc.h>
>  #include <linux/device.h>
>  #include <linux/module.h>
> +#include <linux/sort.h>
>  #include <linux/pci.h>
>  #include "cxlmem.h"
>  #include "region.h"
> @@ -36,6 +37,12 @@
>         for (idx = 0, ep = (region)->config.targets[idx];                      \
>              idx < region_ways(region); ep = (region)->config.targets[++idx])
>
> +#define for_each_cxl_endpoint_hb(ep, region, hb, idx)                          \
> +       for (idx = 0, (ep) = (region)->config.targets[idx];                    \
> +            idx < region_ways(region);                                        \
> +            idx++, (ep) = (region)->config.targets[idx])                      \
> +               if (get_hostbridge(ep) == (hb))
> +
>  #define for_each_cxl_decoder_target(dport, decoder, idx)                       \
>         for (idx = 0, dport = (decoder)->target[idx];                          \
>              idx < (decoder)->nr_targets - 1;                                  \
> @@ -299,6 +306,59 @@ static bool region_xhb_config_valid(const struct cxl_region *cxlr,
>         return true;
>  }
>
> +static struct cxl_dport *get_rp(struct cxl_memdev *ep)
> +{
> +       struct cxl_port *port, *parent_port = port = ep->port;
> +       struct cxl_dport *dport;
> +
> +       while (!is_cxl_root(port)) {
> +               parent_port = to_cxl_port(port->dev.parent);
> +               if (parent_port->depth == 1)
> +                       list_for_each_entry(dport, &parent_port->dports, list)

Locking?

> +                               if (dport->dport == port->uport->parent->parent)

This assumes no switches.

Effectively it is identical to what devm_cxl_enumerate_ports(), which
does support switches, is doing. To reduce maintenance burden it could
follow the same pattern of:

for (iter = dev; iter; iter = grandparent(iter))
...
if (dev_is_cxl_root_child(&port->dev))
...

> +                                       return dport;
> +               port = parent_port;
> +       }
> +
> +       BUG();

more kernel crashing... why?

> +       return NULL;
> +}
> +
> +static int get_num_root_ports(const struct cxl_region *cxlr)
> +{
> +       struct cxl_memdev *endpoint;
> +       struct cxl_dport *dport, *tmp;
> +       int num_root_ports = 0;
> +       LIST_HEAD(root_ports);
> +       int idx;
> +
> +       for_each_cxl_endpoint(endpoint, cxlr, idx) {
> +               struct cxl_dport *root_port = get_rp(endpoint);
> +
> +               if (list_empty(&root_port->verify_link)) {
> +                       list_add_tail(&root_port->verify_link, &root_ports);

Doesn't this run into problems when there are multiple regions per root port?

> +                       num_root_ports++;
> +               }
> +       }
> +
> +       list_for_each_entry_safe(dport, tmp, &root_ports, verify_link)
> +               list_del_init(&dport->verify_link);
> +
> +       return num_root_ports;
> +}
> +
> +static bool has_switch(const struct cxl_region *cxlr)
> +{
> +       struct cxl_memdev *ep;
> +       int i;
> +
> +       for_each_cxl_endpoint(ep, cxlr, i)
> +               if (ep->port->depth > 2)
> +                       return true;
> +
> +       return false;
> +}
> +
>  /**
>   * region_hb_rp_config_valid() - determine root port ordering is correct
>   * @cxlr: Region to validate
> @@ -312,7 +372,72 @@ static bool region_xhb_config_valid(const struct cxl_region *cxlr,
>  static bool region_hb_rp_config_valid(const struct cxl_region *cxlr,
>                                       const struct cxl_decoder *rootd)
>  {
> -       /* TODO: */
> +       const int num_root_ports = get_num_root_ports(cxlr);
> +       struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];

In terms of stack usage, doesn't the caller also have one of these on
the stack when this is called?

> +       int hb_count, i;
> +
> +       hb_count = get_unique_hostbridges(cxlr, hbs);
> +
> +       /* TODO: Switch support */
> +       if (has_switch(cxlr))
> +               return false;
> +
> +       /*
> +        * Are all devices in this region on the same CXL Host Bridge
> +        * Root Port?
> +        */
> +       if (num_root_ports == 1 && !has_switch(cxlr))
> +               return true;

How can this happen without any intervening switch?

> +
> +       for (i = 0; i < hb_count; i++) {
> +               int idx, position_mask;
> +               struct cxl_dport *rp;
> +               struct cxl_port *hb;
> +
> +               /* Get next CXL Host Bridge this region spans */
> +               hb = hbs[i];
> +
> +               /*
> +                * Calculate the position mask: NumRootPorts = 2^PositionMask
> +                * for this region.
> +                *
> +                * XXX: pos_mask is actually (1 << PositionMask)  - 1
> +                */
> +               position_mask = (1 << (ilog2(num_root_ports))) - 1;

Isn't "1 << ilog2(num_root_ports)" just "num_root_ports"?

> +
> +               /*
> +                * Calculate the PortGrouping for each device on this CXL Host
> +                * Bridge Root Port:
> +                * PortGrouping = RegionLabel.Position & PositionMask

Still confused what a port grouping is and what it means for the
algorithm especially since RegionLabels are not relevant to this part
of the algorithm. This assumes someone is familiar with "guide"
terminology?

> +                *
> +                * The following nest iterators effectively iterate over each
> +                * root port in the region.
> +                *   for_each_unique_rootport(rp, cxlr)
> +                */
> +               list_for_each_entry(rp, &hb->dports, list) {
> +                       struct cxl_memdev *ep;
> +                       int port_grouping = -1;
> +
> +                       for_each_cxl_endpoint_hb(ep, cxlr, hb, idx) {
> +                               if (get_rp(ep) != rp)
> +                                       continue;
> +
> +                               if (port_grouping == -1)
> +                                       port_grouping = idx & position_mask;
> +
> +                               /*
> +                                * Do all devices in the region connected to this CXL
> +                                * Host Bridge Root Port have the same PortGrouping?
> +                                */
> +                               if ((idx & position_mask) != port_grouping) {
> +                                       dev_dbg(&cxlr->dev,
> +                                               "One or more devices are not connected to the correct Host Bridge Root Port\n");
> +                                       return false;
> +                               }
> +                       }
> +               }
> +       }
> +
>         return true;
>  }
>
> --
> 2.35.0
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 09/14] cxl/region: Add infrastructure for decoder programming
  2022-01-28  0:27 ` [PATCH v3 09/14] cxl/region: Add infrastructure for decoder programming Ben Widawsky
  2022-02-01 18:16   ` Jonathan Cameron
@ 2022-02-18 21:53   ` Dan Williams
  1 sibling, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-18 21:53 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> There are 3 steps in handling region programming once it has been
> configured by userspace.
> 1. Sanitize the parameters against the system.
> 2. Collect decoder resources from the topology
> 3. Program decoder resources
>
> The infrastructure added here addresses #2. Two new APIs are introduced
> to allow collecting and returning decoder resources. Additionally the
> infrastructure includes two lists managed by the region driver, a staged
> list, and a commit list. The staged list contains those collected in
> step #2, and the commit list are all the decoders programmed in step #3.

I expect this patch will see significant rewrites with the ABI change
to register endpoint decoders with regions. It's otherwise redundant
to have a 'targets' array and then yet mosre linked lists to walk the
decoders associated with a region.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 10/14] cxl/region: Collect host bridge decoders
  2022-01-28  0:27 ` [PATCH v3 10/14] cxl/region: Collect host bridge decoders Ben Widawsky
  2022-02-01 18:21   ` Jonathan Cameron
@ 2022-02-18 23:42   ` Dan Williams
  1 sibling, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-18 23:42 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Part of host bridge verification in the CXL Type 3 Memory Device
> Software Guide calculates the host bridge interleave target list (6th
> step in the flow chart), ie. verification and state update are done in
> the same step. Host bridge verification is already in place, so go ahead
> and store the decoders with their target lists.
>
> Switches are implemented in a separate patch.
>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
>  drivers/cxl/region.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
> index 145d7bb02714..b8982be13bfe 100644
> --- a/drivers/cxl/region.c
> +++ b/drivers/cxl/region.c
> @@ -428,6 +428,7 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
>                 return simple_config(cxlr, hbs[0]);
>
>         for (i = 0; i < hb_count; i++) {
> +               struct cxl_decoder *cxld;
>                 int idx, position_mask;
>                 struct cxl_dport *rp;
>                 struct cxl_port *hb;
> @@ -486,6 +487,18 @@ static bool region_hb_rp_config_valid(struct cxl_region *cxlr,
>                                                 "One or more devices are not connected to the correct Host Bridge Root Port\n");
>                                         goto err;
>                                 }
> +
> +                               if (!state_update)
> +                                       continue;
> +
> +                               if (dev_WARN_ONCE(&cxld->dev,
> +                                                 port_grouping >= cxld->nr_targets,
> +                                                 "Invalid port grouping %d/%d\n",
> +                                                 port_grouping, cxld->nr_targets))
> +                                       goto err;
> +
> +                               cxld->interleave_ways++;
> +                               cxld->target[port_grouping] = get_rp(ep);

There is not enough context in the changelog to understand what this
code is doing, but I do want to react to all this caching of objects
without references. I'd prefer helpers that walk the device that are
already synced with device_del() events than worry about these caches
and when to invalidate their references.

>                         }
>                 }
>         }
> @@ -538,7 +551,7 @@ static bool rootd_valid(const struct cxl_region *cxlr,
>
>  struct rootd_context {
>         const struct cxl_region *cxlr;
> -       struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
> +       const struct cxl_port *hbs[CXL_DECODER_MAX_INTERLEAVE];
>         int count;
>  };
>
> @@ -564,7 +577,7 @@ static struct cxl_decoder *find_rootd(const struct cxl_region *cxlr,
>         struct rootd_context ctx;
>         struct device *ret;
>
> -       ctx.cxlr = cxlr;
> +       ctx.cxlr = (struct cxl_region *)cxlr;

If const requires casting then don't use const.

>
>         ret = device_find_child((struct device *)&root->dev, &ctx, rootd_match);
>         if (ret)
> --
> 2.35.0
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-02-17 19:57       ` Dan Williams
  2022-02-17 20:20         ` Ben Widawsky
@ 2022-02-23 21:49         ` Ben Widawsky
  2022-02-23 22:24           ` Dan Williams
  1 sibling, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-23 21:49 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On 22-02-17 11:57:59, Dan Williams wrote:
> On Thu, Feb 17, 2022 at 10:36 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > Consolidating earlier discussions...
> >
> > On 22-01-28 16:25:34, Dan Williams wrote:
> > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > The region creation APIs create a vacant region. Configuring the region
> > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > will be provided to allow userspace to configure the region.  Finally
> > > > once all configuration is complete, userspace may activate the region.
> > > >
> > > > Introduced here are the most basic attributes needed to configure a
> > > > region. Details of these attribute are described in the ABI
> > >
> > > s/attribute/attributes/
> > >
> > > > Documentation. Sanity checking of configuration parameters are done at
> > > > region binding time. This consolidates all such logic in one place,
> > > > rather than being strewn across multiple places.
> > >
> > > I think that's too late for some of the validation. The complex
> > > validation that the region driver does throughout the topology is
> > > different from the basic input validation that can  be done at the
> > > sysfs write time. For example ,this patch allows negative
> > > interleave_granularity values to specified, just return -EINVAL. I
> > > agree that sysfs should not validate everything, I disagree with
> > > pushing all validation to cxl_region_probe().
> > >
> >
> > Okay. It might save us some back and forth if you could outline everything you'd
> > expect to be validated, but I can also make an attempt to figure out the
> > reasonable set of things.
> 
> Input validation. Every value that gets written to a sysfs attribute
> should be checked for validity, more below:
> 
> >
> > > >
> > > > A example is provided below:
> > > >
> > > > /sys/bus/cxl/devices/region0.0:0
> > > > ├── interleave_granularity
> 
> ...validate granularity is within spec and can be supported by the root decoder.
> 
> > > > ├── interleave_ways
> 
> ...validate ways is within spec and can be supported by the root decoder.

I'm not sure how to do this one. Validation requires device positions and we
can't set the targets until ways is set. Can you please provide some more
insight on what you'd like me to check in addition to the value being within
spec?

> 
> > > > ├── offset
> > > > ├── size
> 
> ...try to reserve decoder capacity to validate that there is available space.
> 
> > > > ├── subsystem -> ../../../../../../bus/cxl
> > > > ├── target0
> 
> ...validate that the target maps to the decoder.
> 
> > > > ├── uevent
> > > > └── uuid
> 
> ...validate that the uuid is unique relative to other regions.
> 
> > >
> > > As mentioned off-list, it looks like devtype and modalias are missing.
> > >
> >
> > Yep. This belongs in the previous patch though.
> >
> > > >
> > > > Reported-by: kernel test robot <lkp@intel.com> (v2)
> > > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > > ---
> > > >  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
> > > >  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
> > > >  2 files changed, 340 insertions(+)
> > > >
> > > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > index dcc728458936..50ba5018014d 100644
> > > > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > @@ -187,3 +187,43 @@ Description:
> > > >                 region driver before being deleted. The attributes expects a
> > > >                 region in the form "regionX.Y:Z". The region's name, allocated
> > > >                 by reading create_region, will also be released.
> > > > +
> > > > +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset
> > >
> > > This is just another 'resource' attribute for the physical base
> > > address of the region, right? 'offset' sounds like something that
> > > would be relative instead of absolute.
> > >
> >
> > It was meant to be relative. I can make it absolute if that's preferable but the
> > physical base is known at the decoder level already.
> 
> Yes, but it saves userspace a step to get the absolute value here and
> matches what happens in PCI sysfs where the user is not required to
> look up the bridge base to calculate the absolute value.
> 
> >
> > > > +Date:          August, 2021
> > >
> > > Same date update comment here.
> > >
> > > > +KernelVersion: v5.18
> > > > +Contact:       linux-cxl@vger.kernel.org
> > > > +Description:
> > > > +               (RO) A region resides within an address space that is claimed by
> > > > +               a decoder.
> > >
> > > "A region is a contiguous partition of a CXL Root decoder address space."
> > >
> > > >                  Region space allocation is handled by the driver, but
> > >
> > > "Region capacity is allocated by writing to the size attribute, the
> > > resulting physical address base determined by the driver is reflected
> > > here."
> > >
> > > > +               the offset may be read by userspace tooling in order to
> > > > +               determine fragmentation, and available size for new regions.
> > >
> > > I would also expect, before / along with these new region attributes,
> > > there would be 'available' and 'max_extent_available' at the decoder
> > > level to indicate how much free space the decoder has and how big the
> > > next region creation can be. User tooling can walk  the decoder and
> > > the regions together to determine fragmentation if necessary, but for
> > > the most part the tool likely only cares about "how big can the next
> > > region be?" and "how full is this decoder?".
> > >
> >
> > Since this is the configuration part of the ABI, I'd rather add that information
> > when the plumbing to report them exists. I'm struggling to understand the
> > balance (as mentioned also earlier in this mail thread) as to what userspace
> > does and what the kernel does. I will add these as you request.
> 
> Userspace asks by DPA size and SPA size and the kernel validates /
> performs the allocation on its behalf.
> 
> > > > +
> > > > +What:
> > > > +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> > > > +Date:          August, 2021
> > > > +KernelVersion: v5.18
> > > > +Contact:       linux-cxl@vger.kernel.org
> > > > +Description:
> > > > +               (RW) Configuring regions requires a minimal set of parameters in
> > > > +               order for the subsequent bind operation to succeed. The
> > > > +               following parameters are defined:
> > >
> > > Let's split up the descriptions into individual sections. That can
> > > also document the order that attributes must be written. For example,
> > > doesn't size need to be set before targets are added so that targets
> > > can be validated whether they have sufficient capacity?
> > >
> >
> > Okay. Since we're moving toward making the sysfs ABI stateful,
> 
> sysfs is always stateful. Stateless would be an ioctl.
> 
> > would you like me
> > to make the attrs only visible when they can actually be set?
> 
> No, that's a bit too much magic, and it would be racy.
> 
> >
> > > > +
> > > > +               ==      ========================================================
> > > > +               interleave_granularity Mandatory. Number of consecutive bytes
> > > > +                       each device in the interleave set will claim. The
> > > > +                       possible interleave granularity values are determined by
> > > > +                       the CXL spec and the participating devices.
> > > > +               interleave_ways Mandatory. Number of devices participating in the
> > > > +                       region. Each device will provide 1/interleave of storage
> > > > +                       for the region.
> > > > +               size    Manadatory. Phsyical address space the region will
> > > > +                       consume.
> > >
> > > s/Phsyical/Physical/
> > >
> > > > +               target  Mandatory. Memory devices are the backing storage for a
> > > > +                       region. There will be N targets based on the number of
> > > > +                       interleave ways that the top level decoder is configured
> > > > +                       for.
> > >
> > > That doesn't sound right, IW at the root != IW at the endpoint level
> > > and the region needs to record all the endpoint level targets.
> >
> > Correct.
> >
> > >
> > > > Each target must be set with a memdev device ie.
> > > > +                       'mem1'. This attribute only becomes available after
> > > > +                       setting the 'interleave' attribute.
> > > > +               uuid    Optional. A unique identifier for the region. If none is
> > > > +                       selected, the kernel will create one.
> > >
> > > Let's drop the Mandatory / Optional distinction, or I am otherwise not
> > > understanding what this is trying to document. For example 'uuid' is
> > > "mandatory" for PMEM regions and "omitted" for volatile regions not
> > > optional.
> >
> > Okay.
> >
> > >
> > > > +               ==      ========================================================
> > > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > > > index 1a448543db0d..3b48e0469fc7 100644
> > > > --- a/drivers/cxl/core/region.c
> > > > +++ b/drivers/cxl/core/region.c
> > > > @@ -3,9 +3,12 @@
> > > >  #include <linux/io-64-nonatomic-lo-hi.h>
> > > >  #include <linux/device.h>
> > > >  #include <linux/module.h>
> > > > +#include <linux/sizes.h>
> > > >  #include <linux/slab.h>
> > > > +#include <linux/uuid.h>
> > > >  #include <linux/idr.h>
> > > >  #include <region.h>
> > > > +#include <cxlmem.h>
> > > >  #include <cxl.h>
> > > >  #include "core.h"
> > > >
> > > > @@ -18,11 +21,305 @@
> > > >   * (programming the hardware) is handled by a separate region driver.
> > > >   */
> > > >
> > > > +struct cxl_region *to_cxl_region(struct device *dev);
> > > > +static const struct attribute_group region_interleave_group;
> > > > +
> > > > +static bool is_region_active(struct cxl_region *cxlr)
> > > > +{
> > > > +       /* TODO: Regions can't be activated yet. */
> > > > +       return false;
> > >
> > > This function seems redundant with just checking "cxlr->dev.driver !=
> > > NULL"? The benefit of that is there is no need to carry a TODO in the
> > > series.
> > >
> >
> > The idea behind this was to give the reviewer somewhat of a bigger picture as to
> > how things should work in the code rather than in a commit message. I will
> > remove this.
> 
> They look premature to me.
> 
> >
> > > > +}
> > > > +
> > > > +static void remove_target(struct cxl_region *cxlr, int target)
> > > > +{
> > > > +       struct cxl_memdev *cxlmd;
> > > > +
> > > > +       cxlmd = cxlr->config.targets[target];
> > > > +       if (cxlmd)
> > > > +               put_device(&cxlmd->dev);
> > >
> > > A memdev can be a member of multiple regions at once, shouldn't this
> > > be an endpoint decoder or similar, not the entire memdev?
> > >
> > > Also, if memdevs autoremove themselves from regions at memdev
> > > ->remove() time then I don't think the region needs to hold references
> > > on memdevs.
> > >
> >
> > This needs some work. The concern I have is region operations will need to
> > operate on memdevs/decoders at various points in time. When the memdev goes
> > away, the region will also need to go away. None of that plumbing was in place
> > in v3 and the reference on the memdev was just a half-hearted attempt at doing
> > the right thing.
> >
> > For now if you prefer I remove the reference, but perhaps the decoder reference
> > would buy us some safety?
> 
> So, I don't want to merge an interim solution. I think this series
> needs to prove out the end to end final ABI with all the lifetime
> issues worked out before committing to it upstream because lifetime
> issues get much harder to fix when they also need to conform to a
> legacy ABI.
> 
> >
> > > > +       cxlr->config.targets[target] = NULL;
> > > > +}
> > > > +
> > > > +static ssize_t interleave_ways_show(struct device *dev,
> > > > +                                   struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> > > > +}
> > > > +
> > > > +static ssize_t interleave_ways_store(struct device *dev,
> > > > +                                    struct device_attribute *attr,
> > > > +                                    const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       int ret, prev_iw;
> > > > +       int val;
> > >
> > > I would expect:
> > >
> > > if (dev->driver)
> > >    return -EBUSY;
> > >
> > > ...to shutdown configuration writes once the region is active. Might
> > > also need a region-wide seqlock like target_list_show. So that region
> > > probe drains  all active sysfs writers before assuming the
> > > configuration is stable.
> >
> > Initially my thought here is that this is a problem for userspace to deal with.
> > If userspace can't figure out how to synchronously configure and bind the
> > region, that's not a kernel problem.
> 
> The kernel always needs to protect itself. Userspace is free to race
> itself, but it can not be allowed to trigger a kernel race. So there
> needs to be protection against userspace writing interleave_ways and
> the kernel being able to trust that interleave_ways is now static for
> the life of the region.
> 
> > However, we've put some effort into
> > protecting userspace from itself in the create ABI, so it might be more in line
> > to do that here.
> 
> That safety was about preventing userspace from leaking kernel memory,
> not about protecting userspace from itself. It's still the case that
> userspace racing itself will get horribly confused when it collides
> region creation, but the kernel protects itself by resolving the race.
> 
> > In summary, I'm fine to add it, but I think I really need to get more in your
> > brain about the userspace/kernel divide sooner rather than later.
> 
> Don't let userspace break the kernel, that's it.
> 
> >
> > >
> > > > +
> > > > +       prev_iw = cxlr->config.interleave_ways;
> > > > +       ret = kstrtoint(buf, 0, &val);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> > > > +               return -EINVAL;
> > > > +
> > > > +       cxlr->config.interleave_ways = val;
> > > > +
> > > > +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> > > > +       if (ret < 0)
> > > > +               goto err;
> > > > +
> > > > +       sysfs_notify(&dev->kobj, NULL, "target_interleave");
> > >
> > > Why?
> > >
> >
> > copypasta
> >
> > > > +
> > > > +       while (prev_iw > cxlr->config.interleave_ways)
> > > > +               remove_target(cxlr, --prev_iw);
> > >
> > > To make the kernel side simpler this attribute could just require that
> > > setting interleave ways is a one way street, if you want to change it
> > > you need to delete the region and start over.
> > >
> >
> > Okay. One of the earlier versions did this implicitly since the #ways was needed
> > to create the region. I thought from the ABI perspective, flexibility was good.
> > Userspace may choose not to utilize it.
> 
> More flexibility == more maintenance burden. If it's not strictly
> necessary, don't expose it, so making this read-only seems simpler to
> me.
> 
> [..]
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (is_region_active(cxlr))
> > > > +               rc = -EBUSY;
> > > > +       else
> > > > +               cxlr->config.size = val;
> > > > +       device_unlock(&cxlr->dev);
> > >
> > > I think lockdep will complain about device_lock() usage in an
> > > attribute. Try changing this to cxl_device_lock() with
> > > CONFIG_PROVE_CXL_LOCKING=y.
> > >
> >
> > I might have messed it up, but I didn't seem to run into an issue. With the
> > driver bound check though, it can go away.
> >
> > I think it would be really good to add this kind of detail to sysfs.rst. Quick
> > grep finds me arm64/kernel/mte and the nfit driver taking the device lock in an
> > attr.
> 
> Yeah, CONFIG_PROVE_{NVDIMM,CXL}_LOCKING needs to annotate the
> driver-core as well. I'm concerned there's a class of deadlocks that
> lockdep just can't see.
> 
> >
> >
> > > > +
> > > > +       return rc ? rc : len;
> > > > +}
> > > > +static DEVICE_ATTR_RW(size);
> > > > +
> > > > +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> > > > +                        char *buf)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> > > > +}
> > > > +
> > > > +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> > > > +                         const char *buf, size_t len)
> > > > +{
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       ssize_t rc;
> > > > +
> > > > +       if (len != UUID_STRING_LEN + 1)
> > > > +               return -EINVAL;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (is_region_active(cxlr))
> > > > +               rc = -EBUSY;
> > > > +       else
> > > > +               rc = uuid_parse(buf, &cxlr->config.uuid);
> > > > +       device_unlock(&cxlr->dev);
> > > > +
> > > > +       return rc ? rc : len;
> > > > +}
> > > > +static DEVICE_ATTR_RW(uuid);
> > > > +
> > > > +static struct attribute *region_attrs[] = {
> > > > +       &dev_attr_interleave_ways.attr,
> > > > +       &dev_attr_interleave_granularity.attr,
> > > > +       &dev_attr_offset.attr,
> > > > +       &dev_attr_size.attr,
> > > > +       &dev_attr_uuid.attr,
> > > > +       NULL,
> > > > +};
> > > > +
> > > > +static const struct attribute_group region_group = {
> > > > +       .attrs = region_attrs,
> > > > +};
> > > > +
> > > > +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> > > > +{
> > > > +       int ret;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +       if (!cxlr->config.targets[n])
> > > > +               ret = sysfs_emit(buf, "\n");
> > > > +       else
> > > > +               ret = sysfs_emit(buf, "%s\n",
> > > > +                                dev_name(&cxlr->config.targets[n]->dev));
> > > > +       device_unlock(&cxlr->dev);
> > >
> > > The component contribution of a memdev to a region is a DPA-span, not
> > > the whole memdev. I would expect something like dax_mapping_attributes
> > > or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
> > > information about the component contribution of a memdev to a region.
> > >
> >
> > I think show_target should just return the chosen decoder and then the decoder
> > attributes will tell the rest, wouldn't they?
> 
> Given the conflicts that can arise between HDM decoders needing to map
> increasing DPA values and other conflicts that there will be
> situations where the kernel auto-picking a decoder will get in the
> way. Exposing the decoder selection to userspace also gives one more
> place to do leaf validation. I.e. at decoder-to-region assignment time
> the kernel can validate that the DPA is available and can be mapped by
> the given decoder given the state of other decoders on that device.
> 
> >
> > > > +
> > > > +       return ret;
> > > > +}
> > > > +
> > > > +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> > > > +                         size_t len)
> > > > +{
> > > > +       struct device *memdev_dev;
> > > > +       struct cxl_memdev *cxlmd;
> > > > +
> > > > +       device_lock(&cxlr->dev);
> > > > +
> > > > +       if (len == 1 || cxlr->config.targets[n])
> > > > +               remove_target(cxlr, n);
> > > > +
> > > > +       /* Remove target special case */
> > > > +       if (len == 1) {
> > > > +               device_unlock(&cxlr->dev);
> > > > +               return len;
> > > > +       }
> > > > +
> > > > +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> > >
> > > I think this wants to be an endpoint decoder, not a memdev. Because
> > > it's the decoder that joins a memdev to a region, or at least a
> > > decoder should be picked when the memdev is assigned so that the DPA
> > > mapping can be registered. If all the decoders are allocated then fail
> > > here.
> > >
> >
> > Per above, I think making this decoders makes sense. I could make it flexible
> > for ease of use, like if you specify memX, the kernel will pick a decoder for
> > you however I suspect you won't like that.
> 
> Right, put the user friendliness in the tooling, not sysfs ABI.
> 
> >
> > > > +       if (!memdev_dev) {
> > > > +               device_unlock(&cxlr->dev);
> > > > +               return -ENOENT;
> > > > +       }
> > > > +
> > > > +       /* reference to memdev held until target is unset or region goes away */
> > > > +
> > > > +       cxlmd = to_cxl_memdev(memdev_dev);
> > > > +       cxlr->config.targets[n] = cxlmd;
> > > > +
> > > > +       device_unlock(&cxlr->dev);
> > > > +
> > > > +       return len;
> > > > +}
> > > > +
> > > > +#define TARGET_ATTR_RW(n)                                                      \
> > > > +       static ssize_t target##n##_show(                                       \
> > > > +               struct device *dev, struct device_attribute *attr, char *buf)  \
> > > > +       {                                                                      \
> > > > +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> > > > +       }                                                                      \
> > > > +       static ssize_t target##n##_store(struct device *dev,                   \
> > > > +                                        struct device_attribute *attr,        \
> > > > +                                        const char *buf, size_t len)          \
> > > > +       {                                                                      \
> > > > +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> > > > +       }                                                                      \
> > > > +       static DEVICE_ATTR_RW(target##n)
> > > > +
> > > > +TARGET_ATTR_RW(0);
> > > > +TARGET_ATTR_RW(1);
> > > > +TARGET_ATTR_RW(2);
> > > > +TARGET_ATTR_RW(3);
> > > > +TARGET_ATTR_RW(4);
> > > > +TARGET_ATTR_RW(5);
> > > > +TARGET_ATTR_RW(6);
> > > > +TARGET_ATTR_RW(7);
> > > > +TARGET_ATTR_RW(8);
> > > > +TARGET_ATTR_RW(9);
> > > > +TARGET_ATTR_RW(10);
> > > > +TARGET_ATTR_RW(11);
> > > > +TARGET_ATTR_RW(12);
> > > > +TARGET_ATTR_RW(13);
> > > > +TARGET_ATTR_RW(14);
> > > > +TARGET_ATTR_RW(15);
> > > > +
> > > > +static struct attribute *interleave_attrs[] = {
> > > > +       &dev_attr_target0.attr,
> > > > +       &dev_attr_target1.attr,
> > > > +       &dev_attr_target2.attr,
> > > > +       &dev_attr_target3.attr,
> > > > +       &dev_attr_target4.attr,
> > > > +       &dev_attr_target5.attr,
> > > > +       &dev_attr_target6.attr,
> > > > +       &dev_attr_target7.attr,
> > > > +       &dev_attr_target8.attr,
> > > > +       &dev_attr_target9.attr,
> > > > +       &dev_attr_target10.attr,
> > > > +       &dev_attr_target11.attr,
> > > > +       &dev_attr_target12.attr,
> > > > +       &dev_attr_target13.attr,
> > > > +       &dev_attr_target14.attr,
> > > > +       &dev_attr_target15.attr,
> > > > +       NULL,
> > > > +};
> > > > +
> > > > +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> > > > +{
> > > > +       struct device *dev = container_of(kobj, struct device, kobj);
> > > > +       struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +
> > > > +       if (n < cxlr->config.interleave_ways)
> > > > +               return a->mode;
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static const struct attribute_group region_interleave_group = {
> > > > +       .attrs = interleave_attrs,
> > > > +       .is_visible = visible_targets,
> > > > +};
> > > > +
> > > > +static const struct attribute_group *region_groups[] = {
> > > > +       &region_group,
> > > > +       &region_interleave_group,
> > > > +       NULL,
> > > > +};
> > > > +
> > > >  static void cxl_region_release(struct device *dev);
> > > >
> > > >  static const struct device_type cxl_region_type = {
> > > >         .name = "cxl_region",
> > > >         .release = cxl_region_release,
> > > > +       .groups = region_groups
> > > >  };
> > > >
> > > >  static ssize_t create_region_show(struct device *dev,
> > > > @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
> > > >  {
> > > >         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> > > >         struct cxl_region *cxlr = to_cxl_region(dev);
> > > > +       int i;
> > > >
> > > >         ida_free(&cxld->region_ida, cxlr->id);
> > > > +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> > > > +               remove_target(cxlr, i);
> > >
> > > Like the last patch this feels too late. I expect whatever unregisters
> > > the region should have already handled removing the targets.
> > >
> >
> > Would remove() be more appropriate?
> 
> ->remove() does not seem a good fit since it may be the case that
> someone wants do "echo $region >
> /sys/bus/cxl/drivers/cxl_region/unbind; echo $region >
> /sys/bus/cxl/drivers/cxl_region/bind;" without needing to go
> reconfigure the targets. I am suggesting that before
> device_unregister(&cxlr->dev) the targets are released.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-02-23 21:49         ` Ben Widawsky
@ 2022-02-23 22:24           ` Dan Williams
  2022-02-23 22:31             ` Ben Widawsky
  0 siblings, 1 reply; 70+ messages in thread
From: Dan Williams @ 2022-02-23 22:24 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On Wed, Feb 23, 2022 at 1:50 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> On 22-02-17 11:57:59, Dan Williams wrote:
> > On Thu, Feb 17, 2022 at 10:36 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > Consolidating earlier discussions...
> > >
> > > On 22-01-28 16:25:34, Dan Williams wrote:
> > > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > >
> > > > > The region creation APIs create a vacant region. Configuring the region
> > > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > > will be provided to allow userspace to configure the region.  Finally
> > > > > once all configuration is complete, userspace may activate the region.
> > > > >
> > > > > Introduced here are the most basic attributes needed to configure a
> > > > > region. Details of these attribute are described in the ABI
> > > >
> > > > s/attribute/attributes/
> > > >
> > > > > Documentation. Sanity checking of configuration parameters are done at
> > > > > region binding time. This consolidates all such logic in one place,
> > > > > rather than being strewn across multiple places.
> > > >
> > > > I think that's too late for some of the validation. The complex
> > > > validation that the region driver does throughout the topology is
> > > > different from the basic input validation that can  be done at the
> > > > sysfs write time. For example ,this patch allows negative
> > > > interleave_granularity values to specified, just return -EINVAL. I
> > > > agree that sysfs should not validate everything, I disagree with
> > > > pushing all validation to cxl_region_probe().
> > > >
> > >
> > > Okay. It might save us some back and forth if you could outline everything you'd
> > > expect to be validated, but I can also make an attempt to figure out the
> > > reasonable set of things.
> >
> > Input validation. Every value that gets written to a sysfs attribute
> > should be checked for validity, more below:
> >
> > >
> > > > >
> > > > > A example is provided below:
> > > > >
> > > > > /sys/bus/cxl/devices/region0.0:0
> > > > > ├── interleave_granularity
> >
> > ...validate granularity is within spec and can be supported by the root decoder.
> >
> > > > > ├── interleave_ways
> >
> > ...validate ways is within spec and can be supported by the root decoder.
>
> I'm not sure how to do this one. Validation requires device positions and we
> can't set the targets until ways is set. Can you please provide some more
> insight on what you'd like me to check in addition to the value being within
> spec?

For example you could check that interleave_ways is >= to the root
level interleave. I.e. it would be invalid to attempt a x1 interleave
on a decoder that is x2 interleaved at the host-bridge level.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-02-23 22:24           ` Dan Williams
@ 2022-02-23 22:31             ` Ben Widawsky
  2022-02-23 22:42               ` Dan Williams
  0 siblings, 1 reply; 70+ messages in thread
From: Ben Widawsky @ 2022-02-23 22:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On 22-02-23 14:24:00, Dan Williams wrote:
> On Wed, Feb 23, 2022 at 1:50 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > On 22-02-17 11:57:59, Dan Williams wrote:
> > > On Thu, Feb 17, 2022 at 10:36 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > Consolidating earlier discussions...
> > > >
> > > > On 22-01-28 16:25:34, Dan Williams wrote:
> > > > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > > >
> > > > > > The region creation APIs create a vacant region. Configuring the region
> > > > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > > > will be provided to allow userspace to configure the region.  Finally
> > > > > > once all configuration is complete, userspace may activate the region.
> > > > > >
> > > > > > Introduced here are the most basic attributes needed to configure a
> > > > > > region. Details of these attribute are described in the ABI
> > > > >
> > > > > s/attribute/attributes/
> > > > >
> > > > > > Documentation. Sanity checking of configuration parameters are done at
> > > > > > region binding time. This consolidates all such logic in one place,
> > > > > > rather than being strewn across multiple places.
> > > > >
> > > > > I think that's too late for some of the validation. The complex
> > > > > validation that the region driver does throughout the topology is
> > > > > different from the basic input validation that can  be done at the
> > > > > sysfs write time. For example ,this patch allows negative
> > > > > interleave_granularity values to specified, just return -EINVAL. I
> > > > > agree that sysfs should not validate everything, I disagree with
> > > > > pushing all validation to cxl_region_probe().
> > > > >
> > > >
> > > > Okay. It might save us some back and forth if you could outline everything you'd
> > > > expect to be validated, but I can also make an attempt to figure out the
> > > > reasonable set of things.
> > >
> > > Input validation. Every value that gets written to a sysfs attribute
> > > should be checked for validity, more below:
> > >
> > > >
> > > > > >
> > > > > > A example is provided below:
> > > > > >
> > > > > > /sys/bus/cxl/devices/region0.0:0
> > > > > > ├── interleave_granularity
> > >
> > > ...validate granularity is within spec and can be supported by the root decoder.
> > >
> > > > > > ├── interleave_ways
> > >
> > > ...validate ways is within spec and can be supported by the root decoder.
> >
> > I'm not sure how to do this one. Validation requires device positions and we
> > can't set the targets until ways is set. Can you please provide some more
> > insight on what you'd like me to check in addition to the value being within
> > spec?
> 
> For example you could check that interleave_ways is >= to the root
> level interleave. I.e. it would be invalid to attempt a x1 interleave
> on a decoder that is x2 interleaved at the host-bridge level.

I tried to convince myself that that assertion always holds and didn't feel
super comfortable. If you do, I can add those kinds of checks.

Thanks.
Ben

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration
  2022-02-23 22:31             ` Ben Widawsky
@ 2022-02-23 22:42               ` Dan Williams
  0 siblings, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-23 22:42 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, kernel test robot, Alison Schofield,
	Ira Weiny, Jonathan Cameron, Vishal Verma, Bjorn Helgaas,
	Linux NVDIMM, Linux PCI

On Wed, Feb 23, 2022 at 2:31 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> On 22-02-23 14:24:00, Dan Williams wrote:
> > On Wed, Feb 23, 2022 at 1:50 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > On 22-02-17 11:57:59, Dan Williams wrote:
> > > > On Thu, Feb 17, 2022 at 10:36 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > >
> > > > > Consolidating earlier discussions...
> > > > >
> > > > > On 22-01-28 16:25:34, Dan Williams wrote:
> > > > > > On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > > > > >
> > > > > > > The region creation APIs create a vacant region. Configuring the region
> > > > > > > works in the same way as similar subsystems such as devdax. Sysfs attrs
> > > > > > > will be provided to allow userspace to configure the region.  Finally
> > > > > > > once all configuration is complete, userspace may activate the region.
> > > > > > >
> > > > > > > Introduced here are the most basic attributes needed to configure a
> > > > > > > region. Details of these attribute are described in the ABI
> > > > > >
> > > > > > s/attribute/attributes/
> > > > > >
> > > > > > > Documentation. Sanity checking of configuration parameters are done at
> > > > > > > region binding time. This consolidates all such logic in one place,
> > > > > > > rather than being strewn across multiple places.
> > > > > >
> > > > > > I think that's too late for some of the validation. The complex
> > > > > > validation that the region driver does throughout the topology is
> > > > > > different from the basic input validation that can  be done at the
> > > > > > sysfs write time. For example ,this patch allows negative
> > > > > > interleave_granularity values to specified, just return -EINVAL. I
> > > > > > agree that sysfs should not validate everything, I disagree with
> > > > > > pushing all validation to cxl_region_probe().
> > > > > >
> > > > >
> > > > > Okay. It might save us some back and forth if you could outline everything you'd
> > > > > expect to be validated, but I can also make an attempt to figure out the
> > > > > reasonable set of things.
> > > >
> > > > Input validation. Every value that gets written to a sysfs attribute
> > > > should be checked for validity, more below:
> > > >
> > > > >
> > > > > > >
> > > > > > > A example is provided below:
> > > > > > >
> > > > > > > /sys/bus/cxl/devices/region0.0:0
> > > > > > > ├── interleave_granularity
> > > >
> > > > ...validate granularity is within spec and can be supported by the root decoder.
> > > >
> > > > > > > ├── interleave_ways
> > > >
> > > > ...validate ways is within spec and can be supported by the root decoder.
> > >
> > > I'm not sure how to do this one. Validation requires device positions and we
> > > can't set the targets until ways is set. Can you please provide some more
> > > insight on what you'd like me to check in addition to the value being within
> > > spec?
> >
> > For example you could check that interleave_ways is >= to the root
> > level interleave. I.e. it would be invalid to attempt a x1 interleave
> > on a decoder that is x2 interleaved at the host-bridge level.
>
> I tried to convince myself that that assertion always holds and didn't feel
> super comfortable. If you do, I can add those kinds of checks.

The only way to support a x1 region on an x2 interleave is to have the
size be equal to interleave granularity so that accesses stay
contained to that one device.

In fact that's another validation step, which you might already have,
region size must be >= and aligned to interleave_granularity *
interleave_ways.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 12/14] cxl: Program decoders for regions
  2022-01-28  0:27 ` [PATCH v3 12/14] cxl: Program decoders for regions Ben Widawsky
@ 2022-02-24  0:08   ` Dan Williams
  0 siblings, 0 replies; 70+ messages in thread
From: Dan Williams @ 2022-02-24  0:08 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-cxl, patches, Alison Schofield, Ira Weiny,
	Jonathan Cameron, Vishal Verma, Bjorn Helgaas, Linux NVDIMM,
	Linux PCI

On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Configure and commit the HDM decoders for the region. Since the region
> driver already was able to walk the topology and build the list of
> needed decoders, all that was needed to finish region setup was to
> actually write the HDM decoder MMIO.
>
> CXL regions appear as linear addresses in the system's physical address
> space. CXL memory devices comprise the storage for the region. In order
> for traffic to be properly routed to the memory devices in the region, a
> set of Host-manged Device Memory decoders must be present. The decoders
> are a piece of hardware defined in the CXL specification.
>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
>
> ---
> Changes since v2:
> - Fix unwind issue in bind_region introduced in v2
> ---
>  drivers/cxl/core/hdm.c | 209 +++++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxl.h      |   3 +
>  drivers/cxl/region.c   |  72 +++++++++++---
>  3 files changed, 272 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index a28369f264da..66c08d69f7a6 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -268,3 +268,212 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm)
>         return 0;
>  }
>  EXPORT_SYMBOL_NS_GPL(devm_cxl_enumerate_decoders, CXL);
> +
> +#define COMMIT_TIMEOUT_MS 10
> +static int wait_for_commit(struct cxl_decoder *cxld)
> +{
> +       const unsigned long end = jiffies + msecs_to_jiffies(COMMIT_TIMEOUT_MS);
> +       struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> +       void __iomem *hdm_decoder;
> +       struct cxl_hdm *cxlhdm;
> +       u32 ctrl;
> +
> +       cxlhdm = dev_get_drvdata(&port->dev);
> +       hdm_decoder = cxlhdm->regs.hdm_decoder;
> +
> +       while (1) {

A live wait is too expensive. Also, given region programming is
writing multiple decoders it would seem to make sense to amortize the
waiting over multiple decoders. I.e. a flow like:

program all decoders
commit all decoders
10ms wait
check all decoder for commit timeouts

> +               ctrl = readl(hdm_decoder +
> +                            CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
> +               if (FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMITTED, ctrl))
> +                       break;
> +
> +               if (time_after(jiffies, end)) {
> +                       dev_err(&cxld->dev, "HDM decoder commit timeout %x\n",
> +                               ctrl);
> +                       return -ETIMEDOUT;
> +               }
> +               if ((ctrl & CXL_HDM_DECODER0_CTRL_COMMIT_ERROR) != 0) {
> +                       dev_err(&cxld->dev, "HDM decoder commit error %x\n",
> +                               ctrl);
> +                       return -ENXIO;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +/**
> + * cxl_commit_decoder() - Program a configured cxl_decoder
> + * @cxld: The preconfigured cxl decoder.
> + *
> + * A cxl decoder that is to be committed should have been earmarked as enabled.
> + * This mechanism acts as a soft reservation on the decoder.
> + *
> + * Returns 0 if commit was successful, negative error code otherwise.
> + */
> +int cxl_commit_decoder(struct cxl_decoder *cxld)
> +{
> +       u32 ctrl, tl_lo, tl_hi, base_lo, base_hi, size_lo, size_hi;
> +       struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> +       void __iomem *hdm_decoder;
> +       struct cxl_hdm *cxlhdm;
> +       int rc;
> +
> +       /*
> +        * Decoder flags are entirely software controlled and therefore this
> +        * case is purely a driver bug.
> +        */
> +       if (dev_WARN_ONCE(&port->dev, (cxld->flags & CXL_DECODER_F_ENABLE) != 0,
> +                         "Invalid %s enable state\n", dev_name(&cxld->dev)))
> +               return -ENXIO;
> +
> +       cxlhdm = dev_get_drvdata(&port->dev);
> +       hdm_decoder = cxlhdm->regs.hdm_decoder;
> +       ctrl = readl(hdm_decoder + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
> +
> +       /*
> +        * A decoder that's currently active cannot be changed without the
> +        * system being quiesced. While the driver should prevent against this,
> +        * for a variety of reasons the hardware might not be in sync with the
> +        * hardware and so, do not splat on error.
> +        */
> +       size_hi = readl(hdm_decoder +
> +                       CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(cxld->id));
> +       size_lo =
> +               readl(hdm_decoder + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(cxld->id));
> +       if (FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMITTED, ctrl) &&
> +           (size_lo + size_hi)) {
> +               dev_err(&port->dev, "Tried to change an active decoder (%s)\n",
> +                       dev_name(&cxld->dev));
> +               return -EBUSY;
> +       }
> +
> +       u32p_replace_bits(&ctrl, cxl_to_ig(cxld->interleave_granularity),
> +                         CXL_HDM_DECODER0_CTRL_IG_MASK);

Usage of u32p_replace_bits() makes me wonder what is worth preserving
in the old control value? In this case the code is completely
overwriting the old value so it can just do typical updates for a
@ctrl variable initialized to zero at the start. Otherwise a comment
is needed as to what fields in the current @ctrl need to be preserved
after programming granularity, ways, type, and commit.

> +       u32p_replace_bits(&ctrl, cxl_to_eniw(cxld->interleave_ways),
> +                         CXL_HDM_DECODER0_CTRL_IW_MASK);
> +       u32p_replace_bits(&ctrl, 1, CXL_HDM_DECODER0_CTRL_COMMIT);
> +
> +       /* TODO: set based on type */
> +       u32p_replace_bits(&ctrl, 1, CXL_HDM_DECODER0_CTRL_TYPE);
> +
> +       base_lo = GENMASK(31, 28) & lower_32_bits(cxld->decoder_range.start);
> +       base_hi = upper_32_bits(cxld->decoder_range.start);
> +
> +       size_lo = GENMASK(31, 28) & (u32)(range_len(&cxld->decoder_range));

Why the cast vs just lower_32_bits(range_len(&cxld->decoder_range))?

> +       size_hi = upper_32_bits(range_len(&cxld->decoder_range) >> 32);

Isn't this always 0? I would expect upper_32_bits() without the shift.

> +
> +       if (cxld->nr_targets > 0) {
> +               tl_hi = 0;
> +
> +               tl_lo = FIELD_PREP(GENMASK(7, 0), cxld->target[0]->port_id);
> +
> +               if (cxld->interleave_ways > 1)
> +                       tl_lo |= FIELD_PREP(GENMASK(15, 8),
> +                                           cxld->target[1]->port_id);
> +               if (cxld->interleave_ways > 2)
> +                       tl_lo |= FIELD_PREP(GENMASK(23, 16),
> +                                           cxld->target[2]->port_id);
> +               if (cxld->interleave_ways > 3)
> +                       tl_lo |= FIELD_PREP(GENMASK(31, 24),
> +                                           cxld->target[3]->port_id);
> +               if (cxld->interleave_ways > 4)
> +                       tl_hi |= FIELD_PREP(GENMASK(7, 0),
> +                                           cxld->target[4]->port_id);
> +               if (cxld->interleave_ways > 5)
> +                       tl_hi |= FIELD_PREP(GENMASK(15, 8),
> +                                           cxld->target[5]->port_id);
> +               if (cxld->interleave_ways > 6)
> +                       tl_hi |= FIELD_PREP(GENMASK(23, 16),
> +                                           cxld->target[6]->port_id);
> +               if (cxld->interleave_ways > 7)
> +                       tl_hi |= FIELD_PREP(GENMASK(31, 24),
> +                                           cxld->target[7]->port_id);
> +
> +               writel(tl_hi, hdm_decoder + CXL_HDM_DECODER0_TL_HIGH(cxld->id));
> +               writel(tl_lo, hdm_decoder + CXL_HDM_DECODER0_TL_LOW(cxld->id));
> +       } else {
> +               /* Zero out skip list for devices */

Seems unwieldy to mix root and downstream port decoder programming in
the same function as endpoint decoder programming. Let's keep them
separate. They can of course share helpers, but I don't want the
mental strain of reading this function and wondering what context it
is being called, that should be evident from the function name.

> +               writel(0, hdm_decoder + CXL_HDM_DECODER0_TL_HIGH(cxld->id));
> +               writel(0, hdm_decoder + CXL_HDM_DECODER0_TL_LOW(cxld->id));
> +       }
> +
> +       writel(size_hi,
> +              hdm_decoder + CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(cxld->id));
> +       writel(size_lo,
> +              hdm_decoder + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(cxld->id));
> +       writel(base_hi,
> +              hdm_decoder + CXL_HDM_DECODER0_BASE_HIGH_OFFSET(cxld->id));
> +       writel(base_lo,
> +              hdm_decoder + CXL_HDM_DECODER0_BASE_LOW_OFFSET(cxld->id));
> +       writel(ctrl, hdm_decoder + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
> +
> +       rc = wait_for_commit(cxld);

per above, do you think it's workable to have a common wait for all
decoders at least per level of the hierarcy, i.e.:

program root decoders, wait;
program switch decoders, wait;
program endpoint decoders, wait;

> +       if (rc)
> +               return rc;
> +
> +       cxld->flags |= CXL_DECODER_F_ENABLE;
> +
> +#define DPORT_TL_STR "%d %d %d %d %d %d %d %d"
> +#define DPORT(i)                                                               \
> +       (cxld->nr_targets && cxld->interleave_ways > (i)) ?                    \
> +               cxld->target[(i)]->port_id :                                   \
> +                     -1
> +#define DPORT_TL                                                               \
> +       DPORT(0), DPORT(1), DPORT(2), DPORT(3), DPORT(4), DPORT(5), DPORT(6),  \
> +               DPORT(7)
> +
> +       dev_dbg(&cxld->dev,
> +               "%s (depth %d)\n\tBase %pa\n\tSize %llu\n\tIG %u (%ub)\n\tENIW %u (x%u)\n\tTargetList: \n" DPORT_TL_STR,
> +               dev_name(&port->dev), port->depth, &cxld->decoder_range.start,
> +               range_len(&cxld->decoder_range),
> +               cxl_to_ig(cxld->interleave_granularity),
> +               cxld->interleave_granularity,
> +               cxl_to_eniw(cxld->interleave_ways), cxld->interleave_ways,
> +               DPORT_TL);
> +#undef DPORT_TL
> +#undef DPORT
> +#undef DPORT_TL_STR

It just seems like all of this data is available in sysfs, does the
kernel need to print it?

> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(cxl_commit_decoder);

EXPOER_SYMBOL_NS_GPL

> +
> +/**
> + * cxl_disable_decoder() - Disables a decoder
> + * @cxld: The active cxl decoder.
> + *
> + * CXL decoders (as of 2.0 spec) have no way to deactivate them other than to
> + * set the size of the HDM to 0. This function will clear all registers, and if
> + * the decoder is active, commit the 0'd out registers.
> + */
> +void cxl_disable_decoder(struct cxl_decoder *cxld)
> +{
> +       struct cxl_port *port = to_cxl_port(cxld->dev.parent);
> +       void __iomem *hdm_decoder;
> +       struct cxl_hdm *cxlhdm;
> +       u32 ctrl;
> +
> +       cxlhdm = dev_get_drvdata(&port->dev);
> +       hdm_decoder = cxlhdm->regs.hdm_decoder;
> +       ctrl = readl(hdm_decoder + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
> +
> +       if (dev_WARN_ONCE(&port->dev, (cxld->flags & CXL_DECODER_F_ENABLE) == 0,
> +                         "Invalid decoder enable state\n"))

Why crash the kernel when trying to disable a disabled decoder?

Where does the "locked" check happen?

> +               return;
> +
> +       cxld->flags &= ~CXL_DECODER_F_ENABLE;
> +
> +       /* There's no way to "uncommit" a committed decoder, only 0 size it */
> +       writel(0, hdm_decoder + CXL_HDM_DECODER0_TL_HIGH(cxld->id));
> +       writel(0, hdm_decoder + CXL_HDM_DECODER0_TL_LOW(cxld->id));

Shouldn't size be first just in case there are cycles still in flight?

> +       writel(0, hdm_decoder + CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(cxld->id));
> +       writel(0, hdm_decoder + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(cxld->id));
> +       writel(0, hdm_decoder + CXL_HDM_DECODER0_BASE_HIGH_OFFSET(cxld->id));
> +       writel(0, hdm_decoder + CXL_HDM_DECODER0_BASE_LOW_OFFSET(cxld->id));
> +
> +       /* If the device isn't actually active, just zero out all the fields */

Why not unconditionally try to commit the new size here? I'd be ok to
keep this policy as you have it, but the comment needs to be changed
to indicate "why?", not "what?" because the code answers the latter.

> +       if (FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMITTED, ctrl))
> +               writel(CXL_HDM_DECODER0_CTRL_COMMIT,
> +                      hdm_decoder + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
> +}
> +EXPORT_SYMBOL_GPL(cxl_disable_decoder);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index d70d8c85d05f..f9dab312ed26 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -55,6 +55,7 @@
>  #define   CXL_HDM_DECODER0_CTRL_LOCK BIT(8)
>  #define   CXL_HDM_DECODER0_CTRL_COMMIT BIT(9)
>  #define   CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10)
> +#define   CXL_HDM_DECODER0_CTRL_COMMIT_ERROR BIT(11)
>  #define   CXL_HDM_DECODER0_CTRL_TYPE BIT(12)
>  #define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
>  #define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
> @@ -416,6 +417,8 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port,
>  struct cxl_dport *cxl_find_dport_by_dev(struct cxl_port *port,
>                                         const struct device *dev);
>  struct cxl_port *ep_find_cxl_port(struct cxl_memdev *cxlmd, unsigned int depth);
> +int cxl_commit_decoder(struct cxl_decoder *cxld);
> +void cxl_disable_decoder(struct cxl_decoder *cxld);
>
>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
>  bool is_cxl_decoder(struct device *dev);
> diff --git a/drivers/cxl/region.c b/drivers/cxl/region.c
> index f748060733dd..ac290677534d 100644
> --- a/drivers/cxl/region.c
> +++ b/drivers/cxl/region.c
> @@ -678,10 +678,52 @@ static int collect_ep_decoders(struct cxl_region *cxlr)
>         return rc;
>  }
>
> -static int bind_region(const struct cxl_region *cxlr)

Try not to change the function signature twice in the same patch
series, i.e. make it non-const at the outset... although I'm not sure
this function survives the new ABI that is assigning decoders to
regions as that would also impact the need to have these ->staged_list
and ->commit_list trackers.

> +static int bind_region(struct cxl_region *cxlr)
>  {
> -       /* TODO: */
> -       return 0;
> +       struct cxl_decoder *cxld, *d;
> +       int rc;
> +
> +       list_for_each_entry_safe(cxld, d, &cxlr->staged_list, region_link) {
> +               rc = cxl_commit_decoder(cxld);
> +               if (!rc) {
> +                       list_move_tail(&cxld->region_link, &cxlr->commit_list);
> +               } else {
> +                       dev_dbg(&cxlr->dev, "Failed to commit %s\n",
> +                               dev_name(&cxld->dev));
> +                       break;
> +               }
> +       }
> +
> +       list_for_each_entry_safe(cxld, d, &cxlr->commit_list, region_link) {
> +               if (rc) {
> +                       cxl_disable_decoder(cxld);
> +                       list_del(&cxld->region_link);
> +               }
> +       }
> +
> +       if (rc)
> +               cleanup_staged_decoders(cxlr);
> +
> +       BUG_ON(!list_empty(&cxlr->staged_list));
> +       return rc;
> +}
> +
> +static void region_unregister(void *dev)
> +{
> +       struct cxl_region *region = to_cxl_region(dev);
> +       struct cxl_decoder *cxld, *d;
> +
> +       if (dev_WARN_ONCE(dev, !list_empty(&region->staged_list),
> +                         "Decoders still staged"))

If I am reading this correctly, why is it a programming error to
unregister a region mid-programming? Certainly userspace could abort a
create attempt and delete after adding some targets, but not others.

> +               cleanup_staged_decoders(region);
> +
> +       /* TODO: teardown the nd_region */
> +
> +       list_for_each_entry_safe(cxld, d, &region->commit_list, region_link) {
> +               cxl_disable_decoder(cxld);
> +               list_del(&cxld->region_link);
> +               cxl_put_decoder(cxld);
> +       }
>  }
>
>  static int cxl_region_probe(struct device *dev)
> @@ -732,20 +774,26 @@ static int cxl_region_probe(struct device *dev)
>                 put_device(&ours->dev);
>
>         ret = collect_ep_decoders(cxlr);
> -       if (ret)
> -               goto err;
> +       if (ret) {
> +               cleanup_staged_decoders(cxlr);
> +               return ret;
> +       }
>
>         ret = bind_region(cxlr);
> -       if (ret)
> -               goto err;
> +       if (ret) {
> +               /* bind_region should cleanup after itself */
> +               if (dev_WARN_ONCE(dev, !list_empty(&cxlr->staged_list),
> +                                 "Region bind failed to cleanup staged decoders\n"))
> +                       cleanup_staged_decoders(cxlr);
> +               if (dev_WARN_ONCE(dev, !list_empty(&cxlr->commit_list),
> +                                 "Region bind failed to cleanup committed decoders\n"))
> +                       region_unregister(&cxlr->dev);
> +               return ret;
> +       }
>
>         cxlr->active = true;
>         dev_info(dev, "Bound");
> -       return 0;
> -
> -err:
> -       cleanup_staged_decoders(cxlr);
> -       return ret;
> +       return devm_add_action_or_reset(dev, region_unregister, dev);
>  }
>
>  static struct cxl_driver cxl_region_driver = {
> --
> 2.35.0
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2022-02-24  0:09 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-28  0:26 [PATCH v3 00/14] CXL Region driver Ben Widawsky
2022-01-28  0:26 ` [PATCH v3 01/14] cxl/region: Add region creation ABI Ben Widawsky
2022-01-28 18:14   ` Dan Williams
2022-01-28 18:59     ` Dan Williams
2022-02-02 18:26       ` Ben Widawsky
2022-02-02 18:28         ` Ben Widawsky
2022-02-02 18:48           ` Ben Widawsky
2022-02-02 19:00             ` Dan Williams
2022-02-02 19:02               ` Ben Widawsky
2022-02-02 19:15                 ` Dan Williams
2022-02-01 22:42     ` Ben Widawsky
2022-02-01 15:53   ` Jonathan Cameron
2022-02-17 17:10   ` [PATCH v4 " Ben Widawsky
2022-02-17 17:19     ` [PATCH v5 01/15] " Ben Widawsky
2022-02-17 17:33       ` Ben Widawsky
2022-02-17 17:58       ` Dan Williams
2022-02-17 18:58         ` Ben Widawsky
2022-02-17 20:26           ` Dan Williams
2022-02-17 22:22         ` Ben Widawsky
2022-02-17 23:32           ` Dan Williams
2022-02-18 16:41             ` Ben Widawsky
2022-01-28  0:26 ` [PATCH v3 02/14] cxl/region: Introduce concept of region configuration Ben Widawsky
2022-01-29  0:25   ` Dan Williams
2022-02-01 14:59     ` Ben Widawsky
2022-02-03  5:06       ` Dan Williams
2022-02-01 23:11     ` Ben Widawsky
2022-02-03 17:48       ` Dan Williams
2022-02-03 22:23         ` Ben Widawsky
2022-02-03 23:27           ` Dan Williams
2022-02-04  0:19             ` Ben Widawsky
2022-02-04  2:45               ` Dan Williams
2022-02-17 18:36     ` Ben Widawsky
2022-02-17 19:57       ` Dan Williams
2022-02-17 20:20         ` Ben Widawsky
2022-02-17 21:12           ` Dan Williams
2022-02-23 21:49         ` Ben Widawsky
2022-02-23 22:24           ` Dan Williams
2022-02-23 22:31             ` Ben Widawsky
2022-02-23 22:42               ` Dan Williams
2022-01-28  0:26 ` [PATCH v3 03/14] cxl/mem: Cache port created by the mem dev Ben Widawsky
2022-02-17  1:20   ` Dan Williams
2022-01-28  0:26 ` [PATCH v3 04/14] cxl/region: Introduce a cxl_region driver Ben Widawsky
2022-02-01 16:21   ` Jonathan Cameron
2022-02-17  6:04   ` Dan Williams
2022-01-28  0:26 ` [PATCH v3 05/14] cxl/acpi: Handle address space allocation Ben Widawsky
2022-02-18 19:17   ` Dan Williams
2022-01-28  0:26 ` [PATCH v3 06/14] cxl/region: Address " Ben Widawsky
2022-02-18 19:51   ` Dan Williams
2022-01-28  0:27 ` [PATCH v3 07/14] cxl/region: Implement XHB verification Ben Widawsky
2022-02-18 20:23   ` Dan Williams
2022-01-28  0:27 ` [PATCH v3 08/14] cxl/region: HB port config verification Ben Widawsky
2022-02-14 16:20   ` Jonathan Cameron
2022-02-14 17:51     ` Ben Widawsky
2022-02-14 18:09       ` Jonathan Cameron
2022-02-15 16:35   ` Jonathan Cameron
2022-02-18 21:04   ` Dan Williams
2022-01-28  0:27 ` [PATCH v3 09/14] cxl/region: Add infrastructure for decoder programming Ben Widawsky
2022-02-01 18:16   ` Jonathan Cameron
2022-02-18 21:53   ` Dan Williams
2022-01-28  0:27 ` [PATCH v3 10/14] cxl/region: Collect host bridge decoders Ben Widawsky
2022-02-01 18:21   ` Jonathan Cameron
2022-02-18 23:42   ` Dan Williams
2022-01-28  0:27 ` [PATCH v3 11/14] cxl/region: Add support for single switch level Ben Widawsky
2022-02-01 18:26   ` Jonathan Cameron
2022-02-15 16:10   ` Jonathan Cameron
2022-02-18 18:23     ` Jonathan Cameron
2022-01-28  0:27 ` [PATCH v3 12/14] cxl: Program decoders for regions Ben Widawsky
2022-02-24  0:08   ` Dan Williams
2022-01-28  0:27 ` [PATCH v3 13/14] cxl/pmem: Convert nvdimm bridge API to use dev Ben Widawsky
2022-01-28  0:27 ` [PATCH v3 14/14] cxl/region: Create an nd_region Ben Widawsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).