From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4826C17CC for ; Tue, 1 Mar 2022 21:22:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646169773; x=1677705773; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=slAFL62x2RpMTB+tCwMl9I8IQMQZ64D4xzfHyZAwzLM=; b=lnWsg19OftinhkCJrF+Qy03SRjzkXNDZYpQ84G00C1oJbJCwgzsnmvia Td1fIP1n3BzEDRj7vHRk9dIBqbXsrB6A4xpwxGwFCdxXaLkWxFibYCfDv VFq/oanqpWaBFG1+8l/Cd8neuxhbF815QKtfcmO9THEn1d6SZr7uamc+f 7V8Ey24LP332UAL70v6U8Cv+3o8RZ4DjNmtBBJ6tlsH/TSftSlR8ZyZzu MbVjo+B3Dd2tsVOBMQmH+4m8oSpgzLaXErvYX2ANu9tBJi7pOxWQwd8Ac 7t7oUDybsvucuyhjg+RFzq4E84spdlnqP2D6Cpct5P93aMq+swfRBXwuj A==; X-IronPort-AV: E=McAfee;i="6200,9189,10273"; a="252979619" X-IronPort-AV: E=Sophos;i="5.90,146,1643702400"; d="scan'208";a="252979619" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Mar 2022 13:22:52 -0800 X-IronPort-AV: E=Sophos;i="5.90,146,1643702400"; d="scan'208";a="575858071" Received: from paggarwa-mobl2.amr.corp.intel.com (HELO intel.com) ([10.252.142.236]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Mar 2022 13:22:52 -0800 Date: Tue, 1 Mar 2022 13:22:51 -0800 From: Ben Widawsky To: Dan Williams Cc: linux-cxl@vger.kernel.org, patches@lists.linux.dev, Alison Schofield , Ira Weiny , Jonathan Cameron , Vishal Verma Subject: Re: [RFC PATCH 1/2] cxl/region: Add region creation ABI Message-ID: <20220301212251.leoflc2ia5ddhwyn@intel.com> References: <20220225060038.1511562-1-ben.widawsky@intel.com> <20220225060038.1511562-2-ben.widawsky@intel.com> Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On 22-02-28 15:48:43, Dan Williams wrote: > On Thu, Feb 24, 2022 at 10:01 PM Ben Widawsky wrote: > > > > Regions are created as a child of the decoder that encompasses an > > address space with constraints. Regions have a number of attributes that > > must be configured before the region can be activated. > > > > The ABI is not meant to be secure, > > What does that mean? It's "secure" by virtue of only being root > writable and if root userspace process race themselves the kernel will > coherently handle the conflict and pick a winner. > > > but is meant to avoid accidental > > races. > > It also solves purposeful race. > > > As a result, a buggy process may create a region by name that was > > allocated by a different process. > > That's not a bug, one process atomically claims the next region, the > other loops. > > > However, multiple processes which are > > trying not to race with each other shouldn't need special > > synchronization to do so. > > > > // Allocate a new region name > > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region) > > > > // Create a new region by name > > while > > region=$(cat /sys/bus/cxl/devices/decoder0.0/create_region) > > ! echo $region > /sys/bus/cxl/devices/decoder0.0/create_region > > do true; done > > > > // Region now exists in sysfs > > stat -t /sys/bus/cxl/devices/decoder0.0/$region > > > > // Delete the region, and name > > echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region > > > > Signed-off-by: Ben Widawsky > > > > --- > > Changes since v5 (all Dan): > > - Fix erroneous return on create > > - Fix ida leak on error > > - forward declare to_cxl_region instead of cxl_region_release > > - Use REGION_DEAD in the right place > > - Allocate next id in region_alloc > > --- > > Documentation/ABI/testing/sysfs-bus-cxl | 23 ++ > > .../driver-api/cxl/memory-devices.rst | 11 + > > drivers/cxl/core/Makefile | 1 + > > drivers/cxl/core/core.h | 3 + > > drivers/cxl/core/port.c | 18 ++ > > drivers/cxl/core/region.c | 223 ++++++++++++++++++ > > drivers/cxl/cxl.h | 5 + > > drivers/cxl/region.h | 28 +++ > > tools/testing/cxl/Kbuild | 1 + > > 9 files changed, 313 insertions(+) > > create mode 100644 drivers/cxl/core/region.c > > create mode 100644 drivers/cxl/region.h > > > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl > > index 7c2b846521f3..e5db45ea70ad 100644 > > --- a/Documentation/ABI/testing/sysfs-bus-cxl > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl > > @@ -163,3 +163,26 @@ Description: > > memory (type-3). The 'target_type' attribute indicates the > > current setting which may dynamically change based on what > > memory regions are activated in this decode hierarchy. > > + > > +What: /sys/bus/cxl/devices/decoderX.Y/create_region > > +Date: January, 2022 > > +KernelVersion: v5.18 > > +Contact: linux-cxl@vger.kernel.org > > +Description: > > + Write a value of the form 'regionX.Y:Z' to instantiate a new > > Per internal conversation this will move to "write a value of the form > "%u", right? Yep. > > > + region within the decode range bounded by decoderX.Y. The value > > + written must match the current value returned from reading this > > + attribute. This behavior lets the kernel arbitrate racing > > + attempts to create a region. The thread that fails to write > > + loops and tries the next value. Regions must be created for root > > + decoders, > > Does this need to be stated? It's already implicit by the fact that > 'create_region' does not appear on anything but root decoders. > Yes. Historically I had added this attribute for all decoders... > > > and must subsequently configured and bound to a region > > + driver before they can be used. > > + > > +What: /sys/bus/cxl/devices/decoderX.Y/delete_region > > +Date: January, 2022 > > +KernelVersion: v5.18 > > +Contact: linux-cxl@vger.kernel.org > > +Description: > > + Deletes the named region. The attribute expects a region in the > > + form "regionX.Y:Z". The region's name, allocated by reading > > + create_region, will also be released. > > diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst > > index db476bb170b6..66ddc58a21b1 100644 > > --- a/Documentation/driver-api/cxl/memory-devices.rst > > +++ b/Documentation/driver-api/cxl/memory-devices.rst > > @@ -362,6 +362,17 @@ CXL Core > > .. kernel-doc:: drivers/cxl/core/mbox.c > > :doc: cxl mbox > > > > +CXL Regions > > +----------- > > +.. kernel-doc:: drivers/cxl/region.h > > + :identifiers: > > + > > +.. kernel-doc:: drivers/cxl/core/region.c > > + :doc: cxl core region > > + > > +.. kernel-doc:: drivers/cxl/core/region.c > > + :identifiers: > > + > > External Interfaces > > =================== > > > > diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile > > index 6d37cd78b151..39ce8f2f2373 100644 > > --- a/drivers/cxl/core/Makefile > > +++ b/drivers/cxl/core/Makefile > > @@ -4,6 +4,7 @@ obj-$(CONFIG_CXL_BUS) += cxl_core.o > > ccflags-y += -I$(srctree)/drivers/cxl > > cxl_core-y := port.o > > cxl_core-y += pmem.o > > +cxl_core-y += region.o > > cxl_core-y += regs.o > > cxl_core-y += memdev.o > > cxl_core-y += mbox.o > > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h > > index 1a50c0fc399c..adfd42370b28 100644 > > --- a/drivers/cxl/core/core.h > > +++ b/drivers/cxl/core/core.h > > @@ -9,6 +9,9 @@ extern const struct device_type cxl_nvdimm_type; > > > > extern struct attribute_group cxl_base_attribute_group; > > > > +extern struct device_attribute dev_attr_create_region; > > +extern struct device_attribute dev_attr_delete_region; > > + > > struct cxl_send_command; > > struct cxl_mem_query_commands; > > int cxl_query_cmd(struct cxl_memdev *cxlmd, > > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c > > index 1e785a3affaa..f3e1313217a8 100644 > > --- a/drivers/cxl/core/port.c > > +++ b/drivers/cxl/core/port.c > > @@ -9,6 +9,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include "core.h" > > > > @@ -213,6 +214,8 @@ static struct attribute_group cxl_decoder_base_attribute_group = { > > }; > > > > static struct attribute *cxl_decoder_root_attrs[] = { > > + &dev_attr_create_region.attr, > > + &dev_attr_delete_region.attr, > > &dev_attr_cap_pmem.attr, > > &dev_attr_cap_ram.attr, > > &dev_attr_cap_type2.attr, > > @@ -270,6 +273,8 @@ static void cxl_decoder_release(struct device *dev) > > struct cxl_decoder *cxld = to_cxl_decoder(dev); > > struct cxl_port *port = to_cxl_port(dev->parent); > > > > + ida_free(&cxld->region_ida, cxld->next_region_id); > > + ida_destroy(&cxld->region_ida); > > ida_free(&port->decoder_ida, cxld->id); > > kfree(cxld); > > } > > @@ -1244,6 +1249,13 @@ static struct cxl_decoder *cxl_decoder_alloc(struct cxl_port *port, > > cxld->target_type = CXL_DECODER_EXPANDER; > > cxld->platform_res = (struct resource)DEFINE_RES_MEM(0, 0); > > > > + mutex_init(&cxld->id_lock); > > + ida_init(&cxld->region_ida); > > + rc = ida_alloc(&cxld->region_ida, GFP_KERNEL); > > + if (rc < 0) > > + goto err; > > + > > + cxld->next_region_id = rc; > > return cxld; > > err: > > kfree(cxld); > > @@ -1502,6 +1514,12 @@ bool schedule_cxl_memdev_detach(struct cxl_memdev *cxlmd) > > } > > EXPORT_SYMBOL_NS_GPL(schedule_cxl_memdev_detach, CXL); > > > > +bool schedule_cxl_region_unregister(struct cxl_region *cxlr) > > +{ > > + return queue_work(cxl_bus_wq, &cxlr->unregister_work); > > +} > > +EXPORT_SYMBOL_NS_GPL(schedule_cxl_region_unregister, CXL); > > + > > /* for user tooling to ensure port disable work has completed */ > > static ssize_t flush_store(struct bus_type *bus, const char *buf, size_t count) > > { > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > > new file mode 100644 > > index 000000000000..a934938f8630 > > --- /dev/null > > +++ b/drivers/cxl/core/region.c > > @@ -0,0 +1,223 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > +/* Copyright(c) 2022 Intel Corporation. All rights reserved. */ > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include "core.h" > > + > > +/** > > + * DOC: cxl core region > > + * > > + * CXL Regions represent mapped memory capacity in system physical address > > + * space. Whereas the CXL Root Decoders identify the bounds of potential CXL > > + * Memory ranges, Regions represent the active mapped capacity by the HDM > > + * Decoder Capability structures throughout the Host Bridges, Switches, and > > + * Endpoints in the topology. > > + */ > > + > > + > > +static struct cxl_region *to_cxl_region(struct device *dev); > > cxl_lock_class() will want to know the lock_class for the cxl_region > device lock, so this can go straight to be an extern declaration > alongside is_cxl_decoder(). > Yeah, this is stale already. I took what you have in the preview branch which sets it up right before device_add(). > > + > > +static void cxl_region_release(struct device *dev) > > +{ > > + struct cxl_decoder *cxld = to_cxl_decoder(dev->parent); > > + struct cxl_region *cxlr = to_cxl_region(dev); > > + > > + dev_dbg(&cxld->dev, "Releasing %s\n", dev_name(dev)); > > + ida_free(&cxld->region_ida, cxlr->id); > > + kfree(cxlr); > > + put_device(&cxld->dev); > > +} > > + > > +static const struct device_type cxl_region_type = { > > + .name = "cxl_region", > > + .release = cxl_region_release, > > +}; > > + > > +static struct cxl_region *to_cxl_region(struct device *dev) > > +{ > > + if (dev_WARN_ONCE(dev, dev->type != &cxl_region_type, > > + "not a cxl_region device\n")) > > + return NULL; > > + > > + return container_of(dev, struct cxl_region, dev); > > +} > > + > > +static void unregister_region(struct work_struct *work) > > +{ > > + struct cxl_region *cxlr; > > + > > + cxlr = container_of(work, typeof(*cxlr), unregister_work); > > + device_unregister(&cxlr->dev); > > +} > > + > > +static void schedule_unregister(void *cxlr) > > +{ > > + schedule_cxl_region_unregister(cxlr); > > +} > > + > > +static struct cxl_region *cxl_region_alloc(struct cxl_decoder *cxld) > > +{ > > + struct cxl_region *cxlr; > > + struct device *dev; > > + int rc; > > + > > + lockdep_assert_held(&cxld->id_lock); > > + > > + rc = ida_alloc(&cxld->region_ida, GFP_KERNEL); > > + if (rc < 0) { > > + dev_dbg(dev, "Failed to get next cached id (%d)\n", rc); > > + return ERR_PTR(rc); > > + } > > + > > + cxlr = kzalloc(sizeof(*cxlr), GFP_KERNEL); > > + if (!cxlr) { > > + ida_free(&cxld->region_ida, rc); > > + return ERR_PTR(-ENOMEM); > > + } > > + > > + cxld->next_region_id = rc; > > + > > + dev = &cxlr->dev; > > + device_initialize(dev); > > + dev->parent = &cxld->dev; > > + device_set_pm_not_required(dev); > > + dev->bus = &cxl_bus_type; > > + dev->type = &cxl_region_type; > > + INIT_WORK(&cxlr->unregister_work, unregister_region); > > Perhaps name it "detach_work" to match the same for memdevs, or second > choice, go rename the memdev one to "unregister_work". Keep the naming > consistent for data structures that fill the same role. > Okay. Do you want me to rename schedule_cxl_region_unregister() also? I assume yes. > > + > > + return cxlr; > > +} > > + > > +/** > > + * devm_cxl_add_region - Adds a region to a decoder > > + * @cxld: Parent decoder. > > + * @cxlr: Region to be added to the decoder. > > + * > > + * This is the second step of region initialization. Regions exist within an > > + * address space which is mapped by a @cxld. That @cxld must be a root decoder, > > + * and it enforces constraints upon the region as it is configured. > > + * > > + * Return: 0 if the region was added to the @cxld, else returns negative error > > + * code. The region will be named "regionX.Y.Z" where X is the port, Y is the > > + * decoder id, and Z is the region number. > > + */ > > +static struct cxl_region *devm_cxl_add_region(struct cxl_decoder *cxld) > > +{ > > + struct cxl_port *port = to_cxl_port(cxld->dev.parent); > > + struct cxl_region *cxlr; > > + struct device *dev; > > + int rc; > > + > > + cxlr = cxl_region_alloc(cxld); > > + if (IS_ERR(cxlr)) > > + return cxlr; > > + > > + dev = &cxlr->dev; > > + > > + cxlr->id = cxld->next_region_id; > > + rc = dev_set_name(dev, "region%d.%d:%d", port->id, cxld->id, cxlr->id); > > + if (rc) > > + goto err_out; > > + > > + /* affirm that release will have access to the decoder's region ida */ > > s/affirm that release will/arrange for cxl_region_release to/ > > > + get_device(&cxld->dev); > > + > > + rc = device_add(dev); > > + if (rc) > > + goto err_put; > > + > > + rc = devm_add_action_or_reset(port->uport, schedule_unregister, cxlr); > > + if (rc) > > + goto err_put; > > + > > + return cxlr; > > + > > +err_put: > > + put_device(&cxld->dev); > > + > > +err_out: > > + put_device(dev); > > + return ERR_PTR(rc); > > +} > > + > > +static ssize_t create_region_show(struct device *dev, > > + struct device_attribute *attr, char *buf) > > +{ > > + struct cxl_port *port = to_cxl_port(dev->parent); > > + struct cxl_decoder *cxld = to_cxl_decoder(dev); > > + > > + return sysfs_emit(buf, "region%d.%d:%d\n", port->id, cxld->id, > > + cxld->next_region_id); > > To cut down on userspace racing itself this should acquire the id_lock > to synchronize against the store side. i.e. no point in returning > known bad answers when the lock is held on the store side, even though > the answer given here may be immediately invalidated as soon as the > lock is dropped it's still useful to throttle readers in the presence > of writers. > > > +} > > + > > +static ssize_t create_region_store(struct device *dev, > > + struct device_attribute *attr, > > + const char *buf, size_t len) > > +{ > > + struct cxl_port *port = to_cxl_port(dev->parent); > > + struct cxl_decoder *cxld = to_cxl_decoder(dev); > > + struct cxl_region *cxlr; > > + int d, p, r, rc = 0; > > + > > + if (sscanf(buf, "region%d.%d:%d", &p, &d, &r) != 3) > > + return -EINVAL; > > + > > + if (port->id != p || cxld->id != d) > > + return -EINVAL; > > + > > + rc = mutex_lock_interruptible(&cxld->id_lock); > > + if (rc) > > + return rc; > > + > > + if (cxld->next_region_id != r) { > > + rc = -EINVAL; > > + goto out; > > + } > > + > > + cxlr = devm_cxl_add_region(cxld); > > + rc = 0; > > + dev_dbg(dev, "Created %s\n", dev_name(&cxlr->dev)); > > + > > +out: > > + mutex_unlock(&cxld->id_lock); > > + if (rc) > > + return rc; > > + return len; > > +} > > +DEVICE_ATTR_RW(create_region); > > + > > +static struct cxl_region *cxl_find_region_by_name(struct cxl_decoder *cxld, > > + const char *name) > > +{ > > + struct device *region_dev; > > + > > + region_dev = device_find_child_by_name(&cxld->dev, name); > > + if (!region_dev) > > + return ERR_PTR(-ENOENT); > > + > > + return to_cxl_region(region_dev); > > +} > > + > > +static ssize_t delete_region_store(struct device *dev, > > + struct device_attribute *attr, > > + const char *buf, size_t len) > > +{ > > + struct cxl_port *port = to_cxl_port(dev->parent); > > + struct cxl_decoder *cxld = to_cxl_decoder(dev); > > + struct cxl_region *cxlr; > > + > > + cxlr = cxl_find_region_by_name(cxld, buf); > > + if (IS_ERR(cxlr)) > > + return PTR_ERR(cxlr); > > + > > + if (!test_and_set_bit(REGION_DEAD, &cxlr->flags)) > > + devm_release_action(port->uport, schedule_unregister, cxlr); > > Where is the synchronization against port ->remove()? That could have > started before this sysfs file was deleted and trigger double > device_unregister(). Maybe I'm missing something obvious, but I don't see a way to do this without adding another lock. I need the root port's device lock for this, but when the port goes away, so does that device, right? > Also while any workqueue thread may want to access the region a reference > needs to be held otherwise you can get a use after free. I expect that this > should unconditionally schedule the unregister work, then in the workqueue > check that only one invocation actually performs the unregistration. Okay. > > Given that the work is only related to unregistration this can fail > requests to delete something that has already been deleted. If > userspace sees that error and wants to synchronize it can use the > bus/flush attribute for that. I.e. > > if (work_pending(&cxlr->detach_work)) > return -EBUSY; > I'm not following, you mean to put this in flush_store? > > + put_device(&cxlr->dev); > > + > > + return len; > > +} > > +DEVICE_ATTR_WO(delete_region); > > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > > index b4047a310340..d5397f7dfcf4 100644 > > --- a/drivers/cxl/cxl.h > > +++ b/drivers/cxl/cxl.h > > @@ -221,6 +221,8 @@ enum cxl_decoder_type { > > * @target_type: accelerator vs expander (type2 vs type3) selector > > * @flags: memory type capabilities and locking > > * @target_lock: coordinate coherent reads of the target list > > + * @region_ida: allocator for region ids. > > + * @next_region_id: Cached region id for next region. > > * @nr_targets: number of elements in @target > > * @target: active ordered target list in current decoder configuration > > */ > > @@ -236,6 +238,9 @@ struct cxl_decoder { > > enum cxl_decoder_type target_type; > > unsigned long flags; > > seqlock_t target_lock; > > + struct mutex id_lock; > > + struct ida region_ida; > > + int next_region_id; > > int nr_targets; > > struct cxl_dport *target[]; > > }; > > diff --git a/drivers/cxl/region.h b/drivers/cxl/region.h > > new file mode 100644 > > index 000000000000..7025f6785f83 > > --- /dev/null > > +++ b/drivers/cxl/region.h > > @@ -0,0 +1,28 @@ > > +/* SPDX-License-Identifier: GPL-2.0-only */ > > +/* Copyright(c) 2021 Intel Corporation. */ > > +#ifndef __CXL_REGION_H__ > > +#define __CXL_REGION_H__ > > + > > +#include > > + > > +#include "cxl.h" > > + > > +/** > > + * struct cxl_region - CXL region > > + * @dev: This region's device. > > + * @id: This region's id. Id is globally unique across all regions. > > + * @flags: Flags representing the current state of the region. > > + * @unregister_work: Async unregister to allow attrs to take device_lock. > > + */ > > +struct cxl_region { > > + struct device dev; > > + int id; > > + unsigned long flags; > > +#define REGION_DEAD 0 > > + struct work_struct unregister_work; > > + > > +}; > > + > > +bool schedule_cxl_region_unregister(struct cxl_region *cxlr); > > + > > +#endif > > diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild > > index 82e49ab0937d..3fe6d34e6d59 100644 > > --- a/tools/testing/cxl/Kbuild > > +++ b/tools/testing/cxl/Kbuild > > @@ -46,6 +46,7 @@ cxl_core-y += $(CXL_CORE_SRC)/memdev.o > > cxl_core-y += $(CXL_CORE_SRC)/mbox.o > > cxl_core-y += $(CXL_CORE_SRC)/pci.o > > cxl_core-y += $(CXL_CORE_SRC)/hdm.o > > +cxl_core-y += $(CXL_CORE_SRC)/region.o > > cxl_core-y += config_check.o > > > > obj-m += test/ > > -- > > 2.35.1 > >