linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] device-dax: sub-division support
@ 2016-12-11  6:28 Dan Williams
  2016-12-11  6:28 ` [PATCH 1/8] dax: add region-available-size attribute Dan Williams
                   ` (8 more replies)
  0 siblings, 9 replies; 21+ messages in thread
From: Dan Williams @ 2016-12-11  6:28 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel

>From [PATCH 6/8] dax: sub-division support:

Device-DAX is a mechanism to establish mappings of performance / feature
differentiated memory with strict fault behavior guarantees.  With
sub-division support a platform owner can provision sub-allocations of a
dax-region into separate devices. The provisioning mechanism follows the
same scheme as the libnvdimm sub-system in that a 'seed' device is
created at initialization time that can be resized from zero to become
enabled.

Unlike the nvdimm sub-system there is no on media labelling scheme
associated with this partitioning. Provisioning decisions are ephemeral
/ not automatically restored after reboot. While the initial use case of
device-dax is persistent memory other uses case may be volatile, so the
device-dax core is unable to assume the underlying memory is pmem.  The
task of recalling a partitioning scheme or permissions on the device(s)
is left to userspace.

For persistent allocations, naming, and permissions automatically
recalled by the kernel, use filesystem-DAX. For a userspace helper
library and utility for manipulating device-dax instances see libdaxctl
and the daxctl utility here: https://github.com/pmem/ndctl

---

Dan Williams (8):
      dax: add region-available-size attribute
      dax: add region 'id', 'size', and 'align' attributes
      dax: register seed device
      dax: use multi-order radix for resource lookup
      dax: refactor locking out of size calculation routines
      dax: sub-division support
      dax: add / remove dax devices after provisioning
      dax: add debug for region available_size


 drivers/dax/Kconfig |    1 
 drivers/dax/dax.c   |  747 ++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 698 insertions(+), 50 deletions(-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/8] dax: add region-available-size attribute
  2016-12-11  6:28 [PATCH 0/8] device-dax: sub-division support Dan Williams
@ 2016-12-11  6:28 ` Dan Williams
  2016-12-14 14:38   ` Johannes Thumshirn
  2016-12-11  6:28 ` [PATCH 2/8] dax: add region 'id', 'size', and 'align' attributes Dan Williams
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2016-12-11  6:28 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel

In preparation for a facility that enables dax regions to be
sub-divided, introduce a 'dax/available_size' attribute.  This attribute
appears under the parent device that registered the device-dax region,
and it assumes that the device-dax-core owns the driver-data for that
device.

'dax/available_size' adjusts dynamically as dax-device instances are
registered and unregistered.

As a side effect of using __request_region() to reserve capacity from
the dax_region we now track pointers to those returned resources rather
than duplicating the passed in resource array.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/dax.c |  135 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 123 insertions(+), 12 deletions(-)

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 3d94ff20fdca..cc7c4aa4bcc2 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -38,6 +38,7 @@ MODULE_PARM_DESC(nr_dax, "max number of device-dax instances");
  * @id: kernel-wide unique region for a memory range
  * @base: linear address corresponding to @res
  * @kref: to pin while other agents have a need to do lookups
+ * @lock: synchronize changes / consistent-access to the resource tree (@res)
  * @dev: parent device backing this region
  * @align: allocation and mapping alignment for child dax devices
  * @res: physical address range of the region
@@ -48,6 +49,7 @@ struct dax_region {
 	struct ida ida;
 	void *base;
 	struct kref kref;
+	struct mutex lock;
 	struct device *dev;
 	unsigned int align;
 	struct resource res;
@@ -72,7 +74,56 @@ struct dax_dev {
 	bool alive;
 	int id;
 	int num_resources;
-	struct resource res[0];
+	struct resource **res;
+};
+
+#define for_each_dax_region_resource(dax_region, res) \
+	for (res = (dax_region)->res.child; res; res = res->sibling)
+
+static unsigned long long dax_region_avail_size(
+		struct dax_region *dax_region)
+{
+	unsigned long long size;
+	struct resource *res;
+
+	mutex_lock(&dax_region->lock);
+	size = resource_size(&dax_region->res);
+	for_each_dax_region_resource(dax_region, res)
+		size -= resource_size(res);
+	mutex_unlock(&dax_region->lock);
+
+	return size;
+}
+
+static ssize_t available_size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dax_region *dax_region;
+	ssize_t rc = -ENXIO;
+
+	device_lock(dev);
+	dax_region = dev_get_drvdata(dev);
+	if (dax_region)
+		rc = sprintf(buf, "%llu\n", dax_region_avail_size(dax_region));
+	device_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RO(available_size);
+
+static struct attribute *dax_region_attributes[] = {
+	&dev_attr_available_size.attr,
+	NULL,
+};
+
+static const struct attribute_group dax_region_attribute_group = {
+	.name = "dax_region",
+	.attrs = dax_region_attributes,
+};
+
+static const struct attribute_group *dax_region_attribute_groups[] = {
+	&dax_region_attribute_group,
+	NULL,
 };
 
 static struct inode *dax_alloc_inode(struct super_block *sb)
@@ -200,12 +251,27 @@ void dax_region_put(struct dax_region *dax_region)
 }
 EXPORT_SYMBOL_GPL(dax_region_put);
 
+
+static void dax_region_unregister(void *region)
+{
+	struct dax_region *dax_region = region;
+
+	sysfs_remove_groups(&dax_region->dev->kobj,
+			dax_region_attribute_groups);
+	dax_region_put(dax_region);
+}
+
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct resource *res, unsigned int align, void *addr,
 		unsigned long pfn_flags)
 {
 	struct dax_region *dax_region;
 
+	if (dev_get_drvdata(parent)) {
+		dev_WARN(parent, "dax core found drvdata already in use\n");
+		return NULL;
+	}
+
 	if (!IS_ALIGNED(res->start, align)
 			|| !IS_ALIGNED(resource_size(res), align))
 		return NULL;
@@ -213,16 +279,26 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 	dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL);
 	if (!dax_region)
 		return NULL;
-
-	memcpy(&dax_region->res, res, sizeof(*res));
+	dev_set_drvdata(parent, dax_region);
+	dax_region->res.name = dev_name(parent);
+	dax_region->res.start = res->start;
+	dax_region->res.end = res->end;
 	dax_region->pfn_flags = pfn_flags;
+	mutex_init(&dax_region->lock);
 	kref_init(&dax_region->kref);
 	dax_region->id = region_id;
 	ida_init(&dax_region->ida);
 	dax_region->align = align;
 	dax_region->dev = parent;
 	dax_region->base = addr;
+	if (sysfs_create_groups(&parent->kobj, dax_region_attribute_groups)) {
+		kfree(dax_region);
+		return NULL;;
+	}
 
+	kref_get(&dax_region->kref);
+	if (devm_add_action_or_reset(parent, dax_region_unregister, dax_region))
+		return NULL;
 	return dax_region;
 }
 EXPORT_SYMBOL_GPL(alloc_dax_region);
@@ -240,7 +316,7 @@ static ssize_t size_show(struct device *dev,
 	int i;
 
 	for (i = 0; i < dax_dev->num_resources; i++)
-		size += resource_size(&dax_dev->res[i]);
+		size += resource_size(dax_dev->res[i]);
 
 	return sprintf(buf, "%llu\n", size);
 }
@@ -309,7 +385,7 @@ static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff,
 	int i;
 
 	for (i = 0; i < dax_dev->num_resources; i++) {
-		res = &dax_dev->res[i];
+		res = dax_dev->res[i];
 		phys = pgoff * PAGE_SIZE + res->start;
 		if (phys >= res->start && phys <= res->end)
 			break;
@@ -317,7 +393,7 @@ static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff,
 	}
 
 	if (i < dax_dev->num_resources) {
-		res = &dax_dev->res[i];
+		res = dax_dev->res[i];
 		if (phys + size - 1 <= res->end)
 			return phys;
 	}
@@ -534,13 +610,16 @@ static void dax_dev_release(struct device *dev)
 	ida_simple_remove(&dax_minor_ida, MINOR(dev->devt));
 	dax_region_put(dax_region);
 	iput(dax_dev->inode);
+	kfree(dax_dev->res);
 	kfree(dax_dev);
 }
 
 static void unregister_dax_dev(void *dev)
 {
 	struct dax_dev *dax_dev = to_dax_dev(dev);
+	struct dax_region *dax_region = dax_dev->region;
 	struct cdev *cdev = &dax_dev->cdev;
+	int i;
 
 	dev_dbg(dev, "%s\n", __func__);
 
@@ -554,6 +633,13 @@ static void unregister_dax_dev(void *dev)
 	dax_dev->alive = false;
 	synchronize_rcu();
 	unmap_mapping_range(dax_dev->inode->i_mapping, 0, 0, 1);
+
+	mutex_lock(&dax_region->lock);
+	for (i = 0; i < dax_dev->num_resources; i++)
+		__release_region(&dax_region->res, dax_dev->res[i]->start,
+				resource_size(dax_dev->res[i]));
+	mutex_unlock(&dax_region->lock);
+
 	cdev_del(cdev);
 	device_unregister(dev);
 }
@@ -568,28 +654,42 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
 	struct cdev *cdev;
 	dev_t dev_t;
 
-	dax_dev = kzalloc(sizeof(*dax_dev) + sizeof(*res) * count, GFP_KERNEL);
+	dax_dev = kzalloc(sizeof(*dax_dev), GFP_KERNEL);
 	if (!dax_dev)
 		return ERR_PTR(-ENOMEM);
 
+	dax_dev->res = kzalloc(sizeof(res) * count, GFP_KERNEL);
+	if (!dax_dev->res)
+		goto err_res;
+
 	for (i = 0; i < count; i++) {
+		struct resource *dax_res;
+
 		if (!IS_ALIGNED(res[i].start, dax_region->align)
 				|| !IS_ALIGNED(resource_size(&res[i]),
 					dax_region->align)) {
 			rc = -EINVAL;
 			break;
 		}
-		dax_dev->res[i].start = res[i].start;
-		dax_dev->res[i].end = res[i].end;
+
+		mutex_lock(&dax_region->lock);
+		dax_res = __request_region(&dax_region->res, res[i].start,
+				resource_size(&res[i]), NULL, 0);
+		mutex_unlock(&dax_region->lock);
+		if (!dax_res) {
+			rc = -EBUSY;
+			break;
+		}
+		dax_dev->res[i] = dax_res;
 	}
 
 	if (i < count)
-		goto err_id;
+		goto err_request_region;
 
 	dax_dev->id = ida_simple_get(&dax_region->ida, 0, 0, GFP_KERNEL);
 	if (dax_dev->id < 0) {
 		rc = dax_dev->id;
-		goto err_id;
+		goto err_request_region;
 	}
 
 	minor = ida_simple_get(&dax_minor_ida, 0, 0, GFP_KERNEL);
@@ -629,6 +729,10 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
 	dev->groups = dax_attribute_groups;
 	dev->release = dax_dev_release;
 	dev_set_name(dev, "dax%d.%d", dax_region->id, dax_dev->id);
+	/* update resource names now that the owner device is named */
+	for (i = 0; i < dax_dev->num_resources; i++)
+		dax_dev->res[i]->name = dev_name(dev);
+
 	rc = device_add(dev);
 	if (rc) {
 		put_device(dev);
@@ -647,7 +751,14 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
 	ida_simple_remove(&dax_minor_ida, minor);
  err_minor:
 	ida_simple_remove(&dax_region->ida, dax_dev->id);
- err_id:
+ err_request_region:
+	mutex_lock(&dax_region->lock);
+	for (i--; i >= 0; i--)
+		__release_region(&dax_region->res, dax_dev->res[i]->start,
+				resource_size(dax_dev->res[i]));
+	mutex_unlock(&dax_region->lock);
+	kfree(dax_dev->res);
+ err_res:
 	kfree(dax_dev);
 
 	return ERR_PTR(rc);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/8] dax: add region 'id', 'size', and 'align' attributes
  2016-12-11  6:28 [PATCH 0/8] device-dax: sub-division support Dan Williams
  2016-12-11  6:28 ` [PATCH 1/8] dax: add region-available-size attribute Dan Williams
@ 2016-12-11  6:28 ` Dan Williams
  2016-12-11  6:28 ` [PATCH 3/8] dax: register seed device Dan Williams
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2016-12-11  6:28 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel

While this information is available by looking at the nvdimm parent
device that may not always be the case when/if we add support for other
memory regions. Tooling should not depend on walking a given ancestor
topology that is not guaranteed by the device's class. For example, a
device-dax instance will always have a dax_region parent, but it may not
always have a libnvdimm "dax" device as a grandparent.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/dax.c |   53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index cc7c4aa4bcc2..da2962b6f8de 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -111,8 +111,61 @@ static ssize_t available_size_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(available_size);
 
+static ssize_t id_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dax_region *dax_region;
+	ssize_t rc = -ENXIO;
+
+	device_lock(dev);
+	dax_region = dev_get_drvdata(dev);
+	if (dax_region)
+		rc = sprintf(buf, "%d\n", dax_region->id);
+	device_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RO(id);
+
+static ssize_t region_size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dax_region *dax_region;
+	ssize_t rc = -ENXIO;
+
+	device_lock(dev);
+	dax_region = dev_get_drvdata(dev);
+	if (dax_region)
+		rc = sprintf(buf, "%llu\n", (unsigned long long)
+				resource_size(&dax_region->res));
+	device_unlock(dev);
+
+	return rc;
+}
+static struct device_attribute dev_attr_region_size = __ATTR(size, 0444,
+		region_size_show, NULL);
+
+static ssize_t align_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dax_region *dax_region;
+	ssize_t rc = -ENXIO;
+
+	device_lock(dev);
+	dax_region = dev_get_drvdata(dev);
+	if (dax_region)
+		rc = sprintf(buf, "%u\n", dax_region->align);
+	device_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RO(align);
+
 static struct attribute *dax_region_attributes[] = {
 	&dev_attr_available_size.attr,
+	&dev_attr_region_size.attr,
+	&dev_attr_align.attr,
+	&dev_attr_id.attr,
 	NULL,
 };
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/8] dax: register seed device
  2016-12-11  6:28 [PATCH 0/8] device-dax: sub-division support Dan Williams
  2016-12-11  6:28 ` [PATCH 1/8] dax: add region-available-size attribute Dan Williams
  2016-12-11  6:28 ` [PATCH 2/8] dax: add region 'id', 'size', and 'align' attributes Dan Williams
@ 2016-12-11  6:28 ` Dan Williams
  2016-12-11  6:28 ` [PATCH 4/8] dax: use multi-order radix for resource lookup Dan Williams
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2016-12-11  6:28 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel

Towards adding support for sub-dividing device-dax regions, add a seed
device.  Similar to libnvdimm a 'seed' device is un-configured device
that is enabled after setting configuration parameters.  After a given
seed device is enabled another is created and this process repeats until
no more seed devices can be activated due to dax_available_size being
exhausted.

For now, this simply registers a seed instance after the first dax
device instance is established.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/dax.c |   41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index da2962b6f8de..c9ba011e333b 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -40,8 +40,10 @@ MODULE_PARM_DESC(nr_dax, "max number of device-dax instances");
  * @kref: to pin while other agents have a need to do lookups
  * @lock: synchronize changes / consistent-access to the resource tree (@res)
  * @dev: parent device backing this region
+ * @seed: next device for dynamic allocation / configuration
  * @align: allocation and mapping alignment for child dax devices
  * @res: physical address range of the region
+ * @child_count: number of registered dax device instances
  * @pfn_flags: identify whether the pfns are paged back or not
  */
 struct dax_region {
@@ -51,8 +53,10 @@ struct dax_region {
 	struct kref kref;
 	struct mutex lock;
 	struct device *dev;
+	struct device *seed;
 	unsigned int align;
 	struct resource res;
+	atomic_t child_count;
 	unsigned long pfn_flags;
 };
 
@@ -161,10 +165,31 @@ static ssize_t align_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(align);
 
+static ssize_t seed_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dax_region *dax_region;
+	ssize_t rc = -ENXIO;
+
+	device_lock(dev);
+	dax_region = dev_get_drvdata(dev);
+	if (dax_region) {
+		mutex_lock(&dax_region->lock);
+		if (dax_region->seed)
+			rc = sprintf(buf, "%s\n", dev_name(dax_region->seed));
+		mutex_unlock(&dax_region->lock);
+	}
+	device_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RO(seed);
+
 static struct attribute *dax_region_attributes[] = {
 	&dev_attr_available_size.attr,
 	&dev_attr_region_size.attr,
 	&dev_attr_align.attr,
+	&dev_attr_seed.attr,
 	&dev_attr_id.attr,
 	NULL,
 };
@@ -295,6 +320,9 @@ static void dax_region_free(struct kref *kref)
 	struct dax_region *dax_region;
 
 	dax_region = container_of(kref, struct dax_region, kref);
+	WARN(atomic_read(&dax_region->child_count),
+			"%s: child count not zero\n",
+			dev_name(dax_region->dev));
 	kfree(dax_region);
 }
 
@@ -691,7 +719,10 @@ static void unregister_dax_dev(void *dev)
 	for (i = 0; i < dax_dev->num_resources; i++)
 		__release_region(&dax_region->res, dax_dev->res[i]->start,
 				resource_size(dax_dev->res[i]));
+	if (dax_region->seed == dev)
+		dax_region->seed = NULL;
 	mutex_unlock(&dax_region->lock);
+	atomic_dec(&dax_region->child_count);
 
 	cdev_del(cdev);
 	device_unregister(dev);
@@ -796,6 +827,16 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
 	if (rc)
 		return ERR_PTR(rc);
 
+	if (atomic_inc_return(&dax_region->child_count) == 1) {
+		struct dax_dev *seed;
+
+		seed = devm_create_dax_dev(dax_region, NULL, 0);
+		if (IS_ERR(seed))
+			dev_warn(parent, "failed to create region seed\n");
+		else
+			dax_region->seed = &seed->dev;
+	}
+
 	return dax_dev;
 
  err_cdev:

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/8] dax: use multi-order radix for resource lookup
  2016-12-11  6:28 [PATCH 0/8] device-dax: sub-division support Dan Williams
                   ` (2 preceding siblings ...)
  2016-12-11  6:28 ` [PATCH 3/8] dax: register seed device Dan Williams
@ 2016-12-11  6:28 ` Dan Williams
  2016-12-11  6:28 ` [PATCH 5/8] dax: refactor locking out of size calculation routines Dan Williams
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2016-12-11  6:28 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel

Before we add support for multiple discontiguous extents in a device-dax
region, convert the file-offset to physical-address lookup to use the
multi-order radix tree.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/Kconfig |    1 
 drivers/dax/dax.c   |  120 ++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 104 insertions(+), 17 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 3e2ab3b14eea..f01d1b353b8e 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -2,6 +2,7 @@ menuconfig DEV_DAX
 	tristate "DAX: direct access to differentiated memory"
 	default m if NVDIMM_DAX
 	depends on TRANSPARENT_HUGEPAGE
+	select RADIX_TREE_MULTIORDER
 	help
 	  Support raw access to differentiated (persistence, bandwidth,
 	  latency...) memory via an mmap(2) capable character
diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index c9ba011e333b..d878a56cf3e3 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -458,28 +458,34 @@ static int check_vma(struct dax_dev *dax_dev, struct vm_area_struct *vma,
 	return 0;
 }
 
+/*
+ * Reuse the unused ->desc attribute of a dax_dev resource to store the
+ * relative pgoff of the resource within the device.
+ */
+static unsigned long to_dev_pgoff(struct resource *res)
+{
+	return res->desc;
+}
+
+static void set_dev_pgoff(struct resource *res, unsigned long dev_pgoff)
+{
+	res->desc = dev_pgoff;
+}
+
 static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff,
 		unsigned long size)
 {
+	struct address_space *mapping = dax_dev->inode->i_mapping;
+	phys_addr_t res_offset;
 	struct resource *res;
-	phys_addr_t phys;
-	int i;
-
-	for (i = 0; i < dax_dev->num_resources; i++) {
-		res = dax_dev->res[i];
-		phys = pgoff * PAGE_SIZE + res->start;
-		if (phys >= res->start && phys <= res->end)
-			break;
-		pgoff -= PHYS_PFN(resource_size(res));
-	}
-
-	if (i < dax_dev->num_resources) {
-		res = dax_dev->res[i];
-		if (phys + size - 1 <= res->end)
-			return phys;
-	}
 
-	return -1;
+	res = radix_tree_lookup(&mapping->page_tree, pgoff);
+	if (!res)
+		return -1;
+	res_offset = PFN_PHYS(pgoff - to_dev_pgoff(res));
+	if (res_offset + size >= resource_size(res))
+		return -1;
+	return res->start + res_offset;
 }
 
 static int __dax_dev_fault(struct dax_dev *dax_dev, struct vm_area_struct *vma,
@@ -682,6 +688,58 @@ static const struct file_operations dax_fops = {
 	.mmap = dax_mmap,
 };
 
+static unsigned order_at(struct resource *res, unsigned long pgoff)
+{
+	unsigned long dev_pgoff = to_dev_pgoff(res) + pgoff;
+	unsigned long nr_pages = PHYS_PFN(resource_size(res));
+	unsigned order_max, order_pgoff;
+
+	if (nr_pages == pgoff)
+		return UINT_MAX;
+
+	/*
+	 * What is the largest power-of-2 range available from this
+	 * resource pgoff to the end of the resource range, considering
+	 * the alignment of the current dev_pgoff?
+	 */
+	order_pgoff = ilog2(nr_pages | dev_pgoff);
+	order_max = ilog2(nr_pages - pgoff);
+	return min(order_max, order_pgoff);
+}
+
+#define foreach_order_pgoff(res, order, pgoff) \
+	for (pgoff = 0, order = order_at((res), pgoff); order < UINT_MAX; \
+		pgoff += 1UL << order, order = order_at(res, pgoff))
+
+static void clear_dax_dev_radix(struct dax_dev *dax_dev)
+{
+	struct address_space *mapping = dax_dev->inode->i_mapping;
+	struct radix_tree_iter iter;
+	void **slot;
+
+	rcu_read_lock();
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0) {
+		struct resource *res;
+		unsigned long pgoff;
+		unsigned order;
+
+		res = radix_tree_deref_slot(slot);
+		if (unlikely(!res))
+			continue;
+		if (radix_tree_deref_retry(res)) {
+			slot = radix_tree_iter_retry(&iter);
+			continue;
+		}
+
+		foreach_order_pgoff(res, order, pgoff)
+			radix_tree_delete(&mapping->page_tree,
+					to_dev_pgoff(res) + pgoff);
+	}
+	rcu_read_unlock();
+
+	synchronize_rcu();
+}
+
 static void dax_dev_release(struct device *dev)
 {
 	struct dax_dev *dax_dev = to_dax_dev(dev);
@@ -716,6 +774,7 @@ static void unregister_dax_dev(void *dev)
 	unmap_mapping_range(dax_dev->inode->i_mapping, 0, 0, 1);
 
 	mutex_lock(&dax_region->lock);
+	clear_dax_dev_radix(dax_dev);
 	for (i = 0; i < dax_dev->num_resources; i++)
 		__release_region(&dax_region->res, dax_dev->res[i]->start,
 				resource_size(dax_dev->res[i]));
@@ -734,6 +793,7 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
 	struct device *parent = dax_region->dev;
 	struct dax_dev *dax_dev;
 	int rc = 0, minor, i;
+	unsigned long pgoff;
 	struct device *dev;
 	struct cdev *cdev;
 	dev_t dev_t;
@@ -790,6 +850,28 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
 		goto err_inode;
 	}
 
+	for (i = 0, pgoff = 0; i < count; i++) {
+		struct address_space *mapping = dax_dev->inode->i_mapping;
+		struct resource *dax_res;
+		int order;
+
+		dax_res = dax_dev->res[i];
+		set_dev_pgoff(dax_res, pgoff);
+		mutex_lock(&dax_region->lock);
+		foreach_order_pgoff(dax_res, order, pgoff) {
+			rc = __radix_tree_insert(&mapping->page_tree,
+					to_dev_pgoff(dax_res) + pgoff, order,
+					dax_res);
+			if (rc)
+				break;
+		}
+		mutex_unlock(&dax_region->lock);
+		pgoff = to_dev_pgoff(dax_res) + PHYS_PFN(resource_size(dax_res));
+
+		if (rc)
+			goto err_radix_insert;
+	}
+
 	/* device_initialize() so cdev can reference kobj parent */
 	device_initialize(dev);
 
@@ -840,6 +922,10 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
 	return dax_dev;
 
  err_cdev:
+	mutex_lock(&dax_region->lock);
+	clear_dax_dev_radix(dax_dev);
+	mutex_unlock(&dax_region->lock);
+ err_radix_insert:
 	iput(dax_dev->inode);
  err_inode:
 	ida_simple_remove(&dax_minor_ida, minor);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/8] dax: refactor locking out of size calculation routines
  2016-12-11  6:28 [PATCH 0/8] device-dax: sub-division support Dan Williams
                   ` (3 preceding siblings ...)
  2016-12-11  6:28 ` [PATCH 4/8] dax: use multi-order radix for resource lookup Dan Williams
@ 2016-12-11  6:28 ` Dan Williams
  2016-12-14 15:01   ` Johannes Thumshirn
  2016-12-11  6:28 ` [PATCH 6/8] dax: sub-division support Dan Williams
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2016-12-11  6:28 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel

In preparation for other callers of these routines make the locking the
responsibility of the caller.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/dax.c |   30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index d878a56cf3e3..5b65eaff6ace 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -90,11 +90,11 @@ static unsigned long long dax_region_avail_size(
 	unsigned long long size;
 	struct resource *res;
 
-	mutex_lock(&dax_region->lock);
+	WARN_ON_ONCE(!mutex_is_locked(&dax_region->lock));
+
 	size = resource_size(&dax_region->res);
 	for_each_dax_region_resource(dax_region, res)
 		size -= resource_size(res);
-	mutex_unlock(&dax_region->lock);
 
 	return size;
 }
@@ -107,8 +107,11 @@ static ssize_t available_size_show(struct device *dev,
 
 	device_lock(dev);
 	dax_region = dev_get_drvdata(dev);
-	if (dax_region)
+	if (dax_region) {
+		mutex_lock(&dax_region->lock);
 		rc = sprintf(buf, "%llu\n", dax_region_avail_size(dax_region));
+		mutex_unlock(&dax_region->lock);
+	}
 	device_unlock(dev);
 
 	return rc;
@@ -389,16 +392,31 @@ static struct dax_dev *to_dax_dev(struct device *dev)
 	return container_of(dev, struct dax_dev, dev);
 }
 
-static ssize_t size_show(struct device *dev,
-		struct device_attribute *attr, char *buf)
+static unsigned long long dax_dev_size(struct dax_dev *dax_dev)
 {
-	struct dax_dev *dax_dev = to_dax_dev(dev);
+	struct dax_region *dax_region = dax_dev->region;
 	unsigned long long size = 0;
 	int i;
 
+	WARN_ON_ONCE(!mutex_is_locked(&dax_region->lock));
+
 	for (i = 0; i < dax_dev->num_resources; i++)
 		size += resource_size(dax_dev->res[i]);
 
+	return size;
+}
+
+static ssize_t size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	unsigned long long size;
+	struct dax_dev *dax_dev = to_dax_dev(dev);
+	struct dax_region *dax_region = dax_dev->region;
+
+	mutex_lock(&dax_region->lock);
+	size = dax_dev_size(dax_dev);
+	mutex_unlock(&dax_region->lock);
+
 	return sprintf(buf, "%llu\n", size);
 }
 static DEVICE_ATTR_RO(size);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/8] dax: sub-division support
  2016-12-11  6:28 [PATCH 0/8] device-dax: sub-division support Dan Williams
                   ` (4 preceding siblings ...)
  2016-12-11  6:28 ` [PATCH 5/8] dax: refactor locking out of size calculation routines Dan Williams
@ 2016-12-11  6:28 ` Dan Williams
  2016-12-11  6:29 ` [PATCH 7/8] dax: add / remove dax devices after provisioning Dan Williams
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2016-12-11  6:28 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel

Device-DAX is a mechanism to establish mappings of performance / feature
differentiated memory with strict fault behavior guarantees. With
sub-division support a platform owner can provision sub-allocations of a
dax-region into separate devices. The provisioning mechanism follows the
same scheme as the libnvdimm sub-system in that a 'seed' device is
created at initialization time that can be resized from zero to become
enabled. Note that a later patch handles creating a new seed when the
current one is "planted" (enabled).

Unlike the nvdimm sub-system there is no on media labelling scheme
associated with this partitioning. Provisioning decisions are ephemeral
/ not automatically restored after reboot. While the initial use case of
device-dax is persistent memory other uses case may be volatile, so the
device-dax core is unable to assume the underlying memory is pmem.  The
task of recalling a partitioning scheme or permissions on the device(s)
is left to userspace.

For persistent allocations, naming, and permissions automatically
recalled by the kernel, use filesystem-DAX. For a userspace helper
library and utility for manipulating device-dax instances see libdaxctl
and the daxctl utility here: https://github.com/pmem/ndctl

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/dax.c |  351 +++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 312 insertions(+), 39 deletions(-)

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 5b65eaff6ace..9b641c079e52 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -63,6 +63,7 @@ struct dax_region {
 /**
  * struct dax_dev - subdivision of a dax region
  * @region - parent region
+ * @resize_lock - for resource size reductions
  * @dev - device backing the character device
  * @cdev - core chardev data
  * @alive - !alive + rcu grace period == no new mappings can be established
@@ -72,6 +73,7 @@ struct dax_region {
  */
 struct dax_dev {
 	struct dax_region *region;
+	rwlock_t resize_lock;
 	struct inode *inode;
 	struct device dev;
 	struct cdev cdev;
@@ -419,7 +421,302 @@ static ssize_t size_show(struct device *dev,
 
 	return sprintf(buf, "%llu\n", size);
 }
-static DEVICE_ATTR_RO(size);
+
+/*
+ * Reuse the unused ->desc attribute of a dax_dev resource to store the
+ * relative pgoff of the resource within the device.
+ */
+static unsigned long to_dev_pgoff(struct resource *res)
+{
+	return res->desc;
+}
+
+static void set_dev_pgoff(struct resource *res, unsigned long dev_pgoff)
+{
+	res->desc = dev_pgoff;
+}
+
+static unsigned order_at(struct resource *res, unsigned long pgoff)
+{
+	unsigned long dev_pgoff = to_dev_pgoff(res) + pgoff;
+	unsigned long nr_pages = PHYS_PFN(resource_size(res));
+	unsigned order_max, order_pgoff;
+
+	if (nr_pages == pgoff)
+		return UINT_MAX;
+
+	/*
+	 * What is the largest power-of-2 range available from this
+	 * resource pgoff to the end of the resource range, considering
+	 * the alignment of the current dev_pgoff?
+	 */
+	order_pgoff = ilog2(nr_pages | dev_pgoff);
+	order_max = ilog2(nr_pages - pgoff);
+	return min(order_max, order_pgoff);
+}
+
+#define foreach_order_pgoff(res, order, pgoff) \
+	for (pgoff = 0, order = order_at((res), pgoff); order < UINT_MAX; \
+		pgoff += 1UL << order, order = order_at(res, pgoff))
+
+static int dax_dev_adjust_resource(struct dax_dev *dax_dev,
+		struct resource *res, resource_size_t size)
+{
+	struct address_space *mapping = dax_dev->inode->i_mapping;
+	unsigned long pgoff;
+	int rc = 0, order;
+
+	/*
+	 * Take the lock to prevent false negative lookups while we
+	 * adjust both the resource and radix entries. Note that the
+	 * false *positive* lookups that are allowed by not locking when
+	 * deleting full resources are permissible because we will end
+	 * up invalidating those mappings before completing the resize.
+	 */
+	write_lock(&dax_dev->resize_lock);
+	foreach_order_pgoff(res, order, pgoff)
+		radix_tree_delete(&mapping->page_tree,
+				to_dev_pgoff(res) + pgoff);
+
+	adjust_resource(res, res->start, size);
+
+	foreach_order_pgoff(res, order, pgoff) {
+		rc = __radix_tree_insert(&mapping->page_tree,
+				to_dev_pgoff(res) + pgoff, order, res);
+		if (rc) {
+			dev_WARN(&dax_dev->dev,
+					"error: %d adjusting size\n", rc);
+			break;
+		}
+	}
+	write_unlock(&dax_dev->resize_lock);
+
+	return rc;
+}
+
+static int dax_dev_shrink(struct dax_region *dax_region,
+		struct dax_dev *dax_dev, unsigned long long size)
+{
+	struct address_space *mapping = dax_dev->inode->i_mapping;
+	resource_size_t dev_size = dax_dev_size(dax_dev);
+	resource_size_t res_size, to_free;
+	struct resource *max_res, *res;
+	unsigned long pgoff;
+	int i, order, rc = 0;
+
+	to_free = dev_size - size;
+
+retry:
+	max_res = NULL;
+	/* delete from the highest pgoff resource */
+	for (i = 0; i < dax_dev->num_resources; i++) {
+		res = dax_dev->res[i];
+		if (!max_res || to_dev_pgoff(res) > to_dev_pgoff(max_res))
+			max_res = res;
+	}
+
+	res = max_res;
+	if (!res)
+		return -ENXIO;
+	res_size = resource_size(res);
+
+	if (to_free >= res_size) {
+		foreach_order_pgoff(res, order, pgoff)
+			radix_tree_delete(&mapping->page_tree,
+					to_dev_pgoff(res) + pgoff);
+		synchronize_rcu();
+		__release_region(&dax_region->res, res->start, res_size);
+		for (i = 0; i < dax_dev->num_resources; i++)
+			if (res == dax_dev->res[i])
+				break;
+		for (i = i + 1; i < dax_dev->num_resources; i++)
+			dax_dev->res[i - 1] = dax_dev->res[i];
+		dax_dev->num_resources--;
+		to_free -= res_size;
+
+		/*
+		 * Once we've deleted a resource we need to search the
+		 * next resource at the highest remaining dev_pgoff.
+		 */
+		if (to_free)
+			goto retry;
+	} else {
+		rc = dax_dev_adjust_resource(dax_dev, res, res_size - to_free);
+		synchronize_rcu();
+	}
+
+	/*
+	 * Now that the lookup radix and resource tree has been cleaned
+	 * up we can invalidate any remaining mappings in the deleted
+	 * range.
+	 */
+	unmap_mapping_range(mapping, size, dev_size - size, 1);
+
+	return rc;
+}
+
+static int dax_dev_add_resource(struct dax_region *dax_region,
+		struct dax_dev *dax_dev, resource_size_t start,
+		resource_size_t size, unsigned long dev_pgoff)
+{
+	struct address_space *mapping = dax_dev->inode->i_mapping;
+	struct resource *res, **resources;
+	int order, rc = -ENOMEM;
+	unsigned long pgoff;
+
+	res = __request_region(&dax_region->res, start, size,
+			dev_name(&dax_dev->dev), 0);
+	if (!res)
+		return -EBUSY;
+	set_dev_pgoff(res, dev_pgoff);
+	resources = krealloc(dax_dev->res, sizeof(struct resource *)
+			* (dax_dev->num_resources + 1), GFP_KERNEL);
+	if (!resources)
+		goto err_resources;
+	dax_dev->res = resources;
+	dax_dev->res[dax_dev->num_resources++] = res;
+
+	foreach_order_pgoff(res, order, pgoff) {
+		rc = __radix_tree_insert(&mapping->page_tree,
+				to_dev_pgoff(res) + pgoff, order, res);
+		if (rc)
+			goto err_radix;
+	}
+
+	return 0;
+
+err_radix:
+	foreach_order_pgoff(res, order, pgoff)
+		radix_tree_delete(&mapping->page_tree,
+				to_dev_pgoff(res) + pgoff);
+	dax_dev->res[--dax_dev->num_resources] = NULL;
+err_resources:
+	__release_region(&dax_region->res, start, size);
+	return -ENOMEM;
+
+}
+
+static ssize_t dax_dev_resize(struct dax_region *dax_region,
+		struct dax_dev *dax_dev, resource_size_t size)
+{
+	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
+	resource_size_t dev_size = dax_dev_size(dax_dev);
+	struct resource *max_res = NULL, *res, *first;
+	unsigned long dev_pgoff = PHYS_PFN(dev_size);
+	const char *name = dev_name(&dax_dev->dev);
+	resource_size_t region_end;
+	int i, rc;
+
+	if (size == dev_size)
+		return 0;
+	if (size > dev_size && size - dev_size > avail)
+		return -ENOSPC;
+
+	if (size < dev_size)
+		return dax_dev_shrink(dax_region, dax_dev, size);
+
+	to_alloc = size - dev_size;
+	if (!IS_ALIGNED(to_alloc, dax_region->align)) {
+		WARN_ON(1);
+		return -ENXIO;
+	}
+
+	for (i = 0; i < dax_dev->num_resources; i++) {
+		res = dax_dev->res[i];
+		if (!max_res || to_dev_pgoff(res) > to_dev_pgoff(max_res))
+			max_res = res;
+	}
+
+	/*
+	 * Expand the device into the unused portion of the region. This
+	 * may involve adjusting the end of an existing resource, or
+	 * allocating a new disjoint resource.
+	 */
+	region_end = dax_region->res.start + resource_size(&dax_region->res);
+	first = dax_region->res.child;
+	for (res = first; to_alloc && res; res = res->sibling) {
+		struct resource *next = res->sibling;
+		resource_size_t alloc, res_end;
+
+		res_end = res->start + resource_size(res);
+
+		/* space at the beginning of the region */
+		if (res == first && res->start > dax_region->res.start) {
+			alloc = res->start - dax_region->res.start;
+			alloc = min(alloc, to_alloc);
+			rc = dax_dev_add_resource(dax_region, dax_dev,
+					dax_region->res.start, alloc,
+					dev_pgoff);
+			if (rc)
+				return rc;
+			to_alloc -= alloc;
+			dev_pgoff += PHYS_PFN(alloc);
+		}
+
+		/* space between allocations */
+		if (to_alloc && next && next->start > res_end) {
+			alloc = next->start - res_end;
+			alloc = min(alloc, to_alloc);
+			if (res == max_res && strcmp(name, res->name) == 0)
+				rc = dax_dev_adjust_resource(dax_dev, res,
+						resource_size(res) + alloc);
+			else
+				rc = dax_dev_add_resource(dax_region, dax_dev,
+						res_end, alloc, dev_pgoff);
+			if (rc)
+				return rc;
+			to_alloc -= alloc;
+			dev_pgoff += PHYS_PFN(alloc);
+		}
+
+		/* space at the end of the region */
+		if (to_alloc && !next && res_end < region_end) {
+			alloc = region_end - res_end;
+			alloc = min(alloc, to_alloc);
+			if (res == max_res && strcmp(name, res->name) == 0)
+				rc = dax_dev_adjust_resource(dax_dev, res,
+						resource_size(res) + alloc);
+			else
+				rc = dax_dev_add_resource(dax_region, dax_dev,
+						res_end, alloc, dev_pgoff);
+			if (rc)
+				return rc;
+			to_alloc -= alloc;
+			dev_pgoff += PHYS_PFN(alloc);
+		}
+	}
+
+	return 0;
+}
+
+static ssize_t size_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	ssize_t rc;
+	unsigned long long val;
+	struct dax_dev *dax_dev = to_dax_dev(dev);
+	struct dax_region *dax_region = dax_dev->region;
+
+	rc = kstrtoull(buf, 0, &val);
+	if (rc)
+		return rc;
+
+	if (!IS_ALIGNED(val, dax_region->align)) {
+		dev_dbg(&dax_dev->dev, "%s: size: %lld misaligned\n",
+				__func__, val);
+		return -EINVAL;
+	}
+
+	mutex_lock(&dax_region->lock);
+	rc = dax_dev_resize(dax_region, dax_dev, val);
+	mutex_unlock(&dax_region->lock);
+
+	if (rc == 0)
+		return len;
+
+	return rc;
+}
+static DEVICE_ATTR_RW(size);
 
 static struct attribute *dax_device_attributes[] = {
 	&dev_attr_size.attr,
@@ -476,21 +773,7 @@ static int check_vma(struct dax_dev *dax_dev, struct vm_area_struct *vma,
 	return 0;
 }
 
-/*
- * Reuse the unused ->desc attribute of a dax_dev resource to store the
- * relative pgoff of the resource within the device.
- */
-static unsigned long to_dev_pgoff(struct resource *res)
-{
-	return res->desc;
-}
-
-static void set_dev_pgoff(struct resource *res, unsigned long dev_pgoff)
-{
-	res->desc = dev_pgoff;
-}
-
-static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff,
+static phys_addr_t __pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff,
 		unsigned long size)
 {
 	struct address_space *mapping = dax_dev->inode->i_mapping;
@@ -506,6 +789,18 @@ static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff,
 	return res->start + res_offset;
 }
 
+static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff,
+                unsigned long size)
+{
+	phys_addr_t phys;
+
+	read_lock(&dax_dev->resize_lock);
+	phys = __pgoff_to_phys(dax_dev, pgoff, size);
+	read_unlock(&dax_dev->resize_lock);
+
+	return phys;
+}
+
 static int __dax_dev_fault(struct dax_dev *dax_dev, struct vm_area_struct *vma,
 		struct vm_fault *vmf)
 {
@@ -706,29 +1001,6 @@ static const struct file_operations dax_fops = {
 	.mmap = dax_mmap,
 };
 
-static unsigned order_at(struct resource *res, unsigned long pgoff)
-{
-	unsigned long dev_pgoff = to_dev_pgoff(res) + pgoff;
-	unsigned long nr_pages = PHYS_PFN(resource_size(res));
-	unsigned order_max, order_pgoff;
-
-	if (nr_pages == pgoff)
-		return UINT_MAX;
-
-	/*
-	 * What is the largest power-of-2 range available from this
-	 * resource pgoff to the end of the resource range, considering
-	 * the alignment of the current dev_pgoff?
-	 */
-	order_pgoff = ilog2(nr_pages | dev_pgoff);
-	order_max = ilog2(nr_pages - pgoff);
-	return min(order_max, order_pgoff);
-}
-
-#define foreach_order_pgoff(res, order, pgoff) \
-	for (pgoff = 0, order = order_at((res), pgoff); order < UINT_MAX; \
-		pgoff += 1UL << order, order = order_at(res, pgoff))
-
 static void clear_dax_dev_radix(struct dax_dev *dax_dev)
 {
 	struct address_space *mapping = dax_dev->inode->i_mapping;
@@ -905,6 +1177,7 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
 	dax_dev->num_resources = count;
 	dax_dev->alive = true;
 	dax_dev->region = dax_region;
+	rwlock_init(&dax_dev->resize_lock);
 	kref_get(&dax_region->kref);
 
 	dev->devt = dev_t;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/8] dax: add / remove dax devices after provisioning
  2016-12-11  6:28 [PATCH 0/8] device-dax: sub-division support Dan Williams
                   ` (5 preceding siblings ...)
  2016-12-11  6:28 ` [PATCH 6/8] dax: sub-division support Dan Williams
@ 2016-12-11  6:29 ` Dan Williams
  2016-12-11  6:29 ` [PATCH 8/8] dax: add debug for region available_size Dan Williams
  2016-12-12 17:15 ` [PATCH 0/8] device-dax: sub-division support Jeff Moyer
  8 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2016-12-11  6:29 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel

Create a new device-dax seed when incrementing the size of the existing
seed from zero.

Destroy a device-dax instance when its size changes from non-zero to
zero.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/dax.c |  195 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 128 insertions(+), 67 deletions(-)

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 9b641c079e52..b130eff91b83 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -15,6 +15,7 @@
 #include <linux/device.h>
 #include <linux/mount.h>
 #include <linux/pfn_t.h>
+#include <linux/async.h>
 #include <linux/hash.h>
 #include <linux/cdev.h>
 #include <linux/slab.h>
@@ -32,6 +33,7 @@ static struct vfsmount *dax_mnt;
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
 MODULE_PARM_DESC(nr_dax, "max number of device-dax instances");
+static ASYNC_DOMAIN_EXCLUSIVE(dax_dev_async);
 
 /**
  * struct dax_region - mapping infrastructure for dax devices
@@ -329,6 +331,7 @@ static void dax_region_free(struct kref *kref)
 			"%s: child count not zero\n",
 			dev_name(dax_region->dev));
 	kfree(dax_region);
+	module_put(THIS_MODULE);
 }
 
 void dax_region_put(struct dax_region *dax_region)
@@ -377,15 +380,22 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 	dax_region->align = align;
 	dax_region->dev = parent;
 	dax_region->base = addr;
-	if (sysfs_create_groups(&parent->kobj, dax_region_attribute_groups)) {
-		kfree(dax_region);
-		return NULL;;
-	}
+	if (!try_module_get(THIS_MODULE))
+		goto err_module;
+
+	if (sysfs_create_groups(&parent->kobj, dax_region_attribute_groups))
+		goto err_groups;
 
 	kref_get(&dax_region->kref);
 	if (devm_add_action_or_reset(parent, dax_region_unregister, dax_region))
 		return NULL;
 	return dax_region;
+
+err_groups:
+	module_put(THIS_MODULE);
+err_module:
+	kfree(dax_region);
+	return NULL;
 }
 EXPORT_SYMBOL_GPL(alloc_dax_region);
 
@@ -402,6 +412,9 @@ static unsigned long long dax_dev_size(struct dax_dev *dax_dev)
 
 	WARN_ON_ONCE(!mutex_is_locked(&dax_region->lock));
 
+	if (!dax_dev->alive)
+		return 0;
+
 	for (i = 0; i < dax_dev->num_resources; i++)
 		size += resource_size(dax_dev->res[i]);
 
@@ -415,6 +428,9 @@ static ssize_t size_show(struct device *dev,
 	struct dax_dev *dax_dev = to_dax_dev(dev);
 	struct dax_region *dax_region = dax_dev->region;
 
+	/* flush previous size operations */
+	async_synchronize_full_domain(&dax_dev_async);
+
 	mutex_lock(&dax_region->lock);
 	size = dax_dev_size(dax_dev);
 	mutex_unlock(&dax_region->lock);
@@ -494,6 +510,89 @@ static int dax_dev_adjust_resource(struct dax_dev *dax_dev,
 	return rc;
 }
 
+static void clear_dax_dev_radix(struct dax_dev *dax_dev)
+{
+	struct address_space *mapping = dax_dev->inode->i_mapping;
+	struct radix_tree_iter iter;
+	void **slot;
+
+	rcu_read_lock();
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0) {
+		struct resource *res;
+		unsigned long pgoff;
+		unsigned order;
+
+		res = radix_tree_deref_slot(slot);
+		if (unlikely(!res))
+			continue;
+		if (radix_tree_deref_retry(res)) {
+			slot = radix_tree_iter_retry(&iter);
+			continue;
+		}
+
+		foreach_order_pgoff(res, order, pgoff)
+			radix_tree_delete(&mapping->page_tree,
+					to_dev_pgoff(res) + pgoff);
+	}
+	rcu_read_unlock();
+
+	synchronize_rcu();
+}
+
+static void unregister_dax_dev(void *dev)
+{
+	struct dax_dev *dax_dev = to_dax_dev(dev);
+	struct dax_region *dax_region = dax_dev->region;
+	struct cdev *cdev = &dax_dev->cdev;
+	int i;
+
+	dev_dbg(dev, "%s\n", __func__);
+
+	/*
+	 * Note, rcu is not protecting the liveness of dax_dev, rcu is
+	 * ensuring that any fault handlers that might have seen
+	 * dax_dev->alive == true, have completed.  Any fault handlers
+	 * that start after synchronize_rcu() has started will abort
+	 * upon seeing dax_dev->alive == false.
+	 */
+	dax_dev->alive = false;
+	synchronize_rcu();
+	unmap_mapping_range(dax_dev->inode->i_mapping, 0, 0, 1);
+
+	mutex_lock(&dax_region->lock);
+	clear_dax_dev_radix(dax_dev);
+	for (i = 0; i < dax_dev->num_resources; i++)
+		__release_region(&dax_region->res, dax_dev->res[i]->start,
+				resource_size(dax_dev->res[i]));
+	if (dax_region->seed == dev)
+		dax_region->seed = NULL;
+	mutex_unlock(&dax_region->lock);
+	atomic_dec(&dax_region->child_count);
+
+	cdev_del(cdev);
+	device_unregister(dev);
+}
+
+static void dax_dev_async_unregister(void *d, async_cookie_t cookie)
+{
+	struct device *dev = d;
+	struct dax_dev *dax_dev = to_dax_dev(dev);
+	struct dax_region *dax_region = dax_dev->region;
+
+	/*
+	 * Check that we still have an enabled region, if not then this
+	 * device was unregistered when the region was disabled.
+	 */
+	device_lock(dax_region->dev);
+	if (dev_get_drvdata(dax_region->dev)) {
+		devm_remove_action(dax_region->dev, unregister_dax_dev, dev);
+		unregister_dax_dev(dev);
+	}
+	device_unlock(dax_region->dev);
+
+	put_device(dev);
+}
+
 static int dax_dev_shrink(struct dax_region *dax_region,
 		struct dax_dev *dax_dev, unsigned long long size)
 {
@@ -552,6 +651,14 @@ static int dax_dev_shrink(struct dax_region *dax_region,
 	 */
 	unmap_mapping_range(mapping, size, dev_size - size, 1);
 
+	if (size == 0 && &dax_dev->dev != dax_region->seed) {
+		get_device(&dax_dev->dev);
+		dax_dev->alive = false;
+		synchronize_rcu();
+		async_schedule_domain(dax_dev_async_unregister, &dax_dev->dev,
+				&dax_dev_async);
+	}
+
 	return rc;
 }
 
@@ -607,6 +714,9 @@ static ssize_t dax_dev_resize(struct dax_region *dax_region,
 	resource_size_t region_end;
 	int i, rc;
 
+	if (!dax_dev->alive)
+		return -ENXIO;
+
 	if (size == dev_size)
 		return 0;
 	if (size > dev_size && size - dev_size > avail)
@@ -686,6 +796,20 @@ static ssize_t dax_dev_resize(struct dax_region *dax_region,
 		}
 	}
 
+	device_lock(dax_region->dev);
+	if (dev_get_drvdata(dax_region->dev) && dev_size == 0
+			&& &dax_dev->dev == dax_region->seed) {
+		struct dax_dev *seed;
+
+		seed = devm_create_dax_dev(dax_region, NULL, 0);
+		if (IS_ERR(seed))
+			dev_warn(dax_region->dev,
+					"failed to create new region seed\n");
+		else
+			dax_region->seed = &seed->dev;
+	}
+	device_unlock(dax_region->dev);
+
 	return 0;
 }
 
@@ -1001,35 +1125,6 @@ static const struct file_operations dax_fops = {
 	.mmap = dax_mmap,
 };
 
-static void clear_dax_dev_radix(struct dax_dev *dax_dev)
-{
-	struct address_space *mapping = dax_dev->inode->i_mapping;
-	struct radix_tree_iter iter;
-	void **slot;
-
-	rcu_read_lock();
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0) {
-		struct resource *res;
-		unsigned long pgoff;
-		unsigned order;
-
-		res = radix_tree_deref_slot(slot);
-		if (unlikely(!res))
-			continue;
-		if (radix_tree_deref_retry(res)) {
-			slot = radix_tree_iter_retry(&iter);
-			continue;
-		}
-
-		foreach_order_pgoff(res, order, pgoff)
-			radix_tree_delete(&mapping->page_tree,
-					to_dev_pgoff(res) + pgoff);
-	}
-	rcu_read_unlock();
-
-	synchronize_rcu();
-}
-
 static void dax_dev_release(struct device *dev)
 {
 	struct dax_dev *dax_dev = to_dax_dev(dev);
@@ -1043,40 +1138,6 @@ static void dax_dev_release(struct device *dev)
 	kfree(dax_dev);
 }
 
-static void unregister_dax_dev(void *dev)
-{
-	struct dax_dev *dax_dev = to_dax_dev(dev);
-	struct dax_region *dax_region = dax_dev->region;
-	struct cdev *cdev = &dax_dev->cdev;
-	int i;
-
-	dev_dbg(dev, "%s\n", __func__);
-
-	/*
-	 * Note, rcu is not protecting the liveness of dax_dev, rcu is
-	 * ensuring that any fault handlers that might have seen
-	 * dax_dev->alive == true, have completed.  Any fault handlers
-	 * that start after synchronize_rcu() has started will abort
-	 * upon seeing dax_dev->alive == false.
-	 */
-	dax_dev->alive = false;
-	synchronize_rcu();
-	unmap_mapping_range(dax_dev->inode->i_mapping, 0, 0, 1);
-
-	mutex_lock(&dax_region->lock);
-	clear_dax_dev_radix(dax_dev);
-	for (i = 0; i < dax_dev->num_resources; i++)
-		__release_region(&dax_region->res, dax_dev->res[i]->start,
-				resource_size(dax_dev->res[i]));
-	if (dax_region->seed == dev)
-		dax_region->seed = NULL;
-	mutex_unlock(&dax_region->lock);
-	atomic_dec(&dax_region->child_count);
-
-	cdev_del(cdev);
-	device_unregister(dev);
-}
-
 struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
 		struct resource *res, int count)
 {

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 8/8] dax: add debug for region available_size
  2016-12-11  6:28 [PATCH 0/8] device-dax: sub-division support Dan Williams
                   ` (6 preceding siblings ...)
  2016-12-11  6:29 ` [PATCH 7/8] dax: add / remove dax devices after provisioning Dan Williams
@ 2016-12-11  6:29 ` Dan Williams
  2016-12-12 17:15 ` [PATCH 0/8] device-dax: sub-division support Jeff Moyer
  8 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2016-12-11  6:29 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel

Add dynamic debug statements to dump the per-dax-dev resource
allocations and device offsets.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/dax.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index b130eff91b83..d1641e69a088 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -97,8 +97,11 @@ static unsigned long long dax_region_avail_size(
 	WARN_ON_ONCE(!mutex_is_locked(&dax_region->lock));
 
 	size = resource_size(&dax_region->res);
-	for_each_dax_region_resource(dax_region, res)
+	for_each_dax_region_resource(dax_region, res) {
+		dev_dbg(dax_region->dev, "%s: %pr offset: %lx\n",
+				res->name, res, res->desc);
 		size -= resource_size(res);
+	}
 
 	return size;
 }
@@ -372,6 +375,7 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 	dax_region->res.name = dev_name(parent);
 	dax_region->res.start = res->start;
 	dax_region->res.end = res->end;
+	dax_region->res.flags = IORESOURCE_MEM;
 	dax_region->pfn_flags = pfn_flags;
 	mutex_init(&dax_region->lock);
 	kref_init(&dax_region->kref);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] device-dax: sub-division support
  2016-12-11  6:28 [PATCH 0/8] device-dax: sub-division support Dan Williams
                   ` (7 preceding siblings ...)
  2016-12-11  6:29 ` [PATCH 8/8] dax: add debug for region available_size Dan Williams
@ 2016-12-12 17:15 ` Jeff Moyer
  2016-12-12 18:46   ` Dan Williams
  8 siblings, 1 reply; 21+ messages in thread
From: Jeff Moyer @ 2016-12-12 17:15 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, linux-kernel

Hi, Dan,

Dan Williams <dan.j.williams@intel.com> writes:

>>From [PATCH 6/8] dax: sub-division support:
>
> Device-DAX is a mechanism to establish mappings of performance / feature
> differentiated memory with strict fault behavior guarantees.  With
> sub-division support a platform owner can provision sub-allocations of a
> dax-region into separate devices. The provisioning mechanism follows the
> same scheme as the libnvdimm sub-system in that a 'seed' device is
> created at initialization time that can be resized from zero to become
> enabled.
>
> Unlike the nvdimm sub-system there is no on media labelling scheme
> associated with this partitioning. Provisioning decisions are ephemeral
> / not automatically restored after reboot. While the initial use case of
> device-dax is persistent memory other uses case may be volatile, so the
> device-dax core is unable to assume the underlying memory is pmem.  The
> task of recalling a partitioning scheme or permissions on the device(s)
> is left to userspace.

Can you explain this reasoning in a bit more detail, please?  If you
have specific use cases in mind, that would be helpful.

> For persistent allocations, naming, and permissions automatically
> recalled by the kernel, use filesystem-DAX. For a userspace helper

I'd agree with that guidance if it wasn't for the fact that device dax
was born out of the need to be able to flush dirty data in a safe manner
from userspace.  At best, we're giving mixed guidance to application
developers.

-Jeff

> library and utility for manipulating device-dax instances see libdaxctl
> and the daxctl utility here: https://github.com/pmem/ndctl
>
> ---
>
> Dan Williams (8):
>       dax: add region-available-size attribute
>       dax: add region 'id', 'size', and 'align' attributes
>       dax: register seed device
>       dax: use multi-order radix for resource lookup
>       dax: refactor locking out of size calculation routines
>       dax: sub-division support
>       dax: add / remove dax devices after provisioning
>       dax: add debug for region available_size
>
>
>  drivers/dax/Kconfig |    1 
>  drivers/dax/dax.c   |  747 ++++++++++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 698 insertions(+), 50 deletions(-)
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] device-dax: sub-division support
  2016-12-12 17:15 ` [PATCH 0/8] device-dax: sub-division support Jeff Moyer
@ 2016-12-12 18:46   ` Dan Williams
  2016-12-13 23:46     ` Jeff Moyer
  0 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2016-12-12 18:46 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-nvdimm, linux-kernel

On Mon, Dec 12, 2016 at 9:15 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Hi, Dan,
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
>>>From [PATCH 6/8] dax: sub-division support:
>>
>> Device-DAX is a mechanism to establish mappings of performance / feature
>> differentiated memory with strict fault behavior guarantees.  With
>> sub-division support a platform owner can provision sub-allocations of a
>> dax-region into separate devices. The provisioning mechanism follows the
>> same scheme as the libnvdimm sub-system in that a 'seed' device is
>> created at initialization time that can be resized from zero to become
>> enabled.
>>
>> Unlike the nvdimm sub-system there is no on media labelling scheme
>> associated with this partitioning. Provisioning decisions are ephemeral
>> / not automatically restored after reboot. While the initial use case of
>> device-dax is persistent memory other uses case may be volatile, so the
>> device-dax core is unable to assume the underlying memory is pmem.  The
>> task of recalling a partitioning scheme or permissions on the device(s)
>> is left to userspace.
>
> Can you explain this reasoning in a bit more detail, please?  If you
> have specific use cases in mind, that would be helpful.

A few use cases are top of mind:

* userspace persistence support: filesystem-DAX as implemented in XFS
and EXT4 requires filesystem coordination for persistence, device-dax
does not. An application may not need a full namespace worth of
persistent memory, or may want to dynamically resize the amount of
persistent memory it is consuming. This enabling allows online resize
of device-dax file/instance.

* allocation + access mechanism for performance differentiated memory:
Persistent memory is one example of a reserved memory pool with
different performance characteristics than typical DRAM in a system,
and there are examples of other performance differentiated memory
pools (high bandwidth or low latency) showing up on commonly available
platforms. This mechanism gives purpose built applications (high
performance computing, databases, etc...) a way to establish mappings
with predictable fault-granularities and performance, but also allow
for different permissions per allocation.

* carving up a PCI-E device memory bar for managing peer-to-peer
transactions: In the thread about enablling P2P DMA one of the
concerns that was raised was security separation of different users of
a device: http://marc.info/?l=linux-kernel&m=148106083913173&w=2

>> For persistent allocations, naming, and permissions automatically
>> recalled by the kernel, use filesystem-DAX. For a userspace helper
>
> I'd agree with that guidance if it wasn't for the fact that device dax
> was born out of the need to be able to flush dirty data in a safe manner
> from userspace.  At best, we're giving mixed guidance to application
> developers.

Yes, but at the same time device-DAX is sufficiently painful (no
read(2)/write(2) support, no builtin metadata support) that it may
spur application developers to lobby for a filesystem that offers
userspace dirty-data flushing. Until then we have this vehicle to test
the difference and dax-support for memory types beyond persistent
memory.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] device-dax: sub-division support
  2016-12-12 18:46   ` Dan Williams
@ 2016-12-13 23:46     ` Jeff Moyer
  2016-12-14  1:17       ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Moyer @ 2016-12-13 23:46 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, linux-kernel

Hi, Dan,

In general, I have a couple of concerns with this patchset:
1) You're making a case that subdivision shouldn't be persistent, which
   means that all of the code we already have for subdividing devices
   (partitions, libnvdimm) has to be re-invented in userspace, and
   existing tools can't be used to manage nvdimms.
2) You're pushing file system features into a character device.

I think that using device dax for both volatile and non-volatile
memories is a mistake.  For persistent memory, I think users would want
any subdivision to be persistent.  I also think that using a familiar
storage model, like block devices and partitions, would make a heck of a
lot more sense than this proposal.  For volatile use cases, I don't have
a problem with what you've proposed.  But then, I don't really think too
much about those use cases, either, so maybe I'm not the best person to
ask.

So, in my opinion, you should make device dax all about the volatile use
case and we can go back to pushing dax for block devices to support use
cases like big databases and passing NVDIMMs into VMs.  Yes, I'm signing
up to help.

More detailed responses are inline below.

Dan Williams <dan.j.williams@intel.com> writes:

> On Mon, Dec 12, 2016 at 9:15 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Hi, Dan,
>>
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>>>From [PATCH 6/8] dax: sub-division support:
>>>
>>> Device-DAX is a mechanism to establish mappings of performance / feature
>>> differentiated memory with strict fault behavior guarantees.  With
>>> sub-division support a platform owner can provision sub-allocations of a
>>> dax-region into separate devices. The provisioning mechanism follows the
>>> same scheme as the libnvdimm sub-system in that a 'seed' device is
>>> created at initialization time that can be resized from zero to become
>>> enabled.
>>>
>>> Unlike the nvdimm sub-system there is no on media labelling scheme
>>> associated with this partitioning. Provisioning decisions are ephemeral
>>> / not automatically restored after reboot. While the initial use case of
>>> device-dax is persistent memory other uses case may be volatile, so the
>>> device-dax core is unable to assume the underlying memory is pmem.  The
>>> task of recalling a partitioning scheme or permissions on the device(s)
>>> is left to userspace.
>>
>> Can you explain this reasoning in a bit more detail, please?  If you
>> have specific use cases in mind, that would be helpful.
> 
> A few use cases are top of mind:
>
> * userspace persistence support: filesystem-DAX as implemented in XFS
> and EXT4 requires filesystem coordination for persistence, device-dax
> does not. An application may not need a full namespace worth of
> persistent memory, or may want to dynamically resize the amount of
> persistent memory it is consuming. This enabling allows online resize
> of device-dax file/instance.

OK, so you've now implemented file extending and truncation (and block
mapping, I guess).  Where does this end?  How many more file-system
features will you add to this character device?

> * allocation + access mechanism for performance differentiated memory:
> Persistent memory is one example of a reserved memory pool with
> different performance characteristics than typical DRAM in a system,
> and there are examples of other performance differentiated memory
> pools (high bandwidth or low latency) showing up on commonly available
> platforms. This mechanism gives purpose built applications (high
> performance computing, databases, etc...) a way to establish mappings
> with predictable fault-granularities and performance, but also allow
> for different permissions per allocation.

So, how would an application that wishes to use a device-dax subdivision
of performance differentiated memory get access to it?
1) administrator subdivides space and assigns it to a user
2) application gets to use it

Something like that?  Or do you expect applications to sub-divide the
device-dax instance programmatically?  Why wouldn't you want the mapping
to live beyond a single boot?

> * carving up a PCI-E device memory bar for managing peer-to-peer
> transactions: In the thread about enablling P2P DMA one of the
> concerns that was raised was security separation of different users of
> a device: http://marc.info/?l=linux-kernel&m=148106083913173&w=2

OK, but I wasn't sure that there was consensus in that thread.  It
seemed more likely that the block device ioctl path would be pursued.
If this is the preferred method, I think you should document their
requirements and show how the implementation meets them, instead of
leaving that up to reviewers.  Or, at the very least, CC the interested
parties?

>>> For persistent allocations, naming, and permissions automatically
>>> recalled by the kernel, use filesystem-DAX. For a userspace helper
>>
>> I'd agree with that guidance if it wasn't for the fact that device dax
>> was born out of the need to be able to flush dirty data in a safe manner
>> from userspace.  At best, we're giving mixed guidance to application
>> developers.
>
> Yes, but at the same time device-DAX is sufficiently painful (no
> read(2)/write(2) support, no builtin metadata support) that it may
> spur application developers to lobby for a filesystem that offers
> userspace dirty-data flushing. Until then we have this vehicle to test
> the difference and dax-support for memory types beyond persistent
> memory.

Let's just work on the PMEM_IMMUTABLE flag that Dave suggested[1] and
make device dax just for volatile memories.

-Jeff

[1] http://lkml.iu.edu/hypermail/linux/kernel/1609.1/05372.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] device-dax: sub-division support
  2016-12-13 23:46     ` Jeff Moyer
@ 2016-12-14  1:17       ` Dan Williams
  2016-12-15 16:50         ` Jeff Moyer
  0 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2016-12-14  1:17 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-nvdimm, linux-kernel, Dave Hansen

On Tue, Dec 13, 2016 at 3:46 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Hi, Dan,
>
> In general, I have a couple of concerns with this patchset:
> 1) You're making a case that subdivision shouldn't be persistent, which
>    means that all of the code we already have for subdividing devices
>    (partitions, libnvdimm) has to be re-invented in userspace, and
>    existing tools can't be used to manage nvdimms.

Keep in mind that the device-dax core just operates on address ranges,
whether those address ranges are persistent or not is invisible to the
core. The core simply can not assume that the address ranges it is
managing are persistent or volatile. For environments that want to use
traditional partitioning or libnvdimm namespace labels, nothing stops
them. The difference is just a mode setting at the namespace level,
for example:

    ndctl create-namespace --reconfig=namespace0.0 --mode=dax --force
    ndctl create-namespace --reconfig=namespace0.0 --mode=memory --force

Also recall that we have namespace sub-division support, with
persistent labels, that was added in 4.9. So instead of being limited
to one namespace per pmem-region we can now support multiple
namespaces per region.

> 2) You're pushing file system features into a character device.

Yes, just like a block device the device-dax instances support
sub-division and a unified inode. But, unlike a block device, no
entanglements with the page cache or other higher level file api
features.

> I think that using device dax for both volatile and non-volatile
> memories is a mistake.  For persistent memory, I think users would want
> any subdivision to be persistent.  I also think that using a familiar
> storage model, like block devices and partitions, would make a heck of a
> lot more sense than this proposal.  For volatile use cases, I don't have
> a problem with what you've proposed.  But then, I don't really think too
> much about those use cases, either, so maybe I'm not the best person to
> ask.
>
> So, in my opinion, you should make device dax all about the volatile use
> case and we can go back to pushing dax for block devices to support use
> cases like big databases and passing NVDIMMs into VMs.  Yes, I'm signing
> up to help.
>
> More detailed responses are inline below.
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> On Mon, Dec 12, 2016 at 9:15 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> Hi, Dan,
>>>
>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>
>>>>>From [PATCH 6/8] dax: sub-division support:
>>>>
>>>> Device-DAX is a mechanism to establish mappings of performance / feature
>>>> differentiated memory with strict fault behavior guarantees.  With
>>>> sub-division support a platform owner can provision sub-allocations of a
>>>> dax-region into separate devices. The provisioning mechanism follows the
>>>> same scheme as the libnvdimm sub-system in that a 'seed' device is
>>>> created at initialization time that can be resized from zero to become
>>>> enabled.
>>>>
>>>> Unlike the nvdimm sub-system there is no on media labelling scheme
>>>> associated with this partitioning. Provisioning decisions are ephemeral
>>>> / not automatically restored after reboot. While the initial use case of
>>>> device-dax is persistent memory other uses case may be volatile, so the
>>>> device-dax core is unable to assume the underlying memory is pmem.  The
>>>> task of recalling a partitioning scheme or permissions on the device(s)
>>>> is left to userspace.
>>>
>>> Can you explain this reasoning in a bit more detail, please?  If you
>>> have specific use cases in mind, that would be helpful.
>>
>> A few use cases are top of mind:
>>
>> * userspace persistence support: filesystem-DAX as implemented in XFS
>> and EXT4 requires filesystem coordination for persistence, device-dax
>> does not. An application may not need a full namespace worth of
>> persistent memory, or may want to dynamically resize the amount of
>> persistent memory it is consuming. This enabling allows online resize
>> of device-dax file/instance.
>
> OK, so you've now implemented file extending and truncation (and block
> mapping, I guess).  Where does this end?  How many more file-system
> features will you add to this character device?
>

It ends here. A device-file per sub-allocation of a memory range is
the bare minimum to take device-dax from being a toy to something
usable in practice. Device-DAX will never support read(2)/write(2),
never support MAP_PRIVATE, and being DAX it will never interact with
the page cache which eliminates most of the rest of the file apis. It
will also never support higher order mm capabilities like overcommit
and migration.

>> * allocation + access mechanism for performance differentiated memory:
>> Persistent memory is one example of a reserved memory pool with
>> different performance characteristics than typical DRAM in a system,
>> and there are examples of other performance differentiated memory
>> pools (high bandwidth or low latency) showing up on commonly available
>> platforms. This mechanism gives purpose built applications (high
>> performance computing, databases, etc...) a way to establish mappings
>> with predictable fault-granularities and performance, but also allow
>> for different permissions per allocation.
>
> So, how would an application that wishes to use a device-dax subdivision
> of performance differentiated memory get access to it?
> 1) administrator subdivides space and assigns it to a user
> 2) application gets to use it
>
> Something like that?  Or do you expect applications to sub-divide the
> device-dax instance programmatically?

No, not programmatically, I would expect this would be a provisioning
time setup operation when the server/service is instantiated.

> Why wouldn't you want the mapping
> to live beyond a single boot?

This goes back to not being able to assume that the media is
persistent. If an application/use case needs the kernel to recall
provisioning decisions then that application needs to stick to
libnvdimm namespace labels, block device partitions, or a filesystem.

>> * carving up a PCI-E device memory bar for managing peer-to-peer
>> transactions: In the thread about enablling P2P DMA one of the
>> concerns that was raised was security separation of different users of
>> a device: http://marc.info/?l=linux-kernel&m=148106083913173&w=2
>
> OK, but I wasn't sure that there was consensus in that thread.  It
> seemed more likely that the block device ioctl path would be pursued.
> If this is the preferred method, I think you should document their
> requirements and show how the implementation meets them, instead of
> leaving that up to reviewers.  Or, at the very least, CC the interested
> parties?

I put those details here [1]. That thread did try to gather
requirements, but got muddled between graphics, mm, and RDMA concerns.
Device-dax is not attempting to solve problems outside of its core use
of allowing an application to statically allocate reserved memory. If
it works by accident for a P2P RDMA use case, great, but to your
earlier concern we're not going to chase that use case with ever more
device-dax features.

[1]: http://marc.info/?l=linux-kernel&m=147983832620658&w=2

>>>> For persistent allocations, naming, and permissions automatically
>>>> recalled by the kernel, use filesystem-DAX. For a userspace helper
>>>
>>> I'd agree with that guidance if it wasn't for the fact that device dax
>>> was born out of the need to be able to flush dirty data in a safe manner
>>> from userspace.  At best, we're giving mixed guidance to application
>>> developers.
>>
>> Yes, but at the same time device-DAX is sufficiently painful (no
>> read(2)/write(2) support, no builtin metadata support) that it may
>> spur application developers to lobby for a filesystem that offers
>> userspace dirty-data flushing. Until then we have this vehicle to test
>> the difference and dax-support for memory types beyond persistent
>> memory.
>
> Let's just work on the PMEM_IMMUTABLE flag that Dave suggested[1] and
> make device dax just for volatile memories.

Yes, let's work on PMEM_IMMUTABLE, and in the meantime we have
device-dax. It's not a zero sum situation.

Device-dax handles physical memory ranges generically, if you want to
"make device dax just for volatile memories", that's a user decision
to not give persistent memory ranges to device-dax.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/8] dax: add region-available-size attribute
  2016-12-11  6:28 ` [PATCH 1/8] dax: add region-available-size attribute Dan Williams
@ 2016-12-14 14:38   ` Johannes Thumshirn
  2016-12-14 15:53     ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Thumshirn @ 2016-12-14 14:38 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, linux-kernel

Hi Dan,

On Sat, Dec 10, 2016 at 10:28:30PM -0800, Dan Williams wrote:
> In preparation for a facility that enables dax regions to be
> sub-divided, introduce a 'dax/available_size' attribute.  This attribute
> appears under the parent device that registered the device-dax region,
> and it assumes that the device-dax-core owns the driver-data for that
> device.
> 
> 'dax/available_size' adjusts dynamically as dax-device instances are
> registered and unregistered.
> 
> As a side effect of using __request_region() to reserve capacity from
> the dax_region we now track pointers to those returned resources rather
> than duplicating the passed in resource array.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

[...]

> +static const struct attribute_group *dax_region_attribute_groups[] = {
> +	&dax_region_attribute_group,
> +	NULL,
>  };
>  
>  static struct inode *dax_alloc_inode(struct super_block *sb)
> @@ -200,12 +251,27 @@ void dax_region_put(struct dax_region *dax_region)
>  }
>  EXPORT_SYMBOL_GPL(dax_region_put);
>  
> +

Stray extra newline?

[...]

>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  		struct resource *res, unsigned int align, void *addr,
>  		unsigned long pfn_flags)
>  {
>  	struct dax_region *dax_region;
>  
> +	if (dev_get_drvdata(parent)) {
> +		dev_WARN(parent, "dax core found drvdata already in use\n");
> +		return NULL;
> +	}
> +

My first thought was, it might be interesting to see who already claimed
the drvdata. Then I figured, how are multiple sub-regions of a dax-device
supposed to work? What am I missing here?

>  	if (!IS_ALIGNED(res->start, align)
>  			|| !IS_ALIGNED(resource_size(res), align))
>  		return NULL;
> @@ -213,16 +279,26 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  	dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL);
>  	if (!dax_region)
>  		return NULL;
> -
> -	memcpy(&dax_region->res, res, sizeof(*res));
> +	dev_set_drvdata(parent, dax_region);
> +	dax_region->res.name = dev_name(parent);
> +	dax_region->res.start = res->start;
> +	dax_region->res.end = res->end;
>  	dax_region->pfn_flags = pfn_flags;
> +	mutex_init(&dax_region->lock);
>  	kref_init(&dax_region->kref);
>  	dax_region->id = region_id;
>  	ida_init(&dax_region->ida);
>  	dax_region->align = align;
>  	dax_region->dev = parent;
>  	dax_region->base = addr;
> +	if (sysfs_create_groups(&parent->kobj, dax_region_attribute_groups)) {
> +		kfree(dax_region);
> +		return NULL;;
> +	}
>  
> +	kref_get(&dax_region->kref);
> +	if (devm_add_action_or_reset(parent, dax_region_unregister, dax_region))
> +		return NULL;
>  	return dax_region;
>  }
>  EXPORT_SYMBOL_GPL(alloc_dax_region);

[...]

> @@ -568,28 +654,42 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
>  	struct cdev *cdev;
>  	dev_t dev_t;
>  
> -	dax_dev = kzalloc(sizeof(*dax_dev) + sizeof(*res) * count, GFP_KERNEL);
> +	dax_dev = kzalloc(sizeof(*dax_dev), GFP_KERNEL);
>  	if (!dax_dev)
>  		return ERR_PTR(-ENOMEM);
>  
> +	dax_dev->res = kzalloc(sizeof(res) * count, GFP_KERNEL);

	dax_dev->res = kcalloc(sizeof(res), count, GFP_KERNEL); ?

> +	if (!dax_dev->res)
> +		goto err_res;
> +

Byte,
	Johannes

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/8] dax: refactor locking out of size calculation routines
  2016-12-11  6:28 ` [PATCH 5/8] dax: refactor locking out of size calculation routines Dan Williams
@ 2016-12-14 15:01   ` Johannes Thumshirn
  2016-12-14 15:55     ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Thumshirn @ 2016-12-14 15:01 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, linux-kernel

Hi Dan,

On Sat, Dec 10, 2016 at 10:28:51PM -0800, Dan Williams wrote:
> In preparation for other callers of these routines make the locking the
> responsibility of the caller.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/dax/dax.c |   30 ++++++++++++++++++++++++------
>  1 file changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
> index d878a56cf3e3..5b65eaff6ace 100644
> --- a/drivers/dax/dax.c
> +++ b/drivers/dax/dax.c
> @@ -90,11 +90,11 @@ static unsigned long long dax_region_avail_size(
>  	unsigned long long size;
>  	struct resource *res;
>  
> -	mutex_lock(&dax_region->lock);
> +	WARN_ON_ONCE(!mutex_is_locked(&dax_region->lock));

I'd prefer it a lockdep_assert_held(&dex_region->lock). This of cause has the
drawback that it won't trigger w/o lockdep but enabled, but I don't think it's
the responibility of a production kernel to have this warnings anyway. On the
flip side you get all the lockdep beauty for free with it.

> +
>  	size = resource_size(&dax_region->res);
>  	for_each_dax_region_resource(dax_region, res)
>  		size -= resource_size(res);
> -	mutex_unlock(&dax_region->lock);
>  
>  	return size;
>  }

[...]

> +static unsigned long long dax_dev_size(struct dax_dev *dax_dev)
>  {
> -	struct dax_dev *dax_dev = to_dax_dev(dev);
> +	struct dax_region *dax_region = dax_dev->region;
>  	unsigned long long size = 0;
>  	int i;
>  
> +	WARN_ON_ONCE(!mutex_is_locked(&dax_region->lock));
> +

See above

>  	for (i = 0; i < dax_dev->num_resources; i++)
>  		size += resource_size(dax_dev->res[i]);
>  
> +	return size;
> +}
> +

Byte,
	Johannes

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/8] dax: add region-available-size attribute
  2016-12-14 14:38   ` Johannes Thumshirn
@ 2016-12-14 15:53     ` Dan Williams
  2016-12-15  6:47       ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2016-12-14 15:53 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-nvdimm@lists.01.org, linux-kernel

On Wed, Dec 14, 2016 at 6:38 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> Hi Dan,
>
> On Sat, Dec 10, 2016 at 10:28:30PM -0800, Dan Williams wrote:
>> In preparation for a facility that enables dax regions to be
>> sub-divided, introduce a 'dax/available_size' attribute.  This attribute
>> appears under the parent device that registered the device-dax region,
>> and it assumes that the device-dax-core owns the driver-data for that
>> device.
>>
>> 'dax/available_size' adjusts dynamically as dax-device instances are
>> registered and unregistered.
>>
>> As a side effect of using __request_region() to reserve capacity from
>> the dax_region we now track pointers to those returned resources rather
>> than duplicating the passed in resource array.
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>
> [...]
>
>> +static const struct attribute_group *dax_region_attribute_groups[] = {
>> +     &dax_region_attribute_group,
>> +     NULL,
>>  };
>>
>>  static struct inode *dax_alloc_inode(struct super_block *sb)
>> @@ -200,12 +251,27 @@ void dax_region_put(struct dax_region *dax_region)
>>  }
>>  EXPORT_SYMBOL_GPL(dax_region_put);
>>
>> +
>
> Stray extra newline?
>
> [...]
>
>>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>>               struct resource *res, unsigned int align, void *addr,
>>               unsigned long pfn_flags)
>>  {
>>       struct dax_region *dax_region;
>>
>> +     if (dev_get_drvdata(parent)) {
>> +             dev_WARN(parent, "dax core found drvdata already in use\n");
>> +             return NULL;
>> +     }
>> +
>
> My first thought was, it might be interesting to see who already claimed
> the drvdata. Then I figured, how are multiple sub-regions of a dax-device
> supposed to work? What am I missing here?

This is a check similar to the -EBUSY return you would get from
request_mem_region(). In fact if all dax drivers are correctly calling
request_mem_region() before alloc_dax_region() then it would be
impossible for this check to ever fire. It's already impossible
because there's only one dax driver upstream (dax_pmem). It's not
really benefiting the kernel at all until we have multiple dax
drivers, I'll remove it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/8] dax: refactor locking out of size calculation routines
  2016-12-14 15:01   ` Johannes Thumshirn
@ 2016-12-14 15:55     ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2016-12-14 15:55 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-nvdimm@lists.01.org, linux-kernel

On Wed, Dec 14, 2016 at 7:01 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> Hi Dan,
>
> On Sat, Dec 10, 2016 at 10:28:51PM -0800, Dan Williams wrote:
>> In preparation for other callers of these routines make the locking the
>> responsibility of the caller.
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  drivers/dax/dax.c |   30 ++++++++++++++++++++++++------
>>  1 file changed, 24 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
>> index d878a56cf3e3..5b65eaff6ace 100644
>> --- a/drivers/dax/dax.c
>> +++ b/drivers/dax/dax.c
>> @@ -90,11 +90,11 @@ static unsigned long long dax_region_avail_size(
>>       unsigned long long size;
>>       struct resource *res;
>>
>> -     mutex_lock(&dax_region->lock);
>> +     WARN_ON_ONCE(!mutex_is_locked(&dax_region->lock));
>
> I'd prefer it a lockdep_assert_held(&dex_region->lock). This of cause has the
> drawback that it won't trigger w/o lockdep but enabled, but I don't think it's
> the responibility of a production kernel to have this warnings anyway. On the
> flip side you get all the lockdep beauty for free with it.

True, yes, it's only a developer debug warning. I'll change it and fix
up the other pattern like this in libnvdimm.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/8] dax: add region-available-size attribute
  2016-12-14 15:53     ` Dan Williams
@ 2016-12-15  6:47       ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2016-12-15  6:47 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-nvdimm@lists.01.org, linux-kernel

On Wed, Dec 14, 2016 at 7:53 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Dec 14, 2016 at 6:38 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
>> Hi Dan,
>>
>> On Sat, Dec 10, 2016 at 10:28:30PM -0800, Dan Williams wrote:
>>> In preparation for a facility that enables dax regions to be
>>> sub-divided, introduce a 'dax/available_size' attribute.  This attribute
>>> appears under the parent device that registered the device-dax region,
>>> and it assumes that the device-dax-core owns the driver-data for that
>>> device.
>>>
>>> 'dax/available_size' adjusts dynamically as dax-device instances are
>>> registered and unregistered.
>>>
>>> As a side effect of using __request_region() to reserve capacity from
>>> the dax_region we now track pointers to those returned resources rather
>>> than duplicating the passed in resource array.
>>>
>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>>> ---
>>
>> [...]
>>
>>> +static const struct attribute_group *dax_region_attribute_groups[] = {
>>> +     &dax_region_attribute_group,
>>> +     NULL,
>>>  };
>>>
>>>  static struct inode *dax_alloc_inode(struct super_block *sb)
>>> @@ -200,12 +251,27 @@ void dax_region_put(struct dax_region *dax_region)
>>>  }
>>>  EXPORT_SYMBOL_GPL(dax_region_put);
>>>
>>> +
>>
>> Stray extra newline?
>>
>> [...]
>>
>>>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>>>               struct resource *res, unsigned int align, void *addr,
>>>               unsigned long pfn_flags)
>>>  {
>>>       struct dax_region *dax_region;
>>>
>>> +     if (dev_get_drvdata(parent)) {
>>> +             dev_WARN(parent, "dax core found drvdata already in use\n");
>>> +             return NULL;
>>> +     }
>>> +
>>
>> My first thought was, it might be interesting to see who already claimed
>> the drvdata. Then I figured, how are multiple sub-regions of a dax-device
>> supposed to work? What am I missing here?
>
> This is a check similar to the -EBUSY return you would get from
> request_mem_region(). In fact if all dax drivers are correctly calling
> request_mem_region() before alloc_dax_region() then it would be
> impossible for this check to ever fire. It's already impossible
> because there's only one dax driver upstream (dax_pmem). It's not
> really benefiting the kernel at all until we have multiple dax
> drivers, I'll remove it.

No, I went to go delete this and remembered the real reason this was
added. A device driver that calls alloc_dax_region() commits to
letting the dax core own dev->driver_data. Since this wasn't even
clear to me, I'll go fix up the comment.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] device-dax: sub-division support
  2016-12-14  1:17       ` Dan Williams
@ 2016-12-15 16:50         ` Jeff Moyer
  2016-12-15 23:48           ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Moyer @ 2016-12-15 16:50 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, linux-kernel, Dave Hansen

Hi, Dan,

Dan Williams <dan.j.williams@intel.com> writes:

> On Tue, Dec 13, 2016 at 3:46 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Hi, Dan,
>>
>> In general, I have a couple of concerns with this patchset:
>> 1) You're making a case that subdivision shouldn't be persistent, which
>>    means that all of the code we already have for subdividing devices
>>    (partitions, libnvdimm) has to be re-invented in userspace, and
>>    existing tools can't be used to manage nvdimms.
>
> Keep in mind that the device-dax core just operates on address ranges,
> whether those address ranges are persistent or not is invisible to the
> core. The core simply can not assume that the address ranges it is
> managing are persistent or volatile. For environments that want to use
> traditional partitioning or libnvdimm namespace labels, nothing stops
> them. The difference is just a mode setting at the namespace level,
> for example:
>
>     ndctl create-namespace --reconfig=namespace0.0 --mode=dax --force
>     ndctl create-namespace --reconfig=namespace0.0 --mode=memory --force
>
> Also recall that we have namespace sub-division support, with
> persistent labels, that was added in 4.9. So instead of being limited
> to one namespace per pmem-region we can now support multiple
> namespaces per region.

Namespace subdivision requires label support in the NVDIMM, and given
that there are no NVDIMMs out there today that support labels, that's
not an option.

It makes a heck of a lot more sense to continue to manage storage via a
block device.  I know that DAX support for block devices ran into
roadblocks before, but I'm willing to give it another try.  You
mentioned on irc that we may be able to emulate label support for
DIMMs that don't have it.  I guess that would be another way forward.
Did you have any ideas on how that might be implemented?

>> 2) You're pushing file system features into a character device.
>
> Yes, just like a block device the device-dax instances support
> sub-division and a unified inode.

I'm not sure what you mean by a unified inode.

> But, unlike a block device, no entanglements with the page cache or
> other higher level file api features.

>> OK, so you've now implemented file extending and truncation (and block
>> mapping, I guess).  Where does this end?  How many more file-system
>> features will you add to this character device?
>>
>
> It ends here. A device-file per sub-allocation of a memory range is
> the bare minimum to take device-dax from being a toy to something
> usable in practice. Device-DAX will never support read(2)/write(2),
> never support MAP_PRIVATE, and being DAX it will never interact with
> the page cache which eliminates most of the rest of the file apis. It
> will also never support higher order mm capabilities like overcommit
> and migration.

Well, Dave Jiang posted patches to add fallocate support.  So it didn't
quite end there.

>>> * allocation + access mechanism for performance differentiated memory:
>>> Persistent memory is one example of a reserved memory pool with
>>> different performance characteristics than typical DRAM in a system,
>>> and there are examples of other performance differentiated memory
>>> pools (high bandwidth or low latency) showing up on commonly available
>>> platforms. This mechanism gives purpose built applications (high
>>> performance computing, databases, etc...) a way to establish mappings
>>> with predictable fault-granularities and performance, but also allow
>>> for different permissions per allocation.
>>
>> So, how would an application that wishes to use a device-dax subdivision
>> of performance differentiated memory get access to it?
>> 1) administrator subdivides space and assigns it to a user
>> 2) application gets to use it
>>
>> Something like that?  Or do you expect applications to sub-divide the
>> device-dax instance programmatically?
>
> No, not programmatically, I would expect this would be a provisioning
> time setup operation when the server/service is instantiated.

That's a terrible model for storage.  If you're going to continue on
this path, then I'd suggest that the moment the namespace is converted
to be "device dax", the initial device should have a size of 0.  At
least that way nobody can accidentally open it and scribble all over
the full device.

>> Why wouldn't you want the mapping to live beyond a single boot?
>
> This goes back to not being able to assume that the media is
> persistent. If an application/use case needs the kernel to recall
> provisioning decisions then that application needs to stick to
> libnvdimm namespace labels, block device partitions, or a filesystem.

You can't have it both ways.  Either device-dax is meant for persistent
memory or it isn't.  You're stating that the right way to divide up a
persistent memory namespace is to use labels, which don't exist.  Then
you're proposing this method for dividing up device-dax as well, without
anybody from the non-persistent memory camp even chiming in that this is
something that they want.  What is the urgency here, and where are the
users?

I can only conclude that you actually do intend the subdivision to be
used for persistent memory, and I'm telling you that what you've
implemented doesn't fit that use case well at all.

>>> * carving up a PCI-E device memory bar for managing peer-to-peer
>>> transactions: In the thread about enablling P2P DMA one of the
>>> concerns that was raised was security separation of different users of
>>> a device: http://marc.info/?l=linux-kernel&m=148106083913173&w=2
>>
>> OK, but I wasn't sure that there was consensus in that thread.  It
>> seemed more likely that the block device ioctl path would be pursued.
>> If this is the preferred method, I think you should document their
>> requirements and show how the implementation meets them, instead of
>> leaving that up to reviewers.  Or, at the very least, CC the interested
>> parties?
>
> I put those details here [1]. That thread did try to gather
> requirements, but got muddled between graphics, mm, and RDMA concerns.
> Device-dax is not attempting to solve problems outside of its core use
> of allowing an application to statically allocate reserved memory. If
> it works by accident for a P2P RDMA use case, great, but to your
> earlier concern we're not going to chase that use case with ever more
> device-dax features.
>
> [1]: http://marc.info/?l=linux-kernel&m=147983832620658&w=2

That's a pretty thin proposal.  I'd much rather see the rest of the
supporting code implemented as a proof of concept before we start taking
interfaces into the kernel.  If your only justification is to use this
with persistent memory, then I'm telling you I think it's a bad
interface.

>>>>> For persistent allocations, naming, and permissions automatically
>>>>> recalled by the kernel, use filesystem-DAX. For a userspace helper
>>>>
>>>> I'd agree with that guidance if it wasn't for the fact that device dax
>>>> was born out of the need to be able to flush dirty data in a safe manner
>>>> from userspace.  At best, we're giving mixed guidance to application
>>>> developers.
>>>
>>> Yes, but at the same time device-DAX is sufficiently painful (no
>>> read(2)/write(2) support, no builtin metadata support) that it may
>>> spur application developers to lobby for a filesystem that offers
>>> userspace dirty-data flushing. Until then we have this vehicle to test
>>> the difference and dax-support for memory types beyond persistent
>>> memory.
>>
>> Let's just work on the PMEM_IMMUTABLE flag that Dave suggested[1] and
>> make device dax just for volatile memories.
>
> Yes, let's work on PMEM_IMMUTABLE, and in the meantime we have
> device-dax. It's not a zero sum situation.
>
> Device-dax handles physical memory ranges generically, if you want to
> "make device dax just for volatile memories", that's a user decision
> to not give persistent memory ranges to device-dax.

Right now, your only use case is persistent memory, and I don't think
this is the right interface for it.  Clearly someone is asking for this
support.  Can you convince them to chime in on the mailing list with
their requirements?  Alternatively, can you state what the requirements
were that lead to this solution?

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] device-dax: sub-division support
  2016-12-15 16:50         ` Jeff Moyer
@ 2016-12-15 23:48           ` Dan Williams
  2016-12-16  2:33             ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2016-12-15 23:48 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-nvdimm, linux-kernel, Dave Hansen

On Thu, Dec 15, 2016 at 8:50 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Hi, Dan,
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> On Tue, Dec 13, 2016 at 3:46 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> Hi, Dan,
>>>
>>> In general, I have a couple of concerns with this patchset:
>>> 1) You're making a case that subdivision shouldn't be persistent, which
>>>    means that all of the code we already have for subdividing devices
>>>    (partitions, libnvdimm) has to be re-invented in userspace, and
>>>    existing tools can't be used to manage nvdimms.
>>
>> Keep in mind that the device-dax core just operates on address ranges,
>> whether those address ranges are persistent or not is invisible to the
>> core. The core simply can not assume that the address ranges it is
>> managing are persistent or volatile. For environments that want to use
>> traditional partitioning or libnvdimm namespace labels, nothing stops
>> them. The difference is just a mode setting at the namespace level,
>> for example:
>>
>>     ndctl create-namespace --reconfig=namespace0.0 --mode=dax --force
>>     ndctl create-namespace --reconfig=namespace0.0 --mode=memory --force
>>
>> Also recall that we have namespace sub-division support, with
>> persistent labels, that was added in 4.9. So instead of being limited
>> to one namespace per pmem-region we can now support multiple
>> namespaces per region.
>
> Namespace subdivision requires label support in the NVDIMM, and given
> that there are no NVDIMMs out there today that support labels, that's
> not an option.
>
> It makes a heck of a lot more sense to continue to manage storage via a
> block device.  I know that DAX support for block devices ran into
> roadblocks before, but I'm willing to give it another try.  You
> mentioned on irc that we may be able to emulate label support for
> DIMMs that don't have it.  I guess that would be another way forward.
> Did you have any ideas on how that might be implemented?

Yes, place an information block at the start of the region that
reserves space for a label. Extend the new dynamic label support [1]
to switch to label mode upon finding this signature.

Linux could go off and do this privately today, or we can work on
trying to get this scheme into a standard so that other operating
environments honor those boundaries.

[1]: 42237e393f64 libnvdimm: allow a platform to force enable label support
https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git/commit/?h=libnvdimm-for-next&id=42237e393f64

>>> 2) You're pushing file system features into a character device.
>>
>> Yes, just like a block device the device-dax instances support
>> sub-division and a unified inode.
>
> I'm not sure what you mean by a unified inode.

For example if I had a situation like the following:

# ls -l dax1.0
crw------- 1 root root 243, 0 Dec 14 15:08 dax1.0
crw------- 1 root root 243, 0 Dec 14 15:08 another-dax1.0

Without a "unified inode" an mmap on 'dax1.0' is isolated from an mmap
on 'another-dax1.0' . Each of those files is related to their own
inode in the devtmpfs filesystem. For dax we allocate our own inode
and redirect the i_mapping of those per-device-file inodes to the
single inode allocated per dax device. This is identical to what
happens in bd_acquire() for a block device.


>> But, unlike a block device, no entanglements with the page cache or
>> other higher level file api features.
>
>>> OK, so you've now implemented file extending and truncation (and block
>>> mapping, I guess).  Where does this end?  How many more file-system
>>> features will you add to this character device?
>>>
>>
>> It ends here. A device-file per sub-allocation of a memory range is
>> the bare minimum to take device-dax from being a toy to something
>> usable in practice. Device-DAX will never support read(2)/write(2),
>> never support MAP_PRIVATE, and being DAX it will never interact with
>> the page cache which eliminates most of the rest of the file apis. It
>> will also never support higher order mm capabilities like overcommit
>> and migration.
>
> Well, Dave Jiang posted patches to add fallocate support.  So it didn't
> quite end there.

True, sorry, I should have brought that up. I didn't want to invent a
private ioctl path to clear errors. The fallocate support is limited
to triggering to a firmware "clear error" command for the given range.

>>>> * allocation + access mechanism for performance differentiated memory:
>>>> Persistent memory is one example of a reserved memory pool with
>>>> different performance characteristics than typical DRAM in a system,
>>>> and there are examples of other performance differentiated memory
>>>> pools (high bandwidth or low latency) showing up on commonly available
>>>> platforms. This mechanism gives purpose built applications (high
>>>> performance computing, databases, etc...) a way to establish mappings
>>>> with predictable fault-granularities and performance, but also allow
>>>> for different permissions per allocation.
>>>
>>> So, how would an application that wishes to use a device-dax subdivision
>>> of performance differentiated memory get access to it?
>>> 1) administrator subdivides space and assigns it to a user
>>> 2) application gets to use it
>>>
>>> Something like that?  Or do you expect applications to sub-divide the
>>> device-dax instance programmatically?
>>
>> No, not programmatically, I would expect this would be a provisioning
>> time setup operation when the server/service is instantiated.
>
> That's a terrible model for storage.  If you're going to continue on
> this path, then I'd suggest that the moment the namespace is converted
> to be "device dax", the initial device should have a size of 0.  At
> least that way nobody can accidentally open it and scribble all over
> the full device.

Fair enough, I'll make that change... or better yet just turn it off
by default for pmem, see below.

>>> Why wouldn't you want the mapping to live beyond a single boot?
>>
>> This goes back to not being able to assume that the media is
>> persistent. If an application/use case needs the kernel to recall
>> provisioning decisions then that application needs to stick to
>> libnvdimm namespace labels, block device partitions, or a filesystem.
>
> You can't have it both ways.  Either device-dax is meant for persistent
> memory or it isn't.

I'm not trying to have it both ways I'm trying to develop an interface
that is generically useful for reserved memory where the persistence
of provisioning decisions is handled at a higher level in the
device-model hierarchy, or in userspace.

> You're stating that the right way to divide up a
> persistent memory namespace is to use labels, which don't exist.

QEMU ships support for the current label mechanisms in Linux, we
expect future platfoms to be compatible with that enabling, and I gave
you a plan above for existing NVDIMMs without a label area.

> Then you're proposing this method for dividing up device-dax as well, without
> anybody from the non-persistent memory camp even chiming in that this is
> something that they want.  What is the urgency here, and where are the
> users?

The same urgency of merging ACPI NFIT support when we did. We also had
support for custom defining memory ranges (memmap=ss!nn) upstream in
advance of that.

There are platforms with these differentiated (non-persistent) memory
types [2], and device-dax is a mechanism to get guaranteed mappings of
that memory. Guaranteed mappings without playing games with
numa-distance to keep stray kernel allocations from landing in the
"precious" memory pool, a pool that is expected to be 100% available
for a given application.

[2]: https://software.intel.com/en-us/articles/mcdram-high-bandwidth-memory-on-knights-landing-analysis-methods-tools

> I can only conclude that you actually do intend the subdivision to be
> used for persistent memory, and I'm telling you that what you've
> implemented doesn't fit that use case well at all.

I do intend the base device-dax capability to be used with persistent
memory, but you're right this sub-division mechanism is terrible for
storage in the general case. Let's make it default to off for pmem,
with a module parameter override if an application really thinks it
knows what it is dong. Does that address your concern?

>>>> * carving up a PCI-E device memory bar for managing peer-to-peer
>>>> transactions: In the thread about enablling P2P DMA one of the
>>>> concerns that was raised was security separation of different users of
>>>> a device: http://marc.info/?l=linux-kernel&m=148106083913173&w=2
>>>
>>> OK, but I wasn't sure that there was consensus in that thread.  It
>>> seemed more likely that the block device ioctl path would be pursued.
>>> If this is the preferred method, I think you should document their
>>> requirements and show how the implementation meets them, instead of
>>> leaving that up to reviewers.  Or, at the very least, CC the interested
>>> parties?
>>
>> I put those details here [1]. That thread did try to gather
>> requirements, but got muddled between graphics, mm, and RDMA concerns.
>> Device-dax is not attempting to solve problems outside of its core use
>> of allowing an application to statically allocate reserved memory. If
>> it works by accident for a P2P RDMA use case, great, but to your
>> earlier concern we're not going to chase that use case with ever more
>> device-dax features.
>>
>> [1]: http://marc.info/?l=linux-kernel&m=147983832620658&w=2
>
> That's a pretty thin proposal.  I'd much rather see the rest of the
> supporting code implemented as a proof of concept before we start taking
> interfaces into the kernel.  If your only justification is to use this
> with persistent memory, then I'm telling you I think it's a bad
> interface.
>

Noted. I'm not going to merge this as is over your objection.

I still think whole namespace (indivisible) device-dax for pmem is needed.

>>>>>> For persistent allocations, naming, and permissions automatically
>>>>>> recalled by the kernel, use filesystem-DAX. For a userspace helper
>>>>>
>>>>> I'd agree with that guidance if it wasn't for the fact that device dax
>>>>> was born out of the need to be able to flush dirty data in a safe manner
>>>>> from userspace.  At best, we're giving mixed guidance to application
>>>>> developers.
>>>>
>>>> Yes, but at the same time device-DAX is sufficiently painful (no
>>>> read(2)/write(2) support, no builtin metadata support) that it may
>>>> spur application developers to lobby for a filesystem that offers
>>>> userspace dirty-data flushing. Until then we have this vehicle to test
>>>> the difference and dax-support for memory types beyond persistent
>>>> memory.
>>>
>>> Let's just work on the PMEM_IMMUTABLE flag that Dave suggested[1] and
>>> make device dax just for volatile memories.
>>
>> Yes, let's work on PMEM_IMMUTABLE, and in the meantime we have
>> device-dax. It's not a zero sum situation.
>>
>> Device-dax handles physical memory ranges generically, if you want to
>> "make device dax just for volatile memories", that's a user decision
>> to not give persistent memory ranges to device-dax.
>
> Right now, your only use case is persistent memory, and I don't think
> this is the right interface for it.  Clearly someone is asking for this
> support.  Can you convince them to chime in on the mailing list with
> their requirements?  Alternatively, can you state what the requirements
> were that lead to this solution?

The feedback I received from an application developer that convinced
me to push on the device-dax interface (outside of the volatile uses)
was that it allowed applications to implement metadata redundancy. In
the filesystem-DAX case we don't have a way to identify and survive
media errors that land in the metadata that describes a dax file. With
device-dax an application at least has the visibility to manage all
the metadata that describes a dax-file.

As for sub-division it was designed for the volatile case, I've heard
feedback that some want it for the pmem case. You and I agree that
it's *not* a good interface for the pmem case, so I'm fine disabling
sub-division and telling them that they need to override a kernel
default to use this mechanism for pmem, but otherwise say "just use
libnvdimm namespace labels" for storage use cases.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] device-dax: sub-division support
  2016-12-15 23:48           ` Dan Williams
@ 2016-12-16  2:33             ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2016-12-16  2:33 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-nvdimm, linux-kernel, Dave Hansen

On Thu, Dec 15, 2016 at 3:48 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Dec 15, 2016 at 8:50 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
[..]
> As for sub-division it was designed for the volatile case, I've heard
> feedback that some want it for the pmem case. You and I agree that
> it's *not* a good interface for the pmem case, so I'm fine disabling
> sub-division and telling them that they need to override a kernel
> default to use this mechanism for pmem, but otherwise say "just use
> libnvdimm namespace labels" for storage use cases.

So I went to go add support for making sub-division default off and
zero-size the initial device when sub-division is enabled. However, if
the policy is that the initial device is zero-sized for safety then,
for safety, we need to remember if the dax-region was *ever* in
dynamic mode to make sure that moving to dynamic mode is a one way
street.

If we're persisting the region setting then we might as well be
persisting the device-dax sub-division decisions. At that point we
might as well just wait for libnvdimm to grow namespace label support
for all NVDIMM types and let the provisioning happen at the level that
we agree is the right level for pmem.

Long story short, you're right, a volatile provisioning mechanism is
only tenable for volatile media. This enabling should wait until a
volatile consumer of device-dax surfaces, the dax_pmem driver should
never use it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-12-16  2:40 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-11  6:28 [PATCH 0/8] device-dax: sub-division support Dan Williams
2016-12-11  6:28 ` [PATCH 1/8] dax: add region-available-size attribute Dan Williams
2016-12-14 14:38   ` Johannes Thumshirn
2016-12-14 15:53     ` Dan Williams
2016-12-15  6:47       ` Dan Williams
2016-12-11  6:28 ` [PATCH 2/8] dax: add region 'id', 'size', and 'align' attributes Dan Williams
2016-12-11  6:28 ` [PATCH 3/8] dax: register seed device Dan Williams
2016-12-11  6:28 ` [PATCH 4/8] dax: use multi-order radix for resource lookup Dan Williams
2016-12-11  6:28 ` [PATCH 5/8] dax: refactor locking out of size calculation routines Dan Williams
2016-12-14 15:01   ` Johannes Thumshirn
2016-12-14 15:55     ` Dan Williams
2016-12-11  6:28 ` [PATCH 6/8] dax: sub-division support Dan Williams
2016-12-11  6:29 ` [PATCH 7/8] dax: add / remove dax devices after provisioning Dan Williams
2016-12-11  6:29 ` [PATCH 8/8] dax: add debug for region available_size Dan Williams
2016-12-12 17:15 ` [PATCH 0/8] device-dax: sub-division support Jeff Moyer
2016-12-12 18:46   ` Dan Williams
2016-12-13 23:46     ` Jeff Moyer
2016-12-14  1:17       ` Dan Williams
2016-12-15 16:50         ` Jeff Moyer
2016-12-15 23:48           ` Dan Williams
2016-12-16  2:33             ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).