linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] libnvdimm: export wpq flush interface
@ 2017-04-24 23:49 Dan Williams
  2017-04-24 23:49 ` [PATCH v2 1/2] libnvdimm, region: fix flush hint detection crash Dan Williams
  2017-04-24 23:50 ` [PATCH v2 2/2] libnvdimm, region: sysfs trigger for nvdimm_flush() Dan Williams
  0 siblings, 2 replies; 21+ messages in thread
From: Dan Williams @ 2017-04-24 23:49 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Masayoshi Mizuma, linux-acpi, Jeff Moyer, linux-kernel, stable

Changes since v1 [1]:
* Clean up the changelog to clarify how this capability is expected to
  be used (Jeff)
* Make the attribute read / write to separate detection and flush
  triggering (Masayoshi)
* Add the flush hint detection crash fix

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009898.html

---

Quoting the changelog from patch2:

The nvdimm_flush() mechanism helps to reduce the impact of an ADR
(asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
platform WPQ (write-pending-queue) buffers when power is removed. The
nvdimm_flush() mechanism performs that same function on-demand.

When a pmem namespace is associated with a block device, an
nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
request. These requests are typically associated with filesystem
metadata updates. However, when a namespace is in device-dax mode,
userspace (think database metadata) needs another path to perform the
same flushing. In other words this is not required to make data
persistent, but in the case of metadata it allows for a smaller failure
domain in the unlikely event of an ADR failure.

---

Dan Williams (2):
      libnvdimm, region: fix flush hint detection crash
      libnvdimm, region: sysfs trigger for nvdimm_flush()


 drivers/nvdimm/region_devs.c |   52 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 48 insertions(+), 4 deletions(-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 1/2] libnvdimm, region: fix flush hint detection crash
  2017-04-24 23:49 [PATCH v2 0/2] libnvdimm: export wpq flush interface Dan Williams
@ 2017-04-24 23:49 ` Dan Williams
  2017-04-26 19:43   ` Jeff Moyer
  2017-04-24 23:50 ` [PATCH v2 2/2] libnvdimm, region: sysfs trigger for nvdimm_flush() Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Dan Williams @ 2017-04-24 23:49 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-acpi, linux-kernel, stable

In the case where a dimm does not have any associated flush hints the
ndrd->flush_wpq array may be uninitialized leading to crashes with the
following signature:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
 IP: region_visible+0x10f/0x160 [libnvdimm]

 Call Trace:
  internal_create_group+0xbe/0x2f0
  sysfs_create_groups+0x40/0x80
  device_add+0x2d8/0x650
  nd_async_device_register+0x12/0x40 [libnvdimm]
  async_run_entry_fn+0x39/0x170
  process_one_work+0x212/0x6c0
  ? process_one_work+0x197/0x6c0
  worker_thread+0x4e/0x4a0
  kthread+0x10c/0x140
  ? process_one_work+0x6c0/0x6c0
  ? kthread_create_on_node+0x60/0x60
  ret_from_fork+0x31/0x40

Cc: <stable@vger.kernel.org>
Fixes: f284a4f23752 ("libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/region_devs.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 8de5a04644a1..24abceda986a 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1000,17 +1000,20 @@ EXPORT_SYMBOL_GPL(nvdimm_flush);
  */
 int nvdimm_has_flush(struct nd_region *nd_region)
 {
-	struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
 	int i;
 
 	/* no nvdimm == flushing capability unknown */
 	if (nd_region->ndr_mappings == 0)
 		return -ENXIO;
 
-	for (i = 0; i < nd_region->ndr_mappings; i++)
-		/* flush hints present, flushing required */
-		if (ndrd_get_flush_wpq(ndrd, i, 0))
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nvdimm *nvdimm = nd_mapping->nvdimm;
+
+		/* flush hints present / available */
+		if (nvdimm->num_flush)
 			return 1;
+	}
 
 	/*
 	 * The platform defines dimm devices without hints, assume

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 2/2] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-24 23:49 [PATCH v2 0/2] libnvdimm: export wpq flush interface Dan Williams
  2017-04-24 23:49 ` [PATCH v2 1/2] libnvdimm, region: fix flush hint detection crash Dan Williams
@ 2017-04-24 23:50 ` Dan Williams
  2017-04-25 16:37   ` Ross Zwisler
  2017-04-25 20:17   ` [PATCH v3] " Dan Williams
  1 sibling, 2 replies; 21+ messages in thread
From: Dan Williams @ 2017-04-24 23:50 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: Masayoshi Mizuma, linux-acpi, Jeff Moyer, linux-kernel

The nvdimm_flush() mechanism helps to reduce the impact of an ADR
(asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
platform WPQ (write-pending-queue) buffers when power is removed. The
nvdimm_flush() mechanism performs that same function on-demand.

When a pmem namespace is associated with a block device, an
nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
request. These requests are typically associated with filesystem
metadata updates. However, when a namespace is in device-dax mode,
userspace (think database metadata) needs another path to perform the
same flushing. In other words this is not required to make data
persistent, but in the case of metadata it allows for a smaller failure
domain in the unlikely event of an ADR failure.

The new 'flush' attribute is visible when the individual DIMMs backing a
given interleave-set are described by platform firmware. In ACPI terms
this is "NVDIMM Region Mapping Structures" and associated "Flush Hint
Address Structures". Reads return "1" if the region supports triggering
WPQ flushes on all DIMMs. Reads return "0" the flush operation is a
platform nop, and in that case the attribute is read-only.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/region_devs.c |   41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 24abceda986a..c48f3eddce2d 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -255,6 +255,35 @@ static ssize_t size_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(size);
 
+static ssize_t flush_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	/*
+	 * NOTE: in the nvdimm_has_flush() error case this attribute is
+	 * not visible.
+	 */
+	return sprintf(buf, "%d\n", nvdimm_has_flush(nd_region));
+}
+
+static ssize_t flush_store(struct device *dev, struct device_attribute *attr,
+		const char *buf, size_t len)
+{
+	bool flush;
+	int rc = strtobool(buf, &flush);
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	if (rc)
+		return rc;
+	if (!flush)
+		return -EINVAL;
+	nvdimm_flush(nd_region);
+
+	return len;
+}
+static DEVICE_ATTR_RW(flush);
+
 static ssize_t mappings_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -474,6 +503,7 @@ static DEVICE_ATTR_RO(resource);
 
 static struct attribute *nd_region_attributes[] = {
 	&dev_attr_size.attr,
+	&dev_attr_flush.attr,
 	&dev_attr_nstype.attr,
 	&dev_attr_mappings.attr,
 	&dev_attr_btt_seed.attr,
@@ -508,6 +538,17 @@ static umode_t region_visible(struct kobject *kobj, struct attribute *a, int n)
 	if (!is_nd_pmem(dev) && a == &dev_attr_resource.attr)
 		return 0;
 
+	if (a == &dev_attr_flush.attr) {
+		int has_flush = nvdimm_has_flush(nd_region);
+
+		if (has_flush == 1)
+			return a->mode;
+		else if (has_flush == 0)
+			return 0400;
+		else
+			return 0;
+	}
+
 	if (a != &dev_attr_set_cookie.attr
 			&& a != &dev_attr_available_size.attr)
 		return a->mode;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-24 23:50 ` [PATCH v2 2/2] libnvdimm, region: sysfs trigger for nvdimm_flush() Dan Williams
@ 2017-04-25 16:37   ` Ross Zwisler
  2017-04-25 16:38     ` Dan Williams
  2017-04-25 20:17   ` [PATCH v3] " Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Ross Zwisler @ 2017-04-25 16:37 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, linux-acpi, linux-kernel

On Mon, Apr 24, 2017 at 04:50:01PM -0700, Dan Williams wrote:
> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
> platform WPQ (write-pending-queue) buffers when power is removed. The
> nvdimm_flush() mechanism performs that same function on-demand.
> 
> When a pmem namespace is associated with a block device, an
> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
> request. These requests are typically associated with filesystem
> metadata updates. However, when a namespace is in device-dax mode,
> userspace (think database metadata) needs another path to perform the
> same flushing. In other words this is not required to make data
> persistent, but in the case of metadata it allows for a smaller failure
> domain in the unlikely event of an ADR failure.
> 
> The new 'flush' attribute is visible when the individual DIMMs backing a
> given interleave-set are described by platform firmware. In ACPI terms
> this is "NVDIMM Region Mapping Structures" and associated "Flush Hint
> Address Structures". Reads return "1" if the region supports triggering
> WPQ flushes on all DIMMs. Reads return "0" the flush operation is a
> platform nop, and in that case the attribute is read-only.
> 
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/nvdimm/region_devs.c |   41 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 24abceda986a..c48f3eddce2d 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -255,6 +255,35 @@ static ssize_t size_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(size);
>  
> +static ssize_t flush_show(struct device *dev,
> +		struct device_attribute *attr, char *buf)
> +{
> +	struct nd_region *nd_region = to_nd_region(dev);
> +
> +	/*
> +	 * NOTE: in the nvdimm_has_flush() error case this attribute is
> +	 * not visible.
> +	 */
> +	return sprintf(buf, "%d\n", nvdimm_has_flush(nd_region));
> +}
> +
> +static ssize_t flush_store(struct device *dev, struct device_attribute *attr,
> +		const char *buf, size_t len)
> +{
> +	bool flush;
> +	int rc = strtobool(buf, &flush);
> +	struct nd_region *nd_region = to_nd_region(dev);
> +
> +	if (rc)
> +		return rc;
> +	if (!flush)
> +		return -EINVAL;

Is there a benefit to verifying whether the user actually pushed a "1" into
our flush sysfs entry?  Why have an -EINVAL error case at all?

Flushing is non-destructive and we don't actually need the user to give us any
data, so it seems simpler to just have this code flush, regardless of what
input we received.

> +	nvdimm_flush(nd_region);
> +
> +	return len;
> +}
> +static DEVICE_ATTR_RW(flush);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-25 16:37   ` Ross Zwisler
@ 2017-04-25 16:38     ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2017-04-25 16:38 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: linux-nvdimm@lists.01.org, Linux ACPI, linux-kernel

On Tue, Apr 25, 2017 at 9:37 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Apr 24, 2017 at 04:50:01PM -0700, Dan Williams wrote:
>> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
>> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
>> platform WPQ (write-pending-queue) buffers when power is removed. The
>> nvdimm_flush() mechanism performs that same function on-demand.
>>
>> When a pmem namespace is associated with a block device, an
>> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
>> request. These requests are typically associated with filesystem
>> metadata updates. However, when a namespace is in device-dax mode,
>> userspace (think database metadata) needs another path to perform the
>> same flushing. In other words this is not required to make data
>> persistent, but in the case of metadata it allows for a smaller failure
>> domain in the unlikely event of an ADR failure.
>>
>> The new 'flush' attribute is visible when the individual DIMMs backing a
>> given interleave-set are described by platform firmware. In ACPI terms
>> this is "NVDIMM Region Mapping Structures" and associated "Flush Hint
>> Address Structures". Reads return "1" if the region supports triggering
>> WPQ flushes on all DIMMs. Reads return "0" the flush operation is a
>> platform nop, and in that case the attribute is read-only.
>>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  drivers/nvdimm/region_devs.c |   41 +++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 41 insertions(+)
>>
>> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
>> index 24abceda986a..c48f3eddce2d 100644
>> --- a/drivers/nvdimm/region_devs.c
>> +++ b/drivers/nvdimm/region_devs.c
>> @@ -255,6 +255,35 @@ static ssize_t size_show(struct device *dev,
>>  }
>>  static DEVICE_ATTR_RO(size);
>>
>> +static ssize_t flush_show(struct device *dev,
>> +             struct device_attribute *attr, char *buf)
>> +{
>> +     struct nd_region *nd_region = to_nd_region(dev);
>> +
>> +     /*
>> +      * NOTE: in the nvdimm_has_flush() error case this attribute is
>> +      * not visible.
>> +      */
>> +     return sprintf(buf, "%d\n", nvdimm_has_flush(nd_region));
>> +}
>> +
>> +static ssize_t flush_store(struct device *dev, struct device_attribute *attr,
>> +             const char *buf, size_t len)
>> +{
>> +     bool flush;
>> +     int rc = strtobool(buf, &flush);
>> +     struct nd_region *nd_region = to_nd_region(dev);
>> +
>> +     if (rc)
>> +             return rc;
>> +     if (!flush)
>> +             return -EINVAL;
>
> Is there a benefit to verifying whether the user actually pushed a "1" into
> our flush sysfs entry?  Why have an -EINVAL error case at all?
>
> Flushing is non-destructive and we don't actually need the user to give us any
> data, so it seems simpler to just have this code flush, regardless of what
> input we received.

I want to be specific so that in the future if we decide that we want
to have "0" or some other value have a different meaning of "1" we
won't need to contend with userspace that may be expecting any random
value to work.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-24 23:50 ` [PATCH v2 2/2] libnvdimm, region: sysfs trigger for nvdimm_flush() Dan Williams
  2017-04-25 16:37   ` Ross Zwisler
@ 2017-04-25 20:17   ` Dan Williams
  2017-04-26 20:38     ` Jeff Moyer
  1 sibling, 1 reply; 21+ messages in thread
From: Dan Williams @ 2017-04-25 20:17 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: Masayoshi Mizuma, linux-acpi, Jeff Moyer, linux-kernel

The nvdimm_flush() mechanism helps to reduce the impact of an ADR
(asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
platform WPQ (write-pending-queue) buffers when power is removed. The
nvdimm_flush() mechanism performs that same function on-demand.

When a pmem namespace is associated with a block device, an
nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
request. These requests are typically associated with filesystem
metadata updates. However, when a namespace is in device-dax mode,
userspace (think database metadata) needs another path to perform the
same flushing. In other words this is not required to make data
persistent, but in the case of metadata it allows for a smaller failure
domain in the unlikely event of an ADR failure.

The new 'flush' attribute is visible when the individual DIMMs backing a
given interleave-set are described by platform firmware. In ACPI terms
this is "NVDIMM Region Mapping Structures" and associated "Flush Hint
Address Structures". Reads return "1" if the region supports triggering
WPQ flushes on all DIMMs. Reads return "0" the flush operation is a
platform nop, and in that case the attribute is read-only.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes in v3:
* Fixed up the permissions in the read-only case to 0444 instead of
  0400 as is typical for read-only sysfs attributes.

 drivers/nvdimm/region_devs.c |   41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 24abceda986a..9c4dc8bc759a 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -255,6 +255,35 @@ static ssize_t size_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(size);
 
+static ssize_t flush_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	/*
+	 * NOTE: in the nvdimm_has_flush() error case this attribute is
+	 * not visible.
+	 */
+	return sprintf(buf, "%d\n", nvdimm_has_flush(nd_region));
+}
+
+static ssize_t flush_store(struct device *dev, struct device_attribute *attr,
+		const char *buf, size_t len)
+{
+	bool flush;
+	int rc = strtobool(buf, &flush);
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	if (rc)
+		return rc;
+	if (!flush)
+		return -EINVAL;
+	nvdimm_flush(nd_region);
+
+	return len;
+}
+static DEVICE_ATTR_RW(flush);
+
 static ssize_t mappings_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -474,6 +503,7 @@ static DEVICE_ATTR_RO(resource);
 
 static struct attribute *nd_region_attributes[] = {
 	&dev_attr_size.attr,
+	&dev_attr_flush.attr,
 	&dev_attr_nstype.attr,
 	&dev_attr_mappings.attr,
 	&dev_attr_btt_seed.attr,
@@ -508,6 +538,17 @@ static umode_t region_visible(struct kobject *kobj, struct attribute *a, int n)
 	if (!is_nd_pmem(dev) && a == &dev_attr_resource.attr)
 		return 0;
 
+	if (a == &dev_attr_flush.attr) {
+		int has_flush = nvdimm_has_flush(nd_region);
+
+		if (has_flush == 1)
+			return a->mode;
+		else if (has_flush == 0)
+			return 0444;
+		else
+			return 0;
+	}
+
 	if (a != &dev_attr_set_cookie.attr
 			&& a != &dev_attr_available_size.attr)
 		return a->mode;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] libnvdimm, region: fix flush hint detection crash
  2017-04-24 23:49 ` [PATCH v2 1/2] libnvdimm, region: fix flush hint detection crash Dan Williams
@ 2017-04-26 19:43   ` Jeff Moyer
  2017-04-26 20:04     ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Moyer @ 2017-04-26 19:43 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, linux-acpi, linux-kernel, stable

Dan Williams <dan.j.williams@intel.com> writes:

> In the case where a dimm does not have any associated flush hints the
> ndrd->flush_wpq array may be uninitialized leading to crashes with the
> following signature:
>
>  BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
>  IP: region_visible+0x10f/0x160 [libnvdimm]
>
>  Call Trace:
>   internal_create_group+0xbe/0x2f0
>   sysfs_create_groups+0x40/0x80
>   device_add+0x2d8/0x650
>   nd_async_device_register+0x12/0x40 [libnvdimm]
>   async_run_entry_fn+0x39/0x170
>   process_one_work+0x212/0x6c0
>   ? process_one_work+0x197/0x6c0
>   worker_thread+0x4e/0x4a0
>   kthread+0x10c/0x140
>   ? process_one_work+0x6c0/0x6c0
>   ? kthread_create_on_node+0x60/0x60
>   ret_from_fork+0x31/0x40

Sorry for being dense, but I'm having a tough time connecting the dots,
here.  How does region_visible trip over the missing (not uninitialized,
you're actually walking off the end of the structure) wpq_flush array?

Anyway, the fix looks valid.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

Thanks,
Jeff

>
> Cc: <stable@vger.kernel.org>
> Fixes: f284a4f23752 ("libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/nvdimm/region_devs.c |   11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 8de5a04644a1..24abceda986a 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -1000,17 +1000,20 @@ EXPORT_SYMBOL_GPL(nvdimm_flush);
>   */
>  int nvdimm_has_flush(struct nd_region *nd_region)
>  {
> -	struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
>  	int i;
>  
>  	/* no nvdimm == flushing capability unknown */
>  	if (nd_region->ndr_mappings == 0)
>  		return -ENXIO;
>  
> -	for (i = 0; i < nd_region->ndr_mappings; i++)
> -		/* flush hints present, flushing required */
> -		if (ndrd_get_flush_wpq(ndrd, i, 0))
> +	for (i = 0; i < nd_region->ndr_mappings; i++) {
> +		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
> +		struct nvdimm *nvdimm = nd_mapping->nvdimm;
> +
> +		/* flush hints present / available */
> +		if (nvdimm->num_flush)
>  			return 1;
> +	}
>  
>  	/*
>  	 * The platform defines dimm devices without hints, assume
>
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] libnvdimm, region: fix flush hint detection crash
  2017-04-26 19:43   ` Jeff Moyer
@ 2017-04-26 20:04     ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2017-04-26 20:04 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-nvdimm@lists.01.org, Linux ACPI, linux-kernel, stable

On Wed, Apr 26, 2017 at 12:43 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> In the case where a dimm does not have any associated flush hints the
>> ndrd->flush_wpq array may be uninitialized leading to crashes with the
>> following signature:
>>
>>  BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
>>  IP: region_visible+0x10f/0x160 [libnvdimm]
>>
>>  Call Trace:
>>   internal_create_group+0xbe/0x2f0
>>   sysfs_create_groups+0x40/0x80
>>   device_add+0x2d8/0x650
>>   nd_async_device_register+0x12/0x40 [libnvdimm]
>>   async_run_entry_fn+0x39/0x170
>>   process_one_work+0x212/0x6c0
>>   ? process_one_work+0x197/0x6c0
>>   worker_thread+0x4e/0x4a0
>>   kthread+0x10c/0x140
>>   ? process_one_work+0x6c0/0x6c0
>>   ? kthread_create_on_node+0x60/0x60
>>   ret_from_fork+0x31/0x40
>
> Sorry for being dense, but I'm having a tough time connecting the dots,
> here.  How does region_visible trip over the missing (not uninitialized,
> you're actually walking off the end of the structure) wpq_flush array?

So, you're not dense, or you're at least as equally dense as me,
because I didn't immediately understand where this failure was coming
from either. I just happened to trigger it while running patch2 and
thought the current code just looked unsafe by inspection.

> Anyway, the fix looks valid.
>
> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-25 20:17   ` [PATCH v3] " Dan Williams
@ 2017-04-26 20:38     ` Jeff Moyer
  2017-04-26 23:00       ` Dan Williams
  2017-04-26 23:36       ` [PATCH v4] " Dan Williams
  0 siblings, 2 replies; 21+ messages in thread
From: Jeff Moyer @ 2017-04-26 20:38 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, Masayoshi Mizuma, linux-acpi, linux-kernel

Dan Williams <dan.j.williams@intel.com> writes:

> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
> platform WPQ (write-pending-queue) buffers when power is removed. The
> nvdimm_flush() mechanism performs that same function on-demand.
>
> When a pmem namespace is associated with a block device, an
> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
> request. These requests are typically associated with filesystem
> metadata updates. However, when a namespace is in device-dax mode,
> userspace (think database metadata) needs another path to perform the
> same flushing. In other words this is not required to make data
> persistent, but in the case of metadata it allows for a smaller failure
> domain in the unlikely event of an ADR failure.
>
> The new 'flush' attribute is visible when the individual DIMMs backing a
> given interleave-set are described by platform firmware. In ACPI terms
> this is "NVDIMM Region Mapping Structures" and associated "Flush Hint
> Address Structures". Reads return "1" if the region supports triggering
> WPQ flushes on all DIMMs. Reads return "0" the flush operation is a
> platform nop, and in that case the attribute is read-only.

I can make peace with exposing this to userspace, though I am mostly
against its use.  However, sysfs feels like the wrong interface.
Believe it or not, I'd rather see this implemented as an ioctl.

This isn't a NACK, it's me giving my opinion.  Do with it what you will.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-26 20:38     ` Jeff Moyer
@ 2017-04-26 23:00       ` Dan Williams
  2017-04-27 13:45         ` Jeff Moyer
  2017-04-26 23:36       ` [PATCH v4] " Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Dan Williams @ 2017-04-26 23:00 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-nvdimm@lists.01.org, Masayoshi Mizuma, Linux ACPI, linux-kernel

On Wed, Apr 26, 2017 at 1:38 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
>> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
>> platform WPQ (write-pending-queue) buffers when power is removed. The
>> nvdimm_flush() mechanism performs that same function on-demand.
>>
>> When a pmem namespace is associated with a block device, an
>> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
>> request. These requests are typically associated with filesystem
>> metadata updates. However, when a namespace is in device-dax mode,
>> userspace (think database metadata) needs another path to perform the
>> same flushing. In other words this is not required to make data
>> persistent, but in the case of metadata it allows for a smaller failure
>> domain in the unlikely event of an ADR failure.
>>
>> The new 'flush' attribute is visible when the individual DIMMs backing a
>> given interleave-set are described by platform firmware. In ACPI terms
>> this is "NVDIMM Region Mapping Structures" and associated "Flush Hint
>> Address Structures". Reads return "1" if the region supports triggering
>> WPQ flushes on all DIMMs. Reads return "0" the flush operation is a
>> platform nop, and in that case the attribute is read-only.
>
> I can make peace with exposing this to userspace, though I am mostly
> against its use.  However, sysfs feels like the wrong interface.
> Believe it or not, I'd rather see this implemented as an ioctl.
>
> This isn't a NACK, it's me giving my opinion.  Do with it what you will.

I hate ioctls with a burning passion so I can't get on board with that
change, but perhaps the sentiment behind it is that this is too
visible and too attractive being called "flush" in sysfs? Would a name
more specific to the mechanism make it more palatable? Like
"flush_hint_trigger" or "wpq_drain"?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-26 20:38     ` Jeff Moyer
  2017-04-26 23:00       ` Dan Williams
@ 2017-04-26 23:36       ` Dan Williams
  2017-04-27 22:17         ` [PATCH v5] " Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Dan Williams @ 2017-04-26 23:36 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: Masayoshi Mizuma, linux-acpi, Jeff Moyer, linux-kernel

The nvdimm_flush() mechanism helps to reduce the impact of an ADR
(asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
platform WPQ (write-pending-queue) buffers when power is removed. The
nvdimm_flush() mechanism performs that same function on-demand.

When a pmem namespace is associated with a block device, an
nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
request. These requests are typically associated with filesystem
metadata updates. However, when a namespace is in device-dax mode,
userspace (think database metadata) needs another path to perform the
same flushing. In other words this is not required to make data
persistent, but in the case of metadata it allows for a smaller failure
domain in the unlikely event of an ADR failure.

The new 'wpq_drain' attribute is visible when the individual DIMMs
backing a given interleave-set are described by platform firmware. In
ACPI terms this is "NVDIMM Region Mapping Structures" and associated
"Flush Hint Address Structures". Reads return "1" if the region supports
triggering WPQ flushes on all DIMMs. Reads return "0" the flush
operation is a platform nop, and in that case the attribute is
read-only.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes in v4:
* Rename the attribute from "flush" to "wpq_drain". Short of moving this
  functionality to an ioctl() this rename hopefully makes clearer that
  this is an optional hardware specific operation and not a general purpose
  "flush" for reaching persistence. (Jeff)

 drivers/nvdimm/region_devs.c |   41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 24abceda986a..6bcbbfd9f015 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -255,6 +255,35 @@ static ssize_t size_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(size);
 
+static ssize_t wpq_drain_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	/*
+	 * NOTE: in the nvdimm_has_flush() error case this attribute is
+	 * not visible.
+	 */
+	return sprintf(buf, "%d\n", nvdimm_has_flush(nd_region));
+}
+
+static ssize_t wpq_drain_store(struct device *dev, struct device_attribute *attr,
+		const char *buf, size_t len)
+{
+	bool flush;
+	int rc = strtobool(buf, &flush);
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	if (rc)
+		return rc;
+	if (!flush)
+		return -EINVAL;
+	nvdimm_flush(nd_region);
+
+	return len;
+}
+static DEVICE_ATTR_RW(wpq_drain);
+
 static ssize_t mappings_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -479,6 +508,7 @@ static struct attribute *nd_region_attributes[] = {
 	&dev_attr_btt_seed.attr,
 	&dev_attr_pfn_seed.attr,
 	&dev_attr_dax_seed.attr,
+	&dev_attr_wpq_drain.attr,
 	&dev_attr_read_only.attr,
 	&dev_attr_set_cookie.attr,
 	&dev_attr_available_size.attr,
@@ -508,6 +538,17 @@ static umode_t region_visible(struct kobject *kobj, struct attribute *a, int n)
 	if (!is_nd_pmem(dev) && a == &dev_attr_resource.attr)
 		return 0;
 
+	if (a == &dev_attr_wpq_drain.attr) {
+		int has_flush = nvdimm_has_flush(nd_region);
+
+		if (has_flush == 1)
+			return a->mode;
+		else if (has_flush == 0)
+			return 0444;
+		else
+			return 0;
+	}
+
 	if (a != &dev_attr_set_cookie.attr
 			&& a != &dev_attr_available_size.attr)
 		return a->mode;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-26 23:00       ` Dan Williams
@ 2017-04-27 13:45         ` Jeff Moyer
  2017-04-27 16:56           ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Moyer @ 2017-04-27 13:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm@lists.01.org, Masayoshi Mizuma, Linux ACPI, linux-kernel

Dan Williams <dan.j.williams@intel.com> writes:

> On Wed, Apr 26, 2017 at 1:38 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
>>> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
>>> platform WPQ (write-pending-queue) buffers when power is removed. The
>>> nvdimm_flush() mechanism performs that same function on-demand.
>>>
>>> When a pmem namespace is associated with a block device, an
>>> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
>>> request. These requests are typically associated with filesystem
>>> metadata updates. However, when a namespace is in device-dax mode,
>>> userspace (think database metadata) needs another path to perform the
>>> same flushing. In other words this is not required to make data
>>> persistent, but in the case of metadata it allows for a smaller failure
>>> domain in the unlikely event of an ADR failure.
>>>
>>> The new 'flush' attribute is visible when the individual DIMMs backing a
>>> given interleave-set are described by platform firmware. In ACPI terms
>>> this is "NVDIMM Region Mapping Structures" and associated "Flush Hint
>>> Address Structures". Reads return "1" if the region supports triggering
>>> WPQ flushes on all DIMMs. Reads return "0" the flush operation is a
>>> platform nop, and in that case the attribute is read-only.
>>
>> I can make peace with exposing this to userspace, though I am mostly
>> against its use.  However, sysfs feels like the wrong interface.
>> Believe it or not, I'd rather see this implemented as an ioctl.
>>
>> This isn't a NACK, it's me giving my opinion.  Do with it what you will.
>
> I hate ioctls with a burning passion so I can't get on board with that
> change, but perhaps the sentiment behind it is that this is too
> visible and too attractive being called "flush" in sysfs? Would a name
> more specific to the mechanism make it more palatable? Like
> "flush_hint_trigger" or "wpq_drain"?

The sentiment is that programs shouldn't have to grovel around in sysfs
to do stuff related to an open file descriptor or mapping.  I don't take
issue with the name.  I do worry that something like 'wpq_drain' may be
too platform specific, though.  The NVM Programming Model specification
is going to call this "deep flush", so maybe that will give you
some inspiration if you do want to change the name.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-27 13:45         ` Jeff Moyer
@ 2017-04-27 16:56           ` Dan Williams
  2017-04-27 18:41             ` Jeff Moyer
  0 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2017-04-27 16:56 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-nvdimm@lists.01.org, Masayoshi Mizuma, Linux ACPI, linux-kernel

On Thu, Apr 27, 2017 at 6:45 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> On Wed, Apr 26, 2017 at 1:38 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>
>>>> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
>>>> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
>>>> platform WPQ (write-pending-queue) buffers when power is removed. The
>>>> nvdimm_flush() mechanism performs that same function on-demand.
>>>>
>>>> When a pmem namespace is associated with a block device, an
>>>> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
>>>> request. These requests are typically associated with filesystem
>>>> metadata updates. However, when a namespace is in device-dax mode,
>>>> userspace (think database metadata) needs another path to perform the
>>>> same flushing. In other words this is not required to make data
>>>> persistent, but in the case of metadata it allows for a smaller failure
>>>> domain in the unlikely event of an ADR failure.
>>>>
>>>> The new 'flush' attribute is visible when the individual DIMMs backing a
>>>> given interleave-set are described by platform firmware. In ACPI terms
>>>> this is "NVDIMM Region Mapping Structures" and associated "Flush Hint
>>>> Address Structures". Reads return "1" if the region supports triggering
>>>> WPQ flushes on all DIMMs. Reads return "0" the flush operation is a
>>>> platform nop, and in that case the attribute is read-only.
>>>
>>> I can make peace with exposing this to userspace, though I am mostly
>>> against its use.  However, sysfs feels like the wrong interface.
>>> Believe it or not, I'd rather see this implemented as an ioctl.
>>>
>>> This isn't a NACK, it's me giving my opinion.  Do with it what you will.
>>
>> I hate ioctls with a burning passion so I can't get on board with that
>> change, but perhaps the sentiment behind it is that this is too
>> visible and too attractive being called "flush" in sysfs? Would a name
>> more specific to the mechanism make it more palatable? Like
>> "flush_hint_trigger" or "wpq_drain"?
>
> The sentiment is that programs shouldn't have to grovel around in sysfs
> to do stuff related to an open file descriptor or mapping.  I don't take
> issue with the name.  I do worry that something like 'wpq_drain' may be
> too platform specific, though.  The NVM Programming Model specification
> is going to call this "deep flush", so maybe that will give you
> some inspiration if you do want to change the name.

I'll change to "deep_flush", and I quibble that this is related to a
single open file descriptor or mapping. It really is a "region flush"
for giving extra protection for global metadata, but the persistence
of individual fds or mappings is handled by ADR. I think an ioctl
might give the false impression that every time you flush a cacheline
to persistence you need to call the ioctl.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-27 16:56           ` Dan Williams
@ 2017-04-27 18:41             ` Jeff Moyer
  2017-04-27 19:17               ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Moyer @ 2017-04-27 18:41 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm@lists.01.org, Masayoshi Mizuma, Linux ACPI, linux-kernel

Dan Williams <dan.j.williams@intel.com> writes:

>> The sentiment is that programs shouldn't have to grovel around in sysfs
>> to do stuff related to an open file descriptor or mapping.  I don't take
>> issue with the name.  I do worry that something like 'wpq_drain' may be
>> too platform specific, though.  The NVM Programming Model specification
>> is going to call this "deep flush", so maybe that will give you
>> some inspiration if you do want to change the name.
>
> I'll change to "deep_flush", and I quibble that this is related to a
> single open file descriptor or mapping. It really is a "region flush"
> for giving extra protection for global metadata, but the persistence
> of individual fds or mappings is handled by ADR. I think an ioctl
> might give the false impression that every time you flush a cacheline
> to persistence you need to call the ioctl.

fsync, for example, may affect more than one fd--all data in the drive
write cache will be flushed.  I don't see how this is so different.  I
think a sysfs file is awkward because it requires an application to
chase down the correct file in the sysfs hierarchy.  If the application
already has an open fd or a mapping, it should be able to operate on
that.

As for confusion on when to use the interface, I think that's inevitable
no matter how it's implemented.  We're introducing a flush type that has
never been exposed before, and we're not giving any information on how
likely an ADR failure is, or how expensive this flush will be.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-27 18:41             ` Jeff Moyer
@ 2017-04-27 19:17               ` Dan Williams
  2017-04-27 19:21                 ` Dan Williams
  2017-04-27 19:40                 ` Jeff Moyer
  0 siblings, 2 replies; 21+ messages in thread
From: Dan Williams @ 2017-04-27 19:17 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-nvdimm@lists.01.org, Masayoshi Mizuma, Linux ACPI, linux-kernel

On Thu, Apr 27, 2017 at 11:41 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>>> The sentiment is that programs shouldn't have to grovel around in sysfs
>>> to do stuff related to an open file descriptor or mapping.  I don't take
>>> issue with the name.  I do worry that something like 'wpq_drain' may be
>>> too platform specific, though.  The NVM Programming Model specification
>>> is going to call this "deep flush", so maybe that will give you
>>> some inspiration if you do want to change the name.
>>
>> I'll change to "deep_flush", and I quibble that this is related to a
>> single open file descriptor or mapping. It really is a "region flush"
>> for giving extra protection for global metadata, but the persistence
>> of individual fds or mappings is handled by ADR. I think an ioctl
>> might give the false impression that every time you flush a cacheline
>> to persistence you need to call the ioctl.
>
> fsync, for example, may affect more than one fd--all data in the drive
> write cache will be flushed.  I don't see how this is so different.  I
> think a sysfs file is awkward because it requires an application to
> chase down the correct file in the sysfs hierarchy.  If the application
> already has an open fd or a mapping, it should be able to operate on
> that.

I'm teetering, but still leaning towards sysfs. The use case that
needs this is device-dax because we otherwise silently do this behind
the application's back on filesystem-dax for fsync / msync. A
device-dax ioctl would be straightforward, but 'deep flush' assumes
that the device-dax instance is fronting persistent memory. There's
nothing persistent memory specific about device-dax except that today
only the nvdimm sub-system knows how to create them, but there's
nothing that prevents other memory regions from being mapped this way.
So I'd rather this persistent memory specific mechanism stay with the
persistent memory specific portion of the interface rather than plumb
persistent memory details out through the generic device-dax interface
since we have no other intercept point like we do in the
filesystem-dax case to hide this flush.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-27 19:17               ` Dan Williams
@ 2017-04-27 19:21                 ` Dan Williams
  2017-04-27 19:43                   ` Jeff Moyer
  2017-04-27 19:40                 ` Jeff Moyer
  1 sibling, 1 reply; 21+ messages in thread
From: Dan Williams @ 2017-04-27 19:21 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-nvdimm@lists.01.org, Masayoshi Mizuma, Linux ACPI, linux-kernel

On Thu, Apr 27, 2017 at 12:17 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Apr 27, 2017 at 11:41 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>>> The sentiment is that programs shouldn't have to grovel around in sysfs
>>>> to do stuff related to an open file descriptor or mapping.  I don't take
>>>> issue with the name.  I do worry that something like 'wpq_drain' may be
>>>> too platform specific, though.  The NVM Programming Model specification
>>>> is going to call this "deep flush", so maybe that will give you
>>>> some inspiration if you do want to change the name.
>>>
>>> I'll change to "deep_flush", and I quibble that this is related to a
>>> single open file descriptor or mapping. It really is a "region flush"
>>> for giving extra protection for global metadata, but the persistence
>>> of individual fds or mappings is handled by ADR. I think an ioctl
>>> might give the false impression that every time you flush a cacheline
>>> to persistence you need to call the ioctl.
>>
>> fsync, for example, may affect more than one fd--all data in the drive
>> write cache will be flushed.  I don't see how this is so different.  I
>> think a sysfs file is awkward because it requires an application to
>> chase down the correct file in the sysfs hierarchy.  If the application
>> already has an open fd or a mapping, it should be able to operate on
>> that.
>
> I'm teetering, but still leaning towards sysfs. The use case that
> needs this is device-dax because we otherwise silently do this behind
> the application's back on filesystem-dax for fsync / msync. A
> device-dax ioctl would be straightforward, but 'deep flush' assumes
> that the device-dax instance is fronting persistent memory. There's
> nothing persistent memory specific about device-dax except that today
> only the nvdimm sub-system knows how to create them, but there's
> nothing that prevents other memory regions from being mapped this way.
> So I'd rather this persistent memory specific mechanism stay with the
> persistent memory specific portion of the interface rather than plumb
> persistent memory details out through the generic device-dax interface
> since we have no other intercept point like we do in the
> filesystem-dax case to hide this flush.

We also still seem to need a discovery mechanism as I've had questions
about "how do I tell if my system supports deep flush?". That's where
sysfs is much better than an ioctl. The need for deep flush discovery
tips the scales, at least for me, to also do deep flush triggering
through the same interface.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-27 19:17               ` Dan Williams
  2017-04-27 19:21                 ` Dan Williams
@ 2017-04-27 19:40                 ` Jeff Moyer
  2017-04-27 20:02                   ` Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff Moyer @ 2017-04-27 19:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm@lists.01.org, Masayoshi Mizuma, Linux ACPI, linux-kernel

Dan Williams <dan.j.williams@intel.com> writes:

> On Thu, Apr 27, 2017 at 11:41 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>>> The sentiment is that programs shouldn't have to grovel around in sysfs
>>>> to do stuff related to an open file descriptor or mapping.  I don't take
>>>> issue with the name.  I do worry that something like 'wpq_drain' may be
>>>> too platform specific, though.  The NVM Programming Model specification
>>>> is going to call this "deep flush", so maybe that will give you
>>>> some inspiration if you do want to change the name.
>>>
>>> I'll change to "deep_flush", and I quibble that this is related to a
>>> single open file descriptor or mapping. It really is a "region flush"
>>> for giving extra protection for global metadata, but the persistence
>>> of individual fds or mappings is handled by ADR. I think an ioctl
>>> might give the false impression that every time you flush a cacheline
>>> to persistence you need to call the ioctl.
>>
>> fsync, for example, may affect more than one fd--all data in the drive
>> write cache will be flushed.  I don't see how this is so different.  I
>> think a sysfs file is awkward because it requires an application to
>> chase down the correct file in the sysfs hierarchy.  If the application
>> already has an open fd or a mapping, it should be able to operate on
>> that.
>
> I'm teetering, but still leaning towards sysfs. The use case that
> needs this is device-dax because we otherwise silently do this behind
> the application's back on filesystem-dax for fsync / msync.

We may yet get file system support for flush from userspace (NOVA, for
example).  So I don't think we should restrict ourselves to only
thinking about the device dax use case.

> A device-dax ioctl would be straightforward, but 'deep flush' assumes
> that the device-dax instance is fronting persistent memory.  There's
> nothing persistent memory specific about device-dax except that today
> only the nvdimm sub-system knows how to create them, but there's
> nothing that prevents other memory regions from being mapped this way.

You're concerned that applications operating on device dax instances
that are not backed by pmem will try to issue a deep flush?  Why would
they do that, and why can't you just return failure from the ioctl?

> So I'd rather this persistent memory specific mechanism stay with the
> persistent memory specific portion of the interface rather than plumb
> persistent memory details out through the generic device-dax interface
> since we have no other intercept point like we do in the
> filesystem-dax case to hide this flush.

Look at the block layer.  You can issue an ioctl on a block device, and
if the generic block layer can handle it, it does.  If not, it gets
passed down to lower layers until either it gets handled, or it bubbles
back up because nobody knew what to do with it.  I think you can do the
same thing here, and that solves your layering violation.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-27 19:21                 ` Dan Williams
@ 2017-04-27 19:43                   ` Jeff Moyer
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff Moyer @ 2017-04-27 19:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm@lists.01.org, Masayoshi Mizuma, Linux ACPI, linux-kernel

Dan Williams <dan.j.williams@intel.com> writes:

> We also still seem to need a discovery mechanism as I've had questions
> about "how do I tell if my system supports deep flush?". That's where
> sysfs is much better than an ioctl. The need for deep flush discovery
> tips the scales, at least for me, to also do deep flush triggering
> through the same interface.

Return ENXIO or EOPNOTSUPP from the ioctl when it isn't supported?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-27 19:40                 ` Jeff Moyer
@ 2017-04-27 20:02                   ` Dan Williams
  2017-04-27 21:36                     ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2017-04-27 20:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-nvdimm@lists.01.org, Masayoshi Mizuma, Linux ACPI, linux-kernel

On Thu, Apr 27, 2017 at 12:40 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> On Thu, Apr 27, 2017 at 11:41 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>
>>>>> The sentiment is that programs shouldn't have to grovel around in sysfs
>>>>> to do stuff related to an open file descriptor or mapping.  I don't take
>>>>> issue with the name.  I do worry that something like 'wpq_drain' may be
>>>>> too platform specific, though.  The NVM Programming Model specification
>>>>> is going to call this "deep flush", so maybe that will give you
>>>>> some inspiration if you do want to change the name.
>>>>
>>>> I'll change to "deep_flush", and I quibble that this is related to a
>>>> single open file descriptor or mapping. It really is a "region flush"
>>>> for giving extra protection for global metadata, but the persistence
>>>> of individual fds or mappings is handled by ADR. I think an ioctl
>>>> might give the false impression that every time you flush a cacheline
>>>> to persistence you need to call the ioctl.
>>>
>>> fsync, for example, may affect more than one fd--all data in the drive
>>> write cache will be flushed.  I don't see how this is so different.  I
>>> think a sysfs file is awkward because it requires an application to
>>> chase down the correct file in the sysfs hierarchy.  If the application
>>> already has an open fd or a mapping, it should be able to operate on
>>> that.
>>
>> I'm teetering, but still leaning towards sysfs. The use case that
>> needs this is device-dax because we otherwise silently do this behind
>> the application's back on filesystem-dax for fsync / msync.
>
> We may yet get file system support for flush from userspace (NOVA, for
> example).  So I don't think we should restrict ourselves to only
> thinking about the device dax use case.
>
>> A device-dax ioctl would be straightforward, but 'deep flush' assumes
>> that the device-dax instance is fronting persistent memory.  There's
>> nothing persistent memory specific about device-dax except that today
>> only the nvdimm sub-system knows how to create them, but there's
>> nothing that prevents other memory regions from being mapped this way.
>
> You're concerned that applications operating on device dax instances
> that are not backed by pmem will try to issue a deep flush?  Why would
> they do that, and why can't you just return failure from the ioctl?
>
>> So I'd rather this persistent memory specific mechanism stay with the
>> persistent memory specific portion of the interface rather than plumb
>> persistent memory details out through the generic device-dax interface
>> since we have no other intercept point like we do in the
>> filesystem-dax case to hide this flush.
>
> Look at the block layer.  You can issue an ioctl on a block device, and
> if the generic block layer can handle it, it does.  If not, it gets
> passed down to lower layers until either it gets handled, or it bubbles
> back up because nobody knew what to do with it.  I think you can do the
> same thing here, and that solves your layering violation.

So this is where I started. I was going to follow the block layer.
Except recently the block layer has been leaning away from ioctls and
implementing support for syscalls directly. The same approach for
device-dax fallocate() support got NAKd, so I opted for sysfs out of
the gate.

However, since there really is no analog for "deep flush" in the
syscall namespace lets (*gag*) implement an ioctl for this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-27 20:02                   ` Dan Williams
@ 2017-04-27 21:36                     ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2017-04-27 21:36 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-nvdimm@lists.01.org, Masayoshi Mizuma, Linux ACPI, linux-kernel

On Thu, Apr 27, 2017 at 1:02 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Apr 27, 2017 at 12:40 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>> On Thu, Apr 27, 2017 at 11:41 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>>
>>>>>> The sentiment is that programs shouldn't have to grovel around in sysfs
>>>>>> to do stuff related to an open file descriptor or mapping.  I don't take
>>>>>> issue with the name.  I do worry that something like 'wpq_drain' may be
>>>>>> too platform specific, though.  The NVM Programming Model specification
>>>>>> is going to call this "deep flush", so maybe that will give you
>>>>>> some inspiration if you do want to change the name.
>>>>>
>>>>> I'll change to "deep_flush", and I quibble that this is related to a
>>>>> single open file descriptor or mapping. It really is a "region flush"
>>>>> for giving extra protection for global metadata, but the persistence
>>>>> of individual fds or mappings is handled by ADR. I think an ioctl
>>>>> might give the false impression that every time you flush a cacheline
>>>>> to persistence you need to call the ioctl.
>>>>
>>>> fsync, for example, may affect more than one fd--all data in the drive
>>>> write cache will be flushed.  I don't see how this is so different.  I
>>>> think a sysfs file is awkward because it requires an application to
>>>> chase down the correct file in the sysfs hierarchy.  If the application
>>>> already has an open fd or a mapping, it should be able to operate on
>>>> that.
>>>
>>> I'm teetering, but still leaning towards sysfs. The use case that
>>> needs this is device-dax because we otherwise silently do this behind
>>> the application's back on filesystem-dax for fsync / msync.
>>
>> We may yet get file system support for flush from userspace (NOVA, for
>> example).  So I don't think we should restrict ourselves to only
>> thinking about the device dax use case.
>>
>>> A device-dax ioctl would be straightforward, but 'deep flush' assumes
>>> that the device-dax instance is fronting persistent memory.  There's
>>> nothing persistent memory specific about device-dax except that today
>>> only the nvdimm sub-system knows how to create them, but there's
>>> nothing that prevents other memory regions from being mapped this way.
>>
>> You're concerned that applications operating on device dax instances
>> that are not backed by pmem will try to issue a deep flush?  Why would
>> they do that, and why can't you just return failure from the ioctl?
>>
>>> So I'd rather this persistent memory specific mechanism stay with the
>>> persistent memory specific portion of the interface rather than plumb
>>> persistent memory details out through the generic device-dax interface
>>> since we have no other intercept point like we do in the
>>> filesystem-dax case to hide this flush.
>>
>> Look at the block layer.  You can issue an ioctl on a block device, and
>> if the generic block layer can handle it, it does.  If not, it gets
>> passed down to lower layers until either it gets handled, or it bubbles
>> back up because nobody knew what to do with it.  I think you can do the
>> same thing here, and that solves your layering violation.
>
> So this is where I started. I was going to follow the block layer.
> Except recently the block layer has been leaning away from ioctls and
> implementing support for syscalls directly. The same approach for
> device-dax fallocate() support got NAKd, so I opted for sysfs out of
> the gate.
>
> However, since there really is no analog for "deep flush" in the
> syscall namespace lets (*gag*) implement an ioctl for this.

So I think about it for 2 seconds and now I'm veering back to sysfs.
We don't want applications calling this ioctl on filesystem-dax fds. I
simply can't bring myself to do the work to pick a unique ioctl number
that is known to be unique for the full filsystem and block-layer
paths when we can put this interface right where it belongs in nvdimm
specific sysfs.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v5] libnvdimm, region: sysfs trigger for nvdimm_flush()
  2017-04-26 23:36       ` [PATCH v4] " Dan Williams
@ 2017-04-27 22:17         ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2017-04-27 22:17 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: Masayoshi Mizuma, linux-acpi, Jeff Moyer, linux-kernel

The nvdimm_flush() mechanism helps to reduce the impact of an ADR
(asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
platform WPQ (write-pending-queue) buffers when power is removed. The
nvdimm_flush() mechanism performs that same function on-demand.

When a pmem namespace is associated with a block device, an
nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
request. These requests are typically associated with filesystem
metadata updates. However, when a namespace is in device-dax mode,
userspace (think database metadata) needs another path to perform the
same flushing. In other words this is not required to make data
persistent, but in the case of metadata it allows for a smaller failure
domain in the unlikely event of an ADR failure.

The new 'deep_flush' attribute is visible when the individual DIMMs
backing a given interleave-set are described by platform firmware. In
ACPI terms this is "NVDIMM Region Mapping Structures" and associated
"Flush Hint Address Structures". Reads return "1" if the region supports
triggering WPQ flushes on all DIMMs. Reads return "0" the flush
operation is a platform nop, and in that case the attribute is
read-only.

Why sysfs and not an ioctl? An ioctl requires establishing a new
ioctl function number space for device-dax. Given that this would be
called on a device-dax fd an application could be forgiven for
accidentally calling this on a filesystem-dax fd. Placing this interface
in libnvdimm sysfs removes that potential for collision with a
filesystem ioctl, and it keeps ioctls out of the generic device-dax
implementation.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/region_devs.c |   41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 24abceda986a..53d1ba4e6d99 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -255,6 +255,35 @@ static ssize_t size_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(size);
 
+static ssize_t deep_flush_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	/*
+	 * NOTE: in the nvdimm_has_flush() error case this attribute is
+	 * not visible.
+	 */
+	return sprintf(buf, "%d\n", nvdimm_has_flush(nd_region));
+}
+
+static ssize_t deep_flush_store(struct device *dev, struct device_attribute *attr,
+		const char *buf, size_t len)
+{
+	bool flush;
+	int rc = strtobool(buf, &flush);
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	if (rc)
+		return rc;
+	if (!flush)
+		return -EINVAL;
+	nvdimm_flush(nd_region);
+
+	return len;
+}
+static DEVICE_ATTR_RW(deep_flush);
+
 static ssize_t mappings_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -479,6 +508,7 @@ static struct attribute *nd_region_attributes[] = {
 	&dev_attr_btt_seed.attr,
 	&dev_attr_pfn_seed.attr,
 	&dev_attr_dax_seed.attr,
+	&dev_attr_deep_flush.attr,
 	&dev_attr_read_only.attr,
 	&dev_attr_set_cookie.attr,
 	&dev_attr_available_size.attr,
@@ -508,6 +538,17 @@ static umode_t region_visible(struct kobject *kobj, struct attribute *a, int n)
 	if (!is_nd_pmem(dev) && a == &dev_attr_resource.attr)
 		return 0;
 
+	if (a == &dev_attr_deep_flush.attr) {
+		int has_flush = nvdimm_has_flush(nd_region);
+
+		if (has_flush == 1)
+			return a->mode;
+		else if (has_flush == 0)
+			return 0444;
+		else
+			return 0;
+	}
+
 	if (a != &dev_attr_set_cookie.attr
 			&& a != &dev_attr_available_size.attr)
 		return a->mode;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-04-27 22:23 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-24 23:49 [PATCH v2 0/2] libnvdimm: export wpq flush interface Dan Williams
2017-04-24 23:49 ` [PATCH v2 1/2] libnvdimm, region: fix flush hint detection crash Dan Williams
2017-04-26 19:43   ` Jeff Moyer
2017-04-26 20:04     ` Dan Williams
2017-04-24 23:50 ` [PATCH v2 2/2] libnvdimm, region: sysfs trigger for nvdimm_flush() Dan Williams
2017-04-25 16:37   ` Ross Zwisler
2017-04-25 16:38     ` Dan Williams
2017-04-25 20:17   ` [PATCH v3] " Dan Williams
2017-04-26 20:38     ` Jeff Moyer
2017-04-26 23:00       ` Dan Williams
2017-04-27 13:45         ` Jeff Moyer
2017-04-27 16:56           ` Dan Williams
2017-04-27 18:41             ` Jeff Moyer
2017-04-27 19:17               ` Dan Williams
2017-04-27 19:21                 ` Dan Williams
2017-04-27 19:43                   ` Jeff Moyer
2017-04-27 19:40                 ` Jeff Moyer
2017-04-27 20:02                   ` Dan Williams
2017-04-27 21:36                     ` Dan Williams
2017-04-26 23:36       ` [PATCH v4] " Dan Williams
2017-04-27 22:17         ` [PATCH v5] " Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).