All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/8] dax/pmem: Provide a dax operation to zero range of memory
@ 2020-02-18 21:48 ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams; +Cc: dm-devel

Hi, 

This is V5 of patches. These patches are also available at.

https://github.com/rhvgoyal/linux/commits/dax-zero-range-v5

Changes since V4:

- Rebased on top of 5.6-rc2
- Added a separate patch so that pmem_clear_poison() accepts arbitrary
  offset and len and aligns these as needed. This takes away the burden
  of aligning from callers.

Previous versions of patches are here.
v4:
https://lore.kernel.org/linux-fsdevel/20200217181653.4706-1-vgoyal@redhat.com/
v3:
https://lore.kernel.org/linux-fsdevel/20200214125717.GA18654@redhat.com/T/#t
v2:
https://lore.kernel.org/linux-fsdevel/20200203200029.4592-1-vgoyal@redhat.com/
v1:
https://lore.kernel.org/linux-fsdevel/20200123165249.GA7664@redhat.com/

Thanks
Vivek

Vivek Goyal (8):
  pmem: Add functions for reading/writing page to/from pmem
  drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and
    len
  pmem: Enable pmem_do_write() to deal with arbitrary ranges
  dax, pmem: Add a dax operation zero_page_range
  s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
  dm,dax: Add dax zero_page_range operation
  dax,iomap: Start using dax native zero_page_range()
  dax,iomap: Add helper dax_iomap_zero() to zero a range

 drivers/dax/super.c           |  19 +++++
 drivers/md/dm-linear.c        |  21 ++++++
 drivers/md/dm-log-writes.c    |  19 +++++
 drivers/md/dm-stripe.c        |  26 +++++++
 drivers/md/dm.c               |  31 ++++++++
 drivers/nvdimm/pmem.c         | 134 +++++++++++++++++++++++-----------
 drivers/s390/block/dcssblk.c  |  17 +++++
 fs/dax.c                      |  53 ++++----------
 fs/iomap/buffered-io.c        |   9 +--
 include/linux/dax.h           |  20 ++---
 include/linux/device-mapper.h |   3 +
 11 files changed, 246 insertions(+), 106 deletions(-)

-- 
2.20.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v5 0/8] dax/pmem: Provide a dax operation to zero range of memory
@ 2020-02-18 21:48 ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams
  Cc: dm-devel, vishal.l.verma, vgoyal

Hi, 

This is V5 of patches. These patches are also available at.

https://github.com/rhvgoyal/linux/commits/dax-zero-range-v5

Changes since V4:

- Rebased on top of 5.6-rc2
- Added a separate patch so that pmem_clear_poison() accepts arbitrary
  offset and len and aligns these as needed. This takes away the burden
  of aligning from callers.

Previous versions of patches are here.
v4:
https://lore.kernel.org/linux-fsdevel/20200217181653.4706-1-vgoyal@redhat.com/
v3:
https://lore.kernel.org/linux-fsdevel/20200214125717.GA18654@redhat.com/T/#t
v2:
https://lore.kernel.org/linux-fsdevel/20200203200029.4592-1-vgoyal@redhat.com/
v1:
https://lore.kernel.org/linux-fsdevel/20200123165249.GA7664@redhat.com/

Thanks
Vivek

Vivek Goyal (8):
  pmem: Add functions for reading/writing page to/from pmem
  drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and
    len
  pmem: Enable pmem_do_write() to deal with arbitrary ranges
  dax, pmem: Add a dax operation zero_page_range
  s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
  dm,dax: Add dax zero_page_range operation
  dax,iomap: Start using dax native zero_page_range()
  dax,iomap: Add helper dax_iomap_zero() to zero a range

 drivers/dax/super.c           |  19 +++++
 drivers/md/dm-linear.c        |  21 ++++++
 drivers/md/dm-log-writes.c    |  19 +++++
 drivers/md/dm-stripe.c        |  26 +++++++
 drivers/md/dm.c               |  31 ++++++++
 drivers/nvdimm/pmem.c         | 134 +++++++++++++++++++++++-----------
 drivers/s390/block/dcssblk.c  |  17 +++++
 fs/dax.c                      |  53 ++++----------
 fs/iomap/buffered-io.c        |   9 +--
 include/linux/dax.h           |  20 ++---
 include/linux/device-mapper.h |   3 +
 11 files changed, 246 insertions(+), 106 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v5 1/8] pmem: Add functions for reading/writing page to/from pmem
  2020-02-18 21:48 ` Vivek Goyal
@ 2020-02-18 21:48   ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams
  Cc: dm-devel, Christoph Hellwig

This splits pmem_do_bvec() into pmem_do_read() and pmem_do_write().
pmem_do_write() will be used by pmem zero_page_range() as well. Hence
sharing the same code.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/nvdimm/pmem.c | 86 +++++++++++++++++++++++++------------------
 1 file changed, 50 insertions(+), 36 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4eae441f86c9..075b11682192 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -136,9 +136,25 @@ static blk_status_t read_pmem(struct page *page, unsigned int off,
 	return BLK_STS_OK;
 }
 
-static blk_status_t pmem_do_bvec(struct pmem_device *pmem, struct page *page,
-			unsigned int len, unsigned int off, unsigned int op,
-			sector_t sector)
+static blk_status_t pmem_do_read(struct pmem_device *pmem,
+			struct page *page, unsigned int page_off,
+			sector_t sector, unsigned int len)
+{
+	blk_status_t rc;
+	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+	void *pmem_addr = pmem->virt_addr + pmem_off;
+
+	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
+		return BLK_STS_IOERR;
+
+	rc = read_pmem(page, page_off, pmem_addr, len);
+	flush_dcache_page(page);
+	return rc;
+}
+
+static blk_status_t pmem_do_write(struct pmem_device *pmem,
+			struct page *page, unsigned int page_off,
+			sector_t sector, unsigned int len)
 {
 	blk_status_t rc = BLK_STS_OK;
 	bool bad_pmem = false;
@@ -148,34 +164,25 @@ static blk_status_t pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
 		bad_pmem = true;
 
-	if (!op_is_write(op)) {
-		if (unlikely(bad_pmem))
-			rc = BLK_STS_IOERR;
-		else {
-			rc = read_pmem(page, off, pmem_addr, len);
-			flush_dcache_page(page);
-		}
-	} else {
-		/*
-		 * Note that we write the data both before and after
-		 * clearing poison.  The write before clear poison
-		 * handles situations where the latest written data is
-		 * preserved and the clear poison operation simply marks
-		 * the address range as valid without changing the data.
-		 * In this case application software can assume that an
-		 * interrupted write will either return the new good
-		 * data or an error.
-		 *
-		 * However, if pmem_clear_poison() leaves the data in an
-		 * indeterminate state we need to perform the write
-		 * after clear poison.
-		 */
-		flush_dcache_page(page);
-		write_pmem(pmem_addr, page, off, len);
-		if (unlikely(bad_pmem)) {
-			rc = pmem_clear_poison(pmem, pmem_off, len);
-			write_pmem(pmem_addr, page, off, len);
-		}
+	/*
+	 * Note that we write the data both before and after
+	 * clearing poison.  The write before clear poison
+	 * handles situations where the latest written data is
+	 * preserved and the clear poison operation simply marks
+	 * the address range as valid without changing the data.
+	 * In this case application software can assume that an
+	 * interrupted write will either return the new good
+	 * data or an error.
+	 *
+	 * However, if pmem_clear_poison() leaves the data in an
+	 * indeterminate state we need to perform the write
+	 * after clear poison.
+	 */
+	flush_dcache_page(page);
+	write_pmem(pmem_addr, page, page_off, len);
+	if (unlikely(bad_pmem)) {
+		rc = pmem_clear_poison(pmem, pmem_off, len);
+		write_pmem(pmem_addr, page, page_off, len);
 	}
 
 	return rc;
@@ -197,8 +204,12 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 
 	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter) {
-		rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
-				bvec.bv_offset, bio_op(bio), iter.bi_sector);
+		if (op_is_write(bio_op(bio)))
+			rc = pmem_do_write(pmem, bvec.bv_page, bvec.bv_offset,
+				iter.bi_sector, bvec.bv_len);
+		else
+			rc = pmem_do_read(pmem, bvec.bv_page, bvec.bv_offset,
+				iter.bi_sector, bvec.bv_len);
 		if (rc) {
 			bio->bi_status = rc;
 			break;
@@ -223,9 +234,12 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	struct pmem_device *pmem = bdev->bd_queue->queuedata;
 	blk_status_t rc;
 
-	rc = pmem_do_bvec(pmem, page, hpage_nr_pages(page) * PAGE_SIZE,
-			  0, op, sector);
-
+	if (op_is_write(op))
+		rc = pmem_do_write(pmem, page, 0, sector,
+				   hpage_nr_pages(page) * PAGE_SIZE);
+	else
+		rc = pmem_do_read(pmem, page, 0, sector,
+				   hpage_nr_pages(page) * PAGE_SIZE);
 	/*
 	 * The ->rw_page interface is subtle and tricky.  The core
 	 * retries on any error, so we can only invoke page_endio() in
-- 
2.20.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 1/8] pmem: Add functions for reading/writing page to/from pmem
@ 2020-02-18 21:48   ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams
  Cc: dm-devel, vishal.l.verma, vgoyal, Christoph Hellwig

This splits pmem_do_bvec() into pmem_do_read() and pmem_do_write().
pmem_do_write() will be used by pmem zero_page_range() as well. Hence
sharing the same code.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/nvdimm/pmem.c | 86 +++++++++++++++++++++++++------------------
 1 file changed, 50 insertions(+), 36 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4eae441f86c9..075b11682192 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -136,9 +136,25 @@ static blk_status_t read_pmem(struct page *page, unsigned int off,
 	return BLK_STS_OK;
 }
 
-static blk_status_t pmem_do_bvec(struct pmem_device *pmem, struct page *page,
-			unsigned int len, unsigned int off, unsigned int op,
-			sector_t sector)
+static blk_status_t pmem_do_read(struct pmem_device *pmem,
+			struct page *page, unsigned int page_off,
+			sector_t sector, unsigned int len)
+{
+	blk_status_t rc;
+	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+	void *pmem_addr = pmem->virt_addr + pmem_off;
+
+	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
+		return BLK_STS_IOERR;
+
+	rc = read_pmem(page, page_off, pmem_addr, len);
+	flush_dcache_page(page);
+	return rc;
+}
+
+static blk_status_t pmem_do_write(struct pmem_device *pmem,
+			struct page *page, unsigned int page_off,
+			sector_t sector, unsigned int len)
 {
 	blk_status_t rc = BLK_STS_OK;
 	bool bad_pmem = false;
@@ -148,34 +164,25 @@ static blk_status_t pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
 		bad_pmem = true;
 
-	if (!op_is_write(op)) {
-		if (unlikely(bad_pmem))
-			rc = BLK_STS_IOERR;
-		else {
-			rc = read_pmem(page, off, pmem_addr, len);
-			flush_dcache_page(page);
-		}
-	} else {
-		/*
-		 * Note that we write the data both before and after
-		 * clearing poison.  The write before clear poison
-		 * handles situations where the latest written data is
-		 * preserved and the clear poison operation simply marks
-		 * the address range as valid without changing the data.
-		 * In this case application software can assume that an
-		 * interrupted write will either return the new good
-		 * data or an error.
-		 *
-		 * However, if pmem_clear_poison() leaves the data in an
-		 * indeterminate state we need to perform the write
-		 * after clear poison.
-		 */
-		flush_dcache_page(page);
-		write_pmem(pmem_addr, page, off, len);
-		if (unlikely(bad_pmem)) {
-			rc = pmem_clear_poison(pmem, pmem_off, len);
-			write_pmem(pmem_addr, page, off, len);
-		}
+	/*
+	 * Note that we write the data both before and after
+	 * clearing poison.  The write before clear poison
+	 * handles situations where the latest written data is
+	 * preserved and the clear poison operation simply marks
+	 * the address range as valid without changing the data.
+	 * In this case application software can assume that an
+	 * interrupted write will either return the new good
+	 * data or an error.
+	 *
+	 * However, if pmem_clear_poison() leaves the data in an
+	 * indeterminate state we need to perform the write
+	 * after clear poison.
+	 */
+	flush_dcache_page(page);
+	write_pmem(pmem_addr, page, page_off, len);
+	if (unlikely(bad_pmem)) {
+		rc = pmem_clear_poison(pmem, pmem_off, len);
+		write_pmem(pmem_addr, page, page_off, len);
 	}
 
 	return rc;
@@ -197,8 +204,12 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 
 	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter) {
-		rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
-				bvec.bv_offset, bio_op(bio), iter.bi_sector);
+		if (op_is_write(bio_op(bio)))
+			rc = pmem_do_write(pmem, bvec.bv_page, bvec.bv_offset,
+				iter.bi_sector, bvec.bv_len);
+		else
+			rc = pmem_do_read(pmem, bvec.bv_page, bvec.bv_offset,
+				iter.bi_sector, bvec.bv_len);
 		if (rc) {
 			bio->bi_status = rc;
 			break;
@@ -223,9 +234,12 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	struct pmem_device *pmem = bdev->bd_queue->queuedata;
 	blk_status_t rc;
 
-	rc = pmem_do_bvec(pmem, page, hpage_nr_pages(page) * PAGE_SIZE,
-			  0, op, sector);
-
+	if (op_is_write(op))
+		rc = pmem_do_write(pmem, page, 0, sector,
+				   hpage_nr_pages(page) * PAGE_SIZE);
+	else
+		rc = pmem_do_read(pmem, page, 0, sector,
+				   hpage_nr_pages(page) * PAGE_SIZE);
 	/*
 	 * The ->rw_page interface is subtle and tricky.  The core
 	 * retries on any error, so we can only invoke page_endio() in
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-18 21:48 ` Vivek Goyal
@ 2020-02-18 21:48   ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams; +Cc: dm-devel

Currently pmem_clear_poison() expects offset and len to be sector aligned.
Atleast that seems to be the assumption with which code has been written.
It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
and pmem_make_request() which will only passe sector aligned offset and len.

Soon we want use this function from dax_zero_page_range() code path which
can try to zero arbitrary range of memory with-in a page. So update this
function to assume that offset and length can be arbitrary and do the
necessary alignments as needed.

nvdimm_clear_poison() seems to assume offset and len to be aligned to
clear_err_unit boundary. But this is currently internal detail and is
not exported for others to use. So for now, continue to align offset and
length to SECTOR_SIZE boundary. Improving it further and to align it
to clear_err_unit boundary is a TODO item for future.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/nvdimm/pmem.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 075b11682192..e72959203253 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 	sector_t sector;
 	long cleared;
 	blk_status_t rc = BLK_STS_OK;
+	phys_addr_t start_aligned, end_aligned;
+	unsigned int len_aligned;
 
-	sector = (offset - pmem->data_offset) / 512;
+	/*
+	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
+	 * expects memory offset and length to meet certain alignment
+	 * restrction (clear_err_unit). Currently nvdimm does not export
+	 * required alignment. So align offset and length to sector boundary
+	 * before passing it to nvdimm_clear_poison().
+	 */
+	start_aligned = ALIGN(offset, SECTOR_SIZE);
+	end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1;
+	len_aligned = end_aligned - start_aligned + 1;
+
+	sector = (start_aligned - pmem->data_offset) / 512;
 
-	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
-	if (cleared < len)
+	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned,
+				      len_aligned);
+	if (cleared < len_aligned)
 		rc = BLK_STS_IOERR;
 	if (cleared > 0 && cleared / 512) {
-		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
+		hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared);
 		cleared /= 512;
 		dev_dbg(dev, "%#llx clear %ld sector%s\n",
 				(unsigned long long) sector, cleared,
-- 
2.20.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-18 21:48   ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams
  Cc: dm-devel, vishal.l.verma, vgoyal

Currently pmem_clear_poison() expects offset and len to be sector aligned.
Atleast that seems to be the assumption with which code has been written.
It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
and pmem_make_request() which will only passe sector aligned offset and len.

Soon we want use this function from dax_zero_page_range() code path which
can try to zero arbitrary range of memory with-in a page. So update this
function to assume that offset and length can be arbitrary and do the
necessary alignments as needed.

nvdimm_clear_poison() seems to assume offset and len to be aligned to
clear_err_unit boundary. But this is currently internal detail and is
not exported for others to use. So for now, continue to align offset and
length to SECTOR_SIZE boundary. Improving it further and to align it
to clear_err_unit boundary is a TODO item for future.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/nvdimm/pmem.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 075b11682192..e72959203253 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 	sector_t sector;
 	long cleared;
 	blk_status_t rc = BLK_STS_OK;
+	phys_addr_t start_aligned, end_aligned;
+	unsigned int len_aligned;
 
-	sector = (offset - pmem->data_offset) / 512;
+	/*
+	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
+	 * expects memory offset and length to meet certain alignment
+	 * restrction (clear_err_unit). Currently nvdimm does not export
+	 * required alignment. So align offset and length to sector boundary
+	 * before passing it to nvdimm_clear_poison().
+	 */
+	start_aligned = ALIGN(offset, SECTOR_SIZE);
+	end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1;
+	len_aligned = end_aligned - start_aligned + 1;
+
+	sector = (start_aligned - pmem->data_offset) / 512;
 
-	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
-	if (cleared < len)
+	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned,
+				      len_aligned);
+	if (cleared < len_aligned)
 		rc = BLK_STS_IOERR;
 	if (cleared > 0 && cleared / 512) {
-		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
+		hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared);
 		cleared /= 512;
 		dev_dbg(dev, "%#llx clear %ld sector%s\n",
 				(unsigned long long) sector, cleared,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 3/8] pmem: Enable pmem_do_write() to deal with arbitrary ranges
  2020-02-18 21:48 ` Vivek Goyal
@ 2020-02-18 21:48   ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams; +Cc: dm-devel

Currently pmem_do_write() is written with assumption that all I/O is
sector aligned. Soon I want to use this function in zero_page_range()
where range passed in does not have to be sector aligned.

Modify this function to be able to deal with an arbitrary range. Which
is specified by pmem_off and len.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/nvdimm/pmem.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e72959203253..3c46e9e6d04c 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -168,15 +168,23 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
 
 static blk_status_t pmem_do_write(struct pmem_device *pmem,
 			struct page *page, unsigned int page_off,
-			sector_t sector, unsigned int len)
+			u64 pmem_off, unsigned int len)
 {
 	blk_status_t rc = BLK_STS_OK;
 	bool bad_pmem = false;
-	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
-	void *pmem_addr = pmem->virt_addr + pmem_off;
-
-	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
-		bad_pmem = true;
+	phys_addr_t pmem_real_off = pmem_off + pmem->data_offset;
+	void *pmem_addr = pmem->virt_addr + pmem_real_off;
+	sector_t sector_start, sector_end;
+	unsigned nr_sectors;
+
+	sector_start = DIV_ROUND_UP(pmem_off, SECTOR_SIZE);
+	sector_end = (pmem_off + len) >> SECTOR_SHIFT;
+	if (sector_end > sector_start) {
+		nr_sectors = sector_end - sector_start;
+		if (is_bad_pmem(&pmem->bb, sector_start,
+				nr_sectors << SECTOR_SHIFT))
+			bad_pmem = true;
+	}
 
 	/*
 	 * Note that we write the data both before and after
@@ -195,7 +203,7 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
 	flush_dcache_page(page);
 	write_pmem(pmem_addr, page, page_off, len);
 	if (unlikely(bad_pmem)) {
-		rc = pmem_clear_poison(pmem, pmem_off, len);
+		rc = pmem_clear_poison(pmem, pmem_real_off, len);
 		write_pmem(pmem_addr, page, page_off, len);
 	}
 
@@ -220,7 +228,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 	bio_for_each_segment(bvec, bio, iter) {
 		if (op_is_write(bio_op(bio)))
 			rc = pmem_do_write(pmem, bvec.bv_page, bvec.bv_offset,
-				iter.bi_sector, bvec.bv_len);
+				iter.bi_sector << SECTOR_SHIFT, bvec.bv_len);
 		else
 			rc = pmem_do_read(pmem, bvec.bv_page, bvec.bv_offset,
 				iter.bi_sector, bvec.bv_len);
@@ -249,7 +257,7 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	blk_status_t rc;
 
 	if (op_is_write(op))
-		rc = pmem_do_write(pmem, page, 0, sector,
+		rc = pmem_do_write(pmem, page, 0, sector << SECTOR_SHIFT,
 				   hpage_nr_pages(page) * PAGE_SIZE);
 	else
 		rc = pmem_do_read(pmem, page, 0, sector,
-- 
2.20.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 3/8] pmem: Enable pmem_do_write() to deal with arbitrary ranges
@ 2020-02-18 21:48   ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams
  Cc: dm-devel, vishal.l.verma, vgoyal

Currently pmem_do_write() is written with assumption that all I/O is
sector aligned. Soon I want to use this function in zero_page_range()
where range passed in does not have to be sector aligned.

Modify this function to be able to deal with an arbitrary range. Which
is specified by pmem_off and len.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/nvdimm/pmem.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e72959203253..3c46e9e6d04c 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -168,15 +168,23 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
 
 static blk_status_t pmem_do_write(struct pmem_device *pmem,
 			struct page *page, unsigned int page_off,
-			sector_t sector, unsigned int len)
+			u64 pmem_off, unsigned int len)
 {
 	blk_status_t rc = BLK_STS_OK;
 	bool bad_pmem = false;
-	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
-	void *pmem_addr = pmem->virt_addr + pmem_off;
-
-	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
-		bad_pmem = true;
+	phys_addr_t pmem_real_off = pmem_off + pmem->data_offset;
+	void *pmem_addr = pmem->virt_addr + pmem_real_off;
+	sector_t sector_start, sector_end;
+	unsigned nr_sectors;
+
+	sector_start = DIV_ROUND_UP(pmem_off, SECTOR_SIZE);
+	sector_end = (pmem_off + len) >> SECTOR_SHIFT;
+	if (sector_end > sector_start) {
+		nr_sectors = sector_end - sector_start;
+		if (is_bad_pmem(&pmem->bb, sector_start,
+				nr_sectors << SECTOR_SHIFT))
+			bad_pmem = true;
+	}
 
 	/*
 	 * Note that we write the data both before and after
@@ -195,7 +203,7 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
 	flush_dcache_page(page);
 	write_pmem(pmem_addr, page, page_off, len);
 	if (unlikely(bad_pmem)) {
-		rc = pmem_clear_poison(pmem, pmem_off, len);
+		rc = pmem_clear_poison(pmem, pmem_real_off, len);
 		write_pmem(pmem_addr, page, page_off, len);
 	}
 
@@ -220,7 +228,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 	bio_for_each_segment(bvec, bio, iter) {
 		if (op_is_write(bio_op(bio)))
 			rc = pmem_do_write(pmem, bvec.bv_page, bvec.bv_offset,
-				iter.bi_sector, bvec.bv_len);
+				iter.bi_sector << SECTOR_SHIFT, bvec.bv_len);
 		else
 			rc = pmem_do_read(pmem, bvec.bv_page, bvec.bv_offset,
 				iter.bi_sector, bvec.bv_len);
@@ -249,7 +257,7 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	blk_status_t rc;
 
 	if (op_is_write(op))
-		rc = pmem_do_write(pmem, page, 0, sector,
+		rc = pmem_do_write(pmem, page, 0, sector << SECTOR_SHIFT,
 				   hpage_nr_pages(page) * PAGE_SIZE);
 	else
 		rc = pmem_do_read(pmem, page, 0, sector,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 4/8] dax, pmem: Add a dax operation zero_page_range
  2020-02-18 21:48 ` Vivek Goyal
@ 2020-02-18 21:48   ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams
  Cc: dm-devel, Christoph Hellwig

Add a dax operation zero_page_range, to zero a range of memory. This will
also clear any poison in the range being zeroed.

As of now, zeroing of up to one page is allowed in a single call. There
are no callers which are trying to zero more than a page in a single call.
Once we grow the callers which zero more than a page in single call, we
can add that support. Primary reason for not doing that yet is that this
will add little complexity in dm implementation where a range might be
spanning multiple underlying targets and one will have to split the range
into multiple sub ranges and call zero_page_range() on individual targets.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/dax/super.c   | 19 +++++++++++++++++++
 drivers/nvdimm/pmem.c | 10 ++++++++++
 include/linux/dax.h   |  3 +++
 3 files changed, 32 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0aa4b6bc5101..c912808bc886 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -344,6 +344,25 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
 }
 EXPORT_SYMBOL_GPL(dax_copy_to_iter);
 
+int dax_zero_page_range(struct dax_device *dax_dev, u64 offset, size_t len)
+{
+	if (!dax_alive(dax_dev))
+		return -ENXIO;
+
+	if (!dax_dev->ops->zero_page_range)
+		return -EOPNOTSUPP;
+	/*
+	 * There are no callers that want to zero across a page boundary as of
+	 * now. Once users are there, this check can be removed after the
+	 * device mapper code has been updated to split ranges across targets.
+	 */
+	if (offset_in_page(offset) + len > PAGE_SIZE)
+		return -EIO;
+
+	return dax_dev->ops->zero_page_range(dax_dev, offset, len);
+}
+EXPORT_SYMBOL_GPL(dax_zero_page_range);
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3c46e9e6d04c..e17f9f56d6fe 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -304,6 +304,15 @@ static const struct block_device_operations pmem_fops = {
 	.revalidate_disk =	nvdimm_revalidate_disk,
 };
 
+static int pmem_dax_zero_page_range(struct dax_device *dax_dev, u64 offset,
+				    size_t len)
+{
+	struct pmem_device *pmem = dax_get_private(dax_dev);
+
+	return blk_status_to_errno(pmem_do_write(pmem, ZERO_PAGE(0), 0, offset,
+				   len));
+}
+
 static long pmem_dax_direct_access(struct dax_device *dax_dev,
 		pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
 {
@@ -335,6 +344,7 @@ static const struct dax_operations pmem_dax_ops = {
 	.dax_supported = generic_fsdax_supported,
 	.copy_from_iter = pmem_copy_from_iter,
 	.copy_to_iter = pmem_copy_to_iter,
+	.zero_page_range = pmem_dax_zero_page_range,
 };
 
 static const struct attribute_group *pmem_attribute_groups[] = {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 328c2dbb4409..93a663c26d6a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -34,6 +34,8 @@ struct dax_operations {
 	/* copy_to_iter: required operation for fs-dax direct-i/o */
 	size_t (*copy_to_iter)(struct dax_device *, pgoff_t, void *, size_t,
 			struct iov_iter *);
+	/* zero_page_range: required operation. Zero range with-in a page  */
+	int (*zero_page_range)(struct dax_device *, u64, size_t);
 };
 
 extern struct attribute_group dax_attribute_group;
@@ -199,6 +201,7 @@ size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
 		size_t bytes, struct iov_iter *i);
 size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
 		size_t bytes, struct iov_iter *i);
+int dax_zero_page_range(struct dax_device *dax_dev, u64 offset, size_t len);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
-- 
2.20.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 4/8] dax, pmem: Add a dax operation zero_page_range
@ 2020-02-18 21:48   ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams
  Cc: dm-devel, vishal.l.verma, vgoyal, Christoph Hellwig

Add a dax operation zero_page_range, to zero a range of memory. This will
also clear any poison in the range being zeroed.

As of now, zeroing of up to one page is allowed in a single call. There
are no callers which are trying to zero more than a page in a single call.
Once we grow the callers which zero more than a page in single call, we
can add that support. Primary reason for not doing that yet is that this
will add little complexity in dm implementation where a range might be
spanning multiple underlying targets and one will have to split the range
into multiple sub ranges and call zero_page_range() on individual targets.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/dax/super.c   | 19 +++++++++++++++++++
 drivers/nvdimm/pmem.c | 10 ++++++++++
 include/linux/dax.h   |  3 +++
 3 files changed, 32 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0aa4b6bc5101..c912808bc886 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -344,6 +344,25 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
 }
 EXPORT_SYMBOL_GPL(dax_copy_to_iter);
 
+int dax_zero_page_range(struct dax_device *dax_dev, u64 offset, size_t len)
+{
+	if (!dax_alive(dax_dev))
+		return -ENXIO;
+
+	if (!dax_dev->ops->zero_page_range)
+		return -EOPNOTSUPP;
+	/*
+	 * There are no callers that want to zero across a page boundary as of
+	 * now. Once users are there, this check can be removed after the
+	 * device mapper code has been updated to split ranges across targets.
+	 */
+	if (offset_in_page(offset) + len > PAGE_SIZE)
+		return -EIO;
+
+	return dax_dev->ops->zero_page_range(dax_dev, offset, len);
+}
+EXPORT_SYMBOL_GPL(dax_zero_page_range);
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3c46e9e6d04c..e17f9f56d6fe 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -304,6 +304,15 @@ static const struct block_device_operations pmem_fops = {
 	.revalidate_disk =	nvdimm_revalidate_disk,
 };
 
+static int pmem_dax_zero_page_range(struct dax_device *dax_dev, u64 offset,
+				    size_t len)
+{
+	struct pmem_device *pmem = dax_get_private(dax_dev);
+
+	return blk_status_to_errno(pmem_do_write(pmem, ZERO_PAGE(0), 0, offset,
+				   len));
+}
+
 static long pmem_dax_direct_access(struct dax_device *dax_dev,
 		pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
 {
@@ -335,6 +344,7 @@ static const struct dax_operations pmem_dax_ops = {
 	.dax_supported = generic_fsdax_supported,
 	.copy_from_iter = pmem_copy_from_iter,
 	.copy_to_iter = pmem_copy_to_iter,
+	.zero_page_range = pmem_dax_zero_page_range,
 };
 
 static const struct attribute_group *pmem_attribute_groups[] = {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 328c2dbb4409..93a663c26d6a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -34,6 +34,8 @@ struct dax_operations {
 	/* copy_to_iter: required operation for fs-dax direct-i/o */
 	size_t (*copy_to_iter)(struct dax_device *, pgoff_t, void *, size_t,
 			struct iov_iter *);
+	/* zero_page_range: required operation. Zero range with-in a page  */
+	int (*zero_page_range)(struct dax_device *, u64, size_t);
 };
 
 extern struct attribute_group dax_attribute_group;
@@ -199,6 +201,7 @@ size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
 		size_t bytes, struct iov_iter *i);
 size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
 		size_t bytes, struct iov_iter *i);
+int dax_zero_page_range(struct dax_device *dax_dev, u64 offset, size_t len);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 5/8] s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
  2020-02-18 21:48 ` Vivek Goyal
@ 2020-02-18 21:48   ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams
  Cc: dm-devel, linux-s390, Gerald Schaefer

Add dax operation zero_page_range for dcssblk driver.

CC: linux-s390@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/s390/block/dcssblk.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 63502ca537eb..331abab5d066 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -57,11 +57,28 @@ static size_t dcssblk_dax_copy_to_iter(struct dax_device *dax_dev,
 	return copy_to_iter(addr, bytes, i);
 }
 
+static int dcssblk_dax_zero_page_range(struct dax_device *dax_dev, u64 offset,
+				       size_t len)
+{
+	long rc;
+	void *kaddr;
+	pgoff_t pgoff = offset >> PAGE_SHIFT;
+	unsigned page_offset = offset_in_page(offset);
+
+	rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
+	if (rc < 0)
+		return rc;
+	memset(kaddr + page_offset, 0, len);
+	dax_flush(dax_dev, kaddr + page_offset, len);
+	return 0;
+}
+
 static const struct dax_operations dcssblk_dax_ops = {
 	.direct_access = dcssblk_dax_direct_access,
 	.dax_supported = generic_fsdax_supported,
 	.copy_from_iter = dcssblk_dax_copy_from_iter,
 	.copy_to_iter = dcssblk_dax_copy_to_iter,
+	.zero_page_range = dcssblk_dax_zero_page_range,
 };
 
 struct dcssblk_dev_info {
-- 
2.20.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 5/8] s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
@ 2020-02-18 21:48   ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams
  Cc: dm-devel, vishal.l.verma, vgoyal, linux-s390, Gerald Schaefer

Add dax operation zero_page_range for dcssblk driver.

CC: linux-s390@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/s390/block/dcssblk.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 63502ca537eb..331abab5d066 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -57,11 +57,28 @@ static size_t dcssblk_dax_copy_to_iter(struct dax_device *dax_dev,
 	return copy_to_iter(addr, bytes, i);
 }
 
+static int dcssblk_dax_zero_page_range(struct dax_device *dax_dev, u64 offset,
+				       size_t len)
+{
+	long rc;
+	void *kaddr;
+	pgoff_t pgoff = offset >> PAGE_SHIFT;
+	unsigned page_offset = offset_in_page(offset);
+
+	rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
+	if (rc < 0)
+		return rc;
+	memset(kaddr + page_offset, 0, len);
+	dax_flush(dax_dev, kaddr + page_offset, len);
+	return 0;
+}
+
 static const struct dax_operations dcssblk_dax_ops = {
 	.direct_access = dcssblk_dax_direct_access,
 	.dax_supported = generic_fsdax_supported,
 	.copy_from_iter = dcssblk_dax_copy_from_iter,
 	.copy_to_iter = dcssblk_dax_copy_to_iter,
+	.zero_page_range = dcssblk_dax_zero_page_range,
 };
 
 struct dcssblk_dev_info {
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 6/8] dm,dax: Add dax zero_page_range operation
  2020-02-18 21:48 ` Vivek Goyal
@ 2020-02-18 21:48   ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams; +Cc: dm-devel

This patch adds support for dax zero_page_range operation to dm targets.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/md/dm-linear.c        | 21 +++++++++++++++++++++
 drivers/md/dm-log-writes.c    | 19 +++++++++++++++++++
 drivers/md/dm-stripe.c        | 26 ++++++++++++++++++++++++++
 drivers/md/dm.c               | 31 +++++++++++++++++++++++++++++++
 include/linux/device-mapper.h |  3 +++
 5 files changed, 100 insertions(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 8d07fdf63a47..03f99e6ad372 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -201,10 +201,30 @@ static size_t linear_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
 	return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i);
 }
 
+static int linear_dax_zero_page_range(struct dm_target *ti, u64 offset,
+				      size_t len)
+{
+	int ret;
+	struct linear_c *lc = ti->private;
+	struct block_device *bdev = lc->dev->bdev;
+	struct dax_device *dax_dev = lc->dev->dax_dev;
+	pgoff_t pgoff = offset >> PAGE_SHIFT;
+	unsigned page_offset = offset_in_page(offset);
+	sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+	dev_sector = linear_map_sector(ti, sector);
+	ret = bdev_dax_pgoff(bdev, dev_sector, ALIGN(len, PAGE_SIZE), &pgoff);
+	if (ret)
+		return ret;
+	return dax_zero_page_range(dax_dev, (pgoff << PAGE_SHIFT) + page_offset,
+				   len);
+}
+
 #else
 #define linear_dax_direct_access NULL
 #define linear_dax_copy_from_iter NULL
 #define linear_dax_copy_to_iter NULL
+#define linear_dax_zero_page_range NULL
 #endif
 
 static struct target_type linear_target = {
@@ -226,6 +246,7 @@ static struct target_type linear_target = {
 	.direct_access = linear_dax_direct_access,
 	.dax_copy_from_iter = linear_dax_copy_from_iter,
 	.dax_copy_to_iter = linear_dax_copy_to_iter,
+	.dax_zero_page_range = linear_dax_zero_page_range,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 99721c76225d..f36ee223cb60 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -994,10 +994,28 @@ static size_t log_writes_dax_copy_to_iter(struct dm_target *ti,
 	return dax_copy_to_iter(lc->dev->dax_dev, pgoff, addr, bytes, i);
 }
 
+static int log_writes_dax_zero_page_range(struct dm_target *ti, u64 offset,
+					  size_t len)
+{
+	int ret;
+	struct log_writes_c *lc = ti->private;
+	pgoff_t pgoff = offset >> PAGE_SHIFT;
+	unsigned page_offset = offset_in_page(offset);
+	sector_t sector = pgoff * PAGE_SECTORS;
+
+	ret = bdev_dax_pgoff(lc->dev->bdev, sector, ALIGN(len, PAGE_SIZE),
+			     &pgoff);
+	if (ret)
+		return ret;
+	return dax_zero_page_range(lc->dev->dax_dev,
+				   (pgoff << PAGE_SHIFT) + page_offset, len);
+}
+
 #else
 #define log_writes_dax_direct_access NULL
 #define log_writes_dax_copy_from_iter NULL
 #define log_writes_dax_copy_to_iter NULL
+#define log_writes_dax_zero_page_range NULL
 #endif
 
 static struct target_type log_writes_target = {
@@ -1016,6 +1034,7 @@ static struct target_type log_writes_target = {
 	.direct_access = log_writes_dax_direct_access,
 	.dax_copy_from_iter = log_writes_dax_copy_from_iter,
 	.dax_copy_to_iter = log_writes_dax_copy_to_iter,
+	.dax_zero_page_range = log_writes_dax_zero_page_range,
 };
 
 static int __init dm_log_writes_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 63bbcc20f49a..f5e17284c615 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -360,10 +360,35 @@ static size_t stripe_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
 	return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i);
 }
 
+static int stripe_dax_zero_page_range(struct dm_target *ti, u64 offset,
+				      size_t len)
+{
+	int ret;
+	pgoff_t pgoff = offset >> PAGE_SHIFT;
+	unsigned page_offset = offset_in_page(offset);
+	sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+	struct stripe_c *sc = ti->private;
+	struct dax_device *dax_dev;
+	struct block_device *bdev;
+	uint32_t stripe;
+
+	stripe_map_sector(sc, sector, &stripe, &dev_sector);
+	dev_sector += sc->stripe[stripe].physical_start;
+	dax_dev = sc->stripe[stripe].dev->dax_dev;
+	bdev = sc->stripe[stripe].dev->bdev;
+
+	ret = bdev_dax_pgoff(bdev, dev_sector, ALIGN(len, PAGE_SIZE), &pgoff);
+	if (ret)
+		return ret;
+	return dax_zero_page_range(dax_dev, (pgoff << PAGE_SHIFT) + page_offset,
+				   len);
+}
+
 #else
 #define stripe_dax_direct_access NULL
 #define stripe_dax_copy_from_iter NULL
 #define stripe_dax_copy_to_iter NULL
+#define stripe_dax_zero_page_range NULL
 #endif
 
 /*
@@ -486,6 +511,7 @@ static struct target_type stripe_target = {
 	.direct_access = stripe_dax_direct_access,
 	.dax_copy_from_iter = stripe_dax_copy_from_iter,
 	.dax_copy_to_iter = stripe_dax_copy_to_iter,
+	.dax_zero_page_range = stripe_dax_zero_page_range,
 };
 
 int __init dm_stripe_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b89f07ee2eff..c87cabdf7f18 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1198,6 +1198,36 @@ static size_t dm_dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 	return ret;
 }
 
+static int dm_dax_zero_page_range(struct dax_device *dax_dev, u64 offset,
+				  size_t len)
+{
+	struct mapped_device *md = dax_get_private(dax_dev);
+	pgoff_t pgoff = offset >> PAGE_SHIFT;
+	sector_t sector = pgoff * PAGE_SECTORS;
+	struct dm_target *ti;
+	int ret = -EIO;
+	int srcu_idx;
+
+	ti = dm_dax_get_live_target(md, sector, &srcu_idx);
+
+	if (!ti)
+		goto out;
+	if (WARN_ON(!ti->type->dax_zero_page_range)) {
+		/*
+		 * ->zero_page_range() is mandatory dax operation. If we are
+		 *  here, something is wrong.
+		 */
+		dm_put_live_table(md, srcu_idx);
+		goto out;
+	}
+	ret = ti->type->dax_zero_page_range(ti, offset, len);
+
+ out:
+	dm_put_live_table(md, srcu_idx);
+
+	return ret;
+}
+
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
  * allowed for all bio types except REQ_PREFLUSH, REQ_OP_ZONE_RESET,
@@ -3199,6 +3229,7 @@ static const struct dax_operations dm_dax_ops = {
 	.dax_supported = dm_dax_supported,
 	.copy_from_iter = dm_dax_copy_from_iter,
 	.copy_to_iter = dm_dax_copy_to_iter,
+	.zero_page_range = dm_dax_zero_page_range,
 };
 
 /*
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 475668c69dbc..b4ef5b07be74 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -141,6 +141,8 @@ typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
 		long nr_pages, void **kaddr, pfn_t *pfn);
 typedef size_t (*dm_dax_copy_iter_fn)(struct dm_target *ti, pgoff_t pgoff,
 		void *addr, size_t bytes, struct iov_iter *i);
+typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, u64 offset,
+		size_t len);
 #define PAGE_SECTORS (PAGE_SIZE / 512)
 
 void dm_error(const char *message);
@@ -195,6 +197,7 @@ struct target_type {
 	dm_dax_direct_access_fn direct_access;
 	dm_dax_copy_iter_fn dax_copy_from_iter;
 	dm_dax_copy_iter_fn dax_copy_to_iter;
+	dm_dax_zero_page_range_fn dax_zero_page_range;
 
 	/* For internal device-mapper use. */
 	struct list_head list;
-- 
2.20.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 6/8] dm,dax: Add dax zero_page_range operation
@ 2020-02-18 21:48   ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams
  Cc: dm-devel, vishal.l.verma, vgoyal

This patch adds support for dax zero_page_range operation to dm targets.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/md/dm-linear.c        | 21 +++++++++++++++++++++
 drivers/md/dm-log-writes.c    | 19 +++++++++++++++++++
 drivers/md/dm-stripe.c        | 26 ++++++++++++++++++++++++++
 drivers/md/dm.c               | 31 +++++++++++++++++++++++++++++++
 include/linux/device-mapper.h |  3 +++
 5 files changed, 100 insertions(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 8d07fdf63a47..03f99e6ad372 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -201,10 +201,30 @@ static size_t linear_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
 	return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i);
 }
 
+static int linear_dax_zero_page_range(struct dm_target *ti, u64 offset,
+				      size_t len)
+{
+	int ret;
+	struct linear_c *lc = ti->private;
+	struct block_device *bdev = lc->dev->bdev;
+	struct dax_device *dax_dev = lc->dev->dax_dev;
+	pgoff_t pgoff = offset >> PAGE_SHIFT;
+	unsigned page_offset = offset_in_page(offset);
+	sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+	dev_sector = linear_map_sector(ti, sector);
+	ret = bdev_dax_pgoff(bdev, dev_sector, ALIGN(len, PAGE_SIZE), &pgoff);
+	if (ret)
+		return ret;
+	return dax_zero_page_range(dax_dev, (pgoff << PAGE_SHIFT) + page_offset,
+				   len);
+}
+
 #else
 #define linear_dax_direct_access NULL
 #define linear_dax_copy_from_iter NULL
 #define linear_dax_copy_to_iter NULL
+#define linear_dax_zero_page_range NULL
 #endif
 
 static struct target_type linear_target = {
@@ -226,6 +246,7 @@ static struct target_type linear_target = {
 	.direct_access = linear_dax_direct_access,
 	.dax_copy_from_iter = linear_dax_copy_from_iter,
 	.dax_copy_to_iter = linear_dax_copy_to_iter,
+	.dax_zero_page_range = linear_dax_zero_page_range,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 99721c76225d..f36ee223cb60 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -994,10 +994,28 @@ static size_t log_writes_dax_copy_to_iter(struct dm_target *ti,
 	return dax_copy_to_iter(lc->dev->dax_dev, pgoff, addr, bytes, i);
 }
 
+static int log_writes_dax_zero_page_range(struct dm_target *ti, u64 offset,
+					  size_t len)
+{
+	int ret;
+	struct log_writes_c *lc = ti->private;
+	pgoff_t pgoff = offset >> PAGE_SHIFT;
+	unsigned page_offset = offset_in_page(offset);
+	sector_t sector = pgoff * PAGE_SECTORS;
+
+	ret = bdev_dax_pgoff(lc->dev->bdev, sector, ALIGN(len, PAGE_SIZE),
+			     &pgoff);
+	if (ret)
+		return ret;
+	return dax_zero_page_range(lc->dev->dax_dev,
+				   (pgoff << PAGE_SHIFT) + page_offset, len);
+}
+
 #else
 #define log_writes_dax_direct_access NULL
 #define log_writes_dax_copy_from_iter NULL
 #define log_writes_dax_copy_to_iter NULL
+#define log_writes_dax_zero_page_range NULL
 #endif
 
 static struct target_type log_writes_target = {
@@ -1016,6 +1034,7 @@ static struct target_type log_writes_target = {
 	.direct_access = log_writes_dax_direct_access,
 	.dax_copy_from_iter = log_writes_dax_copy_from_iter,
 	.dax_copy_to_iter = log_writes_dax_copy_to_iter,
+	.dax_zero_page_range = log_writes_dax_zero_page_range,
 };
 
 static int __init dm_log_writes_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 63bbcc20f49a..f5e17284c615 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -360,10 +360,35 @@ static size_t stripe_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
 	return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i);
 }
 
+static int stripe_dax_zero_page_range(struct dm_target *ti, u64 offset,
+				      size_t len)
+{
+	int ret;
+	pgoff_t pgoff = offset >> PAGE_SHIFT;
+	unsigned page_offset = offset_in_page(offset);
+	sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+	struct stripe_c *sc = ti->private;
+	struct dax_device *dax_dev;
+	struct block_device *bdev;
+	uint32_t stripe;
+
+	stripe_map_sector(sc, sector, &stripe, &dev_sector);
+	dev_sector += sc->stripe[stripe].physical_start;
+	dax_dev = sc->stripe[stripe].dev->dax_dev;
+	bdev = sc->stripe[stripe].dev->bdev;
+
+	ret = bdev_dax_pgoff(bdev, dev_sector, ALIGN(len, PAGE_SIZE), &pgoff);
+	if (ret)
+		return ret;
+	return dax_zero_page_range(dax_dev, (pgoff << PAGE_SHIFT) + page_offset,
+				   len);
+}
+
 #else
 #define stripe_dax_direct_access NULL
 #define stripe_dax_copy_from_iter NULL
 #define stripe_dax_copy_to_iter NULL
+#define stripe_dax_zero_page_range NULL
 #endif
 
 /*
@@ -486,6 +511,7 @@ static struct target_type stripe_target = {
 	.direct_access = stripe_dax_direct_access,
 	.dax_copy_from_iter = stripe_dax_copy_from_iter,
 	.dax_copy_to_iter = stripe_dax_copy_to_iter,
+	.dax_zero_page_range = stripe_dax_zero_page_range,
 };
 
 int __init dm_stripe_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b89f07ee2eff..c87cabdf7f18 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1198,6 +1198,36 @@ static size_t dm_dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 	return ret;
 }
 
+static int dm_dax_zero_page_range(struct dax_device *dax_dev, u64 offset,
+				  size_t len)
+{
+	struct mapped_device *md = dax_get_private(dax_dev);
+	pgoff_t pgoff = offset >> PAGE_SHIFT;
+	sector_t sector = pgoff * PAGE_SECTORS;
+	struct dm_target *ti;
+	int ret = -EIO;
+	int srcu_idx;
+
+	ti = dm_dax_get_live_target(md, sector, &srcu_idx);
+
+	if (!ti)
+		goto out;
+	if (WARN_ON(!ti->type->dax_zero_page_range)) {
+		/*
+		 * ->zero_page_range() is mandatory dax operation. If we are
+		 *  here, something is wrong.
+		 */
+		dm_put_live_table(md, srcu_idx);
+		goto out;
+	}
+	ret = ti->type->dax_zero_page_range(ti, offset, len);
+
+ out:
+	dm_put_live_table(md, srcu_idx);
+
+	return ret;
+}
+
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
  * allowed for all bio types except REQ_PREFLUSH, REQ_OP_ZONE_RESET,
@@ -3199,6 +3229,7 @@ static const struct dax_operations dm_dax_ops = {
 	.dax_supported = dm_dax_supported,
 	.copy_from_iter = dm_dax_copy_from_iter,
 	.copy_to_iter = dm_dax_copy_to_iter,
+	.zero_page_range = dm_dax_zero_page_range,
 };
 
 /*
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 475668c69dbc..b4ef5b07be74 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -141,6 +141,8 @@ typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
 		long nr_pages, void **kaddr, pfn_t *pfn);
 typedef size_t (*dm_dax_copy_iter_fn)(struct dm_target *ti, pgoff_t pgoff,
 		void *addr, size_t bytes, struct iov_iter *i);
+typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, u64 offset,
+		size_t len);
 #define PAGE_SECTORS (PAGE_SIZE / 512)
 
 void dm_error(const char *message);
@@ -195,6 +197,7 @@ struct target_type {
 	dm_dax_direct_access_fn direct_access;
 	dm_dax_copy_iter_fn dax_copy_from_iter;
 	dm_dax_copy_iter_fn dax_copy_to_iter;
+	dm_dax_zero_page_range_fn dax_zero_page_range;
 
 	/* For internal device-mapper use. */
 	struct list_head list;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 7/8] dax,iomap: Start using dax native zero_page_range()
  2020-02-18 21:48 ` Vivek Goyal
@ 2020-02-18 21:48   ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams
  Cc: dm-devel, Christoph Hellwig

Get rid of calling block device interface for zeroing in iomap dax
zeroing path and use dax native zeroing interface instead.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/dax.c | 45 +++++++++------------------------------------
 1 file changed, 9 insertions(+), 36 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 35da144375a0..f8ae0a9984fa 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1038,48 +1038,21 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 	return ret;
 }
 
-static bool dax_range_is_aligned(struct block_device *bdev,
-				 unsigned int offset, unsigned int length)
-{
-	unsigned short sector_size = bdev_logical_block_size(bdev);
-
-	if (!IS_ALIGNED(offset, sector_size))
-		return false;
-	if (!IS_ALIGNED(length, sector_size))
-		return false;
-
-	return true;
-}
-
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int size)
 {
-	if (dax_range_is_aligned(bdev, offset, size)) {
-		sector_t start_sector = sector + (offset >> 9);
-
-		return blkdev_issue_zeroout(bdev, start_sector,
-				size >> 9, GFP_NOFS, 0);
-	} else {
-		pgoff_t pgoff;
-		long rc, id;
-		void *kaddr;
+	pgoff_t pgoff;
+	long rc, id;
 
-		rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
-		if (rc)
-			return rc;
+	rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
+	if (rc)
+		return rc;
 
-		id = dax_read_lock();
-		rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
-		if (rc < 0) {
-			dax_read_unlock(id);
-			return rc;
-		}
-		memset(kaddr + offset, 0, size);
-		dax_flush(dax_dev, kaddr + offset, size);
-		dax_read_unlock(id);
-	}
-	return 0;
+	id = dax_read_lock();
+	rc = dax_zero_page_range(dax_dev, (pgoff << PAGE_SHIFT) + offset, size);
+	dax_read_unlock(id);
+	return rc;
 }
 EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
-- 
2.20.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 7/8] dax,iomap: Start using dax native zero_page_range()
@ 2020-02-18 21:48   ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams
  Cc: dm-devel, vishal.l.verma, vgoyal, Christoph Hellwig

Get rid of calling block device interface for zeroing in iomap dax
zeroing path and use dax native zeroing interface instead.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/dax.c | 45 +++++++++------------------------------------
 1 file changed, 9 insertions(+), 36 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 35da144375a0..f8ae0a9984fa 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1038,48 +1038,21 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 	return ret;
 }
 
-static bool dax_range_is_aligned(struct block_device *bdev,
-				 unsigned int offset, unsigned int length)
-{
-	unsigned short sector_size = bdev_logical_block_size(bdev);
-
-	if (!IS_ALIGNED(offset, sector_size))
-		return false;
-	if (!IS_ALIGNED(length, sector_size))
-		return false;
-
-	return true;
-}
-
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int size)
 {
-	if (dax_range_is_aligned(bdev, offset, size)) {
-		sector_t start_sector = sector + (offset >> 9);
-
-		return blkdev_issue_zeroout(bdev, start_sector,
-				size >> 9, GFP_NOFS, 0);
-	} else {
-		pgoff_t pgoff;
-		long rc, id;
-		void *kaddr;
+	pgoff_t pgoff;
+	long rc, id;
 
-		rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
-		if (rc)
-			return rc;
+	rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
+	if (rc)
+		return rc;
 
-		id = dax_read_lock();
-		rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
-		if (rc < 0) {
-			dax_read_unlock(id);
-			return rc;
-		}
-		memset(kaddr + offset, 0, size);
-		dax_flush(dax_dev, kaddr + offset, size);
-		dax_read_unlock(id);
-	}
-	return 0;
+	id = dax_read_lock();
+	rc = dax_zero_page_range(dax_dev, (pgoff << PAGE_SHIFT) + offset, size);
+	dax_read_unlock(id);
+	return rc;
 }
 EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 8/8] dax,iomap: Add helper dax_iomap_zero() to zero a range
  2020-02-18 21:48 ` Vivek Goyal
@ 2020-02-18 21:48   ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams
  Cc: dm-devel, Christoph Hellwig

Add a helper dax_ioamp_zero() to zero a range. This patch basically
merges __dax_zero_page_range() and iomap_dax_zero().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/dax.c               | 12 ++++++------
 fs/iomap/buffered-io.c |  9 +--------
 include/linux/dax.h    | 17 +++--------------
 3 files changed, 10 insertions(+), 28 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f8ae0a9984fa..4be581009db5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1038,23 +1038,23 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 	return ret;
 }
 
-int __dax_zero_page_range(struct block_device *bdev,
-		struct dax_device *dax_dev, sector_t sector,
-		unsigned int offset, unsigned int size)
+int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
+		   struct iomap *iomap)
 {
 	pgoff_t pgoff;
 	long rc, id;
+	sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
 
-	rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
+	rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff);
 	if (rc)
 		return rc;
 
 	id = dax_read_lock();
-	rc = dax_zero_page_range(dax_dev, (pgoff << PAGE_SHIFT) + offset, size);
+	rc = dax_zero_page_range(iomap->dax_dev, (pgoff << PAGE_SHIFT) + offset,
+				 size);
 	dax_read_unlock(id);
 	return rc;
 }
-EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7c84c4c027c4..6f750da545e5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -974,13 +974,6 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
 	return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
 }
 
-static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
-		struct iomap *iomap)
-{
-	return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
-			iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
-}
-
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		void *data, struct iomap *iomap, struct iomap *srcmap)
@@ -1000,7 +993,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
 
 		if (IS_DAX(inode))
-			status = iomap_dax_zero(pos, offset, bytes, iomap);
+			status = dax_iomap_zero(pos, offset, bytes, iomap);
 		else
 			status = iomap_zero(inode, pos, offset, bytes, iomap,
 					srcmap);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 93a663c26d6a..7b681acfc522 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -13,6 +13,7 @@
 typedef unsigned long dax_entry_t;
 
 struct iomap_ops;
+struct iomap;
 struct dax_device;
 struct dax_operations {
 	/*
@@ -213,20 +214,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
-
-#ifdef CONFIG_FS_DAX
-int __dax_zero_page_range(struct block_device *bdev,
-		struct dax_device *dax_dev, sector_t sector,
-		unsigned int offset, unsigned int length);
-#else
-static inline int __dax_zero_page_range(struct block_device *bdev,
-		struct dax_device *dax_dev, sector_t sector,
-		unsigned int offset, unsigned int length)
-{
-	return -ENXIO;
-}
-#endif
-
+int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
+			struct iomap *iomap);
 static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
-- 
2.20.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v5 8/8] dax,iomap: Add helper dax_iomap_zero() to zero a range
@ 2020-02-18 21:48   ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-18 21:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-nvdimm, hch, dan.j.williams
  Cc: dm-devel, vishal.l.verma, vgoyal, Christoph Hellwig

Add a helper dax_ioamp_zero() to zero a range. This patch basically
merges __dax_zero_page_range() and iomap_dax_zero().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/dax.c               | 12 ++++++------
 fs/iomap/buffered-io.c |  9 +--------
 include/linux/dax.h    | 17 +++--------------
 3 files changed, 10 insertions(+), 28 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f8ae0a9984fa..4be581009db5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1038,23 +1038,23 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 	return ret;
 }
 
-int __dax_zero_page_range(struct block_device *bdev,
-		struct dax_device *dax_dev, sector_t sector,
-		unsigned int offset, unsigned int size)
+int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
+		   struct iomap *iomap)
 {
 	pgoff_t pgoff;
 	long rc, id;
+	sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
 
-	rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
+	rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff);
 	if (rc)
 		return rc;
 
 	id = dax_read_lock();
-	rc = dax_zero_page_range(dax_dev, (pgoff << PAGE_SHIFT) + offset, size);
+	rc = dax_zero_page_range(iomap->dax_dev, (pgoff << PAGE_SHIFT) + offset,
+				 size);
 	dax_read_unlock(id);
 	return rc;
 }
-EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7c84c4c027c4..6f750da545e5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -974,13 +974,6 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
 	return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
 }
 
-static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
-		struct iomap *iomap)
-{
-	return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
-			iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
-}
-
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		void *data, struct iomap *iomap, struct iomap *srcmap)
@@ -1000,7 +993,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
 
 		if (IS_DAX(inode))
-			status = iomap_dax_zero(pos, offset, bytes, iomap);
+			status = dax_iomap_zero(pos, offset, bytes, iomap);
 		else
 			status = iomap_zero(inode, pos, offset, bytes, iomap,
 					srcmap);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 93a663c26d6a..7b681acfc522 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -13,6 +13,7 @@
 typedef unsigned long dax_entry_t;
 
 struct iomap_ops;
+struct iomap;
 struct dax_device;
 struct dax_operations {
 	/*
@@ -213,20 +214,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
-
-#ifdef CONFIG_FS_DAX
-int __dax_zero_page_range(struct block_device *bdev,
-		struct dax_device *dax_dev, sector_t sector,
-		unsigned int offset, unsigned int length);
-#else
-static inline int __dax_zero_page_range(struct block_device *bdev,
-		struct dax_device *dax_dev, sector_t sector,
-		unsigned int offset, unsigned int length)
-{
-	return -ENXIO;
-}
-#endif
-
+int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
+			struct iomap *iomap);
 static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-18 21:48   ` Vivek Goyal
@ 2020-02-20 16:17     ` Christoph Hellwig
  -1 siblings, 0 replies; 103+ messages in thread
From: Christoph Hellwig @ 2020-02-20 16:17 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-fsdevel, linux-nvdimm, hch, dm-devel

On Tue, Feb 18, 2020 at 04:48:35PM -0500, Vivek Goyal wrote:
> Currently pmem_clear_poison() expects offset and len to be sector aligned.
> Atleast that seems to be the assumption with which code has been written.
> It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> and pmem_make_request() which will only passe sector aligned offset and len.
> 
> Soon we want use this function from dax_zero_page_range() code path which
> can try to zero arbitrary range of memory with-in a page. So update this
> function to assume that offset and length can be arbitrary and do the
> necessary alignments as needed.
> 
> nvdimm_clear_poison() seems to assume offset and len to be aligned to
> clear_err_unit boundary. But this is currently internal detail and is
> not exported for others to use. So for now, continue to align offset and
> length to SECTOR_SIZE boundary. Improving it further and to align it
> to clear_err_unit boundary is a TODO item for future.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

This looks sensibel to me, but I'd really like to have Dan take at look.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-20 16:17     ` Christoph Hellwig
  0 siblings, 0 replies; 103+ messages in thread
From: Christoph Hellwig @ 2020-02-20 16:17 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-nvdimm, hch, dan.j.williams, dm-devel,
	vishal.l.verma

On Tue, Feb 18, 2020 at 04:48:35PM -0500, Vivek Goyal wrote:
> Currently pmem_clear_poison() expects offset and len to be sector aligned.
> Atleast that seems to be the assumption with which code has been written.
> It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> and pmem_make_request() which will only passe sector aligned offset and len.
> 
> Soon we want use this function from dax_zero_page_range() code path which
> can try to zero arbitrary range of memory with-in a page. So update this
> function to assume that offset and length can be arbitrary and do the
> necessary alignments as needed.
> 
> nvdimm_clear_poison() seems to assume offset and len to be aligned to
> clear_err_unit boundary. But this is currently internal detail and is
> not exported for others to use. So for now, continue to align offset and
> length to SECTOR_SIZE boundary. Improving it further and to align it
> to clear_err_unit boundary is a TODO item for future.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

This looks sensibel to me, but I'd really like to have Dan take at look.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 3/8] pmem: Enable pmem_do_write() to deal with arbitrary ranges
  2020-02-18 21:48   ` Vivek Goyal
@ 2020-02-20 16:17     ` Christoph Hellwig
  -1 siblings, 0 replies; 103+ messages in thread
From: Christoph Hellwig @ 2020-02-20 16:17 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-fsdevel, linux-nvdimm, hch, dm-devel

On Tue, Feb 18, 2020 at 04:48:36PM -0500, Vivek Goyal wrote:
> Currently pmem_do_write() is written with assumption that all I/O is
> sector aligned. Soon I want to use this function in zero_page_range()
> where range passed in does not have to be sector aligned.
> 
> Modify this function to be able to deal with an arbitrary range. Which
> is specified by pmem_off and len.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 3/8] pmem: Enable pmem_do_write() to deal with arbitrary ranges
@ 2020-02-20 16:17     ` Christoph Hellwig
  0 siblings, 0 replies; 103+ messages in thread
From: Christoph Hellwig @ 2020-02-20 16:17 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-nvdimm, hch, dan.j.williams, dm-devel,
	vishal.l.verma

On Tue, Feb 18, 2020 at 04:48:36PM -0500, Vivek Goyal wrote:
> Currently pmem_do_write() is written with assumption that all I/O is
> sector aligned. Soon I want to use this function in zero_page_range()
> where range passed in does not have to be sector aligned.
> 
> Modify this function to be able to deal with an arbitrary range. Which
> is specified by pmem_off and len.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-18 21:48   ` Vivek Goyal
@ 2020-02-20 21:35     ` Jeff Moyer
  -1 siblings, 0 replies; 103+ messages in thread
From: Jeff Moyer @ 2020-02-20 21:35 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-fsdevel, linux-nvdimm, hch, dm-devel

Vivek Goyal <vgoyal@redhat.com> writes:

> Currently pmem_clear_poison() expects offset and len to be sector aligned.
> Atleast that seems to be the assumption with which code has been written.
> It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> and pmem_make_request() which will only passe sector aligned offset and len.
>
> Soon we want use this function from dax_zero_page_range() code path which
> can try to zero arbitrary range of memory with-in a page. So update this
> function to assume that offset and length can be arbitrary and do the
> necessary alignments as needed.

What caller will try to zero a range that is smaller than a sector?

> nvdimm_clear_poison() seems to assume offset and len to be aligned to
> clear_err_unit boundary. But this is currently internal detail and is
> not exported for others to use. So for now, continue to align offset and
> length to SECTOR_SIZE boundary. Improving it further and to align it
> to clear_err_unit boundary is a TODO item for future.

When there is a poisoned range of persistent memory, it is recorded by
the badblocks infrastructure, which currently operates on sectors.  So,
no matter what the error unit is for the hardware, we currently can't
record/report to userspace anything smaller than a sector, and so that
is what we expect when clearing errors.

Continuing on for completeness, we will currently not map a page with
badblocks into a process' address space.  So, let's say you have 256
bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
you access a valid mmap()d address in the same page as the poisoned
memory, you will get a segfault.

Userspace can fix up the error by calling write(2) and friends to
provide new data, or by punching a hole and writing new data to the hole
(which may result in getting a new block, or reallocating the old block
and zeroing it, which will clear the error).

More comments below...

> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  drivers/nvdimm/pmem.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 075b11682192..e72959203253 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
>  	sector_t sector;
>  	long cleared;
>  	blk_status_t rc = BLK_STS_OK;
> +	phys_addr_t start_aligned, end_aligned;
> +	unsigned int len_aligned;
>  
> -	sector = (offset - pmem->data_offset) / 512;
> +	/*
> +	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> +	 * expects memory offset and length to meet certain alignment
> +	 * restrction (clear_err_unit). Currently nvdimm does not export
                                                  ^^^^^^^^^^^^^^^^^^^^^^
> +	 * required alignment. So align offset and length to sector boundary

What is "nvdimm" in that sentence?  Because the nvdimm most certainly
does export the required alignment.  Perhaps you meant libnvdimm?

> +	 * before passing it to nvdimm_clear_poison().
> +	 */
> +	start_aligned = ALIGN(offset, SECTOR_SIZE);
> +	end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1;
> +	len_aligned = end_aligned - start_aligned + 1;
> +
> +	sector = (start_aligned - pmem->data_offset) / 512;
>  
> -	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
> -	if (cleared < len)
> +	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned,
> +				      len_aligned);
> +	if (cleared < len_aligned)
>  		rc = BLK_STS_IOERR;
>  	if (cleared > 0 && cleared / 512) {
> -		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
> +		hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared);
>  		cleared /= 512;
>  		dev_dbg(dev, "%#llx clear %ld sector%s\n",
>  				(unsigned long long) sector, cleared,

We could potentially support clearing less than a sector, but I'd have
to understand the use cases better before offerring implementation
suggestions.

-Jeff
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-20 21:35     ` Jeff Moyer
  0 siblings, 0 replies; 103+ messages in thread
From: Jeff Moyer @ 2020-02-20 21:35 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-fsdevel, linux-nvdimm, hch, dan.j.williams, dm-devel

Vivek Goyal <vgoyal@redhat.com> writes:

> Currently pmem_clear_poison() expects offset and len to be sector aligned.
> Atleast that seems to be the assumption with which code has been written.
> It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> and pmem_make_request() which will only passe sector aligned offset and len.
>
> Soon we want use this function from dax_zero_page_range() code path which
> can try to zero arbitrary range of memory with-in a page. So update this
> function to assume that offset and length can be arbitrary and do the
> necessary alignments as needed.

What caller will try to zero a range that is smaller than a sector?

> nvdimm_clear_poison() seems to assume offset and len to be aligned to
> clear_err_unit boundary. But this is currently internal detail and is
> not exported for others to use. So for now, continue to align offset and
> length to SECTOR_SIZE boundary. Improving it further and to align it
> to clear_err_unit boundary is a TODO item for future.

When there is a poisoned range of persistent memory, it is recorded by
the badblocks infrastructure, which currently operates on sectors.  So,
no matter what the error unit is for the hardware, we currently can't
record/report to userspace anything smaller than a sector, and so that
is what we expect when clearing errors.

Continuing on for completeness, we will currently not map a page with
badblocks into a process' address space.  So, let's say you have 256
bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
you access a valid mmap()d address in the same page as the poisoned
memory, you will get a segfault.

Userspace can fix up the error by calling write(2) and friends to
provide new data, or by punching a hole and writing new data to the hole
(which may result in getting a new block, or reallocating the old block
and zeroing it, which will clear the error).

More comments below...

> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  drivers/nvdimm/pmem.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 075b11682192..e72959203253 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
>  	sector_t sector;
>  	long cleared;
>  	blk_status_t rc = BLK_STS_OK;
> +	phys_addr_t start_aligned, end_aligned;
> +	unsigned int len_aligned;
>  
> -	sector = (offset - pmem->data_offset) / 512;
> +	/*
> +	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> +	 * expects memory offset and length to meet certain alignment
> +	 * restrction (clear_err_unit). Currently nvdimm does not export
                                                  ^^^^^^^^^^^^^^^^^^^^^^
> +	 * required alignment. So align offset and length to sector boundary

What is "nvdimm" in that sentence?  Because the nvdimm most certainly
does export the required alignment.  Perhaps you meant libnvdimm?

> +	 * before passing it to nvdimm_clear_poison().
> +	 */
> +	start_aligned = ALIGN(offset, SECTOR_SIZE);
> +	end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1;
> +	len_aligned = end_aligned - start_aligned + 1;
> +
> +	sector = (start_aligned - pmem->data_offset) / 512;
>  
> -	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
> -	if (cleared < len)
> +	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned,
> +				      len_aligned);
> +	if (cleared < len_aligned)
>  		rc = BLK_STS_IOERR;
>  	if (cleared > 0 && cleared / 512) {
> -		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
> +		hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared);
>  		cleared /= 512;
>  		dev_dbg(dev, "%#llx clear %ld sector%s\n",
>  				(unsigned long long) sector, cleared,

We could potentially support clearing less than a sector, but I'd have
to understand the use cases better before offerring implementation
suggestions.

-Jeff


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-20 21:35     ` Jeff Moyer
@ 2020-02-20 21:57       ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-20 21:57 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-fsdevel, linux-nvdimm, hch, dm-devel

On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > Atleast that seems to be the assumption with which code has been written.
> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > and pmem_make_request() which will only passe sector aligned offset and len.
> >
> > Soon we want use this function from dax_zero_page_range() code path which
> > can try to zero arbitrary range of memory with-in a page. So update this
> > function to assume that offset and length can be arbitrary and do the
> > necessary alignments as needed.
> 
> What caller will try to zero a range that is smaller than a sector?

Hi Jeff,

New dax zeroing interface (dax_zero_page_range()) can technically pass
a range which is less than a sector. Or which is bigger than a sector
but start and end are not aligned on sector boundaries.

At this point of time, all I care about is that case of an arbitrary
range is handeled well. So if a caller passes a range in, we figure
out subrange which is sector aligned in terms of start and end, and
clear poison on those sectors and ignore rest of the range. And
this itself will be an improvement over current behavior where 
nothing is cleared if I/O is not sector aligned.

> 
> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> > clear_err_unit boundary. But this is currently internal detail and is
> > not exported for others to use. So for now, continue to align offset and
> > length to SECTOR_SIZE boundary. Improving it further and to align it
> > to clear_err_unit boundary is a TODO item for future.
> 
> When there is a poisoned range of persistent memory, it is recorded by
> the badblocks infrastructure, which currently operates on sectors.  So,
> no matter what the error unit is for the hardware, we currently can't
> record/report to userspace anything smaller than a sector, and so that
> is what we expect when clearing errors.
> 
> Continuing on for completeness, we will currently not map a page with
> badblocks into a process' address space.  So, let's say you have 256
> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> you access a valid mmap()d address in the same page as the poisoned
> memory, you will get a segfault.
> 
> Userspace can fix up the error by calling write(2) and friends to
> provide new data, or by punching a hole and writing new data to the hole
> (which may result in getting a new block, or reallocating the old block
> and zeroing it, which will clear the error).

Fair enough. I do not need poison clearing at finer granularity. It might
be needed once dev_dax path wants to clear poison. Not sure how exactly
that works.

> 
> More comments below...
> 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  drivers/nvdimm/pmem.c | 22 ++++++++++++++++++----
> >  1 file changed, 18 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index 075b11682192..e72959203253 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
> >  	sector_t sector;
> >  	long cleared;
> >  	blk_status_t rc = BLK_STS_OK;
> > +	phys_addr_t start_aligned, end_aligned;
> > +	unsigned int len_aligned;
> >  
> > -	sector = (offset - pmem->data_offset) / 512;
> > +	/*
> > +	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> > +	 * expects memory offset and length to meet certain alignment
> > +	 * restrction (clear_err_unit). Currently nvdimm does not export
>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> > +	 * required alignment. So align offset and length to sector boundary
> 
> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> does export the required alignment.  Perhaps you meant libnvdimm?

I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
it is called. It first queries alignement required (clear_err_unit) and
then makes sure range passed in meets that alignment requirement.

> 
> > +	 * before passing it to nvdimm_clear_poison().
> > +	 */
> > +	start_aligned = ALIGN(offset, SECTOR_SIZE);
> > +	end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1;
> > +	len_aligned = end_aligned - start_aligned + 1;
> > +
> > +	sector = (start_aligned - pmem->data_offset) / 512;
> >  
> > -	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
> > -	if (cleared < len)
> > +	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned,
> > +				      len_aligned);
> > +	if (cleared < len_aligned)
> >  		rc = BLK_STS_IOERR;
> >  	if (cleared > 0 && cleared / 512) {
> > -		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
> > +		hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared);
> >  		cleared /= 512;
> >  		dev_dbg(dev, "%#llx clear %ld sector%s\n",
> >  				(unsigned long long) sector, cleared,
> 
> We could potentially support clearing less than a sector, but I'd have
> to understand the use cases better before offerring implementation
> suggestions.

I don't need clearing less than a secotr. Once somebody needs it they
can implement it. All I am doing is making sure current logic is not
broken when dax_zero_page_range() starts using this logic and passes
an arbitrary range. We need to make sure we internally align I/O 
and carve out an aligned sub-range and pass that subrange to
nvdimm_clear_poison().

So if you can make sure I am not breaking things and new interface
will continue to clear poison on sector boundary, that will be great.

Thanks
Vivek
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-20 21:57       ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-20 21:57 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-fsdevel, linux-nvdimm, hch, dan.j.williams, dm-devel

On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > Atleast that seems to be the assumption with which code has been written.
> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > and pmem_make_request() which will only passe sector aligned offset and len.
> >
> > Soon we want use this function from dax_zero_page_range() code path which
> > can try to zero arbitrary range of memory with-in a page. So update this
> > function to assume that offset and length can be arbitrary and do the
> > necessary alignments as needed.
> 
> What caller will try to zero a range that is smaller than a sector?

Hi Jeff,

New dax zeroing interface (dax_zero_page_range()) can technically pass
a range which is less than a sector. Or which is bigger than a sector
but start and end are not aligned on sector boundaries.

At this point of time, all I care about is that case of an arbitrary
range is handeled well. So if a caller passes a range in, we figure
out subrange which is sector aligned in terms of start and end, and
clear poison on those sectors and ignore rest of the range. And
this itself will be an improvement over current behavior where 
nothing is cleared if I/O is not sector aligned.

> 
> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> > clear_err_unit boundary. But this is currently internal detail and is
> > not exported for others to use. So for now, continue to align offset and
> > length to SECTOR_SIZE boundary. Improving it further and to align it
> > to clear_err_unit boundary is a TODO item for future.
> 
> When there is a poisoned range of persistent memory, it is recorded by
> the badblocks infrastructure, which currently operates on sectors.  So,
> no matter what the error unit is for the hardware, we currently can't
> record/report to userspace anything smaller than a sector, and so that
> is what we expect when clearing errors.
> 
> Continuing on for completeness, we will currently not map a page with
> badblocks into a process' address space.  So, let's say you have 256
> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> you access a valid mmap()d address in the same page as the poisoned
> memory, you will get a segfault.
> 
> Userspace can fix up the error by calling write(2) and friends to
> provide new data, or by punching a hole and writing new data to the hole
> (which may result in getting a new block, or reallocating the old block
> and zeroing it, which will clear the error).

Fair enough. I do not need poison clearing at finer granularity. It might
be needed once dev_dax path wants to clear poison. Not sure how exactly
that works.

> 
> More comments below...
> 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  drivers/nvdimm/pmem.c | 22 ++++++++++++++++++----
> >  1 file changed, 18 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index 075b11682192..e72959203253 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
> >  	sector_t sector;
> >  	long cleared;
> >  	blk_status_t rc = BLK_STS_OK;
> > +	phys_addr_t start_aligned, end_aligned;
> > +	unsigned int len_aligned;
> >  
> > -	sector = (offset - pmem->data_offset) / 512;
> > +	/*
> > +	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> > +	 * expects memory offset and length to meet certain alignment
> > +	 * restrction (clear_err_unit). Currently nvdimm does not export
>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> > +	 * required alignment. So align offset and length to sector boundary
> 
> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> does export the required alignment.  Perhaps you meant libnvdimm?

I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
it is called. It first queries alignement required (clear_err_unit) and
then makes sure range passed in meets that alignment requirement.

> 
> > +	 * before passing it to nvdimm_clear_poison().
> > +	 */
> > +	start_aligned = ALIGN(offset, SECTOR_SIZE);
> > +	end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1;
> > +	len_aligned = end_aligned - start_aligned + 1;
> > +
> > +	sector = (start_aligned - pmem->data_offset) / 512;
> >  
> > -	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
> > -	if (cleared < len)
> > +	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned,
> > +				      len_aligned);
> > +	if (cleared < len_aligned)
> >  		rc = BLK_STS_IOERR;
> >  	if (cleared > 0 && cleared / 512) {
> > -		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
> > +		hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared);
> >  		cleared /= 512;
> >  		dev_dbg(dev, "%#llx clear %ld sector%s\n",
> >  				(unsigned long long) sector, cleared,
> 
> We could potentially support clearing less than a sector, but I'd have
> to understand the use cases better before offerring implementation
> suggestions.

I don't need clearing less than a secotr. Once somebody needs it they
can implement it. All I am doing is making sure current logic is not
broken when dax_zero_page_range() starts using this logic and passes
an arbitrary range. We need to make sure we internally align I/O 
and carve out an aligned sub-range and pass that subrange to
nvdimm_clear_poison().

So if you can make sure I am not breaking things and new interface
will continue to clear poison on sector boundary, that will be great.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-20 21:57       ` Vivek Goyal
@ 2020-02-21 18:32         ` Jeff Moyer
  -1 siblings, 0 replies; 103+ messages in thread
From: Jeff Moyer @ 2020-02-21 18:32 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-fsdevel, linux-nvdimm, hch, dm-devel

Vivek Goyal <vgoyal@redhat.com> writes:

> On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
>> Vivek Goyal <vgoyal@redhat.com> writes:
>> 
>> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
>> > Atleast that seems to be the assumption with which code has been written.
>> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
>> > and pmem_make_request() which will only passe sector aligned offset and len.
>> >
>> > Soon we want use this function from dax_zero_page_range() code path which
>> > can try to zero arbitrary range of memory with-in a page. So update this
>> > function to assume that offset and length can be arbitrary and do the
>> > necessary alignments as needed.
>> 
>> What caller will try to zero a range that is smaller than a sector?
>
> Hi Jeff,
>
> New dax zeroing interface (dax_zero_page_range()) can technically pass
> a range which is less than a sector. Or which is bigger than a sector
> but start and end are not aligned on sector boundaries.

Sure, but who will call it with misaligned ranges?

> At this point of time, all I care about is that case of an arbitrary
> range is handeled well. So if a caller passes a range in, we figure
> out subrange which is sector aligned in terms of start and end, and
> clear poison on those sectors and ignore rest of the range. And
> this itself will be an improvement over current behavior where 
> nothing is cleared if I/O is not sector aligned.

I don't think this makes sense.  The caller needs to know about the
blast radius of errors.  This is why I asked for a concrete example.
It might make more sense, for example, to return an error if not all of
the errors could be cleared.

>> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
>> > clear_err_unit boundary. But this is currently internal detail and is
>> > not exported for others to use. So for now, continue to align offset and
>> > length to SECTOR_SIZE boundary. Improving it further and to align it
>> > to clear_err_unit boundary is a TODO item for future.
>> 
>> When there is a poisoned range of persistent memory, it is recorded by
>> the badblocks infrastructure, which currently operates on sectors.  So,
>> no matter what the error unit is for the hardware, we currently can't
>> record/report to userspace anything smaller than a sector, and so that
>> is what we expect when clearing errors.
>> 
>> Continuing on for completeness, we will currently not map a page with
>> badblocks into a process' address space.  So, let's say you have 256
>> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
>> you access a valid mmap()d address in the same page as the poisoned
>> memory, you will get a segfault.
>> 
>> Userspace can fix up the error by calling write(2) and friends to
>> provide new data, or by punching a hole and writing new data to the hole
>> (which may result in getting a new block, or reallocating the old block
>> and zeroing it, which will clear the error).
>
> Fair enough. I do not need poison clearing at finer granularity. It might
> be needed once dev_dax path wants to clear poison. Not sure how exactly
> that works.

It doesn't.  :)

>> > +	/*
>> > +	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
>> > +	 * expects memory offset and length to meet certain alignment
>> > +	 * restrction (clear_err_unit). Currently nvdimm does not export
>>                                                   ^^^^^^^^^^^^^^^^^^^^^^
>> > +	 * required alignment. So align offset and length to sector boundary
>> 
>> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
>> does export the required alignment.  Perhaps you meant libnvdimm?
>
> I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> it is called. It first queries alignement required (clear_err_unit) and
> then makes sure range passed in meets that alignment requirement.

My point was your comment is misleading.

>> We could potentially support clearing less than a sector, but I'd have
>> to understand the use cases better before offerring implementation
>> suggestions.
>
> I don't need clearing less than a secotr. Once somebody needs it they
> can implement it. All I am doing is making sure current logic is not
> broken when dax_zero_page_range() starts using this logic and passes
> an arbitrary range. We need to make sure we internally align I/O 

An arbitrary range is the same thing as less than a sector.  :)  Do you
know of an instance where the range will not be sector-aligned and sized?

> and carve out an aligned sub-range and pass that subrange to
> nvdimm_clear_poison().

And what happens to the rest?  The caller is left to trip over the
errors?  That sounds pretty terrible.  I really think there needs to be
an explicit contract here.

> So if you can make sure I am not breaking things and new interface
> will continue to clear poison on sector boundary, that will be great.

I think allowing arbitrary ranges /could/ break things.  How it breaks
things depends on what the caller is doing.

If ther eare no callers using the interface in this way, then I see no
need to relax the restriction.  I do think we could document it better.

-Jeff
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-21 18:32         ` Jeff Moyer
  0 siblings, 0 replies; 103+ messages in thread
From: Jeff Moyer @ 2020-02-21 18:32 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-fsdevel, linux-nvdimm, hch, dan.j.williams, dm-devel

Vivek Goyal <vgoyal@redhat.com> writes:

> On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
>> Vivek Goyal <vgoyal@redhat.com> writes:
>> 
>> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
>> > Atleast that seems to be the assumption with which code has been written.
>> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
>> > and pmem_make_request() which will only passe sector aligned offset and len.
>> >
>> > Soon we want use this function from dax_zero_page_range() code path which
>> > can try to zero arbitrary range of memory with-in a page. So update this
>> > function to assume that offset and length can be arbitrary and do the
>> > necessary alignments as needed.
>> 
>> What caller will try to zero a range that is smaller than a sector?
>
> Hi Jeff,
>
> New dax zeroing interface (dax_zero_page_range()) can technically pass
> a range which is less than a sector. Or which is bigger than a sector
> but start and end are not aligned on sector boundaries.

Sure, but who will call it with misaligned ranges?

> At this point of time, all I care about is that case of an arbitrary
> range is handeled well. So if a caller passes a range in, we figure
> out subrange which is sector aligned in terms of start and end, and
> clear poison on those sectors and ignore rest of the range. And
> this itself will be an improvement over current behavior where 
> nothing is cleared if I/O is not sector aligned.

I don't think this makes sense.  The caller needs to know about the
blast radius of errors.  This is why I asked for a concrete example.
It might make more sense, for example, to return an error if not all of
the errors could be cleared.

>> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
>> > clear_err_unit boundary. But this is currently internal detail and is
>> > not exported for others to use. So for now, continue to align offset and
>> > length to SECTOR_SIZE boundary. Improving it further and to align it
>> > to clear_err_unit boundary is a TODO item for future.
>> 
>> When there is a poisoned range of persistent memory, it is recorded by
>> the badblocks infrastructure, which currently operates on sectors.  So,
>> no matter what the error unit is for the hardware, we currently can't
>> record/report to userspace anything smaller than a sector, and so that
>> is what we expect when clearing errors.
>> 
>> Continuing on for completeness, we will currently not map a page with
>> badblocks into a process' address space.  So, let's say you have 256
>> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
>> you access a valid mmap()d address in the same page as the poisoned
>> memory, you will get a segfault.
>> 
>> Userspace can fix up the error by calling write(2) and friends to
>> provide new data, or by punching a hole and writing new data to the hole
>> (which may result in getting a new block, or reallocating the old block
>> and zeroing it, which will clear the error).
>
> Fair enough. I do not need poison clearing at finer granularity. It might
> be needed once dev_dax path wants to clear poison. Not sure how exactly
> that works.

It doesn't.  :)

>> > +	/*
>> > +	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
>> > +	 * expects memory offset and length to meet certain alignment
>> > +	 * restrction (clear_err_unit). Currently nvdimm does not export
>>                                                   ^^^^^^^^^^^^^^^^^^^^^^
>> > +	 * required alignment. So align offset and length to sector boundary
>> 
>> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
>> does export the required alignment.  Perhaps you meant libnvdimm?
>
> I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> it is called. It first queries alignement required (clear_err_unit) and
> then makes sure range passed in meets that alignment requirement.

My point was your comment is misleading.

>> We could potentially support clearing less than a sector, but I'd have
>> to understand the use cases better before offerring implementation
>> suggestions.
>
> I don't need clearing less than a secotr. Once somebody needs it they
> can implement it. All I am doing is making sure current logic is not
> broken when dax_zero_page_range() starts using this logic and passes
> an arbitrary range. We need to make sure we internally align I/O 

An arbitrary range is the same thing as less than a sector.  :)  Do you
know of an instance where the range will not be sector-aligned and sized?

> and carve out an aligned sub-range and pass that subrange to
> nvdimm_clear_poison().

And what happens to the rest?  The caller is left to trip over the
errors?  That sounds pretty terrible.  I really think there needs to be
an explicit contract here.

> So if you can make sure I am not breaking things and new interface
> will continue to clear poison on sector boundary, that will be great.

I think allowing arbitrary ranges /could/ break things.  How it breaks
things depends on what the caller is doing.

If ther eare no callers using the interface in this way, then I see no
need to relax the restriction.  I do think we could document it better.

-Jeff


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-21 18:32         ` Jeff Moyer
@ 2020-02-21 20:17           ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-21 20:17 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-fsdevel, linux-nvdimm, hch, dm-devel

On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> >> Vivek Goyal <vgoyal@redhat.com> writes:
> >> 
> >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> >> > Atleast that seems to be the assumption with which code has been written.
> >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> >> > and pmem_make_request() which will only passe sector aligned offset and len.
> >> >
> >> > Soon we want use this function from dax_zero_page_range() code path which
> >> > can try to zero arbitrary range of memory with-in a page. So update this
> >> > function to assume that offset and length can be arbitrary and do the
> >> > necessary alignments as needed.
> >> 
> >> What caller will try to zero a range that is smaller than a sector?
> >
> > Hi Jeff,
> >
> > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > a range which is less than a sector. Or which is bigger than a sector
> > but start and end are not aligned on sector boundaries.
> 
> Sure, but who will call it with misaligned ranges?

create a file foo.txt of size 4K and then truncate it.

"truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
4095.

I have also written a test for this.

https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102

> 
> > At this point of time, all I care about is that case of an arbitrary
> > range is handeled well. So if a caller passes a range in, we figure
> > out subrange which is sector aligned in terms of start and end, and
> > clear poison on those sectors and ignore rest of the range. And
> > this itself will be an improvement over current behavior where 
> > nothing is cleared if I/O is not sector aligned.
> 
> I don't think this makes sense.  The caller needs to know about the
> blast radius of errors.  This is why I asked for a concrete example.
> It might make more sense, for example, to return an error if not all of
> the errors could be cleared.
> 
> >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> >> > clear_err_unit boundary. But this is currently internal detail and is
> >> > not exported for others to use. So for now, continue to align offset and
> >> > length to SECTOR_SIZE boundary. Improving it further and to align it
> >> > to clear_err_unit boundary is a TODO item for future.
> >> 
> >> When there is a poisoned range of persistent memory, it is recorded by
> >> the badblocks infrastructure, which currently operates on sectors.  So,
> >> no matter what the error unit is for the hardware, we currently can't
> >> record/report to userspace anything smaller than a sector, and so that
> >> is what we expect when clearing errors.
> >> 
> >> Continuing on for completeness, we will currently not map a page with
> >> badblocks into a process' address space.  So, let's say you have 256
> >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> >> you access a valid mmap()d address in the same page as the poisoned
> >> memory, you will get a segfault.
> >> 
> >> Userspace can fix up the error by calling write(2) and friends to
> >> provide new data, or by punching a hole and writing new data to the hole
> >> (which may result in getting a new block, or reallocating the old block
> >> and zeroing it, which will clear the error).
> >
> > Fair enough. I do not need poison clearing at finer granularity. It might
> > be needed once dev_dax path wants to clear poison. Not sure how exactly
> > that works.
> 
> It doesn't.  :)
> 
> >> > +	/*
> >> > +	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> >> > +	 * expects memory offset and length to meet certain alignment
> >> > +	 * restrction (clear_err_unit). Currently nvdimm does not export
> >>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> >> > +	 * required alignment. So align offset and length to sector boundary
> >> 
> >> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> >> does export the required alignment.  Perhaps you meant libnvdimm?
> >
> > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> > it is called. It first queries alignement required (clear_err_unit) and
> > then makes sure range passed in meets that alignment requirement.
> 
> My point was your comment is misleading.
> 
> >> We could potentially support clearing less than a sector, but I'd have
> >> to understand the use cases better before offerring implementation
> >> suggestions.
> >
> > I don't need clearing less than a secotr. Once somebody needs it they
> > can implement it. All I am doing is making sure current logic is not
> > broken when dax_zero_page_range() starts using this logic and passes
> > an arbitrary range. We need to make sure we internally align I/O 
> 
> An arbitrary range is the same thing as less than a sector.  :)  Do you
> know of an instance where the range will not be sector-aligned and sized?
> 
> > and carve out an aligned sub-range and pass that subrange to
> > nvdimm_clear_poison().
> 
> And what happens to the rest?  The caller is left to trip over the
> errors?  That sounds pretty terrible.  I really think there needs to be
> an explicit contract here.

Ok, I think is is the contentious bit. Current interface
(__dax_zero_page_range()) either clears the poison (if I/O is aligned to
sector) or expects page to be free of poison.

So in above example, of "truncate -s 23 foo.txt", currently I get an error
because range being zeroed is not sector aligned. So
__dax_zero_page_range() falls back to calling direct_access(). Which
fails because there are poisoned sectors in the page.

With my patches, dax_zero_page_range(), clears the poison from sector 1 to
7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
and returns success.

So question is, is this better behavior or worse behavior. If sector 0
was poisoned, it will continue to remain poisoned and caller will come
to know about it on next read and then it should try to truncate file
to length 0 or unlink file or restore that file to get rid of poison.

IOW, if a partial block is being zeroed and if it is poisoned, caller
will not be return an error and poison will not be cleared and memory
will be zeroed. What do we expect in such cases.

Do we expect an interface where if there are any bad blocks in the range
being zeroed, then they all should be cleared (and hence all I/O should
be aligned) otherwise error is returned. If yes, I could make that
change.

Downside of current interface is that it will clear as many blocks as
possible in the given range and leave starting and end blocks poisoned
(if it is unaligned) and not return error. That means a reader will
get error on these blocks again and they will have to try to clear it
again.

Thanks
Vivek
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-21 20:17           ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-21 20:17 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-fsdevel, linux-nvdimm, hch, dan.j.williams, dm-devel

On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> >> Vivek Goyal <vgoyal@redhat.com> writes:
> >> 
> >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> >> > Atleast that seems to be the assumption with which code has been written.
> >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> >> > and pmem_make_request() which will only passe sector aligned offset and len.
> >> >
> >> > Soon we want use this function from dax_zero_page_range() code path which
> >> > can try to zero arbitrary range of memory with-in a page. So update this
> >> > function to assume that offset and length can be arbitrary and do the
> >> > necessary alignments as needed.
> >> 
> >> What caller will try to zero a range that is smaller than a sector?
> >
> > Hi Jeff,
> >
> > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > a range which is less than a sector. Or which is bigger than a sector
> > but start and end are not aligned on sector boundaries.
> 
> Sure, but who will call it with misaligned ranges?

create a file foo.txt of size 4K and then truncate it.

"truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
4095.

I have also written a test for this.

https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102

> 
> > At this point of time, all I care about is that case of an arbitrary
> > range is handeled well. So if a caller passes a range in, we figure
> > out subrange which is sector aligned in terms of start and end, and
> > clear poison on those sectors and ignore rest of the range. And
> > this itself will be an improvement over current behavior where 
> > nothing is cleared if I/O is not sector aligned.
> 
> I don't think this makes sense.  The caller needs to know about the
> blast radius of errors.  This is why I asked for a concrete example.
> It might make more sense, for example, to return an error if not all of
> the errors could be cleared.
> 
> >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> >> > clear_err_unit boundary. But this is currently internal detail and is
> >> > not exported for others to use. So for now, continue to align offset and
> >> > length to SECTOR_SIZE boundary. Improving it further and to align it
> >> > to clear_err_unit boundary is a TODO item for future.
> >> 
> >> When there is a poisoned range of persistent memory, it is recorded by
> >> the badblocks infrastructure, which currently operates on sectors.  So,
> >> no matter what the error unit is for the hardware, we currently can't
> >> record/report to userspace anything smaller than a sector, and so that
> >> is what we expect when clearing errors.
> >> 
> >> Continuing on for completeness, we will currently not map a page with
> >> badblocks into a process' address space.  So, let's say you have 256
> >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> >> you access a valid mmap()d address in the same page as the poisoned
> >> memory, you will get a segfault.
> >> 
> >> Userspace can fix up the error by calling write(2) and friends to
> >> provide new data, or by punching a hole and writing new data to the hole
> >> (which may result in getting a new block, or reallocating the old block
> >> and zeroing it, which will clear the error).
> >
> > Fair enough. I do not need poison clearing at finer granularity. It might
> > be needed once dev_dax path wants to clear poison. Not sure how exactly
> > that works.
> 
> It doesn't.  :)
> 
> >> > +	/*
> >> > +	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> >> > +	 * expects memory offset and length to meet certain alignment
> >> > +	 * restrction (clear_err_unit). Currently nvdimm does not export
> >>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> >> > +	 * required alignment. So align offset and length to sector boundary
> >> 
> >> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> >> does export the required alignment.  Perhaps you meant libnvdimm?
> >
> > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> > it is called. It first queries alignement required (clear_err_unit) and
> > then makes sure range passed in meets that alignment requirement.
> 
> My point was your comment is misleading.
> 
> >> We could potentially support clearing less than a sector, but I'd have
> >> to understand the use cases better before offerring implementation
> >> suggestions.
> >
> > I don't need clearing less than a secotr. Once somebody needs it they
> > can implement it. All I am doing is making sure current logic is not
> > broken when dax_zero_page_range() starts using this logic and passes
> > an arbitrary range. We need to make sure we internally align I/O 
> 
> An arbitrary range is the same thing as less than a sector.  :)  Do you
> know of an instance where the range will not be sector-aligned and sized?
> 
> > and carve out an aligned sub-range and pass that subrange to
> > nvdimm_clear_poison().
> 
> And what happens to the rest?  The caller is left to trip over the
> errors?  That sounds pretty terrible.  I really think there needs to be
> an explicit contract here.

Ok, I think is is the contentious bit. Current interface
(__dax_zero_page_range()) either clears the poison (if I/O is aligned to
sector) or expects page to be free of poison.

So in above example, of "truncate -s 23 foo.txt", currently I get an error
because range being zeroed is not sector aligned. So
__dax_zero_page_range() falls back to calling direct_access(). Which
fails because there are poisoned sectors in the page.

With my patches, dax_zero_page_range(), clears the poison from sector 1 to
7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
and returns success.

So question is, is this better behavior or worse behavior. If sector 0
was poisoned, it will continue to remain poisoned and caller will come
to know about it on next read and then it should try to truncate file
to length 0 or unlink file or restore that file to get rid of poison.

IOW, if a partial block is being zeroed and if it is poisoned, caller
will not be return an error and poison will not be cleared and memory
will be zeroed. What do we expect in such cases.

Do we expect an interface where if there are any bad blocks in the range
being zeroed, then they all should be cleared (and hence all I/O should
be aligned) otherwise error is returned. If yes, I could make that
change.

Downside of current interface is that it will clear as many blocks as
possible in the given range and leave starting and end blocks poisoned
(if it is unaligned) and not return error. That means a reader will
get error on these blocks again and they will have to try to clear it
again.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-21 20:17           ` Vivek Goyal
@ 2020-02-21 21:00             ` Dan Williams
  -1 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-21 21:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Fri, Feb 21, 2020 at 12:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > Vivek Goyal <vgoyal@redhat.com> writes:
> >
> > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > >>
> > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > >> > Atleast that seems to be the assumption with which code has been written.
> > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > >> >
> > >> > Soon we want use this function from dax_zero_page_range() code path which
> > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > >> > function to assume that offset and length can be arbitrary and do the
> > >> > necessary alignments as needed.
> > >>
> > >> What caller will try to zero a range that is smaller than a sector?
> > >
> > > Hi Jeff,
> > >
> > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > a range which is less than a sector. Or which is bigger than a sector
> > > but start and end are not aligned on sector boundaries.
> >
> > Sure, but who will call it with misaligned ranges?
>
> create a file foo.txt of size 4K and then truncate it.
>
> "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> 4095.
>
> I have also written a test for this.
>
> https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102
>
> >
> > > At this point of time, all I care about is that case of an arbitrary
> > > range is handeled well. So if a caller passes a range in, we figure
> > > out subrange which is sector aligned in terms of start and end, and
> > > clear poison on those sectors and ignore rest of the range. And
> > > this itself will be an improvement over current behavior where
> > > nothing is cleared if I/O is not sector aligned.
> >
> > I don't think this makes sense.  The caller needs to know about the
> > blast radius of errors.  This is why I asked for a concrete example.
> > It might make more sense, for example, to return an error if not all of
> > the errors could be cleared.
> >
> > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> > >> > clear_err_unit boundary. But this is currently internal detail and is
> > >> > not exported for others to use. So for now, continue to align offset and
> > >> > length to SECTOR_SIZE boundary. Improving it further and to align it
> > >> > to clear_err_unit boundary is a TODO item for future.
> > >>
> > >> When there is a poisoned range of persistent memory, it is recorded by
> > >> the badblocks infrastructure, which currently operates on sectors.  So,
> > >> no matter what the error unit is for the hardware, we currently can't
> > >> record/report to userspace anything smaller than a sector, and so that
> > >> is what we expect when clearing errors.
> > >>
> > >> Continuing on for completeness, we will currently not map a page with
> > >> badblocks into a process' address space.  So, let's say you have 256
> > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> > >> you access a valid mmap()d address in the same page as the poisoned
> > >> memory, you will get a segfault.
> > >>
> > >> Userspace can fix up the error by calling write(2) and friends to
> > >> provide new data, or by punching a hole and writing new data to the hole
> > >> (which may result in getting a new block, or reallocating the old block
> > >> and zeroing it, which will clear the error).
> > >
> > > Fair enough. I do not need poison clearing at finer granularity. It might
> > > be needed once dev_dax path wants to clear poison. Not sure how exactly
> > > that works.
> >
> > It doesn't.  :)
> >
> > >> > +        /*
> > >> > +         * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> > >> > +         * expects memory offset and length to meet certain alignment
> > >> > +         * restrction (clear_err_unit). Currently nvdimm does not export
> > >>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> > >> > +         * required alignment. So align offset and length to sector boundary
> > >>
> > >> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> > >> does export the required alignment.  Perhaps you meant libnvdimm?
> > >
> > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> > > it is called. It first queries alignement required (clear_err_unit) and
> > > then makes sure range passed in meets that alignment requirement.
> >
> > My point was your comment is misleading.
> >
> > >> We could potentially support clearing less than a sector, but I'd have
> > >> to understand the use cases better before offerring implementation
> > >> suggestions.
> > >
> > > I don't need clearing less than a secotr. Once somebody needs it they
> > > can implement it. All I am doing is making sure current logic is not
> > > broken when dax_zero_page_range() starts using this logic and passes
> > > an arbitrary range. We need to make sure we internally align I/O
> >
> > An arbitrary range is the same thing as less than a sector.  :)  Do you
> > know of an instance where the range will not be sector-aligned and sized?
> >
> > > and carve out an aligned sub-range and pass that subrange to
> > > nvdimm_clear_poison().
> >
> > And what happens to the rest?  The caller is left to trip over the
> > errors?  That sounds pretty terrible.  I really think there needs to be
> > an explicit contract here.
>
> Ok, I think is is the contentious bit. Current interface
> (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> sector) or expects page to be free of poison.
>
> So in above example, of "truncate -s 23 foo.txt", currently I get an error
> because range being zeroed is not sector aligned. So
> __dax_zero_page_range() falls back to calling direct_access(). Which
> fails because there are poisoned sectors in the page.
>
> With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> and returns success.
>
> So question is, is this better behavior or worse behavior. If sector 0
> was poisoned, it will continue to remain poisoned and caller will come
> to know about it on next read and then it should try to truncate file
> to length 0 or unlink file or restore that file to get rid of poison.
>
> IOW, if a partial block is being zeroed and if it is poisoned, caller
> will not be return an error and poison will not be cleared and memory
> will be zeroed. What do we expect in such cases.
>
> Do we expect an interface where if there are any bad blocks in the range
> being zeroed, then they all should be cleared (and hence all I/O should
> be aligned) otherwise error is returned. If yes, I could make that
> change.

This does not strike me as a good idea because it's a false security
compared to the latent poison case. If the writes to an unknown
poisoned location would otherwise succeed via a different I/O path
(dax), it's an unsymmetric surprise to start returning errors just
because you wrote zeroes as a side effect of truncate.

> Downside of current interface is that it will clear as many blocks as
> possible in the given range and leave starting and end blocks poisoned
> (if it is unaligned) and not return error. That means a reader will
> get error on these blocks again and they will have to try to clear it
> again.

I think what you have described in your truncate example is an
improvement on what we have currently because x86 does not communicate
write errors. Specifically, writing zeros via dax from userspace over
unknown poison behaves the same as writing unaligned zeros over known
poison. In both cases it's a best effort that always succeeds (no cpu
exception), and may inadvertently clear poison as a side-effect.
Otherwise, an error-block-aligned hole punch is the only way to
trigger the kernel to try to clear known poison when the full block is
reallocated.

On movdir64b capable cpus the error clearing unit becomes 64-bytes
rather than 256-bytes because that allows a cacheline to be written
without triggering a line fill read. So the error clearing granularity
gets better over time, but unfortunately not synchronous detection in
the I/O path.

I think a better way to improve poison handling is the long standing
idea to integrate the badblock tracking into the filesystem directly.
That way driver notifications of poison can be ingested into the
filesystem and notifications sent on filenames rather than the current
TOCTOU mess of trying to do a reverse lookup of badblock numbers to
files. If the application can efficiently list and be notified of
poison it can mitigate it immediately rather than trying to rely on
write side effects.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-21 21:00             ` Dan Williams
  0 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-21 21:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jeff Moyer, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Fri, Feb 21, 2020 at 12:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > Vivek Goyal <vgoyal@redhat.com> writes:
> >
> > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > >>
> > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > >> > Atleast that seems to be the assumption with which code has been written.
> > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > >> >
> > >> > Soon we want use this function from dax_zero_page_range() code path which
> > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > >> > function to assume that offset and length can be arbitrary and do the
> > >> > necessary alignments as needed.
> > >>
> > >> What caller will try to zero a range that is smaller than a sector?
> > >
> > > Hi Jeff,
> > >
> > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > a range which is less than a sector. Or which is bigger than a sector
> > > but start and end are not aligned on sector boundaries.
> >
> > Sure, but who will call it with misaligned ranges?
>
> create a file foo.txt of size 4K and then truncate it.
>
> "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> 4095.
>
> I have also written a test for this.
>
> https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102
>
> >
> > > At this point of time, all I care about is that case of an arbitrary
> > > range is handeled well. So if a caller passes a range in, we figure
> > > out subrange which is sector aligned in terms of start and end, and
> > > clear poison on those sectors and ignore rest of the range. And
> > > this itself will be an improvement over current behavior where
> > > nothing is cleared if I/O is not sector aligned.
> >
> > I don't think this makes sense.  The caller needs to know about the
> > blast radius of errors.  This is why I asked for a concrete example.
> > It might make more sense, for example, to return an error if not all of
> > the errors could be cleared.
> >
> > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> > >> > clear_err_unit boundary. But this is currently internal detail and is
> > >> > not exported for others to use. So for now, continue to align offset and
> > >> > length to SECTOR_SIZE boundary. Improving it further and to align it
> > >> > to clear_err_unit boundary is a TODO item for future.
> > >>
> > >> When there is a poisoned range of persistent memory, it is recorded by
> > >> the badblocks infrastructure, which currently operates on sectors.  So,
> > >> no matter what the error unit is for the hardware, we currently can't
> > >> record/report to userspace anything smaller than a sector, and so that
> > >> is what we expect when clearing errors.
> > >>
> > >> Continuing on for completeness, we will currently not map a page with
> > >> badblocks into a process' address space.  So, let's say you have 256
> > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> > >> you access a valid mmap()d address in the same page as the poisoned
> > >> memory, you will get a segfault.
> > >>
> > >> Userspace can fix up the error by calling write(2) and friends to
> > >> provide new data, or by punching a hole and writing new data to the hole
> > >> (which may result in getting a new block, or reallocating the old block
> > >> and zeroing it, which will clear the error).
> > >
> > > Fair enough. I do not need poison clearing at finer granularity. It might
> > > be needed once dev_dax path wants to clear poison. Not sure how exactly
> > > that works.
> >
> > It doesn't.  :)
> >
> > >> > +        /*
> > >> > +         * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> > >> > +         * expects memory offset and length to meet certain alignment
> > >> > +         * restrction (clear_err_unit). Currently nvdimm does not export
> > >>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> > >> > +         * required alignment. So align offset and length to sector boundary
> > >>
> > >> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> > >> does export the required alignment.  Perhaps you meant libnvdimm?
> > >
> > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> > > it is called. It first queries alignement required (clear_err_unit) and
> > > then makes sure range passed in meets that alignment requirement.
> >
> > My point was your comment is misleading.
> >
> > >> We could potentially support clearing less than a sector, but I'd have
> > >> to understand the use cases better before offerring implementation
> > >> suggestions.
> > >
> > > I don't need clearing less than a secotr. Once somebody needs it they
> > > can implement it. All I am doing is making sure current logic is not
> > > broken when dax_zero_page_range() starts using this logic and passes
> > > an arbitrary range. We need to make sure we internally align I/O
> >
> > An arbitrary range is the same thing as less than a sector.  :)  Do you
> > know of an instance where the range will not be sector-aligned and sized?
> >
> > > and carve out an aligned sub-range and pass that subrange to
> > > nvdimm_clear_poison().
> >
> > And what happens to the rest?  The caller is left to trip over the
> > errors?  That sounds pretty terrible.  I really think there needs to be
> > an explicit contract here.
>
> Ok, I think is is the contentious bit. Current interface
> (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> sector) or expects page to be free of poison.
>
> So in above example, of "truncate -s 23 foo.txt", currently I get an error
> because range being zeroed is not sector aligned. So
> __dax_zero_page_range() falls back to calling direct_access(). Which
> fails because there are poisoned sectors in the page.
>
> With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> and returns success.
>
> So question is, is this better behavior or worse behavior. If sector 0
> was poisoned, it will continue to remain poisoned and caller will come
> to know about it on next read and then it should try to truncate file
> to length 0 or unlink file or restore that file to get rid of poison.
>
> IOW, if a partial block is being zeroed and if it is poisoned, caller
> will not be return an error and poison will not be cleared and memory
> will be zeroed. What do we expect in such cases.
>
> Do we expect an interface where if there are any bad blocks in the range
> being zeroed, then they all should be cleared (and hence all I/O should
> be aligned) otherwise error is returned. If yes, I could make that
> change.

This does not strike me as a good idea because it's a false security
compared to the latent poison case. If the writes to an unknown
poisoned location would otherwise succeed via a different I/O path
(dax), it's an unsymmetric surprise to start returning errors just
because you wrote zeroes as a side effect of truncate.

> Downside of current interface is that it will clear as many blocks as
> possible in the given range and leave starting and end blocks poisoned
> (if it is unaligned) and not return error. That means a reader will
> get error on these blocks again and they will have to try to clear it
> again.

I think what you have described in your truncate example is an
improvement on what we have currently because x86 does not communicate
write errors. Specifically, writing zeros via dax from userspace over
unknown poison behaves the same as writing unaligned zeros over known
poison. In both cases it's a best effort that always succeeds (no cpu
exception), and may inadvertently clear poison as a side-effect.
Otherwise, an error-block-aligned hole punch is the only way to
trigger the kernel to try to clear known poison when the full block is
reallocated.

On movdir64b capable cpus the error clearing unit becomes 64-bytes
rather than 256-bytes because that allows a cacheline to be written
without triggering a line fill read. So the error clearing granularity
gets better over time, but unfortunately not synchronous detection in
the I/O path.

I think a better way to improve poison handling is the long standing
idea to integrate the badblock tracking into the filesystem directly.
That way driver notifications of poison can be ingested into the
filesystem and notifications sent on filenames rather than the current
TOCTOU mess of trying to do a reverse lookup of badblock numbers to
files. If the application can efficiently list and be notified of
poison it can mitigate it immediately rather than trying to rely on
write side effects.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-21 21:00             ` Dan Williams
@ 2020-02-21 21:24               ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-21 21:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Fri, Feb 21, 2020 at 01:00:29PM -0800, Dan Williams wrote:
> On Fri, Feb 21, 2020 at 12:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > Vivek Goyal <vgoyal@redhat.com> writes:
> > >
> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > >>
> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > >> > Atleast that seems to be the assumption with which code has been written.
> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > >> >
> > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > >> > function to assume that offset and length can be arbitrary and do the
> > > >> > necessary alignments as needed.
> > > >>
> > > >> What caller will try to zero a range that is smaller than a sector?
> > > >
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > >
> > > Sure, but who will call it with misaligned ranges?
> >
> > create a file foo.txt of size 4K and then truncate it.
> >
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
> >
> > I have also written a test for this.
> >
> > https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102
> >
> > >
> > > > At this point of time, all I care about is that case of an arbitrary
> > > > range is handeled well. So if a caller passes a range in, we figure
> > > > out subrange which is sector aligned in terms of start and end, and
> > > > clear poison on those sectors and ignore rest of the range. And
> > > > this itself will be an improvement over current behavior where
> > > > nothing is cleared if I/O is not sector aligned.
> > >
> > > I don't think this makes sense.  The caller needs to know about the
> > > blast radius of errors.  This is why I asked for a concrete example.
> > > It might make more sense, for example, to return an error if not all of
> > > the errors could be cleared.
> > >
> > > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> > > >> > clear_err_unit boundary. But this is currently internal detail and is
> > > >> > not exported for others to use. So for now, continue to align offset and
> > > >> > length to SECTOR_SIZE boundary. Improving it further and to align it
> > > >> > to clear_err_unit boundary is a TODO item for future.
> > > >>
> > > >> When there is a poisoned range of persistent memory, it is recorded by
> > > >> the badblocks infrastructure, which currently operates on sectors.  So,
> > > >> no matter what the error unit is for the hardware, we currently can't
> > > >> record/report to userspace anything smaller than a sector, and so that
> > > >> is what we expect when clearing errors.
> > > >>
> > > >> Continuing on for completeness, we will currently not map a page with
> > > >> badblocks into a process' address space.  So, let's say you have 256
> > > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> > > >> you access a valid mmap()d address in the same page as the poisoned
> > > >> memory, you will get a segfault.
> > > >>
> > > >> Userspace can fix up the error by calling write(2) and friends to
> > > >> provide new data, or by punching a hole and writing new data to the hole
> > > >> (which may result in getting a new block, or reallocating the old block
> > > >> and zeroing it, which will clear the error).
> > > >
> > > > Fair enough. I do not need poison clearing at finer granularity. It might
> > > > be needed once dev_dax path wants to clear poison. Not sure how exactly
> > > > that works.
> > >
> > > It doesn't.  :)
> > >
> > > >> > +        /*
> > > >> > +         * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> > > >> > +         * expects memory offset and length to meet certain alignment
> > > >> > +         * restrction (clear_err_unit). Currently nvdimm does not export
> > > >>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> > > >> > +         * required alignment. So align offset and length to sector boundary
> > > >>
> > > >> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> > > >> does export the required alignment.  Perhaps you meant libnvdimm?
> > > >
> > > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> > > > it is called. It first queries alignement required (clear_err_unit) and
> > > > then makes sure range passed in meets that alignment requirement.
> > >
> > > My point was your comment is misleading.
> > >
> > > >> We could potentially support clearing less than a sector, but I'd have
> > > >> to understand the use cases better before offerring implementation
> > > >> suggestions.
> > > >
> > > > I don't need clearing less than a secotr. Once somebody needs it they
> > > > can implement it. All I am doing is making sure current logic is not
> > > > broken when dax_zero_page_range() starts using this logic and passes
> > > > an arbitrary range. We need to make sure we internally align I/O
> > >
> > > An arbitrary range is the same thing as less than a sector.  :)  Do you
> > > know of an instance where the range will not be sector-aligned and sized?
> > >
> > > > and carve out an aligned sub-range and pass that subrange to
> > > > nvdimm_clear_poison().
> > >
> > > And what happens to the rest?  The caller is left to trip over the
> > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > an explicit contract here.
> >
> > Ok, I think is is the contentious bit. Current interface
> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > sector) or expects page to be free of poison.
> >
> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > because range being zeroed is not sector aligned. So
> > __dax_zero_page_range() falls back to calling direct_access(). Which
> > fails because there are poisoned sectors in the page.
> >
> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > and returns success.
> >
> > So question is, is this better behavior or worse behavior. If sector 0
> > was poisoned, it will continue to remain poisoned and caller will come
> > to know about it on next read and then it should try to truncate file
> > to length 0 or unlink file or restore that file to get rid of poison.
> >
> > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > will not be return an error and poison will not be cleared and memory
> > will be zeroed. What do we expect in such cases.
> >
> > Do we expect an interface where if there are any bad blocks in the range
> > being zeroed, then they all should be cleared (and hence all I/O should
> > be aligned) otherwise error is returned. If yes, I could make that
> > change.
> 
> This does not strike me as a good idea because it's a false security
> compared to the latent poison case. If the writes to an unknown
> poisoned location would otherwise succeed via a different I/O path
> (dax), it's an unsymmetric surprise to start returning errors just
> because you wrote zeroes as a side effect of truncate.
> 
> > Downside of current interface is that it will clear as many blocks as
> > possible in the given range and leave starting and end blocks poisoned
> > (if it is unaligned) and not return error. That means a reader will
> > get error on these blocks again and they will have to try to clear it
> > again.
> 
> I think what you have described in your truncate example is an
> improvement on what we have currently because x86 does not communicate
> write errors. Specifically, writing zeros via dax from userspace over
> unknown poison behaves the same as writing unaligned zeros over known
> poison. In both cases it's a best effort that always succeeds (no cpu
> exception), and may inadvertently clear poison as a side-effect.
> Otherwise, an error-block-aligned hole punch is the only way to
> trigger the kernel to try to clear known poison when the full block is
> reallocated.

Hi Dan,

Agreed. This new interface works uniformly for both known poison and latent
poison cases. Existing interface is asymmetric and that means if poison is
latent, unaligned zero range will succeed but if poison is known, unaligned
zero range will fail.

> 
> On movdir64b capable cpus the error clearing unit becomes 64-bytes
> rather than 256-bytes because that allows a cacheline to be written
> without triggering a line fill read. So the error clearing granularity
> gets better over time, but unfortunately not synchronous detection in
> the I/O path.
> 
> I think a better way to improve poison handling is the long standing
> idea to integrate the badblock tracking into the filesystem directly.
> That way driver notifications of poison can be ingested into the
> filesystem and notifications sent on filenames rather than the current
> TOCTOU mess of trying to do a reverse lookup of badblock numbers to
> files. If the application can efficiently list and be notified of
> poison it can mitigate it immediately rather than trying to rely on
> write side effects.

Moving badblocks infrastructure in filesystem sounds like a major
rework which should be taken up in a seprate patch series in future.

For now, can we please take these patches which are an improvement
over existing interface.

Thanks
Vivek
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-21 21:24               ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-21 21:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Moyer, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Fri, Feb 21, 2020 at 01:00:29PM -0800, Dan Williams wrote:
> On Fri, Feb 21, 2020 at 12:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > Vivek Goyal <vgoyal@redhat.com> writes:
> > >
> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > >>
> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > >> > Atleast that seems to be the assumption with which code has been written.
> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > >> >
> > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > >> > function to assume that offset and length can be arbitrary and do the
> > > >> > necessary alignments as needed.
> > > >>
> > > >> What caller will try to zero a range that is smaller than a sector?
> > > >
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > >
> > > Sure, but who will call it with misaligned ranges?
> >
> > create a file foo.txt of size 4K and then truncate it.
> >
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
> >
> > I have also written a test for this.
> >
> > https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102
> >
> > >
> > > > At this point of time, all I care about is that case of an arbitrary
> > > > range is handeled well. So if a caller passes a range in, we figure
> > > > out subrange which is sector aligned in terms of start and end, and
> > > > clear poison on those sectors and ignore rest of the range. And
> > > > this itself will be an improvement over current behavior where
> > > > nothing is cleared if I/O is not sector aligned.
> > >
> > > I don't think this makes sense.  The caller needs to know about the
> > > blast radius of errors.  This is why I asked for a concrete example.
> > > It might make more sense, for example, to return an error if not all of
> > > the errors could be cleared.
> > >
> > > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> > > >> > clear_err_unit boundary. But this is currently internal detail and is
> > > >> > not exported for others to use. So for now, continue to align offset and
> > > >> > length to SECTOR_SIZE boundary. Improving it further and to align it
> > > >> > to clear_err_unit boundary is a TODO item for future.
> > > >>
> > > >> When there is a poisoned range of persistent memory, it is recorded by
> > > >> the badblocks infrastructure, which currently operates on sectors.  So,
> > > >> no matter what the error unit is for the hardware, we currently can't
> > > >> record/report to userspace anything smaller than a sector, and so that
> > > >> is what we expect when clearing errors.
> > > >>
> > > >> Continuing on for completeness, we will currently not map a page with
> > > >> badblocks into a process' address space.  So, let's say you have 256
> > > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> > > >> you access a valid mmap()d address in the same page as the poisoned
> > > >> memory, you will get a segfault.
> > > >>
> > > >> Userspace can fix up the error by calling write(2) and friends to
> > > >> provide new data, or by punching a hole and writing new data to the hole
> > > >> (which may result in getting a new block, or reallocating the old block
> > > >> and zeroing it, which will clear the error).
> > > >
> > > > Fair enough. I do not need poison clearing at finer granularity. It might
> > > > be needed once dev_dax path wants to clear poison. Not sure how exactly
> > > > that works.
> > >
> > > It doesn't.  :)
> > >
> > > >> > +        /*
> > > >> > +         * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> > > >> > +         * expects memory offset and length to meet certain alignment
> > > >> > +         * restrction (clear_err_unit). Currently nvdimm does not export
> > > >>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> > > >> > +         * required alignment. So align offset and length to sector boundary
> > > >>
> > > >> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> > > >> does export the required alignment.  Perhaps you meant libnvdimm?
> > > >
> > > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> > > > it is called. It first queries alignement required (clear_err_unit) and
> > > > then makes sure range passed in meets that alignment requirement.
> > >
> > > My point was your comment is misleading.
> > >
> > > >> We could potentially support clearing less than a sector, but I'd have
> > > >> to understand the use cases better before offerring implementation
> > > >> suggestions.
> > > >
> > > > I don't need clearing less than a secotr. Once somebody needs it they
> > > > can implement it. All I am doing is making sure current logic is not
> > > > broken when dax_zero_page_range() starts using this logic and passes
> > > > an arbitrary range. We need to make sure we internally align I/O
> > >
> > > An arbitrary range is the same thing as less than a sector.  :)  Do you
> > > know of an instance where the range will not be sector-aligned and sized?
> > >
> > > > and carve out an aligned sub-range and pass that subrange to
> > > > nvdimm_clear_poison().
> > >
> > > And what happens to the rest?  The caller is left to trip over the
> > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > an explicit contract here.
> >
> > Ok, I think is is the contentious bit. Current interface
> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > sector) or expects page to be free of poison.
> >
> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > because range being zeroed is not sector aligned. So
> > __dax_zero_page_range() falls back to calling direct_access(). Which
> > fails because there are poisoned sectors in the page.
> >
> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > and returns success.
> >
> > So question is, is this better behavior or worse behavior. If sector 0
> > was poisoned, it will continue to remain poisoned and caller will come
> > to know about it on next read and then it should try to truncate file
> > to length 0 or unlink file or restore that file to get rid of poison.
> >
> > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > will not be return an error and poison will not be cleared and memory
> > will be zeroed. What do we expect in such cases.
> >
> > Do we expect an interface where if there are any bad blocks in the range
> > being zeroed, then they all should be cleared (and hence all I/O should
> > be aligned) otherwise error is returned. If yes, I could make that
> > change.
> 
> This does not strike me as a good idea because it's a false security
> compared to the latent poison case. If the writes to an unknown
> poisoned location would otherwise succeed via a different I/O path
> (dax), it's an unsymmetric surprise to start returning errors just
> because you wrote zeroes as a side effect of truncate.
> 
> > Downside of current interface is that it will clear as many blocks as
> > possible in the given range and leave starting and end blocks poisoned
> > (if it is unaligned) and not return error. That means a reader will
> > get error on these blocks again and they will have to try to clear it
> > again.
> 
> I think what you have described in your truncate example is an
> improvement on what we have currently because x86 does not communicate
> write errors. Specifically, writing zeros via dax from userspace over
> unknown poison behaves the same as writing unaligned zeros over known
> poison. In both cases it's a best effort that always succeeds (no cpu
> exception), and may inadvertently clear poison as a side-effect.
> Otherwise, an error-block-aligned hole punch is the only way to
> trigger the kernel to try to clear known poison when the full block is
> reallocated.

Hi Dan,

Agreed. This new interface works uniformly for both known poison and latent
poison cases. Existing interface is asymmetric and that means if poison is
latent, unaligned zero range will succeed but if poison is known, unaligned
zero range will fail.

> 
> On movdir64b capable cpus the error clearing unit becomes 64-bytes
> rather than 256-bytes because that allows a cacheline to be written
> without triggering a line fill read. So the error clearing granularity
> gets better over time, but unfortunately not synchronous detection in
> the I/O path.
> 
> I think a better way to improve poison handling is the long standing
> idea to integrate the badblock tracking into the filesystem directly.
> That way driver notifications of poison can be ingested into the
> filesystem and notifications sent on filenames rather than the current
> TOCTOU mess of trying to do a reverse lookup of badblock numbers to
> files. If the application can efficiently list and be notified of
> poison it can mitigate it immediately rather than trying to rely on
> write side effects.

Moving badblocks infrastructure in filesystem sounds like a major
rework which should be taken up in a seprate patch series in future.

For now, can we please take these patches which are an improvement
over existing interface.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-21 21:24               ` Vivek Goyal
@ 2020-02-21 21:30                 ` Dan Williams
  -1 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-21 21:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Fri, Feb 21, 2020 at 1:25 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Feb 21, 2020 at 01:00:29PM -0800, Dan Williams wrote:
> > On Fri, Feb 21, 2020 at 12:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > >
> > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > > >>
> > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > > >> > Atleast that seems to be the assumption with which code has been written.
> > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > > >> >
> > > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > > >> > function to assume that offset and length can be arbitrary and do the
> > > > >> > necessary alignments as needed.
> > > > >>
> > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > >
> > > > > Hi Jeff,
> > > > >
> > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > but start and end are not aligned on sector boundaries.
> > > >
> > > > Sure, but who will call it with misaligned ranges?
> > >
> > > create a file foo.txt of size 4K and then truncate it.
> > >
> > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > 4095.
> > >
> > > I have also written a test for this.
> > >
> > > https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102
> > >
> > > >
> > > > > At this point of time, all I care about is that case of an arbitrary
> > > > > range is handeled well. So if a caller passes a range in, we figure
> > > > > out subrange which is sector aligned in terms of start and end, and
> > > > > clear poison on those sectors and ignore rest of the range. And
> > > > > this itself will be an improvement over current behavior where
> > > > > nothing is cleared if I/O is not sector aligned.
> > > >
> > > > I don't think this makes sense.  The caller needs to know about the
> > > > blast radius of errors.  This is why I asked for a concrete example.
> > > > It might make more sense, for example, to return an error if not all of
> > > > the errors could be cleared.
> > > >
> > > > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> > > > >> > clear_err_unit boundary. But this is currently internal detail and is
> > > > >> > not exported for others to use. So for now, continue to align offset and
> > > > >> > length to SECTOR_SIZE boundary. Improving it further and to align it
> > > > >> > to clear_err_unit boundary is a TODO item for future.
> > > > >>
> > > > >> When there is a poisoned range of persistent memory, it is recorded by
> > > > >> the badblocks infrastructure, which currently operates on sectors.  So,
> > > > >> no matter what the error unit is for the hardware, we currently can't
> > > > >> record/report to userspace anything smaller than a sector, and so that
> > > > >> is what we expect when clearing errors.
> > > > >>
> > > > >> Continuing on for completeness, we will currently not map a page with
> > > > >> badblocks into a process' address space.  So, let's say you have 256
> > > > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> > > > >> you access a valid mmap()d address in the same page as the poisoned
> > > > >> memory, you will get a segfault.
> > > > >>
> > > > >> Userspace can fix up the error by calling write(2) and friends to
> > > > >> provide new data, or by punching a hole and writing new data to the hole
> > > > >> (which may result in getting a new block, or reallocating the old block
> > > > >> and zeroing it, which will clear the error).
> > > > >
> > > > > Fair enough. I do not need poison clearing at finer granularity. It might
> > > > > be needed once dev_dax path wants to clear poison. Not sure how exactly
> > > > > that works.
> > > >
> > > > It doesn't.  :)
> > > >
> > > > >> > +        /*
> > > > >> > +         * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> > > > >> > +         * expects memory offset and length to meet certain alignment
> > > > >> > +         * restrction (clear_err_unit). Currently nvdimm does not export
> > > > >>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> > > > >> > +         * required alignment. So align offset and length to sector boundary
> > > > >>
> > > > >> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> > > > >> does export the required alignment.  Perhaps you meant libnvdimm?
> > > > >
> > > > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> > > > > it is called. It first queries alignement required (clear_err_unit) and
> > > > > then makes sure range passed in meets that alignment requirement.
> > > >
> > > > My point was your comment is misleading.
> > > >
> > > > >> We could potentially support clearing less than a sector, but I'd have
> > > > >> to understand the use cases better before offerring implementation
> > > > >> suggestions.
> > > > >
> > > > > I don't need clearing less than a secotr. Once somebody needs it they
> > > > > can implement it. All I am doing is making sure current logic is not
> > > > > broken when dax_zero_page_range() starts using this logic and passes
> > > > > an arbitrary range. We need to make sure we internally align I/O
> > > >
> > > > An arbitrary range is the same thing as less than a sector.  :)  Do you
> > > > know of an instance where the range will not be sector-aligned and sized?
> > > >
> > > > > and carve out an aligned sub-range and pass that subrange to
> > > > > nvdimm_clear_poison().
> > > >
> > > > And what happens to the rest?  The caller is left to trip over the
> > > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > > an explicit contract here.
> > >
> > > Ok, I think is is the contentious bit. Current interface
> > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > > sector) or expects page to be free of poison.
> > >
> > > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > > because range being zeroed is not sector aligned. So
> > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > fails because there are poisoned sectors in the page.
> > >
> > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > > and returns success.
> > >
> > > So question is, is this better behavior or worse behavior. If sector 0
> > > was poisoned, it will continue to remain poisoned and caller will come
> > > to know about it on next read and then it should try to truncate file
> > > to length 0 or unlink file or restore that file to get rid of poison.
> > >
> > > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > > will not be return an error and poison will not be cleared and memory
> > > will be zeroed. What do we expect in such cases.
> > >
> > > Do we expect an interface where if there are any bad blocks in the range
> > > being zeroed, then they all should be cleared (and hence all I/O should
> > > be aligned) otherwise error is returned. If yes, I could make that
> > > change.
> >
> > This does not strike me as a good idea because it's a false security
> > compared to the latent poison case. If the writes to an unknown
> > poisoned location would otherwise succeed via a different I/O path
> > (dax), it's an unsymmetric surprise to start returning errors just
> > because you wrote zeroes as a side effect of truncate.
> >
> > > Downside of current interface is that it will clear as many blocks as
> > > possible in the given range and leave starting and end blocks poisoned
> > > (if it is unaligned) and not return error. That means a reader will
> > > get error on these blocks again and they will have to try to clear it
> > > again.
> >
> > I think what you have described in your truncate example is an
> > improvement on what we have currently because x86 does not communicate
> > write errors. Specifically, writing zeros via dax from userspace over
> > unknown poison behaves the same as writing unaligned zeros over known
> > poison. In both cases it's a best effort that always succeeds (no cpu
> > exception), and may inadvertently clear poison as a side-effect.
> > Otherwise, an error-block-aligned hole punch is the only way to
> > trigger the kernel to try to clear known poison when the full block is
> > reallocated.
>
> Hi Dan,
>
> Agreed. This new interface works uniformly for both known poison and latent
> poison cases. Existing interface is asymmetric and that means if poison is
> latent, unaligned zero range will succeed but if poison is known, unaligned
> zero range will fail.
>
> >
> > On movdir64b capable cpus the error clearing unit becomes 64-bytes
> > rather than 256-bytes because that allows a cacheline to be written
> > without triggering a line fill read. So the error clearing granularity
> > gets better over time, but unfortunately not synchronous detection in
> > the I/O path.
> >
> > I think a better way to improve poison handling is the long standing
> > idea to integrate the badblock tracking into the filesystem directly.
> > That way driver notifications of poison can be ingested into the
> > filesystem and notifications sent on filenames rather than the current
> > TOCTOU mess of trying to do a reverse lookup of badblock numbers to
> > files. If the application can efficiently list and be notified of
> > poison it can mitigate it immediately rather than trying to rely on
> > write side effects.
>
> Moving badblocks infrastructure in filesystem sounds like a major
> rework which should be taken up in a seprate patch series in future.
>
> For now, can we please take these patches which are an improvement
> over existing interface.

Oh you misunderstood my comment, the "move badblocks to filesystem"
proposal is long term / down the road thing to consider. In the near
term this unaligned block zeroing facility is an improvement.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-21 21:30                 ` Dan Williams
  0 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-21 21:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jeff Moyer, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Fri, Feb 21, 2020 at 1:25 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Feb 21, 2020 at 01:00:29PM -0800, Dan Williams wrote:
> > On Fri, Feb 21, 2020 at 12:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > >
> > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > > >>
> > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > > >> > Atleast that seems to be the assumption with which code has been written.
> > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > > >> >
> > > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > > >> > function to assume that offset and length can be arbitrary and do the
> > > > >> > necessary alignments as needed.
> > > > >>
> > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > >
> > > > > Hi Jeff,
> > > > >
> > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > but start and end are not aligned on sector boundaries.
> > > >
> > > > Sure, but who will call it with misaligned ranges?
> > >
> > > create a file foo.txt of size 4K and then truncate it.
> > >
> > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > 4095.
> > >
> > > I have also written a test for this.
> > >
> > > https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102
> > >
> > > >
> > > > > At this point of time, all I care about is that case of an arbitrary
> > > > > range is handeled well. So if a caller passes a range in, we figure
> > > > > out subrange which is sector aligned in terms of start and end, and
> > > > > clear poison on those sectors and ignore rest of the range. And
> > > > > this itself will be an improvement over current behavior where
> > > > > nothing is cleared if I/O is not sector aligned.
> > > >
> > > > I don't think this makes sense.  The caller needs to know about the
> > > > blast radius of errors.  This is why I asked for a concrete example.
> > > > It might make more sense, for example, to return an error if not all of
> > > > the errors could be cleared.
> > > >
> > > > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> > > > >> > clear_err_unit boundary. But this is currently internal detail and is
> > > > >> > not exported for others to use. So for now, continue to align offset and
> > > > >> > length to SECTOR_SIZE boundary. Improving it further and to align it
> > > > >> > to clear_err_unit boundary is a TODO item for future.
> > > > >>
> > > > >> When there is a poisoned range of persistent memory, it is recorded by
> > > > >> the badblocks infrastructure, which currently operates on sectors.  So,
> > > > >> no matter what the error unit is for the hardware, we currently can't
> > > > >> record/report to userspace anything smaller than a sector, and so that
> > > > >> is what we expect when clearing errors.
> > > > >>
> > > > >> Continuing on for completeness, we will currently not map a page with
> > > > >> badblocks into a process' address space.  So, let's say you have 256
> > > > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> > > > >> you access a valid mmap()d address in the same page as the poisoned
> > > > >> memory, you will get a segfault.
> > > > >>
> > > > >> Userspace can fix up the error by calling write(2) and friends to
> > > > >> provide new data, or by punching a hole and writing new data to the hole
> > > > >> (which may result in getting a new block, or reallocating the old block
> > > > >> and zeroing it, which will clear the error).
> > > > >
> > > > > Fair enough. I do not need poison clearing at finer granularity. It might
> > > > > be needed once dev_dax path wants to clear poison. Not sure how exactly
> > > > > that works.
> > > >
> > > > It doesn't.  :)
> > > >
> > > > >> > +        /*
> > > > >> > +         * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> > > > >> > +         * expects memory offset and length to meet certain alignment
> > > > >> > +         * restrction (clear_err_unit). Currently nvdimm does not export
> > > > >>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> > > > >> > +         * required alignment. So align offset and length to sector boundary
> > > > >>
> > > > >> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> > > > >> does export the required alignment.  Perhaps you meant libnvdimm?
> > > > >
> > > > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> > > > > it is called. It first queries alignement required (clear_err_unit) and
> > > > > then makes sure range passed in meets that alignment requirement.
> > > >
> > > > My point was your comment is misleading.
> > > >
> > > > >> We could potentially support clearing less than a sector, but I'd have
> > > > >> to understand the use cases better before offerring implementation
> > > > >> suggestions.
> > > > >
> > > > > I don't need clearing less than a secotr. Once somebody needs it they
> > > > > can implement it. All I am doing is making sure current logic is not
> > > > > broken when dax_zero_page_range() starts using this logic and passes
> > > > > an arbitrary range. We need to make sure we internally align I/O
> > > >
> > > > An arbitrary range is the same thing as less than a sector.  :)  Do you
> > > > know of an instance where the range will not be sector-aligned and sized?
> > > >
> > > > > and carve out an aligned sub-range and pass that subrange to
> > > > > nvdimm_clear_poison().
> > > >
> > > > And what happens to the rest?  The caller is left to trip over the
> > > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > > an explicit contract here.
> > >
> > > Ok, I think is is the contentious bit. Current interface
> > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > > sector) or expects page to be free of poison.
> > >
> > > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > > because range being zeroed is not sector aligned. So
> > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > fails because there are poisoned sectors in the page.
> > >
> > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > > and returns success.
> > >
> > > So question is, is this better behavior or worse behavior. If sector 0
> > > was poisoned, it will continue to remain poisoned and caller will come
> > > to know about it on next read and then it should try to truncate file
> > > to length 0 or unlink file or restore that file to get rid of poison.
> > >
> > > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > > will not be return an error and poison will not be cleared and memory
> > > will be zeroed. What do we expect in such cases.
> > >
> > > Do we expect an interface where if there are any bad blocks in the range
> > > being zeroed, then they all should be cleared (and hence all I/O should
> > > be aligned) otherwise error is returned. If yes, I could make that
> > > change.
> >
> > This does not strike me as a good idea because it's a false security
> > compared to the latent poison case. If the writes to an unknown
> > poisoned location would otherwise succeed via a different I/O path
> > (dax), it's an unsymmetric surprise to start returning errors just
> > because you wrote zeroes as a side effect of truncate.
> >
> > > Downside of current interface is that it will clear as many blocks as
> > > possible in the given range and leave starting and end blocks poisoned
> > > (if it is unaligned) and not return error. That means a reader will
> > > get error on these blocks again and they will have to try to clear it
> > > again.
> >
> > I think what you have described in your truncate example is an
> > improvement on what we have currently because x86 does not communicate
> > write errors. Specifically, writing zeros via dax from userspace over
> > unknown poison behaves the same as writing unaligned zeros over known
> > poison. In both cases it's a best effort that always succeeds (no cpu
> > exception), and may inadvertently clear poison as a side-effect.
> > Otherwise, an error-block-aligned hole punch is the only way to
> > trigger the kernel to try to clear known poison when the full block is
> > reallocated.
>
> Hi Dan,
>
> Agreed. This new interface works uniformly for both known poison and latent
> poison cases. Existing interface is asymmetric and that means if poison is
> latent, unaligned zero range will succeed but if poison is known, unaligned
> zero range will fail.
>
> >
> > On movdir64b capable cpus the error clearing unit becomes 64-bytes
> > rather than 256-bytes because that allows a cacheline to be written
> > without triggering a line fill read. So the error clearing granularity
> > gets better over time, but unfortunately not synchronous detection in
> > the I/O path.
> >
> > I think a better way to improve poison handling is the long standing
> > idea to integrate the badblock tracking into the filesystem directly.
> > That way driver notifications of poison can be ingested into the
> > filesystem and notifications sent on filenames rather than the current
> > TOCTOU mess of trying to do a reverse lookup of badblock numbers to
> > files. If the application can efficiently list and be notified of
> > poison it can mitigate it immediately rather than trying to rely on
> > write side effects.
>
> Moving badblocks infrastructure in filesystem sounds like a major
> rework which should be taken up in a seprate patch series in future.
>
> For now, can we please take these patches which are an improvement
> over existing interface.

Oh you misunderstood my comment, the "move badblocks to filesystem"
proposal is long term / down the road thing to consider. In the near
term this unaligned block zeroing facility is an improvement.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-21 21:30                 ` Dan Williams
@ 2020-02-21 21:33                   ` Jeff Moyer
  -1 siblings, 0 replies; 103+ messages in thread
From: Jeff Moyer @ 2020-02-21 21:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

Dan Williams <dan.j.williams@intel.com> writes:

> Oh you misunderstood my comment, the "move badblocks to filesystem"
> proposal is long term / down the road thing to consider. In the near
> term this unaligned block zeroing facility is an improvement.

I'm not sure I agree.  I'm going to think about it and get back to you.

-Jeff
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-21 21:33                   ` Jeff Moyer
  0 siblings, 0 replies; 103+ messages in thread
From: Jeff Moyer @ 2020-02-21 21:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vivek Goyal, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

Dan Williams <dan.j.williams@intel.com> writes:

> Oh you misunderstood my comment, the "move badblocks to filesystem"
> proposal is long term / down the road thing to consider. In the near
> term this unaligned block zeroing facility is an improvement.

I'm not sure I agree.  I'm going to think about it and get back to you.

-Jeff


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-21 20:17           ` Vivek Goyal
@ 2020-02-23 23:03             ` Dave Chinner
  -1 siblings, 0 replies; 103+ messages in thread
From: Dave Chinner @ 2020-02-23 23:03 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-fsdevel, linux-nvdimm, hch, dm-devel

On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > Vivek Goyal <vgoyal@redhat.com> writes:
> > 
> > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > >> 
> > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > >> > Atleast that seems to be the assumption with which code has been written.
> > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > >> >
> > >> > Soon we want use this function from dax_zero_page_range() code path which
> > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > >> > function to assume that offset and length can be arbitrary and do the
> > >> > necessary alignments as needed.
> > >> 
> > >> What caller will try to zero a range that is smaller than a sector?
> > >
> > > Hi Jeff,
> > >
> > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > a range which is less than a sector. Or which is bigger than a sector
> > > but start and end are not aligned on sector boundaries.
> > 
> > Sure, but who will call it with misaligned ranges?
> 
> create a file foo.txt of size 4K and then truncate it.
> 
> "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> 4095.

This should fail with EIO. Only full page writes should clear the
bad page state, and partial writes should therefore fail because
they do not guarantee the data in the filesystem block is all good.

If this zeroing was a buffered write to an address with a bad
sector, then the writeback will fail and the user will (eventually)
get an EIO on the file.

DAX should do the same thing, except because the zeroing is
synchronous (i.e. done directly by the truncate syscall) we can -
and should - return EIO immediately.

Indeed, with your code, if we then extend the file by truncating up
back to 4k, then the range between 23 and 512 is still bad, even
though we've successfully zeroed it and the user knows it. An
attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
will fail with EIO, but reading 10 bytes at offset 2000 will
succeed.

That's *awful* behaviour to expose to userspace, especially when
they look at the fs config and see that it's using both 4kB block
and sector sizes...

The only thing that makes sense from a filesystem perspective is
clearing bad page state when entire filesystem blocks are
overwritten. The data in a filesystem block is either good or bad,
and it doesn't matter how many internal (kernel or device) sectors
it has.

> > And what happens to the rest?  The caller is left to trip over the
> > errors?  That sounds pretty terrible.  I really think there needs to be
> > an explicit contract here.
> 
> Ok, I think is is the contentious bit. Current interface
> (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> sector) or expects page to be free of poison.
> 
> So in above example, of "truncate -s 23 foo.txt", currently I get an error
> because range being zeroed is not sector aligned. So
> __dax_zero_page_range() falls back to calling direct_access(). Which
> fails because there are poisoned sectors in the page.
> 
> With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> and returns success.

Ok, kernel sectors are not the unit of granularity bad page state
should be managed at. They don't match page state granularity, and
they don't match filesystem block granularity, and the whacky
"partial writes silently succeed, reads fail unpredictably"
assymetry it leads to will just cause problems for users.

> So question is, is this better behavior or worse behavior. If sector 0
> was poisoned, it will continue to remain poisoned and caller will come
> to know about it on next read and then it should try to truncate file
> to length 0 or unlink file or restore that file to get rid of poison.

Worse, because the filesystem can't track what sub-parts of the
block are bad and that leads to inconsistent data integrity status
being exposed to userspace.


> IOW, if a partial block is being zeroed and if it is poisoned, caller
> will not be return an error and poison will not be cleared and memory
> will be zeroed. What do we expect in such cases.
> 
> Do we expect an interface where if there are any bad blocks in the range
> being zeroed, then they all should be cleared (and hence all I/O should
> be aligned) otherwise error is returned. If yes, I could make that
> change.
> 
> Downside of current interface is that it will clear as many blocks as
> possible in the given range and leave starting and end blocks poisoned
> (if it is unaligned) and not return error. That means a reader will
> get error on these blocks again and they will have to try to clear it
> again.

Which is solved by having partial page writes always EIO on poisoned
memory.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-23 23:03             ` Dave Chinner
  0 siblings, 0 replies; 103+ messages in thread
From: Dave Chinner @ 2020-02-23 23:03 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jeff Moyer, linux-fsdevel, linux-nvdimm, hch, dan.j.williams, dm-devel

On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > Vivek Goyal <vgoyal@redhat.com> writes:
> > 
> > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > >> 
> > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > >> > Atleast that seems to be the assumption with which code has been written.
> > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > >> >
> > >> > Soon we want use this function from dax_zero_page_range() code path which
> > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > >> > function to assume that offset and length can be arbitrary and do the
> > >> > necessary alignments as needed.
> > >> 
> > >> What caller will try to zero a range that is smaller than a sector?
> > >
> > > Hi Jeff,
> > >
> > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > a range which is less than a sector. Or which is bigger than a sector
> > > but start and end are not aligned on sector boundaries.
> > 
> > Sure, but who will call it with misaligned ranges?
> 
> create a file foo.txt of size 4K and then truncate it.
> 
> "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> 4095.

This should fail with EIO. Only full page writes should clear the
bad page state, and partial writes should therefore fail because
they do not guarantee the data in the filesystem block is all good.

If this zeroing was a buffered write to an address with a bad
sector, then the writeback will fail and the user will (eventually)
get an EIO on the file.

DAX should do the same thing, except because the zeroing is
synchronous (i.e. done directly by the truncate syscall) we can -
and should - return EIO immediately.

Indeed, with your code, if we then extend the file by truncating up
back to 4k, then the range between 23 and 512 is still bad, even
though we've successfully zeroed it and the user knows it. An
attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
will fail with EIO, but reading 10 bytes at offset 2000 will
succeed.

That's *awful* behaviour to expose to userspace, especially when
they look at the fs config and see that it's using both 4kB block
and sector sizes...

The only thing that makes sense from a filesystem perspective is
clearing bad page state when entire filesystem blocks are
overwritten. The data in a filesystem block is either good or bad,
and it doesn't matter how many internal (kernel or device) sectors
it has.

> > And what happens to the rest?  The caller is left to trip over the
> > errors?  That sounds pretty terrible.  I really think there needs to be
> > an explicit contract here.
> 
> Ok, I think is is the contentious bit. Current interface
> (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> sector) or expects page to be free of poison.
> 
> So in above example, of "truncate -s 23 foo.txt", currently I get an error
> because range being zeroed is not sector aligned. So
> __dax_zero_page_range() falls back to calling direct_access(). Which
> fails because there are poisoned sectors in the page.
> 
> With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> and returns success.

Ok, kernel sectors are not the unit of granularity bad page state
should be managed at. They don't match page state granularity, and
they don't match filesystem block granularity, and the whacky
"partial writes silently succeed, reads fail unpredictably"
assymetry it leads to will just cause problems for users.

> So question is, is this better behavior or worse behavior. If sector 0
> was poisoned, it will continue to remain poisoned and caller will come
> to know about it on next read and then it should try to truncate file
> to length 0 or unlink file or restore that file to get rid of poison.

Worse, because the filesystem can't track what sub-parts of the
block are bad and that leads to inconsistent data integrity status
being exposed to userspace.


> IOW, if a partial block is being zeroed and if it is poisoned, caller
> will not be return an error and poison will not be cleared and memory
> will be zeroed. What do we expect in such cases.
> 
> Do we expect an interface where if there are any bad blocks in the range
> being zeroed, then they all should be cleared (and hence all I/O should
> be aligned) otherwise error is returned. If yes, I could make that
> change.
> 
> Downside of current interface is that it will clear as many blocks as
> possible in the given range and leave starting and end blocks poisoned
> (if it is unaligned) and not return error. That means a reader will
> get error on these blocks again and they will have to try to clear it
> again.

Which is solved by having partial page writes always EIO on poisoned
memory.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-23 23:03             ` Dave Chinner
@ 2020-02-24  0:40               ` Dan Williams
  -1 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-24  0:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > Vivek Goyal <vgoyal@redhat.com> writes:
> > >
> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > >>
> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > >> > Atleast that seems to be the assumption with which code has been written.
> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > >> >
> > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > >> > function to assume that offset and length can be arbitrary and do the
> > > >> > necessary alignments as needed.
> > > >>
> > > >> What caller will try to zero a range that is smaller than a sector?
> > > >
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > >
> > > Sure, but who will call it with misaligned ranges?
> >
> > create a file foo.txt of size 4K and then truncate it.
> >
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
>
> This should fail with EIO. Only full page writes should clear the
> bad page state, and partial writes should therefore fail because
> they do not guarantee the data in the filesystem block is all good.
>
> If this zeroing was a buffered write to an address with a bad
> sector, then the writeback will fail and the user will (eventually)
> get an EIO on the file.
>
> DAX should do the same thing, except because the zeroing is
> synchronous (i.e. done directly by the truncate syscall) we can -
> and should - return EIO immediately.
>
> Indeed, with your code, if we then extend the file by truncating up
> back to 4k, then the range between 23 and 512 is still bad, even
> though we've successfully zeroed it and the user knows it. An
> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> will fail with EIO, but reading 10 bytes at offset 2000 will
> succeed.
>
> That's *awful* behaviour to expose to userspace, especially when
> they look at the fs config and see that it's using both 4kB block
> and sector sizes...
>
> The only thing that makes sense from a filesystem perspective is
> clearing bad page state when entire filesystem blocks are
> overwritten. The data in a filesystem block is either good or bad,
> and it doesn't matter how many internal (kernel or device) sectors
> it has.
>
> > > And what happens to the rest?  The caller is left to trip over the
> > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > an explicit contract here.
> >
> > Ok, I think is is the contentious bit. Current interface
> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > sector) or expects page to be free of poison.
> >
> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > because range being zeroed is not sector aligned. So
> > __dax_zero_page_range() falls back to calling direct_access(). Which
> > fails because there are poisoned sectors in the page.
> >
> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > and returns success.
>
> Ok, kernel sectors are not the unit of granularity bad page state
> should be managed at. They don't match page state granularity, and
> they don't match filesystem block granularity, and the whacky
> "partial writes silently succeed, reads fail unpredictably"
> assymetry it leads to will just cause problems for users.
>
> > So question is, is this better behavior or worse behavior. If sector 0
> > was poisoned, it will continue to remain poisoned and caller will come
> > to know about it on next read and then it should try to truncate file
> > to length 0 or unlink file or restore that file to get rid of poison.
>
> Worse, because the filesystem can't track what sub-parts of the
> block are bad and that leads to inconsistent data integrity status
> being exposed to userspace.

The driver can't track it either. Latent poison isn't know until it is
consumed, and writes to latent poison are not guaranteed to clear it.

>
>
> > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > will not be return an error and poison will not be cleared and memory
> > will be zeroed. What do we expect in such cases.
> >
> > Do we expect an interface where if there are any bad blocks in the range
> > being zeroed, then they all should be cleared (and hence all I/O should
> > be aligned) otherwise error is returned. If yes, I could make that
> > change.
> >
> > Downside of current interface is that it will clear as many blocks as
> > possible in the given range and leave starting and end blocks poisoned
> > (if it is unaligned) and not return error. That means a reader will
> > get error on these blocks again and they will have to try to clear it
> > again.
>
> Which is solved by having partial page writes always EIO on poisoned
> memory.

The problem with the above is that partial page writes can not be
guaranteed to return EIO. Poison is only detected on consumed reads,
or a periodic scrub, not writes. IFF poison detection was always
synchronous with poison creation then the above makes sense. However,
with asynchronous signaling, it's fundamentally a false security
blanket to assume even full block writes will clear poison unless a
callback to firmware is made for every block.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-24  0:40               ` Dan Williams
  0 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-24  0:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vivek Goyal, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > Vivek Goyal <vgoyal@redhat.com> writes:
> > >
> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > >>
> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > >> > Atleast that seems to be the assumption with which code has been written.
> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > >> >
> > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > >> > function to assume that offset and length can be arbitrary and do the
> > > >> > necessary alignments as needed.
> > > >>
> > > >> What caller will try to zero a range that is smaller than a sector?
> > > >
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > >
> > > Sure, but who will call it with misaligned ranges?
> >
> > create a file foo.txt of size 4K and then truncate it.
> >
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
>
> This should fail with EIO. Only full page writes should clear the
> bad page state, and partial writes should therefore fail because
> they do not guarantee the data in the filesystem block is all good.
>
> If this zeroing was a buffered write to an address with a bad
> sector, then the writeback will fail and the user will (eventually)
> get an EIO on the file.
>
> DAX should do the same thing, except because the zeroing is
> synchronous (i.e. done directly by the truncate syscall) we can -
> and should - return EIO immediately.
>
> Indeed, with your code, if we then extend the file by truncating up
> back to 4k, then the range between 23 and 512 is still bad, even
> though we've successfully zeroed it and the user knows it. An
> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> will fail with EIO, but reading 10 bytes at offset 2000 will
> succeed.
>
> That's *awful* behaviour to expose to userspace, especially when
> they look at the fs config and see that it's using both 4kB block
> and sector sizes...
>
> The only thing that makes sense from a filesystem perspective is
> clearing bad page state when entire filesystem blocks are
> overwritten. The data in a filesystem block is either good or bad,
> and it doesn't matter how many internal (kernel or device) sectors
> it has.
>
> > > And what happens to the rest?  The caller is left to trip over the
> > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > an explicit contract here.
> >
> > Ok, I think is is the contentious bit. Current interface
> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > sector) or expects page to be free of poison.
> >
> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > because range being zeroed is not sector aligned. So
> > __dax_zero_page_range() falls back to calling direct_access(). Which
> > fails because there are poisoned sectors in the page.
> >
> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > and returns success.
>
> Ok, kernel sectors are not the unit of granularity bad page state
> should be managed at. They don't match page state granularity, and
> they don't match filesystem block granularity, and the whacky
> "partial writes silently succeed, reads fail unpredictably"
> assymetry it leads to will just cause problems for users.
>
> > So question is, is this better behavior or worse behavior. If sector 0
> > was poisoned, it will continue to remain poisoned and caller will come
> > to know about it on next read and then it should try to truncate file
> > to length 0 or unlink file or restore that file to get rid of poison.
>
> Worse, because the filesystem can't track what sub-parts of the
> block are bad and that leads to inconsistent data integrity status
> being exposed to userspace.

The driver can't track it either. Latent poison isn't know until it is
consumed, and writes to latent poison are not guaranteed to clear it.

>
>
> > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > will not be return an error and poison will not be cleared and memory
> > will be zeroed. What do we expect in such cases.
> >
> > Do we expect an interface where if there are any bad blocks in the range
> > being zeroed, then they all should be cleared (and hence all I/O should
> > be aligned) otherwise error is returned. If yes, I could make that
> > change.
> >
> > Downside of current interface is that it will clear as many blocks as
> > possible in the given range and leave starting and end blocks poisoned
> > (if it is unaligned) and not return error. That means a reader will
> > get error on these blocks again and they will have to try to clear it
> > again.
>
> Which is solved by having partial page writes always EIO on poisoned
> memory.

The problem with the above is that partial page writes can not be
guaranteed to return EIO. Poison is only detected on consumed reads,
or a periodic scrub, not writes. IFF poison detection was always
synchronous with poison creation then the above makes sense. However,
with asynchronous signaling, it's fundamentally a false security
blanket to assume even full block writes will clear poison unless a
callback to firmware is made for every block.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-24  0:40               ` Dan Williams
@ 2020-02-24 13:50                 ` Jeff Moyer
  -1 siblings, 0 replies; 103+ messages in thread
From: Jeff Moyer @ 2020-02-24 13:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

Dan Williams <dan.j.williams@intel.com> writes:

> On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
>> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
>> > > Vivek Goyal <vgoyal@redhat.com> writes:
>> > >
>> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
>> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
>> > > >>
>> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
>> > > >> > Atleast that seems to be the assumption with which code has been written.
>> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
>> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
>> > > >> >
>> > > >> > Soon we want use this function from dax_zero_page_range() code path which
>> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
>> > > >> > function to assume that offset and length can be arbitrary and do the
>> > > >> > necessary alignments as needed.
>> > > >>
>> > > >> What caller will try to zero a range that is smaller than a sector?
>> > > >
>> > > > Hi Jeff,
>> > > >
>> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
>> > > > a range which is less than a sector. Or which is bigger than a sector
>> > > > but start and end are not aligned on sector boundaries.
>> > >
>> > > Sure, but who will call it with misaligned ranges?
>> >
>> > create a file foo.txt of size 4K and then truncate it.
>> >
>> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
>> > 4095.
>>
>> This should fail with EIO. Only full page writes should clear the
>> bad page state, and partial writes should therefore fail because
>> they do not guarantee the data in the filesystem block is all good.
>>
>> If this zeroing was a buffered write to an address with a bad
>> sector, then the writeback will fail and the user will (eventually)
>> get an EIO on the file.
>>
>> DAX should do the same thing, except because the zeroing is
>> synchronous (i.e. done directly by the truncate syscall) we can -
>> and should - return EIO immediately.
>>
>> Indeed, with your code, if we then extend the file by truncating up
>> back to 4k, then the range between 23 and 512 is still bad, even
>> though we've successfully zeroed it and the user knows it. An
>> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
>> will fail with EIO, but reading 10 bytes at offset 2000 will
>> succeed.
>>
>> That's *awful* behaviour to expose to userspace, especially when
>> they look at the fs config and see that it's using both 4kB block
>> and sector sizes...
>>
>> The only thing that makes sense from a filesystem perspective is
>> clearing bad page state when entire filesystem blocks are
>> overwritten. The data in a filesystem block is either good or bad,
>> and it doesn't matter how many internal (kernel or device) sectors
>> it has.
>>
>> > > And what happens to the rest?  The caller is left to trip over the
>> > > errors?  That sounds pretty terrible.  I really think there needs to be
>> > > an explicit contract here.
>> >
>> > Ok, I think is is the contentious bit. Current interface
>> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
>> > sector) or expects page to be free of poison.
>> >
>> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
>> > because range being zeroed is not sector aligned. So
>> > __dax_zero_page_range() falls back to calling direct_access(). Which
>> > fails because there are poisoned sectors in the page.
>> >
>> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
>> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
>> > and returns success.
>>
>> Ok, kernel sectors are not the unit of granularity bad page state
>> should be managed at. They don't match page state granularity, and
>> they don't match filesystem block granularity, and the whacky
>> "partial writes silently succeed, reads fail unpredictably"
>> assymetry it leads to will just cause problems for users.
>>
>> > So question is, is this better behavior or worse behavior. If sector 0
>> > was poisoned, it will continue to remain poisoned and caller will come
>> > to know about it on next read and then it should try to truncate file
>> > to length 0 or unlink file or restore that file to get rid of poison.
>>
>> Worse, because the filesystem can't track what sub-parts of the
>> block are bad and that leads to inconsistent data integrity status
>> being exposed to userspace.
>
> The driver can't track it either. Latent poison isn't know until it is
> consumed, and writes to latent poison are not guaranteed to clear it.

I believe we're discussing the case where we know there is a bad block.
Obviously we can't know about latent errors.

>> > IOW, if a partial block is being zeroed and if it is poisoned, caller
>> > will not be return an error and poison will not be cleared and memory
>> > will be zeroed. What do we expect in such cases.
>> >
>> > Do we expect an interface where if there are any bad blocks in the range
>> > being zeroed, then they all should be cleared (and hence all I/O should
>> > be aligned) otherwise error is returned. If yes, I could make that
>> > change.
>> >
>> > Downside of current interface is that it will clear as many blocks as
>> > possible in the given range and leave starting and end blocks poisoned
>> > (if it is unaligned) and not return error. That means a reader will
>> > get error on these blocks again and they will have to try to clear it
>> > again.
>>
>> Which is solved by having partial page writes always EIO on poisoned
>> memory.
>
> The problem with the above is that partial page writes can not be
> guaranteed to return EIO. Poison is only detected on consumed reads,
> or a periodic scrub, not writes. IFF poison detection was always
> synchronous with poison creation then the above makes sense. However,
> with asynchronous signaling, it's fundamentally a false security
> blanket to assume even full block writes will clear poison unless a
> callback to firmware is made for every block.

Let's just focus on reporting errors when we know we have them.

-Jeff
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-24 13:50                 ` Jeff Moyer
  0 siblings, 0 replies; 103+ messages in thread
From: Jeff Moyer @ 2020-02-24 13:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Vivek Goyal, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

Dan Williams <dan.j.williams@intel.com> writes:

> On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
>> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
>> > > Vivek Goyal <vgoyal@redhat.com> writes:
>> > >
>> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
>> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
>> > > >>
>> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
>> > > >> > Atleast that seems to be the assumption with which code has been written.
>> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
>> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
>> > > >> >
>> > > >> > Soon we want use this function from dax_zero_page_range() code path which
>> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
>> > > >> > function to assume that offset and length can be arbitrary and do the
>> > > >> > necessary alignments as needed.
>> > > >>
>> > > >> What caller will try to zero a range that is smaller than a sector?
>> > > >
>> > > > Hi Jeff,
>> > > >
>> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
>> > > > a range which is less than a sector. Or which is bigger than a sector
>> > > > but start and end are not aligned on sector boundaries.
>> > >
>> > > Sure, but who will call it with misaligned ranges?
>> >
>> > create a file foo.txt of size 4K and then truncate it.
>> >
>> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
>> > 4095.
>>
>> This should fail with EIO. Only full page writes should clear the
>> bad page state, and partial writes should therefore fail because
>> they do not guarantee the data in the filesystem block is all good.
>>
>> If this zeroing was a buffered write to an address with a bad
>> sector, then the writeback will fail and the user will (eventually)
>> get an EIO on the file.
>>
>> DAX should do the same thing, except because the zeroing is
>> synchronous (i.e. done directly by the truncate syscall) we can -
>> and should - return EIO immediately.
>>
>> Indeed, with your code, if we then extend the file by truncating up
>> back to 4k, then the range between 23 and 512 is still bad, even
>> though we've successfully zeroed it and the user knows it. An
>> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
>> will fail with EIO, but reading 10 bytes at offset 2000 will
>> succeed.
>>
>> That's *awful* behaviour to expose to userspace, especially when
>> they look at the fs config and see that it's using both 4kB block
>> and sector sizes...
>>
>> The only thing that makes sense from a filesystem perspective is
>> clearing bad page state when entire filesystem blocks are
>> overwritten. The data in a filesystem block is either good or bad,
>> and it doesn't matter how many internal (kernel or device) sectors
>> it has.
>>
>> > > And what happens to the rest?  The caller is left to trip over the
>> > > errors?  That sounds pretty terrible.  I really think there needs to be
>> > > an explicit contract here.
>> >
>> > Ok, I think is is the contentious bit. Current interface
>> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
>> > sector) or expects page to be free of poison.
>> >
>> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
>> > because range being zeroed is not sector aligned. So
>> > __dax_zero_page_range() falls back to calling direct_access(). Which
>> > fails because there are poisoned sectors in the page.
>> >
>> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
>> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
>> > and returns success.
>>
>> Ok, kernel sectors are not the unit of granularity bad page state
>> should be managed at. They don't match page state granularity, and
>> they don't match filesystem block granularity, and the whacky
>> "partial writes silently succeed, reads fail unpredictably"
>> assymetry it leads to will just cause problems for users.
>>
>> > So question is, is this better behavior or worse behavior. If sector 0
>> > was poisoned, it will continue to remain poisoned and caller will come
>> > to know about it on next read and then it should try to truncate file
>> > to length 0 or unlink file or restore that file to get rid of poison.
>>
>> Worse, because the filesystem can't track what sub-parts of the
>> block are bad and that leads to inconsistent data integrity status
>> being exposed to userspace.
>
> The driver can't track it either. Latent poison isn't know until it is
> consumed, and writes to latent poison are not guaranteed to clear it.

I believe we're discussing the case where we know there is a bad block.
Obviously we can't know about latent errors.

>> > IOW, if a partial block is being zeroed and if it is poisoned, caller
>> > will not be return an error and poison will not be cleared and memory
>> > will be zeroed. What do we expect in such cases.
>> >
>> > Do we expect an interface where if there are any bad blocks in the range
>> > being zeroed, then they all should be cleared (and hence all I/O should
>> > be aligned) otherwise error is returned. If yes, I could make that
>> > change.
>> >
>> > Downside of current interface is that it will clear as many blocks as
>> > possible in the given range and leave starting and end blocks poisoned
>> > (if it is unaligned) and not return error. That means a reader will
>> > get error on these blocks again and they will have to try to clear it
>> > again.
>>
>> Which is solved by having partial page writes always EIO on poisoned
>> memory.
>
> The problem with the above is that partial page writes can not be
> guaranteed to return EIO. Poison is only detected on consumed reads,
> or a periodic scrub, not writes. IFF poison detection was always
> synchronous with poison creation then the above makes sense. However,
> with asynchronous signaling, it's fundamentally a false security
> blanket to assume even full block writes will clear poison unless a
> callback to firmware is made for every block.

Let's just focus on reporting errors when we know we have them.

-Jeff


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-23 23:03             ` Dave Chinner
@ 2020-02-24 15:38               ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-24 15:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-nvdimm, hch, dm-devel

On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:

[..]
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > > 
> > > Sure, but who will call it with misaligned ranges?
> > 
> > create a file foo.txt of size 4K and then truncate it.
> > 
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
> 
> This should fail with EIO. Only full page writes should clear the
> bad page state, and partial writes should therefore fail because
> they do not guarantee the data in the filesystem block is all good.
> 
> If this zeroing was a buffered write to an address with a bad
> sector, then the writeback will fail and the user will (eventually)
> get an EIO on the file.
> 
> DAX should do the same thing, except because the zeroing is
> synchronous (i.e. done directly by the truncate syscall) we can -
> and should - return EIO immediately.
> 
> Indeed, with your code, if we then extend the file by truncating up
> back to 4k, then the range between 23 and 512 is still bad, even
> though we've successfully zeroed it and the user knows it. An
> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> will fail with EIO, but reading 10 bytes at offset 2000 will
> succeed.

Hi Dave,

What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to
511) is poisoned and rest don't have poison. Should this fail with -EIO.

In current implementation it does not. Because all sector aligned I/O
we redirect through blkdev_issue_zeroout() and that will happly zero
out sector 2-8 without worrying about the state of sector 1. Hence user
which tries to read 10 bytes at offset 100, will still fail. This probably
should be fixed if we want to retain existing behavior.

Anyway, partial page truncate can't ensure that data in rest of the page can
be read back successfully. Memory can get poison after the write and
hence read after truncate will still fail.

Hence, all we are trying to ensure is that if a poison is known at the
time of writing partial page, then we should return error to user space.

Thanks
Vivek
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-24 15:38               ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-24 15:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Moyer, linux-fsdevel, linux-nvdimm, hch, dan.j.williams, dm-devel

On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:

[..]
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > > 
> > > Sure, but who will call it with misaligned ranges?
> > 
> > create a file foo.txt of size 4K and then truncate it.
> > 
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
> 
> This should fail with EIO. Only full page writes should clear the
> bad page state, and partial writes should therefore fail because
> they do not guarantee the data in the filesystem block is all good.
> 
> If this zeroing was a buffered write to an address with a bad
> sector, then the writeback will fail and the user will (eventually)
> get an EIO on the file.
> 
> DAX should do the same thing, except because the zeroing is
> synchronous (i.e. done directly by the truncate syscall) we can -
> and should - return EIO immediately.
> 
> Indeed, with your code, if we then extend the file by truncating up
> back to 4k, then the range between 23 and 512 is still bad, even
> though we've successfully zeroed it and the user knows it. An
> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> will fail with EIO, but reading 10 bytes at offset 2000 will
> succeed.

Hi Dave,

What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to
511) is poisoned and rest don't have poison. Should this fail with -EIO.

In current implementation it does not. Because all sector aligned I/O
we redirect through blkdev_issue_zeroout() and that will happly zero
out sector 2-8 without worrying about the state of sector 1. Hence user
which tries to read 10 bytes at offset 100, will still fail. This probably
should be fixed if we want to retain existing behavior.

Anyway, partial page truncate can't ensure that data in rest of the page can
be read back successfully. Memory can get poison after the write and
hence read after truncate will still fail.

Hence, all we are trying to ensure is that if a poison is known at the
time of writing partial page, then we should return error to user space.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-23 23:03             ` Dave Chinner
@ 2020-02-24 20:13               ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-24 20:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-nvdimm, hch, dm-devel

On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > 
> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > >> 
> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > >> > Atleast that seems to be the assumption with which code has been written.
> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > >> >
> > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > >> > function to assume that offset and length can be arbitrary and do the
> > > >> > necessary alignments as needed.
> > > >> 
> > > >> What caller will try to zero a range that is smaller than a sector?
> > > >
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > > 
> > > Sure, but who will call it with misaligned ranges?
> > 
> > create a file foo.txt of size 4K and then truncate it.
> > 
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
> 
> This should fail with EIO. Only full page writes should clear the
> bad page state, and partial writes should therefore fail because
> they do not guarantee the data in the filesystem block is all good.
> 
> If this zeroing was a buffered write to an address with a bad
> sector, then the writeback will fail and the user will (eventually)
> get an EIO on the file.
> 
> DAX should do the same thing, except because the zeroing is
> synchronous (i.e. done directly by the truncate syscall) we can -
> and should - return EIO immediately.
> 
> Indeed, with your code, if we then extend the file by truncating up
> back to 4k, then the range between 23 and 512 is still bad, even
> though we've successfully zeroed it and the user knows it. An
> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> will fail with EIO, but reading 10 bytes at offset 2000 will
> succeed.
> 
> That's *awful* behaviour to expose to userspace, especially when
> they look at the fs config and see that it's using both 4kB block
> and sector sizes...
> 
> The only thing that makes sense from a filesystem perspective is
> clearing bad page state when entire filesystem blocks are
> overwritten. The data in a filesystem block is either good or bad,
> and it doesn't matter how many internal (kernel or device) sectors
> it has.
> 
> > > And what happens to the rest?  The caller is left to trip over the
> > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > an explicit contract here.
> > 
> > Ok, I think is is the contentious bit. Current interface
> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > sector) or expects page to be free of poison.
> > 
> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > because range being zeroed is not sector aligned. So
> > __dax_zero_page_range() falls back to calling direct_access(). Which
> > fails because there are poisoned sectors in the page.
> > 
> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > and returns success.
> 
> Ok, kernel sectors are not the unit of granularity bad page state
> should be managed at. They don't match page state granularity, and
> they don't match filesystem block granularity, and the whacky
> "partial writes silently succeed, reads fail unpredictably"
> assymetry it leads to will just cause problems for users.
> 
> > So question is, is this better behavior or worse behavior. If sector 0
> > was poisoned, it will continue to remain poisoned and caller will come
> > to know about it on next read and then it should try to truncate file
> > to length 0 or unlink file or restore that file to get rid of poison.
> 
> Worse, because the filesystem can't track what sub-parts of the
> block are bad and that leads to inconsistent data integrity status
> being exposed to userspace.
> 
> 
> > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > will not be return an error and poison will not be cleared and memory
> > will be zeroed. What do we expect in such cases.
> > 
> > Do we expect an interface where if there are any bad blocks in the range
> > being zeroed, then they all should be cleared (and hence all I/O should
> > be aligned) otherwise error is returned. If yes, I could make that
> > change.
> > 
> > Downside of current interface is that it will clear as many blocks as
> > possible in the given range and leave starting and end blocks poisoned
> > (if it is unaligned) and not return error. That means a reader will
> > get error on these blocks again and they will have to try to clear it
> > again.
> 
> Which is solved by having partial page writes always EIO on poisoned
> memory.

Ok, how about if I add one more patch to the series which will check
if unwritten portion of the page has known poison. If it has, then
-EIO is returned. 


Subject: pmem: zero page range return error if poisoned memory in unwritten area

Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
truncate. If partial page is being zeroed, then at the end of operation
file systems expect that there is no poison in the whole page (atleast
known poison).

So make sure part of the partial page which is not being written, does not
have poison. If it does, return error. If there is poison in area of page
being written, it will be cleared.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/nvdimm/pmem.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

Index: redhat-linux/drivers/nvdimm/pmem.c
===================================================================
--- redhat-linux.orig/drivers/nvdimm/pmem.c	2020-02-24 10:57:48.520451478 -0500
+++ redhat-linux/drivers/nvdimm/pmem.c	2020-02-24 15:01:58.119854835 -0500
@@ -309,6 +309,19 @@ static int pmem_dax_zero_page_range(stru
 {
 	struct pmem_device *pmem = dax_get_private(dax_dev);
 
+	if (offset_in_page(offset)) {
+		sector_t unwritten_start = (offset & PAGE_MASK) >> SECTOR_SHIFT;
+		unsigned int unwritten_len = ALIGN(offset_in_page(offset), SECTOR_SIZE);
+		/* If partial page is being zeroed, make sure there is no
+		 * known poison in the area of page which is not being zeroed.
+		 * Filesystems expect full page to be free of *known* poison
+		 * at the end of partial page zeroing.
+		 */
+		if (unlikely(is_bad_pmem(&pmem->bb, unwritten_start,
+					 unwritten_len)))
+			return -EIO;
+	}
+
 	return blk_status_to_errno(pmem_do_write(pmem, ZERO_PAGE(0), 0, offset,
 				   len));
 }
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-24 20:13               ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-24 20:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Moyer, linux-fsdevel, linux-nvdimm, hch, dan.j.williams, dm-devel

On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > 
> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > >> 
> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > >> > Atleast that seems to be the assumption with which code has been written.
> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > >> >
> > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > >> > function to assume that offset and length can be arbitrary and do the
> > > >> > necessary alignments as needed.
> > > >> 
> > > >> What caller will try to zero a range that is smaller than a sector?
> > > >
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > > 
> > > Sure, but who will call it with misaligned ranges?
> > 
> > create a file foo.txt of size 4K and then truncate it.
> > 
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
> 
> This should fail with EIO. Only full page writes should clear the
> bad page state, and partial writes should therefore fail because
> they do not guarantee the data in the filesystem block is all good.
> 
> If this zeroing was a buffered write to an address with a bad
> sector, then the writeback will fail and the user will (eventually)
> get an EIO on the file.
> 
> DAX should do the same thing, except because the zeroing is
> synchronous (i.e. done directly by the truncate syscall) we can -
> and should - return EIO immediately.
> 
> Indeed, with your code, if we then extend the file by truncating up
> back to 4k, then the range between 23 and 512 is still bad, even
> though we've successfully zeroed it and the user knows it. An
> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> will fail with EIO, but reading 10 bytes at offset 2000 will
> succeed.
> 
> That's *awful* behaviour to expose to userspace, especially when
> they look at the fs config and see that it's using both 4kB block
> and sector sizes...
> 
> The only thing that makes sense from a filesystem perspective is
> clearing bad page state when entire filesystem blocks are
> overwritten. The data in a filesystem block is either good or bad,
> and it doesn't matter how many internal (kernel or device) sectors
> it has.
> 
> > > And what happens to the rest?  The caller is left to trip over the
> > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > an explicit contract here.
> > 
> > Ok, I think is is the contentious bit. Current interface
> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > sector) or expects page to be free of poison.
> > 
> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > because range being zeroed is not sector aligned. So
> > __dax_zero_page_range() falls back to calling direct_access(). Which
> > fails because there are poisoned sectors in the page.
> > 
> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > and returns success.
> 
> Ok, kernel sectors are not the unit of granularity bad page state
> should be managed at. They don't match page state granularity, and
> they don't match filesystem block granularity, and the whacky
> "partial writes silently succeed, reads fail unpredictably"
> assymetry it leads to will just cause problems for users.
> 
> > So question is, is this better behavior or worse behavior. If sector 0
> > was poisoned, it will continue to remain poisoned and caller will come
> > to know about it on next read and then it should try to truncate file
> > to length 0 or unlink file or restore that file to get rid of poison.
> 
> Worse, because the filesystem can't track what sub-parts of the
> block are bad and that leads to inconsistent data integrity status
> being exposed to userspace.
> 
> 
> > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > will not be return an error and poison will not be cleared and memory
> > will be zeroed. What do we expect in such cases.
> > 
> > Do we expect an interface where if there are any bad blocks in the range
> > being zeroed, then they all should be cleared (and hence all I/O should
> > be aligned) otherwise error is returned. If yes, I could make that
> > change.
> > 
> > Downside of current interface is that it will clear as many blocks as
> > possible in the given range and leave starting and end blocks poisoned
> > (if it is unaligned) and not return error. That means a reader will
> > get error on these blocks again and they will have to try to clear it
> > again.
> 
> Which is solved by having partial page writes always EIO on poisoned
> memory.

Ok, how about if I add one more patch to the series which will check
if unwritten portion of the page has known poison. If it has, then
-EIO is returned. 


Subject: pmem: zero page range return error if poisoned memory in unwritten area

Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
truncate. If partial page is being zeroed, then at the end of operation
file systems expect that there is no poison in the whole page (atleast
known poison).

So make sure part of the partial page which is not being written, does not
have poison. If it does, return error. If there is poison in area of page
being written, it will be cleared.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/nvdimm/pmem.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

Index: redhat-linux/drivers/nvdimm/pmem.c
===================================================================
--- redhat-linux.orig/drivers/nvdimm/pmem.c	2020-02-24 10:57:48.520451478 -0500
+++ redhat-linux/drivers/nvdimm/pmem.c	2020-02-24 15:01:58.119854835 -0500
@@ -309,6 +309,19 @@ static int pmem_dax_zero_page_range(stru
 {
 	struct pmem_device *pmem = dax_get_private(dax_dev);
 
+	if (offset_in_page(offset)) {
+		sector_t unwritten_start = (offset & PAGE_MASK) >> SECTOR_SHIFT;
+		unsigned int unwritten_len = ALIGN(offset_in_page(offset), SECTOR_SIZE);
+		/* If partial page is being zeroed, make sure there is no
+		 * known poison in the area of page which is not being zeroed.
+		 * Filesystems expect full page to be free of *known* poison
+		 * at the end of partial page zeroing.
+		 */
+		if (unlikely(is_bad_pmem(&pmem->bb, unwritten_start,
+					 unwritten_len)))
+			return -EIO;
+	}
+
 	return blk_status_to_errno(pmem_do_write(pmem, ZERO_PAGE(0), 0, offset,
 				   len));
 }


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-24 13:50                 ` Jeff Moyer
@ 2020-02-24 20:48                   ` Dan Williams
  -1 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-24 20:48 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Mon, Feb 24, 2020 at 5:50 AM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> >> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> >> > > Vivek Goyal <vgoyal@redhat.com> writes:
> >> > >
> >> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> >> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> >> > > >>
> >> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> >> > > >> > Atleast that seems to be the assumption with which code has been written.
> >> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> >> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> >> > > >> >
> >> > > >> > Soon we want use this function from dax_zero_page_range() code path which
> >> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> >> > > >> > function to assume that offset and length can be arbitrary and do the
> >> > > >> > necessary alignments as needed.
> >> > > >>
> >> > > >> What caller will try to zero a range that is smaller than a sector?
> >> > > >
> >> > > > Hi Jeff,
> >> > > >
> >> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> >> > > > a range which is less than a sector. Or which is bigger than a sector
> >> > > > but start and end are not aligned on sector boundaries.
> >> > >
> >> > > Sure, but who will call it with misaligned ranges?
> >> >
> >> > create a file foo.txt of size 4K and then truncate it.
> >> >
> >> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> >> > 4095.
> >>
> >> This should fail with EIO. Only full page writes should clear the
> >> bad page state, and partial writes should therefore fail because
> >> they do not guarantee the data in the filesystem block is all good.
> >>
> >> If this zeroing was a buffered write to an address with a bad
> >> sector, then the writeback will fail and the user will (eventually)
> >> get an EIO on the file.
> >>
> >> DAX should do the same thing, except because the zeroing is
> >> synchronous (i.e. done directly by the truncate syscall) we can -
> >> and should - return EIO immediately.
> >>
> >> Indeed, with your code, if we then extend the file by truncating up
> >> back to 4k, then the range between 23 and 512 is still bad, even
> >> though we've successfully zeroed it and the user knows it. An
> >> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> >> will fail with EIO, but reading 10 bytes at offset 2000 will
> >> succeed.
> >>
> >> That's *awful* behaviour to expose to userspace, especially when
> >> they look at the fs config and see that it's using both 4kB block
> >> and sector sizes...
> >>
> >> The only thing that makes sense from a filesystem perspective is
> >> clearing bad page state when entire filesystem blocks are
> >> overwritten. The data in a filesystem block is either good or bad,
> >> and it doesn't matter how many internal (kernel or device) sectors
> >> it has.
> >>
> >> > > And what happens to the rest?  The caller is left to trip over the
> >> > > errors?  That sounds pretty terrible.  I really think there needs to be
> >> > > an explicit contract here.
> >> >
> >> > Ok, I think is is the contentious bit. Current interface
> >> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> >> > sector) or expects page to be free of poison.
> >> >
> >> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> >> > because range being zeroed is not sector aligned. So
> >> > __dax_zero_page_range() falls back to calling direct_access(). Which
> >> > fails because there are poisoned sectors in the page.
> >> >
> >> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> >> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> >> > and returns success.
> >>
> >> Ok, kernel sectors are not the unit of granularity bad page state
> >> should be managed at. They don't match page state granularity, and
> >> they don't match filesystem block granularity, and the whacky
> >> "partial writes silently succeed, reads fail unpredictably"
> >> assymetry it leads to will just cause problems for users.
> >>
> >> > So question is, is this better behavior or worse behavior. If sector 0
> >> > was poisoned, it will continue to remain poisoned and caller will come
> >> > to know about it on next read and then it should try to truncate file
> >> > to length 0 or unlink file or restore that file to get rid of poison.
> >>
> >> Worse, because the filesystem can't track what sub-parts of the
> >> block are bad and that leads to inconsistent data integrity status
> >> being exposed to userspace.
> >
> > The driver can't track it either. Latent poison isn't know until it is
> > consumed, and writes to latent poison are not guaranteed to clear it.
>
> I believe we're discussing the case where we know there is a bad block.
> Obviously we can't know about latent errors.
>
> >> > IOW, if a partial block is being zeroed and if it is poisoned, caller
> >> > will not be return an error and poison will not be cleared and memory
> >> > will be zeroed. What do we expect in such cases.
> >> >
> >> > Do we expect an interface where if there are any bad blocks in the range
> >> > being zeroed, then they all should be cleared (and hence all I/O should
> >> > be aligned) otherwise error is returned. If yes, I could make that
> >> > change.
> >> >
> >> > Downside of current interface is that it will clear as many blocks as
> >> > possible in the given range and leave starting and end blocks poisoned
> >> > (if it is unaligned) and not return error. That means a reader will
> >> > get error on these blocks again and they will have to try to clear it
> >> > again.
> >>
> >> Which is solved by having partial page writes always EIO on poisoned
> >> memory.
> >
> > The problem with the above is that partial page writes can not be
> > guaranteed to return EIO. Poison is only detected on consumed reads,
> > or a periodic scrub, not writes. IFF poison detection was always
> > synchronous with poison creation then the above makes sense. However,
> > with asynchronous signaling, it's fundamentally a false security
> > blanket to assume even full block writes will clear poison unless a
> > callback to firmware is made for every block.
>
> Let's just focus on reporting errors when we know we have them.

That's the problem in my eyes. If software needs to contend with
latent error reporting then it should always contend otherwise
software has multiple error models to wrangle.

Setting that aside we can start with just treating zeroing the same as
the copy_from_iter() case and fail the I/O at the dax_direct_access()
step.

I'd rather have a separate op that filesystems can use to clear errors
at block allocation time that can be enforced to have the correct
alignment.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-24 20:48                   ` Dan Williams
  0 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-24 20:48 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dave Chinner, Vivek Goyal, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Mon, Feb 24, 2020 at 5:50 AM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> >> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> >> > > Vivek Goyal <vgoyal@redhat.com> writes:
> >> > >
> >> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> >> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> >> > > >>
> >> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> >> > > >> > Atleast that seems to be the assumption with which code has been written.
> >> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> >> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> >> > > >> >
> >> > > >> > Soon we want use this function from dax_zero_page_range() code path which
> >> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> >> > > >> > function to assume that offset and length can be arbitrary and do the
> >> > > >> > necessary alignments as needed.
> >> > > >>
> >> > > >> What caller will try to zero a range that is smaller than a sector?
> >> > > >
> >> > > > Hi Jeff,
> >> > > >
> >> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> >> > > > a range which is less than a sector. Or which is bigger than a sector
> >> > > > but start and end are not aligned on sector boundaries.
> >> > >
> >> > > Sure, but who will call it with misaligned ranges?
> >> >
> >> > create a file foo.txt of size 4K and then truncate it.
> >> >
> >> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> >> > 4095.
> >>
> >> This should fail with EIO. Only full page writes should clear the
> >> bad page state, and partial writes should therefore fail because
> >> they do not guarantee the data in the filesystem block is all good.
> >>
> >> If this zeroing was a buffered write to an address with a bad
> >> sector, then the writeback will fail and the user will (eventually)
> >> get an EIO on the file.
> >>
> >> DAX should do the same thing, except because the zeroing is
> >> synchronous (i.e. done directly by the truncate syscall) we can -
> >> and should - return EIO immediately.
> >>
> >> Indeed, with your code, if we then extend the file by truncating up
> >> back to 4k, then the range between 23 and 512 is still bad, even
> >> though we've successfully zeroed it and the user knows it. An
> >> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> >> will fail with EIO, but reading 10 bytes at offset 2000 will
> >> succeed.
> >>
> >> That's *awful* behaviour to expose to userspace, especially when
> >> they look at the fs config and see that it's using both 4kB block
> >> and sector sizes...
> >>
> >> The only thing that makes sense from a filesystem perspective is
> >> clearing bad page state when entire filesystem blocks are
> >> overwritten. The data in a filesystem block is either good or bad,
> >> and it doesn't matter how many internal (kernel or device) sectors
> >> it has.
> >>
> >> > > And what happens to the rest?  The caller is left to trip over the
> >> > > errors?  That sounds pretty terrible.  I really think there needs to be
> >> > > an explicit contract here.
> >> >
> >> > Ok, I think is is the contentious bit. Current interface
> >> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> >> > sector) or expects page to be free of poison.
> >> >
> >> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> >> > because range being zeroed is not sector aligned. So
> >> > __dax_zero_page_range() falls back to calling direct_access(). Which
> >> > fails because there are poisoned sectors in the page.
> >> >
> >> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> >> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> >> > and returns success.
> >>
> >> Ok, kernel sectors are not the unit of granularity bad page state
> >> should be managed at. They don't match page state granularity, and
> >> they don't match filesystem block granularity, and the whacky
> >> "partial writes silently succeed, reads fail unpredictably"
> >> assymetry it leads to will just cause problems for users.
> >>
> >> > So question is, is this better behavior or worse behavior. If sector 0
> >> > was poisoned, it will continue to remain poisoned and caller will come
> >> > to know about it on next read and then it should try to truncate file
> >> > to length 0 or unlink file or restore that file to get rid of poison.
> >>
> >> Worse, because the filesystem can't track what sub-parts of the
> >> block are bad and that leads to inconsistent data integrity status
> >> being exposed to userspace.
> >
> > The driver can't track it either. Latent poison isn't know until it is
> > consumed, and writes to latent poison are not guaranteed to clear it.
>
> I believe we're discussing the case where we know there is a bad block.
> Obviously we can't know about latent errors.
>
> >> > IOW, if a partial block is being zeroed and if it is poisoned, caller
> >> > will not be return an error and poison will not be cleared and memory
> >> > will be zeroed. What do we expect in such cases.
> >> >
> >> > Do we expect an interface where if there are any bad blocks in the range
> >> > being zeroed, then they all should be cleared (and hence all I/O should
> >> > be aligned) otherwise error is returned. If yes, I could make that
> >> > change.
> >> >
> >> > Downside of current interface is that it will clear as many blocks as
> >> > possible in the given range and leave starting and end blocks poisoned
> >> > (if it is unaligned) and not return error. That means a reader will
> >> > get error on these blocks again and they will have to try to clear it
> >> > again.
> >>
> >> Which is solved by having partial page writes always EIO on poisoned
> >> memory.
> >
> > The problem with the above is that partial page writes can not be
> > guaranteed to return EIO. Poison is only detected on consumed reads,
> > or a periodic scrub, not writes. IFF poison detection was always
> > synchronous with poison creation then the above makes sense. However,
> > with asynchronous signaling, it's fundamentally a false security
> > blanket to assume even full block writes will clear poison unless a
> > callback to firmware is made for every block.
>
> Let's just focus on reporting errors when we know we have them.

That's the problem in my eyes. If software needs to contend with
latent error reporting then it should always contend otherwise
software has multiple error models to wrangle.

Setting that aside we can start with just treating zeroing the same as
the copy_from_iter() case and fail the I/O at the dax_direct_access()
step.

I'd rather have a separate op that filesystems can use to clear errors
at block allocation time that can be enforced to have the correct
alignment.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-24 20:13               ` Vivek Goyal
@ 2020-02-24 20:52                 ` Dan Williams
  -1 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-24 20:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > >
> > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > > >>
> > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > > >> > Atleast that seems to be the assumption with which code has been written.
> > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > > >> >
> > > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > > >> > function to assume that offset and length can be arbitrary and do the
> > > > >> > necessary alignments as needed.
> > > > >>
> > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > >
> > > > > Hi Jeff,
> > > > >
> > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > but start and end are not aligned on sector boundaries.
> > > >
> > > > Sure, but who will call it with misaligned ranges?
> > >
> > > create a file foo.txt of size 4K and then truncate it.
> > >
> > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > 4095.
> >
> > This should fail with EIO. Only full page writes should clear the
> > bad page state, and partial writes should therefore fail because
> > they do not guarantee the data in the filesystem block is all good.
> >
> > If this zeroing was a buffered write to an address with a bad
> > sector, then the writeback will fail and the user will (eventually)
> > get an EIO on the file.
> >
> > DAX should do the same thing, except because the zeroing is
> > synchronous (i.e. done directly by the truncate syscall) we can -
> > and should - return EIO immediately.
> >
> > Indeed, with your code, if we then extend the file by truncating up
> > back to 4k, then the range between 23 and 512 is still bad, even
> > though we've successfully zeroed it and the user knows it. An
> > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > will fail with EIO, but reading 10 bytes at offset 2000 will
> > succeed.
> >
> > That's *awful* behaviour to expose to userspace, especially when
> > they look at the fs config and see that it's using both 4kB block
> > and sector sizes...
> >
> > The only thing that makes sense from a filesystem perspective is
> > clearing bad page state when entire filesystem blocks are
> > overwritten. The data in a filesystem block is either good or bad,
> > and it doesn't matter how many internal (kernel or device) sectors
> > it has.
> >
> > > > And what happens to the rest?  The caller is left to trip over the
> > > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > > an explicit contract here.
> > >
> > > Ok, I think is is the contentious bit. Current interface
> > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > > sector) or expects page to be free of poison.
> > >
> > > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > > because range being zeroed is not sector aligned. So
> > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > fails because there are poisoned sectors in the page.
> > >
> > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > > and returns success.
> >
> > Ok, kernel sectors are not the unit of granularity bad page state
> > should be managed at. They don't match page state granularity, and
> > they don't match filesystem block granularity, and the whacky
> > "partial writes silently succeed, reads fail unpredictably"
> > assymetry it leads to will just cause problems for users.
> >
> > > So question is, is this better behavior or worse behavior. If sector 0
> > > was poisoned, it will continue to remain poisoned and caller will come
> > > to know about it on next read and then it should try to truncate file
> > > to length 0 or unlink file or restore that file to get rid of poison.
> >
> > Worse, because the filesystem can't track what sub-parts of the
> > block are bad and that leads to inconsistent data integrity status
> > being exposed to userspace.
> >
> >
> > > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > > will not be return an error and poison will not be cleared and memory
> > > will be zeroed. What do we expect in such cases.
> > >
> > > Do we expect an interface where if there are any bad blocks in the range
> > > being zeroed, then they all should be cleared (and hence all I/O should
> > > be aligned) otherwise error is returned. If yes, I could make that
> > > change.
> > >
> > > Downside of current interface is that it will clear as many blocks as
> > > possible in the given range and leave starting and end blocks poisoned
> > > (if it is unaligned) and not return error. That means a reader will
> > > get error on these blocks again and they will have to try to clear it
> > > again.
> >
> > Which is solved by having partial page writes always EIO on poisoned
> > memory.
>
> Ok, how about if I add one more patch to the series which will check
> if unwritten portion of the page has known poison. If it has, then
> -EIO is returned.
>
>
> Subject: pmem: zero page range return error if poisoned memory in unwritten area
>
> Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> truncate. If partial page is being zeroed, then at the end of operation
> file systems expect that there is no poison in the whole page (atleast
> known poison).
>
> So make sure part of the partial page which is not being written, does not
> have poison. If it does, return error. If there is poison in area of page
> being written, it will be cleared.

No, I don't like that the zero operation is special cased compared to
the write case. I'd say let's make them identical for now. I.e. fail
the I/O at dax_direct_access() time. I think the error clearing
interface should be an explicit / separate op rather than a
side-effect. What about an explicit interface for initializing newly
allocated blocks, and the only reliable way to destroy poison through
the filesystem is to free the block?
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-24 20:52                 ` Dan Williams
  0 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-24 20:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dave Chinner, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > >
> > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > > >>
> > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > > >> > Atleast that seems to be the assumption with which code has been written.
> > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > > >> >
> > > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > > >> > function to assume that offset and length can be arbitrary and do the
> > > > >> > necessary alignments as needed.
> > > > >>
> > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > >
> > > > > Hi Jeff,
> > > > >
> > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > but start and end are not aligned on sector boundaries.
> > > >
> > > > Sure, but who will call it with misaligned ranges?
> > >
> > > create a file foo.txt of size 4K and then truncate it.
> > >
> > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > 4095.
> >
> > This should fail with EIO. Only full page writes should clear the
> > bad page state, and partial writes should therefore fail because
> > they do not guarantee the data in the filesystem block is all good.
> >
> > If this zeroing was a buffered write to an address with a bad
> > sector, then the writeback will fail and the user will (eventually)
> > get an EIO on the file.
> >
> > DAX should do the same thing, except because the zeroing is
> > synchronous (i.e. done directly by the truncate syscall) we can -
> > and should - return EIO immediately.
> >
> > Indeed, with your code, if we then extend the file by truncating up
> > back to 4k, then the range between 23 and 512 is still bad, even
> > though we've successfully zeroed it and the user knows it. An
> > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > will fail with EIO, but reading 10 bytes at offset 2000 will
> > succeed.
> >
> > That's *awful* behaviour to expose to userspace, especially when
> > they look at the fs config and see that it's using both 4kB block
> > and sector sizes...
> >
> > The only thing that makes sense from a filesystem perspective is
> > clearing bad page state when entire filesystem blocks are
> > overwritten. The data in a filesystem block is either good or bad,
> > and it doesn't matter how many internal (kernel or device) sectors
> > it has.
> >
> > > > And what happens to the rest?  The caller is left to trip over the
> > > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > > an explicit contract here.
> > >
> > > Ok, I think is is the contentious bit. Current interface
> > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > > sector) or expects page to be free of poison.
> > >
> > > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > > because range being zeroed is not sector aligned. So
> > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > fails because there are poisoned sectors in the page.
> > >
> > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > > and returns success.
> >
> > Ok, kernel sectors are not the unit of granularity bad page state
> > should be managed at. They don't match page state granularity, and
> > they don't match filesystem block granularity, and the whacky
> > "partial writes silently succeed, reads fail unpredictably"
> > assymetry it leads to will just cause problems for users.
> >
> > > So question is, is this better behavior or worse behavior. If sector 0
> > > was poisoned, it will continue to remain poisoned and caller will come
> > > to know about it on next read and then it should try to truncate file
> > > to length 0 or unlink file or restore that file to get rid of poison.
> >
> > Worse, because the filesystem can't track what sub-parts of the
> > block are bad and that leads to inconsistent data integrity status
> > being exposed to userspace.
> >
> >
> > > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > > will not be return an error and poison will not be cleared and memory
> > > will be zeroed. What do we expect in such cases.
> > >
> > > Do we expect an interface where if there are any bad blocks in the range
> > > being zeroed, then they all should be cleared (and hence all I/O should
> > > be aligned) otherwise error is returned. If yes, I could make that
> > > change.
> > >
> > > Downside of current interface is that it will clear as many blocks as
> > > possible in the given range and leave starting and end blocks poisoned
> > > (if it is unaligned) and not return error. That means a reader will
> > > get error on these blocks again and they will have to try to clear it
> > > again.
> >
> > Which is solved by having partial page writes always EIO on poisoned
> > memory.
>
> Ok, how about if I add one more patch to the series which will check
> if unwritten portion of the page has known poison. If it has, then
> -EIO is returned.
>
>
> Subject: pmem: zero page range return error if poisoned memory in unwritten area
>
> Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> truncate. If partial page is being zeroed, then at the end of operation
> file systems expect that there is no poison in the whole page (atleast
> known poison).
>
> So make sure part of the partial page which is not being written, does not
> have poison. If it does, return error. If there is poison in area of page
> being written, it will be cleared.

No, I don't like that the zero operation is special cased compared to
the write case. I'd say let's make them identical for now. I.e. fail
the I/O at dax_direct_access() time. I think the error clearing
interface should be an explicit / separate op rather than a
side-effect. What about an explicit interface for initializing newly
allocated blocks, and the only reliable way to destroy poison through
the filesystem is to free the block?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-24 20:52                 ` Dan Williams
@ 2020-02-24 21:15                   ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-24 21:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Mon, Feb 24, 2020 at 12:52:13PM -0800, Dan Williams wrote:
> On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> > > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > > >
> > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > > > >>
> > > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > > > >> > Atleast that seems to be the assumption with which code has been written.
> > > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > > > >> >
> > > > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > > > >> > function to assume that offset and length can be arbitrary and do the
> > > > > >> > necessary alignments as needed.
> > > > > >>
> > > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > > >
> > > > > > Hi Jeff,
> > > > > >
> > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > > but start and end are not aligned on sector boundaries.
> > > > >
> > > > > Sure, but who will call it with misaligned ranges?
> > > >
> > > > create a file foo.txt of size 4K and then truncate it.
> > > >
> > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > > 4095.
> > >
> > > This should fail with EIO. Only full page writes should clear the
> > > bad page state, and partial writes should therefore fail because
> > > they do not guarantee the data in the filesystem block is all good.
> > >
> > > If this zeroing was a buffered write to an address with a bad
> > > sector, then the writeback will fail and the user will (eventually)
> > > get an EIO on the file.
> > >
> > > DAX should do the same thing, except because the zeroing is
> > > synchronous (i.e. done directly by the truncate syscall) we can -
> > > and should - return EIO immediately.
> > >
> > > Indeed, with your code, if we then extend the file by truncating up
> > > back to 4k, then the range between 23 and 512 is still bad, even
> > > though we've successfully zeroed it and the user knows it. An
> > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > > will fail with EIO, but reading 10 bytes at offset 2000 will
> > > succeed.
> > >
> > > That's *awful* behaviour to expose to userspace, especially when
> > > they look at the fs config and see that it's using both 4kB block
> > > and sector sizes...
> > >
> > > The only thing that makes sense from a filesystem perspective is
> > > clearing bad page state when entire filesystem blocks are
> > > overwritten. The data in a filesystem block is either good or bad,
> > > and it doesn't matter how many internal (kernel or device) sectors
> > > it has.
> > >
> > > > > And what happens to the rest?  The caller is left to trip over the
> > > > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > > > an explicit contract here.
> > > >
> > > > Ok, I think is is the contentious bit. Current interface
> > > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > > > sector) or expects page to be free of poison.
> > > >
> > > > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > > > because range being zeroed is not sector aligned. So
> > > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > > fails because there are poisoned sectors in the page.
> > > >
> > > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > > > and returns success.
> > >
> > > Ok, kernel sectors are not the unit of granularity bad page state
> > > should be managed at. They don't match page state granularity, and
> > > they don't match filesystem block granularity, and the whacky
> > > "partial writes silently succeed, reads fail unpredictably"
> > > assymetry it leads to will just cause problems for users.
> > >
> > > > So question is, is this better behavior or worse behavior. If sector 0
> > > > was poisoned, it will continue to remain poisoned and caller will come
> > > > to know about it on next read and then it should try to truncate file
> > > > to length 0 or unlink file or restore that file to get rid of poison.
> > >
> > > Worse, because the filesystem can't track what sub-parts of the
> > > block are bad and that leads to inconsistent data integrity status
> > > being exposed to userspace.
> > >
> > >
> > > > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > > > will not be return an error and poison will not be cleared and memory
> > > > will be zeroed. What do we expect in such cases.
> > > >
> > > > Do we expect an interface where if there are any bad blocks in the range
> > > > being zeroed, then they all should be cleared (and hence all I/O should
> > > > be aligned) otherwise error is returned. If yes, I could make that
> > > > change.
> > > >
> > > > Downside of current interface is that it will clear as many blocks as
> > > > possible in the given range and leave starting and end blocks poisoned
> > > > (if it is unaligned) and not return error. That means a reader will
> > > > get error on these blocks again and they will have to try to clear it
> > > > again.
> > >
> > > Which is solved by having partial page writes always EIO on poisoned
> > > memory.
> >
> > Ok, how about if I add one more patch to the series which will check
> > if unwritten portion of the page has known poison. If it has, then
> > -EIO is returned.
> >
> >
> > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> >
> > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > truncate. If partial page is being zeroed, then at the end of operation
> > file systems expect that there is no poison in the whole page (atleast
> > known poison).
> >
> > So make sure part of the partial page which is not being written, does not
> > have poison. If it does, return error. If there is poison in area of page
> > being written, it will be cleared.
> 
> No, I don't like that the zero operation is special cased compared to
> the write case. I'd say let's make them identical for now. I.e. fail
> the I/O at dax_direct_access() time.

So basically __dax_zero_page_range() will only write zeros (and not
try to clear any poison). Right?

> I think the error clearing
> interface should be an explicit / separate op rather than a
> side-effect. What about an explicit interface for initializing newly
> allocated blocks, and the only reliable way to destroy poison through
> the filesystem is to free the block?

Effectively pmem_make_request() is already that interface filesystems
use to initialize blocks and clear poison. So we don't really have to
introduce a new interface?

Or you are suggesting separate dax_zero_page_range() interface which will
always call into firmware to clear poison. And that will make sure latent
poison is cleared as well and filesystem should use that for block
initialization instead? I do like the idea of not having to differentiate
between known poison and latent poison. Once a block has been initialized
all poison should be cleared (known/latent). I am worried though that
on large devices this might slowdown filesystem initialization a lot
if they are zeroing large range of blocks.

If yes, this sounds like two different patch series. First patch series
takes care of removing blkdev_issue_zeroout() from
__dax_zero_page_range() and couple of iomap related cleans christoph
wanted.

And second patch series for adding new dax operation to zero a range
and always call info firmware to clear poison and modify filesystems
accordingly.

Thanks
Vivek
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-24 21:15                   ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-24 21:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Mon, Feb 24, 2020 at 12:52:13PM -0800, Dan Williams wrote:
> On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> > > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > > >
> > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > > > >>
> > > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > > > >> > Atleast that seems to be the assumption with which code has been written.
> > > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > > > >> >
> > > > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > > > >> > function to assume that offset and length can be arbitrary and do the
> > > > > >> > necessary alignments as needed.
> > > > > >>
> > > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > > >
> > > > > > Hi Jeff,
> > > > > >
> > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > > but start and end are not aligned on sector boundaries.
> > > > >
> > > > > Sure, but who will call it with misaligned ranges?
> > > >
> > > > create a file foo.txt of size 4K and then truncate it.
> > > >
> > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > > 4095.
> > >
> > > This should fail with EIO. Only full page writes should clear the
> > > bad page state, and partial writes should therefore fail because
> > > they do not guarantee the data in the filesystem block is all good.
> > >
> > > If this zeroing was a buffered write to an address with a bad
> > > sector, then the writeback will fail and the user will (eventually)
> > > get an EIO on the file.
> > >
> > > DAX should do the same thing, except because the zeroing is
> > > synchronous (i.e. done directly by the truncate syscall) we can -
> > > and should - return EIO immediately.
> > >
> > > Indeed, with your code, if we then extend the file by truncating up
> > > back to 4k, then the range between 23 and 512 is still bad, even
> > > though we've successfully zeroed it and the user knows it. An
> > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > > will fail with EIO, but reading 10 bytes at offset 2000 will
> > > succeed.
> > >
> > > That's *awful* behaviour to expose to userspace, especially when
> > > they look at the fs config and see that it's using both 4kB block
> > > and sector sizes...
> > >
> > > The only thing that makes sense from a filesystem perspective is
> > > clearing bad page state when entire filesystem blocks are
> > > overwritten. The data in a filesystem block is either good or bad,
> > > and it doesn't matter how many internal (kernel or device) sectors
> > > it has.
> > >
> > > > > And what happens to the rest?  The caller is left to trip over the
> > > > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > > > an explicit contract here.
> > > >
> > > > Ok, I think is is the contentious bit. Current interface
> > > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > > > sector) or expects page to be free of poison.
> > > >
> > > > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > > > because range being zeroed is not sector aligned. So
> > > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > > fails because there are poisoned sectors in the page.
> > > >
> > > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > > > and returns success.
> > >
> > > Ok, kernel sectors are not the unit of granularity bad page state
> > > should be managed at. They don't match page state granularity, and
> > > they don't match filesystem block granularity, and the whacky
> > > "partial writes silently succeed, reads fail unpredictably"
> > > assymetry it leads to will just cause problems for users.
> > >
> > > > So question is, is this better behavior or worse behavior. If sector 0
> > > > was poisoned, it will continue to remain poisoned and caller will come
> > > > to know about it on next read and then it should try to truncate file
> > > > to length 0 or unlink file or restore that file to get rid of poison.
> > >
> > > Worse, because the filesystem can't track what sub-parts of the
> > > block are bad and that leads to inconsistent data integrity status
> > > being exposed to userspace.
> > >
> > >
> > > > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > > > will not be return an error and poison will not be cleared and memory
> > > > will be zeroed. What do we expect in such cases.
> > > >
> > > > Do we expect an interface where if there are any bad blocks in the range
> > > > being zeroed, then they all should be cleared (and hence all I/O should
> > > > be aligned) otherwise error is returned. If yes, I could make that
> > > > change.
> > > >
> > > > Downside of current interface is that it will clear as many blocks as
> > > > possible in the given range and leave starting and end blocks poisoned
> > > > (if it is unaligned) and not return error. That means a reader will
> > > > get error on these blocks again and they will have to try to clear it
> > > > again.
> > >
> > > Which is solved by having partial page writes always EIO on poisoned
> > > memory.
> >
> > Ok, how about if I add one more patch to the series which will check
> > if unwritten portion of the page has known poison. If it has, then
> > -EIO is returned.
> >
> >
> > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> >
> > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > truncate. If partial page is being zeroed, then at the end of operation
> > file systems expect that there is no poison in the whole page (atleast
> > known poison).
> >
> > So make sure part of the partial page which is not being written, does not
> > have poison. If it does, return error. If there is poison in area of page
> > being written, it will be cleared.
> 
> No, I don't like that the zero operation is special cased compared to
> the write case. I'd say let's make them identical for now. I.e. fail
> the I/O at dax_direct_access() time.

So basically __dax_zero_page_range() will only write zeros (and not
try to clear any poison). Right?

> I think the error clearing
> interface should be an explicit / separate op rather than a
> side-effect. What about an explicit interface for initializing newly
> allocated blocks, and the only reliable way to destroy poison through
> the filesystem is to free the block?

Effectively pmem_make_request() is already that interface filesystems
use to initialize blocks and clear poison. So we don't really have to
introduce a new interface?

Or you are suggesting separate dax_zero_page_range() interface which will
always call into firmware to clear poison. And that will make sure latent
poison is cleared as well and filesystem should use that for block
initialization instead? I do like the idea of not having to differentiate
between known poison and latent poison. Once a block has been initialized
all poison should be cleared (known/latent). I am worried though that
on large devices this might slowdown filesystem initialization a lot
if they are zeroing large range of blocks.

If yes, this sounds like two different patch series. First patch series
takes care of removing blkdev_issue_zeroout() from
__dax_zero_page_range() and couple of iomap related cleans christoph
wanted.

And second patch series for adding new dax operation to zero a range
and always call info firmware to clear poison and modify filesystems
accordingly.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-24 21:15                   ` Vivek Goyal
@ 2020-02-24 21:32                     ` Dan Williams
  -1 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-24 21:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Mon, Feb 24, 2020 at 1:17 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Feb 24, 2020 at 12:52:13PM -0800, Dan Williams wrote:
> > On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> > > > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > > > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > > > >
> > > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > > > > >>
> > > > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > > > > >> > Atleast that seems to be the assumption with which code has been written.
> > > > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > > > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > > > > >> >
> > > > > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > > > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > > > > >> > function to assume that offset and length can be arbitrary and do the
> > > > > > >> > necessary alignments as needed.
> > > > > > >>
> > > > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > > > >
> > > > > > > Hi Jeff,
> > > > > > >
> > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > > > but start and end are not aligned on sector boundaries.
> > > > > >
> > > > > > Sure, but who will call it with misaligned ranges?
> > > > >
> > > > > create a file foo.txt of size 4K and then truncate it.
> > > > >
> > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > > > 4095.
> > > >
> > > > This should fail with EIO. Only full page writes should clear the
> > > > bad page state, and partial writes should therefore fail because
> > > > they do not guarantee the data in the filesystem block is all good.
> > > >
> > > > If this zeroing was a buffered write to an address with a bad
> > > > sector, then the writeback will fail and the user will (eventually)
> > > > get an EIO on the file.
> > > >
> > > > DAX should do the same thing, except because the zeroing is
> > > > synchronous (i.e. done directly by the truncate syscall) we can -
> > > > and should - return EIO immediately.
> > > >
> > > > Indeed, with your code, if we then extend the file by truncating up
> > > > back to 4k, then the range between 23 and 512 is still bad, even
> > > > though we've successfully zeroed it and the user knows it. An
> > > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > > > will fail with EIO, but reading 10 bytes at offset 2000 will
> > > > succeed.
> > > >
> > > > That's *awful* behaviour to expose to userspace, especially when
> > > > they look at the fs config and see that it's using both 4kB block
> > > > and sector sizes...
> > > >
> > > > The only thing that makes sense from a filesystem perspective is
> > > > clearing bad page state when entire filesystem blocks are
> > > > overwritten. The data in a filesystem block is either good or bad,
> > > > and it doesn't matter how many internal (kernel or device) sectors
> > > > it has.
> > > >
> > > > > > And what happens to the rest?  The caller is left to trip over the
> > > > > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > > > > an explicit contract here.
> > > > >
> > > > > Ok, I think is is the contentious bit. Current interface
> > > > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > > > > sector) or expects page to be free of poison.
> > > > >
> > > > > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > > > > because range being zeroed is not sector aligned. So
> > > > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > > > fails because there are poisoned sectors in the page.
> > > > >
> > > > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > > > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > > > > and returns success.
> > > >
> > > > Ok, kernel sectors are not the unit of granularity bad page state
> > > > should be managed at. They don't match page state granularity, and
> > > > they don't match filesystem block granularity, and the whacky
> > > > "partial writes silently succeed, reads fail unpredictably"
> > > > assymetry it leads to will just cause problems for users.
> > > >
> > > > > So question is, is this better behavior or worse behavior. If sector 0
> > > > > was poisoned, it will continue to remain poisoned and caller will come
> > > > > to know about it on next read and then it should try to truncate file
> > > > > to length 0 or unlink file or restore that file to get rid of poison.
> > > >
> > > > Worse, because the filesystem can't track what sub-parts of the
> > > > block are bad and that leads to inconsistent data integrity status
> > > > being exposed to userspace.
> > > >
> > > >
> > > > > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > > > > will not be return an error and poison will not be cleared and memory
> > > > > will be zeroed. What do we expect in such cases.
> > > > >
> > > > > Do we expect an interface where if there are any bad blocks in the range
> > > > > being zeroed, then they all should be cleared (and hence all I/O should
> > > > > be aligned) otherwise error is returned. If yes, I could make that
> > > > > change.
> > > > >
> > > > > Downside of current interface is that it will clear as many blocks as
> > > > > possible in the given range and leave starting and end blocks poisoned
> > > > > (if it is unaligned) and not return error. That means a reader will
> > > > > get error on these blocks again and they will have to try to clear it
> > > > > again.
> > > >
> > > > Which is solved by having partial page writes always EIO on poisoned
> > > > memory.
> > >
> > > Ok, how about if I add one more patch to the series which will check
> > > if unwritten portion of the page has known poison. If it has, then
> > > -EIO is returned.
> > >
> > >
> > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > >
> > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > truncate. If partial page is being zeroed, then at the end of operation
> > > file systems expect that there is no poison in the whole page (atleast
> > > known poison).
> > >
> > > So make sure part of the partial page which is not being written, does not
> > > have poison. If it does, return error. If there is poison in area of page
> > > being written, it will be cleared.
> >
> > No, I don't like that the zero operation is special cased compared to
> > the write case. I'd say let's make them identical for now. I.e. fail
> > the I/O at dax_direct_access() time.
>
> So basically __dax_zero_page_range() will only write zeros (and not
> try to clear any poison). Right?

Yes, the zero operation would have already failed at the
dax_direct_access() step if there was present poison.

> > I think the error clearing
> > interface should be an explicit / separate op rather than a
> > side-effect. What about an explicit interface for initializing newly
> > allocated blocks, and the only reliable way to destroy poison through
> > the filesystem is to free the block?
>
> Effectively pmem_make_request() is already that interface filesystems
> use to initialize blocks and clear poison. So we don't really have to
> introduce a new interface?

pmem_make_request() is shared with the I/O path and is too low in the
stack to understand intent. DAX intercepts the I/O path closer to the
filesystem and can understand zeroing vs writing today. I'm proposing
we go a step further and make DAX understand free-to-allocated-block
initialization instead of just zeroing. Inject the error clearing into
that initialization interface.

> Or you are suggesting separate dax_zero_page_range() interface which will
> always call into firmware to clear poison. And that will make sure latent
> poison is cleared as well and filesystem should use that for block
> initialization instead?

Yes, except latent poison would not be cleared until the zeroing is
implemented with movdir64b instead of callouts to firmware. It's
otherwise too slow to call out to firmware unconditionally.

> I do like the idea of not having to differentiate
> between known poison and latent poison. Once a block has been initialized
> all poison should be cleared (known/latent). I am worried though that
> on large devices this might slowdown filesystem initialization a lot
> if they are zeroing large range of blocks.
>
> If yes, this sounds like two different patch series. First patch series
> takes care of removing blkdev_issue_zeroout() from
> __dax_zero_page_range() and couple of iomap related cleans christoph
> wanted.
>
> And second patch series for adding new dax operation to zero a range
> and always call info firmware to clear poison and modify filesystems
> accordingly.

Yes, but they may need to be merged together. I don't want to regress
the ability of a block-aligned hole-punch to clear errors.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-24 21:32                     ` Dan Williams
  0 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-24 21:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dave Chinner, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Mon, Feb 24, 2020 at 1:17 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Feb 24, 2020 at 12:52:13PM -0800, Dan Williams wrote:
> > On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> > > > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > > > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > > > >
> > > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > > > > >>
> > > > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > > > > >> > Atleast that seems to be the assumption with which code has been written.
> > > > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > > > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > > > > >> >
> > > > > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > > > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > > > > >> > function to assume that offset and length can be arbitrary and do the
> > > > > > >> > necessary alignments as needed.
> > > > > > >>
> > > > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > > > >
> > > > > > > Hi Jeff,
> > > > > > >
> > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > > > but start and end are not aligned on sector boundaries.
> > > > > >
> > > > > > Sure, but who will call it with misaligned ranges?
> > > > >
> > > > > create a file foo.txt of size 4K and then truncate it.
> > > > >
> > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > > > 4095.
> > > >
> > > > This should fail with EIO. Only full page writes should clear the
> > > > bad page state, and partial writes should therefore fail because
> > > > they do not guarantee the data in the filesystem block is all good.
> > > >
> > > > If this zeroing was a buffered write to an address with a bad
> > > > sector, then the writeback will fail and the user will (eventually)
> > > > get an EIO on the file.
> > > >
> > > > DAX should do the same thing, except because the zeroing is
> > > > synchronous (i.e. done directly by the truncate syscall) we can -
> > > > and should - return EIO immediately.
> > > >
> > > > Indeed, with your code, if we then extend the file by truncating up
> > > > back to 4k, then the range between 23 and 512 is still bad, even
> > > > though we've successfully zeroed it and the user knows it. An
> > > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > > > will fail with EIO, but reading 10 bytes at offset 2000 will
> > > > succeed.
> > > >
> > > > That's *awful* behaviour to expose to userspace, especially when
> > > > they look at the fs config and see that it's using both 4kB block
> > > > and sector sizes...
> > > >
> > > > The only thing that makes sense from a filesystem perspective is
> > > > clearing bad page state when entire filesystem blocks are
> > > > overwritten. The data in a filesystem block is either good or bad,
> > > > and it doesn't matter how many internal (kernel or device) sectors
> > > > it has.
> > > >
> > > > > > And what happens to the rest?  The caller is left to trip over the
> > > > > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > > > > an explicit contract here.
> > > > >
> > > > > Ok, I think is is the contentious bit. Current interface
> > > > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > > > > sector) or expects page to be free of poison.
> > > > >
> > > > > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > > > > because range being zeroed is not sector aligned. So
> > > > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > > > fails because there are poisoned sectors in the page.
> > > > >
> > > > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > > > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > > > > and returns success.
> > > >
> > > > Ok, kernel sectors are not the unit of granularity bad page state
> > > > should be managed at. They don't match page state granularity, and
> > > > they don't match filesystem block granularity, and the whacky
> > > > "partial writes silently succeed, reads fail unpredictably"
> > > > assymetry it leads to will just cause problems for users.
> > > >
> > > > > So question is, is this better behavior or worse behavior. If sector 0
> > > > > was poisoned, it will continue to remain poisoned and caller will come
> > > > > to know about it on next read and then it should try to truncate file
> > > > > to length 0 or unlink file or restore that file to get rid of poison.
> > > >
> > > > Worse, because the filesystem can't track what sub-parts of the
> > > > block are bad and that leads to inconsistent data integrity status
> > > > being exposed to userspace.
> > > >
> > > >
> > > > > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > > > > will not be return an error and poison will not be cleared and memory
> > > > > will be zeroed. What do we expect in such cases.
> > > > >
> > > > > Do we expect an interface where if there are any bad blocks in the range
> > > > > being zeroed, then they all should be cleared (and hence all I/O should
> > > > > be aligned) otherwise error is returned. If yes, I could make that
> > > > > change.
> > > > >
> > > > > Downside of current interface is that it will clear as many blocks as
> > > > > possible in the given range and leave starting and end blocks poisoned
> > > > > (if it is unaligned) and not return error. That means a reader will
> > > > > get error on these blocks again and they will have to try to clear it
> > > > > again.
> > > >
> > > > Which is solved by having partial page writes always EIO on poisoned
> > > > memory.
> > >
> > > Ok, how about if I add one more patch to the series which will check
> > > if unwritten portion of the page has known poison. If it has, then
> > > -EIO is returned.
> > >
> > >
> > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > >
> > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > truncate. If partial page is being zeroed, then at the end of operation
> > > file systems expect that there is no poison in the whole page (atleast
> > > known poison).
> > >
> > > So make sure part of the partial page which is not being written, does not
> > > have poison. If it does, return error. If there is poison in area of page
> > > being written, it will be cleared.
> >
> > No, I don't like that the zero operation is special cased compared to
> > the write case. I'd say let's make them identical for now. I.e. fail
> > the I/O at dax_direct_access() time.
>
> So basically __dax_zero_page_range() will only write zeros (and not
> try to clear any poison). Right?

Yes, the zero operation would have already failed at the
dax_direct_access() step if there was present poison.

> > I think the error clearing
> > interface should be an explicit / separate op rather than a
> > side-effect. What about an explicit interface for initializing newly
> > allocated blocks, and the only reliable way to destroy poison through
> > the filesystem is to free the block?
>
> Effectively pmem_make_request() is already that interface filesystems
> use to initialize blocks and clear poison. So we don't really have to
> introduce a new interface?

pmem_make_request() is shared with the I/O path and is too low in the
stack to understand intent. DAX intercepts the I/O path closer to the
filesystem and can understand zeroing vs writing today. I'm proposing
we go a step further and make DAX understand free-to-allocated-block
initialization instead of just zeroing. Inject the error clearing into
that initialization interface.

> Or you are suggesting separate dax_zero_page_range() interface which will
> always call into firmware to clear poison. And that will make sure latent
> poison is cleared as well and filesystem should use that for block
> initialization instead?

Yes, except latent poison would not be cleared until the zeroing is
implemented with movdir64b instead of callouts to firmware. It's
otherwise too slow to call out to firmware unconditionally.

> I do like the idea of not having to differentiate
> between known poison and latent poison. Once a block has been initialized
> all poison should be cleared (known/latent). I am worried though that
> on large devices this might slowdown filesystem initialization a lot
> if they are zeroing large range of blocks.
>
> If yes, this sounds like two different patch series. First patch series
> takes care of removing blkdev_issue_zeroout() from
> __dax_zero_page_range() and couple of iomap related cleans christoph
> wanted.
>
> And second patch series for adding new dax operation to zero a range
> and always call info firmware to clear poison and modify filesystems
> accordingly.

Yes, but they may need to be merged together. I don't want to regress
the ability of a block-aligned hole-punch to clear errors.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-24 20:48                   ` Dan Williams
@ 2020-02-24 21:53                     ` Jeff Moyer
  -1 siblings, 0 replies; 103+ messages in thread
From: Jeff Moyer @ 2020-02-24 21:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

Dan Williams <dan.j.williams@intel.com> writes:

>> Let's just focus on reporting errors when we know we have them.
>
> That's the problem in my eyes. If software needs to contend with
> latent error reporting then it should always contend otherwise
> software has multiple error models to wrangle.

The only way for an application to know that the data has been written
successfully would be to issue a read after every write.  That's not a
performance hit most applications are willing to take.  And, of course,
the media can still go bad at a later time, so it only guarantees the
data is accessible immediately after having been written.

What I'm suggesting is that we should not complete a write successfully
if we know that the data will not be retrievable.  I wouldn't call this
adding an extra error model to contend with.  Applications should
already be checking for errors on write.

Does that make sense?  Are we talking past each other?

> Setting that aside we can start with just treating zeroing the same as
> the copy_from_iter() case and fail the I/O at the dax_direct_access()
> step.

OK.

> I'd rather have a separate op that filesystems can use to clear errors
> at block allocation time that can be enforced to have the correct
> alignment.

So would file systems always call that routine instead of zeroing, or
would they first check to see if there are badblocks?

-Jeff
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-24 21:53                     ` Jeff Moyer
  0 siblings, 0 replies; 103+ messages in thread
From: Jeff Moyer @ 2020-02-24 21:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Vivek Goyal, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

Dan Williams <dan.j.williams@intel.com> writes:

>> Let's just focus on reporting errors when we know we have them.
>
> That's the problem in my eyes. If software needs to contend with
> latent error reporting then it should always contend otherwise
> software has multiple error models to wrangle.

The only way for an application to know that the data has been written
successfully would be to issue a read after every write.  That's not a
performance hit most applications are willing to take.  And, of course,
the media can still go bad at a later time, so it only guarantees the
data is accessible immediately after having been written.

What I'm suggesting is that we should not complete a write successfully
if we know that the data will not be retrievable.  I wouldn't call this
adding an extra error model to contend with.  Applications should
already be checking for errors on write.

Does that make sense?  Are we talking past each other?

> Setting that aside we can start with just treating zeroing the same as
> the copy_from_iter() case and fail the I/O at the dax_direct_access()
> step.

OK.

> I'd rather have a separate op that filesystems can use to clear errors
> at block allocation time that can be enforced to have the correct
> alignment.

So would file systems always call that routine instead of zeroing, or
would they first check to see if there are badblocks?

-Jeff


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-24 21:53                     ` Jeff Moyer
@ 2020-02-25  0:26                       ` Dan Williams
  -1 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-25  0:26 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> >> Let's just focus on reporting errors when we know we have them.
> >
> > That's the problem in my eyes. If software needs to contend with
> > latent error reporting then it should always contend otherwise
> > software has multiple error models to wrangle.
>
> The only way for an application to know that the data has been written
> successfully would be to issue a read after every write.  That's not a
> performance hit most applications are willing to take.  And, of course,
> the media can still go bad at a later time, so it only guarantees the
> data is accessible immediately after having been written.
>
> What I'm suggesting is that we should not complete a write successfully
> if we know that the data will not be retrievable.  I wouldn't call this
> adding an extra error model to contend with.  Applications should
> already be checking for errors on write.
>
> Does that make sense? Are we talking past each other?

The badblock list is late to update in both directions, late to add
entries that the scrub needs to find and late to delete entries that
were inadvertently cleared by cache-line writes that did not first
ingest the poison for a read-modify-write. So I see the above as being
wishful in using the error list as the hard source of truth and
unfortunate to up-level all sub-sector error entries into full
PAGE_SIZE data offline events.

I'm hoping we can find a way to make the error handling more fine
grained over time, but for the current patch, managing the blast
radius as PAGE_SIZE granularity at least matches the zero path with
the write path.

> > Setting that aside we can start with just treating zeroing the same as
> > the copy_from_iter() case and fail the I/O at the dax_direct_access()
> > step.
>
> OK.
>
> > I'd rather have a separate op that filesystems can use to clear errors
> > at block allocation time that can be enforced to have the correct
> > alignment.
>
> So would file systems always call that routine instead of zeroing, or
> would they first check to see if there are badblocks?

The proposal is that filesystems distinguish zeroing from free-block
allocation/initialization such that the fsdax implementation directs
initialization to a driver callback. This "initialization op" would
take care to check for poison and clear it. All other dax paths would
not consult the badblocks list.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-25  0:26                       ` Dan Williams
  0 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-25  0:26 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dave Chinner, Vivek Goyal, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> >> Let's just focus on reporting errors when we know we have them.
> >
> > That's the problem in my eyes. If software needs to contend with
> > latent error reporting then it should always contend otherwise
> > software has multiple error models to wrangle.
>
> The only way for an application to know that the data has been written
> successfully would be to issue a read after every write.  That's not a
> performance hit most applications are willing to take.  And, of course,
> the media can still go bad at a later time, so it only guarantees the
> data is accessible immediately after having been written.
>
> What I'm suggesting is that we should not complete a write successfully
> if we know that the data will not be retrievable.  I wouldn't call this
> adding an extra error model to contend with.  Applications should
> already be checking for errors on write.
>
> Does that make sense? Are we talking past each other?

The badblock list is late to update in both directions, late to add
entries that the scrub needs to find and late to delete entries that
were inadvertently cleared by cache-line writes that did not first
ingest the poison for a read-modify-write. So I see the above as being
wishful in using the error list as the hard source of truth and
unfortunate to up-level all sub-sector error entries into full
PAGE_SIZE data offline events.

I'm hoping we can find a way to make the error handling more fine
grained over time, but for the current patch, managing the blast
radius as PAGE_SIZE granularity at least matches the zero path with
the write path.

> > Setting that aside we can start with just treating zeroing the same as
> > the copy_from_iter() case and fail the I/O at the dax_direct_access()
> > step.
>
> OK.
>
> > I'd rather have a separate op that filesystems can use to clear errors
> > at block allocation time that can be enforced to have the correct
> > alignment.
>
> So would file systems always call that routine instead of zeroing, or
> would they first check to see if there are badblocks?

The proposal is that filesystems distinguish zeroing from free-block
allocation/initialization such that the fsdax implementation directs
initialization to a driver callback. This "initialization op" would
take care to check for poison and clear it. All other dax paths would
not consult the badblocks list.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-24 21:32                     ` Dan Williams
@ 2020-02-25 13:36                       ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-25 13:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:

[..]
> > > > Ok, how about if I add one more patch to the series which will check
> > > > if unwritten portion of the page has known poison. If it has, then
> > > > -EIO is returned.
> > > >
> > > >
> > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > > >
> > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > > truncate. If partial page is being zeroed, then at the end of operation
> > > > file systems expect that there is no poison in the whole page (atleast
> > > > known poison).
> > > >
> > > > So make sure part of the partial page which is not being written, does not
> > > > have poison. If it does, return error. If there is poison in area of page
> > > > being written, it will be cleared.
> > >
> > > No, I don't like that the zero operation is special cased compared to
> > > the write case. I'd say let's make them identical for now. I.e. fail
> > > the I/O at dax_direct_access() time.
> >
> > So basically __dax_zero_page_range() will only write zeros (and not
> > try to clear any poison). Right?
> 
> Yes, the zero operation would have already failed at the
> dax_direct_access() step if there was present poison.
> 
> > > I think the error clearing
> > > interface should be an explicit / separate op rather than a
> > > side-effect. What about an explicit interface for initializing newly
> > > allocated blocks, and the only reliable way to destroy poison through
> > > the filesystem is to free the block?
> >
> > Effectively pmem_make_request() is already that interface filesystems
> > use to initialize blocks and clear poison. So we don't really have to
> > introduce a new interface?
> 
> pmem_make_request() is shared with the I/O path and is too low in the
> stack to understand intent. DAX intercepts the I/O path closer to the
> filesystem and can understand zeroing vs writing today. I'm proposing
> we go a step further and make DAX understand free-to-allocated-block
> initialization instead of just zeroing. Inject the error clearing into
> that initialization interface.
> 
> > Or you are suggesting separate dax_zero_page_range() interface which will
> > always call into firmware to clear poison. And that will make sure latent
> > poison is cleared as well and filesystem should use that for block
> > initialization instead?
> 
> Yes, except latent poison would not be cleared until the zeroing is
> implemented with movdir64b instead of callouts to firmware. It's
> otherwise too slow to call out to firmware unconditionally.
> 
> > I do like the idea of not having to differentiate
> > between known poison and latent poison. Once a block has been initialized
> > all poison should be cleared (known/latent). I am worried though that
> > on large devices this might slowdown filesystem initialization a lot
> > if they are zeroing large range of blocks.
> >
> > If yes, this sounds like two different patch series. First patch series
> > takes care of removing blkdev_issue_zeroout() from
> > __dax_zero_page_range() and couple of iomap related cleans christoph
> > wanted.
> >
> > And second patch series for adding new dax operation to zero a range
> > and always call info firmware to clear poison and modify filesystems
> > accordingly.
> 
> Yes, but they may need to be merged together. I don't want to regress
> the ability of a block-aligned hole-punch to clear errors.

Hi Dan,

IIUC, block aligned hole punch don't go through __dax_zero_page_range()
path. Instead they call blkdev_issue_zeroout() at later point of time.

Only partial block zeroing path is taking __dax_zero_page_range(). So
even if we remove poison clearing code from __dax_zero_page_range(),
there should not be a regression w.r.t full block zeroing. Only possible
regression will be if somebody was doing partial block zeroing on sector
boundary, then poison will not be cleared.

We now seem to be discussing too many issues w.r.t poison clearing
and dax. Atleast 3 issues are mentioned in this thread.

A. Get rid of dependency on block device in dax zeroing path.
   (__dax_zero_page_range)

B. Provide a way to clear latent poison. And possibly use movdir64b to
   do that and make filesystems use that interface for initialization
   of blocks.

C. Dax zero operation is clearing known poison while copy_from_iter() is
   not. I guess this ship has already sailed. If we change it now,
   somebody will complain of some regression.

For issue A, there are two possible ways to deal with it.

1. Implement a dax method to zero page. And this method will also clear
   known poison. This is what my patch series is doing.

2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
   so that no poison will be cleared in __dax_zero_page_range() path. This
   path is currently used in partial page zeroing path and full filesystem
   block zeroing happens with blkdev_issue_zeroout(). There is a small
   chance of regression here in case of sector aligned partial block
   zeroing.

My patch series takes care of issue A without any regressions. In fact it
improves current interface. For example, currently "truncate -s 512
foo.txt" will succeed even if first sector in the block is poisoned. My
patch series fixes it. Current implementation will return error on if any
non sector aligned truncate is done and any of the sector is poisoned. My
implementation will not return error if poisoned can be cleared as part
of zeroing. It will return only if poison is present in non-zeoring part.

Why don't we solve one issue A now and deal with issue B and C later in
a sepaprate patch series. This patch series gets rid of dependency on 
block device in dax path and also makes current zeroing interface better.

Thanks
Vivek
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-25 13:36                       ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-25 13:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:

[..]
> > > > Ok, how about if I add one more patch to the series which will check
> > > > if unwritten portion of the page has known poison. If it has, then
> > > > -EIO is returned.
> > > >
> > > >
> > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > > >
> > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > > truncate. If partial page is being zeroed, then at the end of operation
> > > > file systems expect that there is no poison in the whole page (atleast
> > > > known poison).
> > > >
> > > > So make sure part of the partial page which is not being written, does not
> > > > have poison. If it does, return error. If there is poison in area of page
> > > > being written, it will be cleared.
> > >
> > > No, I don't like that the zero operation is special cased compared to
> > > the write case. I'd say let's make them identical for now. I.e. fail
> > > the I/O at dax_direct_access() time.
> >
> > So basically __dax_zero_page_range() will only write zeros (and not
> > try to clear any poison). Right?
> 
> Yes, the zero operation would have already failed at the
> dax_direct_access() step if there was present poison.
> 
> > > I think the error clearing
> > > interface should be an explicit / separate op rather than a
> > > side-effect. What about an explicit interface for initializing newly
> > > allocated blocks, and the only reliable way to destroy poison through
> > > the filesystem is to free the block?
> >
> > Effectively pmem_make_request() is already that interface filesystems
> > use to initialize blocks and clear poison. So we don't really have to
> > introduce a new interface?
> 
> pmem_make_request() is shared with the I/O path and is too low in the
> stack to understand intent. DAX intercepts the I/O path closer to the
> filesystem and can understand zeroing vs writing today. I'm proposing
> we go a step further and make DAX understand free-to-allocated-block
> initialization instead of just zeroing. Inject the error clearing into
> that initialization interface.
> 
> > Or you are suggesting separate dax_zero_page_range() interface which will
> > always call into firmware to clear poison. And that will make sure latent
> > poison is cleared as well and filesystem should use that for block
> > initialization instead?
> 
> Yes, except latent poison would not be cleared until the zeroing is
> implemented with movdir64b instead of callouts to firmware. It's
> otherwise too slow to call out to firmware unconditionally.
> 
> > I do like the idea of not having to differentiate
> > between known poison and latent poison. Once a block has been initialized
> > all poison should be cleared (known/latent). I am worried though that
> > on large devices this might slowdown filesystem initialization a lot
> > if they are zeroing large range of blocks.
> >
> > If yes, this sounds like two different patch series. First patch series
> > takes care of removing blkdev_issue_zeroout() from
> > __dax_zero_page_range() and couple of iomap related cleans christoph
> > wanted.
> >
> > And second patch series for adding new dax operation to zero a range
> > and always call info firmware to clear poison and modify filesystems
> > accordingly.
> 
> Yes, but they may need to be merged together. I don't want to regress
> the ability of a block-aligned hole-punch to clear errors.

Hi Dan,

IIUC, block aligned hole punch don't go through __dax_zero_page_range()
path. Instead they call blkdev_issue_zeroout() at later point of time.

Only partial block zeroing path is taking __dax_zero_page_range(). So
even if we remove poison clearing code from __dax_zero_page_range(),
there should not be a regression w.r.t full block zeroing. Only possible
regression will be if somebody was doing partial block zeroing on sector
boundary, then poison will not be cleared.

We now seem to be discussing too many issues w.r.t poison clearing
and dax. Atleast 3 issues are mentioned in this thread.

A. Get rid of dependency on block device in dax zeroing path.
   (__dax_zero_page_range)

B. Provide a way to clear latent poison. And possibly use movdir64b to
   do that and make filesystems use that interface for initialization
   of blocks.

C. Dax zero operation is clearing known poison while copy_from_iter() is
   not. I guess this ship has already sailed. If we change it now,
   somebody will complain of some regression.

For issue A, there are two possible ways to deal with it.

1. Implement a dax method to zero page. And this method will also clear
   known poison. This is what my patch series is doing.

2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
   so that no poison will be cleared in __dax_zero_page_range() path. This
   path is currently used in partial page zeroing path and full filesystem
   block zeroing happens with blkdev_issue_zeroout(). There is a small
   chance of regression here in case of sector aligned partial block
   zeroing.

My patch series takes care of issue A without any regressions. In fact it
improves current interface. For example, currently "truncate -s 512
foo.txt" will succeed even if first sector in the block is poisoned. My
patch series fixes it. Current implementation will return error on if any
non sector aligned truncate is done and any of the sector is poisoned. My
implementation will not return error if poisoned can be cleared as part
of zeroing. It will return only if poison is present in non-zeoring part.

Why don't we solve one issue A now and deal with issue B and C later in
a sepaprate patch series. This patch series gets rid of dependency on 
block device in dax path and also makes current zeroing interface better.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-25 13:36                       ` Vivek Goyal
@ 2020-02-25 16:25                         ` Dan Williams
  -1 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-25 16:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:
>
> [..]
> > > > > Ok, how about if I add one more patch to the series which will check
> > > > > if unwritten portion of the page has known poison. If it has, then
> > > > > -EIO is returned.
> > > > >
> > > > >
> > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > > > >
> > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > > > truncate. If partial page is being zeroed, then at the end of operation
> > > > > file systems expect that there is no poison in the whole page (atleast
> > > > > known poison).
> > > > >
> > > > > So make sure part of the partial page which is not being written, does not
> > > > > have poison. If it does, return error. If there is poison in area of page
> > > > > being written, it will be cleared.
> > > >
> > > > No, I don't like that the zero operation is special cased compared to
> > > > the write case. I'd say let's make them identical for now. I.e. fail
> > > > the I/O at dax_direct_access() time.
> > >
> > > So basically __dax_zero_page_range() will only write zeros (and not
> > > try to clear any poison). Right?
> >
> > Yes, the zero operation would have already failed at the
> > dax_direct_access() step if there was present poison.
> >
> > > > I think the error clearing
> > > > interface should be an explicit / separate op rather than a
> > > > side-effect. What about an explicit interface for initializing newly
> > > > allocated blocks, and the only reliable way to destroy poison through
> > > > the filesystem is to free the block?
> > >
> > > Effectively pmem_make_request() is already that interface filesystems
> > > use to initialize blocks and clear poison. So we don't really have to
> > > introduce a new interface?
> >
> > pmem_make_request() is shared with the I/O path and is too low in the
> > stack to understand intent. DAX intercepts the I/O path closer to the
> > filesystem and can understand zeroing vs writing today. I'm proposing
> > we go a step further and make DAX understand free-to-allocated-block
> > initialization instead of just zeroing. Inject the error clearing into
> > that initialization interface.
> >
> > > Or you are suggesting separate dax_zero_page_range() interface which will
> > > always call into firmware to clear poison. And that will make sure latent
> > > poison is cleared as well and filesystem should use that for block
> > > initialization instead?
> >
> > Yes, except latent poison would not be cleared until the zeroing is
> > implemented with movdir64b instead of callouts to firmware. It's
> > otherwise too slow to call out to firmware unconditionally.
> >
> > > I do like the idea of not having to differentiate
> > > between known poison and latent poison. Once a block has been initialized
> > > all poison should be cleared (known/latent). I am worried though that
> > > on large devices this might slowdown filesystem initialization a lot
> > > if they are zeroing large range of blocks.
> > >
> > > If yes, this sounds like two different patch series. First patch series
> > > takes care of removing blkdev_issue_zeroout() from
> > > __dax_zero_page_range() and couple of iomap related cleans christoph
> > > wanted.
> > >
> > > And second patch series for adding new dax operation to zero a range
> > > and always call info firmware to clear poison and modify filesystems
> > > accordingly.
> >
> > Yes, but they may need to be merged together. I don't want to regress
> > the ability of a block-aligned hole-punch to clear errors.
>
> Hi Dan,
>
> IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> path. Instead they call blkdev_issue_zeroout() at later point of time.
>
> Only partial block zeroing path is taking __dax_zero_page_range(). So
> even if we remove poison clearing code from __dax_zero_page_range(),
> there should not be a regression w.r.t full block zeroing. Only possible
> regression will be if somebody was doing partial block zeroing on sector
> boundary, then poison will not be cleared.
>
> We now seem to be discussing too many issues w.r.t poison clearing
> and dax. Atleast 3 issues are mentioned in this thread.
>
> A. Get rid of dependency on block device in dax zeroing path.
>    (__dax_zero_page_range)
>
> B. Provide a way to clear latent poison. And possibly use movdir64b to
>    do that and make filesystems use that interface for initialization
>    of blocks.
>
> C. Dax zero operation is clearing known poison while copy_from_iter() is
>    not. I guess this ship has already sailed. If we change it now,
>    somebody will complain of some regression.
>
> For issue A, there are two possible ways to deal with it.
>
> 1. Implement a dax method to zero page. And this method will also clear
>    known poison. This is what my patch series is doing.
>
> 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
>    so that no poison will be cleared in __dax_zero_page_range() path. This
>    path is currently used in partial page zeroing path and full filesystem
>    block zeroing happens with blkdev_issue_zeroout(). There is a small
>    chance of regression here in case of sector aligned partial block
>    zeroing.
>
> My patch series takes care of issue A without any regressions. In fact it
> improves current interface. For example, currently "truncate -s 512
> foo.txt" will succeed even if first sector in the block is poisoned. My
> patch series fixes it. Current implementation will return error on if any
> non sector aligned truncate is done and any of the sector is poisoned. My
> implementation will not return error if poisoned can be cleared as part
> of zeroing. It will return only if poison is present in non-zeoring part.

That asymmetry makes the implementation too much of a special case. If
the dax mapping path forces error boundaries on PAGE_SIZE blocks then
so should zeroing.

>
> Why don't we solve one issue A now and deal with issue B and C later in
> a sepaprate patch series. This patch series gets rid of dependency on
> block device in dax path and also makes current zeroing interface better.

I'm ok with replacing blkdev_issue_zeroout() with a dax operation
callback that deals with page aligned entries. That change at least
makes the error boundary symmetric across copy_from_iter() and the
zeroing path.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-25 16:25                         ` Dan Williams
  0 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-25 16:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dave Chinner, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:
>
> [..]
> > > > > Ok, how about if I add one more patch to the series which will check
> > > > > if unwritten portion of the page has known poison. If it has, then
> > > > > -EIO is returned.
> > > > >
> > > > >
> > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > > > >
> > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > > > truncate. If partial page is being zeroed, then at the end of operation
> > > > > file systems expect that there is no poison in the whole page (atleast
> > > > > known poison).
> > > > >
> > > > > So make sure part of the partial page which is not being written, does not
> > > > > have poison. If it does, return error. If there is poison in area of page
> > > > > being written, it will be cleared.
> > > >
> > > > No, I don't like that the zero operation is special cased compared to
> > > > the write case. I'd say let's make them identical for now. I.e. fail
> > > > the I/O at dax_direct_access() time.
> > >
> > > So basically __dax_zero_page_range() will only write zeros (and not
> > > try to clear any poison). Right?
> >
> > Yes, the zero operation would have already failed at the
> > dax_direct_access() step if there was present poison.
> >
> > > > I think the error clearing
> > > > interface should be an explicit / separate op rather than a
> > > > side-effect. What about an explicit interface for initializing newly
> > > > allocated blocks, and the only reliable way to destroy poison through
> > > > the filesystem is to free the block?
> > >
> > > Effectively pmem_make_request() is already that interface filesystems
> > > use to initialize blocks and clear poison. So we don't really have to
> > > introduce a new interface?
> >
> > pmem_make_request() is shared with the I/O path and is too low in the
> > stack to understand intent. DAX intercepts the I/O path closer to the
> > filesystem and can understand zeroing vs writing today. I'm proposing
> > we go a step further and make DAX understand free-to-allocated-block
> > initialization instead of just zeroing. Inject the error clearing into
> > that initialization interface.
> >
> > > Or you are suggesting separate dax_zero_page_range() interface which will
> > > always call into firmware to clear poison. And that will make sure latent
> > > poison is cleared as well and filesystem should use that for block
> > > initialization instead?
> >
> > Yes, except latent poison would not be cleared until the zeroing is
> > implemented with movdir64b instead of callouts to firmware. It's
> > otherwise too slow to call out to firmware unconditionally.
> >
> > > I do like the idea of not having to differentiate
> > > between known poison and latent poison. Once a block has been initialized
> > > all poison should be cleared (known/latent). I am worried though that
> > > on large devices this might slowdown filesystem initialization a lot
> > > if they are zeroing large range of blocks.
> > >
> > > If yes, this sounds like two different patch series. First patch series
> > > takes care of removing blkdev_issue_zeroout() from
> > > __dax_zero_page_range() and couple of iomap related cleans christoph
> > > wanted.
> > >
> > > And second patch series for adding new dax operation to zero a range
> > > and always call info firmware to clear poison and modify filesystems
> > > accordingly.
> >
> > Yes, but they may need to be merged together. I don't want to regress
> > the ability of a block-aligned hole-punch to clear errors.
>
> Hi Dan,
>
> IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> path. Instead they call blkdev_issue_zeroout() at later point of time.
>
> Only partial block zeroing path is taking __dax_zero_page_range(). So
> even if we remove poison clearing code from __dax_zero_page_range(),
> there should not be a regression w.r.t full block zeroing. Only possible
> regression will be if somebody was doing partial block zeroing on sector
> boundary, then poison will not be cleared.
>
> We now seem to be discussing too many issues w.r.t poison clearing
> and dax. Atleast 3 issues are mentioned in this thread.
>
> A. Get rid of dependency on block device in dax zeroing path.
>    (__dax_zero_page_range)
>
> B. Provide a way to clear latent poison. And possibly use movdir64b to
>    do that and make filesystems use that interface for initialization
>    of blocks.
>
> C. Dax zero operation is clearing known poison while copy_from_iter() is
>    not. I guess this ship has already sailed. If we change it now,
>    somebody will complain of some regression.
>
> For issue A, there are two possible ways to deal with it.
>
> 1. Implement a dax method to zero page. And this method will also clear
>    known poison. This is what my patch series is doing.
>
> 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
>    so that no poison will be cleared in __dax_zero_page_range() path. This
>    path is currently used in partial page zeroing path and full filesystem
>    block zeroing happens with blkdev_issue_zeroout(). There is a small
>    chance of regression here in case of sector aligned partial block
>    zeroing.
>
> My patch series takes care of issue A without any regressions. In fact it
> improves current interface. For example, currently "truncate -s 512
> foo.txt" will succeed even if first sector in the block is poisoned. My
> patch series fixes it. Current implementation will return error on if any
> non sector aligned truncate is done and any of the sector is poisoned. My
> implementation will not return error if poisoned can be cleared as part
> of zeroing. It will return only if poison is present in non-zeoring part.

That asymmetry makes the implementation too much of a special case. If
the dax mapping path forces error boundaries on PAGE_SIZE blocks then
so should zeroing.

>
> Why don't we solve one issue A now and deal with issue B and C later in
> a sepaprate patch series. This patch series gets rid of dependency on
> block device in dax path and also makes current zeroing interface better.

I'm ok with replacing blkdev_issue_zeroout() with a dax operation
callback that deals with page aligned entries. That change at least
makes the error boundary symmetric across copy_from_iter() and the
zeroing path.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-25 16:25                         ` Dan Williams
@ 2020-02-25 20:08                           ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-25 20:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Tue, Feb 25, 2020 at 08:25:27AM -0800, Dan Williams wrote:
> On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:
> >
> > [..]
> > > > > > Ok, how about if I add one more patch to the series which will check
> > > > > > if unwritten portion of the page has known poison. If it has, then
> > > > > > -EIO is returned.
> > > > > >
> > > > > >
> > > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > > > > >
> > > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > > > > truncate. If partial page is being zeroed, then at the end of operation
> > > > > > file systems expect that there is no poison in the whole page (atleast
> > > > > > known poison).
> > > > > >
> > > > > > So make sure part of the partial page which is not being written, does not
> > > > > > have poison. If it does, return error. If there is poison in area of page
> > > > > > being written, it will be cleared.
> > > > >
> > > > > No, I don't like that the zero operation is special cased compared to
> > > > > the write case. I'd say let's make them identical for now. I.e. fail
> > > > > the I/O at dax_direct_access() time.
> > > >
> > > > So basically __dax_zero_page_range() will only write zeros (and not
> > > > try to clear any poison). Right?
> > >
> > > Yes, the zero operation would have already failed at the
> > > dax_direct_access() step if there was present poison.
> > >
> > > > > I think the error clearing
> > > > > interface should be an explicit / separate op rather than a
> > > > > side-effect. What about an explicit interface for initializing newly
> > > > > allocated blocks, and the only reliable way to destroy poison through
> > > > > the filesystem is to free the block?
> > > >
> > > > Effectively pmem_make_request() is already that interface filesystems
> > > > use to initialize blocks and clear poison. So we don't really have to
> > > > introduce a new interface?
> > >
> > > pmem_make_request() is shared with the I/O path and is too low in the
> > > stack to understand intent. DAX intercepts the I/O path closer to the
> > > filesystem and can understand zeroing vs writing today. I'm proposing
> > > we go a step further and make DAX understand free-to-allocated-block
> > > initialization instead of just zeroing. Inject the error clearing into
> > > that initialization interface.
> > >
> > > > Or you are suggesting separate dax_zero_page_range() interface which will
> > > > always call into firmware to clear poison. And that will make sure latent
> > > > poison is cleared as well and filesystem should use that for block
> > > > initialization instead?
> > >
> > > Yes, except latent poison would not be cleared until the zeroing is
> > > implemented with movdir64b instead of callouts to firmware. It's
> > > otherwise too slow to call out to firmware unconditionally.
> > >
> > > > I do like the idea of not having to differentiate
> > > > between known poison and latent poison. Once a block has been initialized
> > > > all poison should be cleared (known/latent). I am worried though that
> > > > on large devices this might slowdown filesystem initialization a lot
> > > > if they are zeroing large range of blocks.
> > > >
> > > > If yes, this sounds like two different patch series. First patch series
> > > > takes care of removing blkdev_issue_zeroout() from
> > > > __dax_zero_page_range() and couple of iomap related cleans christoph
> > > > wanted.
> > > >
> > > > And second patch series for adding new dax operation to zero a range
> > > > and always call info firmware to clear poison and modify filesystems
> > > > accordingly.
> > >
> > > Yes, but they may need to be merged together. I don't want to regress
> > > the ability of a block-aligned hole-punch to clear errors.
> >
> > Hi Dan,
> >
> > IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> > path. Instead they call blkdev_issue_zeroout() at later point of time.
> >
> > Only partial block zeroing path is taking __dax_zero_page_range(). So
> > even if we remove poison clearing code from __dax_zero_page_range(),
> > there should not be a regression w.r.t full block zeroing. Only possible
> > regression will be if somebody was doing partial block zeroing on sector
> > boundary, then poison will not be cleared.
> >
> > We now seem to be discussing too many issues w.r.t poison clearing
> > and dax. Atleast 3 issues are mentioned in this thread.
> >
> > A. Get rid of dependency on block device in dax zeroing path.
> >    (__dax_zero_page_range)
> >
> > B. Provide a way to clear latent poison. And possibly use movdir64b to
> >    do that and make filesystems use that interface for initialization
> >    of blocks.
> >
> > C. Dax zero operation is clearing known poison while copy_from_iter() is
> >    not. I guess this ship has already sailed. If we change it now,
> >    somebody will complain of some regression.
> >
> > For issue A, there are two possible ways to deal with it.
> >
> > 1. Implement a dax method to zero page. And this method will also clear
> >    known poison. This is what my patch series is doing.
> >
> > 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
> >    so that no poison will be cleared in __dax_zero_page_range() path. This
> >    path is currently used in partial page zeroing path and full filesystem
> >    block zeroing happens with blkdev_issue_zeroout(). There is a small
> >    chance of regression here in case of sector aligned partial block
> >    zeroing.
> >
> > My patch series takes care of issue A without any regressions. In fact it
> > improves current interface. For example, currently "truncate -s 512
> > foo.txt" will succeed even if first sector in the block is poisoned. My
> > patch series fixes it. Current implementation will return error on if any
> > non sector aligned truncate is done and any of the sector is poisoned. My
> > implementation will not return error if poisoned can be cleared as part
> > of zeroing. It will return only if poison is present in non-zeoring part.
> 
> That asymmetry makes the implementation too much of a special case. If
> the dax mapping path forces error boundaries on PAGE_SIZE blocks then
> so should zeroing.
> 
> >
> > Why don't we solve one issue A now and deal with issue B and C later in
> > a sepaprate patch series. This patch series gets rid of dependency on
> > block device in dax path and also makes current zeroing interface better.
> 
> I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> callback that deals with page aligned entries. That change at least
> makes the error boundary symmetric across copy_from_iter() and the
> zeroing path.

IIUC, you are suggesting that modify dax_zero_page_range() to take page
aligned start and size and call this interface from
__dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
path?

Something like.

__dax_zero_page_range() {
  if(page_aligned_io)
  	call_dax_page_zero_range()
  else
   	use_direct_access_and_memcpy;
}

And other callers of blkdev_issue_zeroout() in filesystems can migrate
to calling dax_zero_page_range() instead.

If yes, I am not seeing what advantage do we get by this change.

- __dax_zero_page_range() seems to be called by only partial block
  zeroing code. So dax_zero_page_range() call will remain unused.

- dax_zero_page_range() will be exact replacement of
  blkdev_issue_zeroout() so filesystems will not gain anything. Just that
  it will create a dax specific hook.

In that case it might be simpler to just get rid of blkdev_issue_zeroout()
call from __dax_zero_page_range() and make sure there are no callers of
full block zeroing from this path.

Thanks
Vivek
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-25 20:08                           ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-25 20:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Tue, Feb 25, 2020 at 08:25:27AM -0800, Dan Williams wrote:
> On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:
> >
> > [..]
> > > > > > Ok, how about if I add one more patch to the series which will check
> > > > > > if unwritten portion of the page has known poison. If it has, then
> > > > > > -EIO is returned.
> > > > > >
> > > > > >
> > > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > > > > >
> > > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > > > > truncate. If partial page is being zeroed, then at the end of operation
> > > > > > file systems expect that there is no poison in the whole page (atleast
> > > > > > known poison).
> > > > > >
> > > > > > So make sure part of the partial page which is not being written, does not
> > > > > > have poison. If it does, return error. If there is poison in area of page
> > > > > > being written, it will be cleared.
> > > > >
> > > > > No, I don't like that the zero operation is special cased compared to
> > > > > the write case. I'd say let's make them identical for now. I.e. fail
> > > > > the I/O at dax_direct_access() time.
> > > >
> > > > So basically __dax_zero_page_range() will only write zeros (and not
> > > > try to clear any poison). Right?
> > >
> > > Yes, the zero operation would have already failed at the
> > > dax_direct_access() step if there was present poison.
> > >
> > > > > I think the error clearing
> > > > > interface should be an explicit / separate op rather than a
> > > > > side-effect. What about an explicit interface for initializing newly
> > > > > allocated blocks, and the only reliable way to destroy poison through
> > > > > the filesystem is to free the block?
> > > >
> > > > Effectively pmem_make_request() is already that interface filesystems
> > > > use to initialize blocks and clear poison. So we don't really have to
> > > > introduce a new interface?
> > >
> > > pmem_make_request() is shared with the I/O path and is too low in the
> > > stack to understand intent. DAX intercepts the I/O path closer to the
> > > filesystem and can understand zeroing vs writing today. I'm proposing
> > > we go a step further and make DAX understand free-to-allocated-block
> > > initialization instead of just zeroing. Inject the error clearing into
> > > that initialization interface.
> > >
> > > > Or you are suggesting separate dax_zero_page_range() interface which will
> > > > always call into firmware to clear poison. And that will make sure latent
> > > > poison is cleared as well and filesystem should use that for block
> > > > initialization instead?
> > >
> > > Yes, except latent poison would not be cleared until the zeroing is
> > > implemented with movdir64b instead of callouts to firmware. It's
> > > otherwise too slow to call out to firmware unconditionally.
> > >
> > > > I do like the idea of not having to differentiate
> > > > between known poison and latent poison. Once a block has been initialized
> > > > all poison should be cleared (known/latent). I am worried though that
> > > > on large devices this might slowdown filesystem initialization a lot
> > > > if they are zeroing large range of blocks.
> > > >
> > > > If yes, this sounds like two different patch series. First patch series
> > > > takes care of removing blkdev_issue_zeroout() from
> > > > __dax_zero_page_range() and couple of iomap related cleans christoph
> > > > wanted.
> > > >
> > > > And second patch series for adding new dax operation to zero a range
> > > > and always call info firmware to clear poison and modify filesystems
> > > > accordingly.
> > >
> > > Yes, but they may need to be merged together. I don't want to regress
> > > the ability of a block-aligned hole-punch to clear errors.
> >
> > Hi Dan,
> >
> > IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> > path. Instead they call blkdev_issue_zeroout() at later point of time.
> >
> > Only partial block zeroing path is taking __dax_zero_page_range(). So
> > even if we remove poison clearing code from __dax_zero_page_range(),
> > there should not be a regression w.r.t full block zeroing. Only possible
> > regression will be if somebody was doing partial block zeroing on sector
> > boundary, then poison will not be cleared.
> >
> > We now seem to be discussing too many issues w.r.t poison clearing
> > and dax. Atleast 3 issues are mentioned in this thread.
> >
> > A. Get rid of dependency on block device in dax zeroing path.
> >    (__dax_zero_page_range)
> >
> > B. Provide a way to clear latent poison. And possibly use movdir64b to
> >    do that and make filesystems use that interface for initialization
> >    of blocks.
> >
> > C. Dax zero operation is clearing known poison while copy_from_iter() is
> >    not. I guess this ship has already sailed. If we change it now,
> >    somebody will complain of some regression.
> >
> > For issue A, there are two possible ways to deal with it.
> >
> > 1. Implement a dax method to zero page. And this method will also clear
> >    known poison. This is what my patch series is doing.
> >
> > 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
> >    so that no poison will be cleared in __dax_zero_page_range() path. This
> >    path is currently used in partial page zeroing path and full filesystem
> >    block zeroing happens with blkdev_issue_zeroout(). There is a small
> >    chance of regression here in case of sector aligned partial block
> >    zeroing.
> >
> > My patch series takes care of issue A without any regressions. In fact it
> > improves current interface. For example, currently "truncate -s 512
> > foo.txt" will succeed even if first sector in the block is poisoned. My
> > patch series fixes it. Current implementation will return error on if any
> > non sector aligned truncate is done and any of the sector is poisoned. My
> > implementation will not return error if poisoned can be cleared as part
> > of zeroing. It will return only if poison is present in non-zeoring part.
> 
> That asymmetry makes the implementation too much of a special case. If
> the dax mapping path forces error boundaries on PAGE_SIZE blocks then
> so should zeroing.
> 
> >
> > Why don't we solve one issue A now and deal with issue B and C later in
> > a sepaprate patch series. This patch series gets rid of dependency on
> > block device in dax path and also makes current zeroing interface better.
> 
> I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> callback that deals with page aligned entries. That change at least
> makes the error boundary symmetric across copy_from_iter() and the
> zeroing path.

IIUC, you are suggesting that modify dax_zero_page_range() to take page
aligned start and size and call this interface from
__dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
path?

Something like.

__dax_zero_page_range() {
  if(page_aligned_io)
  	call_dax_page_zero_range()
  else
   	use_direct_access_and_memcpy;
}

And other callers of blkdev_issue_zeroout() in filesystems can migrate
to calling dax_zero_page_range() instead.

If yes, I am not seeing what advantage do we get by this change.

- __dax_zero_page_range() seems to be called by only partial block
  zeroing code. So dax_zero_page_range() call will remain unused.

- dax_zero_page_range() will be exact replacement of
  blkdev_issue_zeroout() so filesystems will not gain anything. Just that
  it will create a dax specific hook.

In that case it might be simpler to just get rid of blkdev_issue_zeroout()
call from __dax_zero_page_range() and make sure there are no callers of
full block zeroing from this path.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-25  0:26                       ` Dan Williams
@ 2020-02-25 20:32                         ` Jeff Moyer
  -1 siblings, 0 replies; 103+ messages in thread
From: Jeff Moyer @ 2020-02-25 20:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

Dan Williams <dan.j.williams@intel.com> writes:

> On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>>
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>> >> Let's just focus on reporting errors when we know we have them.
>> >
>> > That's the problem in my eyes. If software needs to contend with
>> > latent error reporting then it should always contend otherwise
>> > software has multiple error models to wrangle.
>>
>> The only way for an application to know that the data has been written
>> successfully would be to issue a read after every write.  That's not a
>> performance hit most applications are willing to take.  And, of course,
>> the media can still go bad at a later time, so it only guarantees the
>> data is accessible immediately after having been written.
>>
>> What I'm suggesting is that we should not complete a write successfully
>> if we know that the data will not be retrievable.  I wouldn't call this
>> adding an extra error model to contend with.  Applications should
>> already be checking for errors on write.
>>
>> Does that make sense? Are we talking past each other?
>
> The badblock list is late to update in both directions, late to add
> entries that the scrub needs to find and late to delete entries that
> were inadvertently cleared by cache-line writes that did not first
> ingest the poison for a read-modify-write.

We aren't looking for perfection.  If the badblocks list is populated,
then report the error instead of letting the user write data that we
know they won't be able to access later.

You have confused me, though, since I thought that stores to bad media
would not clear errors.  Perhaps you are talking about some future
hardware implementation that doesn't yet exist?

> So I see the above as being wishful in using the error list as the
> hard source of truth and unfortunate to up-level all sub-sector error
> entries into full PAGE_SIZE data offline events.

The page size granularity is only an issue for mmap().  If you are using
read/write, then 512 byte granularity can be achieved.  Even with mmap,
if you encounter an error on a 4k page, you can query the status of each
sector in that page to isolate the error.  So I'm not quite sure I
understand what you're getting at.

> I'm hoping we can find a way to make the error handling more fine
> grained over time, but for the current patch, managing the blast
> radius as PAGE_SIZE granularity at least matches the zero path with
> the write path.

I think the write path can do 512 byte granularity, not page size.
Anyway, I think we've gone far enough off into the weeds that more
patches will have to be posted for debate.  :)

>> > Setting that aside we can start with just treating zeroing the same as
>> > the copy_from_iter() case and fail the I/O at the dax_direct_access()
>> > step.
>>
>> OK.
>>
>> > I'd rather have a separate op that filesystems can use to clear errors
>> > at block allocation time that can be enforced to have the correct
>> > alignment.
>>
>> So would file systems always call that routine instead of zeroing, or
>> would they first check to see if there are badblocks?
>
> The proposal is that filesystems distinguish zeroing from free-block
> allocation/initialization such that the fsdax implementation directs
> initialization to a driver callback. This "initialization op" would
> take care to check for poison and clear it. All other dax paths would
> not consult the badblocks list.

What do you mean by "all other dax paths?"  Would that include
mmap/direct_access?  Because that definitely should still consult the
badblocks list.

I'm not against having a separate operation for clearing errors, but I
guess I'm not convinced it's cleaner, either.

-Jeff
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-25 20:32                         ` Jeff Moyer
  0 siblings, 0 replies; 103+ messages in thread
From: Jeff Moyer @ 2020-02-25 20:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Vivek Goyal, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

Dan Williams <dan.j.williams@intel.com> writes:

> On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>>
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>> >> Let's just focus on reporting errors when we know we have them.
>> >
>> > That's the problem in my eyes. If software needs to contend with
>> > latent error reporting then it should always contend otherwise
>> > software has multiple error models to wrangle.
>>
>> The only way for an application to know that the data has been written
>> successfully would be to issue a read after every write.  That's not a
>> performance hit most applications are willing to take.  And, of course,
>> the media can still go bad at a later time, so it only guarantees the
>> data is accessible immediately after having been written.
>>
>> What I'm suggesting is that we should not complete a write successfully
>> if we know that the data will not be retrievable.  I wouldn't call this
>> adding an extra error model to contend with.  Applications should
>> already be checking for errors on write.
>>
>> Does that make sense? Are we talking past each other?
>
> The badblock list is late to update in both directions, late to add
> entries that the scrub needs to find and late to delete entries that
> were inadvertently cleared by cache-line writes that did not first
> ingest the poison for a read-modify-write.

We aren't looking for perfection.  If the badblocks list is populated,
then report the error instead of letting the user write data that we
know they won't be able to access later.

You have confused me, though, since I thought that stores to bad media
would not clear errors.  Perhaps you are talking about some future
hardware implementation that doesn't yet exist?

> So I see the above as being wishful in using the error list as the
> hard source of truth and unfortunate to up-level all sub-sector error
> entries into full PAGE_SIZE data offline events.

The page size granularity is only an issue for mmap().  If you are using
read/write, then 512 byte granularity can be achieved.  Even with mmap,
if you encounter an error on a 4k page, you can query the status of each
sector in that page to isolate the error.  So I'm not quite sure I
understand what you're getting at.

> I'm hoping we can find a way to make the error handling more fine
> grained over time, but for the current patch, managing the blast
> radius as PAGE_SIZE granularity at least matches the zero path with
> the write path.

I think the write path can do 512 byte granularity, not page size.
Anyway, I think we've gone far enough off into the weeds that more
patches will have to be posted for debate.  :)

>> > Setting that aside we can start with just treating zeroing the same as
>> > the copy_from_iter() case and fail the I/O at the dax_direct_access()
>> > step.
>>
>> OK.
>>
>> > I'd rather have a separate op that filesystems can use to clear errors
>> > at block allocation time that can be enforced to have the correct
>> > alignment.
>>
>> So would file systems always call that routine instead of zeroing, or
>> would they first check to see if there are badblocks?
>
> The proposal is that filesystems distinguish zeroing from free-block
> allocation/initialization such that the fsdax implementation directs
> initialization to a driver callback. This "initialization op" would
> take care to check for poison and clear it. All other dax paths would
> not consult the badblocks list.

What do you mean by "all other dax paths?"  Would that include
mmap/direct_access?  Because that definitely should still consult the
badblocks list.

I'm not against having a separate operation for clearing errors, but I
guess I'm not convinced it's cleaner, either.

-Jeff


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-25 20:32                         ` Jeff Moyer
@ 2020-02-25 21:52                           ` Dan Williams
  -1 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-25 21:52 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Tue, Feb 25, 2020 at 12:33 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote:
> >>
> >> Dan Williams <dan.j.williams@intel.com> writes:
> >>
> >> >> Let's just focus on reporting errors when we know we have them.
> >> >
> >> > That's the problem in my eyes. If software needs to contend with
> >> > latent error reporting then it should always contend otherwise
> >> > software has multiple error models to wrangle.
> >>
> >> The only way for an application to know that the data has been written
> >> successfully would be to issue a read after every write.  That's not a
> >> performance hit most applications are willing to take.  And, of course,
> >> the media can still go bad at a later time, so it only guarantees the
> >> data is accessible immediately after having been written.
> >>
> >> What I'm suggesting is that we should not complete a write successfully
> >> if we know that the data will not be retrievable.  I wouldn't call this
> >> adding an extra error model to contend with.  Applications should
> >> already be checking for errors on write.
> >>
> >> Does that make sense? Are we talking past each other?
> >
> > The badblock list is late to update in both directions, late to add
> > entries that the scrub needs to find and late to delete entries that
> > were inadvertently cleared by cache-line writes that did not first
> > ingest the poison for a read-modify-write.
>
> We aren't looking for perfection.  If the badblocks list is populated,
> then report the error instead of letting the user write data that we
> know they won't be able to access later.
>
> You have confused me, though, since I thought that stores to bad media
> would not clear errors.  Perhaps you are talking about some future
> hardware implementation that doesn't yet exist?

No, today cacheline aligned DMA, and cpu cachelines that are fully
dirtied without a demand fill can end up clearing poison. The
movdir64b instruction provides a way to force this behavior from the
cpu where it was previously luck.

> > So I see the above as being wishful in using the error list as the
> > hard source of truth and unfortunate to up-level all sub-sector error
> > entries into full PAGE_SIZE data offline events.
>
> The page size granularity is only an issue for mmap().  If you are using
> read/write, then 512 byte granularity can be achieved.  Even with mmap,
> if you encounter an error on a 4k page, you can query the status of each
> sector in that page to isolate the error.  So I'm not quite sure I
> understand what you're getting at.

I'm getting at the fact that we don't allow mmap on PAGE_SIZE
granularity in the presence of errors, and don't allow dax I/O to the
page when an error is present. Only non-dax direct-I/O can read /
write at sub-page granularity in the presence of errors.

The interface to query the status is not coordinated with the
filesystem, but that's a whole other discussion. Yes, we're getting a
bit off in the weeds.

> > I'm hoping we can find a way to make the error handling more fine
> > grained over time, but for the current patch, managing the blast
> > radius as PAGE_SIZE granularity at least matches the zero path with
> > the write path.
>
> I think the write path can do 512 byte granularity, not page size.
> Anyway, I think we've gone far enough off into the weeds that more
> patches will have to be posted for debate.  :)
>

It can't, see dax_iomap_actor(). That call to dax_direct_access() will
fail if it hits a known badblock.

> >> > Setting that aside we can start with just treating zeroing the same as
> >> > the copy_from_iter() case and fail the I/O at the dax_direct_access()
> >> > step.
> >>
> >> OK.
> >>
> >> > I'd rather have a separate op that filesystems can use to clear errors
> >> > at block allocation time that can be enforced to have the correct
> >> > alignment.
> >>
> >> So would file systems always call that routine instead of zeroing, or
> >> would they first check to see if there are badblocks?
> >
> > The proposal is that filesystems distinguish zeroing from free-block
> > allocation/initialization such that the fsdax implementation directs
> > initialization to a driver callback. This "initialization op" would
> > take care to check for poison and clear it. All other dax paths would
> > not consult the badblocks list.
>
> What do you mean by "all other dax paths?"  Would that include
> mmap/direct_access?  Because that definitely should still consult the
> badblocks list.

dax_direct_access() consults the badblock list,
dax_copy_{to,from}_iter() do not, and badblocks discovered after a
page is mapped do not invalidate the page unless the poison is
consumed.

> I'm not against having a separate operation for clearing errors, but I
> guess I'm not convinced it's cleaner, either.

The idea with a separate op is to have an explicit ABI to clear errors
like page-aligned-hole-punch (FALLOC_FL_PUNCH_ERROR?) vs some implicit
side effect of direct-I/O writes.

I also feel like we're heading in a direction that tries to avoid
relying on the cpu machine-check recovery path at all costs. That's
the kernel's prerogative, of course, but it seems like at a minimum
we're not quite on the same page about what role machine-check
recovery plays in the error handling model.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-25 21:52                           ` Dan Williams
  0 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-25 21:52 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dave Chinner, Vivek Goyal, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Tue, Feb 25, 2020 at 12:33 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote:
> >>
> >> Dan Williams <dan.j.williams@intel.com> writes:
> >>
> >> >> Let's just focus on reporting errors when we know we have them.
> >> >
> >> > That's the problem in my eyes. If software needs to contend with
> >> > latent error reporting then it should always contend otherwise
> >> > software has multiple error models to wrangle.
> >>
> >> The only way for an application to know that the data has been written
> >> successfully would be to issue a read after every write.  That's not a
> >> performance hit most applications are willing to take.  And, of course,
> >> the media can still go bad at a later time, so it only guarantees the
> >> data is accessible immediately after having been written.
> >>
> >> What I'm suggesting is that we should not complete a write successfully
> >> if we know that the data will not be retrievable.  I wouldn't call this
> >> adding an extra error model to contend with.  Applications should
> >> already be checking for errors on write.
> >>
> >> Does that make sense? Are we talking past each other?
> >
> > The badblock list is late to update in both directions, late to add
> > entries that the scrub needs to find and late to delete entries that
> > were inadvertently cleared by cache-line writes that did not first
> > ingest the poison for a read-modify-write.
>
> We aren't looking for perfection.  If the badblocks list is populated,
> then report the error instead of letting the user write data that we
> know they won't be able to access later.
>
> You have confused me, though, since I thought that stores to bad media
> would not clear errors.  Perhaps you are talking about some future
> hardware implementation that doesn't yet exist?

No, today cacheline aligned DMA, and cpu cachelines that are fully
dirtied without a demand fill can end up clearing poison. The
movdir64b instruction provides a way to force this behavior from the
cpu where it was previously luck.

> > So I see the above as being wishful in using the error list as the
> > hard source of truth and unfortunate to up-level all sub-sector error
> > entries into full PAGE_SIZE data offline events.
>
> The page size granularity is only an issue for mmap().  If you are using
> read/write, then 512 byte granularity can be achieved.  Even with mmap,
> if you encounter an error on a 4k page, you can query the status of each
> sector in that page to isolate the error.  So I'm not quite sure I
> understand what you're getting at.

I'm getting at the fact that we don't allow mmap on PAGE_SIZE
granularity in the presence of errors, and don't allow dax I/O to the
page when an error is present. Only non-dax direct-I/O can read /
write at sub-page granularity in the presence of errors.

The interface to query the status is not coordinated with the
filesystem, but that's a whole other discussion. Yes, we're getting a
bit off in the weeds.

> > I'm hoping we can find a way to make the error handling more fine
> > grained over time, but for the current patch, managing the blast
> > radius as PAGE_SIZE granularity at least matches the zero path with
> > the write path.
>
> I think the write path can do 512 byte granularity, not page size.
> Anyway, I think we've gone far enough off into the weeds that more
> patches will have to be posted for debate.  :)
>

It can't, see dax_iomap_actor(). That call to dax_direct_access() will
fail if it hits a known badblock.

> >> > Setting that aside we can start with just treating zeroing the same as
> >> > the copy_from_iter() case and fail the I/O at the dax_direct_access()
> >> > step.
> >>
> >> OK.
> >>
> >> > I'd rather have a separate op that filesystems can use to clear errors
> >> > at block allocation time that can be enforced to have the correct
> >> > alignment.
> >>
> >> So would file systems always call that routine instead of zeroing, or
> >> would they first check to see if there are badblocks?
> >
> > The proposal is that filesystems distinguish zeroing from free-block
> > allocation/initialization such that the fsdax implementation directs
> > initialization to a driver callback. This "initialization op" would
> > take care to check for poison and clear it. All other dax paths would
> > not consult the badblocks list.
>
> What do you mean by "all other dax paths?"  Would that include
> mmap/direct_access?  Because that definitely should still consult the
> badblocks list.

dax_direct_access() consults the badblock list,
dax_copy_{to,from}_iter() do not, and badblocks discovered after a
page is mapped do not invalidate the page unless the poison is
consumed.

> I'm not against having a separate operation for clearing errors, but I
> guess I'm not convinced it's cleaner, either.

The idea with a separate op is to have an explicit ABI to clear errors
like page-aligned-hole-punch (FALLOC_FL_PUNCH_ERROR?) vs some implicit
side effect of direct-I/O writes.

I also feel like we're heading in a direction that tries to avoid
relying on the cpu machine-check recovery path at all costs. That's
the kernel's prerogative, of course, but it seems like at a minimum
we're not quite on the same page about what role machine-check
recovery plays in the error handling model.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-25 20:08                           ` Vivek Goyal
@ 2020-02-25 22:49                             ` Dan Williams
  -1 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-25 22:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Tue, Feb 25, 2020 at 12:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Feb 25, 2020 at 08:25:27AM -0800, Dan Williams wrote:
> > On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:
> > >
> > > [..]
> > > > > > > Ok, how about if I add one more patch to the series which will check
> > > > > > > if unwritten portion of the page has known poison. If it has, then
> > > > > > > -EIO is returned.
> > > > > > >
> > > > > > >
> > > > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > > > > > >
> > > > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > > > > > truncate. If partial page is being zeroed, then at the end of operation
> > > > > > > file systems expect that there is no poison in the whole page (atleast
> > > > > > > known poison).
> > > > > > >
> > > > > > > So make sure part of the partial page which is not being written, does not
> > > > > > > have poison. If it does, return error. If there is poison in area of page
> > > > > > > being written, it will be cleared.
> > > > > >
> > > > > > No, I don't like that the zero operation is special cased compared to
> > > > > > the write case. I'd say let's make them identical for now. I.e. fail
> > > > > > the I/O at dax_direct_access() time.
> > > > >
> > > > > So basically __dax_zero_page_range() will only write zeros (and not
> > > > > try to clear any poison). Right?
> > > >
> > > > Yes, the zero operation would have already failed at the
> > > > dax_direct_access() step if there was present poison.
> > > >
> > > > > > I think the error clearing
> > > > > > interface should be an explicit / separate op rather than a
> > > > > > side-effect. What about an explicit interface for initializing newly
> > > > > > allocated blocks, and the only reliable way to destroy poison through
> > > > > > the filesystem is to free the block?
> > > > >
> > > > > Effectively pmem_make_request() is already that interface filesystems
> > > > > use to initialize blocks and clear poison. So we don't really have to
> > > > > introduce a new interface?
> > > >
> > > > pmem_make_request() is shared with the I/O path and is too low in the
> > > > stack to understand intent. DAX intercepts the I/O path closer to the
> > > > filesystem and can understand zeroing vs writing today. I'm proposing
> > > > we go a step further and make DAX understand free-to-allocated-block
> > > > initialization instead of just zeroing. Inject the error clearing into
> > > > that initialization interface.
> > > >
> > > > > Or you are suggesting separate dax_zero_page_range() interface which will
> > > > > always call into firmware to clear poison. And that will make sure latent
> > > > > poison is cleared as well and filesystem should use that for block
> > > > > initialization instead?
> > > >
> > > > Yes, except latent poison would not be cleared until the zeroing is
> > > > implemented with movdir64b instead of callouts to firmware. It's
> > > > otherwise too slow to call out to firmware unconditionally.
> > > >
> > > > > I do like the idea of not having to differentiate
> > > > > between known poison and latent poison. Once a block has been initialized
> > > > > all poison should be cleared (known/latent). I am worried though that
> > > > > on large devices this might slowdown filesystem initialization a lot
> > > > > if they are zeroing large range of blocks.
> > > > >
> > > > > If yes, this sounds like two different patch series. First patch series
> > > > > takes care of removing blkdev_issue_zeroout() from
> > > > > __dax_zero_page_range() and couple of iomap related cleans christoph
> > > > > wanted.
> > > > >
> > > > > And second patch series for adding new dax operation to zero a range
> > > > > and always call info firmware to clear poison and modify filesystems
> > > > > accordingly.
> > > >
> > > > Yes, but they may need to be merged together. I don't want to regress
> > > > the ability of a block-aligned hole-punch to clear errors.
> > >
> > > Hi Dan,
> > >
> > > IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> > > path. Instead they call blkdev_issue_zeroout() at later point of time.
> > >
> > > Only partial block zeroing path is taking __dax_zero_page_range(). So
> > > even if we remove poison clearing code from __dax_zero_page_range(),
> > > there should not be a regression w.r.t full block zeroing. Only possible
> > > regression will be if somebody was doing partial block zeroing on sector
> > > boundary, then poison will not be cleared.
> > >
> > > We now seem to be discussing too many issues w.r.t poison clearing
> > > and dax. Atleast 3 issues are mentioned in this thread.
> > >
> > > A. Get rid of dependency on block device in dax zeroing path.
> > >    (__dax_zero_page_range)
> > >
> > > B. Provide a way to clear latent poison. And possibly use movdir64b to
> > >    do that and make filesystems use that interface for initialization
> > >    of blocks.
> > >
> > > C. Dax zero operation is clearing known poison while copy_from_iter() is
> > >    not. I guess this ship has already sailed. If we change it now,
> > >    somebody will complain of some regression.
> > >
> > > For issue A, there are two possible ways to deal with it.
> > >
> > > 1. Implement a dax method to zero page. And this method will also clear
> > >    known poison. This is what my patch series is doing.
> > >
> > > 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
> > >    so that no poison will be cleared in __dax_zero_page_range() path. This
> > >    path is currently used in partial page zeroing path and full filesystem
> > >    block zeroing happens with blkdev_issue_zeroout(). There is a small
> > >    chance of regression here in case of sector aligned partial block
> > >    zeroing.
> > >
> > > My patch series takes care of issue A without any regressions. In fact it
> > > improves current interface. For example, currently "truncate -s 512
> > > foo.txt" will succeed even if first sector in the block is poisoned. My
> > > patch series fixes it. Current implementation will return error on if any
> > > non sector aligned truncate is done and any of the sector is poisoned. My
> > > implementation will not return error if poisoned can be cleared as part
> > > of zeroing. It will return only if poison is present in non-zeoring part.
> >
> > That asymmetry makes the implementation too much of a special case. If
> > the dax mapping path forces error boundaries on PAGE_SIZE blocks then
> > so should zeroing.
> >
> > >
> > > Why don't we solve one issue A now and deal with issue B and C later in
> > > a sepaprate patch series. This patch series gets rid of dependency on
> > > block device in dax path and also makes current zeroing interface better.
> >
> > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > callback that deals with page aligned entries. That change at least
> > makes the error boundary symmetric across copy_from_iter() and the
> > zeroing path.
>
> IIUC, you are suggesting that modify dax_zero_page_range() to take page
> aligned start and size and call this interface from
> __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> path?
>
> Something like.
>
> __dax_zero_page_range() {
>   if(page_aligned_io)
>         call_dax_page_zero_range()
>   else
>         use_direct_access_and_memcpy;
> }
>
> And other callers of blkdev_issue_zeroout() in filesystems can migrate
> to calling dax_zero_page_range() instead.
>
> If yes, I am not seeing what advantage do we get by this change.
>
> - __dax_zero_page_range() seems to be called by only partial block
>   zeroing code. So dax_zero_page_range() call will remain unused.
>
>
> - dax_zero_page_range() will be exact replacement of
>   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
>   it will create a dax specific hook.
>
> In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> call from __dax_zero_page_range() and make sure there are no callers of
> full block zeroing from this path.

I think you're right. The path I'm concerned about not regressing is
the error clearing on new block allocation and we get that already via
xfs_zero_extent() and sb_issue_zeroout(). For your fs we'll want a
dax-device equivalent  for that path, but that does mean that
__dax_zero_page_range() stays out of the error clearing game.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-25 22:49                             ` Dan Williams
  0 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-25 22:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dave Chinner, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Tue, Feb 25, 2020 at 12:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Feb 25, 2020 at 08:25:27AM -0800, Dan Williams wrote:
> > On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:
> > >
> > > [..]
> > > > > > > Ok, how about if I add one more patch to the series which will check
> > > > > > > if unwritten portion of the page has known poison. If it has, then
> > > > > > > -EIO is returned.
> > > > > > >
> > > > > > >
> > > > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > > > > > >
> > > > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > > > > > truncate. If partial page is being zeroed, then at the end of operation
> > > > > > > file systems expect that there is no poison in the whole page (atleast
> > > > > > > known poison).
> > > > > > >
> > > > > > > So make sure part of the partial page which is not being written, does not
> > > > > > > have poison. If it does, return error. If there is poison in area of page
> > > > > > > being written, it will be cleared.
> > > > > >
> > > > > > No, I don't like that the zero operation is special cased compared to
> > > > > > the write case. I'd say let's make them identical for now. I.e. fail
> > > > > > the I/O at dax_direct_access() time.
> > > > >
> > > > > So basically __dax_zero_page_range() will only write zeros (and not
> > > > > try to clear any poison). Right?
> > > >
> > > > Yes, the zero operation would have already failed at the
> > > > dax_direct_access() step if there was present poison.
> > > >
> > > > > > I think the error clearing
> > > > > > interface should be an explicit / separate op rather than a
> > > > > > side-effect. What about an explicit interface for initializing newly
> > > > > > allocated blocks, and the only reliable way to destroy poison through
> > > > > > the filesystem is to free the block?
> > > > >
> > > > > Effectively pmem_make_request() is already that interface filesystems
> > > > > use to initialize blocks and clear poison. So we don't really have to
> > > > > introduce a new interface?
> > > >
> > > > pmem_make_request() is shared with the I/O path and is too low in the
> > > > stack to understand intent. DAX intercepts the I/O path closer to the
> > > > filesystem and can understand zeroing vs writing today. I'm proposing
> > > > we go a step further and make DAX understand free-to-allocated-block
> > > > initialization instead of just zeroing. Inject the error clearing into
> > > > that initialization interface.
> > > >
> > > > > Or you are suggesting separate dax_zero_page_range() interface which will
> > > > > always call into firmware to clear poison. And that will make sure latent
> > > > > poison is cleared as well and filesystem should use that for block
> > > > > initialization instead?
> > > >
> > > > Yes, except latent poison would not be cleared until the zeroing is
> > > > implemented with movdir64b instead of callouts to firmware. It's
> > > > otherwise too slow to call out to firmware unconditionally.
> > > >
> > > > > I do like the idea of not having to differentiate
> > > > > between known poison and latent poison. Once a block has been initialized
> > > > > all poison should be cleared (known/latent). I am worried though that
> > > > > on large devices this might slowdown filesystem initialization a lot
> > > > > if they are zeroing large range of blocks.
> > > > >
> > > > > If yes, this sounds like two different patch series. First patch series
> > > > > takes care of removing blkdev_issue_zeroout() from
> > > > > __dax_zero_page_range() and couple of iomap related cleans christoph
> > > > > wanted.
> > > > >
> > > > > And second patch series for adding new dax operation to zero a range
> > > > > and always call info firmware to clear poison and modify filesystems
> > > > > accordingly.
> > > >
> > > > Yes, but they may need to be merged together. I don't want to regress
> > > > the ability of a block-aligned hole-punch to clear errors.
> > >
> > > Hi Dan,
> > >
> > > IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> > > path. Instead they call blkdev_issue_zeroout() at later point of time.
> > >
> > > Only partial block zeroing path is taking __dax_zero_page_range(). So
> > > even if we remove poison clearing code from __dax_zero_page_range(),
> > > there should not be a regression w.r.t full block zeroing. Only possible
> > > regression will be if somebody was doing partial block zeroing on sector
> > > boundary, then poison will not be cleared.
> > >
> > > We now seem to be discussing too many issues w.r.t poison clearing
> > > and dax. Atleast 3 issues are mentioned in this thread.
> > >
> > > A. Get rid of dependency on block device in dax zeroing path.
> > >    (__dax_zero_page_range)
> > >
> > > B. Provide a way to clear latent poison. And possibly use movdir64b to
> > >    do that and make filesystems use that interface for initialization
> > >    of blocks.
> > >
> > > C. Dax zero operation is clearing known poison while copy_from_iter() is
> > >    not. I guess this ship has already sailed. If we change it now,
> > >    somebody will complain of some regression.
> > >
> > > For issue A, there are two possible ways to deal with it.
> > >
> > > 1. Implement a dax method to zero page. And this method will also clear
> > >    known poison. This is what my patch series is doing.
> > >
> > > 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
> > >    so that no poison will be cleared in __dax_zero_page_range() path. This
> > >    path is currently used in partial page zeroing path and full filesystem
> > >    block zeroing happens with blkdev_issue_zeroout(). There is a small
> > >    chance of regression here in case of sector aligned partial block
> > >    zeroing.
> > >
> > > My patch series takes care of issue A without any regressions. In fact it
> > > improves current interface. For example, currently "truncate -s 512
> > > foo.txt" will succeed even if first sector in the block is poisoned. My
> > > patch series fixes it. Current implementation will return error on if any
> > > non sector aligned truncate is done and any of the sector is poisoned. My
> > > implementation will not return error if poisoned can be cleared as part
> > > of zeroing. It will return only if poison is present in non-zeoring part.
> >
> > That asymmetry makes the implementation too much of a special case. If
> > the dax mapping path forces error boundaries on PAGE_SIZE blocks then
> > so should zeroing.
> >
> > >
> > > Why don't we solve one issue A now and deal with issue B and C later in
> > > a sepaprate patch series. This patch series gets rid of dependency on
> > > block device in dax path and also makes current zeroing interface better.
> >
> > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > callback that deals with page aligned entries. That change at least
> > makes the error boundary symmetric across copy_from_iter() and the
> > zeroing path.
>
> IIUC, you are suggesting that modify dax_zero_page_range() to take page
> aligned start and size and call this interface from
> __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> path?
>
> Something like.
>
> __dax_zero_page_range() {
>   if(page_aligned_io)
>         call_dax_page_zero_range()
>   else
>         use_direct_access_and_memcpy;
> }
>
> And other callers of blkdev_issue_zeroout() in filesystems can migrate
> to calling dax_zero_page_range() instead.
>
> If yes, I am not seeing what advantage do we get by this change.
>
> - __dax_zero_page_range() seems to be called by only partial block
>   zeroing code. So dax_zero_page_range() call will remain unused.
>
>
> - dax_zero_page_range() will be exact replacement of
>   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
>   it will create a dax specific hook.
>
> In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> call from __dax_zero_page_range() and make sure there are no callers of
> full block zeroing from this path.

I think you're right. The path I'm concerned about not regressing is
the error clearing on new block allocation and we get that already via
xfs_zero_extent() and sb_issue_zeroout(). For your fs we'll want a
dax-device equivalent  for that path, but that does mean that
__dax_zero_page_range() stays out of the error clearing game.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-25  0:26                       ` Dan Williams
@ 2020-02-25 23:26                         ` Jane Chu
  -1 siblings, 0 replies; 103+ messages in thread
From: Jane Chu @ 2020-02-25 23:26 UTC (permalink / raw)
  To: Dan Williams, Jeff Moyer
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On 2/24/2020 4:26 PM, Dan Williams wrote:
> On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>>
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>>> Let's just focus on reporting errors when we know we have them.
>>>
>>> That's the problem in my eyes. If software needs to contend with
>>> latent error reporting then it should always contend otherwise
>>> software has multiple error models to wrangle.
>>
>> The only way for an application to know that the data has been written
>> successfully would be to issue a read after every write.  That's not a
>> performance hit most applications are willing to take.  And, of course,
>> the media can still go bad at a later time, so it only guarantees the
>> data is accessible immediately after having been written.
>>
>> What I'm suggesting is that we should not complete a write successfully
>> if we know that the data will not be retrievable.  I wouldn't call this
>> adding an extra error model to contend with.  Applications should
>> already be checking for errors on write.
>>
>> Does that make sense? Are we talking past each other?
> 
> The badblock list is late to update in both directions, late to add
> entries that the scrub needs to find and late to delete entries that
> were inadvertently cleared by cache-line writes that did not first
> ingest the poison for a read-modify-write. So I see the above as being
> wishful in using the error list as the hard source of truth and
> unfortunate to up-level all sub-sector error entries into full
> PAGE_SIZE data offline events.

Sorry, don't mean to distract the discussion, but I'm wondering if
anyone has noticed SIGBUS with si_code = MCEERR_AO in a single process 
poison test over a dax-xfs file?  There is only 1 poison in the
file which has been consumed, it's the recovery code path (hole punch/
munmap/mmap/pwrite/read) that encounters the _AO. I'm confident that
latent error isn't the scenario per ARS scrub. Also, the _AO appears
rarely. This is un-explainable given the kernel MCE pmem handling
implementation.

> 
> I'm hoping we can find a way to make the error handling more fine
> grained over time, but for the current patch, managing the blast
> radius as PAGE_SIZE granularity at least matches the zero path with
> the write path.

Maybe the new filesystem op for clearing pmem poison should insist on
4K alignment? because in hwpoison_clear() the starting pfn is given
by PHYS_PFN which rounds down to the nearest page, so we might inadvertently
clear the poison bit and 'noce' bit from a page when we only cleared a
poison e.g. in the second half of the page.

BTW, set_mce_nospec() doesn't seem to work in 5.5 release,
   [ 2321.209382] Could not invalidate pfn=0x1850600 from 1:1 map
I will see if I can find more information.

> 
>>> Setting that aside we can start with just treating zeroing the same as
>>> the copy_from_iter() case and fail the I/O at the dax_direct_access()
>>> step.
>>
>> OK.
>>
>>> I'd rather have a separate op that filesystems can use to clear errors
>>> at block allocation time that can be enforced to have the correct
>>> alignment.
>>
>> So would file systems always call that routine instead of zeroing, or
>> would they first check to see if there are badblocks?
> 
> The proposal is that filesystems distinguish zeroing from free-block
> allocation/initialization such that the fsdax implementation directs
> initialization to a driver callback. This "initialization op" would
> take care to check for poison and clear it. All other dax paths would
> not consult the badblocks list.

thanks!
-jane


> _______________________________________________
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
> To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-25 23:26                         ` Jane Chu
  0 siblings, 0 replies; 103+ messages in thread
From: Jane Chu @ 2020-02-25 23:26 UTC (permalink / raw)
  To: Dan Williams, Jeff Moyer
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On 2/24/2020 4:26 PM, Dan Williams wrote:
> On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>>
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>>> Let's just focus on reporting errors when we know we have them.
>>>
>>> That's the problem in my eyes. If software needs to contend with
>>> latent error reporting then it should always contend otherwise
>>> software has multiple error models to wrangle.
>>
>> The only way for an application to know that the data has been written
>> successfully would be to issue a read after every write.  That's not a
>> performance hit most applications are willing to take.  And, of course,
>> the media can still go bad at a later time, so it only guarantees the
>> data is accessible immediately after having been written.
>>
>> What I'm suggesting is that we should not complete a write successfully
>> if we know that the data will not be retrievable.  I wouldn't call this
>> adding an extra error model to contend with.  Applications should
>> already be checking for errors on write.
>>
>> Does that make sense? Are we talking past each other?
> 
> The badblock list is late to update in both directions, late to add
> entries that the scrub needs to find and late to delete entries that
> were inadvertently cleared by cache-line writes that did not first
> ingest the poison for a read-modify-write. So I see the above as being
> wishful in using the error list as the hard source of truth and
> unfortunate to up-level all sub-sector error entries into full
> PAGE_SIZE data offline events.

Sorry, don't mean to distract the discussion, but I'm wondering if
anyone has noticed SIGBUS with si_code = MCEERR_AO in a single process 
poison test over a dax-xfs file?  There is only 1 poison in the
file which has been consumed, it's the recovery code path (hole punch/
munmap/mmap/pwrite/read) that encounters the _AO. I'm confident that
latent error isn't the scenario per ARS scrub. Also, the _AO appears
rarely. This is un-explainable given the kernel MCE pmem handling
implementation.

> 
> I'm hoping we can find a way to make the error handling more fine
> grained over time, but for the current patch, managing the blast
> radius as PAGE_SIZE granularity at least matches the zero path with
> the write path.

Maybe the new filesystem op for clearing pmem poison should insist on
4K alignment? because in hwpoison_clear() the starting pfn is given
by PHYS_PFN which rounds down to the nearest page, so we might inadvertently
clear the poison bit and 'noce' bit from a page when we only cleared a
poison e.g. in the second half of the page.

BTW, set_mce_nospec() doesn't seem to work in 5.5 release,
   [ 2321.209382] Could not invalidate pfn=0x1850600 from 1:1 map
I will see if I can find more information.

> 
>>> Setting that aside we can start with just treating zeroing the same as
>>> the copy_from_iter() case and fail the I/O at the dax_direct_access()
>>> step.
>>
>> OK.
>>
>>> I'd rather have a separate op that filesystems can use to clear errors
>>> at block allocation time that can be enforced to have the correct
>>> alignment.
>>
>> So would file systems always call that routine instead of zeroing, or
>> would they first check to see if there are badblocks?
> 
> The proposal is that filesystems distinguish zeroing from free-block
> allocation/initialization such that the fsdax implementation directs
> initialization to a driver callback. This "initialization op" would
> take care to check for poison and clear it. All other dax paths would
> not consult the badblocks list.

thanks!
-jane


> _______________________________________________
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
> To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-25 22:49                             ` Dan Williams
@ 2020-02-26 13:51                               ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-26 13:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
[..]
> > > > Hi Dan,
> > > >
> > > > IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> > > > path. Instead they call blkdev_issue_zeroout() at later point of time.
> > > >
> > > > Only partial block zeroing path is taking __dax_zero_page_range(). So
> > > > even if we remove poison clearing code from __dax_zero_page_range(),
> > > > there should not be a regression w.r.t full block zeroing. Only possible
> > > > regression will be if somebody was doing partial block zeroing on sector
> > > > boundary, then poison will not be cleared.
> > > >
> > > > We now seem to be discussing too many issues w.r.t poison clearing
> > > > and dax. Atleast 3 issues are mentioned in this thread.
> > > >
> > > > A. Get rid of dependency on block device in dax zeroing path.
> > > >    (__dax_zero_page_range)
> > > >
> > > > B. Provide a way to clear latent poison. And possibly use movdir64b to
> > > >    do that and make filesystems use that interface for initialization
> > > >    of blocks.
> > > >
> > > > C. Dax zero operation is clearing known poison while copy_from_iter() is
> > > >    not. I guess this ship has already sailed. If we change it now,
> > > >    somebody will complain of some regression.
> > > >
> > > > For issue A, there are two possible ways to deal with it.
> > > >
> > > > 1. Implement a dax method to zero page. And this method will also clear
> > > >    known poison. This is what my patch series is doing.
> > > >
> > > > 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
> > > >    so that no poison will be cleared in __dax_zero_page_range() path. This
> > > >    path is currently used in partial page zeroing path and full filesystem
> > > >    block zeroing happens with blkdev_issue_zeroout(). There is a small
> > > >    chance of regression here in case of sector aligned partial block
> > > >    zeroing.
> > > >
> > > > My patch series takes care of issue A without any regressions. In fact it
> > > > improves current interface. For example, currently "truncate -s 512
> > > > foo.txt" will succeed even if first sector in the block is poisoned. My
> > > > patch series fixes it. Current implementation will return error on if any
> > > > non sector aligned truncate is done and any of the sector is poisoned. My
> > > > implementation will not return error if poisoned can be cleared as part
> > > > of zeroing. It will return only if poison is present in non-zeoring part.
> > >
> > > That asymmetry makes the implementation too much of a special case. If
> > > the dax mapping path forces error boundaries on PAGE_SIZE blocks then
> > > so should zeroing.
> > >
> > > >
> > > > Why don't we solve one issue A now and deal with issue B and C later in
> > > > a sepaprate patch series. This patch series gets rid of dependency on
> > > > block device in dax path and also makes current zeroing interface better.
> > >
> > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > callback that deals with page aligned entries. That change at least
> > > makes the error boundary symmetric across copy_from_iter() and the
> > > zeroing path.
> >
> > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > aligned start and size and call this interface from
> > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > path?
> >
> > Something like.
> >
> > __dax_zero_page_range() {
> >   if(page_aligned_io)
> >         call_dax_page_zero_range()
> >   else
> >         use_direct_access_and_memcpy;
> > }
> >
> > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > to calling dax_zero_page_range() instead.
> >
> > If yes, I am not seeing what advantage do we get by this change.
> >
> > - __dax_zero_page_range() seems to be called by only partial block
> >   zeroing code. So dax_zero_page_range() call will remain unused.
> >
> >
> > - dax_zero_page_range() will be exact replacement of
> >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> >   it will create a dax specific hook.
> >
> > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > call from __dax_zero_page_range() and make sure there are no callers of
> > full block zeroing from this path.
> 
> I think you're right. The path I'm concerned about not regressing is
> the error clearing on new block allocation and we get that already via
> xfs_zero_extent() and sb_issue_zeroout(). For your fs we'll want a
> dax-device equivalent  for that path, but that does mean that
> __dax_zero_page_range() stays out of the error clearing game.

In virtiofs we do not manage our own blocks. We let host filesystem
do that and we are just passthrough filesystem passing fuse messages
around. So I have not seen need of block zeroing interface yet.

I just happened to carry a patch in my patch series in this area because
we wanted to get rid of this assumption that dax always has a block
device associated. Apart from that, I don't need __dax_zero_page_range()
for virtiofs. 

I am doing this cleanup so that we dont even try to use block device
in this dax zeroing path.

Anyway, I will cleanup this patch series and get rid of
blkdev_issue_zeroout() call from __dax_zero_page_range() and post again
for review and where does it go from there.

Thanks
Vivek
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-26 13:51                               ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-26 13:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
[..]
> > > > Hi Dan,
> > > >
> > > > IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> > > > path. Instead they call blkdev_issue_zeroout() at later point of time.
> > > >
> > > > Only partial block zeroing path is taking __dax_zero_page_range(). So
> > > > even if we remove poison clearing code from __dax_zero_page_range(),
> > > > there should not be a regression w.r.t full block zeroing. Only possible
> > > > regression will be if somebody was doing partial block zeroing on sector
> > > > boundary, then poison will not be cleared.
> > > >
> > > > We now seem to be discussing too many issues w.r.t poison clearing
> > > > and dax. Atleast 3 issues are mentioned in this thread.
> > > >
> > > > A. Get rid of dependency on block device in dax zeroing path.
> > > >    (__dax_zero_page_range)
> > > >
> > > > B. Provide a way to clear latent poison. And possibly use movdir64b to
> > > >    do that and make filesystems use that interface for initialization
> > > >    of blocks.
> > > >
> > > > C. Dax zero operation is clearing known poison while copy_from_iter() is
> > > >    not. I guess this ship has already sailed. If we change it now,
> > > >    somebody will complain of some regression.
> > > >
> > > > For issue A, there are two possible ways to deal with it.
> > > >
> > > > 1. Implement a dax method to zero page. And this method will also clear
> > > >    known poison. This is what my patch series is doing.
> > > >
> > > > 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
> > > >    so that no poison will be cleared in __dax_zero_page_range() path. This
> > > >    path is currently used in partial page zeroing path and full filesystem
> > > >    block zeroing happens with blkdev_issue_zeroout(). There is a small
> > > >    chance of regression here in case of sector aligned partial block
> > > >    zeroing.
> > > >
> > > > My patch series takes care of issue A without any regressions. In fact it
> > > > improves current interface. For example, currently "truncate -s 512
> > > > foo.txt" will succeed even if first sector in the block is poisoned. My
> > > > patch series fixes it. Current implementation will return error on if any
> > > > non sector aligned truncate is done and any of the sector is poisoned. My
> > > > implementation will not return error if poisoned can be cleared as part
> > > > of zeroing. It will return only if poison is present in non-zeoring part.
> > >
> > > That asymmetry makes the implementation too much of a special case. If
> > > the dax mapping path forces error boundaries on PAGE_SIZE blocks then
> > > so should zeroing.
> > >
> > > >
> > > > Why don't we solve one issue A now and deal with issue B and C later in
> > > > a sepaprate patch series. This patch series gets rid of dependency on
> > > > block device in dax path and also makes current zeroing interface better.
> > >
> > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > callback that deals with page aligned entries. That change at least
> > > makes the error boundary symmetric across copy_from_iter() and the
> > > zeroing path.
> >
> > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > aligned start and size and call this interface from
> > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > path?
> >
> > Something like.
> >
> > __dax_zero_page_range() {
> >   if(page_aligned_io)
> >         call_dax_page_zero_range()
> >   else
> >         use_direct_access_and_memcpy;
> > }
> >
> > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > to calling dax_zero_page_range() instead.
> >
> > If yes, I am not seeing what advantage do we get by this change.
> >
> > - __dax_zero_page_range() seems to be called by only partial block
> >   zeroing code. So dax_zero_page_range() call will remain unused.
> >
> >
> > - dax_zero_page_range() will be exact replacement of
> >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> >   it will create a dax specific hook.
> >
> > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > call from __dax_zero_page_range() and make sure there are no callers of
> > full block zeroing from this path.
> 
> I think you're right. The path I'm concerned about not regressing is
> the error clearing on new block allocation and we get that already via
> xfs_zero_extent() and sb_issue_zeroout(). For your fs we'll want a
> dax-device equivalent  for that path, but that does mean that
> __dax_zero_page_range() stays out of the error clearing game.

In virtiofs we do not manage our own blocks. We let host filesystem
do that and we are just passthrough filesystem passing fuse messages
around. So I have not seen need of block zeroing interface yet.

I just happened to carry a patch in my patch series in this area because
we wanted to get rid of this assumption that dax always has a block
device associated. Apart from that, I don't need __dax_zero_page_range()
for virtiofs. 

I am doing this cleanup so that we dont even try to use block device
in this dax zeroing path.

Anyway, I will cleanup this patch series and get rid of
blkdev_issue_zeroout() call from __dax_zero_page_range() and post again
for review and where does it go from there.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-25 22:49                             ` Dan Williams
@ 2020-02-26 16:57                               ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-26 16:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
[..]
> > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > callback that deals with page aligned entries. That change at least
> > > makes the error boundary symmetric across copy_from_iter() and the
> > > zeroing path.
> >
> > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > aligned start and size and call this interface from
> > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > path?
> >
> > Something like.
> >
> > __dax_zero_page_range() {
> >   if(page_aligned_io)
> >         call_dax_page_zero_range()
> >   else
> >         use_direct_access_and_memcpy;
> > }
> >
> > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > to calling dax_zero_page_range() instead.
> >
> > If yes, I am not seeing what advantage do we get by this change.
> >
> > - __dax_zero_page_range() seems to be called by only partial block
> >   zeroing code. So dax_zero_page_range() call will remain unused.
> >
> >
> > - dax_zero_page_range() will be exact replacement of
> >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> >   it will create a dax specific hook.
> >
> > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > call from __dax_zero_page_range() and make sure there are no callers of
> > full block zeroing from this path.
> 
> I think you're right. The path I'm concerned about not regressing is
> the error clearing on new block allocation and we get that already via
> xfs_zero_extent() and sb_issue_zeroout().

Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
to zero full PAGE_SIZE blocks.

xfs_io -c "allocsp 32K 0" foo.txt

In that case, I will add a new dax method say dax_zero_page_range() which will
only take PAGE_SIZE aligned range will clear known poison and call that
from __dax_zero_page_range() if I/O is PAGE_SIZE aligned.

For now I will limit this interface to take only single page at a time
because otherwise implementation becomes complex in dm/md stack where
range has to be broken across multiple devices and there are no users
because current iomap API passes one page at a time.

So once we have grown users, then one can also tackle the complexity of
modifying dm/md to break a page range across multiple devices.

Will post a patch series for this.

Thanks
Vivek
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-26 16:57                               ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-26 16:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
[..]
> > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > callback that deals with page aligned entries. That change at least
> > > makes the error boundary symmetric across copy_from_iter() and the
> > > zeroing path.
> >
> > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > aligned start and size and call this interface from
> > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > path?
> >
> > Something like.
> >
> > __dax_zero_page_range() {
> >   if(page_aligned_io)
> >         call_dax_page_zero_range()
> >   else
> >         use_direct_access_and_memcpy;
> > }
> >
> > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > to calling dax_zero_page_range() instead.
> >
> > If yes, I am not seeing what advantage do we get by this change.
> >
> > - __dax_zero_page_range() seems to be called by only partial block
> >   zeroing code. So dax_zero_page_range() call will remain unused.
> >
> >
> > - dax_zero_page_range() will be exact replacement of
> >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> >   it will create a dax specific hook.
> >
> > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > call from __dax_zero_page_range() and make sure there are no callers of
> > full block zeroing from this path.
> 
> I think you're right. The path I'm concerned about not regressing is
> the error clearing on new block allocation and we get that already via
> xfs_zero_extent() and sb_issue_zeroout().

Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
to zero full PAGE_SIZE blocks.

xfs_io -c "allocsp 32K 0" foo.txt

In that case, I will add a new dax method say dax_zero_page_range() which will
only take PAGE_SIZE aligned range will clear known poison and call that
from __dax_zero_page_range() if I/O is PAGE_SIZE aligned.

For now I will limit this interface to take only single page at a time
because otherwise implementation becomes complex in dm/md stack where
range has to be broken across multiple devices and there are no users
because current iomap API passes one page at a time.

So once we have grown users, then one can also tackle the complexity of
modifying dm/md to break a page range across multiple devices.

Will post a patch series for this.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-24 15:38               ` Vivek Goyal
@ 2020-02-27  3:02                 ` Dave Chinner
  -1 siblings, 0 replies; 103+ messages in thread
From: Dave Chinner @ 2020-02-27  3:02 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-fsdevel, linux-nvdimm, hch, dm-devel

On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote:
> On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> 
> [..]
> > > > > Hi Jeff,
> > > > >
> > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > but start and end are not aligned on sector boundaries.
> > > > 
> > > > Sure, but who will call it with misaligned ranges?
> > > 
> > > create a file foo.txt of size 4K and then truncate it.
> > > 
> > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > 4095.
> > 
> > This should fail with EIO. Only full page writes should clear the
> > bad page state, and partial writes should therefore fail because
> > they do not guarantee the data in the filesystem block is all good.
> > 
> > If this zeroing was a buffered write to an address with a bad
> > sector, then the writeback will fail and the user will (eventually)
> > get an EIO on the file.
> > 
> > DAX should do the same thing, except because the zeroing is
> > synchronous (i.e. done directly by the truncate syscall) we can -
> > and should - return EIO immediately.
> > 
> > Indeed, with your code, if we then extend the file by truncating up
> > back to 4k, then the range between 23 and 512 is still bad, even
> > though we've successfully zeroed it and the user knows it. An
> > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > will fail with EIO, but reading 10 bytes at offset 2000 will
> > succeed.
> 
> Hi Dave,
> 
> What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to
> 511) is poisoned and rest don't have poison. Should this fail with -EIO.

Yes - the filesystem block still contains bad data.

> In current implementation it does not.

I'm not surprised - the whole hardware error handling architecture
for FS-DAX is fundamentally broken. It was designed for device-dax,
and it just doesn't work for FS-DAX.

For example, to get the hardware error handling to be able to kill
userspace applications, a 1:1 physical-to-logical association
constraint was added to fs/dax.c:

/*
 * TODO: for reflink+dax we need a way to associate a single page with
 * multiple address_space instances at different linear_page_index()
 * offsets.
 */
static void dax_associate_entry(void *entry, struct address_space *mapping,
                struct vm_area_struct *vma, unsigned long address)

because device-dax only has *linear mappings* and so has a fixed
reverse mapping architecture.

i.e. the whole hardware error handling infrastructure was designed
around the constraints of device-dax. device-dax does not having any
structure to serialise access to the physical storage, so locking
was added to the mapping tree. THe mapping tree locking is accessed
on hardware error via the reverse mappingi association in the struct
page and that's how device-dax serialises direct physical storage
access against hardware error processing.  And while the page index
is locked in the mapping tree, it can walk the process vmas that
have the page mapped to kill them so that they don't try to access
the bad page.

That bad physical storage state is carried in a volatile struct page
flag, hence requiring some mechanism to make it persistent (the
device bad blocks register) and some other mechanism to clear the
poison state (direct IO, IIRC).

It's also a big, nasty, solid roadblock to implementing shared
data extents in FS-DAX. We basically have to completely re-architect
the hardware error handling for FS-DAX so that such errors are
reported to the filesystem first and then the filesystem does what
is needed to handle the error.

None of this works for filesystems because they need to perform
different operations depending on what the page that went bad
contains. FS-DAX should never trip over an unexpected poisoned page;
we do so now because such physical storage errors are completely
hidden form the fielsystem.

What you are trying to do is slap a band-aid over what to do when we
hit an unexpected page containing bad data. Filesystems expect to
find out about bad data in storage when they marshall the data into
or out of memory. They make the assumption that once it is in memory
it remains valid on the physical storage. Hence if an in-memory
error occurs, we can just toss it away and re-read it from storage,
and all is good.

FS-DAX changes that - we are no longer marshalling data into and out
of memory so we don't have a mechanism to get EIO when reading the
page into the page cache or writing it back to disk. We also don't
have an in-memory copy of the data - the physical storage is the
in-memory copy, and so we can't just toss it away when an error
occurs.

What we actually require is instantaneous notification of physical
storage errors so we can handle the error immediately. And that, in
turn, means we should never poison or see poisoned pages during
direct access operations because the filesystem doesn't need to
poison pages to prevent user access - it controls how the storage is
accessed directly.

e.g. if this error is in filesystem metadata, we might be able to
recover from it as that metadata might have a full copy in memory
(metadata is buffered in both XFS and ext4) or we might be able to
reconstruct it from other sources. Worst case, we have shut the
filesystem down completely so the admin can repair the damage the
lost page has caused.

e.g. The physical page may be located in free space, in which case
we don't care and can just zero it so all the bad hardware state is
cleared. The page never goes bad or gets poisoned in that case.

e.g. The physical page may be user data, in which case it may be
buffered in the page cache (non-dax) and so can easily be recovered.
It may not be recoverable, in which case we need to issue log
messages indicating that data has been lost (path, offset, length),
and do the VMA walk and kill processes that map that page. Then we
can zero the page to clear the bad state.

If, at any point we can't clear the bad state (e.g. the zeroing or
the read-back verification fails), then we need to make sure that
filesystem block is marked as allocated in the free space map, then
tell the reverse map that it's owner is now "bad storage" so it
never gets used again. i.e. this is the persistent bad block
storage, but doing it this way results in the rest of the filesystem
code never, ever seeing a poisoned page. And users never see it
either, because it will never be returned to the free space pool.

Of course, this relies of the filesystem having reverse mapping
capability. XFS already has this funcitonality available as a mkfs
option (mkfs.xfs -m rmapbt=1 ...), and we really need this so we can
get rid of the association of a physical page with a mapping and
file offset that device-dax requires for hardware page error
handling.  This means we don't need the physical storage to try to
hold filesystem layer reverse mapping information for us, and this
also removes the roadblock that the hardware error handling has
placed on implementing reflink w/ FS-DAX.

IOWs, the problem you are trying to solve is a direct result of
filesysetms not being informed when a physical pmem page goes bad
and the current error handling being implemented at entirely the
wrong layer for FS-DAX. It may work for device-dax, but it's most
definitely insufficient for correct error handling for filesystems.

> Anyway, partial page truncate can't ensure that data in rest of the page can
> be read back successfully. Memory can get poison after the write and
> hence read after truncate will still fail.

Which is where the notification requirement comes in. Yes, we may
still get errors on read or write, but if memory at rest goes bad,
we want to handle that and correct it ASAP, not wait days or months
for something to trip over the poisoned page before we find out
about it.

> Hence, all we are trying to ensure is that if a poison is known at the
> time of writing partial page, then we should return error to user space.

I think within FS-DAX infrastructure, any access to the data (read
or write) within a poisoned page or a page marked with PageError()
should return EIO to the caller, unless it's the specific command to
clear the error/poison state on the page. What happens with that
error state is then up to the caller.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-27  3:02                 ` Dave Chinner
  0 siblings, 0 replies; 103+ messages in thread
From: Dave Chinner @ 2020-02-27  3:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jeff Moyer, linux-fsdevel, linux-nvdimm, hch, dan.j.williams, dm-devel

On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote:
> On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> 
> [..]
> > > > > Hi Jeff,
> > > > >
> > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > but start and end are not aligned on sector boundaries.
> > > > 
> > > > Sure, but who will call it with misaligned ranges?
> > > 
> > > create a file foo.txt of size 4K and then truncate it.
> > > 
> > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > 4095.
> > 
> > This should fail with EIO. Only full page writes should clear the
> > bad page state, and partial writes should therefore fail because
> > they do not guarantee the data in the filesystem block is all good.
> > 
> > If this zeroing was a buffered write to an address with a bad
> > sector, then the writeback will fail and the user will (eventually)
> > get an EIO on the file.
> > 
> > DAX should do the same thing, except because the zeroing is
> > synchronous (i.e. done directly by the truncate syscall) we can -
> > and should - return EIO immediately.
> > 
> > Indeed, with your code, if we then extend the file by truncating up
> > back to 4k, then the range between 23 and 512 is still bad, even
> > though we've successfully zeroed it and the user knows it. An
> > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > will fail with EIO, but reading 10 bytes at offset 2000 will
> > succeed.
> 
> Hi Dave,
> 
> What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to
> 511) is poisoned and rest don't have poison. Should this fail with -EIO.

Yes - the filesystem block still contains bad data.

> In current implementation it does not.

I'm not surprised - the whole hardware error handling architecture
for FS-DAX is fundamentally broken. It was designed for device-dax,
and it just doesn't work for FS-DAX.

For example, to get the hardware error handling to be able to kill
userspace applications, a 1:1 physical-to-logical association
constraint was added to fs/dax.c:

/*
 * TODO: for reflink+dax we need a way to associate a single page with
 * multiple address_space instances at different linear_page_index()
 * offsets.
 */
static void dax_associate_entry(void *entry, struct address_space *mapping,
                struct vm_area_struct *vma, unsigned long address)

because device-dax only has *linear mappings* and so has a fixed
reverse mapping architecture.

i.e. the whole hardware error handling infrastructure was designed
around the constraints of device-dax. device-dax does not having any
structure to serialise access to the physical storage, so locking
was added to the mapping tree. THe mapping tree locking is accessed
on hardware error via the reverse mappingi association in the struct
page and that's how device-dax serialises direct physical storage
access against hardware error processing.  And while the page index
is locked in the mapping tree, it can walk the process vmas that
have the page mapped to kill them so that they don't try to access
the bad page.

That bad physical storage state is carried in a volatile struct page
flag, hence requiring some mechanism to make it persistent (the
device bad blocks register) and some other mechanism to clear the
poison state (direct IO, IIRC).

It's also a big, nasty, solid roadblock to implementing shared
data extents in FS-DAX. We basically have to completely re-architect
the hardware error handling for FS-DAX so that such errors are
reported to the filesystem first and then the filesystem does what
is needed to handle the error.

None of this works for filesystems because they need to perform
different operations depending on what the page that went bad
contains. FS-DAX should never trip over an unexpected poisoned page;
we do so now because such physical storage errors are completely
hidden form the fielsystem.

What you are trying to do is slap a band-aid over what to do when we
hit an unexpected page containing bad data. Filesystems expect to
find out about bad data in storage when they marshall the data into
or out of memory. They make the assumption that once it is in memory
it remains valid on the physical storage. Hence if an in-memory
error occurs, we can just toss it away and re-read it from storage,
and all is good.

FS-DAX changes that - we are no longer marshalling data into and out
of memory so we don't have a mechanism to get EIO when reading the
page into the page cache or writing it back to disk. We also don't
have an in-memory copy of the data - the physical storage is the
in-memory copy, and so we can't just toss it away when an error
occurs.

What we actually require is instantaneous notification of physical
storage errors so we can handle the error immediately. And that, in
turn, means we should never poison or see poisoned pages during
direct access operations because the filesystem doesn't need to
poison pages to prevent user access - it controls how the storage is
accessed directly.

e.g. if this error is in filesystem metadata, we might be able to
recover from it as that metadata might have a full copy in memory
(metadata is buffered in both XFS and ext4) or we might be able to
reconstruct it from other sources. Worst case, we have shut the
filesystem down completely so the admin can repair the damage the
lost page has caused.

e.g. The physical page may be located in free space, in which case
we don't care and can just zero it so all the bad hardware state is
cleared. The page never goes bad or gets poisoned in that case.

e.g. The physical page may be user data, in which case it may be
buffered in the page cache (non-dax) and so can easily be recovered.
It may not be recoverable, in which case we need to issue log
messages indicating that data has been lost (path, offset, length),
and do the VMA walk and kill processes that map that page. Then we
can zero the page to clear the bad state.

If, at any point we can't clear the bad state (e.g. the zeroing or
the read-back verification fails), then we need to make sure that
filesystem block is marked as allocated in the free space map, then
tell the reverse map that it's owner is now "bad storage" so it
never gets used again. i.e. this is the persistent bad block
storage, but doing it this way results in the rest of the filesystem
code never, ever seeing a poisoned page. And users never see it
either, because it will never be returned to the free space pool.

Of course, this relies of the filesystem having reverse mapping
capability. XFS already has this funcitonality available as a mkfs
option (mkfs.xfs -m rmapbt=1 ...), and we really need this so we can
get rid of the association of a physical page with a mapping and
file offset that device-dax requires for hardware page error
handling.  This means we don't need the physical storage to try to
hold filesystem layer reverse mapping information for us, and this
also removes the roadblock that the hardware error handling has
placed on implementing reflink w/ FS-DAX.

IOWs, the problem you are trying to solve is a direct result of
filesysetms not being informed when a physical pmem page goes bad
and the current error handling being implemented at entirely the
wrong layer for FS-DAX. It may work for device-dax, but it's most
definitely insufficient for correct error handling for filesystems.

> Anyway, partial page truncate can't ensure that data in rest of the page can
> be read back successfully. Memory can get poison after the write and
> hence read after truncate will still fail.

Which is where the notification requirement comes in. Yes, we may
still get errors on read or write, but if memory at rest goes bad,
we want to handle that and correct it ASAP, not wait days or months
for something to trip over the poisoned page before we find out
about it.

> Hence, all we are trying to ensure is that if a poison is known at the
> time of writing partial page, then we should return error to user space.

I think within FS-DAX infrastructure, any access to the data (read
or write) within a poisoned page or a page marked with PageError()
should return EIO to the caller, unless it's the specific command to
clear the error/poison state on the page. What happens with that
error state is then up to the caller.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-26 16:57                               ` Vivek Goyal
@ 2020-02-27  3:11                                 ` Dave Chinner
  -1 siblings, 0 replies; 103+ messages in thread
From: Dave Chinner @ 2020-02-27  3:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote:
> On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
> [..]
> > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > > callback that deals with page aligned entries. That change at least
> > > > makes the error boundary symmetric across copy_from_iter() and the
> > > > zeroing path.
> > >
> > > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > > aligned start and size and call this interface from
> > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > > path?
> > >
> > > Something like.
> > >
> > > __dax_zero_page_range() {
> > >   if(page_aligned_io)
> > >         call_dax_page_zero_range()
> > >   else
> > >         use_direct_access_and_memcpy;
> > > }
> > >
> > > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > > to calling dax_zero_page_range() instead.
> > >
> > > If yes, I am not seeing what advantage do we get by this change.
> > >
> > > - __dax_zero_page_range() seems to be called by only partial block
> > >   zeroing code. So dax_zero_page_range() call will remain unused.
> > >
> > >
> > > - dax_zero_page_range() will be exact replacement of
> > >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> > >   it will create a dax specific hook.
> > >
> > > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > > call from __dax_zero_page_range() and make sure there are no callers of
> > > full block zeroing from this path.
> > 
> > I think you're right. The path I'm concerned about not regressing is
> > the error clearing on new block allocation and we get that already via
> > xfs_zero_extent() and sb_issue_zeroout().
> 
> Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
> to zero full PAGE_SIZE blocks.
> 
> xfs_io -c "allocsp 32K 0" foo.txt

That ioctl interface is deprecated and likely unused by any new
application written since 1999. It predates unwritten extents (1998)
and I don't think any native linux applications have ever used it. A
newly written DAX aware application would almost certainly not use
this interface.

IOWs, I wouldn't use it as justification for some special case
behaviour; I'm more likely to say "rip that ancient ioctl out" than
to jump through hoops because it's implementation behaviour.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-27  3:11                                 ` Dave Chinner
  0 siblings, 0 replies; 103+ messages in thread
From: Dave Chinner @ 2020-02-27  3:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dan Williams, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote:
> On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
> [..]
> > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > > callback that deals with page aligned entries. That change at least
> > > > makes the error boundary symmetric across copy_from_iter() and the
> > > > zeroing path.
> > >
> > > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > > aligned start and size and call this interface from
> > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > > path?
> > >
> > > Something like.
> > >
> > > __dax_zero_page_range() {
> > >   if(page_aligned_io)
> > >         call_dax_page_zero_range()
> > >   else
> > >         use_direct_access_and_memcpy;
> > > }
> > >
> > > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > > to calling dax_zero_page_range() instead.
> > >
> > > If yes, I am not seeing what advantage do we get by this change.
> > >
> > > - __dax_zero_page_range() seems to be called by only partial block
> > >   zeroing code. So dax_zero_page_range() call will remain unused.
> > >
> > >
> > > - dax_zero_page_range() will be exact replacement of
> > >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> > >   it will create a dax specific hook.
> > >
> > > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > > call from __dax_zero_page_range() and make sure there are no callers of
> > > full block zeroing from this path.
> > 
> > I think you're right. The path I'm concerned about not regressing is
> > the error clearing on new block allocation and we get that already via
> > xfs_zero_extent() and sb_issue_zeroout().
> 
> Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
> to zero full PAGE_SIZE blocks.
> 
> xfs_io -c "allocsp 32K 0" foo.txt

That ioctl interface is deprecated and likely unused by any new
application written since 1999. It predates unwritten extents (1998)
and I don't think any native linux applications have ever used it. A
newly written DAX aware application would almost certainly not use
this interface.

IOWs, I wouldn't use it as justification for some special case
behaviour; I'm more likely to say "rip that ancient ioctl out" than
to jump through hoops because it's implementation behaviour.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-27  3:02                 ` Dave Chinner
@ 2020-02-27  4:19                   ` Dan Williams
  -1 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-27  4:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Wed, Feb 26, 2020 at 7:03 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote:
> > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> >
> > [..]
> > > > > > Hi Jeff,
> > > > > >
> > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > > but start and end are not aligned on sector boundaries.
> > > > >
> > > > > Sure, but who will call it with misaligned ranges?
> > > >
> > > > create a file foo.txt of size 4K and then truncate it.
> > > >
> > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > > 4095.
> > >
> > > This should fail with EIO. Only full page writes should clear the
> > > bad page state, and partial writes should therefore fail because
> > > they do not guarantee the data in the filesystem block is all good.
> > >
> > > If this zeroing was a buffered write to an address with a bad
> > > sector, then the writeback will fail and the user will (eventually)
> > > get an EIO on the file.
> > >
> > > DAX should do the same thing, except because the zeroing is
> > > synchronous (i.e. done directly by the truncate syscall) we can -
> > > and should - return EIO immediately.
> > >
> > > Indeed, with your code, if we then extend the file by truncating up
> > > back to 4k, then the range between 23 and 512 is still bad, even
> > > though we've successfully zeroed it and the user knows it. An
> > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > > will fail with EIO, but reading 10 bytes at offset 2000 will
> > > succeed.
> >
> > Hi Dave,
> >
> > What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to
> > 511) is poisoned and rest don't have poison. Should this fail with -EIO.
>
> Yes - the filesystem block still contains bad data.
>
> > In current implementation it does not.
>
> I'm not surprised - the whole hardware error handling architecture
> for FS-DAX is fundamentally broken. It was designed for device-dax,
> and it just doesn't work for FS-DAX.
>
> For example, to get the hardware error handling to be able to kill
> userspace applications, a 1:1 physical-to-logical association
> constraint was added to fs/dax.c:
>
> /*
>  * TODO: for reflink+dax we need a way to associate a single page with
>  * multiple address_space instances at different linear_page_index()
>  * offsets.
>  */
> static void dax_associate_entry(void *entry, struct address_space *mapping,
>                 struct vm_area_struct *vma, unsigned long address)
>
> because device-dax only has *linear mappings* and so has a fixed
> reverse mapping architecture.
>
> i.e. the whole hardware error handling infrastructure was designed
> around the constraints of device-dax. device-dax does not having any
> structure to serialise access to the physical storage, so locking
> was added to the mapping tree. THe mapping tree locking is accessed
> on hardware error via the reverse mappingi association in the struct
> page and that's how device-dax serialises direct physical storage
> access against hardware error processing.  And while the page index
> is locked in the mapping tree, it can walk the process vmas that
> have the page mapped to kill them so that they don't try to access
> the bad page.
>
> That bad physical storage state is carried in a volatile struct page
> flag, hence requiring some mechanism to make it persistent (the
> device bad blocks register) and some other mechanism to clear the
> poison state (direct IO, IIRC).
>
> It's also a big, nasty, solid roadblock to implementing shared
> data extents in FS-DAX. We basically have to completely re-architect
> the hardware error handling for FS-DAX so that such errors are
> reported to the filesystem first and then the filesystem does what
> is needed to handle the error.
>
> None of this works for filesystems because they need to perform
> different operations depending on what the page that went bad
> contains. FS-DAX should never trip over an unexpected poisoned page;
> we do so now because such physical storage errors are completely
> hidden form the fielsystem.
>
> What you are trying to do is slap a band-aid over what to do when we
> hit an unexpected page containing bad data. Filesystems expect to
> find out about bad data in storage when they marshall the data into
> or out of memory. They make the assumption that once it is in memory
> it remains valid on the physical storage. Hence if an in-memory
> error occurs, we can just toss it away and re-read it from storage,
> and all is good.
>
> FS-DAX changes that - we are no longer marshalling data into and out
> of memory so we don't have a mechanism to get EIO when reading the
> page into the page cache or writing it back to disk. We also don't
> have an in-memory copy of the data - the physical storage is the
> in-memory copy, and so we can't just toss it away when an error
> occurs.
>
> What we actually require is instantaneous notification of physical
> storage errors so we can handle the error immediately. And that, in
> turn, means we should never poison or see poisoned pages during
> direct access operations because the filesystem doesn't need to
> poison pages to prevent user access - it controls how the storage is
> accessed directly.
>
> e.g. if this error is in filesystem metadata, we might be able to
> recover from it as that metadata might have a full copy in memory
> (metadata is buffered in both XFS and ext4) or we might be able to
> reconstruct it from other sources. Worst case, we have shut the
> filesystem down completely so the admin can repair the damage the
> lost page has caused.
>
> e.g. The physical page may be located in free space, in which case
> we don't care and can just zero it so all the bad hardware state is
> cleared. The page never goes bad or gets poisoned in that case.
>
> e.g. The physical page may be user data, in which case it may be
> buffered in the page cache (non-dax) and so can easily be recovered.
> It may not be recoverable, in which case we need to issue log
> messages indicating that data has been lost (path, offset, length),
> and do the VMA walk and kill processes that map that page. Then we
> can zero the page to clear the bad state.
>
> If, at any point we can't clear the bad state (e.g. the zeroing or
> the read-back verification fails), then we need to make sure that
> filesystem block is marked as allocated in the free space map, then
> tell the reverse map that it's owner is now "bad storage" so it
> never gets used again. i.e. this is the persistent bad block
> storage, but doing it this way results in the rest of the filesystem
> code never, ever seeing a poisoned page. And users never see it
> either, because it will never be returned to the free space pool.
>
> Of course, this relies of the filesystem having reverse mapping
> capability. XFS already has this funcitonality available as a mkfs
> option (mkfs.xfs -m rmapbt=1 ...), and we really need this so we can
> get rid of the association of a physical page with a mapping and
> file offset that device-dax requires for hardware page error
> handling.  This means we don't need the physical storage to try to
> hold filesystem layer reverse mapping information for us, and this
> also removes the roadblock that the hardware error handling has
> placed on implementing reflink w/ FS-DAX.
>
> IOWs, the problem you are trying to solve is a direct result of
> filesysetms not being informed when a physical pmem page goes bad
> and the current error handling being implemented at entirely the
> wrong layer for FS-DAX. It may work for device-dax, but it's most
> definitely insufficient for correct error handling for filesystems.
>
> > Anyway, partial page truncate can't ensure that data in rest of the page can
> > be read back successfully. Memory can get poison after the write and
> > hence read after truncate will still fail.
>
> Which is where the notification requirement comes in. Yes, we may
> still get errors on read or write, but if memory at rest goes bad,
> we want to handle that and correct it ASAP, not wait days or months
> for something to trip over the poisoned page before we find out
> about it.
>
> > Hence, all we are trying to ensure is that if a poison is known at the
> > time of writing partial page, then we should return error to user space.
>
> I think within FS-DAX infrastructure, any access to the data (read
> or write) within a poisoned page or a page marked with PageError()
> should return EIO to the caller, unless it's the specific command to
> clear the error/poison state on the page. What happens with that
> error state is then up to the caller.
>

I agree with most of the above if you replace "device-dax error
handling" with "System RAM error handling". It's typical memory error
handling that injects the page->index and page->mappping dependency.
In fact it was difficult to map this to device-dax without the hack
that device-dax does not need to take a page lock. I do think that the
FS needs this error information, at the same time it's also true that
historically no FS pushed for this information for page-cache and
in-memory metadata prior to FS-DAX. So, the design was not "device-dax
first" it was "existing memory error handling first" and all the warts
that implied.

So you want the FS to have error handling for just pmem errors or all
memory errors? And you want this to be done without the mm core using
page->index to identify what to unmap when the error happens? Memory
error scanning is not universally available on all pmem
implementations, so FS will need to subscribe for memory-error
handling events. The implementation is clearer for pmem, if the FS is
only interested in memory error handling events for memory it
ostensibly owns (mounted pmem) vs other pages that are only
ephemerally related to the FS (typical RAM).
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-27  4:19                   ` Dan Williams
  0 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-27  4:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vivek Goyal, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Wed, Feb 26, 2020 at 7:03 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote:
> > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> >
> > [..]
> > > > > > Hi Jeff,
> > > > > >
> > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > > but start and end are not aligned on sector boundaries.
> > > > >
> > > > > Sure, but who will call it with misaligned ranges?
> > > >
> > > > create a file foo.txt of size 4K and then truncate it.
> > > >
> > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > > 4095.
> > >
> > > This should fail with EIO. Only full page writes should clear the
> > > bad page state, and partial writes should therefore fail because
> > > they do not guarantee the data in the filesystem block is all good.
> > >
> > > If this zeroing was a buffered write to an address with a bad
> > > sector, then the writeback will fail and the user will (eventually)
> > > get an EIO on the file.
> > >
> > > DAX should do the same thing, except because the zeroing is
> > > synchronous (i.e. done directly by the truncate syscall) we can -
> > > and should - return EIO immediately.
> > >
> > > Indeed, with your code, if we then extend the file by truncating up
> > > back to 4k, then the range between 23 and 512 is still bad, even
> > > though we've successfully zeroed it and the user knows it. An
> > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > > will fail with EIO, but reading 10 bytes at offset 2000 will
> > > succeed.
> >
> > Hi Dave,
> >
> > What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to
> > 511) is poisoned and rest don't have poison. Should this fail with -EIO.
>
> Yes - the filesystem block still contains bad data.
>
> > In current implementation it does not.
>
> I'm not surprised - the whole hardware error handling architecture
> for FS-DAX is fundamentally broken. It was designed for device-dax,
> and it just doesn't work for FS-DAX.
>
> For example, to get the hardware error handling to be able to kill
> userspace applications, a 1:1 physical-to-logical association
> constraint was added to fs/dax.c:
>
> /*
>  * TODO: for reflink+dax we need a way to associate a single page with
>  * multiple address_space instances at different linear_page_index()
>  * offsets.
>  */
> static void dax_associate_entry(void *entry, struct address_space *mapping,
>                 struct vm_area_struct *vma, unsigned long address)
>
> because device-dax only has *linear mappings* and so has a fixed
> reverse mapping architecture.
>
> i.e. the whole hardware error handling infrastructure was designed
> around the constraints of device-dax. device-dax does not having any
> structure to serialise access to the physical storage, so locking
> was added to the mapping tree. THe mapping tree locking is accessed
> on hardware error via the reverse mappingi association in the struct
> page and that's how device-dax serialises direct physical storage
> access against hardware error processing.  And while the page index
> is locked in the mapping tree, it can walk the process vmas that
> have the page mapped to kill them so that they don't try to access
> the bad page.
>
> That bad physical storage state is carried in a volatile struct page
> flag, hence requiring some mechanism to make it persistent (the
> device bad blocks register) and some other mechanism to clear the
> poison state (direct IO, IIRC).
>
> It's also a big, nasty, solid roadblock to implementing shared
> data extents in FS-DAX. We basically have to completely re-architect
> the hardware error handling for FS-DAX so that such errors are
> reported to the filesystem first and then the filesystem does what
> is needed to handle the error.
>
> None of this works for filesystems because they need to perform
> different operations depending on what the page that went bad
> contains. FS-DAX should never trip over an unexpected poisoned page;
> we do so now because such physical storage errors are completely
> hidden form the fielsystem.
>
> What you are trying to do is slap a band-aid over what to do when we
> hit an unexpected page containing bad data. Filesystems expect to
> find out about bad data in storage when they marshall the data into
> or out of memory. They make the assumption that once it is in memory
> it remains valid on the physical storage. Hence if an in-memory
> error occurs, we can just toss it away and re-read it from storage,
> and all is good.
>
> FS-DAX changes that - we are no longer marshalling data into and out
> of memory so we don't have a mechanism to get EIO when reading the
> page into the page cache or writing it back to disk. We also don't
> have an in-memory copy of the data - the physical storage is the
> in-memory copy, and so we can't just toss it away when an error
> occurs.
>
> What we actually require is instantaneous notification of physical
> storage errors so we can handle the error immediately. And that, in
> turn, means we should never poison or see poisoned pages during
> direct access operations because the filesystem doesn't need to
> poison pages to prevent user access - it controls how the storage is
> accessed directly.
>
> e.g. if this error is in filesystem metadata, we might be able to
> recover from it as that metadata might have a full copy in memory
> (metadata is buffered in both XFS and ext4) or we might be able to
> reconstruct it from other sources. Worst case, we have shut the
> filesystem down completely so the admin can repair the damage the
> lost page has caused.
>
> e.g. The physical page may be located in free space, in which case
> we don't care and can just zero it so all the bad hardware state is
> cleared. The page never goes bad or gets poisoned in that case.
>
> e.g. The physical page may be user data, in which case it may be
> buffered in the page cache (non-dax) and so can easily be recovered.
> It may not be recoverable, in which case we need to issue log
> messages indicating that data has been lost (path, offset, length),
> and do the VMA walk and kill processes that map that page. Then we
> can zero the page to clear the bad state.
>
> If, at any point we can't clear the bad state (e.g. the zeroing or
> the read-back verification fails), then we need to make sure that
> filesystem block is marked as allocated in the free space map, then
> tell the reverse map that it's owner is now "bad storage" so it
> never gets used again. i.e. this is the persistent bad block
> storage, but doing it this way results in the rest of the filesystem
> code never, ever seeing a poisoned page. And users never see it
> either, because it will never be returned to the free space pool.
>
> Of course, this relies of the filesystem having reverse mapping
> capability. XFS already has this funcitonality available as a mkfs
> option (mkfs.xfs -m rmapbt=1 ...), and we really need this so we can
> get rid of the association of a physical page with a mapping and
> file offset that device-dax requires for hardware page error
> handling.  This means we don't need the physical storage to try to
> hold filesystem layer reverse mapping information for us, and this
> also removes the roadblock that the hardware error handling has
> placed on implementing reflink w/ FS-DAX.
>
> IOWs, the problem you are trying to solve is a direct result of
> filesysetms not being informed when a physical pmem page goes bad
> and the current error handling being implemented at entirely the
> wrong layer for FS-DAX. It may work for device-dax, but it's most
> definitely insufficient for correct error handling for filesystems.
>
> > Anyway, partial page truncate can't ensure that data in rest of the page can
> > be read back successfully. Memory can get poison after the write and
> > hence read after truncate will still fail.
>
> Which is where the notification requirement comes in. Yes, we may
> still get errors on read or write, but if memory at rest goes bad,
> we want to handle that and correct it ASAP, not wait days or months
> for something to trip over the poisoned page before we find out
> about it.
>
> > Hence, all we are trying to ensure is that if a poison is known at the
> > time of writing partial page, then we should return error to user space.
>
> I think within FS-DAX infrastructure, any access to the data (read
> or write) within a poisoned page or a page marked with PageError()
> should return EIO to the caller, unless it's the specific command to
> clear the error/poison state on the page. What happens with that
> error state is then up to the caller.
>

I agree with most of the above if you replace "device-dax error
handling" with "System RAM error handling". It's typical memory error
handling that injects the page->index and page->mappping dependency.
In fact it was difficult to map this to device-dax without the hack
that device-dax does not need to take a page lock. I do think that the
FS needs this error information, at the same time it's also true that
historically no FS pushed for this information for page-cache and
in-memory metadata prior to FS-DAX. So, the design was not "device-dax
first" it was "existing memory error handling first" and all the warts
that implied.

So you want the FS to have error handling for just pmem errors or all
memory errors? And you want this to be done without the mm core using
page->index to identify what to unmap when the error happens? Memory
error scanning is not universally available on all pmem
implementations, so FS will need to subscribe for memory-error
handling events. The implementation is clearer for pmem, if the FS is
only interested in memory error handling events for memory it
ostensibly owns (mounted pmem) vs other pages that are only
ephemerally related to the FS (typical RAM).

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-27  3:11                                 ` Dave Chinner
@ 2020-02-27 15:25                                   ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-27 15:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Thu, Feb 27, 2020 at 02:11:43PM +1100, Dave Chinner wrote:
> On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote:
> > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
> > [..]
> > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > > > callback that deals with page aligned entries. That change at least
> > > > > makes the error boundary symmetric across copy_from_iter() and the
> > > > > zeroing path.
> > > >
> > > > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > > > aligned start and size and call this interface from
> > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > > > path?
> > > >
> > > > Something like.
> > > >
> > > > __dax_zero_page_range() {
> > > >   if(page_aligned_io)
> > > >         call_dax_page_zero_range()
> > > >   else
> > > >         use_direct_access_and_memcpy;
> > > > }
> > > >
> > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > > > to calling dax_zero_page_range() instead.
> > > >
> > > > If yes, I am not seeing what advantage do we get by this change.
> > > >
> > > > - __dax_zero_page_range() seems to be called by only partial block
> > > >   zeroing code. So dax_zero_page_range() call will remain unused.
> > > >
> > > >
> > > > - dax_zero_page_range() will be exact replacement of
> > > >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> > > >   it will create a dax specific hook.
> > > >
> > > > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > > > call from __dax_zero_page_range() and make sure there are no callers of
> > > > full block zeroing from this path.
> > > 
> > > I think you're right. The path I'm concerned about not regressing is
> > > the error clearing on new block allocation and we get that already via
> > > xfs_zero_extent() and sb_issue_zeroout().
> > 
> > Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
> > to zero full PAGE_SIZE blocks.
> > 
> > xfs_io -c "allocsp 32K 0" foo.txt
> 
> That ioctl interface is deprecated and likely unused by any new
> application written since 1999. It predates unwritten extents (1998)
> and I don't think any native linux applications have ever used it. A
> newly written DAX aware application would almost certainly not use
> this interface.
> 
> IOWs, I wouldn't use it as justification for some special case
> behaviour; I'm more likely to say "rip that ancient ioctl out" than
> to jump through hoops because it's implementation behaviour.

Hi Dave,

Do you see any other path in xfs using iomap_zero_range() and zeroing
full block. iomap_zero_range() already skips IOMAP_HOLE and
IOMAP_UNWRITTEN. So it has to be a full block zeroing which is of not type
IOMAP_HOLE and IOMAP_UNWRITTEN.

My understanding is that ext4 and xfs both are initializing full blocks
using blkdev_issue_zeroout(). Only partial blocks are being zeroed using
this dax zeroing path.

If there are no callers of full block zeroing through
__dax_zero_page_range(), then I can simply get rid of
blkdev_issue_zerout() call from __dax_zero_page_range().

Thanks
Vivek
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-27 15:25                                   ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-02-27 15:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Thu, Feb 27, 2020 at 02:11:43PM +1100, Dave Chinner wrote:
> On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote:
> > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
> > [..]
> > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > > > callback that deals with page aligned entries. That change at least
> > > > > makes the error boundary symmetric across copy_from_iter() and the
> > > > > zeroing path.
> > > >
> > > > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > > > aligned start and size and call this interface from
> > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > > > path?
> > > >
> > > > Something like.
> > > >
> > > > __dax_zero_page_range() {
> > > >   if(page_aligned_io)
> > > >         call_dax_page_zero_range()
> > > >   else
> > > >         use_direct_access_and_memcpy;
> > > > }
> > > >
> > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > > > to calling dax_zero_page_range() instead.
> > > >
> > > > If yes, I am not seeing what advantage do we get by this change.
> > > >
> > > > - __dax_zero_page_range() seems to be called by only partial block
> > > >   zeroing code. So dax_zero_page_range() call will remain unused.
> > > >
> > > >
> > > > - dax_zero_page_range() will be exact replacement of
> > > >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> > > >   it will create a dax specific hook.
> > > >
> > > > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > > > call from __dax_zero_page_range() and make sure there are no callers of
> > > > full block zeroing from this path.
> > > 
> > > I think you're right. The path I'm concerned about not regressing is
> > > the error clearing on new block allocation and we get that already via
> > > xfs_zero_extent() and sb_issue_zeroout().
> > 
> > Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
> > to zero full PAGE_SIZE blocks.
> > 
> > xfs_io -c "allocsp 32K 0" foo.txt
> 
> That ioctl interface is deprecated and likely unused by any new
> application written since 1999. It predates unwritten extents (1998)
> and I don't think any native linux applications have ever used it. A
> newly written DAX aware application would almost certainly not use
> this interface.
> 
> IOWs, I wouldn't use it as justification for some special case
> behaviour; I'm more likely to say "rip that ancient ioctl out" than
> to jump through hoops because it's implementation behaviour.

Hi Dave,

Do you see any other path in xfs using iomap_zero_range() and zeroing
full block. iomap_zero_range() already skips IOMAP_HOLE and
IOMAP_UNWRITTEN. So it has to be a full block zeroing which is of not type
IOMAP_HOLE and IOMAP_UNWRITTEN.

My understanding is that ext4 and xfs both are initializing full blocks
using blkdev_issue_zeroout(). Only partial blocks are being zeroed using
this dax zeroing path.

If there are no callers of full block zeroing through
__dax_zero_page_range(), then I can simply get rid of
blkdev_issue_zerout() call from __dax_zero_page_range().

Thanks
Vivek


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-27  4:19                   ` Dan Williams
@ 2020-02-28  1:30                     ` Dave Chinner
  -1 siblings, 0 replies; 103+ messages in thread
From: Dave Chinner @ 2020-02-28  1:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Wed, Feb 26, 2020 at 08:19:37PM -0800, Dan Williams wrote:
> On Wed, Feb 26, 2020 at 7:03 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote:
> > > Anyway, partial page truncate can't ensure that data in rest of the page can
> > > be read back successfully. Memory can get poison after the write and
> > > hence read after truncate will still fail.
> >
> > Which is where the notification requirement comes in. Yes, we may
> > still get errors on read or write, but if memory at rest goes bad,
> > we want to handle that and correct it ASAP, not wait days or months
> > for something to trip over the poisoned page before we find out
> > about it.
> >
> > > Hence, all we are trying to ensure is that if a poison is known at the
> > > time of writing partial page, then we should return error to user space.
> >
> > I think within FS-DAX infrastructure, any access to the data (read
> > or write) within a poisoned page or a page marked with PageError()
> > should return EIO to the caller, unless it's the specific command to
> > clear the error/poison state on the page. What happens with that
> > error state is then up to the caller.
> >
> 
> I agree with most of the above if you replace "device-dax error
> handling" with "System RAM error handling". It's typical memory error
> handling that injects the page->index and page->mappping dependency.

I disagree, but that's beside the point and not worth arguing.

> So you want the FS to have error handling for just pmem errors or all
> memory errors?

Just pmem errors in the specific range the filesystem manages - we
really only care storage errors because those are the only errors
the filesystem is responsible for handling correctly.

Somebody else can worry about errors that hit page cache pages -
page cache pages require mapping/index pointers on each page anyway,
so a generic mechanism for handling those errors can be built into
the page cache. And if the error is in general kernel memory, then
it's game over for the entire kernel at that point, not just the
filesystem.

> And you want this to be done without the mm core using
> page->index to identify what to unmap when the error happens?

Isn't that exactly what I just said? We get the page address that
failed, the daxdev can turn that into a sector address and call into
the filesystem with a {sector, len, errno} tuple. We then do a
reverse mapping lookup on {sector, len} to find all the owners of
that range in the filesystem. If it's file data, that owner record
gives us the inode and the offset into the file, which then gives us
a {mapping, index} tuple.

Further, the filesytem reverse map is aware that it's blocks can be
shared by multiple owners, so it will have a record for every inode
and file offset that maps to that page. Hence we can simply iterate
the reverse map and do that whacky collect_procs/kill_procs dance
for every {mapping, index} pair that references the the bad range.

Nothing ever needs to be stored on the struct page...

> Memory
> error scanning is not universally available on all pmem
> implementations, so FS will need to subscribe for memory-error
> handling events.

No. Filesystems interact with the underlying device abstraction, not
the physical storage that lies behind that device abstraction.  The
filesystem should not interface with pmem directly in any way (all
direct accesses are hidden inside fs/dax.c!), nor should it care
about the quirks of the pmem implementation it is sitting on. That's
what the daxdev abstraction is supposed to hide from the users of
the pmem.

IOWs, the daxdev infrastructure subscribes to memory-error event
subsystem, calls out to the filesystem when an error in a page in
the daxdev is reported. The filesystem only needs to know the
{sector, len, errno} tuple related to the error; it is the device's
responsibility to handle the physical mechanics of listening,
filtering and translating MCEs to a format the filesystem
understands....

Another reason it should be provided by the daxdev as a {sector,
len, errno} tuple is that it also allows non-dax block devices to
implement the similar error notifications and provide filesystems
with exactly the same information so the filesystem can start
auto-recovery processes....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-28  1:30                     ` Dave Chinner
  0 siblings, 0 replies; 103+ messages in thread
From: Dave Chinner @ 2020-02-28  1:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vivek Goyal, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Wed, Feb 26, 2020 at 08:19:37PM -0800, Dan Williams wrote:
> On Wed, Feb 26, 2020 at 7:03 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote:
> > > Anyway, partial page truncate can't ensure that data in rest of the page can
> > > be read back successfully. Memory can get poison after the write and
> > > hence read after truncate will still fail.
> >
> > Which is where the notification requirement comes in. Yes, we may
> > still get errors on read or write, but if memory at rest goes bad,
> > we want to handle that and correct it ASAP, not wait days or months
> > for something to trip over the poisoned page before we find out
> > about it.
> >
> > > Hence, all we are trying to ensure is that if a poison is known at the
> > > time of writing partial page, then we should return error to user space.
> >
> > I think within FS-DAX infrastructure, any access to the data (read
> > or write) within a poisoned page or a page marked with PageError()
> > should return EIO to the caller, unless it's the specific command to
> > clear the error/poison state on the page. What happens with that
> > error state is then up to the caller.
> >
> 
> I agree with most of the above if you replace "device-dax error
> handling" with "System RAM error handling". It's typical memory error
> handling that injects the page->index and page->mappping dependency.

I disagree, but that's beside the point and not worth arguing.

> So you want the FS to have error handling for just pmem errors or all
> memory errors?

Just pmem errors in the specific range the filesystem manages - we
really only care storage errors because those are the only errors
the filesystem is responsible for handling correctly.

Somebody else can worry about errors that hit page cache pages -
page cache pages require mapping/index pointers on each page anyway,
so a generic mechanism for handling those errors can be built into
the page cache. And if the error is in general kernel memory, then
it's game over for the entire kernel at that point, not just the
filesystem.

> And you want this to be done without the mm core using
> page->index to identify what to unmap when the error happens?

Isn't that exactly what I just said? We get the page address that
failed, the daxdev can turn that into a sector address and call into
the filesystem with a {sector, len, errno} tuple. We then do a
reverse mapping lookup on {sector, len} to find all the owners of
that range in the filesystem. If it's file data, that owner record
gives us the inode and the offset into the file, which then gives us
a {mapping, index} tuple.

Further, the filesytem reverse map is aware that it's blocks can be
shared by multiple owners, so it will have a record for every inode
and file offset that maps to that page. Hence we can simply iterate
the reverse map and do that whacky collect_procs/kill_procs dance
for every {mapping, index} pair that references the the bad range.

Nothing ever needs to be stored on the struct page...

> Memory
> error scanning is not universally available on all pmem
> implementations, so FS will need to subscribe for memory-error
> handling events.

No. Filesystems interact with the underlying device abstraction, not
the physical storage that lies behind that device abstraction.  The
filesystem should not interface with pmem directly in any way (all
direct accesses are hidden inside fs/dax.c!), nor should it care
about the quirks of the pmem implementation it is sitting on. That's
what the daxdev abstraction is supposed to hide from the users of
the pmem.

IOWs, the daxdev infrastructure subscribes to memory-error event
subsystem, calls out to the filesystem when an error in a page in
the daxdev is reported. The filesystem only needs to know the
{sector, len, errno} tuple related to the error; it is the device's
responsibility to handle the physical mechanics of listening,
filtering and translating MCEs to a format the filesystem
understands....

Another reason it should be provided by the daxdev as a {sector,
len, errno} tuple is that it also allows non-dax block devices to
implement the similar error notifications and provide filesystems
with exactly the same information so the filesystem can start
auto-recovery processes....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-27 15:25                                   ` Vivek Goyal
@ 2020-02-28  1:50                                     ` Dave Chinner
  -1 siblings, 0 replies; 103+ messages in thread
From: Dave Chinner @ 2020-02-28  1:50 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Thu, Feb 27, 2020 at 10:25:17AM -0500, Vivek Goyal wrote:
> On Thu, Feb 27, 2020 at 02:11:43PM +1100, Dave Chinner wrote:
> > On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote:
> > > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
> > > [..]
> > > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > > > > callback that deals with page aligned entries. That change at least
> > > > > > makes the error boundary symmetric across copy_from_iter() and the
> > > > > > zeroing path.
> > > > >
> > > > > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > > > > aligned start and size and call this interface from
> > > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > > > > path?
> > > > >
> > > > > Something like.
> > > > >
> > > > > __dax_zero_page_range() {
> > > > >   if(page_aligned_io)
> > > > >         call_dax_page_zero_range()
> > > > >   else
> > > > >         use_direct_access_and_memcpy;
> > > > > }
> > > > >
> > > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > > > > to calling dax_zero_page_range() instead.
> > > > >
> > > > > If yes, I am not seeing what advantage do we get by this change.
> > > > >
> > > > > - __dax_zero_page_range() seems to be called by only partial block
> > > > >   zeroing code. So dax_zero_page_range() call will remain unused.
> > > > >
> > > > >
> > > > > - dax_zero_page_range() will be exact replacement of
> > > > >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> > > > >   it will create a dax specific hook.
> > > > >
> > > > > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > > > > call from __dax_zero_page_range() and make sure there are no callers of
> > > > > full block zeroing from this path.
> > > > 
> > > > I think you're right. The path I'm concerned about not regressing is
> > > > the error clearing on new block allocation and we get that already via
> > > > xfs_zero_extent() and sb_issue_zeroout().
> > > 
> > > Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
> > > to zero full PAGE_SIZE blocks.
> > > 
> > > xfs_io -c "allocsp 32K 0" foo.txt
> > 
> > That ioctl interface is deprecated and likely unused by any new
> > application written since 1999. It predates unwritten extents (1998)
> > and I don't think any native linux applications have ever used it. A
> > newly written DAX aware application would almost certainly not use
> > this interface.
> > 
> > IOWs, I wouldn't use it as justification for some special case
> > behaviour; I'm more likely to say "rip that ancient ioctl out" than
> > to jump through hoops because it's implementation behaviour.
> 
> Hi Dave,
> 
> Do you see any other path in xfs using iomap_zero_range() and zeroing
> full block.

Yes:

- xfs_file_aio_write_checks() for zeroing blocks between the
  existing EOF and the start of the incoming write beyond EOF
- xfs_setattr_size() on truncate up for zeroing blocks between the
  existing EOF and the new EOF.
- xfs_reflink_zero_posteof() for zeroing blocks between the old EOF
  and where the new reflinked extents are going to land beyond EOF

And don't think that blocks beyond EOF can't exist when DAX is
enabled. We can turn DAX on and off, we can crash between allocation
and file size extension, etc. Hence this code must be able to handle
zeroing large ranges of blocks beyond EOF...

> iomap_zero_range() already skips IOMAP_HOLE and
> IOMAP_UNWRITTEN. So it has to be a full block zeroing which is of not type
> IOMAP_HOLE and IOMAP_UNWRITTEN.
> 
> My understanding is that ext4 and xfs both are initializing full blocks
> using blkdev_issue_zeroout(). Only partial blocks are being zeroed using
> this dax zeroing path.

Look at the API, not the callers: iomap_zero_range takes a 64 bit
length parameter. It can be asked to zero blocks across petabytes of
a file. If there's a single block somewhere in that range, it will
only zero that block. If the entire range is allocated, it will zero
that entire range (yes, it will take forever!) as that it what it
is intended to do.

It should be pretty clear that needs to be able to zero entire
pages, regardless of how it is currently used/called by filesystems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-28  1:50                                     ` Dave Chinner
  0 siblings, 0 replies; 103+ messages in thread
From: Dave Chinner @ 2020-02-28  1:50 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dan Williams, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Thu, Feb 27, 2020 at 10:25:17AM -0500, Vivek Goyal wrote:
> On Thu, Feb 27, 2020 at 02:11:43PM +1100, Dave Chinner wrote:
> > On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote:
> > > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
> > > [..]
> > > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > > > > callback that deals with page aligned entries. That change at least
> > > > > > makes the error boundary symmetric across copy_from_iter() and the
> > > > > > zeroing path.
> > > > >
> > > > > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > > > > aligned start and size and call this interface from
> > > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > > > > path?
> > > > >
> > > > > Something like.
> > > > >
> > > > > __dax_zero_page_range() {
> > > > >   if(page_aligned_io)
> > > > >         call_dax_page_zero_range()
> > > > >   else
> > > > >         use_direct_access_and_memcpy;
> > > > > }
> > > > >
> > > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > > > > to calling dax_zero_page_range() instead.
> > > > >
> > > > > If yes, I am not seeing what advantage do we get by this change.
> > > > >
> > > > > - __dax_zero_page_range() seems to be called by only partial block
> > > > >   zeroing code. So dax_zero_page_range() call will remain unused.
> > > > >
> > > > >
> > > > > - dax_zero_page_range() will be exact replacement of
> > > > >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> > > > >   it will create a dax specific hook.
> > > > >
> > > > > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > > > > call from __dax_zero_page_range() and make sure there are no callers of
> > > > > full block zeroing from this path.
> > > > 
> > > > I think you're right. The path I'm concerned about not regressing is
> > > > the error clearing on new block allocation and we get that already via
> > > > xfs_zero_extent() and sb_issue_zeroout().
> > > 
> > > Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
> > > to zero full PAGE_SIZE blocks.
> > > 
> > > xfs_io -c "allocsp 32K 0" foo.txt
> > 
> > That ioctl interface is deprecated and likely unused by any new
> > application written since 1999. It predates unwritten extents (1998)
> > and I don't think any native linux applications have ever used it. A
> > newly written DAX aware application would almost certainly not use
> > this interface.
> > 
> > IOWs, I wouldn't use it as justification for some special case
> > behaviour; I'm more likely to say "rip that ancient ioctl out" than
> > to jump through hoops because it's implementation behaviour.
> 
> Hi Dave,
> 
> Do you see any other path in xfs using iomap_zero_range() and zeroing
> full block.

Yes:

- xfs_file_aio_write_checks() for zeroing blocks between the
  existing EOF and the start of the incoming write beyond EOF
- xfs_setattr_size() on truncate up for zeroing blocks between the
  existing EOF and the new EOF.
- xfs_reflink_zero_posteof() for zeroing blocks between the old EOF
  and where the new reflinked extents are going to land beyond EOF

And don't think that blocks beyond EOF can't exist when DAX is
enabled. We can turn DAX on and off, we can crash between allocation
and file size extension, etc. Hence this code must be able to handle
zeroing large ranges of blocks beyond EOF...

> iomap_zero_range() already skips IOMAP_HOLE and
> IOMAP_UNWRITTEN. So it has to be a full block zeroing which is of not type
> IOMAP_HOLE and IOMAP_UNWRITTEN.
> 
> My understanding is that ext4 and xfs both are initializing full blocks
> using blkdev_issue_zeroout(). Only partial blocks are being zeroed using
> this dax zeroing path.

Look at the API, not the callers: iomap_zero_range takes a 64 bit
length parameter. It can be asked to zero blocks across petabytes of
a file. If there's a single block somewhere in that range, it will
only zero that block. If the entire range is allocated, it will zero
that entire range (yes, it will take forever!) as that it what it
is intended to do.

It should be pretty clear that needs to be able to zero entire
pages, regardless of how it is currently used/called by filesystems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-28  1:30                     ` Dave Chinner
@ 2020-02-28  3:28                       ` Dan Williams
  -1 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-28  3:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Thu, Feb 27, 2020 at 5:31 PM Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Feb 26, 2020 at 08:19:37PM -0800, Dan Williams wrote:
[..]
> > So you want the FS to have error handling for just pmem errors or all
> > memory errors?
>
> Just pmem errors in the specific range the filesystem manages - we
> really only care storage errors because those are the only errors
> the filesystem is responsible for handling correctly.
>
> Somebody else can worry about errors that hit page cache pages -
> page cache pages require mapping/index pointers on each page anyway,
> so a generic mechanism for handling those errors can be built into
> the page cache. And if the error is in general kernel memory, then
> it's game over for the entire kernel at that point, not just the
> filesystem.
>
> > And you want this to be done without the mm core using
> > page->index to identify what to unmap when the error happens?
>
> Isn't that exactly what I just said? We get the page address that
> failed, the daxdev can turn that into a sector address and call into
> the filesystem with a {sector, len, errno} tuple. We then do a
> reverse mapping lookup on {sector, len} to find all the owners of
> that range in the filesystem. If it's file data, that owner record
> gives us the inode and the offset into the file, which then gives us
> a {mapping, index} tuple.
>
> Further, the filesytem reverse map is aware that it's blocks can be
> shared by multiple owners, so it will have a record for every inode
> and file offset that maps to that page. Hence we can simply iterate
> the reverse map and do that whacky collect_procs/kill_procs dance
> for every {mapping, index} pair that references the the bad range.
>
> Nothing ever needs to be stored on the struct page...

Ok, so fs/dax.c needs to coordinate with mm/memory-failure.c to say
"don't perform generic memory-error-handling, there's an fs that owns
this daxdev and wants to disposition errors". The fs/dax.c
infrastructure that sets up ->index and ->mapping would still need to
be there for ext4 until its ready to take on the same responsibility.
Last I checked the ext4 reverse mapping implementation was not as
capable as XFS. This goes back to why the initial design avoided
relying on not universally available / stable reverse-mapping
facilities and opted for extending the generic mm/memory-failure.c
implementation.

If XFS optionally supplants mm/memory-failure.c I would expect we'd
want to do better than the "whacky collect_procs/kill_procs"
implementation and let applications register for a notification format
better than BUS_MCEERR_AO signals.

> > Memory
> > error scanning is not universally available on all pmem
> > implementations, so FS will need to subscribe for memory-error
> > handling events.
>
> No. Filesystems interact with the underlying device abstraction, not
> the physical storage that lies behind that device abstraction.  The
> filesystem should not interface with pmem directly in any way (all
> direct accesses are hidden inside fs/dax.c!), nor should it care
> about the quirks of the pmem implementation it is sitting on. That's
> what the daxdev abstraction is supposed to hide from the users of
> the pmem.

I wasn't proposing that XFS have a machine-check handler, I was trying
to get to the next level detail of the async notification interface to
the fs.

>
> IOWs, the daxdev infrastructure subscribes to memory-error event
> subsystem, calls out to the filesystem when an error in a page in
> the daxdev is reported. The filesystem only needs to know the
> {sector, len, errno} tuple related to the error; it is the device's
> responsibility to handle the physical mechanics of listening,
> filtering and translating MCEs to a format the filesystem
> understands....
>
> Another reason it should be provided by the daxdev as a {sector,
> len, errno} tuple is that it also allows non-dax block devices to
> implement the similar error notifications and provide filesystems
> with exactly the same information so the filesystem can start
> auto-recovery processes....

The driver layer does already have 'struct badblocks' that both pmem
and md use, just no current way for the FS to get a reference to it.
However, my hope was to get away from the interface being sector based
because the error granularity is already smaller than a sector in the
daxdev case as compared to a bdev. A daxdev native error record should
be a daxdev relative byte offset, not a sector. If the fs wants to
align the blast radius of the error to sectors or fs-blocks that's its
choice, but I don't think the driver interface should preclude
sub-sector error handling. Essentially I don't want to add more bdev
semantics to fs/dax.c while Vivek is busy removing them.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-28  3:28                       ` Dan Williams
  0 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-28  3:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vivek Goyal, Jeff Moyer, linux-fsdevel, linux-nvdimm,
	Christoph Hellwig, device-mapper development

On Thu, Feb 27, 2020 at 5:31 PM Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Feb 26, 2020 at 08:19:37PM -0800, Dan Williams wrote:
[..]
> > So you want the FS to have error handling for just pmem errors or all
> > memory errors?
>
> Just pmem errors in the specific range the filesystem manages - we
> really only care storage errors because those are the only errors
> the filesystem is responsible for handling correctly.
>
> Somebody else can worry about errors that hit page cache pages -
> page cache pages require mapping/index pointers on each page anyway,
> so a generic mechanism for handling those errors can be built into
> the page cache. And if the error is in general kernel memory, then
> it's game over for the entire kernel at that point, not just the
> filesystem.
>
> > And you want this to be done without the mm core using
> > page->index to identify what to unmap when the error happens?
>
> Isn't that exactly what I just said? We get the page address that
> failed, the daxdev can turn that into a sector address and call into
> the filesystem with a {sector, len, errno} tuple. We then do a
> reverse mapping lookup on {sector, len} to find all the owners of
> that range in the filesystem. If it's file data, that owner record
> gives us the inode and the offset into the file, which then gives us
> a {mapping, index} tuple.
>
> Further, the filesytem reverse map is aware that it's blocks can be
> shared by multiple owners, so it will have a record for every inode
> and file offset that maps to that page. Hence we can simply iterate
> the reverse map and do that whacky collect_procs/kill_procs dance
> for every {mapping, index} pair that references the the bad range.
>
> Nothing ever needs to be stored on the struct page...

Ok, so fs/dax.c needs to coordinate with mm/memory-failure.c to say
"don't perform generic memory-error-handling, there's an fs that owns
this daxdev and wants to disposition errors". The fs/dax.c
infrastructure that sets up ->index and ->mapping would still need to
be there for ext4 until its ready to take on the same responsibility.
Last I checked the ext4 reverse mapping implementation was not as
capable as XFS. This goes back to why the initial design avoided
relying on not universally available / stable reverse-mapping
facilities and opted for extending the generic mm/memory-failure.c
implementation.

If XFS optionally supplants mm/memory-failure.c I would expect we'd
want to do better than the "whacky collect_procs/kill_procs"
implementation and let applications register for a notification format
better than BUS_MCEERR_AO signals.

> > Memory
> > error scanning is not universally available on all pmem
> > implementations, so FS will need to subscribe for memory-error
> > handling events.
>
> No. Filesystems interact with the underlying device abstraction, not
> the physical storage that lies behind that device abstraction.  The
> filesystem should not interface with pmem directly in any way (all
> direct accesses are hidden inside fs/dax.c!), nor should it care
> about the quirks of the pmem implementation it is sitting on. That's
> what the daxdev abstraction is supposed to hide from the users of
> the pmem.

I wasn't proposing that XFS have a machine-check handler, I was trying
to get to the next level detail of the async notification interface to
the fs.

>
> IOWs, the daxdev infrastructure subscribes to memory-error event
> subsystem, calls out to the filesystem when an error in a page in
> the daxdev is reported. The filesystem only needs to know the
> {sector, len, errno} tuple related to the error; it is the device's
> responsibility to handle the physical mechanics of listening,
> filtering and translating MCEs to a format the filesystem
> understands....
>
> Another reason it should be provided by the daxdev as a {sector,
> len, errno} tuple is that it also allows non-dax block devices to
> implement the similar error notifications and provide filesystems
> with exactly the same information so the filesystem can start
> auto-recovery processes....

The driver layer does already have 'struct badblocks' that both pmem
and md use, just no current way for the FS to get a reference to it.
However, my hope was to get away from the interface being sector based
because the error granularity is already smaller than a sector in the
daxdev case as compared to a bdev. A daxdev native error record should
be a daxdev relative byte offset, not a sector. If the fs wants to
align the blast radius of the error to sectors or fs-blocks that's its
choice, but I don't think the driver interface should preclude
sub-sector error handling. Essentially I don't want to add more bdev
semantics to fs/dax.c while Vivek is busy removing them.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-28  3:28                       ` Dan Williams
@ 2020-02-28 14:05                         ` Christoph Hellwig
  -1 siblings, 0 replies; 103+ messages in thread
From: Christoph Hellwig @ 2020-02-28 14:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development

On Thu, Feb 27, 2020 at 07:28:41PM -0800, Dan Williams wrote:
> "don't perform generic memory-error-handling, there's an fs that owns
> this daxdev and wants to disposition errors". The fs/dax.c
> infrastructure that sets up ->index and ->mapping would still need to
> be there for ext4 until its ready to take on the same responsibility.
> Last I checked the ext4 reverse mapping implementation was not as
> capable as XFS. This goes back to why the initial design avoided
> relying on not universally available / stable reverse-mapping
> facilities and opted for extending the generic mm/memory-failure.c
> implementation.

The important but is that we stop using ->index and ->mapping when XFS
is used as that enables using DAX+reflinks, which actually is the most
requested DAX feature on XFS (way more than silly runtime flag switches
for example).
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-28 14:05                         ` Christoph Hellwig
  0 siblings, 0 replies; 103+ messages in thread
From: Christoph Hellwig @ 2020-02-28 14:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Vivek Goyal, Jeff Moyer, linux-fsdevel,
	linux-nvdimm, Christoph Hellwig, device-mapper development

On Thu, Feb 27, 2020 at 07:28:41PM -0800, Dan Williams wrote:
> "don't perform generic memory-error-handling, there's an fs that owns
> this daxdev and wants to disposition errors". The fs/dax.c
> infrastructure that sets up ->index and ->mapping would still need to
> be there for ext4 until its ready to take on the same responsibility.
> Last I checked the ext4 reverse mapping implementation was not as
> capable as XFS. This goes back to why the initial design avoided
> relying on not universally available / stable reverse-mapping
> facilities and opted for extending the generic mm/memory-failure.c
> implementation.

The important but is that we stop using ->index and ->mapping when XFS
is used as that enables using DAX+reflinks, which actually is the most
requested DAX feature on XFS (way more than silly runtime flag switches
for example).

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
  2020-02-28 14:05                         ` Christoph Hellwig
@ 2020-02-28 16:26                           ` Dan Williams
  -1 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-28 16:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-fsdevel, linux-nvdimm, device-mapper development

On Fri, Feb 28, 2020 at 6:05 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Feb 27, 2020 at 07:28:41PM -0800, Dan Williams wrote:
> > "don't perform generic memory-error-handling, there's an fs that owns
> > this daxdev and wants to disposition errors". The fs/dax.c
> > infrastructure that sets up ->index and ->mapping would still need to
> > be there for ext4 until its ready to take on the same responsibility.
> > Last I checked the ext4 reverse mapping implementation was not as
> > capable as XFS. This goes back to why the initial design avoided
> > relying on not universally available / stable reverse-mapping
> > facilities and opted for extending the generic mm/memory-failure.c
> > implementation.
>
> The important but is that we stop using ->index and ->mapping when XFS
> is used as that enables using DAX+reflinks, which actually is the most
> requested DAX feature on XFS (way more than silly runtime flag switches
> for example).

Understood. To be clear the plan we are marching to is knock down all
the known objections to the removal of the "experimental" designation.
reflink is on that list and so is per-file dax. The thought was that
pef-file dax was a nearer term goal than reflink.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
@ 2020-02-28 16:26                           ` Dan Williams
  0 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-02-28 16:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Vivek Goyal, Jeff Moyer, linux-fsdevel,
	linux-nvdimm, device-mapper development

On Fri, Feb 28, 2020 at 6:05 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Feb 27, 2020 at 07:28:41PM -0800, Dan Williams wrote:
> > "don't perform generic memory-error-handling, there's an fs that owns
> > this daxdev and wants to disposition errors". The fs/dax.c
> > infrastructure that sets up ->index and ->mapping would still need to
> > be there for ext4 until its ready to take on the same responsibility.
> > Last I checked the ext4 reverse mapping implementation was not as
> > capable as XFS. This goes back to why the initial design avoided
> > relying on not universally available / stable reverse-mapping
> > facilities and opted for extending the generic mm/memory-failure.c
> > implementation.
>
> The important but is that we stop using ->index and ->mapping when XFS
> is used as that enables using DAX+reflinks, which actually is the most
> requested DAX feature on XFS (way more than silly runtime flag switches
> for example).

Understood. To be clear the plan we are marching to is knock down all
the known objections to the removal of the "experimental" designation.
reflink is on that list and so is per-file dax. The thought was that
pef-file dax was a nearer term goal than reflink.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 4/8] dax, pmem: Add a dax operation zero_page_range
  2020-02-18 21:48   ` Vivek Goyal
@ 2020-03-31 19:38     ` Dan Williams
  -1 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-03-31 19:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development, Christoph Hellwig

On Tue, Feb 18, 2020 at 1:49 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> Add a dax operation zero_page_range, to zero a range of memory. This will
> also clear any poison in the range being zeroed.
>
> As of now, zeroing of up to one page is allowed in a single call. There
> are no callers which are trying to zero more than a page in a single call.
> Once we grow the callers which zero more than a page in single call, we
> can add that support. Primary reason for not doing that yet is that this
> will add little complexity in dm implementation where a range might be
> spanning multiple underlying targets and one will have to split the range
> into multiple sub ranges and call zero_page_range() on individual targets.
>
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  drivers/dax/super.c   | 19 +++++++++++++++++++
>  drivers/nvdimm/pmem.c | 10 ++++++++++
>  include/linux/dax.h   |  3 +++
>  3 files changed, 32 insertions(+)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0aa4b6bc5101..c912808bc886 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -344,6 +344,25 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
>  }
>  EXPORT_SYMBOL_GPL(dax_copy_to_iter);
>
> +int dax_zero_page_range(struct dax_device *dax_dev, u64 offset, size_t len)
> +{
> +       if (!dax_alive(dax_dev))
> +               return -ENXIO;
> +
> +       if (!dax_dev->ops->zero_page_range)
> +               return -EOPNOTSUPP;

This seems too late to be doing the validation. It would be odd for
random filesystem operations to see this error. I would move the check
to alloc_dax() and fail that if the caller fails to implement the
operation.

An incremental patch on top to fix this up would be ok. Something like
"Now that all dax_operations providers implement zero_page_range()
mandate it at alloc_dax time".
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 4/8] dax, pmem: Add a dax operation zero_page_range
@ 2020-03-31 19:38     ` Dan Williams
  0 siblings, 0 replies; 103+ messages in thread
From: Dan Williams @ 2020-03-31 19:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development, Vishal L Verma, Christoph Hellwig

On Tue, Feb 18, 2020 at 1:49 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> Add a dax operation zero_page_range, to zero a range of memory. This will
> also clear any poison in the range being zeroed.
>
> As of now, zeroing of up to one page is allowed in a single call. There
> are no callers which are trying to zero more than a page in a single call.
> Once we grow the callers which zero more than a page in single call, we
> can add that support. Primary reason for not doing that yet is that this
> will add little complexity in dm implementation where a range might be
> spanning multiple underlying targets and one will have to split the range
> into multiple sub ranges and call zero_page_range() on individual targets.
>
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  drivers/dax/super.c   | 19 +++++++++++++++++++
>  drivers/nvdimm/pmem.c | 10 ++++++++++
>  include/linux/dax.h   |  3 +++
>  3 files changed, 32 insertions(+)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0aa4b6bc5101..c912808bc886 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -344,6 +344,25 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
>  }
>  EXPORT_SYMBOL_GPL(dax_copy_to_iter);
>
> +int dax_zero_page_range(struct dax_device *dax_dev, u64 offset, size_t len)
> +{
> +       if (!dax_alive(dax_dev))
> +               return -ENXIO;
> +
> +       if (!dax_dev->ops->zero_page_range)
> +               return -EOPNOTSUPP;

This seems too late to be doing the validation. It would be odd for
random filesystem operations to see this error. I would move the check
to alloc_dax() and fail that if the caller fails to implement the
operation.

An incremental patch on top to fix this up would be ok. Something like
"Now that all dax_operations providers implement zero_page_range()
mandate it at alloc_dax time".

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 4/8] dax, pmem: Add a dax operation zero_page_range
  2020-03-31 19:38     ` Dan Williams
@ 2020-04-01 13:15       ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-04-01 13:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development, Christoph Hellwig

On Tue, Mar 31, 2020 at 12:38:16PM -0700, Dan Williams wrote:
> On Tue, Feb 18, 2020 at 1:49 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > Add a dax operation zero_page_range, to zero a range of memory. This will
> > also clear any poison in the range being zeroed.
> >
> > As of now, zeroing of up to one page is allowed in a single call. There
> > are no callers which are trying to zero more than a page in a single call.
> > Once we grow the callers which zero more than a page in single call, we
> > can add that support. Primary reason for not doing that yet is that this
> > will add little complexity in dm implementation where a range might be
> > spanning multiple underlying targets and one will have to split the range
> > into multiple sub ranges and call zero_page_range() on individual targets.
> >
> > Suggested-by: Christoph Hellwig <hch@infradead.org>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  drivers/dax/super.c   | 19 +++++++++++++++++++
> >  drivers/nvdimm/pmem.c | 10 ++++++++++
> >  include/linux/dax.h   |  3 +++
> >  3 files changed, 32 insertions(+)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 0aa4b6bc5101..c912808bc886 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -344,6 +344,25 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
> >  }
> >  EXPORT_SYMBOL_GPL(dax_copy_to_iter);
> >
> > +int dax_zero_page_range(struct dax_device *dax_dev, u64 offset, size_t len)
> > +{
> > +       if (!dax_alive(dax_dev))
> > +               return -ENXIO;
> > +
> > +       if (!dax_dev->ops->zero_page_range)
> > +               return -EOPNOTSUPP;
> 
> This seems too late to be doing the validation. It would be odd for
> random filesystem operations to see this error. I would move the check
> to alloc_dax() and fail that if the caller fails to implement the
> operation.
> 
> An incremental patch on top to fix this up would be ok. Something like
> "Now that all dax_operations providers implement zero_page_range()
> mandate it at alloc_dax time".

Hi Dan,

Ok, I will send an incremental patch for this.

BTW, I have posted V6 of this patch series and you might want to look
at that instead of V5.

https://lore.kernel.org/linux-fsdevel/20200228163456.1587-1-vgoyal@redhat.com/

Vivek
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 4/8] dax, pmem: Add a dax operation zero_page_range
@ 2020-04-01 13:15       ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-04-01 13:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development, Vishal L Verma, Christoph Hellwig

On Tue, Mar 31, 2020 at 12:38:16PM -0700, Dan Williams wrote:
> On Tue, Feb 18, 2020 at 1:49 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > Add a dax operation zero_page_range, to zero a range of memory. This will
> > also clear any poison in the range being zeroed.
> >
> > As of now, zeroing of up to one page is allowed in a single call. There
> > are no callers which are trying to zero more than a page in a single call.
> > Once we grow the callers which zero more than a page in single call, we
> > can add that support. Primary reason for not doing that yet is that this
> > will add little complexity in dm implementation where a range might be
> > spanning multiple underlying targets and one will have to split the range
> > into multiple sub ranges and call zero_page_range() on individual targets.
> >
> > Suggested-by: Christoph Hellwig <hch@infradead.org>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  drivers/dax/super.c   | 19 +++++++++++++++++++
> >  drivers/nvdimm/pmem.c | 10 ++++++++++
> >  include/linux/dax.h   |  3 +++
> >  3 files changed, 32 insertions(+)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 0aa4b6bc5101..c912808bc886 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -344,6 +344,25 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
> >  }
> >  EXPORT_SYMBOL_GPL(dax_copy_to_iter);
> >
> > +int dax_zero_page_range(struct dax_device *dax_dev, u64 offset, size_t len)
> > +{
> > +       if (!dax_alive(dax_dev))
> > +               return -ENXIO;
> > +
> > +       if (!dax_dev->ops->zero_page_range)
> > +               return -EOPNOTSUPP;
> 
> This seems too late to be doing the validation. It would be odd for
> random filesystem operations to see this error. I would move the check
> to alloc_dax() and fail that if the caller fails to implement the
> operation.
> 
> An incremental patch on top to fix this up would be ok. Something like
> "Now that all dax_operations providers implement zero_page_range()
> mandate it at alloc_dax time".

Hi Dan,

Ok, I will send an incremental patch for this.

BTW, I have posted V6 of this patch series and you might want to look
at that instead of V5.

https://lore.kernel.org/linux-fsdevel/20200228163456.1587-1-vgoyal@redhat.com/

Vivek


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 4/8] dax, pmem: Add a dax operation zero_page_range
  2020-03-31 19:38     ` Dan Williams
@ 2020-04-01 16:14       ` Vivek Goyal
  -1 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-04-01 16:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development, Christoph Hellwig

On Tue, Mar 31, 2020 at 12:38:16PM -0700, Dan Williams wrote:
> On Tue, Feb 18, 2020 at 1:49 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > Add a dax operation zero_page_range, to zero a range of memory. This will
> > also clear any poison in the range being zeroed.
> >
> > As of now, zeroing of up to one page is allowed in a single call. There
> > are no callers which are trying to zero more than a page in a single call.
> > Once we grow the callers which zero more than a page in single call, we
> > can add that support. Primary reason for not doing that yet is that this
> > will add little complexity in dm implementation where a range might be
> > spanning multiple underlying targets and one will have to split the range
> > into multiple sub ranges and call zero_page_range() on individual targets.
> >
> > Suggested-by: Christoph Hellwig <hch@infradead.org>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  drivers/dax/super.c   | 19 +++++++++++++++++++
> >  drivers/nvdimm/pmem.c | 10 ++++++++++
> >  include/linux/dax.h   |  3 +++
> >  3 files changed, 32 insertions(+)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 0aa4b6bc5101..c912808bc886 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -344,6 +344,25 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
> >  }
> >  EXPORT_SYMBOL_GPL(dax_copy_to_iter);
> >
> > +int dax_zero_page_range(struct dax_device *dax_dev, u64 offset, size_t len)
> > +{
> > +       if (!dax_alive(dax_dev))
> > +               return -ENXIO;
> > +
> > +       if (!dax_dev->ops->zero_page_range)
> > +               return -EOPNOTSUPP;
> 
> This seems too late to be doing the validation. It would be odd for
> random filesystem operations to see this error. I would move the check
> to alloc_dax() and fail that if the caller fails to implement the
> operation.
> 
> An incremental patch on top to fix this up would be ok. Something like
> "Now that all dax_operations providers implement zero_page_range()
> mandate it at alloc_dax time".

Hi Dan,

Posted an extra patch in same patch series for this.

https://lore.kernel.org/linux-fsdevel/20200228163456.1587-1-vgoyal@redhat.com/T/#m624680cbb5e714266d4b34ade2d6c390dae69598

Vivek
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 4/8] dax, pmem: Add a dax operation zero_page_range
@ 2020-04-01 16:14       ` Vivek Goyal
  0 siblings, 0 replies; 103+ messages in thread
From: Vivek Goyal @ 2020-04-01 16:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-fsdevel, linux-nvdimm, Christoph Hellwig,
	device-mapper development, Vishal L Verma, Christoph Hellwig

On Tue, Mar 31, 2020 at 12:38:16PM -0700, Dan Williams wrote:
> On Tue, Feb 18, 2020 at 1:49 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > Add a dax operation zero_page_range, to zero a range of memory. This will
> > also clear any poison in the range being zeroed.
> >
> > As of now, zeroing of up to one page is allowed in a single call. There
> > are no callers which are trying to zero more than a page in a single call.
> > Once we grow the callers which zero more than a page in single call, we
> > can add that support. Primary reason for not doing that yet is that this
> > will add little complexity in dm implementation where a range might be
> > spanning multiple underlying targets and one will have to split the range
> > into multiple sub ranges and call zero_page_range() on individual targets.
> >
> > Suggested-by: Christoph Hellwig <hch@infradead.org>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  drivers/dax/super.c   | 19 +++++++++++++++++++
> >  drivers/nvdimm/pmem.c | 10 ++++++++++
> >  include/linux/dax.h   |  3 +++
> >  3 files changed, 32 insertions(+)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 0aa4b6bc5101..c912808bc886 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -344,6 +344,25 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
> >  }
> >  EXPORT_SYMBOL_GPL(dax_copy_to_iter);
> >
> > +int dax_zero_page_range(struct dax_device *dax_dev, u64 offset, size_t len)
> > +{
> > +       if (!dax_alive(dax_dev))
> > +               return -ENXIO;
> > +
> > +       if (!dax_dev->ops->zero_page_range)
> > +               return -EOPNOTSUPP;
> 
> This seems too late to be doing the validation. It would be odd for
> random filesystem operations to see this error. I would move the check
> to alloc_dax() and fail that if the caller fails to implement the
> operation.
> 
> An incremental patch on top to fix this up would be ok. Something like
> "Now that all dax_operations providers implement zero_page_range()
> mandate it at alloc_dax time".

Hi Dan,

Posted an extra patch in same patch series for this.

https://lore.kernel.org/linux-fsdevel/20200228163456.1587-1-vgoyal@redhat.com/T/#m624680cbb5e714266d4b34ade2d6c390dae69598

Vivek
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v5 8/8] dax, iomap: Add helper dax_iomap_zero() to zero a range
  2020-02-18 21:48   ` Vivek Goyal
  (?)
@ 2020-04-25 11:31   ` neolift9
  -1 siblings, 0 replies; 103+ messages in thread
From: neolift9 @ 2020-04-25 11:31 UTC (permalink / raw)
  To: linux-nvdimm

We plan customized campaigns to meet your organizational requirements <a href="https://neolift.in/digital-marketing/seo/">seo services in pune</a> and campaign goals
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2020-04-25 11:32 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-18 21:48 [PATCH v5 0/8] dax/pmem: Provide a dax operation to zero range of memory Vivek Goyal
2020-02-18 21:48 ` Vivek Goyal
2020-02-18 21:48 ` [PATCH v5 1/8] pmem: Add functions for reading/writing page to/from pmem Vivek Goyal
2020-02-18 21:48   ` Vivek Goyal
2020-02-18 21:48 ` [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len Vivek Goyal
2020-02-18 21:48   ` Vivek Goyal
2020-02-20 16:17   ` Christoph Hellwig
2020-02-20 16:17     ` Christoph Hellwig
2020-02-20 21:35   ` Jeff Moyer
2020-02-20 21:35     ` Jeff Moyer
2020-02-20 21:57     ` Vivek Goyal
2020-02-20 21:57       ` Vivek Goyal
2020-02-21 18:32       ` Jeff Moyer
2020-02-21 18:32         ` Jeff Moyer
2020-02-21 20:17         ` Vivek Goyal
2020-02-21 20:17           ` Vivek Goyal
2020-02-21 21:00           ` Dan Williams
2020-02-21 21:00             ` Dan Williams
2020-02-21 21:24             ` Vivek Goyal
2020-02-21 21:24               ` Vivek Goyal
2020-02-21 21:30               ` Dan Williams
2020-02-21 21:30                 ` Dan Williams
2020-02-21 21:33                 ` Jeff Moyer
2020-02-21 21:33                   ` Jeff Moyer
2020-02-23 23:03           ` Dave Chinner
2020-02-23 23:03             ` Dave Chinner
2020-02-24  0:40             ` Dan Williams
2020-02-24  0:40               ` Dan Williams
2020-02-24 13:50               ` Jeff Moyer
2020-02-24 13:50                 ` Jeff Moyer
2020-02-24 20:48                 ` Dan Williams
2020-02-24 20:48                   ` Dan Williams
2020-02-24 21:53                   ` Jeff Moyer
2020-02-24 21:53                     ` Jeff Moyer
2020-02-25  0:26                     ` Dan Williams
2020-02-25  0:26                       ` Dan Williams
2020-02-25 20:32                       ` Jeff Moyer
2020-02-25 20:32                         ` Jeff Moyer
2020-02-25 21:52                         ` Dan Williams
2020-02-25 21:52                           ` Dan Williams
2020-02-25 23:26                       ` Jane Chu
2020-02-25 23:26                         ` Jane Chu
2020-02-24 15:38             ` Vivek Goyal
2020-02-24 15:38               ` Vivek Goyal
2020-02-27  3:02               ` Dave Chinner
2020-02-27  3:02                 ` Dave Chinner
2020-02-27  4:19                 ` Dan Williams
2020-02-27  4:19                   ` Dan Williams
2020-02-28  1:30                   ` Dave Chinner
2020-02-28  1:30                     ` Dave Chinner
2020-02-28  3:28                     ` Dan Williams
2020-02-28  3:28                       ` Dan Williams
2020-02-28 14:05                       ` Christoph Hellwig
2020-02-28 14:05                         ` Christoph Hellwig
2020-02-28 16:26                         ` Dan Williams
2020-02-28 16:26                           ` Dan Williams
2020-02-24 20:13             ` Vivek Goyal
2020-02-24 20:13               ` Vivek Goyal
2020-02-24 20:52               ` Dan Williams
2020-02-24 20:52                 ` Dan Williams
2020-02-24 21:15                 ` Vivek Goyal
2020-02-24 21:15                   ` Vivek Goyal
2020-02-24 21:32                   ` Dan Williams
2020-02-24 21:32                     ` Dan Williams
2020-02-25 13:36                     ` Vivek Goyal
2020-02-25 13:36                       ` Vivek Goyal
2020-02-25 16:25                       ` Dan Williams
2020-02-25 16:25                         ` Dan Williams
2020-02-25 20:08                         ` Vivek Goyal
2020-02-25 20:08                           ` Vivek Goyal
2020-02-25 22:49                           ` Dan Williams
2020-02-25 22:49                             ` Dan Williams
2020-02-26 13:51                             ` Vivek Goyal
2020-02-26 13:51                               ` Vivek Goyal
2020-02-26 16:57                             ` Vivek Goyal
2020-02-26 16:57                               ` Vivek Goyal
2020-02-27  3:11                               ` Dave Chinner
2020-02-27  3:11                                 ` Dave Chinner
2020-02-27 15:25                                 ` Vivek Goyal
2020-02-27 15:25                                   ` Vivek Goyal
2020-02-28  1:50                                   ` Dave Chinner
2020-02-28  1:50                                     ` Dave Chinner
2020-02-18 21:48 ` [PATCH v5 3/8] pmem: Enable pmem_do_write() to deal with arbitrary ranges Vivek Goyal
2020-02-18 21:48   ` Vivek Goyal
2020-02-20 16:17   ` Christoph Hellwig
2020-02-20 16:17     ` Christoph Hellwig
2020-02-18 21:48 ` [PATCH v5 4/8] dax, pmem: Add a dax operation zero_page_range Vivek Goyal
2020-02-18 21:48   ` Vivek Goyal
2020-03-31 19:38   ` Dan Williams
2020-03-31 19:38     ` Dan Williams
2020-04-01 13:15     ` Vivek Goyal
2020-04-01 13:15       ` Vivek Goyal
2020-04-01 16:14     ` Vivek Goyal
2020-04-01 16:14       ` Vivek Goyal
2020-02-18 21:48 ` [PATCH v5 5/8] s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver Vivek Goyal
2020-02-18 21:48   ` Vivek Goyal
2020-02-18 21:48 ` [PATCH v5 6/8] dm,dax: Add dax zero_page_range operation Vivek Goyal
2020-02-18 21:48   ` Vivek Goyal
2020-02-18 21:48 ` [PATCH v5 7/8] dax,iomap: Start using dax native zero_page_range() Vivek Goyal
2020-02-18 21:48   ` Vivek Goyal
2020-02-18 21:48 ` [PATCH v5 8/8] dax,iomap: Add helper dax_iomap_zero() to zero a range Vivek Goyal
2020-02-18 21:48   ` Vivek Goyal
2020-04-25 11:31   ` [PATCH v5 8/8] dax, iomap: " neolift9

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.