linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Dax poison recovery
@ 2021-11-06  1:16 Jane Chu
  2021-11-06  1:16 ` [PATCH v2 1/2] dax: Introduce normal and recovery dax operation modes Jane Chu
  2021-11-06  1:16 ` [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery Jane Chu
  0 siblings, 2 replies; 21+ messages in thread
From: Jane Chu @ 2021-11-06  1:16 UTC (permalink / raw)
  To: david, djwong, dan.j.williams, hch, vishal.l.verma, dave.jiang,
	agk, snitzer, dm-devel, ira.weiny, willy, vgoyal, linux-fsdevel,
	nvdimm, linux-kernel, linux-xfs

Up till now, the method commonly used for data recovery from
dax media error has been a combination of hole-punch followed
by a pwrite(2). The downside of this method is that it causes
fragmentation of the pmem backend, which brings a host of
very challenging issues.

This patch is an attempt to provide an alternative way for
dax users to perform data recovery from pmem media error without
fragmenting the pmem backend.

Dax media error may be manifested to user process via SIGBUS
with .si_code BUS_MCEERR_AR when a load instruction consumes
a poison in the media, or SIGBUS with .si_code BUS_ADRERR when
a page fault handler fails to resolve due to existing poison
record, or IO error returned from a block read or write.

With the patch in place, the way for user process to recover
the data can be just a pwrite(2) to a page aligned range without
the need to punch-a-hole. In case of BUS_MCEERR_AR, the range
is incidated by the signal payload: .si_addr and .si_addr_lsb.
In case of BUS_ADRERR, the range is the user mapping page size
starting from .si_addr. If the clue of media error come from
block IO error, the range is a stretch of the block IO range
to page aligned range.

Changes from v1:
- instead of giving user the say to start dax data recovery,
  let dax-filesystem decide when to enter data recovery mode.
  Hence 99% of the non-dax usage of pwrite and its variants
  aren't aware of dax specific recovering.
- Instead of exporting dax_clear_error() API that does one thing,
  combine dax {poison-clearing, error-record-update, writing-good-data}
  in one tight operation to better protect data integrity.
- some semantics and format fixes

v1: 
https://lore.kernel.org/lkml/20211029223233.GB449541@dread.disaster.area/T/
  
Jane Chu (2):
  dax: Introduce normal and recovery dax operation modes
  dax,pmem: Implement pmem based dax data recovery

 drivers/dax/super.c             | 15 +++---
 drivers/md/dm-linear.c          | 14 +++---
 drivers/md/dm-log-writes.c      | 19 +++++---
 drivers/md/dm-stripe.c          | 14 +++---
 drivers/md/dm-target.c          |  2 +-
 drivers/md/dm-writecache.c      |  8 +--
 drivers/md/dm.c                 | 16 +++---
 drivers/nvdimm/pmem.c           | 86 +++++++++++++++++++++++++++++----
 drivers/nvdimm/pmem.h           |  2 +-
 drivers/s390/block/dcssblk.c    | 13 +++--
 fs/dax.c                        | 32 +++++++++---
 fs/fuse/dax.c                   |  4 +-
 fs/fuse/virtio_fs.c             | 12 +++--
 include/linux/dax.h             | 18 ++++---
 include/linux/device-mapper.h   |  5 +-
 tools/testing/nvdimm/pmem-dax.c |  2 +-
 16 files changed, 187 insertions(+), 75 deletions(-)

-- 
2.18.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 1/2] dax: Introduce normal and recovery dax operation modes
  2021-11-06  1:16 [PATCH v2 0/2] Dax poison recovery Jane Chu
@ 2021-11-06  1:16 ` Jane Chu
  2021-11-06  1:50   ` Darrick J. Wong
  2021-11-06 16:48   ` Dan Williams
  2021-11-06  1:16 ` [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery Jane Chu
  1 sibling, 2 replies; 21+ messages in thread
From: Jane Chu @ 2021-11-06  1:16 UTC (permalink / raw)
  To: david, djwong, dan.j.williams, hch, vishal.l.verma, dave.jiang,
	agk, snitzer, dm-devel, ira.weiny, willy, vgoyal, linux-fsdevel,
	nvdimm, linux-kernel, linux-xfs

Introduce DAX_OP_NORMAL and DAX_OP_RECOVERY operation modes to
{dax_direct_access, dax_copy_from_iter, dax_copy_to_iter}.
DAX_OP_NORMAL is the default or the existing mode, and
DAX_OP_RECOVERY is a new mode for data recovery purpose.

When dax-FS suspects dax media error might be encountered
on a read or write, it can enact the recovery mode read or write
by setting DAX_OP_RECOVERY in the aforementioned APIs. A read
in recovery mode attempts to fetch as much data as possible
until the first poisoned page is encountered. A write in recovery
mode attempts to clear poison(s) in a page-aligned range and
then write the user provided data over.

DAX_OP_NORMAL should be used for all non-recovery code path.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/dax/super.c             | 15 +++++++++------
 drivers/md/dm-linear.c          | 14 ++++++++------
 drivers/md/dm-log-writes.c      | 19 +++++++++++--------
 drivers/md/dm-stripe.c          | 14 ++++++++------
 drivers/md/dm-target.c          |  2 +-
 drivers/md/dm-writecache.c      |  8 +++++---
 drivers/md/dm.c                 | 14 ++++++++------
 drivers/nvdimm/pmem.c           | 11 ++++++-----
 drivers/nvdimm/pmem.h           |  2 +-
 drivers/s390/block/dcssblk.c    | 13 ++++++++-----
 fs/dax.c                        | 14 ++++++++------
 fs/fuse/dax.c                   |  4 ++--
 fs/fuse/virtio_fs.c             | 12 ++++++++----
 include/linux/dax.h             | 18 +++++++++++-------
 include/linux/device-mapper.h   |  5 +++--
 tools/testing/nvdimm/pmem-dax.c |  2 +-
 16 files changed, 98 insertions(+), 69 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c0910687fbcb..90cae9d84b9c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -110,6 +110,7 @@ enum dax_device_flags {
  * @dax_dev: a dax_device instance representing the logical memory range
  * @pgoff: offset in pages from the start of the device to translate
  * @nr_pages: number of consecutive pages caller can handle relative to @pfn
+ * @mode: indicate whether dax operation is in normal or recovery mode
  * @kaddr: output parameter that returns a virtual address mapping of pfn
  * @pfn: output parameter that returns an absolute pfn translation of @pgoff
  *
@@ -117,7 +118,7 @@ enum dax_device_flags {
  * pages accessible at the device relative @pgoff.
  */
 long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
-		void **kaddr, pfn_t *pfn)
+		int mode, void **kaddr, pfn_t *pfn)
 {
 	long avail;
 
@@ -131,7 +132,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
 		return -EINVAL;
 
 	avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
-			kaddr, pfn);
+			mode, kaddr, pfn);
 	if (!avail)
 		return -ERANGE;
 	return min(avail, nr_pages);
@@ -139,22 +140,24 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
 EXPORT_SYMBOL_GPL(dax_direct_access);
 
 size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
-		size_t bytes, struct iov_iter *i)
+		size_t bytes, struct iov_iter *i, int mode)
 {
 	if (!dax_alive(dax_dev))
 		return 0;
 
-	return dax_dev->ops->copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+	return dax_dev->ops->copy_from_iter(dax_dev, pgoff, addr, bytes, i,
+						mode);
 }
 EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 
 size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
-		size_t bytes, struct iov_iter *i)
+		size_t bytes, struct iov_iter *i, int mode)
 {
 	if (!dax_alive(dax_dev))
 		return 0;
 
-	return dax_dev->ops->copy_to_iter(dax_dev, pgoff, addr, bytes, i);
+	return dax_dev->ops->copy_to_iter(dax_dev, pgoff, addr, bytes, i,
+						mode);
 }
 EXPORT_SYMBOL_GPL(dax_copy_to_iter);
 
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 90de42f6743a..c73ac6b98801 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -173,27 +173,29 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
 }
 
 static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, int mode, void **kaddr, pfn_t *pfn)
 {
 	struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);
 
-	return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+	return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
-		void *addr, size_t bytes, struct iov_iter *i)
+		void *addr, size_t bytes, struct iov_iter *i,
+		int mode)
 {
 	struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);
 
-	return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+	return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i, mode);
 }
 
 static size_t linear_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
-		void *addr, size_t bytes, struct iov_iter *i)
+		void *addr, size_t bytes, struct iov_iter *i,
+		int mode)
 {
 	struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);
 
-	return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i);
+	return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i, mode);
 }
 
 static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index df3cd78223fb..1e9847f904ef 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -959,16 +959,18 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti,
 }
 
 static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-					 long nr_pages, void **kaddr, pfn_t *pfn)
+					 long nr_pages, int mode,
+					 void **kaddr, pfn_t *pfn)
 {
 	struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff);
 
-	return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+	return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 static size_t log_writes_dax_copy_from_iter(struct dm_target *ti,
-					    pgoff_t pgoff, void *addr, size_t bytes,
-					    struct iov_iter *i)
+					    pgoff_t pgoff, void *addr,
+					    size_t bytes, struct iov_iter *i,
+					    int mode)
 {
 	struct log_writes_c *lc = ti->private;
 	sector_t sector = pgoff * PAGE_SECTORS;
@@ -985,16 +987,17 @@ static size_t log_writes_dax_copy_from_iter(struct dm_target *ti,
 		return 0;
 	}
 dax_copy:
-	return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+	return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i, mode);
 }
 
 static size_t log_writes_dax_copy_to_iter(struct dm_target *ti,
-					  pgoff_t pgoff, void *addr, size_t bytes,
-					  struct iov_iter *i)
+					  pgoff_t pgoff, void *addr,
+					  size_t bytes, struct iov_iter *i,
+					  int mode)
 {
 	struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff);
 
-	return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i);
+	return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i, mode);
 }
 
 static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 50dba3f39274..4c098452268b 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -317,27 +317,29 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
 }
 
 static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, int mode, void **kaddr, pfn_t *pfn)
 {
 	struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);
 
-	return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+	return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
-		void *addr, size_t bytes, struct iov_iter *i)
+		void *addr, size_t bytes, struct iov_iter *i,
+		int mode)
 {
 	struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);
 
-	return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+	return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i, mode);
 }
 
 static size_t stripe_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
-		void *addr, size_t bytes, struct iov_iter *i)
+		void *addr, size_t bytes, struct iov_iter *i,
+		int mode)
 {
 	struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);
 
-	return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i);
+	return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i, mode);
 }
 
 static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
index 64dd0b34fcf4..2de1073dbad6 100644
--- a/drivers/md/dm-target.c
+++ b/drivers/md/dm-target.c
@@ -142,7 +142,7 @@ static void io_err_release_clone_rq(struct request *clone,
 }
 
 static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, int mode, void **kaddr, pfn_t *pfn)
 {
 	return -EIO;
 }
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 0af464a863fe..b2e4ff922fe2 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -286,7 +286,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 
 	id = dax_read_lock();
 
-	da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn);
+	da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, DAX_OP_NORMAL,
+				&wc->memory_map, &pfn);
 	if (da < 0) {
 		wc->memory_map = NULL;
 		r = da;
@@ -308,8 +309,9 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 		i = 0;
 		do {
 			long daa;
-			daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i,
-						NULL, &pfn);
+			daa = dax_direct_access(wc->ssd_dev->dax_dev,
+						offset + i, p - i,
+						DAX_OP_NORMAL, NULL, &pfn);
 			if (daa <= 0) {
 				r = daa ? daa : -EINVAL;
 				goto err3;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 282008afc465..dc354db22ef9 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1001,7 +1001,8 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
 }
 
 static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
-				 long nr_pages, void **kaddr, pfn_t *pfn)
+				 long nr_pages, int mode, void **kaddr,
+				 pfn_t *pfn)
 {
 	struct mapped_device *md = dax_get_private(dax_dev);
 	sector_t sector = pgoff * PAGE_SECTORS;
@@ -1019,7 +1020,7 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 	if (len < 1)
 		goto out;
 	nr_pages = min(len, nr_pages);
-	ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
+	ret = ti->type->direct_access(ti, pgoff, nr_pages, mode, kaddr, pfn);
 
  out:
 	dm_put_live_table(md, srcu_idx);
@@ -1028,7 +1029,8 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 }
 
 static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
-				    void *addr, size_t bytes, struct iov_iter *i)
+				    void *addr, size_t bytes,
+				    struct iov_iter *i, int mode)
 {
 	struct mapped_device *md = dax_get_private(dax_dev);
 	sector_t sector = pgoff * PAGE_SECTORS;
@@ -1044,7 +1046,7 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 		ret = copy_from_iter(addr, bytes, i);
 		goto out;
 	}
-	ret = ti->type->dax_copy_from_iter(ti, pgoff, addr, bytes, i);
+	ret = ti->type->dax_copy_from_iter(ti, pgoff, addr, bytes, i, mode);
  out:
 	dm_put_live_table(md, srcu_idx);
 
@@ -1052,7 +1054,7 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 }
 
 static size_t dm_dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
-		void *addr, size_t bytes, struct iov_iter *i)
+		void *addr, size_t bytes, struct iov_iter *i, int mode)
 {
 	struct mapped_device *md = dax_get_private(dax_dev);
 	sector_t sector = pgoff * PAGE_SECTORS;
@@ -1068,7 +1070,7 @@ static size_t dm_dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 		ret = copy_to_iter(addr, bytes, i);
 		goto out;
 	}
-	ret = ti->type->dax_copy_to_iter(ti, pgoff, addr, bytes, i);
+	ret = ti->type->dax_copy_to_iter(ti, pgoff, addr, bytes, i, mode);
  out:
 	dm_put_live_table(md, srcu_idx);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0d6633987552..3dc99e0bf633 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -255,7 +255,7 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 
 /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
 __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, int mode, void **kaddr, pfn_t *pfn)
 {
 	resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
 
@@ -294,11 +294,12 @@ static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 }
 
 static long pmem_dax_direct_access(struct dax_device *dax_dev,
-		pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
+		pgoff_t pgoff, long nr_pages, int mode, void **kaddr,
+		pfn_t *pfn)
 {
 	struct pmem_device *pmem = dax_get_private(dax_dev);
 
-	return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
+	return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 /*
@@ -308,13 +309,13 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
  * dax_iomap_actor()
  */
 static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
-		void *addr, size_t bytes, struct iov_iter *i)
+		void *addr, size_t bytes, struct iov_iter *i, int mode)
 {
 	return _copy_from_iter_flushcache(addr, bytes, i);
 }
 
 static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
-		void *addr, size_t bytes, struct iov_iter *i)
+		void *addr, size_t bytes, struct iov_iter *i, int mode)
 {
 	return _copy_mc_to_iter(addr, bytes, i);
 }
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 59cfe13ea8a8..bda6a898ba81 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -27,7 +27,7 @@ struct pmem_device {
 };
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn);
+		long nr_pages, int mode, void **kaddr, pfn_t *pfn);
 
 #ifdef CONFIG_MEMORY_FAILURE
 static inline bool test_and_clear_pmem_poison(struct page *page)
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index e65e83764d1c..fb9f768e12a1 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -32,7 +32,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_submit_bio(struct bio *bio);
 static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn);
+		long nr_pages, int mode, void **kaddr, pfn_t *pfn);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -45,13 +45,15 @@ static const struct block_device_operations dcssblk_devops = {
 };
 
 static size_t dcssblk_dax_copy_from_iter(struct dax_device *dax_dev,
-		pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
+		pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i,
+		int mode)
 {
 	return copy_from_iter(addr, bytes, i);
 }
 
 static size_t dcssblk_dax_copy_to_iter(struct dax_device *dax_dev,
-		pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
+		pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i,
+		int mode)
 {
 	return copy_to_iter(addr, bytes, i);
 }
@@ -62,7 +64,8 @@ static int dcssblk_dax_zero_page_range(struct dax_device *dax_dev,
 	long rc;
 	void *kaddr;
 
-	rc = dax_direct_access(dax_dev, pgoff, nr_pages, &kaddr, NULL);
+	rc = dax_direct_access(dax_dev, pgoff, nr_pages, DAX_OP_NORMAL,
+			       &kaddr, NULL);
 	if (rc < 0)
 		return rc;
 	memset(kaddr, 0, nr_pages << PAGE_SHIFT);
@@ -941,7 +944,7 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
 
 static long
 dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, int mode, void **kaddr, pfn_t *pfn)
 {
 	struct dcssblk_dev_info *dev_info = dax_get_private(dax_dev);
 
diff --git a/fs/dax.c b/fs/dax.c
index eb715363fd66..bea6df1498c3 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -735,7 +735,8 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
 		return rc;
 
 	id = dax_read_lock();
-	rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
+	rc = dax_direct_access(dax_dev, pgoff, 1, DAX_OP_NORMAL, &kaddr,
+			       NULL);
 	if (rc < 0) {
 		dax_read_unlock(id);
 		return rc;
@@ -1036,7 +1037,7 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
 		return rc;
 	id = dax_read_lock();
 	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
-				   NULL, pfnp);
+				   DAX_OP_NORMAL, NULL, pfnp);
 	if (length < 0) {
 		rc = length;
 		goto out;
@@ -1162,7 +1163,8 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
 	if (page_aligned)
 		rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
 	else
-		rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
+		rc = dax_direct_access(iomap->dax_dev, pgoff, 1,
+				       DAX_OP_NORMAL, &kaddr, NULL);
 	if (rc < 0) {
 		dax_read_unlock(id);
 		return rc;
@@ -1231,7 +1233,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 			break;
 
 		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
-				&kaddr, NULL);
+					    DAX_OP_NORMAL, &kaddr, NULL);
 		if (map_len < 0) {
 			ret = map_len;
 			break;
@@ -1250,10 +1252,10 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		 */
 		if (iov_iter_rw(iter) == WRITE)
 			xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
-					map_len, iter);
+					map_len, iter, DAX_OP_NORMAL);
 		else
 			xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
-					map_len, iter);
+					map_len, iter, DAX_OP_NORMAL);
 
 		pos += xfer;
 		length -= xfer;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 713818d74de6..755d8d4b7d34 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -1241,8 +1241,8 @@ static int fuse_dax_mem_range_init(struct fuse_conn_dax *fcd)
 	INIT_DELAYED_WORK(&fcd->free_work, fuse_dax_free_mem_worker);
 
 	id = dax_read_lock();
-	nr_pages = dax_direct_access(fcd->dev, 0, PHYS_PFN(dax_size), NULL,
-				     NULL);
+	nr_pages = dax_direct_access(fcd->dev, 0, PHYS_PFN(dax_size),
+				     DAX_OP_NORMAL, NULL, NULL);
 	dax_read_unlock(id);
 	if (nr_pages < 0) {
 		pr_debug("dax_direct_access() returned %ld\n", nr_pages);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index b4c7c7fa987f..fb5433a37a7b 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -739,7 +739,8 @@ static void virtio_fs_cleanup_vqs(struct virtio_device *vdev,
  * offset.
  */
 static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
-				    long nr_pages, void **kaddr, pfn_t *pfn)
+				    long nr_pages, int mode, void **kaddr,
+				    pfn_t *pfn)
 {
 	struct virtio_fs *fs = dax_get_private(dax_dev);
 	phys_addr_t offset = PFN_PHYS(pgoff);
@@ -755,14 +756,16 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 
 static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
 				       pgoff_t pgoff, void *addr,
-				       size_t bytes, struct iov_iter *i)
+				       size_t bytes, struct iov_iter *i,
+				       int mode)
 {
 	return copy_from_iter(addr, bytes, i);
 }
 
 static size_t virtio_fs_copy_to_iter(struct dax_device *dax_dev,
 				       pgoff_t pgoff, void *addr,
-				       size_t bytes, struct iov_iter *i)
+				       size_t bytes, struct iov_iter *i,
+				       int mode)
 {
 	return copy_to_iter(addr, bytes, i);
 }
@@ -773,7 +776,8 @@ static int virtio_fs_zero_page_range(struct dax_device *dax_dev,
 	long rc;
 	void *kaddr;
 
-	rc = dax_direct_access(dax_dev, pgoff, nr_pages, &kaddr, NULL);
+	rc = dax_direct_access(dax_dev, pgoff, nr_pages, DAX_OP_NORMAL,
+			       &kaddr, NULL);
 	if (rc < 0)
 		return rc;
 	memset(kaddr, 0, nr_pages << PAGE_SHIFT);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 324363b798ec..931586df2905 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -9,6 +9,10 @@
 /* Flag for synchronous flush */
 #define DAXDEV_F_SYNC (1UL << 0)
 
+/* dax operation mode dynamically set by caller */
+#define	DAX_OP_NORMAL		0
+#define	DAX_OP_RECOVERY		1
+
 typedef unsigned long dax_entry_t;
 
 struct dax_device;
@@ -22,8 +26,8 @@ struct dax_operations {
 	 * logical-page-offset into an absolute physical pfn. Return the
 	 * number of pages available for DAX at that pfn.
 	 */
-	long (*direct_access)(struct dax_device *, pgoff_t, long,
-			void **, pfn_t *);
+	long (*direct_access)(struct dax_device *, pgoff_t, long, int,
+				void **, pfn_t *);
 	/*
 	 * Validate whether this device is usable as an fsdax backing
 	 * device.
@@ -32,10 +36,10 @@ struct dax_operations {
 			sector_t, sector_t);
 	/* copy_from_iter: required operation for fs-dax direct-i/o */
 	size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
-			struct iov_iter *);
+			struct iov_iter *, int);
 	/* copy_to_iter: required operation for fs-dax direct-i/o */
 	size_t (*copy_to_iter)(struct dax_device *, pgoff_t, void *, size_t,
-			struct iov_iter *);
+			struct iov_iter *, int);
 	/* zero_page_range: required operation. Zero page range   */
 	int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
 };
@@ -186,11 +190,11 @@ static inline void dax_read_unlock(int id)
 bool dax_alive(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
 long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
-		void **kaddr, pfn_t *pfn);
+		int mode, void **kaddr, pfn_t *pfn);
 size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
-		size_t bytes, struct iov_iter *i);
+		size_t bytes, struct iov_iter *i, int mode);
 size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
-		size_t bytes, struct iov_iter *i);
+		size_t bytes, struct iov_iter *i, int mode);
 int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 			size_t nr_pages);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a7df155ea49b..6596a8e0ceed 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -146,9 +146,10 @@ typedef int (*dm_busy_fn) (struct dm_target *ti);
  * >= 0 : the number of bytes accessible at the address
  */
 typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn);
+		long nr_pages, int mode, void **kaddr, pfn_t *pfn);
 typedef size_t (*dm_dax_copy_iter_fn)(struct dm_target *ti, pgoff_t pgoff,
-		void *addr, size_t bytes, struct iov_iter *i);
+		void *addr, size_t bytes, struct iov_iter *i,
+		int mode);
 typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, pgoff_t pgoff,
 		size_t nr_pages);
 
diff --git a/tools/testing/nvdimm/pmem-dax.c b/tools/testing/nvdimm/pmem-dax.c
index af19c85558e7..71c225630e7e 100644
--- a/tools/testing/nvdimm/pmem-dax.c
+++ b/tools/testing/nvdimm/pmem-dax.c
@@ -8,7 +8,7 @@
 #include <nd.h>
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, int mode, void **kaddr, pfn_t *pfn)
 {
 	resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
 
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery
  2021-11-06  1:16 [PATCH v2 0/2] Dax poison recovery Jane Chu
  2021-11-06  1:16 ` [PATCH v2 1/2] dax: Introduce normal and recovery dax operation modes Jane Chu
@ 2021-11-06  1:16 ` Jane Chu
  2021-11-06  2:04   ` Darrick J. Wong
  2021-11-09  7:27   ` Christoph Hellwig
  1 sibling, 2 replies; 21+ messages in thread
From: Jane Chu @ 2021-11-06  1:16 UTC (permalink / raw)
  To: david, djwong, dan.j.williams, hch, vishal.l.verma, dave.jiang,
	agk, snitzer, dm-devel, ira.weiny, willy, vgoyal, linux-fsdevel,
	nvdimm, linux-kernel, linux-xfs

For /dev/pmem based dax, enable DAX_OP_RECOVERY mode for
dax_direct_access to translate 'kaddr' over a range that
may contain poison(s); and enable dax_copy_to_iter to
read as much data as possible up till a poisoned page is
encountered; and enable dax_copy_from_iter to clear poison
among a page-aligned range, and then write the good data over.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/md/dm.c       |  2 ++
 drivers/nvdimm/pmem.c | 75 ++++++++++++++++++++++++++++++++++++++++---
 fs/dax.c              | 24 +++++++++++---
 3 files changed, 92 insertions(+), 9 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dc354db22ef9..9b3dac916f22 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1043,6 +1043,7 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 	if (!ti)
 		goto out;
 	if (!ti->type->dax_copy_from_iter) {
+		WARN_ON(mode == DAX_OP_RECOVERY);
 		ret = copy_from_iter(addr, bytes, i);
 		goto out;
 	}
@@ -1067,6 +1068,7 @@ static size_t dm_dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 	if (!ti)
 		goto out;
 	if (!ti->type->dax_copy_to_iter) {
+		WARN_ON(mode == DAX_OP_RECOVERY);
 		ret = copy_to_iter(addr, bytes, i);
 		goto out;
 	}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3dc99e0bf633..8ae6aa678c51 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
 	resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
 
 	if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
-					PFN_PHYS(nr_pages))))
+				 PFN_PHYS(nr_pages)) && mode == DAX_OP_NORMAL))
 		return -EIO;
 
 	if (kaddr)
@@ -303,20 +303,85 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 }
 
 /*
- * Use the 'no check' versions of copy_from_iter_flushcache() and
- * copy_mc_to_iter() to bypass HARDENED_USERCOPY overhead. Bounds
- * checking, both file offset and device offset, is handled by
- * dax_iomap_actor()
+ * Even though the 'no check' versions of copy_from_iter_flushcache()
+ * and copy_mc_to_iter() are used to bypass HARDENED_USERCOPY overhead,
+ * 'read'/'write' aren't always safe when poison is consumed. They happen
+ * to be safe because the 'read'/'write' range has been guaranteed
+ * be free of poison(s) by a prior call to dax_direct_access() on the
+ * caller stack.
+ * But on a data recovery code path, the 'read'/'write' range is expected
+ * to contain poison(s), and so poison(s) is explicit checked, such that
+ * 'read' can fetch data from clean page(s) up till the first poison is
+ * encountered, and 'write' requires the range be page aligned in order
+ * to restore the poisoned page's memory type back to "rw" after clearing
+ * the poison(s).
+ * In the event of poison related failure, (size_t) -EIO is returned and
+ * caller may check the return value after casting it to (ssize_t).
+ *
+ * TODO: add support for CPUs that support MOVDIR64B instruction for
+ * faster poison clearing, and possibly smaller error blast radius.
  */
 static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 		void *addr, size_t bytes, struct iov_iter *i, int mode)
 {
+	phys_addr_t pmem_off;
+	size_t len, lead_off;
+	struct pmem_device *pmem = dax_get_private(dax_dev);
+	struct device *dev = pmem->bb.dev;
+
+	if (unlikely(mode == DAX_OP_RECOVERY)) {
+		lead_off = (unsigned long)addr & ~PAGE_MASK;
+		len = PFN_PHYS(PFN_UP(lead_off + bytes));
+		if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
+			if (lead_off || !(PAGE_ALIGNED(bytes))) {
+				dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
+					addr, bytes);
+				return (size_t) -EIO;
+			}
+			pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
+			if (pmem_clear_poison(pmem, pmem_off, bytes) !=
+						BLK_STS_OK)
+				return (size_t) -EIO;
+		}
+	}
+
 	return _copy_from_iter_flushcache(addr, bytes, i);
 }
 
 static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 		void *addr, size_t bytes, struct iov_iter *i, int mode)
 {
+	int num_bad;
+	size_t len, lead_off;
+	unsigned long bad_pfn;
+	bool bad_pmem = false;
+	size_t adj_len = bytes;
+	sector_t sector, first_bad;
+	struct pmem_device *pmem = dax_get_private(dax_dev);
+	struct device *dev = pmem->bb.dev;
+
+	if (unlikely(mode == DAX_OP_RECOVERY)) {
+		sector = PFN_PHYS(pgoff) / 512;
+		lead_off = (unsigned long)addr & ~PAGE_MASK;
+		len = PFN_PHYS(PFN_UP(lead_off + bytes));
+		if (pmem->bb.count)
+			bad_pmem = !!badblocks_check(&pmem->bb, sector,
+					len / 512, &first_bad, &num_bad);
+		if (bad_pmem) {
+			bad_pfn = PHYS_PFN(first_bad * 512);
+			if (bad_pfn == pgoff) {
+				dev_warn(dev, "Found poison in page: pgoff(%#lx)\n",
+					pgoff);
+				return -EIO;
+			}
+			adj_len = PFN_PHYS(bad_pfn - pgoff) - lead_off;
+			dev_WARN_ONCE(dev, (adj_len > bytes),
+					"out-of-range first_bad?");
+		}
+		if (adj_len == 0)
+			return (size_t) -EIO;
+	}
+
 	return _copy_mc_to_iter(addr, bytes, i);
 }
 
diff --git a/fs/dax.c b/fs/dax.c
index bea6df1498c3..7640be6b6a97 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1219,6 +1219,8 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		unsigned offset = pos & (PAGE_SIZE - 1);
 		const size_t size = ALIGN(length + offset, PAGE_SIZE);
 		const sector_t sector = dax_iomap_sector(iomap, pos);
+		long nr_page = PHYS_PFN(size);
+		int dax_mode = DAX_OP_NORMAL;
 		ssize_t map_len;
 		pgoff_t pgoff;
 		void *kaddr;
@@ -1232,8 +1234,13 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		if (ret)
 			break;
 
-		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
-					    DAX_OP_NORMAL, &kaddr, NULL);
+		map_len = dax_direct_access(dax_dev, pgoff, nr_page, dax_mode,
+					    &kaddr, NULL);
+		if (unlikely(map_len == -EIO)) {
+			dax_mode = DAX_OP_RECOVERY;
+			map_len = dax_direct_access(dax_dev, pgoff, nr_page,
+						    dax_mode, &kaddr, NULL);
+		}
 		if (map_len < 0) {
 			ret = map_len;
 			break;
@@ -1252,11 +1259,20 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		 */
 		if (iov_iter_rw(iter) == WRITE)
 			xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
-					map_len, iter, DAX_OP_NORMAL);
+					map_len, iter, dax_mode);
 		else
 			xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
-					map_len, iter, DAX_OP_NORMAL);
+					map_len, iter, dax_mode);
 
+		/*
+		 * If dax data recovery is enacted via DAX_OP_RECOVERY,
+		 * recovery failure would be indicated by a -EIO return
+		 * in 'xfer' casted as (size_t).
+		 */
+		if ((ssize_t)xfer == -EIO) {
+			ret = -EIO;
+			break;
+		}
 		pos += xfer;
 		length -= xfer;
 		done += xfer;
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] dax: Introduce normal and recovery dax operation modes
  2021-11-06  1:16 ` [PATCH v2 1/2] dax: Introduce normal and recovery dax operation modes Jane Chu
@ 2021-11-06  1:50   ` Darrick J. Wong
  2021-11-08 20:43     ` Jane Chu
  2021-11-06 16:48   ` Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2021-11-06  1:50 UTC (permalink / raw)
  To: Jane Chu
  Cc: david, dan.j.williams, hch, vishal.l.verma, dave.jiang, agk,
	snitzer, dm-devel, ira.weiny, willy, vgoyal, linux-fsdevel,
	nvdimm, linux-kernel, linux-xfs

On Fri, Nov 05, 2021 at 07:16:37PM -0600, Jane Chu wrote:
> Introduce DAX_OP_NORMAL and DAX_OP_RECOVERY operation modes to
> {dax_direct_access, dax_copy_from_iter, dax_copy_to_iter}.
> DAX_OP_NORMAL is the default or the existing mode, and
> DAX_OP_RECOVERY is a new mode for data recovery purpose.
> 
> When dax-FS suspects dax media error might be encountered
> on a read or write, it can enact the recovery mode read or write
> by setting DAX_OP_RECOVERY in the aforementioned APIs. A read
> in recovery mode attempts to fetch as much data as possible
> until the first poisoned page is encountered. A write in recovery
> mode attempts to clear poison(s) in a page-aligned range and
> then write the user provided data over.
> 
> DAX_OP_NORMAL should be used for all non-recovery code path.
> 
> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> ---
>  drivers/dax/super.c             | 15 +++++++++------
>  drivers/md/dm-linear.c          | 14 ++++++++------
>  drivers/md/dm-log-writes.c      | 19 +++++++++++--------
>  drivers/md/dm-stripe.c          | 14 ++++++++------
>  drivers/md/dm-target.c          |  2 +-
>  drivers/md/dm-writecache.c      |  8 +++++---
>  drivers/md/dm.c                 | 14 ++++++++------
>  drivers/nvdimm/pmem.c           | 11 ++++++-----
>  drivers/nvdimm/pmem.h           |  2 +-
>  drivers/s390/block/dcssblk.c    | 13 ++++++++-----
>  fs/dax.c                        | 14 ++++++++------
>  fs/fuse/dax.c                   |  4 ++--
>  fs/fuse/virtio_fs.c             | 12 ++++++++----
>  include/linux/dax.h             | 18 +++++++++++-------
>  include/linux/device-mapper.h   |  5 +++--
>  tools/testing/nvdimm/pmem-dax.c |  2 +-
>  16 files changed, 98 insertions(+), 69 deletions(-)
> 

<snip>

> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 324363b798ec..931586df2905 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -9,6 +9,10 @@
>  /* Flag for synchronous flush */
>  #define DAXDEV_F_SYNC (1UL << 0)
>  
> +/* dax operation mode dynamically set by caller */
> +#define	DAX_OP_NORMAL		0
> +#define	DAX_OP_RECOVERY		1

Mostly looks ok to me, but since this is an operation mode, should this
be an enum instead of an int?

Granted I also think six arguments is a lot... though I don't really
see any better way to do this.

(Dunno, I spent all day running internal patches through the process
gauntlet so this is the remaining 2% of my brain speaking...)

--D

> +
>  typedef unsigned long dax_entry_t;
>  
>  struct dax_device;
> @@ -22,8 +26,8 @@ struct dax_operations {
>  	 * logical-page-offset into an absolute physical pfn. Return the
>  	 * number of pages available for DAX at that pfn.
>  	 */
> -	long (*direct_access)(struct dax_device *, pgoff_t, long,
> -			void **, pfn_t *);
> +	long (*direct_access)(struct dax_device *, pgoff_t, long, int,
> +				void **, pfn_t *);
>  	/*
>  	 * Validate whether this device is usable as an fsdax backing
>  	 * device.
> @@ -32,10 +36,10 @@ struct dax_operations {
>  			sector_t, sector_t);
>  	/* copy_from_iter: required operation for fs-dax direct-i/o */
>  	size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
> -			struct iov_iter *);
> +			struct iov_iter *, int);
>  	/* copy_to_iter: required operation for fs-dax direct-i/o */
>  	size_t (*copy_to_iter)(struct dax_device *, pgoff_t, void *, size_t,
> -			struct iov_iter *);
> +			struct iov_iter *, int);
>  	/* zero_page_range: required operation. Zero page range   */
>  	int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
>  };
> @@ -186,11 +190,11 @@ static inline void dax_read_unlock(int id)
>  bool dax_alive(struct dax_device *dax_dev);
>  void *dax_get_private(struct dax_device *dax_dev);
>  long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
> -		void **kaddr, pfn_t *pfn);
> +		int mode, void **kaddr, pfn_t *pfn);
>  size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
> -		size_t bytes, struct iov_iter *i);
> +		size_t bytes, struct iov_iter *i, int mode);
>  size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
> -		size_t bytes, struct iov_iter *i);
> +		size_t bytes, struct iov_iter *i, int mode);
>  int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
>  			size_t nr_pages);
>  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index a7df155ea49b..6596a8e0ceed 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -146,9 +146,10 @@ typedef int (*dm_busy_fn) (struct dm_target *ti);
>   * >= 0 : the number of bytes accessible at the address
>   */
>  typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
> -		long nr_pages, void **kaddr, pfn_t *pfn);
> +		long nr_pages, int mode, void **kaddr, pfn_t *pfn);
>  typedef size_t (*dm_dax_copy_iter_fn)(struct dm_target *ti, pgoff_t pgoff,
> -		void *addr, size_t bytes, struct iov_iter *i);
> +		void *addr, size_t bytes, struct iov_iter *i,
> +		int mode);
>  typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, pgoff_t pgoff,
>  		size_t nr_pages);
>  
> diff --git a/tools/testing/nvdimm/pmem-dax.c b/tools/testing/nvdimm/pmem-dax.c
> index af19c85558e7..71c225630e7e 100644
> --- a/tools/testing/nvdimm/pmem-dax.c
> +++ b/tools/testing/nvdimm/pmem-dax.c
> @@ -8,7 +8,7 @@
>  #include <nd.h>
>  
>  long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
> -		long nr_pages, void **kaddr, pfn_t *pfn)
> +		long nr_pages, int mode, void **kaddr, pfn_t *pfn)
>  {
>  	resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
>  
> -- 
> 2.18.4
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery
  2021-11-06  1:16 ` [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery Jane Chu
@ 2021-11-06  2:04   ` Darrick J. Wong
  2021-11-08 20:53     ` Jane Chu
  2021-11-08 21:00     ` Jane Chu
  2021-11-09  7:27   ` Christoph Hellwig
  1 sibling, 2 replies; 21+ messages in thread
From: Darrick J. Wong @ 2021-11-06  2:04 UTC (permalink / raw)
  To: Jane Chu
  Cc: david, dan.j.williams, hch, vishal.l.verma, dave.jiang, agk,
	snitzer, dm-devel, ira.weiny, willy, vgoyal, linux-fsdevel,
	nvdimm, linux-kernel, linux-xfs

On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote:
> For /dev/pmem based dax, enable DAX_OP_RECOVERY mode for
> dax_direct_access to translate 'kaddr' over a range that
> may contain poison(s); and enable dax_copy_to_iter to
> read as much data as possible up till a poisoned page is
> encountered; and enable dax_copy_from_iter to clear poison
> among a page-aligned range, and then write the good data over.
> 
> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> ---
>  drivers/md/dm.c       |  2 ++
>  drivers/nvdimm/pmem.c | 75 ++++++++++++++++++++++++++++++++++++++++---
>  fs/dax.c              | 24 +++++++++++---
>  3 files changed, 92 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index dc354db22ef9..9b3dac916f22 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1043,6 +1043,7 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>  	if (!ti)
>  		goto out;
>  	if (!ti->type->dax_copy_from_iter) {
> +		WARN_ON(mode == DAX_OP_RECOVERY);
>  		ret = copy_from_iter(addr, bytes, i);
>  		goto out;
>  	}
> @@ -1067,6 +1068,7 @@ static size_t dm_dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>  	if (!ti)
>  		goto out;
>  	if (!ti->type->dax_copy_to_iter) {
> +		WARN_ON(mode == DAX_OP_RECOVERY);

Maybe just return -EOPNOTSUPP here?

Warnings are kinda loud.

>  		ret = copy_to_iter(addr, bytes, i);
>  		goto out;
>  	}
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 3dc99e0bf633..8ae6aa678c51 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
>  	resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
>  
>  	if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
> -					PFN_PHYS(nr_pages))))
> +				 PFN_PHYS(nr_pages)) && mode == DAX_OP_NORMAL))
>  		return -EIO;
>  
>  	if (kaddr)
> @@ -303,20 +303,85 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
>  }
>  
>  /*
> - * Use the 'no check' versions of copy_from_iter_flushcache() and
> - * copy_mc_to_iter() to bypass HARDENED_USERCOPY overhead. Bounds
> - * checking, both file offset and device offset, is handled by
> - * dax_iomap_actor()
> + * Even though the 'no check' versions of copy_from_iter_flushcache()
> + * and copy_mc_to_iter() are used to bypass HARDENED_USERCOPY overhead,
> + * 'read'/'write' aren't always safe when poison is consumed. They happen
> + * to be safe because the 'read'/'write' range has been guaranteed
> + * be free of poison(s) by a prior call to dax_direct_access() on the
> + * caller stack.
> + * But on a data recovery code path, the 'read'/'write' range is expected
> + * to contain poison(s), and so poison(s) is explicit checked, such that
> + * 'read' can fetch data from clean page(s) up till the first poison is
> + * encountered, and 'write' requires the range be page aligned in order
> + * to restore the poisoned page's memory type back to "rw" after clearing
> + * the poison(s).
> + * In the event of poison related failure, (size_t) -EIO is returned and
> + * caller may check the return value after casting it to (ssize_t).
> + *
> + * TODO: add support for CPUs that support MOVDIR64B instruction for
> + * faster poison clearing, and possibly smaller error blast radius.

I get that it's still early days yet for whatever pmem stuff is going on
for 5.17, but I feel like this ought to be a separate function called by
pmem_copy_from_iter, with this huge comment attached to that recovery
function.

>   */
>  static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>  		void *addr, size_t bytes, struct iov_iter *i, int mode)
>  {
> +	phys_addr_t pmem_off;
> +	size_t len, lead_off;
> +	struct pmem_device *pmem = dax_get_private(dax_dev);
> +	struct device *dev = pmem->bb.dev;
> +
> +	if (unlikely(mode == DAX_OP_RECOVERY)) {
> +		lead_off = (unsigned long)addr & ~PAGE_MASK;
> +		len = PFN_PHYS(PFN_UP(lead_off + bytes));
> +		if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
> +			if (lead_off || !(PAGE_ALIGNED(bytes))) {
> +				dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
> +					addr, bytes);
> +				return (size_t) -EIO;
> +			}
> +			pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
> +			if (pmem_clear_poison(pmem, pmem_off, bytes) !=
> +						BLK_STS_OK)
> +				return (size_t) -EIO;

Looks reasonable enough to me, though you might want to restructure this
to reduce the amount of indent.

FWIW I dislike how is_bad_pmem mixes units (sector_t vs. bytes), that
was seriously confusing.  But I guess that's a weird quirk of the
badblocks API and .... ugh.

(I dunno, can we at least clean up the nvdimm parts and some day replace
the badblocks backend with something that can handle more than 16
records?  interval_tree is more than up to that task, I know, I use it
for xfs online fsck...)

> +		}
> +	}
> +
>  	return _copy_from_iter_flushcache(addr, bytes, i);
>  }
>  
>  static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>  		void *addr, size_t bytes, struct iov_iter *i, int mode)
>  {
> +	int num_bad;
> +	size_t len, lead_off;
> +	unsigned long bad_pfn;
> +	bool bad_pmem = false;
> +	size_t adj_len = bytes;
> +	sector_t sector, first_bad;
> +	struct pmem_device *pmem = dax_get_private(dax_dev);
> +	struct device *dev = pmem->bb.dev;
> +
> +	if (unlikely(mode == DAX_OP_RECOVERY)) {
> +		sector = PFN_PHYS(pgoff) / 512;
> +		lead_off = (unsigned long)addr & ~PAGE_MASK;
> +		len = PFN_PHYS(PFN_UP(lead_off + bytes));
> +		if (pmem->bb.count)
> +			bad_pmem = !!badblocks_check(&pmem->bb, sector,
> +					len / 512, &first_bad, &num_bad);
> +		if (bad_pmem) {
> +			bad_pfn = PHYS_PFN(first_bad * 512);
> +			if (bad_pfn == pgoff) {
> +				dev_warn(dev, "Found poison in page: pgoff(%#lx)\n",
> +					pgoff);
> +				return -EIO;
> +			}
> +			adj_len = PFN_PHYS(bad_pfn - pgoff) - lead_off;
> +			dev_WARN_ONCE(dev, (adj_len > bytes),
> +					"out-of-range first_bad?");
> +		}
> +		if (adj_len == 0)
> +			return (size_t) -EIO;

Uh, are we supposed to adjust bytes here or something?

--D

> +	}
> +
>  	return _copy_mc_to_iter(addr, bytes, i);
>  }
>  
> diff --git a/fs/dax.c b/fs/dax.c
> index bea6df1498c3..7640be6b6a97 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1219,6 +1219,8 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
>  		unsigned offset = pos & (PAGE_SIZE - 1);
>  		const size_t size = ALIGN(length + offset, PAGE_SIZE);
>  		const sector_t sector = dax_iomap_sector(iomap, pos);
> +		long nr_page = PHYS_PFN(size);
> +		int dax_mode = DAX_OP_NORMAL;
>  		ssize_t map_len;
>  		pgoff_t pgoff;
>  		void *kaddr;
> @@ -1232,8 +1234,13 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
>  		if (ret)
>  			break;
>  
> -		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
> -					    DAX_OP_NORMAL, &kaddr, NULL);
> +		map_len = dax_direct_access(dax_dev, pgoff, nr_page, dax_mode,
> +					    &kaddr, NULL);
> +		if (unlikely(map_len == -EIO)) {
> +			dax_mode = DAX_OP_RECOVERY;
> +			map_len = dax_direct_access(dax_dev, pgoff, nr_page,
> +						    dax_mode, &kaddr, NULL);
> +		}
>  		if (map_len < 0) {
>  			ret = map_len;
>  			break;
> @@ -1252,11 +1259,20 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
>  		 */
>  		if (iov_iter_rw(iter) == WRITE)
>  			xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
> -					map_len, iter, DAX_OP_NORMAL);
> +					map_len, iter, dax_mode);
>  		else
>  			xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
> -					map_len, iter, DAX_OP_NORMAL);
> +					map_len, iter, dax_mode);
>  
> +		/*
> +		 * If dax data recovery is enacted via DAX_OP_RECOVERY,
> +		 * recovery failure would be indicated by a -EIO return
> +		 * in 'xfer' casted as (size_t).
> +		 */
> +		if ((ssize_t)xfer == -EIO) {
> +			ret = -EIO;
> +			break;
> +		}
>  		pos += xfer;
>  		length -= xfer;
>  		done += xfer;
> -- 
> 2.18.4
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] dax: Introduce normal and recovery dax operation modes
  2021-11-06  1:16 ` [PATCH v2 1/2] dax: Introduce normal and recovery dax operation modes Jane Chu
  2021-11-06  1:50   ` Darrick J. Wong
@ 2021-11-06 16:48   ` Dan Williams
  2021-11-08 21:02     ` Jane Chu
  1 sibling, 1 reply; 21+ messages in thread
From: Dan Williams @ 2021-11-06 16:48 UTC (permalink / raw)
  To: Jane Chu
  Cc: david, Darrick J. Wong, Christoph Hellwig, Vishal L Verma,
	Dave Jiang, Alasdair Kergon, Mike Snitzer,
	device-mapper development, Weiny, Ira, Matthew Wilcox,
	Vivek Goyal, linux-fsdevel, Linux NVDIMM,
	Linux Kernel Mailing List, linux-xfs

On Fri, Nov 5, 2021 at 6:17 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> Introduce DAX_OP_NORMAL and DAX_OP_RECOVERY operation modes to
> {dax_direct_access, dax_copy_from_iter, dax_copy_to_iter}.
> DAX_OP_NORMAL is the default or the existing mode, and
> DAX_OP_RECOVERY is a new mode for data recovery purpose.
>
> When dax-FS suspects dax media error might be encountered
> on a read or write, it can enact the recovery mode read or write
> by setting DAX_OP_RECOVERY in the aforementioned APIs. A read
> in recovery mode attempts to fetch as much data as possible
> until the first poisoned page is encountered. A write in recovery
> mode attempts to clear poison(s) in a page-aligned range and
> then write the user provided data over.
>
> DAX_OP_NORMAL should be used for all non-recovery code path.
>
> Signed-off-by: Jane Chu <jane.chu@oracle.com>
[..]
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 324363b798ec..931586df2905 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -9,6 +9,10 @@
>  /* Flag for synchronous flush */
>  #define DAXDEV_F_SYNC (1UL << 0)
>
> +/* dax operation mode dynamically set by caller */
> +#define        DAX_OP_NORMAL           0

Perhaps this should be called DAX_OP_FAILFAST?

> +#define        DAX_OP_RECOVERY         1
> +
>  typedef unsigned long dax_entry_t;
>
>  struct dax_device;
> @@ -22,8 +26,8 @@ struct dax_operations {
>          * logical-page-offset into an absolute physical pfn. Return the
>          * number of pages available for DAX at that pfn.
>          */
> -       long (*direct_access)(struct dax_device *, pgoff_t, long,
> -                       void **, pfn_t *);
> +       long (*direct_access)(struct dax_device *, pgoff_t, long, int,

Would be nice if that 'int' was an enum, but I'm not sure a new
parameter is needed at all, see below...

> +                               void **, pfn_t *);
>         /*
>          * Validate whether this device is usable as an fsdax backing
>          * device.
> @@ -32,10 +36,10 @@ struct dax_operations {
>                         sector_t, sector_t);
>         /* copy_from_iter: required operation for fs-dax direct-i/o */
>         size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
> -                       struct iov_iter *);
> +                       struct iov_iter *, int);

I'm not sure the flag is needed here as the "void *" could carry a
flag in the pointer to indicate that is a recovery kaddr.

>         /* copy_to_iter: required operation for fs-dax direct-i/o */
>         size_t (*copy_to_iter)(struct dax_device *, pgoff_t, void *, size_t,
> -                       struct iov_iter *);
> +                       struct iov_iter *, int);

Same comment here.

>         /* zero_page_range: required operation. Zero page range   */
>         int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
>  };
> @@ -186,11 +190,11 @@ static inline void dax_read_unlock(int id)
>  bool dax_alive(struct dax_device *dax_dev);
>  void *dax_get_private(struct dax_device *dax_dev);
>  long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
> -               void **kaddr, pfn_t *pfn);
> +               int mode, void **kaddr, pfn_t *pfn);

How about dax_direct_access() calling convention stays the same, but
the kaddr is optionally updated to carry a flag in the lower unused
bits. So:

void **kaddr = NULL; /* caller only cares about the pfn */

void *failfast = NULL;
void **kaddr = &failfast; /* caller wants -EIO not recovery */

void *recovery = (void *) DAX_OP_RECOVERY;
void **kaddr = &recovery; /* caller wants to carefully access page(s)
containing poison */

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] dax: Introduce normal and recovery dax operation modes
  2021-11-06  1:50   ` Darrick J. Wong
@ 2021-11-08 20:43     ` Jane Chu
  0 siblings, 0 replies; 21+ messages in thread
From: Jane Chu @ 2021-11-08 20:43 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, dan.j.williams, hch, vishal.l.verma, dave.jiang, agk,
	snitzer, dm-devel, ira.weiny, willy, vgoyal, linux-fsdevel,
	nvdimm, linux-kernel, linux-xfs

On 11/5/2021 6:50 PM, Darrick J. Wong wrote:
> 
> <snip>
> 
>> diff --git a/include/linux/dax.h b/include/linux/dax.h
>> index 324363b798ec..931586df2905 100644
>> --- a/include/linux/dax.h
>> +++ b/include/linux/dax.h
>> @@ -9,6 +9,10 @@
>>   /* Flag for synchronous flush */
>>   #define DAXDEV_F_SYNC (1UL << 0)
>>   
>> +/* dax operation mode dynamically set by caller */
>> +#define	DAX_OP_NORMAL		0
>> +#define	DAX_OP_RECOVERY		1
> 
> Mostly looks ok to me, but since this is an operation mode, should this
> be an enum instead of an int?

Yeah, I tried enum at first, and then noticed that the
new dax enum type need to be introduced to device-mapper.h
by either include dax.h or define a mirrored enum, and
I wondered if that would be an over kill, so I ended up
settle on #define.

> 
> Granted I also think six arguments is a lot... though I don't really
> see any better way to do this.

Dan has a suggestion, and that'll reduce the number of args to 5.
> 
> (Dunno, I spent all day running internal patches through the process
> gauntlet so this is the remaining 2% of my brain speaking...)

Thanks!
-jane

> 
> --D
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery
  2021-11-06  2:04   ` Darrick J. Wong
@ 2021-11-08 20:53     ` Jane Chu
  2021-11-08 21:00     ` Jane Chu
  1 sibling, 0 replies; 21+ messages in thread
From: Jane Chu @ 2021-11-08 20:53 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, dan.j.williams, hch, vishal.l.verma, dave.jiang, agk,
	snitzer, dm-devel, ira.weiny, willy, vgoyal, linux-fsdevel,
	nvdimm, linux-kernel, linux-xfs

On 11/5/2021 7:04 PM, Darrick J. Wong wrote:
<snip>
>>
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index dc354db22ef9..9b3dac916f22 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -1043,6 +1043,7 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>>   	if (!ti)
>>   		goto out;
>>   	if (!ti->type->dax_copy_from_iter) {
>> +		WARN_ON(mode == DAX_OP_RECOVERY);
>>   		ret = copy_from_iter(addr, bytes, i);
>>   		goto out;
>>   	}
>> @@ -1067,6 +1068,7 @@ static size_t dm_dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>>   	if (!ti)
>>   		goto out;
>>   	if (!ti->type->dax_copy_to_iter) {
>> +		WARN_ON(mode == DAX_OP_RECOVERY);
> 
> Maybe just return -EOPNOTSUPP here?
> 
> Warnings are kinda loud.
> 

Indeed.  Looks like the
   "if (!ti->type->dax_copy_to_iter) {"
clause was to allow mixed dax targets in dm, such as dcss, fuse and
virtio_fs targets. These targets either don't export
.dax_copy_from/to_iter, or don't need to.
And their .dax_direct_access don't check poison, and can't repair
poison anyway.

I think these targets may safely ignore the flag.  However, returning
-EOPNOTSUPP is helpful to catch future bug, such as someone add a
method to detect poison, but didn't add a method to clear poison, in
that case, we fail the call.

Dan, do you have a preference?

thanks!
-jane


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery
  2021-11-06  2:04   ` Darrick J. Wong
  2021-11-08 20:53     ` Jane Chu
@ 2021-11-08 21:00     ` Jane Chu
  1 sibling, 0 replies; 21+ messages in thread
From: Jane Chu @ 2021-11-08 21:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, dan.j.williams, hch, vishal.l.verma, dave.jiang, agk,
	snitzer, dm-devel, ira.weiny, willy, vgoyal, linux-fsdevel,
	nvdimm, linux-kernel, linux-xfs

On 11/5/2021 7:04 PM, Darrick J. Wong wrote:
<snip>
>> @@ -303,20 +303,85 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
>>   }
>>   
>>   /*
>> - * Use the 'no check' versions of copy_from_iter_flushcache() and
>> - * copy_mc_to_iter() to bypass HARDENED_USERCOPY overhead. Bounds
>> - * checking, both file offset and device offset, is handled by
>> - * dax_iomap_actor()
>> + * Even though the 'no check' versions of copy_from_iter_flushcache()
>> + * and copy_mc_to_iter() are used to bypass HARDENED_USERCOPY overhead,
>> + * 'read'/'write' aren't always safe when poison is consumed. They happen
>> + * to be safe because the 'read'/'write' range has been guaranteed
>> + * be free of poison(s) by a prior call to dax_direct_access() on the
>> + * caller stack.
>> + * But on a data recovery code path, the 'read'/'write' range is expected
>> + * to contain poison(s), and so poison(s) is explicit checked, such that
>> + * 'read' can fetch data from clean page(s) up till the first poison is
>> + * encountered, and 'write' requires the range be page aligned in order
>> + * to restore the poisoned page's memory type back to "rw" after clearing
>> + * the poison(s).
>> + * In the event of poison related failure, (size_t) -EIO is returned and
>> + * caller may check the return value after casting it to (ssize_t).
>> + *
>> + * TODO: add support for CPUs that support MOVDIR64B instruction for
>> + * faster poison clearing, and possibly smaller error blast radius.
> 
> I get that it's still early days yet for whatever pmem stuff is going on
> for 5.17, but I feel like this ought to be a separate function called by
> pmem_copy_from_iter, with this huge comment attached to that recovery
> function.

Thanks, will refactor both functions.

> 
>>    */
>>   static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>>   		void *addr, size_t bytes, struct iov_iter *i, int mode)
>>   {
>> +	phys_addr_t pmem_off;
>> +	size_t len, lead_off;
>> +	struct pmem_device *pmem = dax_get_private(dax_dev);
>> +	struct device *dev = pmem->bb.dev;
>> +
>> +	if (unlikely(mode == DAX_OP_RECOVERY)) {
>> +		lead_off = (unsigned long)addr & ~PAGE_MASK;
>> +		len = PFN_PHYS(PFN_UP(lead_off + bytes));
>> +		if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
>> +			if (lead_off || !(PAGE_ALIGNED(bytes))) {
>> +				dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
>> +					addr, bytes);
>> +				return (size_t) -EIO;
>> +			}
>> +			pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
>> +			if (pmem_clear_poison(pmem, pmem_off, bytes) !=
>> +						BLK_STS_OK)
>> +				return (size_t) -EIO;
> 
> Looks reasonable enough to me, though you might want to restructure this
> to reduce the amount of indent.

Agreed.

> 
> FWIW I dislike how is_bad_pmem mixes units (sector_t vs. bytes), that
> was seriously confusing.  But I guess that's a weird quirk of the
> badblocks API and .... ugh.
> 
> (I dunno, can we at least clean up the nvdimm parts and some day replace
> the badblocks backend with something that can handle more than 16
> records?  interval_tree is more than up to that task, I know, I use it
> for xfs online fsck...)

Let me look into this and get back to you.

> 
>> +		}
>> +	}
>> +
>>   	return _copy_from_iter_flushcache(addr, bytes, i);
>>   }
>>   
>>   static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>>   		void *addr, size_t bytes, struct iov_iter *i, int mode)
>>   {
>> +	int num_bad;
>> +	size_t len, lead_off;
>> +	unsigned long bad_pfn;
>> +	bool bad_pmem = false;
>> +	size_t adj_len = bytes;
>> +	sector_t sector, first_bad;
>> +	struct pmem_device *pmem = dax_get_private(dax_dev);
>> +	struct device *dev = pmem->bb.dev;
>> +
>> +	if (unlikely(mode == DAX_OP_RECOVERY)) {
>> +		sector = PFN_PHYS(pgoff) / 512;
>> +		lead_off = (unsigned long)addr & ~PAGE_MASK;
>> +		len = PFN_PHYS(PFN_UP(lead_off + bytes));
>> +		if (pmem->bb.count)
>> +			bad_pmem = !!badblocks_check(&pmem->bb, sector,
>> +					len / 512, &first_bad, &num_bad);
>> +		if (bad_pmem) {
>> +			bad_pfn = PHYS_PFN(first_bad * 512);
>> +			if (bad_pfn == pgoff) {
>> +				dev_warn(dev, "Found poison in page: pgoff(%#lx)\n",
>> +					pgoff);
>> +				return -EIO;
>> +			}
>> +			adj_len = PFN_PHYS(bad_pfn - pgoff) - lead_off;
>> +			dev_WARN_ONCE(dev, (adj_len > bytes),
>> +					"out-of-range first_bad?");
>> +		}
>> +		if (adj_len == 0)
>> +			return (size_t) -EIO;
> 
> Uh, are we supposed to adjust bytes here or something?

Because we're trying to read as much data as possible...
What is your concern here?

thanks!
-jane

> 
> --D
> 
>> +	}
>> +
>>   	return _copy_mc_to_iter(addr, bytes, i);
>>   }
>>   



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] dax: Introduce normal and recovery dax operation modes
  2021-11-06 16:48   ` Dan Williams
@ 2021-11-08 21:02     ` Jane Chu
  2021-11-09  5:26       ` Ira Weiny
  0 siblings, 1 reply; 21+ messages in thread
From: Jane Chu @ 2021-11-08 21:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: david, Darrick J. Wong, Christoph Hellwig, Vishal L Verma,
	Dave Jiang, Alasdair Kergon, Mike Snitzer,
	device-mapper development, Weiny, Ira, Matthew Wilcox,
	Vivek Goyal, linux-fsdevel, Linux NVDIMM,
	Linux Kernel Mailing List, linux-xfs

On 11/6/2021 9:48 AM, Dan Williams wrote:
> On Fri, Nov 5, 2021 at 6:17 PM Jane Chu <jane.chu@oracle.com> wrote:
>>
>> Introduce DAX_OP_NORMAL and DAX_OP_RECOVERY operation modes to
>> {dax_direct_access, dax_copy_from_iter, dax_copy_to_iter}.
>> DAX_OP_NORMAL is the default or the existing mode, and
>> DAX_OP_RECOVERY is a new mode for data recovery purpose.
>>
>> When dax-FS suspects dax media error might be encountered
>> on a read or write, it can enact the recovery mode read or write
>> by setting DAX_OP_RECOVERY in the aforementioned APIs. A read
>> in recovery mode attempts to fetch as much data as possible
>> until the first poisoned page is encountered. A write in recovery
>> mode attempts to clear poison(s) in a page-aligned range and
>> then write the user provided data over.
>>
>> DAX_OP_NORMAL should be used for all non-recovery code path.
>>
>> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> [..]
>> diff --git a/include/linux/dax.h b/include/linux/dax.h
>> index 324363b798ec..931586df2905 100644
>> --- a/include/linux/dax.h
>> +++ b/include/linux/dax.h
>> @@ -9,6 +9,10 @@
>>   /* Flag for synchronous flush */
>>   #define DAXDEV_F_SYNC (1UL << 0)
>>
>> +/* dax operation mode dynamically set by caller */
>> +#define        DAX_OP_NORMAL           0
> 
> Perhaps this should be called DAX_OP_FAILFAST?

Sure.

> 
>> +#define        DAX_OP_RECOVERY         1
>> +
>>   typedef unsigned long dax_entry_t;
>>
>>   struct dax_device;
>> @@ -22,8 +26,8 @@ struct dax_operations {
>>           * logical-page-offset into an absolute physical pfn. Return the
>>           * number of pages available for DAX at that pfn.
>>           */
>> -       long (*direct_access)(struct dax_device *, pgoff_t, long,
>> -                       void **, pfn_t *);
>> +       long (*direct_access)(struct dax_device *, pgoff_t, long, int,
> 
> Would be nice if that 'int' was an enum, but I'm not sure a new
> parameter is needed at all, see below...

Let's do your suggestion below. :)

> 
>> +                               void **, pfn_t *);
>>          /*
>>           * Validate whether this device is usable as an fsdax backing
>>           * device.
>> @@ -32,10 +36,10 @@ struct dax_operations {
>>                          sector_t, sector_t);
>>          /* copy_from_iter: required operation for fs-dax direct-i/o */
>>          size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
>> -                       struct iov_iter *);
>> +                       struct iov_iter *, int);
> 
> I'm not sure the flag is needed here as the "void *" could carry a
> flag in the pointer to indicate that is a recovery kaddr.

Agreed.

> 
>>          /* copy_to_iter: required operation for fs-dax direct-i/o */
>>          size_t (*copy_to_iter)(struct dax_device *, pgoff_t, void *, size_t,
>> -                       struct iov_iter *);
>> +                       struct iov_iter *, int);
> 
> Same comment here.
> 
>>          /* zero_page_range: required operation. Zero page range   */
>>          int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
>>   };
>> @@ -186,11 +190,11 @@ static inline void dax_read_unlock(int id)
>>   bool dax_alive(struct dax_device *dax_dev);
>>   void *dax_get_private(struct dax_device *dax_dev);
>>   long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
>> -               void **kaddr, pfn_t *pfn);
>> +               int mode, void **kaddr, pfn_t *pfn);
> 
> How about dax_direct_access() calling convention stays the same, but
> the kaddr is optionally updated to carry a flag in the lower unused
> bits. So:
> 
> void **kaddr = NULL; /* caller only cares about the pfn */
> 
> void *failfast = NULL;
> void **kaddr = &failfast; /* caller wants -EIO not recovery */
> 
> void *recovery = (void *) DAX_OP_RECOVERY;
> void **kaddr = &recovery; /* caller wants to carefully access page(s)
> containing poison */
> 

Got it.

thanks!
-jane


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] dax: Introduce normal and recovery dax operation modes
  2021-11-08 21:02     ` Jane Chu
@ 2021-11-09  5:26       ` Ira Weiny
  2021-11-09  6:04         ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Ira Weiny @ 2021-11-09  5:26 UTC (permalink / raw)
  To: Jane Chu
  Cc: Dan Williams, david, Darrick J. Wong, Christoph Hellwig,
	Vishal L Verma, Dave Jiang, Alasdair Kergon, Mike Snitzer,
	device-mapper development, Matthew Wilcox, Vivek Goyal,
	linux-fsdevel, Linux NVDIMM, Linux Kernel Mailing List,
	linux-xfs

On Mon, Nov 08, 2021 at 09:02:29PM +0000, Jane Chu wrote:
> On 11/6/2021 9:48 AM, Dan Williams wrote:
> > On Fri, Nov 5, 2021 at 6:17 PM Jane Chu <jane.chu@oracle.com> wrote:
> >>
> >> Introduce DAX_OP_NORMAL and DAX_OP_RECOVERY operation modes to
> >> {dax_direct_access, dax_copy_from_iter, dax_copy_to_iter}.
> >> DAX_OP_NORMAL is the default or the existing mode, and
> >> DAX_OP_RECOVERY is a new mode for data recovery purpose.
> >>
> >> When dax-FS suspects dax media error might be encountered
> >> on a read or write, it can enact the recovery mode read or write
> >> by setting DAX_OP_RECOVERY in the aforementioned APIs. A read
> >> in recovery mode attempts to fetch as much data as possible
> >> until the first poisoned page is encountered. A write in recovery
> >> mode attempts to clear poison(s) in a page-aligned range and
> >> then write the user provided data over.
> >>
> >> DAX_OP_NORMAL should be used for all non-recovery code path.
> >>
> >> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> > [..]
> >> diff --git a/include/linux/dax.h b/include/linux/dax.h
> >> index 324363b798ec..931586df2905 100644
> >> --- a/include/linux/dax.h
> >> +++ b/include/linux/dax.h
> >> @@ -9,6 +9,10 @@
> >>   /* Flag for synchronous flush */
> >>   #define DAXDEV_F_SYNC (1UL << 0)
> >>
> >> +/* dax operation mode dynamically set by caller */
> >> +#define        DAX_OP_NORMAL           0
> > 
> > Perhaps this should be called DAX_OP_FAILFAST?
> 
> Sure.
> 
> > 
> >> +#define        DAX_OP_RECOVERY         1
> >> +
> >>   typedef unsigned long dax_entry_t;
> >>
> >>   struct dax_device;
> >> @@ -22,8 +26,8 @@ struct dax_operations {
> >>           * logical-page-offset into an absolute physical pfn. Return the
> >>           * number of pages available for DAX at that pfn.
> >>           */
> >> -       long (*direct_access)(struct dax_device *, pgoff_t, long,
> >> -                       void **, pfn_t *);
> >> +       long (*direct_access)(struct dax_device *, pgoff_t, long, int,
> > 
> > Would be nice if that 'int' was an enum, but I'm not sure a new
> > parameter is needed at all, see below...
> 
> Let's do your suggestion below. :)
> 
> > 
> >> +                               void **, pfn_t *);
> >>          /*
> >>           * Validate whether this device is usable as an fsdax backing
> >>           * device.
> >> @@ -32,10 +36,10 @@ struct dax_operations {
> >>                          sector_t, sector_t);
> >>          /* copy_from_iter: required operation for fs-dax direct-i/o */
> >>          size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
> >> -                       struct iov_iter *);
> >> +                       struct iov_iter *, int);
> > 
> > I'm not sure the flag is needed here as the "void *" could carry a
> > flag in the pointer to indicate that is a recovery kaddr.
> 
> Agreed.

Not sure if this is implied but I would like some macros or other helper
functions to check these flags hidden in the addresses.

For me I'm a bit scared about having flags hidden in the address like this
because I can't lead to some confusions IMO.

But if we have some macros or other calls which can make this more obvious of
what is going on I think that would help.

Apologies if this was what you were already going to do...  :-D

Ira

> 
> > 
> >>          /* copy_to_iter: required operation for fs-dax direct-i/o */
> >>          size_t (*copy_to_iter)(struct dax_device *, pgoff_t, void *, size_t,
> >> -                       struct iov_iter *);
> >> +                       struct iov_iter *, int);
> > 
> > Same comment here.
> > 
> >>          /* zero_page_range: required operation. Zero page range   */
> >>          int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
> >>   };
> >> @@ -186,11 +190,11 @@ static inline void dax_read_unlock(int id)
> >>   bool dax_alive(struct dax_device *dax_dev);
> >>   void *dax_get_private(struct dax_device *dax_dev);
> >>   long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
> >> -               void **kaddr, pfn_t *pfn);
> >> +               int mode, void **kaddr, pfn_t *pfn);
> > 
> > How about dax_direct_access() calling convention stays the same, but
> > the kaddr is optionally updated to carry a flag in the lower unused
> > bits. So:
> > 
> > void **kaddr = NULL; /* caller only cares about the pfn */
> > 
> > void *failfast = NULL;
> > void **kaddr = &failfast; /* caller wants -EIO not recovery */
> > 
> > void *recovery = (void *) DAX_OP_RECOVERY;
> > void **kaddr = &recovery; /* caller wants to carefully access page(s)
> > containing poison */
> > 
> 
> Got it.
> 
> thanks!
> -jane
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] dax: Introduce normal and recovery dax operation modes
  2021-11-09  5:26       ` Ira Weiny
@ 2021-11-09  6:04         ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2021-11-09  6:04 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jane Chu, david, Darrick J. Wong, Christoph Hellwig,
	Vishal L Verma, Dave Jiang, Alasdair Kergon, Mike Snitzer,
	device-mapper development, Matthew Wilcox, Vivek Goyal,
	linux-fsdevel, Linux NVDIMM, Linux Kernel Mailing List,
	linux-xfs

On Mon, Nov 8, 2021 at 9:26 PM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Mon, Nov 08, 2021 at 09:02:29PM +0000, Jane Chu wrote:
> > On 11/6/2021 9:48 AM, Dan Williams wrote:
> > > On Fri, Nov 5, 2021 at 6:17 PM Jane Chu <jane.chu@oracle.com> wrote:
> > >>
> > >> Introduce DAX_OP_NORMAL and DAX_OP_RECOVERY operation modes to
> > >> {dax_direct_access, dax_copy_from_iter, dax_copy_to_iter}.
> > >> DAX_OP_NORMAL is the default or the existing mode, and
> > >> DAX_OP_RECOVERY is a new mode for data recovery purpose.
> > >>
> > >> When dax-FS suspects dax media error might be encountered
> > >> on a read or write, it can enact the recovery mode read or write
> > >> by setting DAX_OP_RECOVERY in the aforementioned APIs. A read
> > >> in recovery mode attempts to fetch as much data as possible
> > >> until the first poisoned page is encountered. A write in recovery
> > >> mode attempts to clear poison(s) in a page-aligned range and
> > >> then write the user provided data over.
> > >>
> > >> DAX_OP_NORMAL should be used for all non-recovery code path.
> > >>
> > >> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> > > [..]
> > >> diff --git a/include/linux/dax.h b/include/linux/dax.h
> > >> index 324363b798ec..931586df2905 100644
> > >> --- a/include/linux/dax.h
> > >> +++ b/include/linux/dax.h
> > >> @@ -9,6 +9,10 @@
> > >>   /* Flag for synchronous flush */
> > >>   #define DAXDEV_F_SYNC (1UL << 0)
> > >>
> > >> +/* dax operation mode dynamically set by caller */
> > >> +#define        DAX_OP_NORMAL           0
> > >
> > > Perhaps this should be called DAX_OP_FAILFAST?
> >
> > Sure.
> >
> > >
> > >> +#define        DAX_OP_RECOVERY         1
> > >> +
> > >>   typedef unsigned long dax_entry_t;
> > >>
> > >>   struct dax_device;
> > >> @@ -22,8 +26,8 @@ struct dax_operations {
> > >>           * logical-page-offset into an absolute physical pfn. Return the
> > >>           * number of pages available for DAX at that pfn.
> > >>           */
> > >> -       long (*direct_access)(struct dax_device *, pgoff_t, long,
> > >> -                       void **, pfn_t *);
> > >> +       long (*direct_access)(struct dax_device *, pgoff_t, long, int,
> > >
> > > Would be nice if that 'int' was an enum, but I'm not sure a new
> > > parameter is needed at all, see below...
> >
> > Let's do your suggestion below. :)
> >
> > >
> > >> +                               void **, pfn_t *);
> > >>          /*
> > >>           * Validate whether this device is usable as an fsdax backing
> > >>           * device.
> > >> @@ -32,10 +36,10 @@ struct dax_operations {
> > >>                          sector_t, sector_t);
> > >>          /* copy_from_iter: required operation for fs-dax direct-i/o */
> > >>          size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
> > >> -                       struct iov_iter *);
> > >> +                       struct iov_iter *, int);
> > >
> > > I'm not sure the flag is needed here as the "void *" could carry a
> > > flag in the pointer to indicate that is a recovery kaddr.
> >
> > Agreed.
>
> Not sure if this is implied but I would like some macros or other helper
> functions to check these flags hidden in the addresses.
>
> For me I'm a bit scared about having flags hidden in the address like this
> because I can't lead to some confusions IMO.
>
> But if we have some macros or other calls which can make this more obvious of
> what is going on I think that would help.

You could go further and mark it as an 'unsigned long __bitwise' type
to get the compiler to help with enforcing accessors to strip off the
flag bits.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery
  2021-11-06  1:16 ` [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery Jane Chu
  2021-11-06  2:04   ` Darrick J. Wong
@ 2021-11-09  7:27   ` Christoph Hellwig
  2021-11-09 18:48     ` Dan Williams
  2021-11-09 19:14     ` Jane Chu
  1 sibling, 2 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-11-09  7:27 UTC (permalink / raw)
  To: Jane Chu
  Cc: david, djwong, dan.j.williams, hch, vishal.l.verma, dave.jiang,
	agk, snitzer, dm-devel, ira.weiny, willy, vgoyal, linux-fsdevel,
	nvdimm, linux-kernel, linux-xfs

On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote:
>  static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>  		void *addr, size_t bytes, struct iov_iter *i, int mode)
>  {
> +	phys_addr_t pmem_off;
> +	size_t len, lead_off;
> +	struct pmem_device *pmem = dax_get_private(dax_dev);
> +	struct device *dev = pmem->bb.dev;
> +
> +	if (unlikely(mode == DAX_OP_RECOVERY)) {
> +		lead_off = (unsigned long)addr & ~PAGE_MASK;
> +		len = PFN_PHYS(PFN_UP(lead_off + bytes));
> +		if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
> +			if (lead_off || !(PAGE_ALIGNED(bytes))) {
> +				dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
> +					addr, bytes);
> +				return (size_t) -EIO;
> +			}
> +			pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
> +			if (pmem_clear_poison(pmem, pmem_off, bytes) !=
> +						BLK_STS_OK)
> +				return (size_t) -EIO;
> +		}
> +	}

This is in the wrong spot.  As seen in my WIP series individual drivers
really should not hook into copying to and from the iter, because it
really is just one way to write to a nvdimm.  How would dm-writecache
clear the errors with this scheme?

So IMHO going back to the separate recovery method as in your previous
patch really is the way to go.  If/when the 64-bit store happens we
need to figure out a good way to clear the bad block list for that.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery
  2021-11-09  7:27   ` Christoph Hellwig
@ 2021-11-09 18:48     ` Dan Williams
  2021-11-09 19:52       ` Christoph Hellwig
  2021-11-09 19:58       ` Jane Chu
  2021-11-09 19:14     ` Jane Chu
  1 sibling, 2 replies; 21+ messages in thread
From: Dan Williams @ 2021-11-09 18:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jane Chu, david, Darrick J. Wong, Vishal L Verma, Dave Jiang,
	Alasdair Kergon, Mike Snitzer, device-mapper development, Weiny,
	Ira, Matthew Wilcox, Vivek Goyal, linux-fsdevel, Linux NVDIMM,
	Linux Kernel Mailing List, linux-xfs

On Mon, Nov 8, 2021 at 11:27 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote:
> >  static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
> >               void *addr, size_t bytes, struct iov_iter *i, int mode)
> >  {
> > +     phys_addr_t pmem_off;
> > +     size_t len, lead_off;
> > +     struct pmem_device *pmem = dax_get_private(dax_dev);
> > +     struct device *dev = pmem->bb.dev;
> > +
> > +     if (unlikely(mode == DAX_OP_RECOVERY)) {
> > +             lead_off = (unsigned long)addr & ~PAGE_MASK;
> > +             len = PFN_PHYS(PFN_UP(lead_off + bytes));
> > +             if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
> > +                     if (lead_off || !(PAGE_ALIGNED(bytes))) {
> > +                             dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
> > +                                     addr, bytes);
> > +                             return (size_t) -EIO;
> > +                     }
> > +                     pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
> > +                     if (pmem_clear_poison(pmem, pmem_off, bytes) !=
> > +                                             BLK_STS_OK)
> > +                             return (size_t) -EIO;
> > +             }
> > +     }
>
> This is in the wrong spot.  As seen in my WIP series individual drivers
> really should not hook into copying to and from the iter, because it
> really is just one way to write to a nvdimm.  How would dm-writecache
> clear the errors with this scheme?
>
> So IMHO going back to the separate recovery method as in your previous
> patch really is the way to go.  If/when the 64-bit store happens we
> need to figure out a good way to clear the bad block list for that.

I think we just make error management a first class citizen of a
dax-device and stop abstracting it behind a driver callback. That way
the driver that registers the dax-device can optionally register error
management as well. Then fsdax path can do:

        rc = dax_direct_access(..., &kaddr, ...);
        if (unlikely(rc)) {
                kaddr = dax_mk_recovery(kaddr);
                dax_direct_access(..., &kaddr, ...);
                return dax_recovery_{read,write}(..., kaddr, ...);
        }
        return copy_{mc_to_iter,from_iter_flushcache}(...);

Where, the recovery version of dax_direct_access() has the opportunity
to change the page permissions / use an alias mapping for the access,
dax_recovery_read() allows reading the good cachelines out of a
poisoned page, and dax_recovery_write() coordinates error list
management and returning a poison page to full write-back caching
operation when no more poisoned cacheline are detected in the page.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery
  2021-11-09  7:27   ` Christoph Hellwig
  2021-11-09 18:48     ` Dan Williams
@ 2021-11-09 19:14     ` Jane Chu
  1 sibling, 0 replies; 21+ messages in thread
From: Jane Chu @ 2021-11-09 19:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: david, djwong, dan.j.williams, vishal.l.verma, dave.jiang, agk,
	snitzer, dm-devel, ira.weiny, willy, vgoyal, linux-fsdevel,
	nvdimm, linux-kernel, linux-xfs

On 11/8/2021 11:27 PM, Christoph Hellwig wrote:
> On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote:
>>   static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>>   		void *addr, size_t bytes, struct iov_iter *i, int mode)
>>   {
>> +	phys_addr_t pmem_off;
>> +	size_t len, lead_off;
>> +	struct pmem_device *pmem = dax_get_private(dax_dev);
>> +	struct device *dev = pmem->bb.dev;
>> +
>> +	if (unlikely(mode == DAX_OP_RECOVERY)) {
>> +		lead_off = (unsigned long)addr & ~PAGE_MASK;
>> +		len = PFN_PHYS(PFN_UP(lead_off + bytes));
>> +		if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
>> +			if (lead_off || !(PAGE_ALIGNED(bytes))) {
>> +				dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
>> +					addr, bytes);
>> +				return (size_t) -EIO;
>> +			}
>> +			pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
>> +			if (pmem_clear_poison(pmem, pmem_off, bytes) !=
>> +						BLK_STS_OK)
>> +				return (size_t) -EIO;
>> +		}
>> +	}
> 
> This is in the wrong spot.  As seen in my WIP series individual drivers
> really should not hook into copying to and from the iter, because it
> really is just one way to write to a nvdimm.  How would dm-writecache
> clear the errors with this scheme?

How does dm-writecache detect and clear errors today?

> 
> So IMHO going back to the separate recovery method as in your previous
> patch really is the way to go.  If/when the 64-bit store happens we
> need to figure out a good way to clear the bad block list for that.
> 
Yeah, the separate .dax_clear_poison API may not be arbitrarily called
though. It must be followed by a 'write' operation, and so, unless we
bind the two operations together, how do we enforce that to avoid
silent data corruption?

thanks!
-jane




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery
  2021-11-09 18:48     ` Dan Williams
@ 2021-11-09 19:52       ` Christoph Hellwig
  2021-11-09 19:58       ` Jane Chu
  1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-11-09 19:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jane Chu, david, Darrick J. Wong,
	Vishal L Verma, Dave Jiang, Alasdair Kergon, Mike Snitzer,
	device-mapper development, Weiny, Ira, Matthew Wilcox,
	Vivek Goyal, linux-fsdevel, Linux NVDIMM,
	Linux Kernel Mailing List, linux-xfs

On Tue, Nov 09, 2021 at 10:48:51AM -0800, Dan Williams wrote:
> I think we just make error management a first class citizen of a
> dax-device and stop abstracting it behind a driver callback. That way
> the driver that registers the dax-device can optionally register error
> management as well. Then fsdax path can do:

This sound pretty sensible.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery
  2021-11-09 18:48     ` Dan Williams
  2021-11-09 19:52       ` Christoph Hellwig
@ 2021-11-09 19:58       ` Jane Chu
  2021-11-09 21:02         ` Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Jane Chu @ 2021-11-09 19:58 UTC (permalink / raw)
  To: Dan Williams, Christoph Hellwig
  Cc: david, Darrick J. Wong, Vishal L Verma, Dave Jiang,
	Alasdair Kergon, Mike Snitzer, device-mapper development, Weiny,
	Ira, Matthew Wilcox, Vivek Goyal, linux-fsdevel, Linux NVDIMM,
	Linux Kernel Mailing List, linux-xfs

On 11/9/2021 10:48 AM, Dan Williams wrote:
> On Mon, Nov 8, 2021 at 11:27 PM Christoph Hellwig <hch@infradead.org> wrote:
>>
>> On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote:
>>>   static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>>>                void *addr, size_t bytes, struct iov_iter *i, int mode)
>>>   {
>>> +     phys_addr_t pmem_off;
>>> +     size_t len, lead_off;
>>> +     struct pmem_device *pmem = dax_get_private(dax_dev);
>>> +     struct device *dev = pmem->bb.dev;
>>> +
>>> +     if (unlikely(mode == DAX_OP_RECOVERY)) {
>>> +             lead_off = (unsigned long)addr & ~PAGE_MASK;
>>> +             len = PFN_PHYS(PFN_UP(lead_off + bytes));
>>> +             if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
>>> +                     if (lead_off || !(PAGE_ALIGNED(bytes))) {
>>> +                             dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
>>> +                                     addr, bytes);
>>> +                             return (size_t) -EIO;
>>> +                     }
>>> +                     pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
>>> +                     if (pmem_clear_poison(pmem, pmem_off, bytes) !=
>>> +                                             BLK_STS_OK)
>>> +                             return (size_t) -EIO;
>>> +             }
>>> +     }
>>
>> This is in the wrong spot.  As seen in my WIP series individual drivers
>> really should not hook into copying to and from the iter, because it
>> really is just one way to write to a nvdimm.  How would dm-writecache
>> clear the errors with this scheme?
>>
>> So IMHO going back to the separate recovery method as in your previous
>> patch really is the way to go.  If/when the 64-bit store happens we
>> need to figure out a good way to clear the bad block list for that.
> 
> I think we just make error management a first class citizen of a
> dax-device and stop abstracting it behind a driver callback. That way
> the driver that registers the dax-device can optionally register error
> management as well. Then fsdax path can do:
> 
>          rc = dax_direct_access(..., &kaddr, ...);
>          if (unlikely(rc)) {
>                  kaddr = dax_mk_recovery(kaddr);

Sorry, what does dax_mk_recovery(kaddr) do?

>                  dax_direct_access(..., &kaddr, ...);
>                  return dax_recovery_{read,write}(..., kaddr, ...);
>          }
>          return copy_{mc_to_iter,from_iter_flushcache}(...);
> 
> Where, the recovery version of dax_direct_access() has the opportunity
> to change the page permissions / use an alias mapping for the access,

again, sorry, what 'page permissions'?  memory_failure_dev_pagemap()
changes the poisoned page mem_type from 'rw' to 'uc-' (should be NP?),
do you mean to reverse the change?

> dax_recovery_read() allows reading the good cachelines out of a
> poisoned page, and dax_recovery_write() coordinates error list
> management and returning a poison page to full write-back caching
> operation when no more poisoned cacheline are detected in the page.
> 

How about to introduce 3 dax_recover_ APIs:
   dax_recover_direct_access(): similar to dax_direct_access except
      it ignores error list and return the kaddr, and hence is also
      optional, exported by device driver that has the ability to
      detect error;
   dax_recovery_read(): optional, supported by pmem driver only,
      reads as much data as possible up to the poisoned page;
   dax_recovery_write(): optional, supported by pmem driver only,
      first clear-poison, then write.

Should we worry about the dm targets?

Both dax_recovery_read/write() are hooked up to dax_iomap_iter().

Thanks,
-jane




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery
  2021-11-09 19:58       ` Jane Chu
@ 2021-11-09 21:02         ` Dan Williams
  2021-11-10 18:26           ` Jane Chu
  0 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2021-11-09 21:02 UTC (permalink / raw)
  To: Jane Chu
  Cc: Christoph Hellwig, david, Darrick J. Wong, Vishal L Verma,
	Dave Jiang, Alasdair Kergon, Mike Snitzer,
	device-mapper development, Weiny, Ira, Matthew Wilcox,
	Vivek Goyal, linux-fsdevel, Linux NVDIMM,
	Linux Kernel Mailing List, linux-xfs

On Tue, Nov 9, 2021 at 11:59 AM Jane Chu <jane.chu@oracle.com> wrote:
>
> On 11/9/2021 10:48 AM, Dan Williams wrote:
> > On Mon, Nov 8, 2021 at 11:27 PM Christoph Hellwig <hch@infradead.org> wrote:
> >>
> >> On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote:
> >>>   static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
> >>>                void *addr, size_t bytes, struct iov_iter *i, int mode)
> >>>   {
> >>> +     phys_addr_t pmem_off;
> >>> +     size_t len, lead_off;
> >>> +     struct pmem_device *pmem = dax_get_private(dax_dev);
> >>> +     struct device *dev = pmem->bb.dev;
> >>> +
> >>> +     if (unlikely(mode == DAX_OP_RECOVERY)) {
> >>> +             lead_off = (unsigned long)addr & ~PAGE_MASK;
> >>> +             len = PFN_PHYS(PFN_UP(lead_off + bytes));
> >>> +             if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
> >>> +                     if (lead_off || !(PAGE_ALIGNED(bytes))) {
> >>> +                             dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
> >>> +                                     addr, bytes);
> >>> +                             return (size_t) -EIO;
> >>> +                     }
> >>> +                     pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
> >>> +                     if (pmem_clear_poison(pmem, pmem_off, bytes) !=
> >>> +                                             BLK_STS_OK)
> >>> +                             return (size_t) -EIO;
> >>> +             }
> >>> +     }
> >>
> >> This is in the wrong spot.  As seen in my WIP series individual drivers
> >> really should not hook into copying to and from the iter, because it
> >> really is just one way to write to a nvdimm.  How would dm-writecache
> >> clear the errors with this scheme?
> >>
> >> So IMHO going back to the separate recovery method as in your previous
> >> patch really is the way to go.  If/when the 64-bit store happens we
> >> need to figure out a good way to clear the bad block list for that.
> >
> > I think we just make error management a first class citizen of a
> > dax-device and stop abstracting it behind a driver callback. That way
> > the driver that registers the dax-device can optionally register error
> > management as well. Then fsdax path can do:
> >
> >          rc = dax_direct_access(..., &kaddr, ...);
> >          if (unlikely(rc)) {
> >                  kaddr = dax_mk_recovery(kaddr);
>
> Sorry, what does dax_mk_recovery(kaddr) do?

I was thinking this just does the hackery to set a flag bit in the
pointer, something like:

return (void *) ((unsigned long) kaddr | DAX_RECOVERY)

>
> >                  dax_direct_access(..., &kaddr, ...);
> >                  return dax_recovery_{read,write}(..., kaddr, ...);
> >          }
> >          return copy_{mc_to_iter,from_iter_flushcache}(...);
> >
> > Where, the recovery version of dax_direct_access() has the opportunity
> > to change the page permissions / use an alias mapping for the access,
>
> again, sorry, what 'page permissions'?  memory_failure_dev_pagemap()
> changes the poisoned page mem_type from 'rw' to 'uc-' (should be NP?),
> do you mean to reverse the change?

Right, the result of the conversation with Boris is that
memory_failure() should mark the page as NP in call cases, so
dax_direct_access() needs to create a UC mapping and
dax_recover_{read,write}() would sink that operation and either return
the page to NP after the access completes, or convert it to WB if the
operation cleared the error.

> > dax_recovery_read() allows reading the good cachelines out of a
> > poisoned page, and dax_recovery_write() coordinates error list
> > management and returning a poison page to full write-back caching
> > operation when no more poisoned cacheline are detected in the page.
> >
>
> How about to introduce 3 dax_recover_ APIs:
>    dax_recover_direct_access(): similar to dax_direct_access except
>       it ignores error list and return the kaddr, and hence is also
>       optional, exported by device driver that has the ability to
>       detect error;
>    dax_recovery_read(): optional, supported by pmem driver only,
>       reads as much data as possible up to the poisoned page;

It wouldn't be a property of the pmem driver, I expect it would be a
flag on the dax device whether to attempt recovery or not. I.e. get
away from this being a pmem callback and make this a native capability
of a dax device.

>    dax_recovery_write(): optional, supported by pmem driver only,
>       first clear-poison, then write.
>
> Should we worry about the dm targets?

The dm targets after Christoph's conversion should be able to do all
the translation at direct access time and then dax_recovery_X can be
done on the resulting already translated kaddr.

> Both dax_recovery_read/write() are hooked up to dax_iomap_iter().

Yes.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery
  2021-11-09 21:02         ` Dan Williams
@ 2021-11-10 18:26           ` Jane Chu
  2021-11-12 15:36             ` Mike Snitzer
  0 siblings, 1 reply; 21+ messages in thread
From: Jane Chu @ 2021-11-10 18:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, david, Darrick J. Wong, Vishal L Verma,
	Dave Jiang, Alasdair Kergon, Mike Snitzer,
	device-mapper development, Weiny, Ira, Matthew Wilcox,
	Vivek Goyal, linux-fsdevel, Linux NVDIMM,
	Linux Kernel Mailing List, linux-xfs

On 11/9/2021 1:02 PM, Dan Williams wrote:
> On Tue, Nov 9, 2021 at 11:59 AM Jane Chu <jane.chu@oracle.com> wrote:
>>
>> On 11/9/2021 10:48 AM, Dan Williams wrote:
>>> On Mon, Nov 8, 2021 at 11:27 PM Christoph Hellwig <hch@infradead.org> wrote:
>>>>
>>>> On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote:
>>>>>    static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>>>>>                 void *addr, size_t bytes, struct iov_iter *i, int mode)
>>>>>    {
>>>>> +     phys_addr_t pmem_off;
>>>>> +     size_t len, lead_off;
>>>>> +     struct pmem_device *pmem = dax_get_private(dax_dev);
>>>>> +     struct device *dev = pmem->bb.dev;
>>>>> +
>>>>> +     if (unlikely(mode == DAX_OP_RECOVERY)) {
>>>>> +             lead_off = (unsigned long)addr & ~PAGE_MASK;
>>>>> +             len = PFN_PHYS(PFN_UP(lead_off + bytes));
>>>>> +             if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
>>>>> +                     if (lead_off || !(PAGE_ALIGNED(bytes))) {
>>>>> +                             dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
>>>>> +                                     addr, bytes);
>>>>> +                             return (size_t) -EIO;
>>>>> +                     }
>>>>> +                     pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
>>>>> +                     if (pmem_clear_poison(pmem, pmem_off, bytes) !=
>>>>> +                                             BLK_STS_OK)
>>>>> +                             return (size_t) -EIO;
>>>>> +             }
>>>>> +     }
>>>>
>>>> This is in the wrong spot.  As seen in my WIP series individual drivers
>>>> really should not hook into copying to and from the iter, because it
>>>> really is just one way to write to a nvdimm.  How would dm-writecache
>>>> clear the errors with this scheme?
>>>>
>>>> So IMHO going back to the separate recovery method as in your previous
>>>> patch really is the way to go.  If/when the 64-bit store happens we
>>>> need to figure out a good way to clear the bad block list for that.
>>>
>>> I think we just make error management a first class citizen of a
>>> dax-device and stop abstracting it behind a driver callback. That way
>>> the driver that registers the dax-device can optionally register error
>>> management as well. Then fsdax path can do:
>>>
>>>           rc = dax_direct_access(..., &kaddr, ...);
>>>           if (unlikely(rc)) {
>>>                   kaddr = dax_mk_recovery(kaddr);
>>
>> Sorry, what does dax_mk_recovery(kaddr) do?
> 
> I was thinking this just does the hackery to set a flag bit in the
> pointer, something like:
> 
> return (void *) ((unsigned long) kaddr | DAX_RECOVERY)

Okay, how about call it dax_prep_recovery()?

> 
>>
>>>                   dax_direct_access(..., &kaddr, ...);
>>>                   return dax_recovery_{read,write}(..., kaddr, ...);
>>>           }
>>>           return copy_{mc_to_iter,from_iter_flushcache}(...);
>>>
>>> Where, the recovery version of dax_direct_access() has the opportunity
>>> to change the page permissions / use an alias mapping for the access,
>>
>> again, sorry, what 'page permissions'?  memory_failure_dev_pagemap()
>> changes the poisoned page mem_type from 'rw' to 'uc-' (should be NP?),
>> do you mean to reverse the change?
> 
> Right, the result of the conversation with Boris is that
> memory_failure() should mark the page as NP in call cases, so
> dax_direct_access() needs to create a UC mapping and
> dax_recover_{read,write}() would sink that operation and either return
> the page to NP after the access completes, or convert it to WB if the
> operation cleared the error.

Okay,  will add a patch to fix set_mce_nospec().

How about moving set_memory_uc() and set_memory_np() down to
dax_recovery_read(), so that we don't split the set_memory_X calls
over different APIs, because we can't enforce what follows
dax_direct_access()?

> 
>>> dax_recovery_read() allows reading the good cachelines out of a
>>> poisoned page, and dax_recovery_write() coordinates error list
>>> management and returning a poison page to full write-back caching
>>> operation when no more poisoned cacheline are detected in the page.
>>>
>>
>> How about to introduce 3 dax_recover_ APIs:
>>     dax_recover_direct_access(): similar to dax_direct_access except
>>        it ignores error list and return the kaddr, and hence is also
>>        optional, exported by device driver that has the ability to
>>        detect error;
>>     dax_recovery_read(): optional, supported by pmem driver only,
>>        reads as much data as possible up to the poisoned page;
> 
> It wouldn't be a property of the pmem driver, I expect it would be a
> flag on the dax device whether to attempt recovery or not. I.e. get
> away from this being a pmem callback and make this a native capability
> of a dax device.
> 
>>     dax_recovery_write(): optional, supported by pmem driver only,
>>        first clear-poison, then write.
>>
>> Should we worry about the dm targets?
> 
> The dm targets after Christoph's conversion should be able to do all
> the translation at direct access time and then dax_recovery_X can be
> done on the resulting already translated kaddr.

I'm thinking about the mixed device dm where some provides
dax_recovery_X, others don't, in which case we don't allow
dax recovery because that causes confusion? or should we still
allow recovery for part of the mixed devices?

> 
>> Both dax_recovery_read/write() are hooked up to dax_iomap_iter().
> 
> Yes.
> 

thanks!
-jane

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery
  2021-11-10 18:26           ` Jane Chu
@ 2021-11-12 15:36             ` Mike Snitzer
  2021-11-12 18:00               ` Jane Chu
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2021-11-12 15:36 UTC (permalink / raw)
  To: Jane Chu
  Cc: Dan Williams, Christoph Hellwig, david, Darrick J. Wong,
	Vishal L Verma, Dave Jiang, Alasdair Kergon,
	device-mapper development, Weiny, Ira, Matthew Wilcox,
	Vivek Goyal, linux-fsdevel, Linux NVDIMM,
	Linux Kernel Mailing List, linux-xfs

On Wed, Nov 10 2021 at  1:26P -0500,
Jane Chu <jane.chu@oracle.com> wrote:

> On 11/9/2021 1:02 PM, Dan Williams wrote:
> > On Tue, Nov 9, 2021 at 11:59 AM Jane Chu <jane.chu@oracle.com> wrote:
> >>
> >> On 11/9/2021 10:48 AM, Dan Williams wrote:
> >>> On Mon, Nov 8, 2021 at 11:27 PM Christoph Hellwig <hch@infradead.org> wrote:
> >>>>
> >>>> On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote:
> >>>>>    static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
> >>>>>                 void *addr, size_t bytes, struct iov_iter *i, int mode)
> >>>>>    {
> >>>>> +     phys_addr_t pmem_off;
> >>>>> +     size_t len, lead_off;
> >>>>> +     struct pmem_device *pmem = dax_get_private(dax_dev);
> >>>>> +     struct device *dev = pmem->bb.dev;
> >>>>> +
> >>>>> +     if (unlikely(mode == DAX_OP_RECOVERY)) {
> >>>>> +             lead_off = (unsigned long)addr & ~PAGE_MASK;
> >>>>> +             len = PFN_PHYS(PFN_UP(lead_off + bytes));
> >>>>> +             if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
> >>>>> +                     if (lead_off || !(PAGE_ALIGNED(bytes))) {
> >>>>> +                             dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
> >>>>> +                                     addr, bytes);
> >>>>> +                             return (size_t) -EIO;
> >>>>> +                     }
> >>>>> +                     pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
> >>>>> +                     if (pmem_clear_poison(pmem, pmem_off, bytes) !=
> >>>>> +                                             BLK_STS_OK)
> >>>>> +                             return (size_t) -EIO;
> >>>>> +             }
> >>>>> +     }
> >>>>
> >>>> This is in the wrong spot.  As seen in my WIP series individual drivers
> >>>> really should not hook into copying to and from the iter, because it
> >>>> really is just one way to write to a nvdimm.  How would dm-writecache
> >>>> clear the errors with this scheme?
> >>>>
> >>>> So IMHO going back to the separate recovery method as in your previous
> >>>> patch really is the way to go.  If/when the 64-bit store happens we
> >>>> need to figure out a good way to clear the bad block list for that.
> >>>
> >>> I think we just make error management a first class citizen of a
> >>> dax-device and stop abstracting it behind a driver callback. That way
> >>> the driver that registers the dax-device can optionally register error
> >>> management as well. Then fsdax path can do:
> >>>
> >>>           rc = dax_direct_access(..., &kaddr, ...);
> >>>           if (unlikely(rc)) {
> >>>                   kaddr = dax_mk_recovery(kaddr);
> >>
> >> Sorry, what does dax_mk_recovery(kaddr) do?
> > 
> > I was thinking this just does the hackery to set a flag bit in the
> > pointer, something like:
> > 
> > return (void *) ((unsigned long) kaddr | DAX_RECOVERY)
> 
> Okay, how about call it dax_prep_recovery()?
> 
> > 
> >>
> >>>                   dax_direct_access(..., &kaddr, ...);
> >>>                   return dax_recovery_{read,write}(..., kaddr, ...);
> >>>           }
> >>>           return copy_{mc_to_iter,from_iter_flushcache}(...);
> >>>
> >>> Where, the recovery version of dax_direct_access() has the opportunity
> >>> to change the page permissions / use an alias mapping for the access,
> >>
> >> again, sorry, what 'page permissions'?  memory_failure_dev_pagemap()
> >> changes the poisoned page mem_type from 'rw' to 'uc-' (should be NP?),
> >> do you mean to reverse the change?
> > 
> > Right, the result of the conversation with Boris is that
> > memory_failure() should mark the page as NP in call cases, so
> > dax_direct_access() needs to create a UC mapping and
> > dax_recover_{read,write}() would sink that operation and either return
> > the page to NP after the access completes, or convert it to WB if the
> > operation cleared the error.
> 
> Okay,  will add a patch to fix set_mce_nospec().
> 
> How about moving set_memory_uc() and set_memory_np() down to
> dax_recovery_read(), so that we don't split the set_memory_X calls
> over different APIs, because we can't enforce what follows
> dax_direct_access()?
> 
> > 
> >>> dax_recovery_read() allows reading the good cachelines out of a
> >>> poisoned page, and dax_recovery_write() coordinates error list
> >>> management and returning a poison page to full write-back caching
> >>> operation when no more poisoned cacheline are detected in the page.
> >>>
> >>
> >> How about to introduce 3 dax_recover_ APIs:
> >>     dax_recover_direct_access(): similar to dax_direct_access except
> >>        it ignores error list and return the kaddr, and hence is also
> >>        optional, exported by device driver that has the ability to
> >>        detect error;
> >>     dax_recovery_read(): optional, supported by pmem driver only,
> >>        reads as much data as possible up to the poisoned page;
> > 
> > It wouldn't be a property of the pmem driver, I expect it would be a
> > flag on the dax device whether to attempt recovery or not. I.e. get
> > away from this being a pmem callback and make this a native capability
> > of a dax device.
> > 
> >>     dax_recovery_write(): optional, supported by pmem driver only,
> >>        first clear-poison, then write.
> >>
> >> Should we worry about the dm targets?
> > 
> > The dm targets after Christoph's conversion should be able to do all
> > the translation at direct access time and then dax_recovery_X can be
> > done on the resulting already translated kaddr.
> 
> I'm thinking about the mixed device dm where some provides
> dax_recovery_X, others don't, in which case we don't allow
> dax recovery because that causes confusion? or should we still
> allow recovery for part of the mixed devices?

I really don't like the all or nothing approach if it can be avoided.
I would imagine that if recovery possible it best to support it even
if the DM device happens to span a mix of devices with varying support
for recovery.

Thanks,
Mike


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery
  2021-11-12 15:36             ` Mike Snitzer
@ 2021-11-12 18:00               ` Jane Chu
  0 siblings, 0 replies; 21+ messages in thread
From: Jane Chu @ 2021-11-12 18:00 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Dan Williams, Christoph Hellwig, david, Darrick J. Wong,
	Vishal L Verma, Dave Jiang, Alasdair Kergon,
	device-mapper development, Weiny, Ira, Matthew Wilcox,
	Vivek Goyal, linux-fsdevel, Linux NVDIMM,
	Linux Kernel Mailing List, linux-xfs

On 11/12/2021 7:36 AM, Mike Snitzer wrote:
> On Wed, Nov 10 2021 at  1:26P -0500,
> Jane Chu <jane.chu@oracle.com> wrote:
> 
>> On 11/9/2021 1:02 PM, Dan Williams wrote:
>>> On Tue, Nov 9, 2021 at 11:59 AM Jane Chu <jane.chu@oracle.com> wrote:
>>>>
>>>> On 11/9/2021 10:48 AM, Dan Williams wrote:
>>>>> On Mon, Nov 8, 2021 at 11:27 PM Christoph Hellwig <hch@infradead.org> wrote:
>>>>>>
>>>>>> On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote:
>>>>>>>     static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>>>>>>>                  void *addr, size_t bytes, struct iov_iter *i, int mode)
>>>>>>>     {
>>>>>>> +     phys_addr_t pmem_off;
>>>>>>> +     size_t len, lead_off;
>>>>>>> +     struct pmem_device *pmem = dax_get_private(dax_dev);
>>>>>>> +     struct device *dev = pmem->bb.dev;
>>>>>>> +
>>>>>>> +     if (unlikely(mode == DAX_OP_RECOVERY)) {
>>>>>>> +             lead_off = (unsigned long)addr & ~PAGE_MASK;
>>>>>>> +             len = PFN_PHYS(PFN_UP(lead_off + bytes));
>>>>>>> +             if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
>>>>>>> +                     if (lead_off || !(PAGE_ALIGNED(bytes))) {
>>>>>>> +                             dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
>>>>>>> +                                     addr, bytes);
>>>>>>> +                             return (size_t) -EIO;
>>>>>>> +                     }
>>>>>>> +                     pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
>>>>>>> +                     if (pmem_clear_poison(pmem, pmem_off, bytes) !=
>>>>>>> +                                             BLK_STS_OK)
>>>>>>> +                             return (size_t) -EIO;
>>>>>>> +             }
>>>>>>> +     }
>>>>>>
>>>>>> This is in the wrong spot.  As seen in my WIP series individual drivers
>>>>>> really should not hook into copying to and from the iter, because it
>>>>>> really is just one way to write to a nvdimm.  How would dm-writecache
>>>>>> clear the errors with this scheme?
>>>>>>
>>>>>> So IMHO going back to the separate recovery method as in your previous
>>>>>> patch really is the way to go.  If/when the 64-bit store happens we
>>>>>> need to figure out a good way to clear the bad block list for that.
>>>>>
>>>>> I think we just make error management a first class citizen of a
>>>>> dax-device and stop abstracting it behind a driver callback. That way
>>>>> the driver that registers the dax-device can optionally register error
>>>>> management as well. Then fsdax path can do:
>>>>>
>>>>>            rc = dax_direct_access(..., &kaddr, ...);
>>>>>            if (unlikely(rc)) {
>>>>>                    kaddr = dax_mk_recovery(kaddr);
>>>>
>>>> Sorry, what does dax_mk_recovery(kaddr) do?
>>>
>>> I was thinking this just does the hackery to set a flag bit in the
>>> pointer, something like:
>>>
>>> return (void *) ((unsigned long) kaddr | DAX_RECOVERY)
>>
>> Okay, how about call it dax_prep_recovery()?
>>
>>>
>>>>
>>>>>                    dax_direct_access(..., &kaddr, ...);
>>>>>                    return dax_recovery_{read,write}(..., kaddr, ...);
>>>>>            }
>>>>>            return copy_{mc_to_iter,from_iter_flushcache}(...);
>>>>>
>>>>> Where, the recovery version of dax_direct_access() has the opportunity
>>>>> to change the page permissions / use an alias mapping for the access,
>>>>
>>>> again, sorry, what 'page permissions'?  memory_failure_dev_pagemap()
>>>> changes the poisoned page mem_type from 'rw' to 'uc-' (should be NP?),
>>>> do you mean to reverse the change?
>>>
>>> Right, the result of the conversation with Boris is that
>>> memory_failure() should mark the page as NP in call cases, so
>>> dax_direct_access() needs to create a UC mapping and
>>> dax_recover_{read,write}() would sink that operation and either return
>>> the page to NP after the access completes, or convert it to WB if the
>>> operation cleared the error.
>>
>> Okay,  will add a patch to fix set_mce_nospec().
>>
>> How about moving set_memory_uc() and set_memory_np() down to
>> dax_recovery_read(), so that we don't split the set_memory_X calls
>> over different APIs, because we can't enforce what follows
>> dax_direct_access()?
>>
>>>
>>>>> dax_recovery_read() allows reading the good cachelines out of a
>>>>> poisoned page, and dax_recovery_write() coordinates error list
>>>>> management and returning a poison page to full write-back caching
>>>>> operation when no more poisoned cacheline are detected in the page.
>>>>>
>>>>
>>>> How about to introduce 3 dax_recover_ APIs:
>>>>      dax_recover_direct_access(): similar to dax_direct_access except
>>>>         it ignores error list and return the kaddr, and hence is also
>>>>         optional, exported by device driver that has the ability to
>>>>         detect error;
>>>>      dax_recovery_read(): optional, supported by pmem driver only,
>>>>         reads as much data as possible up to the poisoned page;
>>>
>>> It wouldn't be a property of the pmem driver, I expect it would be a
>>> flag on the dax device whether to attempt recovery or not. I.e. get
>>> away from this being a pmem callback and make this a native capability
>>> of a dax device.
>>>
>>>>      dax_recovery_write(): optional, supported by pmem driver only,
>>>>         first clear-poison, then write.
>>>>
>>>> Should we worry about the dm targets?
>>>
>>> The dm targets after Christoph's conversion should be able to do all
>>> the translation at direct access time and then dax_recovery_X can be
>>> done on the resulting already translated kaddr.
>>
>> I'm thinking about the mixed device dm where some provides
>> dax_recovery_X, others don't, in which case we don't allow
>> dax recovery because that causes confusion? or should we still
>> allow recovery for part of the mixed devices?
> 
> I really don't like the all or nothing approach if it can be avoided.
> I would imagine that if recovery possible it best to support it even
> if the DM device happens to span a mix of devices with varying support
> for recovery.

Got it!

thanks!
-jane

> 
> Thanks,
> Mike
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-11-12 18:00 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-06  1:16 [PATCH v2 0/2] Dax poison recovery Jane Chu
2021-11-06  1:16 ` [PATCH v2 1/2] dax: Introduce normal and recovery dax operation modes Jane Chu
2021-11-06  1:50   ` Darrick J. Wong
2021-11-08 20:43     ` Jane Chu
2021-11-06 16:48   ` Dan Williams
2021-11-08 21:02     ` Jane Chu
2021-11-09  5:26       ` Ira Weiny
2021-11-09  6:04         ` Dan Williams
2021-11-06  1:16 ` [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery Jane Chu
2021-11-06  2:04   ` Darrick J. Wong
2021-11-08 20:53     ` Jane Chu
2021-11-08 21:00     ` Jane Chu
2021-11-09  7:27   ` Christoph Hellwig
2021-11-09 18:48     ` Dan Williams
2021-11-09 19:52       ` Christoph Hellwig
2021-11-09 19:58       ` Jane Chu
2021-11-09 21:02         ` Dan Williams
2021-11-10 18:26           ` Jane Chu
2021-11-12 15:36             ` Mike Snitzer
2021-11-12 18:00               ` Jane Chu
2021-11-09 19:14     ` Jane Chu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).