All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 0/7] DAX poison recovery
@ 2022-04-20  2:04 ` Jane Chu
  0 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: vishal.l.verma, dave.jiang, agk, snitzer, dm-devel, ira.weiny,
	willy, vgoyal

In this series, dax recovery code path is independent of that of
normal write. Competing dax recovery threads are serialized,
racing read threads are guaranteed not overlapping with the
recovery process.

In this phase, the recovery granularity is page, future patch
will explore recovery in finer granularity.

Changelog:
v7 -> v8:
  - add a patch to teach the nfit driver to rely on mce->misc for poison
    granularity, suggested by Dan
  - add set_memory_present() helper to be invoked by set_mce_nospec() for
    better readibility, suggested by Dan
  - folded a trivial fix to comments in patch 2/NN, suggested by Boris,
    and more commit message comments from Boris
  - mode .recovery_write to dax_operation as dev_pagemap_ops is meant for
    device agnostic operations, suggested by Christoph
  - replace DAX_RECOVERY flag with enum dax_access_mode suggested by Dan
  - re-organized __pmem_direct_access as provided by Christoph
  - split [PATCH v7 4/6] into two patches: one introduces
    DAX_RECOVERY_WRITE, and the other introduces .recovery_write operation
  
v6 -> v7:
 . incorporated comments from Christoph, and picked up a reviewed-by
 . add x86@kernel.org per Boris
 . discovered pmem firmware doesn't reliably handle a request to clear
   poison over a large range (such as 2M), hence worked around the
   the feature by limiting the size of the requested range to kernel
   page size. 

v5->v6:
  . per Christoph, move set{clear}_mce_nospec() inline functions out
    of include/linux/set_memory.h and into arch/x86/mm/pat/set_memory.c
    file, so that no need to export _set_memory_present().
  . per Christoph, ratelimit warning message in pmem_do_write()
  . per both Christoph and Dan, switch back to adding a flag to
    dax_direct_access() instead of embedding the flag in kvaddr
  . suggestions from Christoph for improving code structure and
    readability
  . per Dan, add .recovery_write to dev_pagemap.ops instead of adding
    it to dax_operations, such that, the DM layer doesn't need to be
    involved explicitly in dax recoovery write
  . per Dan, is_bad_pmem() takes a seqlock, so no need to place it
    under recovery_lock.
  Many thanks for both reviewers!

v4->v5:
  Fixed build errors reported by kernel test robot

v3->v4:
  Rebased to v5.17-rc1-81-g0280e3c58f92

References:
v4 https://lore.kernel.org/lkml/20220126211116.860012-1-jane.chu@oracle.com/T/
v3 https://lkml.org/lkml/2022/1/11/900
v2 https://lore.kernel.org/all/20211106011638.2613039-1-jane.chu@oracle.com/
Disussions about marking poisoned page as 'np'
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/

Jane Chu (7):
  acpi/nfit: rely on mce->misc to determine poison granularity
  x86/mce: relocate set{clear}_mce_nospec() functions
  mce: fix set_mce_nospec to always unmap the whole page
  dax: introduce DAX_RECOVERY_WRITE dax access mode
  dax: add .recovery_write dax_operation
  pmem: refactor pmem_clear_poison()
  pmem: implement pmem_recovery_write()

 arch/x86/include/asm/set_memory.h |  52 --------
 arch/x86/kernel/cpu/mce/core.c    |   6 +-
 arch/x86/mm/pat/set_memory.c      |  48 ++++++-
 drivers/acpi/nfit/mce.c           |   4 +-
 drivers/dax/super.c               |  14 +-
 drivers/md/dm-linear.c            |  15 ++-
 drivers/md/dm-log-writes.c        |  15 ++-
 drivers/md/dm-stripe.c            |  15 ++-
 drivers/md/dm-target.c            |   4 +-
 drivers/md/dm-writecache.c        |   7 +-
 drivers/md/dm.c                   |  25 +++-
 drivers/nvdimm/pmem.c             | 207 +++++++++++++++++++++---------
 drivers/nvdimm/pmem.h             |   6 +-
 drivers/s390/block/dcssblk.c      |   9 +-
 fs/dax.c                          |  22 +++-
 fs/fuse/dax.c                     |   4 +-
 include/linux/dax.h               |  22 +++-
 include/linux/device-mapper.h     |  13 +-
 include/linux/set_memory.h        |  10 +-
 tools/testing/nvdimm/pmem-dax.c   |   3 +-
 20 files changed, 351 insertions(+), 150 deletions(-)


base-commit: 028192fea1de083f4f12bfb1eb7c4d7beb5c8ecd
-- 
2.18.4


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [dm-devel] [PATCH v8 0/7] DAX poison recovery
@ 2022-04-20  2:04 ` Jane Chu
  0 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: dave.jiang, snitzer, vishal.l.verma, willy, dm-devel, vgoyal,
	ira.weiny, agk

In this series, dax recovery code path is independent of that of
normal write. Competing dax recovery threads are serialized,
racing read threads are guaranteed not overlapping with the
recovery process.

In this phase, the recovery granularity is page, future patch
will explore recovery in finer granularity.

Changelog:
v7 -> v8:
  - add a patch to teach the nfit driver to rely on mce->misc for poison
    granularity, suggested by Dan
  - add set_memory_present() helper to be invoked by set_mce_nospec() for
    better readibility, suggested by Dan
  - folded a trivial fix to comments in patch 2/NN, suggested by Boris,
    and more commit message comments from Boris
  - mode .recovery_write to dax_operation as dev_pagemap_ops is meant for
    device agnostic operations, suggested by Christoph
  - replace DAX_RECOVERY flag with enum dax_access_mode suggested by Dan
  - re-organized __pmem_direct_access as provided by Christoph
  - split [PATCH v7 4/6] into two patches: one introduces
    DAX_RECOVERY_WRITE, and the other introduces .recovery_write operation
  
v6 -> v7:
 . incorporated comments from Christoph, and picked up a reviewed-by
 . add x86@kernel.org per Boris
 . discovered pmem firmware doesn't reliably handle a request to clear
   poison over a large range (such as 2M), hence worked around the
   the feature by limiting the size of the requested range to kernel
   page size. 

v5->v6:
  . per Christoph, move set{clear}_mce_nospec() inline functions out
    of include/linux/set_memory.h and into arch/x86/mm/pat/set_memory.c
    file, so that no need to export _set_memory_present().
  . per Christoph, ratelimit warning message in pmem_do_write()
  . per both Christoph and Dan, switch back to adding a flag to
    dax_direct_access() instead of embedding the flag in kvaddr
  . suggestions from Christoph for improving code structure and
    readability
  . per Dan, add .recovery_write to dev_pagemap.ops instead of adding
    it to dax_operations, such that, the DM layer doesn't need to be
    involved explicitly in dax recoovery write
  . per Dan, is_bad_pmem() takes a seqlock, so no need to place it
    under recovery_lock.
  Many thanks for both reviewers!

v4->v5:
  Fixed build errors reported by kernel test robot

v3->v4:
  Rebased to v5.17-rc1-81-g0280e3c58f92

References:
v4 https://lore.kernel.org/lkml/20220126211116.860012-1-jane.chu@oracle.com/T/
v3 https://lkml.org/lkml/2022/1/11/900
v2 https://lore.kernel.org/all/20211106011638.2613039-1-jane.chu@oracle.com/
Disussions about marking poisoned page as 'np'
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/

Jane Chu (7):
  acpi/nfit: rely on mce->misc to determine poison granularity
  x86/mce: relocate set{clear}_mce_nospec() functions
  mce: fix set_mce_nospec to always unmap the whole page
  dax: introduce DAX_RECOVERY_WRITE dax access mode
  dax: add .recovery_write dax_operation
  pmem: refactor pmem_clear_poison()
  pmem: implement pmem_recovery_write()

 arch/x86/include/asm/set_memory.h |  52 --------
 arch/x86/kernel/cpu/mce/core.c    |   6 +-
 arch/x86/mm/pat/set_memory.c      |  48 ++++++-
 drivers/acpi/nfit/mce.c           |   4 +-
 drivers/dax/super.c               |  14 +-
 drivers/md/dm-linear.c            |  15 ++-
 drivers/md/dm-log-writes.c        |  15 ++-
 drivers/md/dm-stripe.c            |  15 ++-
 drivers/md/dm-target.c            |   4 +-
 drivers/md/dm-writecache.c        |   7 +-
 drivers/md/dm.c                   |  25 +++-
 drivers/nvdimm/pmem.c             | 207 +++++++++++++++++++++---------
 drivers/nvdimm/pmem.h             |   6 +-
 drivers/s390/block/dcssblk.c      |   9 +-
 fs/dax.c                          |  22 +++-
 fs/fuse/dax.c                     |   4 +-
 include/linux/dax.h               |  22 +++-
 include/linux/device-mapper.h     |  13 +-
 include/linux/set_memory.h        |  10 +-
 tools/testing/nvdimm/pmem-dax.c   |   3 +-
 20 files changed, 351 insertions(+), 150 deletions(-)


base-commit: 028192fea1de083f4f12bfb1eb7c4d7beb5c8ecd
-- 
2.18.4

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 1/7] acpi/nfit: rely on mce->misc to determine poison granularity
  2022-04-20  2:04 ` [dm-devel] " Jane Chu
@ 2022-04-20  2:04   ` Jane Chu
  -1 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: vishal.l.verma, dave.jiang, agk, snitzer, dm-devel, ira.weiny,
	willy, vgoyal

nfit_handle_mec() hardcode poison granularity at L1_CACHE_BYTES.
Instead, let the driver rely on mce->misc register to determine
the poison granularity.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/acpi/nfit/mce.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index ee8d9973f60b..d48a388b796e 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -32,6 +32,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
 	 */
 	mutex_lock(&acpi_desc_lock);
 	list_for_each_entry(acpi_desc, &acpi_descs, list) {
+		unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc);
 		struct device *dev = acpi_desc->dev;
 		int found_match = 0;
 
@@ -63,8 +64,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
 
 		/* If this fails due to an -ENOMEM, there is little we can do */
 		nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
-				ALIGN(mce->addr, L1_CACHE_BYTES),
-				L1_CACHE_BYTES);
+				ALIGN_DOWN(mce->addr, align), align);
 		nvdimm_region_notify(nfit_spa->nd_region,
 				NVDIMM_REVALIDATE_POISON);
 
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [dm-devel] [PATCH v8 1/7] acpi/nfit: rely on mce->misc to determine poison granularity
@ 2022-04-20  2:04   ` Jane Chu
  0 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: dave.jiang, snitzer, vishal.l.verma, willy, dm-devel, vgoyal,
	ira.weiny, agk

nfit_handle_mec() hardcode poison granularity at L1_CACHE_BYTES.
Instead, let the driver rely on mce->misc register to determine
the poison granularity.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/acpi/nfit/mce.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index ee8d9973f60b..d48a388b796e 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -32,6 +32,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
 	 */
 	mutex_lock(&acpi_desc_lock);
 	list_for_each_entry(acpi_desc, &acpi_descs, list) {
+		unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc);
 		struct device *dev = acpi_desc->dev;
 		int found_match = 0;
 
@@ -63,8 +64,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
 
 		/* If this fails due to an -ENOMEM, there is little we can do */
 		nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
-				ALIGN(mce->addr, L1_CACHE_BYTES),
-				L1_CACHE_BYTES);
+				ALIGN_DOWN(mce->addr, align), align);
 		nvdimm_region_notify(nfit_spa->nd_region,
 				NVDIMM_REVALIDATE_POISON);
 
-- 
2.18.4

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v8 2/7] x86/mce: relocate set{clear}_mce_nospec() functions
  2022-04-20  2:04 ` [dm-devel] " Jane Chu
@ 2022-04-20  2:04   ` Jane Chu
  -1 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: vishal.l.verma, dave.jiang, agk, snitzer, dm-devel, ira.weiny,
	willy, vgoyal

Relocate the twin mce functions to arch/x86/mm/pat/set_memory.c
file where they belong.

While at it, fixup a function name in a comment.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 arch/x86/include/asm/set_memory.h | 52 -------------------------------
 arch/x86/mm/pat/set_memory.c      | 49 ++++++++++++++++++++++++++++-
 include/linux/set_memory.h        |  8 ++---
 3 files changed, 52 insertions(+), 57 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 78ca53512486..b45c4d27fd46 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,56 +86,4 @@ bool kernel_page_present(struct page *page);
 
 extern int kernel_set_to_readonly;
 
-#ifdef CONFIG_X86_64
-/*
- * Prevent speculative access to the page by either unmapping
- * it (if we do not require access to any part of the page) or
- * marking it uncacheable (if we want to try to retrieve data
- * from non-poisoned lines in the page).
- */
-static inline int set_mce_nospec(unsigned long pfn, bool unmap)
-{
-	unsigned long decoy_addr;
-	int rc;
-
-	/* SGX pages are not in the 1:1 map */
-	if (arch_is_platform_page(pfn << PAGE_SHIFT))
-		return 0;
-	/*
-	 * We would like to just call:
-	 *      set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
-	 * but doing that would radically increase the odds of a
-	 * speculative access to the poison page because we'd have
-	 * the virtual address of the kernel 1:1 mapping sitting
-	 * around in registers.
-	 * Instead we get tricky.  We create a non-canonical address
-	 * that looks just like the one we want, but has bit 63 flipped.
-	 * This relies on set_memory_XX() properly sanitizing any __pa()
-	 * results with __PHYSICAL_MASK or PTE_PFN_MASK.
-	 */
-	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
-
-	if (unmap)
-		rc = set_memory_np(decoy_addr, 1);
-	else
-		rc = set_memory_uc(decoy_addr, 1);
-	if (rc)
-		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
-	return rc;
-}
-#define set_mce_nospec set_mce_nospec
-
-/* Restore full speculative operation to the pfn. */
-static inline int clear_mce_nospec(unsigned long pfn)
-{
-	return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
-}
-#define clear_mce_nospec clear_mce_nospec
-#else
-/*
- * Few people would run a 32-bit kernel on a machine that supports
- * recoverable errors because they have too much memory to boot 32-bit.
- */
-#endif
-
 #endif /* _ASM_X86_SET_MEMORY_H */
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index abf5ed76e4b7..978cf5bd2ab6 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1816,7 +1816,7 @@ static inline int cpa_clear_pages_array(struct page **pages, int numpages,
 }
 
 /*
- * _set_memory_prot is an internal helper for callers that have been passed
+ * __set_memory_prot is an internal helper for callers that have been passed
  * a pgprot_t value from upper layers and a reservation has already been taken.
  * If you want to set the pgprot to a specific page protocol, use the
  * set_memory_xx() functions.
@@ -1925,6 +1925,53 @@ int set_memory_wb(unsigned long addr, int numpages)
 }
 EXPORT_SYMBOL(set_memory_wb);
 
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+int set_mce_nospec(unsigned long pfn, bool unmap)
+{
+	unsigned long decoy_addr;
+	int rc;
+
+	if (!IS_ENABLED(CONFIG_64BIT))
+		return 0;
+
+	/* SGX pages are not in the 1:1 map */
+	if (arch_is_platform_page(pfn << PAGE_SHIFT))
+		return 0;
+	/*
+	 * We would like to just call:
+	 *      set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
+	 * but doing that would radically increase the odds of a
+	 * speculative access to the poison page because we'd have
+	 * the virtual address of the kernel 1:1 mapping sitting
+	 * around in registers.
+	 * Instead we get tricky.  We create a non-canonical address
+	 * that looks just like the one we want, but has bit 63 flipped.
+	 * This relies on set_memory_XX() properly sanitizing any __pa()
+	 * results with __PHYSICAL_MASK or PTE_PFN_MASK.
+	 */
+	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
+
+	if (unmap)
+		rc = set_memory_np(decoy_addr, 1);
+	else
+		rc = set_memory_uc(decoy_addr, 1);
+	if (rc)
+		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
+	return rc;
+}
+
+/* Restore full speculative operation to the pfn. */
+int clear_mce_nospec(unsigned long pfn)
+{
+	return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
+}
+EXPORT_SYMBOL_GPL(clear_mce_nospec);
+
 int set_memory_x(unsigned long addr, int numpages)
 {
 	if (!(__supported_pte_mask & _PAGE_NX))
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index f36be5166c19..683a6c3f7179 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -42,14 +42,14 @@ static inline bool can_set_direct_map(void)
 #endif
 #endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
 
-#ifndef set_mce_nospec
+#ifdef CONFIG_X86_64
+int set_mce_nospec(unsigned long pfn, bool unmap);
+int clear_mce_nospec(unsigned long pfn);
+#else
 static inline int set_mce_nospec(unsigned long pfn, bool unmap)
 {
 	return 0;
 }
-#endif
-
-#ifndef clear_mce_nospec
 static inline int clear_mce_nospec(unsigned long pfn)
 {
 	return 0;
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [dm-devel] [PATCH v8 2/7] x86/mce: relocate set{clear}_mce_nospec() functions
@ 2022-04-20  2:04   ` Jane Chu
  0 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: dave.jiang, snitzer, vishal.l.verma, willy, dm-devel, vgoyal,
	ira.weiny, agk

Relocate the twin mce functions to arch/x86/mm/pat/set_memory.c
file where they belong.

While at it, fixup a function name in a comment.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 arch/x86/include/asm/set_memory.h | 52 -------------------------------
 arch/x86/mm/pat/set_memory.c      | 49 ++++++++++++++++++++++++++++-
 include/linux/set_memory.h        |  8 ++---
 3 files changed, 52 insertions(+), 57 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 78ca53512486..b45c4d27fd46 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,56 +86,4 @@ bool kernel_page_present(struct page *page);
 
 extern int kernel_set_to_readonly;
 
-#ifdef CONFIG_X86_64
-/*
- * Prevent speculative access to the page by either unmapping
- * it (if we do not require access to any part of the page) or
- * marking it uncacheable (if we want to try to retrieve data
- * from non-poisoned lines in the page).
- */
-static inline int set_mce_nospec(unsigned long pfn, bool unmap)
-{
-	unsigned long decoy_addr;
-	int rc;
-
-	/* SGX pages are not in the 1:1 map */
-	if (arch_is_platform_page(pfn << PAGE_SHIFT))
-		return 0;
-	/*
-	 * We would like to just call:
-	 *      set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
-	 * but doing that would radically increase the odds of a
-	 * speculative access to the poison page because we'd have
-	 * the virtual address of the kernel 1:1 mapping sitting
-	 * around in registers.
-	 * Instead we get tricky.  We create a non-canonical address
-	 * that looks just like the one we want, but has bit 63 flipped.
-	 * This relies on set_memory_XX() properly sanitizing any __pa()
-	 * results with __PHYSICAL_MASK or PTE_PFN_MASK.
-	 */
-	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
-
-	if (unmap)
-		rc = set_memory_np(decoy_addr, 1);
-	else
-		rc = set_memory_uc(decoy_addr, 1);
-	if (rc)
-		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
-	return rc;
-}
-#define set_mce_nospec set_mce_nospec
-
-/* Restore full speculative operation to the pfn. */
-static inline int clear_mce_nospec(unsigned long pfn)
-{
-	return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
-}
-#define clear_mce_nospec clear_mce_nospec
-#else
-/*
- * Few people would run a 32-bit kernel on a machine that supports
- * recoverable errors because they have too much memory to boot 32-bit.
- */
-#endif
-
 #endif /* _ASM_X86_SET_MEMORY_H */
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index abf5ed76e4b7..978cf5bd2ab6 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1816,7 +1816,7 @@ static inline int cpa_clear_pages_array(struct page **pages, int numpages,
 }
 
 /*
- * _set_memory_prot is an internal helper for callers that have been passed
+ * __set_memory_prot is an internal helper for callers that have been passed
  * a pgprot_t value from upper layers and a reservation has already been taken.
  * If you want to set the pgprot to a specific page protocol, use the
  * set_memory_xx() functions.
@@ -1925,6 +1925,53 @@ int set_memory_wb(unsigned long addr, int numpages)
 }
 EXPORT_SYMBOL(set_memory_wb);
 
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+int set_mce_nospec(unsigned long pfn, bool unmap)
+{
+	unsigned long decoy_addr;
+	int rc;
+
+	if (!IS_ENABLED(CONFIG_64BIT))
+		return 0;
+
+	/* SGX pages are not in the 1:1 map */
+	if (arch_is_platform_page(pfn << PAGE_SHIFT))
+		return 0;
+	/*
+	 * We would like to just call:
+	 *      set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
+	 * but doing that would radically increase the odds of a
+	 * speculative access to the poison page because we'd have
+	 * the virtual address of the kernel 1:1 mapping sitting
+	 * around in registers.
+	 * Instead we get tricky.  We create a non-canonical address
+	 * that looks just like the one we want, but has bit 63 flipped.
+	 * This relies on set_memory_XX() properly sanitizing any __pa()
+	 * results with __PHYSICAL_MASK or PTE_PFN_MASK.
+	 */
+	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
+
+	if (unmap)
+		rc = set_memory_np(decoy_addr, 1);
+	else
+		rc = set_memory_uc(decoy_addr, 1);
+	if (rc)
+		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
+	return rc;
+}
+
+/* Restore full speculative operation to the pfn. */
+int clear_mce_nospec(unsigned long pfn)
+{
+	return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
+}
+EXPORT_SYMBOL_GPL(clear_mce_nospec);
+
 int set_memory_x(unsigned long addr, int numpages)
 {
 	if (!(__supported_pte_mask & _PAGE_NX))
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index f36be5166c19..683a6c3f7179 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -42,14 +42,14 @@ static inline bool can_set_direct_map(void)
 #endif
 #endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
 
-#ifndef set_mce_nospec
+#ifdef CONFIG_X86_64
+int set_mce_nospec(unsigned long pfn, bool unmap);
+int clear_mce_nospec(unsigned long pfn);
+#else
 static inline int set_mce_nospec(unsigned long pfn, bool unmap)
 {
 	return 0;
 }
-#endif
-
-#ifndef clear_mce_nospec
 static inline int clear_mce_nospec(unsigned long pfn)
 {
 	return 0;
-- 
2.18.4

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v8 3/7] mce: fix set_mce_nospec to always unmap the whole page
  2022-04-20  2:04 ` [dm-devel] " Jane Chu
@ 2022-04-20  2:04   ` Jane Chu
  -1 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: vishal.l.verma, dave.jiang, agk, snitzer, dm-devel, ira.weiny,
	willy, vgoyal

The set_memory_uc() approach doesn't work well in all cases.
As Dan pointed out when "The VMM unmapped the bad page from
guest physical space and passed the machine check to the guest."
"The guest gets virtual #MC on an access to that page. When
the guest tries to do set_memory_uc() and instructs cpa_flush()
to do clean caches that results in taking another fault / exception
perhaps because the VMM unmapped the page from the guest."

Since the driver has special knowledge to handle NP or UC,
mark the poisoned page with NP and let driver handle it when
it comes down to repair.

Please refer to discussions here for more details.
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/

Now since poisoned page is marked as not-present, in order to
avoid writing to a not-present page and trigger kernel Oops,
also fix pmem_do_write().

Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 arch/x86/kernel/cpu/mce/core.c |  6 +++---
 arch/x86/mm/pat/set_memory.c   | 23 +++++++++++------------
 drivers/nvdimm/pmem.c          | 30 +++++++-----------------------
 include/linux/set_memory.h     |  4 ++--
 4 files changed, 23 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 981496e6bc0e..fa67bb9d1afe 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -579,7 +579,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 
 	pfn = mce->addr >> PAGE_SHIFT;
 	if (!memory_failure(pfn, 0)) {
-		set_mce_nospec(pfn, whole_page(mce));
+		set_mce_nospec(pfn);
 		mce->kflags |= MCE_HANDLED_UC;
 	}
 
@@ -1316,7 +1316,7 @@ static void kill_me_maybe(struct callback_head *cb)
 
 	ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
 	if (!ret) {
-		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+		set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
 		sync_core();
 		return;
 	}
@@ -1342,7 +1342,7 @@ static void kill_me_never(struct callback_head *cb)
 	p->mce_count = 0;
 	pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
 	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
-		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+		set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
 }
 
 static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *))
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 978cf5bd2ab6..e3a5e55f3e08 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1925,13 +1925,8 @@ int set_memory_wb(unsigned long addr, int numpages)
 }
 EXPORT_SYMBOL(set_memory_wb);
 
-/*
- * Prevent speculative access to the page by either unmapping
- * it (if we do not require access to any part of the page) or
- * marking it uncacheable (if we want to try to retrieve data
- * from non-poisoned lines in the page).
- */
-int set_mce_nospec(unsigned long pfn, bool unmap)
+/* Prevent speculative access to a page by marking it not-present */
+int set_mce_nospec(unsigned long pfn)
 {
 	unsigned long decoy_addr;
 	int rc;
@@ -1956,19 +1951,23 @@ int set_mce_nospec(unsigned long pfn, bool unmap)
 	 */
 	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
 
-	if (unmap)
-		rc = set_memory_np(decoy_addr, 1);
-	else
-		rc = set_memory_uc(decoy_addr, 1);
+	rc = set_memory_np(decoy_addr, 1);
 	if (rc)
 		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
 	return rc;
 }
 
+static int set_memory_present(unsigned long *addr, int numpages)
+{
+	return change_page_attr_set(addr, numpages, __pgprot(_PAGE_PRESENT), 0);
+}
+
 /* Restore full speculative operation to the pfn. */
 int clear_mce_nospec(unsigned long pfn)
 {
-	return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
+	unsigned long addr = (unsigned long) pfn_to_kaddr(pfn);
+
+	return set_memory_present(&addr, 1);
 }
 EXPORT_SYMBOL_GPL(clear_mce_nospec);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 58d95242a836..4aa17132a557 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -158,36 +158,20 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
 			struct page *page, unsigned int page_off,
 			sector_t sector, unsigned int len)
 {
-	blk_status_t rc = BLK_STS_OK;
-	bool bad_pmem = false;
 	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
 	void *pmem_addr = pmem->virt_addr + pmem_off;
 
-	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
-		bad_pmem = true;
+	if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
+		blk_status_t rc = pmem_clear_poison(pmem, pmem_off, len);
+
+		if (rc != BLK_STS_OK)
+			return rc;
+	}
 
-	/*
-	 * Note that we write the data both before and after
-	 * clearing poison.  The write before clear poison
-	 * handles situations where the latest written data is
-	 * preserved and the clear poison operation simply marks
-	 * the address range as valid without changing the data.
-	 * In this case application software can assume that an
-	 * interrupted write will either return the new good
-	 * data or an error.
-	 *
-	 * However, if pmem_clear_poison() leaves the data in an
-	 * indeterminate state we need to perform the write
-	 * after clear poison.
-	 */
 	flush_dcache_page(page);
 	write_pmem(pmem_addr, page, page_off, len);
-	if (unlikely(bad_pmem)) {
-		rc = pmem_clear_poison(pmem, pmem_off, len);
-		write_pmem(pmem_addr, page, page_off, len);
-	}
 
-	return rc;
+	return BLK_STS_OK;
 }
 
 static void pmem_submit_bio(struct bio *bio)
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 683a6c3f7179..369769ce7399 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -43,10 +43,10 @@ static inline bool can_set_direct_map(void)
 #endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
 
 #ifdef CONFIG_X86_64
-int set_mce_nospec(unsigned long pfn, bool unmap);
+int set_mce_nospec(unsigned long pfn);
 int clear_mce_nospec(unsigned long pfn);
 #else
-static inline int set_mce_nospec(unsigned long pfn, bool unmap)
+static inline int set_mce_nospec(unsigned long pfn)
 {
 	return 0;
 }
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [dm-devel] [PATCH v8 3/7] mce: fix set_mce_nospec to always unmap the whole page
@ 2022-04-20  2:04   ` Jane Chu
  0 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: dave.jiang, snitzer, vishal.l.verma, willy, dm-devel, vgoyal,
	ira.weiny, agk

The set_memory_uc() approach doesn't work well in all cases.
As Dan pointed out when "The VMM unmapped the bad page from
guest physical space and passed the machine check to the guest."
"The guest gets virtual #MC on an access to that page. When
the guest tries to do set_memory_uc() and instructs cpa_flush()
to do clean caches that results in taking another fault / exception
perhaps because the VMM unmapped the page from the guest."

Since the driver has special knowledge to handle NP or UC,
mark the poisoned page with NP and let driver handle it when
it comes down to repair.

Please refer to discussions here for more details.
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/

Now since poisoned page is marked as not-present, in order to
avoid writing to a not-present page and trigger kernel Oops,
also fix pmem_do_write().

Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 arch/x86/kernel/cpu/mce/core.c |  6 +++---
 arch/x86/mm/pat/set_memory.c   | 23 +++++++++++------------
 drivers/nvdimm/pmem.c          | 30 +++++++-----------------------
 include/linux/set_memory.h     |  4 ++--
 4 files changed, 23 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 981496e6bc0e..fa67bb9d1afe 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -579,7 +579,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 
 	pfn = mce->addr >> PAGE_SHIFT;
 	if (!memory_failure(pfn, 0)) {
-		set_mce_nospec(pfn, whole_page(mce));
+		set_mce_nospec(pfn);
 		mce->kflags |= MCE_HANDLED_UC;
 	}
 
@@ -1316,7 +1316,7 @@ static void kill_me_maybe(struct callback_head *cb)
 
 	ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
 	if (!ret) {
-		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+		set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
 		sync_core();
 		return;
 	}
@@ -1342,7 +1342,7 @@ static void kill_me_never(struct callback_head *cb)
 	p->mce_count = 0;
 	pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
 	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
-		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+		set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
 }
 
 static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *))
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 978cf5bd2ab6..e3a5e55f3e08 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1925,13 +1925,8 @@ int set_memory_wb(unsigned long addr, int numpages)
 }
 EXPORT_SYMBOL(set_memory_wb);
 
-/*
- * Prevent speculative access to the page by either unmapping
- * it (if we do not require access to any part of the page) or
- * marking it uncacheable (if we want to try to retrieve data
- * from non-poisoned lines in the page).
- */
-int set_mce_nospec(unsigned long pfn, bool unmap)
+/* Prevent speculative access to a page by marking it not-present */
+int set_mce_nospec(unsigned long pfn)
 {
 	unsigned long decoy_addr;
 	int rc;
@@ -1956,19 +1951,23 @@ int set_mce_nospec(unsigned long pfn, bool unmap)
 	 */
 	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
 
-	if (unmap)
-		rc = set_memory_np(decoy_addr, 1);
-	else
-		rc = set_memory_uc(decoy_addr, 1);
+	rc = set_memory_np(decoy_addr, 1);
 	if (rc)
 		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
 	return rc;
 }
 
+static int set_memory_present(unsigned long *addr, int numpages)
+{
+	return change_page_attr_set(addr, numpages, __pgprot(_PAGE_PRESENT), 0);
+}
+
 /* Restore full speculative operation to the pfn. */
 int clear_mce_nospec(unsigned long pfn)
 {
-	return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
+	unsigned long addr = (unsigned long) pfn_to_kaddr(pfn);
+
+	return set_memory_present(&addr, 1);
 }
 EXPORT_SYMBOL_GPL(clear_mce_nospec);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 58d95242a836..4aa17132a557 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -158,36 +158,20 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
 			struct page *page, unsigned int page_off,
 			sector_t sector, unsigned int len)
 {
-	blk_status_t rc = BLK_STS_OK;
-	bool bad_pmem = false;
 	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
 	void *pmem_addr = pmem->virt_addr + pmem_off;
 
-	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
-		bad_pmem = true;
+	if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
+		blk_status_t rc = pmem_clear_poison(pmem, pmem_off, len);
+
+		if (rc != BLK_STS_OK)
+			return rc;
+	}
 
-	/*
-	 * Note that we write the data both before and after
-	 * clearing poison.  The write before clear poison
-	 * handles situations where the latest written data is
-	 * preserved and the clear poison operation simply marks
-	 * the address range as valid without changing the data.
-	 * In this case application software can assume that an
-	 * interrupted write will either return the new good
-	 * data or an error.
-	 *
-	 * However, if pmem_clear_poison() leaves the data in an
-	 * indeterminate state we need to perform the write
-	 * after clear poison.
-	 */
 	flush_dcache_page(page);
 	write_pmem(pmem_addr, page, page_off, len);
-	if (unlikely(bad_pmem)) {
-		rc = pmem_clear_poison(pmem, pmem_off, len);
-		write_pmem(pmem_addr, page, page_off, len);
-	}
 
-	return rc;
+	return BLK_STS_OK;
 }
 
 static void pmem_submit_bio(struct bio *bio)
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 683a6c3f7179..369769ce7399 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -43,10 +43,10 @@ static inline bool can_set_direct_map(void)
 #endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
 
 #ifdef CONFIG_X86_64
-int set_mce_nospec(unsigned long pfn, bool unmap);
+int set_mce_nospec(unsigned long pfn);
 int clear_mce_nospec(unsigned long pfn);
 #else
-static inline int set_mce_nospec(unsigned long pfn, bool unmap)
+static inline int set_mce_nospec(unsigned long pfn)
 {
 	return 0;
 }
-- 
2.18.4

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v8 4/7] dax: introduce DAX_RECOVERY_WRITE dax access mode
  2022-04-20  2:04 ` [dm-devel] " Jane Chu
@ 2022-04-20  2:04   ` Jane Chu
  -1 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: vishal.l.verma, dave.jiang, agk, snitzer, dm-devel, ira.weiny,
	willy, vgoyal

Up till now, dax_direct_access() is used implicitly for normal
access, but for the purpose of recovery write, dax range with
poison is requested.  To make the interface clear, introduce
	enum dax_access_mode {
		DAX_ACCESS,
		DAX_RECOVERY_WRITE,
	}
where DAX_ACCESS is used for normal dax access, and
DAX_RECOVERY_WRITE is used for dax recovery write.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/dax/super.c             |  5 ++--
 drivers/md/dm-linear.c          |  5 ++--
 drivers/md/dm-log-writes.c      |  5 ++--
 drivers/md/dm-stripe.c          |  5 ++--
 drivers/md/dm-target.c          |  4 ++-
 drivers/md/dm-writecache.c      |  7 +++---
 drivers/md/dm.c                 |  5 ++--
 drivers/nvdimm/pmem.c           | 44 +++++++++++++++++++++++++--------
 drivers/nvdimm/pmem.h           |  5 +++-
 drivers/s390/block/dcssblk.c    |  9 ++++---
 fs/dax.c                        |  9 ++++---
 fs/fuse/dax.c                   |  4 +--
 include/linux/dax.h             |  9 +++++--
 include/linux/device-mapper.h   |  4 ++-
 tools/testing/nvdimm/pmem-dax.c |  3 ++-
 15 files changed, 85 insertions(+), 38 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0211e6f7b47a..5405eb553430 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -117,6 +117,7 @@ enum dax_device_flags {
  * @dax_dev: a dax_device instance representing the logical memory range
  * @pgoff: offset in pages from the start of the device to translate
  * @nr_pages: number of consecutive pages caller can handle relative to @pfn
+ * @mode: indicator on normal access or recovery write
  * @kaddr: output parameter that returns a virtual address mapping of pfn
  * @pfn: output parameter that returns an absolute pfn translation of @pgoff
  *
@@ -124,7 +125,7 @@ enum dax_device_flags {
  * pages accessible at the device relative @pgoff.
  */
 long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
-		void **kaddr, pfn_t *pfn)
+		enum dax_access_mode mode, void **kaddr, pfn_t *pfn)
 {
 	long avail;
 
@@ -138,7 +139,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
 		return -EINVAL;
 
 	avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
-			kaddr, pfn);
+			mode, kaddr, pfn);
 	if (!avail)
 		return -ERANGE;
 	return min(avail, nr_pages);
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 76b486e4d2be..13e263299c9c 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -172,11 +172,12 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
 }
 
 static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);
 
-	return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+	return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index c9d036d6bb2e..06bdbed65eb1 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -889,11 +889,12 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti,
 }
 
 static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-					 long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff);
 
-	return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+	return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index c81d331d1afe..77d72900e997 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -315,11 +315,12 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
 }
 
 static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);
 
-	return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+	return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
index 64dd0b34fcf4..8cd5184e62f0 100644
--- a/drivers/md/dm-target.c
+++ b/drivers/md/dm-target.c
@@ -10,6 +10,7 @@
 #include <linux/init.h>
 #include <linux/kmod.h>
 #include <linux/bio.h>
+#include <linux/dax.h>
 
 #define DM_MSG_PREFIX "target"
 
@@ -142,7 +143,8 @@ static void io_err_release_clone_rq(struct request *clone,
 }
 
 static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	return -EIO;
 }
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 5630b470ba42..d74c5a7a0ab4 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -286,7 +286,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 
 	id = dax_read_lock();
 
-	da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn);
+	da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, DAX_ACCESS,
+			&wc->memory_map, &pfn);
 	if (da < 0) {
 		wc->memory_map = NULL;
 		r = da;
@@ -308,8 +309,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 		i = 0;
 		do {
 			long daa;
-			daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i,
-						NULL, &pfn);
+			daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i,
+					p - i, DAX_ACCESS, NULL, &pfn);
 			if (daa <= 0) {
 				r = daa ? daa : -EINVAL;
 				goto err3;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c5fad7c4ee6..8258676a352f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1093,7 +1093,8 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
 }
 
 static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
-				 long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	struct mapped_device *md = dax_get_private(dax_dev);
 	sector_t sector = pgoff * PAGE_SECTORS;
@@ -1111,7 +1112,7 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 	if (len < 1)
 		goto out;
 	nr_pages = min(len, nr_pages);
-	ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
+	ret = ti->type->direct_access(ti, pgoff, nr_pages, mode, kaddr, pfn);
 
  out:
 	dm_put_live_table(md, srcu_idx);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4aa17132a557..c77b7cf19639 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -239,24 +239,47 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 
 /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
 __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
-
-	if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
-					PFN_PHYS(nr_pages))))
-		return -EIO;
+	sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
+	unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
+	struct badblocks *bb = &pmem->bb;
+	sector_t first_bad;
+	int num_bad;
 
 	if (kaddr)
 		*kaddr = pmem->virt_addr + offset;
 	if (pfn)
 		*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
 
+	if (bb->count &&
+		badblocks_check(bb, sector, num, &first_bad, &num_bad)) {
+		long actual_nr;
+
+		if (mode != DAX_RECOVERY_WRITE)
+			return -EIO;
+
+		/*
+		 * Set the recovery stride is set to kernel page size because
+		 * the underlying driver and firmware clear poison functions
+		 * don't appear to handle large chunk(such as 2MiB) reliably.
+		 */
+		actual_nr = PHYS_PFN(
+			PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT));
+		dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n",
+				sector, nr_pages, first_bad, actual_nr);
+		if (actual_nr)
+			return actual_nr;
+		return 1;
+	}
+
 	/*
-	 * If badblocks are present, limit known good range to the
-	 * requested range.
+	 * If badblocks are present but not in the range, limit known good range
+	 * to the requested range.
 	 */
-	if (unlikely(pmem->bb.count))
+	if (bb->count)
 		return nr_pages;
 	return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
 }
@@ -278,11 +301,12 @@ static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 }
 
 static long pmem_dax_direct_access(struct dax_device *dax_dev,
-		pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
+		pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+		void **kaddr, pfn_t *pfn)
 {
 	struct pmem_device *pmem = dax_get_private(dax_dev);
 
-	return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
+	return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 static const struct dax_operations pmem_dax_ops = {
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 1f51a2361429..392b0b38acb9 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -8,6 +8,8 @@
 #include <linux/pfn_t.h>
 #include <linux/fs.h>
 
+enum dax_access_mode;
+
 /* this definition is in it's own header for tools/testing/nvdimm to consume */
 struct pmem_device {
 	/* One contiguous memory region per device */
@@ -28,7 +30,8 @@ struct pmem_device {
 };
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn);
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn);
 
 #ifdef CONFIG_MEMORY_FAILURE
 static inline bool test_and_clear_pmem_poison(struct page *page)
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index d614843caf6c..8d0d0eaa3059 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -32,7 +32,8 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_submit_bio(struct bio *bio);
 static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn);
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -50,7 +51,8 @@ static int dcssblk_dax_zero_page_range(struct dax_device *dax_dev,
 	long rc;
 	void *kaddr;
 
-	rc = dax_direct_access(dax_dev, pgoff, nr_pages, &kaddr, NULL);
+	rc = dax_direct_access(dax_dev, pgoff, nr_pages, DAX_ACCESS,
+			&kaddr, NULL);
 	if (rc < 0)
 		return rc;
 	memset(kaddr, 0, nr_pages << PAGE_SHIFT);
@@ -927,7 +929,8 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
 
 static long
 dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	struct dcssblk_dev_info *dev_info = dax_get_private(dax_dev);
 
diff --git a/fs/dax.c b/fs/dax.c
index 67a08a32fccb..ef3103107104 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -721,7 +721,8 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter *iter
 	int id;
 
 	id = dax_read_lock();
-	rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, &kaddr, NULL);
+	rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, DAX_ACCESS,
+				&kaddr, NULL);
 	if (rc < 0) {
 		dax_read_unlock(id);
 		return rc;
@@ -1013,7 +1014,7 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
 
 	id = dax_read_lock();
 	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
-				   NULL, pfnp);
+				   DAX_ACCESS, NULL, pfnp);
 	if (length < 0) {
 		rc = length;
 		goto out;
@@ -1122,7 +1123,7 @@ static int dax_memzero(struct dax_device *dax_dev, pgoff_t pgoff,
 	void *kaddr;
 	long ret;
 
-	ret = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
+	ret = dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS, &kaddr, NULL);
 	if (ret > 0) {
 		memset(kaddr + offset, 0, size);
 		dax_flush(dax_dev, kaddr + offset, size);
@@ -1247,7 +1248,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		}
 
 		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
-				&kaddr, NULL);
+				DAX_ACCESS, &kaddr, NULL);
 		if (map_len < 0) {
 			ret = map_len;
 			break;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index d7d3a7f06862..10eb50cbf398 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -1241,8 +1241,8 @@ static int fuse_dax_mem_range_init(struct fuse_conn_dax *fcd)
 	INIT_DELAYED_WORK(&fcd->free_work, fuse_dax_free_mem_worker);
 
 	id = dax_read_lock();
-	nr_pages = dax_direct_access(fcd->dev, 0, PHYS_PFN(dax_size), NULL,
-				     NULL);
+	nr_pages = dax_direct_access(fcd->dev, 0, PHYS_PFN(dax_size),
+			DAX_ACCESS, NULL, NULL);
 	dax_read_unlock(id);
 	if (nr_pages < 0) {
 		pr_debug("dax_direct_access() returned %ld\n", nr_pages);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9fc5f99a0ae2..3f1339bce3c0 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -14,6 +14,11 @@ struct iomap_ops;
 struct iomap_iter;
 struct iomap;
 
+enum dax_access_mode {
+	DAX_ACCESS,
+	DAX_RECOVERY_WRITE,
+};
+
 struct dax_operations {
 	/*
 	 * direct_access: translate a device-relative
@@ -21,7 +26,7 @@ struct dax_operations {
 	 * number of pages available for DAX at that pfn.
 	 */
 	long (*direct_access)(struct dax_device *, pgoff_t, long,
-			void **, pfn_t *);
+			enum dax_access_mode, void **, pfn_t *);
 	/*
 	 * Validate whether this device is usable as an fsdax backing
 	 * device.
@@ -178,7 +183,7 @@ static inline void dax_read_unlock(int id)
 bool dax_alive(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
 long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
-		void **kaddr, pfn_t *pfn);
+		enum dax_access_mode mode, void **kaddr, pfn_t *pfn);
 size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
 		size_t bytes, struct iov_iter *i);
 size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index c2a3758c4aaa..acdedda0d12b 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -20,6 +20,7 @@ struct dm_table;
 struct dm_report_zones_args;
 struct mapped_device;
 struct bio_vec;
+enum dax_access_mode;
 
 /*
  * Type of table, mapped_device's mempool and request_queue
@@ -146,7 +147,8 @@ typedef int (*dm_busy_fn) (struct dm_target *ti);
  * >= 0 : the number of bytes accessible at the address
  */
 typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn);
+		long nr_pages, enum dax_access_mode node, void **kaddr,
+		pfn_t *pfn);
 typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, pgoff_t pgoff,
 		size_t nr_pages);
 
diff --git a/tools/testing/nvdimm/pmem-dax.c b/tools/testing/nvdimm/pmem-dax.c
index af19c85558e7..dcc328eba811 100644
--- a/tools/testing/nvdimm/pmem-dax.c
+++ b/tools/testing/nvdimm/pmem-dax.c
@@ -8,7 +8,8 @@
 #include <nd.h>
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
 
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [dm-devel] [PATCH v8 4/7] dax: introduce DAX_RECOVERY_WRITE dax access mode
@ 2022-04-20  2:04   ` Jane Chu
  0 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: dave.jiang, snitzer, vishal.l.verma, willy, dm-devel, vgoyal,
	ira.weiny, agk

Up till now, dax_direct_access() is used implicitly for normal
access, but for the purpose of recovery write, dax range with
poison is requested.  To make the interface clear, introduce
	enum dax_access_mode {
		DAX_ACCESS,
		DAX_RECOVERY_WRITE,
	}
where DAX_ACCESS is used for normal dax access, and
DAX_RECOVERY_WRITE is used for dax recovery write.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/dax/super.c             |  5 ++--
 drivers/md/dm-linear.c          |  5 ++--
 drivers/md/dm-log-writes.c      |  5 ++--
 drivers/md/dm-stripe.c          |  5 ++--
 drivers/md/dm-target.c          |  4 ++-
 drivers/md/dm-writecache.c      |  7 +++---
 drivers/md/dm.c                 |  5 ++--
 drivers/nvdimm/pmem.c           | 44 +++++++++++++++++++++++++--------
 drivers/nvdimm/pmem.h           |  5 +++-
 drivers/s390/block/dcssblk.c    |  9 ++++---
 fs/dax.c                        |  9 ++++---
 fs/fuse/dax.c                   |  4 +--
 include/linux/dax.h             |  9 +++++--
 include/linux/device-mapper.h   |  4 ++-
 tools/testing/nvdimm/pmem-dax.c |  3 ++-
 15 files changed, 85 insertions(+), 38 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0211e6f7b47a..5405eb553430 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -117,6 +117,7 @@ enum dax_device_flags {
  * @dax_dev: a dax_device instance representing the logical memory range
  * @pgoff: offset in pages from the start of the device to translate
  * @nr_pages: number of consecutive pages caller can handle relative to @pfn
+ * @mode: indicator on normal access or recovery write
  * @kaddr: output parameter that returns a virtual address mapping of pfn
  * @pfn: output parameter that returns an absolute pfn translation of @pgoff
  *
@@ -124,7 +125,7 @@ enum dax_device_flags {
  * pages accessible at the device relative @pgoff.
  */
 long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
-		void **kaddr, pfn_t *pfn)
+		enum dax_access_mode mode, void **kaddr, pfn_t *pfn)
 {
 	long avail;
 
@@ -138,7 +139,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
 		return -EINVAL;
 
 	avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
-			kaddr, pfn);
+			mode, kaddr, pfn);
 	if (!avail)
 		return -ERANGE;
 	return min(avail, nr_pages);
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 76b486e4d2be..13e263299c9c 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -172,11 +172,12 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
 }
 
 static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);
 
-	return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+	return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index c9d036d6bb2e..06bdbed65eb1 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -889,11 +889,12 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti,
 }
 
 static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-					 long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff);
 
-	return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+	return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index c81d331d1afe..77d72900e997 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -315,11 +315,12 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
 }
 
 static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);
 
-	return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+	return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
index 64dd0b34fcf4..8cd5184e62f0 100644
--- a/drivers/md/dm-target.c
+++ b/drivers/md/dm-target.c
@@ -10,6 +10,7 @@
 #include <linux/init.h>
 #include <linux/kmod.h>
 #include <linux/bio.h>
+#include <linux/dax.h>
 
 #define DM_MSG_PREFIX "target"
 
@@ -142,7 +143,8 @@ static void io_err_release_clone_rq(struct request *clone,
 }
 
 static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	return -EIO;
 }
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 5630b470ba42..d74c5a7a0ab4 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -286,7 +286,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 
 	id = dax_read_lock();
 
-	da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn);
+	da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, DAX_ACCESS,
+			&wc->memory_map, &pfn);
 	if (da < 0) {
 		wc->memory_map = NULL;
 		r = da;
@@ -308,8 +309,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 		i = 0;
 		do {
 			long daa;
-			daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i,
-						NULL, &pfn);
+			daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i,
+					p - i, DAX_ACCESS, NULL, &pfn);
 			if (daa <= 0) {
 				r = daa ? daa : -EINVAL;
 				goto err3;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c5fad7c4ee6..8258676a352f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1093,7 +1093,8 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
 }
 
 static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
-				 long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	struct mapped_device *md = dax_get_private(dax_dev);
 	sector_t sector = pgoff * PAGE_SECTORS;
@@ -1111,7 +1112,7 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 	if (len < 1)
 		goto out;
 	nr_pages = min(len, nr_pages);
-	ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
+	ret = ti->type->direct_access(ti, pgoff, nr_pages, mode, kaddr, pfn);
 
  out:
 	dm_put_live_table(md, srcu_idx);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4aa17132a557..c77b7cf19639 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -239,24 +239,47 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 
 /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
 __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
-
-	if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
-					PFN_PHYS(nr_pages))))
-		return -EIO;
+	sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
+	unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
+	struct badblocks *bb = &pmem->bb;
+	sector_t first_bad;
+	int num_bad;
 
 	if (kaddr)
 		*kaddr = pmem->virt_addr + offset;
 	if (pfn)
 		*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
 
+	if (bb->count &&
+		badblocks_check(bb, sector, num, &first_bad, &num_bad)) {
+		long actual_nr;
+
+		if (mode != DAX_RECOVERY_WRITE)
+			return -EIO;
+
+		/*
+		 * Set the recovery stride is set to kernel page size because
+		 * the underlying driver and firmware clear poison functions
+		 * don't appear to handle large chunk(such as 2MiB) reliably.
+		 */
+		actual_nr = PHYS_PFN(
+			PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT));
+		dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n",
+				sector, nr_pages, first_bad, actual_nr);
+		if (actual_nr)
+			return actual_nr;
+		return 1;
+	}
+
 	/*
-	 * If badblocks are present, limit known good range to the
-	 * requested range.
+	 * If badblocks are present but not in the range, limit known good range
+	 * to the requested range.
 	 */
-	if (unlikely(pmem->bb.count))
+	if (bb->count)
 		return nr_pages;
 	return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
 }
@@ -278,11 +301,12 @@ static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 }
 
 static long pmem_dax_direct_access(struct dax_device *dax_dev,
-		pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
+		pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+		void **kaddr, pfn_t *pfn)
 {
 	struct pmem_device *pmem = dax_get_private(dax_dev);
 
-	return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
+	return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
 static const struct dax_operations pmem_dax_ops = {
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 1f51a2361429..392b0b38acb9 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -8,6 +8,8 @@
 #include <linux/pfn_t.h>
 #include <linux/fs.h>
 
+enum dax_access_mode;
+
 /* this definition is in it's own header for tools/testing/nvdimm to consume */
 struct pmem_device {
 	/* One contiguous memory region per device */
@@ -28,7 +30,8 @@ struct pmem_device {
 };
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn);
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn);
 
 #ifdef CONFIG_MEMORY_FAILURE
 static inline bool test_and_clear_pmem_poison(struct page *page)
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index d614843caf6c..8d0d0eaa3059 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -32,7 +32,8 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_submit_bio(struct bio *bio);
 static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn);
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -50,7 +51,8 @@ static int dcssblk_dax_zero_page_range(struct dax_device *dax_dev,
 	long rc;
 	void *kaddr;
 
-	rc = dax_direct_access(dax_dev, pgoff, nr_pages, &kaddr, NULL);
+	rc = dax_direct_access(dax_dev, pgoff, nr_pages, DAX_ACCESS,
+			&kaddr, NULL);
 	if (rc < 0)
 		return rc;
 	memset(kaddr, 0, nr_pages << PAGE_SHIFT);
@@ -927,7 +929,8 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
 
 static long
 dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	struct dcssblk_dev_info *dev_info = dax_get_private(dax_dev);
 
diff --git a/fs/dax.c b/fs/dax.c
index 67a08a32fccb..ef3103107104 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -721,7 +721,8 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter *iter
 	int id;
 
 	id = dax_read_lock();
-	rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, &kaddr, NULL);
+	rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, DAX_ACCESS,
+				&kaddr, NULL);
 	if (rc < 0) {
 		dax_read_unlock(id);
 		return rc;
@@ -1013,7 +1014,7 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
 
 	id = dax_read_lock();
 	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
-				   NULL, pfnp);
+				   DAX_ACCESS, NULL, pfnp);
 	if (length < 0) {
 		rc = length;
 		goto out;
@@ -1122,7 +1123,7 @@ static int dax_memzero(struct dax_device *dax_dev, pgoff_t pgoff,
 	void *kaddr;
 	long ret;
 
-	ret = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
+	ret = dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS, &kaddr, NULL);
 	if (ret > 0) {
 		memset(kaddr + offset, 0, size);
 		dax_flush(dax_dev, kaddr + offset, size);
@@ -1247,7 +1248,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		}
 
 		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
-				&kaddr, NULL);
+				DAX_ACCESS, &kaddr, NULL);
 		if (map_len < 0) {
 			ret = map_len;
 			break;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index d7d3a7f06862..10eb50cbf398 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -1241,8 +1241,8 @@ static int fuse_dax_mem_range_init(struct fuse_conn_dax *fcd)
 	INIT_DELAYED_WORK(&fcd->free_work, fuse_dax_free_mem_worker);
 
 	id = dax_read_lock();
-	nr_pages = dax_direct_access(fcd->dev, 0, PHYS_PFN(dax_size), NULL,
-				     NULL);
+	nr_pages = dax_direct_access(fcd->dev, 0, PHYS_PFN(dax_size),
+			DAX_ACCESS, NULL, NULL);
 	dax_read_unlock(id);
 	if (nr_pages < 0) {
 		pr_debug("dax_direct_access() returned %ld\n", nr_pages);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9fc5f99a0ae2..3f1339bce3c0 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -14,6 +14,11 @@ struct iomap_ops;
 struct iomap_iter;
 struct iomap;
 
+enum dax_access_mode {
+	DAX_ACCESS,
+	DAX_RECOVERY_WRITE,
+};
+
 struct dax_operations {
 	/*
 	 * direct_access: translate a device-relative
@@ -21,7 +26,7 @@ struct dax_operations {
 	 * number of pages available for DAX at that pfn.
 	 */
 	long (*direct_access)(struct dax_device *, pgoff_t, long,
-			void **, pfn_t *);
+			enum dax_access_mode, void **, pfn_t *);
 	/*
 	 * Validate whether this device is usable as an fsdax backing
 	 * device.
@@ -178,7 +183,7 @@ static inline void dax_read_unlock(int id)
 bool dax_alive(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
 long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
-		void **kaddr, pfn_t *pfn);
+		enum dax_access_mode mode, void **kaddr, pfn_t *pfn);
 size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
 		size_t bytes, struct iov_iter *i);
 size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index c2a3758c4aaa..acdedda0d12b 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -20,6 +20,7 @@ struct dm_table;
 struct dm_report_zones_args;
 struct mapped_device;
 struct bio_vec;
+enum dax_access_mode;
 
 /*
  * Type of table, mapped_device's mempool and request_queue
@@ -146,7 +147,8 @@ typedef int (*dm_busy_fn) (struct dm_target *ti);
  * >= 0 : the number of bytes accessible at the address
  */
 typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn);
+		long nr_pages, enum dax_access_mode node, void **kaddr,
+		pfn_t *pfn);
 typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, pgoff_t pgoff,
 		size_t nr_pages);
 
diff --git a/tools/testing/nvdimm/pmem-dax.c b/tools/testing/nvdimm/pmem-dax.c
index af19c85558e7..dcc328eba811 100644
--- a/tools/testing/nvdimm/pmem-dax.c
+++ b/tools/testing/nvdimm/pmem-dax.c
@@ -8,7 +8,8 @@
 #include <nd.h>
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		pfn_t *pfn)
 {
 	resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
 
-- 
2.18.4

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v8 5/7] dax: add .recovery_write dax_operation
  2022-04-20  2:04 ` [dm-devel] " Jane Chu
@ 2022-04-20  2:04   ` Jane Chu
  -1 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: vishal.l.verma, dave.jiang, agk, snitzer, dm-devel, ira.weiny,
	willy, vgoyal

Introduce dax_recovery_write() operation. The function is used to
recover a dax range that contains poison. Typical use case is when
a user process receives a SIGBUS with si_code BUS_MCEERR_AR
indicating poison(s) in a dax range, in response, the user process
issues a pwrite() to the page-aligned dax range, thus clears the
poison and puts valid data in the range.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/dax/super.c           |  9 +++++++++
 drivers/md/dm-linear.c        | 10 ++++++++++
 drivers/md/dm-log-writes.c    | 10 ++++++++++
 drivers/md/dm-stripe.c        | 10 ++++++++++
 drivers/md/dm.c               | 20 ++++++++++++++++++++
 drivers/nvdimm/pmem.c         |  7 +++++++
 fs/dax.c                      | 13 ++++++++++++-
 include/linux/dax.h           | 13 +++++++++++++
 include/linux/device-mapper.h |  9 +++++++++
 9 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 5405eb553430..50a08b2ec247 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -195,6 +195,15 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 }
 EXPORT_SYMBOL_GPL(dax_zero_page_range);
 
+size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *iter)
+{
+	if (!dax_dev->ops->recovery_write)
+		return 0;
+	return dax_dev->ops->recovery_write(dax_dev, pgoff, addr, bytes, iter);
+}
+EXPORT_SYMBOL_GPL(dax_recovery_write);
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 13e263299c9c..cdf48bc8c5b0 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -188,9 +188,18 @@ static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
 	return dax_zero_page_range(dax_dev, pgoff, nr_pages);
 }
 
+static size_t linear_dax_recovery_write(struct dm_target *ti, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);
+
+	return dax_recovery_write(dax_dev, pgoff, addr, bytes, i);
+}
+
 #else
 #define linear_dax_direct_access NULL
 #define linear_dax_zero_page_range NULL
+#define linear_dax_recovery_write NULL
 #endif
 
 static struct target_type linear_target = {
@@ -208,6 +217,7 @@ static struct target_type linear_target = {
 	.iterate_devices = linear_iterate_devices,
 	.direct_access = linear_dax_direct_access,
 	.dax_zero_page_range = linear_dax_zero_page_range,
+	.dax_recovery_write = linear_dax_recovery_write,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 06bdbed65eb1..22739dccdd17 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -905,9 +905,18 @@ static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
 	return dax_zero_page_range(dax_dev, pgoff, nr_pages << PAGE_SHIFT);
 }
 
+static size_t log_writes_dax_recovery_write(struct dm_target *ti,
+		pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
+{
+	struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff);
+
+	return dax_recovery_write(dax_dev, pgoff, addr, bytes, i);
+}
+
 #else
 #define log_writes_dax_direct_access NULL
 #define log_writes_dax_zero_page_range NULL
+#define log_writes_dax_recovery_write NULL
 #endif
 
 static struct target_type log_writes_target = {
@@ -925,6 +934,7 @@ static struct target_type log_writes_target = {
 	.io_hints = log_writes_io_hints,
 	.direct_access = log_writes_dax_direct_access,
 	.dax_zero_page_range = log_writes_dax_zero_page_range,
+	.dax_recovery_write = log_writes_dax_recovery_write,
 };
 
 static int __init dm_log_writes_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 77d72900e997..baa085cc67bd 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -331,9 +331,18 @@ static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
 	return dax_zero_page_range(dax_dev, pgoff, nr_pages);
 }
 
+static size_t stripe_dax_recovery_write(struct dm_target *ti, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);
+
+	return dax_recovery_write(dax_dev, pgoff, addr, bytes, i);
+}
+
 #else
 #define stripe_dax_direct_access NULL
 #define stripe_dax_zero_page_range NULL
+#define stripe_dax_recovery_write NULL
 #endif
 
 /*
@@ -470,6 +479,7 @@ static struct target_type stripe_target = {
 	.io_hints = stripe_io_hints,
 	.direct_access = stripe_dax_direct_access,
 	.dax_zero_page_range = stripe_dax_zero_page_range,
+	.dax_recovery_write = stripe_dax_recovery_write,
 };
 
 int __init dm_stripe_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8258676a352f..5374c8aba2d6 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1147,6 +1147,25 @@ static int dm_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 	return ret;
 }
 
+static size_t dm_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	struct mapped_device *md = dax_get_private(dax_dev);
+	sector_t sector = pgoff * PAGE_SECTORS;
+	struct dm_target *ti;
+	int srcu_idx;
+	long ret = 0;
+
+	ti = dm_dax_get_live_target(md, sector, &srcu_idx);
+	if (!ti || !ti->type->dax_recovery_write)
+		goto out;
+
+	ret = ti->type->dax_recovery_write(ti, pgoff, addr, bytes, i);
+out:
+	dm_put_live_table(md, srcu_idx);
+	return ret;
+}
+
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
  * allowed for all bio types except REQ_PREFLUSH, REQ_OP_ZONE_* zone management
@@ -3151,6 +3170,7 @@ static const struct block_device_operations dm_rq_blk_dops = {
 static const struct dax_operations dm_dax_ops = {
 	.direct_access = dm_dax_direct_access,
 	.zero_page_range = dm_dax_zero_page_range,
+	.recovery_write = dm_dax_recovery_write,
 };
 
 /*
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c77b7cf19639..3c0cad38ec33 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -309,9 +309,16 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 	return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
+static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	return 0;
+}
+
 static const struct dax_operations pmem_dax_ops = {
 	.direct_access = pmem_dax_direct_access,
 	.zero_page_range = pmem_dax_zero_page_range,
+	.recovery_write = pmem_recovery_write,
 };
 
 static ssize_t write_cache_show(struct device *dev,
diff --git a/fs/dax.c b/fs/dax.c
index ef3103107104..a1e4b45cbf55 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1240,6 +1240,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		const size_t size = ALIGN(length + offset, PAGE_SIZE);
 		pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
 		ssize_t map_len;
+		bool recovery = false;
 		void *kaddr;
 
 		if (fatal_signal_pending(current)) {
@@ -1249,6 +1250,13 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 
 		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
 				DAX_ACCESS, &kaddr, NULL);
+		if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+			map_len = dax_direct_access(dax_dev, pgoff,
+					PHYS_PFN(size), DAX_RECOVERY_WRITE,
+					&kaddr, NULL);
+			if (map_len > 0)
+				recovery = true;
+		}
 		if (map_len < 0) {
 			ret = map_len;
 			break;
@@ -1260,7 +1268,10 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		if (map_len > end - pos)
 			map_len = end - pos;
 
-		if (iov_iter_rw(iter) == WRITE)
+		if (recovery)
+			xfer = dax_recovery_write(dax_dev, pgoff, kaddr,
+					map_len, iter);
+		else if (iov_iter_rw(iter) == WRITE)
 			xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
 					map_len, iter);
 		else
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3f1339bce3c0..e7b81634c52a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -35,6 +35,12 @@ struct dax_operations {
 			sector_t, sector_t);
 	/* zero_page_range: required operation. Zero page range   */
 	int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
+	/*
+	 * recovery_write: recover a poisoned range by DAX device driver
+	 * capable of clearing poison.
+	 */
+	size_t (*recovery_write)(struct dax_device *dax_dev, pgoff_t pgoff,
+			void *addr, size_t bytes, struct iov_iter *iter);
 };
 
 #if IS_ENABLED(CONFIG_DAX)
@@ -45,6 +51,8 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc);
 bool dax_write_cache_enabled(struct dax_device *dax_dev);
 bool dax_synchronous(struct dax_device *dax_dev);
 void set_dax_synchronous(struct dax_device *dax_dev);
+size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i);
 /*
  * Check if given mapping is supported by the file / underlying device.
  */
@@ -92,6 +100,11 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
 {
 	return !(vma->vm_flags & VM_SYNC);
 }
+static inline size_t dax_recovery_write(struct dax_device *dax_dev,
+		pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
+{
+	return 0;
+}
 #endif
 
 void set_dax_nocache(struct dax_device *dax_dev);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index acdedda0d12b..47a01c7cffdf 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -152,6 +152,14 @@ typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
 typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, pgoff_t pgoff,
 		size_t nr_pages);
 
+/*
+ * Returns:
+ * != 0 : number of bytes transferred
+ * 0    : recovery write failed
+ */
+typedef size_t (*dm_dax_recovery_write_fn)(struct dm_target *ti, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i);
+
 void dm_error(const char *message);
 
 struct dm_dev {
@@ -201,6 +209,7 @@ struct target_type {
 	dm_io_hints_fn io_hints;
 	dm_dax_direct_access_fn direct_access;
 	dm_dax_zero_page_range_fn dax_zero_page_range;
+	dm_dax_recovery_write_fn dax_recovery_write;
 
 	/* For internal device-mapper use. */
 	struct list_head list;
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [dm-devel] [PATCH v8 5/7] dax: add .recovery_write dax_operation
@ 2022-04-20  2:04   ` Jane Chu
  0 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: dave.jiang, snitzer, vishal.l.verma, willy, dm-devel, vgoyal,
	ira.weiny, agk

Introduce dax_recovery_write() operation. The function is used to
recover a dax range that contains poison. Typical use case is when
a user process receives a SIGBUS with si_code BUS_MCEERR_AR
indicating poison(s) in a dax range, in response, the user process
issues a pwrite() to the page-aligned dax range, thus clears the
poison and puts valid data in the range.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/dax/super.c           |  9 +++++++++
 drivers/md/dm-linear.c        | 10 ++++++++++
 drivers/md/dm-log-writes.c    | 10 ++++++++++
 drivers/md/dm-stripe.c        | 10 ++++++++++
 drivers/md/dm.c               | 20 ++++++++++++++++++++
 drivers/nvdimm/pmem.c         |  7 +++++++
 fs/dax.c                      | 13 ++++++++++++-
 include/linux/dax.h           | 13 +++++++++++++
 include/linux/device-mapper.h |  9 +++++++++
 9 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 5405eb553430..50a08b2ec247 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -195,6 +195,15 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 }
 EXPORT_SYMBOL_GPL(dax_zero_page_range);
 
+size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *iter)
+{
+	if (!dax_dev->ops->recovery_write)
+		return 0;
+	return dax_dev->ops->recovery_write(dax_dev, pgoff, addr, bytes, iter);
+}
+EXPORT_SYMBOL_GPL(dax_recovery_write);
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 13e263299c9c..cdf48bc8c5b0 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -188,9 +188,18 @@ static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
 	return dax_zero_page_range(dax_dev, pgoff, nr_pages);
 }
 
+static size_t linear_dax_recovery_write(struct dm_target *ti, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);
+
+	return dax_recovery_write(dax_dev, pgoff, addr, bytes, i);
+}
+
 #else
 #define linear_dax_direct_access NULL
 #define linear_dax_zero_page_range NULL
+#define linear_dax_recovery_write NULL
 #endif
 
 static struct target_type linear_target = {
@@ -208,6 +217,7 @@ static struct target_type linear_target = {
 	.iterate_devices = linear_iterate_devices,
 	.direct_access = linear_dax_direct_access,
 	.dax_zero_page_range = linear_dax_zero_page_range,
+	.dax_recovery_write = linear_dax_recovery_write,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 06bdbed65eb1..22739dccdd17 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -905,9 +905,18 @@ static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
 	return dax_zero_page_range(dax_dev, pgoff, nr_pages << PAGE_SHIFT);
 }
 
+static size_t log_writes_dax_recovery_write(struct dm_target *ti,
+		pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
+{
+	struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff);
+
+	return dax_recovery_write(dax_dev, pgoff, addr, bytes, i);
+}
+
 #else
 #define log_writes_dax_direct_access NULL
 #define log_writes_dax_zero_page_range NULL
+#define log_writes_dax_recovery_write NULL
 #endif
 
 static struct target_type log_writes_target = {
@@ -925,6 +934,7 @@ static struct target_type log_writes_target = {
 	.io_hints = log_writes_io_hints,
 	.direct_access = log_writes_dax_direct_access,
 	.dax_zero_page_range = log_writes_dax_zero_page_range,
+	.dax_recovery_write = log_writes_dax_recovery_write,
 };
 
 static int __init dm_log_writes_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 77d72900e997..baa085cc67bd 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -331,9 +331,18 @@ static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
 	return dax_zero_page_range(dax_dev, pgoff, nr_pages);
 }
 
+static size_t stripe_dax_recovery_write(struct dm_target *ti, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);
+
+	return dax_recovery_write(dax_dev, pgoff, addr, bytes, i);
+}
+
 #else
 #define stripe_dax_direct_access NULL
 #define stripe_dax_zero_page_range NULL
+#define stripe_dax_recovery_write NULL
 #endif
 
 /*
@@ -470,6 +479,7 @@ static struct target_type stripe_target = {
 	.io_hints = stripe_io_hints,
 	.direct_access = stripe_dax_direct_access,
 	.dax_zero_page_range = stripe_dax_zero_page_range,
+	.dax_recovery_write = stripe_dax_recovery_write,
 };
 
 int __init dm_stripe_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8258676a352f..5374c8aba2d6 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1147,6 +1147,25 @@ static int dm_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 	return ret;
 }
 
+static size_t dm_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	struct mapped_device *md = dax_get_private(dax_dev);
+	sector_t sector = pgoff * PAGE_SECTORS;
+	struct dm_target *ti;
+	int srcu_idx;
+	long ret = 0;
+
+	ti = dm_dax_get_live_target(md, sector, &srcu_idx);
+	if (!ti || !ti->type->dax_recovery_write)
+		goto out;
+
+	ret = ti->type->dax_recovery_write(ti, pgoff, addr, bytes, i);
+out:
+	dm_put_live_table(md, srcu_idx);
+	return ret;
+}
+
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
  * allowed for all bio types except REQ_PREFLUSH, REQ_OP_ZONE_* zone management
@@ -3151,6 +3170,7 @@ static const struct block_device_operations dm_rq_blk_dops = {
 static const struct dax_operations dm_dax_ops = {
 	.direct_access = dm_dax_direct_access,
 	.zero_page_range = dm_dax_zero_page_range,
+	.recovery_write = dm_dax_recovery_write,
 };
 
 /*
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c77b7cf19639..3c0cad38ec33 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -309,9 +309,16 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 	return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
+static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	return 0;
+}
+
 static const struct dax_operations pmem_dax_ops = {
 	.direct_access = pmem_dax_direct_access,
 	.zero_page_range = pmem_dax_zero_page_range,
+	.recovery_write = pmem_recovery_write,
 };
 
 static ssize_t write_cache_show(struct device *dev,
diff --git a/fs/dax.c b/fs/dax.c
index ef3103107104..a1e4b45cbf55 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1240,6 +1240,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		const size_t size = ALIGN(length + offset, PAGE_SIZE);
 		pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
 		ssize_t map_len;
+		bool recovery = false;
 		void *kaddr;
 
 		if (fatal_signal_pending(current)) {
@@ -1249,6 +1250,13 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 
 		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
 				DAX_ACCESS, &kaddr, NULL);
+		if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+			map_len = dax_direct_access(dax_dev, pgoff,
+					PHYS_PFN(size), DAX_RECOVERY_WRITE,
+					&kaddr, NULL);
+			if (map_len > 0)
+				recovery = true;
+		}
 		if (map_len < 0) {
 			ret = map_len;
 			break;
@@ -1260,7 +1268,10 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		if (map_len > end - pos)
 			map_len = end - pos;
 
-		if (iov_iter_rw(iter) == WRITE)
+		if (recovery)
+			xfer = dax_recovery_write(dax_dev, pgoff, kaddr,
+					map_len, iter);
+		else if (iov_iter_rw(iter) == WRITE)
 			xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
 					map_len, iter);
 		else
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3f1339bce3c0..e7b81634c52a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -35,6 +35,12 @@ struct dax_operations {
 			sector_t, sector_t);
 	/* zero_page_range: required operation. Zero page range   */
 	int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
+	/*
+	 * recovery_write: recover a poisoned range by DAX device driver
+	 * capable of clearing poison.
+	 */
+	size_t (*recovery_write)(struct dax_device *dax_dev, pgoff_t pgoff,
+			void *addr, size_t bytes, struct iov_iter *iter);
 };
 
 #if IS_ENABLED(CONFIG_DAX)
@@ -45,6 +51,8 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc);
 bool dax_write_cache_enabled(struct dax_device *dax_dev);
 bool dax_synchronous(struct dax_device *dax_dev);
 void set_dax_synchronous(struct dax_device *dax_dev);
+size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i);
 /*
  * Check if given mapping is supported by the file / underlying device.
  */
@@ -92,6 +100,11 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
 {
 	return !(vma->vm_flags & VM_SYNC);
 }
+static inline size_t dax_recovery_write(struct dax_device *dax_dev,
+		pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
+{
+	return 0;
+}
 #endif
 
 void set_dax_nocache(struct dax_device *dax_dev);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index acdedda0d12b..47a01c7cffdf 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -152,6 +152,14 @@ typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
 typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, pgoff_t pgoff,
 		size_t nr_pages);
 
+/*
+ * Returns:
+ * != 0 : number of bytes transferred
+ * 0    : recovery write failed
+ */
+typedef size_t (*dm_dax_recovery_write_fn)(struct dm_target *ti, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i);
+
 void dm_error(const char *message);
 
 struct dm_dev {
@@ -201,6 +209,7 @@ struct target_type {
 	dm_io_hints_fn io_hints;
 	dm_dax_direct_access_fn direct_access;
 	dm_dax_zero_page_range_fn dax_zero_page_range;
+	dm_dax_recovery_write_fn dax_recovery_write;
 
 	/* For internal device-mapper use. */
 	struct list_head list;
-- 
2.18.4

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v8 6/7] pmem: refactor pmem_clear_poison()
  2022-04-20  2:04 ` [dm-devel] " Jane Chu
@ 2022-04-20  2:04   ` Jane Chu
  -1 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: vishal.l.verma, dave.jiang, agk, snitzer, dm-devel, ira.weiny,
	willy, vgoyal

Refactor the pmem_clear_poison() function such that the common
shared code between the typical write path and the recovery write
path is factored out.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/nvdimm/pmem.c | 73 ++++++++++++++++++++++++++++---------------
 1 file changed, 48 insertions(+), 25 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3c0cad38ec33..c3772304d417 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -45,9 +45,25 @@ static struct nd_region *to_region(struct pmem_device *pmem)
 	return to_nd_region(to_dev(pmem)->parent);
 }
 
-static void hwpoison_clear(struct pmem_device *pmem,
-		phys_addr_t phys, unsigned int len)
+static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset)
 {
+	return pmem->phys_addr + offset;
+}
+
+static sector_t to_sect(struct pmem_device *pmem, phys_addr_t offset)
+{
+	return (offset - pmem->data_offset) >> SECTOR_SHIFT;
+}
+
+static phys_addr_t to_offset(struct pmem_device *pmem, sector_t sector)
+{
+	return (sector << SECTOR_SHIFT) + pmem->data_offset;
+}
+
+static void pmem_mkpage_present(struct pmem_device *pmem, phys_addr_t offset,
+		unsigned int len)
+{
+	phys_addr_t phys = to_phys(pmem, offset);
 	unsigned long pfn_start, pfn_end, pfn;
 
 	/* only pmem in the linear map supports HWPoison */
@@ -69,33 +85,40 @@ static void hwpoison_clear(struct pmem_device *pmem,
 	}
 }
 
-static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
-		phys_addr_t offset, unsigned int len)
+static void pmem_clear_bb(struct pmem_device *pmem, sector_t sector, long blks)
 {
-	struct device *dev = to_dev(pmem);
-	sector_t sector;
-	long cleared;
-	blk_status_t rc = BLK_STS_OK;
+	if (blks == 0)
+		return;
+	badblocks_clear(&pmem->bb, sector, blks);
+	if (pmem->bb_state)
+		sysfs_notify_dirent(pmem->bb_state);
+}
 
-	sector = (offset - pmem->data_offset) / 512;
+static long __pmem_clear_poison(struct pmem_device *pmem,
+		phys_addr_t offset, unsigned int len)
+{
+	phys_addr_t phys = to_phys(pmem, offset);
+	long cleared = nvdimm_clear_poison(to_dev(pmem), phys, len);
 
-	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
-	if (cleared < len)
-		rc = BLK_STS_IOERR;
-	if (cleared > 0 && cleared / 512) {
-		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
-		cleared /= 512;
-		dev_dbg(dev, "%#llx clear %ld sector%s\n",
-				(unsigned long long) sector, cleared,
-				cleared > 1 ? "s" : "");
-		badblocks_clear(&pmem->bb, sector, cleared);
-		if (pmem->bb_state)
-			sysfs_notify_dirent(pmem->bb_state);
+	if (cleared > 0) {
+		pmem_mkpage_present(pmem, offset, cleared);
+		arch_invalidate_pmem(pmem->virt_addr + offset, len);
 	}
+	return cleared;
+}
 
-	arch_invalidate_pmem(pmem->virt_addr + offset, len);
+static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
+		phys_addr_t offset, unsigned int len)
+{
+	long cleared = __pmem_clear_poison(pmem, offset, len);
 
-	return rc;
+	if (cleared < 0)
+		return BLK_STS_IOERR;
+
+	pmem_clear_bb(pmem, to_sect(pmem, offset), cleared >> SECTOR_SHIFT);
+	if (cleared < len)
+		return BLK_STS_IOERR;
+	return BLK_STS_OK;
 }
 
 static void write_pmem(void *pmem_addr, struct page *page,
@@ -143,7 +166,7 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
 			sector_t sector, unsigned int len)
 {
 	blk_status_t rc;
-	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+	phys_addr_t pmem_off = to_offset(pmem, sector);
 	void *pmem_addr = pmem->virt_addr + pmem_off;
 
 	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
@@ -158,7 +181,7 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
 			struct page *page, unsigned int page_off,
 			sector_t sector, unsigned int len)
 {
-	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+	phys_addr_t pmem_off = to_offset(pmem, sector);
 	void *pmem_addr = pmem->virt_addr + pmem_off;
 
 	if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [dm-devel] [PATCH v8 6/7] pmem: refactor pmem_clear_poison()
@ 2022-04-20  2:04   ` Jane Chu
  0 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: dave.jiang, snitzer, vishal.l.verma, willy, dm-devel, vgoyal,
	ira.weiny, agk

Refactor the pmem_clear_poison() function such that the common
shared code between the typical write path and the recovery write
path is factored out.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/nvdimm/pmem.c | 73 ++++++++++++++++++++++++++++---------------
 1 file changed, 48 insertions(+), 25 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3c0cad38ec33..c3772304d417 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -45,9 +45,25 @@ static struct nd_region *to_region(struct pmem_device *pmem)
 	return to_nd_region(to_dev(pmem)->parent);
 }
 
-static void hwpoison_clear(struct pmem_device *pmem,
-		phys_addr_t phys, unsigned int len)
+static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset)
 {
+	return pmem->phys_addr + offset;
+}
+
+static sector_t to_sect(struct pmem_device *pmem, phys_addr_t offset)
+{
+	return (offset - pmem->data_offset) >> SECTOR_SHIFT;
+}
+
+static phys_addr_t to_offset(struct pmem_device *pmem, sector_t sector)
+{
+	return (sector << SECTOR_SHIFT) + pmem->data_offset;
+}
+
+static void pmem_mkpage_present(struct pmem_device *pmem, phys_addr_t offset,
+		unsigned int len)
+{
+	phys_addr_t phys = to_phys(pmem, offset);
 	unsigned long pfn_start, pfn_end, pfn;
 
 	/* only pmem in the linear map supports HWPoison */
@@ -69,33 +85,40 @@ static void hwpoison_clear(struct pmem_device *pmem,
 	}
 }
 
-static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
-		phys_addr_t offset, unsigned int len)
+static void pmem_clear_bb(struct pmem_device *pmem, sector_t sector, long blks)
 {
-	struct device *dev = to_dev(pmem);
-	sector_t sector;
-	long cleared;
-	blk_status_t rc = BLK_STS_OK;
+	if (blks == 0)
+		return;
+	badblocks_clear(&pmem->bb, sector, blks);
+	if (pmem->bb_state)
+		sysfs_notify_dirent(pmem->bb_state);
+}
 
-	sector = (offset - pmem->data_offset) / 512;
+static long __pmem_clear_poison(struct pmem_device *pmem,
+		phys_addr_t offset, unsigned int len)
+{
+	phys_addr_t phys = to_phys(pmem, offset);
+	long cleared = nvdimm_clear_poison(to_dev(pmem), phys, len);
 
-	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
-	if (cleared < len)
-		rc = BLK_STS_IOERR;
-	if (cleared > 0 && cleared / 512) {
-		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
-		cleared /= 512;
-		dev_dbg(dev, "%#llx clear %ld sector%s\n",
-				(unsigned long long) sector, cleared,
-				cleared > 1 ? "s" : "");
-		badblocks_clear(&pmem->bb, sector, cleared);
-		if (pmem->bb_state)
-			sysfs_notify_dirent(pmem->bb_state);
+	if (cleared > 0) {
+		pmem_mkpage_present(pmem, offset, cleared);
+		arch_invalidate_pmem(pmem->virt_addr + offset, len);
 	}
+	return cleared;
+}
 
-	arch_invalidate_pmem(pmem->virt_addr + offset, len);
+static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
+		phys_addr_t offset, unsigned int len)
+{
+	long cleared = __pmem_clear_poison(pmem, offset, len);
 
-	return rc;
+	if (cleared < 0)
+		return BLK_STS_IOERR;
+
+	pmem_clear_bb(pmem, to_sect(pmem, offset), cleared >> SECTOR_SHIFT);
+	if (cleared < len)
+		return BLK_STS_IOERR;
+	return BLK_STS_OK;
 }
 
 static void write_pmem(void *pmem_addr, struct page *page,
@@ -143,7 +166,7 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
 			sector_t sector, unsigned int len)
 {
 	blk_status_t rc;
-	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+	phys_addr_t pmem_off = to_offset(pmem, sector);
 	void *pmem_addr = pmem->virt_addr + pmem_off;
 
 	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
@@ -158,7 +181,7 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
 			struct page *page, unsigned int page_off,
 			sector_t sector, unsigned int len)
 {
-	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+	phys_addr_t pmem_off = to_offset(pmem, sector);
 	void *pmem_addr = pmem->virt_addr + pmem_off;
 
 	if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
-- 
2.18.4

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v8 7/7] pmem: implement pmem_recovery_write()
  2022-04-20  2:04 ` [dm-devel] " Jane Chu
@ 2022-04-20  2:04   ` Jane Chu
  -1 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: vishal.l.verma, dave.jiang, agk, snitzer, dm-devel, ira.weiny,
	willy, vgoyal

The recovery write thread started out as a normal pwrite thread and
when the filesystem was told about potential media error in the
range, filesystem turns the normal pwrite to a dax_recovery_write.

The recovery write consists of clearing media poison, clearing page
HWPoison bit, reenable page-wide read-write permission, flush the
caches and finally write.  A competing pread thread will be held
off during the recovery process since data read back might not be
valid, and this is achieved by clearing the badblock records after
the recovery write is complete. Competing recovery write threads
are serialized by pmem device level .recovery_lock.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/nvdimm/pmem.c | 56 ++++++++++++++++++++++++++++++++++++++++++-
 drivers/nvdimm/pmem.h |  1 +
 2 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c3772304d417..134f8909eb65 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -332,10 +332,63 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 	return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
+/*
+ * The recovery write thread started out as a normal pwrite thread and
+ * when the filesystem was told about potential media error in the
+ * range, filesystem turns the normal pwrite to a dax_recovery_write.
+ *
+ * The recovery write consists of clearing media poison, clearing page
+ * HWPoison bit, reenable page-wide read-write permission, flush the
+ * caches and finally write.  A competing pread thread will be held
+ * off during the recovery process since data read back might not be
+ * valid, and this is achieved by clearing the badblock records after
+ * the recovery write is complete. Competing recovery write threads
+ * are serialized by pmem device level .recovery_lock.
+ */
 static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
 		void *addr, size_t bytes, struct iov_iter *i)
 {
-	return 0;
+	struct pmem_device *pmem = dax_get_private(dax_dev);
+	size_t olen, len, off;
+	phys_addr_t pmem_off;
+	struct device *dev = pmem->bb.dev;
+	long cleared;
+
+	off = offset_in_page(addr);
+	len = PFN_PHYS(PFN_UP(off + bytes));
+	if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len))
+		return _copy_from_iter_flushcache(addr, bytes, i);
+
+	/*
+	 * Not page-aligned range cannot be recovered. This should not
+	 * happen unless something else went wrong.
+	 */
+	if (off || !PAGE_ALIGNED(bytes)) {
+		dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n",
+			addr, bytes);
+		return 0;
+	}
+
+	mutex_lock(&pmem->recovery_lock);
+	pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
+	cleared = __pmem_clear_poison(pmem, pmem_off, len);
+	if (cleared > 0 && cleared < len) {
+		dev_warn(dev, "poison cleared only %ld out of %lu bytes\n",
+			cleared, len);
+		mutex_unlock(&pmem->recovery_lock);
+		return 0;
+	}
+	if (cleared < 0) {
+		dev_warn(dev, "poison clear failed: %ld\n", cleared);
+		mutex_unlock(&pmem->recovery_lock);
+		return 0;
+	}
+
+	olen = _copy_from_iter_flushcache(addr, bytes, i);
+	pmem_clear_bb(pmem, to_sect(pmem, pmem_off), cleared >> SECTOR_SHIFT);
+
+	mutex_unlock(&pmem->recovery_lock);
+	return olen;
 }
 
 static const struct dax_operations pmem_dax_ops = {
@@ -525,6 +578,7 @@ static int pmem_attach_disk(struct device *dev,
 	if (rc)
 		goto out_cleanup_dax;
 	dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
+	mutex_init(&pmem->recovery_lock);
 	pmem->dax_dev = dax_dev;
 
 	rc = device_add_disk(dev, disk, pmem_attribute_groups);
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 392b0b38acb9..91e40f5e3c0e 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -27,6 +27,7 @@ struct pmem_device {
 	struct dax_device	*dax_dev;
 	struct gendisk		*disk;
 	struct dev_pagemap	pgmap;
+	struct mutex		recovery_lock;
 };
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [dm-devel] [PATCH v8 7/7] pmem: implement pmem_recovery_write()
@ 2022-04-20  2:04   ` Jane Chu
  0 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20  2:04 UTC (permalink / raw)
  To: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86
  Cc: dave.jiang, snitzer, vishal.l.verma, willy, dm-devel, vgoyal,
	ira.weiny, agk

The recovery write thread started out as a normal pwrite thread and
when the filesystem was told about potential media error in the
range, filesystem turns the normal pwrite to a dax_recovery_write.

The recovery write consists of clearing media poison, clearing page
HWPoison bit, reenable page-wide read-write permission, flush the
caches and finally write.  A competing pread thread will be held
off during the recovery process since data read back might not be
valid, and this is achieved by clearing the badblock records after
the recovery write is complete. Competing recovery write threads
are serialized by pmem device level .recovery_lock.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/nvdimm/pmem.c | 56 ++++++++++++++++++++++++++++++++++++++++++-
 drivers/nvdimm/pmem.h |  1 +
 2 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c3772304d417..134f8909eb65 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -332,10 +332,63 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 	return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
 }
 
+/*
+ * The recovery write thread started out as a normal pwrite thread and
+ * when the filesystem was told about potential media error in the
+ * range, filesystem turns the normal pwrite to a dax_recovery_write.
+ *
+ * The recovery write consists of clearing media poison, clearing page
+ * HWPoison bit, reenable page-wide read-write permission, flush the
+ * caches and finally write.  A competing pread thread will be held
+ * off during the recovery process since data read back might not be
+ * valid, and this is achieved by clearing the badblock records after
+ * the recovery write is complete. Competing recovery write threads
+ * are serialized by pmem device level .recovery_lock.
+ */
 static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
 		void *addr, size_t bytes, struct iov_iter *i)
 {
-	return 0;
+	struct pmem_device *pmem = dax_get_private(dax_dev);
+	size_t olen, len, off;
+	phys_addr_t pmem_off;
+	struct device *dev = pmem->bb.dev;
+	long cleared;
+
+	off = offset_in_page(addr);
+	len = PFN_PHYS(PFN_UP(off + bytes));
+	if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len))
+		return _copy_from_iter_flushcache(addr, bytes, i);
+
+	/*
+	 * Not page-aligned range cannot be recovered. This should not
+	 * happen unless something else went wrong.
+	 */
+	if (off || !PAGE_ALIGNED(bytes)) {
+		dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n",
+			addr, bytes);
+		return 0;
+	}
+
+	mutex_lock(&pmem->recovery_lock);
+	pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
+	cleared = __pmem_clear_poison(pmem, pmem_off, len);
+	if (cleared > 0 && cleared < len) {
+		dev_warn(dev, "poison cleared only %ld out of %lu bytes\n",
+			cleared, len);
+		mutex_unlock(&pmem->recovery_lock);
+		return 0;
+	}
+	if (cleared < 0) {
+		dev_warn(dev, "poison clear failed: %ld\n", cleared);
+		mutex_unlock(&pmem->recovery_lock);
+		return 0;
+	}
+
+	olen = _copy_from_iter_flushcache(addr, bytes, i);
+	pmem_clear_bb(pmem, to_sect(pmem, pmem_off), cleared >> SECTOR_SHIFT);
+
+	mutex_unlock(&pmem->recovery_lock);
+	return olen;
 }
 
 static const struct dax_operations pmem_dax_ops = {
@@ -525,6 +578,7 @@ static int pmem_attach_disk(struct device *dev,
 	if (rc)
 		goto out_cleanup_dax;
 	dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
+	mutex_init(&pmem->recovery_lock);
 	pmem->dax_dev = dax_dev;
 
 	rc = device_add_disk(dev, disk, pmem_attribute_groups);
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 392b0b38acb9..91e40f5e3c0e 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -27,6 +27,7 @@ struct pmem_device {
 	struct dax_device	*dax_dev;
 	struct gendisk		*disk;
 	struct dev_pagemap	pgmap;
+	struct mutex		recovery_lock;
 };
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
-- 
2.18.4

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 7/7] pmem: implement pmem_recovery_write()
  2022-04-20  2:04   ` [dm-devel] " Jane Chu
@ 2022-04-20  6:26     ` Dan Williams
  -1 siblings, 0 replies; 44+ messages in thread
From: Dan Williams @ 2022-04-20  6:26 UTC (permalink / raw)
  To: Jane Chu
  Cc: Borislav Petkov, Christoph Hellwig, Dave Hansen, Peter Zijlstra,
	Andy Lutomirski, david, Darrick J. Wong, linux-fsdevel,
	Linux NVDIMM, Linux Kernel Mailing List, X86 ML, Vishal L Verma,
	Dave Jiang, Alasdair Kergon, Mike Snitzer,
	device-mapper development, Weiny, Ira, Matthew Wilcox,
	Vivek Goyal

On Tue, Apr 19, 2022 at 7:06 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> The recovery write thread started out as a normal pwrite thread and
> when the filesystem was told about potential media error in the
> range, filesystem turns the normal pwrite to a dax_recovery_write.
>
> The recovery write consists of clearing media poison, clearing page
> HWPoison bit, reenable page-wide read-write permission, flush the
> caches and finally write.  A competing pread thread will be held
> off during the recovery process since data read back might not be
> valid, and this is achieved by clearing the badblock records after
> the recovery write is complete. Competing recovery write threads
> are serialized by pmem device level .recovery_lock.
>
> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> ---
>  drivers/nvdimm/pmem.c | 56 ++++++++++++++++++++++++++++++++++++++++++-
>  drivers/nvdimm/pmem.h |  1 +
>  2 files changed, 56 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index c3772304d417..134f8909eb65 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -332,10 +332,63 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
>         return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
>  }
>
> +/*
> + * The recovery write thread started out as a normal pwrite thread and
> + * when the filesystem was told about potential media error in the
> + * range, filesystem turns the normal pwrite to a dax_recovery_write.
> + *
> + * The recovery write consists of clearing media poison, clearing page
> + * HWPoison bit, reenable page-wide read-write permission, flush the
> + * caches and finally write.  A competing pread thread will be held
> + * off during the recovery process since data read back might not be
> + * valid, and this is achieved by clearing the badblock records after
> + * the recovery write is complete. Competing recovery write threads
> + * are serialized by pmem device level .recovery_lock.
> + */
>  static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
>                 void *addr, size_t bytes, struct iov_iter *i)
>  {
> -       return 0;
> +       struct pmem_device *pmem = dax_get_private(dax_dev);
> +       size_t olen, len, off;
> +       phys_addr_t pmem_off;
> +       struct device *dev = pmem->bb.dev;
> +       long cleared;
> +
> +       off = offset_in_page(addr);
> +       len = PFN_PHYS(PFN_UP(off + bytes));
> +       if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len))
> +               return _copy_from_iter_flushcache(addr, bytes, i);
> +
> +       /*
> +        * Not page-aligned range cannot be recovered. This should not
> +        * happen unless something else went wrong.
> +        */
> +       if (off || !PAGE_ALIGNED(bytes)) {
> +               dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n",
> +                       addr, bytes);

If this warn stays:

s/dev_warn/dev_warn_ratelimited/

The kernel prints hashed addresses for %p, so I'm not sure printing
@addr is useful or @bytes because there is nothing actionable that can
be done with that information in the log. @pgoff is probably the only
variable worth printing (after converting to bytes or sectors) as that
might be able to be reverse mapped back to the impacted data.

> +               return 0;
> +       }
> +
> +       mutex_lock(&pmem->recovery_lock);
> +       pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
> +       cleared = __pmem_clear_poison(pmem, pmem_off, len);
> +       if (cleared > 0 && cleared < len) {
> +               dev_warn(dev, "poison cleared only %ld out of %lu bytes\n",
> +                       cleared, len);

This looks like dev_dbg() to me, or at minimum the same
dev_warn_ratelimited() print as the one above to just indicate a write
to the device at the given offset failed.

> +               mutex_unlock(&pmem->recovery_lock);
> +               return 0;
> +       }
> +       if (cleared < 0) {
> +               dev_warn(dev, "poison clear failed: %ld\n", cleared);

Same feedback here, these should probably all map to the identical
error exit ratelimited print.

> +               mutex_unlock(&pmem->recovery_lock);

It occurs to me that all callers of this are arriving through the
fsdax iomap ops and all of these callers take an exclusive lock to
prevent simultaneous access to the inode. If recovery_write() is only
used through that path then this lock is redundant. Simultaneous reads
are protected by the fact that the badblocks are cleared last. I think
this can wait to add the lock when / if a non-fsdax access path
arrives for dax_recovery_write(), and even then it should probably
enforce the single writer exclusion guarantee of the fsdax path.

> +               return 0;
> +       }
> +
> +       olen = _copy_from_iter_flushcache(addr, bytes, i);
> +       pmem_clear_bb(pmem, to_sect(pmem, pmem_off), cleared >> SECTOR_SHIFT);
> +
> +       mutex_unlock(&pmem->recovery_lock);
> +       return olen;
>  }
>
>  static const struct dax_operations pmem_dax_ops = {
> @@ -525,6 +578,7 @@ static int pmem_attach_disk(struct device *dev,
>         if (rc)
>                 goto out_cleanup_dax;
>         dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
> +       mutex_init(&pmem->recovery_lock);
>         pmem->dax_dev = dax_dev;
>
>         rc = device_add_disk(dev, disk, pmem_attribute_groups);
> diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
> index 392b0b38acb9..91e40f5e3c0e 100644
> --- a/drivers/nvdimm/pmem.h
> +++ b/drivers/nvdimm/pmem.h
> @@ -27,6 +27,7 @@ struct pmem_device {
>         struct dax_device       *dax_dev;
>         struct gendisk          *disk;
>         struct dev_pagemap      pgmap;
> +       struct mutex            recovery_lock;
>  };
>
>  long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
> --
> 2.18.4
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dm-devel] [PATCH v8 7/7] pmem: implement pmem_recovery_write()
@ 2022-04-20  6:26     ` Dan Williams
  0 siblings, 0 replies; 44+ messages in thread
From: Dan Williams @ 2022-04-20  6:26 UTC (permalink / raw)
  To: Jane Chu
  Cc: Linux NVDIMM, Dave Jiang, Mike Snitzer, Peter Zijlstra,
	Darrick J. Wong, X86 ML, david, Linux Kernel Mailing List,
	Matthew Wilcox, Christoph Hellwig, Dave Hansen,
	device-mapper development, Borislav Petkov, Vivek Goyal,
	Andy Lutomirski, Vishal L Verma, linux-fsdevel, Weiny, Ira,
	Alasdair Kergon

On Tue, Apr 19, 2022 at 7:06 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> The recovery write thread started out as a normal pwrite thread and
> when the filesystem was told about potential media error in the
> range, filesystem turns the normal pwrite to a dax_recovery_write.
>
> The recovery write consists of clearing media poison, clearing page
> HWPoison bit, reenable page-wide read-write permission, flush the
> caches and finally write.  A competing pread thread will be held
> off during the recovery process since data read back might not be
> valid, and this is achieved by clearing the badblock records after
> the recovery write is complete. Competing recovery write threads
> are serialized by pmem device level .recovery_lock.
>
> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> ---
>  drivers/nvdimm/pmem.c | 56 ++++++++++++++++++++++++++++++++++++++++++-
>  drivers/nvdimm/pmem.h |  1 +
>  2 files changed, 56 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index c3772304d417..134f8909eb65 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -332,10 +332,63 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
>         return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
>  }
>
> +/*
> + * The recovery write thread started out as a normal pwrite thread and
> + * when the filesystem was told about potential media error in the
> + * range, filesystem turns the normal pwrite to a dax_recovery_write.
> + *
> + * The recovery write consists of clearing media poison, clearing page
> + * HWPoison bit, reenable page-wide read-write permission, flush the
> + * caches and finally write.  A competing pread thread will be held
> + * off during the recovery process since data read back might not be
> + * valid, and this is achieved by clearing the badblock records after
> + * the recovery write is complete. Competing recovery write threads
> + * are serialized by pmem device level .recovery_lock.
> + */
>  static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
>                 void *addr, size_t bytes, struct iov_iter *i)
>  {
> -       return 0;
> +       struct pmem_device *pmem = dax_get_private(dax_dev);
> +       size_t olen, len, off;
> +       phys_addr_t pmem_off;
> +       struct device *dev = pmem->bb.dev;
> +       long cleared;
> +
> +       off = offset_in_page(addr);
> +       len = PFN_PHYS(PFN_UP(off + bytes));
> +       if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len))
> +               return _copy_from_iter_flushcache(addr, bytes, i);
> +
> +       /*
> +        * Not page-aligned range cannot be recovered. This should not
> +        * happen unless something else went wrong.
> +        */
> +       if (off || !PAGE_ALIGNED(bytes)) {
> +               dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n",
> +                       addr, bytes);

If this warn stays:

s/dev_warn/dev_warn_ratelimited/

The kernel prints hashed addresses for %p, so I'm not sure printing
@addr is useful or @bytes because there is nothing actionable that can
be done with that information in the log. @pgoff is probably the only
variable worth printing (after converting to bytes or sectors) as that
might be able to be reverse mapped back to the impacted data.

> +               return 0;
> +       }
> +
> +       mutex_lock(&pmem->recovery_lock);
> +       pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
> +       cleared = __pmem_clear_poison(pmem, pmem_off, len);
> +       if (cleared > 0 && cleared < len) {
> +               dev_warn(dev, "poison cleared only %ld out of %lu bytes\n",
> +                       cleared, len);

This looks like dev_dbg() to me, or at minimum the same
dev_warn_ratelimited() print as the one above to just indicate a write
to the device at the given offset failed.

> +               mutex_unlock(&pmem->recovery_lock);
> +               return 0;
> +       }
> +       if (cleared < 0) {
> +               dev_warn(dev, "poison clear failed: %ld\n", cleared);

Same feedback here, these should probably all map to the identical
error exit ratelimited print.

> +               mutex_unlock(&pmem->recovery_lock);

It occurs to me that all callers of this are arriving through the
fsdax iomap ops and all of these callers take an exclusive lock to
prevent simultaneous access to the inode. If recovery_write() is only
used through that path then this lock is redundant. Simultaneous reads
are protected by the fact that the badblocks are cleared last. I think
this can wait to add the lock when / if a non-fsdax access path
arrives for dax_recovery_write(), and even then it should probably
enforce the single writer exclusion guarantee of the fsdax path.

> +               return 0;
> +       }
> +
> +       olen = _copy_from_iter_flushcache(addr, bytes, i);
> +       pmem_clear_bb(pmem, to_sect(pmem, pmem_off), cleared >> SECTOR_SHIFT);
> +
> +       mutex_unlock(&pmem->recovery_lock);
> +       return olen;
>  }
>
>  static const struct dax_operations pmem_dax_ops = {
> @@ -525,6 +578,7 @@ static int pmem_attach_disk(struct device *dev,
>         if (rc)
>                 goto out_cleanup_dax;
>         dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
> +       mutex_init(&pmem->recovery_lock);
>         pmem->dax_dev = dax_dev;
>
>         rc = device_add_disk(dev, disk, pmem_attribute_groups);
> diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
> index 392b0b38acb9..91e40f5e3c0e 100644
> --- a/drivers/nvdimm/pmem.h
> +++ b/drivers/nvdimm/pmem.h
> @@ -27,6 +27,7 @@ struct pmem_device {
>         struct dax_device       *dax_dev;
>         struct gendisk          *disk;
>         struct dev_pagemap      pgmap;
> +       struct mutex            recovery_lock;
>  };
>
>  long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
> --
> 2.18.4
>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 7/7] pmem: implement pmem_recovery_write()
  2022-04-20  6:26     ` [dm-devel] " Dan Williams
@ 2022-04-20 17:01       ` Jane Chu
  -1 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20 17:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Borislav Petkov, Christoph Hellwig, Dave Hansen, Peter Zijlstra,
	Andy Lutomirski, david, Darrick J. Wong, linux-fsdevel,
	Linux NVDIMM, Linux Kernel Mailing List, X86 ML, Vishal L Verma,
	Dave Jiang, Alasdair Kergon, Mike Snitzer,
	device-mapper development, Weiny, Ira, Matthew Wilcox,
	Vivek Goyal

On 4/19/2022 11:26 PM, Dan Williams wrote:
> On Tue, Apr 19, 2022 at 7:06 PM Jane Chu <jane.chu@oracle.com> wrote:
>>
>> The recovery write thread started out as a normal pwrite thread and
>> when the filesystem was told about potential media error in the
>> range, filesystem turns the normal pwrite to a dax_recovery_write.
>>
>> The recovery write consists of clearing media poison, clearing page
>> HWPoison bit, reenable page-wide read-write permission, flush the
>> caches and finally write.  A competing pread thread will be held
>> off during the recovery process since data read back might not be
>> valid, and this is achieved by clearing the badblock records after
>> the recovery write is complete. Competing recovery write threads
>> are serialized by pmem device level .recovery_lock.
>>
>> Signed-off-by: Jane Chu <jane.chu@oracle.com>
>> ---
>>   drivers/nvdimm/pmem.c | 56 ++++++++++++++++++++++++++++++++++++++++++-
>>   drivers/nvdimm/pmem.h |  1 +
>>   2 files changed, 56 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> index c3772304d417..134f8909eb65 100644
>> --- a/drivers/nvdimm/pmem.c
>> +++ b/drivers/nvdimm/pmem.c
>> @@ -332,10 +332,63 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
>>          return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
>>   }
>>
>> +/*
>> + * The recovery write thread started out as a normal pwrite thread and
>> + * when the filesystem was told about potential media error in the
>> + * range, filesystem turns the normal pwrite to a dax_recovery_write.
>> + *
>> + * The recovery write consists of clearing media poison, clearing page
>> + * HWPoison bit, reenable page-wide read-write permission, flush the
>> + * caches and finally write.  A competing pread thread will be held
>> + * off during the recovery process since data read back might not be
>> + * valid, and this is achieved by clearing the badblock records after
>> + * the recovery write is complete. Competing recovery write threads
>> + * are serialized by pmem device level .recovery_lock.
>> + */
>>   static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
>>                  void *addr, size_t bytes, struct iov_iter *i)
>>   {
>> -       return 0;
>> +       struct pmem_device *pmem = dax_get_private(dax_dev);
>> +       size_t olen, len, off;
>> +       phys_addr_t pmem_off;
>> +       struct device *dev = pmem->bb.dev;
>> +       long cleared;
>> +
>> +       off = offset_in_page(addr);
>> +       len = PFN_PHYS(PFN_UP(off + bytes));
>> +       if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len))
>> +               return _copy_from_iter_flushcache(addr, bytes, i);
>> +
>> +       /*
>> +        * Not page-aligned range cannot be recovered. This should not
>> +        * happen unless something else went wrong.
>> +        */
>> +       if (off || !PAGE_ALIGNED(bytes)) {
>> +               dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n",
>> +                       addr, bytes);
> 
> If this warn stays:
> 
> s/dev_warn/dev_warn_ratelimited/
> 
> The kernel prints hashed addresses for %p, so I'm not sure printing
> @addr is useful or @bytes because there is nothing actionable that can
> be done with that information in the log. @pgoff is probably the only
> variable worth printing (after converting to bytes or sectors) as that
> might be able to be reverse mapped back to the impacted data.

The intention with printing @addr and @bytes is to show the 
mis-alignment. In the past when UC- was set on poisoned dax page, 
returning less than a page being written would cause dax_iomap_iter to 
produce next iteration with @addr and @bytes not-page-aligned.  Although 
UC- doesn't apply here, I thought it might still be worth while to watch 
for similar scenario.  Also that's why @pgoff isn't helpful.

How about s/dev_warn/dev_dbg/ ?

> 
>> +               return 0;
>> +       }
>> +
>> +       mutex_lock(&pmem->recovery_lock);
>> +       pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
>> +       cleared = __pmem_clear_poison(pmem, pmem_off, len);
>> +       if (cleared > 0 && cleared < len) {
>> +               dev_warn(dev, "poison cleared only %ld out of %lu bytes\n",
>> +                       cleared, len);
> 
> This looks like dev_dbg() to me, or at minimum the same
> dev_warn_ratelimited() print as the one above to just indicate a write
> to the device at the given offset failed.

Will s/dev_warn/dev_dbg/

> 
>> +               mutex_unlock(&pmem->recovery_lock);
>> +               return 0;
>> +       }
>> +       if (cleared < 0) {
>> +               dev_warn(dev, "poison clear failed: %ld\n", cleared);
> 
> Same feedback here, these should probably all map to the identical
> error exit ratelimited print.

Will s/dev_warn/dev_dbg/

> 
>> +               mutex_unlock(&pmem->recovery_lock);
> 
> It occurs to me that all callers of this are arriving through the
> fsdax iomap ops and all of these callers take an exclusive lock to
> prevent simultaneous access to the inode. If recovery_write() is only
> used through that path then this lock is redundant. Simultaneous reads
> are protected by the fact that the badblocks are cleared last. I think
> this can wait to add the lock when / if a non-fsdax access path
> arrives for dax_recovery_write(), and even then it should probably
> enforce the single writer exclusion guarantee of the fsdax path.
> 

Indeed, the caller dax_iomap_rw has already held the writer lock.

Will remove .recovery_lock for now.

BTW, how are the other patches look to you?

Thanks!
-jane

>> +               return 0;
>> +       }
>> +
>> +       olen = _copy_from_iter_flushcache(addr, bytes, i);
>> +       pmem_clear_bb(pmem, to_sect(pmem, pmem_off), cleared >> SECTOR_SHIFT);
>> +
>> +       mutex_unlock(&pmem->recovery_lock);
>> +       return olen;
>>   }
>>
>>   static const struct dax_operations pmem_dax_ops = {
>> @@ -525,6 +578,7 @@ static int pmem_attach_disk(struct device *dev,
>>          if (rc)
>>                  goto out_cleanup_dax;
>>          dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
>> +       mutex_init(&pmem->recovery_lock);
>>          pmem->dax_dev = dax_dev;
>>
>>          rc = device_add_disk(dev, disk, pmem_attribute_groups);
>> diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
>> index 392b0b38acb9..91e40f5e3c0e 100644
>> --- a/drivers/nvdimm/pmem.h
>> +++ b/drivers/nvdimm/pmem.h
>> @@ -27,6 +27,7 @@ struct pmem_device {
>>          struct dax_device       *dax_dev;
>>          struct gendisk          *disk;
>>          struct dev_pagemap      pgmap;
>> +       struct mutex            recovery_lock;
>>   };
>>
>>   long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
>> --
>> 2.18.4
>>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dm-devel] [PATCH v8 7/7] pmem: implement pmem_recovery_write()
@ 2022-04-20 17:01       ` Jane Chu
  0 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-20 17:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux NVDIMM, Dave Jiang, Mike Snitzer, Peter Zijlstra,
	Darrick J. Wong, X86 ML, david, Linux Kernel Mailing List,
	Matthew Wilcox, Christoph Hellwig, Dave Hansen,
	device-mapper development, Borislav Petkov, Vivek Goyal,
	Andy Lutomirski, Vishal L Verma, linux-fsdevel, Weiny, Ira,
	Alasdair Kergon

On 4/19/2022 11:26 PM, Dan Williams wrote:
> On Tue, Apr 19, 2022 at 7:06 PM Jane Chu <jane.chu@oracle.com> wrote:
>>
>> The recovery write thread started out as a normal pwrite thread and
>> when the filesystem was told about potential media error in the
>> range, filesystem turns the normal pwrite to a dax_recovery_write.
>>
>> The recovery write consists of clearing media poison, clearing page
>> HWPoison bit, reenable page-wide read-write permission, flush the
>> caches and finally write.  A competing pread thread will be held
>> off during the recovery process since data read back might not be
>> valid, and this is achieved by clearing the badblock records after
>> the recovery write is complete. Competing recovery write threads
>> are serialized by pmem device level .recovery_lock.
>>
>> Signed-off-by: Jane Chu <jane.chu@oracle.com>
>> ---
>>   drivers/nvdimm/pmem.c | 56 ++++++++++++++++++++++++++++++++++++++++++-
>>   drivers/nvdimm/pmem.h |  1 +
>>   2 files changed, 56 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> index c3772304d417..134f8909eb65 100644
>> --- a/drivers/nvdimm/pmem.c
>> +++ b/drivers/nvdimm/pmem.c
>> @@ -332,10 +332,63 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
>>          return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
>>   }
>>
>> +/*
>> + * The recovery write thread started out as a normal pwrite thread and
>> + * when the filesystem was told about potential media error in the
>> + * range, filesystem turns the normal pwrite to a dax_recovery_write.
>> + *
>> + * The recovery write consists of clearing media poison, clearing page
>> + * HWPoison bit, reenable page-wide read-write permission, flush the
>> + * caches and finally write.  A competing pread thread will be held
>> + * off during the recovery process since data read back might not be
>> + * valid, and this is achieved by clearing the badblock records after
>> + * the recovery write is complete. Competing recovery write threads
>> + * are serialized by pmem device level .recovery_lock.
>> + */
>>   static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
>>                  void *addr, size_t bytes, struct iov_iter *i)
>>   {
>> -       return 0;
>> +       struct pmem_device *pmem = dax_get_private(dax_dev);
>> +       size_t olen, len, off;
>> +       phys_addr_t pmem_off;
>> +       struct device *dev = pmem->bb.dev;
>> +       long cleared;
>> +
>> +       off = offset_in_page(addr);
>> +       len = PFN_PHYS(PFN_UP(off + bytes));
>> +       if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len))
>> +               return _copy_from_iter_flushcache(addr, bytes, i);
>> +
>> +       /*
>> +        * Not page-aligned range cannot be recovered. This should not
>> +        * happen unless something else went wrong.
>> +        */
>> +       if (off || !PAGE_ALIGNED(bytes)) {
>> +               dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n",
>> +                       addr, bytes);
> 
> If this warn stays:
> 
> s/dev_warn/dev_warn_ratelimited/
> 
> The kernel prints hashed addresses for %p, so I'm not sure printing
> @addr is useful or @bytes because there is nothing actionable that can
> be done with that information in the log. @pgoff is probably the only
> variable worth printing (after converting to bytes or sectors) as that
> might be able to be reverse mapped back to the impacted data.

The intention with printing @addr and @bytes is to show the 
mis-alignment. In the past when UC- was set on poisoned dax page, 
returning less than a page being written would cause dax_iomap_iter to 
produce next iteration with @addr and @bytes not-page-aligned.  Although 
UC- doesn't apply here, I thought it might still be worth while to watch 
for similar scenario.  Also that's why @pgoff isn't helpful.

How about s/dev_warn/dev_dbg/ ?

> 
>> +               return 0;
>> +       }
>> +
>> +       mutex_lock(&pmem->recovery_lock);
>> +       pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
>> +       cleared = __pmem_clear_poison(pmem, pmem_off, len);
>> +       if (cleared > 0 && cleared < len) {
>> +               dev_warn(dev, "poison cleared only %ld out of %lu bytes\n",
>> +                       cleared, len);
> 
> This looks like dev_dbg() to me, or at minimum the same
> dev_warn_ratelimited() print as the one above to just indicate a write
> to the device at the given offset failed.

Will s/dev_warn/dev_dbg/

> 
>> +               mutex_unlock(&pmem->recovery_lock);
>> +               return 0;
>> +       }
>> +       if (cleared < 0) {
>> +               dev_warn(dev, "poison clear failed: %ld\n", cleared);
> 
> Same feedback here, these should probably all map to the identical
> error exit ratelimited print.

Will s/dev_warn/dev_dbg/

> 
>> +               mutex_unlock(&pmem->recovery_lock);
> 
> It occurs to me that all callers of this are arriving through the
> fsdax iomap ops and all of these callers take an exclusive lock to
> prevent simultaneous access to the inode. If recovery_write() is only
> used through that path then this lock is redundant. Simultaneous reads
> are protected by the fact that the badblocks are cleared last. I think
> this can wait to add the lock when / if a non-fsdax access path
> arrives for dax_recovery_write(), and even then it should probably
> enforce the single writer exclusion guarantee of the fsdax path.
> 

Indeed, the caller dax_iomap_rw has already held the writer lock.

Will remove .recovery_lock for now.

BTW, how are the other patches look to you?

Thanks!
-jane

>> +               return 0;
>> +       }
>> +
>> +       olen = _copy_from_iter_flushcache(addr, bytes, i);
>> +       pmem_clear_bb(pmem, to_sect(pmem, pmem_off), cleared >> SECTOR_SHIFT);
>> +
>> +       mutex_unlock(&pmem->recovery_lock);
>> +       return olen;
>>   }
>>
>>   static const struct dax_operations pmem_dax_ops = {
>> @@ -525,6 +578,7 @@ static int pmem_attach_disk(struct device *dev,
>>          if (rc)
>>                  goto out_cleanup_dax;
>>          dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
>> +       mutex_init(&pmem->recovery_lock);
>>          pmem->dax_dev = dax_dev;
>>
>>          rc = device_add_disk(dev, disk, pmem_attribute_groups);
>> diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
>> index 392b0b38acb9..91e40f5e3c0e 100644
>> --- a/drivers/nvdimm/pmem.h
>> +++ b/drivers/nvdimm/pmem.h
>> @@ -27,6 +27,7 @@ struct pmem_device {
>>          struct dax_device       *dax_dev;
>>          struct gendisk          *disk;
>>          struct dev_pagemap      pgmap;
>> +       struct mutex            recovery_lock;
>>   };
>>
>>   long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
>> --
>> 2.18.4
>>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 7/7] pmem: implement pmem_recovery_write()
  2022-04-20 17:01       ` [dm-devel] " Jane Chu
@ 2022-04-20 17:05         ` Dan Williams
  -1 siblings, 0 replies; 44+ messages in thread
From: Dan Williams @ 2022-04-20 17:05 UTC (permalink / raw)
  To: Jane Chu
  Cc: Borislav Petkov, Christoph Hellwig, Dave Hansen, Peter Zijlstra,
	Andy Lutomirski, david, Darrick J. Wong, linux-fsdevel,
	Linux NVDIMM, Linux Kernel Mailing List, X86 ML, Vishal L Verma,
	Dave Jiang, Alasdair Kergon, Mike Snitzer,
	device-mapper development, Weiny, Ira, Matthew Wilcox,
	Vivek Goyal

On Wed, Apr 20, 2022 at 10:02 AM Jane Chu <jane.chu@oracle.com> wrote:
>
> On 4/19/2022 11:26 PM, Dan Williams wrote:
> > On Tue, Apr 19, 2022 at 7:06 PM Jane Chu <jane.chu@oracle.com> wrote:
> >>
> >> The recovery write thread started out as a normal pwrite thread and
> >> when the filesystem was told about potential media error in the
> >> range, filesystem turns the normal pwrite to a dax_recovery_write.
> >>
> >> The recovery write consists of clearing media poison, clearing page
> >> HWPoison bit, reenable page-wide read-write permission, flush the
> >> caches and finally write.  A competing pread thread will be held
> >> off during the recovery process since data read back might not be
> >> valid, and this is achieved by clearing the badblock records after
> >> the recovery write is complete. Competing recovery write threads
> >> are serialized by pmem device level .recovery_lock.
> >>
> >> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> >> ---
> >>   drivers/nvdimm/pmem.c | 56 ++++++++++++++++++++++++++++++++++++++++++-
> >>   drivers/nvdimm/pmem.h |  1 +
> >>   2 files changed, 56 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> >> index c3772304d417..134f8909eb65 100644
> >> --- a/drivers/nvdimm/pmem.c
> >> +++ b/drivers/nvdimm/pmem.c
> >> @@ -332,10 +332,63 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
> >>          return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
> >>   }
> >>
> >> +/*
> >> + * The recovery write thread started out as a normal pwrite thread and
> >> + * when the filesystem was told about potential media error in the
> >> + * range, filesystem turns the normal pwrite to a dax_recovery_write.
> >> + *
> >> + * The recovery write consists of clearing media poison, clearing page
> >> + * HWPoison bit, reenable page-wide read-write permission, flush the
> >> + * caches and finally write.  A competing pread thread will be held
> >> + * off during the recovery process since data read back might not be
> >> + * valid, and this is achieved by clearing the badblock records after
> >> + * the recovery write is complete. Competing recovery write threads
> >> + * are serialized by pmem device level .recovery_lock.
> >> + */
> >>   static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
> >>                  void *addr, size_t bytes, struct iov_iter *i)
> >>   {
> >> -       return 0;
> >> +       struct pmem_device *pmem = dax_get_private(dax_dev);
> >> +       size_t olen, len, off;
> >> +       phys_addr_t pmem_off;
> >> +       struct device *dev = pmem->bb.dev;
> >> +       long cleared;
> >> +
> >> +       off = offset_in_page(addr);
> >> +       len = PFN_PHYS(PFN_UP(off + bytes));
> >> +       if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len))
> >> +               return _copy_from_iter_flushcache(addr, bytes, i);
> >> +
> >> +       /*
> >> +        * Not page-aligned range cannot be recovered. This should not
> >> +        * happen unless something else went wrong.
> >> +        */
> >> +       if (off || !PAGE_ALIGNED(bytes)) {
> >> +               dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n",
> >> +                       addr, bytes);
> >
> > If this warn stays:
> >
> > s/dev_warn/dev_warn_ratelimited/
> >
> > The kernel prints hashed addresses for %p, so I'm not sure printing
> > @addr is useful or @bytes because there is nothing actionable that can
> > be done with that information in the log. @pgoff is probably the only
> > variable worth printing (after converting to bytes or sectors) as that
> > might be able to be reverse mapped back to the impacted data.
>
> The intention with printing @addr and @bytes is to show the
> mis-alignment. In the past when UC- was set on poisoned dax page,
> returning less than a page being written would cause dax_iomap_iter to
> produce next iteration with @addr and @bytes not-page-aligned.  Although
> UC- doesn't apply here, I thought it might still be worth while to watch
> for similar scenario.  Also that's why @pgoff isn't helpful.
>
> How about s/dev_warn/dev_dbg/ ?
>
> >
> >> +               return 0;
> >> +       }
> >> +
> >> +       mutex_lock(&pmem->recovery_lock);
> >> +       pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
> >> +       cleared = __pmem_clear_poison(pmem, pmem_off, len);
> >> +       if (cleared > 0 && cleared < len) {
> >> +               dev_warn(dev, "poison cleared only %ld out of %lu bytes\n",
> >> +                       cleared, len);
> >
> > This looks like dev_dbg() to me, or at minimum the same
> > dev_warn_ratelimited() print as the one above to just indicate a write
> > to the device at the given offset failed.
>
> Will s/dev_warn/dev_dbg/
>
> >
> >> +               mutex_unlock(&pmem->recovery_lock);
> >> +               return 0;
> >> +       }
> >> +       if (cleared < 0) {
> >> +               dev_warn(dev, "poison clear failed: %ld\n", cleared);
> >
> > Same feedback here, these should probably all map to the identical
> > error exit ratelimited print.
>
> Will s/dev_warn/dev_dbg/
>
> >
> >> +               mutex_unlock(&pmem->recovery_lock);
> >
> > It occurs to me that all callers of this are arriving through the
> > fsdax iomap ops and all of these callers take an exclusive lock to
> > prevent simultaneous access to the inode. If recovery_write() is only
> > used through that path then this lock is redundant. Simultaneous reads
> > are protected by the fact that the badblocks are cleared last. I think
> > this can wait to add the lock when / if a non-fsdax access path
> > arrives for dax_recovery_write(), and even then it should probably
> > enforce the single writer exclusion guarantee of the fsdax path.
> >
>
> Indeed, the caller dax_iomap_rw has already held the writer lock.
>
> Will remove .recovery_lock for now.
>
> BTW, how are the other patches look to you?

First pass looked good, so I'll do one more lookover, but this is
coming together nicely. Thanks for all the effort on this!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dm-devel] [PATCH v8 7/7] pmem: implement pmem_recovery_write()
@ 2022-04-20 17:05         ` Dan Williams
  0 siblings, 0 replies; 44+ messages in thread
From: Dan Williams @ 2022-04-20 17:05 UTC (permalink / raw)
  To: Jane Chu
  Cc: Linux NVDIMM, Dave Jiang, Mike Snitzer, Peter Zijlstra,
	Darrick J. Wong, X86 ML, david, Linux Kernel Mailing List,
	Matthew Wilcox, Christoph Hellwig, Dave Hansen,
	device-mapper development, Borislav Petkov, Vivek Goyal,
	Andy Lutomirski, Vishal L Verma, linux-fsdevel, Weiny, Ira,
	Alasdair Kergon

On Wed, Apr 20, 2022 at 10:02 AM Jane Chu <jane.chu@oracle.com> wrote:
>
> On 4/19/2022 11:26 PM, Dan Williams wrote:
> > On Tue, Apr 19, 2022 at 7:06 PM Jane Chu <jane.chu@oracle.com> wrote:
> >>
> >> The recovery write thread started out as a normal pwrite thread and
> >> when the filesystem was told about potential media error in the
> >> range, filesystem turns the normal pwrite to a dax_recovery_write.
> >>
> >> The recovery write consists of clearing media poison, clearing page
> >> HWPoison bit, reenable page-wide read-write permission, flush the
> >> caches and finally write.  A competing pread thread will be held
> >> off during the recovery process since data read back might not be
> >> valid, and this is achieved by clearing the badblock records after
> >> the recovery write is complete. Competing recovery write threads
> >> are serialized by pmem device level .recovery_lock.
> >>
> >> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> >> ---
> >>   drivers/nvdimm/pmem.c | 56 ++++++++++++++++++++++++++++++++++++++++++-
> >>   drivers/nvdimm/pmem.h |  1 +
> >>   2 files changed, 56 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> >> index c3772304d417..134f8909eb65 100644
> >> --- a/drivers/nvdimm/pmem.c
> >> +++ b/drivers/nvdimm/pmem.c
> >> @@ -332,10 +332,63 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
> >>          return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
> >>   }
> >>
> >> +/*
> >> + * The recovery write thread started out as a normal pwrite thread and
> >> + * when the filesystem was told about potential media error in the
> >> + * range, filesystem turns the normal pwrite to a dax_recovery_write.
> >> + *
> >> + * The recovery write consists of clearing media poison, clearing page
> >> + * HWPoison bit, reenable page-wide read-write permission, flush the
> >> + * caches and finally write.  A competing pread thread will be held
> >> + * off during the recovery process since data read back might not be
> >> + * valid, and this is achieved by clearing the badblock records after
> >> + * the recovery write is complete. Competing recovery write threads
> >> + * are serialized by pmem device level .recovery_lock.
> >> + */
> >>   static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
> >>                  void *addr, size_t bytes, struct iov_iter *i)
> >>   {
> >> -       return 0;
> >> +       struct pmem_device *pmem = dax_get_private(dax_dev);
> >> +       size_t olen, len, off;
> >> +       phys_addr_t pmem_off;
> >> +       struct device *dev = pmem->bb.dev;
> >> +       long cleared;
> >> +
> >> +       off = offset_in_page(addr);
> >> +       len = PFN_PHYS(PFN_UP(off + bytes));
> >> +       if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len))
> >> +               return _copy_from_iter_flushcache(addr, bytes, i);
> >> +
> >> +       /*
> >> +        * Not page-aligned range cannot be recovered. This should not
> >> +        * happen unless something else went wrong.
> >> +        */
> >> +       if (off || !PAGE_ALIGNED(bytes)) {
> >> +               dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n",
> >> +                       addr, bytes);
> >
> > If this warn stays:
> >
> > s/dev_warn/dev_warn_ratelimited/
> >
> > The kernel prints hashed addresses for %p, so I'm not sure printing
> > @addr is useful or @bytes because there is nothing actionable that can
> > be done with that information in the log. @pgoff is probably the only
> > variable worth printing (after converting to bytes or sectors) as that
> > might be able to be reverse mapped back to the impacted data.
>
> The intention with printing @addr and @bytes is to show the
> mis-alignment. In the past when UC- was set on poisoned dax page,
> returning less than a page being written would cause dax_iomap_iter to
> produce next iteration with @addr and @bytes not-page-aligned.  Although
> UC- doesn't apply here, I thought it might still be worth while to watch
> for similar scenario.  Also that's why @pgoff isn't helpful.
>
> How about s/dev_warn/dev_dbg/ ?
>
> >
> >> +               return 0;
> >> +       }
> >> +
> >> +       mutex_lock(&pmem->recovery_lock);
> >> +       pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
> >> +       cleared = __pmem_clear_poison(pmem, pmem_off, len);
> >> +       if (cleared > 0 && cleared < len) {
> >> +               dev_warn(dev, "poison cleared only %ld out of %lu bytes\n",
> >> +                       cleared, len);
> >
> > This looks like dev_dbg() to me, or at minimum the same
> > dev_warn_ratelimited() print as the one above to just indicate a write
> > to the device at the given offset failed.
>
> Will s/dev_warn/dev_dbg/
>
> >
> >> +               mutex_unlock(&pmem->recovery_lock);
> >> +               return 0;
> >> +       }
> >> +       if (cleared < 0) {
> >> +               dev_warn(dev, "poison clear failed: %ld\n", cleared);
> >
> > Same feedback here, these should probably all map to the identical
> > error exit ratelimited print.
>
> Will s/dev_warn/dev_dbg/
>
> >
> >> +               mutex_unlock(&pmem->recovery_lock);
> >
> > It occurs to me that all callers of this are arriving through the
> > fsdax iomap ops and all of these callers take an exclusive lock to
> > prevent simultaneous access to the inode. If recovery_write() is only
> > used through that path then this lock is redundant. Simultaneous reads
> > are protected by the fact that the badblocks are cleared last. I think
> > this can wait to add the lock when / if a non-fsdax access path
> > arrives for dax_recovery_write(), and even then it should probably
> > enforce the single writer exclusion guarantee of the fsdax path.
> >
>
> Indeed, the caller dax_iomap_rw has already held the writer lock.
>
> Will remove .recovery_lock for now.
>
> BTW, how are the other patches look to you?

First pass looked good, so I'll do one more lookover, but this is
coming together nicely. Thanks for all the effort on this!

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 2/7] x86/mce: relocate set{clear}_mce_nospec() functions
  2022-04-20  2:04   ` [dm-devel] " Jane Chu
@ 2022-04-21  6:50     ` Christoph Hellwig
  -1 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:50 UTC (permalink / raw)
  To: Jane Chu
  Cc: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86, vishal.l.verma,
	dave.jiang, agk, snitzer, dm-devel, ira.weiny, willy, vgoyal

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dm-devel] [PATCH v8 2/7] x86/mce: relocate set{clear}_mce_nospec() functions
@ 2022-04-21  6:50     ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:50 UTC (permalink / raw)
  To: Jane Chu
  Cc: nvdimm, dave.jiang, snitzer, peterz, djwong, x86, david,
	linux-kernel, willy, hch, dave.hansen, dm-devel, bp, vgoyal,
	luto, vishal.l.verma, linux-fsdevel, dan.j.williams, ira.weiny,
	agk

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 3/7] mce: fix set_mce_nospec to always unmap the whole page
  2022-04-20  2:04   ` [dm-devel] " Jane Chu
@ 2022-04-21  6:51     ` Christoph Hellwig
  -1 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:51 UTC (permalink / raw)
  To: Jane Chu
  Cc: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86, vishal.l.verma,
	dave.jiang, agk, snitzer, dm-devel, ira.weiny, willy, vgoyal

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dm-devel] [PATCH v8 3/7] mce: fix set_mce_nospec to always unmap the whole page
@ 2022-04-21  6:51     ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:51 UTC (permalink / raw)
  To: Jane Chu
  Cc: nvdimm, dave.jiang, snitzer, peterz, djwong, x86, david,
	linux-kernel, willy, hch, dave.hansen, dm-devel, bp, vgoyal,
	luto, vishal.l.verma, linux-fsdevel, dan.j.williams, ira.weiny,
	agk

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 4/7] dax: introduce DAX_RECOVERY_WRITE dax access mode
  2022-04-20  2:04   ` [dm-devel] " Jane Chu
@ 2022-04-21  6:57     ` Christoph Hellwig
  -1 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:57 UTC (permalink / raw)
  To: Jane Chu
  Cc: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86, vishal.l.verma,
	dave.jiang, agk, snitzer, dm-devel, ira.weiny, willy, vgoyal

> +	if (bb->count &&
> +		badblocks_check(bb, sector, num, &first_bad, &num_bad)) {

Weird alignment here, continuing lines for conditionals are aligned
either after the opening brace:

	if (bb->count &&
	    badblocks_check(bb, sector, num, &first_bad, &num_bad)) {

or with double tabs.  I tend to prefer the version I posted above.

The being said, shouldn't this change even be in this patch and not just
added once we add actual recovery support?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dm-devel] [PATCH v8 4/7] dax: introduce DAX_RECOVERY_WRITE dax access mode
@ 2022-04-21  6:57     ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:57 UTC (permalink / raw)
  To: Jane Chu
  Cc: nvdimm, dave.jiang, snitzer, peterz, djwong, x86, david,
	linux-kernel, willy, hch, dave.hansen, dm-devel, bp, vgoyal,
	luto, vishal.l.verma, linux-fsdevel, dan.j.williams, ira.weiny,
	agk

> +	if (bb->count &&
> +		badblocks_check(bb, sector, num, &first_bad, &num_bad)) {

Weird alignment here, continuing lines for conditionals are aligned
either after the opening brace:

	if (bb->count &&
	    badblocks_check(bb, sector, num, &first_bad, &num_bad)) {

or with double tabs.  I tend to prefer the version I posted above.

The being said, shouldn't this change even be in this patch and not just
added once we add actual recovery support?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 5/7] dax: add .recovery_write dax_operation
  2022-04-20  2:04   ` [dm-devel] " Jane Chu
@ 2022-04-21  6:57     ` Christoph Hellwig
  -1 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:57 UTC (permalink / raw)
  To: Jane Chu
  Cc: dan.j.williams, bp, hch, dave.hansen, peterz, luto, david,
	djwong, linux-fsdevel, nvdimm, linux-kernel, x86, vishal.l.verma,
	dave.jiang, agk, snitzer, dm-devel, ira.weiny, willy, vgoyal

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dm-devel] [PATCH v8 5/7] dax: add .recovery_write dax_operation
@ 2022-04-21  6:57     ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:57 UTC (permalink / raw)
  To: Jane Chu
  Cc: nvdimm, dave.jiang, snitzer, peterz, djwong, x86, david,
	linux-kernel, willy, hch, dave.hansen, dm-devel, bp, vgoyal,
	luto, vishal.l.verma, linux-fsdevel, dan.j.williams, ira.weiny,
	agk

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 1/7] acpi/nfit: rely on mce->misc to determine poison granularity
  2022-04-20  2:04   ` [dm-devel] " Jane Chu
@ 2022-04-21 19:25     ` Dan Williams
  -1 siblings, 0 replies; 44+ messages in thread
From: Dan Williams @ 2022-04-21 19:25 UTC (permalink / raw)
  To: Jane Chu
  Cc: Borislav Petkov, Christoph Hellwig, Dave Hansen, Peter Zijlstra,
	Andy Lutomirski, david, Darrick J. Wong, linux-fsdevel,
	Linux NVDIMM, Linux Kernel Mailing List, X86 ML, Vishal L Verma,
	Dave Jiang, Alasdair Kergon, Mike Snitzer,
	device-mapper development, Weiny, Ira, Matthew Wilcox,
	Vivek Goyal

On Tue, Apr 19, 2022 at 7:05 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> nfit_handle_mec() hardcode poison granularity at L1_CACHE_BYTES.
> Instead, let the driver rely on mce->misc register to determine
> the poison granularity.
>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Jane Chu <jane.chu@oracle.com>

Looks good,

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

...I'll add the Fixes: line when applying this.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dm-devel] [PATCH v8 1/7] acpi/nfit: rely on mce->misc to determine poison granularity
@ 2022-04-21 19:25     ` Dan Williams
  0 siblings, 0 replies; 44+ messages in thread
From: Dan Williams @ 2022-04-21 19:25 UTC (permalink / raw)
  To: Jane Chu
  Cc: Linux NVDIMM, Dave Jiang, Mike Snitzer, Peter Zijlstra,
	Darrick J. Wong, X86 ML, david, Linux Kernel Mailing List,
	Matthew Wilcox, Christoph Hellwig, Dave Hansen,
	device-mapper development, Borislav Petkov, Vivek Goyal,
	Andy Lutomirski, Vishal L Verma, linux-fsdevel, Weiny, Ira,
	Alasdair Kergon

On Tue, Apr 19, 2022 at 7:05 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> nfit_handle_mec() hardcode poison granularity at L1_CACHE_BYTES.
> Instead, let the driver rely on mce->misc register to determine
> the poison granularity.
>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Jane Chu <jane.chu@oracle.com>

Looks good,

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

...I'll add the Fixes: line when applying this.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 3/7] mce: fix set_mce_nospec to always unmap the whole page
  2022-04-20  2:04   ` [dm-devel] " Jane Chu
@ 2022-04-21 19:26     ` Dan Williams
  -1 siblings, 0 replies; 44+ messages in thread
From: Dan Williams @ 2022-04-21 19:26 UTC (permalink / raw)
  To: Jane Chu
  Cc: Borislav Petkov, Christoph Hellwig, Dave Hansen, Peter Zijlstra,
	Andy Lutomirski, david, Darrick J. Wong, linux-fsdevel,
	Linux NVDIMM, Linux Kernel Mailing List, X86 ML, Vishal L Verma,
	Dave Jiang, Alasdair Kergon, Mike Snitzer,
	device-mapper development, Weiny, Ira, Matthew Wilcox,
	Vivek Goyal

On Tue, Apr 19, 2022 at 7:05 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> The set_memory_uc() approach doesn't work well in all cases.
> As Dan pointed out when "The VMM unmapped the bad page from
> guest physical space and passed the machine check to the guest."
> "The guest gets virtual #MC on an access to that page. When
> the guest tries to do set_memory_uc() and instructs cpa_flush()
> to do clean caches that results in taking another fault / exception
> perhaps because the VMM unmapped the page from the guest."
>
> Since the driver has special knowledge to handle NP or UC,
> mark the poisoned page with NP and let driver handle it when
> it comes down to repair.
>
> Please refer to discussions here for more details.
> https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/
>
> Now since poisoned page is marked as not-present, in order to
> avoid writing to a not-present page and trigger kernel Oops,
> also fix pmem_do_write().
>
> Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
> Signed-off-by: Jane Chu <jane.chu@oracle.com>

Looks good to me:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

> ---
>  arch/x86/kernel/cpu/mce/core.c |  6 +++---
>  arch/x86/mm/pat/set_memory.c   | 23 +++++++++++------------
>  drivers/nvdimm/pmem.c          | 30 +++++++-----------------------
>  include/linux/set_memory.h     |  4 ++--
>  4 files changed, 23 insertions(+), 40 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 981496e6bc0e..fa67bb9d1afe 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -579,7 +579,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>
>         pfn = mce->addr >> PAGE_SHIFT;
>         if (!memory_failure(pfn, 0)) {
> -               set_mce_nospec(pfn, whole_page(mce));
> +               set_mce_nospec(pfn);
>                 mce->kflags |= MCE_HANDLED_UC;
>         }
>
> @@ -1316,7 +1316,7 @@ static void kill_me_maybe(struct callback_head *cb)
>
>         ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
>         if (!ret) {
> -               set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> +               set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
>                 sync_core();
>                 return;
>         }
> @@ -1342,7 +1342,7 @@ static void kill_me_never(struct callback_head *cb)
>         p->mce_count = 0;
>         pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
>         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
> -               set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> +               set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
>  }
>
>  static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *))
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 978cf5bd2ab6..e3a5e55f3e08 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -1925,13 +1925,8 @@ int set_memory_wb(unsigned long addr, int numpages)
>  }
>  EXPORT_SYMBOL(set_memory_wb);
>
> -/*
> - * Prevent speculative access to the page by either unmapping
> - * it (if we do not require access to any part of the page) or
> - * marking it uncacheable (if we want to try to retrieve data
> - * from non-poisoned lines in the page).
> - */
> -int set_mce_nospec(unsigned long pfn, bool unmap)
> +/* Prevent speculative access to a page by marking it not-present */
> +int set_mce_nospec(unsigned long pfn)
>  {
>         unsigned long decoy_addr;
>         int rc;
> @@ -1956,19 +1951,23 @@ int set_mce_nospec(unsigned long pfn, bool unmap)
>          */
>         decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
>
> -       if (unmap)
> -               rc = set_memory_np(decoy_addr, 1);
> -       else
> -               rc = set_memory_uc(decoy_addr, 1);
> +       rc = set_memory_np(decoy_addr, 1);
>         if (rc)
>                 pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
>         return rc;
>  }
>
> +static int set_memory_present(unsigned long *addr, int numpages)
> +{
> +       return change_page_attr_set(addr, numpages, __pgprot(_PAGE_PRESENT), 0);
> +}
> +
>  /* Restore full speculative operation to the pfn. */
>  int clear_mce_nospec(unsigned long pfn)
>  {
> -       return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
> +       unsigned long addr = (unsigned long) pfn_to_kaddr(pfn);
> +
> +       return set_memory_present(&addr, 1);
>  }
>  EXPORT_SYMBOL_GPL(clear_mce_nospec);
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 58d95242a836..4aa17132a557 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -158,36 +158,20 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
>                         struct page *page, unsigned int page_off,
>                         sector_t sector, unsigned int len)
>  {
> -       blk_status_t rc = BLK_STS_OK;
> -       bool bad_pmem = false;
>         phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
>         void *pmem_addr = pmem->virt_addr + pmem_off;
>
> -       if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
> -               bad_pmem = true;
> +       if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
> +               blk_status_t rc = pmem_clear_poison(pmem, pmem_off, len);
> +
> +               if (rc != BLK_STS_OK)
> +                       return rc;
> +       }
>
> -       /*
> -        * Note that we write the data both before and after
> -        * clearing poison.  The write before clear poison
> -        * handles situations where the latest written data is
> -        * preserved and the clear poison operation simply marks
> -        * the address range as valid without changing the data.
> -        * In this case application software can assume that an
> -        * interrupted write will either return the new good
> -        * data or an error.
> -        *
> -        * However, if pmem_clear_poison() leaves the data in an
> -        * indeterminate state we need to perform the write
> -        * after clear poison.
> -        */
>         flush_dcache_page(page);
>         write_pmem(pmem_addr, page, page_off, len);
> -       if (unlikely(bad_pmem)) {
> -               rc = pmem_clear_poison(pmem, pmem_off, len);
> -               write_pmem(pmem_addr, page, page_off, len);
> -       }
>
> -       return rc;
> +       return BLK_STS_OK;
>  }
>
>  static void pmem_submit_bio(struct bio *bio)
> diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
> index 683a6c3f7179..369769ce7399 100644
> --- a/include/linux/set_memory.h
> +++ b/include/linux/set_memory.h
> @@ -43,10 +43,10 @@ static inline bool can_set_direct_map(void)
>  #endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
>
>  #ifdef CONFIG_X86_64
> -int set_mce_nospec(unsigned long pfn, bool unmap);
> +int set_mce_nospec(unsigned long pfn);
>  int clear_mce_nospec(unsigned long pfn);
>  #else
> -static inline int set_mce_nospec(unsigned long pfn, bool unmap)
> +static inline int set_mce_nospec(unsigned long pfn)
>  {
>         return 0;
>  }
> --
> 2.18.4
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dm-devel] [PATCH v8 3/7] mce: fix set_mce_nospec to always unmap the whole page
@ 2022-04-21 19:26     ` Dan Williams
  0 siblings, 0 replies; 44+ messages in thread
From: Dan Williams @ 2022-04-21 19:26 UTC (permalink / raw)
  To: Jane Chu
  Cc: Linux NVDIMM, Dave Jiang, Mike Snitzer, Peter Zijlstra,
	Darrick J. Wong, X86 ML, david, Linux Kernel Mailing List,
	Matthew Wilcox, Christoph Hellwig, Dave Hansen,
	device-mapper development, Borislav Petkov, Vivek Goyal,
	Andy Lutomirski, Vishal L Verma, linux-fsdevel, Weiny, Ira,
	Alasdair Kergon

On Tue, Apr 19, 2022 at 7:05 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> The set_memory_uc() approach doesn't work well in all cases.
> As Dan pointed out when "The VMM unmapped the bad page from
> guest physical space and passed the machine check to the guest."
> "The guest gets virtual #MC on an access to that page. When
> the guest tries to do set_memory_uc() and instructs cpa_flush()
> to do clean caches that results in taking another fault / exception
> perhaps because the VMM unmapped the page from the guest."
>
> Since the driver has special knowledge to handle NP or UC,
> mark the poisoned page with NP and let driver handle it when
> it comes down to repair.
>
> Please refer to discussions here for more details.
> https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/
>
> Now since poisoned page is marked as not-present, in order to
> avoid writing to a not-present page and trigger kernel Oops,
> also fix pmem_do_write().
>
> Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
> Signed-off-by: Jane Chu <jane.chu@oracle.com>

Looks good to me:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

> ---
>  arch/x86/kernel/cpu/mce/core.c |  6 +++---
>  arch/x86/mm/pat/set_memory.c   | 23 +++++++++++------------
>  drivers/nvdimm/pmem.c          | 30 +++++++-----------------------
>  include/linux/set_memory.h     |  4 ++--
>  4 files changed, 23 insertions(+), 40 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 981496e6bc0e..fa67bb9d1afe 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -579,7 +579,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>
>         pfn = mce->addr >> PAGE_SHIFT;
>         if (!memory_failure(pfn, 0)) {
> -               set_mce_nospec(pfn, whole_page(mce));
> +               set_mce_nospec(pfn);
>                 mce->kflags |= MCE_HANDLED_UC;
>         }
>
> @@ -1316,7 +1316,7 @@ static void kill_me_maybe(struct callback_head *cb)
>
>         ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
>         if (!ret) {
> -               set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> +               set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
>                 sync_core();
>                 return;
>         }
> @@ -1342,7 +1342,7 @@ static void kill_me_never(struct callback_head *cb)
>         p->mce_count = 0;
>         pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
>         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
> -               set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> +               set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
>  }
>
>  static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *))
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 978cf5bd2ab6..e3a5e55f3e08 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -1925,13 +1925,8 @@ int set_memory_wb(unsigned long addr, int numpages)
>  }
>  EXPORT_SYMBOL(set_memory_wb);
>
> -/*
> - * Prevent speculative access to the page by either unmapping
> - * it (if we do not require access to any part of the page) or
> - * marking it uncacheable (if we want to try to retrieve data
> - * from non-poisoned lines in the page).
> - */
> -int set_mce_nospec(unsigned long pfn, bool unmap)
> +/* Prevent speculative access to a page by marking it not-present */
> +int set_mce_nospec(unsigned long pfn)
>  {
>         unsigned long decoy_addr;
>         int rc;
> @@ -1956,19 +1951,23 @@ int set_mce_nospec(unsigned long pfn, bool unmap)
>          */
>         decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
>
> -       if (unmap)
> -               rc = set_memory_np(decoy_addr, 1);
> -       else
> -               rc = set_memory_uc(decoy_addr, 1);
> +       rc = set_memory_np(decoy_addr, 1);
>         if (rc)
>                 pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
>         return rc;
>  }
>
> +static int set_memory_present(unsigned long *addr, int numpages)
> +{
> +       return change_page_attr_set(addr, numpages, __pgprot(_PAGE_PRESENT), 0);
> +}
> +
>  /* Restore full speculative operation to the pfn. */
>  int clear_mce_nospec(unsigned long pfn)
>  {
> -       return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
> +       unsigned long addr = (unsigned long) pfn_to_kaddr(pfn);
> +
> +       return set_memory_present(&addr, 1);
>  }
>  EXPORT_SYMBOL_GPL(clear_mce_nospec);
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 58d95242a836..4aa17132a557 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -158,36 +158,20 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
>                         struct page *page, unsigned int page_off,
>                         sector_t sector, unsigned int len)
>  {
> -       blk_status_t rc = BLK_STS_OK;
> -       bool bad_pmem = false;
>         phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
>         void *pmem_addr = pmem->virt_addr + pmem_off;
>
> -       if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
> -               bad_pmem = true;
> +       if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
> +               blk_status_t rc = pmem_clear_poison(pmem, pmem_off, len);
> +
> +               if (rc != BLK_STS_OK)
> +                       return rc;
> +       }
>
> -       /*
> -        * Note that we write the data both before and after
> -        * clearing poison.  The write before clear poison
> -        * handles situations where the latest written data is
> -        * preserved and the clear poison operation simply marks
> -        * the address range as valid without changing the data.
> -        * In this case application software can assume that an
> -        * interrupted write will either return the new good
> -        * data or an error.
> -        *
> -        * However, if pmem_clear_poison() leaves the data in an
> -        * indeterminate state we need to perform the write
> -        * after clear poison.
> -        */
>         flush_dcache_page(page);
>         write_pmem(pmem_addr, page, page_off, len);
> -       if (unlikely(bad_pmem)) {
> -               rc = pmem_clear_poison(pmem, pmem_off, len);
> -               write_pmem(pmem_addr, page, page_off, len);
> -       }
>
> -       return rc;
> +       return BLK_STS_OK;
>  }
>
>  static void pmem_submit_bio(struct bio *bio)
> diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
> index 683a6c3f7179..369769ce7399 100644
> --- a/include/linux/set_memory.h
> +++ b/include/linux/set_memory.h
> @@ -43,10 +43,10 @@ static inline bool can_set_direct_map(void)
>  #endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
>
>  #ifdef CONFIG_X86_64
> -int set_mce_nospec(unsigned long pfn, bool unmap);
> +int set_mce_nospec(unsigned long pfn);
>  int clear_mce_nospec(unsigned long pfn);
>  #else
> -static inline int set_mce_nospec(unsigned long pfn, bool unmap)
> +static inline int set_mce_nospec(unsigned long pfn)
>  {
>         return 0;
>  }
> --
> 2.18.4
>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 4/7] dax: introduce DAX_RECOVERY_WRITE dax access mode
  2022-04-20  2:04   ` [dm-devel] " Jane Chu
@ 2022-04-21 19:35     ` Dan Williams
  -1 siblings, 0 replies; 44+ messages in thread
From: Dan Williams @ 2022-04-21 19:35 UTC (permalink / raw)
  To: Jane Chu
  Cc: Borislav Petkov, Christoph Hellwig, Dave Hansen, Peter Zijlstra,
	Andy Lutomirski, david, Darrick J. Wong, linux-fsdevel,
	Linux NVDIMM, Linux Kernel Mailing List, X86 ML, Vishal L Verma,
	Dave Jiang, Alasdair Kergon, Mike Snitzer,
	device-mapper development, Weiny, Ira, Matthew Wilcox,
	Vivek Goyal

On Tue, Apr 19, 2022 at 7:05 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> Up till now, dax_direct_access() is used implicitly for normal
> access, but for the purpose of recovery write, dax range with
> poison is requested.  To make the interface clear, introduce
>         enum dax_access_mode {
>                 DAX_ACCESS,
>                 DAX_RECOVERY_WRITE,
>         }
> where DAX_ACCESS is used for normal dax access, and
> DAX_RECOVERY_WRITE is used for dax recovery write.
>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> ---
>  drivers/dax/super.c             |  5 ++--
>  drivers/md/dm-linear.c          |  5 ++--
>  drivers/md/dm-log-writes.c      |  5 ++--
>  drivers/md/dm-stripe.c          |  5 ++--
>  drivers/md/dm-target.c          |  4 ++-
>  drivers/md/dm-writecache.c      |  7 +++---
>  drivers/md/dm.c                 |  5 ++--
>  drivers/nvdimm/pmem.c           | 44 +++++++++++++++++++++++++--------
>  drivers/nvdimm/pmem.h           |  5 +++-
>  drivers/s390/block/dcssblk.c    |  9 ++++---
>  fs/dax.c                        |  9 ++++---
>  fs/fuse/dax.c                   |  4 +--
>  include/linux/dax.h             |  9 +++++--
>  include/linux/device-mapper.h   |  4 ++-
>  tools/testing/nvdimm/pmem-dax.c |  3 ++-
>  15 files changed, 85 insertions(+), 38 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0211e6f7b47a..5405eb553430 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -117,6 +117,7 @@ enum dax_device_flags {
>   * @dax_dev: a dax_device instance representing the logical memory range
>   * @pgoff: offset in pages from the start of the device to translate
>   * @nr_pages: number of consecutive pages caller can handle relative to @pfn
> + * @mode: indicator on normal access or recovery write
>   * @kaddr: output parameter that returns a virtual address mapping of pfn
>   * @pfn: output parameter that returns an absolute pfn translation of @pgoff
>   *
> @@ -124,7 +125,7 @@ enum dax_device_flags {
>   * pages accessible at the device relative @pgoff.
>   */
>  long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
> -               void **kaddr, pfn_t *pfn)
> +               enum dax_access_mode mode, void **kaddr, pfn_t *pfn)
>  {
>         long avail;
>
> @@ -138,7 +139,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
>                 return -EINVAL;
>
>         avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
> -                       kaddr, pfn);
> +                       mode, kaddr, pfn);
>         if (!avail)
>                 return -ERANGE;
>         return min(avail, nr_pages);
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index 76b486e4d2be..13e263299c9c 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -172,11 +172,12 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
>  }
>
>  static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> -               long nr_pages, void **kaddr, pfn_t *pfn)
> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
> +               pfn_t *pfn)
>  {
>         struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);
>
> -       return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
> +       return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
>  }
>
>  static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
> diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
> index c9d036d6bb2e..06bdbed65eb1 100644
> --- a/drivers/md/dm-log-writes.c
> +++ b/drivers/md/dm-log-writes.c
> @@ -889,11 +889,12 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti,
>  }
>
>  static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> -                                        long nr_pages, void **kaddr, pfn_t *pfn)
> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
> +               pfn_t *pfn)
>  {
>         struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff);
>
> -       return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
> +       return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
>  }
>
>  static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
> index c81d331d1afe..77d72900e997 100644
> --- a/drivers/md/dm-stripe.c
> +++ b/drivers/md/dm-stripe.c
> @@ -315,11 +315,12 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
>  }
>
>  static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> -               long nr_pages, void **kaddr, pfn_t *pfn)
> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
> +               pfn_t *pfn)
>  {
>         struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);
>
> -       return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
> +       return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
>  }
>
>  static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
> diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
> index 64dd0b34fcf4..8cd5184e62f0 100644
> --- a/drivers/md/dm-target.c
> +++ b/drivers/md/dm-target.c
> @@ -10,6 +10,7 @@
>  #include <linux/init.h>
>  #include <linux/kmod.h>
>  #include <linux/bio.h>
> +#include <linux/dax.h>
>
>  #define DM_MSG_PREFIX "target"
>
> @@ -142,7 +143,8 @@ static void io_err_release_clone_rq(struct request *clone,
>  }
>
>  static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> -               long nr_pages, void **kaddr, pfn_t *pfn)
> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
> +               pfn_t *pfn)
>  {
>         return -EIO;
>  }
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 5630b470ba42..d74c5a7a0ab4 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -286,7 +286,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>
>         id = dax_read_lock();
>
> -       da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn);
> +       da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, DAX_ACCESS,
> +                       &wc->memory_map, &pfn);
>         if (da < 0) {
>                 wc->memory_map = NULL;
>                 r = da;
> @@ -308,8 +309,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>                 i = 0;
>                 do {
>                         long daa;
> -                       daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i,
> -                                               NULL, &pfn);
> +                       daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i,
> +                                       p - i, DAX_ACCESS, NULL, &pfn);
>                         if (daa <= 0) {
>                                 r = daa ? daa : -EINVAL;
>                                 goto err3;
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3c5fad7c4ee6..8258676a352f 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1093,7 +1093,8 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
>  }
>
>  static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> -                                long nr_pages, void **kaddr, pfn_t *pfn)
> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
> +               pfn_t *pfn)
>  {
>         struct mapped_device *md = dax_get_private(dax_dev);
>         sector_t sector = pgoff * PAGE_SECTORS;
> @@ -1111,7 +1112,7 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
>         if (len < 1)
>                 goto out;
>         nr_pages = min(len, nr_pages);
> -       ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
> +       ret = ti->type->direct_access(ti, pgoff, nr_pages, mode, kaddr, pfn);
>
>   out:
>         dm_put_live_table(md, srcu_idx);
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 4aa17132a557..c77b7cf19639 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -239,24 +239,47 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
>
>  /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
>  __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
> -               long nr_pages, void **kaddr, pfn_t *pfn)
> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
> +               pfn_t *pfn)
>  {
>         resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
> -
> -       if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
> -                                       PFN_PHYS(nr_pages))))
> -               return -EIO;
> +       sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
> +       unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
> +       struct badblocks *bb = &pmem->bb;
> +       sector_t first_bad;
> +       int num_bad;
>
>         if (kaddr)
>                 *kaddr = pmem->virt_addr + offset;
>         if (pfn)
>                 *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
>
> +       if (bb->count &&
> +               badblocks_check(bb, sector, num, &first_bad, &num_bad)) {
> +               long actual_nr;
> +
> +               if (mode != DAX_RECOVERY_WRITE)
> +                       return -EIO;
> +
> +               /*
> +                * Set the recovery stride is set to kernel page size because
> +                * the underlying driver and firmware clear poison functions
> +                * don't appear to handle large chunk(such as 2MiB) reliably.
> +                */
> +               actual_nr = PHYS_PFN(
> +                       PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT));
> +               dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n",
> +                               sector, nr_pages, first_bad, actual_nr);
> +               if (actual_nr)
> +                       return actual_nr;
> +               return 1;
> +       }
> +

Similar feedback as Christoph, lets keep this patch just to the simple
task of plumbing the @mode argument to dax_direct_access() and save
the logic changes for DAX_RECOVERY_WRITE to a later patch. The idea
being that in general you want to limit the blast radius of
regressions so that it simplifies reverts if necessary. If these logic
changes have a regression then reverting this patch also undoes all
the other innocent boilerplate plumbing.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dm-devel] [PATCH v8 4/7] dax: introduce DAX_RECOVERY_WRITE dax access mode
@ 2022-04-21 19:35     ` Dan Williams
  0 siblings, 0 replies; 44+ messages in thread
From: Dan Williams @ 2022-04-21 19:35 UTC (permalink / raw)
  To: Jane Chu
  Cc: Linux NVDIMM, Dave Jiang, Mike Snitzer, Peter Zijlstra,
	Darrick J. Wong, X86 ML, david, Linux Kernel Mailing List,
	Matthew Wilcox, Christoph Hellwig, Dave Hansen,
	device-mapper development, Borislav Petkov, Vivek Goyal,
	Andy Lutomirski, Vishal L Verma, linux-fsdevel, Weiny, Ira,
	Alasdair Kergon

On Tue, Apr 19, 2022 at 7:05 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> Up till now, dax_direct_access() is used implicitly for normal
> access, but for the purpose of recovery write, dax range with
> poison is requested.  To make the interface clear, introduce
>         enum dax_access_mode {
>                 DAX_ACCESS,
>                 DAX_RECOVERY_WRITE,
>         }
> where DAX_ACCESS is used for normal dax access, and
> DAX_RECOVERY_WRITE is used for dax recovery write.
>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> ---
>  drivers/dax/super.c             |  5 ++--
>  drivers/md/dm-linear.c          |  5 ++--
>  drivers/md/dm-log-writes.c      |  5 ++--
>  drivers/md/dm-stripe.c          |  5 ++--
>  drivers/md/dm-target.c          |  4 ++-
>  drivers/md/dm-writecache.c      |  7 +++---
>  drivers/md/dm.c                 |  5 ++--
>  drivers/nvdimm/pmem.c           | 44 +++++++++++++++++++++++++--------
>  drivers/nvdimm/pmem.h           |  5 +++-
>  drivers/s390/block/dcssblk.c    |  9 ++++---
>  fs/dax.c                        |  9 ++++---
>  fs/fuse/dax.c                   |  4 +--
>  include/linux/dax.h             |  9 +++++--
>  include/linux/device-mapper.h   |  4 ++-
>  tools/testing/nvdimm/pmem-dax.c |  3 ++-
>  15 files changed, 85 insertions(+), 38 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0211e6f7b47a..5405eb553430 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -117,6 +117,7 @@ enum dax_device_flags {
>   * @dax_dev: a dax_device instance representing the logical memory range
>   * @pgoff: offset in pages from the start of the device to translate
>   * @nr_pages: number of consecutive pages caller can handle relative to @pfn
> + * @mode: indicator on normal access or recovery write
>   * @kaddr: output parameter that returns a virtual address mapping of pfn
>   * @pfn: output parameter that returns an absolute pfn translation of @pgoff
>   *
> @@ -124,7 +125,7 @@ enum dax_device_flags {
>   * pages accessible at the device relative @pgoff.
>   */
>  long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
> -               void **kaddr, pfn_t *pfn)
> +               enum dax_access_mode mode, void **kaddr, pfn_t *pfn)
>  {
>         long avail;
>
> @@ -138,7 +139,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
>                 return -EINVAL;
>
>         avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
> -                       kaddr, pfn);
> +                       mode, kaddr, pfn);
>         if (!avail)
>                 return -ERANGE;
>         return min(avail, nr_pages);
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index 76b486e4d2be..13e263299c9c 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -172,11 +172,12 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
>  }
>
>  static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> -               long nr_pages, void **kaddr, pfn_t *pfn)
> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
> +               pfn_t *pfn)
>  {
>         struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);
>
> -       return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
> +       return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
>  }
>
>  static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
> diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
> index c9d036d6bb2e..06bdbed65eb1 100644
> --- a/drivers/md/dm-log-writes.c
> +++ b/drivers/md/dm-log-writes.c
> @@ -889,11 +889,12 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti,
>  }
>
>  static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> -                                        long nr_pages, void **kaddr, pfn_t *pfn)
> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
> +               pfn_t *pfn)
>  {
>         struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff);
>
> -       return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
> +       return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
>  }
>
>  static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
> index c81d331d1afe..77d72900e997 100644
> --- a/drivers/md/dm-stripe.c
> +++ b/drivers/md/dm-stripe.c
> @@ -315,11 +315,12 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
>  }
>
>  static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> -               long nr_pages, void **kaddr, pfn_t *pfn)
> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
> +               pfn_t *pfn)
>  {
>         struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);
>
> -       return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
> +       return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
>  }
>
>  static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
> diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
> index 64dd0b34fcf4..8cd5184e62f0 100644
> --- a/drivers/md/dm-target.c
> +++ b/drivers/md/dm-target.c
> @@ -10,6 +10,7 @@
>  #include <linux/init.h>
>  #include <linux/kmod.h>
>  #include <linux/bio.h>
> +#include <linux/dax.h>
>
>  #define DM_MSG_PREFIX "target"
>
> @@ -142,7 +143,8 @@ static void io_err_release_clone_rq(struct request *clone,
>  }
>
>  static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> -               long nr_pages, void **kaddr, pfn_t *pfn)
> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
> +               pfn_t *pfn)
>  {
>         return -EIO;
>  }
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 5630b470ba42..d74c5a7a0ab4 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -286,7 +286,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>
>         id = dax_read_lock();
>
> -       da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn);
> +       da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, DAX_ACCESS,
> +                       &wc->memory_map, &pfn);
>         if (da < 0) {
>                 wc->memory_map = NULL;
>                 r = da;
> @@ -308,8 +309,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>                 i = 0;
>                 do {
>                         long daa;
> -                       daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i,
> -                                               NULL, &pfn);
> +                       daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i,
> +                                       p - i, DAX_ACCESS, NULL, &pfn);
>                         if (daa <= 0) {
>                                 r = daa ? daa : -EINVAL;
>                                 goto err3;
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3c5fad7c4ee6..8258676a352f 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1093,7 +1093,8 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
>  }
>
>  static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> -                                long nr_pages, void **kaddr, pfn_t *pfn)
> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
> +               pfn_t *pfn)
>  {
>         struct mapped_device *md = dax_get_private(dax_dev);
>         sector_t sector = pgoff * PAGE_SECTORS;
> @@ -1111,7 +1112,7 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
>         if (len < 1)
>                 goto out;
>         nr_pages = min(len, nr_pages);
> -       ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
> +       ret = ti->type->direct_access(ti, pgoff, nr_pages, mode, kaddr, pfn);
>
>   out:
>         dm_put_live_table(md, srcu_idx);
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 4aa17132a557..c77b7cf19639 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -239,24 +239,47 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
>
>  /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
>  __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
> -               long nr_pages, void **kaddr, pfn_t *pfn)
> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
> +               pfn_t *pfn)
>  {
>         resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
> -
> -       if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
> -                                       PFN_PHYS(nr_pages))))
> -               return -EIO;
> +       sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
> +       unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
> +       struct badblocks *bb = &pmem->bb;
> +       sector_t first_bad;
> +       int num_bad;
>
>         if (kaddr)
>                 *kaddr = pmem->virt_addr + offset;
>         if (pfn)
>                 *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
>
> +       if (bb->count &&
> +               badblocks_check(bb, sector, num, &first_bad, &num_bad)) {
> +               long actual_nr;
> +
> +               if (mode != DAX_RECOVERY_WRITE)
> +                       return -EIO;
> +
> +               /*
> +                * Set the recovery stride is set to kernel page size because
> +                * the underlying driver and firmware clear poison functions
> +                * don't appear to handle large chunk(such as 2MiB) reliably.
> +                */
> +               actual_nr = PHYS_PFN(
> +                       PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT));
> +               dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n",
> +                               sector, nr_pages, first_bad, actual_nr);
> +               if (actual_nr)
> +                       return actual_nr;
> +               return 1;
> +       }
> +

Similar feedback as Christoph, lets keep this patch just to the simple
task of plumbing the @mode argument to dax_direct_access() and save
the logic changes for DAX_RECOVERY_WRITE to a later patch. The idea
being that in general you want to limit the blast radius of
regressions so that it simplifies reverts if necessary. If these logic
changes have a regression then reverting this patch also undoes all
the other innocent boilerplate plumbing.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 2/7] x86/mce: relocate set{clear}_mce_nospec() functions
  2022-04-21  6:50     ` [dm-devel] " Christoph Hellwig
@ 2022-04-22  3:46       ` Jane Chu
  -1 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-22  3:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dan.j.williams, bp, dave.hansen, peterz, luto, david, djwong,
	linux-fsdevel, nvdimm, linux-kernel, x86, vishal.l.verma,
	dave.jiang, agk, snitzer, dm-devel, ira.weiny, willy, vgoyal

On 4/20/2022 11:50 PM, Christoph Hellwig wrote:
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!
-jane

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dm-devel] [PATCH v8 2/7] x86/mce: relocate set{clear}_mce_nospec() functions
@ 2022-04-22  3:46       ` Jane Chu
  0 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-22  3:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: nvdimm, dave.jiang, snitzer, peterz, djwong, x86, david,
	linux-kernel, willy, dave.hansen, dm-devel, bp, vgoyal, luto,
	vishal.l.verma, linux-fsdevel, dan.j.williams, ira.weiny, agk

On 4/20/2022 11:50 PM, Christoph Hellwig wrote:
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!
-jane
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 3/7] mce: fix set_mce_nospec to always unmap the whole page
  2022-04-21  6:51     ` [dm-devel] " Christoph Hellwig
@ 2022-04-22  3:47       ` Jane Chu
  -1 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-22  3:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dan.j.williams, bp, dave.hansen, peterz, luto, david, djwong,
	linux-fsdevel, nvdimm, linux-kernel, x86, vishal.l.verma,
	dave.jiang, agk, snitzer, dm-devel, ira.weiny, willy, vgoyal

On 4/20/2022 11:51 PM, Christoph Hellwig wrote:
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

thanks!
-jane

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dm-devel] [PATCH v8 3/7] mce: fix set_mce_nospec to always unmap the whole page
@ 2022-04-22  3:47       ` Jane Chu
  0 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-22  3:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: nvdimm, dave.jiang, snitzer, peterz, djwong, x86, david,
	linux-kernel, willy, dave.hansen, dm-devel, bp, vgoyal, luto,
	vishal.l.verma, linux-fsdevel, dan.j.williams, ira.weiny, agk

On 4/20/2022 11:51 PM, Christoph Hellwig wrote:
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

thanks!
-jane
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 4/7] dax: introduce DAX_RECOVERY_WRITE dax access mode
  2022-04-21  6:57     ` [dm-devel] " Christoph Hellwig
@ 2022-04-22  4:15       ` Jane Chu
  -1 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-22  4:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dan.j.williams, bp, dave.hansen, peterz, luto, david, djwong,
	linux-fsdevel, nvdimm, linux-kernel, x86, vishal.l.verma,
	dave.jiang, agk, snitzer, dm-devel, ira.weiny, willy, vgoyal

On 4/20/2022 11:57 PM, Christoph Hellwig wrote:
>> +	if (bb->count &&
>> +		badblocks_check(bb, sector, num, &first_bad, &num_bad)) {
> 
> Weird alignment here, continuing lines for conditionals are aligned
> either after the opening brace:
> 
> 	if (bb->count &&
> 	    badblocks_check(bb, sector, num, &first_bad, &num_bad)) {
> 
> or with double tabs.  I tend to prefer the version I posted above.
> 
> The being said, shouldn't this change even be in this patch and not just
> added once we add actual recovery support?

Right, will move the recovery_write related  changes to next patch.

thanks!
-jane


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dm-devel] [PATCH v8 4/7] dax: introduce DAX_RECOVERY_WRITE dax access mode
@ 2022-04-22  4:15       ` Jane Chu
  0 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-22  4:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: nvdimm, dave.jiang, snitzer, peterz, djwong, x86, david,
	linux-kernel, willy, dave.hansen, dm-devel, bp, vgoyal, luto,
	vishal.l.verma, linux-fsdevel, dan.j.williams, ira.weiny, agk

On 4/20/2022 11:57 PM, Christoph Hellwig wrote:
>> +	if (bb->count &&
>> +		badblocks_check(bb, sector, num, &first_bad, &num_bad)) {
> 
> Weird alignment here, continuing lines for conditionals are aligned
> either after the opening brace:
> 
> 	if (bb->count &&
> 	    badblocks_check(bb, sector, num, &first_bad, &num_bad)) {
> 
> or with double tabs.  I tend to prefer the version I posted above.
> 
> The being said, shouldn't this change even be in this patch and not just
> added once we add actual recovery support?

Right, will move the recovery_write related  changes to next patch.

thanks!
-jane

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 4/7] dax: introduce DAX_RECOVERY_WRITE dax access mode
  2022-04-21 19:35     ` [dm-devel] " Dan Williams
@ 2022-04-22  4:15       ` Jane Chu
  -1 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-22  4:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Borislav Petkov, Christoph Hellwig, Dave Hansen, Peter Zijlstra,
	Andy Lutomirski, david, Darrick J. Wong, linux-fsdevel,
	Linux NVDIMM, Linux Kernel Mailing List, X86 ML, Vishal L Verma,
	Dave Jiang, Alasdair Kergon, Mike Snitzer,
	device-mapper development, Weiny, Ira, Matthew Wilcox,
	Vivek Goyal

On 4/21/2022 12:35 PM, Dan Williams wrote:
> On Tue, Apr 19, 2022 at 7:05 PM Jane Chu <jane.chu@oracle.com> wrote:
>>
>> Up till now, dax_direct_access() is used implicitly for normal
>> access, but for the purpose of recovery write, dax range with
>> poison is requested.  To make the interface clear, introduce
>>          enum dax_access_mode {
>>                  DAX_ACCESS,
>>                  DAX_RECOVERY_WRITE,
>>          }
>> where DAX_ACCESS is used for normal dax access, and
>> DAX_RECOVERY_WRITE is used for dax recovery write.
>>
>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Jane Chu <jane.chu@oracle.com>
>> ---
>>   drivers/dax/super.c             |  5 ++--
>>   drivers/md/dm-linear.c          |  5 ++--
>>   drivers/md/dm-log-writes.c      |  5 ++--
>>   drivers/md/dm-stripe.c          |  5 ++--
>>   drivers/md/dm-target.c          |  4 ++-
>>   drivers/md/dm-writecache.c      |  7 +++---
>>   drivers/md/dm.c                 |  5 ++--
>>   drivers/nvdimm/pmem.c           | 44 +++++++++++++++++++++++++--------
>>   drivers/nvdimm/pmem.h           |  5 +++-
>>   drivers/s390/block/dcssblk.c    |  9 ++++---
>>   fs/dax.c                        |  9 ++++---
>>   fs/fuse/dax.c                   |  4 +--
>>   include/linux/dax.h             |  9 +++++--
>>   include/linux/device-mapper.h   |  4 ++-
>>   tools/testing/nvdimm/pmem-dax.c |  3 ++-
>>   15 files changed, 85 insertions(+), 38 deletions(-)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index 0211e6f7b47a..5405eb553430 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -117,6 +117,7 @@ enum dax_device_flags {
>>    * @dax_dev: a dax_device instance representing the logical memory range
>>    * @pgoff: offset in pages from the start of the device to translate
>>    * @nr_pages: number of consecutive pages caller can handle relative to @pfn
>> + * @mode: indicator on normal access or recovery write
>>    * @kaddr: output parameter that returns a virtual address mapping of pfn
>>    * @pfn: output parameter that returns an absolute pfn translation of @pgoff
>>    *
>> @@ -124,7 +125,7 @@ enum dax_device_flags {
>>    * pages accessible at the device relative @pgoff.
>>    */
>>   long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
>> -               void **kaddr, pfn_t *pfn)
>> +               enum dax_access_mode mode, void **kaddr, pfn_t *pfn)
>>   {
>>          long avail;
>>
>> @@ -138,7 +139,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
>>                  return -EINVAL;
>>
>>          avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
>> -                       kaddr, pfn);
>> +                       mode, kaddr, pfn);
>>          if (!avail)
>>                  return -ERANGE;
>>          return min(avail, nr_pages);
>> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
>> index 76b486e4d2be..13e263299c9c 100644
>> --- a/drivers/md/dm-linear.c
>> +++ b/drivers/md/dm-linear.c
>> @@ -172,11 +172,12 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
>>   }
>>
>>   static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
>> -               long nr_pages, void **kaddr, pfn_t *pfn)
>> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
>> +               pfn_t *pfn)
>>   {
>>          struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);
>>
>> -       return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
>> +       return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
>>   }
>>
>>   static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
>> diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
>> index c9d036d6bb2e..06bdbed65eb1 100644
>> --- a/drivers/md/dm-log-writes.c
>> +++ b/drivers/md/dm-log-writes.c
>> @@ -889,11 +889,12 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti,
>>   }
>>
>>   static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
>> -                                        long nr_pages, void **kaddr, pfn_t *pfn)
>> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
>> +               pfn_t *pfn)
>>   {
>>          struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff);
>>
>> -       return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
>> +       return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
>>   }
>>
>>   static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
>> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
>> index c81d331d1afe..77d72900e997 100644
>> --- a/drivers/md/dm-stripe.c
>> +++ b/drivers/md/dm-stripe.c
>> @@ -315,11 +315,12 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
>>   }
>>
>>   static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
>> -               long nr_pages, void **kaddr, pfn_t *pfn)
>> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
>> +               pfn_t *pfn)
>>   {
>>          struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);
>>
>> -       return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
>> +       return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
>>   }
>>
>>   static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
>> diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
>> index 64dd0b34fcf4..8cd5184e62f0 100644
>> --- a/drivers/md/dm-target.c
>> +++ b/drivers/md/dm-target.c
>> @@ -10,6 +10,7 @@
>>   #include <linux/init.h>
>>   #include <linux/kmod.h>
>>   #include <linux/bio.h>
>> +#include <linux/dax.h>
>>
>>   #define DM_MSG_PREFIX "target"
>>
>> @@ -142,7 +143,8 @@ static void io_err_release_clone_rq(struct request *clone,
>>   }
>>
>>   static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
>> -               long nr_pages, void **kaddr, pfn_t *pfn)
>> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
>> +               pfn_t *pfn)
>>   {
>>          return -EIO;
>>   }
>> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
>> index 5630b470ba42..d74c5a7a0ab4 100644
>> --- a/drivers/md/dm-writecache.c
>> +++ b/drivers/md/dm-writecache.c
>> @@ -286,7 +286,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>>
>>          id = dax_read_lock();
>>
>> -       da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn);
>> +       da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, DAX_ACCESS,
>> +                       &wc->memory_map, &pfn);
>>          if (da < 0) {
>>                  wc->memory_map = NULL;
>>                  r = da;
>> @@ -308,8 +309,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>>                  i = 0;
>>                  do {
>>                          long daa;
>> -                       daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i,
>> -                                               NULL, &pfn);
>> +                       daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i,
>> +                                       p - i, DAX_ACCESS, NULL, &pfn);
>>                          if (daa <= 0) {
>>                                  r = daa ? daa : -EINVAL;
>>                                  goto err3;
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index 3c5fad7c4ee6..8258676a352f 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -1093,7 +1093,8 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
>>   }
>>
>>   static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
>> -                                long nr_pages, void **kaddr, pfn_t *pfn)
>> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
>> +               pfn_t *pfn)
>>   {
>>          struct mapped_device *md = dax_get_private(dax_dev);
>>          sector_t sector = pgoff * PAGE_SECTORS;
>> @@ -1111,7 +1112,7 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
>>          if (len < 1)
>>                  goto out;
>>          nr_pages = min(len, nr_pages);
>> -       ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
>> +       ret = ti->type->direct_access(ti, pgoff, nr_pages, mode, kaddr, pfn);
>>
>>    out:
>>          dm_put_live_table(md, srcu_idx);
>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> index 4aa17132a557..c77b7cf19639 100644
>> --- a/drivers/nvdimm/pmem.c
>> +++ b/drivers/nvdimm/pmem.c
>> @@ -239,24 +239,47 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
>>
>>   /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
>>   __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
>> -               long nr_pages, void **kaddr, pfn_t *pfn)
>> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
>> +               pfn_t *pfn)
>>   {
>>          resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
>> -
>> -       if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
>> -                                       PFN_PHYS(nr_pages))))
>> -               return -EIO;
>> +       sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
>> +       unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
>> +       struct badblocks *bb = &pmem->bb;
>> +       sector_t first_bad;
>> +       int num_bad;
>>
>>          if (kaddr)
>>                  *kaddr = pmem->virt_addr + offset;
>>          if (pfn)
>>                  *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
>>
>> +       if (bb->count &&
>> +               badblocks_check(bb, sector, num, &first_bad, &num_bad)) {
>> +               long actual_nr;
>> +
>> +               if (mode != DAX_RECOVERY_WRITE)
>> +                       return -EIO;
>> +
>> +               /*
>> +                * Set the recovery stride is set to kernel page size because
>> +                * the underlying driver and firmware clear poison functions
>> +                * don't appear to handle large chunk(such as 2MiB) reliably.
>> +                */
>> +               actual_nr = PHYS_PFN(
>> +                       PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT));
>> +               dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n",
>> +                               sector, nr_pages, first_bad, actual_nr);
>> +               if (actual_nr)
>> +                       return actual_nr;
>> +               return 1;
>> +       }
>> +
> 
> Similar feedback as Christoph, lets keep this patch just to the simple
> task of plumbing the @mode argument to dax_direct_access() and save
> the logic changes for DAX_RECOVERY_WRITE to a later patch. The idea
> being that in general you want to limit the blast radius of
> regressions so that it simplifies reverts if necessary. If these logic
> changes have a regression then reverting this patch also undoes all
> the other innocent boilerplate plumbing.

Got it, will do.

thanks!
-jane

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dm-devel] [PATCH v8 4/7] dax: introduce DAX_RECOVERY_WRITE dax access mode
@ 2022-04-22  4:15       ` Jane Chu
  0 siblings, 0 replies; 44+ messages in thread
From: Jane Chu @ 2022-04-22  4:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux NVDIMM, Dave Jiang, Mike Snitzer, Peter Zijlstra,
	Darrick J. Wong, X86 ML, david, Linux Kernel Mailing List,
	Matthew Wilcox, Christoph Hellwig, Dave Hansen,
	device-mapper development, Borislav Petkov, Vivek Goyal,
	Andy Lutomirski, Vishal L Verma, linux-fsdevel, Weiny, Ira,
	Alasdair Kergon

On 4/21/2022 12:35 PM, Dan Williams wrote:
> On Tue, Apr 19, 2022 at 7:05 PM Jane Chu <jane.chu@oracle.com> wrote:
>>
>> Up till now, dax_direct_access() is used implicitly for normal
>> access, but for the purpose of recovery write, dax range with
>> poison is requested.  To make the interface clear, introduce
>>          enum dax_access_mode {
>>                  DAX_ACCESS,
>>                  DAX_RECOVERY_WRITE,
>>          }
>> where DAX_ACCESS is used for normal dax access, and
>> DAX_RECOVERY_WRITE is used for dax recovery write.
>>
>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Jane Chu <jane.chu@oracle.com>
>> ---
>>   drivers/dax/super.c             |  5 ++--
>>   drivers/md/dm-linear.c          |  5 ++--
>>   drivers/md/dm-log-writes.c      |  5 ++--
>>   drivers/md/dm-stripe.c          |  5 ++--
>>   drivers/md/dm-target.c          |  4 ++-
>>   drivers/md/dm-writecache.c      |  7 +++---
>>   drivers/md/dm.c                 |  5 ++--
>>   drivers/nvdimm/pmem.c           | 44 +++++++++++++++++++++++++--------
>>   drivers/nvdimm/pmem.h           |  5 +++-
>>   drivers/s390/block/dcssblk.c    |  9 ++++---
>>   fs/dax.c                        |  9 ++++---
>>   fs/fuse/dax.c                   |  4 +--
>>   include/linux/dax.h             |  9 +++++--
>>   include/linux/device-mapper.h   |  4 ++-
>>   tools/testing/nvdimm/pmem-dax.c |  3 ++-
>>   15 files changed, 85 insertions(+), 38 deletions(-)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index 0211e6f7b47a..5405eb553430 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -117,6 +117,7 @@ enum dax_device_flags {
>>    * @dax_dev: a dax_device instance representing the logical memory range
>>    * @pgoff: offset in pages from the start of the device to translate
>>    * @nr_pages: number of consecutive pages caller can handle relative to @pfn
>> + * @mode: indicator on normal access or recovery write
>>    * @kaddr: output parameter that returns a virtual address mapping of pfn
>>    * @pfn: output parameter that returns an absolute pfn translation of @pgoff
>>    *
>> @@ -124,7 +125,7 @@ enum dax_device_flags {
>>    * pages accessible at the device relative @pgoff.
>>    */
>>   long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
>> -               void **kaddr, pfn_t *pfn)
>> +               enum dax_access_mode mode, void **kaddr, pfn_t *pfn)
>>   {
>>          long avail;
>>
>> @@ -138,7 +139,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
>>                  return -EINVAL;
>>
>>          avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
>> -                       kaddr, pfn);
>> +                       mode, kaddr, pfn);
>>          if (!avail)
>>                  return -ERANGE;
>>          return min(avail, nr_pages);
>> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
>> index 76b486e4d2be..13e263299c9c 100644
>> --- a/drivers/md/dm-linear.c
>> +++ b/drivers/md/dm-linear.c
>> @@ -172,11 +172,12 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
>>   }
>>
>>   static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
>> -               long nr_pages, void **kaddr, pfn_t *pfn)
>> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
>> +               pfn_t *pfn)
>>   {
>>          struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);
>>
>> -       return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
>> +       return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
>>   }
>>
>>   static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
>> diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
>> index c9d036d6bb2e..06bdbed65eb1 100644
>> --- a/drivers/md/dm-log-writes.c
>> +++ b/drivers/md/dm-log-writes.c
>> @@ -889,11 +889,12 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti,
>>   }
>>
>>   static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
>> -                                        long nr_pages, void **kaddr, pfn_t *pfn)
>> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
>> +               pfn_t *pfn)
>>   {
>>          struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff);
>>
>> -       return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
>> +       return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
>>   }
>>
>>   static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
>> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
>> index c81d331d1afe..77d72900e997 100644
>> --- a/drivers/md/dm-stripe.c
>> +++ b/drivers/md/dm-stripe.c
>> @@ -315,11 +315,12 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
>>   }
>>
>>   static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
>> -               long nr_pages, void **kaddr, pfn_t *pfn)
>> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
>> +               pfn_t *pfn)
>>   {
>>          struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);
>>
>> -       return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
>> +       return dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
>>   }
>>
>>   static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
>> diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
>> index 64dd0b34fcf4..8cd5184e62f0 100644
>> --- a/drivers/md/dm-target.c
>> +++ b/drivers/md/dm-target.c
>> @@ -10,6 +10,7 @@
>>   #include <linux/init.h>
>>   #include <linux/kmod.h>
>>   #include <linux/bio.h>
>> +#include <linux/dax.h>
>>
>>   #define DM_MSG_PREFIX "target"
>>
>> @@ -142,7 +143,8 @@ static void io_err_release_clone_rq(struct request *clone,
>>   }
>>
>>   static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
>> -               long nr_pages, void **kaddr, pfn_t *pfn)
>> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
>> +               pfn_t *pfn)
>>   {
>>          return -EIO;
>>   }
>> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
>> index 5630b470ba42..d74c5a7a0ab4 100644
>> --- a/drivers/md/dm-writecache.c
>> +++ b/drivers/md/dm-writecache.c
>> @@ -286,7 +286,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>>
>>          id = dax_read_lock();
>>
>> -       da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn);
>> +       da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, DAX_ACCESS,
>> +                       &wc->memory_map, &pfn);
>>          if (da < 0) {
>>                  wc->memory_map = NULL;
>>                  r = da;
>> @@ -308,8 +309,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>>                  i = 0;
>>                  do {
>>                          long daa;
>> -                       daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i,
>> -                                               NULL, &pfn);
>> +                       daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i,
>> +                                       p - i, DAX_ACCESS, NULL, &pfn);
>>                          if (daa <= 0) {
>>                                  r = daa ? daa : -EINVAL;
>>                                  goto err3;
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index 3c5fad7c4ee6..8258676a352f 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -1093,7 +1093,8 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
>>   }
>>
>>   static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
>> -                                long nr_pages, void **kaddr, pfn_t *pfn)
>> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
>> +               pfn_t *pfn)
>>   {
>>          struct mapped_device *md = dax_get_private(dax_dev);
>>          sector_t sector = pgoff * PAGE_SECTORS;
>> @@ -1111,7 +1112,7 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
>>          if (len < 1)
>>                  goto out;
>>          nr_pages = min(len, nr_pages);
>> -       ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
>> +       ret = ti->type->direct_access(ti, pgoff, nr_pages, mode, kaddr, pfn);
>>
>>    out:
>>          dm_put_live_table(md, srcu_idx);
>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> index 4aa17132a557..c77b7cf19639 100644
>> --- a/drivers/nvdimm/pmem.c
>> +++ b/drivers/nvdimm/pmem.c
>> @@ -239,24 +239,47 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
>>
>>   /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
>>   __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
>> -               long nr_pages, void **kaddr, pfn_t *pfn)
>> +               long nr_pages, enum dax_access_mode mode, void **kaddr,
>> +               pfn_t *pfn)
>>   {
>>          resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
>> -
>> -       if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
>> -                                       PFN_PHYS(nr_pages))))
>> -               return -EIO;
>> +       sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
>> +       unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
>> +       struct badblocks *bb = &pmem->bb;
>> +       sector_t first_bad;
>> +       int num_bad;
>>
>>          if (kaddr)
>>                  *kaddr = pmem->virt_addr + offset;
>>          if (pfn)
>>                  *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
>>
>> +       if (bb->count &&
>> +               badblocks_check(bb, sector, num, &first_bad, &num_bad)) {
>> +               long actual_nr;
>> +
>> +               if (mode != DAX_RECOVERY_WRITE)
>> +                       return -EIO;
>> +
>> +               /*
>> +                * Set the recovery stride is set to kernel page size because
>> +                * the underlying driver and firmware clear poison functions
>> +                * don't appear to handle large chunk(such as 2MiB) reliably.
>> +                */
>> +               actual_nr = PHYS_PFN(
>> +                       PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT));
>> +               dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n",
>> +                               sector, nr_pages, first_bad, actual_nr);
>> +               if (actual_nr)
>> +                       return actual_nr;
>> +               return 1;
>> +       }
>> +
> 
> Similar feedback as Christoph, lets keep this patch just to the simple
> task of plumbing the @mode argument to dax_direct_access() and save
> the logic changes for DAX_RECOVERY_WRITE to a later patch. The idea
> being that in general you want to limit the blast radius of
> regressions so that it simplifies reverts if necessary. If these logic
> changes have a regression then reverting this patch also undoes all
> the other innocent boilerplate plumbing.

Got it, will do.

thanks!
-jane
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2022-04-22  4:16 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-20  2:04 [PATCH v8 0/7] DAX poison recovery Jane Chu
2022-04-20  2:04 ` [dm-devel] " Jane Chu
2022-04-20  2:04 ` [PATCH v8 1/7] acpi/nfit: rely on mce->misc to determine poison granularity Jane Chu
2022-04-20  2:04   ` [dm-devel] " Jane Chu
2022-04-21 19:25   ` Dan Williams
2022-04-21 19:25     ` [dm-devel] " Dan Williams
2022-04-20  2:04 ` [PATCH v8 2/7] x86/mce: relocate set{clear}_mce_nospec() functions Jane Chu
2022-04-20  2:04   ` [dm-devel] " Jane Chu
2022-04-21  6:50   ` Christoph Hellwig
2022-04-21  6:50     ` [dm-devel] " Christoph Hellwig
2022-04-22  3:46     ` Jane Chu
2022-04-22  3:46       ` [dm-devel] " Jane Chu
2022-04-20  2:04 ` [PATCH v8 3/7] mce: fix set_mce_nospec to always unmap the whole page Jane Chu
2022-04-20  2:04   ` [dm-devel] " Jane Chu
2022-04-21  6:51   ` Christoph Hellwig
2022-04-21  6:51     ` [dm-devel] " Christoph Hellwig
2022-04-22  3:47     ` Jane Chu
2022-04-22  3:47       ` [dm-devel] " Jane Chu
2022-04-21 19:26   ` Dan Williams
2022-04-21 19:26     ` [dm-devel] " Dan Williams
2022-04-20  2:04 ` [PATCH v8 4/7] dax: introduce DAX_RECOVERY_WRITE dax access mode Jane Chu
2022-04-20  2:04   ` [dm-devel] " Jane Chu
2022-04-21  6:57   ` Christoph Hellwig
2022-04-21  6:57     ` [dm-devel] " Christoph Hellwig
2022-04-22  4:15     ` Jane Chu
2022-04-22  4:15       ` [dm-devel] " Jane Chu
2022-04-21 19:35   ` Dan Williams
2022-04-21 19:35     ` [dm-devel] " Dan Williams
2022-04-22  4:15     ` Jane Chu
2022-04-22  4:15       ` [dm-devel] " Jane Chu
2022-04-20  2:04 ` [PATCH v8 5/7] dax: add .recovery_write dax_operation Jane Chu
2022-04-20  2:04   ` [dm-devel] " Jane Chu
2022-04-21  6:57   ` Christoph Hellwig
2022-04-21  6:57     ` [dm-devel] " Christoph Hellwig
2022-04-20  2:04 ` [PATCH v8 6/7] pmem: refactor pmem_clear_poison() Jane Chu
2022-04-20  2:04   ` [dm-devel] " Jane Chu
2022-04-20  2:04 ` [PATCH v8 7/7] pmem: implement pmem_recovery_write() Jane Chu
2022-04-20  2:04   ` [dm-devel] " Jane Chu
2022-04-20  6:26   ` Dan Williams
2022-04-20  6:26     ` [dm-devel] " Dan Williams
2022-04-20 17:01     ` Jane Chu
2022-04-20 17:01       ` [dm-devel] " Jane Chu
2022-04-20 17:05       ` Dan Williams
2022-04-20 17:05         ` [dm-devel] " Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.